Arxiv今日论文 | 2024-11-22

本篇博文主要展示 2024-11-22 从Arxiv.org论文网站获取的最新论文列表，自动更新，按照NLP、CV、ML、AI、IR五个大方向区分，若需要邮件定时接收，请在评论区留下你的邮箱号。

说明：每日论文数据从Arxiv.org获取，每天早上12:00左右定时自动更新。

友情提示: 如何您需要邮箱接收每日论文数据，请在评论处留下你的邮箱。

【速读】：该论文试图解决的问题是：在缺乏明确标准和难以量化奖励的广泛领域中，o1模型是否能够有效泛化。解决方案的关键在于Marco-o1模型，它通过Chain-of-Thought (CoT)微调、Monte Carlo Tree Search (MCTS)、反思机制和创新推理策略的结合，优化了复杂现实世界问题的解决能力。这些技术共同作用，使得模型能够在开放性问题中进行更有效的推理和决策。

链接: https://arxiv.org/abs/2411.14405
作者: Yu Zhao,Huifeng Yin,Bo Zeng,Hao Wang,Tianqi Shi,Chenyang Lyu,Longyue Wang,Weihua Luo,Kaifu Zhang
关键词-EN: sparked a surge, surge of interest, study of large, LRM, Monte Carlo Tree
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Currently OpenAI o1 has sparked a surge of interest in the study of large reasoning models (LRM). Building on this momentum, Marco-o1 not only focuses on disciplines with standard answers, such as mathematics, physics, and coding – which are well-suited for reinforcement learning (RL) – but also places greater emphasis on open-ended resolutions. We aim to address the question: “Can the o1 model effectively generalize to broader domains where clear standards are absent and rewards are challenging to quantify?” Marco-o1 is powered by Chain-of-Thought (CoT) fine-tuning, Monte Carlo Tree Search (MCTS), reflection mechanisms, and innovative reasoning strategies – optimized for complex real-world problem-solving tasks.
摘要：目前，OpenAI o1 引发了人们对大型推理模型 (LRM) 研究的浓厚兴趣。在此基础上，Marco-o1 不仅专注于数学、物理、编程等具有标准答案的学科——这些学科非常适合强化学习 (RL)——还更加强调开放性问题的解决。我们的目标是探讨：“o1 模型能否有效推广到标准不明确且奖励难以量化的更广泛领域？”Marco-o1 通过链式思维 (Chain-of-Thought, CoT) 微调、蒙特卡洛树搜索 (Monte Carlo Tree Search, MCTS)、反思机制和创新的推理策略——这些都针对复杂的现实世界问题解决任务进行了优化。

[NLP-1] Lightweight Safety Guardrails Using Fine-tuned BERT Embeddings COLING2025

【速读】：该论文试图解决在使用大型语言模型（LLMs）进行企业级应用开发时，如何实现高效、可靠且成本可控的监控和控制机制的问题。解决方案的关键在于采用轻量级架构的微调方法，具体是通过微调Sentence-BERT模型，将模型参数从LlamaGuard的70亿减少到约6700万，同时保持与AEGIS安全基准相当的性能。这种方法不仅降低了延迟和维护成本，还提高了部署的实用性和可扩展性。

链接: https://arxiv.org/abs/2411.14398
作者: Aaron Zheng,Mansi Rana,Andreas Stolcke
关键词-EN: large language models, rapidly develop, recent proliferation, proliferation of large, large language
类目: Computation and Language (cs.CL)
备注: To appear in Proceedings of COLING 2025

点击查看摘要

Abstract:With the recent proliferation of large language models (LLMs), enterprises have been able to rapidly develop proof-of-concepts and prototypes. As a result, there is a growing need to implement robust guardrails that monitor, quantize and control an LLM’s behavior, ensuring that the use is reliable, safe, accurate and also aligned with the users’ expectations. Previous approaches for filtering out inappropriate user prompts or system outputs, such as LlamaGuard and OpenAI’s MOD API, have achieved significant success by fine-tuning existing LLMs. However, using fine-tuned LLMs as guardrails introduces increased latency and higher maintenance costs, which may not be practical or scalable for cost-efficient deployments. We take a different approach, focusing on fine-tuning a lightweight architecture: Sentence-BERT. This method reduces the model size from LlamaGuard’s 7 billion parameters to approximately 67 million, while maintaining comparable performance on the AEGIS safety benchmark.
摘要：随着大语言模型（LLM）的广泛应用，企业能够快速开发概念验证和原型。因此，实施强大的监控、量化和控制LLM行为的防护措施变得日益重要，以确保其使用既可靠、安全、准确，又符合用户的期望。以往用于过滤不当用户提示或系统输出的方法，如LlamaGuard和OpenAI的MOD API，通过微调现有LLM取得了显著成效。然而，使用微调后的LLM作为防护措施会增加延迟和维护成本，这在成本效益高的部署中可能不切实际或难以扩展。我们采用了一种不同的方法，专注于微调轻量级架构：Sentence-BERT。这种方法将模型规模从LlamaGuard的70亿参数减少到约6700万，同时在AEGIS安全基准上保持了可比性能。

[NLP-2] POS-tagging to highlight the skeletal structure of sentences

【速读】：该论文旨在开发一种基于词性标注（Part-of-Speech Tagging, POS）的模型，用于提取句子的骨架结构，并通过迁移学习与BERT架构的结合来实现这一目标。解决方案的关键在于利用BERT模型进行token分类，并在俄语文本上进行微调，以展示其有效性。该方法有望提升自然语言处理任务，如机器翻译的性能。

链接: https://arxiv.org/abs/2411.14393
作者: Grigorii Churakov
关键词-EN: token classification, study presents, presents the development, extract the skeletal, skeletal structure
类目: Computation and Language (cs.CL)
备注: in Russian language. Conference: Automated control systems and information technologies this https URL Section: IT and automated systems

点击查看摘要

Abstract:This study presents the development of a part-of-speech (POS) tagging model to extract the skeletal structure of sentences using transfer learning with the BERT architecture for token classification. The model, fine-tuned on Russian text, demonstrating its effectiveness. The approach offers potential applications in enhancing natural language processing tasks, such as improving machine translation. Keywords: part of speech tagging, morphological analysis, natural language processing, BERT. Comments: in Russian language. Conference: Automated control systems and information technologies this https URL Section: IT and automated systems Subjects: Computation and Language (cs.CL) Cite as: arXiv:2411.14393 [cs.CL] (or arXiv:2411.14393v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2411.14393 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
摘要：本研究介绍了利用 BERT 架构进行 Token 分类的迁移学习，开发了一种词性标注 (Part-of-Speech, POS) 模型，用于提取句子的骨架结构。该模型在俄语文本上进行了微调，展示了其有效性。该方法在提升自然语言处理任务（如改进机器翻译）方面具有潜在应用。关键词：词性标注、形态分析、自然语言处理、BERT。

评论：针对俄语语言。会议：自动化控制系统与信息技术。

主题：计算与语言 (cs.CL)

引用方式：arXiv:2411.14393 [cs.CL]（或 arXiv:2411.14393v1 [cs.CL] 针对此版本）

https://doi.org/10.48550/arXiv.2411.14393

了解更多信息请访问 arXiv 发布的 DOI 通过 DataCite（待注册）。

[NLP-3] UnifiedCrawl: Aggregated Common Crawl for Affordable Adaptation of LLM s on Low-Resource Languages

【速读】：该论文试图解决大型语言模型（LLMs）在低资源语言上表现不佳的问题，主要原因是这些语言的训练数据有限。解决方案的关键在于提出了一种名为UnifiedCrawl的方法，该方法能够从整个Common Crawl语料库中高效地收集低资源语言的文本数据，使用最少的计算资源进行过滤和提取，从而生成比现有资源大得多的单语数据集。通过利用这些数据，结合高效的适配器方法（如QLoRA）对多语言LLMs进行微调，显著提升了低资源语言的性能，同时最小化了VRAM的使用。实验结果显示，语言建模的困惑度大幅降低，且少样本提示得分有所提高。该研究提供了一种利用消费级硬件改善低资源语言LLMs性能的经济有效途径。

链接: https://arxiv.org/abs/2411.14343
作者: Bethel Melesse Tessema(1),Akhil Kedia(2),Tae-Sun Chung(1) ((1) Ajou University, (2) Independent Researcher)
关键词-EN: limited training data, Large language models, due to limited, limited training, Common Crawl
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models (LLMs) under-perform on low-resource languages due to limited training data. We present a method to efficiently collect text data for low-resource languages from the entire Common Crawl corpus. Our approach, UnifiedCrawl, filters and extracts common crawl using minimal compute resources, yielding mono-lingual datasets much larger than previously available sources. We demonstrate that leveraging this data to fine-tuning multilingual LLMs via efficient adapter methods (QLoRA) significantly boosts performance on the low-resource language, while minimizing VRAM usage. Our experiments show large improvements in language modeling perplexity and an increase in few-shot prompting scores. Our work and released source code provide an affordable approach to improve LLMs for low-resource languages using consumer hardware. Our source code is available here at this https URL.
摘要：大语言模型 (Large Language Models, LLMs) 在低资源语言上的表现不佳，主要原因是训练数据的匮乏。我们提出了一种从整个 Common Crawl 语料库中高效收集低资源语言文本数据的方法。我们的方法，称为 UnifiedCrawl，利用最少的计算资源进行过滤和提取，生成的单语数据集规模远超以往的可用资源。我们通过实验证明，利用这些数据通过高效的适配器方法 (QLoRA) 对多语言大语言模型进行微调，能够显著提升低资源语言的性能，同时最小化 VRAM 的使用。我们的实验结果显示，语言模型的困惑度大幅降低，少样本提示的得分也有所提高。我们的工作及其开源代码提供了一种使用消费级硬件改善低资源语言大语言模型的经济实惠的方法。我们的源代码可通过此 https URL 获取。

[NLP-4] Velocitune: A Velocity-based Dynamic Domain Reweighting Method for Continual Pre-training

【速读】：该论文试图解决领域自适应持续预训练中的复杂性问题，特别是如何动态调整不同领域数据的比例以优化模型训练。解决方案的关键在于提出了一种名为Velocitune的新框架，该框架通过动态评估学习速度并相应调整数据比例，优先考虑学习速度较慢的领域，同时避免学习速度较快的领域。这一过程由一个缩放定律指导，以指示每个领域所需的学习目标，并减少相关成本。关键因素包括目标损失预测和数据排序。

链接: https://arxiv.org/abs/2411.14318
作者: Zheheng Luo,Xin Zhang,Xiao Liu,Haoling Li,Yeyun Gong,Chen Qi,Peng Cheng
关键词-EN: large language models, training large language, language models, large language, typically constructed
类目: Computation and Language (cs.CL)
备注: Work in progress

点击查看摘要

Abstract:It is well-known that a diverse corpus is critical for training large language models, which are typically constructed from a mixture of various domains. In general, previous efforts resort to sampling training data from different domains with static proportions, as well as adjusting data proportions during training. However, few methods have addressed the complexities of domain-adaptive continual pre-training. To fill this gap, we propose Velocitune, a novel framework dynamically assesses learning velocity and adjusts data proportions accordingly, favoring slower-learning domains while shunning faster-learning ones, which is guided by a scaling law to indicate the desired learning goal for each domain with less associated cost. To evaluate the effectiveness of Velocitune, we conduct experiments in a reasoning-focused dataset with CodeLlama, as well as in a corpus specialised for system command generation with Llama3 and Mistral. Velocitune achieves performance gains in both math and code reasoning tasks and command-line generation benchmarks. Further analysis reveals that key factors driving Velocitune’s effectiveness include target loss prediction and data ordering.
摘要：众所周知，多样化的语料库对于训练大语言模型至关重要，这些模型通常由多个领域的混合数据构建而成。以往的研究通常采用静态比例从不同领域采样训练数据，或在训练过程中调整数据比例。然而，针对领域自适应持续预训练的复杂性，相关研究较少。为填补这一空白，我们提出了 Velocitune，这是一种新颖的框架，能够动态评估学习速度并相应调整数据比例，倾向于学习速度较慢的领域，同时避免学习速度较快的领域，这一过程由一个缩放定律指导，以指示每个领域在较低相关成本下的期望学习目标。为了评估 Velocitune 的有效性，我们在一个以推理为重点的数据集上使用 CodeLlama 进行了实验，并在一个专门用于系统命令生成的语料库上使用 Llama3 和 Mistral 进行了实验。Velocitune 在数学和代码推理任务以及命令行生成基准测试中均取得了性能提升。进一步分析表明，驱动 Velocitune 有效性的关键因素包括目标损失预测和数据排序。

[NLP-5] Looking Beyond Text: Reducing Language bias in Large Vision-Language Models via Multimodal Dual-Attention and Soft-Image Guidance

【速读】：该论文试图解决大型视觉-语言模型（Large vision-language models, LVLMs）在处理视觉-语言任务时由于语言偏差导致的幻觉问题，这种偏差主要源于预训练阶段与多模态对齐阶段训练数据规模的不同以及文本数据短期依赖性导致的推理偏差。解决方案的关键在于提出了一个名为LACING的系统框架，该框架通过多模态双注意力机制（muLtimodal duAl-attention meChanIsm, MDA）和软图像引导（soft-image Guidance, IFG）来解决语言偏差问题。MDA通过引入并行的双注意力机制增强模型对视觉输入的整合，而IFG则通过在训练和推理过程中引入可学习的软视觉提示来替代视觉输入，迫使LVLMs优先处理文本输入，并通过一种新的解码策略减少模型对相邻文本输入的过度依赖。实验结果表明，该方法能有效减少LVLMs的语言偏差，提升视觉理解能力并减少幻觉现象，且无需额外的训练资源或数据。

链接: https://arxiv.org/abs/2411.14279
作者: Haozhe Zhao,Shuzheng Si,Liang Chen,Yichi Zhang,Maosong Sun,Mingjia Zhang,Baobao Chang
关键词-EN: Large vision-language models, Large vision-language, achieved impressive results, vision-language tasks, achieved impressive
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注: 19 pages, 12 figures

点击查看摘要

Abstract:Large vision-language models (LVLMs) have achieved impressive results in various vision-language tasks. However, despite showing promising performance, LVLMs suffer from hallucinations caused by language bias, leading to diminished focus on images and ineffective visual comprehension. We identify two primary reasons for this bias: 1. Different scales of training data between the pretraining stage of LLM and multimodal alignment stage. 2. The learned inference bias due to short-term dependency of text data. Therefore, we propose LACING, a systemic framework designed to address the language bias of LVLMs with muLtimodal duAl-attention meChanIsm (MDA) aNd soft-image Guidance (IFG). Specifically, MDA introduces a parallel dual-attention mechanism that enhances the integration of visual inputs across the model. IFG introduces a learnable soft visual prompt during training and inference to replace visual inputs, designed to compel LVLMs to prioritize text inputs. Then, IFG further proposes a novel decoding strategy using the soft visual prompt to mitigate the model’s over-reliance on adjacent text inputs. Comprehensive experiments demonstrate that our method effectively debiases LVLMs from their language bias, enhancing visual comprehension and reducing hallucinations without requiring additional training resources or data. The code and model are available at [this http URL](this https URL).
摘要：大规模视觉语言模型（Large vision-language models, LVLMs）在多种视觉语言任务中取得了显著成果。然而，尽管表现出良好的性能，LVLMs 仍存在由语言偏差引起的幻觉问题，导致对图像的关注度降低和视觉理解效果不佳。我们识别出导致这种偏差的两个主要原因：1. 大语言模型（LLM）预训练阶段与多模态对齐阶段之间训练数据规模的不同。2. 由于文本数据的短期依赖性而学习到的推理偏差。因此，我们提出了 LACING，这是一个系统性框架，旨在通过多模态双注意力机制（muLtimodal duAl-attention meChanIsm, MDA）和软图像引导（soft-image Guidance, IFG）来解决 LVLMs 的语言偏差问题。具体而言，MDA 引入了一种并行的双注意力机制，以增强模型中视觉输入的整合。IFG 在训练和推理过程中引入了一个可学习的软视觉提示，以替代视觉输入，旨在迫使 LVLMs 优先处理文本输入。随后，IFG 进一步提出了一种使用软视觉提示的新解码策略，以减轻模型对相邻文本输入的过度依赖。综合实验表明，我们的方法有效地去除了 LVLMs 的语言偏差，增强了视觉理解能力并减少了幻觉现象，且无需额外的训练资源或数据。代码和模型可在 [this http URL](this https URL) 获取。

[NLP-6] Efficient Aspect-Based Summarization of Climate Change Reports with Small Language Models

【速读】：该论文试图解决气候变化报告中信息提取和总结的问题，特别是通过使用基于方面的总结 (Aspect-Based Summarization, ABS) 系统来帮助决策者快速获取关键信息。解决方案的关键在于利用大型语言模型 (Large Language Models, LLMs) 和小型语言模型 (Small Language Models, SLMs) 进行无监督的ABS任务，并通过首次应用考虑能源效率和任务性能的现有框架来评估这些模型的表现。研究结果表明，现代语言模型，无论大小，都能有效处理气候变化报告的ABS任务，但当问题被框架为检索增强生成 (Retrieval Augmented Generation, RAG) 问题时，仍需进一步研究。该论文还发布了一个新的数据集，以促进这一领域的研究。

链接: https://arxiv.org/abs/2411.14272
作者: Iacopo Ghinassi,Leonardo Catalano,Tommaso Colella
关键词-EN: Natural Language Processing, Climate Change action, Climate Change reports, NLP technologies, Climate Change
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The use of Natural Language Processing (NLP) for helping decision-makers with Climate Change action has recently been highlighted as a use case aligning with a broader drive towards NLP technologies for social good. In this context, Aspect-Based Summarization (ABS) systems that extract and summarize relevant information are particularly useful as they provide stakeholders with a convenient way of finding relevant information in expert-curated reports. In this work, we release a new dataset for ABS of Climate Change reports and we employ different Large Language Models (LLMs) and so-called Small Language Models (SLMs) to tackle this problem in an unsupervised way. Considering the problem at hand, we also show how SLMs are not significantly worse for the problem while leading to reduced carbon footprint; we do so by applying for the first time an existing framework considering both energy efficiency and task performance to the evaluation of zero-shot generative models for ABS. Overall, our results show that modern language models, both big and small, can effectively tackle ABS for Climate Change reports but more research is needed when we frame the problem as a Retrieval Augmented Generation (RAG) problem and our work and dataset will help foster efforts in this direction.
摘要：近年来，自然语言处理（Natural Language Processing, NLP）在帮助决策者应对气候变化行动方面的应用受到了广泛关注，这与推动NLP技术用于社会公益的更广泛趋势相契合。在此背景下，基于方面的摘要（Aspect-Based Summarization, ABS）系统尤为有用，因为它们能够提取并总结相关信息，为利益相关者提供了一种便捷的方式来查找专家编制的报告中相关信息。在本研究中，我们发布了一个新的气候变化报告ABS数据集，并采用不同的大语言模型（Large Language Models, LLMs）和所谓的“小语言模型”（Small Language Models, SLMs）以无监督的方式解决这一问题。考虑到当前的问题，我们还展示了SLM在解决该问题时并不显著逊色，同时还能减少碳足迹；我们首次应用了一个既考虑能源效率又考虑任务性能的现有框架来评估零样本生成模型的ABS性能。总体而言，我们的研究结果表明，现代语言模型，无论是大型还是小型，都能有效应对气候变化报告的ABS任务，但当我们将问题框架为检索增强生成（Retrieval Augmented Generation, RAG）问题时，仍需进一步研究。我们的工作和数据集将有助于推动这一方向的研究进展。

[NLP-7] Knowledge Graphs Large Language Models and Hallucinations: An NLP Perspective

【速读】：该论文试图解决大型语言模型（Large Language Models, LLMs）在自然语言处理应用中面临的幻觉问题（hallucinations），即模型生成看似合理但事实错误的响应，从而影响其可信度和适用性。解决方案的关键在于利用知识图谱（Knowledge Graphs, KGs）来提供结构化的、相互关联的事实信息，以填补LLMs在某些主题理解上的空白。通过将KGs与LLMs结合，可以显著减少幻觉现象，提高模型的可靠性和准确性，同时保持其广泛的应用潜力。论文讨论了当前在数据集、基准测试、知识集成方法和幻觉评估方面的最新进展，并指出了未来研究的方向。

链接: https://arxiv.org/abs/2411.14258
作者: Ernests Lavrinovics,Russa Biswas,Johannes Bjerva,Katja Hose
关键词-EN: Natural Language Processing, Large Language Models, revolutionized Natural Language, Large Language, Language Processing
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 7 pages, 2 Figures, 1 Table

点击查看摘要

Abstract:Large Language Models (LLMs) have revolutionized Natural Language Processing (NLP) based applications including automated text generation, question answering, chatbots, and others. However, they face a significant challenge: hallucinations, where models produce plausible-sounding but factually incorrect responses. This undermines trust and limits the applicability of LLMs in different domains. Knowledge Graphs (KGs), on the other hand, provide a structured collection of interconnected facts represented as entities (nodes) and their relationships (edges). In recent research, KGs have been leveraged to provide context that can fill gaps in an LLM understanding of certain topics offering a promising approach to mitigate hallucinations in LLMs, enhancing their reliability and accuracy while benefiting from their wide applicability. Nonetheless, it is still a very active area of research with various unresolved open problems. In this paper, we discuss these open challenges covering state-of-the-art datasets and benchmarks as well as methods for knowledge integration and evaluating hallucinations. In our discussion, we consider the current use of KGs in LLM systems and identify future directions within each of these challenges.
摘要：大语言模型 (LLMs) 已经彻底改变了基于自然语言处理 (NLP) 的应用，包括自动文本生成、问答系统、聊天机器人等。然而，它们面临一个重大挑战：幻觉现象，即模型生成听起来合理但实际上事实错误的响应。这削弱了信任，并限制了 LLMs 在不同领域的应用。另一方面，知识图谱 (KGs) 提供了一种结构化的、相互关联的事实集合，这些事实以实体（节点）及其关系（边）的形式表示。最近的研究中，KGs 被用来提供上下文，以填补 LLM 对某些主题理解的空白，提供了一种有前景的方法来减轻 LLMs 中的幻觉现象，增强其可靠性和准确性，同时受益于其广泛的应用性。尽管如此，这仍然是一个非常活跃的研究领域，存在许多未解决的开放问题。本文讨论了这些开放挑战，涵盖了最先进的基准数据集和方法，以及知识整合和幻觉评估的方法。在我们的讨论中，我们考虑了当前 KGs 在 LLM 系统中的应用，并确定了每个挑战领域的未来方向。

[NLP-8] Do I Know This Entity? Knowledge Awareness and Hallucinations in Language Models

【速读】：该论文试图解决大型语言模型中普遍存在的幻觉问题，即模型在生成内容时出现的不准确或虚构的信息。解决方案的关键在于利用稀疏自编码器（sparse autoencoders）作为解释性工具，揭示模型内部关于实体识别的机制。具体来说，稀疏自编码器能够发现表示空间中有意义的方向，这些方向能够检测模型是否识别某个实体，例如检测模型是否知道某个运动员或电影。这表明模型具有自我认知能力，即内部表示能够反映其自身的知识边界。这些方向在因果关系上具有重要性，能够引导模型拒绝回答关于已知实体的问题，或在模型原本会拒绝的情况下生成未知实体的属性。研究还发现，尽管稀疏自编码器是在基础模型上训练的，但这些方向对聊天模型的拒绝行为具有因果效应，表明聊天微调过程中重新利用了这一现有机制。此外，研究初步探讨了这些方向在模型中的机械作用，发现它们会干扰通常将实体属性移动到最终标记的下游注意力头。

链接: https://arxiv.org/abs/2411.14257
作者: Javier Ferrando,Oscar Obeso,Senthooran Rajamanoharan,Neel Nanda
关键词-EN: Hallucinations in large, large language models, widespread problem, poorly understood, limiting our ability
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Hallucinations in large language models are a widespread problem, yet the mechanisms behind whether models will hallucinate are poorly understood, limiting our ability to solve this problem. Using sparse autoencoders as an interpretability tool, we discover that a key part of these mechanisms is entity recognition, where the model detects if an entity is one it can recall facts about. Sparse autoencoders uncover meaningful directions in the representation space, these detect whether the model recognizes an entity, e.g. detecting it doesn’t know about an athlete or a movie. This suggests that models can have self-knowledge: internal representations about their own capabilities. These directions are causally relevant: capable of steering the model to refuse to answer questions about known entities, or to hallucinate attributes of unknown entities when it would otherwise refuse. We demonstrate that despite the sparse autoencoders being trained on the base model, these directions have a causal effect on the chat model’s refusal behavior, suggesting that chat finetuning has repurposed this existing mechanism. Furthermore, we provide an initial exploration into the mechanistic role of these directions in the model, finding that they disrupt the attention of downstream heads that typically move entity attributes to the final token.
摘要：大语言模型中的幻觉现象是一个普遍存在的问题，然而导致模型产生幻觉的机制尚未被充分理解，这限制了我们解决这一问题的能力。通过使用稀疏自编码器作为可解释性工具，我们发现这些机制的关键部分是实体识别，即模型检测某个实体是否是其能够回忆相关事实的实体。稀疏自编码器揭示了表示空间中有意义的方向，这些方向能够检测模型是否识别某个实体，例如检测模型是否不知道某个运动员或电影。这表明模型具有自我认知能力：关于自身能力的内部表示。这些方向在因果关系上是相关的：能够引导模型拒绝回答关于已知实体的问题，或在原本会拒绝的情况下幻觉出未知实体的属性。我们证明，尽管稀疏自编码器是在基础模型上训练的，但这些方向对聊天模型的拒绝行为具有因果效应，这表明聊天微调已经重新利用了这一现有机制。此外，我们初步探讨了这些方向在模型中的机制作用，发现它们会干扰通常将实体属性移动到最终 Token 的下游注意力头。

[NLP-9] Intent-Aware Dialogue Generation and Multi-Task Contrastive Learning for Multi-Turn Intent Classification

【速读】：该论文试图解决在训练聊天机器人系统中的多轮意图分类模型时，生成大规模、领域特定、多语言多轮对话数据集的难题。解决方案的关键在于引入了一种名为“意图链 (Chain-of-Intent)”的新机制，该机制结合了隐马尔可夫模型 (Hidden Markov Models) 和大型语言模型 (Large Language Models, LLMs)，通过自我对弈生成上下文感知、意图驱动的对话。具体来说，通过从电子商务聊天日志中提取领域特定知识，估计对话轮次和意图转换，从而指导生成连贯的对话。利用LLMs增强发射概率，该方法能够生成自然且上下文一致的问题和答案。此外，论文还提出了一个名为MINT-CL的多轮意图分类框架，采用多任务对比学习，提高了分类准确性，同时减少了对大量标注数据的依赖。

链接: https://arxiv.org/abs/2411.14252
作者: Junhua Liu,Yong Keat Tan,Bin Fu,Kwan Hui Lim
关键词-EN: Generating large-scale, Hidden Markov Models, Large Language Models, combines Hidden Markov, chatbot systems
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Generating large-scale, domain-specific, multilingual multi-turn dialogue datasets remains a significant hurdle for training effective Multi-Turn Intent Classification models in chatbot systems. In this paper, we introduce Chain-of-Intent, a novel mechanism that combines Hidden Markov Models with Large Language Models (LLMs) to generate contextually aware, intent-driven conversations through self-play. By extracting domain-specific knowledge from e-commerce chat logs, we estimate conversation turns and intent transitions, which guide the generation of coherent dialogues. Leveraging LLMs to enhance emission probabilities, our approach produces natural and contextually consistent questions and answers. We also propose MINT-CL, a framework for multi-turn intent classification using multi-task contrastive learning, improving classification accuracy without the need for extensive annotated data. Evaluations show that our methods outperform baselines in dialogue quality and intent classification accuracy, especially in multilingual settings, while significantly reducing data generation efforts. Furthermore, we release MINT-E, a multilingual, intent-aware multi-turn e-commerce dialogue corpus to support future research in this area.
摘要：生成大规模、领域特定的多语言多轮对话数据集仍然是训练聊天机器人系统中有效的多轮意图分类模型的一个重大障碍。本文中，我们介绍了链式意图（Chain-of-Intent），这是一种结合隐马尔可夫模型（Hidden Markov Models）与大语言模型（LLMs）的新机制，通过自我对弈生成上下文感知、意图驱动的对话。通过从电子商务聊天日志中提取领域特定知识，我们估计对话轮次和意图转换，从而指导生成连贯的对话。利用大语言模型增强发射概率，我们的方法生成了自然且上下文一致的问题和答案。我们还提出了MINT-CL，一个使用多任务对比学习的多轮意图分类框架，提高了分类准确性，而无需大量标注数据。评估结果显示，我们的方法在对话质量和意图分类准确性方面优于基线，特别是在多语言设置中，同时显著减少了数据生成的工作量。此外，我们发布了MINT-E，一个多语言、意图感知的多轮电子商务对话语料库，以支持该领域的未来研究。

[NLP-10] Natural Language Reinforcement Learning

【速读】：该论文试图解决的问题是将强化学习（Reinforcement Learning, RL）扩展到自然语言处理领域，提出了一种新的框架——自然语言强化学习（Natural Language Reinforcement Learning, NLRL）。解决方案的关键在于将传统的马尔可夫决策过程（Markov Decision Process, MDP）中的各个要素，如任务目标、策略、价值函数、贝尔曼方程和策略迭代，重新定义为基于自然语言的表示形式。通过利用大型语言模型（Large Language Models, LLMs）的最新进展，NLRL能够通过纯提示或基于梯度的训练方法实现类似于传统RL的策略和价值改进。实验结果表明，NLRL在迷宫、突破和井字棋等游戏中表现出了有效性、效率和可解释性。

链接: https://arxiv.org/abs/2411.14251
作者: Xidong Feng,Ziyu Wan,Haotian Fu,Bo Liu,Mengyue Yang,Girish A. Koushik,Zhiyuan Hu,Ying Wen,Jun Wang
关键词-EN: Markov Decision Process, Decision Process, Markov Decision, mathematically formulates decision-making, Reinforcement Learning
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Extension of arXiv:2402.07157

点击查看摘要

Abstract:Reinforcement Learning (RL) mathematically formulates decision-making with Markov Decision Process (MDP). With MDPs, researchers have achieved remarkable breakthroughs across various domains, including games, robotics, and language models. This paper seeks a new possibility, Natural Language Reinforcement Learning (NLRL), by extending traditional MDP to natural language-based representation space. Specifically, NLRL innovatively redefines RL principles, including task objectives, policy, value function, Bellman equation, and policy iteration, into their language counterparts. With recent advancements in large language models (LLMs), NLRL can be practically implemented to achieve RL-like policy and value improvement by either pure prompting or gradient-based training. Experiments over Maze, Breakthrough, and Tic-Tac-Toe games demonstrate the effectiveness, efficiency, and interpretability of the NLRL framework among diverse use cases. Our code will be released at this https URL.
摘要：强化学习 (Reinforcement Learning, RL) 通过马尔可夫决策过程 (Markov Decision Process, MDP) 数学化地描述决策过程。基于 MDP，研究人员在多个领域，包括游戏、机器人和语言模型，取得了显著的突破。本文探索了一种新的可能性，即自然语言强化学习 (Natural Language Reinforcement Learning, NLRL)，通过将传统的 MDP 扩展到基于自然语言的表示空间。具体而言，NLRL 创新性地将强化学习的基本原则，包括任务目标、策略、价值函数、贝尔曼方程和策略迭代，重新定义为相应的语言形式。随着大语言模型 (Large Language Models, LLMs) 的最新进展，NLRL 可以通过纯提示或基于梯度的训练实际实现，从而实现类似于强化学习的策略和价值改进。在迷宫、突破和井字棋等游戏上的实验表明，NLRL 框架在多种应用场景中具有有效性、效率和可解释性。我们的代码将在以下链接发布：https URL。

[NLP-11] Evaluating the Robustness of Analogical Reasoning in Large Language Models

【速读】：该论文试图解决的问题是评估大型语言模型（LLMs）在类比推理任务中的鲁棒性，即这些模型在面对与预训练数据不相似的类比问题变体时，是否仍能保持其推理能力。解决方案的关键在于设计了一系列与原始类比问题具有相同抽象推理能力但与预训练数据不相似的变体问题，并通过对比人类和GPT模型在这些变体问题上的表现，来评估LLMs的鲁棒性。研究发现，尽管在简单类比问题上人类表现稳定，GPT模型的表现却显著下降，表明LLMs在类比推理中的鲁棒性不足，这强调了在评估AI系统认知能力时，除了准确性外，鲁棒性评估的重要性。

链接: https://arxiv.org/abs/2411.14215
作者: Martha Lewis,Melanie Mitchell
关键词-EN: GPT models, test analogical reasoning, analogical reasoning abilities, abstract reasoning, GPT
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 31 pages, 13 figures. arXiv admin note: text overlap with arXiv:2402.08955

点击查看摘要

Abstract:LLMs have performed well on several reasoning benchmarks, including ones that test analogical reasoning abilities. However, there is debate on the extent to which they are performing general abstract reasoning versus employing non-robust processes, e.g., that overly rely on similarity to pre-training data. Here we investigate the robustness of analogy-making abilities previously claimed for LLMs on three of four domains studied by Webb, Holyoak, and Lu (2023): letter-string analogies, digit matrices, and story analogies. For each domain we test humans and GPT models on robustness to variants of the original analogy problems that test the same abstract reasoning abilities but are likely dissimilar from tasks in the pre-training data. The performance of a system that uses robust abstract reasoning should not decline substantially on these variants. On simple letter-string analogies, we find that while the performance of humans remains high for two types of variants we tested, the GPT models’ performance declines sharply. This pattern is less pronounced as the complexity of these problems is increased, as both humans and GPT models perform poorly on both the original and variant problems requiring more complex analogies. On digit-matrix problems, we find a similar pattern but only on one out of the two types of variants we tested. On story-based analogy problems, we find that, unlike humans, the performance of GPT models are susceptible to answer-order effects, and that GPT models also may be more sensitive than humans to paraphrasing. This work provides evidence that LLMs often lack the robustness of zero-shot human analogy-making, exhibiting brittleness on most of the variations we tested. More generally, this work points to the importance of carefully evaluating AI systems not only for accuracy but also robustness when testing their cognitive capabilities. Comments: 31 pages, 13 figures. arXiv admin note: text overlap with arXiv:2402.08955 Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2411.14215 [cs.CL] (or arXiv:2411.14215v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2411.14215 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
摘要：大语言模型在多个推理基准测试中表现出色，包括那些测试类比推理能力的测试。然而，关于它们是在进行一般的抽象推理还是依赖于不稳健的过程（例如，过度依赖于与预训练数据的相似性）存在争议。在此，我们研究了之前声称大语言模型在四个领域中的三个领域（由 Webb, Holyoak, 和 Lu (2023) 研究）的类比制作能力的稳健性：字母串类比、数字矩阵和故事类比。对于每个领域，我们测试了人类和 GPT 模型在原始类比问题的变体上的稳健性，这些变体测试相同的抽象推理能力，但可能与预训练数据中的任务不相似。使用稳健抽象推理的系统在这些变体上的表现不应显著下降。在简单的字母串类比中，我们发现尽管人类在两种变体上的表现保持高水平，但 GPT 模型的表现急剧下降。随着问题复杂性的增加，这种模式变得不那么明显，因为人类和 GPT 模型在需要更复杂类比的原始问题和变体问题上表现都不佳。在数字矩阵问题上，我们发现了类似的模式，但仅在我们测试的两种变体中的一种上。在基于故事的类比问题上，我们发现与人类不同，GPT 模型的表现容易受到答案顺序效应的影响，并且 GPT 模型可能比人类对改写更敏感。这项工作提供了证据，表明大语言模型通常缺乏零样本人类类比制作的稳健性，在我们测试的大多数变体上表现出脆弱性。更广泛地说，这项工作强调了在测试认知能力时，不仅要评估 AI 系统的准确性，还要评估其稳健性的重要性。

评论：31 页，13 幅图。arXiv 管理员注：文本与 arXiv:2402.08955 有重叠。主题：计算与语言 (cs.CL); 人工智能 (cs.AI); 机器学习 (cs.LG)。引用为：arXiv:2411.14215 [cs.CL]（或 arXiv:2411.14215v1 [cs.CL] 用于此版本）。https://doi.org/10.48550/arXiv.2411.14215。通过 DataCite 发布的 arXiv DOI（注册待定）。

[NLP-12] OpenScholar: Synthesizing Scientific Literature with Retrieval-augmented LMs

【速读】：该论文试图解决科研人员在处理日益增长的文献时面临的合成信息难题，解决方案的关键在于引入OpenScholar，这是一个专门设计的检索增强型大型语言模型（retrieval-augmented LM）。OpenScholar通过从4500万篇开放获取的论文中识别相关段落，并合成带有引用的回答，从而协助科学家回答科学查询。其核心创新包括：1) 开发了ScholarQABench，首个大规模多领域文献搜索基准，用于评估模型性能；2) 通过数据存储、检索器和自反馈推理循环，显著提升了现有语言模型的准确性；3) 在正确性和引用准确性方面，OpenScholar表现优于GPT-4o和PaperQA2，且在人类评估中，专家更倾向于选择OpenScholar生成的回答。

链接: https://arxiv.org/abs/2411.14199
作者: Akari Asai,Jacqueline He,Rulin Shao,Weijia Shi,Amanpreet Singh,Joseph Chee Chang,Kyle Lo,Luca Soldaini,Sergey Feldman,Mike D’arcy,David Wadden,Matt Latzke,Minyang Tian,Pan Ji,Shengyan Liu,Hao Tong,Bohao Wu,Yanyu Xiong,Luke Zettlemoyer,Graham Neubig,Dan Weld,Doug Downey,Wen-tau Yih,Pang Wei Koh,Hannaneh Hajishirzi
关键词-EN: Scientific progress depends, progress depends, depends on researchers’, researchers’ ability, ability to synthesize
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Scientific progress depends on researchers’ ability to synthesize the growing body of literature. Can large language models (LMs) assist scientists in this task? We introduce OpenScholar, a specialized retrieval-augmented LM that answers scientific queries by identifying relevant passages from 45 million open-access papers and synthesizing citation-backed responses. To evaluate OpenScholar, we develop ScholarQABench, the first large-scale multi-domain benchmark for literature search, comprising 2,967 expert-written queries and 208 long-form answers across computer science, physics, neuroscience, and biomedicine. On ScholarQABench, OpenScholar-8B outperforms GPT-4o by 5% and PaperQA2 by 7% in correctness, despite being a smaller, open model. While GPT4o hallucinates citations 78 to 90% of the time, OpenScholar achieves citation accuracy on par with human experts. OpenScholar’s datastore, retriever, and self-feedback inference loop also improves off-the-shelf LMs: for instance, OpenScholar-GPT4o improves GPT-4o’s correctness by 12%. In human evaluations, experts preferred OpenScholar-8B and OpenScholar-GPT4o responses over expert-written ones 51% and 70% of the time, respectively, compared to GPT4o’s 32%. We open-source all of our code, models, datastore, data and a public demo.
摘要：科学进步依赖于研究人员综合日益增长的文献的能力。大语言模型（LMs）能否在这一任务中协助科学家？我们引入了OpenScholar，一个专门用于增强检索的LM，通过识别来自4500万篇开放获取论文的相关段落并综合生成引文支持的回答，来回答科学查询。为了评估OpenScholar，我们开发了ScholarQABench，这是首个针对文献搜索的大规模多领域基准测试，包含2,967个专家编写的查询和208个长篇回答，涵盖计算机科学、物理学、神经科学和生物医学。在ScholarQABench上，尽管OpenScholar-8B是一个较小的开源模型，但其正确性比GPT-4o高出5%，比PaperQA2高出7%。而GPT4o在78%到90%的情况下会虚构引文，OpenScholar的引文准确性则与人类专家相当。OpenScholar的数据库、检索器和自反馈推理循环也提升了现成LMs的性能：例如，OpenScholar-GPT4o将GPT-4o的正确性提高了12%。在人类评估中，专家们分别有51%和70%的时间更倾向于选择OpenScholar-8B和OpenScholar-GPT4o的回答，而GPT4o的这一比例为32%。我们开源了所有代码、模型、数据库、数据和一个公共演示。

[NLP-13] Why do language models perform worse for morphologically complex languages?

【速读】：该论文试图解决语言模型在不同语言间表现差异的问题，并探讨形态学类型是否是导致这种差异的原因。研究的关键在于验证形态学类型（如黏着语和屈折语）是否影响语言模型的性能，并通过实验测试了三种可能的原因：分词器的形态对齐、分词质量以及数据集大小和测量的差异。研究结果表明，分词质量确实部分解释了性能差异，但形态对齐并未起到显著作用。相反，当训练数据集大小按“字节溢价”（不同语言和正字法的编码效率）进行调整时，性能差异显著减小。这表明，语言模型性能的差异主要归因于数据集大小的不平衡，而非语言的形态学类型。这一发现对提升低表现和资源匮乏语言的模型性能具有重要意义。

链接: https://arxiv.org/abs/2411.14198
作者: Catherine Arnett,Benjamin K. Bergen
关键词-EN: models perform differently, perform differently, performance, performance gap, languages
类目: Computation and Language (cs.CL)
备注: 9 pages

点击查看摘要

Abstract:Language models perform differently across languages. It has been previously suggested that morphological typology may explain some of this variability (Cotterell et al., 2018). We replicate previous analyses and find additional new evidence for a performance gap between agglutinative and fusional languages, where fusional languages, such as English, tend to have better language modeling performance than morphologically more complex languages like Turkish. We then propose and test three possible causes for this performance gap: morphological alignment of tokenizers, tokenization quality, and disparities in dataset sizes and measurement. To test the morphological alignment hypothesis, we present MorphScore, a tokenizer evaluation metric, and supporting datasets for 22 languages. We find some evidence that tokenization quality explains the performance gap, but none for the role of morphological alignment. Instead we find that the performance gap is most reduced when training datasets are of equivalent size across language types, but only when scaled according to the so-called “byte-premium” – the different encoding efficiencies of different languages and orthographies. These results suggest that no language is harder or easier for a language model to learn on the basis of its morphological typology. Differences in performance can be attributed to disparities in dataset size. These results bear on ongoing efforts to improve performance for low-performing and under-resourced languages.
摘要：语言模型在不同语言中的表现存在差异。先前的研究提出，形态类型学可能解释了部分这种差异性（Cotterell et al., 2018）。我们复现了之前的分析，并发现了新的证据，表明黏着语和屈折语之间存在性能差距，其中屈折语（如英语）的语言建模性能往往优于形态学上更为复杂的语言（如土耳其语）。随后，我们提出了三种可能的原因来解释这种性能差距：Token 化器的形态对齐、Token 化质量以及数据集大小和测量的差异。为了测试形态对齐假设，我们提出了 MorphScore，一种 Token 化器评估指标，并提供了支持 22 种语言的数据集。我们发现了一些证据表明 Token 化质量解释了性能差距，但没有发现形态对齐的作用。相反，我们发现当训练数据集在不同语言类型之间具有相同大小时，性能差距最小化，但仅在根据所谓的“字节溢价”——不同语言和正字法的不同编码效率——进行缩放时。这些结果表明，没有哪种语言因其形态类型学而更难或更容易被语言模型学习。性能差异可以归因于数据集大小的差异。这些结果对正在进行的提高低性能和资源匮乏语言性能的努力具有重要意义。

[NLP-14] Visual Contexts Clarify Ambiguous Expressions: A Benchmark Dataset

【速读】：该论文试图解决现有视觉-语言模型在处理间接和模糊的人类交流时表现不足的问题。解决方案的关键在于提出了一个名为VAGUE的多模态基准，该基准包含了3.9K个间接的人类表达及其对应的场景，旨在评估模型对间接沟通的理解能力。此外，论文还贡献了一个基于模型的管道，用于从输入图像生成提示-解决方案对，以帮助模型更好地理解和处理复杂的语言和视觉推理任务。通过这些创新，研究旨在推动模型在更精细和类人交互方面的能力发展。

链接: https://arxiv.org/abs/2411.14137
作者: Heejeong Nam,Jinwoo Ahn
关键词-EN: Visual Question Answering, effectively interact, Question Answering, real-world scenarios, Visual Grounding
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The ability to perform complex reasoning across multimodal inputs is essential for models to effectively interact with humans in real-world scenarios. Advancements in vision-language models have significantly improved performance on tasks that require processing explicit and direct textual inputs, such as Visual Question Answering (VQA) and Visual Grounding (VG). However, less attention has been given to improving the model capabilities to comprehend nuanced and ambiguous forms of communication. This presents a critical challenge, as human language in real-world interactions often convey hidden intentions that rely on context for accurate interpretation. To address this gap, we propose VAGUE, a multimodal benchmark comprising 3.9K indirect human utterances paired with corresponding scenes. Additionally, we contribute a model-based pipeline for generating prompt-solution pairs from input images. Our work aims to delve deeper into the ability of models to understand indirect communication and seek to contribute to the development of models capable of more refined and human-like interactions. Extensive evaluation on multiple VLMs reveals that mainstream models still struggle with indirect communication when required to perform complex linguistic and visual reasoning. We release our code and data at this https URL.
摘要：在现实世界场景中，模型能够跨多模态输入进行复杂推理的能力对于其有效与人类互动至关重要。视觉-语言模型的进步显著提升了在处理显性和直接文本输入任务（如视觉问答 (VQA) 和视觉定位 (VG)）中的表现。然而，对于提升模型理解微妙和模糊交流形式的能力，关注度相对较少。这构成了一个关键挑战，因为在现实世界互动中，人类语言常常传达依赖于上下文才能准确解读的隐含意图。为了填补这一空白，我们提出了 VAGUE，这是一个包含 3.9K 个间接人类表达及其对应场景的多模态基准。此外，我们还贡献了一个基于模型的管道，用于从输入图像生成提示-解决方案对。我们的工作旨在深入探讨模型理解间接交流的能力，并致力于推动开发出更精细、更接近人类互动能力的模型。对多个视觉-语言模型进行的广泛评估显示，主流模型在需要进行复杂语言和视觉推理的间接交流任务中仍面临困难。我们在此 https URL 上发布了代码和数据。

[NLP-15] Learning from “Silly” Questions Improves Large Language Models But Only Slightly

【速读】：该论文试图解决在构建高质量的监督微调（Supervised Fine-Tuning, SFT）数据集时，如何利用特定来源的数据（如Ruozhiba网站上的“傻瓜”问题）来提升大型语言模型（Large Language Models, LLMs）的微调性能。解决方案的关键在于通过GPT-4分析Ruozhiba问题的成功案例，从教育、心理学和认知科学的角度提取解释规则，并将这些规则应用于MMLU训练集以构建微调数据集。研究发现，这些规则在某些任务上显著提升模型性能，而在其他任务上可能导致性能下降，表明在构建SFT数据集时需要考虑任务多样性和规则适用性，以实现更全面的性能提升。

链接: https://arxiv.org/abs/2411.14121
作者: Tingyuan Zhu,Shudong Liu,Yidong Wang,Derek F. Wong,Han Yu,Takahiro Shinozaki,Jindong Wang
关键词-EN: high-quality Supervised Fine-Tuning, Constructing high-quality Supervised, high-quality Supervised, Supervised Fine-Tuning, large language models
类目: Computation and Language (cs.CL)
备注: 27 pages, 14 figures

点击查看摘要

Abstract:Constructing high-quality Supervised Fine-Tuning (SFT) datasets is critical for the training of large language models (LLMs). Recent studies have shown that using data from a specific source, Ruozhiba, a Chinese website where users ask “silly” questions to better understand certain topics, can lead to better fine-tuning performance. This paper aims to explore some hidden factors: the potential interpretations of its success and a large-scale evaluation of the performance. First, we leverage GPT-4 to analyze the successful cases of Ruozhiba questions from the perspective of education, psychology, and cognitive science, deriving a set of explanatory rules. Then, we construct fine-tuning datasets by applying these rules to the MMLU training set. Surprisingly, our results indicate that rules can significantly improve model performance in certain tasks, while potentially diminishing performance on others. For example, SFT data generated following the “Counterintuitive Thinking” rule can achieve approximately a 5% improvement on the “Global Facts” task, whereas the “Blurring the Conceptual Boundaries” rule leads to a performance drop of 6.14% on the “Econometrics” task. In addition, for specific tasks, different rules tend to have a consistent impact on model performance. This suggests that the differences between the extracted rules are not as significant, and the effectiveness of the rules is relatively consistent across tasks. Our research highlights the importance of considering task diversity and rule applicability when constructing SFT datasets to achieve more comprehensive performance improvements.
摘要：构建高质量的监督微调 (Supervised Fine-Tuning, SFT) 数据集对于大语言模型 (Large Language Models, LLMs) 的训练至关重要。近期研究表明，使用来自特定来源的数据，如中国网站“若子吧”，该网站用户通过提出“傻瓜”问题以更好地理解某些主题，可以提升微调效果。本文旨在探讨一些隐藏因素：其成功背后的潜在解释以及大规模的性能评估。首先，我们利用 GPT-4 从教育学、心理学和认知科学的角度分析若子吧问题的成功案例，推导出一组解释性规则。接着，我们将这些规则应用于 MMLU 训练集，构建微调数据集。令人惊讶的是，结果显示这些规则在某些任务中显著提升模型性能，而在其他任务中则可能降低性能。例如，遵循“反直觉思维”规则生成的 SFT 数据在“全球事实”任务中可实现约 5% 的性能提升，而“模糊概念边界”规则则在“计量经济学”任务中导致 6.14% 的性能下降。此外，对于特定任务，不同规则往往对模型性能产生一致的影响。这表明提取的规则之间的差异并不显著，且规则在不同任务中的有效性相对一致。我们的研究强调了在构建 SFT 数据集时考虑任务多样性和规则适用性的重要性，以实现更全面的性能提升。

[NLP-16] Lost in Inference: Rediscovering the Role of Natural Language Inference for Large Language Models

【速读】：该论文试图解决的问题是：在大型语言模型（LLM）的评估中，自然语言推理（NLI）任务是否仍然具有信息价值，尤其是在这些任务很少被用于LLM评估的情况下。解决方案的关键在于：通过在五个不同的NLI基准上测试六个不同规模和质量的模型，研究NLI任务能否有效区分不同模型，并观察其准确率在训练过程中的变化。此外，研究还探讨了模型在处理模糊或含糊陈述时，其softmax分布与人类分布的对齐程度。结果表明，NLI任务能够有效区分不同训练阶段的模型，且未完全饱和，同时模型分布与人类分布的相似性随模型规模增加而提高，但仍远高于人类群体之间的相似性，这为NLI任务在LLM评估中的潜在价值提供了支持。

链接: https://arxiv.org/abs/2411.14103
作者: Lovish Madaan,David Esiobu,Pontus Stenetorp,Barbara Plank,Dieuwke Hupkes
关键词-EN: natural language understanding, natural language inference, perform natural language, evaluating natural language, natural language
类目: Computation and Language (cs.CL)
备注: preprint, 13 pages

点击查看摘要

Abstract:In the recent past, a popular way of evaluating natural language understanding (NLU), was to consider a model’s ability to perform natural language inference (NLI) tasks. In this paper, we investigate if NLI tasks, that are rarely used for LLM evaluation, can still be informative for evaluating LLMs. Focusing on five different NLI benchmarks across six models of different scales, we investigate if they are able to discriminate models of different size and quality and how their accuracies develop during training. Furthermore, we investigate the extent to which the softmax distributions of models align with human distributions in cases where statements are ambiguous or vague. Overall, our results paint a positive picture for the NLI tasks: we find that they are able to discriminate well between models at various stages of training, yet are not (all) saturated. Furthermore, we find that while the similarity of model distributions with human label distributions increases with scale, it is still much higher than the similarity between two populations of humans, making it a potentially interesting statistic to consider.
摘要：在过去，评估自然语言理解 (Natural Language Understanding, NLU) 的一种流行方法是考察模型在自然语言推理 (Natural Language Inference, NLI) 任务中的表现。本文探讨了那些在评估大语言模型 (Large Language Model, LLM) 时较少使用的 NLI 任务，是否仍然对评估 LLM 具有信息价值。我们聚焦于五个不同的 NLI 基准测试，涵盖了六个不同规模和质量的模型，研究它们是否能够区分不同大小和质量的模型，以及在训练过程中它们的准确率如何发展。此外，我们还研究了在陈述模糊或不明确的情况下，模型 softmax 分布与人类分布的一致性程度。总体而言，我们的研究结果对 NLI 任务持积极态度：我们发现这些任务能够在训练的不同阶段有效区分模型，但并未完全饱和。此外，我们发现尽管模型分布与人类标签分布的相似性随着模型规模的增加而提高，但仍远高于两个人类群体之间的相似性，这使其成为一个值得考虑的有趣统计指标。

[NLP-17] Meaning at the Planck scale? Contextualized word embeddings for doing history philosophy and sociology of science

【速读】：该论文试图解决科学概念在历史、哲学和社会学研究中（HPSS）的语境化和演变意义问题。解决方案的关键在于利用语境化词嵌入（Contextualized Word Embeddings, CWEs），特别是基于BERT的模型，通过领域特定的预训练来提高对科学术语（如“Planck”）的歧义消除和意义预测能力。论文通过构建两个标注数据集（Astro-HEP-Planck Corpus和物理相关的Wikipedia数据集），评估了五个不同预训练程度的BERT模型，包括自定义的Astro-HEP-BERT模型。结果表明，领域适应模型在目标术语的歧义消除、意义预测和高质量意义簇生成方面优于通用模型，并揭示了目标术语在未标注数据集中的语义变化。该研究强调了领域特定预训练在分析科学语言中的重要性，并展示了适应预训练模型在HPSS研究中的成本效益，为科学概念的意义建模提供了可扩展和可转移的方法。

链接: https://arxiv.org/abs/2411.14073
作者: Arno Simons
关键词-EN: contextualized word embeddings, Astro-HEP Corpus, unlabeled Astro-HEP Corpus, word embeddings, sociology of science
类目: Computation and Language (cs.CL); History and Philosophy of Physics (physics.hist-ph)
备注: 18 pages, 7 figures (1 in the Supplement)

点击查看摘要

Abstract:This paper explores the potential of contextualized word embeddings (CWEs) as a new tool in the history, philosophy, and sociology of science (HPSS) for studying contextual and evolving meanings of scientific concepts. Using the term “Planck” as a test case, I evaluate five BERT-based models with varying degrees of domain-specific pretraining, including my custom model Astro-HEP-BERT, trained on the Astro-HEP Corpus, a dataset containing 21.84 million paragraphs from 600,000 articles in astrophysics and high-energy physics. For this analysis, I compiled two labeled datasets: (1) the Astro-HEP-Planck Corpus, consisting of 2,900 labeled occurrences of “Planck” sampled from 1,500 paragraphs in the Astro-HEP Corpus, and (2) a physics-related Wikipedia dataset comprising 1,186 labeled occurrences of “Planck” across 885 paragraphs. Results demonstrate that the domain-adapted models outperform the general-purpose ones in disambiguating the target term, predicting its known meanings, and generating high-quality sense clusters, as measured by a novel purity indicator I developed. Additionally, this approach reveals semantic shifts in the target term over three decades in the unlabeled Astro-HEP Corpus, highlighting the emergence of the Planck space mission as a dominant sense. The study underscores the importance of domain-specific pretraining for analyzing scientific language and demonstrates the cost-effectiveness of adapting pretrained models for HPSS research. By offering a scalable and transferable method for modeling the meanings of scientific concepts, CWEs open up new avenues for investigating the socio-historical dynamics of scientific discourses.
摘要：本文探讨了上下文词嵌入 (Contextualized Word Embeddings, CWEs) 在科学史、科学哲学和科学社会学 (HPSS) 领域中作为研究科学概念上下文和演变意义的新工具的潜力。以“Planck”一词为例，我评估了五种基于 BERT 的模型，这些模型在特定领域预训练的程度各不相同，包括我自定义的模型 Astro-HEP-BERT，该模型在 Astro-HEP 语料库上进行了训练，该语料库包含来自天体物理学和高能物理学领域 60 万篇文章中的 2184 万个段落。为了进行此分析，我编译了两个标注数据集：(1) Astro-HEP-Planck 语料库，包含从 Astro-HEP 语料库中的 1500 个段落中抽取的 2900 个标注的“Planck”出现实例；(2) 一个与物理学相关的维基百科数据集，包含 885 个段落中的 1186 个标注的“Planck”出现实例。结果表明，领域适应模型在消歧目标术语、预测其已知意义以及生成高质量意义簇方面优于通用模型，这一优势通过我开发的新颖纯度指标得以衡量。此外，这种方法揭示了目标术语在未标注的 Astro-HEP 语料库中三十年的语义变化，突显了 Planck 空间任务作为主要意义的涌现。研究强调了领域特定预训练在分析科学语言中的重要性，并展示了为 HPSS 研究调整预训练模型的成本效益。通过提供一种可扩展且可转移的方法来建模科学概念的意义，CWEs 为研究科学话语的社会历史动态开辟了新的途径。

[NLP-18] he Master-Slave Encoder Model for Improving Patent Text Summarization: A New Approach to Combining Specifications and Claims

【速读】：该论文试图解决传统专利文本摘要生成模型在生成质量不足、新术语OOV（Out-Of-Vocabulary）问题以及由于专利文本的高度专业性、准确性和独特性导致的冗余信息问题。解决方案的关键在于提出了基于主从编码器架构的专利文本摘要生成模型（MSEA）。该模型通过设计主从编码器，结合专利文本中的说明书和权利要求作为输入，充分挖掘两者之间的特征和细节；基于指针网络增强对输入序列中新术语的考虑，并通过重新加权编码器中“记忆”和“遗忘”的部分来进一步增强与输入文本的相关性；最后引入增强的重复抑制机制，确保生成准确且无冗余的摘要。实验结果表明，MSEA模型在公开的专利文本数据集上，相较于最先进的模型IMHAM，在Rouge-1、Rouge-2和Rouge-L评分上分别提升了0.006、0.005和0.005，有效提升了专利文本生成的质量。

链接: https://arxiv.org/abs/2411.14072
作者: Shu Zhou,Xin Wang,Zhengda Zhou,Haohan Yi,Xuhui Zheng,Hao Wan
关键词-EN: terminology OOV caused, patent text abstract, input sequence based, master-slave encoder architecture, text abstract generation
类目: Computation and Language (cs.CL); Programming Languages (cs.PL)
备注: 25pages, 1 figure

点击查看摘要

Abstract:In order to solve the problem of insufficient generation quality caused by traditional patent text abstract generation models only originating from patent specifications, the problem of new terminology OOV caused by rapid patent updates, and the problem of information redundancy caused by insufficient consideration of the high professionalism, accuracy, and uniqueness of patent texts, we proposes a patent text abstract generation model (MSEA) based on a master-slave encoder architecture; Firstly, the MSEA model designs a master-slave encoder, which combines the instructions in the patent text with the claims as input, and fully explores the characteristics and details between the two through the master-slave encoder; Then, the model enhances the consideration of new technical terms in the input sequence based on the pointer network, and further enhances the correlation with the input text by re weighing the “remembered” and “for-gotten” parts of the input sequence from the encoder; Finally, an enhanced repetition suppression mechanism for patent text was introduced to ensure accurate and non redundant abstracts generated. On a publicly available patent text dataset, compared to the state-of-the-art model, Improved Multi-Head Attention Mechanism (IMHAM), the MSEA model achieves an improvement of 0.006, 0.005, and 0.005 in Rouge-1, Rouge-2, and Rouge-L scores, respectively. MSEA leverages the characteristics of patent texts to effectively enhance the quality of patent text generation, demonstrating its advancement and effectiveness in the experiments.
摘要：为了解决传统专利文本摘要生成模型仅源自专利说明书导致的生成质量不足问题，以及专利快速更新带来的新术语OOV（Out-of-Vocabulary）问题，同时解决由于对专利文本的高度专业性、准确性和独特性考虑不足导致的信息冗余问题，我们提出了一种基于主从编码器架构的专利文本摘要生成模型（MSEA）。首先，MSEA模型设计了主从编码器，将专利文本中的指令与权利要求作为输入，并通过主从编码器充分挖掘两者之间的特征和细节；接着，模型基于指针网络增强了输入序列中新术语的考虑，并通过重新加权编码器输入序列中的“记忆”和“遗忘”部分，进一步增强与输入文本的相关性；最后，引入了专利文本的增强重复抑制机制，以确保生成准确且无冗余的摘要。在公开的专利文本数据集上，与最先进的模型——改进的多头注意力机制（IMHAM）相比，MSEA模型在Rouge-1、Rouge-2和Rouge-L评分上分别提高了0.006、0.005和0.005。MSEA利用专利文本的特性，有效提升了专利文本生成的质量，实验结果证明了其先进性和有效性。

[NLP-19] MMGenBench: Evaluating the Limits of LMMs from the Text-to-Image Generation Perspective

【速读】：该论文试图解决当前大型多模态模型（Large Multimodal Models, LMMs）在图像生成方面的评估不足问题。现有评估基准主要集中在图像理解上，而缺乏从图像生成角度的评估。解决方案的关键在于提出了一种直接的自动化评估流程，该流程要求LMMs根据输入图像生成图像提示（image-prompt），然后利用文本到图像生成模型（text-to-image generative models）基于这些提示生成新图像，并通过比较原始图像与生成图像来评估LMMs的性能。此外，论文还引入了MMGenBench-Test和MMGenBench-Domain两个综合基准，分别用于评估LMMs在13种不同图像模式和生成图像领域的性能。通过涉及50多个流行LMMs的全面评估，验证了该流程和基准的有效性和可靠性，揭示了当前LMMs在图像理解和描述方面的显著改进空间。

链接: https://arxiv.org/abs/2411.14062
作者: Hailang Huang,Yong Wang,Zixuan Huang,Huaqiu Li,Tongwen Huang,Xiangxiang Chu,Richong Zhang
关键词-EN: Large Multimodal Models, Large Multimodal, demonstrated remarkable capabilities, Multimodal Models, remarkable capabilities
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: This project is available at: this https URL

点击查看摘要

Abstract:Large Multimodal Models (LMMs) have demonstrated remarkable capabilities. While existing benchmarks for evaluating LMMs mainly focus on image comprehension, few works evaluate them from the image generation perspective. To address this issue, we propose a straightforward automated evaluation pipeline. Specifically, this pipeline requires LMMs to generate an image-prompt from a given input image. Subsequently, it employs text-to-image generative models to create a new image based on these generated prompts. Finally, we evaluate the performance of LMMs by comparing the original image with the generated one. Furthermore, we introduce MMGenBench-Test, a comprehensive benchmark developed to evaluate LMMs across 13 distinct image patterns, and MMGenBench-Domain, targeting the performance evaluation of LMMs within the generative image domain. A thorough evaluation involving over 50 popular LMMs demonstrates the effectiveness and reliability in both the pipeline and benchmark. Our observations indicate that numerous LMMs excelling in existing benchmarks fail to adequately complete the basic tasks, related to image understanding and description. This finding highlights the substantial potential for performance improvement in current LMMs and suggests avenues for future model optimization. Concurrently, our pipeline facilitates the efficient assessment of LMMs performance across diverse domains by using solely image inputs.
摘要：大型多模态模型 (Large Multimodal Models, LMMs) 展示了显著的能力。尽管现有的评估 LMMs 的基准主要集中在图像理解上，但很少有研究从图像生成的角度对其进行评估。为了解决这一问题，我们提出了一种直接的自动化评估流程。具体来说，该流程要求 LMMs 根据给定的输入图像生成一个图像提示 (image-prompt)。随后，它利用文本到图像的生成模型 (text-to-image generative models) 基于这些生成的提示创建新图像。最后，我们通过比较原始图像与生成图像来评估 LMMs 的性能。此外，我们引入了 MMGenBench-Test，这是一个综合基准，用于评估 LMMs 在 13 种不同图像模式上的表现，以及 MMGenBench-Domain，专注于在生成图像领域内评估 LMMs 的性能。通过对超过 50 个流行的 LMMs 进行全面评估，我们展示了该流程和基准的有效性和可靠性。我们的观察表明，许多在现有基准上表现优异的 LMMs 未能充分完成与图像理解和描述相关的基本任务。这一发现突显了当前 LMMs 在性能提升方面的巨大潜力，并为未来模型的优化提供了方向。同时，我们的流程通过仅使用图像输入，促进了在不同领域中对 LMMs 性能的高效评估。

[NLP-20] DRPruning: Efficient Large Language Model Pruning through Distributionally Robust Optimization

【速读】：该论文试图解决大型语言模型（LLMs）在模型规模和计算成本增加的情况下，通过结构化剪枝（structured pruning）导致的性能不均匀降级问题。解决方案的关键在于提出了DRPruning方法，该方法结合了分布式鲁棒优化（distributionally robust optimization）来恢复不同领域间的平衡性能，并通过进一步的改进增强了鲁棒性。实验结果表明，DRPruning在单语和多语设置中均优于类似规模的剪枝模型和继续预训练模型，在困惑度、下游任务和指令调优方面表现出色。此外，该方法能够自动确定最优的参考损失和数据比例，显示出在更广泛应用中的潜力。

链接: https://arxiv.org/abs/2411.14055
作者: Hexuan Deng,Wenxiang Jiao,Xuebo Liu,Min Zhang,Zhaopeng Tu
关键词-EN: Large language models, deliver impressive results, Large language, increasing model sizes, deliver impressive
类目: Computation and Language (cs.CL)
备注: Work in Progress

点击查看摘要

Abstract:Large language models (LLMs) deliver impressive results but face challenges from increasing model sizes and computational costs. Structured pruning reduces model size and speeds up inference but often causes uneven degradation across domains, leading to biased performance. To address this, we propose DRPruning, which incorporates distributionally robust optimization to restore balanced performance across domains, along with further improvements to enhance robustness. Experiments in monolingual and multilingual settings show that our method surpasses similarly sized models in pruning and continued pretraining over perplexity, downstream tasks, and instruction tuning. We further provide analysis demonstrating the robustness of our method towards various domains and distribution shifts. Furthermore, our method automatically determines optimal reference losses and data ratios, suggesting potential for broader applications. Our code is available at this https URL.
摘要：大语言模型（Large Language Models, LLMs）虽然取得了显著的成果，但面临着模型规模扩大和计算成本增加的挑战。结构化剪枝（Structured Pruning）虽然能够减小模型尺寸并加快推理速度，但往往会导致不同领域性能的不均匀下降，从而产生偏差。为解决这一问题，我们提出了DRPruning方法，该方法结合了分布式鲁棒优化（Distributionally Robust Optimization），以恢复各领域间的平衡性能，并进一步增强鲁棒性。在单语言和多语言设置下的实验表明，我们的方法在剪枝和继续预训练方面，无论是在困惑度（Perplexity）、下游任务还是指令调优上，均优于同等规模的模型。此外，我们还提供了分析，证明我们的方法对各种领域和分布偏移具有鲁棒性。进一步地，我们的方法能够自动确定最佳参考损失和数据比例，这表明其具有更广泛的应用潜力。我们的代码可在以下链接获取：https URL。

[NLP-21] FunctionChat-Bench: Comprehensive Evaluation of Language Models Generative Capabilities in Korean Tool-use Dialogs

【速读】：该论文试图解决语言模型在工具使用对话中的生成能力问题，特别是评估这些模型在多轮对话环境中的表现。解决方案的关键在于引入了一个名为FunctionChat-Bench的评估基准，该基准包含700个评估项和自动化评估程序，用于分类和评估模型输出的四种类型：工具调用（Tool Call）、答案补全（Answer Completion）、槽位问题（Slot Question）和相关性检测（Relevance Detection）。通过这一基准，论文发现尽管模型在单轮工具调用场景中可能表现出高准确性，但这并不意味着在多轮对话中也能有出色的生成表现。因此，论文强调了生成工具调用消息之外，模型还需要具备生成有效对话消息的能力，以更好地与用户互动。

链接: https://arxiv.org/abs/2411.14054
作者: Shinbok Lee,Gaeun Seo,Daniel Lee,Byeongil Ko,Sunghee Jung,Myeongcheol Shin
关键词-EN: study investigates language, Answer Completion, Slot Question, tool-use dialogs, Relevance Detection
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 8 pages

点击查看摘要

Abstract:This study investigates language models’ generative capabilities in tool-use dialogs. We categorize the models’ outputs in tool-use dialogs into four distinct types: Tool Call, Answer Completion, Slot Question, and Relevance Detection, which serve as aspects for evaluation. We introduce FunctionChat-Bench, comprising 700 evaluation items and automated assessment programs. Using this benchmark, we evaluate several language models that support function calling. Our findings indicate that while language models may exhibit high accuracy in single-turn Tool Call scenarios, this does not necessarily translate to superior generative performance in multi-turn environments. We argue that the capabilities required for function calling extend beyond generating tool call messages; they must also effectively generate conversational messages that engage the user.
摘要：本研究探讨了语言模型在工具使用对话中的生成能力。我们将模型在工具使用对话中的输出分为四种不同类型：工具调用（Tool Call）、答案补全（Answer Completion）、槽位提问（Slot Question）和相关性检测（Relevance Detection），这些类型作为评估的方面。我们引入了FunctionChat-Bench，该基准包含700个评估项和自动化评估程序。利用这一基准，我们评估了支持函数调用的多个语言模型。研究结果表明，尽管语言模型在单轮工具调用场景中可能表现出较高的准确性，但这并不一定意味着在多轮环境中具有优越的生成性能。我们认为，函数调用所需的能力不仅限于生成工具调用消息，还必须能够有效地生成与用户互动的对话消息。

[NLP-22] Forecasting Future International Events: A Reliable Dataset for Text-Based Event Modeling EMNLP2024

【速读】：该论文试图解决现有用于预测国际事件的文本数据集质量不足的问题，解决方案的关键在于引入了一个名为WORLDREP的新型数据集。WORLDREP通过利用大型语言模型（LLMs）的高级推理能力，生成高质量的评分标签，并经过政治学领域专家的严格验证。该数据集不仅展示了其在实际事件预测任务中的质量和实用性，还通过公开发布数据集及其自动化源代码，旨在支持并推动基于文本的事件预测研究。

链接: https://arxiv.org/abs/2411.14042
作者: Daehoon Gwak,Junwoo Park,Minho Park,Chaehun Park,Hyunchan Lee,Edward Choi,Jaegul Choo
关键词-EN: Predicting future international, Predicting future, future international events, strategic decision-making, textual information
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: EMNLP 2024 Findings

点击查看摘要

Abstract:Predicting future international events from textual information, such as news articles, has tremendous potential for applications in global policy, strategic decision-making, and geopolitics. However, existing datasets available for this task are often limited in quality, hindering the progress of related research. In this paper, we introduce WORLDREP (WORLD Relationship and Event Prediction), a novel dataset designed to address these limitations by leveraging the advanced reasoning capabilities of large-language models (LLMs). Our dataset features high-quality scoring labels generated through advanced prompt modeling and rigorously validated by domain experts in political science. We showcase the quality and utility of WORLDREP for real-world event prediction tasks, demonstrating its effectiveness through extensive experiments and analysis. Furthermore, we publicly release our dataset along with the full automation source code for data collection, labeling, and benchmarking, aiming to support and advance research in text-based event prediction.
摘要：从文本信息（如新闻文章）预测未来的国际事件，在全球政策、战略决策和地缘政治领域具有巨大的应用潜力。然而，现有用于此任务的数据集往往质量有限，阻碍了相关研究的进展。本文中，我们介绍了 WORLDREP（WORLD 关系与事件预测），这是一个新型数据集，旨在通过利用大语言模型（LLMs）的高级推理能力来解决这些限制。我们的数据集通过先进的提示建模生成高质量的评分标签，并由政治学领域的专家严格验证。我们展示了 WORLDREP 在实际事件预测任务中的质量和实用性，通过广泛的实验和分析证明了其有效性。此外，我们公开发布了数据集以及完整的数据收集、标注和基准测试自动化源代码，旨在支持和推动基于文本的事件预测研究。

[NLP-23] Logic Augmented Generation

【速读】：该论文试图解决语义知识图谱 (Semantic Knowledge Graphs, SKG) 与大型语言模型 (Large Language Models, LLMs) 之间的矛盾，即SKG在处理非结构化信息和灵活性方面的局限性与LLMs在可解释性和可靠性方面的不足。论文提出的解决方案是逻辑增强生成 (Logic Augmented Generation, LAG)，其关键在于结合SKG的结构化知识和LLMs的生成能力。具体来说，LAG利用LLMs作为反应式连续知识图谱，能够按需生成无限的关系和隐性知识，同时通过SKG引入离散的启发式维度，确保逻辑和事实的清晰边界。这种结合旨在在集体智能任务（如医疗诊断和气候预测）中提供可解释且有效的结果。

链接: https://arxiv.org/abs/2411.14012
作者: Aldo Gangemi,Andrea Giovanni Nuzzolese
关键词-EN: Semantic Knowledge Graphs, Semantic Knowledge, face challenges, challenges with scalability, ambiguous information
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 10 pages, 2 figures

点击查看摘要

Abstract:Semantic Knowledge Graphs (SKG) face challenges with scalability, flexibility, contextual understanding, and handling unstructured or ambiguous information. However, they offer formal and structured knowledge enabling highly interpretable and reliable results by means of reasoning and querying. Large Language Models (LLMs) overcome those limitations making them suitable in open-ended tasks and unstructured environments. Nevertheless, LLMs are neither interpretable nor reliable. To solve the dichotomy between LLMs and SKGs we envision Logic Augmented Generation (LAG) that combines the benefits of the two worlds. LAG uses LLMs as Reactive Continuous Knowledge Graphs that can generate potentially infinite relations and tacit knowledge on-demand. SKGs are key for injecting a discrete heuristic dimension with clear logical and factual boundaries. We exemplify LAG in two tasks of collective intelligence, i.e., medical diagnostics and climate projections. Understanding the properties and limitations of LAG, which are still mostly unknown, is of utmost importance for enabling a variety of tasks involving tacit knowledge in order to provide interpretable and effective results.
摘要：语义知识图谱 (Semantic Knowledge Graphs, SKG) 在可扩展性、灵活性、上下文理解和处理非结构化或模糊信息方面面临挑战。然而，它们通过推理和查询提供了形式化和结构化的知识，从而能够生成高度可解释和可靠的结果。大语言模型 (Large Language Models, LLMs) 克服了这些限制，使其在开放式任务和非结构化环境中表现出色。然而，LLMs 既不可解释也不可靠。为了解决 LLMs 和 SKGs 之间的二分法，我们提出了逻辑增强生成 (Logic Augmented Generation, LAG)，它结合了两者的优势。LAG 使用 LLMs 作为反应式连续知识图谱，能够按需生成潜在的无限关系和隐性知识。SKGs 对于注入具有清晰逻辑和事实边界的离散启发式维度至关重要。我们在集体智能的两个任务中举例说明了 LAG，即医疗诊断和气候预测。理解 LAG 的特性和局限性（这些特性在很大程度上仍未知）对于实现涉及隐性知识的多种任务至关重要，以便提供可解释且有效的结果。

[NLP-24] Sentiment Analysis of Economic Text: A Lexicon-Based Approach

【速读】：该论文试图解决在经济学文本分析中缺乏专门的经济学词汇库（Economic Lexicon, EL）的问题。解决方案的关键在于构建一个具有广泛覆盖范围和人工标注情感分数（范围为[-1,1]）的词汇库。该词汇库不仅涵盖了讨论经济概念的文档中常用的术语，还能更准确地分类词语的情感，从而在经济学领域的文本应用中表现优于其他词汇库。

链接: https://arxiv.org/abs/2411.13958
作者: Luca Barbaglia,Sergio Consoli,Sebastiano Manzan,Luca Tiozzo Pezzoli,Elisa Tosetti
关键词-EN: specifically designed, designed for textual, discussing economic concepts, textual applications, documents discussing economic
类目: Computational Engineering, Finance, and Science (cs.CE); Computation and Language (cs.CL); Computers and Society (cs.CY)
备注: 37 pages, 9 figures, 6 tables, in press

点击查看摘要

Abstract:We propose an Economic Lexicon (EL) specifically designed for textual applications in economics. We construct the dictionary with two important characteristics: 1) to have a wide coverage of terms used in documents discussing economic concepts, and 2) to provide a human-annotated sentiment score in the range [-1,1]. We illustrate the use of the EL in the context of a simple sentiment measure and consider several applications in economics. The comparison to other lexicons shows that the EL is superior due to its wider coverage of domain relevant terms and its more accurate categorization of the word sentiment.
摘要：我们提出了一种专门为经济学文本应用设计的《经济学词典》(Economic Lexicon, EL)。该词典具有两个重要特征：1) 广泛涵盖讨论经济概念的文档中使用的术语；2) 提供范围在 [-1, 1] 内的人工标注情感分数。我们通过一个简单的情感测量示例展示了 EL 的使用，并考虑了其在经济学中的几种应用。与其他词典的比较表明，EL 由于其更广泛的领域相关术语覆盖范围和更准确的词汇情感分类，具有优越性。

[NLP-25] owards Full Delegation: Designing Ideal Agent ic Behaviors for Travel Planning

【速读】：该论文试图解决的问题是如何使基于大型语言模型（LLM）的代理能够完全接管人类的日常决策过程，并被人类信任以找到符合个性化需求且适应不断变化环境的解决方案。解决方案的关键在于提出了APEC Agent Constitution，这是一套评估代理行为的准则，包括准确性（Accuracy）、主动性（Proactivity）、效率（Efficiency）和可信度（Credibility）。通过这些准则，不仅评估代理的成果（outcome evaluation），还评估其达成成果的过程（procedure evaluation）。为了验证APEC准则是否符合人类偏好，论文开发了APEC-Travel，一个通过多轮对话主动提取旅客个性化需求的旅行规划代理，并使用合成数据进行训练和微调，最终在规则基准和LLM评估中均表现优异。

链接: https://arxiv.org/abs/2411.13904
作者: Song Jiang,Da JU,Andrew Cohen,Sasha Mitts,Aaron Foss,Justine T Kao,Xian Li,Yuandong Tian
关键词-EN: APEC Agent Constitution, APEC Agent, LLM-based agents, Agent Constitution, agents
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:How are LLM-based agents used in the future? While many of the existing work on agents has focused on improving the performance of a specific family of objective and challenging tasks, in this work, we take a different perspective by thinking about full delegation: agents take over humans’ routine decision-making processes and are trusted by humans to find solutions that fit people’s personalized needs and are adaptive to ever-changing context. In order to achieve such a goal, the behavior of the agents, i.e., agentic behaviors, should be evaluated not only on their achievements (i.e., outcome evaluation), but also how they achieved that (i.e., procedure evaluation). For this, we propose APEC Agent Constitution, a list of criteria that an agent should follow for good agentic behaviors, including Accuracy, Proactivity, Efficiency and Credibility. To verify whether APEC aligns with human preferences, we develop APEC-Travel, a travel planning agent that proactively extracts hidden personalized needs via multi-round dialog with travelers. APEC-Travel is constructed purely from synthetic data generated by Llama3.1-405B-Instruct with a diverse set of travelers’ persona to simulate rich distribution of dialogs. Iteratively fine-tuned to follow APEC Agent Constitution, APEC-Travel surpasses baselines by 20.7% on rule-based metrics and 9.1% on LLM-as-a-Judge scores across the constitution axes.
摘要：基于大语言模型（LLM）的智能体在未来如何应用？尽管现有关于智能体的工作大多集中在提升特定目标和挑战性任务的性能上，但本研究从不同的视角出发，探讨了完全委托的概念：智能体接管人类的日常决策过程，并被人类信任以找到符合个性化需求且适应不断变化情境的解决方案。为了实现这一目标，智能体的行为，即智能体行为，不仅应根据其成果（即结果评估）进行评估，还应根据其达成方式（即过程评估）进行评估。为此，我们提出了APEC智能体宪章，这是一系列智能体应遵循的良好智能体行为的准则，包括准确性（Accuracy）、主动性（Proactivity）、效率（Efficiency）和可信度（Credibility）。为了验证APEC是否符合人类偏好，我们开发了APEC-Travel，这是一个主动通过多轮对话提取旅客隐藏个性化需求的旅行规划智能体。APEC-Travel完全基于Llama3.1-405B-Instruct生成的合成数据构建，这些数据涵盖了多样化的旅客角色，以模拟丰富的对话分布。经过迭代微调以遵循APEC智能体宪章，APEC-Travel在基于规则的指标上超越了基线20.7%，在LLM作为评判者的评分上超越了基线9.1%，涵盖了宪章的各个维度。

[NLP-26] PIORS: Personalized Intelligent Outpatient Reception based on Large Language Model with Multi-Agents Medical Scenario Simulation

【速读】：该论文旨在解决中国门诊接待护士工作量过大，导致服务质量下降的问题。解决方案的关键是引入个性化智能门诊接待系统 (Personalized Intelligent Outpatient Reception System, PIORS)，该系统结合了基于大语言模型 (LLM) 的接待护士与医院信息系统 (HIS) 的协作，旨在提供个性化、高质量且高效的接待服务。此外，为提升大语言模型在实际医疗场景中的表现，论文提出了一种名为服务流程感知医疗场景模拟 (Service Flow aware Medical Scenario Simulation, SFMSS) 的数据生成框架，以适应实际环境和PIORS系统的需求。通过自动和人工评估，结果显示PIORS-Nurse优于所有基线模型，包括当前最先进的GPT-4o，并符合人类偏好和临床需求。

链接: https://arxiv.org/abs/2411.13902
作者: Zhijie Bao,Qingyun Liu,Ying Guo,Zhengqiang Ye,Jun Shen,Shirong Xie,Jiajie Peng,Xuanjing Huang,Zhongyu Wei
关键词-EN: face overwhelming workloads, reducing service quality, receptionist nurses face, ultimately reducing service, nurses face overwhelming
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In China, receptionist nurses face overwhelming workloads in outpatient settings, limiting their time and attention for each patient and ultimately reducing service quality. In this paper, we present the Personalized Intelligent Outpatient Reception System (PIORS). This system integrates an LLM-based reception nurse and a collaboration between LLM and hospital information system (HIS) into real outpatient reception setting, aiming to deliver personalized, high-quality, and efficient reception services. Additionally, to enhance the performance of LLMs in real-world healthcare scenarios, we propose a medical conversational data generation framework named Service Flow aware Medical Scenario Simulation (SFMSS), aiming to adapt the LLM to the real-world environments and PIORS settings. We evaluate the effectiveness of PIORS and SFMSS through automatic and human assessments involving 15 users and 15 clinical experts. The results demonstrate that PIORS-Nurse outperforms all baselines, including the current state-of-the-art model GPT-4o, and aligns with human preferences and clinical needs. Further details and demo can be found at this https URL
摘要：在中国，门诊接待护士面临着繁重的工作量，这限制了她们为每位患者投入的时间和注意力，最终降低了服务质量。本文介绍了一种个性化智能门诊接待系统（Personalized Intelligent Outpatient Reception System, PIORS）。该系统将基于大语言模型（LLM）的接待护士与大语言模型和医院信息系统（HIS）的协作整合到实际的门诊接待环境中，旨在提供个性化、高质量且高效的接待服务。此外，为了提升大语言模型在实际医疗场景中的表现，我们提出了一种名为服务流程感知医疗场景模拟（Service Flow aware Medical Scenario Simulation, SFMSS）的医疗对话数据生成框架，旨在使大语言模型适应现实环境和PIORS设置。我们通过自动评估和涉及15名用户和15名临床专家的人工评估来验证PIORS和SFMSS的有效性。结果表明，PIORS-Nurse在所有基线模型中表现最佳，包括当前最先进的模型GPT-4o，并且符合人类偏好和临床需求。更多详情和演示可在以下链接找到：https URL。

[NLP-27] Robust Detection of Watermarks for Large Language Models Under Human Edits

【速读】：该论文试图解决在人类编辑介入的情况下，如何有效检测由大型语言模型（LLMs）生成的文本中的水印（watermark）问题。解决方案的关键在于引入了一种新的截断拟合优度检验方法（Truncated Goodness-of-Fit test, Tr-GoF），通过混合模型检测人类编辑的影响。Tr-GoF 方法在不需要精确了解人类编辑水平或 LLMs 的概率特性情况下，实现了对 Gumbel-max 水印的鲁棒检测，尤其是在文本修改程度较大且水印信号微弱的情况下。与现有的基于求和检测规则的方法相比，Tr-GoF 的统计特性更具抗编辑噪声的能力，从而在检测效率上表现出更高的鲁棒性和优越性。

链接: https://arxiv.org/abs/2411.13868
作者: Xiang Li,Feng Ruan,Huiyuan Wang,Qi Long,Weijie J. Su
关键词-EN: Watermarking has offered, large language models, distinguishing text generated, human edits, offered an effective
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Statistics Theory (math.ST); Methodology (stat.ME); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:Watermarking has offered an effective approach to distinguishing text generated by large language models (LLMs) from human-written text. However, the pervasive presence of human edits on LLM-generated text dilutes watermark signals, thereby significantly degrading detection performance of existing methods. In this paper, by modeling human edits through mixture model detection, we introduce a new method in the form of a truncated goodness-of-fit test for detecting watermarked text under human edits, which we refer to as Tr-GoF. We prove that the Tr-GoF test achieves optimality in robust detection of the Gumbel-max watermark in a certain asymptotic regime of substantial text modifications and vanishing watermark signals. Importantly, Tr-GoF achieves this optimality \textitadaptively as it does not require precise knowledge of human edit levels or probabilistic specifications of the LLMs, in contrast to the optimal but impractical (Neyman–Pearson) likelihood ratio test. Moreover, we establish that the Tr-GoF test attains the highest detection efficiency rate in a certain regime of moderate text modifications. In stark contrast, we show that sum-based detection rules, as employed by existing methods, fail to achieve optimal robustness in both regimes because the additive nature of their statistics is less resilient to edit-induced noise. Finally, we demonstrate the competitive and sometimes superior empirical performance of the Tr-GoF test on both synthetic data and open-source LLMs in the OPT and LLaMA families.
摘要：水印技术为区分大语言模型（LLMs）生成的文本与人类撰写的文本提供了一种有效的方法。然而，人类对LLM生成文本的广泛编辑行为会削弱水印信号，从而显著降低现有检测方法的性能。本文通过混合模型检测来建模人类编辑行为，提出了一种新的检测方法，即在人类编辑环境下检测水印文本的截断拟合优度检验（Truncated Goodness-of-Fit Test, Tr-GoF）。我们证明了在文本修改程度较大且水印信号逐渐消失的渐近条件下，Tr-GoF检验在鲁棒检测Gumbel-max水印方面达到了最优性。重要的是，Tr-GoF实现了这种最优性，因为它不需要精确的人类编辑水平或LLMs的概率特性知识，这与最优但实际不可行的（Neyman-Pearson）似然比检验形成对比。此外，我们证明了在适度文本修改的条件下，Tr-GoF检验达到了最高的检测效率。相比之下，我们展示了基于求和的检测规则（如现有方法所采用的）在这两种条件下都无法实现最优鲁棒性，因为其统计量的加和性质对编辑引入的噪声较为脆弱。最后，我们在合成数据和OPT及LLaMA系列的开源LLMs上展示了Tr-GoF检验的竞争性，有时甚至优于现有的检测方法。

[NLP-28] HARec: Hyperbolic Graph-LLM Alignment for Exploration and Exploitation in Recommender Systems

【速读】：该论文试图解决推荐系统中信息茧房问题，即用户接触到的内容多样性受限，导致用户体验下降。解决方案的关键在于开发能够平衡内容探索（exploration）和利用（exploitation）的系统，使用户能够调整其推荐偏好。论文提出了一种名为HARec的解决方案，其核心创新包括：(1) 层次感知图-语言模型对齐机制（hierarchical-aware graph-llm alignment mechanism），用于更好地捕捉层次结构；(2) 双曲层次树结构（hyperbolic hierarchical tree structure），便于用户调整探索-利用的平衡。通过这些创新，HARec在实用性和多样性指标上均显著优于现有的欧几里得和双曲方法。

链接: https://arxiv.org/abs/2411.13865
作者: Qiyao Ma,Menglin Yang,Mingxuan Ju,Tong Zhao,Neil Shah,Rex Ying
关键词-EN: Modern recommendation systems, limiting users’ exposure, create information cocoons, Modern recommendation, limiting users’
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Modern recommendation systems often create information cocoons, limiting users’ exposure to diverse content. To enhance user experience, a crucial challenge is developing systems that can balance content exploration and exploitation, allowing users to adjust their recommendation preferences. Intuitively, this balance can be achieved through a tree-structured representation, where depth search facilitates exploitation and breadth search enables exploration. However, current works face two challenges to achieve this target: (1) Euclidean methods fail to fully capture hierarchical structures and lack flexibility in balancing exploration-exploitation, while (2) hyperbolic approaches, despite better hierarchical modeling, suffer from insufficient semantic alignment due to their reliance on Euclidean text encoders. To address these challenges, we propose HARec, a hyperbolic representation learning framework that jointly aligns user-item collaborative information with textual descriptions in hyperbolic space. Our framework introduces two key technique novelty: (1) a hierarchical-aware graph-llm alignment mechanism that enables better hierarchical representation, and (2) a hyperbolic hierarchical tree structure that facilitates user-adjustable exploration-exploitation trade-offs. Extensive experiments demonstrate that HARec consistently outperforms both Euclidean and hyperbolic baselines, achieving up to 5.49% improvement in utility metrics and 11.39% increase in diversity metrics.
摘要：现代推荐系统常常创建信息茧房，限制用户接触多样化的内容。为了提升用户体验，一个关键挑战是开发能够平衡内容探索与利用的系统，使用户能够调整其推荐偏好。直观上，这种平衡可以通过树状结构表示来实现，其中深度搜索促进利用，而广度搜索则支持探索。然而，当前的研究面临两个挑战：(1) 欧几里得方法无法完全捕捉层次结构，并且在平衡探索-利用方面缺乏灵活性；(2) 双曲方法虽然在层次建模方面表现更好，但由于依赖欧几里得文本编码器，导致语义对齐不足。为解决这些问题，我们提出了 HARec，一个双曲表示学习框架，该框架在双曲空间中联合对齐用户-项目协同信息与文本描述。我们的框架引入了两项关键技术新颖性：(1) 一种层次感知图-大语言模型对齐机制，能够实现更好的层次表示；(2) 一种双曲层次树结构，便于用户可调整的探索-利用权衡。大量实验表明，HARec 在效用指标上比欧几里得和双曲基线方法持续表现更优，效用指标提升高达 5.49%，多样性指标增加 11.39%。

[NLP-29] Interactive and Expressive Code-Augmented Planning with Large Language Models

【速读】：该论文试图解决大型语言模型 (Large Language Models, LLMs) 在复杂、长期规划任务中的不足，特别是在处理模糊或非结构化数据时的局限性。解决方案的关键是提出了REPL-Plan方法，这是一种结合了代码表达能力（利用代码的所有优势）和动态适应性的规划方法。REPL-Plan通过与读取-执行-打印循环 (Read-Eval-Print Loop, REPL) 的交互来解决任务，REPL能够迭代执行和评估代码，类似于语言shell或交互式代码笔记本，从而使模型能够灵活地纠正错误并动态处理任务。这种方法在多个规划领域中展示了优于先前方法的强大性能。

链接: https://arxiv.org/abs/2411.13826
作者: Anthony Z. Liu,Xinhe Wang,Jacob Sansom,Yao Fu,Jongwook Choi,Sungryull Sohn,Jaekyeom Kim,Honglak Lee
关键词-EN: Large Language Models, Large Language, abilities in common-sense, common-sense reasoning, long-horizon planning tasks
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) demonstrate strong abilities in common-sense reasoning and interactive decision-making, but often struggle with complex, long-horizon planning tasks. Recent techniques have sought to structure LLM outputs using control flow and other code-adjacent techniques to improve planning performance. These techniques include using variables (to track important information) and functions (to divide complex tasks into smaller re-usable sub-tasks). However, purely code-based approaches can be error-prone and insufficient for handling ambiguous or unstructured data. To address these challenges, we propose REPL-Plan, an LLM planning approach that is fully code-expressive (it can utilize all the benefits of code) while also being dynamic (it can flexibly adapt from errors and use the LLM for fuzzy situations). In REPL-Plan, an LLM solves tasks by interacting with a Read-Eval-Print Loop (REPL), which iteratively executes and evaluates code, similar to language shells or interactive code notebooks, allowing the model to flexibly correct errors and handle tasks dynamically. We demonstrate that REPL-Plan achieves strong results across various planning domains compared to previous methods.
摘要：大语言模型 (LLMs) 在常识推理和交互决策方面展现出强大的能力，但在处理复杂、长周期的规划任务时往往表现不佳。近期技术尝试通过使用控制流和其他与代码相关的技术来结构化 LLM 的输出，以提升规划性能。这些技术包括使用变量（用于跟踪重要信息）和函数（将复杂任务分解为可重复使用的小子任务）。然而，纯粹基于代码的方法容易出错，且不足以处理模糊或非结构化数据。为应对这些挑战，我们提出了 REPL-Plan，这是一种既完全代码表达（能够利用代码的所有优势）又动态（能够灵活适应错误并利用 LLM 处理模糊情况）的 LLM 规划方法。在 REPL-Plan 中，LLM 通过与读取-评估-打印循环 (REPL) 交互来解决任务，REPL 迭代执行和评估代码，类似于语言 Shell 或交互式代码笔记本，使模型能够灵活地纠正错误并动态处理任务。我们证明，与之前的方法相比，REPL-Plan 在各种规划领域中取得了显著的成果。

[NLP-30] InstCache: A Predictive Cache for LLM Serving

【速读】：该论文试图解决大型语言模型（LLMs）在计算强度高、延迟长和能耗大的问题。解决方案的关键在于提出了一种名为InstCache的预测性缓存机制，通过利用LLMs对用户指令的预测能力，预先填充缓存，从而减少重复计算和提高响应速度。InstCache的核心是一个基于负对数似然（negative log likelihood）的指令预填充算法，用于确定缓存大小和命中率，并以哈希表形式高效实现，具有极低的查找延迟。实验结果表明，InstCache在LMSys数据集上可达到高达51.34%的命中率，对应于2倍的加速效果，而内存成本仅为4.5GB。

链接: https://arxiv.org/abs/2411.13820
作者: Longwei Zou,Tingfeng Liu,Kai Chen,Jiangang Kong,Yangdong Deng
关键词-EN: Large language models, human life, language models, models are revolutionizing, revolutionizing every aspect
类目: Computation and Language (cs.CL); Distributed, Parallel, and Cluster Computing (cs.DC)
备注:

点击查看摘要

Abstract:Large language models are revolutionizing every aspect of human life. However, the unprecedented power comes at the cost of significant computing intensity, suggesting long latency and large energy footprint. Key-Value Cache and Semantic Cache have been proposed as a solution to the above problem, but both suffer from limited scalability due to significant memory cost for each token or instruction embeddings. Motivated by the observations that most instructions are short, repetitive and predictable by LLMs, we propose to predict user-instructions by an instruction-aligned LLM and store them in a predictive cache, so-called InstCache. We introduce an instruction pre-population algorithm based on the negative log likelihood of instructions, determining the cache size with regard to the hit rate. The proposed InstCache is efficiently implemented as a hash table with minimal lookup latency for deployment. Experimental results show that InstCache can achieve up to 51.34% hit rate on LMSys dataset, which corresponds to a 2x speedup, at a memory cost of only 4.5GB.
摘要：大语言模型正在彻底改变人类生活的各个方面。然而，这种前所未有的强大能力伴随着显著的计算强度，导致较长的延迟和较大的能耗。键值缓存（Key-Value Cache）和语义缓存（Semantic Cache）已被提出作为解决上述问题的方法，但由于每个Token或指令嵌入的内存成本高，两者都面临着有限的扩展性。受到大多数指令短小、重复且可被大语言模型预测的观察启发，我们提出通过一个与指令对齐的大语言模型来预测用户指令，并将其存储在一个称为InstCache的预测缓存中。我们引入了一种基于指令负对数似然的指令预填充算法，根据命中率确定缓存大小。所提出的InstCache被高效地实现为一个哈希表，具有最小的查找延迟，便于部署。实验结果表明，InstCache在LMSys数据集上可以达到高达51.34%的命中率，相当于2倍的加速，而内存成本仅为4.5GB。

[NLP-31] SemiKong: Curating Training and Evaluating A Semiconductor Industry-Specific Large Language Model

【速读】：该论文试图解决半导体行业中大型语言模型（Large Language Models, LLMs）缺乏专业知识的问题，特别是在半导体器件和工艺的复杂物理和化学方面的挑战。解决方案的关键在于开发了首个行业特定的LLM——SemiKong，并通过以下三个主要贡献实现：(a) 精心策划了一个全面的半导体相关文本语料库；(b) 创建了一个具有深入半导体知识的基础模型；© 引入了一个整合专家知识的框架，从而推进了领域特定AI模型的评估过程。通过使用精心策划的数据集对预训练的LLM进行微调，SemiKong在各种半导体制造和设计任务中表现优于通用的大型LLMs，强调了开发领域特定LLMs作为公司或工具特定专有模型的基础的重要性。

链接: https://arxiv.org/abs/2411.13802
作者: Christopher Nguyen,William Nguyen,Atsushi Suzuki,Daisuke Oku,Hong An Phan,Sang Dinh,Zooey Nguyen,Anh Ha,Shruti Raghavan,Huy Vo,Thang Nguyen,Lan Nguyen,Yoshikuni Hirayama
关键词-EN: Large Language Models, Large Language, Language Models, demonstrated the potential, potential to address
类目: Computation and Language (cs.CL)
备注: On-going work

点击查看摘要

Abstract:Large Language Models (LLMs) have demonstrated the potential to address some issues within the semiconductor industry. However, they are often general-purpose models that lack the specialized knowledge needed to tackle the unique challenges of this sector, such as the intricate physics and chemistry of semiconductor devices and processes. SemiKong, the first industry-specific LLM for the semiconductor domain, provides a foundation that can be used to develop tailored proprietary models. With SemiKong 1.0, we aim to develop a foundational model capable of understanding etching problems at an expert level. Our key contributions include (a) curating a comprehensive corpus of semiconductor-related texts, (b) creating a foundational model with in-depth semiconductor knowledge, and © introducing a framework for integrating expert knowledge, thereby advancing the evaluation process of domain-specific AI models. Through fine-tuning a pre-trained LLM using our curated dataset, we have shown that SemiKong outperforms larger, general-purpose LLMs in various semiconductor manufacturing and design tasks. Our extensive experiments underscore the importance of developing domain-specific LLMs as a foundation for company- or tool-specific proprietary models, paving the way for further research and applications in the semiconductor domain. Code and dataset will be available at this https URL
摘要：大语言模型 (LLMs) 展示了其在解决半导体行业内某些问题的潜力。然而，这些模型通常是通用模型，缺乏应对该领域独特挑战所需的专门知识，例如半导体器件和工艺的复杂物理和化学特性。SemiKong 作为半导体领域的首个行业专用 LLM，为开发定制的专有模型提供了基础。通过 SemiKong 1.0，我们的目标是开发一个能够以专家级水平理解蚀刻问题的基础模型。我们的主要贡献包括：(a) 策划了一个全面的半导体相关文本语料库；(b) 创建了一个具备深入半导体知识的基础模型；© 引入了一个整合专家知识的框架，从而推进了领域专用 AI 模型的评估过程。通过使用我们策划的数据集对预训练的 LLM 进行微调，我们证明了 SemiKong 在各种半导体制造和设计任务中优于更大规模的通用 LLM。我们的广泛实验强调了开发领域专用 LLM 作为公司或工具特定专有模型的基础的重要性，为半导体领域的进一步研究和应用铺平了道路。代码和数据集将在以下链接提供：https URL。

[NLP-32] Explaining GPT-4s Schema of Depression Using Machine Behavior Analysis

【速读】：该论文试图解决的问题是如何理解大型语言模型如GPT-4在心理健康支持中的应用，特别是其对抑郁症症状的内部关联和解释机制。解决方案的关键在于利用当代测量理论（contemporary measurement theory）来解码GPT-4如何将抑郁症状相互关联，从而为临床应用和理论理解提供信息。研究结果表明，GPT-4在抑郁症状评估方面具有较高的整体收敛效度（convergent validity）和中等偏高的内部一致性（internal consistency），但在自杀倾向（suicidality）和精神运动性症状（psychomotor）与其他症状的关系上存在不足和过度强调的问题。此外，GPT-4的症状推断模式揭示了一些细微的假设，例如睡眠和疲劳受大多数其他症状影响，而自我价值感/内疚感主要受抑郁情绪影响。

链接: https://arxiv.org/abs/2411.13800
作者: Adithya V Ganesan,Vasudha Varadarajan,Yash Kumar Lal,Veerle C. Eijsbroek,Katarina Kjell,Oscar N.E. Kjell,Tanuja Dhanasekaran,Elizabeth C. Stade,Johannes C. Eichstaedt,Ryan L. Boyd,H. Andrew Schwartz,Lucie Flek
关键词-EN: large language models, mental health support, grown rapidly, large language, language models
类目: Computation and Language (cs.CL)
备注: 21 pages, 3 tables, 6 figures, 1 supplementary table, 83 references

点击查看摘要

Abstract:Use of large language models such as ChatGPT (GPT-4) for mental health support has grown rapidly, emerging as a promising route to assess and help people with mood disorders, like depression. However, we have a limited understanding of GPT-4’s schema of mental disorders, that is, how it internally associates and interprets symptoms. In this work, we leveraged contemporary measurement theory to decode how GPT-4 interrelates depressive symptoms to inform both clinical utility and theoretical understanding. We found GPT-4’s assessment of depression: (a) had high overall convergent validity (r = .71 with self-report on 955 samples, and r = .81 with experts judgments on 209 samples); (b) had moderately high internal consistency (symptom inter-correlates r = .23 to .78 ) that largely aligned with literature and self-report; except that GPT-4 © underemphasized suicidality’s – and overemphasized psychomotor’s – relationship with other symptoms, and (d) had symptom inference patterns that suggest nuanced hypotheses (e.g. sleep and fatigue are influenced by most other symptoms while feelings of worthlessness/guilt is mostly influenced by depressed mood).
摘要：使用如 ChatGPT (GPT-4) 这样的大语言模型进行心理健康支持的应用迅速增长，成为评估和帮助患有情绪障碍（如抑郁症）人群的有前景途径。然而，我们对 GPT-4 内部如何关联和解释心理症状的机制了解有限。在本研究中，我们利用当代测量理论来解码 GPT-4 如何相互关联抑郁症状，以促进临床应用和理论理解。我们发现 GPT-4 对抑郁症的评估：(a) 具有较高的整体收敛效度（与自我报告的 955 个样本的相关系数 r = .71，与专家判断的 209 个样本的相关系数 r = .81）；(b) 具有中高水平的内部一致性（症状间相关系数 r = .23 至 .78），这与文献和自我报告大体一致；但 GPT-4 © 低估了自杀倾向与其他症状的关系，而高估了精神运动性症状与其他症状的关系，以及 (d) 具有症状推断模式，这些模式提出了细微的假设（例如，睡眠和疲劳受大多数其他症状影响，而自我价值感/内疚感主要受抑郁情绪影响）。

[NLP-33] NewsInterview: a Dataset and a Playground to Evaluate LLM s Ground Gap via Informational Interviews

【速读】：该论文试图解决大型语言模型（Large Language Models, LLMs）在生成连贯文本方面表现出色，但在语言接地（grounding）和战略对话方面存在不足的问题。解决方案的关键在于通过构建一个包含源角色（source personas）和说服元素（persuasive elements）的现实模拟环境，来促进具有长期回报（longer-horizon rewards）的代理开发。具体来说，论文通过分析40,000个来自NPR和CNN的信息性采访数据，发现LLMs在多轮对话规划和战略思维方面存在根本性缺陷，表现为采访者LLMs难以识别问题是否已回答以及如何进行说服性交流，导致信息提取效果不佳。因此，论文强调了增强LLMs战略对话能力的重要性。

链接: https://arxiv.org/abs/2411.13779
作者: Michael Lu,Hyundong Justin Cho,Weiyan Shi,Jonathan May,Alexander Spangher
关键词-EN: Large Language Models, generating coherent text, Large Language, demonstrated impressive capabilities, grounding language
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have demonstrated impressive capabilities in generating coherent text but often struggle with grounding language and strategic dialogue. To address this gap, we focus on journalistic interviews, a domain rich in grounding communication and abundant in data. We curate a dataset of 40,000 two-person informational interviews from NPR and CNN, and reveal that LLMs are significantly less likely than human interviewers to use acknowledgements and to pivot to higher-level questions. Realizing that a fundamental deficit exists in multi-turn planning and strategic thinking, we develop a realistic simulated environment, incorporating source personas and persuasive elements, in order to facilitate the development of agents with longer-horizon rewards. Our experiments show that while source LLMs mimic human behavior in information sharing, interviewer LLMs struggle with recognizing when questions are answered and engaging persuasively, leading to suboptimal information extraction across model size and capability. These findings underscore the need for enhancing LLMs’ strategic dialogue capabilities.
摘要：大语言模型（Large Language Models, LLMs）在生成连贯文本方面展示了令人印象深刻的能力，但在语言基础和战略对话方面往往表现不佳。为了解决这一差距，我们聚焦于新闻采访领域，这是一个充满基础沟通且数据丰富的领域。我们从 NPR 和 CNN 中精选了 40,000 个两人信息性采访的数据集，并发现 LLMs 使用确认和转向更高层次问题的可能性显著低于人类采访者。意识到多轮规划和战略思维存在根本性缺陷，我们开发了一个现实模拟环境，结合了来源人物角色和说服元素，以促进具有更长远回报的智能体的发展。我们的实验表明，尽管来源 LLMs 在信息共享中模仿人类行为，采访者 LLMs 在识别问题何时被回答和进行说服性互动方面存在困难，导致在模型大小和能力方面的信息提取效果不佳。这些发现强调了增强 LLMs 战略对话能力的重要性。

[NLP-34] Benchmarking GPT-4 against Human Translators: A Comprehensive Evaluation Across Languages Domains and Expertise Levels

【速读】：该论文试图解决的问题是评估GPT-4在翻译任务中的表现，并将其与不同专业水平的人类翻译者进行比较。解决方案的关键在于采用系统性的人类评估方法，使用MQM（Multidimensional Quality Metrics）框架，对GPT-4在三种语言对（中文↔英语、俄语↔英语、中文↔印地语）和三个领域（新闻、技术和生物医学）的翻译质量进行全面评估。研究发现，GPT-4的翻译错误率与初级翻译者相当，但仍落后于高级翻译者。与传统的神经机器翻译系统不同，GPT-4在资源匮乏的语言方向上保持了稳定的翻译质量。通过定性分析，论文还揭示了GPT-4在翻译过程中倾向于过于字面化的翻译和词汇不一致性，而人类翻译者有时会过度解读上下文并引入幻觉。这一研究首次系统性地比较了大型语言模型（LLM）与不同专业水平的人类翻译者，为理解基于LLM的翻译系统的当前能力和局限性提供了宝贵见解。

链接: https://arxiv.org/abs/2411.13775
作者: Jianhao Yan,Pingchuan Yan,Yulong Chen,Jing Li,Xianchao Zhu,Yue Zhang
关键词-EN: varying expertise levels, presents a comprehensive, varying expertise, longleftrightarrow, English
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Work in progress

点击查看摘要

Abstract:This study presents a comprehensive evaluation of GPT-4’s translation capabilities compared to human translators of varying expertise levels. Through systematic human evaluation using the MQM schema, we assess translations across three language pairs (Chinese \longleftrightarrow English, Russian \longleftrightarrow English, and Chinese \longleftrightarrow Hindi) and three domains (News, Technology, and Biomedical). Our findings reveal that GPT-4 achieves performance comparable to junior-level translators in terms of total errors, while still lagging behind senior translators. Unlike traditional Neural Machine Translation systems, which show significant performance degradation in resource-poor language directions, GPT-4 maintains consistent translation quality across all evaluated language pairs. Through qualitative analysis, we identify distinctive patterns in translation approaches: GPT-4 tends toward overly literal translations and exhibits lexical inconsistency, while human translators sometimes over-interpret context and introduce hallucinations. This study represents the first systematic comparison between LLM and human translators across different proficiency levels, providing valuable insights into the current capabilities and limitations of LLM-based translation systems.
摘要：本研究对 GPT-4 的翻译能力进行了全面评估，并与不同专业水平的人类翻译进行了比较。通过使用 MQM 框架进行系统的人工评估，我们评估了三种语言对（中文 \longleftrightarrow 英语、俄语 \longleftrightarrow 英语、中文 \longleftrightarrow 印地语）和三个领域（新闻、技术和生物医学）的翻译质量。研究结果显示，GPT-4 在总错误数量方面达到了与初级翻译人员相当的水平，但仍落后于高级翻译人员。与传统的神经机器翻译系统不同，这些系统在资源匮乏的语言方向上表现出显著的性能下降，而 GPT-4 在所有评估的语言对中保持了稳定的翻译质量。通过定性分析，我们发现了翻译方法中的独特模式：GPT-4 倾向于过度字面翻译，并表现出词汇不一致性，而人类翻译有时会过度解读上下文并引入幻觉。本研究首次系统地比较了大语言模型与不同专业水平的人类翻译，为基于大语言模型的翻译系统的当前能力和局限性提供了宝贵的见解。

[NLP-35] A Framework for Evaluating LLM s Under Task Indeterminacy NEURIPS2024

【速读】：该论文试图解决在大语言模型（LLM）评估中，由于任务的不确定性（task indeterminacy）导致的评估偏差问题。任务不确定性包括任务的模糊性（ambiguity）和模糊性（vagueness），这些因素可能导致评估语料库中的某些项目存在多个正确答案。论文提出的解决方案之关键是开发了一个评估框架，该框架能够解耦任务规范、人类评分和LLM响应之间的关系，从而在LLM评估过程中考虑任务不确定性。通过合成实验，论文展示了基于“黄金标签”假设的评估方法会低估模型的真实性能，并提供了一种方法来估计在评估语料库中部分不确定项目的情况下，模型的误差调整性能区间。

链接: https://arxiv.org/abs/2411.13760
作者: Luke Guerdan,Hanna Wallach,Solon Barocas,Alexandra Chouldechova
关键词-EN: Large language model, Large language, single correct response, language model, evaluation corpus
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
备注: To Appear in NeurIPS 2024 Workshops on Evaluating Evaluations (EvalEval) and Statistical Foundations of LLMs and Foundation Models (SFLLM)

点击查看摘要

Abstract:Large language model (LLM) evaluations often assume there is a single correct response – a gold label – for each item in the evaluation corpus. However, some tasks can be ambiguous – i.e., they provide insufficient information to identify a unique interpretation – or vague – i.e., they do not clearly indicate where to draw the line when making a determination. Both ambiguity and vagueness can cause task indeterminacy – the condition where some items in the evaluation corpus have more than one correct response. In this paper, we develop a framework for evaluating LLMs under task indeterminacy. Our framework disentangles the relationships between task specification, human ratings, and LLM responses in the LLM evaluation pipeline. Using our framework, we conduct a synthetic experiment showing that evaluations that use the “gold label” assumption underestimate the true performance. We also provide a method for estimating an error-adjusted performance interval given partial knowledge about indeterminate items in the evaluation corpus. We conclude by outlining implications of our work for the research community.
摘要：大语言模型 (LLM) 的评估通常假设评估语料库中的每个项目都有一个单一的正确答案——即“黄金标签”。然而，某些任务可能存在歧义——即，它们提供的信息不足以识别唯一的解释——或模糊——即，它们在做出判断时没有明确指示在哪里划定界限。歧义和模糊性都可能导致任务不确定性——即评估语料库中某些项目存在多个正确答案的情况。本文中，我们开发了一个在任务不确定性条件下评估 LLM 的框架。我们的框架解耦了任务规范、人类评分与 LLM 响应在 LLM 评估流程中的关系。通过使用我们的框架，我们进行了一项合成实验，结果表明，使用“黄金标签”假设的评估方法低估了模型的真实性能。我们还提供了一种方法，用于在评估语料库中部分项目存在不确定性的情况下，估算误差调整后的性能区间。最后，我们概述了我们的工作对研究社区的潜在影响。

[NLP-36] Assessing Gender Bias in LLM s: Comparing LLM Outputs with Human Perceptions and Official Statistics COLING

【速读】：该论文试图解决大型语言模型（LLMs）中存在的性别偏见问题，通过比较这些模型对性别的感知与人类受访者、美国劳工统计局数据以及50%无偏基准的差异。解决方案的关键在于创建了一个新的评估数据集，使用职业数据和特定角色的句子，避免了训练数据中的常见基准，从而防止数据泄露和测试集污染。研究中测试了五个LLMs，使用单字答案预测每个角色的性别，并通过Kullback-Leibler（KL）散度来比较模型输出与人类感知、统计数据和50%中性基准的差异。结果显示，所有LLMs均表现出显著的性别中性偏差，且更接近统计数据，但仍反映了固有的偏见。

链接: https://arxiv.org/abs/2411.13738
作者: Tetiana Bas
关键词-EN: Labor Statistics data, Bureau of Labor, Labor Statistics, large language models, study investigates gender
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: under review for Coling conference

点击查看摘要

Abstract:This study investigates gender bias in large language models (LLMs) by comparing their gender perception to that of human respondents, U.S. Bureau of Labor Statistics data, and a 50% no-bias benchmark. We created a new evaluation set using occupational data and role-specific sentences. Unlike common benchmarks included in LLM training data, our set is newly developed, preventing data leakage and test set contamination. Five LLMs were tested to predict the gender for each role using single-word answers. We used Kullback-Leibler (KL) divergence to compare model outputs with human perceptions, statistical data, and the 50% neutrality benchmark. All LLMs showed significant deviation from gender neutrality and aligned more with statistical data, still reflecting inherent biases.
摘要：本研究通过比较大语言模型 (LLM) 对性别的感知与人类受访者、美国劳工统计局数据以及 50% 无偏基准的差异，探讨了性别偏见问题。我们利用职业数据和特定角色的句子创建了一个新的评估集。与 LLM 训练数据中常见的基准不同，我们的评估集是新开发的，避免了数据泄露和测试集污染。我们测试了五个 LLM，使用单字答案预测每个角色的性别。通过 Kullback-Leibler (KL) 散度比较模型输出与人类感知、统计数据和 50% 中性基准的差异。所有 LLM 均显示出显著偏离性别中性，且更接近统计数据，但仍反映出固有的偏见。

[NLP-37] st Security in Remote Testing Age: Perspectives from Process Data Analytics and AI

【速读】：该论文试图解决远程监考高风险评估中的测试安全性问题。解决方案的关键在于利用基于点击流过程数据的数据分析和人工智能方法，以深入洞察测试过程，从而提高远程高风险测试的安全性。通过实际案例，论文展示了这些方法在确保测试安全方面的潜力。

链接: https://arxiv.org/abs/2411.13699
作者: Jiangang Hao,Michael Fauss
关键词-EN: proctored high-stake assessments, pandemic has accelerated, accelerated the implementation, implementation and acceptance, remotely proctored high-stake
类目: Cryptography and Security (cs.CR); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
备注: 23 pages, 8 figures

点击查看摘要

Abstract:The COVID-19 pandemic has accelerated the implementation and acceptance of remotely proctored high-stake assessments. While the flexible administration of the tests brings forth many values, it raises test security-related concerns. Meanwhile, artificial intelligence (AI) has witnessed tremendous advances in the last five years. Many AI tools (such as the very recent ChatGPT) can generate high-quality responses to test items. These new developments require test security research beyond the statistical analysis of scores and response time. Data analytics and AI methods based on clickstream process data can get us deeper insight into the test-taking process and hold great promise for securing remotely administered high-stakes tests. This chapter uses real-world examples to show that this is indeed the case.
摘要：COVID-19 疫情加速了远程监考高风险评估的实施与接受度。尽管测试的灵活管理带来了诸多价值，但也引发了与测试安全性相关的担忧。与此同时，人工智能 (AI) 在过去五年中取得了显著进展。许多 AI 工具（如最近的 ChatGPT）能够生成高质量的测试项目响应。这些新发展要求测试安全研究超越对分数和响应时间的统计分析。基于点击流过程数据的数据分析和 AI 方法可以为我们提供更深入的测试过程洞察，并为保障远程管理的高风险测试带来巨大潜力。本章通过实际案例展示了这一情况。

[NLP-38] Retrieval-Augmented Generation for Domain-Specific Question Answering: A Case Study on Pittsburgh and CMU

【速读】：该论文旨在通过设计一个检索增强生成 (Retrieval-Augmented Generation, RAG) 系统，为大型语言模型提供相关文档，以准确回答关于匹兹堡和卡内基梅隆大学 (CMU) 的领域特定问题。解决方案的关键在于：1) 采用贪婪抓取策略提取了超过1,800个子页面；2) 通过结合手动和Mistral生成的问答对进行混合注释，实现了0.7625的注释者间一致性 (IAA) 评分；3) 整合了BM25和FAISS检索器，并使用重排序器提升文档检索的准确性。实验结果表明，RAG系统在时间敏感和复杂查询方面显著优于非RAG基线，F1分数从5.45%提升至42.21%，召回率达到56.18%。

链接: https://arxiv.org/abs/2411.13691
作者: Haojia Sun,Yaqi Wang,Shuting Zhang
关键词-EN: Carnegie Mellon University, provide large language, answering domain-specific questions, Mellon University, Pittsburgh and Carnegie
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We designed a Retrieval-Augmented Generation (RAG) system to provide large language models with relevant documents for answering domain-specific questions about Pittsburgh and Carnegie Mellon University (CMU). We extracted over 1,800 subpages using a greedy scraping strategy and employed a hybrid annotation process, combining manual and Mistral-generated question-answer pairs, achieving an inter-annotator agreement (IAA) score of 0.7625. Our RAG framework integrates BM25 and FAISS retrievers, enhanced with a reranker for improved document retrieval accuracy. Experimental results show that the RAG system significantly outperforms a non-RAG baseline, particularly in time-sensitive and complex queries, with an F1 score improvement from 5.45% to 42.21% and recall of 56.18%. This study demonstrates the potential of RAG systems in enhancing answer precision and relevance, while identifying areas for further optimization in document retrieval and model training.
摘要：我们设计了一个检索增强生成 (Retrieval-Augmented Generation, RAG) 系统，旨在为大语言模型提供相关文档，以回答关于匹兹堡和卡内基梅隆大学 (CMU) 的领域特定问题。我们采用贪婪抓取策略提取了超过 1,800 个子页面，并采用混合注释流程，结合手动和 Mistral 生成的问答对，实现了 0.7625 的注释者间一致性 (IAA) 评分。我们的 RAG 框架整合了 BM25 和 FAISS 检索器，并通过重排序器增强，以提高文档检索的准确性。实验结果表明，RAG 系统显著优于非 RAG 基线系统，特别是在时间敏感和复杂查询方面，F1 分数从 5.45% 提升至 42.21%，召回率达到 56.18%。本研究展示了 RAG 系统在提升答案精度和相关性方面的潜力，同时指出了在文档检索和模型训练中进一步优化的领域。

[NLP-39] Hierarchical Text Classification (HTC) vs. eXtreme Multilabel Classification (XML): Two Sides of the Same Medal

【速读】：该论文试图解决跨领域文本分类模型迁移的问题，即探讨在Hierarchical Text Classification (HTC) 和 eXtreme Multi-Label Text Classification (XML) 两个不同领域中，现有最先进模型在各自领域外数据集上的表现。解决方案的关键在于评估和比较HTC和XML模型在对方领域数据集上的训练和测试效果。具体来说，论文将HTC领域的HBGL和HGLCR模型应用于XML领域的数据集（如Wiki10-31K、AmazonCat-13K和Amazon-670K），同时将XML领域的CascadeXML和XR-Transformer模型应用于HTC领域的数据集（如Web of Science、The New York Times Annotated Corpus和RCV1-V2）。结果表明，HTC模型在处理XML数据集时表现不佳，而XML模型在HTC数据集上表现较好，这揭示了模型在跨领域应用时的局限性和适应性问题。

链接: https://arxiv.org/abs/2411.13687
作者: Nerijus Bertalis,Paul Granse,Ferhat Gül,Florian Hauss,Leon Menkel,David Schüler,Tom Speier,Lukas Galke,Ansgar Scherp
关键词-EN: text classification problem, Assigning a subset, text classification, Hierarchical Text Classification, real-world applications
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Assigning a subset of labels from a fixed pool of labels to a given input text is a text classification problem with many real-world applications, such as in recommender systems. Two separate research streams address this issue. Hierarchical Text Classification (HTC) focuses on datasets with smaller label pools of hundreds of entries, accompanied by a semantic label hierarchy. In contrast, eXtreme Multi-Label Text Classification (XML) considers very large label pools with up to millions of entries, in which the labels are not arranged in any particular manner. However, in XML, a common approach is to construct an artificial hierarchy without any semantic information before or during the training process. Here, we investigate how state-of-the-art models from one domain perform when trained and tested on datasets from the other domain. The HBGL and HGLCR models from the HTC domain are trained and tested on the datasets Wiki10-31K, AmazonCat-13K, and Amazon-670K from the XML domain. On the other side, the XML models CascadeXML and XR-Transformer are trained and tested on the datasets Web of Science, The New York Times Annotated Corpus, and RCV1-V2 from the HTC domain. HTC models, on the other hand, are not equipped to handle the size of XML datasets and achieve poor transfer results. The code and numerous files that are needed to reproduce our results can be obtained from this https URL
摘要：从固定标签池中为给定输入文本分配标签子集是一个具有许多实际应用的文本分类问题，例如在推荐系统中。针对这一问题，存在两条独立的研究路径。层次文本分类 (Hierarchical Text Classification, HTC) 专注于标签池较小（数百个条目）且伴随有语义标签层次的数据集。相比之下，极端多标签文本分类 (eXtreme Multi-Label Text Classification, XML) 则考虑标签池非常大（多达数百万个条目），且标签未按特定方式排列的情况。然而，在 XML 中，常见的方法是在训练过程前后构建一个不包含任何语义信息的人工层次结构。本文探讨了当最先进模型从一个领域迁移到另一个领域进行训练和测试时的表现。具体来说，来自 HTC 领域的 HBGL 和 HGLCR 模型在 XML 领域的 Wiki10-31K、AmazonCat-13K 和 Amazon-670K 数据集上进行了训练和测试。另一方面，XML 模型 CascadeXML 和 XR-Transformer 则在 HTC 领域的 Web of Science、The New York Times Annotated Corpus 和 RCV1-V2 数据集上进行了训练和测试。结果显示，HTC 模型由于无法处理 XML 数据集的规模，迁移效果不佳。本文的代码及用于复现结果的众多文件可通过此 https URL 获取。

[NLP-40] Hymba: A Hybrid-head Architecture for Small Language Models

【速读】：该论文试图解决小型语言模型（small language models, LMs）在效率和性能上的挑战。解决方案的关键在于提出了一种混合头并行架构（hybrid-head parallel architecture），该架构结合了Transformer注意力机制（Transformer attention mechanisms）和状态空间模型（state space models, SSMs），以提升模型的效率。具体来说，注意力头提供高分辨率的记忆召回，而SSM头则实现高效的上下文总结。此外，论文还引入了可学习的元标记（learnable meta tokens），这些标记被前置于提示中，用于存储关键信息并减轻注意力机制的“强制关注”负担。通过跨层键值（KV）共享和部分滑动窗口注意力（partial sliding window attention）的优化，模型进一步压缩了缓存大小。实验结果表明，Hymba模型在性能和效率上均优于现有的同类模型，特别是在缓存大小和吞吐量方面取得了显著改进。

链接: https://arxiv.org/abs/2411.13676
作者: Xin Dong,Yonggan Fu,Shizhe Diao,Wonmin Byeon,Zijia Chen,Ameya Sunil Mahabaleshwarkar,Shih-Yang Liu,Matthijs Van Keirsbilck,Min-Hung Chen,Yoshi Suhara,Yingyan Lin,Jan Kautz,Pavlo Molchanov
关键词-EN: integrates transformer attention, hybrid-head parallel architecture, enhanced efficiency, language models featuring, state space models
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 20 pages, models are available on huggingface

点击查看摘要

Abstract:We propose Hymba, a family of small language models featuring a hybrid-head parallel architecture that integrates transformer attention mechanisms with state space models (SSMs) for enhanced efficiency. Attention heads provide high-resolution recall, while SSM heads enable efficient context summarization. Additionally, we introduce learnable meta tokens that are prepended to prompts, storing critical information and alleviating the “forced-to-attend” burden associated with attention mechanisms. This model is further optimized by incorporating cross-layer key-value (KV) sharing and partial sliding window attention, resulting in a compact cache size. During development, we conducted a controlled study comparing various architectures under identical settings and observed significant advantages of our proposed architecture. Notably, Hymba achieves state-of-the-art results for small LMs: Our Hymba-1.5B-Base model surpasses all sub-2B public models in performance and even outperforms Llama-3.2-3B with 1.32% higher average accuracy, an 11.67x cache size reduction, and 3.49x throughput.
摘要：我们提出了 Hymba，这是一系列小型语言模型，其特点是采用了一种混合头并行架构，该架构将 Transformer 注意力机制与状态空间模型 (SSMs) 相结合，以提高效率。注意力头提供了高分辨率的回忆能力，而 SSM 头则实现了高效的内容总结。此外，我们引入了可学习的元 Token，这些 Token 被预先附加到提示中，存储关键信息并减轻了注意力机制相关的“强制关注”负担。该模型通过结合跨层键值 (KV) 共享和部分滑动窗口注意力进一步优化，从而实现了紧凑的缓存大小。在开发过程中，我们在相同设置下对多种架构进行了对照研究，并观察到我们提出的架构具有显著优势。值得注意的是，Hymba 在小语言模型中达到了最先进的结果：我们的 Hymba-1.5B-Base 模型在性能上超越了所有小于 20 亿参数的公开模型，甚至优于 Llama-3.2-3B，平均准确率提高了 1.32%，缓存大小减少了 11.67 倍，吞吐量提高了 3.49 倍。

[NLP-41] RadPhi-3: Small Language Models for Radiology

【速读】：该论文试图解决在放射学工作流程中利用基于大型语言模型（LLM）的辅助助手来支持各种任务的问题。解决方案的关键在于提出了RadPhi-3，这是一个从Phi-3-mini-4k-instruct微调而来的小型语言模型，具有3.8亿参数，专门用于放射学任务。RadPhi-3不仅能够生成放射学报告的印象总结，还扩展到其他任务，如变化总结生成、报告部分提取、报告病理和设备标记等。此外，RadPhi-3通过从放射科医生使用的可信知识源中学习，增强了其可靠性和专业性。该模型在RaLEs放射学报告生成基准上达到了最先进（SOTA）的结果。

链接: https://arxiv.org/abs/2411.13604
作者: Mercy Ranjit,Shaury Srivastav,Tanuja Ganu
关键词-EN: LLM based copilot, LLM based, based copilot assistants, Small Language Model, based copilot
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:LLM based copilot assistants are useful in everyday tasks. There is a proliferation in the exploration of AI assistant use cases to support radiology workflows in a reliable manner. In this work, we present RadPhi-3, a Small Language Model instruction tuned from Phi-3-mini-4k-instruct with 3.8B parameters to assist with various tasks in radiology workflows. While impression summary generation has been the primary task which has been explored in prior works w.r.t radiology reports of Chest X-rays, we also explore other useful tasks like change summary generation comparing the current radiology report and its prior report, section extraction from radiology reports, tagging the reports with various pathologies and tubes, lines or devices present in them etc. In-addition, instruction tuning RadPhi-3 involved learning from a credible knowledge source used by radiologists, this http URL. RadPhi-3 can be used both to give reliable answers for radiology related queries as well as perform useful tasks related to radiology reports. RadPhi-3 achieves SOTA results on the RaLEs radiology report generation benchmark.
摘要：基于大语言模型（LLM）的副驾驶助手在日常任务中非常有用。目前，人们正在广泛探索以可靠方式支持放射科工作流程的 AI 助手用例。在本研究中，我们介绍了 RadPhi-3，这是一个从 Phi-3-mini-4k-instruct 微调而来的小型语言模型，拥有 3.8 亿参数，旨在协助放射科工作流程中的各种任务。尽管在先前的研究中，关于胸部 X 光报告的印象总结生成一直是主要探索的任务，但我们还探索了其他有用的任务，如通过比较当前放射科报告与其先前报告来生成变化总结、从放射科报告中提取特定部分、为报告标记各种病理以及报告中存在的导管、线路或设备等。此外，RadPhi-3 的指令微调过程涉及从放射科医生使用的可信知识源中学习，即此 http URL。RadPhi-3 既可以用于提供可靠的放射科相关查询答案，也可以执行与放射科报告相关的实用任务。RadPhi-3 在 RaLEs 放射科报告生成基准测试中达到了最先进的结果。

[NLP-42] Improved GUI Grounding via Iterative Narrowing

【速读】：该论文试图解决视觉-语言模型（Vision-Language Model, VLM）在图形用户界面（GUI）定位任务中的性能不足问题。解决方案的关键在于引入了一种名为迭代细化（Iterative Narrowing, IN）的视觉提示框架，该框架通过逐步细化定位区域，显著提升了通用VLM（如GPT-4V）以及经过微调的模型在GUI定位任务中的表现。研究通过在一个包含不同用户界面平台的综合基准上测试该方法，验证了其有效性。

链接: https://arxiv.org/abs/2411.13591
作者: Anthony Nguyen
关键词-EN: natural language query, GUI grounding, language query, plays a crucial, GUI grounding remains
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:GUI grounding, the task of identifying a precise location on an interface image from a natural language query, plays a crucial role in enhancing the capabilities of Vision-Language Model (VLM) agents. While general VLMs, such as GPT-4V, demonstrate strong performance across various tasks, their proficiency in GUI grounding remains suboptimal. Recent studies have focused on fine-tuning these models specifically for one-shot GUI grounding, yielding significant improvements over baseline performance. We introduce a visual prompting framework called Iterative Narrowing (IN) to further enhance the performance of both general and fine-tuned models in GUI grounding. For evaluation, we tested our method on a comprehensive benchmark comprising different UI platforms.
摘要：GUI 定位，即通过自然语言查询在界面图像上识别精确位置的任务，对于提升视觉-语言模型 (VLM) 智能体的能力至关重要。尽管如 GPT-4V 等通用 VLM 在多种任务中表现出色，但在 GUI 定位方面的表现仍不尽如人意。近期研究主要集中在对这些模型进行少样本 GUI 定位的微调，从而显著提升了基准性能。我们提出了一种名为迭代聚焦 (Iterative Narrowing, IN) 的视觉提示框架，旨在进一步提升通用模型和微调模型在 GUI 定位任务中的表现。为了评估我们的方法，我们在一个包含不同用户界面平台的综合基准上进行了测试。

[NLP-43] AddrLLM : Address Rewriting via Large Language Model on Nationwide Logistics Data KDD’25

【速读】：该论文试图解决异常地址（abnormal addresses）在基于位置的服务（LBS）中导致的高成本问题，特别是由于地址不准确导致的包裹重新路由问题。解决方案的关键在于引入了一个基于检索增强的大型语言模型（LLM）框架，称为AddrLLM。AddrLLM通过三个核心模块克服了现有方法的局限性：监督微调模块（Supervised Fine-Tuning module）、以地址为中心的检索增强生成模块（Address-centric Retrieval Augmented Generation module）和无偏目标对齐模块（Bias-free Objective Alignment module）。该框架不仅能够处理多种错误类型，还无需频繁重新训练，显著提高了地址重写的准确性和效率，从而在实际应用中减少了约43%的包裹重新路由率。

链接: https://arxiv.org/abs/2411.13584
作者: Qinchen Yang,Zhiqing Hong,Dongjiang Cao,Haotian Wang,Zejun Xie,Tian He,Yunhuai Liu,Yu Yang,Desheng Zhang
关键词-EN: Textual description, abnormal addresses, plays an important, location-based services, delivery and navigation
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted by KDD’25 ADS Track

点击查看摘要

Abstract:Textual description of a physical location, commonly known as an address, plays an important role in location-based services(LBS) such as on-demand delivery and navigation. However, the prevalence of abnormal addresses, those containing inaccuracies that fail to pinpoint a location, have led to significant costs. Address rewriting has emerged as a solution to rectify these abnormal addresses. Despite the critical need, existing address rewriting methods are limited, typically tailored to correct specific error types, or frequently require retraining to process new address data effectively. In this study, we introduce AddrLLM, an innovative framework for address rewriting that is built upon a retrieval augmented large language model. AddrLLM overcomes aforementioned limitations through a meticulously designed Supervised Fine-Tuning module, an Address-centric Retrieval Augmented Generation module and a Bias-free Objective Alignment module. To the best of our knowledge, this study pioneers the application of LLM-based address rewriting approach to solve the issue of abnormal addresses. Through comprehensive offline testing with real-world data on a national scale and subsequent online deployment, AddrLLM has demonstrated superior performance in integration with existing logistics system. It has significantly decreased the rate of parcel re-routing by approximately 43%, underscoring its exceptional efficacy in real-world applications.
摘要：文本形式的物理位置描述，通常称为地址，在基于位置的服务（LBS）中扮演着重要角色，如按需配送和导航。然而，异常地址的普遍存在，即那些包含错误信息无法准确指向位置的地址，导致了显著的成本问题。地址重写作为一种解决方案应运而生，旨在纠正这些异常地址。尽管需求迫切，现有的地址重写方法存在局限性，通常仅针对特定类型的错误进行修正，或者需要频繁重新训练以有效处理新的地址数据。在本研究中，我们提出了AddrLLM，这是一个基于检索增强大语言模型的创新地址重写框架。AddrLLM通过精心设计的监督微调模块、以地址为中心的检索增强生成模块以及无偏差目标对齐模块，克服了上述局限性。据我们所知，本研究首次将基于大语言模型的地址重写方法应用于解决异常地址问题。通过在全国范围内使用真实数据进行的全面离线测试和后续在线部署，AddrLLM在与现有物流系统的集成中展示了卓越的性能。它显著降低了包裹重新路由的比率，约减少了43%，突显了其在实际应用中的显著效果。

[NLP-44] Just KIDDIN: Knowledge Infusion and Distillation for Detection of INdecent Memes

【速读】：该论文试图解决在线多模态环境中（如文本和视觉）的毒性识别问题，由于模态间复杂的上下文联系，这一任务具有挑战性。解决方案的关键在于提出了一种结合大视觉语言模型（Large Visual Language Models, LVLMs）的知识蒸馏（Knowledge Distillation, KD）和知识注入的新框架，以增强在仇恨表情包中的毒性检测性能。具体来说，该方法从大规模常识知识图谱（Knowledge Graph, KG）ConceptNet中提取子知识图谱，并将其注入到一个紧凑的视觉语言模型（VLM）框架中。通过增强模型对毒性短语与表情包之间的上下文关系以及表情包中视觉概念的理解，提升了模型的推理能力。实验结果表明，该方法在两个仇恨言论基准数据集上的AU-ROC、F1和Recall指标上均优于现有最先进的方法，分别提升了1.1%、7%和35%。

链接: https://arxiv.org/abs/2411.12174
作者: Rahul Garg,Trilok Padhi,Hemang Jain,Ugur Kursuncu,Ponnurangam Kumaraguru
关键词-EN: Large Visual Language, challenging task due, multimodal environments remains, Visual Language Models, integrates Knowledge Distillation
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Toxicity identification in online multimodal environments remains a challenging task due to the complexity of contextual connections across modalities (e.g., textual and visual). In this paper, we propose a novel framework that integrates Knowledge Distillation (KD) from Large Visual Language Models (LVLMs) and knowledge infusion to enhance the performance of toxicity detection in hateful memes. Our approach extracts sub-knowledge graphs from ConceptNet, a large-scale commonsense Knowledge Graph (KG) to be infused within a compact VLM framework. The relational context between toxic phrases in captions and memes, as well as visual concepts in memes enhance the model’s reasoning capabilities. Experimental results from our study on two hate speech benchmark datasets demonstrate superior performance over the state-of-the-art baselines across AU-ROC, F1, and Recall with improvements of 1.1%, 7%, and 35%, respectively. Given the contextual complexity of the toxicity detection task, our approach showcases the significance of learning from both explicit (i.e. KG) as well as implicit (i.e. LVLMs) contextual cues incorporated through a hybrid neurosymbolic approach. This is crucial for real-world applications where accurate and scalable recognition of toxic content is critical for creating safer online environments.
摘要：在线多模态环境中的毒性识别由于模态间（如文本和视觉）上下文联系的复杂性，仍然是一个具有挑战性的任务。本文提出了一种新颖的框架，该框架结合了大视觉语言模型（LVLMs）的知识蒸馏（Knowledge Distillation, KD）和知识注入，以提升仇恨表情包中毒性检测的性能。我们的方法从ConceptNet这一大规模常识知识图谱（Knowledge Graph, KG）中提取子知识图谱，并将其注入到紧凑的视觉语言模型（VLM）框架中。通过表情包中标题的毒性短语与表情包中的视觉概念之间的关系上下文，增强了模型的推理能力。我们在两个仇恨言论基准数据集上的实验结果表明，与最先进的基线相比，在AU-ROC、F1和召回率上分别提升了1.1%、7%和35%。鉴于毒性检测任务的上下文复杂性，我们的方法展示了从显式（即知识图谱）和隐式（即大视觉语言模型）上下文线索中学习的重要性，这些线索通过混合神经符号方法结合在一起。这对于现实世界应用至关重要，因为在这些应用中，准确且可扩展的毒性内容识别对于创建更安全的在线环境至关重要。

[NLP-45] Generating bilingual example sentences with large language models as lexicography assistants

【速读】：该论文试图解决生成式 AI (Generative AI) 在为双语词典生成例句时，针对不同资源丰富度语言（如高资源语言法语、中资源语言印尼语和低资源语言Tetun）的表现差异问题。解决方案的关键在于：1) 通过GDEX标准（典型性、信息性和可理解性）评估生成例句的质量，发现LLMs在低资源语言中的表现显著下降；2) 通过上下文学习（in-context learning）来调整LLMs以符合个体标注者的偏好，从而提高生成例句的质量一致性；3) 利用预训练语言模型进行自动化评分，发现句子困惑度（sentence perplexity）在高资源语言中可以有效作为典型性和可理解性的代理指标。此外，论文还提供了一个包含600个评分的数据集，并探讨了LLMs在降低词典编纂成本方面的潜力，特别是在低资源语言中的应用。

链接: https://arxiv.org/abs/2410.03182
作者: Raphael Merx,Ekaterina Vylomova,Kemal Kurniawan
关键词-EN: varying resource levels, resource levels, bilingual dictionaries, varying resource, Good Dictionary
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We present a study of LLMs’ performance in generating and rating example sentences for bilingual dictionaries across languages with varying resource levels: French (high-resource), Indonesian (mid-resource), and Tetun (low-resource), with English as the target language. We evaluate the quality of LLM-generated examples against the GDEX (Good Dictionary EXample) criteria: typicality, informativeness, and intelligibility. Our findings reveal that while LLMs can generate reasonably good dictionary examples, their performance degrades significantly for lower-resourced languages. We also observe high variability in human preferences for example quality, reflected in low inter-annotator agreement rates. To address this, we demonstrate that in-context learning can successfully align LLMs with individual annotator preferences. Additionally, we explore the use of pre-trained language models for automated rating of examples, finding that sentence perplexity serves as a good proxy for typicality and intelligibility in higher-resourced languages. Our study also contributes a novel dataset of 600 ratings for LLM-generated sentence pairs, and provides insights into the potential of LLMs in reducing the cost of lexicographic work, particularly for low-resource languages.
摘要：我们研究了大语言模型 (LLM) 在不同资源水平的语言中生成和评估双语词典示例句子的表现：法语（高资源）、印度尼西亚语（中资源）和德顿语（低资源），以英语为目标语言。我们根据 GDEX（优质词典示例）标准：典型性、信息性和可理解性，评估了 LLM 生成的示例的质量。研究结果显示，尽管 LLM 能够生成合理优质的词典示例，但在低资源语言中的表现显著下降。我们还观察到人类对示例质量的偏好存在高度变异性，表现为较低的注释者间一致率。为解决这一问题，我们展示了通过上下文学习可以成功地将 LLM 与个体注释者的偏好对齐。此外，我们探索了使用预训练语言模型进行示例自动化评级的可能性，发现句子困惑度在高资源语言中可以作为典型性和可理解性的良好代理。我们的研究还贡献了一个包含 600 个评级的 LLM 生成句子对的新数据集，并提供了关于 LLM 在降低词典编纂工作成本方面的潜力，特别是对低资源语言的见解。

[NLP-46] Source Code Foundation Models are Transferable Binary Analysis Knowledge Bases

【速读】：该论文试图解决二进制代码与源代码之间的语义鸿沟问题，即如何将二进制代码转化为与源代码相关的人类可读内容。解决方案的关键在于提出了一种探针与恢复框架，该框架结合了二进制-源代码编码器-解码器模型和黑箱大型语言模型 (LLMs) 进行二进制代码分析。具体来说，该方法利用预训练的源代码基础模型 (SCFMs) 生成的符号丰富的代码片段作为上下文，从而增强黑箱LLMs在恢复二进制代码信息时的准确性。实验结果表明，该方法在零样本二进制代码摘要和函数名恢复任务中显著提升了性能，分别在CHRF和GPT4基准上取得了10.3%和16.7%的相对增益，以及在令牌级别精度与召回率上分别提升了6.7%和7.4%。

链接: https://arxiv.org/abs/2405.19581
作者: Zian Su,Xiangzhe Xu,Ziyang Huang,Kaiyuan Zhang,Xiangyu Zhang
关键词-EN: Binary Reverse Engineering, Reverse Engineering, Human-Oriented Binary Reverse, Source Code Foundation, source code
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Human-Oriented Binary Reverse Engineering (HOBRE) lies at the intersection of binary and source code, aiming to lift binary code to human-readable content relevant to source code, thereby bridging the binary-source semantic gap. Recent advancements in uni-modal code model pre-training, particularly in generative Source Code Foundation Models (SCFMs) and binary understanding models, have laid the groundwork for transfer learning applicable to HOBRE. However, existing approaches for HOBRE rely heavily on uni-modal models like SCFMs for supervised fine-tuning or general LLMs for prompting, resulting in sub-optimal performance. Inspired by recent progress in large multi-modal models, we propose that it is possible to harness the strengths of uni-modal code models from both sides to bridge the semantic gap effectively. In this paper, we introduce a novel probe-and-recover framework that incorporates a binary-source encoder-decoder model and black-box LLMs for binary analysis. Our approach leverages the pre-trained knowledge within SCFMs to synthesize relevant, symbol-rich code fragments as context. This additional context enables black-box LLMs to enhance recovery accuracy. We demonstrate significant improvements in zero-shot binary summarization and binary function name recovery, with a 10.3% relative gain in CHRF and a 16.7% relative gain in a GPT4-based metric for summarization, as well as a 6.7% and 7.4% absolute increase in token-level precision and recall for name recovery, respectively. These results highlight the effectiveness of our approach in automating and improving binary code analysis.
摘要：面向人类的二进制逆向工程（Human-Oriented Binary Reverse Engineering, HOBRE）位于二进制代码与源代码的交汇点，旨在将二进制代码提升为与源代码相关的人类可读内容，从而弥合二进制与源代码之间的语义鸿沟。近期在单模态代码模型预训练方面的进展，特别是生成式源代码基础模型（Source Code Foundation Models, SCFMs）和二进制理解模型的进步，为适用于HOBRE的迁移学习奠定了基础。然而，现有的HOBRE方法主要依赖于单模态模型如SCFMs进行监督微调或通用大语言模型（Large Language Models, LLMs）进行提示，导致性能欠佳。受近期大规模多模态模型进展的启发，我们提出可以利用单模态代码模型双方的优势来有效弥合语义鸿沟。本文介绍了一种新颖的探针与恢复框架，该框架结合了二进制-源代码编码器-解码器模型和用于二进制分析的黑箱大语言模型。我们的方法利用SCFMs中的预训练知识来合成相关且符号丰富的代码片段作为上下文。这种额外的上下文使得黑箱大语言模型能够提高恢复准确性。我们在零样本二进制摘要和二进制函数名恢复方面展示了显著的改进，摘要任务中CHRF相对提升10.3%，基于GPT4的指标相对提升16.7%，函数名恢复任务中Token级别的精确率和召回率分别提高了6.7%和7.4%。这些结果突显了我们的方法在自动化和改进二进制代码分析方面的有效性。

[NLP-47] CodeArt: Better Code Models by Attention Regularization When Symbols Are Lacking

【速读】：该论文试图解决基于Transformer的代码模型在符号缺失或不具信息性时性能下降的问题。解决方案的关键在于提出了一种新的预训练方法，通过程序分析提取上下文信息，并利用一种新颖的注意力掩码方法来限制模型仅关注这些上下文，如双向程序依赖传递闭包和token共现。同时，利用自注意力机制来学习哪些允许的注意力比其他更重要。该方法通过增强BERT模型的tokenization和模型架构，构建并使用注意力掩码，并引入新的预训练算法来实现。实验结果表明，该预训练模型在二进制相似性、类型推断和恶意软件家族分类等下游任务中显著提升了现有技术水平。

链接: https://arxiv.org/abs/2402.11842
作者: Zian Su,Xiangzhe Xu,Ziyang Huang,Zhuo Zhang,Yapeng Ye,Jianjun Huang,Xiangyu Zhang
关键词-EN: Transformer based code, Transformer based, software engineering tasks, impressive performance, software engineering
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Transformer based code models have impressive performance in many software engineering tasks. However, their effectiveness degrades when symbols are missing or not informative. The reason is that the model may not learn to pay attention to the right correlations/contexts without the help of symbols. We propose a new method to pre-train general code models when symbols are lacking. We observe that in such cases, programs degenerate to something written in a very primitive language. We hence propose to use program analysis to extract contexts a priori (instead of relying on symbols and masked language modeling as in vanilla models). We then leverage a novel attention masking method to only allow the model attending to these contexts, e.g., bi-directional program dependence transitive closures and token co-occurrences. In the meantime, the inherent self-attention mechanism is utilized to learn which of the allowed attentions are more important compared to others. To realize the idea, we enhance the vanilla tokenization and model architecture of a BERT model, construct and utilize attention masks, and introduce a new pre-training algorithm. We pre-train this BERT-like model from scratch, using a dataset of 26 million stripped binary functions with explicit program dependence information extracted by our tool. We apply the model in three downstream tasks: binary similarity, type inference, and malware family classification. Our pre-trained model can improve the SOTAs in these tasks from 53% to 64%, 49% to 60%, and 74% to 94%, respectively. It also substantially outperforms other general pre-training techniques of code understanding models.
摘要：基于 Transformer 的代码模型在许多软件工程任务中表现出色。然而，当符号缺失或信息量不足时，其有效性会显著下降。原因在于，在没有符号的帮助下，模型可能无法学会关注正确的关联/上下文。我们提出了一种新的方法，用于在符号缺失的情况下预训练通用代码模型。我们观察到，在这种情况下，程序会退化为一种非常原始的语言形式。因此，我们提出使用程序分析来先验地提取上下文（而不是像传统模型那样依赖符号和掩码语言建模）。随后，我们采用一种新颖的注意力掩码方法，仅允许模型关注这些上下文，例如双向程序依赖传递闭包和 Token 共现。同时，利用内在的自注意力机制来学习哪些允许的注意力比其他注意力更为重要。为了实现这一想法，我们对 BERT 模型的传统 Tokenization 和模型架构进行了增强，构建并利用了注意力掩码，并引入了一种新的预训练算法。我们使用包含 2600 万个剥离二进制函数的数据集从头开始预训练这个 BERT 类模型，这些函数通过我们的工具提取了明确的程序依赖信息。我们将该模型应用于三个下游任务：二进制相似性分析、类型推断和恶意软件家族分类。我们的预训练模型在这些任务中分别将 SOTA 从 53% 提升至 64%、从 49% 提升至 60%，以及从 74% 提升至 94%。此外，它在代码理解模型的其他通用预训练技术中也表现出色。

[NLP-48] BEST-STD: Bidirectional Mamba-Enhanced Speech Tokenization for Spoken Term Detection ICASSP2025

【速读】：该论文试图解决口语术语检测 (Spoken Term Detection, STD) 中依赖帧级特征和计算密集型动态时间规整 (Dynamic Time Warping, DTW) 模板匹配的问题，这些方法限制了其实用性。解决方案的关键在于将语音编码为离散的、与说话人无关的语义标记 (semantic tokens)，从而实现基于文本的快速检索，并有效处理词汇外术语。论文提出了一种双向状态空间建模方法，结合 Mamba 编码器，在自监督学习框架下训练，以学习上下文帧级特征并将其编码为离散标记。实验结果表明，该方法生成的语音标记比现有标记器更具说话人不变性，更适合 STD 任务，并且在 LibriSpeech 和 TIMIT 数据库上的评估中表现优于现有基线方法，同时更加高效。

链接: https://arxiv.org/abs/2411.14100
作者: Anup Singh,Kris Demuynck,Vipul Arora
关键词-EN: Spoken term detection, DTW-based template matching, computationally intensive DTW-based, intensive DTW-based template, limiting its practicality
类目: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注: Submitted to ICASSP 2025

点击查看摘要

Abstract:Spoken term detection (STD) is often hindered by reliance on frame-level features and the computationally intensive DTW-based template matching, limiting its practicality. To address these challenges, we propose a novel approach that encodes speech into discrete, speaker-agnostic semantic tokens. This facilitates fast retrieval using text-based search algorithms and effectively handles out-of-vocabulary terms. Our approach focuses on generating consistent token sequences across varying utterances of the same term. We also propose a bidirectional state space modeling within the Mamba encoder, trained in a self-supervised learning framework, to learn contextual frame-level features that are further encoded into discrete tokens. Our analysis shows that our speech tokens exhibit greater speaker invariance than those from existing tokenizers, making them more suitable for STD tasks. Empirical evaluation on LibriSpeech and TIMIT databases indicates that our method outperforms existing STD baselines while being more efficient.
摘要：口语术语检测 (Spoken Term Detection, STD) 常常受到依赖帧级特征和计算密集型的基于动态时间规整 (Dynamic Time Warping, DTW) 的模板匹配方法的限制，从而影响了其实用性。为解决这些问题，我们提出了一种新颖的方法，将语音编码为离散的、与说话人无关的语义 Token。这种方法便于使用基于文本的搜索算法进行快速检索，并能有效处理词汇表外的术语。我们的方法着重于生成同一术语在不同发音中的连贯 Token 序列。此外，我们在 Mamba 编码器中提出了双向状态空间建模，该模型在自监督学习框架下进行训练，以学习上下文帧级特征，这些特征进一步被编码为离散 Token。我们的分析表明，我们的语音 Token 比现有 Tokenizer 生成的 Token 具有更高的说话人不变性，使其更适合于 STD 任务。在 LibriSpeech 和 TIMIT 数据库上的实证评估表明，我们的方法在性能上优于现有的 STD 基线方法，同时更加高效。

[NLP-49] WavChat: A Survey of Spoken Dialogue Models

【速读】：该论文试图解决当前在口语对话系统领域缺乏系统性综述的问题。解决方案的关键在于对现有口语对话系统进行时间顺序的编排和分类，将其分为级联式（cascaded）和端到端（end-to-end）两种范式，并深入探讨了口语对话模型的核心技术，包括语音表示、训练范式、流式处理、双工模式和交互能力。此外，论文还详细回顾了相关的数据集、评估指标和基准，旨在为学术研究和工业应用提供全面的参考和指导。

链接: https://arxiv.org/abs/2411.13577
作者: Shengpeng Ji,Yifu Chen,Minghui Fang,Jialong Zuo,Jingyu Lu,Hanting Wang,Ziyue Jiang,Long Zhou,Shujie Liu,Xize Cheng,Xiaoda Yang,Zehan Wang,Qian Yang,Jian Li,Yidi Jiang,Jingzhen He,Yunfei Chu,Jin Xu,Zhou Zhao
关键词-EN: spoken dialogue models, spoken dialogue, spoken dialogue systems, captured significant attention, dialogue models
类目: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Machine Learning (cs.LG); Multimedia (cs.MM); Sound (cs.SD)
备注: 60 papes, working in progress

点击查看摘要

Abstract:Recent advancements in spoken dialogue models, exemplified by systems like GPT-4o, have captured significant attention in the speech domain. Compared to traditional three-tier cascaded spoken dialogue models that comprise speech recognition (ASR), large language models (LLMs), and text-to-speech (TTS), modern spoken dialogue models exhibit greater intelligence. These advanced spoken dialogue models not only comprehend audio, music, and other speech-related features, but also capture stylistic and timbral characteristics in speech. Moreover, they generate high-quality, multi-turn speech responses with low latency, enabling real-time interaction through simultaneous listening and speaking capability. Despite the progress in spoken dialogue systems, there is a lack of comprehensive surveys that systematically organize and analyze these systems and the underlying technologies. To address this, we have first compiled existing spoken dialogue systems in the chronological order and categorized them into the cascaded and end-to-end paradigms. We then provide an in-depth overview of the core technologies in spoken dialogue models, covering aspects such as speech representation, training paradigm, streaming, duplex, and interaction capabilities. Each section discusses the limitations of these technologies and outlines considerations for future research. Additionally, we present a thorough review of relevant datasets, evaluation metrics, and benchmarks from the perspectives of training and evaluating spoken dialogue systems. We hope this survey will contribute to advancing both academic research and industrial applications in the field of spoken dialogue systems. The related material is available at this https URL.
摘要：近年来，以 GPT-4o 为代表的口语对话模型的进展在语音领域引起了广泛关注。与传统的三层级级联口语对话模型（包括语音识别 (ASR)、大语言模型 (LLM) 和文本到语音 (TTS)）相比，现代口语对话模型展现出更高的智能水平。这些先进的口语对话模型不仅能够理解音频、音乐和其他语音相关特征，还能捕捉语音中的风格和音质特性。此外，它们能够生成高质量、低延迟的多轮语音响应，通过同时听和说的能力实现实时交互。尽管口语对话系统取得了进展，但缺乏系统性地组织和分析这些系统及其底层技术的全面综述。为此，我们首先按时间顺序编排了现有的口语对话系统，并将其分为级联和端到端两种范式。接着，我们深入概述了口语对话模型的核心技术，涵盖语音表示、训练范式、流式处理、双工和交互能力等方面。每个部分都讨论了这些技术的局限性，并概述了未来研究的考虑因素。此外，我们从训练和评估口语对话系统的角度，全面回顾了相关数据集、评估指标和基准。我们希望本综述能够推动口语对话系统领域的学术研究和工业应用的发展。相关材料可在此 https URL 获取。

人工智能

[AI-0] Revisiting the Integration of Convolution and Attention for Vision Backbone NEURIPS2024

链接: https://arxiv.org/abs/2411.14429
作者: Lei Zhu,Xinjiang Wang,Wayne Zhang,Rynson W. H. Lau
关键词-EN: typically considered alternatives, multi-head self-attentions, typically considered, considered alternatives, building vision backbones
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: NeurIPS 2024

点击查看摘要

Abstract:Convolutions (Convs) and multi-head self-attentions (MHSAs) are typically considered alternatives to each other for building vision backbones. Although some works try to integrate both, they apply the two operators simultaneously at the finest pixel granularity. With Convs responsible for per-pixel feature extraction already, the question is whether we still need to include the heavy MHSAs at such a fine-grained level. In fact, this is the root cause of the scalability issue w.r.t. the input resolution for vision transformers. To address this important problem, we propose in this work to use MSHAs and Convs in parallel \textbfat different granularity levels instead. Specifically, in each layer, we use two different ways to represent an image: a fine-grained regular grid and a coarse-grained set of semantic slots. We apply different operations to these two representations: Convs to the grid for local features, and MHSAs to the slots for global features. A pair of fully differentiable soft clustering and dispatching modules is introduced to bridge the grid and set representations, thus enabling local-global fusion. Through extensive experiments on various vision tasks, we empirically verify the potential of the proposed integration scheme, named \textitGLMix: by offloading the burden of fine-grained features to light-weight Convs, it is sufficient to use MHSAs in a few (e.g., 64) semantic slots to match the performance of recent state-of-the-art backbones, while being more efficient. Our visualization results also demonstrate that the soft clustering module produces a meaningful semantic grouping effect with only IN1k classification supervision, which may induce better interpretability and inspire new weakly-supervised semantic segmentation approaches. Code will be available at \urlthis https URL.

[AI-1] Whack-a-Chip: The Futility of Hardware-Centric Export Controls

链接: https://arxiv.org/abs/2411.14425
作者: Ritwik Gupta,Leah Walker,Andrew W. Reddie
关键词-EN: Republic of China, People Republic, steadily creating, artificial intelligence, export controls
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:U.S. export controls on semiconductors are widely known to be permeable, with the People’s Republic of China (PRC) steadily creating state-of-the-art artificial intelligence (AI) models with exfiltrated chips. This paper presents the first concrete, public evidence of how leading PRC AI labs evade and circumvent U.S. export controls. We examine how Chinese companies, notably Tencent, are not only using chips that are restricted under U.S. export controls but are also finding ways to circumvent these regulations by using software and modeling techniques that maximize less capable hardware. Specifically, we argue that Tencent’s ability to power its Hunyuan-Large model with non-export controlled NVIDIA H20s exemplifies broader gains in efficiency in machine learning that have eroded the moat that the United States initially built via its existing export controls. Finally, we examine the implications of this finding for the future of the United States’ export control strategy.

[AI-2] Resolving Multiple-Dynamic Model Uncertainty in Hypothesis-Driven Belief-MDPs AAMAS2025

链接: https://arxiv.org/abs/2411.14404
作者: Ofer Dagan,Tyler Becker,Zachary N. Sunberg
关键词-EN: encounter surprising behavior, cyber-physical systems encounter, systems encounter surprising, Markov decision process, surprising behavior
类目: Artificial Intelligence (cs.AI); Robotics (cs.RO)
*备注: 8 pages, 4 figures, submitted to AAMAS 2025

点击查看摘要

Abstract:When human operators of cyber-physical systems encounter surprising behavior, they often consider multiple hypotheses that might explain it. In some cases, taking information-gathering actions such as additional measurements or control inputs given to the system can help resolve uncertainty and determine the most accurate hypothesis. The task of optimizing these actions can be formulated as a belief-space Markov decision process that we call a hypothesis-driven belief MDP. Unfortunately, this problem suffers from the curse of history similar to a partially observable Markov decision process (POMDP). To plan in continuous domains, an agent needs to reason over countlessly many possible action-observation histories, each resulting in a different belief over the unknown state. The problem is exacerbated in the hypothesis-driven context because each action-observation pair spawns a different belief for each hypothesis, leading to additional branching. This paper considers the case in which each hypothesis corresponds to a different dynamic model in an underlying POMDP. We present a new belief MDP formulation that: (i) enables reasoning over multiple hypotheses, (ii) balances the goals of determining the (most likely) correct hypothesis and performing well in the underlying POMDP, and (iii) can be solved with sparse tree search.

[AI-3] Landing Trajectory Prediction for UAS Based on Generative Adversarial Network

链接: https://arxiv.org/abs/2411.14403
作者: Jun Xiang,Drake Essick,Luiz Gonzalez Bautista,Junfei Xie,Jun Chen
关键词-EN: air mobility studies, advanced air mobility, Unmanned Aircraft systems, Generative Adversarial Network, mobility studies
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注: 9 pages, AIAA SCITECH 2023

点击查看摘要

Abstract:Models for trajectory prediction are an essential component of many advanced air mobility studies. These models help aircraft detect conflict and plan avoidance maneuvers, which is especially important in Unmanned Aircraft systems (UAS) landing management due to the congested airspace near vertiports. In this paper, we propose a landing trajectory prediction model for UAS based on Generative Adversarial Network (GAN). The GAN is a prestigious neural network that has been developed for many years. In previous research, GAN has achieved many state-of-the-art results in many generation tasks. The GAN consists of one neural network generator and a neural network discriminator. Because of the learning capacity of the neural networks, the generator is capable to understand the features of the sample trajectory. The generator takes the previous trajectory as input and outputs some random status of a flight. According to the results of the experiences, the proposed model can output more accurate predictions than the baseline method(GMR) in various datasets. To evaluate the proposed model, we also create a real UAV landing dataset that includes more than 2600 trajectories of drone control manually by real pilots.

[AI-4] Using Formal Models Safety Shields and Certified Control to Validate AI-Based Train Systems

链接: https://arxiv.org/abs/2411.14374
作者: Jan Gruteser(Heinrich Heine University Düsseldorf),Jan Roßbach(Heinrich Heine University Düsseldorf),Fabian Vu(Heinrich Heine University Düsseldorf),Michael Leuschel(Heinrich Heine University Düsseldorf)
关键词-EN: science and industry, important concern, concern in science, certificate checker, runtime certificate checker
类目: Logic in Computer Science (cs.LO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: In Proceedings FMAS2024, arXiv:2411.13215

点击查看摘要

Abstract:The certification of autonomous systems is an important concern in science and industry. The KI-LOK project explores new methods for certifying and safely integrating AI components into autonomous trains. We pursued a two-layered approach: (1) ensuring the safety of the steering system by formal analysis using the B method, and (2) improving the reliability of the perception system with a runtime certificate checker. This work links both strategies within a demonstrator that runs simulations on the formal model, controlled by the real AI output and the real certificate checker. The demonstrator is integrated into the validation tool ProB. This enables runtime monitoring, runtime verification, and statistical validation of formal safety properties using a formal B model. Consequently, one can detect and analyse potential vulnerabilities and weaknesses of the AI and the certificate checker. We apply these techniques to a signal detection case study and present our findings.

[AI-5] Synthesising Robust Controllers for Robot Collectives with Recurrent Tasks: A Case Study

链接: https://arxiv.org/abs/2411.14371
作者: Till Schnittka(University of Bremen),Mario Gleirscher(University of Bremen)
关键词-EN: task specification, practical scale, key challenges, autonomous collectives, controller synthesis
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Robotics (cs.RO)
*备注: In Proceedings FMAS2024, arXiv:2411.13215

点击查看摘要

Abstract:When designing correct-by-construction controllers for autonomous collectives, three key challenges are the task specification, the modelling, and its use at practical scale. In this paper, we focus on a simple yet useful abstraction for high-level controller synthesis for robot collectives with optimisation goals (e.g., maximum cleanliness, minimum energy consumption) and recurrence (e.g., re-establish contamination and charge thresholds) and safety (e.g., avoid full discharge, mutually exclusive room occupation) constraints. Due to technical limitations (related to scalability and using constraints in the synthesis), we simplify our graph-based setting from a stochastic two-player game into a single-player game on a partially observable Markov decision process (POMDP). Robustness against environmental uncertainty is encoded via partial observability. Linear-time correctness properties are verified separately after synthesising the POMDP strategy. We contribute at-scale guidance on POMDP modelling and controller synthesis for tasked robot collectives exemplified by the scenario of battery-driven robots responsible for cleaning public buildings with utilisation constraints.

[AI-6] RV4Chatbot: Are Chatbots Allowed to Dream of Electric Sheep?

链接: https://arxiv.org/abs/2411.14368
作者: Andrea Gatti(University of Genoa),Viviana Mascardi(University of Genoa),Angelo Ferrando(University of Modena and Reggio Emilia)
关键词-EN: application domains, safety-critical considerations, Runtime Verification framework, Runtime Verification, Chatbots
类目: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Software Engineering (cs.SE)
*备注: In Proceedings FMAS2024, arXiv:2411.13215

点击查看摘要

Abstract:Chatbots have become integral to various application domains, including those with safety-critical considerations. As a result, there is a pressing need for methods that ensure chatbots consistently adhere to expected, safe behaviours. In this paper, we introduce RV4Chatbot, a Runtime Verification framework designed to monitor deviations in chatbot behaviour. We formalise expected behaviours as interaction protocols between the user and the chatbot. We present the RV4Chatbot design and describe two implementations that instantiate it: RV4Rasa, for monitoring chatbots created with the Rasa framework, and RV4Dialogflow, for monitoring Dialogflow chatbots. Additionally, we detail experiments conducted in a factory automation scenario using both RV4Rasa and RV4Dialogflow.

[AI-7] ROSMonitoring 2.0: Extending ROS Runtime Verification to Services and Ordered Topics

链接: https://arxiv.org/abs/2411.14367
作者: Maryam Ghaffari Saadat(University of Manchester),Angelo Ferrando(University of Modena and Reggio Emilia),Louise A. Dennis(University of Manchester),Michael Fisher(University of Manchester)
关键词-EN: Formal verification, robotic applications presents, applications presents challenges, presents challenges due, distributed architecture
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Robotics (cs.RO)
*备注: In Proceedings FMAS2024, arXiv:2411.13215

点击查看摘要

Abstract:Formal verification of robotic applications presents challenges due to their hybrid nature and distributed architecture. This paper introduces ROSMonitoring 2.0, an extension of ROSMonitoring designed to facilitate the monitoring of both topics and services while considering the order in which messages are published and received. The framework has been enhanced to support these novel features for ROS1 – and partially ROS2 environments – offering improved real-time support, security, scalability, and interoperability. We discuss the modifications made to accommodate these advancements and present results obtained from a case study involving the runtime monitoring of specific components of a fire-fighting Uncrewed Aerial Vehicle (UAV).

[AI-8] Contrasting local and global modeling with machine learning and satellite data: A case study estimating tree canopy height in African savannas

链接: https://arxiv.org/abs/2411.14354
作者: Esther Rolf,Lucia Gordon,Milind Tambe,Andrew Davies
关键词-EN: facilitating environmental monitoring, regions remains critical, developing SatML models, Karingani Game Reserve, satellite imagery
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: 31 pages; 9 figures

点击查看摘要

Abstract:While advances in machine learning with satellite imagery (SatML) are facilitating environmental monitoring at a global scale, developing SatML models that are accurate and useful for local regions remains critical to understanding and acting on an ever-changing planet. As increasing attention and resources are being devoted to training SatML models with global data, it is important to understand when improvements in global models will make it easier to train or fine-tune models that are accurate in specific regions. To explore this question, we contrast local and global training paradigms for SatML through a case study of tree canopy height (TCH) mapping in the Karingani Game Reserve, Mozambique. We find that recent advances in global TCH mapping do not necessarily translate to better local modeling abilities in our study region. Specifically, small models trained only with locally-collected data outperform published global TCH maps, and even outperform globally pretrained models that we fine-tune using local data. Analyzing these results further, we identify specific points of conflict and synergy between local and global modeling paradigms that can inform future research toward aligning local and global performance objectives in geospatial machine learning.

[AI-9] Automated Generation of Code Debugging Exercises

链接: https://arxiv.org/abs/2411.14303
作者: Victor-Alexandru Pădurean,Paul Denny,Adish Singla
关键词-EN: instruction and emphasis, emphasis often vary, vary widely, widely across introductory, problem specifications
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
*备注: Preprint of the SIGCSE’25 paper

点击查看摘要

Abstract:Debugging is an essential skill when learning to program, yet its instruction and emphasis often vary widely across introductory courses. In the era of code-generating large language models (LLMs), the ability for students to reason about code and identify errors is increasingly important. However, students frequently resort to trial-and-error methods to resolve bugs without fully understanding the underlying issues. Developing the ability to identify and hypothesize the cause of bugs is crucial but can be time-consuming to teach effectively through traditional means. This paper introduces BugSpotter, an innovative tool that leverages an LLM to generate buggy code from a problem description and verify the synthesized bugs via a test suite. Students interact with BugSpotter by designing failing test cases, where the buggy code’s output differs from the expected result as defined by the problem specification. This not only provides opportunities for students to enhance their debugging skills, but also to practice reading and understanding problem specifications. We deployed BugSpotter in a large classroom setting and compared the debugging exercises it generated to exercises hand-crafted by an instructor for the same problems. We found that the LLM-generated exercises produced by BugSpotter varied in difficulty and were well-matched to the problem specifications. Importantly, the LLM-generated exercises were comparable to those manually created by instructors with respect to student performance, suggesting that BugSpotter could be an effective and efficient aid for learning debugging.

[AI-10] Neuro-Symbolic Query Optimization in Knowledge Graphs

链接: https://arxiv.org/abs/2411.14277
作者: Maribel Acosta,Chang Qin,Tim Schwabe
关键词-EN: enhance query processing, knowledge graphs, presenting a comprehensive, emerging field, comprehensive exploration
类目: Databases (cs.DB); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:This chapter delves into the emerging field of neuro-symbolic query optimization for knowledge graphs (KGs), presenting a comprehensive exploration of how neural and symbolic techniques can be integrated to enhance query processing. Traditional query optimizers in knowledge graphs rely heavily on symbolic methods, utilizing dataset summaries, statistics, and cost models to select efficient execution plans. However, these approaches often suffer from misestimations and inaccuracies, particularly when dealing with complex queries or large-scale datasets. Recent advancements have introduced neural models, which capture non-linear aspects of query optimization, offering promising alternatives to purely symbolic methods. In this chapter, we introduce neuro-symbolic query optimizers, a novel approach that combines the strengths of symbolic reasoning with the adaptability of neural computation. We discuss the architecture of these hybrid systems, highlighting the interplay between neural and symbolic components to improve the optimizer’s ability to navigate the search space and produce efficient execution plans. Additionally, the chapter reviews existing neural components tailored for optimizing queries over knowledge graphs and examines the limitations and challenges in deploying neuro-symbolic query optimizers in real-world environments.

[AI-11] Generating Realistic Adversarial Examples for Business Processes using Variational Autoencoders

链接: https://arxiv.org/abs/2411.14263
作者: Alexander Stevens,Jari Peeperkorn,Johannes De Smedt,Jochen De Weerdt
关键词-EN: predictive process monitoring, latent space, predictive process, process monitoring, latent space methods
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In predictive process monitoring, predictive models are vulnerable to adversarial attacks, where input perturbations can lead to incorrect predictions. Unlike in computer vision, where these perturbations are designed to be imperceptible to the human eye, the generation of adversarial examples in predictive process monitoring poses unique challenges. Minor changes to the activity sequences can create improbable or even impossible scenarios to occur due to underlying constraints such as regulatory rules or process constraints. To address this, we focus on generating realistic adversarial examples tailored to the business process context, in contrast to the imperceptible, pixel-level changes commonly seen in computer vision adversarial attacks. This paper introduces two novel latent space attacks, which generate adversaries by adding noise to the latent space representation of the input data, rather than directly modifying the input attributes. These latent space methods are domain-agnostic and do not rely on process-specific knowledge, as we restrict the generation of adversarial examples to the learned class-specific data distributions by directly perturbing the latent space representation of the business process executions. We evaluate these two latent space methods with six other adversarial attacking methods on eleven real-life event logs and four predictive models. The first three attacking methods directly permute the activities of the historically observed business process executions. The fourth method constrains the adversarial examples to lie within the same data distribution as the original instances, by projecting the adversarial examples to the original data distribution.

[AI-12] BERT-Based Approach for Automating Course Articulation Matrix Construction with Explainable AI

链接: https://arxiv.org/abs/2411.14254
作者: Natenaile Asmamaw Shiferaw,Simpenzwe Honore Leandre,Aman Sinha,Dillip Rout
关键词-EN: ensuring curriculum coherence, assessing educational effectiveness, Program Outcome, crucial task, task for ensuring
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 26 pages, 9 figures

点击查看摘要

Abstract:Course Outcome (CO) and Program Outcome (PO)/Program-Specific Outcome (PSO) alignment is a crucial task for ensuring curriculum coherence and assessing educational effectiveness. The construction of a Course Articulation Matrix (CAM), which quantifies the relationship between COs and POs/PSOs, typically involves assigning numerical values (0, 1, 2, 3) to represent the degree of alignment. In this study, We experiment with four models from the BERT family: BERT Base, DistilBERT, ALBERT, and RoBERTa, and use multiclass classification to assess the alignment between CO and PO/PSO pairs. We first evaluate traditional machine learning classifiers, such as Decision Tree, Random Forest, and XGBoost, and then apply transfer learning to evaluate the performance of the pretrained BERT models. To enhance model interpretability, we apply Explainable AI technique, specifically Local Interpretable Model-agnostic Explanations (LIME), to provide transparency into the decision-making process. Our system achieves accuracy, precision, recall, and F1-score values of 98.66%, 98.67%, 98.66%, and 98.66%, respectively. This work demonstrates the potential of utilizing transfer learning with BERT-based models for the automated generation of CAMs, offering high performance and interpretability in educational outcome assessment.

[AI-13] AnywhereDoor: Multi-Target Backdoor Attacks on Object Detection

链接: https://arxiv.org/abs/2411.14243
作者: Jialin Lu,Junjie Shan,Ziqi Zhao,Ka-Ho Chow
关键词-EN: safety-critical applications, understanding its vulnerabilities, vulnerabilities is essential, object, Backdoor
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:As object detection becomes integral to many safety-critical applications, understanding its vulnerabilities is essential. Backdoor attacks, in particular, pose a significant threat by implanting hidden backdoor in a victim model, which adversaries can later exploit to trigger malicious behaviors during inference. However, current backdoor techniques are limited to static scenarios where attackers must define a malicious objective before training, locking the attack into a predetermined action without inference-time adaptability. Given the expressive output space in object detection, including object existence detection, bounding box estimation, and object classification, the feasibility of implanting a backdoor that provides inference-time control with a high degree of freedom remains unexplored. This paper introduces AnywhereDoor, a flexible backdoor attack tailored for object detection. Once implanted, AnywhereDoor enables adversaries to specify different attack types (object vanishing, fabrication, or misclassification) and configurations (untargeted or targeted with specific classes) to dynamically control detection behavior. This flexibility is achieved through three key innovations: (i) objective disentanglement to support a broader range of attack combinations well beyond what existing methods allow; (ii) trigger mosaicking to ensure backdoor activations are robust, even against those object detectors that extract localized regions from the input image for recognition; and (iii) strategic batching to address object-level data imbalances that otherwise hinders a balanced manipulation. Extensive experiments demonstrate that AnywhereDoor provides attackers with a high degree of control, achieving an attack success rate improvement of nearly 80% compared to adaptations of existing methods for such flexible control.

[AI-14] owards Context-Rich Automated Biodiversity Assessments: Deriving AI-Powered Insights from Camera Trap Data

链接: https://arxiv.org/abs/2411.14219
作者: Paul Fergus,Carl Chalmers,Naomi Matthews,Stuart Nixon,Andre Burger,Oliver Hartley,Chris Sutherland,Xavier Lambin,Steven Longmore,Serge Wich
关键词-EN: contextual richness needed, Camera traps offer, traps offer enormous, current automated image, impactful conservation outcomes
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 32 Pages, 22 images

点击查看摘要

Abstract:Camera traps offer enormous new opportunities in ecological studies, but current automated image analysis methods often lack the contextual richness needed to support impactful conservation outcomes. Here we present an integrated approach that combines deep learning-based vision and language models to improve ecological reporting using data from camera traps. We introduce a two-stage system: YOLOv10-X to localise and classify species (mammals and birds) within images, and a Phi-3.5-vision-instruct model to read YOLOv10-X binding box labels to identify species, overcoming its limitation with hard to classify objects in images. Additionally, Phi-3.5 detects broader variables, such as vegetation type, and time of day, providing rich ecological and environmental context to YOLO’s species detection output. When combined, this output is processed by the model’s natural language system to answer complex queries, and retrieval-augmented generation (RAG) is employed to enrich responses with external information, like species weight and IUCN status (information that cannot be obtained through direct visual analysis). This information is used to automatically generate structured reports, providing biodiversity stakeholders with deeper insights into, for example, species abundance, distribution, animal behaviour, and habitat selection. Our approach delivers contextually rich narratives that aid in wildlife management decisions. By providing contextually rich insights, our approach not only reduces manual effort but also supports timely decision-making in conservation, potentially shifting efforts from reactive to proactive management.

[AI-15] Physics-Informed LLM -Agent for Automated Modulation Design in Power Electronics Systems

链接: https://arxiv.org/abs/2411.14214
作者: Junhua Liu,Fanfan Lin,Xinze Li,Kwan Hui Lim,Shuai Zhao
关键词-EN: complex industrial tasks, demonstrated outstanding performance, solving complex industrial, industrial tasks, Power Electronics Systems
类目: Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET)
*备注:

点击查看摘要

Abstract:LLM-based autonomous agents have demonstrated outstanding performance in solving complex industrial tasks. However, in the pursuit of carbon neutrality and high-performance renewable energy systems, existing AI-assisted design automation faces significant limitations in explainability, scalability, and usability. To address these challenges, we propose LP-COMDA, an LLM-based, physics-informed autonomous agent that automates the modulation design of power converters in Power Electronics Systems with minimal human supervision. Unlike traditional AI-assisted approaches, LP-COMDA contains an LLM-based planner that gathers and validates design specifications through a user-friendly chat interface. The planner then coordinates with physics-informed design and optimization tools to iteratively generate and refine modulation designs autonomously. Through the chat interface, LP-COMDA provides an explainable design process, presenting explanations and charts. Experiments show that LP-COMDA outperforms all baseline methods, achieving a 63.2% reduction in error compared to the second-best benchmark method in terms of standard mean absolute error. Furthermore, empirical studies with 20 experts conclude that design time with LP-COMDA is over 33 times faster than conventional methods, showing its significant improvement on design efficiency over the current processes.

[AI-16] HARP: A Large-Scale Higher-Order Ambisonic Room Impulse Response Dataset ICASSP2025

链接: https://arxiv.org/abs/2411.14207
作者: Shivam Saini,Jürgen Peissig
关键词-EN: Image Source Method, Room Impulse Responses, Impulse Responses, Ambisonic Room Impulse, Source Method
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Multimedia (cs.MM); Audio and Speech Processing (eess.AS)
*备注: Submitted to ICASSP 2025 Workshop Dataset and code to be uploaded at: this https URL

点击查看摘要

Abstract:This contribution introduces a dataset of 7th-order Ambisonic Room Impulse Responses (HOA-RIRs), created using the Image Source Method. By employing higher-order Ambisonics, our dataset enables precise spatial audio reproduction, a critical requirement for realistic immersive audio applications. Leveraging the virtual simulation, we present a unique microphone configuration, based on the superposition principle, designed to optimize sound field coverage while addressing the limitations of traditional microphone arrays. The presented 64-microphone configuration allows us to capture RIRs directly in the Spherical Harmonics domain. The dataset features a wide range of room configurations, encompassing variations in room geometry, acoustic absorption materials, and source-receiver distances. A detailed description of the simulation setup is provided alongside for an accurate reproduction. The dataset serves as a vital resource for researchers working on spatial audio, particularly in applications involving machine learning to improve room acoustics modeling and sound field synthesis. It further provides a very high level of spatial resolution and realism crucial for tasks such as source localization, reverberation prediction, and immersive sound reproduction.

[AI-17] Is this Generated Person Existed in Real-world? Fine-grained Detecting and Calibrating Abnormal Human-body

链接: https://arxiv.org/abs/2411.14205
作者: Zeqing Wang,Qingyang Ma,Wentao Wan,Haojie Li,Keze Wang,Yonghong Tian
关键词-EN: Recent improvements, generated human photos, human photos, applicability and demand, synthesis have significantly
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 16 pages, 14 figures

点击查看摘要

Abstract:Recent improvements in visual synthesis have significantly enhanced the depiction of generated human photos, which are pivotal due to their wide applicability and demand. Nonetheless, the existing text-to-image or text-to-video models often generate low-quality human photos that might differ considerably from real-world body structures, referred to as “abnormal human bodies”. Such abnormalities, typically deemed unacceptable, pose considerable challenges in the detection and repair of them within human photos. These challenges require precise abnormality recognition capabilities, which entail pinpointing both the location and the abnormality type. Intuitively, Visual Language Models (VLMs) that have obtained remarkable performance on various visual tasks are quite suitable for this task. However, their performance on abnormality detection in human photos is quite poor. Hence, it is quite important to highlight this task for the research community. In this paper, we first introduce a simple yet challenging task, i.e., \textbfFine-grained \textbfHuman-body \textbfAbnormality \textbfDetection \textbf(FHAD), and construct two high-quality datasets for evaluation. Then, we propose a meticulous framework, named HumanCalibrator, which identifies and repairs abnormalities in human body structures while preserving the other content. Experiments indicate that our HumanCalibrator achieves high accuracy in abnormality detection and accomplishes an increase in visual comparisons while preserving the other visual content.

[AI-18] ComfyGI: Automatic Improvement of Image Generation Workflows

链接: https://arxiv.org/abs/2411.14193
作者: Dominik Sobania,Martin Briesch,Franz Rothlauf
关键词-EN: Automatic image generation, interest to researchers, automatic optimization methods, image generation, Automatic image
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
*备注:

点击查看摘要

Abstract:Automatic image generation is no longer just of interest to researchers, but also to practitioners. However, current models are sensitive to the settings used and automatic optimization methods often require human involvement. To bridge this gap, we introduce ComfyGI, a novel approach to automatically improve workflows for image generation without the need for human intervention driven by techniques from genetic improvement. This enables image generation with significantly higher quality in terms of the alignment with the given description and the perceived aesthetics. On the performance side, we find that overall, the images generated with an optimized workflow are about 50% better compared to the initial workflow in terms of the median ImageReward score. These already good results are even surpassed in our human evaluation, as the participants preferred the images improved by ComfyGI in around 90% of the cases.

[AI-19] FoPru: Focal Pruning for Efficient Large Vision-Language Models

链接: https://arxiv.org/abs/2411.14164
作者: Lei Jiang,Weizhe Huang,Tongxuan Liu,Yuting Zeng,Jing Li,Lechao Cheng,Xiaohua Xu
关键词-EN: Large Language Models, Language Models, Large Vision-Language Models, powerful Large Language, Vision-Language Models
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 11 pages, 7 figures

点击查看摘要

Abstract:Large Vision-Language Models (LVLMs) represent a significant advancement toward achieving superior multimodal capabilities by enabling powerful Large Language Models (LLMs) to understand visual input. Typically, LVLMs utilize visual encoders, such as CLIP, to transform images into visual tokens, which are then aligned with textual tokens through projection layers before being input into the LLM for inference. Although existing LVLMs have achieved significant success, their inference efficiency is still limited by the substantial number of visual tokens and the potential redundancy among them. To mitigate this issue, we propose Focal Pruning (FoPru), a training-free method that prunes visual tokens based on the attention-based token significance derived from the vision encoder. Specifically, we introduce two alternative pruning strategies: 1) the rank strategy, which leverages all token significance scores to retain more critical tokens in a global view; 2) the row strategy, which focuses on preserving continuous key information in images from a local perspective. Finally, the selected tokens are reordered to maintain their original positional relationships. Extensive experiments across various LVLMs and multimodal datasets demonstrate that our method can prune a large number of redundant tokens while maintaining high accuracy, leading to significant improvements in inference efficiency.

[AI-20] Differentiable SVD based on Moore-Penrose Pseudoinverse for Inverse Imaging Problems

链接: https://arxiv.org/abs/2411.14141
作者: Yinghao Zhang,Yue Hu
关键词-EN: Low-rank regularization-based deep, inverse imaging problems, regularization-based deep unrolling, deep unrolling networks, achieved remarkable success
类目: Numerical Analysis (math.NA); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: 11 pages

点击查看摘要

Abstract:Low-rank regularization-based deep unrolling networks have achieved remarkable success in various inverse imaging problems (IIPs). However, the singular value decomposition (SVD) is non-differentiable when duplicated singular values occur, leading to severe numerical instability during training. In this paper, we propose a differentiable SVD based on the Moore-Penrose pseudoinverse to address this issue. To the best of our knowledge, this is the first work to provide a comprehensive analysis of the differentiability of the trivial SVD. Specifically, we show that the non-differentiability of SVD is essentially due to an underdetermined system of linear equations arising in the derivation process. We utilize the Moore-Penrose pseudoinverse to solve the system, thereby proposing a differentiable SVD. A numerical stability analysis in the context of IIPs is provided. Experimental results in color image compressed sensing and dynamic MRI reconstruction show that our proposed differentiable SVD can effectively address the numerical instability issue while ensuring computational precision. Code is available at this https URL.

[AI-21] GASP: Efficient Black-Box Generation of Adversarial Suffixes for Jailbreaking LLM s CVPR’25

链接: https://arxiv.org/abs/2411.14133
作者: Advik Raj Basani,Xiao Zhang
关键词-EN: Large Language Models, language processing tasks, Large Language, shown impressive proficiency, elicit harmful responses
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV)
*备注: 28 pages, 9 tables, 13 figures; under review at CVPR '25

点击查看摘要

Abstract:Large Language Models (LLMs) have shown impressive proficiency across a range of natural language processing tasks yet remain vulnerable to adversarial prompts, known as jailbreak attacks, carefully designed to elicit harmful responses from LLMs. Traditional methods rely on manual heuristics, which suffer from limited generalizability. While being automatic, optimization-based attacks often produce unnatural jailbreak prompts that are easy to detect by safety filters or require high computational overhead due to discrete token optimization. Witnessing the limitations of existing jailbreak methods, we introduce Generative Adversarial Suffix Prompter (GASP), a novel framework that combines human-readable prompt generation with Latent Bayesian Optimization (LBO) to improve adversarial suffix creation in a fully black-box setting. GASP leverages LBO to craft adversarial suffixes by efficiently exploring continuous embedding spaces, gradually optimizing the model to improve attack efficacy while balancing prompt coherence through a targeted iterative refinement procedure. Our experiments show that GASP can generate natural jailbreak prompts, significantly improving attack success rates, reducing training times, and accelerating inference speed, thus making it an efficient and scalable solution for red-teaming LLMs.

[AI-22] Umbrella Reinforcement Learning – computationally efficient tool for hard non-linear problems

链接: https://arxiv.org/abs/2411.14117
作者: Egor E. Nuzhin,Nikolai V. Brilliantov
关键词-EN: computationally efficient approach, solving hard nonlinear, computationally efficient, reinforcement learning, hard nonlinear problems
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:We report a novel, computationally efficient approach for solving hard nonlinear problems of reinforcement learning (RL). Here we combine umbrella sampling, from computational physics/chemistry, with optimal control methods. The approach is realized on the basis of neural networks, with the use of policy gradient. It outperforms, by computational efficiency and implementation universality, all available state-of-the-art algorithms, in application to hard RL problems with sparse reward, state traps and lack of terminal states. The proposed approach uses an ensemble of simultaneously acting agents, with a modified reward which includes the ensemble entropy, yielding an optimal exploration-exploitation balance.

[AI-23] MetaCropFollow: Few-Shot Adaptation with Meta-Learning for Under-Canopy Navigation

链接: https://arxiv.org/abs/2411.14092
作者: Thomas Woehrle,Arun N. Sivakumar,Naveen Uppalapati,Girish Chowdhary
关键词-EN: degraded GPS accuracy, Autonomous under-canopy navigation, faces additional challenges, additional challenges compared, degraded GPS
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Autonomous under-canopy navigation faces additional challenges compared to over-canopy settings - for example the tight spacing between the crop rows, degraded GPS accuracy and excessive clutter. Keypoint-based visual navigation has been shown to perform well in these conditions, however the differences between agricultural environments in terms of lighting, season, soil and crop type mean that a domain shift will likely be encountered at some point of the robot deployment. In this paper, we explore the use of Meta-Learning to overcome this domain shift using a minimal amount of data. We train a base-learner that can quickly adapt to new conditions, enabling more robust navigation in low-data regimes.

[AI-24] Multi LoRA Meets Vision: Merging multiple adapters to create a multi task model

链接: https://arxiv.org/abs/2411.14064
作者: Ege Kesim,Selahattin Serdar Helli
关键词-EN: Parameter efficient finetuning, Parameter efficient, methods are widely, PEFT, Parameter
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Parameter efficient finetuning (PEFT) methods are widely used in LLMs and generative models in computer vision. Especially one can use multiple of these during inference to change the behavior of the base model. In this paper we investigated whether multiple LoRA adapters trained on computer vision tasks can be merged together and used during inference without loss in performance. By achieving this, multitask models can be created just by merging different LoRAs. Merging these will reduce inference time and it will not require any additional retraining. We have trained adapters on six different tasks and evaluated their performance when they are merged together. For comparison we used a model with a frozen backbone and finetuned its head. Our results show that even with simple merging techniques creating a multitask model by merging adapters is achievable by slightly loosing performance in some cases. In our experiments we merged up to three adapters together. Depending on the task and the similarity of the data adapters were trained on, merges can outperform head finetuning. We have observed that LoRAs trained with dissimilar datasets tend to perform better compared to model trained on similar datasets.

[AI-25] Uterine Ultrasound Image Captioning Using Deep Learning Techniques

链接: https://arxiv.org/abs/2411.14039
作者: Abdennour Boulesnane,Boutheina Mokhtari,Oumnia Rana Segueni,Slimane Segueni
关键词-EN: early X-ray usage, early X-ray, X-ray usage, revolutionized medical diagnostics, significantly revolutionized medical
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Medical imaging has significantly revolutionized medical diagnostics and treatment planning, progressing from early X-ray usage to sophisticated methods like MRIs, CT scans, and ultrasounds. This paper investigates the use of deep learning for medical image captioning, with a particular focus on uterine ultrasound images. These images are vital in obstetrics and gynecology for diagnosing and monitoring various conditions across different age groups. However, their interpretation is often challenging due to their complexity and variability. To address this, a deep learning-based medical image captioning system was developed, integrating Convolutional Neural Networks with a Bidirectional Gated Recurrent Unit network. This hybrid model processes both image and text features to generate descriptive captions for uterine ultrasound images. Our experimental results demonstrate the effectiveness of this approach over baseline methods, with the proposed model achieving superior performance in generating accurate and informative captions, as indicated by higher BLEU and ROUGE scores. By enhancing the interpretation of uterine ultrasound images, our research aims to assist medical professionals in making timely and accurate diagnoses, ultimately contributing to improved patient care.

[AI-26] Multi-LLM -Agent Systems: Techniques and Business Perspectives

链接: https://arxiv.org/abs/2411.14033
作者: Yingxuan Yang,Qiuying Peng,Jun Wang,Weinan Zhang
关键词-EN: large language models, LLM agents, large language, language models, reformulated and reproduced
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In the era of (multi-modal) large language models, most operational processes can be reformulated and reproduced using LLM agents. The LLM agents can perceive, control, and get feedback from the environment so as to accomplish the given tasks in an autonomous manner. Besides the environment-interaction property, the LLM agents can call various external tools to ease the task completion process. The tools can be regarded as a predefined operational process with private or real-time knowledge that does not exist in the parameters of LLMs. As a natural trend of development, the tools for calling are becoming autonomous agents, thus the full intelligent system turns out to be a multi-LLM-agent system (MLAS). This paper discusses the technical and business landscapes of MLAS. Compared to the previous single-LLM-agent system, a MLAS has the advantages of i) higher potential of task-solving performance, ii) higher flexibility for system changing, iii) proprietary data preserving for each participating entity, and iv) feasibility of monetization for each entity. To support the ecosystem of MLAS, we provide a preliminary version of such MLAS protocol considering technical requirements, data privacy, and business incentives. As such, MLAS would be a practical solution to achieve artificial collective intelligence in the near future.

[AI-27] Mirror Target YOLO: An Improved YOLOv8 Method with Indirect Vision for Heritage Buildings Fire Detection

链接: https://arxiv.org/abs/2411.13997
作者: Jian Liang,JunSheng Cheng
关键词-EN: making timely fire, heritage buildings, making timely, severe damage, damage to heritage
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Fires can cause severe damage to heritage buildings, making timely fire detection essential. Traditional dense cabling and drilling can harm these structures, so reducing the number of cameras to minimize such impact is challenging. Additionally, avoiding false alarms due to noise sensitivity and preserving the expertise of managers in fire-prone areas is crucial. To address these needs, we propose a fire detection method based on indirect vision, called Mirror Target YOLO (MITA-YOLO). MITA-YOLO integrates indirect vision deployment and an enhanced detection module. It uses mirror angles to achieve indirect views, solving issues with limited visibility in irregular spaces and aligning each indirect view with the target monitoring area. The Target-Mask module is designed to automatically identify and isolate the indirect vision areas in each image, filtering out non-target areas. This enables the model to inherit managers’ expertise in assessing fire-risk zones, improving focus and resistance to interference in fire this http URL our experiments, we created an 800-image fire dataset with indirect vision. Results show that MITA-YOLO significantly reduces camera requirements while achieving superior detection performance compared to other mainstream models.

[AI-28] Safety Without Semantic Disruptions: Editing-free Safe Image Generation via Context-preserving Dual Latent Reconstruction

链接: https://arxiv.org/abs/2411.13982
作者: Jordan Vice,Naveed Akhtar,Richard Hartley,Ajmal Mian
关键词-EN: Training multimodal generative, Training multimodal, uncurated datasets, exposed to harmful, unsafe and controversial
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: This research is supported by the NISDRG project #20100007, funded by the Australian Government

点击查看摘要

Abstract:Training multimodal generative models on large, uncurated datasets can result in users being exposed to harmful, unsafe and controversial or culturally-inappropriate outputs. While model editing has been proposed to remove or filter undesirable concepts in embedding and latent spaces, it can inadvertently damage learned manifolds, distorting concepts in close semantic proximity. We identify limitations in current model editing techniques, showing that even benign, proximal concepts may become misaligned. To address the need for safe content generation, we propose a modular, dynamic solution that leverages safety-context embeddings and a dual reconstruction process using tunable weighted summation in the latent space to generate safer images. Our method preserves global context without compromising the structural integrity of the learned manifolds. We achieve state-of-the-art results on safe image generation benchmarks, while offering controllable variation of model safety. We identify trade-offs between safety and censorship, which presents a necessary perspective in the development of ethical AI models. We will release our code. Keywords: Text-to-Image Models, Generative AI, Safety, Reliability, Model Editing Comments: This research is supported by the NISDRG project #20100007, funded by the Australian Government Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI) Cite as: arXiv:2411.13982 [cs.CV] (or arXiv:2411.13982v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2411.13982 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-29] On the Fairness Diversity and Reliability of Text-to-Image Generative Models

链接: https://arxiv.org/abs/2411.13981
作者: Jordan Vice,Naveed Akhtar,Richard Hartley,Ajmal Mian
关键词-EN: sparked critical discussions, potential for misuse, widespread availability, availability of multimodal, sparked critical
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: This research is supported by the NISDRG project #20100007, funded by the Australian Government

点击查看摘要

Abstract:The widespread availability of multimodal generative models has sparked critical discussions on their fairness, reliability, and potential for misuse. While text-to-image models can produce high-fidelity, user-guided images, they also exhibit unpredictable behavior and vulnerabilities, which can be exploited to manipulate class or concept representations. To address this, we propose an evaluation framework designed to assess model reliability through their responses to globally- and locally-applied `semantic’ perturbations in the embedding space, pinpointing inputs that trigger unreliable behavior. Our approach offers deeper insights into two essential aspects: (i) generative diversity, evaluating the breadth of visual representations for learned concepts, and (ii) generative fairness, examining how removing concepts from input prompts affects semantic guidance. Beyond these evaluations, our method lays the groundwork for detecting unreliable, bias-injected models and retrieval of bias provenance. We will release our code. Keywords: Fairness, Reliability, AI Ethics, Bias, Text-to-Image Models Comments: This research is supported by the NISDRG project #20100007, funded by the Australian Government Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI) Cite as: arXiv:2411.13981 [cs.CV] (or arXiv:2411.13981v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2411.13981 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-30] FedRAV: Hierarchically Federated Region-Learning for Traffic Object Classification of Autonomous Vehicles

链接: https://arxiv.org/abs/2411.13979
作者: Yijun Zhai,Pengzhan Zhou,Yuepeng He,Fang Qu,Zhida Qin,Xianlong Jiao,Guiyan Liu,Songtao Guo
关键词-EN: providing great potential, train equipped deep, utilizing explosively growing, explosively growing autonomous, equipped deep learning
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI)
*备注: 8 pages, 4 figures

点击查看摘要

Abstract:The emerging federated learning enables distributed autonomous vehicles to train equipped deep learning models collaboratively without exposing their raw data, providing great potential for utilizing explosively growing autonomous driving data. However, considering the complicated traffic environments and driving scenarios, deploying federated learning for autonomous vehicles is inevitably challenged by non-independent and identically distributed (Non-IID) data of vehicles, which may lead to failed convergence and low training accuracy. In this paper, we propose a novel hierarchically Federated Region-learning framework of Autonomous Vehicles (FedRAV), a two-stage framework, which adaptively divides a large area containing vehicles into sub-regions based on the defined region-wise distance, and achieves personalized vehicular models and regional models. This approach ensures that the personalized vehicular model adopts the beneficial models while discarding the unprofitable ones. We validate our FedRAV framework against existing federated learning algorithms on three real-world autonomous driving datasets in various heterogeneous settings. The experiment results demonstrate that our framework outperforms those known algorithms, and improves the accuracy by at least 3.69%. The source code of FedRAV is available at: this https URL.

[AI-31] A Dataset for Evaluating Online Anomaly Detection Approaches for Discrete Multivariate Time Series

链接: https://arxiv.org/abs/2411.13951
作者: Lucas Correia,Jan-Christoph Goos,Thomas Bäck,Anna V. Kononova
关键词-EN: Benchmarking anomaly detection, Benchmarking anomaly, challenging due, lack of high-quality, Benchmarking
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Systems and Control (eess.SY)
*备注:

点击查看摘要

Abstract:Benchmarking anomaly detection approaches for multivariate time series is challenging due to the lack of high-quality datasets. Current publicly available datasets are too small, not diverse and feature trivial anomalies, which hinders measurable progress in this research area. We propose a solution: a diverse, extensive, and non-trivial dataset generated via state-of-the-art simulation tools that reflects realistic behaviour of an automotive powertrain, including its multivariate, dynamic and variable-state properties. To cater for both unsupervised and semi-supervised anomaly detection settings, as well as time series generation and forecasting, we make different versions of the dataset available, where training and test subsets are offered in contaminated and clean versions, depending on the task. We also provide baseline results from a small selection of approaches based on deterministic and variational autoencoders, as well as a non-parametric approach. As expected, the baseline experimentation shows that the approaches trained on the semi-supervised version of the dataset outperform their unsupervised counterparts, highlighting a need for approaches more robust to contaminated training data.

[AI-32] Separable Mixture of Low-Rank Adaptation for Continual Visual Instruction Tuning

链接: https://arxiv.org/abs/2411.13949
作者: Ziqi Wang,Chang Che,Qi Wang,Yangyang Li,Zenglin Shi,Meng Wang
关键词-EN: multimodal large language, Visual instruction tuning, enables multimodal large, large language models, multimodal large
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Visual instruction tuning (VIT) enables multimodal large language models (MLLMs) to effectively handle a wide range of vision tasks by framing them as language-based instructions. Building on this, continual visual instruction tuning (CVIT) extends the capability of MLLMs to incrementally learn new tasks, accommodating evolving functionalities. While prior work has advanced CVIT through the development of new benchmarks and approaches to mitigate catastrophic forgetting, these efforts largely follow traditional continual learning paradigms, neglecting the unique challenges specific to CVIT. We identify a dual form of catastrophic forgetting in CVIT, where MLLMs not only forget previously learned visual understanding but also experience a decline in instruction following abilities as they acquire new tasks. To address this, we introduce the Separable Mixture of Low-Rank Adaptation (SMoLoRA) framework, which employs separable routing through two distinct modules - one for visual understanding and another for instruction following. This dual-routing design enables specialized adaptation in both domains, preventing forgetting while improving performance. Furthermore, we propose a novel CVIT benchmark that goes beyond existing benchmarks by additionally evaluating a model’s ability to generalize to unseen tasks and handle diverse instructions across various tasks. Extensive experiments demonstrate that SMoLoRA outperforms existing methods in mitigating dual forgetting, improving generalization to unseen tasks, and ensuring robustness in following diverse instructions.

[AI-33] LLM s as Continuous Learners: Improving the Reproduction of Defective Code in Software Issues

链接: https://arxiv.org/abs/2411.13941
作者: Yalan Lin,Yingwei Ma,Rongyu Cao,Binhua Li,Fei Huang,Xiaodong Gu,Yongbin Li
关键词-EN: Reproducing buggy code, crucially important step, generated patches resolve, Reproducing buggy, crucially important
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Reproducing buggy code is the first and crucially important step in issue resolving, as it aids in identifying the underlying problems and validating that generated patches resolve the problem. While numerous approaches have been proposed for this task, they primarily address common, widespread errors and struggle to adapt to unique, evolving errors specific to individual code repositories. To fill this gap, we propose EvoCoder, a multi-agent continuous learning framework for issue code reproduction. EvoCoder adopts a reflection mechanism that allows the LLM to continuously learn from previously resolved problems and dynamically refine its strategies to new emerging challenges. To prevent experience bloating, EvoCoder introduces a novel hierarchical experience pool that enables the model to adaptively update common and repo-specific experiences. Our experimental results show a 20% improvement in issue reproduction rates over existing SOTA methods. Furthermore, integrating our reproduction mechanism significantly boosts the overall accuracy of the existing issue-resolving pipeline.

[AI-34] Learning to Cooperate with Humans using Generative Agents

链接: https://arxiv.org/abs/2411.13934
作者: Yancheng Liang,Daphne Chen,Abhishek Gupta,Simon S. Du,Natasha Jaques
关键词-EN: multi-agent reinforcement learning, human, training simulated human, generative model, key mission
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
*备注:

点击查看摘要

Abstract:Training agents that can coordinate zero-shot with humans is a key mission in multi-agent reinforcement learning (MARL). Current algorithms focus on training simulated human partner policies which are then used to train a Cooperator agent. The simulated human is produced either through behavior cloning over a dataset of human cooperation behavior, or by using MARL to create a population of simulated agents. However, these approaches often struggle to produce a Cooperator that can coordinate well with real humans, since the simulated humans fail to cover the diverse strategies and styles employed by people in the real world. We show \emphlearning a generative model of human partners can effectively address this issue. Our model learns a latent variable representation of the human that can be regarded as encoding the human’s unique strategy, intention, experience, or style. This generative model can be flexibly trained from any (human or neural policy) agent interaction data. By sampling from the latent space, we can use the generative model to produce different partners to train Cooperator agents. We evaluate our method – \textbfGenerative \textbfAgent \textbfModeling for \textbfMulti-agent \textbfAdaptation (GAMMA) – on Overcooked, a challenging cooperative cooking game that has become a standard benchmark for zero-shot coordination. We conduct an evaluation with real human teammates, and the results show that GAMMA consistently improves performance, whether the generative model is trained on simulated populations or human datasets. Further, we propose a method for posterior sampling from the generative model that is biased towards the human data, enabling us to efficiently improve performance with only a small amount of expensive human interaction data.

[AI-35] XAgents : A Framework for Interpretable Rule-Based Multi-Agents Cooperation

链接: https://arxiv.org/abs/2411.13932
作者: Hailong Yang,Mingxian Gu,Renhuo Zhao,Fuping Hu,Zhaohong Deng,Yitang Chen
关键词-EN: large language models, Extracting implicit knowledge, Extracting implicit, logical reasoning abilities, language models
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Extracting implicit knowledge and logical reasoning abilities from large language models (LLMs) has consistently been a significant challenge. The advancement of multi-agent systems has further en-hanced the capabilities of LLMs. Inspired by the structure of multi-polar neurons (MNs), we propose the XAgents framework, an in-terpretable multi-agent cooperative framework based on the IF-THEN rule-based system. The IF-Parts of the rules are responsible for logical reasoning and domain membership calculation, while the THEN-Parts are comprised of domain expert agents that generate domain-specific contents. Following the calculation of the member-ship, XAgetns transmits the task to the disparate domain rules, which subsequently generate the various responses. These re-sponses are analogous to the answers provided by different experts to the same question. The final response is reached at by eliminat-ing the hallucinations and erroneous knowledge of the LLM through membership computation and semantic adversarial genera-tion of the various domain rules. The incorporation of rule-based interpretability serves to bolster user confidence in the XAgents framework. We evaluate the efficacy of XAgents through a com-parative analysis with the latest AutoAgents, in which XAgents demonstrated superior performance across three distinct datasets. We perform post-hoc interpretable studies with SHAP algorithm and case studies, proving the interpretability of XAgent in terms of input-output feature correlation and rule-based semantics.

[AI-36] Split Federated Learning Over Heterogeneous Edge Devices: Algorithm and Optimization

链接: https://arxiv.org/abs/2411.13907
作者: Yunrui Sun,Gang Hu,Yinglei Teng,Dunbo Cai
关键词-EN: preserving privacy simultaneously, promising collaborative machine, machine learning approach, collaborative machine learning, sharing raw data
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Split Learning (SL) is a promising collaborative machine learning approach, enabling resource-constrained devices to train models without sharing raw data, while reducing computational load and preserving privacy simultaneously. However, current SL algorithms face limitations in training efficiency and suffer from prolonged latency, particularly in sequential settings, where the slowest device can bottleneck the entire process due to heterogeneous resources and frequent data exchanges between clients and servers. To address these challenges, we propose the Heterogeneous Split Federated Learning (HSFL) framework, which allows resource-constrained clients to train their personalized client-side models in parallel, utilizing different cut layers. Aiming to mitigate the impact of heterogeneous environments and accelerate the training process, we formulate a latency minimization problem that optimizes computational and transmission resources jointly. Additionally, we design a resource allocation algorithm that combines the Sample Average Approximation (SAA), Genetic Algorithm (GA), Lagrangian relaxation and Branch and Bound (B\B) methods to efficiently solve this problem. Simulation results demonstrate that HSFL outperforms other frameworks in terms of both convergence rate and model accuracy on heterogeneous devices with non-iid data, while the optimization algorithm is better than other baseline methods in reducing latency.

[AI-37] When Online Algorithms Influence the Environment: A Dynamical Systems Analysis of the Unintended Consequences

链接: https://arxiv.org/abs/2411.13883
作者: Prabhat Lankireddy,Jayakrishnan Nair,D Manjunath
关键词-EN: user, online algorithms, algorithm, recommendation algorithm, recommendations
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
*备注: 13 pages, 4 figures

点击查看摘要

Abstract:We analyze the effect that online algorithms have on the environment that they are learning. As a motivation, consider recommendation systems that use online algorithms to learn optimal product recommendations based on user and product attributes. It is well known that the sequence of recommendations affects user preferences. However, typical learning algorithms treat the user attributes as static and disregard the impact of their recommendations on user preferences. Our interest is to analyze the effect of this mismatch between the model assumption of a static environment, and the reality of an evolving environment affected by the recommendations. To perform this analysis, we first introduce a model for a generic coupled evolution of the parameters that are being learned, and the environment that is affected by it. We then frame a linear bandit recommendation system (RS) into this generic model where the users are characterized by a state variable that evolves based on the sequence of recommendations. The learning algorithm of the RS does not explicitly account for this evolution and assumes that the users are static. A dynamical system model that captures the coupled evolution of the population state and the learning algorithm is described, and its equilibrium behavior is analyzed. We show that when the recommendation algorithm is able to learn the population preferences in the presence of this mismatch, the algorithm induces similarity in the preferences of the user population. In particular, we present results on how different properties of the recommendation algorithm, namely the user attribute space and the exploration-exploitation tradeoff, effect the population preferences when they are learned by the algorithm. We demonstrate these results using model simulations.

[AI-38] Next-Generation Phishing: How LLM Agents Empower Cyber Attackers

链接: https://arxiv.org/abs/2411.13874
作者: Khalifa Afane,Wenqi Wei,Ying Mao,Junaid Farooq,Juntao Chen
关键词-EN: Large Language Models, Large Language, rise of Large, Language Models, Gmail Spam Filter
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The escalating threat of phishing emails has become increasingly sophisticated with the rise of Large Language Models (LLMs). As attackers exploit LLMs to craft more convincing and evasive phishing emails, it is crucial to assess the resilience of current phishing defenses. In this study we conduct a comprehensive evaluation of traditional phishing detectors, such as Gmail Spam Filter, Apache SpamAssassin, and Proofpoint, as well as machine learning models like SVM, Logistic Regression, and Naive Bayes, in identifying both traditional and LLM-rephrased phishing emails. We also explore the emerging role of LLMs as phishing detection tools, a method already adopted by companies like NTT Security Holdings and JPMorgan Chase. Our results reveal notable declines in detection accuracy for rephrased emails across all detectors, highlighting critical weaknesses in current phishing defenses. As the threat landscape evolves, our findings underscore the need for stronger security controls and regulatory oversight on LLM-generated content to prevent its misuse in creating advanced phishing attacks. This study contributes to the development of more effective Cyber Threat Intelligence (CTI) by leveraging LLMs to generate diverse phishing variants that can be used for data augmentation, harnessing the power of LLMs to enhance phishing detection, and paving the way for more robust and adaptable threat detection systems.

[AI-39] Generative Fuzzy System for Sequence Generation

链接: https://arxiv.org/abs/2411.13867
作者: Hailong Yang,Zhaohong Deng,Wei Zhang,Zhuangzhuang Zhao,Guanjin Wang,Kup-sze Choi
关键词-EN: Large Language Models, Large Language, garnered significant attention, Language Models, resemble the original
类目: Artificial Intelligence (cs.AI)
*备注: 12 pages, 5 figures

点击查看摘要

Abstract:Generative Models (GMs), particularly Large Language Models (LLMs), have garnered significant attention in machine learning and artificial intelligence for their ability to generate new data by learning the statistical properties of training data and creating data that resemble the original. This capability offers a wide range of applications across various domains. However, the complex structures and numerous model parameters of GMs make the input-output processes opaque, complicating the understanding and control of outputs. Moreover, the purely data-driven learning mechanism limits GM’s ability to acquire broader knowledge. There remains substantial potential for enhancing the robustness and generalization capabilities of GMs. In this work, we introduce the fuzzy system, a classical modeling method that combines data and knowledge-driven mechanisms, to generative tasks. We propose a novel Generative Fuzzy System framework, named GenFS, which integrates the deep learning capabilities of GM with the interpretability and dual-driven mechanisms of fuzzy systems. Specifically, we propose an end-to-end GenFS-based model for sequence generation, called FuzzyS2S. A series of experimental studies were conducted on 12 datasets, covering three distinct categories of generative tasks: machine translation, code generation, and summary generation. The results demonstrate that FuzzyS2S outperforms the Transformer in terms of accuracy and fluency. Furthermore, it exhibits better performance on some datasets compared to state-of-the-art models T5 and CodeT5.

[AI-40] Exploratory Study Of Human-AI Interaction For Hindustani Music NEURIPS

链接: https://arxiv.org/abs/2411.13846
作者: Nithya Shikarpur,Cheng-Zhi Anna Huang
关键词-EN: Hindustani vocal contours, hierarchical generative model, vocal contours, paper presents, hierarchical generative
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
*备注: Accepted at NeurIPS Creative AI Track 2024

点击查看摘要

Abstract:This paper presents a study of participants interacting with and using GaMaDHaNi, a novel hierarchical generative model for Hindustani vocal contours. To explore possible use cases in human-AI interaction, we conducted a user study with three participants, each engaging with the model through three predefined interaction modes. Although this study was conducted “in the wild”- with the model unadapted for the shift from the training data to real-world interaction - we use it as a pilot to better understand the expectations, reactions, and preferences of practicing musicians when engaging with such a model. We note their challenges as (1) the lack of restrictions in model output, and (2) the incoherence of model output. We situate these challenges in the context of Hindustani music and aim to suggest future directions for the model design to address these gaps.

[AI-41] Heterophilic Graph Neural Networks Optimization with Causal Message-passing

链接: https://arxiv.org/abs/2411.13821
作者: Botao Wang,Jia Li,Heng Chang,Keli Zhang,Fugee Tsung
关键词-EN: Graph Neural Network, Neural Network, Graph Neural, promising approach, approach to capture
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:In this work, we discover that causal inference provides a promising approach to capture heterophilic message-passing in Graph Neural Network (GNN). By leveraging cause-effect analysis, we can discern heterophilic edges based on asymmetric node dependency. The learned causal structure offers more accurate relationships among nodes. To reduce the computational complexity, we introduce intervention-based causal inference in graph learning. We first simplify causal analysis on graphs by formulating it as a structural learning model and define the optimization problem within the Bayesian scheme. We then present an analysis of decomposing the optimization target into a consistency penalty and a structure modification based on cause-effect relations. We then estimate this target by conditional entropy and present insights into how conditional entropy quantifies the heterophily. Accordingly, we propose CausalMP, a causal message-passing discovery network for heterophilic graph learning, that iteratively learns the explicit causal structure of input graphs. We conduct extensive experiments in both heterophilic and homophilic graph settings. The result demonstrates that the our model achieves superior link prediction performance. Training on causal structure can also enhance node representation in classification task across different base models.

[AI-42] AutoMixQ: Self-Adjusting Quantization for High Performance Memory-Efficient Fine-Tuning

链接: https://arxiv.org/abs/2411.13814
作者: Changhai Zhou,Shiyang Zhang,Yuhua Zhou,Zekai Liu,Shichao Weng
关键词-EN: Fine-tuning large language, Fine-tuning large, large language models, deep learning, large language
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Fine-tuning large language models (LLMs) under resource constraints is a significant challenge in deep learning. Low-Rank Adaptation (LoRA), pruning, and quantization are all effective methods for improving resource efficiency. However, combining them directly often results in suboptimal performance, especially with uniform quantization across all model layers. This is due to the complex, uneven interlayer relationships introduced by pruning, necessitating more refined quantization strategies. To address this, we propose AutoMixQ, an end-to-end optimization framework that selects optimal quantization configurations for each LLM layer. AutoMixQ leverages lightweight performance models to guide the selection process, significantly reducing time and computational resources compared to exhaustive search methods. By incorporating Pareto optimality, AutoMixQ balances memory usage and performance, approaching the upper bounds of model capability under strict resource constraints. Our experiments on widely used benchmarks show that AutoMixQ reduces memory consumption while achieving superior performance. For example, at a 30% pruning rate in LLaMA-7B, AutoMixQ achieved 66.21% on BoolQ compared to 62.45% for LoRA and 58.96% for LoftQ, while reducing memory consumption by 35.5% compared to LoRA and 27.5% compared to LoftQ.

[AI-43] A Survey on Adversarial Robustness of LiDAR-based Machine Learning Perception in Autonomous Vehicles

链接: https://arxiv.org/abs/2411.13778
作者: Junae Kim,Amardeep Kaur
关键词-EN: offers great potential, vehicular technology offers, technology offers great, great potential, vehicular technology
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
*备注: 20 pages, 2 figures

点击查看摘要

Abstract:In autonomous driving, the combination of AI and vehicular technology offers great potential. However, this amalgamation comes with vulnerabilities to adversarial attacks. This survey focuses on the intersection of Adversarial Machine Learning (AML) and autonomous systems, with a specific focus on LiDAR-based systems. We comprehensively explore the threat landscape, encompassing cyber-attacks on sensors and adversarial perturbations. Additionally, we investigate defensive strategies employed in countering these threats. This paper endeavors to present a concise overview of the challenges and advances in securing autonomous driving systems against adversarial threats, emphasizing the need for robust defenses to ensure safety and security.

[AI-44] FastRAG: Retrieval Augmented Generation for Semi-structured Data

链接: https://arxiv.org/abs/2411.13773
作者: Amar Abane,Anis Bekri,Abdella Battou
关键词-EN: Large Language Models, increasingly complex networks, Efficiently processing, interpreting network data, operation of increasingly
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Efficiently processing and interpreting network data is critical for the operation of increasingly complex networks. Recent advances in Large Language Models (LLM) and Retrieval-Augmented Generation (RAG) techniques have improved data processing in network management. However, existing RAG methods like VectorRAG and GraphRAG struggle with the complexity and implicit nature of semi-structured technical data, leading to inefficiencies in time, cost, and retrieval. This paper introduces FastRAG, a novel RAG approach designed for semi-structured data. FastRAG employs schema learning and script learning to extract and structure data without needing to submit entire data sources to an LLM. It integrates text search with knowledge graph (KG) querying to improve accuracy in retrieving context-rich information. Evaluation results demonstrate that FastRAG provides accurate question answering, while improving up to 90% in time and 85% in cost compared to GraphRAG.

[AI-45] An Evaluation-Driven Approach to Designing LLM Agents : Process and Architecture

链接: https://arxiv.org/abs/2411.13768
作者: Boming Xia,Qinghua Lu,Liming Zhu,Zhenchang Xing,Dehai Zhao,Hao Zhang
关键词-EN: Large Language Models, Large Language, autonomously achieving under-specified, achieving under-specified goals, LLM agents capable
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The advent of Large Language Models (LLMs) has enabled the development of LLM agents capable of autonomously achieving under-specified goals and continuously evolving through post-deployment improvement, sometimes without requiring code or model updates. Conventional approaches, such as pre-defined test cases and code/model redevelopment pipelines, are inadequate for addressing the unique challenges of LLM agent development, particularly in terms of quality and risk control. This paper introduces an evaluation-driven design approach, inspired by test-driven development, to address these challenges. Through a multivocal literature review (MLR), we synthesize existing LLM evaluation methods and propose a novel process model and reference architecture specifically designed for LLM agents. The proposed approach integrates online and offline evaluations to support adaptive runtime adjustments and systematic offline redevelopment, improving runtime pipelines, artifacts, system architecture, and LLMs by continuously incorporating evaluation results, including fine-grained feedback from human and AI evaluators.

[AI-46] ny-Align: Bridging Automatic Speech Recognition and Large Language Model on the Edge

链接: https://arxiv.org/abs/2411.13766
作者: Ruiyang Qin,Dancheng Liu,Gelei Xu,Zheyu Yan,Chenhui Xu,Yuting Hu,X. Sharon Hu,Jinjun Xiong,Yiyu Shi
关键词-EN: Automatic Speech Recognition, Speech Recognition, Automatic Speech, Large Language Models, combination of Large
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
*备注: 7 pages, 8 figures

点击查看摘要

Abstract:The combination of Large Language Models (LLM) and Automatic Speech Recognition (ASR), when deployed on edge devices (called edge ASR-LLM), can serve as a powerful personalized assistant to enable audio-based interaction for users. Compared to text-based interaction, edge ASR-LLM allows accessible and natural audio interactions. Unfortunately, existing ASR-LLM models are mainly trained in high-performance computing environments and produce substantial model weights, making them difficult to deploy on edge devices. More importantly, to better serve users’ personalized needs, the ASR-LLM must be able to learn from each distinct user, given that audio input often contains highly personalized characteristics that necessitate personalized on-device training. Since individually fine-tuning the ASR or LLM often leads to suboptimal results due to modality-specific limitations, end-to-end training ensures seamless integration of audio features and language understanding (cross-modal alignment), ultimately enabling a more personalized and efficient adaptation on edge devices. However, due to the complex training requirements and substantial computational demands of existing approaches, cross-modal alignment between ASR audio and LLM can be challenging on edge devices. In this work, we propose a resource-efficient cross-modal alignment framework that bridges ASR and LLMs on edge devices to handle personalized audio input. Our framework enables efficient ASR-LLM alignment on resource-constrained devices like NVIDIA Jetson Orin (8GB RAM), achieving 50x training time speedup while improving the alignment quality by more than 50%. To the best of our knowledge, this is the first work to study efficient ASR-LLM alignment on resource-constrained edge devices.

[AI-47] AttentionBreaker: Adaptive Evolutionary Optimization for Unmasking Vulnerabilities in LLM s through Bit-Flip Attacks

链接: https://arxiv.org/abs/2411.13757
作者: Sanjay Das,Swastik Bhattacharya,Souvik Kundu,Shamik Kundu,Anand Menon,Arnab Raha,Kanad Basu
关键词-EN: Large Language Models, natural language processing, revolutionized natural language, Large Language, language processing
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have revolutionized natural language processing (NLP), excelling in tasks like text generation and summarization. However, their increasing adoption in mission-critical applications raises concerns about hardware-based threats, particularly bit-flip attacks (BFAs). BFAs, enabled by fault injection methods such as Rowhammer, target model parameters in memory, compromising both integrity and performance. Identifying critical parameters for BFAs in the vast parameter space of LLMs poses significant challenges. While prior research suggests transformer-based architectures are inherently more robust to BFAs compared to traditional deep neural networks, we challenge this assumption. For the first time, we demonstrate that as few as three bit-flips can cause catastrophic performance degradation in an LLM with billions of parameters. Current BFA techniques are inadequate for exploiting this vulnerability due to the difficulty of efficiently identifying critical parameters within the immense parameter space. To address this, we propose AttentionBreaker, a novel framework tailored for LLMs that enables efficient traversal of the parameter space to identify critical parameters. Additionally, we introduce GenBFA, an evolutionary optimization strategy designed to refine the search further, isolating the most critical bits for an efficient and effective attack. Empirical results reveal the profound vulnerability of LLMs to AttentionBreaker. For example, merely three bit-flips (4.129 x 10^-9% of total parameters) in the LLaMA3-8B-Instruct 8-bit quantized (W8) model result in a complete performance collapse: accuracy on MMLU tasks drops from 67.3% to 0%, and Wikitext perplexity skyrockets from 12.6 to 4.72 x 10^5. These findings underscore the effectiveness of AttentionBreaker in uncovering and exploiting critical vulnerabilities within LLM architectures.

[AI-48] AI-Driven Agents with Prompts Designed for High Agreeableness Increase the Likelihood of Being Mistaken for a Human in the Turing Test

链接: https://arxiv.org/abs/2411.13749
作者: U. León-Domínguez,E. D. Flores-Flores,A. J. García-Jasso,M. K. Gómez-Cuellar,D. Torres-Sánchez,A. Basora-Marimon
关键词-EN: Large Language Models, Large Language, enabling verbal interaction, Language Models based, Turing Test
类目: Artificial Intelligence (cs.AI)
*备注: 25 pages, 2 figures, 7 tables

点击查看摘要

Abstract:Large Language Models based on transformer algorithms have revolutionized Artificial Intelligence by enabling verbal interaction with machines akin to human conversation. These AI agents have surpassed the Turing Test, achieving confusion rates up to 50%. However, challenges persist, especially with the advent of robots and the need to humanize machines for improved Human-AI collaboration. In this experiment, three GPT agents with varying levels of agreeableness (disagreeable, neutral, agreeable) based on the Big Five Inventory were tested in a Turing Test. All exceeded a 50% confusion rate, with the highly agreeable AI agent surpassing 60%. This agent was also recognized as exhibiting the most human-like traits. Various explanations in the literature address why these GPT agents were perceived as human, including psychological frameworks for understanding anthropomorphism. These findings highlight the importance of personality engineering as an emerging discipline in artificial intelligence, calling for collaboration with psychology to develop ergonomic psychological models that enhance system adaptability in collaborative activities.

[AI-49] Federated Continual Learning for Edge-AI: A Comprehensive Survey

链接: https://arxiv.org/abs/2411.13740
作者: Zi Wang,Fei Wu,Feng Yu,Yurui Zhou,Jia Hu,Geyong Min
关键词-EN: FCL, edge computing, network edge, close to users, continual learning
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC); Networking and Internet Architecture (cs.NI)
*备注:

点击查看摘要

Abstract:Edge-AI, the convergence of edge computing and artificial intelligence (AI), has become a promising paradigm that enables the deployment of advanced AI models at the network edge, close to users. In Edge-AI, federated continual learning (FCL) has emerged as an imperative framework, which fuses knowledge from different clients while preserving data privacy and retaining knowledge from previous tasks as it learns new ones. By so doing, FCL aims to ensure stable and reliable performance of learning models in dynamic and distributed environments. In this survey, we thoroughly review the state-of-the-art research and present the first comprehensive survey of FCL for Edge-AI. We categorize FCL methods based on three task characteristics: federated class continual learning, federated domain continual learning, and federated task continual learning. For each category, an in-depth investigation and review of the representative methods are provided, covering background, challenges, problem formalisation, solutions, and limitations. Besides, existing real-world applications empowered by FCL are reviewed, indicating the current progress and potential of FCL in diverse application domains. Furthermore, we discuss and highlight several prospective research directions of FCL such as algorithm-hardware co-design for FCL and FCL with foundation models, which could provide insights into the future development and practical deployment of FCL in the era of Edge-AI.

[AI-50] Exploring Large Language Models for Climate Forecasting

链接: https://arxiv.org/abs/2411.13724
作者: Yang Wang,Hassan A. Karimi
关键词-EN: reliable future climate, future climate information, support planning, future climate predictions, increasing impacts
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:With the increasing impacts of climate change, there is a growing demand for accessible tools that can provide reliable future climate information to support planning, finance, and other decision-making applications. Large language models (LLMs), such as GPT-4, present a promising approach to bridging the gap between complex climate data and the general public, offering a way for non-specialist users to obtain essential climate insights through natural language interaction. However, an essential challenge remains under-explored: evaluating the ability of LLMs to provide accurate and reliable future climate predictions, which is crucial for applications that rely on anticipating climate trends. In this study, we investigate the capability of GPT-4 in predicting rainfall at short-term (15-day) and long-term (12-month) scales. We designed a series of experiments to assess GPT’s performance under different conditions, including scenarios with and without expert data inputs. Our results indicate that GPT, when operating independently, tends to generate conservative forecasts, often reverting to historical averages in the absence of clear trend signals. This study highlights both the potential and challenges of applying LLMs for future climate predictions, providing insights into their integration with climate-related applications and suggesting directions for enhancing their predictive capabilities in the field.

[AI-51] Bimanual Dexterity for Complex Tasks

链接: https://arxiv.org/abs/2411.13677
作者: Kenneth Shaw,Yulong Li,Jiahui Yang,Mohan Kumar Srirama,Ray Liu,Haoyu Xiong,Russell Mendonca,Deepak Pathak
关键词-EN: machine learning methods, generalist robot policies, train generalist robot, machine learning, expert human teleoperation
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: In CoRL 2024. Website at this https URL

点击查看摘要

Abstract:To train generalist robot policies, machine learning methods often require a substantial amount of expert human teleoperation data. An ideal robot for humans collecting data is one that closely mimics them: bimanual arms and dexterous hands. However, creating such a bimanual teleoperation system with over 50 DoF is a significant challenge. To address this, we introduce Bidex, an extremely dexterous, low-cost, low-latency and portable bimanual dexterous teleoperation system which relies on motion capture gloves and teacher arms. We compare Bidex to a Vision Pro teleoperation system and a SteamVR system and find Bidex to produce better quality data for more complex tasks at a faster rate. Additionally, we show Bidex operating a mobile bimanual robot for in the wild tasks. The robot hands (5k USD) and teleoperation system (7k USD) is readily reproducible and can be used on many robot arms including two xArms (16k USD). Website at this https URL

[AI-52] FabuLight-ASD: Unveiling Speech Activity via Body Language

链接: https://arxiv.org/abs/2411.13674
作者: Hugo Carneiro,Stefan Wermter
关键词-EN: Active speaker detection, Wilder Active Speaker, Active speaker, human-robot interaction, speaker detection
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE); Sound (cs.SD)
*备注: 23 pages, 8 figures, 3 tables, accepted for publication in Neural Computing and Applications

点击查看摘要

Abstract:Active speaker detection (ASD) in multimodal environments is crucial for various applications, from video conferencing to human-robot interaction. This paper introduces FabuLight-ASD, an advanced ASD model that integrates facial, audio, and body pose information to enhance detection accuracy and robustness. Our model builds upon the existing Light-ASD framework by incorporating human pose data, represented through skeleton graphs, which minimises computational overhead. Using the Wilder Active Speaker Detection (WASD) dataset, renowned for reliable face and body bounding box annotations, we demonstrate FabuLight-ASD’s effectiveness in real-world scenarios. Achieving an overall mean average precision (mAP) of 94.3%, FabuLight-ASD outperforms Light-ASD, which has an overall mAP of 93.7% across various challenging scenarios. The incorporation of body pose information shows a particularly advantageous impact, with notable improvements in mAP observed in scenarios with speech impairment, face occlusion, and human voice background noise. Furthermore, efficiency analysis indicates only a modest increase in parameter count (27.3%) and multiply-accumulate operations (up to 2.4%), underscoring the model’s efficiency and feasibility. These findings validate the efficacy of FabuLight-ASD in enhancing ASD performance through the integration of body pose data. FabuLight-ASD’s code and model weights are available at this https URL.

[AI-53] No Free Delivery Service: Epistemic limits of passive data collection in complex social systems NEURIPS’24

链接: https://arxiv.org/abs/2411.13653
作者: Maximilian Nickel
关键词-EN: Rapid model validation, Rapid model, breathtaking progress, progress in machine, machine learning
类目: Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
*备注: To appear in NeurIPS’24

点击查看摘要

Abstract:Rapid model validation via the train-test paradigm has been a key driver for the breathtaking progress in machine learning and AI. However, modern AI systems often depend on a combination of tasks and data collection practices that violate all assumptions ensuring test validity. Yet, without rigorous model validation we cannot ensure the intended outcomes of deployed AI systems, including positive social impact, nor continue to advance AI research in a scientifically sound way. In this paper, I will show that for widely considered inference settings in complex social systems the train-test paradigm does not only lack a justification but is indeed invalid for any risk estimator, including counterfactual and causal estimators, with high probability. These formal impossibility results highlight a fundamental epistemic issue, i.e., that for key tasks in modern AI we cannot know whether models are valid under current data collection practices. Importantly, this includes variants of both recommender systems and reasoning via large language models, and neither naïve scaling nor limited benchmarks are suited to address this issue. I am illustrating these results via the widely used MovieLens benchmark and conclude by discussing the implications of these results for AI in social systems, including possible remedies such as participatory data curation and open science.

[AI-54] CryptoFormalEval: Integrating LLM s and Formal Verification for Automated Cryptographic Protocol Vulnerability Detection

链接: https://arxiv.org/abs/2411.13627
作者: Cristian Curaba,Denis D’Ambrosi,Alessandro Minisini,Natalia Pérez-Campanero Antolín
关键词-EN: modern digital infrastructure, securing modern digital, prior formal verification, Cryptographic protocols play, digital infrastructure
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Symbolic Computation (cs.SC)
*备注:

点击查看摘要

Abstract:Cryptographic protocols play a fundamental role in securing modern digital infrastructure, but they are often deployed without prior formal verification. This could lead to the adoption of distributed systems vulnerable to attack vectors. Formal verification methods, on the other hand, require complex and time-consuming techniques that lack automatization. In this paper, we introduce a benchmark to assess the ability of Large Language Models (LLMs) to autonomously identify vulnerabilities in new cryptographic protocols through interaction with Tamarin: a theorem prover for protocol verification. We created a manually validated dataset of novel, flawed, communication protocols and designed a method to automatically verify the vulnerabilities found by the AI agents. Our results about the performances of the current frontier models on the benchmark provides insights about the possibility of cybersecurity applications by integrating LLMs with symbolic reasoning systems.

[AI-55] Non-Linear Outlier Synthesis for Out-of-Distribution Detection

链接: https://arxiv.org/abs/2411.13619
作者: Lars Doorenbos,Raphael Sznitman,Pablo Márquez-Neila
关键词-EN: unexpected inputs, leading to great, reliability of supervised, supervised classifiers, classifiers is severely
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The reliability of supervised classifiers is severely hampered by their limitations in dealing with unexpected inputs, leading to great interest in out-of-distribution (OOD) detection. Recently, OOD detectors trained on synthetic outliers, especially those generated by large diffusion models, have shown promising results in defining robust OOD decision boundaries. Building on this progress, we present NCIS, which enhances the quality of synthetic outliers by operating directly in the diffusion’s model embedding space rather than combining disjoint models as in previous work and by modeling class-conditional manifolds with a conditional volume-preserving network for more expressive characterization of the training distribution. We demonstrate that these improvements yield new state-of-the-art OOD detection results on standard ImageNet100 and CIFAR100 benchmarks and provide insights into the importance of data pre-processing and other key design choices. We make our code available at \urlthis https URL.

[AI-56] Verification and Validation of Autonomous Systems

链接: https://arxiv.org/abs/2411.13614
作者: Sneha Sudhir Shetiya,Vikas Vyas,Shreyas Renukuntla
关键词-EN: product development phase, proficiently prevent software, software product development, prevent software defects, autonomous vehicles
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:This paper describes how to proficiently prevent software defects in autonomous vehicles, discover and correct defects if they are encountered, and create a higher level of assurance in the software product development phase. It also describes how to ensure high assurance on software reliability.

[AI-57] SuPLE: Robot Learning with Lyapunov Rewards

链接: https://arxiv.org/abs/2411.13613
作者: Phu Nguyen,Daniel Polani,Stas Tiomkin
关键词-EN: essential component, robot learning, reward, Positive Lyapunov Exponents’, learning
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注: 7 pages, 4 figures

点击查看摘要

Abstract:The reward function is an essential component in robot learning. Reward directly affects the sample and computational complexity of learning, and the quality of a solution. The design of informative rewards requires domain knowledge, which is not always available. We use the properties of the dynamics to produce system-appropriate reward without adding external assumptions. Specifically, we explore an approach to utilize the Lyapunov exponents of the system dynamics to generate a system-immanent reward. We demonstrate that the `Sum of the Positive Lyapunov Exponents’ (SuPLE) is a strong candidate for the design of such a reward. We develop a computational framework for the derivation of this reward, and demonstrate its effectiveness on classical benchmarks for sample-based stabilization of various dynamical systems. It eliminates the need to start the training trajectories at arbitrary states, also known as auxiliary exploration. While the latter is a common practice in simulated robot learning, it is unpractical to consider to use it in real robotic systems, since they typically start from natural rest states such as a pendulum at the bottom, a robot on the ground, etc. and can not be easily initialized at arbitrary states. Comparing the performance of SuPLE to commonly-used reward functions, we observe that the latter fail to find a solution without auxiliary exploration, even for the task of swinging up the double pendulum and keeping it stable at the upright position, a prototypical scenario for multi-linked robots. SuPLE-induced rewards for robot learning offer a novel route for effective robot learning in typical as opposed to highly specialized or fine-tuned scenarios. Our code is publicly available for reproducibility and further research.

[AI-58] DSTC: Direct Preference Learning with Only Self-Generated Tests and Code to Improve Code LMs

链接: https://arxiv.org/abs/2411.13611
作者: Zhihan Liu,Shenao Zhang,Yongfei Liu,Boyi Liu,Yingxiang Yang,Zhaoran Wang
关键词-EN: Direct preference learning, Direct preference, preference learning, preference learning offers, Direct Preference Optimization
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Direct preference learning offers a promising and computation-efficient beyond supervised fine-tuning (SFT) for improving code generation in coding large language models (LMs). However, the scarcity of reliable preference data is a bottleneck for the performance of direct preference learning to improve the coding accuracy of code LMs. In this paper, we introduce \underline\textbfDirect Preference Learning with Only \underline\textbfSelf-Generated \underline\textbfTests and \underline\textbfCode (DSTC), a framework that leverages only self-generated code snippets and tests to construct reliable preference pairs such that direct preference learning can improve LM coding accuracy without external annotations. DSTC combines a minimax selection process and test-code concatenation to improve preference pair quality, reducing the influence of incorrect self-generated tests and enhancing model performance without the need for costly reward models. When applied with direct preference learning methods such as Direct Preference Optimization (DPO) and Kahneman-Tversky Optimization (KTO), DSTC yields stable improvements in coding accuracy (pass@1 score) across diverse coding benchmarks, including HumanEval, MBPP, and BigCodeBench, demonstrating both its effectiveness and scalability for models of various sizes. This approach autonomously enhances code generation accuracy across LLMs of varying sizes, reducing reliance on expensive annotated coding datasets.

[AI-59] Enhancing Bidirectional Sign Language Communication: Integrating YOLOv8 and NLP for Real-Time Gesture Recognition Translation

链接: https://arxiv.org/abs/2411.13597
作者: Hasnat Jamil Bhuiyan,Mubtasim Fuad Mozumder,Md. Rabiul Islam Khan,Md. Sabbir Ahmed,Nabuat Zaman Nahim
关键词-EN: American Sign Language, Sign Language, American Sign, real time, time camera footage
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The primary concern of this research is to take American Sign Language (ASL) data through real time camera footage and be able to convert the data and information into text. Adding to that, we are also putting focus on creating a framework that can also convert text into sign language in real time which can help us break the language barrier for the people who are in need. In this work, for recognising American Sign Language (ASL), we have used the You Only Look Once(YOLO) model and Convolutional Neural Network (CNN) model. YOLO model is run in real time and automatically extracts discriminative spatial-temporal characteristics from the raw video stream without the need for any prior knowledge, eliminating design flaws. The CNN model here is also run in real time for sign language detection. We have introduced a novel method for converting text based input to sign language by making a framework that will take a sentence as input, identify keywords from that sentence and then show a video where sign language is performed with respect to the sentence given as input in real time. To the best of our knowledge, this is a rare study to demonstrate bidirectional sign language communication in real time in the American Sign Language (ASL).

[AI-60] A Novel Speech Analysis and Correction Tool for Arabic-Speaking Children

链接: https://arxiv.org/abs/2411.13592
作者: Lamia Berriche,Maha Driss,Areej Ahmed Almuntashri,Asma Mufreh Lghabi,Heba Saleh Almudhi,Munerah Abdul-Aziz Almansour
关键词-EN: application named ArPA, Support Vector Machine, paper introduces, named ArPA, application named
类目: ound (cs.SD); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:This paper introduces a new application named ArPA for Arabic kids who have trouble with pronunciation. Our application comprises two key components: the diagnostic module and the therapeutic module. The diagnostic process involves capturing the child’s speech signal, preprocessing, and analyzing it using different machine learning classifiers like K-Nearest Neighbors (KNN), Support Vector Machine (SVM), and Decision Trees as well as deep neural network classifiers like ResNet18. The therapeutic module offers eye-catching gamified interfaces in which each correctly spoken letter earns a higher avatar level, providing positive reinforcement for the child’s pronunciation improvement. Two datasets were used for experimental evaluation: one from a childcare centre and the other including Arabic alphabet pronunciation recordings. Our work uses a novel technique for speech recognition using Melspectrogram and MFCC images. The results show that the ResNet18 classifier on speech-to-image converted data effectively identifies mispronunciations in Arabic speech with an accuracy of 99.015% with Mel-Spectrogram images outperforming ResNet18 with MFCC images.

[AI-61] Deep learning waterways for rural infrastructure development

链接: https://arxiv.org/abs/2411.13590
作者: Matthew Pierson,Zia Mehrabi
关键词-EN: Earth waterways remain, middle income countries, waterways remain unmapped, Surprisingly a number, Earth waterways
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 18 pages, 6 figures

点击查看摘要

Abstract:Surprisingly a number of Earth’s waterways remain unmapped, with a significant number in low and middle income countries. Here we build a computer vision model (WaterNet) to learn the location of waterways in the United States, based on high resolution satellite imagery and digital elevation models, and then deploy this in novel environments in the African continent. Our outputs provide detail of waterways structures hereto unmapped. When assessed against community needs requests for rural bridge building related to access to schools, health care facilities and agricultural markets, we find these newly generated waterways capture on average 93% (country range: 88-96%) of these requests whereas Open Street Map, and the state of the art data from TDX-Hydro, capture only 36% (5-72%) and 62% (37%-85%), respectively. Because these new machine learning enabled maps are built on public and operational data acquisition this approach offers promise for capturing humanitarian needs and planning for social development in places where cartographic efforts have so far failed to deliver. The improved performance in identifying community needs missed by existing data suggests significant value for rural infrastructure development and better targeting of development interventions.

[AI-62] Unveiling Redundancy in Diffusion Transformers (DiTs): A Systematic Study

链接: https://arxiv.org/abs/2411.13588
作者: Xibo Sun,Jiarui Fang,Aoyu Li,Jinzhe Pan
关键词-EN: real-time performance adversely, impacting real-time performance, generating higher resolutions, Diffusion Transformers, increased model capacity
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 9 pages including reference

点击查看摘要

Abstract:The increased model capacity of Diffusion Transformers (DiTs) and the demand for generating higher resolutions of images and videos have led to a significant rise in inference latency, impacting real-time performance adversely. While prior research has highlighted the presence of high similarity in activation values between adjacent diffusion steps (referred to as redundancy) and proposed various caching mechanisms to mitigate computational overhead, the exploration of redundancy in existing literature remains limited, with findings often not generalizable across different DiT models. This study aims to address this gap by conducting a comprehensive investigation into redundancy across a broad spectrum of mainstream DiT models. Our experimental analysis reveals substantial variations in the distribution of redundancy across diffusion steps among different DiT models. Interestingly, within a single model, the redundancy distribution remains stable regardless of variations in input prompts, step counts, or scheduling strategies. Given the lack of a consistent pattern across diverse models, caching strategies designed for a specific group of models may not easily transfer to others. To overcome this challenge, we introduce a tool for analyzing the redundancy of individual models, enabling subsequent research to develop tailored caching strategies for specific model architectures. The project is publicly available at this https URL.

[AI-63] Exploring the Adversarial Vulnerabilities of Vision-Language-Action Models in Robotics

链接: https://arxiv.org/abs/2411.13587
作者: Taowen Wang,Dongfang Liu,James Chenhao Liang,Wenhao Yang,Qifan Wang,Cheng Han,Jiebo Luo,Ruixiang Tang
关键词-EN: execute complex tasks, learning framework, VLA models offer, enabling robots, robots to execute
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Recently in robotics, Vision-Language-Action (VLA) models have emerged as a transformative approach, enabling robots to execute complex tasks by integrating visual and linguistic inputs within an end-to-end learning framework. While VLA models offer significant capabilities, they also introduce new attack surfaces, making them vulnerable to adversarial attacks. With these vulnerabilities largely unexplored, this paper systematically quantifies the robustness of VLA-based robotic systems. Recognizing the unique demands of robotic execution, our attack objectives target the inherent spatial and functional characteristics of robotic systems. In particular, we introduce an untargeted position-aware attack objective that leverages spatial foundations to destabilize robotic actions, and a targeted attack objective that manipulates the robotic trajectory. Additionally, we design an adversarial patch generation approach that places a small, colorful patch within the camera’s view, effectively executing the attack in both digital and physical environments. Our evaluation reveals a marked degradation in task success rates, with up to a 100% reduction across a suite of simulated robotic tasks, highlighting critical security gaps in current VLA architectures. By unveiling these vulnerabilities and proposing actionable evaluation metrics, this work advances both the understanding and enhancement of safety for VLA-based robotic systems, underscoring the necessity for developing robust defense strategies prior to physical-world deployments.

[AI-64] Artificial Intelligence in Cybersecurity: Building Resilient Cyber Diplomacy Frameworks

链接: https://arxiv.org/abs/2411.13585
作者: Michael Stoltz
关键词-EN: cyber diplomacy, artificial intelligence, cyber, diplomacy, paper explores
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:This paper explores how automation and artificial intelligence (AI) are transforming U.S. cyber diplomacy. Leveraging these technologies helps the U.S. manage the complexity and urgency of cyber diplomacy, improving decision-making, efficiency, and security. As global inter connectivity grows, cyber diplomacy, managing national interests in the digital space has become vital. The ability of AI and automation to quickly process vast data volumes enables timely responses to cyber threats and opportunities. This paper underscores the strategic integration of these tools to maintain U.S. competitive advantage and secure national interests. Automation enhances diplomatic communication and data processing, freeing diplomats to focus on strategic decisions. AI supports predictive analytics and real time decision making, offering critical insights and proactive measures during high stakes engagements. Case studies show AIs effectiveness in monitoring cyber activities and managing international cyber policy. Challenges such as ethical concerns, security vulnerabilities, and reliance on technology are also addressed, emphasizing human oversight and strong governance frameworks. Ensuring proper ethical guidelines and cybersecurity measures allows the U.S. to harness the benefits of automation and AI while mitigating risks. By adopting these technologies, U.S. cyber diplomacy can become more proactive and effective, navigating the evolving digital landscape with greater agility.

[AI-65] Enhanced FIWARE-Based Architecture for Cyberphysical Systems With Tiny Machine Learning and Machine Learning Operations: A Case Study on Urban Mobility Systems

链接: https://arxiv.org/abs/2411.13583
作者: Javier Conde,Andrés Munoz-Arcentales,Álvaro Alonso,Joaquín Salvachúa,Gabriel Huecas
关键词-EN: Internet of Things, Things is accelerating, transformation of society, accelerating the digital, digital transformation
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC); Networking and Internet Architecture (cs.NI)
*备注:

点击查看摘要

Abstract:The rise of AI and the Internet of Things is accelerating the digital transformation of society. Mobility computing presents specific barriers due to its real-time requirements, decentralization, and connectivity through wireless networks. New research on edge computing and tiny machine learning (tinyML) explores the execution of AI models on low-performance devices to address these issues. However, there are not many studies proposing agnostic architectures that manage the entire lifecycle of intelligent cyberphysical systems. This article extends a previous architecture based on FIWARE software components to implement the machine learning operations flow, enabling the management of the entire tinyML lifecycle in cyberphysical systems. We also provide a use case to showcase how to implement the FIWARE architecture through a complete example of a smart traffic system. We conclude that the FIWARE ecosystem constitutes a real reference option for developing tinyML and edge computing in cyberphysical systems.

[AI-66] COOD: Concept-based Zero-shot OOD Detection

链接: https://arxiv.org/abs/2411.13578
作者: Zhendong Liu,Yi Nian,Henry Peng Zou,Li Li,Xiyang Hu,Yue Zhao
关键词-EN: models effectively detect, OOD detection, effectively detect, OOD, multi-label OOD detection
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:How can models effectively detect out-of-distribution (OOD) samples in complex, multi-label settings without extensive retraining? Existing OOD detection methods struggle to capture the intricate semantic relationships and label co-occurrences inherent in multi-label settings, often requiring large amounts of training data and failing to generalize to unseen label combinations. While large language models have revolutionized zero-shot OOD detection, they primarily focus on single-label scenarios, leaving a critical gap in handling real-world tasks where samples can be associated with multiple interdependent labels. To address these challenges, we introduce COOD, a novel zero-shot multi-label OOD detection framework. COOD leverages pre-trained vision-language models, enhancing them with a concept-based label expansion strategy and a new scoring function. By enriching the semantic space with both positive and negative concepts for each label, our approach models complex label dependencies, precisely differentiating OOD samples without the need for additional training. Extensive experiments demonstrate that our method significantly outperforms existing approaches, achieving approximately 95% average AUROC on both VOC and COCO datasets, while maintaining robust performance across varying numbers of labels and different types of OOD samples.

[AI-67] Integrated Water Resource Management in the Segura Hydrographic Basin: An Artificial Intelligence Approach

链接: https://arxiv.org/abs/2411.13566
作者: Urtzi Otamendi,Mikel Maiza,Igor G. Olaizola,Basilio Sierra,Markel Flores,Marco Quartulli
关键词-EN: Managing resources effectively, complex governance policies, Segura Hydrographic Basin, Artificial Intelligence algorithms, Managing resources
类目: Artificial Intelligence (cs.AI)
*备注: 15 pages, 14 figures, 8 tables

点击查看摘要

Abstract:Managing resources effectively in uncertain demand, variable availability, and complex governance policies is a significant challenge. This paper presents a paradigmatic framework for addressing these issues in water management scenarios by integrating advanced physical modelling, remote sensing techniques, and Artificial Intelligence algorithms. The proposed approach accurately predicts water availability, estimates demand, and optimizes resource allocation on both short- and long-term basis, combining a comprehensive hydrological model, agronomic crop models for precise demand estimation, and Mixed-Integer Linear Programming for efficient resource distribution. In the study case of the Segura Hydrographic Basin, the approach successfully allocated approximately 642 million cubic meters ( hm^3 ) of water over six months, minimizing the deficit to 9.7% of the total estimated demand. The methodology demonstrated significant environmental benefits, reducing CO2 emissions while optimizing resource distribution. This robust solution supports informed decision-making processes, ensuring sustainable water management across diverse contexts. The generalizability of this approach allows its adaptation to other basins, contributing to improved governance and policy implementation on a broader scale. Ultimately, the methodology has been validated and integrated into the operational water management practices in the Segura Hydrographic Basin in Spain.

[AI-68] AMSnet-KG: A Netlist Dataset for LLM -based AMS Circuit Auto-Design Using Knowledge Graph RAG

链接: https://arxiv.org/abs/2411.13560
作者: Yichen Shi,Zhuofu Tao,Yuhao Gao,Tianjia Zhou,Cheng Chang,Yaxing Wang,Bingyu Chen,Genhao Zhang,Alvin Liu,Zhiping Yu,Ting-Jung Lin,Lei He
关键词-EN: High-performance analog, AMS circuit, AMS, analog and mixed-signal, full-custom designed
类目: Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Signal Processing (eess.SP)
*备注:

点击查看摘要

Abstract:High-performance analog and mixed-signal (AMS) circuits are mainly full-custom designed, which is time-consuming and labor-intensive. A significant portion of the effort is experience-driven, which makes the automation of AMS circuit design a formidable challenge. Large language models (LLMs) have emerged as powerful tools for Electronic Design Automation (EDA) applications, fostering advancements in the automatic design process for large-scale AMS circuits. However, the absence of high-quality datasets has led to issues such as model hallucination, which undermines the robustness of automatically generated circuit designs. To address this issue, this paper introduces AMSnet-KG, a dataset encompassing various AMS circuit schematics and netlists. We construct a knowledge graph with annotations on detailed functional and performance characteristics. Facilitated by AMSnet-KG, we propose an automated AMS circuit generation framework that utilizes the comprehensive knowledge embedded in LLMs. We first formulate a design strategy (e.g., circuit architecture using a number of circuit components) based on required specifications. Next, matched circuit components are retrieved and assembled into a complete topology, and transistor sizing is obtained through Bayesian optimization. Simulation results of the netlist are fed back to the LLM for further topology refinement, ensuring the circuit design specifications are met. We perform case studies of operational amplifier and comparator design to verify the automatic design flow from specifications to netlists with minimal human effort. The dataset used in this paper will be open-sourced upon publishing of this paper.

[AI-69] DrugGen: Advancing Drug Discovery with Large Language Models and Reinforcement Learning Feedback

链接: https://arxiv.org/abs/2411.14157
作者: Mahsa Sheikholeslami,Navid Mazrouei,Yousof Gheisari,Afshin Fasihi,Matin Irajpour,Ali Motahharynia
关键词-EN: Traditional drug design, design faces significant, high failure rates, significant challenges due, faces significant challenges
类目: Quantitative Methods (q-bio.QM); Artificial Intelligence (cs.AI)
*备注: 20 pages, 5 figures, 3 tables, and 7 supplementary files. To use the model, see this https URL

点击查看摘要

Abstract:Traditional drug design faces significant challenges due to inherent chemical and biological complexities, often resulting in high failure rates in clinical trials. Deep learning advancements, particularly generative models, offer potential solutions to these challenges. One promising algorithm is DrugGPT, a transformer-based model, that generates small molecules for input protein sequences. Although promising, it generates both chemically valid and invalid structures and does not incorporate the features of approved drugs, resulting in time-consuming and inefficient drug discovery. To address these issues, we introduce DrugGen, an enhanced model based on the DrugGPT structure. DrugGen is fine-tuned on approved drug-target interactions and optimized with proximal policy optimization. By giving reward feedback from protein-ligand binding affinity prediction using pre-trained transformers (PLAPT) and a customized invalid structure assessor, DrugGen significantly improves performance. Evaluation across multiple targets demonstrated that DrugGen achieves 100% valid structure generation compared to 95.5% with DrugGPT and produced molecules with higher predicted binding affinities (7.22 [6.30-8.07]) compared to DrugGPT (5.81 [4.97-6.63]) while maintaining diversity and novelty. Docking simulations further validate its ability to generate molecules targeting binding sites effectively. For example, in the case of fatty acid-binding protein 5 (FABP5), DrugGen generated molecules with superior docking scores (FABP5/11, -9.537 and FABP5/5, -8.399) compared to the reference molecule (Palmitic acid, -6.177). Beyond lead compound generation, DrugGen also shows potential for drug repositioning and creating novel pharmacophores for existing targets. By producing high-quality small molecules, DrugGen provides a high-performance medium for advancing pharmaceutical research and drug discovery.

[AI-70] Assessing data-driven predictions of band gap and electrical conductivity for transparent conducting materials

链接: https://arxiv.org/abs/2411.14034
作者: Federico Ottomano,John Y. Goulermas,Vladimir Gusev,Rahul Savani,Michael W. Gaultois,Troy D. Manning,Hai Lin,Teresa P. Manzanera,Emmeline G. Poole,Matthew S. Dyer,John B. Claridge,Jon Alaria,Luke M. Daniels,Su Varma,David Rimmer,Kevin Sanderson,Matthew J. Rosseinsky
关键词-EN: Machine Learning, offered innovative perspectives, leveraging the increasing, offered innovative, innovative perspectives
类目: Materials Science (cond-mat.mtrl-sci); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Machine Learning (ML) has offered innovative perspectives for accelerating the discovery of new functional materials, leveraging the increasing availability of material databases. Despite the promising advances, data-driven methods face constraints imposed by the quantity and quality of available data. Moreover, ML is often employed in tandem with simulated datasets originating from density functional theory (DFT), and assessed through in-sample evaluation schemes. This scenario raises questions about the practical utility of ML in uncovering new and significant material classes for industrial applications. Here, we propose a data-driven framework aimed at accelerating the discovery of new transparent conducting materials (TCMs), an important category of semiconductors with a wide range of applications. To mitigate the shortage of available data, we create and validate unique experimental databases, comprising several examples of existing TCMs. We assess state-of-the-art (SOTA) ML models for property prediction from the stoichiometry alone. We propose a bespoke evaluation scheme to provide empirical evidence on the ability of ML to uncover new, previously unseen materials of interest. We test our approach on a list of 55 compositions containing typical elements of known TCMs. Although our study indicates that ML tends to identify new TCMs compositionally similar to those in the training data, we empirically demonstrate that it can highlight material candidates that may have been previously overlooked, offering a systematic approach to identify materials that are likely to display TCMs characteristics.

[AI-71] AmpliNetECG12: A lightweight SoftMax-based relativistic amplitude amplification architecture for 12 lead ECG classification

链接: https://arxiv.org/abs/2411.13903
作者: Shreya Srivastava
关键词-EN: complex electrical activity, restricted computational power, Electrocardiograms using limited, promptly detect cardiac, portable devices
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The urgent need to promptly detect cardiac disorders from 12-lead Electrocardiograms using limited computations is motivated by the heart’s fast and complex electrical activity and restricted computational power of portable devices. Timely and precise diagnoses are crucial since delays might significantly impact patient health outcomes. This research presents a novel deep-learning architecture that aims to diagnose heart abnormalities quickly and accurately. We devised a new activation function called aSoftMax, designed to improve the visibility of ECG deflections. The proposed activation function is used with Convolutional Neural Network architecture to includes kernel weight sharing across the ECG’s various leads. This innovative method thoroughly generalizes the global 12-lead ECG features and minimizes the model’s complexity by decreasing the trainable parameters. aSoftMax, combined with enhanced CNN architecture yielded AmpliNetECG12, we obtain exceptional accuracy of 84% in diagnosing cardiac disorders. AmpliNetECG12 shows outstanding prediction ability when used with the CPSC2018 dataset for arrhythmia classification. The model attains an F1-score of 80.71% and a ROC-AUC score of 96.00%, with 280,000 trainable parameters which signifies the lightweight yet efficient nature of AmpliNetECG12. The stochastic characteristics of aSoftMax, a fundamental element of AmpliNetECG12, improve prediction accuracy and also increasse the model’s interpretability. This feature enhances comprehension of important ECG segments in different forms of arrhythmias, establishing a new standard of explainable architecture for cardiac disorder classification.

[AI-72] SimPhony: A Device-Circuit-Architecture Cross-Layer Modeling and Simulation Framework for Heterogeneous Electronic-Photonic AI System

链接: https://arxiv.org/abs/2411.13715
作者: Ziang Yin,Meng Zhang,Amir Begovic,Rena Huang,Jeff Zhang,Jiaqi Gu
关键词-EN: Electronic-photonic integrated circuits, require interdisciplinary advances, offer transformative potential, integrated circuits, transformative potential
类目: Optics (physics.optics); Artificial Intelligence (cs.AI); Hardware Architecture (cs.AR); Emerging Technologies (cs.ET); Machine Learning (cs.LG)
*备注: 7-page

点击查看摘要

Abstract:Electronic-photonic integrated circuits (EPICs) offer transformative potential for next-generation high-performance AI but require interdisciplinary advances across devices, circuits, architecture, and design automation. The complexity of hybrid systems makes it challenging even for domain experts to understand distinct behaviors and interactions across design stack. The lack of a flexible, accurate, fast, and easy-to-use EPIC AI system simulation framework significantly limits the exploration of hardware innovations and system evaluations on common benchmarks. To address this gap, we propose SimPhony, a cross-layer modeling and simulation framework for heterogeneous electronic-photonic AI systems. SimPhony offers a platform that enables (1) generic, extensible hardware topology representation that supports heterogeneous multi-core architectures with diverse photonic tensor core designs; (2) optics-specific dataflow modeling with unique multi-dimensional parallelism and reuse beyond spatial/temporal dimensions; (3) data-aware energy modeling with realistic device responses, layout-aware area estimation, link budget analysis, and bandwidth-adaptive memory modeling; and (4) seamless integration with model training framework for hardware/software co-simulation. By providing a unified, versatile, and high-fidelity simulation platform, SimPhony enables researchers to innovate and evaluate EPIC AI hardware across multiple domains, facilitating the next leap in emerging AI hardware. We open-source our codes at this https URL

[AI-73] Integrating Dynamic Correlation Shifts and Weighted Benchmarking in Extreme Value Analysis

链接: https://arxiv.org/abs/2411.13608
作者: Dimitrios P. Panagoulias,Elissaios Sarmas,Vangelis Marinakis,Maria Virvou,George A. Tsihrintzis
关键词-EN: Dynamic Benchmarking Method, Benchmarking Method, Dynamic Identification, EVDBM integrates extreme, Extreme
类目: Applications (stat.AP); Artificial Intelligence (cs.AI)
*备注: 33 pages, 8 figures

点击查看摘要

Abstract:This paper presents an innovative approach to Extreme Value Analysis (EVA) by introducing the Extreme Value Dynamic Benchmarking Method (EVDBM). EVDBM integrates extreme value theory to detect extreme events and is coupled with the novel Dynamic Identification of Significant Correlation (DISC)-Thresholding algorithm, which enhances the analysis of key variables under extreme conditions. By integrating return values predicted through EVA into the benchmarking scores, we are able to transform these scores to reflect anticipated conditions more accurately. This provides a more precise picture of how each case is projected to unfold under extreme conditions. As a result, the adjusted scores offer a forward-looking perspective, highlighting potential vulnerabilities and resilience factors for each case in a way that static historical data alone cannot capture. By incorporating both historical and probabilistic elements, the EVDBM algorithm provides a comprehensive benchmarking framework that is adaptable to a range of scenarios and contexts. The methodology is applied to real PV data, revealing critical low - production scenarios and significant correlations between variables, which aid in risk management, infrastructure design, and long-term planning, while also allowing for the comparison of different production plants. The flexibility of EVDBM suggests its potential for broader applications in other sectors where decision-making sensitivity is crucial, offering valuable insights to improve outcomes.

[AI-74] A Full-History Network Dataset for BTC Asset Decentralization Profiling

链接: https://arxiv.org/abs/2411.13603
作者: Ling Cheng,Qian Shao,Fengzhu Zeng,Feida Zhu
关键词-EN: garnered increasing attention, academia and industry, garnered increasing, increasing attention, BTC asset decentralization
类目: atistical Finance (q-fin.ST); Artificial Intelligence (cs.AI)
*备注: IEEE BigData 2024

点击查看摘要

Abstract:Since its advent in 2009, Bitcoin (BTC) has garnered increasing attention from both academia and industry. However, due to the massive transaction volume, no systematic study has quantitatively measured the asset decentralization degree specifically from a network perspective. In this paper, by conducting a thorough analysis of the BTC transaction network, we first address the significant gap in the availability of full-history BTC graph and network property dataset, which spans over 15 years from the genesis block (1st March, 2009) to the 845651-th block (29, May 2024). We then present the first systematic investigation to profile BTC’s asset decentralization and design several decentralization degrees for quantification. Through extensive experiments, we emphasize the significant role of network properties and our network-based decentralization degree in enhancing Bitcoin analysis. Our findings demonstrate the importance of our comprehensive dataset and analysis in advancing research on Bitcoin’s transaction dynamics and decentralization, providing valuable insights into the network’s structure and its implications. Comments: IEEE BigData 2024 Subjects: Statistical Finance (q-fin.ST); Artificial Intelligence (cs.AI) Cite as: arXiv:2411.13603 [q-fin.ST] (or arXiv:2411.13603v1 [q-fin.ST] for this version) https://doi.org/10.48550/arXiv.2411.13603 Focus to learn more arXiv-issued DOI via DataCite

[AI-75] Large-scale cross-modality pretrained model enhances cardiovascular state estimation and cardiomyopathy detection from electrocardiograms: An AI system development and multi-center validation study

链接: https://arxiv.org/abs/2411.13602
作者: Zhengyao Ding,Yujian Hu,Youyao Xu,Chengchen Zhao,Ziyu Li,Yiheng Mao,Haitao Li,Qian Li,Jing Wang,Yue Chen,Mengjia Chen,Longbo Wang,Xuesen Chu,Weichao Pan,Ziyi Liu,Fei Wu,Hongkun Zhang,Ting Chen,Zhengxing Huang
关键词-EN: present significant challenges, Cardiovascular diseases, present significant, significant challenges, ECG
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: 23 pages, 8 figures

点击查看摘要

Abstract:Cardiovascular diseases (CVDs) present significant challenges for early and accurate diagnosis. While cardiac magnetic resonance imaging (CMR) is the gold standard for assessing cardiac function and diagnosing CVDs, its high cost and technical complexity limit accessibility. In contrast, electrocardiography (ECG) offers promise for large-scale early screening. This study introduces CardiacNets, an innovative model that enhances ECG analysis by leveraging the diagnostic strengths of CMR through cross-modal contrastive learning and generative pretraining. CardiacNets serves two primary functions: (1) it evaluates detailed cardiac function indicators and screens for potential CVDs, including coronary artery disease, cardiomyopathy, pericarditis, heart failure and pulmonary hypertension, using ECG input; and (2) it enhances interpretability by generating high-quality CMR images from ECG data. We train and validate the proposed CardiacNets on two large-scale public datasets (the UK Biobank with 41,519 individuals and the MIMIC-IV-ECG comprising 501,172 samples) as well as three private datasets (FAHZU with 410 individuals, SAHZU with 464 individuals, and QPH with 338 individuals), and the findings demonstrate that CardiacNets consistently outperforms traditional ECG-only models, substantially improving screening accuracy. Furthermore, the generated CMR images provide valuable diagnostic support for physicians of all experience levels. This proof-of-concept study highlights how ECG can facilitate cross-modal insights into cardiac function assessment, paving the way for enhanced CVD screening and diagnosis at a population level.

[AI-76] Can ChatGPT Overcome Behavioral Biases in the Financial Sector? Classify-and-Rethink: Multi-Step Zero-Shot Reasoning in the Gold Investment

链接: https://arxiv.org/abs/2411.13599
作者: Shuoling Liu,Gaoguo Jia,Yuhang Jiang,Liyuan Chen,Qiang Yang
关键词-EN: Large Language Models, Large Language, Language Models, displaying exceptional capabilities, displaying exceptional
类目: atistical Finance (q-fin.ST); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have achieved remarkable success recently, displaying exceptional capabilities in creating understandable and organized text. These LLMs have been utilized in diverse fields, such as clinical research, where domain-specific models like Med-Palm have achieved human-level performance. Recently, researchers have employed advanced prompt engineering to enhance the general reasoning ability of LLMs. Despite the remarkable success of zero-shot Chain-of-Thoughts (CoT) in solving general reasoning tasks, the potential of these methods still remains paid limited attention in the financial reasoning this http URL address this issue, we explore multiple prompt strategies and incorporated semantic news information to improve LLMs’ performance on financial reasoning this http URL the best of our knowledge, we are the first to explore this important issue by applying ChatGPT to the gold this http URL this work, our aim is to investigate the financial reasoning capabilities of LLMs and their capacity to generate logical and persuasive investment opinions. We will use ChatGPT, one of the most powerful LLMs recently, and prompt engineering to achieve this goal. Our research will focus on understanding the ability of LLMs in sophisticated analysis and reasoning within the context of investment decision-making. Our study finds that ChatGPT with CoT prompt can provide more explainable predictions and overcome behavioral biases, which is crucial in finance-related tasks and can achieve higher investment returns.

[AI-77] Advance Detection Of Bull And Bear Phases In Cryptocurrency Markets

链接: https://arxiv.org/abs/2411.13586
作者: Rahul Arulkumaran,Suyash Kumar,Shikha Tomar,Manideep Gongalla,Harshitha
关键词-EN: highly volatile financial, volatile financial instruments, retail investors joining, Day Moving Averages, highly volatile
类目: atistical Finance (q-fin.ST); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Cryptocurrencies are highly volatile financial instruments with more and more new retail investors joining the scene with each passing day. Bitcoin has always proved to determine in which way the rest of the cryptocurrency market is headed towards. As of today Bitcoin has a market dominance of close to 50 percent. Bull and bear phases in cryptocurrencies are determined based on the performance of Bitcoin over the 50 Day and 200 Day Moving Averages. The aim of this paper is to foretell the performance of bitcoin in the near future by employing predictive algorithms. This predicted data will then be used to calculate the 50 Day and 200 Day Moving Averages and subsequently plotted to establish the potential bull and bear phases.

[AI-78] he Role of AI in Financial Forecasting: ChatGPTs Potential and Challenges

链接: https://arxiv.org/abs/2411.13562
作者: Shuochen Bi,Tingting Deng,Jue Xiao
关键词-EN: financial sector, financial, artificial intelligence, Internet of Things, financial sector highlight
类目: atistical Finance (q-fin.ST); Artificial Intelligence (cs.AI)
*备注: 7 pages, 4 figures, 3 tables

点击查看摘要

Abstract:The outlook for the future of artificial intelligence (AI) in the financial sector, especially in financial forecasting, the challenges and implications. The dynamics of AI technology, including deep learning, reinforcement learning, and integration with blockchAIn and the Internet of Things, also highlight the continued improvement in data processing capabilities. Explore how AI is reshaping financial services with precisely tAIlored services that can more precisely meet the diverse needs of individual investors. The integration of AI challenges regulatory and ethical issues in the financial sector, as well as the implications for data privacy protection. Analyze the limitations of current AI technology in financial forecasting and its potential impact on the future financial industry landscape, including changes in the job market, the emergence of new financial institutions, and user interface innovations. Emphasizing the importance of increasing investor understanding and awareness of AI and looking ahead to future trends in AI tools for user experience to drive wider adoption of AI in financial decision making. The huge potential, challenges, and future directions of AI in the financial sector highlight the critical role of AI technology in driving transformation and innovation in the financial sector

[AI-79] Composing Ensembles of Instrument-Model Pairs for Optimizing Profitability in Algorithmic Trading

链接: https://arxiv.org/abs/2411.13559
作者: Sahand Hassanizorgabad
关键词-EN: maximize their Return, nonlinear with complexity, buyers and sellers, types of assets, assets are traded
类目: Trading and Market Microstructure (q-fin.TR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Financial markets are nonlinear with complexity, where different types of assets are traded between buyers and sellers, each having a view to maximize their Return on Investment (ROI). Forecasting market trends is a challenging task since various factors like stock-specific news, company profiles, public sentiments, and global economic conditions influence them. This paper describes a daily price directional predictive system of financial instruments, addressing the difficulty of predicting short-term price movements. This paper will introduce the development of a novel trading system methodology by proposing a two-layer Composing Ensembles architecture, optimized through grid search, to predict whether the price will rise or fall the next day. This strategy was back-tested on a wide range of financial instruments and time frames, demonstrating an improvement of 20% over the benchmark, representing a standard investment strategy.

[AI-80] Cerebrovascular Segmentation via Vessel Oriented Filtering Network

链接: https://arxiv.org/abs/2210.08868
作者: Zhanqiang Guo,Yao Luan,Jianjiang Feng,Wangsheng Lu,Yin Yin,Guangming Yang,Jie Zhou
关键词-EN: Magnetic Resonance Angiography, Computed Tomography Angiography, Resonance Angiography, Tomography Angiography, Magnetic Resonance
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Accurate cerebrovascular segmentation from Magnetic Resonance Angiography (MRA) and Computed Tomography Angiography (CTA) is of great significance in diagnosis and treatment of cerebrovascular pathology. Due to the complexity and topology variability of blood vessels, complete and accurate segmentation of vascular network is still a challenge. In this paper, we proposed a Vessel Oriented Filtering Network (VOF-Net) which embeds domain knowledge into the convolutional neural network. We design oriented filters for blood vessels according to vessel orientation field, which is obtained by orientation estimation network. Features extracted by oriented filtering are injected into segmentation network, so as to make use of the prior information that the blood vessels are slender and curved tubular structure. Experimental results on datasets of CTA and MRA show that the proposed method is effective for vessel segmentation, and embedding the specific vascular filter improves the segmentation performance.

计算机视觉

[CV-0] Insight-V: Exploring Long-Chain Visual Reasoning with Multimodal Large Language Models

链接: https://arxiv.org/abs/2411.14432
作者: Yuhao Dong,Zuyan Liu,Hai-Long Sun,Jingkang Yang,Winston Hu,Yongming Rao,Ziwei Liu
关键词-EN: Large Language Models, Large Language, reasoning, multi-modal large language, Language Models
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Large Language Models (LLMs) demonstrate enhanced capabilities and reliability by reasoning more, evolving from Chain-of-Thought prompting to product-level solutions like OpenAI o1. Despite various efforts to improve LLM reasoning, high-quality long-chain reasoning data and optimized training pipelines still remain inadequately explored in vision-language tasks. In this paper, we present Insight-V, an early effort to 1) scalably produce long and robust reasoning data for complex multi-modal tasks, and 2) an effective training pipeline to enhance the reasoning capabilities of multi-modal large language models (MLLMs). Specifically, to create long and structured reasoning data without human labor, we design a two-step pipeline with a progressive strategy to generate sufficiently long and diverse reasoning paths and a multi-granularity assessment method to ensure data quality. We observe that directly supervising MLLMs with such long and complex reasoning data will not yield ideal reasoning ability. To tackle this problem, we design a multi-agent system consisting of a reasoning agent dedicated to performing long-chain reasoning and a summary agent trained to judge and summarize reasoning results. We further incorporate an iterative DPO algorithm to enhance the reasoning agent’s generation stability and quality. Based on the popular LLaVA-NeXT model and our stronger base MLLM, we demonstrate significant performance gains across challenging multi-modal benchmarks requiring visual reasoning. Benefiting from our multi-agent system, Insight-V can also easily maintain or improve performance on perception-focused multi-modal tasks.

[CV-1] Stable Flow: Vital Layers for Training-Free Image Editing

链接: https://arxiv.org/abs/2411.14430
作者: Omri Avrahami,Or Patashnik,Ohad Fried,Egor Nemchinov,Kfir Aberman,Dani Lischinski,Daniel Cohen-Or
关键词-EN: Diffusion Transformer, revolutionized the field, field of content, Diffusion models, Diffusion
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Machine Learning (cs.LG)
*备注: Project page is available at this https URL

点击查看摘要

Abstract:Diffusion models have revolutionized the field of content synthesis and editing. Recent models have replaced the traditional UNet architecture with the Diffusion Transformer (DiT), and employed flow-matching for improved training and sampling. However, they exhibit limited generation diversity. In this work, we leverage this limitation to perform consistent image edits via selective injection of attention features. The main challenge is that, unlike the UNet-based models, DiT lacks a coarse-to-fine synthesis structure, making it unclear in which layers to perform the injection. Therefore, we propose an automatic method to identify “vital layers” within DiT, crucial for image formation, and demonstrate how these layers facilitate a range of controlled stable edits, from non-rigid modifications to object addition, using the same mechanism. Next, to enable real-image editing, we introduce an improved image inversion method for flow models. Finally, we evaluate our approach through qualitative and quantitative comparisons, along with a user study, and demonstrate its effectiveness across multiple applications. The project page is available at this https URL

[CV-2] Unleashing the Potential of Multi-modal Foundation Models and Video Diffusion for 4D Dynamic Physical Scene Simulation

链接: https://arxiv.org/abs/2411.14423
作者: Zhuoman Liu,Weicai Ye,Yan Luximon,Pengfei Wan,Di Zhang
关键词-EN: requires accurately capturing, accurately capturing diverse, modeling complex object, diverse material properties, capturing diverse material
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Homepage: this https URL

点击查看摘要

Abstract:Realistic simulation of dynamic scenes requires accurately capturing diverse material properties and modeling complex object interactions grounded in physical principles. However, existing methods are constrained to basic material types with limited predictable parameters, making them insufficient to represent the complexity of real-world materials. We introduce a novel approach that leverages multi-modal foundation models and video diffusion to achieve enhanced 4D dynamic scene simulation. Our method utilizes multi-modal models to identify material types and initialize material parameters through image queries, while simultaneously inferring 3D Gaussian splats for detailed scene representation. We further refine these material parameters using video diffusion with a differentiable Material Point Method (MPM) and optical flow guidance rather than render loss or Score Distillation Sampling (SDS) loss. This integrated framework enables accurate prediction and realistic simulation of dynamic interactions in real-world scenarios, advancing both accuracy and flexibility in physics-based simulations.

[CV-3] Multimodal Autoregressive Pre-training of Large Vision Encoders

链接: https://arxiv.org/abs/2411.14402
作者: Enrico Fini,Mustafa Shukor,Xiujun Li,Philipp Dufter,Michal Klein,David Haldimann,Sai Aitharaju,Victor Guilherme Turrisi da Costa,Louis Béthune,Zhe Gan,Alexander T Toshev,Marcin Eichner,Moin Nabi,Yinfei Yang,Joshua M. Susskind,Alaaeldin El-Nouby
关键词-EN: large-scale vision encoders, vision, large-scale vision, pre-training, vision encoders
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: this https URL

点击查看摘要

Abstract:We introduce a novel method for pre-training of large-scale vision encoders. Building on recent advancements in autoregressive pre-training of vision models, we extend this framework to a multimodal setting, i.e., images and text. In this paper, we present AIMV2, a family of generalist vision encoders characterized by a straightforward pre-training process, scalability, and remarkable performance across a range of downstream tasks. This is achieved by pairing the vision encoder with a multimodal decoder that autoregressively generates raw image patches and text tokens. Our encoders excel not only in multimodal evaluations but also in vision benchmarks such as localization, grounding, and classification. Notably, our AIMV2-3B encoder achieves 89.5% accuracy on ImageNet-1k with a frozen trunk. Furthermore, AIMV2 consistently outperforms state-of-the-art contrastive models (e.g., CLIP, SigLIP) in multimodal image understanding across diverse settings.

[CV-4] Beyond Training: Dynamic Token Merging for Zero-Shot Video Understanding

链接: https://arxiv.org/abs/2411.14401
作者: Yiming Zhang,Zhuokai Zhao,Zhaorun Chen,Zenghui Ding,Xianjun Yang,Yining Sun
关键词-EN: large language models, multimodal large language, Recent advancements, language models, advancements in multimodal
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Recent advancements in multimodal large language models (MLLMs) have opened new avenues for video understanding. However, achieving high fidelity in zero-shot video tasks remains challenging. Traditional video processing methods rely heavily on fine-tuning to capture nuanced spatial-temporal details, which incurs significant data and computation costs. In contrast, training-free approaches, though efficient, often lack robustness in preserving context-rich features across complex video content. To this end, we propose DYTO, a novel dynamic token merging framework for zero-shot video understanding that adaptively optimizes token efficiency while preserving crucial scene details. DYTO integrates a hierarchical frame selection and a bipartite token merging strategy to dynamically cluster key frames and selectively compress token sequences, striking a balance between computational efficiency with semantic richness. Extensive experiments across multiple benchmarks demonstrate the effectiveness of DYTO, achieving superior performance compared to both fine-tuned and training-free methods and setting a new state-of-the-art for zero-shot video understanding.

[CV-5] Baking Gaussian Splatting into Diffusion Denoiser for Fast and Scalable Single-stage Image-to-3D Generation

链接: https://arxiv.org/abs/2411.14384
作者: Yuanhao Cai,He Zhang,Kai Zhang,Yixun Liang,Mengwei Ren,Fujun Luan,Qing Liu,Soo Ye Kim,Jianming Zhang,Zhifei Zhang,Yuqian Zhou,Zhe Lin,Alan Yuille
关键词-EN: Existing feed-forward, multi-view diffusion models, multi-view diffusion, Existing, object-centric prompt images
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
*备注: A novel one-stage 3DGS-based diffusion generates objects and scenes from a single view in ~6 seconds

点击查看摘要

Abstract:Existing feed-forward image-to-3D methods mainly rely on 2D multi-view diffusion models that cannot guarantee 3D consistency. These methods easily collapse when changing the prompt view direction and mainly handle object-centric prompt images. In this paper, we propose a novel single-stage 3D diffusion model, DiffusionGS, for object and scene generation from a single view. DiffusionGS directly outputs 3D Gaussian point clouds at each timestep to enforce view consistency and allow the model to generate robustly given prompt views of any directions, beyond object-centric inputs. Plus, to improve the capability and generalization ability of DiffusionGS, we scale up 3D training data by developing a scene-object mixed training strategy. Experiments show that our method enjoys better generation quality (2.20 dB higher in PSNR and 23.25 lower in FID) and over 5x faster speed (~6s on an A100 GPU) than SOTA methods. The user study and text-to-3D applications also reveals the practical values of our method. Our Project page at this https URL shows the video and interactive generation results.

[CV-6] InCrowd-VI: A Realistic Visual-Inertial Dataset for Evaluating SLAM in Indoor Pedestrian-Rich Spaces for Human Navigation

链接: https://arxiv.org/abs/2411.14358
作者: Marziyeh Bamdad,Hans-Peter Hutter,Alireza Darvishy
关键词-EN: Simultaneous localization, robust SLAM solutions, Meta Aria Project, localization and mapping, development of robust
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
*备注: 18 pages, 7 figures, 5 tabels

点击查看摘要

Abstract:Simultaneous localization and mapping (SLAM) techniques can be used to navigate the visually impaired, but the development of robust SLAM solutions for crowded spaces is limited by the lack of realistic datasets. To address this, we introduce InCrowd-VI, a novel visual-inertial dataset specifically designed for human navigation in indoor pedestrian-rich environments. Recorded using Meta Aria Project glasses, it captures realistic scenarios without environmental control. InCrowd-VI features 58 sequences totaling a 5 km trajectory length and 1.5 hours of recording time, including RGB, stereo images, and IMU measurements. The dataset captures important challenges such as pedestrian occlusions, varying crowd densities, complex layouts, and lighting changes. Ground-truth trajectories, accurate to approximately 2 cm, are provided in the dataset, originating from the Meta Aria project machine perception SLAM service. In addition, a semi-dense 3D point cloud of scenes is provided for each sequence. The evaluation of state-of-the-art visual odometry (VO) and SLAM algorithms on InCrowd-VI revealed severe performance limitations in these realistic scenarios, demonstrating the need and value of the new dataset to advance SLAM research for visually impaired navigation in complex indoor environments.

[CV-7] DINO-X: A Unified Vision Model for Open-World Object Detection and Understanding

链接: https://arxiv.org/abs/2411.14347
作者: Tianhe Ren,Yihao Chen,Qing Jiang,Zhaoyang Zeng,Yuda Xiong,Wenlong Liu,Zhengyu Ma,Junyi Shen,Yuan Gao,Xiaoke Jiang,Xingyu Chen,Zhuheng Song,Yuhong Zhang,Hongjie Huang,Han Gao,Shilong Liu,Hao Zhang,Feng Li,Kent Yu,Lei Zhang
关键词-EN: IDEA Research, unified object-centric vision, developed by IDEA, object-centric vision model, vision model developed
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Technical Report

点击查看摘要

Abstract:In this paper, we introduce DINO-X, which is a unified object-centric vision model developed by IDEA Research with the best open-world object detection performance to date. DINO-X employs the same Transformer-based encoder-decoder architecture as Grounding DINO 1.5 to pursue an object-level representation for open-world object understanding. To make long-tailed object detection easy, DINO-X extends its input options to support text prompt, visual prompt, and customized prompt. With such flexible prompt options, we develop a universal object prompt to support prompt-free open-world detection, making it possible to detect anything in an image without requiring users to provide any prompt. To enhance the model’s core grounding capability, we have constructed a large-scale dataset with over 100 million high-quality grounding samples, referred to as Grounding-100M, for advancing the model’s open-vocabulary detection performance. Pre-training on such a large-scale grounding dataset leads to a foundational object-level representation, which enables DINO-X to integrate multiple perception heads to simultaneously support multiple object perception and understanding tasks, including detection, segmentation, pose estimation, object captioning, object-based QA, etc. Experimental results demonstrate the superior performance of DINO-X. Specifically, the DINO-X Pro model achieves 56.0 AP, 59.8 AP, and 52.4 AP on the COCO, LVIS-minival, and LVIS-val zero-shot object detection benchmarks, respectively. Notably, it scores 63.3 AP and 56.5 AP on the rare classes of LVIS-minival and LVIS-val benchmarks, both improving the previous SOTA performance by 5.8 AP. Such a result underscores its significantly improved capacity for recognizing long-tailed objects.

[CV-8] Layer Pruning with Consensus: A Triple-Win Solution

链接: https://arxiv.org/abs/2411.14345
作者: Leandro Giusti Mugnaini,Carolina Tavares Duarte,Anna H. Reali Costa,Artur Jordao
关键词-EN: effectively reducing computational, reducing computational costs, standard structured pruning, Layer pruning offers, effectively reducing
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Layer pruning offers a promising alternative to standard structured pruning, effectively reducing computational costs, latency, and memory footprint. While notable layer-pruning approaches aim to detect unimportant layers for removal, they often rely on single criteria that may not fully capture the complex, underlying properties of layers. We propose a novel approach that combines multiple similarity metrics into a single expressive measure of low-importance layers, called the Consensus criterion. Our technique delivers a triple-win solution: low accuracy drop, high-performance improvement, and increased robustness to adversarial attacks. With up to 78.80% FLOPs reduction and performance on par with state-of-the-art methods across different benchmarks, our approach reduces energy consumption and carbon emissions by up to 66.99% and 68.75%, respectively. Additionally, it avoids shortcut learning and improves robustness by up to 4 percentage points under various adversarial attacks. Overall, the Consensus criterion demonstrates its effectiveness in creating robust, efficient, and environmentally friendly pruned models.

[CV-9] SplatR : Experience Goal Visual Rearrangement with 3D Gaussian Splatting and Dense Feature Matching

链接: https://arxiv.org/abs/2411.14322
作者: Arjun P S,Andrew Melnik,Gora Chand Nandi
关键词-EN: Experience Goal Visual, Goal Visual Rearrangement, Visual Rearrangement task, robust world model, Rearrangement task stands
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Experience Goal Visual Rearrangement task stands as a foundational challenge within Embodied AI, requiring an agent to construct a robust world model that accurately captures the goal state. The agent uses this world model to restore a shuffled scene to its original configuration, making an accurate representation of the world essential for successfully completing the task. In this work, we present a novel framework that leverages on 3D Gaussian Splatting as a 3D scene representation for experience goal visual rearrangement task. Recent advances in volumetric scene representation like 3D Gaussian Splatting, offer fast rendering of high quality and photo-realistic novel views. Our approach enables the agent to have consistent views of the current and the goal setting of the rearrangement task, which enables the agent to directly compare the goal state and the shuffled state of the world in image space. To compare these views, we propose to use a dense feature matching method with visual features extracted from a foundation model, leveraging its advantages of a more universal feature representation, which facilitates robustness, and generalization. We validate our approach on the AI2-THOR rearrangement challenge benchmark and demonstrate improvements over the current state of the art methods

[CV-10] StereoCrafter-Zero: Zero-Shot Stereo Video Generation with Noisy Restart

链接: https://arxiv.org/abs/2411.14295
作者: Jian Shi,Qian Wang,Zhenyu Li,Peter Wonka
关键词-EN: Generating high-quality stereo, high-quality stereo videos, mimic human binocular, human binocular vision, binocular vision requires
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Generating high-quality stereo videos that mimic human binocular vision requires maintaining consistent depth perception and temporal coherence across frames. While diffusion models have advanced image and video synthesis, generating high-quality stereo videos remains challenging due to the difficulty of maintaining consistent temporal and spatial coherence between left and right views. We introduce \textitStereoCrafter-Zero, a novel framework for zero-shot stereo video generation that leverages video diffusion priors without the need for paired training data. Key innovations include a noisy restart strategy to initialize stereo-aware latents and an iterative refinement process that progressively harmonizes the latent space, addressing issues like temporal flickering and view inconsistencies. Comprehensive evaluations, including quantitative metrics and user studies, demonstrate that \textitStereoCrafter-Zero produces high-quality stereo videos with improved depth consistency and temporal smoothness, even when depth estimations are imperfect. Our framework is robust and adaptable across various diffusion models, setting a new benchmark for zero-shot stereo video generation and enabling more immersive visual experiences. Our code can be found in~\urlthis https URL.

[CV-11] EasyHOI: Unleashing the Power of Large Models for Reconstructing Hand-Object Interactions in the Wild

链接: https://arxiv.org/abs/2411.14280
作者: Yumeng Liu,Xiaoxiao Long,Zemin Yang,Yuan Liu,Marc Habermann,Christian Theobalt,Yuexin Ma,Wenping Wang
关键词-EN: ill-posed task, work aims, fundamental but ill-posed, hand-object interactions, reconstruct hand-object interactions
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Project page: this https URL

点击查看摘要

Abstract:Our work aims to reconstruct hand-object interactions from a single-view image, which is a fundamental but ill-posed task. Unlike methods that reconstruct from videos, multi-view images, or predefined 3D templates, single-view reconstruction faces significant challenges due to inherent ambiguities and occlusions. These challenges are further amplified by the diverse nature of hand poses and the vast variety of object shapes and sizes. Our key insight is that current foundational models for segmentation, inpainting, and 3D reconstruction robustly generalize to in-the-wild images, which could provide strong visual and geometric priors for reconstructing hand-object interactions. Specifically, given a single image, we first design a novel pipeline to estimate the underlying hand pose and object shape using off-the-shelf large models. Furthermore, with the initial reconstruction, we employ a prior-guided optimization scheme, which optimizes hand pose to comply with 3D physical constraints and the 2D input image content. We perform experiments across several datasets and show that our method consistently outperforms baselines and faithfully reconstructs a diverse set of hand-object interactions. Here is the link of our project page: this https URL

[CV-12] FocusLLaVA: A Coarse-to-Fine Approach for Efficient and Effective Visual Token Compression

链接: https://arxiv.org/abs/2411.14228
作者: Yuke Zhu,Chi Xie,Shuang Liang,Bo Zheng,Sheng Guo
关键词-EN: Multi-modal Large Language, Large Language Models, Multi-modal Large, Large Language, Language Models
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Recent advances on Multi-modal Large Language Models have demonstrated that high-resolution image input is crucial for model capabilities, especially for fine-grained tasks. However, high-resolution images lead to a quadratic increase in the number of visual tokens input into LLMs, resulting in significant computational costs. Current work develop visual token compression methods to achieve efficiency improvements, often at the expense of performance. We argue that removing visual redundancy can simultaneously improve both efficiency and performance. We build a coarse-to-fine visual token compression method, with a vision-guided sampler for compressing redundant regions with low information density, and a text-guided sampler for selecting visual tokens that are strongly correlated with the user this http URL these two modules, the proposed FocusLLaVA achieves improvements in both efficiency and performance. We validate the effectiveness of our approach on a wide range of evaluation datasets.

[CV-13] Generative Outpainting To Enhance the Memorability of Short-Form Videos

链接: https://arxiv.org/abs/2411.14213
作者: Alan Byju,Aman Sudhindra Ladwa,Lorin Sweeney,Alan F. Smeaton
关键词-EN: social media, format in advertising, short-form video format, video, short-form video
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:With the expanding use of the short-form video format in advertising, social media, entertainment, education and more, there is a need for such media to both captivate and be remembered. Video memorability indicates to us how likely a video is to be remembered by a viewer who has no emotional or personal connection with its content. This paper presents the results of using generative outpainting to expand the screen size of a short-form video with a view to improving its memorability. Advances in machine learning and deep learning are compared and leveraged to understand how extending the borders of video screensizes can affect their memorability to viewers. Using quantitative evaluation we determine the best-performing model for outpainting and the impact of outpainting based on image saliency on video memorability scores

[CV-14] Novel View Extrapolation with Video Diffusion Priors

链接: https://arxiv.org/abs/2411.14208
作者: Kunhao Liu,Ling Shao,Shijian Lu
关键词-EN: made significant strides, radiance field methods, Stable Video Diffusion, view, view extrapolation
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:The field of novel view synthesis has made significant strides thanks to the development of radiance field methods. However, most radiance field techniques are far better at novel view interpolation than novel view extrapolation where the synthesis novel views are far beyond the observed training views. We design ViewExtrapolator, a novel view synthesis approach that leverages the generative priors of Stable Video Diffusion (SVD) for realistic novel view extrapolation. By redesigning the SVD denoising process, ViewExtrapolator refines the artifact-prone views rendered by radiance fields, greatly enhancing the clarity and realism of the synthesized novel views. ViewExtrapolator is a generic novel view extrapolator that can work with different types of 3D rendering such as views rendered from point clouds when only a single view or monocular video is available. Additionally, ViewExtrapolator requires no fine-tuning of SVD, making it both data-efficient and computation-efficient. Extensive experiments demonstrate the superiority of ViewExtrapolator in novel view extrapolation. Project page: \urlthis https URL.

[CV-15] Revised Regularization for Efficient Continual Learning through Correlation-Based Parameter Update in Bayesian Neural Networks

链接: https://arxiv.org/abs/2411.14202
作者: Sanchar Palit,Biplab Banerjee,Subhasis Chaudhuri
关键词-EN: Bayesian neural network-based, neural network-based continual, Bayesian neural, Variational Inference, continual learning algorithm
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注: at ICVGIP 2024

点击查看摘要

Abstract:We propose a Bayesian neural network-based continual learning algorithm using Variational Inference, aiming to overcome several drawbacks of existing methods. Specifically, in continual learning scenarios, storing network parameters at each step to retain knowledge poses challenges. This is compounded by the crucial need to mitigate catastrophic forgetting, particularly given the limited access to past datasets, which complicates maintaining correspondence between network parameters and datasets across all sessions. Current methods using Variational Inference with KL divergence risk catastrophic forgetting during uncertain node updates and coupled disruptions in certain nodes. To address these challenges, we propose the following strategies. To reduce the storage of the dense layer parameters, we propose a parameter distribution learning method that significantly reduces the storage requirements. In the continual learning framework employing variational inference, our study introduces a regularization term that specifically targets the dynamics and population of the mean and variance of the parameters. This term aims to retain the benefits of KL divergence while addressing related challenges. To ensure proper correspondence between network parameters and the data, our method introduces an importance-weighted Evidence Lower Bound term to capture data and parameter correlations. This enables storage of common and distinctive parameter hyperspace bases. The proposed method partitions the parameter space into common and distinctive subspaces, with conditions for effective backward and forward knowledge transfer, elucidating the network-parameter dataset correspondence. The experimental results demonstrate the effectiveness of our method across diverse datasets and various combinations of sequential datasets, yielding superior performance compared to existing approaches.

[CV-16] Regional Attention for Shadow Removal

链接: https://arxiv.org/abs/2411.14201
作者: Hengxing Liu,Mingjia Li,Xiaojie Guo
关键词-EN: interacting with objects, plays a crucial, visual quality, shadow removal, Shadow
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Shadow, as a natural consequence of light interacting with objects, plays a crucial role in shaping the aesthetics of an image, which however also impairs the content visibility and overall visual quality. Recent shadow removal approaches employ the mechanism of attention, due to its effectiveness, as a key component. However, they often suffer from two issues including large model size and high computational complexity for practical use. To address these shortcomings, this work devises a lightweight yet accurate shadow removal framework. First, we analyze the characteristics of the shadow removal task to seek the key information required for reconstructing shadow regions and designing a novel regional attention mechanism to effectively capture such information. Then, we customize a Regional Attention Shadow Removal Model (RASM, in short), which leverages non-shadow areas to assist in restoring shadow ones. Unlike existing attention-based models, our regional attention strategy allows each shadow region to interact more rationally with its surrounding non-shadow areas, for seeking the regional contextual correlation between shadow and non-shadow areas. Extensive experiments are conducted to demonstrate that our proposed method delivers superior performance over other state-of-the-art models in terms of accuracy and efficiency, making it appealing for practical applications.

[CV-17] CompetitorFormer: Competitor Transformer for 3D Instance Segmentation

链接: https://arxiv.org/abs/2411.14179
作者: Duanchu Wang(1),Jing Liu(2),Haoran Gong(2),Yinghui Quan(1),Di Wang(2) ((1) School of Electronic Engineering, Xidian University (2) School of Software Engineering, Xian Jiaotong University)
关键词-EN: Transformer-based methods, methods predict instance, Transformer-based, instance, dominant approach
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Transformer-based methods have become the dominant approach for 3D instance segmentation. These methods predict instance masks via instance queries, ranking them by classification confidence and IoU scores to select the top prediction as the final outcome. However, it has been observed that the current models employ a fixed and higher number of queries than the instances present within a scene. In such instances, multiple queries predict the same instance, yet only a single query is ultimately optimized. The close scores of queries in the lower-level decoders make it challenging for the dominant query to distinguish itself rapidly, which ultimately impairs the model’s accuracy and convergence efficiency. This phenomenon is referred to as inter-query competition. To address this challenge, we put forth a series of plug-and-play competition-oriented designs, collectively designated as the CompetitorFormer, with the aim of reducing competition and facilitating a dominant query. Experiments showed that integrating our designs with state-of-the-art frameworks consistently resulted in significant performance improvements in 3D instance segmentation across a range of datasets.

[CV-18] Spatiotemporal Decoupling for Efficient Vision-Based Occupancy Forecasting

链接: https://arxiv.org/abs/2411.14169
作者: Jingyi Xu,Xieyuanli Chen,Junyi Ma,Jiawei Huang,Jintao Xu,Yue Wang,Ling Pei
关键词-EN: involves utilizing past, vehicle surrounding environments, present perception data, autonomous vehicle surrounding, predict future occupancy
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:The task of occupancy forecasting (OCF) involves utilizing past and present perception data to predict future occupancy states of autonomous vehicle surrounding environments, which is critical for downstream tasks such as obstacle avoidance and path planning. Existing 3D OCF approaches struggle to predict plausible spatial details for movable objects and suffer from slow inference speeds due to neglecting the bias and uneven distribution of changing occupancy states in both space and time. In this paper, we propose a novel spatiotemporal decoupling vision-based paradigm to explicitly tackle the bias and achieve both effective and efficient 3D OCF. To tackle spatial bias in empty areas, we introduce a novel spatial representation that decouples the conventional dense 3D format into 2D bird’s-eye view (BEV) occupancy with corresponding height values, enabling 3D OCF derived only from 2D predictions thus enhancing efficiency. To reduce temporal bias on static voxels, we design temporal decoupling to improve end-to-end OCF by temporally associating instances via predicted flows. We develop an efficient multi-head network EfficientOCF to achieve 3D OCF with our devised spatiotemporally decoupled representation. A new metric, conditional IoU (C-IoU), is also introduced to provide a robust 3D OCF performance assessment, especially in datasets with missing or incomplete annotations. The experimental results demonstrate that EfficientOCF surpasses existing baseline methods on accuracy and efficiency, achieving state-of-the-art performance with a fast inference time of 82.33ms with a single GPU. Our code will be released as open source.

[CV-19] Creating a Formally Verified Neural Network for Autonomous Navigation: An Experience Report

链接: https://arxiv.org/abs/2411.14163
作者: Syed Ali Asadullah Bukhari,Thomas Flinkow,Medet Inkarbekov,Barak A. Pearlmutter,Rosemary Monahan
关键词-EN: neural networks opens, neural network, increased reliance, neural network verifier, networks opens
类目: Logic in Computer Science (cs.LO); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: In Proceedings FMAS2024, arXiv:2411.13215

点击查看摘要

Abstract:The increased reliance of self-driving vehicles on neural networks opens up the challenge of their verification. In this paper we present an experience report, describing a case study which we undertook to explore the design and training of a neural network on a custom dataset for vision-based autonomous navigation. We are particularly interested in the use of machine learning with differentiable logics to obtain networks satisfying basic safety properties by design, guaranteeing the behaviour of the neural network after training. We motivate the choice of a suitable neural network verifier for our purposes and report our observations on the use of neural network verifiers for self-driving systems.

[CV-20] Point Cloud Denoising With Fine-Granularity Dynamic Graph Convolutional Networks

链接: https://arxiv.org/abs/2411.14158
作者: Wenqiang Xu,Wenrui Dai,Duoduo Xue,Ziyang Zheng,Chenglin Li,Junni Zou,Hongkai Xiong
关键词-EN: hindering down-stream tasks, Due to limitations, acquisition equipment, perturbations often corrupt, hindering down-stream
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Due to limitations in acquisition equipment, noise perturbations often corrupt 3-D point clouds, hindering down-stream tasks such as surface reconstruction, rendering, and further processing. Existing 3-D point cloud denoising methods typically fail to reliably fit the underlying continuous surface, resulting in a degradation of reconstruction performance. This paper introduces fine-granularity dynamic graph convolutional networks called GD-GCN, a novel approach to denoising in 3-D point clouds. The GD-GCN employs micro-step temporal graph convolution (MST-GConv) to perform feature learning in a gradual manner. Compared with the conventional GCN, which commonly uses discrete integer-step graph convolution, this modification introduces a more adaptable and nuanced approach to feature learning within graph convolution networks. It more accurately depicts the process of fitting the point cloud with noise to the underlying surface by and the learning process for MST-GConv acts like a changing system and is managed through a type of neural network known as neural Partial Differential Equations (PDEs). This means it can adapt and improve over time. GD-GCN approximates the Riemannian metric, calculating distances between points along a low-dimensional manifold. This capability allows it to understand the local geometric structure and effectively capture diverse relationships between points from different geometric regions through geometric graph construction based on Riemannian distances. Additionally, GD-GCN incorporates robust graph spectral filters based on the Bernstein polynomial approximation, which modulate eigenvalues for complex and arbitrary spectral responses, providing theoretical guarantees for BIBO stability. Symmetric channel mixing matrices further enhance filter flexibility by enabling channel-level scaling and shifting in the spectral domain.

[CV-21] RestorerID: Towards Tuning-Free Face Restoration with ID Preservation

链接: https://arxiv.org/abs/2411.14125
作者: Jiacheng Ying,Mushui Liu,Zhe Wu,Runming Zhang,Zhu Yu,Siming Fu,Si-Yuan Cao,Chao Wu,Yunlong Yu,Hui-Liang Shen
关键词-EN: made great progress, face restoration, Blind face restoration, made great, great progress
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 10 pages, 10 figures

点击查看摘要

Abstract:Blind face restoration has made great progress in producing high-quality and lifelike images. Yet it remains challenging to preserve the ID information especially when the degradation is heavy. Current reference-guided face restoration approaches either require face alignment or personalized test-tuning, which are unfaithful or time-consuming. In this paper, we propose a tuning-free method named RestorerID that incorporates ID preservation during face restoration. RestorerID is a diffusion model-based method that restores low-quality images with varying levels of degradation by using a single reference image. To achieve this, we propose a unified framework to combine the ID injection with the base blind face restoration model. In addition, we design a novel Face ID Rebalancing Adapter (FIR-Adapter) to tackle the problems of content unconsistency and contours misalignment that are caused by information conflicts between the low-quality input and reference image. Furthermore, by employing an Adaptive ID-Scale Adjusting strategy, RestorerID can produce superior restored images across various levels of degradation. Experimental results on the Celeb-Ref dataset and real-world scenarios demonstrate that RestorerID effectively delivers high-quality face restoration with ID preservation, achieving a superior performance compared to the test-tuning approaches and other reference-guided ones. The code of RestorerID is available at \urlthis https URL.

[CV-22] Point Cloud Resampling with Learnable Heat Diffusion

链接: https://arxiv.org/abs/2411.14120
作者: Wenqiang Xu,Wenrui Dai,Duoduo Xue,Ziyang Zheng,Chenglin Li,Junni Zou,Hongkai Xiong
关键词-EN: Generative diffusion models, shown empirical successes, progressively refining noise, point cloud resampling, point cloud
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Generative diffusion models have shown empirical successes in point cloud resampling, generating a denser and more uniform distribution of points from sparse or noisy 3D point clouds by progressively refining noise into structure. However, existing diffusion models employ manually predefined schemes, which often fail to recover the underlying point cloud structure due to the rigid and disruptive nature of the geometric degradation. To address this issue, we propose a novel learnable heat diffusion framework for point cloud resampling, which directly parameterizes the marginal distribution for the forward process by learning the adaptive heat diffusion schedules and local filtering scales of the time-varying heat kernel, and consequently, generates an adaptive conditional prior for the reverse process. Unlike previous diffusion models with a fixed prior, the adaptive conditional prior selectively preserves geometric features of the point cloud by minimizing a refined variational lower bound, guiding the points to evolve towards the underlying surface during the reverse process. Extensive experimental results demonstrate that the proposed point cloud resampling achieves state-of-the-art performance in representative reconstruction tasks including point cloud denoising and upsampling.

[CV-23] Uncertainty-Aware Regression for Socio-Economic Estimation via Multi-View Remote Sensing

链接: https://arxiv.org/abs/2411.14119
作者: Fan Yang,Sahoko Ishida,Mengyan Zhang,Daniel Jenson,Swapnil Mishra,Jhonathan Navott,Seth Flaxman
关键词-EN: Earth observation, areas for Earth, imagery offers rich, offers rich spectral, offers rich
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 11 pages, 4 figures

点击查看摘要

Abstract:Remote sensing imagery offers rich spectral data across extensive areas for Earth observation. Many attempts have been made to leverage these data with transfer learning to develop scalable alternatives for estimating socio-economic conditions, reducing reliance on expensive survey-collected data. However, much of this research has primarily focused on daytime satellite imagery due to the limitation that most pre-trained models are trained on 3-band RGB images. Consequently, modeling techniques for spectral bands beyond the visible spectrum have not been thoroughly investigated. Additionally, quantifying uncertainty in remote sensing regression has been less explored, yet it is essential for more informed targeting and iterative collection of ground truth survey data. In this paper, we introduce a novel framework that leverages generic foundational vision models to process remote sensing imagery using combinations of three spectral bands to exploit multi-spectral data. We also employ methods such as heteroscedastic regression and Bayesian modeling to generate uncertainty estimates for the predictions. Experimental results demonstrate that our method outperforms existing models that use RGB or multi-spectral models with unstructured band usage. Moreover, our framework helps identify uncertain predictions, guiding future ground truth data acquisition.

[CV-24] WARLearn: Weather-Adaptive Representation Learning WACV

链接: https://arxiv.org/abs/2411.14095
作者: Shubham Agarwal,Raz Birman,Ofer Hadar
关键词-EN: adaptive representation learning, adversarial weather conditions, paper introduces WARLearn, Barlow Twins, paper introduces
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted for publication in IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), 2025

点击查看摘要

Abstract:This paper introduces WARLearn, a novel framework designed for adaptive representation learning in challenging and adversarial weather conditions. Leveraging the in-variance principal used in Barlow Twins, we demonstrate the capability to port the existing models initially trained on clear weather data to effectively handle adverse weather conditions. With minimal additional training, our method exhibits remarkable performance gains in scenarios characterized by fog and low-light conditions. This adaptive framework extends its applicability beyond adverse weather settings, offering a versatile solution for domains exhibiting variations in data distributions. Furthermore, WARLearn is invaluable in scenarios where data distributions undergo significant shifts over time, enabling models to remain updated and accurate. Our experimental findings reveal a remarkable performance, with a mean average precision (mAP) of 52.6% on unseen real-world foggy dataset (RTTS). Similarly, in low light conditions, our framework achieves a mAP of 55.7% on unseen real-world low light dataset (ExDark). Notably, WARLearn surpasses the performance of state-of-the-art frameworks including FeatEnHancer, Image Adaptive YOLO, DENet, C2PNet, PairLIE and ZeroDCE, by a substantial margin in adverse weather, improving the baseline performance in both foggy and low light conditions. The WARLearn code is available at this https URL

[CV-25] Stereo Anything: Unifying Stereo Matching with Large-Scale Mixed Data

链接: https://arxiv.org/abs/2411.14053
作者: Xianda Guo,Chenming Zhang,Youmin Zhang,Dujun Nie,Ruilin Wang,Wenzhao Zheng,Matteo Poggi,Long Chen
关键词-EN: recover depth information, aiming to find, depth information, pivotal component, find corresponding points
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Code will be available at \url{ this https URL }

点击查看摘要

Abstract:Stereo matching has been a pivotal component in 3D vision, aiming to find corresponding points between pairs of stereo images to recover depth information. In this work, we introduce StereoAnything, a highly practical solution for robust stereo matching. Rather than focusing on a specialized model, our goal is to develop a versatile foundational model capable of handling stereo images across diverse environments. To this end, we scale up the dataset by collecting labeled stereo images and generating synthetic stereo pairs from unlabeled monocular images. To further enrich the model’s ability to generalize across different conditions, we introduce a novel synthetic dataset that complements existing data by adding variability in baselines, camera angles, and scene types. We extensively evaluate the zero-shot capabilities of our model on five public datasets, showcasing its impressive ability to generalize to new, unseen data. Code will be available at \urlthis https URL.

[CV-26] Out-Of-Distribution Detection with Diversification (Provably)

链接: https://arxiv.org/abs/2411.14049
作者: Haiyun Yao,Zongbo Han,Huazhu Fu,Xi Peng,Qinghua Hu,Changqing Zhang
关键词-EN: machine learning models, ensuring reliable deployment, auxiliary outliers, learning models, crucial for ensuring
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Out-of-distribution (OOD) detection is crucial for ensuring reliable deployment of machine learning models. Recent advancements focus on utilizing easily accessible auxiliary outliers (e.g., data from the web or other datasets) in training. However, we experimentally reveal that these methods still struggle to generalize their detection capabilities to unknown OOD data, due to the limited diversity of the auxiliary outliers collected. Therefore, we thoroughly examine this problem from the generalization perspective and demonstrate that a more diverse set of auxiliary outliers is essential for enhancing the detection capabilities. However, in practice, it is difficult and costly to collect sufficiently diverse auxiliary outlier data. Therefore, we propose a simple yet practical approach with a theoretical guarantee, termed Diversity-induced Mixup for OOD detection (diverseMix), which enhances the diversity of auxiliary outlier set for training in an efficient way. Extensive experiments show that diverseMix achieves superior performance on commonly used and recent challenging large-scale benchmarks, which further confirm the importance of the diversity of auxiliary outliers.

[CV-27] Experimental comparison of graph-based approximate nearest neighbor search algorithms on edge devices

链接: https://arxiv.org/abs/2411.14006
作者: Ali Ganbarov,Jicheng Yuan,Anh Le-Tuan,Manfred Hauswirth,Danh Le-Phuoc
关键词-EN: approximate nearest neighbor, smart city infrastructure, nearest neighbor, neighbor search applications, nearest neighbor search
类目: Data Structures and Algorithms (cs.DS); Hardware Architecture (cs.AR); Computer Vision and Pattern Recognition (cs.CV); Distributed, Parallel, and Cluster Computing (cs.DC); Performance (cs.PF)
*备注:

点击查看摘要

Abstract:In this paper, we present an experimental comparison of various graph-based approximate nearest neighbor (ANN) search algorithms deployed on edge devices for real-time nearest neighbor search applications, such as smart city infrastructure and autonomous vehicles. To the best of our knowledge, this specific comparative analysis has not been previously conducted. While existing research has explored graph-based ANN algorithms, it has often been limited to single-threaded implementations on standard commodity hardware. Our study leverages the full computational and storage capabilities of edge devices, incorporating additional metrics such as insertion and deletion latency of new vectors and power consumption. This comprehensive evaluation aims to provide valuable insights into the performance and suitability of these algorithms for edge-based real-time tracking systems enhanced by nearest-neighbor search algorithms.

[CV-28] SEMPose: A Single End-to-end Network for Multi-object Pose Estimation

链接: https://arxiv.org/abs/2411.14002
作者: Xin Liu,Hao Wang,Shibei Xue,Dezong Zhao
关键词-EN: computer vision, fundamental task, RGB image, task, methods
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:In computer vision, estimating the six-degree-of-freedom pose from an RGB image is a fundamental task. However, this task becomes highly challenging in multi-object scenes. Currently, the best methods typically employ an indirect strategy, which identifies 2D and 3D correspondences, and then solves with the Perspective-n-Points method. Yet, this approach cannot be trained end-to-end. Direct methods, on the other hand, suffer from lower accuracy due to challenges such as varying object sizes and occlusions. To address these issues, we propose SEMPose, an end-to-end multi-object pose estimation network. SEMPose utilizes a well-designed texture-shape guided feature pyramid network, effectively tackling the challenge of object size variations. Additionally, it employs an iterative refinement head structure, progressively regressing rotation and translation separately to enhance estimation accuracy. During training, we alleviate the impact of occlusion by selecting positive samples from visible parts. Experimental results demonstrate that SEMPose can perform inference at 32 FPS without requiring inputs other than the RGB image. It can accurately estimate the poses of multiple objects in real time, with inference time unaffected by the number of target objects. On the LM-O and YCB-V datasets, our method outperforms other RGB-based single-model methods, achieving higher accuracy. Even when compared with multi-model methods and approaches that use additional refinement, our results remain competitive.

[CV-29] Graph Domain Adaptation with Dual-branch Encoder and Two-level Alignment for Whole Slide Image-based Survival Prediction

链接: https://arxiv.org/abs/2411.14001
作者: Yuntao Shou,Peiqiang Yan,Xingjian Yuan,Xiangyong Cao,Qian Zhao,Deyu Meng
关键词-EN: medical image analysis, slide image, medical image, based survival analysis, recent years
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 12 pages, 6 figures

点击查看摘要

Abstract:In recent years, histopathological whole slide image (WSI)- based survival analysis has attracted much attention in medical image analysis. In practice, WSIs usually come from different hospitals or laboratories, which can be seen as different domains, and thus may have significant differences in imaging equipment, processing procedures, and sample sources. These differences generally result in large gaps in distribution between different WSI domains, and thus the survival analysis models trained on one domain may fail to transfer to another. To address this issue, we propose a Dual-branch Encoder and Two-level Alignment (DETA) framework to explore both feature and category-level alignment between different WSI domains. Specifically, we first formulate the concerned problem as graph domain adaptation (GDA) by virtue the graph representation of WSIs. Then we construct a dual-branch graph encoder, including the message passing branch and the shortest path branch, to explicitly and implicitly extract semantic information from the graph-represented WSIs. To realize GDA, we propose a two-level alignment approach: at the category level, we develop a coupling technique by virtue of the dual-branch structure, leading to reduced divergence between the category distributions of the two domains; at the feature level, we introduce an adversarial perturbation strategy to better augment source domain feature, resulting in improved alignment in feature distribution. To the best of our knowledge, our work is the first attempt to alleviate the domain shift issue for WSI data analysis. Extensive experiments on four TCGA datasets have validated the effectiveness of our proposed DETA framework and demonstrated its superior performance in WSI-based survival analysis.

[CV-30] ransforming Static Images Using Generative Models for Video Salient Object Detection

链接: https://arxiv.org/abs/2411.13975
作者: Suhwan Cho,Minhyeok Lee,Jungho Lee,Sangyoun Lee
关键词-EN: comprehensive knowledge transfer, facilitates comprehensive knowledge, common strategy, knowledge transfer, abundant and facilitates
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:In many video processing tasks, leveraging large-scale image datasets is a common strategy, as image data is more abundant and facilitates comprehensive knowledge transfer. A typical approach for simulating video from static images involves applying spatial transformations, such as affine transformations and spline warping, to create sequences that mimic temporal progression. However, in tasks like video salient object detection, where both appearance and motion cues are critical, these basic image-to-video techniques fail to produce realistic optical flows that capture the independent motion properties of each object. In this study, we show that image-to-video diffusion models can generate realistic transformations of static images while understanding the contextual relationships between image components. This ability allows the model to generate plausible optical flows, preserving semantic integrity while reflecting the independent motion of scene elements. By augmenting individual images in this way, we create large-scale image-flow pairs that significantly enhance model training. Our approach achieves state-of-the-art performance across all public benchmark datasets, outperforming existing approaches.

[CV-31] Zero-Shot Low-Light Image Enhancement via Joint Frequency Domain Priors Guided Diffusion

链接: https://arxiv.org/abs/2411.13961
作者: Jinhong He,Shivakumara Palaiahnakote,Aoxiang Ning,Minglong Xue
关键词-EN: real-world paired datasets, Fourier frequency domains, supervised methods lacking, Fourier frequency, wavelet and Fourier
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Due to the singularity of real-world paired datasets and the complexity of low-light environments, this leads to supervised methods lacking a degree of scene generalisation. Meanwhile, limited by poor lighting and content guidance, existing zero-shot methods cannot handle unknown severe degradation well. To address this problem, we will propose a new zero-shot low-light enhancement method to compensate for the lack of light and structural information in the diffusion sampling process by effectively combining the wavelet and Fourier frequency domains to construct rich a priori information. The key to the inspiration comes from the similarity between the wavelet and Fourier frequency domains: both light and structure information are closely related to specific frequency domain regions, respectively. Therefore, by transferring the diffusion process to the wavelet low-frequency domain and combining the wavelet and Fourier frequency domains by continuously decomposing them in the inverse process, the constructed rich illumination prior is utilised to guide the image generation enhancement process. Sufficient experiments show that the framework is robust and effective in various scenarios. The code will be available at: \hrefthis https URLthis https URL.

[CV-32] ransforming Engineering Diagrams: A Novel Approach for PID Digitization using Transformers

链接: https://arxiv.org/abs/2411.13929
作者: Jan Marius Stürmer,Marius Graumann,Tobias Koch
关键词-EN: complex technical systems, Piping and Instrumentation, Instrumentation Diagrams, complex systems, complex technical
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:The digitization of complex technical systems, such as Piping and Instrumentation Diagrams (PIDs), is crucial for efficient maintenance and operation of complex systems in hydraulic and process engineering. Previous approaches often rely on separate modules that analyze diagram elements individually, neglecting the diagram’s overall structure. We address this limitation by proposing a novel approach that utilizes the Relationformer, a state-of-the-art deep learning architecture, to extract graphs from PIDs. Our method leverages the ability of the Relationformer to simultaneously detect objects and their relationships in images, making it suitable for the task of graph extraction from engineering diagrams. We apply our proposed approach to both real-world and synthetically created PID datasets, and evaluate its effectiveness by comparing it with a modular digitization approach based on recent literature. We present PID2Graph, the first publicly accessible PID dataset featuring comprehensive labels for the graph structure, including symbols, nodes and their connections that is used for evaluation. To understand the effect of patching and stitching of both of the approaches, we compare values before and after merging the patches. For the real-world data, the Relationformer achieves convincing results, outperforming the modular digitization approach for edge detection by more than 25%. Our work provides a comprehensive framework for assessing the performance of PID digitization methods and opens up new avenues for research in this area using transformer architectures. The PID dataset used for evaluation will be published and publicly available upon acceptance of the paper.

[CV-33] Multimodal 3D Reasoning Segmentation with Complex Scenes

链接: https://arxiv.org/abs/2411.13927
作者: Xueying Jiang,Lewei Lu,Ling Shao,Shijian Lu
关键词-EN: recent development, development in multimodal, multimodal learning, learning has greatly, greatly advanced
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:The recent development in multimodal learning has greatly advanced the research in 3D scene understanding in various real-world tasks such as embodied AI. However, most existing work shares two typical constraints: 1) they are short of reasoning ability for interaction and interpretation of human intension and 2) they focus on scenarios with single-category objects only which leads to over-simplified textual descriptions due to the negligence of multi-object scenarios and spatial relations among objects. We bridge the research gaps by proposing a 3D reasoning segmentation task for multiple objects in scenes. The task allows producing 3D segmentation masks and detailed textual explanations as enriched by 3D spatial relations among objects. To this end, we create ReasonSeg3D, a large-scale and high-quality benchmark that integrates 3D spatial relations with generated question-answer pairs and 3D segmentation masks. In addition, we design MORE3D, a simple yet effective method that enables multi-object 3D reasoning segmentation with user questions and textual outputs. Extensive experiments show that MORE3D excels in reasoning and segmenting complex multi-object 3D scenes, and the created ReasonSeg3D offers a valuable platform for future exploration of 3D reasoning segmentation. The dataset and code will be released.

[CV-34] Quantization without Tears

链接: https://arxiv.org/abs/2411.13918
作者: Minghao Fu,Hao Yu,Jie Shao,Junjie Zhou,Ke Zhu,Jianxin Wu
关键词-EN: Deep neural networks, demand significant resources, achieving remarkable success, GPU memory, Deep neural
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Deep neural networks, while achieving remarkable success across diverse tasks, demand significant resources, including computation, GPU memory, bandwidth, storage, and energy. Network quantization, as a standard compression and acceleration technique, reduces storage costs and enables potential inference acceleration by discretizing network weights and activations into a finite set of integer values. However, current quantization methods are often complex and sensitive, requiring extensive task-specific hyperparameters, where even a single misconfiguration can impair model performance, limiting generality across different models and tasks. In this paper, we propose Quantization without Tears (QwT), a method that simultaneously achieves quantization speed, accuracy, simplicity, and generality. The key insight of QwT is to incorporate a lightweight additional structure into the quantized network to mitigate information loss during quantization. This structure consists solely of a small set of linear layers, keeping the method simple and efficient. More importantly, it provides a closed-form solution, allowing us to improve accuracy effortlessly under 2 minutes. Extensive experiments across various vision, language, and multimodal tasks demonstrate that QwT is both highly effective and versatile. In fact, our approach offers a robust solution for network quantization that combines simplicity, accuracy, and adaptability, which provides new insights for the design of novel quantization paradigms.

[CV-35] Panther: Illuminate the Sight of Multimodal LLM s with Instruction-Guided Visual Prompts

链接: https://arxiv.org/abs/2411.13909
作者: Honglin Li,Yuting Gao,Chenglu Zhu,Jingdong Chen,Ming Yang,Lin Yang
关键词-EN: Multimodal large language, large language models, perception capability rapidly, locating small objects, subtle images details
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Multimodal large language models (MLLMs) are closing the gap to human visual perception capability rapidly, while, still lag behind on attending to subtle images details or locating small objects precisely, etc. Common schemes to tackle these issues include deploying multiple vision encoders or operating on original high-resolution images. Few studies have concentrated on taking the textual instruction into improving visual representation, resulting in losing focus in some vision-centric tasks, a phenomenon we herein termed as Amblyopia. In this work, we introduce Panther, a MLLM that closely adheres to user instruction and locates targets of interests precisely, with the finesse of a black panther. Specifically, Panther comprises three integral components: Panther-VE, Panther-Bridge, and Panther-Decoder. Panther-VE integrates user instruction information at the early stages of the vision encoder, thereby extracting the most relevant and useful visual representations. The Panther-Bridge module, equipped with powerful filtering capabilities, significantly reduces redundant visual information, leading to a substantial savings in training costs. The Panther-Decoder is versatile and can be employed with any decoder-only architecture of LLMs without discrimination. Experimental results, particularly on vision-centric benchmarks, have demonstrated the effectiveness of Panther.

[CV-36] Dressing the Imagination: A Dataset for AI-Powered Translation of Text into Fashion Outfits and A Novel KAN Adapter for Enhanced Feature Adaptation

链接: https://arxiv.org/abs/2411.13901
作者: Gayatri Deshmukh,Somsubhra De,Chirag Sehgal,Jishu Sen Gupta,Sparsh Mittal
关键词-EN: Language Outfit Representation, industry rich language, Fashion Language Outfit, AI-driven fashion design, Specialized datasets
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Under review at a conference

点击查看摘要

Abstract:Specialized datasets that capture the fashion industry’s rich language and styling elements can boost progress in AI-driven fashion design. We present FLORA (Fashion Language Outfit Representation for Apparel Generation), the first comprehensive dataset containing 4,330 curated pairs of fashion outfits and corresponding textual descriptions. Each description utilizes industry-specific terminology and jargon commonly used by professional fashion designers, providing precise and detailed insights into the outfits. Hence, the dataset captures the delicate features and subtle stylistic elements necessary to create high-fidelity fashion designs. We demonstrate that fine-tuning generative models on the FLORA dataset significantly enhances their capability to generate accurate and stylistically rich images from textual descriptions of fashion sketches. FLORA will catalyze the creation of advanced AI models capable of comprehending and producing subtle, stylistically rich fashion designs. It will also help fashion designers and end-users to bring their ideas to life. As a second orthogonal contribution, we introduce KAN Adapters, which leverage Kolmogorov-Arnold Networks (KAN) as adaptive modules. They serve as replacements for traditional MLP-based LoRA adapters. With learnable spline-based activations, KAN Adapters excel in modeling complex, non-linear relationships, achieving superior fidelity, faster convergence and semantic alignment. Extensive experiments and ablation studies on our proposed FLORA dataset validate the superiority of KAN Adapters over LoRA adapters. To foster further research and collaboration, we will open-source both the FLORA and our implementation code. Comments: Under review at a conference Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2411.13901 [cs.CV] (or arXiv:2411.13901v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2411.13901 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[CV-37] CLFace: A Scalable and Resource-Efficient Continual Learning Framework for Lifelong Face Recognition WACV2025

链接: https://arxiv.org/abs/2411.13886
作者: Md Mahedi Hasan,Shoaib Meraj Sami,Nasser Nasrabadi
关键词-EN: continuous data stream, deploying face recognition, important aspect, aspect of deploying, real-world applications
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted for publication in the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV 2025)

点击查看摘要

Abstract:An important aspect of deploying face recognition (FR) algorithms in real-world applications is their ability to learn new face identities from a continuous data stream. However, the online training of existing deep neural network-based FR algorithms, which are pre-trained offline on large-scale stationary datasets, encounter two major challenges: (I) catastrophic forgetting of previously learned identities, and (II) the need to store past data for complete retraining from scratch, leading to significant storage constraints and privacy concerns. In this paper, we introduce CLFace, a continual learning framework designed to preserve and incrementally extend the learned knowledge. CLFace eliminates the classification layer, resulting in a resource-efficient FR model that remains fixed throughout lifelong learning and provides label-free supervision to a student model, making it suitable for open-set face recognition during incremental steps. We introduce an objective function that employs feature-level distillation to reduce drift between feature maps of the student and teacher models across multiple stages. Additionally, it incorporates a geometry-preserving distillation scheme to maintain the orientation of the teacher model’s feature embedding. Furthermore, a contrastive knowledge distillation is incorporated to continually enhance the discriminative power of the feature representation by matching similarities between new identities. Experiments on several benchmark FR datasets demonstrate that CLFace outperforms baseline approaches and state-of-the-art methods on unseen identities using both in-domain and out-of-domain datasets.

[CV-38] Sli2Vol: Segmenting 3D Medical Images Based on an Object Estimation Guided Correspondence Flow Network

链接: https://arxiv.org/abs/2411.13873
作者: Delin An,Pengfei Gu,Milan Sonka,Chaoli Wang,Danny Z. Chen
关键词-EN: Deep learning, shown remarkable successes, medical image, medical image segmentation, shown remarkable
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Deep learning (DL) methods have shown remarkable successes in medical image segmentation, often using large amounts of annotated data for model training. However, acquiring a large number of diverse labeled 3D medical image datasets is highly difficult and expensive. Recently, mask propagation DL methods were developed to reduce the annotation burden on 3D medical images. For example, Sli2Vol~\citeyeung2021sli2vol proposed a self-supervised framework (SSF) to learn correspondences by matching neighboring slices via slice reconstruction in the training stage; the learned correspondences were then used to propagate a labeled slice to other slices in the test stage. But, these methods are still prone to error accumulation due to the inter-slice propagation of reconstruction errors. Also, they do not handle discontinuities well, which can occur between consecutive slices in 3D images, as they emphasize exploiting object continuity. To address these challenges, in this work, we propose a new SSF, called \proposed, for segmenting any anatomical structures in 3D medical images using only a single annotated slice per training and testing volume. Specifically, in the training stage, we first propagate an annotated 2D slice of a training volume to the other slices, generating pseudo-labels (PLs). Then, we develop a novel Object Estimation Guided Correspondence Flow Network to learn reliable correspondences between consecutive slices and corresponding PLs in a self-supervised manner. In the test stage, such correspondences are utilized to propagate a single annotated slice to the other slices of a test volume. We demonstrate the effectiveness of our method on various medical image segmentation tasks with different datasets, showing better generalizability across different organs, modalities, and modals. Code is available at \urlthis https URL

[CV-39] Decoupled Sparse Priors Guided Diffusion Compression Model for Point Clouds

链接: https://arxiv.org/abs/2411.13860
作者: Xiaoge Zhang,Zijie Wu,Mehwish Nasim,Mingtao Feng,Ajmal Mian
关键词-EN: Lossy compression methods, latent representations unexplored, latent points, Lossy compression, compression methods rely
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
*备注:

点击查看摘要

Abstract:Lossy compression methods rely on an autoencoder to transform a point cloud into latent points for storage, leaving the inherent redundancy of latent representations unexplored. To reduce redundancy in latent points, we propose a sparse priors guided method that achieves high reconstruction quality, especially at high compression ratios. This is accomplished by a dual-density scheme separately processing the latent points (intended for reconstruction) and the decoupled sparse priors (intended for storage). Our approach features an efficient dual-density data flow that relaxes size constraints on latent points, and hybridizes a progressive conditional diffusion model to encapsulate essential details for reconstruction within the conditions, which are decoupled hierarchically to intra-point and inter-point priors. Specifically, our method encodes the original point cloud into latent points and decoupled sparse priors through separate encoders. Latent points serve as intermediates, while sparse priors act as adaptive conditions. We then employ a progressive attention-based conditional denoiser to generate latent points conditioned on the decoupled priors, allowing the denoiser to dynamically attend to geometric and semantic cues from the priors at each encoding and decoding layer. Additionally, we integrate the local distribution into the arithmetic encoder and decoder to enhance local context modeling of the sparse points. The original point cloud is reconstructed through a point decoder. Compared to state-of-the-art, our method obtains superior rate-distortion trade-off, evidenced by extensive evaluations on the ShapeNet dataset and standard test datasets from MPEG group including 8iVFB, and Owlii.

[CV-40] Dealing with Synthetic Data Contamination in Online Continual Learning NEURIPS’24

链接: https://arxiv.org/abs/2411.13852
作者: Maorong Wang,Nicolas Michel,Jiafeng Mao,Toshihiko Yamasaki
关键词-EN: generating high-fidelity realistic, high-fidelity realistic images, generating high-fidelity, high-fidelity realistic, advancement of diffusion-based
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: Accepted to NeurIPS’24

点击查看摘要

Abstract:Image generation has shown remarkable results in generating high-fidelity realistic images, in particular with the advancement of diffusion-based models. However, the prevalence of AI-generated images may have side effects for the machine learning community that are not clearly identified. Meanwhile, the success of deep learning in computer vision is driven by the massive dataset collected on the Internet. The extensive quantity of synthetic data being added to the Internet would become an obstacle for future researchers to collect “clean” datasets without AI-generated content. Prior research has shown that using datasets contaminated by synthetic images may result in performance degradation when used for training. In this paper, we investigate the potential impact of contaminated datasets on Online Continual Learning (CL) research. We experimentally show that contaminated datasets might hinder the training of existing online CL methods. Also, we propose Entropy Selection with Real-synthetic similarity Maximization (ESRM), a method to alleviate the performance deterioration caused by synthetic images when training online CL models. Experiments show that our method can significantly alleviate performance deterioration, especially when the contamination is severe. For reproducibility, the source code of our work is available at this https URL.

[CV-41] Multitask Learning for SAR Ship Detection with Gaussian-Mask Joint Segmentation

链接: https://arxiv.org/abs/2411.13847
作者: Ming Zhao,Xin Zhang,André Kaup
关键词-EN: synthetic aperture radar, Detecting ships, SAR ship detection, complex surroundings, strong speckle noise
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Detecting ships in synthetic aperture radar (SAR) images is challenging due to strong speckle noise, complex surroundings, and varying scales. This paper proposes MLDet, a multitask learning framework for SAR ship detection, consisting of object detection, speckle suppression, and target segmentation tasks. An angle classification loss with aspect ratio weighting is introduced to improve detection accuracy by addressing angular periodicity and object proportions. The speckle suppression task uses a dual-feature fusion attention mechanism to reduce noise and fuse shallow and denoising features, enhancing robustness. The target segmentation task, leveraging a rotated Gaussian-mask, aids the network in extracting target regions from cluttered backgrounds and improves detection efficiency with pixel-level predictions. The Gaussian-mask ensures ship centers have the highest probabilities, gradually decreasing outward under a Gaussian distribution. Additionally, a weighted rotated boxes fusion (WRBF) strategy combines multi-direction anchor predictions, filtering anchors beyond boundaries or with high overlap but low confidence. Extensive experiments on SSDD+ and HRSID datasets demonstrate the effectiveness and superiority of MLDet.

[CV-42] Detecting Human Artifacts from Text-to-Image Models

链接: https://arxiv.org/abs/2411.13842
作者: Kaihong Wang,Lingzhi Zhang,Jianming Zhang
关键词-EN: human, recent advancements, Human Artifact, artifacts, human figures
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Despite recent advancements, text-to-image generation models often produce images containing artifacts, especially in human figures. These artifacts appear as poorly generated human bodies, including distorted, missing, or extra body parts, leading to visual inconsistencies with typical human anatomy and greatly impairing overall fidelity. In this study, we address this challenge by curating Human Artifact Dataset (HAD), the first large-scale dataset specifically designed to identify and localize human artifacts. HAD comprises over 37,000 images generated by several popular text-to-image models, annotated for human artifact localization. Using this dataset, we train the Human Artifact Detection Models (HADM), which can identify diverse artifact types across multiple generative domains and demonstrate strong generalization, even on images from unseen generators. Additionally, to further improve generators’ perception of human structural coherence, we use the predictions from our HADM as feedback for diffusion model finetuning. Our experiments confirm a reduction in human artifacts in the resulting model. Furthermore, we showcase a novel application of our HADM in an iterative inpainting framework to correct human artifacts in arbitrary images directly, demonstrating its utility in improving image quality. Our dataset and detection models are available at: \urlthis https URL.

[CV-43] Segment Anything in Light Fields for Real-Time Applications via Constrained Prompting

链接: https://arxiv.org/abs/2411.13840
作者: Nikolai Goncharov,Donald G. Dansereau
关键词-EN: Segmented light field, computer vision tasks, light field, light field domain, object pose tracking
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Segmented light field images can serve as a powerful representation in many of computer vision tasks exploiting geometry and appearance of objects, such as object pose tracking. In the light field domain, segmentation presents an additional objective of recognizing the same segment through all the views. Segment Anything Model 2 (SAM 2) allows producing semantically meaningful segments for monocular images and videos. However, using SAM 2 directly on light fields is highly ineffective due to unexploited constraints. In this work, we present a novel light field segmentation method that adapts SAM 2 to the light field domain without retraining or modifying the model. By utilizing the light field domain constraints, the method produces high quality and view-consistent light field masks, outperforming the SAM 2 video tracking baseline and working 7 times faster, with a real-time speed. We achieve this by exploiting the epipolar geometry cues to propagate the masks between the views, probing the SAM 2 latent space to estimate their occlusion, and further prompting SAM 2 for their refinement.

[CV-44] CLIPer: Hierarchically Improving Spatial Representation of CLIP for Open-Vocabulary Semantic Segmentation

链接: https://arxiv.org/abs/2411.13836
作者: Lin Sun,Jiale Cao,Jin Xie,Xiaoheng Jiang,Yanwei Pang
关键词-EN: Contrastive Language-Image Pre-training, exhibits strong zero-shot, strong zero-shot classification, zero-shot classification ability, pixel-level open-vocabulary semantic
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Homepange and code: this https URL

点击查看摘要

Abstract:Contrastive Language-Image Pre-training (CLIP) exhibits strong zero-shot classification ability on various image-level tasks, leading to the research to adapt CLIP for pixel-level open-vocabulary semantic segmentation without additional training. The key is to improve spatial representation of image-level CLIP, such as replacing self-attention map at last layer with self-self attention map or vision foundation model based attention map. In this paper, we present a novel hierarchical framework, named CLIPer, that hierarchically improves spatial representation of CLIP. The proposed CLIPer includes an early-layer fusion module and a fine-grained compensation module. We observe that, the embeddings and attention maps at early layers can preserve spatial structural information. Inspired by this, we design the early-layer fusion module to generate segmentation map with better spatial coherence. Afterwards, we employ a fine-grained compensation module to compensate the local details using the self-attention maps of diffusion model. We conduct the experiments on seven segmentation datasets. Our proposed CLIPer achieves the state-of-the-art performance on these datasets. For instance, using ViT-L, CLIPer has the mIoU of 69.8% and 43.3% on VOC and COCO Object, outperforming ProxyCLIP by 9.2% and 4.1% respectively.

[CV-45] MagicDriveDiT: High-Resolution Long Video Generation for Autonomous Driving with Adaptive Control

链接: https://arxiv.org/abs/2411.13807
作者: Ruiyuan Gao,Kai Chen,Bo Xiao,Lanqing Hong,Zhenguo Li,Qiang Xu
关键词-EN: improved video synthesis, greatly improved video, autonomous driving, rapid advancement, advancement of diffusion
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Project Website: this https URL

点击查看摘要

Abstract:The rapid advancement of diffusion models has greatly improved video synthesis, especially in controllable video generation, which is essential for applications like autonomous driving. However, existing methods are limited by scalability and how control conditions are integrated, failing to meet the needs for high-resolution and long videos for autonomous driving applications. In this paper, we introduce MagicDriveDiT, a novel approach based on the DiT architecture, and tackle these challenges. Our method enhances scalability through flow matching and employs a progressive training strategy to manage complex scenarios. By incorporating spatial-temporal conditional encoding, MagicDriveDiT achieves precise control over spatial-temporal latents. Comprehensive experiments show its superior performance in generating realistic street scene videos with higher resolution and more frames. MagicDriveDiT significantly improves video generation quality and spatial-temporal controls, expanding its potential applications across various tasks in autonomous driving.

[CV-46] Hugging Rain Man: A Novel Facial Action Units Dataset for Analyzing Atypical Facial Expressions in Children with Autism Spectrum Disorder

链接: https://arxiv.org/abs/2411.13797
作者: Yanfeng Ji,Shutong Wang,Ruyi Xu,Jingying Chen,Xinzhou Jiang,Zhengyu Deng,Yuxuan Quan,Junpeng Liu
关键词-EN: Autism Spectrum Disorder, Spectrum Disorder, Autism Spectrum, Children with Autism, exhibit atypical facial
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Portions of the dataset, features, and pretrained models are accessible at: this https URL

点击查看摘要

Abstract:Children with Autism Spectrum Disorder (ASD) often exhibit atypical facial expressions. However, the specific objective facial features that underlie this subjective perception remain unclear. In this paper, we introduce a novel dataset, Hugging Rain Man (HRM), which includes facial action units (AUs) manually annotated by FACS experts for both children with ASD and typical development (TD). The dataset comprises a rich collection of posed and spontaneous facial expressions, totaling approximately 130,000 frames, along with 22 AUs, 10 Action Descriptors (ADs), and atypicality ratings. A statistical analysis of static images from the HRM reveals significant differences between the ASD and TD groups across multiple AUs and ADs when displaying the same emotional expressions, confirming that participants with ASD tend to demonstrate more irregular and diverse expression patterns. Subsequently, a temporal regression method was presented to analyze atypicality of dynamic sequences, thereby bridging the gap between subjective perception and objective facial characteristics. Furthermore, baseline results for AU detection are provided for future research reference. This work not only contributes to our understanding of the unique facial expression characteristics associated with ASD but also provides potential tools for ASD early screening. Portions of the dataset, features, and pretrained models are accessible at: \urlthis https URL.

[CV-47] GalaxyEdit: Large-Scale Image Editing Dataset with Enhanced Diffusion Adapter

链接: https://arxiv.org/abs/2411.13794
作者: Aniruddha Bala,Rohan Jaiswal,Loay Rashid,Siddharth Roheda
关键词-EN: requires a huge, huge amount, amount of annotated, Training of large-scale, annotated data
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 10 pages, 6 figures

点击查看摘要

Abstract:Training of large-scale text-to-image and image-to-image models requires a huge amount of annotated data. While text-to-image datasets are abundant, data available for instruction-based image-to-image tasks like object addition and removal is limited. This is because of the several challenges associated with the data generation process, such as, significant human effort, limited automation, suboptimal end-to-end models, data diversity constraints and high expenses. We propose an automated data generation pipeline aimed at alleviating such limitations, and introduce GalaxyEdit - a large-scale image editing dataset for add and remove operations. We fine-tune the SD v1.5 model on our dataset and find that our model can successfully handle a broader range of objects and complex editing instructions, outperforming state-of-the-art methods in FID scores by 11.2% and 26.1% for add and remove tasks respectively. Furthermore, in light of on-device usage scenarios, we expand our research to include task-specific lightweight adapters leveraging the ControlNet-xs architecture. While ControlNet-xs excels in canny and depth guided generation, we propose to improve the communication between the control network and U-Net for more intricate add and remove tasks. We achieve this by enhancing ControlNet-xs with non-linear interaction layers based on Volterra filters. Our approach outperforms ControlNet-xs in both add/remove and canny-guided image generation tasks, highlighting the effectiveness of the proposed enhancement.

[CV-48] Edge-Cloud Routing for Text-to-Image Model with Token-Level Multi-Metric Prediction

链接: https://arxiv.org/abs/2411.13787
作者: Zewei Xin,Qinya Li,Chaoyue Niu,Fan Wu
关键词-EN: substantial size necessitates, size necessitates expensive, impressive generation capabilities, demonstrate impressive generation, models demonstrate impressive
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Large text-to-image models demonstrate impressive generation capabilities; however, their substantial size necessitates expensive cloud servers for deployment. Conversely, light-weight models can be deployed on edge devices at lower cost but often with inferior generation quality for complex user prompts. To strike a balance between performance and cost, we propose a routing framework, called \textttRouteT2I, which dynamically selects either the large cloud model or the light-weight edge model for each user prompt. Since generated image quality is challenging to measure directly, \textttRouteT2I establishes multi-dimensional quality metrics, particularly, by evaluating the similarity between the generated images and both positive and negative texts that describe each specific quality metric. \textttRouteT2I then predicts the expected quality of the generated images by identifying key tokens in the prompt and comparing their impact on the quality. \textttRouteT2I further introduces the Pareto relative superiority to compare the multi-metric quality of the generated images. Based on this comparison and predefined cost constraints, \textttRouteT2I allocates prompts to either the edge or the cloud. Evaluation reveals that \textttRouteT2I significantly reduces the number of requesting large cloud model while maintaining high-quality image generation.

[CV-49] Segment Any Class (SAC): Multi-Class Few-Shot Semantic Segmentation via Class Region Proposals

链接: https://arxiv.org/abs/2411.13774
作者: Hussni Mohd Zakir,Eric Tatt Wei Ho
关键词-EN: prompt-driven framework, SAM, SAC, SAM generates class-agnostic, vision foundation model
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 8 pages, 2 figures, 3 tables

点击查看摘要

Abstract:The Segment-Anything Model (SAM) is a vision foundation model for segmentation with a prompt-driven framework. SAM generates class-agnostic masks based on user-specified instance-referring prompts. However, adapting SAM for automated segmentation – where manual input is absent – of specific object classes often requires additional model training. We present Segment Any Class (SAC), a novel, training-free approach that task-adapts SAM for Multi-class segmentation. SAC generates Class-Region Proposals (CRP) on query images which allows us to automatically generate class-aware prompts on probable locations of class instances. CRPs are derived from elementary intra-class and inter-class feature distinctions without any additional training. Our method is versatile, accommodating any N-way K-shot configurations for the multi-class few-shot semantic segmentation (FSS) task. Unlike gradient-learning adaptation of generalist models which risk the loss of generalization and potentially suffer from catastrophic forgetting, SAC solely utilizes automated prompting and achieves superior results over state-of-the-art methods on the COCO-20i benchmark, particularly excelling in high N-way class scenarios. SAC is an interesting demonstration of a prompt-only approach to adapting foundation models for novel tasks with small, limited datasets without any modifications to the foundation model itself. This method offers interesting benefits such as intrinsic immunity to concept or feature loss and rapid, online task adaptation of foundation models.

[CV-50] Learning to Reason Iteratively and Parallelly for Complex Visual Reasoning Scenarios NEURIPS2024

链接: https://arxiv.org/abs/2411.13754
作者: Shantanu Jaiswal,Debaditya Roy,Basura Fernando,Cheston Tan
关键词-EN: requires compositional multi-step, compositional multi-step processing, complex VQA scenarios, Complex visual reasoning, address complex VQA
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注: NeurIPS 2024 camera ready; source code to be released at: this https URL

点击查看摘要

Abstract:Complex visual reasoning and question answering (VQA) is a challenging task that requires compositional multi-step processing and higher-level reasoning capabilities beyond the immediate recognition and localization of objects and events. Here, we introduce a fully neural Iterative and Parallel Reasoning Mechanism (IPRM) that combines two distinct forms of computation – iterative and parallel – to better address complex VQA scenarios. Specifically, IPRM’s “iterative” computation facilitates compositional step-by-step reasoning for scenarios wherein individual operations need to be computed, stored, and recalled dynamically (e.g. when computing the query “determine the color of pen to the left of the child in red t-shirt sitting at the white table”). Meanwhile, its “parallel” computation allows for the simultaneous exploration of different reasoning paths and benefits more robust and efficient execution of operations that are mutually independent (e.g. when counting individual colors for the query: “determine the maximum occurring color amongst all t-shirts”). We design IPRM as a lightweight and fully-differentiable neural module that can be conveniently applied to both transformer and non-transformer vision-language backbones. It notably outperforms prior task-specific methods and transformer-based attention modules across various image and video VQA benchmarks testing distinct complex reasoning capabilities such as compositional spatiotemporal reasoning (AGQA), situational reasoning (STAR), multi-hop reasoning generalization (CLEVR-Humans) and causal event linking (CLEVRER-Humans). Further, IPRM’s internal computations can be visualized across reasoning steps, aiding interpretability and diagnosis of its errors.

[CV-51] FAST-Splat: Fast Ambiguity-Free Semantics Transfer in Gaussian Splatting

链接: https://arxiv.org/abs/2411.13753
作者: Ola Shorinwa,Jiankai Sun,Mac Schwager
关键词-EN: semantic Gaussian Splatting, Gaussian Splatting, Gaussian Splatting methods, Gaussian Splatting scene, semantic Gaussian
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:We present FAST-Splat for fast, ambiguity-free semantic Gaussian Splatting, which seeks to address the main limitations of existing semantic Gaussian Splatting methods, namely: slow training and rendering speeds; high memory usage; and ambiguous semantic object localization. In deriving FAST-Splat , we formulate open-vocabulary semantic Gaussian Splatting as the problem of extending closed-set semantic distillation to the open-set (open-vocabulary) setting, enabling FAST-Splat to provide precise semantic object localization results, even when prompted with ambiguous user-provided natural-language queries. Further, by exploiting the explicit form of the Gaussian Splatting scene representation to the fullest extent, FAST-Splat retains the remarkable training and rendering speeds of Gaussian Splatting. Specifically, while existing semantic Gaussian Splatting methods distill semantics into a separate neural field or utilize neural models for dimensionality reduction, FAST-Splat directly augments each Gaussian with specific semantic codes, preserving the training, rendering, and memory-usage advantages of Gaussian Splatting over neural field methods. These Gaussian-specific semantic codes, together with a hash-table, enable semantic similarity to be measured with open-vocabulary user prompts and further enable FAST-Splat to respond with unambiguous semantic object labels and 3D masks, unlike prior methods. In experiments, we demonstrate that FAST-Splat is 4x to 6x faster to train with a 13x faster data pre-processing step, achieves between 18x to 75x faster rendering speeds, and requires about 3x smaller GPU memory, compared to the best-competing semantic Gaussian Splatting methods. Further, FAST-Splat achieves relatively similar or better semantic segmentation performance compared to existing methods. After the review period, we will provide links to the project website and the codebase.

[CV-52] Delta-Influence: Unlearning Poisons via Influence Functions NEURIPS2024 NEURIPS

链接: https://arxiv.org/abs/2411.13731
作者: Wenjie Li,Jiawei Li,Christian Schroeder de Witt,Ameya Prabhu,Amartya Sanyal
关键词-EN: Addressing data integrity, machine learning models, poisoned training data, Addressing data, influence
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: Accepted at NeurIPS Workshop on Attributing Model Behavior at Scale (ATTRIB @ NeurIPS 2024)

点击查看摘要

Abstract:Addressing data integrity challenges, such as unlearning the effects of data poisoning after model training, is necessary for the reliable deployment of machine learning models. State-of-the-art influence functions, such as EK-FAC, often fail to accurately attribute abnormal model behavior to the specific poisoned training data responsible for the data poisoning attack. In addition, traditional unlearning algorithms often struggle to effectively remove the influence of poisoned samples, particularly when only a few affected examples can be identified. To address these challenge, we introduce \Delta -Influence, a novel approach that leverages influence functions to trace abnormal model behavior back to the responsible poisoned training data using as little as just one poisoned test example. \Delta -Influence applies data transformations that sever the link between poisoned training data and compromised test points without significantly affecting clean data. This allows \Delta -Influence to detect large negative shifts in influence scores following data transformations, a phenomenon we term as influence collapse, thereby accurately identifying poisoned training data. Unlearning this subset, e.g. through retraining, effectively eliminates the data poisoning. We validate our method across three vision-based poisoning attacks and three datasets, benchmarking against four detection algorithms and five unlearning strategies. We show that \Delta -Influence consistently achieves the best unlearning across all settings, showing the promise of influence functions for corrective unlearning. Our code is publicly available at: \urlthis https URL

[CV-53] Developing Normative Gait Cycle Parameters for Clinical Analysis Using Human Pose Estimation

链接: https://arxiv.org/abs/2411.13716
作者: Rahm Ranjan,David Ahmedt-Aristizabal,Mohammad Ali Armin,Juno Kim
关键词-EN: analyse complex movements, RGB video data, RGB video, computer vision, emerging field
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Gait analysis using computer vision is an emerging field in AI, offering clinicians an objective, multi-feature approach to analyse complex movements. Despite its promise, current applications using RGB video data alone are limited in measuring clinically relevant spatial and temporal kinematics and establishing normative parameters essential for identifying movement abnormalities within a gait cycle. This paper presents a data-driven method using RGB video data and 2D human pose estimation for developing normative kinematic gait parameters. By analysing joint angles, an established kinematic measure in biomechanics and clinical practice, we aim to enhance gait analysis capabilities and improve explainability. Our cycle-wise kinematic analysis enables clinicians to simultaneously measure and compare multiple joint angles, assessing individuals against a normative population using just monocular RGB video. This approach expands clinical capacity, supports objective decision-making, and automates the identification of specific spatial and temporal deviations and abnormalities within the gait cycle.

[CV-54] Decompose and Leverage Preferences from Expert Models for Improving Trustworthiness of MLLM s

链接: https://arxiv.org/abs/2411.13697
作者: Rui Cao,Yuming Jiang,Michael Schlichtkrull,Andreas Vlachos
关键词-EN: Multimodal Large Language, Large Language Models, Multimodal Large, Large Language, Language Models
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Multimodal Large Language Models (MLLMs) can enhance trustworthiness by aligning with human preferences. As human preference labeling is laborious, recent works employ evaluation models for assessing MLLMs’ responses, using the model-based assessments to automate preference dataset construction. This approach, however, faces challenges with MLLMs’ lengthy and compositional responses, which often require diverse reasoning skills that a single evaluation model may not fully possess. Additionally, most existing methods rely on closed-source models as evaluators. To address limitations, we propose DecompGen, a decomposable framework that uses an ensemble of open-sourced expert models. DecompGen breaks down each response into atomic verification tasks, assigning each task to an appropriate expert model to generate fine-grained assessments. The DecompGen feedback is used to automatically construct our preference dataset, DGPref. MLLMs aligned with DGPref via preference learning show improvements in trustworthiness, demonstrating the effectiveness of DecompGen.

[CV-55] Extending Video Masked Autoencoders to 128 frames NEURIPS’24

链接: https://arxiv.org/abs/2411.13683
作者: Nitesh Bharadwaj Gundavarapu,Luke Friedman,Raghav Goyal,Chaitra Hegde,Eirikur Agustsson,Sagar M. Waghmare,Mikhail Sirotenko,Ming-Hsuan Yang,Tobias Weyand,Boqing Gong,Leonid Sigal
关键词-EN: witnessed significant progress, foundation models demonstrating, models demonstrating strong, recent video foundation, video foundation models
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 10.5 pages of main paper, 25 pages total, 4 figures and 10 tables. To appear in NeurIPS’24

点击查看摘要

Abstract:Video understanding has witnessed significant progress with recent video foundation models demonstrating strong performance owing to self-supervised pre-training objectives; Masked Autoencoders (MAE) being the design of choice. Nevertheless, the majority of prior works that leverage MAE pre-training have focused on relatively short video representations (16 / 32 frames in length) largely due to hardware memory and compute limitations that scale poorly with video length due to the dense memory-intensive self-attention decoding. One natural strategy to address these challenges is to subsample tokens to reconstruct during decoding (or decoder masking). In this work, we propose an effective strategy for prioritizing tokens which allows training on longer video sequences (128 frames) and gets better performance than, more typical, random and uniform masking strategies. The core of our approach is an adaptive decoder masking strategy that prioritizes the most important tokens and uses quantized tokens as reconstruction objectives. Our adaptive strategy leverages a powerful MAGVIT-based tokenizer that jointly learns the tokens and their priority. We validate our design choices through exhaustive ablations and observe improved performance of the resulting long-video (128 frames) encoders over short-video (32 frames) counterparts. With our long-video masked autoencoder (LVMAE) strategy, we surpass state-of-the-art on Diving48 by 3.9 points and EPIC-Kitchens-100 verb classification by 2.5 points while relying on a simple core architecture and video-only pre-training (unlike some of the prior works that require millions of labeled video-text pairs or specialized encoders).

[CV-56] ID-Patch: Robust ID Association for Group Photo Personalization ATC

链接: https://arxiv.org/abs/2411.13632
作者: Yimeng Zhang,Tiancheng Zhi,Jing Liu,Shen Sang,Liming Jiang,Qing Yan,Sijia Liu,Linjie Luo
关键词-EN: immense creative potential, synthesize personalized group, personalized group photos, offers immense creative, identity offers immense
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Project Page is: this https URL

点击查看摘要

Abstract:The ability to synthesize personalized group photos and specify the positions of each identity offers immense creative potential. While such imagery can be visually appealing, it presents significant challenges for existing technologies. A persistent issue is identity (ID) leakage, where injected facial features interfere with one another, resulting in low face resemblance, incorrect positioning, and visual artifacts. Existing methods suffer from limitations such as the reliance on segmentation models, increased runtime, or a high probability of ID leakage. To address these challenges, we propose ID-Patch, a novel method that provides robust association between identities and 2D positions. Our approach generates an ID patch and ID embeddings from the same facial features: the ID patch is positioned on the conditional image for precise spatial control, while the ID embeddings integrate with text embeddings to ensure high resemblance. Experimental results demonstrate that ID-Patch surpasses baseline methods across metrics, such as face ID resemblance, ID-position association accuracy, and generation efficiency. Project Page is: this https URL

[CV-57] Sparse Input View Synthesis: 3D Representations and Reliable Priors

链接: https://arxiv.org/abs/2411.13631
作者: Nagabhushan Somraj
关键词-EN: view synthesis refers, synthesis refers, Abstract, viewpoints, problem
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: PhD Thesis of Nagabhushan S N, Dept of ECE, Indian Institute of Science (IISc); Advisor: Dr. Rajiv Soundararajan; Thesis Reviewers: Dr. Kaushik Mitra (IIT Madras), Dr. Aniket Bera (Purdue University); Submitted: May 2024; Accepted and Defended: Sep 2024; Abstract condensed, please check the PDF for full abstract

点击查看摘要

Abstract:Novel view synthesis refers to the problem of synthesizing novel viewpoints of a scene given the images from a few viewpoints. This is a fundamental problem in computer vision and graphics, and enables a vast variety of applications such as meta-verse, free-view watching of events, video gaming, video stabilization and video compression. Recent 3D representations such as radiance fields and multi-plane images significantly improve the quality of images rendered from novel viewpoints. However, these models require a dense sampling of input views for high quality renders. Their performance goes down significantly when only a few input views are available. In this thesis, we focus on the sparse input novel view synthesis problem for both static and dynamic scenes.

[CV-58] MambaDETR: Query-based Temporal Modeling using State Space Model for Multi-View 3D Object Detection

链接: https://arxiv.org/abs/2411.13628
作者: Tong Ning,Ke Lu,Xirui Jiang,Jian Xue
关键词-EN: made great progress, great progress recently, Utilizing temporal information, Utilizing temporal, temporal fusion methods
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Utilizing temporal information to improve the performance of 3D detection has made great progress recently in the field of autonomous driving. Traditional transformer-based temporal fusion methods suffer from quadratic computational cost and information decay as the length of the frame sequence increases. In this paper, we propose a novel method called MambaDETR, whose main idea is to implement temporal fusion in the efficient state space. Moreover, we design a Motion Elimination module to remove the relatively static objects for temporal fusion. On the standard nuScenes benchmark, our proposed MambaDETR achieves remarkable result in the 3D object detection task, exhibiting state-of-the-art performance among existing temporal fusion methods.

[CV-59] Principles of Visual Tokens for Efficient Video Understanding

链接: https://arxiv.org/abs/2411.13626
作者: Xinyue Hao,Gen Li,Shreyank N Gowda,Robert B Fisher,Jonathan Huang,Anurag Arnab,Laura Sevilla-Lara
关键词-EN: made huge strides, recent years, relying largely, understanding has made, made huge
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Video understanding has made huge strides in recent years, relying largely on the power of the transformer architecture. As this architecture is notoriously expensive and video is highly redundant, research into improving efficiency has become particularly relevant. This has led to many creative solutions, including token merging and token selection. While most methods succeed in reducing the cost of the model and maintaining accuracy, an interesting pattern arises: most methods do not outperform the random sampling baseline. In this paper we take a closer look at this phenomenon and make several observations. First, we develop an oracle for the value of tokens which exposes a clear Pareto distribution where most tokens have remarkably low value, and just a few carry most of the perceptual information. Second, we analyze why this oracle is extremely hard to learn, as it does not consistently coincide with visual cues. Third, we observe that easy videos need fewer tokens to maintain accuracy. We build on these and further insights to propose a lightweight video model we call LITE that can select a small number of tokens effectively, outperforming state-of-the-art and existing baselines across datasets (Kinetics400 and Something-Something-V2) in the challenging trade-off of computation (GFLOPs) vs accuracy.

[CV-60] Unsupervised Foundation Model-Agnostic Slide-Level Representation Learning

链接: https://arxiv.org/abs/2411.13623
作者: Tim Lenz,Peter Neidlinger,Marta Ligero,Georg Wölflein,Marko van Treeck,Jakob Nikolas Kather
关键词-EN: pathology whole-slide images, Multiple Instance Learning, Instance Learning, Multiple Instance, whole-slide images
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Representation learning of pathology whole-slide images (WSIs) has primarily relied on weak supervision with Multiple Instance Learning (MIL). This approach leads to slide representations highly tailored to a specific clinical task. Self-supervised learning (SSL) has been successfully applied to train histopathology foundation models (FMs) for patch embedding generation. However, generating patient or slide level embeddings remains challenging. Existing approaches for slide representation learning extend the principles of SSL from patch level learning to entire slides by aligning different augmentations of the slide or by utilizing multimodal data. By integrating tile embeddings from multiple FMs, we propose a new single modality SSL method in feature space that generates useful slide representations. Our contrastive pretraining strategy, called COBRA, employs multiple FMs and an architecture based on Mamba-2. COBRA exceeds performance of state-of-the-art slide encoders on four different public CPTAC cohorts on average by at least +3.8% AUC, despite only being pretrained on 3048 WSIs from TCGA. Additionally, COBRA is readily compatible at inference time with previously unseen feature extractors.

[CV-61] Robust SG-NeRF: Robust Scene Graph Aided Neural Surface Reconstruction

链接: https://arxiv.org/abs/2411.13620
作者: Yi Gu,Dongjun Ye,Zhaorui Wang,Jiaxu Wang,Jiahang Cao,Renjing Xu
关键词-EN: Neural surface reconstruction, Neural surface, surface reconstruction relies, reconstruction relies heavily, relies heavily
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: this https URL

点击查看摘要

Abstract:Neural surface reconstruction relies heavily on accurate camera poses as input. Despite utilizing advanced pose estimators like COLMAP or ARKit, camera poses can still be noisy. Existing pose-NeRF joint optimization methods handle poses with small noise (inliers) effectively but struggle with large noise (outliers), such as mirrored poses. In this work, we focus on mitigating the impact of outlier poses. Our method integrates an inlier-outlier confidence estimation scheme, leveraging scene graph information gathered during the data preparation phase. Unlike previous works directly using rendering metrics as the reference, we employ a detached color network that omits the viewing direction as input to minimize the impact caused by shape-radiance ambiguities. This enhanced confidence updating strategy effectively differentiates between inlier and outlier poses, allowing us to sample more rays from inlier poses to construct more reliable radiance fields. Additionally, we introduce a re-projection loss based on the current Signed Distance Function (SDF) and pose estimations, strengthening the constraints between matching image pairs. For outlier poses, we adopt a Monte Carlo re-localization method to find better solutions. We also devise a scene graph updating strategy to provide more accurate information throughout the training process. We validate our approach on the SG-NeRF and DTU datasets. Experimental results on various datasets demonstrate that our methods can consistently improve the reconstruction qualities and pose accuracies.

[CV-62] Video2BEV: Transforming Drone Videos to BEVs for Video-based Geo-localization

链接: https://arxiv.org/abs/2411.13610
作者: Hao Ju,Zhedong Zheng
关键词-EN: single drone-view snapshot, visual geo-localization predominantly, geo-localization predominantly adopt, drone visual geo-localization, predominantly adopt
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Existing approaches to drone visual geo-localization predominantly adopt the image-based setting, where a single drone-view snapshot is matched with images from other platforms. Such task formulation, however, underutilizes the inherent video output of the drone and is sensitive to occlusions and environmental constraints. To address these limitations, we formulate a new video-based drone geo-localization task and propose the Video2BEV paradigm. This paradigm transforms the video into a Bird’s Eye View (BEV), simplifying the subsequent matching process. In particular, we employ Gaussian Splatting to reconstruct a 3D scene and obtain the BEV projection. Different from the existing transform methods, \eg, polar transform, our BEVs preserve more fine-grained details without significant distortion. To further improve model scalability toward diverse BEVs and satellite figures, our Video2BEV paradigm also incorporates a diffusion-based module for generating hard negative samples, which facilitates discriminative feature learning. To validate our approach, we introduce UniV, a new video-based geo-localization dataset that extends the image-based University-1652 dataset. UniV features flight paths at 30^\circ and 45^\circ elevation angles with increased frame rates of up to 10 frames per second (FPS). Extensive experiments on the UniV dataset show that our Video2BEV paradigm achieves competitive recall rates and outperforms conventional video-based methods. Compared to other methods, our proposed approach exhibits robustness at lower elevations with more occlusions.

[CV-63] What You See Is What Matters: A Novel Visual and Physics-Based Metric for Evaluating Video Generation Quality

链接: https://arxiv.org/abs/2411.13609
作者: Zihan Wang,Songlin Li,Lingyan Hao,Bowen Song,Xinyu Hu
关键词-EN: Fréchet Video Distance, models advance rapidly, generation models advance, advance rapidly, increasingly critical
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:As video generation models advance rapidly, assessing the quality of generated videos has become increasingly critical. Existing metrics, such as Fréchet Video Distance (FVD), Inception Score (IS), and ClipSim, measure quality primarily in latent space rather than from a human visual perspective, often overlooking key aspects like appearance and motion consistency to physical laws. In this paper, we propose a novel metric, VAMP (Visual Appearance and Motion Plausibility), that evaluates both the visual appearance and physical plausibility of generated videos. VAMP is composed of two main components: an appearance score, which assesses color, shape, and texture consistency across frames, and a motion score, which evaluates the realism of object movements. We validate VAMP through two experiments: corrupted video evaluation and generated video evaluation. In the corrupted video evaluation, we introduce various types of corruptions into real videos and measure the correlation between corruption severity and VAMP scores. In the generated video evaluation, we use state-of-the-art models to generate videos from carefully designed prompts and compare VAMP’s performance to human evaluators’ rankings. Our results demonstrate that VAMP effectively captures both visual fidelity and temporal consistency, offering a more comprehensive evaluation of video quality than traditional methods.

[CV-64] VioPose: Violin Performance 4D Pose Estimation by Hierarchical Audiovisual Inference WACV2025

链接: https://arxiv.org/abs/2411.13607
作者: Seong Jong Yoo,Snehesh Shrestha,Irina Muresanu,Cornelia Fermüller
关键词-EN: Musicians delicately control, Musicians delicately, delicately control, control their bodies, bodies to generate
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted by WACV 2025 in Round 1. First two authors contributed equally

点击查看摘要

Abstract:Musicians delicately control their bodies to generate music. Sometimes, their motions are too subtle to be captured by the human eye. To analyze how they move to produce the music, we need to estimate precise 4D human pose (3D pose over time). However, current state-of-the-art (SoTA) visual pose estimation algorithms struggle to produce accurate monocular 4D poses because of occlusions, partial views, and human-object interactions. They are limited by the viewing angle, pixel density, and sampling rate of the cameras and fail to estimate fast and subtle movements, such as in the musical effect of vibrato. We leverage the direct causal relationship between the music produced and the human motions creating them to address these challenges. We propose VioPose: a novel multimodal network that hierarchically estimates dynamics. High-level features are cascaded to low-level features and integrated into Bayesian updates. Our architecture is shown to produce accurate pose sequences, facilitating precise motion analysis, and outperforms SoTA. As part of this work, we collected the largest and the most diverse calibrated violin-playing dataset, including video, sound, and 3D motion capture poses. Project page: is available at this https URL.

[CV-65] owards Accessible Learning: Deep Learning-Based Potential Dysgraphia Detection and OCR for Potentially Dysgraphic Handwriting

链接: https://arxiv.org/abs/2411.13595
作者: Vydeki D,Divyansh Bhandari,Pranav Pratap Patil,Aarush Anand Kulkarni
关键词-EN: affects handwriting abilities, making it challenging, legibly and consistently, disorder that affects, write legibly
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Dysgraphia is a learning disorder that affects handwriting abilities, making it challenging for children to write legibly and consistently. Early detection and monitoring are crucial for providing timely support and interventions. This study applies deep learning techniques to address the dual tasks of dysgraphia detection and optical character recognition (OCR) on handwriting samples from children with potential dysgraphic symptoms. Using a dataset of handwritten samples from Malaysian schoolchildren, we developed a custom Convolutional Neural Network (CNN) model, alongside VGG16 and ResNet50, to classify handwriting as dysgraphic or non-dysgraphic. The custom CNN model outperformed the pre-trained models, achieving a test accuracy of 91.8% with high precision, recall, and AUC, demonstrating its robustness in identifying dysgraphic handwriting features. Additionally, an OCR pipeline was created to segment and recognize individual characters in dysgraphic handwriting, achieving a character recognition accuracy of approximately 43.5%. This research highlights the potential of deep learning in supporting dysgraphia assessment, laying a foundation for tools that could assist educators and clinicians in identifying dysgraphia and tracking handwriting progress over time. The findings contribute to advancements in assistive technologies for learning disabilities, offering hope for more accessible and accurate diagnostic tools in educational and clinical settings.

[CV-66] Deep Feature Response Discriminative Calibration

链接: https://arxiv.org/abs/2411.13582
作者: Wenxiang Xu,Tian Qiu,Linyun Zhou,Zunlei Feng,Mingli Song,Huiqiong Wang
关键词-EN: Deep neural networks, Deep neural, numerous applications, neural feature response, Deep
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Deep neural networks (DNNs) have numerous applications across various domains. Several optimization techniques, such as ResNet and SENet, have been proposed to improve model accuracy. These techniques improve the model performance by adjusting or calibrating feature responses according to a uniform standard. However, they lack the discriminative calibration for different features, thereby introducing limitations in the model output. Therefore, we propose a method that discriminatively calibrates feature responses. The preliminary experimental results indicate that the neural feature response follows a Gaussian distribution. Consequently, we compute confidence values by employing the Gaussian probability density function, and then integrate these values with the original response values. The objective of this integration is to improve the feature discriminability of the neural feature response. Based on the calibration values, we propose a plugin-based calibration module incorporated into a modified ResNet architecture, termed Response Calibration Networks (ResCNet). Extensive experiments on datasets like CIFAR-10, CIFAR-100, SVHN, and ImageNet demonstrate the effectiveness of the proposed approach. The developed code is publicly available at this https URL.

[CV-67] Public Health Advocacy Dataset: A Dataset of Tobacco Usage Videos from Social Media

链接: https://arxiv.org/abs/2411.13572
作者: Naga VS Raviteja Chappa,Charlotte McCormick,Susana Rodriguez Gongora,Page Daniel Dobbs,Khoa Luu
关键词-EN: social media platforms, Health Advocacy Dataset, Public Health Advocacy, Advocacy Dataset, Health Advocacy
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Under review at International Journal of Computer Vision (IJCV); 29 figures, 5 figures;

点击查看摘要

Abstract:The Public Health Advocacy Dataset (PHAD) is a comprehensive collection of 5,730 videos related to tobacco products sourced from social media platforms like TikTok and YouTube. This dataset encompasses 4.3 million frames and includes detailed metadata such as user engagement metrics, video descriptions, and search keywords. This is the first dataset with these features providing a valuable resource for analyzing tobacco-related content and its impact. Our research employs a two-stage classification approach, incorporating a Vision-Language (VL) Encoder, demonstrating superior performance in accurately categorizing various types of tobacco products and usage scenarios. The analysis reveals significant user engagement trends, particularly with vaping and e-cigarette content, highlighting areas for targeted public health interventions. The PHAD addresses the need for multi-modal data in public health research, offering insights that can inform regulatory policies and public health strategies. This dataset is a crucial step towards understanding and mitigating the impact of tobacco usage, ensuring that public health efforts are more inclusive and effective.

[CV-68] Multimodal 3D Brain Tumor Segmentation with Adversarial Training and Conditional Random Field

链接: https://arxiv.org/abs/2411.14418
作者: Lan Jiang,Yuchao Zheng,Miao Yu,Haiqing Zhang,Fatemah Aladwani,Alessandro Perelli
关键词-EN: Accurate brain tumor, challenging task due, great individual differences, Volume Generative Adversarial, Generative Adversarial Network
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注: 13 pages, 7 figures, Annual Conference on Medical Image Understanding and Analysis (MIUA) 2024

点击查看摘要

Abstract:Accurate brain tumor segmentation remains a challenging task due to structural complexity and great individual differences of gliomas. Leveraging the pre-eminent detail resilience of CRF and spatial feature extraction capacity of V-net, we propose a multimodal 3D Volume Generative Adversarial Network (3D-vGAN) for precise segmentation. The model utilizes Pseudo-3D for V-net improvement, adds conditional random field after generator and use original image as supplemental guidance. Results, using the BraTS-2018 dataset, show that 3D-vGAN outperforms classical segmentation models, including U-net, Gan, FCN and 3D V-net, reaching specificity over 99.8%.

[CV-69] Adversarial Poisoning Attack on Quantum Machine Learning Models

链接: https://arxiv.org/abs/2411.14412
作者: Satwik Kundu,Swaroop Ghosh
关键词-EN: Quantum Machine Learning, Machine Learning, potential security risks, Quantum Machine, QML
类目: Quantum Physics (quant-ph); Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:With the growing interest in Quantum Machine Learning (QML) and the increasing availability of quantum computers through cloud providers, addressing the potential security risks associated with QML has become an urgent priority. One key concern in the QML domain is the threat of data poisoning attacks in the current quantum cloud setting. Adversarial access to training data could severely compromise the integrity and availability of QML models. Classical data poisoning techniques require significant knowledge and training to generate poisoned data, and lack noise resilience, making them ineffective for QML models in the Noisy Intermediate Scale Quantum (NISQ) era. In this work, we first propose a simple yet effective technique to measure intra-class encoder state similarity (ESS) by analyzing the outputs of encoding circuits. Leveraging this approach, we introduce a quantum indiscriminate data poisoning attack, QUID. Through extensive experiments conducted in both noiseless and noisy environments (e.g., IBM_Brisbane’s noise), across various architectures and datasets, QUID achieves up to 92% accuracy degradation in model performance compared to baseline models and up to 75% accuracy degradation compared to random label-flipping. We also tested QUID against state-of-the-art classical defenses, with accuracy degradation still exceeding 50% , demonstrating its effectiveness. This work represents the first attempt to reevaluate data poisoning attacks in the context of QML.

[CV-70] Enhancing Diagnostic Precision in Gastric Bleeding through Automated Lesion Segmentation: A Deep DuS-KFCM Approach

链接: https://arxiv.org/abs/2411.14385
作者: Xian-Xian Liu,Mingkun Xu,Yuanyuan Wei,Huafeng Qin,Qun Song,Simon Fong,Feng Tien,Wei Luo,Juntao Gao,Zhihua Zhang,Shirley Siu
关键词-EN: life-saving medical procedures, Kernelized Constrained Fuzzy, Spatial Kernelized Fuzzy, Dual Spatial Kernelized, Spatial Kernelized Constrained
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Timely and precise classification and segmentation of gastric bleeding in endoscopic imagery are pivotal for the rapid diagnosis and intervention of gastric complications, which is critical in life-saving medical procedures. Traditional methods grapple with the challenge posed by the indistinguishable intensity values of bleeding tissues adjacent to other gastric structures. Our study seeks to revolutionize this domain by introducing a novel deep learning model, the Dual Spatial Kernelized Constrained Fuzzy C-Means (Deep DuS-KFCM) clustering algorithm. This Hybrid Neuro-Fuzzy system synergizes Neural Networks with Fuzzy Logic to offer a highly precise and efficient identification of bleeding regions. Implementing a two-fold coarse-to-fine strategy for segmentation, this model initially employs the Spatial Kernelized Fuzzy C-Means (SKFCM) algorithm enhanced with spatial intensity profiles and subsequently harnesses the state-of-the-art DeepLabv3+ with ResNet50 architecture to refine the segmentation output. Through extensive experiments across mainstream gastric bleeding and red spots datasets, our Deep DuS-KFCM model demonstrated unprecedented accuracy rates of 87.95%, coupled with a specificity of 96.33%, outperforming contemporary segmentation methods. The findings underscore the model’s robustness against noise and its outstanding segmentation capabilities, particularly for identifying subtle bleeding symptoms, thereby presenting a significant leap forward in medical image processing.

[CV-71] Enhancing Medical Image Segmentation with Deep Learning and Diffusion Models

链接: https://arxiv.org/abs/2411.14353
作者: Houze Liu,Tong Zhou,Yanlin Xiang,Aoran Shen,Jiacheng Hu,Junliang Du
关键词-EN: accurate clinical diagnoses, Medical image segmentation, Medical image, unclear boundaries, clinical diagnoses
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Medical image segmentation is crucial for accurate clinical diagnoses, yet it faces challenges such as low contrast between lesions and normal tissues, unclear boundaries, and high variability across patients. Deep learning has improved segmentation accuracy and efficiency, but it still relies heavily on expert annotations and struggles with the complexities of medical images. The small size of medical image datasets and the high cost of data acquisition further limit the performance of segmentation networks. Diffusion models, with their iterative denoising process, offer a promising alternative for better detail capture in segmentation. However, they face difficulties in accurately segmenting small targets and maintaining the precision of boundary details. This article discusses the importance of medical image segmentation, the limitations of current deep learning approaches, and the potential of diffusion models to address these challenges.

[CV-72] Guided MRI Reconstruction via Schr"odinger Bridge

链接: https://arxiv.org/abs/2411.14269
作者: Yue Wang,Tian Zhou,Zhuo-xu Cui,Bingsheng Huang,Hairong Zheng,Dong Liang,Yanjie Zhu
关键词-EN: Magnetic Resonance Imaging, Magnetic Resonance, Resonance Imaging, multi-contrast imaging technique, similar structural information
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Signal Processing (eess.SP)
*备注:

点击查看摘要

Abstract:Magnetic Resonance Imaging (MRI) is a multi-contrast imaging technique in which different contrast images share similar structural information. However, conventional diffusion models struggle to effectively leverage this structural similarity. Recently, the Schrödinger Bridge (SB), a nonlinear extension of the diffusion model, has been proposed to establish diffusion paths between any distributions, allowing the incorporation of guided priors. This study proposes an SB-based, multi-contrast image-guided reconstruction framework that establishes a diffusion bridge between the guiding and target image distributions. By using the guiding image along with data consistency during sampling, the target image is reconstructed more accurately. To better address structural differences between images, we introduce an inversion strategy from the field of image editing, termed \mathbfI^2 SB-inversion. Experiments on a paried T1 and T2-FLAIR datasets demonstrate that \mathbfI^2 SB-inversion achieve a high acceleration up to 14.4 and outperforms existing methods in terms of both reconstruction accuracy and stability.

[CV-73] CP-UNet: Contour-based Probabilistic Model for Medical Ultrasound Images Segmentation ICASSP2025

链接: https://arxiv.org/abs/2411.14250
作者: Ruiguo Yu,Yiyang Zhang,Yuan Tian,Zhiqiang Liu,Xuewei Li,Jie Gao
关键词-EN: Deep learning-based segmentation, widely utilized, utilized for detecting, ultrasound images, learning-based segmentation methods
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注: 4 pages, 4 figures, 2 tables;For icassp2025

点击查看摘要

Abstract:Deep learning-based segmentation methods are widely utilized for detecting lesions in ultrasound images. Throughout the imaging procedure, the attenuation and scattering of ultrasound waves cause contour blurring and the formation of artifacts, limiting the clarity of the acquired ultrasound images. To overcome this challenge, we propose a contour-based probabilistic segmentation model CP-UNet, which guides the segmentation network to enhance its focus on contour during decoding. We design a novel down-sampling module to enable the contour probability distribution modeling and encoding stages to acquire global-local features. Furthermore, the Gaussian Mixture Model utilizes optimized features to model the contour distribution, capturing the uncertainty of lesion boundaries. Extensive experiments with several state-of-the-art deep learning segmentation methods on three ultrasound image datasets show that our method performs better on breast and thyroid lesions segmentation.

[CV-74] Deep Learning Approach for Enhancing Oral Squamous Cell Carcinoma with LIME Explainable AI Technique

链接: https://arxiv.org/abs/2411.14184
作者: Samiha Islam,Muhammad Zawad Mahmud,Shahran Rahman Alve,Md. Mejbah Ullah Chowdhury
关键词-EN: Histopathological Imaging Database, oral cancer analysis, oral squamous cell, squamous cell carcinoma, Histopathological Imaging
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注: Under Review at an IEEE conference

点击查看摘要

Abstract:The goal of the present study is to analyze an application of deep learning models in order to augment the diagnostic performance of oral squamous cell carcinoma (OSCC) with a longitudinal cohort study using the Histopathological Imaging Database for oral cancer analysis. The dataset consisted of 5192 images (2435 Normal and 2511 OSCC), which were allocated between training, testing, and validation sets with an estimated ratio repartition of about 52% for the OSCC group, and still, our performance measure was validated on a combination set that contains almost equal number of sample in this use case as entire database have been divided into half using stratified splitting technique based again near binary proportion but total distribution was around even. We selected four deep-learning architectures for evaluation in the present study: ResNet101, DenseNet121, VGG16, and EfficientnetB3. EfficientNetB3 was found to be the best, with an accuracy of 98.33% and F1 score (0.9844), and it took remarkably less computing power in comparison with other models. The subsequent one was DenseNet121, with 90.24% accuracy and an F1 score of 90.45%. Moreover, we employed the Local Interpretable Model-agnostic Explanations (LIME) method to clarify why EfficientNetB3 made certain decisions with its predictions to improve the explainability and trustworthiness of results. This work provides evidence for the possible superior diagnosis in OSCC activated from the EfficientNetB3 model with the explanation of AI techniques such as LIME and paves an important groundwork to build on towards clinical usage.

[CV-75] Self-supervised learning for radio-astronomy source classification: a benchmark

链接: https://arxiv.org/abs/2411.14078
作者: Thomas Cecconello,Simone Riggi,Ugo Becciano,Fabio Vitello,Andrew M. Hopkins,Giuseppe Vizzari,Concetto Spampinato,Simone Palazzo
关键词-EN: Square Kilometer Array, upcoming Square Kilometer, Kilometer Array, Square Kilometer, upcoming Square
类目: Instrumentation and Methods for Astrophysics (astro-ph.IM); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:The upcoming Square Kilometer Array (SKA) telescope marks a significant step forward in radio astronomy, presenting new opportunities and challenges for data analysis. Traditional visual models pretrained on optical photography images may not perform optimally on radio interferometry images, which have distinct visual characteristics. Self-Supervised Learning (SSL) offers a promising approach to address this issue, leveraging the abundant unlabeled data in radio astronomy to train neural networks that learn useful representations from radio images. This study explores the application of SSL to radio astronomy, comparing the performance of SSL-trained models with that of traditional models pretrained on natural images, evaluating the importance of data curation for SSL, and assessing the potential benefits of self-supervision to different domain-specific radio astronomy datasets. Our results indicate that, SSL-trained models achieve significant improvements over the baseline in several downstream tasks, especially in the linear evaluation setting; when the entire backbone is fine-tuned, the benefits of SSL are less evident but still outperform pretraining. These findings suggest that SSL can play a valuable role in efficiently enhancing the analysis of radio astronomical data. The trained models and code is available at: \urlthis https URL Subjects: Instrumentation and Methods for Astrophysics (astro-ph.IM); Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2411.14078 [astro-ph.IM] (or arXiv:2411.14078v1 [astro-ph.IM] for this version) https://doi.org/10.48550/arXiv.2411.14078 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[CV-76] Automatic brain tumor segmentation in 2D intra-operative ultrasound images using MRI tumor annotations

链接: https://arxiv.org/abs/2411.14017
作者: Mathilde Faanes,Ragnhild Holden Helland,Ole Solheim,Ingerid Reinertsen
关键词-EN: MRI, MRI annotated tumors, pre-operative MRI images, images, iUS images
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 19, 8 figures, submitted to International Journal of Computer Assisted Radiology and Surgery

点击查看摘要

Abstract:Automatic segmentation of brain tumors in intra-operative ultrasound (iUS) images could facilitate localization of tumor tissue during resection surgery. The lack of large annotated datasets limits the current models performances. In this paper, we investigate the use of tumor annotations in pre-operative MRI images, which are more easily accessible than annotations in iUS images, for training of deep learning models for iUS brain tumor segmentation. We used 180 annotated pre-operative MRI images with corresponding unannotated iUS images, and 29 annotated iUS images. Image registration was performed to transfer the MRI annotations to the corresponding iUS images before training models with the nnU-Net framework. To validate the use of MRI labels, the models were compared to a model trained with only US annotated tumors, and a model with both US and MRI annotated tumors. In addition, the results were compared to annotations validated by an expert neurosurgeon on the same test set to measure inter-observer variability. The results showed similar performance for a model trained with only MRI annotated tumors, compared to a model trained with only US annotated tumors. The model trained using both modalities obtained slightly better results with an average Dice score of 0.62, where external expert annotations achieved a score of 0.67. The results also showed that the deep learning models were comparable to expert annotation for larger tumors ( 200 mm2), but perform clearly worse for smaller tumors ( 200 mm2). This shows that MRI tumor annotations can be used as a substitute for US tumor annotations to train a deep learning model for automatic brain tumor segmentation in intra-operative ultrasound images. Small tumors is a limitation for the current models and will be the focus of future work. The main models are available here: this https URL.

[CV-77] Image Compression Using Novel View Synthesis Priors

链接: https://arxiv.org/abs/2411.13862
作者: Luyuan Peng,Mandar Chitre,Hari Vishnu,Yuen Min Too,Bharath Kalyan,Rajat Mishra,Soo Pieng Tan
关键词-EN: Real-time visual feedback, manipulation tasks, visual feedback, feedback is essential, inspection and manipulation
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
*备注: Preprint submitted to Ocean Engineering

点击查看摘要

Abstract:Real-time visual feedback is essential for tetherless control of remotely operated vehicles, particularly during inspection and manipulation tasks. Though acoustic communication is the preferred choice for medium-range communication underwater, its limited bandwidth renders it impractical to transmit images or videos in real-time. To address this, we propose a model-based image compression technique that leverages prior mission information. Our approach employs trained machine-learning based novel view synthesis models, and uses gradient descent optimization to refine latent representations to help generate compressible differences between camera images and rendered images. We evaluate the proposed compression technique using a dataset from an artificial ocean basin, demonstrating superior compression ratios and image quality over existing techniques. Moreover, our method exhibits robustness to introduction of new objects within the scene, highlighting its potential for advancing tetherless remotely operated vehicle operations.

[CV-78] A Multimodal Approach to The Detection and Classification of Skin Diseases

链接: https://arxiv.org/abs/2411.13855
作者: Allen Yang(1),Edward Yang(2), ((1) Mission San Jose High School, Fremont, CA, (2) Yale University, New Haven, CT)
关键词-EN: Americans lack access, primary care services, avoid medical costs, forty percent delay, one-third of Americans
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:According to PBS, nearly one-third of Americans lack access to primary care services, and another forty percent delay going to avoid medical costs. As a result, many diseases are left undiagnosed and untreated, even if the disease shows many physical symptoms on the skin. With the rise of AI, self-diagnosis and improved disease recognition have become more promising than ever; in spite of that, existing methods suffer from a lack of large-scale patient databases and outdated methods of study, resulting in studies being limited to only a few diseases or modalities. This study incorporates readily available and easily accessible patient information via image and text for skin disease classification on a new dataset of 26 skin disease types that includes both skin disease images (37K) and associated patient narratives. Using this dataset, baselines for various image models were established that outperform existing methods. Initially, the Resnet-50 model was only able to achieve an accuracy of 70% but, after various optimization techniques, the accuracy was improved to 80%. In addition, this study proposes a novel fine-tuning strategy for sequence classification Large Language Models (LLMs), Chain of Options, which breaks down a complex reasoning task into intermediate steps at training time instead of inference. With Chain of Options and preliminary disease recommendations from the image model, this method achieves state of the art accuracy 91% in diagnosing patient skin disease given just an image of the afflicted area as well as a patient description of the symptoms (such as itchiness or dizziness). Through this research, an earlier diagnosis of skin diseases can occur, and clinicians can work with deep learning models to give a more accurate diagnosis, improving quality of life and saving lives.

[CV-79] A Deep Learning Approach to Predict the Fall [of Price] of Cryptocurrency Long Before its Actual Fall

链接: https://arxiv.org/abs/2411.13615
作者: Anika Tahsin Meem,Mst. Shapna Akter,Deponker Sarker Depto,M.R.C. Mahdy
关键词-EN: cryptocurrency market, risk factor, rapidly rising financial, market, cryptocurrency
类目: atistical Finance (q-fin.ST); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 22 pages, 3 figures

点击查看摘要

Abstract:In modern times, the cryptocurrency market is one of the world’s most rapidly rising financial markets. The cryptocurrency market is regarded to be more volatile and illiquid than traditional markets such as equities, foreign exchange, and commodities. The risk of this market creates an uncertain condition among the investors. The purpose of this research is to predict the magnitude of the risk factor of the cryptocurrency market. Risk factor is also called volatility. Our approach will assist people who invest in the cryptocurrency market by overcoming the problems and difficulties they experience. Our approach starts with calculating the risk factor of the cryptocurrency market from the existing parameters. In twenty elements of the cryptocurrency market, the risk factor has been predicted using different machine learning algorithms such as CNN, LSTM, BiLSTM, and GRU. All of the models have been applied to the calculated risk factor parameter. A new model has been developed to predict better than the existing models. Our proposed model gives the highest RMSE value of 1.3229 and the lowest RMSE value of 0.0089. Following our model, it will be easier for investors to trade in complicated and challenging financial assets like bitcoin, Ethereum, dogecoin, etc. Where the other existing models, the highest RMSE was 14.5092, and the lower was 0.02769. So, the proposed model performs much better than models with proper generalization. Using our approach, it will be easier for investors to trade in complicated and challenging financial assets like Bitcoin, Ethereum, and Dogecoin.

机器学习

[LG-0] Learning Fair Robustness via Domain Mixup

链接: https://arxiv.org/abs/2411.14424
作者: Meiyu Zhong,Ravi Tandon
关键词-EN: Adversarial training, predominant techniques, Adversarial, adversarial attacks, training
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Information Theory (cs.IT)
*备注:

点击查看摘要

Abstract:Adversarial training is one of the predominant techniques for training classifiers that are robust to adversarial attacks. Recent work, however has found that adversarial training, which makes the overall classifier robust, it does not necessarily provide equal amount of robustness for all classes. In this paper, we propose the use of mixup for the problem of learning fair robust classifiers, which can provide similar robustness across all classes. Specifically, the idea is to mix inputs from the same classes and perform adversarial training on mixed up inputs. We present a theoretical analysis of this idea for the case of linear classifiers and show that mixup combined with adversarial training can provably reduce the class-wise robustness disparity. This method not only contributes to reducing the disparity in class-wise adversarial risk, but also the class-wise natural risk. Complementing our theoretical analysis, we also provide experimental results on both synthetic data and the real world dataset (CIFAR-10), which shows improvement in class wise disparities for both natural and adversarial risks.

[LG-1] From RNNs to Foundation Models: An Empirical Study on Commercial Building Energy Consumption NEURIPS2024

链接: https://arxiv.org/abs/2411.14421
作者: Shourya Bose,Yijiang Li,Amy Van Sant,Yu Zhang,Kibaek Kim
关键词-EN: Accurate short-term energy, Accurate short-term, smart grid operations, grid operations, short-term energy consumption
类目: Machine Learning (cs.LG)
*备注: NeurIPS 2024 Workshop on Time Series in the Age of Large Models

点击查看摘要

Abstract:Accurate short-term energy consumption forecasting for commercial buildings is crucial for smart grid operations. While smart meters and deep learning models enable forecasting using past data from multiple buildings, data heterogeneity from diverse buildings can reduce model performance. The impact of increasing dataset heterogeneity in time series forecasting, while keeping size and model constant, is understudied. We tackle this issue using the ComStock dataset, which provides synthetic energy consumption data for U.S. commercial buildings. Two curated subsets, identical in size and region but differing in building type diversity, are used to assess the performance of various time series forecasting models, including fine-tuned open-source foundation models (FMs). The results show that dataset heterogeneity and model architecture have a greater impact on post-training forecasting performance than the parameter count. Moreover, despite the higher computational cost, fine-tuned FMs demonstrate competitive performance compared to base models trained from scratch.

[LG-2] Multi-Agent Environments for Vehicle Routing Problems

链接: https://arxiv.org/abs/2411.14411
作者: Ricardo Gama,Daniel Fuertes,Carlos R. del-Blanco,Hugo L. Fernandes
关键词-EN: Operations Research, area classically dominated, discrete optimization problems, dominated by Operations, discrete optimization
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Research on Reinforcement Learning (RL) approaches for discrete optimization problems has increased considerably, extending RL to an area classically dominated by Operations Research (OR). Vehicle routing problems are a good example of discrete optimization problems with high practical relevance where RL techniques have had considerable success. Despite these advances, open-source development frameworks remain scarce, hampering both the testing of algorithms and the ability to objectively compare results. This ultimately slows down progress in the field and limits the exchange of ideas between the RL and OR communities. Here we propose a library composed of multi-agent environments that simulates classic vehicle routing problems. The library, built on PyTorch, provides a flexible modular architecture design that allows easy customization and incorporation of new routing problems. It follows the Agent Environment Cycle (“AEC”) games model and has an intuitive API, enabling rapid adoption and easy integration into existing reinforcement learning frameworks. The library allows for a straightforward use of classical OR benchmark instances in order to narrow the gap between the test beds for algorithm benchmarking used by the RL and OR communities. Additionally, we provide benchmark instance sets for each environment, as well as baseline RL models and training code. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2411.14411 [cs.LG] (or arXiv:2411.14411v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2411.14411 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-3] Model Checking for Reinforcement Learning in Autonomous Driving: One Can Do More Than You Think!

链接: https://arxiv.org/abs/2411.14375
作者: Rong Gu(Mälardalen University)
关键词-EN: Gymnasium using Python, high-level programming languages, OpenAI Gymnasium, programming languages, high-level programming
类目: Machine Learning (cs.LG)
*备注: In Proceedings FMAS2024, arXiv:2411.13215

点击查看摘要

Abstract:Most reinforcement learning (RL) platforms use high-level programming languages, such as OpenAI Gymnasium using Python. These frameworks provide various API and benchmarks for testing RL algorithms in different domains, such as autonomous driving (AD) and robotics. These platforms often emphasise the design of RL algorithms and the training performance but neglect the correctness of models and reward functions, which can be crucial for the successful application of RL. This paper proposes using formal methods to model AD systems and demonstrates how model checking (MC) can be used in RL for AD. Most studies combining MC and RL focus on safety, such as safety shields. However, this paper shows different facets where MC can strengthen RL. First, an MC-based model pre-analysis can reveal bugs with respect to sensor accuracy and learning step size. This step serves as a preparation of RL, which saves time if bugs exist and deepens users’ understanding of the target system. Second, reward automata can benefit the design of reward functions and greatly improve learning performance especially when the learning objectives are multiple. All these findings are supported by experiments.

[LG-4] Agnostic Learning of Arbitrary ReLU Activation under Gaussian Marginals

链接: https://arxiv.org/abs/2411.14349
作者: Anxin Guo,Aravindan Vijayaraghavan
关键词-EN: Gaussian marginals, squared loss objective, arbitrarily-biased ReLU activation, ReLU activation, ReLU neuron
类目: Machine Learning (cs.LG); Data Structures and Algorithms (cs.DS)
*备注:

点击查看摘要

Abstract:We consider the problem of learning an arbitrarily-biased ReLU activation (or neuron) over Gaussian marginals with the squared loss objective. Despite the ReLU neuron being the basic building block of modern neural networks, we still do not understand the basic algorithmic question of whether one arbitrary ReLU neuron is learnable in the non-realizable setting. In particular, all existing polynomial time algorithms only provide approximation guarantees for the better-behaved unbiased setting or restricted bias setting. Our main result is a polynomial time statistical query (SQ) algorithm that gives the first constant factor approximation for arbitrary bias. It outputs a ReLU activation that achieves a loss of O(\mathrmOPT) + \varepsilon in time \mathrmpoly(d,1/\varepsilon) , where \mathrmOPT is the loss obtained by the optimal ReLU activation. Our algorithm presents an interesting departure from existing algorithms, which are all based on gradient descent and thus fall within the class of correlational statistical query (CSQ) algorithms. We complement our algorithmic result by showing that no polynomial time CSQ algorithm can achieve a constant factor approximation. Together, these results shed light on the intrinsic limitation of gradient descent, while identifying arguably the simplest setting (a single neuron) where there is a separation between SQ and CSQ algorithms. Subjects: Machine Learning (cs.LG); Data Structures and Algorithms (cs.DS) Cite as: arXiv:2411.14349 [cs.LG] (or arXiv:2411.14349v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2411.14349 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-5] Overcomplete Tensor Decomposition via Koszul-Young Flattenings

链接: https://arxiv.org/abs/2411.14344
作者: Pravesh K. Kothari,Ankur Moitra,Alexander S. Wein
关键词-EN: algebraic complexity lower, complexity lower bounds, Motivated by connections, investigate Koszul-Young flattenings, complexity lower
类目: Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG)
*备注: 42 pages

点击查看摘要

Abstract:Motivated by connections between algebraic complexity lower bounds and tensor decompositions, we investigate Koszul-Young flattenings, which are the main ingredient in recent lower bounds for matrix multiplication. Based on this tool we give a new algorithm for decomposing an n_1 \times n_2 \times n_3 tensor as the sum of a minimal number of rank-1 terms, and certifying uniqueness of this decomposition. For n_1 \le n_2 \le n_3 with n_1 \to \infty and n_3/n_2 = O(1) , our algorithm is guaranteed to succeed when the tensor rank is bounded by r \le (1-\epsilon)(n_2 + n_3) for an arbitrary \epsilon 0 , provided the tensor components are generically chosen. For any fixed \epsilon , the runtime is polynomial in n_3 . When n_2 = n_3 = n , our condition on the rank gives a factor-of-2 improvement over the classical simultaneous diagonalization algorithm, which requires r \le n , and also improves on the recent algorithm of Koiran (2024) which requires r \le 4n/3 . It also improves on the PhD thesis of Persu (2018) which solves rank detection for r \leq 3n/2 . We complement our upper bounds by showing limitations, in particular that no flattening of the style we consider can surpass rank n_2 + n_3 . Furthermore, for n \times n \times n tensors, we show that an even more general class of degree- d polynomial flattenings cannot surpass rank Cn for a constant C = C(d) . This suggests that for tensor decompositions, the case of generic components may be fundamentally harder than that of random components, where efficient decomposition is possible even in highly overcomplete settings. Comments: 42 pages Subjects: Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG) Cite as: arXiv:2411.14344 [cs.DS] (or arXiv:2411.14344v1 [cs.DS] for this version) https://doi.org/10.48550/arXiv.2411.14344 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-6] Outlier-robust Mean Estimation near the Breakdown Point via Sum-of-Squares

链接: https://arxiv.org/abs/2411.14305
作者: Hongjie Chen,Deepak Narayanan Sridharan,David Steurer
关键词-EN: optimal error rate, fraction of adversarial, adversarial outliers, revisit the problem, problem of estimating
类目: Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: Accepted at SODA 2025, 47 pages

点击查看摘要

Abstract:We revisit the problem of estimating the mean of a high-dimensional distribution in the presence of an \varepsilon -fraction of adversarial outliers. When \varepsilon is at most some sufficiently small constant, previous works can achieve optimal error rate efficiently \citediakonikolas2018robustly, kothari2018robust. As \varepsilon approaches the breakdown point \frac12 , all previous algorithms incur either sub-optimal error rates or exponential running time. In this paper we give a new analysis of the canonical sum-of-squares program introduced in \citekothari2018robust and show that this program efficiently achieves optimal error rate for all \varepsilon \in[0,\frac12) . The key ingredient for our results is a new identifiability proof for robust mean estimation that focuses on the overlap between the distributions instead of their statistical distance as in previous works. We capture this proof within the sum-of-squares proof system, thus obtaining efficient algorithms using the sum-of-squares proofs to algorithms paradigm \citeraghavendra2018high. Comments: Accepted at SODA 2025, 47 pages Subjects: Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG); Machine Learning (stat.ML) Cite as: arXiv:2411.14305 [cs.DS] (or arXiv:2411.14305v1 [cs.DS] for this version) https://doi.org/10.48550/arXiv.2411.14305 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-7] Improving Routability Prediction via NAS Using a Smooth One-shot Augmented Predictor

链接: https://arxiv.org/abs/2411.14296
作者: Arjun Sridhar,Chen-Chia Chang,Junyao Zhang,Yiran Chen
关键词-EN: modern EDA tools, modern EDA, EDA tools, machine learning, optimization in modern
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Routability optimization in modern EDA tools has benefited greatly from using machine learning (ML) models. Constructing and optimizing the performance of ML models continues to be a challenge. Neural Architecture Search (NAS) serves as a tool to aid in the construction and improvement of these models. Traditional NAS techniques struggle to perform well on routability prediction as a result of two primary factors. First, the separation between the training objective and the search objective adds noise to the NAS process. Secondly, the increased variance of the search objective further complicates performing NAS. We craft a novel NAS technique, coined SOAP-NAS, to address these challenges through novel data augmentation techniques and a novel combination of one-shot and predictor-based NAS. Results show that our technique outperforms existing solutions by 40% closer to the ideal performance measured by ROC-AUC (area under the receiver operating characteristic curve) in DRC hotspot detection. SOAPNet is able to achieve an ROC-AUC of 0.9802 and a query time of only 0.461 ms.

[LG-8] On the Sample Complexity of One Hidden Layer Networks with Equivariance Locality and Weight Sharing

链接: https://arxiv.org/abs/2411.14288
作者: Arash Behboodi,Gabriele Cesa
关键词-EN: sample complexity, convolutional neural networks, design choices contribute, neural networks, Weight sharing
类目: Machine Learning (cs.LG); Statistics Theory (math.ST); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Weight sharing, equivariance, and local filters, as in convolutional neural networks, are believed to contribute to the sample efficiency of neural networks. However, it is not clear how each one of these design choices contribute to the generalization error. Through the lens of statistical learning theory, we aim to provide an insight into this question by characterizing the relative impact of each choice on the sample complexity. We obtain lower and upper sample complexity bounds for a class of single hidden layer networks. It is shown that the gain of equivariance is directly manifested in the bound, while getting a similar increase for weight sharing depends on the sharing mechanism. Among our results, we obtain a completely dimension-free bound for equivariant networks for a class of pooling operations. We show that the bound depends merely on the norm of filters, which is tighter than using the spectral norm of the respective matrix. We also characterize the trade-off in sample complexity between the parametrization of filters in spatial and frequency domains, particularly when spatial filters are localized as in vanilla convolutional neural networks.

[LG-9] Simulation-Aided Policy Tuning for Black-Box Robot Learning

链接: https://arxiv.org/abs/2411.14246
作者: Shiming He,Alexander von Rohr,Dominik Baumann,Ji Xiang,Sebastian Trimpe
关键词-EN: robot, robot learning, learning, algorithm, efficient robot learning
类目: Robotics (cs.RO); Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注:

点击查看摘要

Abstract:How can robots learn and adapt to new tasks and situations with little data? Systematic exploration and simulation are crucial tools for efficient robot learning. We present a novel black-box policy search algorithm focused on data-efficient policy improvements. The algorithm learns directly on the robot and treats simulation as an additional information source to speed up the learning process. At the core of the algorithm, a probabilistic model learns the dependence of the policy parameters and the robot learning objective not only by performing experiments on the robot, but also by leveraging data from a simulator. This substantially reduces interaction time with the robot. Using this model, we can guarantee improvements with high probability for each policy update, thereby facilitating fast, goal-oriented learning. We evaluate our algorithm on simulated fine-tuning tasks and demonstrate the data-efficiency of the proposed dual-information source optimization algorithm. In a real robot learning experiment, we show fast and successful task learning on a robot manipulator with the aid of an imperfect simulator.

[LG-10] GNN-MultiFix: Addressing the pitfalls for GNNs for multi-label node classification

链接: https://arxiv.org/abs/2411.14094
作者: Tianqi Zhao,Megha Khosla
关键词-EN: Graph neural networks, data showing state, graph data showing, neural networks, emerged as powerful
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Graph neural networks (GNNs) have emerged as powerful models for learning representations of graph data showing state of the art results in various tasks. Nevertheless, the superiority of these methods is usually supported by either evaluating their performance on small subset of benchmark datasets or by reasoning about their expressive power in terms of certain graph isomorphism tests. In this paper we critically analyse both these aspects through a transductive setting for the task of node classification. First, we delve deeper into the case of multi-label node classification which offers a more realistic scenario and has been ignored in most of the related works. Through analysing the training dynamics for GNN methods we highlight the failure of GNNs to learn over multi-label graph datasets even for the case of abundant training data. Second, we show that specifically for transductive node classification, even the most expressive GNN may fail to learn in absence of node attributes and without using explicit label information as input. To overcome this deficit, we propose a straightforward approach, referred to as GNN-MultiFix, that integrates the feature, label, and positional information of a node. GNN-MultiFix demonstrates significant improvement across all the multi-label datasets. We release our code at this https URL.

[LG-11] Exploration by Running Away from the Past

链接: https://arxiv.org/abs/2411.14085
作者: Paul-Antoine Le Tolguenec,Yann Besse,Florent Teichteil-Koenigsbuch,Dennis G. Wilson,Emmanuel Rachelson
关键词-EN: reinforcement learning, central challenge, challenge of reinforcement, textbf, explore efficiently
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The ability to explore efficiently and effectively is a central challenge of reinforcement learning. In this work, we consider exploration through the lens of information theory. Specifically, we cast exploration as a problem of maximizing the Shannon entropy of the state occupation measure. This is done by maximizing a sequence of divergences between distributions representing an agent’s past behavior and its current behavior. Intuitively, this encourages the agent to explore new behaviors that are distinct from past behaviors. Hence, we call our method RAMP, for `` \textbfR unning \textbfA way fro \textbfm the \textbfP ast.‘’ A fundamental question of this method is the quantification of the distribution change over time. We consider both the Kullback-Leibler divergence and the Wasserstein distance to quantify divergence between successive state occupation measures, and explain why the former might lead to undesirable exploratory behaviors in some tasks. We demonstrate that by encouraging the agent to explore by actively distancing itself from past experiences, it can effectively explore mazes and a wide range of behaviors on robotic manipulation and locomotion tasks.

[LG-12] REFOL: Resource-Efficient Federated Online Learning for Traffic Flow Forecasting

链接: https://arxiv.org/abs/2411.14046
作者: Qingxiang Liu,Sheng Sun,Yuxuan Liang,Xiaolong Xu,Min Liu,Muhammad Bilal,Yuwei Wang,Xujing Li,Yu Zheng
关键词-EN: traffic flow forecasting, Multiple federated learning, privacy-leaking concerns resulting, Multiple federated, federated online learning
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Multiple federated learning (FL) methods are proposed for traffic flow forecasting (TFF) to avoid heavy-transmission and privacy-leaking concerns resulting from the disclosure of raw data in centralized methods. However, these FL methods adopt offline learning which may yield subpar performance, when concept drift occurs, i.e., distributions of historical and future data vary. Online learning can detect concept drift during model training, thus more applicable to TFF. Nevertheless, the existing federated online learning method for TFF fails to efficiently solve the concept drift problem and causes tremendous computing and communication overhead. Therefore, we propose a novel method named Resource-Efficient Federated Online Learning (REFOL) for TFF, which guarantees prediction performance in a communication-lightweight and computation-efficient way. Specifically, we design a data-driven client participation mechanism to detect the occurrence of concept drift and determine clients’ participation necessity. Subsequently, we propose an adaptive online optimization strategy, which guarantees prediction performance and meanwhile avoids meaningless model updates. Then, a graph convolution-based model aggregation mechanism is designed, aiming to assess participants’ contribution based on spatial correlation without importing extra communication and computing consumption on clients. Finally, we conduct extensive experiments on real-world datasets to demonstrate the superiority of REFOL in terms of prediction improvement and resource economization.

[LG-13] aching MLPs to Master Heterogeneous Graph-Structured Knowledge for Efficient and Accurate Inference

链接: https://arxiv.org/abs/2411.14035
作者: Yunhui Liu,Xinyi Gao,Tieke He,Jianhua Zhao,Hongzhi Yin
关键词-EN: Graph Neural Networks, Neural Networks, achieved promising results, Heterogeneous Graph Neural, graph learning tasks
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Heterogeneous Graph Neural Networks (HGNNs) have achieved promising results in various heterogeneous graph learning tasks, owing to their superiority in capturing the intricate relationships and diverse relational semantics inherent in heterogeneous graph structures. However, the neighborhood-fetching latency incurred by structure dependency in HGNNs makes it challenging to deploy for latency-constrained applications that require fast inference. Inspired by recent GNN-to-MLP knowledge distillation frameworks, we introduce HG2M and HG2M+ to combine both HGNN’s superior performance and MLP’s efficient inference. HG2M directly trains student MLPs with node features as input and soft labels from teacher HGNNs as targets, and HG2M+ further distills reliable and heterogeneous semantic knowledge into student MLPs through reliable node distillation and reliable meta-path distillation. Experiments conducted on six heterogeneous graph datasets show that despite lacking structural dependencies, HG2Ms can still achieve competitive or even better performance than HGNNs and significantly outperform vanilla MLPs. Moreover, HG2Ms demonstrate a 379.24 \times speedup in inference over HGNNs on the large-scale IGB-3M-19 dataset, showcasing their ability for latency-sensitive deployments.

[LG-14] me-Scale Separation in Q-Learning: Extending TD(triangle) for Action-Value Function Decomposition

链接: https://arxiv.org/abs/2411.14019
作者: Mahammad Humayoo
关键词-EN: learn optimal policies, fundamental off-policy reinforcement, Delta, off-policy reinforcement learning, approximating action-value functions
类目: Machine Learning (cs.LG)
*备注: 17 pages

点击查看摘要

Abstract:Q-Learning is a fundamental off-policy reinforcement learning (RL) algorithm that has the objective of approximating action-value functions in order to learn optimal policies. Nonetheless, it has difficulties in reconciling bias with variance, particularly in the context of long-term rewards. This paper introduces Q( \Delta )-Learning, an extension of TD( \Delta ) for the Q-Learning framework. TD( \Delta ) facilitates efficient learning over several time scales by breaking the Q( \Delta )-function into distinct discount factors. This approach offers improved learning stability and scalability, especially for long-term tasks where discounting bias may impede convergence. Our methodology guarantees that each element of the Q( \Delta )-function is acquired individually, facilitating expedited convergence on shorter time scales and enhancing the learning of extended time scales. We demonstrate through theoretical analysis and practical evaluations on standard benchmarks like Atari that Q( \Delta )-Learning surpasses conventional Q-Learning and TD learning methods in both tabular and deep RL environments.

[LG-15] rajectory Representation Learning on Road Networks and Grids with Spatio-Temporal Dynamics

链接: https://arxiv.org/abs/2411.14014
作者: Stefan Schestakov,Simon Gottschalk
关键词-EN: including smart city, fields including smart, vehicle movements, trajectory data, smart city
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Trajectory representation learning is a fundamental task for applications in fields including smart city, and urban planning, as it facilitates the utilization of trajectory data (e.g., vehicle movements) for various downstream applications, such as trajectory similarity computation or travel time estimation. This is achieved by learning low-dimensional representations from high-dimensional and raw trajectory data. However, existing methods for trajectory representation learning either rely on grid-based or road-based representations, which are inherently different and thus, could lose information contained in the other modality. Moreover, these methods overlook the dynamic nature of urban traffic, relying on static road network features rather than time varying traffic patterns. In this paper, we propose TIGR, a novel model designed to integrate grid and road network modalities while incorporating spatio-temporal dynamics to learn rich, general-purpose representations of trajectories. We evaluate TIGR on two realworld datasets and demonstrate the effectiveness of combining both modalities by substantially outperforming state-of-the-art methods, i.e., up to 43.22% for trajectory similarity, up to 16.65% for travel time estimation, and up to 10.16% for destination prediction.

[LG-16] Generative Intervention Models for Causal Perturbation Modeling

链接: https://arxiv.org/abs/2411.14003
作者: Nora Schneider,Lars Lorch,Niki Kilbertus,Bernhard Schölkopf,Andreas Krause
关键词-EN: problem of predicting, perturbation, predicting perturbation effects, causal, predicting perturbation
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:We consider the problem of predicting perturbation effects via causal models. In many applications, it is a priori unknown which mechanisms of a system are modified by an external perturbation, even though the features of the perturbation are available. For example, in genomics, some properties of a drug may be known, but not their causal effects on the regulatory pathways of cells. We propose a generative intervention model (GIM) that learns to map these perturbation features to distributions over atomic interventions in a jointly-estimated causal model. Contrary to prior approaches, this enables us to predict the distribution shifts of unseen perturbation features while gaining insights about their mechanistic effects in the underlying data-generating process. On synthetic data and scRNA-seq drug perturbation data, GIMs achieve robust out-of-distribution predictions on par with unstructured approaches, while effectively inferring the underlying perturbation mechanisms, often better than other causal inference methods.

[LG-17] Market Making without Regret

链接: https://arxiv.org/abs/2411.13993
作者: Nicolò Cesa-Bianchi,Tommaso Cesari,Roberto Colomboni,Luigi Foscari,Vinayak Pathak
关键词-EN: sequential decision-making setting, market maker posts, bid price, sequential decision-making, decision-making setting
类目: Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We consider a sequential decision-making setting where, at every round t , a market maker posts a bid price B_t and an ask price A_t to an incoming trader (the taker) with a private valuation for one unit of some asset. If the trader’s valuation is lower than the bid price, or higher than the ask price, then a trade (sell or buy) occurs. If a trade happens at round t , then letting M_t be the market price (observed only at the end of round t ), the maker’s utility is M_t - B_t if the maker bought the asset, and A_t - M_t if they sold it. We characterize the maker’s regret with respect to the best fixed choice of bid and ask pairs under a variety of assumptions (adversarial, i.i.d., and their variants) on the sequence of market prices and valuations. Our upper bound analysis unveils an intriguing connection relating market making to first-price auctions and dynamic pricing. Our main technical contribution is a lower bound for the i.i.d. case with Lipschitz distributions and independence between prices and valuations. The difficulty in the analysis stems from the unique structure of the reward and feedback functions, allowing an algorithm to acquire information by graduating the “cost of exploration” in an arbitrary way.

[LG-18] Material synthesis through simulations guided by machine learning: a position paper

链接: https://arxiv.org/abs/2411.13953
作者: Usman Syed,Federico Cunico,Uzair Khan,Eros Radicchi,Francesco Setti,Adolfo Speghini,Paolo Marone,Filiberto Semenzin,Marco Cristani
关键词-EN: optimal mix design, marble sludge reuse, sustainable data collection, marble sludge, position paper
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In this position paper, we propose an approach for sustainable data collection in the field of optimal mix design for marble sludge reuse. Marble sludge, a calcium-rich residual from stone-cutting processes, can be repurposed by mixing it with various ingredients. However, determining the optimal mix design is challenging due to the variability in sludge composition and the costly, time-consuming nature of experimental data collection. Also, we investigate the possibility of using machine learning models using meta-learning as an optimization tool to estimate the correct quantity of stone-cutting sludge to be used in aggregates to obtain a mix design with specific mechanical properties that can be used successfully in the building industry. Our approach offers two key advantages: (i) through simulations, a large dataset can be generated, saving time and money during the data collection phase, and (ii) Utilizing machine learning models, with performance enhancement through hyper-parameter optimization via meta-learning, to estimate optimal mix designs reducing the need for extensive manual experimentation, lowering costs, minimizing environmental impact, and accelerating the processing of quarry sludge. Our idea promises to streamline the marble sludge reuse process by leveraging collective data and advanced machine learning, promoting sustainability and efficiency in the stonecutting sector.

[LG-19] Neuromorphic Attitude Estimation and Control

链接: https://arxiv.org/abs/2411.13945
作者: Stein Stroobants,Christophe de Wagter,Guido C.H.E. De Croon
关键词-EN: energy limitations, real-world application, application of small, hampered by energy, SNN
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The real-world application of small drones is mostly hampered by energy limitations. Neuromorphic computing promises extremely energy-efficient AI for autonomous flight, but is still challenging to train and deploy on real robots. In order to reap the maximal benefits from neuromorphic computing, it is desired to perform all autonomy functions end-to-end on a single neuromorphic chip, from low-level attitude control to high-level navigation. This research presents the first neuromorphic control system using a spiking neural network (SNN) to effectively map a drone’s raw sensory input directly to motor commands. We apply this method to low-level attitude estimation and control for a quadrotor, deploying the SNN on a tiny Crazyflie. We propose a modular SNN, separately training and then merging estimation and control sub-networks. The SNN is trained with imitation learning, using a flight dataset of sensory-motor pairs. Post-training, the network is deployed on the Crazyflie, issuing control commands from sensor inputs at 500 Hz. Furthermore, for the training procedure we augmented training data by flying a controller with additional excitation and time-shifting the target data to enhance the predictive capabilities of the SNN. On the real drone the perception-to-control SNN tracks attitude commands with an average error of 3 degrees, compared to 2.5 degrees for the regular flight stack. We also show the benefits of the proposed learning modifications for reducing the average tracking error and reducing oscillations. Our work shows the feasibility of performing neuromorphic end-to-end control, laying the basis for highly energy-efficient and low-latency neuromorphic autopilots.

[LG-20] NBMLSS: probabilistic forecasting of electricity prices via Neural Basis Models for Location Scale and Shape

链接: https://arxiv.org/abs/2411.13921
作者: Alessandro Brusaferri,Danial Ramin,Andrea Ballarino
关键词-EN: gain detailed insights, predicted feature-conditioned distribution, flexible neural networks, multi-horizon distributional regression, distributional regression setups
类目: Machine Learning (cs.LG)
*备注: 23 pages

点击查看摘要

Abstract:Forecasters using flexible neural networks (NN) in multi-horizon distributional regression setups often struggle to gain detailed insights into the underlying mechanisms that lead to the predicted feature-conditioned distribution parameters. In this work, we deploy a Neural Basis Model for Location, Scale and Shape, that blends the principled interpretability of GAMLSS with a computationally scalable shared basis decomposition, combined by linear projections supporting dedicated stepwise and parameter-wise feature shape functions aggregations. Experiments have been conducted on multiple market regions, achieving probabilistic forecasting performance comparable to that of distributional neural networks, while providing more insights into the model behavior through the learned nonlinear feature level maps to the distribution parameters across the prediction steps.

[LG-21] Predictive Maintenance Study for High-Pressure Industrial Compressors: Hybrid Clustering Models

链接: https://arxiv.org/abs/2411.13919
作者: Alessandro Costa,Emilio Mastriani,Federico Incardona,Kevin Munari,Sebastiano Spinello
关键词-EN: high pressure industrial, pressure industrial compressors, unsupervised clustering integrated, study introduces, strategy for high
类目: Machine Learning (cs.LG)
*备注: 10 pages, 9 figures, 2 tables, HICSS58 conference

点击查看摘要

Abstract:This study introduces a predictive maintenance strategy for high pressure industrial compressors using sensor data and features derived from unsupervised clustering integrated into classification models. The goal is to enhance model accuracy and efficiency in detecting compressor failures. After data pre processing, sensitive clustering parameters were tuned to identify algorithms that best capture the dataset’s temporal and operational characteristics. Clustering algorithms were evaluated using quality metrics like Normalized Mutual Information (NMI) and Adjusted Rand Index (ARI), selecting those most effective at distinguishing between normal and non normal conditions. These features enriched regression models, improving failure detection accuracy by 4.87 percent on average. Although training time was reduced by 22.96 percent, the decrease was not statistically significant, varying across algorithms. Cross validation and key performance metrics confirmed the benefits of clustering based features in predictive maintenance models.

[LG-22] ICODE: Modeling Dynamical Systems with Extrinsic Input Information

链接: https://arxiv.org/abs/2411.13914
作者: Zhaoyi Li,Wenjie Mei,Ke Yu,Yang Bai,Shihua Li
关键词-EN: future state evolution, studying complex phenomena, predicting future state, nonsmooth or piecewise, state evolution
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Learning models of dynamical systems with external inputs, that may be, for example, nonsmooth or piecewise, is crucial for studying complex phenomena and predicting future state evolution, which is essential for applications such as safety guarantees and decision-making. In this work, we introduce \emphInput Concomitant Neural ODEs (ICODEs), which incorporate precise real-time input information into the learning process of the models, rather than treating the inputs as hidden parameters to be learned. The sufficient conditions to ensure the model’s contraction property are provided to guarantee that system trajectories of the trained model converge to a fixed point, regardless of initial conditions across different training processes. We validate our method through experiments on several representative real dynamics: Single-link robot, DC-to-DC converter, motion dynamics of a rigid body, Rabinovich-Fabrikant equation, Glycolytic-glycogenolytic pathway model, and heat conduction equation. The experimental results demonstrate that our proposed ICODEs efficiently learn the ground truth systems, achieving superior prediction performance under both typical and atypical inputs. This work offers a valuable class of neural ODE models for understanding physical systems with explicit external input information, with potential promising applications in fields such as physics and robotics.

[LG-23] Schemato – An LLM for Netlist-to-Schematic Conversion

链接: https://arxiv.org/abs/2411.13899
作者: Ryoga Matsuo,Stefan Uhlich,Arun Venkitaraman,Andrea Bonetti,Chia-Yu Hsieh,Ali Momeni,Lukas Mauch,Augusto Capone,Eisaku Ohbuchi,Lorenzo Servadei
关键词-EN: Machine learning models, Machine learning, advancing circuit design, Machine, advancing circuit
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Machine learning models are advancing circuit design, particularly in analog circuits. They typically generate netlists that lack human interpretability. This is a problem as human designers heavily rely on the interpretability of circuit diagrams or schematics to intuitively understand, troubleshoot, and develop designs. Hence, to integrate domain knowledge effectively, it is crucial to translate ML-generated netlists into interpretable schematics quickly and accurately. We propose Schemato, a large language model (LLM) for netlist-to-schematic conversion. In particular, we consider our approach in the two settings of converting netlists to .asc files for LTSpice and LATEX files for CircuiTikz schematics. Experiments on our circuit dataset show that Schemato achieves up to 93% compilation success rate for the netlist-to-LaTeX conversion task, surpassing the 26% rate scored by the state-of-the-art LLMs. Furthermore, our experiments show that Schemato generates schematics with a mean structural similarity index measure that is 3xhigher than the best performing LLMs, therefore closer to the reference human design.

[LG-24] GraCo – A Graph Composer for Integrated Circuits

链接: https://arxiv.org/abs/2411.13890
作者: Stefan Uhlich,Andrea Bonetti,Arun Venkitaraman,Ali Momeni,Ryoga Matsuo,Chia-Yu Hsieh,Eisaku Ohbuchi,Lorenzo Servadei
关键词-EN: involves substantial complexity, Designing integrated circuits, circuits involves substantial, Designing integrated, custom digital cells
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Designing integrated circuits involves substantial complexity, posing challenges in revealing its potential applications - from custom digital cells to analog circuits. Despite extensive research over the past decades in building versatile and automated frameworks, there remains open room to explore more computationally efficient AI-based solutions. This paper introduces the graph composer GraCo, a novel method for synthesizing integrated circuits using reinforcement learning (RL). GraCo learns to construct a graph step-by-step, which is then converted into a netlist and simulated with SPICE. We demonstrate that GraCo is highly configurable, enabling the incorporation of prior design knowledge into the framework. We formalize how this prior knowledge can be utilized and, in particular, show that applying consistency checks enhances the efficiency of the sampling process. To evaluate its performance, we compare GraCo to a random baseline, which is known to perform well for smaller design space problems. We demonstrate that GraCo can discover circuits for tasks such as generating standard cells, including the inverter and the two-input NAND (NAND2) gate. Compared to a random baseline, GraCo requires 5x fewer sampling steps to design an inverter and successfully synthesizes a NAND2 gate that is 2.5x faster.

[LG-25] Exploring applications of topological data analysis in stock index movement prediction

链接: https://arxiv.org/abs/2411.13881
作者: Dazhi Huang,Pengcheng Xu,Xiaocheng Huang,Jiayi Chen
关键词-EN: recently gained significant, gained significant attention, Topological Data Analysis, Data Analysis, recently gained
类目: Machine Learning (cs.LG); Data Analysis, Statistics and Probability (physics.data-an)
*备注: 20 pages, 10 figures

点击查看摘要

Abstract:Topological Data Analysis (TDA) has recently gained significant attention in the field of financial prediction. However, the choice of point cloud construction methods, topological feature representations, and classification models has a substantial impact on prediction results. This paper addresses the classification problem of stock index movement. First, we construct point clouds for stock indices using three different methods. Next, we apply TDA to extract topological structures from the point clouds. Four distinct topological features are computed to represent the patterns in the data, and 15 combinations of these features are enumerated and input into six different machine learning models. We evaluate the predictive performance of various TDA configurations by conducting index movement classification tasks on datasets such as CSI, DAX, HSI and FTSE providing insights into the efficiency of different TDA setups.

[LG-26] Exact and approximate error bounds for physics-informed neural networks NEURIPS2024

链接: https://arxiv.org/abs/2411.13848
作者: Augusto T. Chantada,Pavlos Protopapas,Luca Gomez Bachar,Susana J. Landau,Claudia G. Scóccola
关键词-EN: solve differential equations, traditional numerical solvers, error bounds, increased recently, solve differential
类目: Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注: 10 pages, 1 figure, accepted to NeurIPS 2024 Workshop on Machine Learning and the Physical Sciences

点击查看摘要

Abstract:The use of neural networks to solve differential equations, as an alternative to traditional numerical solvers, has increased recently. However, error bounds for the obtained solutions have only been developed for certain equations. In this work, we report important progress in calculating error bounds of physics-informed neural networks (PINNs) solutions of nonlinear first-order ODEs. We give a general expression that describes the error of the solution that the PINN-based method provides for a nonlinear first-order ODE. In addition, we propose a technique to calculate an approximate bound for the general case and an exact bound for a particular case. The error bounds are computed using only the residual information and the equation structure. We apply the proposed methods to particular cases and show that they can successfully provide error bounds without relying on the numerical solution.

[LG-27] Adaptable Embeddings Network (AEN)

链接: https://arxiv.org/abs/2411.13786
作者: Stan Loosmore,Alexander Titus
关键词-EN: Modern day Language, significant computational cost, day Language Models, Modern day, day Language
类目: Machine Learning (cs.LG)
*备注: 20 pages

点击查看摘要

Abstract:Modern day Language Models see extensive use in text classification, yet this comes at significant computational cost. Compute-effective classification models are needed for low-resource environments, most notably on edge devices. We introduce Adaptable Embeddings Networks (AEN), a novel dual-encoder architecture using Kernel Density Estimation (KDE). This architecture allows for runtime adaptation of classification criteria without retraining and is non-autoregressive. Through thorough synthetic data experimentation, we demonstrate our model outputs comparable and in certain cases superior results to that of autoregressive models an order of magnitude larger than AEN’s size. The architecture’s ability to preprocess and cache condition embeddings makes it ideal for edge computing applications and real-time monitoring systems.

[LG-28] On Generalization Bounds for Neural Networks with Low Rank Layers

链接: https://arxiv.org/abs/2411.13733
作者: Andrea Pinto,Akshay Rangamani,Tomaso Poggio
关键词-EN: bounds remain underexplored, low-rank weight matrices, favour low-rank weight, previous optimization results, weight matrices
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: Published in the MIT DSpace repository: this https URL

点击查看摘要

Abstract:While previous optimization results have suggested that deep neural networks tend to favour low-rank weight matrices, the implications of this inductive bias on generalization bounds remain underexplored. In this paper, we apply Maurer’s chain rule for Gaussian complexity to analyze how low-rank layers in deep networks can prevent the accumulation of rank and dimensionality factors that typically multiply across layers. This approach yields generalization bounds for rank and spectral norm constrained networks. We compare our results to prior generalization bounds for deep networks, highlighting how deep networks with low-rank layers can achieve better generalization than those with full-rank layers. Additionally, we discuss how this framework provides new perspectives on the generalization capabilities of deep networks exhibiting neural collapse.

[LG-29] Replicable Online Learning

链接: https://arxiv.org/abs/2411.13730
作者: Saba Ahmadi,Siddharth Bhandari,Avrim Blum
关键词-EN: introduced by Impagliazzo, algorithmic replicability introduced, online, investigate the concept, concept of algorithmic
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We investigate the concept of algorithmic replicability introduced by Impagliazzo et al. 2022, Ghazi et al. 2021, Ahn et al. 2024 in an online setting. In our model, the input sequence received by the online learner is generated from time-varying distributions chosen by an adversary (obliviously). Our objective is to design low-regret online algorithms that, with high probability, produce the exact same sequence of actions when run on two independently sampled input sequences generated as described above. We refer to such algorithms as adversarially replicable. Previous works (such as Esfandiari et al. 2022) explored replicability in the online setting under inputs generated independently from a fixed distribution; we term this notion as iid-replicability. Our model generalizes to capture both adversarial and iid input sequences, as well as their mixtures, which can be modeled by setting certain distributions as point-masses. We demonstrate adversarially replicable online learning algorithms for online linear optimization and the experts problem that achieve sub-linear regret. Additionally, we propose a general framework for converting an online learner into an adversarially replicable one within our setting, bounding the new regret in terms of the original algorithm’s regret. We also present a nearly optimal (in terms of regret) iid-replicable online algorithm for the experts problem, highlighting the distinction between the iid and adversarial notions of replicability. Finally, we establish lower bounds on the regret (in terms of the replicability parameter and time) that any replicable online algorithm must incur. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2411.13730 [cs.LG] (or arXiv:2411.13730v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2411.13730 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-30] Almost Sure Convergence Rates and Concentration of Stochastic Approximation and Reinforcement Learning with Markovian Noise

链接: https://arxiv.org/abs/2411.13711
作者: Xiaochi Qian,Zixuan Xie,Xinyu Liu,Shangtong Zhang
关键词-EN: general contractive stochastic, stochastic approximation algorithms, contractive stochastic approximation, Markovian noise, paper establishes
类目: Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:This paper establishes the first almost sure convergence rate and the first maximal concentration bound with exponential tails for general contractive stochastic approximation algorithms with Markovian noise. As a corollary, we also obtain convergence rates in L^p . Key to our successes is a novel discretization of the mean ODE of stochastic approximation algorithms using intervals with diminishing (instead of constant) length. As applications, we provide the first almost sure convergence rate for Q -learning with Markovian samples without count-based learning rates. We also provide the first concentration bound for off-policy temporal difference learning with Markovian samples.

[LG-31] A Collaborative Ensemble Framework for CTR Prediction

链接: https://arxiv.org/abs/2411.13700
作者: Xiaolong Liu,Zhichen Zeng,Xiaoyi Liu,Siyang Yuan,Weinan Song,Mengyue Hang,Yiqun Liu,Chaofei Yang,Donghyun Kim,Wen-Yen Chen,Jiyan Yang,Yiping Han,Rong Jin,Bo Long,Hanghang Tong,Philip S. Yu
关键词-EN: motivating extensive research, Recent advances, established scaling laws, Ensemble Training Network, Collaborative Ensemble Training
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Recent advances in foundation models have established scaling laws that enable the development of larger models to achieve enhanced performance, motivating extensive research into large-scale recommendation models. However, simply increasing the model size in recommendation systems, even with large amounts of data, does not always result in the expected performance improvements. In this paper, we propose a novel framework, Collaborative Ensemble Training Network (CETNet), to leverage multiple distinct models, each with its own embedding table, to capture unique feature interaction patterns. Unlike naive model scaling, our approach emphasizes diversity and collaboration through collaborative learning, where models iteratively refine their predictions. To dynamically balance contributions from each model, we introduce a confidence-based fusion mechanism using general softmax, where model confidence is computed via negation entropy. This design ensures that more confident models have a greater influence on the final prediction while benefiting from the complementary strengths of other models. We validate our framework on three public datasets (AmazonElectronics, TaobaoAds, and KuaiVideo) as well as a large-scale industrial dataset from Meta, demonstrating its superior performance over individual models and state-of-the-art baselines. Additionally, we conduct further experiments on the Criteo and Avazu datasets to compare our method with the multi-embedding paradigm. Our results show that our framework achieves comparable or better performance with smaller embedding sizes, offering a scalable and efficient solution for CTR prediction tasks.

[LG-32] Multi-Agent Best Arm Identification in Stochastic Linear Bandits

链接: https://arxiv.org/abs/2411.13690
作者: Sanjana Agrawal,Saúl A. Blanco
关键词-EN: collaborative best-arm identification, stochastic linear bandits, linear bandit instance, fixed-budget scenario, study the problem
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We study the problem of collaborative best-arm identification in stochastic linear bandits under a fixed-budget scenario. In our learning model, we consider multiple agents connected through a star network or a generic network, interacting with a linear bandit instance in parallel. The objective of the agents is to collaboratively learn the best arm of the given bandit instance with the help of a central server while minimizing the probability of error in best arm estimation. For this purpose, we devise the algorithms MaLinBAI-Star and MaLinBAI-Gen for star networks and generic networks respectively. Both algorithms employ an Upper-Confidence-Bound approach where agents share their knowledge through the central server during each communication round. We demonstrate, both theoretically and empirically, that our algorithms enjoy exponentially decaying probability of error in the allocated time budget. Furthermore, experimental results based on synthetic and real-world data validate the effectiveness of our algorithms over the existing multi-agent algorithms.

[LG-33] Investigating Graph Neural Networks and Classical Feature-Extraction Techniques in Activity-Cliff and Molecular Property Prediction

链接: https://arxiv.org/abs/2411.13688
作者: Markus Dablander
关键词-EN: Molecular featurisation refers, numerical feature vectors, Molecular, data into numerical, classical molecular featurisations
类目: Machine Learning (cs.LG); Biomolecules (q-bio.BM); Machine Learning (stat.ML)
*备注: Doctoral Thesis (Mathematical Institute, University of Oxford)

点击查看摘要

Abstract:Molecular featurisation refers to the transformation of molecular data into numerical feature vectors. It is one of the key research areas in molecular machine learning and computational drug discovery. Recently, message-passing graph neural networks (GNNs) have emerged as a novel method to learn differentiable features directly from molecular graphs. While such techniques hold great promise, further investigations are needed to clarify if and when they indeed manage to definitively outcompete classical molecular featurisations such as extended-connectivity fingerprints (ECFPs) and physicochemical-descriptor vectors (PDVs). We systematically explore and further develop classical and graph-based molecular featurisation methods for two important tasks: molecular property prediction, in particular, quantitative structure-activity relationship (QSAR) prediction, and the largely unexplored challenge of activity-cliff (AC) prediction. We first give a technical description and critical analysis of PDVs, ECFPs and message-passing GNNs, with a focus on graph isomorphism networks (GINs). We then conduct a rigorous computational study to compare the performance of PDVs, ECFPs and GINs for QSAR and AC-prediction. Following this, we mathematically describe and computationally evaluate a novel twin neural network model for AC-prediction. We further introduce an operation called substructure pooling for the vectorisation of structural fingerprints as a natural counterpart to graph pooling in GNN architectures. We go on to propose Sort Slice, a simple substructure-pooling technique for ECFPs that robustly outperforms hash-based folding at molecular property prediction. Finally, we outline two ideas for future research: (i) a graph-based self-supervised learning strategy to make classical molecular featurisations trainable, and (ii) trainable substructure-pooling via differentiable self-attention.

[LG-34] Differentially Private Learning Beyond the Classical Dimensionality Regime

链接: https://arxiv.org/abs/2411.13682
作者: Cynthia Dwork,Pranay Tankala,Linjun Zhang
关键词-EN: differentially private learning, proportional dimensionality regime, differentially private, high-dimensional differentially private, private learning
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Data Structures and Algorithms (cs.DS)
*备注:

点击查看摘要

Abstract:We initiate the study of differentially private learning in the proportional dimensionality regime, in which the number of data samples n and problem dimension d approach infinity at rates proportional to one another, meaning that d / n \to \delta as n \to \infty for an arbitrary, given constant \delta \in (0, \infty) . This setting is significantly more challenging than that of all prior theoretical work in high-dimensional differentially private learning, which, despite the name, has assumed that \delta = 0 or is sufficiently small for problems of sample complexity O(d) , a regime typically considered “low-dimensional” or “classical” by modern standards in high-dimensional statistics. We provide sharp theoretical estimates of the error of several well-studied differentially private algorithms for robust linear regression and logistic regression, including output perturbation, objective perturbation, and noisy stochastic gradient descent, in the proportional dimensionality regime. The 1 + o(1) factor precision of our error estimates enables a far more nuanced understanding of the price of privacy of these algorithms than that afforded by existing, coarser analyses, which are essentially vacuous in the regime we consider. We incorporate several probabilistic tools that have not previously been used to analyze differentially private learning algorithms, such as a modern Gaussian comparison inequality and recent universality laws with origins in statistical physics. Subjects: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Data Structures and Algorithms (cs.DS) Cite as: arXiv:2411.13682 [cs.LG] (or arXiv:2411.13682v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2411.13682 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-35] Efficient Streaming Voice Steganalysis in Challenging Detection Scenarios

链接: https://arxiv.org/abs/2411.13612
作者: Pengcheng Zhou,Zhengyang Fang,Zhongliang Yang,Zhili Zhou,Linna Zhou
关键词-EN: embed secret information, achieve concealed communication, efficiently embed secret, information hiding techniques, transmitted network media
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In recent years, there has been an increasing number of information hiding techniques based on network streaming media, focusing on how to covertly and efficiently embed secret information into real-time transmitted network media signals to achieve concealed communication. The misuse of these techniques can lead to significant security risks, such as the spread of malicious code, commands, and viruses. Current steganalysis methods for network voice streams face two major challenges: efficient detection under low embedding rates and short duration conditions. These challenges arise because, with low embedding rates (e.g., as low as 10%) and short transmission durations (e.g., only 0.1 second), detection models struggle to acquire sufficiently rich sample features, making effective steganalysis difficult. To address these challenges, this paper introduces a Dual-View VoIP Steganalysis Framework (DVSF). The framework first randomly obfuscates parts of the native steganographic descriptors in VoIP stream segments, making the steganographic features of hard-to-detect samples more pronounced and easier to learn. It then captures fine-grained local features related to steganography, building on the global features of VoIP. Specially constructed VoIP segment triplets further adjust the feature distances within the model. Ultimately, this method effectively address the detection difficulty in VoIP. Extensive experiments demonstrate that our method significantly improves the accuracy of streaming voice steganalysis in these challenging detection scenarios, surpassing existing state-of-the-art methods and offering superior near-real-time performance.

[LG-36] Preserving Expert-Level Privacy in Offline Reinforcement Learning

链接: https://arxiv.org/abs/2411.13598
作者: Navodita Sharma,Vishnu Vinod,Abhradeep Thakurta,Alekh Agarwal,Borja Balle,Christoph Dann,Aravindan Raghuveer
关键词-EN: offline reinforcement learning, historical data collected, reinforcement learning, problem aims, behavioural policies
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The offline reinforcement learning (RL) problem aims to learn an optimal policy from historical data collected by one or more behavioural policies (experts) by interacting with an environment. However, the individual experts may be privacy-sensitive in that the learnt policy may retain information about their precise choices. In some domains like personalized retrieval, advertising and healthcare, the expert choices are considered sensitive data. To provably protect the privacy of such experts, we propose a novel consensus-based expert-level differentially private offline RL training approach compatible with any existing offline RL algorithm. We prove rigorous differential privacy guarantees, while maintaining strong empirical performance. Unlike existing work in differentially private RL, we supplement the theory with proof-of-concept experiments on classic RL environments featuring large continuous state spaces, demonstrating substantial improvements over a natural baseline across multiple tasks.

[LG-37] Browser Extension for Fake URL Detection

链接: https://arxiv.org/abs/2411.13581
作者: Latesh G. Malik,Rohini Shambharkar,Shivam Morey,Shubhlak Kanpate,Vedika Raut
关键词-EN: machine learning models, increased significantly, recent years, Spam Email detection, potential to damage
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注: 5 Pages, 2 figures

点击查看摘要

Abstract:In recent years, Cyber attacks have increased in number, and with them, the intensity of the attacks and their potential to damage the user have also increased significantly. In an ever-advancing world, users find it difficult to keep up with the latest developments in technology, which can leave them vulnerable to attacks. To avoid such situations we need tools to deter such attacks, for this machine learning models are among the best options. This paper presents a Browser Extension that uses machine learning models to enhance online security by integrating three crucial functionalities: Malicious URL detection, Spam Email detection and Network logs analysis. The proposed solution uses LGBM classifier for classification of Phishing websites, the model has been trained on a dataset with 87 features, this model achieved an accuracy of 96.5% with a precision of 96.8% and F1 score of 96.49%. The Model for Spam email detection uses Multinomial NB algorithm which has been trained on a dataset with over 5500 messages, this model achieved an accuracy of 97.09% with a precision of 100%. The results demonstrate the effectiveness of using machine learning models for cyber security.

[LG-38] Persistent Homology for Structural Characterization in Disordered Systems

链接: https://arxiv.org/abs/2411.14390
作者: An Wang,Li Zou
关键词-EN: unified framework based, persistent homology, propose a unified, global phase structure, classifying global phases
类目: Disordered Systems and Neural Networks (cond-mat.dis-nn); Materials Science (cond-mat.mtrl-sci); Machine Learning (cs.LG); Mathematical Physics (math-ph); Algebraic Topology (math.AT)
*备注: 19 pages, 17 figures

点击查看摘要

Abstract:We propose a unified framework based on persistent homology (PH) to characterize both local and global structures in disordered systems. It can simultaneously generate local and global descriptors using the same algorithm and data structure, and has shown to be highly effective and interpretable in predicting particle rearrangements and classifying global phases. Based on this framework, we define a non-parametric metric, the Separation Index (SI), which not only outperforms traditional bond-orientational order parameters in phase classification tasks but also establishes a connection between particle environments and the global phase structure. Our methods provide an effective framework for understanding and analyzing the properties of disordered materials, with broad potential applications in materials science and even wider studies of complex systems.

[LG-39] CoNFiLD-inlet: Synthetic Turbulence Inflow Using Generative Latent Diffusion Models with Neural Fields

链接: https://arxiv.org/abs/2411.14378
作者: Xin-Yang Liu,Meet Hemant Parikh,Xiantao Fan,Pan Du,Qing Wang,Yi-Fan Chen,Jian-Xun Wang
关键词-EN: Eddy-resolving turbulence simulations, simulations require stochastic, Eddy-resolving turbulence, turbulence simulations require, replicate the complex
类目: Fluid Dynamics (physics.flu-dyn); Machine Learning (cs.LG)
*备注: 27 pages, 10 figures

点击查看摘要

Abstract:Eddy-resolving turbulence simulations require stochastic inflow conditions that accurately replicate the complex, multi-scale structures of turbulence. Traditional recycling-based methods rely on computationally expensive precursor simulations, while existing synthetic inflow generators often fail to reproduce realistic coherent structures of turbulence. Recent advances in deep learning (DL) have opened new possibilities for inflow turbulence generation, yet many DL-based methods rely on deterministic, autoregressive frameworks prone to error accumulation, resulting in poor robustness for long-term predictions. In this work, we present CoNFiLD-inlet, a novel DL-based inflow turbulence generator that integrates diffusion models with a conditional neural field (CNF)-encoded latent space to produce realistic, stochastic inflow turbulence. By parameterizing inflow conditions using Reynolds numbers, CoNFiLD-inlet generalizes effectively across a wide range of Reynolds numbers ( Re_\tau between 10^3 and 10^4 ) without requiring retraining or parameter tuning. Comprehensive validation through a priori and a posteriori tests in Direct Numerical Simulation (DNS) and Wall-Modeled Large Eddy Simulation (WMLES) demonstrates its high fidelity, robustness, and scalability, positioning it as an efficient and versatile solution for inflow turbulence synthesis.

[LG-40] Indiscriminate Disruption of Conditional Inference on Multivariate Gaussians

链接: https://arxiv.org/abs/2411.14351
作者: William N. Caballero,Matthew LaRosa,Alexander Fisher,Vahid Tarokh
关键词-EN: Gaussian influence diagrams, underpins myriad operations-research, Gaussian distribution underpins, Bayesian optimization, multivariate Gaussian distribution
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Applications (stat.AP)
*备注: 30 pages, 6 figures; 4 tables

点击查看摘要

Abstract:The multivariate Gaussian distribution underpins myriad operations-research, decision-analytic, and machine-learning models (e.g., Bayesian optimization, Gaussian influence diagrams, and variational autoencoders). However, despite recent advances in adversarial machine learning (AML), inference for Gaussian models in the presence of an adversary is notably understudied. Therefore, we consider a self-interested attacker who wishes to disrupt a decisionmaker’s conditional inference and subsequent actions by corrupting a set of evidentiary variables. To avoid detection, the attacker also desires the attack to appear plausible wherein plausibility is determined by the density of the corrupted evidence. We consider white- and grey-box settings such that the attacker has complete and incomplete knowledge about the decisionmaker’s underlying multivariate Gaussian distribution, respectively. Select instances are shown to reduce to quadratic and stochastic quadratic programs, and structural properties are derived to inform solution methods. We assess the impact and efficacy of these attacks in three examples, including, real estate evaluation, interest rate estimation and signals processing. Each example leverages an alternative underlying model, thereby highlighting the attacks’ broad applicability. Through these applications, we also juxtapose the behavior of the white- and grey-box attacks to understand how uncertainty and structure affect attacker behavior.

[LG-41] Logarithmic Neyman Regret for Adaptive Estimation of the Average Treatment Effect AISTATS2025

链接: https://arxiv.org/abs/2411.14341
作者: Ojash Neopane,Aaditya Ramdas,Aarti Singh
关键词-EN: Average Treatment Effect, Evaluation in Reinforcement, Off-Policy Evaluation, Treatment Effect, Average Treatment
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 12 pages, 2 figures. Submitted to AISTATS 2025

点击查看摘要

Abstract:Estimation of the Average Treatment Effect (ATE) is a core problem in causal inference with strong connections to Off-Policy Evaluation in Reinforcement Learning. This paper considers the problem of adaptively selecting the treatment allocation probability in order to improve estimation of the ATE. The majority of prior work on adaptive ATE estimation focus on asymptotic guarantees, and in turn overlooks important practical considerations such as the difficulty of learning the optimal treatment allocation as well as hyper-parameter selection. Existing non-asymptotic methods are limited by poor empirical performance and exponential scaling of the Neyman regret with respect to problem parameters. In order to address these gaps, we propose and analyze the Clipped Second Moment Tracking (ClipSMT) algorithm, a variant of an existing algorithm with strong asymptotic optimality guarantees, and provide finite sample bounds on its Neyman regret. Our analysis shows that ClipSMT achieves exponential improvements in Neyman regret on two fronts: improving the dependence on T from O(\sqrtT) to O(\log T) , as well as reducing the exponential dependence on problem parameters to a polynomial dependence. Finally, we conclude with simulations which show the marked improvement of ClipSMT over existing approaches.

[LG-42] Model-free learning of probability flows: Elucidating the nonequilibrium dynamics of flocking

链接: https://arxiv.org/abs/2411.14317
作者: Nicholas M. Boffi,Eric Vanden-Eijnden
关键词-EN: autonomously dissipate energy, individual components autonomously, components autonomously dissipate, Active systems comprise, dissipate energy
类目: atistical Mechanics (cond-mat.stat-mech); Machine Learning (cs.LG); Probability (math.PR)
*备注:

点击查看摘要

Abstract:Active systems comprise a class of nonequilibrium dynamics in which individual components autonomously dissipate energy. Efforts towards understanding the role played by activity have centered on computation of the entropy production rate (EPR), which quantifies the breakdown of time reversal symmetry. A fundamental difficulty in this program is that high dimensionality of the phase space renders traditional computational techniques infeasible for estimating the EPR. Here, we overcome this challenge with a novel deep learning approach that estimates probability currents directly from stochastic system trajectories. We derive a new physical connection between the probability current and two local definitions of the EPR for inertial systems, which we apply to characterize the departure from equilibrium in a canonical model of flocking. Our results highlight that entropy is produced and consumed on the spatial interface of a flock as the interplay between alignment and fluctuation dynamically creates and annihilates order. By enabling the direct visualization of when and where a given system is out of equilibrium, we anticipate that our methodology will advance the understanding of a broad class of complex nonequilibrium dynamics.

[LG-43] Learning Pore-scale Multi-phase Flow from Experimental Data with Graph Neural Network NEURIPS2024

链接: https://arxiv.org/abs/2411.14192
作者: Yuxuan Gu,Catherine Spurin,Gege Wen
关键词-EN: change mitigation technologies, climate change mitigation, geological storage, hydrogen storage, mitigation technologies
类目: Fluid Dynamics (physics.flu-dyn); Machine Learning (cs.LG)
*备注: Accpeted for Machine Learning and the Physical Sciences Workshop at the 38th conference on Neural Information Processing Systems (NeurIPS 2024)

点击查看摘要

Abstract:Understanding the process of multiphase fluid flow through porous media is crucial for many climate change mitigation technologies, including CO _2 geological storage, hydrogen storage, and fuel cells. However, current numerical models are often incapable of accurately capturing the complex pore-scale physics observed in experiments. In this study, we address this challenge using a graph neural network-based approach and directly learn pore-scale fluid flow using micro-CT experimental data. We propose a Long-Short-Edge MeshGraphNet (LSE-MGN) that predicts the state of each node in the pore space at each time step. During inference, given an initial state, the model can autoregressively predict the evolution of the multiphase flow process over time. This approach successfully captures the physics from the high-resolution experimental data while maintaining computational efficiency, providing a promising direction for accurate and efficient pore-scale modeling of complex multiphase fluid flow dynamics.

[LG-44] SPARKLE: A Unified Single-Loop Primal-Dual Framework for Decentralized Bilevel Optimization

链接: https://arxiv.org/abs/2411.14166
作者: Shuchen Zhu,Boao Kong,Songtao Lu,Xinmeng Huang,Kun Yuan
关键词-EN: multiple agents collaborate, problems involving nested, involving nested optimization, nested optimization structures, decentralized bilevel optimization
类目: Optimization and Control (math.OC); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 73 pages, the Thirty-Eighth Annual Conference on Neural Information Processing Systems (2024)

点击查看摘要

Abstract:This paper studies decentralized bilevel optimization, in which multiple agents collaborate to solve problems involving nested optimization structures with neighborhood communications. Most existing literature primarily utilizes gradient tracking to mitigate the influence of data heterogeneity, without exploring other well-known heterogeneity-correction techniques such as EXTRA or Exact Diffusion. Additionally, these studies often employ identical decentralized strategies for both upper- and lower-level problems, neglecting to leverage distinct mechanisms across different levels. To address these limitations, this paper proposes SPARKLE, a unified Single-loop Primal-dual AlgoRithm frameworK for decentraLized bilEvel optimization. SPARKLE offers the flexibility to incorporate various heterogeneitycorrection strategies into the algorithm. Moreover, SPARKLE allows for different strategies to solve upper- and lower-level problems. We present a unified convergence analysis for SPARKLE, applicable to all its variants, with state-of-the-art convergence rates compared to existing decentralized bilevel algorithms. Our results further reveal that EXTRA and Exact Diffusion are more suitable for decentralized bilevel optimization, and using mixed strategies in bilevel algorithms brings more benefits than relying solely on gradient tracking.

[LG-45] Adjoint-based online learning of two-layer quasi-geostrophic baroclinic turbulence

链接: https://arxiv.org/abs/2411.14106
作者: Fei Er Yan,Hugo Frezat,Julien Le Sommer,Julian Mak,Karl Otness
关键词-EN: Earth System Modeling, global ocean circulation, modeled ocean circulation, Earth System, ocean circulation models
类目: Atmospheric and Oceanic Physics (physics.ao-ph); Machine Learning (cs.LG); Fluid Dynamics (physics.flu-dyn)
*备注: 25 pages, 1 table, 8 figures

点击查看摘要

Abstract:For reasons of computational constraint, most global ocean circulation models used for Earth System Modeling still rely on parameterizations of sub-grid processes, and limitations in these parameterizations affect the modeled ocean circulation and impact on predictive skill. An increasingly popular approach is to leverage machine learning approaches for parameterizations, regressing for a map between the resolved state and missing feedbacks in a fluid system as a supervised learning task. However, the learning is often performed in an offline' fashion, without involving the underlying fluid dynamical model during the training stage. Here, we explore the online’ approach that involves the fluid dynamical model during the training stage for the learning of baroclinic turbulence and its parameterization, with reference to ocean eddy parameterization. Two online approaches are considered: a full adjoint-based online approach, related to traditional adjoint optimization approaches that require a `differentiable’ dynamical model, and an approximately online approach that approximates the adjoint calculation and does not require a differentiable dynamical model. The online approaches are found to be generally more skillful and numerically stable than offline approaches. Others details relating to online training, such as window size, machine learning model set up and designs of the loss functions are detailed to aid in further explorations of the online training methodology for Earth System Modeling.

[LG-46] Single-Model Attribution for Spoofed Speech via Vocoder Fingerprints in an Open-World Setting

链接: https://arxiv.org/abs/2411.14013
作者: Matías Pizarro,Mike Laszkiewicz,Dorothea Kolossa,Asja Fischer
关键词-EN: generation technology advances, speech generation technology, misusing spoofed speech, spoofed speech signals, technology advances
类目: Audio and Speech Processing (eess.AS); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:As speech generation technology advances, so do the potential threats of misusing spoofed speech signals. One way to address these threats is by attributing the signals to their source generative model. In this work, we are the first to tackle the single-model attribution task in an open-world setting, that is, we aim at identifying whether spoofed speech signals from unknown sources originate from a specific vocoder. We show that the standardized average residual between audio signals and their low-pass filtered or EnCodec filtered versions can serve as powerful vocoder fingerprints. The approach only requires data from the target vocoder and allows for simple but highly accurate distance-based model attribution. We demonstrate its effectiveness on LJSpeech and JSUT, achieving an average AUROC of over 99% in most settings. The accompanying robustness study shows that it is also resilient to noise levels up to a certain degree.

[LG-47] Accelerated zero-order SGD under high-order smoothness and overparameterized regime

链接: https://arxiv.org/abs/2411.13999
作者: Georgii Bychkov,Darina Dvinskikh,Anastasia Antsiferova,Alexander Gasnikov,Aleksandr Lobanov
关键词-EN: adversarial multi-armed bandit, multi-armed bandit problem, convex stochastic optimization, stochastic optimization problem, encountered in medicine
类目: Optimization and Control (math.OC); Machine Learning (cs.LG)
*备注: 10 pages, 1 figure

点击查看摘要

Abstract:We present a novel gradient-free algorithm to solve a convex stochastic optimization problem, such as those encountered in medicine, physics, and machine learning (e.g., adversarial multi-armed bandit problem), where the objective function can only be computed through numerical simulation, either as the result of a real experiment or as feedback given by the function evaluations from an adversary. Thus we suppose that only a black-box access to the function values of the objective is available, possibly corrupted by adversarial noise: deterministic or stochastic. The noisy setup can arise naturally from modeling randomness within a simulation or by computer discretization, or when exact values of function are forbidden due to privacy issues, or when solving non-convex problems as convex ones with an inexact function oracle. By exploiting higher-order smoothness, fulfilled, e.g., in logistic regression, we improve the performance of zero-order methods developed under the assumption of classical smoothness (or having a Lipschitz gradient). The proposed algorithm enjoys optimal oracle complexity and is designed under an overparameterization setup, i.e., when the number of model parameters is much larger than the size of the training dataset. Overparametrized models fit to the training data perfectly while also having good generalization and outperforming underparameterized models on unseen data. We provide convergence guarantees for the proposed algorithm under both types of noise. Moreover, we estimate the maximum permissible adversarial noise level that maintains the desired accuracy in the Euclidean setup, and then we extend our results to a non-Euclidean setup. Our theoretical results are verified on the logistic regression problem.

[LG-48] Movable Antenna-Equipped UAV for Data Collection in Backscatter Sensor Networks: A Deep Reinforcement Learning-based Approach

链接: https://arxiv.org/abs/2411.13970
作者: Yu Bai,Boxuan Xie,Ruifan Zhu,Zheng Chang,Riku Jantti
关键词-EN: wireless sensor networks, promising energy-efficient solution, future wireless sensor, sensor networks, data collection time
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Backscatter communication (BC) becomes a promising energy-efficient solution for future wireless sensor networks (WSNs). Unmanned aerial vehicles (UAVs) enable flexible data collection from remote backscatter devices (BDs), yet conventional UAVs rely on omni-directional fixed-position antennas (FPAs), limiting channel gain and prolonging data collection time. To address this issue, we consider equipping a UAV with a directional movable antenna (MA) with high directivity and flexibility. The MA enhances channel gain by precisely aiming its main lobe at each BD, focusing transmission power for efficient communication. Our goal is to minimize the total data collection time by jointly optimizing the UAV’s trajectory and the MA’s orientation. We develop a deep reinforcement learning (DRL)-based strategy using the azimuth angle and distance between the UAV and each BD to simplify the agent’s observation space. To ensure stability during training, we adopt Soft Actor-Critic (SAC) algorithm that balances exploration with reward maximization for efficient and reliable learning. Simulation results demonstrate that our proposed MA-equipped UAV with SAC outperforms both FPA-equipped UAVs and other RL methods, achieving significant reductions in both data collection time and energy consumption.

[LG-49] Exponentially Consistent Nonparametric Clustering of Data Streams

链接: https://arxiv.org/abs/2411.13922
作者: Bhupender Singh,Ananth Ram Rajagopalan,Srikrishna Bhashyam
关键词-EN: data streams generated, data streams, data streams belong, independent and identically, identically distributed
类目: Machine Learning (stat.ML); Information Theory (cs.IT); Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注:

点击查看摘要

Abstract:In this paper, we consider nonparametric clustering of M independent and identically distributed (i.i.d.) data streams generated from unknown distributions. The distributions of the M data streams belong to K underlying distribution clusters. Existing results on exponentially consistent nonparametric clustering algorithms, like single linkage-based (SLINK) clustering and k -medoids distribution clustering, assume that the maximum intra-cluster distance ( d_L ) is smaller than the minimum inter-cluster distance ( d_H ). First, in the fixed sample size (FSS) setting, we show that exponential consistency can be achieved for SLINK clustering under a less strict assumption, d_I d_H , where d_I is the maximum distance between any two sub-clusters of a cluster that partition the cluster. Note that d_I d_L in general. Our results show that SLINK is exponentially consistent for a larger class of problems than k -medoids distribution clustering. We also identify examples where k -medoids clustering is unable to find the true clusters, but SLINK is exponentially consistent. Then, we propose a sequential clustering algorithm, named SLINK-SEQ, based on SLINK and prove that it is also exponentially consistent. Simulation results show that the SLINK-SEQ algorithm requires fewer expected number of samples than the FSS SLINK algorithm for the same probability of error.

[LG-50] opology optimization of periodic lattice structures for specified mechanical properties using machine learning considering member connectivity

链接: https://arxiv.org/abs/2411.13869
作者: Tomoya Matsuoka,Makoto Ohsaki,Kazuki Hayashi
关键词-EN: utilize machine learning, periodic lattice structures, machine learning, methodology to utilize, utilize machine
类目: Optimization and Control (math.OC); Machine Learning (cs.LG)
*备注: Presented at Asian Congress of Structural and Multidisciplinary Optimization (ACSMO 2024)

点击查看摘要

Abstract:This study proposes a methodology to utilize machine learning (ML) for topology optimization of periodic lattice structures. In particular, we investigate data representation of lattice structures used as input data for ML models to improve the performance of the models, focusing on the filtering process and feature selection. We use the filtering technique to explicitly consider the connectivity of lattice members and perform feature selection to reduce the input data size. In addition, we propose a convolution approach to apply pre-trained models for small structures to structures of larger sizes. The computational cost for obtaining optimal topologies by a heuristic method is reduced by incorporating the prediction of the trained ML model into the optimization process. In the numerical examples, a response prediction model is constructed for a lattice structure of 4x4 units, and topology optimization of 4x4-unit and 8x8-unit structures is performed by simulated annealing assisted by the trained ML model. The example demonstrates that ML models perform higher accuracy by using the filtered data as input than by solely using the data representing the existence of each member. It is also demonstrated that a small-scale prediction model can be constructed with sufficient accuracy by feature selection. Additionally, the proposed method can find the optimal structure in less computation time than the pure simulated annealing.

[LG-51] FLRNet: A Deep Learning Method for Regressive Reconstruction of Flow Field From Limited Sensor Measurements

链接: https://arxiv.org/abs/2411.13815
作者: Phong C. H. Nguyen,Joseph B. Choi,Quang-Trung Luu
关键词-EN: mechanics require effective, limited sensor data, require effective methods, experimental fluid mechanics, fluid mechanics require
类目: Fluid Dynamics (physics.flu-dyn); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Many applications in computational and experimental fluid mechanics require effective methods for reconstructing the flow fields from limited sensor data. However, this task remains a significant challenge because the measurement operator, which provides the punctual sensor measurement for a given state of the flow field, is often ill-conditioned and non-invertible. This issue impedes the feasibility of identifying the forward map, theoretically the inverse of the measurement operator, for field reconstruction purposes. While data-driven methods are available, their generalizability across different flow conditions (\textite.g., different Reynold numbers) remains questioned. Moreover, they frequently face the problem of spectral bias, which leads to smooth and blurry reconstructed fields, thereby decreasing the accuracy of reconstruction. We introduce FLRNet, a deep learning method for flow field reconstruction from sparse sensor measurements. FLRNet employs an variational autoencoder with Fourier feature layers and incorporates an extra perceptual loss term during training to learn a rich, low-dimensional latent representation of the flow field. The learned latent representation is then correlated to the sensor measurement using a fully connected (dense) network. We validated the reconstruction capability and the generalizability of FLRNet under various fluid flow conditions and sensor configurations, including different sensor counts and sensor layouts. Numerical experiments show that in all tested scenarios, FLRNet consistently outperformed other baselines, delivering the most accurate reconstructed flow field and being the most robust to noise.

[LG-52] Benchmarking a wide range of optimisers for solving the Fermi-Hubbard model using the variational quantum eigensolver

链接: https://arxiv.org/abs/2411.13742
作者: Benjamin D.M. Jones,Lana Mineh,Ashley Montanaro
关键词-EN: Hamiltonian variational ansatz, Hamiltonian variational, variational quantum eigensolver, variational ansatz, numerically benchmark
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
*备注: 42 pages, 30 figures. Associated data can be found at this https URL

点击查看摘要

Abstract:We numerically benchmark 30 optimisers on 372 instances of the variational quantum eigensolver for solving the Fermi-Hubbard system with the Hamiltonian variational ansatz. We rank the optimisers with respect to metrics such as final energy achieved and function calls needed to get within a certain tolerance level, and find that the best performing optimisers are variants of gradient descent such as Momentum and ADAM (using finite difference), SPSA, CMAES, and BayesMGD. We also perform gradient analysis and observe that the step size for finite difference has a very significant impact. We also consider using simultaneous perturbation (inspired by SPSA) as a gradient subroutine: here finite difference can lead to a more precise estimate of the ground state but uses more calls, whereas simultaneous perturbation can converge quicker but may be less precise in the later stages. Finally, we also study the quantum natural gradient algorithm: we implement this method for 1-dimensional Fermi-Hubbard systems, and find that whilst it can reach a lower energy with fewer iterations, this improvement is typically lost when taking total function calls into account. Our method involves performing careful hyperparameter sweeping on 4 instances. We present a variety of analysis and figures, detailed optimiser notes, and discuss future directions.

[LG-53] Graph neural network framework for energy mapping of hybrid monte-carlo molecular dynamics simulations of Medium Entropy Alloys

链接: https://arxiv.org/abs/2411.13670
作者: Mashaekh Tausif Ehsan,Saifuddin Zafar,Apurba Sarker,Sourav Das Suvro,Mohammad Nasim Hasan
关键词-EN: drawn significant interest, Machine learning, methods have drawn, design and discovery, drawn significant
类目: Materials Science (cond-mat.mtrl-sci); Machine Learning (cs.LG)
*备注: 28 pages, 9 figures

点击查看摘要

Abstract:Machine learning (ML) methods have drawn significant interest in material design and discovery. Graph neural networks (GNNs), in particular, have demonstrated strong potential for predicting material properties. The present study proposes a graph-based representation for modeling medium-entropy alloys (MEAs). Hybrid Monte-Carlo molecular dynamics (MC/MD) simulations are employed to achieve thermally stable structures across various annealing temperatures in an MEA. These simulations generate dump files and potential energy labels, which are used to construct graph representations of the atomic configurations. Edges are created between each atom and its 12 nearest neighbors without incorporating explicit edge features. These graphs then serve as input for a Graph Convolutional Neural Network (GCNN) based ML model to predict the system’s potential energy. The GCNN architecture effectively captures the local environment and chemical ordering within the MEA structure. The GCNN-based ML model demonstrates strong performance in predicting potential energy at different steps, showing satisfactory results on both the training data and unseen configurations. Our approach presents a graph-based modeling framework for MEAs and high-entropy alloys (HEAs), which effectively captures the local chemical order (LCO) within the alloy structure. This allows us to predict key material properties influenced by LCO in both MEAs and HEAs, providing deeper insights into how atomic-scale arrangements affect the properties of these alloys.

[LG-54] High resolution microprice estimates from limit orderbook data using hyperdimensional vector Tsetlin Machines

链接: https://arxiv.org/abs/2411.13594
作者: Christian D. Blakely
关键词-EN: higher order information, propose an error-correcting, order information, future prices, higher price rank
类目: Trading and Market Microstructure (q-fin.TR); Machine Learning (cs.LG); Statistical Finance (q-fin.ST)
*备注:

点击查看摘要

Abstract:We propose an error-correcting model for the microprice, a high-frequency estimator of future prices given higher order information of imbalances in the orderbook. The model takes into account a current microprice estimate given the spread and best bid to ask imbalance, and adjusts the microprice based on recent dynamics of higher price rank imbalances. We introduce a computationally fast estimator using a recently proposed hyperdimensional vector Tsetlin machine framework and demonstrate empirically that this estimator can provide a robust estimate of future prices in the orderbook.

[LG-55] A Random Forest approach to detect and identify Unlawful Insider Trading

链接: https://arxiv.org/abs/2411.13564
作者: Krishna Neupane,Igor Griva
关键词-EN: Exchange Act, privileged corporate information, unlawful insider trading, insider trading, corporate information
类目: atistical Finance (q-fin.ST); Machine Learning (cs.LG); Risk Management (q-fin.RM); Trading and Market Microstructure (q-fin.TR)
*备注:

点击查看摘要

Abstract:According to The Exchange Act, 1934 unlawful insider trading is the abuse of access to privileged corporate information. While a blurred line between “routine” the “opportunistic” insider trading exists, detection of strategies that insiders mold to maneuver fair market prices to their advantage is an uphill battle for hand-engineered approaches. In the context of detailed high-dimensional financial and trade data that are structurally built by multiple covariates, in this study, we explore, implement and provide detailed comparison to the existing study (Deng et al. (2019)) and independently implement automated end-to-end state-of-art methods by integrating principal component analysis to the random forest (PCA-RF) followed by a standalone random forest (RF) with 320 and 3984 randomly selected, semi-manually labeled and normalized transactions from multiple industry. The settings successfully uncover latent structures and detect unlawful insider trading. Among the multiple scenarios, our best-performing model accurately classified 96.43 percent of transactions. Among all transactions the models find 95.47 lawful as lawful and 98.00 unlawful as unlawful percent. Besides, the model makes very few mistakes in classifying lawful as unlawful by missing only 2.00 percent. In addition to the classification task, model generated Gini Impurity based features ranking, our analysis show ownership and governance related features based on permutation values play important roles. In summary, a simple yet powerful automated end-to-end method relieves labor-intensive activities to redirect resources to enhance rule-making and tracking the uncaptured unlawful insider trading transactions. We emphasize that developed financial and trading features are capable of uncovering fraudulent behaviors.

[LG-56] Dyson Brownian motion and random matrix dynamics of weight matrices during learning NEURIPS2024

链接: https://arxiv.org/abs/2411.13512
作者: Gert Aarts(Swansea University),Ouraman Hajizadeh(Graz),Biagio Lucini(Swansea University),Chanju Park(Swansea University)
关键词-EN: stochastic gradient descent, variations thereof, architectures are updated, gradient descent, descent or variations
类目: Disordered Systems and Neural Networks (cond-mat.dis-nn); Machine Learning (cs.LG); High Energy Physics - Lattice (hep-lat)
*备注: 7 pages. Contribution accepted in the NeurIPS 2024 workshop “Machine Learning and the Physical Sciences”

点击查看摘要

Abstract:During training, weight matrices in machine learning architectures are updated using stochastic gradient descent or variations thereof. In this contribution we employ concepts of random matrix theory to analyse the resulting stochastic matrix dynamics. We first demonstrate that the dynamics can generically be described using Dyson Brownian motion, leading to e.g. eigenvalue repulsion. The level of stochasticity is shown to depend on the ratio of the learning rate and the mini-batch size, explaining the empirically observed linear scaling rule. We verify this linear scaling in the restricted Boltzmann machine. Subsequently we study weight matrix dynamics in transformers (a nano-GPT), following the evolution from a Marchenko-Pastur distribution for eigenvalues at initialisation to a combination with additional structure at the end of learning.

信息检索

[IR-0] opology-Aware Popularity Debiasing via Simplicial Complexes

链接: https://arxiv.org/abs/2411.13892
作者: Yanbiao Ji,Yue Ding,Chang Liu,Yuxiang Lu,Xin Xin,Hongtao Lu
关键词-EN: delivering personalized content, Recommender systems, historical interaction data, users’ historical interaction, Graph Neural Networks
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Recommender systems (RS) play a critical role in delivering personalized content across various online platforms, leveraging collaborative filtering (CF) as a key technique to generate recommendations based on users’ historical interaction data. Recent advancements in CF have been driven by the adoption of Graph Neural Networks (GNNs), which model user-item interactions as bipartite graphs, enabling the capture of high-order collaborative signals. Despite their success, GNN-based methods face significant challenges due to the inherent popularity bias in the user-item interaction graph’s topology, leading to skewed recommendations that favor popular items over less-known ones. To address this challenge, we propose a novel topology-aware popularity debiasing framework, Test-time Simplicial Propagation (TSP), which incorporates simplicial complexes (SCs) to enhance the expressiveness of GNNs. Unlike traditional methods that focus on pairwise relationships, our approach captures multi-order relationships through SCs, providing a more comprehensive representation of user-item interactions. By enriching the neighborhoods of tail items and leveraging SCs for feature smoothing, TSP enables the propagation of multi-order collaborative signals and effectively mitigates biased propagation. Our TSP module is designed as a plug-and-play solution, allowing for seamless integration into pre-trained GNN-based models without the need for fine-tuning additional parameters. Extensive experiments on five real-world datasets demonstrate the superior performance of our method, particularly in long-tail recommendation tasks. Visualization results further confirm that TSP produces more uniform distributions of item representations, leading to fairer and more accurate recommendations. Subjects: Information Retrieval (cs.IR) Cite as: arXiv:2411.13892 [cs.IR] (or arXiv:2411.13892v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2411.13892 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[IR-1] LEADRE: Multi-Faceted Knowledge Enhanced LLM Empowered Display Advertisement Recommender System

链接: https://arxiv.org/abs/2411.13789
作者: Fengxin Li,Yi Li,Yue Liu,Chao Zhou,Yuan Wang,Xiaoxiang Deng,Wei Xue,Dapeng Liu,Lei Xiao,Haijie Gu,Jie Jiang,Hongyan Liu,Biao Qin,Jun He
关键词-EN: display advertising systems, Traditional display advertising, Display advertising, LLM Empowered Display, advertising systems utilize
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Display advertising provides significant value to advertisers, publishers, and users. Traditional display advertising systems utilize a multi-stage architecture consisting of retrieval, coarse ranking, and final ranking. However, conventional retrieval methods rely on ID-based learning to rank mechanisms and fail to adequately utilize the content information of ads, which hampers their ability to provide diverse recommendation lists. To address this limitation, we propose leveraging the extensive world knowledge of LLMs. However, three key challenges arise when attempting to maximize the effectiveness of LLMs: “How to capture user interests”, “How to bridge the knowledge gap between LLMs and advertising system”, and “How to efficiently deploy LLMs”. To overcome these challenges, we introduce a novel LLM-based framework called LLM Empowered Display ADvertisement REcommender system (LEADRE). LEADRE consists of three core modules: (1) The Intent-Aware Prompt Engineering introduces multi-faceted knowledge and designs intent-aware Prompt, Response pairs that fine-tune LLMs to generate ads tailored to users’ personal interests. (2) The Advertising-Specific Knowledge Alignment incorporates auxiliary fine-tuning tasks and Direct Preference Optimization (DPO) to align LLMs with ad semantic and business value. (3) The Efficient System Deployment deploys LEADRE in an online environment by integrating both latency-tolerant and latency-sensitive service. Extensive offline experiments demonstrate the effectiveness of LEADRE and validate the contributions of individual modules. Online A/B test shows that LEADRE leads to a 1.57% and 1.17% GMV lift for serviced users on WeChat Channels and Moments separately. LEADRE has been deployed on both platforms, serving tens of billions of requests each day. Subjects: Information Retrieval (cs.IR) Cite as: arXiv:2411.13789 [cs.IR] (or arXiv:2411.13789v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2411.13789 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

附件下载

点击下载今日全部论文列表

Arxiv今日论文 | 2024-11-22

目录

概览 (2024-11-22)

自然语言处理

人工智能

计算机视觉

机器学习

信息检索

附件下载