Arxiv今日论文 | 2024-11-04

本篇博文主要展示 2024-11-04 从Arxiv.org论文网站获取的最新论文列表，自动更新，按照NLP、CV、ML、AI、IR五个大方向区分，若需要邮件定时接收，请在评论区留下你的邮箱号。

说明：每日论文数据从Arxiv.org获取，每天早上12:00左右定时自动更新。

友情提示: 如何您需要邮箱接收每日论文数据，请在评论处留下你的邮箱。

【速读】：该论文试图解决在保持大型语言模型（LLM）冻结状态下实现多模态（multimodal）人机交互的问题。解决方案的关键在于提出了一个名为Freeze-Omni的语音-文本多模态LLM架构，其核心创新在于在训练过程中保持LLM冻结，同时通过三阶段训练策略将语音输入和输出模态连接到LLM。这种方法利用了文本-语音配对数据（如ASR和TTS数据）和仅60,000条多轮文本问答数据，在8个GPU上实现了语音到语音的对话能力。此外，通过多任务训练设计，Freeze-Omni还实现了双工对话能力，使其在语音模态下的智能水平与文本模态相当，同时保持了较低的端到端延迟。这一解决方案为研究人员在有限数据和训练资源下进行多模态LLM研究提供了可能性，避免了因数据和资源不足导致的LLM灾难性遗忘问题。

链接: https://arxiv.org/abs/2411.00774
作者: Xiong Wang,Yangze Li,Chaoyou Fu,Lei Xie,Ke Li,Xing Sun,Long Ma
关键词-EN: brought impressive experience, large language models, excellent multimodal human-computer, multimodal human-computer interaction, models has brought
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
备注: Project Page: this https URL

点击查看摘要

Abstract:The rapid development of large language models has brought many new smart applications, especially the excellent multimodal human-computer interaction in GPT-4o has brought impressive experience to users. In this background, researchers have proposed many multimodal LLMs that can achieve speech-to-speech dialogue recently. In this paper, we propose a speech-text multimodal LLM architecture called Freeze-Omni. Our main contribution is the speech input and output modalities can connected to the LLM while keeping the LLM frozen throughout the training process. We designed 3-stage training strategies both for the modeling of speech input and output, enabling Freeze-Omni to obtain speech-to-speech dialogue ability using text-speech paired data (such as ASR and TTS data) and only 60,000 multi-round text QA data on 8 GPUs. Moreover, we can effectively ensure that the intelligence of the Freeze-Omni in the speech modality is at the same level compared with that in the text modality of its backbone LLM, while the end-to-end latency of the spoken response achieves a low level. In addition, we also designed a method to achieve duplex dialogue ability through multi-task training, making Freeze-Omni have a more natural style of dialogue ability between the users. Freeze-Omni mainly provides a possibility for researchers to conduct multimodal LLM under the condition of a frozen LLM, avoiding various impacts caused by the catastrophic forgetting of LLM caused by fewer data and training resources.
摘要：大语言模型的快速发展带来了许多新的智能应用，特别是 GPT-4 在多模态人机交互方面的卓越表现，为用户带来了令人印象深刻的体验。在此背景下，研究人员最近提出了许多能够实现语音到语音对话的多模态大语言模型。本文提出了一种名为 Freeze-Omni 的语音-文本多模态大语言模型架构。我们的主要贡献在于，在训练过程中保持大语言模型冻结的同时，将语音输入和输出模态连接到大语言模型。我们设计了三阶段的训练策略，分别用于语音输入和输出的建模，使得 Freeze-Omni 能够利用文本-语音配对数据（如 ASR 和 TTS 数据）以及仅 60,000 条多轮文本问答数据在 8 个 GPU 上获得语音到语音的对话能力。此外，我们能够有效确保 Freeze-Omni 在语音模态中的智能水平与其在大语言模型文本模态中的智能水平相当，同时语音响应的端到端延迟达到较低水平。此外，我们还设计了一种通过多任务训练实现双向对话能力的方法，使 Freeze-Omni 在用户之间的对话能力更加自然。Freeze-Omni 主要为研究人员提供了一种在冻结大语言模型条件下进行多模态大语言模型研究的可能性，避免了由于数据和训练资源较少导致的灾难性遗忘对大语言模型的各种影响。

[NLP-1] Mitigating Tail Narrowing in LLM Self-Improvement via Socratic-Guided Sampling

【速读】：该论文试图解决自改进方法在大语言模型（LLMs）中性能提升遇到瓶颈的问题。具体来说，自改进方法在迭代训练过程中，模型倾向于过度采样简单查询而忽略复杂查询，导致采样分布不均衡，形成长尾分布，从而限制了性能的进一步提升。论文提出的解决方案是引入**Guided Self-Improvement (GSI)**策略，通过利用苏格拉底式的指导信号来辅助LLMs处理复杂查询，从而提高采样挑战性数据的效率，减少探索成本并降低计算开销。这一策略的关键在于平衡性能提升与计算效率，同时在保留任务上也表现出有效性。

链接: https://arxiv.org/abs/2411.00750
作者: Yiwen Ding,Zhiheng Xi,Wei He,Zhuoyuan Li,Yitao Zhai,Xiaowei Shi,Xunliang Cai,Tao Gui,Qi Zhang,Xuanjing Huang
关键词-EN: methods enable large, enable large language, Self-improvement methods enable, large language models, high-quality rationales
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Codes are publicly available at this https URL

点击查看摘要

Abstract:Self-improvement methods enable large language models (LLMs) to generate solutions themselves and iteratively train on filtered, high-quality rationales. This process proves effective and reduces the reliance on human supervision in LLMs’ reasoning, but the performance soon plateaus. We delve into the process and find that models tend to over-sample on easy queries and under-sample on queries they have yet to master. As iterations proceed, this imbalance in sampling is exacerbated, leading to a long-tail distribution where solutions to difficult queries almost diminish. This phenomenon limits the performance gain of self-improving models. A straightforward solution is brute-force sampling to balance the distribution, which significantly raises computational costs. In this paper, we introduce Guided Self-Improvement (GSI), a strategy aimed at improving the efficiency of sampling challenging heavy-tailed data. It leverages Socratic-style guidance signals to help LLM reasoning with complex queries, reducing the exploration effort and minimizing computational overhead. Experiments on four models across diverse mathematical tasks show that GSI strikes a balance between performance and efficiency, while also being effective on held-out tasks.
摘要：自我改进方法使大语言模型 (LLM) 能够自行生成解决方案，并通过迭代训练于筛选后的高质量推理过程中。这一过程被证明是有效的，并减少了 LLM 推理过程中对人类监督的依赖，但性能很快达到瓶颈。我们深入研究了这一过程，发现模型倾向于过度采样简单查询，而对尚未掌握的查询采样不足。随着迭代进行，这种采样不平衡问题加剧，导致长尾分布，其中困难查询的解决方案几乎消失。这种现象限制了自我改进模型的性能提升。一个直接的解决方案是通过暴力采样来平衡分布，但这显著增加了计算成本。在本文中，我们提出了引导式自我改进 (Guided Self-Improvement, GSI)，这是一种旨在提高采样挑战性重尾数据效率的策略。它利用苏格拉底式的引导信号来帮助 LLM 处理复杂查询，减少探索努力并最小化计算开销。在四个模型上进行的跨多种数学任务的实验表明，GSI 在性能和效率之间取得了平衡，同时在保留任务上也表现出色。

[NLP-2] CORAG: A Cost-Constrained Retrieval Optimization System for Retrieval-Augmented Generation

【速读】：该论文试图解决大语言模型（Large Language Models, LLMs）在生成过程中因无法访问最新信息而导致的幻觉问题，并针对现有检索增强生成（Retrieval-Augmented Generation, RAG）方法面临的三个关键挑战提出解决方案。解决方案的关键在于提出了一种成本约束的检索优化系统CORAG，通过采用基于蒙特卡洛树搜索（Monte Carlo Tree Search, MCTS）的策略框架，顺序地寻找最优的块组合（chunk combinations），从而全面考虑块之间的相关性。此外，CORAG系统将预算约束整合到块组合的优化过程中，有效解决了块效用的非单调性问题，避免了传统方法中因过度追求包含更多块而导致的性能下降。

链接: https://arxiv.org/abs/2411.00744
作者: Ziting Wang,Haitao Yuan,Wei Dong,Gao Cong,Feifei Li
关键词-EN: Large Language Models, Large Language, demonstrated remarkable generation, remarkable generation capabilities, Language Models
类目: Databases (cs.DB); Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have demonstrated remarkable generation capabilities but often struggle to access up-to-date information, which can lead to hallucinations. Retrieval-Augmented Generation (RAG) addresses this issue by incorporating knowledge from external databases, enabling more accurate and relevant responses. Due to the context window constraints of LLMs, it is impractical to input the entire external database context directly into the model. Instead, only the most relevant information, referred to as chunks, is selectively retrieved. However, current RAG research faces three key challenges. First, existing solutions often select each chunk independently, overlooking potential correlations among them. Second, in practice the utility of chunks is non-monotonic, meaning that adding more chunks can decrease overall utility. Traditional methods emphasize maximizing the number of included chunks, which can inadvertently compromise performance. Third, each type of user query possesses unique characteristics that require tailored handling, an aspect that current approaches do not fully consider. To overcome these challenges, we propose a cost constrained retrieval optimization system CORAG for retrieval-augmented generation. We employ a Monte Carlo Tree Search (MCTS) based policy framework to find optimal chunk combinations sequentially, allowing for a comprehensive consideration of correlations among chunks. Additionally, rather than viewing budget exhaustion as a termination condition, we integrate budget constraints into the optimization of chunk combinations, effectively addressing the non-monotonicity of chunk utility.
摘要：大语言模型 (LLM) 展示了卓越的生成能力，但常常难以访问最新的信息，这可能导致生成内容的不准确。检索增强生成 (RAG) 通过整合外部数据库的知识，解决了这一问题，从而实现更准确和相关的响应。然而，由于大语言模型的上下文窗口限制，直接将整个外部数据库内容输入模型是不现实的。因此，仅选择最相关的信息，即“块 (chunk)”，进行有选择地检索。当前的 RAG 研究面临三个主要挑战。首先，现有解决方案通常独立选择每个块，忽视了它们之间的潜在关联。其次，在实际应用中，块的效用是非单调的，这意味着增加更多的块可能会降低整体效用。传统方法强调最大化包含的块数量，这可能会无意中损害性能。第三，每种类型的用户查询具有独特的特征，需要针对性的处理，而当前的方法并未充分考虑这一点。为克服这些挑战，我们提出了一个成本约束的检索优化系统 CORAG，用于检索增强生成。我们采用基于蒙特卡洛树搜索 (MCTS) 的策略框架，以顺序方式寻找最优的块组合，从而全面考虑块之间的关联。此外，我们不是将预算耗尽视为终止条件，而是将预算约束整合到块组合的优化中，有效解决了块效用的非单调性问题。

[NLP-3] Decoding Dark Matter: Specialized Sparse Autoencoders for Interpreting Rare Concepts in Foundation Models

【速读】：该论文试图解决基础模型（Foundation Models, FMs）中稀有但关键的概念难以被稀疏自编码器（Sparse Autoencoders, SAEs）捕捉的问题。解决方案的关键在于引入专门化稀疏自编码器（Specialized Sparse Autoencoders, SSAEs），通过聚焦特定子域来揭示这些难以捉摸的“暗物质”特征。论文提出了一种实用的SSAEs训练方法，包括使用密集检索进行数据选择和采用倾斜经验风险最小化（Tilted Empirical Risk Minimization）作为训练目标，以提高概念召回率。实验结果表明，SSAEs在捕捉子域尾部概念方面表现优异，超过了通用SAEs的能力，并在Bias in Bios数据集上展示了其实际应用价值，通过去除虚假性别信息，将最差组分类准确率提高了12.5%。

链接: https://arxiv.org/abs/2411.00743
作者: Aashiq Muhamed,Mona Diab,Virginia Smith
关键词-EN: effective interpretability methods, developing effective interpretability, Understanding and mitigating, Specialized Sparse Autoencoders, Sparse Autoencoders
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Understanding and mitigating the potential risks associated with foundation models (FMs) hinges on developing effective interpretability methods. Sparse Autoencoders (SAEs) have emerged as a promising tool for disentangling FM representations, but they struggle to capture rare, yet crucial concepts in the data. We introduce Specialized Sparse Autoencoders (SSAEs), designed to illuminate these elusive dark matter features by focusing on specific subdomains. We present a practical recipe for training SSAEs, demonstrating the efficacy of dense retrieval for data selection and the benefits of Tilted Empirical Risk Minimization as a training objective to improve concept recall. Our evaluation of SSAEs on standard metrics, such as downstream perplexity and L_0 sparsity, show that they effectively capture subdomain tail concepts, exceeding the capabilities of general-purpose SAEs. We showcase the practical utility of SSAEs in a case study on the Bias in Bios dataset, where SSAEs achieve a 12.5% increase in worst-group classification accuracy when applied to remove spurious gender information. SSAEs provide a powerful new lens for peering into the inner workings of FMs in subdomains.
摘要：理解和缓解基础模型（Foundation Models, FMs）相关的潜在风险，关键在于开发有效的可解释性方法。稀疏自编码器（Sparse Autoencoders, SAEs）作为一种有前景的工具，用于解构 FM 的表示，但在捕捉数据中罕见但关键的概念时表现不佳。我们引入了专门稀疏自编码器（Specialized Sparse Autoencoders, SSAEs），旨在通过聚焦特定子领域来揭示这些难以捉摸的“暗物质”特征。我们提供了一个实用的 SSAEs 训练方案，展示了密集检索在数据选择中的有效性，以及倾斜经验风险最小化作为训练目标在提高概念召回率方面的优势。我们对 SSAEs 在下游困惑度（perplexity）和 L_0 稀疏性等标准指标上的评估表明，它们能够有效捕捉子领域的尾部概念，超越了通用 SAEs 的能力。我们通过在 Bias in Bios 数据集上的案例研究展示了 SSAEs 的实际应用价值，当应用于去除虚假性别信息时，SSAEs 使最差组分类准确率提高了 12.5%。SSAEs 为深入探索 FM 在子领域内部运作提供了一个强大的新视角。

[NLP-4] MolCap-Arena: A Comprehensive Captioning Benchmark on Language-Enhanced Molecular Property Prediction

【速读】：该论文试图解决的问题是如何利用大型语言模型（LLMs）增强生物分子属性预测的性能，特别是在复杂预测任务（如毒性预测）中的应用。解决方案的关键在于引入了一个名为“Molecule Caption Arena”的综合基准测试，通过评估超过二十种LLMs（包括通用和领域特定的分子描述生成器）在多样化的预测任务中的表现，来验证LLM提取的知识是否能提升现有分子表示的性能。研究中还创新性地采用了基于对战的评分系统，以量化不同模型、提示和数据集之间的差异。

链接: https://arxiv.org/abs/2411.00737
作者: Carl Edwards,Ziqing Lu,Ehsan Hajiramezanali,Tommaso Biancalani,Heng Ji,Gabriele Scalia
关键词-EN: Bridging biomolecular modeling, natural language information, interdisciplinary research area, promising interdisciplinary research, Bridging biomolecular
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Biomolecules (q-bio.BM)
备注:

点击查看摘要

Abstract:Bridging biomolecular modeling with natural language information, particularly through large language models (LLMs), has recently emerged as a promising interdisciplinary research area. LLMs, having been trained on large corpora of scientific documents, demonstrate significant potential in understanding and reasoning about biomolecules by providing enriched contextual and domain knowledge. However, the extent to which LLM-driven insights can improve performance on complex predictive tasks (e.g., toxicity) remains unclear. Further, the extent to which relevant knowledge can be extracted from LLMs also remains unknown. In this study, we present Molecule Caption Arena: the first comprehensive benchmark of LLM-augmented molecular property prediction. We evaluate over twenty LLMs, including both general-purpose and domain-specific molecule captioners, across diverse prediction tasks. To this goal, we introduce a novel, battle-based rating system. Our findings confirm the ability of LLM-extracted knowledge to enhance state-of-the-art molecular representations, with notable model-, prompt-, and dataset-specific variations. Code, resources, and data are available at this http URL.
摘要：将生物分子建模与自然语言信息，特别是通过大语言模型 (LLMs) 进行结合，最近已成为一个有前景的跨学科研究领域。LLMs 经过大量科学文献的训练，展现出通过提供丰富的上下文和领域知识来理解和推理生物分子的显著潜力。然而，LLM 驱动的见解在多大程度上能够提升复杂预测任务（例如，毒性预测）的性能仍不明确。此外，从 LLMs 中提取相关知识的程度也尚不清楚。在本研究中，我们提出了 Molecule Caption Arena：首个全面评估 LLM 增强分子属性预测的基准。我们评估了超过二十种 LLMs，包括通用型和领域特定的分子描述生成器，跨越多种预测任务。为此，我们引入了一种新颖的基于对战的评分系统。我们的研究结果证实了 LLM 提取的知识能够增强最先进的分子表示，并显示出显著的模型、提示和数据集特定差异。代码、资源和数据可在以下网址获取：http URL。

[NLP-5] SPRING Lab IITMs submission to Low Resource Indic Language Translation Shared Task

【速读】：该论文旨在解决四种低资源印度语言（Khasi, Mizo, Manipuri, 和 Assamese）的翻译问题。解决方案的关键在于：1) 利用多种数据源（WMT任务数据集、BPCC、PMIndia和OpenLanguageData）进行数据收集和预处理；2) 针对双语数据稀缺的问题，采用回译技术（back-translation techniques）扩充Mizo和Khasi的单语数据集；3) 对预训练的NLLB 3.3B模型进行微调，以提升Assamese、Mizo和Manipuri的翻译性能；4) 针对NLLB模型不支持的Khasi语言，引入特殊标记并基于Khasi语料库进行模型训练；5) 通过掩码语言建模（masked language modelling）和后续的微调步骤，实现英语与印度语言之间的双向翻译。

链接: https://arxiv.org/abs/2411.00727
作者: Hamees Sayed,Advait Joglekar,Srinivasan Umesh
关键词-EN: low-resource Indic languages, low-resource Indic, develop a robust, Indic languages, WMT task datasets
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: To be published in WMT 2024. Low-Resource Indic Language Translation Shared Task

点击查看摘要

Abstract:We develop a robust translation model for four low-resource Indic languages: Khasi, Mizo, Manipuri, and Assamese. Our approach includes a comprehensive pipeline from data collection and preprocessing to training and evaluation, leveraging data from WMT task datasets, BPCC, PMIndia, and OpenLanguageData. To address the scarcity of bilingual data, we use back-translation techniques on monolingual datasets for Mizo and Khasi, significantly expanding our training corpus. We fine-tune the pre-trained NLLB 3.3B model for Assamese, Mizo, and Manipuri, achieving improved performance over the baseline. For Khasi, which is not supported by the NLLB model, we introduce special tokens and train the model on our Khasi corpus. Our training involves masked language modelling, followed by fine-tuning for English-to-Indic and Indic-to-English translations.
摘要：我们开发了一个针对四种低资源印度语言——Khasi、Mizo、Manipuri 和 Assamese 的鲁棒翻译模型。我们的方法包括从数据收集和预处理到训练和评估的全面流程，利用了 WMT 任务数据集、BPCC、PMIndia 和 OpenLanguageData 的数据。为了解决双语数据稀缺的问题，我们对 Mizo 和 Khasi 的单语数据集采用了回译技术，显著扩展了训练语料库。我们对预训练的 NLLB 3.3B 模型进行了微调，用于 Assamese、Mizo 和 Manipuri，性能优于基线模型。对于 NLLB 模型不支持的 Khasi，我们引入了特殊 Token 并在我们的 Khasi 语料库上训练模型。我们的训练过程包括掩码语言建模，随后进行英印互译的微调。

[NLP-6] A graph-based approach to extracting narrative signals from public discourse

【速读】：该论文试图解决从数字文本语料库中提取、表示和分析政治叙事信号的问题。解决方案的关键在于提出了一种基于抽象意义表示（Abstract Meaning Representation, AMR）的图结构形式化方法和机器引导技术。具体而言，该方法通过AMR从文本语料库中提取句子的图结构表示，并应用叙事学中的可转移概念，通过一系列启发式规则筛选出代表1) 行动者、2) 这些行动者参与的事件、以及3) 这些事件的视角化痕迹的核心叙事信号。这些信号为进一步分析提供了线索，指向更大的政治叙事。通过欧盟国家联盟地址的案例研究，论文展示了该形式化方法如何从公共话语中归纳出政治叙事的信号。

链接: https://arxiv.org/abs/2411.00702
作者: Armin Pournaki,Tom Willaert
关键词-EN: key interpretative devices, humans make sense, political narratives, political, Narratives
类目: Computation and Language (cs.CL); Computers and Society (cs.CY); Information Retrieval (cs.IR); Social and Information Networks (cs.SI)
备注: 23 pages, 4 figures

点击查看摘要

Abstract:Narratives are key interpretative devices by which humans make sense of political reality. As the significance of narratives for understanding current societal issues such as polarization and misinformation becomes increasingly evident, there is a growing demand for methods that support their empirical analysis. To this end, we propose a graph-based formalism and machine-guided method for extracting, representing, and analyzing selected narrative signals from digital textual corpora, based on Abstract Meaning Representation (AMR). The formalism and method introduced here specifically cater to the study of political narratives that figure in texts from digital media such as archived political speeches, social media posts, political manifestos and transcripts of parliamentary debates. We conceptualize these political narratives as a type of ontological narratives: stories by which actors position themselves as political beings, and which are akin to political worldviews in which actors present their normative vision of the world, or aspects thereof. We approach the study of such political narratives as a problem of information retrieval: starting from a textual corpus, we first extract a graph-like representation of the meaning of each sentence in the corpus using AMR. Drawing on transferable concepts from narratology, we then apply a set of heuristics to filter these graphs for representations of 1) actors, 2) the events in which these actors figure, and 3) traces of the perspectivization of these events. We approach these references to actors, events, and instances of perspectivization as core narrative signals that initiate a further analysis by alluding to larger political narratives. By means of a case study of State of the European Union addresses, we demonstrate how the formalism can be used to inductively surface signals of political narratives from public discourse.
摘要：叙事是人类理解政治现实的关键解释工具。随着叙事在理解当前社会问题（如极化与错误信息）中的重要性日益凸显，对支持其实证分析的方法的需求也日益增长。为此，我们提出了一种基于抽象意义表示（Abstract Meaning Representation, AMR）的图论形式化和机器引导方法，用于从数字文本语料库中提取、表示和分析选定的叙事信号。本文介绍的形式化和方法特别适用于研究数字媒体文本中的政治叙事，如存档的政治演讲、社交媒体帖子、政治宣言及议会辩论的文字记录。我们将这些政治叙事概念化为一种本体叙事：通过这些叙事，行动者将自己定位为政治主体，这些叙事类似于政治世界观，行动者在此展示其对世界的规范性愿景或其某些方面。我们将此类政治叙事的研究视为信息检索问题：从文本语料库出发，首先使用AMR提取语料库中每句话的意义的图表示。借鉴叙事学的可迁移概念，我们随后应用一组启发式规则，过滤这些图以表示1) 行动者，2) 这些行动者参与的事件，以及3) 这些事件视角化的痕迹。我们将这些对行动者、事件及视角化实例的引用视为核心叙事信号，这些信号通过暗示更大的政治叙事来启动进一步的分析。通过欧盟国情咨文案例研究，我们展示了该形式化方法如何归纳性地从公共话语中揭示政治叙事的信号。

[NLP-7] Leveraging Large Language Models for Code-Mixed Data Augmentation in Sentiment Analysis EMNLP2024

【速读】：该论文试图解决多语言混合（Code-mixing, CM）在自然语言处理中的挑战，特别是由于其复杂性和数据有限性导致的情感分析任务性能不佳的问题。解决方案的关键在于利用大型语言模型生成合成CM数据，并将其用于增强特定任务模型的性能。通过在西班牙语-英语和马拉雅拉姆语-英语中的实验，论文展示了合成数据在低基线情况下显著提升了F1分数，尤其是在西班牙语-英语中提高了9.32%。此外，人工评估证实了这种方法能够以简单、经济的方式生成自然流畅的CM句子，尤其适用于低基线场景。研究结果表明，少样本提示（few-shot prompting）大型语言模型是一种有前景的CM数据增强方法，对提升情感分析性能具有重要意义，进而有助于社会影响力系统的发展。

链接: https://arxiv.org/abs/2411.00691
作者: Linda Zeng
关键词-EN: speakers blend languages, language processing due, single expression, natural language processing, speakers blend
类目: Computation and Language (cs.CL)
备注: 17 pages, 4 figures, 11 tables, To be published in the Proceedings of the Second Workshop on Social Influence in Conversations (SICon 2024), co-located with EMNLP 2024

点击查看摘要

Abstract:Code-mixing (CM), where speakers blend languages within a single expression, is prevalent in multilingual societies but poses challenges for natural language processing due to its complexity and limited data. We propose using a large language model to generate synthetic CM data, which is then used to enhance the performance of task-specific models for CM sentiment analysis. Our results show that in Spanish-English, synthetic data improved the F1 score by 9.32%, outperforming previous augmentation techniques. However, in Malayalam-English, synthetic data only helped when the baseline was low; with strong natural data, additional synthetic data offered little benefit. Human evaluation confirmed that this approach is a simple, cost-effective way to generate natural-sounding CM sentences, particularly beneficial for low baselines. Our findings suggest that few-shot prompting of large language models is a promising method for CM data augmentation and has significant impact on improving sentiment analysis, an important element in the development of social influence systems.
摘要：代码混合（Code-mixing, CM）是指说话者在单一表达中混合使用多种语言的现象，这在多语言社会中非常普遍，但由于其复杂性和数据有限性，给自然语言处理带来了挑战。我们提出利用大语言模型生成合成代码混合数据，然后用这些数据来提升针对代码混合情感分析的任务特定模型的性能。我们的实验结果显示，在西班牙语-英语的代码混合数据中，合成数据使F1分数提高了9.32%，超过了以往的数据增强技术。然而，在马拉雅拉姆语-英语的代码混合数据中，合成数据仅在基线较低时有所帮助；在自然数据较强的情况下，额外的合成数据带来的收益有限。通过人工评估确认，这种方法是一种简单且成本效益高的方式来生成自然听感的代码混合句子，特别有利于基线较低的情况。我们的研究结果表明，少样本提示（few-shot prompting）大语言模型是一种有前景的代码混合数据增强方法，对提升情感分析——这一社会影响力系统发展中的重要元素——具有显著影响。

[NLP-8] owards Multi-Source Retrieval-Augmented Generation via Synergizing Reasoning and Preference-Driven Retrieval

【速读】：该论文试图解决现有自适应检索增强生成 (Adaptive Retrieval-Augmented Generation, ARAG) 系统在处理多个检索源时，无法有效选择合适检索源的问题。解决方案的关键在于提出了一个多源 ARAG 框架，称为 MSPR，该框架通过协同推理和偏好驱动的检索，动态决定“何时以及检索什么”以及“使用哪个检索源”。此外，通过检索动作调整和答案反馈策略，MSPR 能够充分探索高质量的主检索源，并在适当时候补充次要检索源，从而提升整体检索效果。实验结果表明，MSPR 在多个数据集上的表现优于现有方法。

链接: https://arxiv.org/abs/2411.00689
作者: Qingfei Zhao,Ruobing Wang,Xin Wang,Daren Zha,Nan Mu
关键词-EN: Large Language Models, reliable external knowledge, external knowledge augmentation, knowledge augmentation technique, parameterized knowledge limitations
类目: Computation and Language (cs.CL)
备注: 5 pages, 1 figure

点击查看摘要

Abstract:Retrieval-Augmented Generation (RAG) has emerged as a reliable external knowledge augmentation technique to mitigate hallucination issues and parameterized knowledge limitations in Large Language Models (LLMs). Existing Adaptive RAG (ARAG) systems struggle to effectively explore multiple retrieval sources due to their inability to select the right source at the right time. To address this, we propose a multi-source ARAG framework, termed MSPR, which synergizes reasoning and preference-driven retrieval to adaptive decide “when and what to retrieve” and “which retrieval source to use”. To better adapt to retrieval sources of differing characteristics, we also employ retrieval action adjustment and answer feedback strategy. They enable our framework to fully explore the high-quality primary source while supplementing it with secondary sources at the right time. Extensive and multi-dimensional experiments conducted on three datasets demonstrate the superiority and effectiveness of MSPR.
摘要：检索增强生成 (Retrieval-Augmented Generation, RAG) 作为一种可靠的外部知识增强技术，旨在缓解大语言模型 (Large Language Models, LLMs) 中的幻觉问题和参数化知识限制。现有的自适应 RAG (Adaptive RAG, ARAG) 系统由于无法在正确的时间选择合适的检索源，难以有效利用多个检索源。为此，我们提出了一种多源 ARAG 框架，称为 MSPR，该框架通过协同推理和偏好驱动的检索，自适应地决定“何时以及检索什么”以及“使用哪个检索源”。为了更好地适应不同特性的检索源，我们还采用了检索动作调整和答案反馈策略。这些策略使我们的框架能够在充分利用高质量主检索源的同时，在适当的时候补充次级检索源。在三个数据集上进行的广泛且多维度的实验证明了 MSPR 的优越性和有效性。

[NLP-9] Latent Paraphrasing: Perturbation on Layers Improves Knowledge Injection in Language Models NEURIPS2024

【速读】：该论文试图解决在大语言模型 (Large Language Models, LLMs) 应用于专业领域时，知识更新带来的高计算成本和样本多样性不足的问题。解决方案的关键是引入了一种名为 LaPael 的潜在层 (latent-level) 释义方法，该方法通过在早期模型层中应用输入依赖的噪声，直接在模型内部生成多样且语义一致的增强数据。这种方法不仅减少了重复使用外部模型生成释义的高成本，还提高了知识注入的效率和效果。实验结果表明，LaPael 在问答基准测试中优于标准微调和现有的基于噪声的方法，并且与数据层释义方法结合使用时，性能进一步提升。

链接: https://arxiv.org/abs/2411.00686
作者: Minki Kang,Sung Ju Hwang,Gibbeum Lee,Jaewoong Cho
关键词-EN: Large Language Models, Large Language, Language Models, continuously evolving knowledge, increasingly deployed
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: NeurIPS 2024

点击查看摘要

Abstract:As Large Language Models (LLMs) are increasingly deployed in specialized domains with continuously evolving knowledge, the need for timely and precise knowledge injection has become essential. Fine-tuning with paraphrased data is a common approach to enhance knowledge injection, yet it faces two significant challenges: high computational costs due to repetitive external model usage and limited sample diversity. To this end, we introduce LaPael, a latent-level paraphrasing method that applies input-dependent noise to early LLM layers. This approach enables diverse and semantically consistent augmentations directly within the model. Furthermore, it eliminates the recurring costs of paraphrase generation for each knowledge update. Our extensive experiments on question-answering benchmarks demonstrate that LaPael improves knowledge injection over standard fine-tuning and existing noise-based approaches. Additionally, combining LaPael with data-level paraphrasing further enhances performance.
摘要：随着大语言模型 (LLM) 在专业领域中的应用日益广泛，且这些领域的知识不断更新，及时且精确的知识注入变得至关重要。使用改写数据进行微调是增强知识注入的常见方法，但面临两大挑战：由于重复使用外部模型导致的高计算成本，以及样本多样性有限。为此，我们提出了 LaPael，一种在潜在层级进行改写的方法，该方法在早期 LLM 层应用输入依赖的噪声。这种方法能够在模型内部直接实现多样且语义一致的增强，同时消除了每次知识更新时生成改写的重复成本。我们在问答基准上的广泛实验表明，LaPael 在知识注入方面优于标准微调及现有的基于噪声的方法。此外，将 LaPael 与数据层级的改写方法结合使用，进一步提升了性能。

[NLP-10] axaBind: A Unified Embedding Space for Ecological Applications WACV2025

【速读】：该论文试图解决生态问题中对任意物种进行特征化的问题，提出了一个统一的嵌入空间 (embedding space) 称为 TaxaBind。解决方案的关键在于利用多模态嵌入空间，涵盖了六种模态：地面图像、地理位置、卫星图像、文本、音频和环境特征。通过将地面图像作为绑定模态 (binding modality)，并采用多模态补丁技术 (multimodal patching) 来有效提取各模态的知识，TaxaBind 能够解决包括物种分类、跨模态检索和音频分类在内的多种生态任务。此外，论文还构建了两个预训练数据集 iSatNat 和 iSoundNat，并引入了 TaxaBench-8k 数据集用于模型评估。

链接: https://arxiv.org/abs/2411.00683
作者: Srikumar Sastry,Subash Khanal,Aayush Dhakal,Adeel Ahmad,Nathan Jacobs
关键词-EN: unified embedding space, embedding space, species, images, unified embedding
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Accepted to WACV 2025

点击查看摘要

Abstract:We present TaxaBind, a unified embedding space for characterizing any species of interest. TaxaBind is a multimodal embedding space across six modalities: ground-level images of species, geographic location, satellite image, text, audio, and environmental features, useful for solving ecological problems. To learn this joint embedding space, we leverage ground-level images of species as a binding modality. We propose multimodal patching, a technique for effectively distilling the knowledge from various modalities into the binding modality. We construct two large datasets for pretraining: iSatNat with species images and satellite images, and iSoundNat with species images and audio. Additionally, we introduce TaxaBench-8k, a diverse multimodal dataset with six paired modalities for evaluating deep learning models on ecological tasks. Experiments with TaxaBind demonstrate its strong zero-shot and emergent capabilities on a range of tasks including species classification, cross-model retrieval, and audio classification. The datasets and models are made available at this https URL.
摘要：我们提出了 TaxaBind，这是一个统一的嵌入空间，用于表征任何感兴趣的物种。TaxaBind 是一个跨六种模态的多模态嵌入空间：物种的地面图像、地理位置、卫星图像、文本、音频和环境特征，适用于解决生态问题。为了学习这个联合嵌入空间，我们利用物种的地面图像作为绑定模态。我们提出了一种多模态补丁技术，该技术能够有效地将各种模态的知识提炼到绑定模态中。我们构建了两个大型预训练数据集：iSatNat 包含物种图像和卫星图像，iSoundNat 包含物种图像和音频。此外，我们引入了 TaxaBench-8k，这是一个包含六种配对模态的多样化多模态数据集，用于评估深度学习模型在生态任务上的表现。实验结果表明，TaxaBind 在物种分类、跨模态检索和音频分类等任务上展现了强大的零样本和涌现能力。数据集和模型可通过此 https URL 获取。

[NLP-11] Zipfian Whitening NEURIPS2024

链接: https://arxiv.org/abs/2411.00680
作者: Sho Yokoi,Han Bao,Hiroto Kurita,Hidetoshi Shimodaira
关键词-EN:
类目: Computation and Language (cs.CL); Machine Learning (cs.LG); Machine Learning (stat.ML)
备注: NeurIPS 2024

点击查看摘要

[NLP-12] Phase Diagram of Vision Large Language Models Inference: A Perspective from Interaction across Image and Instruction

【速读】：该论文试图解决视觉大语言模型 (Vision Large Language Models, VLLMs) 在处理图像和文本嵌入时内部行为不明确的问题，特别是两种模态（图像和文本）之间的交互机制。解决方案的关键在于通过测量不同模态隐藏状态向量之间的上下文关系，揭示了VLLMs在推理过程中的四个阶段动态变化：(I) 对齐阶段，早期层中模态间出现上下文关系，表明特征空间对齐；(II) 模态内编码阶段，早期层中模态内上下文关系增强，而模态间交互被抑制，表明模态内局部编码；(III) 模态间编码阶段，后期层中模态间上下文关系增强，表明模态间深度融合；(IV) 输出准备阶段，晚期层中全局上下文关系减少，隐藏状态向输出空间对齐。这一发现有助于理解VLLMs的内部工作机制和模态间交互的动态过程。

链接: https://arxiv.org/abs/2411.00646
作者: Houjing Wei,Hakaze Cho,Yuting Shi,Naoya Inoue
关键词-EN: Vision Large Language, Large Language Models, Vision Large, Large Language, conduct causal modeling
类目: Computation and Language (cs.CL)
备注: 6 pages, 5 figures

点击查看摘要

Abstract:Vision Large Language Models (VLLMs) usually take input as a concatenation of image token embeddings and text token embeddings and conduct causal modeling. However, their internal behaviors remain underexplored, raising the question of interaction among two types of tokens. To investigate such multimodal interaction during model inference, in this paper, we measure the contextualization among the hidden state vectors of tokens from different modalities. Our experiments uncover a four-phase inference dynamics of VLLMs against the depth of Transformer-based LMs, including (I) Alignment: In very early layers, contextualization emerges between modalities, suggesting a feature space alignment. (II) Intra-modal Encoding: In early layers, intra-modal contextualization is enhanced while inter-modal interaction is suppressed, suggesting a local encoding within modalities. (III) Inter-modal Encoding: In later layers, contextualization across modalities is enhanced, suggesting a deeper fusion across modalities. (IV) Output Preparation: In very late layers, contextualization is reduced globally, and hidden states are aligned towards the unembedding space.
摘要：视觉大语言模型 (VLLMs) 通常将输入视为图像 Token 嵌入和文本 Token 嵌入的拼接，并进行因果建模。然而，其内部行为仍未得到充分探索，引发了两种 Token 之间交互的问题。为了研究模型推理过程中的多模态交互，本文测量了不同模态 Token 隐藏状态向量之间的上下文关系。我们的实验揭示了 VLLMs 在 Transformer 大语言模型深度上的四阶段推理动态，包括：(I) 对齐：在非常早期的层中，模态间的上下文关系出现，表明特征空间的对齐。(II) 模态内编码：在早期层中，模态内的上下文关系增强，而模态间的交互被抑制，表明模态内的局部编码。(III) 模态间编码：在后期层中，模态间的上下文关系增强，表明模态间的深度融合。(IV) 输出准备：在非常后期的层中，全局上下文关系减少，隐藏状态向解码空间对齐。

[NLP-13] ConvCounsel: A Conversational Dataset for Student Counseling

【速读】：该论文试图解决学生心理健康咨询中学生与咨询师比例失衡的问题，特别是在大多数大学中，这一比例超过了推荐的250:1标准，导致面对面咨询的等待时间过长，从而影响治疗效果。解决方案的关键在于引入了一个名为ConvCounsel的专门心理健康数据集，该数据集强调在咨询对话中使用的积极倾听策略，并包含语音和文本数据，有助于开发可靠的心理健康对话系统。论文还展示了基于该数据集开发的NYCUKA系统，证明了该数据集的有效性。

链接: https://arxiv.org/abs/2411.00604
作者: Po-Chuan Chen,Mahdin Rohmatillah,You-Teng Lin,Jen-Tzung Chien
关键词-EN: necessitates special attention, Student mental health, mental health dialogue, Student mental, mental health
类目: Computation and Language (cs.CL)
备注: Accepted at O-COCOSDA 2024, Won Best Student Paper Award

点击查看摘要

Abstract:Student mental health is a sensitive issue that necessitates special attention. A primary concern is the student-to-counselor ratio, which surpasses the recommended standard of 250:1 in most universities. This imbalance results in extended waiting periods for in-person consultations, which cause suboptimal treatment. Significant efforts have been directed toward developing mental health dialogue systems utilizing the existing open-source mental health-related datasets. However, currently available datasets either discuss general topics or various strategies that may not be viable for direct application due to numerous ethical constraints inherent in this research domain. To address this issue, this paper introduces a specialized mental health dataset that emphasizes the active listening strategy employed in conversation for counseling, also named as ConvCounsel. This dataset comprises both speech and text data, which can facilitate the development of a reliable pipeline for mental health dialogue systems. To demonstrate the utility of the proposed dataset, this paper also presents the NYCUKA, a spoken mental health dialogue system that is designed by using the ConvCounsel dataset. The results show the merit of using this dataset.
摘要：学生心理健康是一个需要特别关注的问题。一个主要问题是学生与心理咨询师的比例，大多数大学中这一比例超过了推荐的250:1的标准。这种不平衡导致面对面咨询的等待时间延长，从而导致治疗效果不佳。已经投入了大量精力开发利用现有开源心理健康相关数据集的心理健康对话系统。然而，目前可用的数据集要么讨论一般话题，要么涉及各种可能由于该研究领域固有的众多伦理限制而无法直接应用的策略。为了解决这一问题，本文介绍了一个专门的心理健康数据集，该数据集强调了在咨询中使用的积极倾听策略，也称为ConvCounsel。该数据集包含语音和文本数据，可以促进心理健康对话系统的可靠开发流程。为了展示所提出数据集的实用性，本文还介绍了NYCUKA，这是一个使用ConvCounsel数据集设计的口语心理健康对话系统。结果显示了使用该数据集的优势。

[NLP-14] Adapting Language Models via Token Translation

【速读】：该论文试图解决在将预训练的大型语言模型应用于新目标领域时，由于使用固定分词器（tokenizer）导致的压缩效果不佳、推理成本增加以及语义对齐减弱的问题。解决方案的关键是引入稀疏Sinkhorn分词转换（Sparse Sinkhorn Token Translation, S2T2）。S2T2通过为目标领域训练定制的分词器，并学习目标与源分词之间的转换，从而更有效地重用预训练的源分词预测器。实验结果表明，S2T2在微调的英语语言模型中，对域外蛋白质序列的困惑度和压缩效果均有显著提升，优于直接使用源或目标分词器的微调方法。此外，S2T2还发现，为较小、成本较低的模型学习到的分词转换可以直接迁移到更大、更强大的模型中，以较低成本实现S2T2的效益。

链接: https://arxiv.org/abs/2411.00593
作者: Zhili Feng,Tanya Marwah,Lester Mackey,David Alvarez-Melis,Nicolo Fusi
关键词-EN: Modern large language, effectively compress text, compress text drawn, Modern large, effectively compress
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Modern large language models use a fixed tokenizer to effectively compress text drawn from a source domain. However, applying the same tokenizer to a new target domain often leads to inferior compression, more costly inference, and reduced semantic alignment. To address this deficiency, we introduce Sparse Sinkhorn Token Translation (S2T2). S2T2 trains a tailored tokenizer for the target domain and learns to translate between target and source tokens, enabling more effective reuse of the pre-trained next-source-token predictor. In our experiments with finetuned English language models, S2T2 improves both the perplexity and the compression of out-of-domain protein sequences, outperforming direct finetuning with either the source or target tokenizer. In addition, we find that token translations learned for smaller, less expensive models can be directly transferred to larger, more powerful models to reap the benefits of S2T2 at lower cost.
摘要：现代大语言模型使用固定的 Tokenizer 来有效压缩来自源域的文本。然而，将相同的 Tokenizer 应用于新的目标域通常会导致压缩效果不佳、推理成本增加以及语义对齐度降低。为了解决这一缺陷，我们提出了稀疏 Sinkhorn Token 翻译 (Sparse Sinkhorn Token Translation, S2T2)。S2T2 为目标域训练定制的 Tokenizer，并学习在目标和源 Token 之间进行翻译，从而实现对预训练的下一源 Token 预测器的更有效复用。在我们的实验中，通过对微调的英语语言模型进行测试，S2T2 在处理域外蛋白质序列时，不仅提高了困惑度 (perplexity)，还增强了压缩效果，优于直接使用源或目标 Tokenizer 进行微调。此外，我们发现，为较小、成本较低的模型学习的 Token 翻译可以直接迁移到更大、更强大的模型中，以较低的成本享受 S2T2 带来的好处。

[NLP-15] ReverseNER: A Self-Generated Example-Driven Framework for Zero-Shot Named Entity Recognition with Large Language Models

【速读】：该论文试图解决大型语言模型 (LLMs) 在零样本命名实体识别 (NER) 任务中，特别是某些实体类型边界模糊的情况下表现不佳的问题。解决方案的关键在于提出了一种名为 ReverseNER 的框架，该框架通过逆向 NER 过程构建一个可靠的示例库。具体来说，ReverseNER 首先根据实体定义使用 LLM 生成实体，然后将这些实体扩展为完整的句子，同时引导 LLM 复制特定“特征句子”的结构，这些特征句子是从任务句子中通过聚类提取的。这种方法生成的句子具有清晰的实体标注，同时保持与任务句子在语义和结构上的相似性。构建示例库后，ReverseNER 选择与每个任务句子语义最相似的示例标签来支持 LLM 的推理。此外，论文还提出了一种实体级别的自一致性评分机制，以进一步提高 NER 性能。实验结果表明，ReverseNER 显著优于传统的零样本 NER 方法，并在多个少样本方法中表现出色，特别是在标注数据有限的领域中。

链接: https://arxiv.org/abs/2411.00533
作者: Anbang Wang
关键词-EN: Named Entity Recognition, zero-shot Named Entity, Entity Recognition, Named Entity, large language models
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This paper presents ReverseNER, a framework aimed at overcoming the limitations of large language models (LLMs) in zero-shot Named Entity Recognition (NER) tasks, particularly in cases where certain entity types have ambiguous boundaries. ReverseNER tackles this challenge by constructing a reliable example library with the reversed process of NER. Rather than beginning with sentences, this method uses an LLM to generate entities based on their definitions and then expands them into full sentences. During sentence generation, the LLM is guided to replicate the structure of a specific ‘feature sentence’, extracted from the task sentences by clustering. This results in well-annotated sentences with clearly labeled entities, while preserving semantic and structural similarity to the task sentences. Once the example library is constructed, the method selects the most semantically similar example labels for each task sentence to support the LLM’s inference. We also propose an entity-level self-consistency scoring mechanism to improve NER performance with LLMs. Experiments show that ReverseNER significantly outperforms traditional zero-shot NER with LLMs and surpasses several few-shot methods, marking a notable improvement in NER for domains with limited labeled data.
摘要：本文提出了 ReverseNER，这是一个旨在克服大语言模型 (LLM) 在零样本命名实体识别 (NER) 任务中的局限性的框架，特别是在某些实体类型边界模糊的情况下。ReverseNER 通过构建一个可靠的示例库来应对这一挑战，该库利用了 NER 的逆过程。与从句子开始的传统方法不同，该方法使用 LLM 根据实体定义生成实体，然后将其扩展为完整句子。在句子生成过程中，LLM 被引导复制从任务句子中通过聚类提取的特定“特征句子”的结构。这不仅生成了带有清晰标注实体的句子，还保留了与任务句子在语义和结构上的相似性。一旦构建了示例库，该方法会选择与每个任务句子语义上最相似的示例标签来支持 LLM 的推理。此外，我们还提出了一种实体级别的自一致性评分机制，以提高 LLM 在 NER 任务中的表现。实验结果表明，ReverseNER 在零样本 NER 任务中显著优于传统方法，并超越了多种少样本方法，标志着在标注数据有限的领域中 NER 性能的显著提升。

[NLP-16] Multi-expert Prompting Improves Reliability Safety and Usefulness of Large Language Models EMNLP2024

链接: https://arxiv.org/abs/2411.00492
作者: Do Xuan Long,Duong Ngoc Yen,Anh Tuan Luu,Kenji Kawaguchi,Min-Yen Kan,Nancy F. Chen
关键词-EN:
类目: Computation and Language (cs.CL)
备注: EMNLP 2024 Main Conference

点击查看摘要

[NLP-17] GDTB: Genre Diverse Data for English Shallow Discourse Parsing across Modalities Text Types and Domains EMNLP2024

【速读】：该论文试图解决现有浅层话语解析数据集（PDTB）在英语中仅限于华尔街日报语料库的问题，该语料库不公开、局限于新闻领域且已有35年历史。解决方案的关键在于提出一个新的开放访问、多领域基准数据集，基于现有的UD English GUM语料库，并展示了在跨领域关系分类实验中，尽管新数据集与PDTB兼容，但仍存在显著的领域外性能下降，而通过联合训练两个数据集可以缓解这一问题。

链接: https://arxiv.org/abs/2411.00491
作者: Yang Janet Liu,Tatsuya Aoyama,Wesley Scivetti,Yilun Zhu,Shabnam Behzad,Lauren Elizabeth Levine,Jessica Lin,Devika Tiwari,Amir Zeldes
关键词-EN: Wall Street Journal, Street Journal corpus, Wall Street, Street Journal, shallow discourse parsing
类目: Computation and Language (cs.CL)
备注: Accepted to EMNLP 2024 (main, long); camera-ready version

点击查看摘要

Abstract:Work on shallow discourse parsing in English has focused on the Wall Street Journal corpus, the only large-scale dataset for the language in the PDTB framework. However, the data is not openly available, is restricted to the news domain, and is by now 35 years old. In this paper, we present and evaluate a new open-access, multi-genre benchmark for PDTB-style shallow discourse parsing, based on the existing UD English GUM corpus, for which discourse relation annotations in other frameworks already exist. In a series of experiments on cross-domain relation classification, we show that while our dataset is compatible with PDTB, substantial out-of-domain degradation is observed, which can be alleviated by joint training on both datasets.
摘要：在英语浅层话语解析领域的研究主要集中在华尔街日报语料库上，这是PDTB框架下唯一的大规模数据集。然而，该数据集并未公开，仅限于新闻领域，并且已有35年历史。本文中，我们基于现有的UD英语GUM语料库，提出并评估了一个新的开放访问、多体裁的PDTB风格浅层话语解析基准数据集，该语料库在其他框架中已存在话语关系标注。在一系列跨领域关系分类实验中，我们发现尽管我们的数据集与PDTB兼容，但在域外数据上存在显著的性能下降，而通过联合训练两个数据集可以缓解这一问题。

[NLP-18] E2E-AFG: An End-to-End Model with Adaptive Filtering for Retrieval-Augmented Generation

【速读】：该论文试图解决检索增强生成方法中检索内容质量不高的问题，即外部知识库中可能包含无关信息或潜在错误信息，从而影响大型语言模型的生成结果。解决方案的关键在于提出了一个端到端模型（E2E-AFG），该模型将答案存在性判断与文本生成整合在一个单一的端到端框架中，通过自适应过滤机制，使模型能够更有效地聚焦于相关内容，减少无关信息的影响，从而生成更准确的答案。实验结果表明，E2E-AFG在六个代表性的知识密集型语言数据集上均优于基线模型，证明了该方法的有效性和鲁棒性。

链接: https://arxiv.org/abs/2411.00437
作者: Yun Jiang,Zilong Xie,Wei Zhang,Yun Fang,Shuai Pan
关键词-EN: external knowledge bases, Retrieval-augmented generation methods, Retrieval-augmented generation, knowledge bases, methods often neglect
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 13 pages, 3 figures, 5 tables

点击查看摘要

Abstract:Retrieval-augmented generation methods often neglect the quality of content retrieved from external knowledge bases, resulting in irrelevant information or potential misinformation that negatively affects the generation results of large language models. In this paper, we propose an end-to-end model with adaptive filtering for retrieval-augmented generation (E2E-AFG), which integrates answer existence judgment and text generation into a single end-to-end framework. This enables the model to focus more effectively on relevant content while reducing the influence of irrelevant information and generating accurate answers. We evaluate E2E-AFG on six representative knowledge-intensive language datasets, and the results show that it consistently outperforms baseline models across all tasks, demonstrating the effectiveness and robustness of the proposed approach.
摘要：检索增强生成方法往往忽视了从外部知识库中检索内容的质量，导致无关信息或潜在的错误信息对大语言模型的生成结果产生负面影响。本文提出了一种端到端的自适应过滤检索增强生成模型（E2E-AFG），该模型将答案存在性判断与文本生成整合到一个端到端的框架中。这使得模型能够更有效地聚焦于相关内容，同时减少无关信息的影响，生成准确的答案。我们在六个具有代表性的知识密集型语言数据集上评估了E2E-AFG，结果表明，它在所有任务中均持续优于基线模型，展示了所提出方法的有效性和鲁棒性。

[NLP-19] DARD: A Multi-Agent Approach for Task-Oriented Dialog Systems

【速读】：该论文试图解决多领域任务导向对话系统开发中的复杂性问题，特别是如何有效处理跨多个领域的多样化用户意图、实体类型和领域特定知识。解决方案的关键在于提出了DARD（Domain Assigned Response Delegation），一个多代理对话系统。DARD通过利用领域特定的代理，并由中央对话管理代理进行协调，成功地处理了多领域对话。该系统结合了较小微调模型（如Flan-T5-large和Mistral-7B）与大型语言模型（如Claude Sonnet 3.0）的优势，展示了其在灵活性和可组合性方面的显著优势。通过在MultiWOZ基准上的评估，DARD实现了对话信息率提升6.6%和成功率提升4.1%的最新性能。

链接: https://arxiv.org/abs/2411.00427
作者: Aman Gupta,Anirudh Ravichandran,Ziji Zhang,Swair Shah,Anurag Beniwal,Narayanan Sadagopan
关键词-EN: Task-oriented dialogue systems, Assigned Response Delegation, Domain Assigned Response, essential for applications, applications ranging
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Task-oriented dialogue systems are essential for applications ranging from customer service to personal assistants and are widely used across various industries. However, developing effective multi-domain systems remains a significant challenge due to the complexity of handling diverse user intents, entity types, and domain-specific knowledge across several domains. In this work, we propose DARD (Domain Assigned Response Delegation), a multi-agent conversational system capable of successfully handling multi-domain dialogs. DARD leverages domain-specific agents, orchestrated by a central dialog manager agent. Our extensive experiments compare and utilize various agent modeling approaches, combining the strengths of smaller fine-tuned models (Flan-T5-large Mistral-7B) with their larger counterparts, Large Language Models (LLMs) (Claude Sonnet 3.0). We provide insights into the strengths and limitations of each approach, highlighting the benefits of our multi-agent framework in terms of flexibility and composability. We evaluate DARD using the well-established MultiWOZ benchmark, achieving state-of-the-art performance by improving the dialogue inform rate by 6.6% and the success rate by 4.1% over the best-performing existing approaches. Additionally, we discuss various annotator discrepancies and issues within the MultiWOZ dataset and its evaluation system.
摘要：面向任务的对话系统在从客户服务到个人助理的应用中至关重要，并在各个行业中得到广泛应用。然而，开发有效的多领域系统仍然是一个重大挑战，因为处理跨多个领域的多样化用户意图、实体类型和领域特定知识具有复杂性。在本研究中，我们提出了DARD（领域分配响应委托），这是一个能够成功处理多领域对话的多智能体对话系统。DARD利用由中央对话管理智能体协调的领域特定智能体。我们进行了广泛的实验，比较并利用了多种智能体建模方法，结合了较小微调模型（如Flan-T5-large和Mistral-7B）与其较大版本，即大语言模型（LLMs）（如Claude Sonnet 3.0）的优势。我们深入探讨了每种方法的优势和局限性，突出了我们多智能体框架在灵活性和可组合性方面的优势。我们使用公认的MultiWOZ基准评估了DARD，通过将对话信息率提高6.6%和成功率提高4.1%，实现了现有最佳方法的最新性能。此外，我们还讨论了MultiWOZ数据集及其评估系统中存在的各种标注差异和问题。

[NLP-20] Self-Evolved Reward Learning for LLM s

【速读】：该论文试图解决在强化学习从人类反馈 (Reinforcement Learning from Human Feedback, RLHF) 中，训练可靠的奖励模型 (Reward Model, RM) 时依赖高质量人类标注数据的高成本和潜在偏差问题。解决方案的关键在于提出了一种自进化奖励学习 (Self-Evolved Reward Learning, SER) 方法，该方法通过奖励模型自身生成额外的训练数据，并迭代地改进自身性能。实验结果表明，即使在有限的人类标注数据下，通过自反馈学习可以稳健地提升奖励模型的表现，从而增强大型语言模型 (Large Language Models, LLMs) 的能力。

链接: https://arxiv.org/abs/2411.00418
作者: Chenghua Huang,Zhizhen Fan,Lu Wang,Fangkai Yang,Pu Zhao,Zeqi Lin,Qingwei Lin,Dongmei Zhang,Saravan Rajmohan,Qi Zhang
关键词-EN: Human Feedback, Reinforcement Learning, playing a pivotal, crucial technique, technique for aligning
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 19 pages,6 figures

点击查看摘要

Abstract:Reinforcement Learning from Human Feedback (RLHF) is a crucial technique for aligning language models with human preferences, playing a pivotal role in the success of conversational models like GPT-4, ChatGPT, and Llama 2. A core challenge in employing RLHF lies in training a reliable reward model (RM), which relies on high-quality labels typically provided by human experts or advanced AI system. These methods can be costly and may introduce biases that affect the language model’s responses. As language models improve, human input may become less effective in further enhancing their performance. In this paper, we propose Self-Evolved Reward Learning (SER), a novel approach where the RM generates additional training data to iteratively improve itself. We conducted extensive experiments on multiple datasets such as HH-RLHF and UltraFeedback, using models like Mistral and Llama 3, and compare SER against various baselines. Our results demonstrate that even with limited human-annotated data, learning from self-feedback can robustly enhance RM performance, thereby boosting the capabilities of large language models (LLMs).
摘要：人类反馈强化学习 (Reinforcement Learning from Human Feedback, RLHF) 是使语言模型与人类偏好对齐的关键技术，对于 GPT-4、ChatGPT 和 Llama 2 等对话模型的成功起到了至关重要的作用。在使用 RLHF 时，一个核心挑战在于训练一个可靠的奖励模型 (Reward Model, RM)，这通常依赖于由人类专家或高级 AI 系统提供的高质量标签。这些方法成本高昂，并且可能引入偏差，影响语言模型的响应。随着语言模型的改进，人类输入在进一步提升其性能方面可能变得不那么有效。在本文中，我们提出了自进化奖励学习 (Self-Evolved Reward Learning, SER)，这是一种新颖的方法，其中 RM 生成额外的训练数据以迭代地改进自身。我们在多个数据集（如 HH-RLHF 和 UltraFeedback）上进行了广泛的实验，使用了 Mistral 和 Llama 3 等模型，并将 SER 与各种基线方法进行了比较。我们的结果表明，即使在有限的人类标注数据下，通过自我反馈学习可以稳健地提升 RM 的性能，从而增强大语言模型 (Large Language Models, LLMs) 的能力。

[NLP-21] Adapting While Learning: Grounding LLM s for Scientific Problems with Intelligent Tool Usage Adaptation

【速读】：该论文试图解决大型语言模型（LLMs）在处理复杂科学问题时容易产生幻觉的问题，同时避免过度依赖工具导致模型在解决简单问题时推理能力下降。解决方案的关键在于提出了一种新颖的两组件微调方法：世界知识蒸馏（World Knowledge Distillation, WKD）和工具使用适应（Tool Usage Adaptation, TUA）。WKD通过直接从工具生成的解决方案中学习，使模型内化领域知识；TUA则根据模型直接回答的准确性将问题分为简单和复杂两类，对简单问题保持与WKD相同的对齐目标，而对复杂问题则训练模型智能地切换到工具使用。这种方法在多个科学基准数据集上验证了其有效性，显著提升了答案准确性和工具使用精度。

链接: https://arxiv.org/abs/2411.00412
作者: Bohan Lyu,Yadi Cao,Duncan Watson-Parris,Leon Bergen,Taylor Berg-Kirkpatrick,Rose Yu
关键词-EN: Large Language Models, Large Language, demonstrate promising capabilities, Language Models, promising capabilities
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 26 pages, 15 figures

点击查看摘要

Abstract:Large Language Models (LLMs) demonstrate promising capabilities in solving simple scientific problems but often produce hallucinations for complex ones. While integrating LLMs with tools can increase reliability, this approach typically results in over-reliance on tools, diminishing the model’s ability to solve simple problems through basic reasoning. In contrast, human experts first assess problem complexity using domain knowledge before choosing an appropriate solution approach. Inspired by this human problem-solving process, we propose a novel two-component fine-tuning method. In the first component World Knowledge Distillation (WKD), LLMs learn directly from solutions generated using tool’s information to internalize domain knowledge. In the second component Tool Usage Adaptation (TUA), we partition problems into easy and hard categories based on the model’s direct answering accuracy. While maintaining the same alignment target for easy problems as in WKD, we train the model to intelligently switch to tool usage for more challenging problems. We validate our method on six scientific benchmark datasets, spanning mathematics, climate science and epidemiology. On average, our models demonstrate a 28.18% improvement in answer accuracy and a 13.89% increase in tool usage precision across all datasets, surpassing state-of-the-art models including GPT-4o and Claude-3.5.
摘要：大语言模型 (LLM) 在解决简单科学问题方面展现出令人鼓舞的能力，但在处理复杂问题时往往会产生幻觉。尽管将 LLM 与工具结合可以提高可靠性，但这种方法通常会导致过度依赖工具，从而削弱模型通过基本推理解决简单问题的能力。相比之下，人类专家首先会利用领域知识评估问题的复杂性，然后选择合适的解决方案。受此人类问题解决过程的启发，我们提出了一种新颖的两部分微调方法。在第一部分世界知识蒸馏 (World Knowledge Distillation, WKD) 中，LLM 直接从使用工具信息生成的解决方案中学习，以内化领域知识。在第二部分工具使用适应 (Tool Usage Adaptation, TUA) 中，我们根据模型直接回答的准确性将问题分为简单和困难两类。对于简单问题，我们保持与 WKD 相同的对齐目标，同时训练模型在面对更具挑战性的问题时智能地切换到工具使用。我们在六个科学基准数据集上验证了我们的方法，涵盖数学、气候科学和流行病学领域。平均而言，我们的模型在回答准确性上提高了 28.18%，在工具使用精度上提高了 13.89%，超越了包括 GPT-4o 和 Claude-3.5 在内的最先进模型。

[NLP-22] Enhancing Authorship Attribution through Embedding Fusion: A Novel Approach with Masked and Encoder-Decoder Language Models

【速读】：该论文试图解决如何有效区分AI生成内容与人类撰写文本的问题。解决方案的关键在于利用预训练语言模型（Pre-trained Language Models, PLMs）的文本嵌入（textual embeddings），并通过嵌入融合（Embedding Fusion）技术整合多个语言模型的语义信息，从而提升分类性能。该方法在多个公开数据集上的评估结果显示，分类准确率超过96%，马修斯相关系数（Matthews Correlation Coefficient, MCC）超过0.93，表明其有效性和鲁棒性。

链接: https://arxiv.org/abs/2411.00411
作者: Arjun Ramesh Kaushik,Sunil Rufus R P,Nalini Ratha
关键词-EN: reliable discrimination methods, content alongside human-written, Pre-trained Language Models, Language Models, Large Language Models
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:The increasing prevalence of AI-generated content alongside human-written text underscores the need for reliable discrimination methods. To address this challenge, we propose a novel framework with textual embeddings from Pre-trained Language Models (PLMs) to distinguish AI-generated and human-authored text. Our approach utilizes Embedding Fusion to integrate semantic information from multiple Language Models, harnessing their complementary strengths to enhance performance. Through extensive evaluation across publicly available diverse datasets, our proposed approach demonstrates strong performance, achieving classification accuracy greater than 96% and a Matthews Correlation Coefficient (MCC) greater than 0.93. This evaluation is conducted on a balanced dataset of texts generated from five well-known Large Language Models (LLMs), highlighting the effectiveness and robustness of our novel methodology.
摘要：随着人工智能生成内容与人类撰写文本的日益普遍，可靠的区分方法变得至关重要。为应对这一挑战，我们提出了一种基于预训练语言模型 (Pre-trained Language Models, PLMs) 文本嵌入的新框架，用于区分人工智能生成和人类撰写的文本。我们的方法采用嵌入融合 (Embedding Fusion) 技术，整合多个语言模型的语义信息，利用它们的互补优势提升性能。通过在公开的多样化数据集上进行广泛评估，我们提出的方法表现出强大的性能，分类准确率超过 96%，马修斯相关系数 (Matthews Correlation Coefficient, MCC) 超过 0.93。此评估基于由五个知名大语言模型 (Large Language Models, LLMs) 生成的平衡文本数据集，突显了我们新方法的有效性和鲁棒性。

[NLP-23] MetaMetrics-MT: Tuning Meta-Metrics for Machine Translation via Human Preference Calibration

【速读】：该论文试图解决机器翻译 (MT) 任务中现有评估指标与人类偏好之间的一致性问题。解决方案的关键在于提出了 MetaMetrics-MT，这是一种通过贝叶斯优化与高斯过程 (Bayesian optimization with Gaussian Processes) 来优化现有 MT 指标与人类判断之间相关性的创新度量方法。MetaMetrics-MT 在 WMT24 度量共享任务数据集上的实验结果表明，它在基于参考的设置中显著优于所有现有基线，成为新的最先进性能基准，同时在无参考设置中也表现出与领先指标相当的效率。

链接: https://arxiv.org/abs/2411.00390
作者: David Anugraha,Garry Kuwanto,Lucky Susanto,Derry Tanti Wijaya,Genta Indra Winata
关键词-EN: Gaussian Processes, evaluate machine translation, preferences through Bayesian, Bayesian optimization, optimization with Gaussian
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Preprint

点击查看摘要

Abstract:We present MetaMetrics-MT, an innovative metric designed to evaluate machine translation (MT) tasks by aligning closely with human preferences through Bayesian optimization with Gaussian Processes. MetaMetrics-MT enhances existing MT metrics by optimizing their correlation with human judgments. Our experiments on the WMT24 metric shared task dataset demonstrate that MetaMetrics-MT outperforms all existing baselines, setting a new benchmark for state-of-the-art performance in the reference-based setting. Furthermore, it achieves comparable results to leading metrics in the reference-free setting, offering greater efficiency.
摘要：我们提出了 MetaMetrics-MT，这是一种创新的评估指标，旨在通过基于高斯过程的贝叶斯优化，紧密对齐机器翻译 (MT) 任务中的人类偏好。MetaMetrics-MT 通过优化其与人类判断的相关性，提升了现有 MT 指标的性能。我们在 WMT24 评估任务数据集上的实验表明，MetaMetrics-MT 在基于参考的设置中超越了所有现有基线，为最先进的表现设定了新的基准。此外，在无参考设置中，它与领先的指标取得了可比的结果，同时提供了更高的效率。

[NLP-24] STEM-POM: Evaluating Language Models Math-Symbol Reasoning in Document Parsing NEURIPS

【速读】：该论文试图解决大语言模型（LLMs）在理解和解释数学丰富文档中抽象数学符号的能力不足的问题。解决方案的关键在于引入了一个名为STEM-PoM的综合基准数据集，该数据集专门设计用于评估LLMs在科学文本中处理数学符号的推理能力。STEM-PoM数据集从实际的ArXiv文档中提取，包含超过2000个数学符号，这些符号被分类为主变量、常量、运算符和单位描述符的主要属性，并进一步细分为子属性，如变量的标量/向量/矩阵分类以及常量和运算符的局部/全局/学科特定标签。通过实验，论文展示了当前最先进的LLMs在上下文学习和微调后的平均准确率分别为20-60%和50-60%，揭示了LLMs在数学推理能力上的显著差距，从而为开发能够稳健处理数学符号的高级数学AI模型提供了研究基础。

链接: https://arxiv.org/abs/2411.00387
作者: Jiaru Zou,Qing Wang,Pratyush Thakur,Nickvash Kani
关键词-EN: Advances in large, math-rich STEM documents, math-rich STEM, large language models, STEM documents
类目: Computation and Language (cs.CL)
备注: Accepted to NeurIPS Math-AI 2024

点击查看摘要

Abstract:Advances in large language models (LLMs) have spurred research into enhancing their reasoning capabilities, particularly in math-rich STEM documents. While LLMs can generate equations or solve math-related queries, their ability to fully understand and interpret abstract mathematical symbols in long, math-rich documents remains limited. In this paper, we introduce STEM-PoM, a comprehensive benchmark dataset designed to evaluate LLMs’ reasoning abilities on math symbols within contextual scientific text. The dataset, sourced from real-world ArXiv documents, contains over 2K math symbols classified as main attributes of variables, constants, operators, and unit descriptors, with additional sub-attributes including scalar/vector/matrix for variables and local/global/discipline-specific labels for both constants and operators. Our extensive experiments show that state-of-the-art LLMs achieve an average of 20-60% accuracy under in-context learning and 50-60% accuracy with fine-tuning, revealing a significant gap in their mathematical reasoning capabilities. STEM-PoM fuels future research of developing advanced Math-AI models that can robustly handle math symbols.
摘要：大语言模型 (Large Language Models, LLMs) 的进步激发了对其推理能力，特别是在富含数学内容的 STEM 文档中的推理能力，的研究。尽管 LLMs 能够生成方程或解决与数学相关的查询，但它们在理解和解释长篇数学内容文档中的抽象数学符号方面仍然存在局限。本文中，我们介绍了 STEM-PoM，这是一个综合基准数据集，旨在评估 LLMs 在科学文本中对数学符号的推理能力。该数据集来源于实际的 ArXiv 文档，包含超过 2,000 个数学符号，这些符号被分类为变量、常量、运算符和单位描述符的主要属性，并附加了次要属性，如变量的标量/向量/矩阵分类，以及常量和运算符的局部/全局/学科特定标签。我们的广泛实验表明，最先进的 LLMs 在上下文学习中平均达到 20-60% 的准确率，经过微调后达到 50-60% 的准确率，显示出它们在数学推理能力方面存在显著差距。STEM-PoM 为开发能够稳健处理数学符号的高级数学 AI 模型提供了未来的研究动力。

[NLP-25] GRS-QA – Graph Reasoning-Structured Question Answering Dataset

【速读】：该论文试图解决大语言模型（LLMs）在多跳问答（M-QA）任务中，由于缺乏细粒度的推理结构数据集，导致其推理能力的影响因素不明确的问题。解决方案的关键在于引入了图推理结构问答数据集（Graph Reasoning-Structured Question Answering Dataset, GRS-QA），该数据集不仅包含语义上下文，还明确捕捉了复杂的推理路径，通过构建推理图（reasoning graphs）来表示文本上下文和逻辑流。这种细粒度的推理结构使得能够对LLMs在不同推理结构下的推理能力进行精细化评估，从而揭示了LLMs在处理不同推理结构问题时的性能差异。

链接: https://arxiv.org/abs/2411.00369
作者: Anish Pahilajani,Devasha Trivedi,Jincen Shuai,Khin S. Yone,Samyak Rajesh Jain,Namyong Park,Ryan A. Rossi,Nesreen K. Ahmed,Franck Dernoncourt,Yu Wang
关键词-EN: Large Language Models, Large Language, Language Models, advanced reasoning abilities, reasoning structures
类目: Computation and Language (cs.CL)
备注: 15 pages, 24 figures, 10 tables

点击查看摘要

Abstract:Large Language Models (LLMs) have excelled in multi-hop question-answering (M-QA) due to their advanced reasoning abilities. However, the impact of the inherent reasoning structures on LLM M-QA performance remains unclear, largely due to the absence of QA datasets that provide fine-grained reasoning structures. To address this gap, we introduce the Graph Reasoning-Structured Question Answering Dataset (GRS-QA), which includes both semantic contexts and reasoning structures for QA pairs. Unlike existing M-QA datasets, where different reasoning structures are entangled together, GRS-QA explicitly captures intricate reasoning pathways by constructing reasoning graphs, where nodes represent textual contexts and edges denote logical flows. These reasoning graphs of different structures enable a fine-grained evaluation of LLM reasoning capabilities across various reasoning structures. Our empirical analysis reveals that LLMs perform differently when handling questions with varying reasoning structures. This finding facilitates the exploration of textual structures as compared with semantics.
摘要：大语言模型 (LLMs) 因其先进的推理能力在多跳问答 (M-QA) 中表现出色。然而，推理结构对 LLM M-QA 性能的影响尚不明确，这主要是因为缺乏提供细粒度推理结构的问答数据集。为填补这一空白，我们引入了图推理结构化问答数据集 (GRS-QA)，该数据集不仅包含语义上下文，还提供了问答对的推理结构。与现有的 M-QA 数据集不同，这些数据集中的不同推理结构相互交织，GRS-QA 通过构建推理图明确捕捉复杂的推理路径，其中节点代表文本上下文，边表示逻辑流。这些不同结构的推理图使得能够对 LLM 在各种推理结构中的推理能力进行细粒度评估。我们的实证分析表明，LLM 在处理具有不同推理结构的问题时表现各异。这一发现有助于探索文本结构与语义之间的对比关系。

[NLP-26] Learning to Rank Salient Content for Query-focused Summarization EMNLP2024

【速读】：该论文试图解决的问题是如何通过内容优先级排序来提高查询聚焦摘要（Query-focused Summarization, QFS）的相关性。解决方案的关键在于将学习排序（Learning-to-Rank, LTR）与QFS相结合，并在摘要解码器中使用共享的二级解码器在段落级别执行LTR任务。这种方法在QMSum基准测试中显著提升了Rouge-L和BertScore指标，同时在SQuALITY基准测试中在Rouge-L指标上表现优异，表明模型在生成连贯且相关性强的摘要方面具有优势。此外，该方法在训练开销上较低，且在人类评估中显示出较高的相关性和忠实度，同时保持了流畅性。

链接: https://arxiv.org/abs/2411.00324
作者: Sajad Sotudeh,Nazli Goharian
关键词-EN: Query-focused Summarization, content prioritization, study examines, enhance the summary, QMSum benchmark
类目: Computation and Language (cs.CL)
备注: Long paper accepted at EMNLP 2024 (Main)

点击查看摘要

Abstract:This study examines the potential of integrating Learning-to-Rank (LTR) with Query-focused Summarization (QFS) to enhance the summary relevance via content prioritization. Using a shared secondary decoder with the summarization decoder, we carry out the LTR task at the segment level. Compared to the state-of-the-art, our model outperforms on QMSum benchmark (all metrics) and matches on SQuALITY benchmark (2 metrics) as measured by Rouge and BertScore while offering a lower training overhead. Specifically, on the QMSum benchmark, our proposed system achieves improvements, particularly in Rouge-L (+0.42) and BertScore (+0.34), indicating enhanced understanding and relevance. While facing minor challenges in Rouge-1 and Rouge-2 scores on the SQuALITY benchmark, the model significantly excels in Rouge-L (+1.47), underscoring its capability to generate coherent summaries. Human evaluations emphasize the efficacy of our method in terms of relevance and faithfulness of the generated summaries, without sacrificing fluency. A deeper analysis reveals our model’s superiority over the state-of-the-art for broad queries, as opposed to specific ones, from a qualitative standpoint. We further present an error analysis of our model, pinpointing challenges faced and suggesting potential directions for future research in this field.
摘要：本研究探讨了将学习排序 (Learning-to-Rank, LTR) 与查询聚焦摘要 (Query-focused Summarization, QFS) 相结合，通过内容优先级排序来提升摘要相关性的潜力。我们利用与摘要解码器共享的次级解码器，在段落级别执行 LTR 任务。与当前最先进的技术相比，我们的模型在 QMSum 基准测试（所有指标）上表现更优，并在 SQuALITY 基准测试（两个指标）上与之持平，这些均通过 Rouge 和 BertScore 进行衡量，同时具有更低的训练开销。具体而言，在 QMSum 基准测试中，我们提出的系统在 Rouge-L (+0.42) 和 BertScore (+0.34) 方面取得了显著提升，表明理解和相关性的增强。尽管在 SQuALITY 基准测试的 Rouge-1 和 Rouge-2 分数上遇到轻微挑战，但模型在 Rouge-L (+1.47) 上表现出色，突显了其生成连贯摘要的能力。人工评估强调了我们的方法在生成摘要的相关性和忠实度方面的有效性，同时不影响流畅性。深入分析显示，从定性角度来看，我们的模型在处理广泛查询时优于当前最先进的技术，而在特定查询上则表现稍逊。此外，我们还对模型进行了错误分析，指出了面临的挑战，并提出了未来在该领域研究的可能方向。

[NLP-27] Rationale-Guided Retrieval Augmented Generation for Medical Question Answering

【速读】：该论文试图解决大型语言模型（LLM）在生物医学应用中存在的幻觉和知识过时问题，以及检索增强生成（RAG）方法在处理这些问题时面临的挑战，如易受无关或错误上下文影响、医学查询信息针对性不足以及检索器对特定训练语料库的偏见。解决方案的关键在于提出了一种名为RAG²（RAtionale-Guided RAG）的新框架，该框架通过三个创新点来增强RAG在生物医学领域的可靠性：(1) 使用基于困惑度标签训练的小型过滤模型，选择性地增强信息片段并过滤干扰项；(2) 利用LLM生成的理由作为查询，以提高检索片段的实用性；(3) 设计了一种结构，从四个全面的生物医学语料库中均匀检索片段，有效缓解检索器的偏见。实验结果表明，RAG²显著提升了不同规模LLM的性能，最高提升达6.1%，并在三个医学问答基准测试中超越了之前最佳的医学RAG模型，最高提升达5.6%。

链接: https://arxiv.org/abs/2411.00300
作者: Jiwoong Sohn,Yein Park,Chanwoong Yoon,Sihyeon Park,Hyeon Hwang,Mujeen Sung,Hyunjae Kim,Jaewoo Kang
关键词-EN: hold significant potential, Large language models, Large language, hold significant, applications in biomedicine
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLM) hold significant potential for applications in biomedicine, but they struggle with hallucinations and outdated knowledge. While retrieval-augmented generation (RAG) is generally employed to address these issues, it also has its own set of challenges: (1) LLMs are vulnerable to irrelevant or incorrect context, (2) medical queries are often not well-targeted for helpful information, and (3) retrievers are prone to bias toward the specific source corpus they were trained on. In this study, we present RAG ^2 (RAtionale-Guided RAG), a new framework for enhancing the reliability of RAG in biomedical contexts. RAG ^2 incorporates three key innovations: a small filtering model trained on perplexity-based labels of rationales, which selectively augments informative snippets of documents while filtering out distractors; LLM-generated rationales as queries to improve the utility of retrieved snippets; a structure designed to retrieve snippets evenly from a comprehensive set of four biomedical corpora, effectively mitigating retriever bias. Our experiments demonstrate that RAG ^2 improves the state-of-the-art LLMs of varying sizes, with improvements of up to 6.1%, and it outperforms the previous best medical RAG model by up to 5.6% across three medical question-answering benchmarks. Our code is available at this https URL.
摘要：大语言模型 (LLM) 在生物医学领域的应用具有巨大潜力，但它们在处理幻觉和过时知识方面存在困难。虽然检索增强生成 (RAG) 通常用于解决这些问题，但它也面临一系列挑战：(1) LLM 容易受到无关或错误上下文的影响；(2) 医学查询往往不够精准，难以获取有用信息；(3) 检索器容易偏向于其训练所用的特定语料库。在本研究中，我们提出了 RAG^2 (RAtionale-Guided RAG)，这是一种在生物医学背景下增强 RAG 可靠性的新框架。RAG^2 集成了三项关键创新：一个基于困惑度标签训练的小型过滤模型，用于选择性地增强文档中有信息量的片段并过滤掉干扰项；由 LLM 生成的理由作为查询，以提高检索片段的实用性；一种结构设计，旨在从全面的四个生物医学语料库中均匀检索片段，有效缓解检索器的偏差。我们的实验表明，RAG^2 提升了不同规模的最先进 LLM 的性能，提升幅度高达 6.1%，并且在三个医学问答基准测试中，其表现优于之前最佳的医学 RAG 模型，提升幅度高达 5.6%。我们的代码可在以下链接获取：https URL。

[NLP-28] LLM -Ref: Enhancing Reference Handling in Technical Writing with Large Language Models

【速读】：该论文试图解决大型语言模型（LLMs）在特定领域任务中准确性不足的问题，特别是通过生成式 AI 系统（RAG）进行数据合成时，检索和生成阶段的优化对输出质量的影响。解决方案的关键在于提出了 LLM-Ref 写作助手工具，该工具通过直接从文本段落中检索和生成内容，避免了传统 RAG 系统中使用的分块和索引方法，从而实现了直接从生成输出中提取参考文献的功能。此外，LLM-Ref 采用迭代响应生成策略，有效管理长上下文在语言模型限制内的处理，显著提升了 Ragas 评分，表明其在提高写作助手工具的准确性和上下文相关性方面的有效性。

链接: https://arxiv.org/abs/2411.00294
作者: Kazi Ahmed Asif Fuad,Lizhong Chen
关键词-EN: leveraging user-provided data, Large Language Models, Large Language, domain-specific tasks, user-provided data
类目: Computation and Language (cs.CL)
备注: 20 pages, 7 figures, submitted to ARR October 2024

点击查看摘要

Abstract:Large Language Models (LLMs) excel in data synthesis but can be inaccurate in domain-specific tasks, which retrieval-augmented generation (RAG) systems address by leveraging user-provided data. However, RAGs require optimization in both retrieval and generation stages, which can affect output quality. In this paper, we present LLM-Ref, a writing assistant tool that aids researchers in writing articles from multiple source documents with enhanced reference synthesis and handling capabilities. Unlike traditional RAG systems that use chunking and indexing, our tool retrieves and generates content directly from text paragraphs. This method facilitates direct reference extraction from the generated outputs, a feature unique to our tool. Additionally, our tool employs iterative response generation, effectively managing lengthy contexts within the language model’s constraints. Compared to baseline RAG-based systems, our approach achieves a 3.25\times to 6.26\times increase in Ragas score, a comprehensive metric that provides a holistic view of a RAG system’s ability to produce accurate, relevant, and contextually appropriate responses. This improvement shows our method enhances the accuracy and contextual relevance of writing assistance tools.
摘要：大语言模型（Large Language Models, LLMs）在数据合成方面表现出色，但在特定领域任务中可能存在不准确性，而检索增强生成（Retrieval-Augmented Generation, RAG）系统通过利用用户提供的数据解决了这一问题。然而，RAG系统在检索和生成阶段都需要优化，这会影响输出质量。本文介绍了一种名为LLM-Ref的写作助手工具，该工具帮助研究人员从多个源文档中撰写文章，具备增强的引用合成和处理能力。与传统的使用分块和索引的RAG系统不同，我们的工具直接从文本段落中检索和生成内容。这种方法便于直接从生成输出中提取引用，这是我们工具独有的功能。此外，我们的工具采用迭代响应生成，有效管理语言模型约束内的长上下文。与基于基线RAG的系统相比，我们的方法在Ragas评分上实现了3.25倍至6.26倍的提升，Ragas评分是一种综合指标，全面评估RAG系统生成准确、相关且上下文适切响应的能力。这一改进表明，我们的方法提高了写作助手工具的准确性和上下文相关性。

[NLP-29] A Demonstration of Adaptive Collaboration of Large Language Models for Medical Decision-Making

【速读】：该论文试图解决医疗决策过程中复杂的多模态数据处理和协作问题。解决方案的关键在于引入了一个名为MDAgents的框架，该框架能够根据任务复杂性动态分配协作结构给大型语言模型 (Large Language Models, LLMs)，从而模拟现实世界中的临床协作和决策过程。这种方法不仅提高了诊断的准确性，还支持在复杂医疗场景中的适应性响应，同时相比静态的多代理决策方法在计算成本上更为高效。

链接: https://arxiv.org/abs/2411.00248
作者: Yubin Kim,Chanwoo Park,Hyewon Jeong,Cristina Grau-Vilchez,Yik Siu Chan,Xuhai Xu,Daniel McDuff,Hyeonhoon Lee,Marzyeh Ghassemi,Cynthia Breazeal,Hae Won Park
关键词-EN: patient data patient, Large Language Models, multi-modal patient data, MDM, Language Models
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Medical Decision-Making (MDM) is a multi-faceted process that requires clinicians to assess complex multi-modal patient data patient, often collaboratively. Large Language Models (LLMs) promise to streamline this process by synthesizing vast medical knowledge and multi-modal health data. However, single-agent are often ill-suited for nuanced medical contexts requiring adaptable, collaborative problem-solving. Our MDAgents addresses this need by dynamically assigning collaboration structures to LLMs based on task complexity, mimicking real-world clinical collaboration and decision-making. This framework improves diagnostic accuracy and supports adaptive responses in complex, real-world medical scenarios, making it a valuable tool for clinicians in various healthcare settings, and at the same time, being more efficient in terms of computing cost than static multi-agent decision making methods.
摘要：医疗决策制定（Medical Decision-Making, MDM）是一个多方面的过程，需要临床医生评估复杂的多种模式的患者数据，通常需要协作进行。大语言模型（Large Language Models, LLMs）有望通过整合庞大的医学知识和多种模式的健康数据来简化这一过程。然而，单一智能体往往不适合需要适应性和协作性问题解决的复杂医疗情境。我们的 MDAgents 通过根据任务复杂性动态分配协作结构给 LLMs，模拟现实世界的临床协作和决策制定，从而解决了这一需求。该框架提高了诊断的准确性，并在复杂的现实世界医疗场景中支持适应性响应，使其成为各种医疗环境中临床医生的宝贵工具，同时在计算成本方面比静态多智能体决策制定方法更为高效。

[NLP-30] Compositional Automata Embeddings for Goal-Conditioned Reinforcement Learning

【速读】：该论文试图解决目标条件强化学习中目标表示的局限性问题，特别是当前流行的目标表示（如目标状态或自然语言）要么局限于马尔可夫任务，要么依赖于模糊的任务语义。论文提出的解决方案是使用确定性有限自动机（Deterministic Finite Automata, DFA）的组合（cDFAs）来表示时间目标，并通过cDFAs指导强化学习（Reinforcement Learning, RL）代理的行为。关键在于，cDFAs既提供了正式的时间语义，又易于解释，同时形成了一个可数无限的概念类，具有布尔语义。为了解决cDFAs细微变化可能导致任务差异较大的问题，论文提出了一种预训练方法，即在“到达-避免派生”的DFA上预训练图神经网络嵌入（Graph Neural Network Embeddings）。这种方法通过实验验证，能够在各种cDFA任务类中实现零样本泛化（zero-shot generalization），并加速策略专业化（policy specialization），同时避免了分层方法的短视次优性（myopic suboptimality）。

链接: https://arxiv.org/abs/2411.00205
作者: Beyazit Yalcinkaya,Niklas Lauffer,Marcell Vazquez-Chanlatte,Sanjit A. Seshia
关键词-EN: Goal-conditioned reinforcement learning, Goal-conditioned reinforcement, reinforcement learning, Goal-conditioned, agent behavior
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Formal Languages and Automata Theory (cs.FL)
备注:

点击查看摘要

Abstract:Goal-conditioned reinforcement learning is a powerful way to control an AI agent’s behavior at runtime. That said, popular goal representations, e.g., target states or natural language, are either limited to Markovian tasks or rely on ambiguous task semantics. We propose representing temporal goals using compositions of deterministic finite automata (cDFAs) and use cDFAs to guide RL agents. cDFAs balance the need for formal temporal semantics with ease of interpretation: if one can understand a flow chart, one can understand a cDFA. On the other hand, cDFAs form a countably infinite concept class with Boolean semantics, and subtle changes to the automaton can result in very different tasks, making them difficult to condition agent behavior on. To address this, we observe that all paths through a DFA correspond to a series of reach-avoid tasks and propose pre-training graph neural network embeddings on “reach-avoid derived” DFAs. Through empirical evaluation, we demonstrate that the proposed pre-training method enables zero-shot generalization to various cDFA task classes and accelerated policy specialization without the myopic suboptimality of hierarchical methods.
摘要：目标条件强化学习是一种在运行时控制 AI 智能体行为的强大方法。然而，流行的目标表示方法，如目标状态或自然语言，要么局限于马尔可夫任务，要么依赖于模糊的任务语义。我们提出使用确定性有限自动机（Deterministic Finite Automata, DFA）的组合（cDFAs）来表示时间目标，并利用 cDFAs 指导强化学习智能体。cDFAs 在形式化时间语义的需求与易于解释之间取得了平衡：如果一个人能理解流程图，那么他就能理解 cDFA。另一方面，cDFAs 构成了一个具有布尔语义的可数无限概念类，自动机的细微变化可能导致截然不同的任务，这使得基于 cDFAs 条件化智能体行为变得困难。为了解决这一问题，我们观察到 DFA 中的所有路径都对应于一系列的到达-避免任务，并提出在“到达-避免衍生”的 DFA 上预训练图神经网络嵌入。通过实证评估，我们证明了所提出的预训练方法能够实现对各种 cDFA 任务类的零样本泛化，并加速策略专业化，同时避免了分层方法的短视次优性。

[NLP-31] RESTOR: Knowledge Recovery through Machine Unlearning

【速读】：该论文试图解决大型语言模型在训练过程中记忆了不希望的数据点（如错误事实、版权内容或敏感数据）的问题，并提出了一个名为RESTOR的框架来评估机器遗忘算法的效果。解决方案的关键在于RESTOR框架的三个维度：(1) 任务设置聚焦于现实世界的事实知识；(2) 多种数据污染场景模拟了需要被遗忘的不同类型的数据点；(3) 评估指标不仅关注遗忘不希望的知识，还强调恢复模型在遇到这些数据点之前的原始状态，即恢复性遗忘（restorative unlearning）。RESTOR框架揭示了现有遗忘算法的一些新见解，例如某些算法仅强调遗忘知识，而定位遗忘目标可以增强遗忘性能。

链接: https://arxiv.org/abs/2411.00204
作者: Keivan Rezaei,Khyathi Chandu,Soheil Feizi,Yejin Choi,Faeze Brahman,Abhilasha Ravichander
关键词-EN: Large language models, Large language, language models trained, memorize undesirable datapoints, incorrect facts
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models trained on web-scale corpora can memorize undesirable datapoints such as incorrect facts, copyrighted content or sensitive data. Recently, many machine unlearning methods have been proposed that aim to ‘erase’ these datapoints from trained models – that is, revert model behavior to be similar to a model that had never been trained on these datapoints. However, evaluating the success of unlearning algorithms remains challenging. In this work, we propose the RESTOR framework for machine unlearning based on the following dimensions: (1) a task setting that focuses on real-world factual knowledge, (2) a variety of corruption scenarios that emulate different kinds of datapoints that might need to be unlearned, and (3) evaluation metrics that emphasize not just forgetting undesirable knowledge, but also recovering the model’s original state before encountering these datapoints, or restorative unlearning. RESTOR helps uncover several novel insights about popular unlearning algorithms, and the mechanisms through which they operate – for instance, identifying that some algorithms merely emphasize forgetting the knowledge to be unlearned, and that localizing unlearning targets can enhance unlearning performance. Code/data is available at this http URL.
摘要：在大规模语料库上训练的大语言模型可能会记忆一些不希望的数据点，如错误的事实、受版权保护的内容或敏感数据。最近，许多机器遗忘方法被提出，旨在从训练好的模型中“擦除”这些数据点——即，使模型行为恢复到从未接触过这些数据点的状态。然而，评估遗忘算法的成功与否仍然具有挑战性。在本研究中，我们提出了基于以下维度的机器遗忘框架 RESTOR：（1）专注于现实世界事实知识的任务设置，（2）模拟不同类型可能需要遗忘的数据点的多种破坏场景，以及（3）强调不仅遗忘不希望的知识，还要恢复模型在接触这些数据点之前的原始状态，即恢复性遗忘的评估指标。RESTOR 揭示了关于流行遗忘算法的几个新见解，以及它们运作的机制——例如，识别出某些算法仅强调遗忘要遗忘的知识，而定位遗忘目标可以增强遗忘性能。代码/数据可在以下网址获取：http URL。

[NLP-32] Beyond Label Attention: Transparency in Language Models for Automated Medical Coding via Dictionary Learning

【速读】：该论文试图解决医疗编码过程中模型解释性不足的问题，特别是在标签注意力机制下，模型可能会高亮与ICD编码无关的冗余标记。解决方案的关键在于利用字典学习（dictionary learning），从密集的语言模型嵌入中高效提取稀疏激活的表示，构建一个可解释的字典。这种方法不仅超越了传统的标记级表示，还能为每个ICD编码预测提供基于机制的解释，即使高亮的标记在医学上无关，也能揭示其隐藏的含义，从而提高模型的可解释性和人类可理解性。

链接: https://arxiv.org/abs/2411.00173
作者: John Wu,David Wu,Jimeng Sun
关键词-EN: time-consuming healthcare practice, unstructured clinical text, healthcare practice, standardized medical codes, translation of unstructured
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Medical coding, the translation of unstructured clinical text into standardized medical codes, is a crucial but time-consuming healthcare practice. Though large language models (LLM) could automate the coding process and improve the efficiency of such tasks, interpretability remains paramount for maintaining patient trust. Current efforts in interpretability of medical coding applications rely heavily on label attention mechanisms, which often leads to the highlighting of extraneous tokens irrelevant to the ICD code. To facilitate accurate interpretability in medical language models, this paper leverages dictionary learning that can efficiently extract sparsely activated representations from dense language model embeddings in superposition. Compared with common label attention mechanisms, our model goes beyond token-level representations by building an interpretable dictionary which enhances the mechanistic-based explanations for each ICD code prediction, even when the highlighted tokens are medically irrelevant. We show that dictionary features can steer model behavior, elucidate the hidden meanings of upwards of 90% of medically irrelevant tokens, and are human interpretable.
摘要：医学编码，即将非结构化的临床文本转化为标准化的医学代码，是医疗实践中至关重要但耗时的工作。尽管大语言模型 (Large Language Model, LLM) 能够自动化编码过程并提高此类任务的效率，但可解释性对于维持患者信任仍然至关重要。当前在医学编码应用中的可解释性研究主要依赖于标签注意力机制，这往往导致突出显示与 ICD 代码无关的冗余 Token。为了促进医学语言模型的准确可解释性，本文利用字典学习方法，能够从密集的 Transformer 嵌入中高效提取稀疏激活的表示。与常见的标签注意力机制相比，我们的模型通过构建可解释的字典，超越了 Token 级别的表示，增强了基于机制的 ICD 代码预测解释，即使在突出显示的 Token 在医学上无关紧要的情况下。我们展示了字典特征能够引导模型行为，阐明超过 90% 的医学无关 Token 的隐藏含义，并且具有人类可解释性。

[NLP-33] Scaling Up Membership Inference: When and How Attacks Succeed on Large Language Models

【速读】：该论文试图解决的问题是验证当前的成员推断攻击 (Membership Inference Attacks, MIA) 方法是否仍然适用于大型语言模型 (Large Language Models, LLM)。解决方案的关键在于提出了一种新的基准测试方法，该方法通过连续扩展数据样本的规模，从句子 (n-grams) 到文档集合 (多个token块)，来评估MIA的效果。论文还引入了一种基于数据集推断 (Dataset Inference, DI) 的二元成员检测方法，该方法通过聚合段落级别的MIA特征，使得MIA能够在文档和文档集合级别上成功实施，从而首次在预训练和微调的LLM上实现了成功的MIA。

链接: https://arxiv.org/abs/2411.00154
作者: Haritz Puerto,Martin Gubri,Sangdoo Yun,Seong Joon Oh
关键词-EN: MIA, attempt to verify, Membership inference attacks, inference attacks, training set
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Our code is available at this https URL

点击查看摘要

Abstract:Membership inference attacks (MIA) attempt to verify the membership of a given data sample in the training set for a model. MIA has become relevant in recent years, following the rapid development of large language models (LLM). Many are concerned about the usage of copyrighted materials for training them and call for methods for detecting such usage. However, recent research has largely concluded that current MIA methods do not work on LLMs. Even when they seem to work, it is usually because of the ill-designed experimental setup where other shortcut features enable “cheating.” In this work, we argue that MIA still works on LLMs, but only when multiple documents are presented for testing. We construct new benchmarks that measure the MIA performances at a continuous scale of data samples, from sentences (n-grams) to a collection of documents (multiple chunks of tokens). To validate the efficacy of current MIA approaches at greater scales, we adapt a recent work on Dataset Inference (DI) for the task of binary membership detection that aggregates paragraph-level MIA features to enable MIA at document and collection of documents level. This baseline achieves the first successful MIA on pre-trained and fine-tuned LLMs.
摘要：成员推断攻击 (Membership Inference Attack, MIA) 旨在验证给定数据样本是否存在于模型的训练集中。近年来，随着大语言模型 (Large Language Model, LLM) 的快速发展，MIA 变得愈发重要。许多人对使用受版权保护的材料进行训练表示担忧，并呼吁开发检测此类使用的方法。然而，最近的研究大多得出结论，当前的 MIA 方法对 LLM 无效。即使这些方法看似有效，通常也是由于实验设置不当，其他捷径特征使得“作弊”成为可能。在本研究中，我们认为 MIA 对 LLM 仍然有效，但仅限于在测试时提供多个文档的情况下。我们构建了新的基准，用于衡量在数据样本的连续尺度上（从句子（n-gram）到文档集合（多个 Token 块））的 MIA 性能。为了验证当前 MIA 方法在大规模数据上的有效性，我们借鉴了最近关于数据集推断 (Dataset Inference, DI) 的研究，将其应用于二元成员检测任务，通过聚合段落级别的 MIA 特征，实现在文档和文档集合级别的 MIA。这一基线方法首次成功地在预训练和微调的 LLM 上实现了 MIA。

[NLP-34] Schema Augmentation for Zero-Shot Domain Adaptation in Dialogue State Tracking

【速读】：该论文试图解决任务导向对话系统中对话状态跟踪（DST）的零样本领域适应问题，即模型在训练时未见过的目标领域上的泛化能力。解决方案的关键在于提出了一种名为“Schema Augmentation”的数据增强方法，通过在微调过程中引入槽位名称的变体来增强语言模型的零样本领域适应能力。实验结果表明，该方法在MultiWOZ和SpokenWOZ数据集上显著提升了模型在未见领域上的准确性，甚至在某些实验中实现了超过两倍的准确性提升，同时保持了在所有领域上的同等或更优表现。

链接: https://arxiv.org/abs/2411.00150
作者: Christopher Richardson,Roshan Sharma,Neeraj Gaur,Parisa Haghani,Anirudh Sundar,Bhuvana Ramabhadran
关键词-EN: dialogue state tracking, Zero-shot domain adaptation, Zero-shot domain, state tracking, remains a challenging
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Zero-shot domain adaptation for dialogue state tracking (DST) remains a challenging problem in task-oriented dialogue (TOD) systems, where models must generalize to target domains unseen at training time. Current large language model approaches for zero-shot domain adaptation rely on prompting to introduce knowledge pertaining to the target domains. However, their efficacy strongly depends on prompt engineering, as well as the zero-shot ability of the underlying language model. In this work, we devise a novel data augmentation approach, Schema Augmentation, that improves the zero-shot domain adaptation of language models through fine-tuning. Schema Augmentation is a simple but effective technique that enhances generalization by introducing variations of slot names within the schema provided in the prompt. Experiments on MultiWOZ and SpokenWOZ showed that the proposed approach resulted in a substantial improvement over the baseline, in some experiments achieving over a twofold accuracy gain over unseen domains while maintaining equal or superior performance over all domains.
摘要：零样本领域适应（Zero-shot domain adaptation）在面向任务的对话系统（Task-oriented Dialogue, TOD）中的对话状态跟踪（Dialogue State Tracking, DST）仍然是一个具有挑战性的问题。在这种情况下，模型必须在训练时未见过的目标领域上进行泛化。当前的大语言模型（Large Language Model, LLM）在零样本领域适应方面的方法主要依赖于提示（prompting）来引入与目标领域相关的知识。然而，这些方法的有效性在很大程度上取决于提示工程（prompt engineering）以及底层语言模型的零样本能力。在本研究中，我们设计了一种新颖的数据增强方法——Schema Augmentation，通过微调（fine-tuning）来提升语言模型的零样本领域适应能力。Schema Augmentation 是一种简单但有效的技术，通过在提示中提供的模式（schema）中引入槽位名称（slot names）的变化来增强泛化能力。在 MultiWOZ 和 SpokenWOZ 上的实验表明，所提出的方法相较于基线（baseline）取得了显著的改进，在某些实验中，对未见领域的准确率提升超过两倍，同时在所有领域上保持了同等或更优的性能。

[NLP-35] JudgeRank: Leveraging Large Language Models for Reasoning-Intensive Reranking

【速读】：该论文试图解决在检索增强生成 (RAG) 应用中，大型语言模型 (LLMs) 作为密集编码器或列表式重排序器时，在处理推理密集型任务中缺乏细致分析文档相关性的问题。解决方案的关键是引入了一种名为 JudgeRank 的新型代理重排序器，该方法模拟人类认知过程来评估文档相关性。JudgeRank 的核心步骤包括：(1) 查询分析以识别核心问题，(2) 文档分析以提取查询感知的摘要，(3) 相关性判断以提供简明的文档相关性评估。通过在推理密集型 BRIGHT 基准上的评估，JudgeRank 显著提升了性能，并展示了其零样本泛化能力。

链接: https://arxiv.org/abs/2411.00142
作者: Tong Niu,Shafiq Joty,Ye Liu,Caiming Xiong,Yingbo Zhou,Semih Yavuz
关键词-EN: including open-domain question, open-domain question answering, retrieval-augmented generation, including open-domain, code completion
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Accurate document retrieval is crucial for the success of retrieval-augmented generation (RAG) applications, including open-domain question answering and code completion. While large language models (LLMs) have been employed as dense encoders or listwise rerankers in RAG systems, they often struggle with reasoning-intensive tasks because they lack nuanced analysis when judging document relevance. To address this limitation, we introduce JudgeRank, a novel agentic reranker that emulates human cognitive processes when assessing document relevance. Our approach consists of three key steps: (1) query analysis to identify the core problem, (2) document analysis to extract a query-aware summary, and (3) relevance judgment to provide a concise assessment of document relevance. We evaluate JudgeRank on the reasoning-intensive BRIGHT benchmark, demonstrating substantial performance improvements over first-stage retrieval methods and outperforming other popular reranking approaches. In addition, JudgeRank performs on par with fine-tuned state-of-the-art rerankers on the popular BEIR benchmark, validating its zero-shot generalization capability. Through comprehensive ablation studies, we demonstrate that JudgeRank’s performance generalizes well across LLMs of various sizes while ensembling them yields even more accurate reranking than individual models.
摘要：准确的文档检索对于检索增强生成（RAG）应用的成功至关重要，包括开放领域问答和代码补全。尽管大语言模型（LLMs）已被用作RAG系统中的密集编码器或列表式重排序器，但它们在处理推理密集型任务时常常遇到困难，因为它们在判断文档相关性时缺乏细致的分析。为了解决这一局限，我们引入了JudgeRank，这是一种新颖的智能体重排序器，它在评估文档相关性时模拟人类的认知过程。我们的方法包括三个关键步骤：（1）查询分析以识别核心问题，（2）文档分析以提取查询感知的摘要，以及（3）相关性判断以提供文档相关性的简明评估。我们在推理密集型的BRIGHT基准上评估了JudgeRank，结果显示其性能显著优于第一阶段检索方法，并超越了其他流行的重排序方法。此外，JudgeRank在流行的BEIR基准上与经过微调的最先进重排序器表现相当，验证了其零样本泛化能力。通过全面的消融研究，我们证明了JudgeRank的性能在不同规模的大语言模型中具有良好的泛化性，而将它们集成在一起则能比单个模型提供更准确的重排序。

[NLP-36] RSL-SQL: Robust Schema Linking in Text-to-SQL Generation

【速读】：该论文试图解决在基于大型语言模型（LLMs）的文本到SQL生成任务中，模式链接（schema linking）可能导致的必要元素遗漏和数据库结构完整性破坏的问题。解决方案的关键在于提出了一种名为RSL-SQL的新框架，该框架结合了双向模式链接（bidirectional schema linking）、上下文信息增强（contextual information augmentation）、二元选择策略（binary selection strategy）和多轮自我修正（multi-turn self-correction）。通过前向和后向剪枝提高模式链接的召回率，并通过在完整模式和上下文信息增强的简化模式之间投票来规避风险。实验结果表明，该方法在BIRD和Spider基准测试中达到了开源解决方案中的最先进执行准确率，分别为67.2%和87.9%，并且在采用DeepSeek（成本更低）时，其性能优于一系列基于GPT-4的文本到SQL系统。

链接: https://arxiv.org/abs/2411.00073
作者: Zhenbiao Cao,Yuanlei Zheng,Zhihao Fan,Xiaojin Zhang,Wei Chen
关键词-EN: translate natural language, natural language questions, SQL statements, questions into SQL, generation aims
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Databases (cs.DB)
备注:

点击查看摘要

Abstract:Text-to-SQL generation aims to translate natural language questions into SQL statements. In large language models (LLMs) based Text-to-SQL, schema linking is a widely adopted strategy to streamline the input for LLMs by selecting only relevant schema elements, therefore reducing noise and computational overhead. However, schema linking faces risks that requires caution, including the potential omission of necessary elements and disruption of database structural integrity. To address these challenges, we propose a novel framework called RSL-SQL that combines bidirectional schema linking, contextual information augmentation, binary selection strategy, and multi-turn self-correction. Our approach improves the recall of schema linking through forward and backward pruning and hedges the risk by voting between full schema and contextual information augmented simplified schema. Experiments on the BIRD and Spider benchmarks demonstrate that our approach achieves state-of-the-art execution accuracy among open-source solutions, with 67.2% on BIRD and 87.9% on Spider using GPT-4o. Furthermore, our approach outperforms a series of GPT-4 based Text-to-SQL systems when adopting DeepSeek (much cheaper) with same intact prompts. Extensive analysis and ablation studies confirm the effectiveness of each component in our framework. The codes are available at this https URL.
摘要：文本到SQL生成旨在将自然语言问题转换为SQL语句。在大语言模型（LLM）为基础的文本到SQL生成中，模式链接是一种广泛采用的策略，通过仅选择相关的模式元素来简化输入，从而减少噪声和计算开销。然而，模式链接存在需要谨慎处理的风险，包括可能遗漏必要元素和破坏数据库结构完整性。为应对这些挑战，我们提出了一种名为RSL-SQL的新框架，该框架结合了双向模式链接、上下文信息增强、二元选择策略和多轮自校正。我们的方法通过前向和后向剪枝提高了模式链接的召回率，并通过在完整模式和上下文信息增强的简化模式之间进行投票来规避风险。在BIRD和Spider基准测试上的实验表明，我们的方法在开源解决方案中达到了最先进的执行准确率，使用GPT-4o在BIRD上达到67.2%，在Spider上达到87.9%。此外，当采用DeepSeek（成本更低）并使用相同的完整提示时，我们的方法在一系列基于GPT-4的文本到SQL系统中表现更优。广泛的分析和消融研究证实了我们框架中每个组件的有效性。代码可在以下链接获取：https URL。

[NLP-37] Interpretable Language Modeling via Induction-head Ngram Models

【速读】：该论文试图解决在高风险和计算资源有限的环境中使用大型语言模型（LLMs）时，对解释性和效率的需求问题。解决方案的关键在于提出了一种名为“Induction-head ngram模型（Induction-Gram）”的方法，通过在现代ngram模型中加入一个手工设计的“induction head”来增强其效率和解释性。这个induction head利用自定义的神经相似度度量来高效地搜索模型的输入上下文，以找到可能的下一个词的补全。这种方法不仅显著提高了下一个词预测的准确性（比基线解释性模型高出26%），还可以通过推测解码加速LLM的推理过程。此外，在自然语言神经科学设置中，Induction-Gram在预测fMRI响应序列中的下一个响应时，也显著提高了预测的准确性（相关性提高了20%），可能为大脑中语言选择性的深入科学研究提供支持。

链接: https://arxiv.org/abs/2411.00066
作者: Eunji Kim,Sriya Mantena,Weiwei Yang,Chandan Singh,Sungroh Yoon,Jianfeng Gao
关键词-EN: Recent large language, Recent large, range of tasks, interpretability and efficiency, wide range
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Recent large language models (LLMs) have excelled across a wide range of tasks, but their use in high-stakes and compute-limited settings has intensified the demand for interpretability and efficiency. We address this need by proposing Induction-head ngram models (Induction-Gram), a method that builds an efficient, interpretable LM by bolstering modern ngram models with a hand-engineered “induction head”. This induction head uses a custom neural similarity metric to efficiently search the model’s input context for potential next-word completions. This process enables Induction-Gram to provide ngram-level grounding for each generated token. Moreover, experiments show that this simple method significantly improves next-word prediction over baseline interpretable models (up to 26%p) and can be used to speed up LLM inference for large models through speculative decoding. We further study Induction-Gram in a natural-language neuroscience setting, where the goal is to predict the next fMRI response in a sequence. It again provides a significant improvement over interpretable models (20% relative increase in the correlation of predicted fMRI responses), potentially enabling deeper scientific investigation of language selectivity in the brain. The code is available at this https URL.
摘要：近年来，大语言模型 (LLM) 在众多任务中表现卓越，但在高风险和计算资源受限的环境中，对其可解释性和效率的需求日益增加。我们通过提出诱导头 ngram 模型 (Induction-Gram) 来应对这一需求，该方法通过在现代 ngram 模型中加入一个手工设计的“诱导头”，构建了一个高效且可解释的语言模型。这个诱导头使用自定义的神经相似度度量，高效地在模型的输入上下文中搜索潜在的下一个词完成。这一过程使得 Induction-Gram 能够为每个生成的 Token 提供 ngram 级别的依据。此外，实验表明，这种简单的方法显著提升了下一个词预测的准确性（相对于基线可解释模型，最高提升 26%），并且可以通过推测解码加速大语言模型的推理过程。我们进一步在自然语言神经科学环境中研究了 Induction-Gram，目标是预测序列中的下一个 fMRI 响应。结果再次显示，它在可解释模型中提供了显著的改进（预测的 fMRI 响应相关性相对提高了 20%），这可能为深入研究大脑中的语言选择性提供了科学依据。代码可在以下链接获取：https URL。

[NLP-38] Evolving Alignment via Asymmetric Self-Play

【速读】：该论文试图解决现有强化学习从人类反馈 (RLHF) 框架在调整大型语言模型 (LLMs) 时，假设固定提示分布的局限性问题。这种假设限制了模型的可扩展性和泛化能力。论文提出的解决方案是引入一个开放式的、不对称的自博弈 (Asymmetric Self-Play) 框架，称为 Evolving Alignment via Asymmetric Self-Play (eva)。该框架将调整过程视为两个玩家之间的不对称游戏：(i) 一个生成器，利用奖励模型生成越来越丰富的提示分布；(ii) 一个求解器，学习在生成器生成的提示上产生更优的响应。eva 框架的关键在于其简单性和高效性，能够利用任何现有的 RLHF 算法进行可扩展的调整，并在多个基准测试中显著优于现有最先进的方法，无需额外的人工设计提示。

链接: https://arxiv.org/abs/2411.00062
作者: Ziyu Ye,Rishabh Agarwal,Tianqi Liu,Rishabh Joshi,Sarmishta Velury,Quoc V. Le,Qijun Tan,Yuan Liu
关键词-EN: Current RLHF frameworks, aligning large language, large language models, Current RLHF, fixed prompt distribution
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Data Analysis, Statistics and Probability (physics.data-an); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:Current RLHF frameworks for aligning large language models (LLMs) typically assume a fixed prompt distribution, which is sub-optimal and limits the scalability of alignment and generalizability of models. To address this, we introduce a general open-ended RLHF framework that casts alignment as an asymmetric game between two players: (i) a creator that generates increasingly informative prompt distributions using the reward model, and (ii) a solver that learns to produce more preferred responses on prompts produced by the creator. This framework of Evolving Alignment via Asymmetric Self-Play (eva), results in a simple and efficient approach that can utilize any existing RLHF algorithm for scalable alignment. eva outperforms state-of-the-art methods on widely-used benchmarks, without the need of any additional human crafted prompts. Specifically, eva improves the win rate of Gemma-2-9B-it on Arena-Hard from 51.6% to 60.1% with DPO, from 55.7% to 58.9% with SPPO, from 52.3% to 60.7% with SimPO, and from 54.8% to 60.3% with ORPO, surpassing its 27B version and matching claude-3-opus. This improvement is persistent even when new human crafted prompts are introduced. Finally, we show eva is effective and robust under various ablation settings.
摘要：当前用于对齐大语言模型（LLMs）的强化学习人类反馈（RLHF）框架通常假设固定的提示分布，这种假设并不理想，限制了对齐的可扩展性和模型的泛化能力。为解决这一问题，我们提出了一种通用的开放式RLHF框架，将对齐视为两个玩家之间的非对称游戏：（i）一个生成器，它利用奖励模型生成越来越丰富的提示分布；（ii）一个求解器，它学习在生成器产生的提示上生成更优的响应。这种通过非对称自我对弈进行演化对齐（Evolving Alignment via Asymmetric Self-Play, eva）的框架，形成了一种简单且高效的方法，能够利用任何现有的RLHF算法实现可扩展的对齐。eva在广泛使用的基准测试中优于最先进的方法，无需任何额外的人工设计提示。具体而言，eva在使用直接偏好优化（DPO）时，将Gemma-2-9B-it在Arena-Hard上的胜率从51.6%提升至60.1%；在使用策略梯度偏好优化（SPPO）时，胜率从55.7%提升至58.9%；在使用模拟偏好优化（SimPO）时，胜率从52.3%提升至60.7%；在使用最优偏好优化（ORPO）时，胜率从54.8%提升至60.3%，超过了其27B版本，并达到了claude-3-opus的水平。即使在引入新的人工设计提示时，这种改进依然持续存在。最后，我们展示了eva在各种消融设置下依然有效且稳健。

[NLP-39] Generating Diverse Negations from Affirmative Sentences NEURIPS2024

【速读】：该论文试图解决当前大型语言模型在处理否定语句时表现不佳的问题，特别是由于现有基准数据集中否定形式的多样性和复杂性不足，导致模型训练数据不足。解决方案的关键在于提出了NegVerse方法，该方法通过从肯定句中生成多种类型的否定形式（包括动词、非动词和词缀形式）来丰富否定数据集。具体来说，NegVerse利用句法结构来确定否定可能出现的位置，并通过冻结的基础大型语言模型（LLM）和提示调优来生成否定句。此外，该方法还包括一个过滤机制，用于识别否定线索并去除不合理的例子，从而生成具有高词汇相似性、良好句法保留和多样化否定形式的否定句。

链接: https://arxiv.org/abs/2411.00056
作者: Darian Rodriguez Vasquez,Afroditi Papadaki
关键词-EN: large language models, impressive performance, performance of large, struggle with reasoning, large language
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted at “Adaptive Foundation Models: Evolving AI for Personalized and Efficient Learning” workshop at NeurIPS 2024

点击查看摘要

Abstract:Despite the impressive performance of large language models across various tasks, they often struggle with reasoning under negated statements. Negations are important in real-world applications as they encode negative polarity in verb phrases, clauses, or other expressions. Nevertheless, they are underrepresented in current benchmarks, which mainly include basic negation forms and overlook more complex ones, resulting in insufficient data for training a language model. In this work, we propose NegVerse, a method that tackles the lack of negation datasets by producing a diverse range of negation types from affirmative sentences, including verbal, non-verbal, and affixal forms commonly found in English text. We provide new rules for masking parts of sentences where negations are most likely to occur, based on syntactic structure and use a frozen baseline LLM and prompt tuning to generate negated sentences. We also propose a filtering mechanism to identify negation cues and remove degenerate examples, producing a diverse range of meaningful perturbations. Our results show that NegVerse outperforms existing methods and generates negations with higher lexical similarity to the original sentences, better syntactic preservation and negation diversity. The code is available in this https URL
摘要：尽管大语言模型在各种任务中表现出色，但在处理否定陈述下的推理时常常遇到困难。否定在实际应用中非常重要，因为它们在动词短语、从句或其他表达中编码了否定极性。然而，当前的基准测试中对否定的表示不足，主要包含基本的否定形式，而忽略了更复杂的否定形式，导致训练语言模型的数据不足。在本研究中，我们提出了 NegVerse 方法，通过从肯定句中生成多种否定类型来解决否定数据集的缺乏问题，包括在英语文本中常见的动词否定、非动词否定和词缀否定形式。我们基于句法结构提供了新的规则，用于掩盖句子中否定最可能出现的部分，并使用冻结的基线大语言模型和提示调优来生成否定句。我们还提出了一种过滤机制，用于识别否定线索并移除退化示例，从而生成多样化的有意义扰动。我们的结果表明，NegVerse 优于现有方法，生成的否定句在词汇相似性、句法保留和否定多样性方面表现更好。代码可在以下链接中获取：https URL。

[NLP-40] ACC-Debate: An Actor-Critic Approach to Multi-Agent Debate

【速读】：该论文试图解决现有多智能体辩论（multi-agent debate, MAD）框架中辩论行为被视为自发而非学习行为的问题。解决方案的关键在于提出了基于Actor-Critic学习框架的ACC-Debate，通过训练两个专门用于辩论的智能体，使其在辩论中表现更为出色，并在多个基准测试中超越了现有的最先进辩论技术。

链接: https://arxiv.org/abs/2411.00053
作者: Andrew Estornell,Jean-Francois Ton,Yuanshun Yao,Yang Liu
关键词-EN: Large language models, Large language, language-based tasks, remarkable ability, ability to serve
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have demonstrated a remarkable ability to serve as general-purpose tools for various language-based tasks. Recent works have demonstrated that the efficacy of such models can be improved through iterative dialog between multiple models, frequently referred to as multi-agent debate (MAD). While debate shows promise as a means of improving model efficacy, most works in this area treat debate as an emergent behavior, rather than a learned behavior. In doing so, current debate frameworks rely on collaborative behaviors to have been sufficiently trained into off-the-shelf models. To address this limitation, we propose ACC-Debate, an Actor-Critic based learning framework to produce a two-agent team specialized in debate. We demonstrate that ACC-Debate outperforms SotA debate techniques on a wide array of benchmarks.
摘要：大语言模型 (Large Language Models, LLMs) 展示了作为各种基于语言任务通用工具的显著能力。最近的研究表明，通过多个模型之间的迭代对话，即所谓的多智能体辩论 (Multi-Agent Debate, MAD)，可以显著提升这些模型的效能。尽管辩论作为一种提升模型效能的手段显示出潜力，但该领域的多数研究将辩论视为一种自发行为，而非学习行为。因此，当前的辩论框架依赖于预训练模型中已充分训练的协作行为。为解决这一局限，我们提出了基于 Actor-Critic 学习框架的 ACC-Debate，用于生成专长于辩论的双智能体团队。我们的实验证明，ACC-Debate 在广泛的基准测试中优于现有的最先进辩论技术。

[NLP-41] Larger models yield better results? Streamlined severity classification of ADHD-related concerns using BERT-based knowledge distillation

【速读】：该论文试图解决在自然语言处理应用中，如何通过知识蒸馏（Knowledge Distillation）方法生成一个轻量级但强大的基于BERT的模型，以应对资源有限的环境。解决方案的关键在于创建了一个定制化的学生BERT模型，即LastBERT，通过显著减少模型参数（从110 million减少到29 million），同时保持了在GLUE基准测试和实际ADHD数据集上的高性能。具体来说，LastBERT在保持与DistilBERT和ClinicalBERT相当的性能的同时，大幅降低了模型大小和计算资源需求，从而提高了模型的适用性和可访问性，特别是在使用如Google Colab等常见计算工具的情况下。

链接: https://arxiv.org/abs/2411.00052
作者: Ahmed Akib Jawad Karim,Kazi Hafiz Md. Asad,Md. Golam Rabiul Alam
关键词-EN: powerful BERT based, natural language processing, Deficit Hyperactivity Disorder, Attention Deficit Hyperactivity, BERT based model
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 20 figures, 31 pages, review 1 from plos one journal

点击查看摘要

Abstract:This work focuses on the efficiency of the knowledge distillation approach in generating a lightweight yet powerful BERT based model for natural language processing applications. After the model creation, we applied the resulting model, LastBERT, to a real-world task classifying severity levels of Attention Deficit Hyperactivity Disorder (ADHD)-related concerns from social media text data. Referring to LastBERT, a customized student BERT model, we significantly lowered model parameters from 110 million BERT base to 29 million, resulting in a model approximately 73.64% smaller. On the GLUE benchmark, comprising paraphrase identification, sentiment analysis, and text classification, the student model maintained strong performance across many tasks despite this reduction. The model was also used on a real-world ADHD dataset with an accuracy and F1 score of 85%. When compared to DistilBERT (66M) and ClinicalBERT (110M), LastBERT demonstrated comparable performance, with DistilBERT slightly outperforming it at 87%, and ClinicalBERT achieving 86% across the same metrics. These findings highlight the LastBERT model’s capacity to classify degrees of ADHD severity properly, so it offers a useful tool for mental health professionals to assess and comprehend material produced by users on social networking platforms. The study emphasizes the possibilities of knowledge distillation to produce effective models fit for use in resource-limited conditions, hence advancing NLP and mental health diagnosis. Furthermore underlined by the considerable decrease in model size without appreciable performance loss is the lower computational resources needed for training and deployment, hence facilitating greater applicability. Especially using readily available computational tools like Google Colab. This study shows the accessibility and usefulness of advanced NLP methods in pragmatic world applications.
摘要：本研究聚焦于知识蒸馏方法在生成轻量级且强大的基于 BERT 的自然语言处理模型中的效率。在模型构建完成后，我们将生成的模型 LastBERT 应用于一个实际任务，即从社交媒体文本数据中分类注意力缺陷多动障碍（ADHD）相关问题的严重程度。参考 LastBERT，一个定制的学生 BERT 模型，我们将模型参数从 1.1 亿个 BERT 基础模型显著降低到 2900 万个，使得模型大小减少了约 73.64%。在包含释义识别、情感分析和文本分类的 GLUE 基准测试中，尽管模型规模大幅缩小，学生模型在多个任务中仍保持了强劲的性能。该模型还在一个实际的 ADHD 数据集上使用，准确率和 F1 分数达到 85%。与 DistilBERT（66M）和 ClinicalBERT（110M）相比，LastBERT 表现相当，其中 DistilBERT 略胜一筹，达到 87%，而 ClinicalBERT 在相同指标上达到 86%。这些发现突显了 LastBERT 模型在正确分类 ADHD 严重程度方面的能力，因此它为心理健康专业人员提供了一个有用的工具，用于评估和理解用户在社交网络平台上产生的内容。本研究强调了知识蒸馏在资源有限条件下生成有效模型的可能性，从而推动了自然语言处理和心理健康诊断的发展。此外，模型尺寸的大幅减少而性能损失不显著，也意味着训练和部署所需的计算资源减少，从而提高了适用性，特别是在使用如 Google Colab 等现成的计算工具时。本研究展示了先进的自然语言处理方法在实际应用中的可及性和实用性。

[NLP-42] Rule by Rule: Learning with Confidence through Vocabulary Expansion

【速读】：该论文试图解决在文本数据（但不限于文本数据）中进行规则学习时面临的内存消耗问题，并提高生成规则的可靠性。解决方案的关键在于引入一种创新的迭代方法，通过逐步扩展词汇表来显著减少内存消耗，并引入“置信度值”（Value of Confidence）作为生成规则可靠性的指标。该方法确保只有最稳健和可信的规则被保留，从而提升规则学习过程的整体质量。

链接: https://arxiv.org/abs/2411.00049
作者: Albert Nössig,Tobias Hell,Georg Moser
关键词-EN: learning specifically designed, innovative iterative approach, text-based data, rule learning specifically, present an innovative
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 29 pages, 8 figures

点击查看摘要

Abstract:In this paper, we present an innovative iterative approach to rule learning specifically designed for (but not limited to) text-based data. Our method focuses on progressively expanding the vocabulary utilized in each iteration resulting in a significant reduction of memory consumption. Moreover, we introduce a Value of Confidence as an indicator of the reliability of the generated rules. By leveraging the Value of Confidence, our approach ensures that only the most robust and trustworthy rules are retained, thereby improving the overall quality of the rule learning process. We demonstrate the effectiveness of our method through extensive experiments on various textual as well as non-textual datasets including a use case of significant interest to insurance industries, showcasing its potential for real-world applications.
摘要：本文提出了一种创新的迭代规则学习方法，特别针对（但不限于）基于文本的数据。该方法的核心在于逐步扩展每次迭代中使用的词汇，从而显著减少内存消耗。此外，我们引入了一个“置信度值”作为生成规则可靠性的指标。通过利用置信度值，我们的方法确保只保留最稳健和可信的规则，从而提高规则学习过程的整体质量。我们通过在多种文本和非文本数据集上的广泛实验，包括对保险行业具有重要意义的用例，展示了该方法的有效性及其在实际应用中的潜力。

[NLP-43] CurateGPT: A flexible language-model assisted biocuration tool

【速读】：该论文试图解决生物医学数据发现过程中数据管理（data curation）的效率问题，特别是在专家管理员面临时间和资源限制的情况下。解决方案的关键在于利用生成式 AI（Generative AI），特别是指令调优的大型语言模型（LLMs），结合代理（agents）的设计理念，来辅助人类进行推理、搜索本体和整合外部知识源。论文提出的 CurateGPT 工具结合了生成式 AI 的强大能力与可信的知识库和文献资源，通过简化管理流程，提高协作和效率，帮助管理员、研究人员和工程师应对日益增长的科学数据量。

链接: https://arxiv.org/abs/2411.00046
作者: Harry Caufield,Carlo Kroll,Shawn T O’Neil,Justin T Reese,Marcin P Joachimiak,Harshad Hegde,Nomi L Harris,Madan Krishnamurthy,James A McLaughlin,Damian Smedley,Melissa A Haendel,Peter N Robinson,Christopher J Mungall
关键词-EN: Effective data-driven biomedical, data-driven biomedical discovery, biomedical discovery requires, structured form suitable, Effective data-driven
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Databases (cs.DB); Quantitative Methods (q-bio.QM)
备注:

点击查看摘要

Abstract:Effective data-driven biomedical discovery requires data curation: a time-consuming process of finding, organizing, distilling, integrating, interpreting, annotating, and validating diverse information into a structured form suitable for databases and knowledge bases. Accurate and efficient curation of these digital assets is critical to ensuring that they are FAIR, trustworthy, and sustainable. Unfortunately, expert curators face significant time and resource constraints. The rapid pace of new information being published daily is exceeding their capacity for curation. Generative AI, exemplified by instruction-tuned large language models (LLMs), has opened up new possibilities for assisting human-driven curation. The design philosophy of agents combines the emerging abilities of generative AI with more precise methods. A curator’s tasks can be aided by agents for performing reasoning, searching ontologies, and integrating knowledge across external sources, all efforts otherwise requiring extensive manual effort. Our LLM-driven annotation tool, CurateGPT, melds the power of generative AI together with trusted knowledge bases and literature sources. CurateGPT streamlines the curation process, enhancing collaboration and efficiency in common workflows. Compared to direct interaction with an LLM, CurateGPT’s agents enable access to information beyond that in the LLM’s training data and they provide direct links to the data supporting each claim. This helps curators, researchers, and engineers scale up curation efforts to keep pace with the ever-increasing volume of scientific data.
摘要：有效的数据驱动生物医学发现需要数据管理：这是一个耗时的过程，涉及查找、组织、提炼、整合、解释、注释和验证多样化的信息，使其成为适合数据库和知识库的结构化形式。准确且高效地管理这些数字资产对于确保其符合FAIR原则（可查找性、可访问性、互操作性和可重复使用性）、可信且可持续至关重要。然而，专家管理员面临着显著的时间和资源限制。每天发布的新信息的快速步伐已经超出了他们的管理能力。生成式 AI（Generative AI），特别是通过指令微调的大语言模型（LLMs），为辅助人类驱动的数据管理开辟了新的可能性。智能体（AI Agent）的设计理念结合了生成式 AI 的新兴能力与更为精确的方法。管理员的任务可以通过智能体来辅助，这些智能体能够进行推理、搜索本体论以及跨外部资源整合知识，这些工作原本需要大量的人工努力。我们的 LLM 驱动的注释工具 CurateGPT，将生成式 AI 的力量与可信的知识库和文献资源相结合。CurateGPT 简化了数据管理过程，增强了常见工作流程中的协作和效率。与直接与 LLM 交互相比，CurateGPT 的智能体能够访问超出 LLM 训练数据范围的信息，并提供支持每个声明的数据的直接链接。这有助于管理员、研究人员和工程师扩大数据管理工作的规模，以跟上不断增长的科学数据量。

[NLP-44] A Novel Psychometrics-Based Approach to Developing Professional Competency Benchmark for Large Language Models

【速读】：该论文试图解决当前大型语言模型（LLM）评估方法的不足，特别是现有基准测试在验证性和可靠性方面的缺陷。解决方案的关键在于采用证据中心设计（Evidence-centered design, ECD）方法论，并结合严格的心理测量学原则，构建一个基于布鲁姆分类法（Bloom’s taxonomy）的全新教育领域基准测试。该基准由教育专家团队精心设计，旨在为LLM提供一个学术上严谨且实用的评估工具，而非针对人类参与者。通过在俄语GPT模型上的实证测试，该基准揭示了当前LLM在处理复杂认知任务方面的关键能力差距，强调了生成式AI工具在教育领域的潜力与局限性。

链接: https://arxiv.org/abs/2411.00045
作者: Elena Kardanova,Alina Ivanova,Ksenia Tarasova,Taras Pashchenko,Aleksei Tikhoniuk,Elen Yusupova,Anatoly Kasprzhak,Yaroslav Kuzminov,Ekaterina Kruchinskaia,Irina Brun(National Research University Higher School of Economics, Moscow, Russia)
关键词-EN: raises questions, era of large, large language models, benchmark, existing benchmark development
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 36 pages, 2 figures

点击查看摘要

Abstract:The era of large language models (LLM) raises questions not only about how to train models, but also about how to evaluate them. Despite numerous existing benchmarks, insufficient attention is often given to creating assessments that test LLMs in a valid and reliable manner. To address this challenge, we accommodate the Evidence-centered design (ECD) methodology and propose a comprehensive approach to benchmark development based on rigorous psychometric principles. In this paper, we have made the first attempt to illustrate this approach by creating a new benchmark in the field of pedagogy and education, highlighting the limitations of existing benchmark development approach and taking into account the development of LLMs. We conclude that a new approach to benchmarking is required to match the growing complexity of AI applications in the educational context. We construct a novel benchmark guided by the Bloom’s taxonomy and rigorously designed by a consortium of education experts trained in test development. Thus the current benchmark provides an academically robust and practical assessment tool tailored for LLMs, rather than human participants. Tested empirically on the GPT model in the Russian language, it evaluates model performance across varied task complexities, revealing critical gaps in current LLM capabilities. Our results indicate that while generative AI tools hold significant promise for education - potentially supporting tasks such as personalized tutoring, real-time feedback, and multilingual learning - their reliability as autonomous teachers’ assistants right now remain rather limited, particularly in tasks requiring deeper cognitive engagement.
摘要：大语言模型 (LLM) 的时代不仅提出了如何训练模型的问题，还提出了如何评估它们的问题。尽管存在众多现有的基准测试，但往往对创建有效且可靠的评估方法关注不足。为了应对这一挑战，我们采用了证据中心设计 (Evidence-centered design, ECD) 方法论，并提出了一种基于严格心理测量原则的综合基准开发方法。本文首次尝试通过在教育学领域创建一个新的基准来阐述这种方法，突出了现有基准开发方法的局限性，并考虑了大语言模型的发展。我们得出结论，需要一种新的基准测试方法来匹配教育领域中日益复杂的 AI 应用。我们构建了一个新的基准，该基准以布鲁姆分类法为指导，并由受过测试开发培训的教育专家团队严格设计。因此，当前的基准提供了一个学术上严谨且实用的评估工具，专门为大语言模型定制，而非针对人类参与者。在俄语的 GPT 模型上进行了实证测试，它评估了模型在不同任务复杂性下的表现，揭示了当前大语言模型能力的重大差距。我们的结果表明，尽管生成式 AI 工具在教育领域具有显著潜力——可能支持个性化辅导、实时反馈和多语言学习等任务——但作为自主教师助手的可靠性目前仍然相当有限，特别是在需要深度认知参与的任务中。

[NLP-45] MIMIC-IV-Ext-PE: Using a large language model to predict pulmonary embolism phenotype in the MIMIC-IV dataset

【速读】：该论文试图解决肺栓塞（Pulmonary Embolism, PE）诊断数据集缺乏的问题，特别是缺乏大规模公开可用且带有PE标签的数据集。解决方案的关键在于利用MIMIC-IV数据库中的CTPA（Computed Tomography Pulmonary Angiography）放射报告，通过两位医生手动标注结果，并应用经过微调的Bio_ClinicalBERT语言模型（VTE-BERT）来自动提取标签。通过对比VTE-BERT与诊断代码的性能，发现VTE-BERT在所有19,942例CTPA报告中的敏感性（Sensitivity）为92.4%，阳性预测值（Positive Predictive Value, PPV）为87.8%，而诊断代码在11,990例住院患者中的敏感性为95.4%，PPV为83.8%。这一方法不仅成功为近20,000例CTPA添加了标签，还展示了半监督语言模型在加速血液学研究中的外部有效性。

链接: https://arxiv.org/abs/2411.00044
作者: B. D. Lam,S. Ma,I. Kovalenko,P. Wang,O. Jafari,A. Li,S. Horng
关键词-EN: preventable in-hospital mortality, in-hospital mortality, preventable in-hospital, Pulmonary embolism, diagnosis codes
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Pulmonary embolism (PE) is a leading cause of preventable in-hospital mortality. Advances in diagnosis, risk stratification, and prevention can improve outcomes. There are few large publicly available datasets that contain PE labels for research. Using the MIMIC-IV database, we extracted all available radiology reports of computed tomography pulmonary angiography (CTPA) scans and two physicians manually labeled the results as PE positive (acute PE) or PE negative. We then applied a previously finetuned Bio_ClinicalBERT transformer language model, VTE-BERT, to extract labels automatically. We verified VTE-BERT’s reliability by measuring its performance against manual adjudication. We also compared the performance of VTE-BERT to diagnosis codes. We found that VTE-BERT has a sensitivity of 92.4% and positive predictive value (PPV) of 87.8% on all 19,942 patients with CTPA radiology reports from the emergency room and/or hospital admission. In contrast, diagnosis codes have a sensitivity of 95.4% and PPV of 83.8% on the subset of 11,990 hospitalized patients with discharge diagnosis codes. We successfully add nearly 20,000 labels to CTPAs in a publicly available dataset and demonstrate the external validity of a semi-supervised language model in accelerating hematologic research.
摘要：肺栓塞（Pulmonary Embolism, PE）是院内可预防死亡的主要原因之一。诊断、风险分层和预防方面的进展可以改善患者预后。然而，目前公开可用的包含PE标签的大型数据集较少。我们利用MIMIC-IV数据库，提取了所有可用的计算机断层扫描肺动脉造影（Computed Tomography Pulmonary Angiography, CTPA）的放射报告，并由两名医生手动将结果标记为PE阳性（急性PE）或PE阴性。随后，我们应用了先前微调的Bio_ClinicalBERT Transformer语言模型——VTE-BERT，来自动提取标签。通过测量VTE-BERT与手动裁决结果的性能对比，验证了其可靠性。我们还比较了VTE-BERT与诊断代码的性能。结果显示，在所有19,942名急诊和/或住院患者的CTPA放射报告中，VTE-BERT的敏感性为92.4%，阳性预测值（Positive Predictive Value, PPV）为87.8%。相比之下，在11,990名有出院诊断代码的住院患者子集中，诊断代码的敏感性为95.4%，PPV为83.8%。我们成功为公开数据集中的近20,000份CTPA报告添加了标签，并展示了半监督语言模型在加速血液学研究中的外部有效性。

[NLP-46] Problem Categorization Can Help Large Language Models Solve Math Problems

【速读】：该论文试图解决如何优化大型语言模型（Large-Language Models, LLMs）在快速准确解决数学问题中的应用。解决方案的关键在于通过将问题分类到不同类别来促进问题解决，并创建一个准确的分类数据集来优化这一过程。这种方法有助于减少LLMs在解决数学问题时的幻觉（hallucination）现象，从而提升其解决问题的能力。

链接: https://arxiv.org/abs/2411.00042
作者: Amogh Akella
关键词-EN: Large-Language Models, Models to quickly, accurately solve mathematical, usage of Large-Language, quickly and accurately
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:In this paper, we explore how to optimize the usage of Large-Language Models to quickly and accurately solve mathematical problems. In particular, we show the effectiveness of using the classification of problems into different categories to facilitate problem-solving. Additionally, we optimize the classification of problems into categories by creating an accurate dataset. We believe that our technique for problem-solving works by helping mitigate hallucination in LLMs which is key to unlocking their ability to solve math problems.
摘要：本文探讨了如何优化大语言模型 (Large-Language Models) 的使用，以快速且准确地解决数学问题。特别地，我们展示了将问题分类为不同类别以促进问题解决的有效性。此外，我们通过创建一个精确的数据集来优化问题的分类。我们相信，我们的问题解决技术通过帮助减少大语言模型中的幻觉 (hallucination)，从而解锁其解决数学问题的能力，这是关键所在。

[NLP-47] NeuroSym-BioCAT: Leveraging Neuro-Symbolic Methods for Biomedical Scholarly Document Categorization and Question Answering

【速读】：该论文试图解决生物医学学术文档摘要的大量增长带来的信息检索效率问题。解决方案的关键在于结合优化的主题建模框架OVB-LDA与BI-POP CMA-ES优化技术，以提升学术文档摘要的分类效果，并采用经过领域特定数据微调的MiniLM模型进行高精度答案提取。该方法在学术文档摘要检索、金标准学术文档摘要和金标准片段三个配置中均优于现有方法，如RYGH和bio-answer finder，并展示了仅从摘要中提取答案即可达到高准确性，强调了摘要对于许多生物医学查询的充分性。尽管MiniLM模型规模较小，但其表现出了与大型模型相媲美的性能，挑战了只有大型、资源密集型模型才能处理复杂任务的传统观念。

链接: https://arxiv.org/abs/2411.00041
作者: Parvez Zamil,Gollam Rabby,Md. Sadekur Rahman,Sören Auer
关键词-EN: efficiently retrieving accurate, scholarly document abstract, document abstracts presents, scholarly document, relevant information
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Digital Libraries (cs.DL); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:The growing volume of biomedical scholarly document abstracts presents an increasing challenge in efficiently retrieving accurate and relevant information. To address this, we introduce a novel approach that integrates an optimized topic modelling framework, OVB-LDA, with the BI-POP CMA-ES optimization technique for enhanced scholarly document abstract categorization. Complementing this, we employ the distilled MiniLM model, fine-tuned on domain-specific data, for high-precision answer extraction. Our approach is evaluated across three configurations: scholarly document abstract retrieval, gold-standard scholarly documents abstract, and gold-standard snippets, consistently outperforming established methods such as RYGH and bio-answer finder. Notably, we demonstrate that extracting answers from scholarly documents abstracts alone can yield high accuracy, underscoring the sufficiency of abstracts for many biomedical queries. Despite its compact size, MiniLM exhibits competitive performance, challenging the prevailing notion that only large, resource-intensive models can handle such complex tasks. Our results, validated across various question types and evaluation batches, highlight the robustness and adaptability of our method in real-world biomedical applications. While our approach shows promise, we identify challenges in handling complex list-type questions and inconsistencies in evaluation metrics. Future work will focus on refining the topic model with more extensive domain-specific datasets, further optimizing MiniLM and utilizing large language models (LLM) to improve both precision and efficiency in biomedical question answering.
摘要：随着生物医学学术文档摘要数量的不断增加，如何高效地检索准确且相关的信息成为一个日益严峻的挑战。为此，我们提出了一种新颖的方法，该方法将优化的主题建模框架 OVB-LDA 与 BI-POP CMA-ES 优化技术相结合，以增强学术文档摘要的分类效果。此外，我们还采用了在领域特定数据上微调的蒸馏 MiniLM 模型，用于高精度的答案提取。我们的方法在三种配置下进行了评估：学术文档摘要检索、黄金标准学术文档摘要和黄金标准片段，均显著优于现有的方法，如 RYGH 和生物答案查找器。特别值得注意的是，我们证明了仅从学术文档摘要中提取答案即可达到高准确率，这表明摘要对于许多生物医学查询已足够充分。尽管 MiniLM 模型规模较小，但其表现却具有竞争力，挑战了只有大型、资源密集型模型才能处理此类复杂任务的传统观念。我们的结果在多种问题类型和评估批次中得到了验证，凸显了该方法在实际生物医学应用中的鲁棒性和适应性。尽管我们的方法显示出潜力，但我们仍识别出在处理复杂列表类型问题和评估指标不一致性方面的挑战。未来的工作将聚焦于使用更广泛的领域特定数据集来改进主题模型，进一步优化 MiniLM，并利用大语言模型 (LLM) 来提升生物医学问答中的精度和效率。

[NLP-48] Linear Chain Transformation: Expanding Optimization Dynamics for Fine-Tuning Large Language Models

【速读】：该论文试图解决在微调大型语言模型（LLMs）时，如何更有效地适应特定下游任务的问题。解决方案的关键在于提出了一种名为线性链变换（Linear Chain Transformation, LinChain）的新方法，通过在微调过程中引入一系列线性变换，丰富了优化动态。LinChain通过在参数更新过程中整合多个线性变换，扩展了更新的有效秩，从而增强了模型学习复杂任务特定表示的能力。这种方法在保持推理效率的同时，提供了更灵活的训练优化路径，显著提升了LLM微调的性能，并在多个基准任务上展示了更好的泛化能力、更少的可学习参数以及更强的任务适应性。

链接: https://arxiv.org/abs/2411.00039
作者: Yulong Wang,Chang Zuo,Yin Xuan,Hong Li,Ni Wei
关键词-EN: Fine-tuning large language, large language models, adapting pretrained models, specific downstream tasks, Linear Chain Transformation
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 9 pages, 2 figures, 4 tables

点击查看摘要

Abstract:Fine-tuning large language models (LLMs) has become essential for adapting pretrained models to specific downstream tasks. In this paper, we propose Linear Chain Transformation (LinChain), a novel approach that introduces a sequence of linear transformations during fine-tuning to enrich optimization dynamics. By incorporating multiple linear transformations into the parameter update process, LinChain expands the effective rank of updates and enhances the model’s ability to learn complex task-specific representations. We demonstrate that this method significantly improves the performance of LLM fine-tuning over state-of-the-art methods by providing more flexible optimization paths during training, while maintaining the inference efficiency of the resulting model. Our experiments on various benchmark tasks show that LinChain leads to better generalization, fewer learnable parameters, and improved task adaptation, making it a compelling strategy for LLM fine-tuning.
摘要：微调大语言模型（LLM）已成为将预训练模型适应于特定下游任务的关键。本文提出了一种名为线性链变换（Linear Chain Transformation, LinChain）的新方法，该方法在微调过程中引入了一系列线性变换，以丰富优化动态。通过在参数更新过程中融入多个线性变换，LinChain扩展了更新的有效秩，并增强了模型学习复杂任务特定表示的能力。我们证明，这种方法在训练过程中提供了更灵活的优化路径，显著提升了LLM微调的性能，同时保持了最终模型的推理效率。我们在多个基准任务上的实验表明，LinChain带来了更好的泛化能力、更少的可学习参数以及更优的任务适应性，使其成为LLM微调的一种引人注目的策略。

[NLP-49] opic-Conversation Relevance (TCR) Dataset and Benchmarks NEURIPS2024

【速读】：该论文试图解决会议效率低下的问题，通过评估会议对话与预定主题的相关性来提高会议的有效性。解决方案的关键在于创建了一个全面的主题-对话相关性数据集（Topic-Conversation Relevance, TCR），该数据集包含1,500个独特的会议、2200万字的转录文本和超过15,000个会议主题，数据来源包括新收集的语音中断会议（Speech Interruption Meeting, SIM）数据和现有的公开数据集。此外，论文还提供了开源脚本，用于生成合成会议或从TCR数据集中创建增强会议，以增强数据多样性。通过使用GPT-4创建的基准测试，评估模型在理解转录文本与主题相关性方面的准确性。

链接: https://arxiv.org/abs/2411.00038
作者: Yaran Fan,Jamie Pool,Senja Filipi,Ross Cutler
关键词-EN: Workplace meetings, organizational collaboration, rated as ineffective, vital to organizational, large percentage
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: To be published in 38th Conference on Neural Information Processing Systems (NeurIPS 2024) Track on Datasets and Benchmarks

点击查看摘要

Abstract:Workplace meetings are vital to organizational collaboration, yet a large percentage of meetings are rated as ineffective. To help improve meeting effectiveness by understanding if the conversation is on topic, we create a comprehensive Topic-Conversation Relevance (TCR) dataset that covers a variety of domains and meeting styles. The TCR dataset includes 1,500 unique meetings, 22 million words in transcripts, and over 15,000 meeting topics, sourced from both newly collected Speech Interruption Meeting (SIM) data and existing public datasets. Along with the text data, we also open source scripts to generate synthetic meetings or create augmented meetings from the TCR dataset to enhance data diversity. For each data source, benchmarks are created using GPT-4 to evaluate the model accuracy in understanding transcription-topic relevance.
摘要：工作场所会议对于组织协作至关重要，但有很大比例的会议被评定为无效。为了通过理解对话是否切题来提高会议效率，我们创建了一个全面的主题-对话相关性 (Topic-Conversation Relevance, TCR) 数据集，涵盖了多种领域和会议风格。TCR 数据集包含 1,500 场独特的会议、2200 万字的转录文本以及超过 15,000 个会议主题，这些数据来源于新收集的语音中断会议 (Speech Interruption Meeting, SIM) 数据和现有的公开数据集。除了文本数据外，我们还开源了脚本，用于生成合成会议或从 TCR 数据集中创建增强会议，以增强数据多样性。对于每个数据源，我们使用 GPT-4 创建了基准测试，以评估模型在理解转录文本与主题相关性方面的准确性。

[NLP-50] Is Our Chatbot Telling Lies? Assessing Correctness of an LLM -based Dutch Support Chatbot

【速读】：该论文试图解决的问题是如何利用大型语言模型 (LLMs) 自动评估和生成客户支持对话中的正确响应，特别是在荷兰语环境中，且在训练数据有限的情况下。解决方案的关键在于定义响应的正确性标准，并结合自然语言生成 (Natural Language Generation) 和自动答案评分系统 (Automated Answer Grading Systems) 来模拟客户支持团队决策过程。通过这种方法，研究能够自动识别错误信息，准确率达到55%，从而验证了自动评估聊天机器人输出正确性的可行性。

链接: https://arxiv.org/abs/2411.00034
作者: Herman Lassche(1 and 2),Michiel Overeem(1),Ayushi Rastogi(2) ((1) AFAS Software, (2) University Groningen)
关键词-EN: Companies support, gain their loyalty, customer support team, live chats, Dutch company aiming
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 10 pages + 2 pages references, 4 figures

点击查看摘要

Abstract:Companies support their customers using live chats and chatbots to gain their loyalty. AFAS is a Dutch company aiming to leverage the opportunity large language models (LLMs) offer to answer customer queries with minimal to no input from its customer support team. Adding to its complexity, it is unclear what makes a response correct, and that too in Dutch. Further, with minimal data available for training, the challenge is to identify whether an answer generated by a large language model is correct and do it on the fly. This study is the first to define the correctness of a response based on how the support team at AFAS makes decisions. It leverages literature on natural language generation and automated answer grading systems to automate the decision-making of the customer support team. We investigated questions requiring a binary response (e.g., Would it be possible to adjust tax rates manually?) or instructions (e.g., How would I adjust tax rate manually?) to test how close our automated approach reaches support rating. Our approach can identify wrong messages in 55% of the cases. This work shows the viability of automatically assessing when our chatbot tell lies. Comments: 10 pages + 2 pages references, 4 figures Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) ACMclasses: I.2.7; I.7.0 Cite as: arXiv:2411.00034 [cs.CL] (or arXiv:2411.00034v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2411.00034 Focus to learn more arXiv-issued DOI via DataCite
摘要：公司通过实时聊天和聊天机器人来支持客户，以提升客户忠诚度。AFAS 是一家荷兰公司，旨在利用大语言模型 (LLMs) 的机会，以最小化甚至无需客户支持团队的介入来回答客户查询。然而，这一过程的复杂性在于，尚不清楚什么样的回答是正确的，尤其是在荷兰语环境中。此外，由于训练数据有限，挑战在于如何实时判断大语言模型生成的答案是否正确。本研究首次基于 AFAS 支持团队决策的标准来定义回答的正确性。它借鉴了自然语言生成和自动评分系统的文献，以自动化客户支持团队的决策过程。我们针对需要二元响应（例如，是否可以手动调整税率？）或指令（例如，如何手动调整税率？）的问题进行了测试，以评估我们的自动化方法与支持团队评分的接近程度。我们的方法在 55% 的情况下能够识别出错误信息。这项工作展示了自动评估聊天机器人何时提供错误信息的可行性。

评论：10 页正文 + 2 页参考文献，4 幅图
主题：计算与语言 (cs.CL); 人工智能 (cs.AI)
ACM 分类：I.2.7; I.7.0
引用为：arXiv:2411.00034 [cs.CL]
（或 arXiv:2411.00034v1 [cs.CL] 用于此版本）
https://doi.org/10.48550/arXiv.2411.00034
通过 DataCite 发布的 arXiv DOI

聚焦以了解更多信息
arXiv-issued DOI via DataCite

[NLP-51] WikiNER-fr-gold: A Gold-Standard NER Corpus

【速读】：该论文试图解决WikiNER语料库的质量问题，特别是其多语言命名实体识别语料库的标注质量。解决方案的关键在于提出了WikiNER-fr-gold，这是一个经过修订的法国部分WikiNER语料库的版本。具体步骤包括：首先总结每个类别中包含的实体类型以定义标注指南，然后对语料库进行修订，最后分析WikiNER-fr语料库中的错误和不一致性，并讨论未来的工作方向。

链接: https://arxiv.org/abs/2411.00030
作者: Danrun Cao(IRISA, EXPRESSION),Nicolas Béchet(IRISA, UBS, EXPRESSION),Pierre-François Marteau(IRISA, UBS, EXPRESSION)
关键词-EN: Named Entity Recognition, multilingual Named Entity, multilingual Named, Entity Recognition corpus, Named Entity
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Databases (cs.DB)
备注:

点击查看摘要

Abstract:We address in this article the the quality of the WikiNER corpus, a multilingual Named Entity Recognition corpus, and provide a consolidated version of it. The annotation of WikiNER was produced in a semi-supervised manner i.e. no manual verification has been carried out a posteriori. Such corpus is called silver-standard. In this paper we propose WikiNER-fr-gold which is a revised version of the French proportion of WikiNER. Our corpus consists of randomly sampled 20% of the original French sub-corpus (26,818 sentences with 700k tokens). We start by summarizing the entity types included in each category in order to define an annotation guideline, and then we proceed to revise the corpus. Finally we present an analysis of errors and inconsistency observed in the WikiNER-fr corpus, and we discuss potential future work directions.
摘要：本文探讨了多语言命名实体识别语料库 WikiNER 的质量，并提供了一个整合版本。WikiNER 的标注是通过半监督方式生成的，即未进行后验的手动验证。此类语料库被称为银标准语料库。本文提出了 WikiNER-fr-gold，这是 WikiNER 法语部分的一个修订版本。我们的语料库包含原始法语子语料库中随机抽取的 20% 数据（26,818 个句子，包含 700k 个 Token）。我们首先总结了每个类别中包含的实体类型，以定义标注指南，然后对语料库进行修订。最后，我们对 WikiNER-fr 语料库中观察到的错误和不一致性进行了分析，并讨论了潜在的未来工作方向。

[NLP-52] Preserving Pre-trained Representation Space: On Effectiveness of Prefix-tuning for Large Multi-modal Models EMNLP2024

【速读】：该论文试图解决在大规模多模态模型（LMMs）中，参数高效微调（PEFT）方法在保留预训练知识的同时如何有效适应下游任务的问题。解决方案的关键在于提出了一种名为Prefix-Tuned PEFT (PT-PEFT)的两步微调策略，该策略首先进行前缀微调（prefix-tuning）以保留预训练的特征表示空间，然后进行PEFT（如Adapter或LoRA）以提升下游任务的性能。实验结果表明，PT-PEFT不仅在图像描述和视觉问答任务中优于传统的PEFT方法，还能更好地保留预训练模型的特征表示空间。

链接: https://arxiv.org/abs/2411.00029
作者: Donghoon Kim,Gusang Lee,Kyuhong Shim,Byonghyo Shim
关键词-EN: Large Multi-modal Models, Large Multi-modal, observed that Large, multi-modal applications, unlocking new possibilities
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Findings of EMNLP 2024

点击查看摘要

Abstract:Recently, we have observed that Large Multi-modal Models (LMMs) are revolutionizing the way machines interact with the world, unlocking new possibilities across various multi-modal applications. To adapt LMMs for downstream tasks, parameter-efficient fine-tuning (PEFT) which only trains additional prefix tokens or modules, has gained popularity. Nevertheless, there has been little analysis of how PEFT works in LMMs. In this paper, we delve into the strengths and weaknesses of each tuning strategy, shifting the focus from the efficiency typically associated with these approaches. We first discover that model parameter tuning methods such as LoRA and Adapters distort the feature representation space learned during pre-training and limit the full utilization of pre-trained knowledge. We also demonstrate that prefix-tuning excels at preserving the representation space, despite its lower performance on downstream tasks. These findings suggest a simple two-step PEFT strategy called Prefix-Tuned PEFT (PT-PEFT), which successively performs prefix-tuning and then PEFT (i.e., Adapter, LoRA), combines the benefits of both. Experimental results show that PT-PEFT not only improves performance in image captioning and visual question answering compared to vanilla PEFT methods but also helps preserve the representation space of the four pre-trained models.
摘要：近年来，我们观察到大语言模型 (LMMs) 正在彻底改变机器与世界互动的方式，为各种多模态应用开辟了新的可能性。为了使 LMMs 适应下游任务，参数高效的微调 (PEFT) 方法，即仅训练额外的前缀 Token 或模块，已逐渐流行起来。然而，关于 PEFT 在 LMMs 中的工作机制分析却鲜有研究。本文深入探讨了每种微调策略的优缺点，将关注点从这些方法通常强调的效率转移开来。我们首先发现，诸如 LoRA 和 Adapters 等模型参数微调方法会扭曲预训练过程中学习到的特征表示空间，限制了预训练知识的充分利用。我们还证明，尽管前缀微调在下游任务中的表现较低，但它擅长保留表示空间。这些发现提出了一种简单的两步 PEFT 策略，称为前缀微调 PEFT (PT-PEFT)，该策略依次进行前缀微调和 PEFT（即 Adapter、LoRA），结合了两者的优势。实验结果表明，PT-PEFT 不仅在图像描述和视觉问答任务中比传统的 PEFT 方法表现更佳，还有助于保留四种预训练模型的表示空间。

[NLP-53] Synergizing LLM Agents and Knowledge Graph for Socioeconomic Prediction in LBSN

【速读】：该论文试图解决现有基于位置的社交网络数据（LBSN）在社会经济预测中存在的两个主要问题：一是依赖于启发式方法和专家知识提取任务相关知识，可能不适用于特定任务；二是忽视不同指标之间的内在关系，限制了预测精度。解决方案的关键在于将大型语言模型（LLM）代理与知识图谱（KG）相结合，通过构建位置基础知识图谱（LBKG）整合多源LBSN数据，并利用LLM代理的推理能力识别每种社会经济预测任务的相关元路径。此外，设计语义引导的注意力模块进行知识融合，并引入跨任务通信机制，在LLM代理和KG层面实现任务间的知识共享，从而提升预测性能。

链接: https://arxiv.org/abs/2411.00028
作者: Zhilun Zhou,Jingyang Fan,Yu Liu,Fengli Xu,Depeng Jin,Yong Li
关键词-EN: commercial activity estimation, LBSN data, socioeconomic prediction, location-based social networks, heterogeneous LBSN data
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Social and Information Networks (cs.SI)
备注:

点击查看摘要

Abstract:The fast development of location-based social networks (LBSNs) has led to significant changes in society, resulting in popular studies of using LBSN data for socioeconomic prediction, e.g., regional population and commercial activity estimation. Existing studies design various graphs to model heterogeneous LBSN data, and further apply graph representation learning methods for socioeconomic prediction. However, these approaches heavily rely on heuristic ideas and expertise to extract task-relevant knowledge from diverse data, which may not be optimal for specific tasks. Additionally, they tend to overlook the inherent relationships between different indicators, limiting the prediction accuracy. Motivated by the remarkable abilities of large language models (LLMs) in commonsense reasoning, embedding, and multi-agent collaboration, in this work, we synergize LLM agents and knowledge graph for socioeconomic prediction. We first construct a location-based knowledge graph (LBKG) to integrate multi-sourced LBSN data. Then we leverage the reasoning power of LLM agent to identify relevant meta-paths in the LBKG for each type of socioeconomic prediction task, and design a semantic-guided attention module for knowledge fusion with meta-paths. Moreover, we introduce a cross-task communication mechanism to further enhance performance by enabling knowledge sharing across tasks at both LLM agent and KG levels. On the one hand, the LLM agents for different tasks collaborate to generate more diverse and comprehensive meta-paths. On the other hand, the embeddings from different tasks are adaptively merged for better socioeconomic prediction. Experiments on two datasets demonstrate the effectiveness of the synergistic design between LLM and KG, providing insights for information sharing across socioeconomic prediction tasks.
摘要：基于位置的社交网络 (LBSN) 的快速发展已对社会产生了重大影响，促成了利用 LBSN 数据进行社会经济预测（如区域人口和商业活动估算）的研究热潮。现有研究设计了多种图结构来建模异构 LBSN 数据，并进一步应用图表示学习方法进行社会经济预测。然而，这些方法严重依赖于启发式思路和专家知识从多样数据中提取与任务相关的知识，这可能并非特定任务的最佳选择。此外，它们往往忽视不同指标之间的内在联系，限制了预测精度。受大语言模型 (LLM) 在常识推理、嵌入和多智能体协作方面卓越能力的启发，本文提出了一种将 LLM 智能体与知识图谱结合用于社会经济预测的方法。首先，我们构建了一个基于位置的知识图谱 (LBKG) 以整合多源 LBSN 数据。接着，我们利用 LLM 智能体的推理能力，为每类社会经济预测任务在 LBKG 中识别相关元路径，并设计了一个语义引导的注意力模块用于基于元路径的知识融合。此外，我们引入了一种跨任务通信机制，通过在 LLM 智能体和 KG 层面实现任务间的知识共享来进一步提升性能。一方面，不同任务的 LLM 智能体协作生成更多样化和全面的元路径；另一方面，不同任务的嵌入被自适应地合并以实现更优的社会经济预测。在两个数据集上的实验验证了 LLM 与 KG 协同设计的有效性，为社会经济预测任务间的信息共享提供了见解。

[NLP-54] Personalization of Large Language Models : A Survey

【速读】：该论文试图解决个性化大型语言模型 (Personalized Large Language Models, LLMs) 研究领域中存在的两个主要方向之间的割裂问题，即个性化文本生成和利用LLMs进行个性化相关的下游应用（如推荐系统）。解决方案的关键在于引入了一个系统的分类法 (taxonomy)，该分类法涵盖了个性化LLM使用的各个方面，包括个性化粒度、个性化技术、数据集、评估方法和应用场景。通过这一分类法，论文不仅总结了现有研究的关键差异和挑战，还为未来的研究提供了清晰的框架和指南，从而统一并深化了对个性化LLM的理解和应用。

链接: https://arxiv.org/abs/2411.00027
作者: Zhehao Zhang,Ryan A. Rossi,Branislav Kveton,Yijia Shao,Diyi Yang,Hamed Zamani,Franck Dernoncourt,Joe Barrow,Tong Yu,Sungchul Kim,Ruiyi Zhang,Jiuxiang Gu,Tyler Derr,Hongjie Chen,Junda Wu,Xiang Chen,Zichao Wang,Subrata Mitra,Nedim Lipka,Nesreen Ahmed,Yu Wang
关键词-EN: Large Language Models, Language Models, Large Language, personalized LLMs, recently become increasingly
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Personalization of Large Language Models (LLMs) has recently become increasingly important with a wide range of applications. Despite the importance and recent progress, most existing works on personalized LLMs have focused either entirely on (a) personalized text generation or (b) leveraging LLMs for personalization-related downstream applications, such as recommendation systems. In this work, we bridge the gap between these two separate main directions for the first time by introducing a taxonomy for personalized LLM usage and summarizing the key differences and challenges. We provide a formalization of the foundations of personalized LLMs that consolidates and expands notions of personalization of LLMs, defining and discussing novel facets of personalization, usage, and desiderata of personalized LLMs. We then unify the literature across these diverse fields and usage scenarios by proposing systematic taxonomies for the granularity of personalization, personalization techniques, datasets, evaluation methods, and applications of personalized LLMs. Finally, we highlight challenges and important open problems that remain to be addressed. By unifying and surveying recent research using the proposed taxonomies, we aim to provide a clear guide to the existing literature and different facets of personalization in LLMs, empowering both researchers and practitioners.
摘要：大语言模型 (LLM) 的个性化近年来变得越来越重要，应用范围广泛。尽管其重要性及近期进展显著，但大多数现有关于个性化 LLM 的研究要么完全集中在 (a) 个性化文本生成，要么 (b) 利用 LLM 进行与个性化相关的下游应用，如推荐系统。在本研究中，我们首次尝试弥合这两个主要方向之间的差距，通过引入个性化 LLM 使用的分类法，并总结关键差异和挑战。我们提供了一个个性化 LLM 基础的正式化，该基础整合并扩展了 LLM 个性化的概念，定义并讨论了个性化、使用方式及个性化 LLM 的期望特性。接着，我们通过提出系统的分类法，统一了这些多样领域和使用场景的文献，包括个性化粒度、个性化技术、数据集、评估方法及个性化 LLM 的应用。最后，我们强调了仍需解决的挑战和重要开放问题。通过使用所提出的分类法统一并综述近期研究，我们旨在为现有文献和 LLM 中不同个性化方面提供清晰的指南，从而为研究人员和实践者赋能。

[NLP-55] A Perspective for Adapting Generalist AI to Specialized Medical AI Applications and Their Challenges

【速读】：该论文旨在解决将大型语言模型（LLMs）应用于医疗领域的复杂性和挑战，并提出一个全面的开发框架。解决方案的关键在于引入一个三步框架：1) 建模（Modeling），将复杂的医疗工作流程分解为可管理的步骤以开发特定于医疗的模型；2) 优化（Optimization），通过精心设计的提示和整合外部知识与工具来优化模型性能；3) 系统工程（System engineering），将复杂任务分解为子任务，并利用人类专家知识构建医疗AI应用。此外，论文还提供了详细的用例手册，描述了各种LLM驱动的医疗AI应用，并讨论了构建医疗AI应用时需要考虑的诸多挑战和问题，如幻觉问题、数据所有权和合规性、隐私、知识产权、计算成本、可持续性和负责任的AI要求。

链接: https://arxiv.org/abs/2411.00024
作者: Zifeng Wang,Hanyin Wang,Benjamin Danek,Ying Li,Christina Mack,Hoifung Poon,Yajun Wang,Pranav Rajpurkar,Jimeng Sun
关键词-EN: Large Language Models, Large Language, sparked widespread interest, healthcare insurance applications, integration of Large
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The integration of Large Language Models (LLMs) into medical applications has sparked widespread interest across the healthcare industry, from drug discovery and development to clinical decision support, assisting telemedicine, medical devices, and healthcare insurance applications. This perspective paper aims to discuss the inner workings of building LLM-powered medical AI applications and introduces a comprehensive framework for their development. We review existing literature and outline the unique challenges of applying LLMs in specialized medical contexts. Additionally, we introduce a three-step framework to organize medical LLM research activities: 1) Modeling: breaking down complex medical workflows into manageable steps for developing medical-specific models; 2) Optimization: optimizing the model performance with crafted prompts and integrating external knowledge and tools, and 3) System engineering: decomposing complex tasks into subtasks and leveraging human expertise for building medical AI applications. Furthermore, we offer a detailed use case playbook that describes various LLM-powered medical AI applications, such as optimizing clinical trial design, enhancing clinical decision support, and advancing medical imaging analysis. Finally, we discuss various challenges and considerations for building medical AI applications with LLMs, such as handling hallucination issues, data ownership and compliance, privacy, intellectual property considerations, compute cost, sustainability issues, and responsible AI requirements.
摘要：将大语言模型 (LLM) 整合到医疗应用中，已在整个医疗行业引起了广泛关注，从药物发现和开发到临床决策支持，再到辅助远程医疗、医疗设备和医疗保险应用。本文旨在探讨构建基于 LLM 的医疗 AI 应用的内部机制，并引入一个全面的开发框架。我们回顾了现有文献，并概述了在专业医疗环境中应用 LLM 所面临的独特挑战。此外，我们提出一个三步框架来组织医疗 LLM 研究活动：1) 建模：将复杂的医疗工作流程分解为可管理的步骤，以开发医疗专用模型；2) 优化：通过精心设计的提示优化模型性能，并整合外部知识和工具；3) 系统工程：将复杂任务分解为子任务，并利用人类专家知识构建医疗 AI 应用。此外，我们提供了一个详细的用例手册，描述了各种基于 LLM 的医疗 AI 应用，如优化临床试验设计、增强临床决策支持和推进医学影像分析。最后，我们讨论了构建基于 LLM 的医疗 AI 应用时所面临的各种挑战和考虑因素，如处理幻觉问题、数据所有权和合规性、隐私、知识产权考虑、计算成本、可持续性问题以及负责任的 AI 要求。

[NLP-56] Pistis-RAG: Enhancing Retrieval-Augmented Generation with Human Feedback

【速读】：该论文试图解决传统检索增强生成系统（RAG）在仅依赖语义相关性时无法保证生成质量的问题。解决方案的关键在于引入结构化的人类反馈，通过提供复制、重新生成或不喜欢的选项来调整大型语言模型（LLM）的输出，从而更好地对齐人类偏好。论文提出了Pistis-RAG框架，该框架采用以内容为中心的方法，有效利用人类反馈来提升内容排序和生成质量。通过在公共数据集上模拟人类反馈，实验结果显示Pistis-RAG在MMLU（英语）和C-EVAL（中文）准确性指标上分别提高了6.06%和7.08%，显著改善了与人类偏好的对齐效果。

链接: https://arxiv.org/abs/2407.00072
作者: Yu Bai,Yukai Miao,Li Chen,Dawei Wang,Dan Li,Yanyu Ren,Hongtao Xie,Ce Yang,Xuhui Cai
关键词-EN: guarantee improved generation, RAG systems face, semantic relevance, guarantee improved, improved generation quality
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:RAG systems face limitations when semantic relevance alone does not guarantee improved generation quality. This issue becomes particularly evident due to the sensitivity of large language models (LLMs) to the ordering of few-shot prompts, which can affect model performance. To address this challenge, aligning LLM outputs with human preferences using structured feedback, such as options to copy, regenerate, or dislike, offers a promising method for improvement. This feedback is applied to the entire list of inputs rather than giving specific ratings for individual documents, making it a Listwide Labels Learning-to-Rank task. To address this task, we propose Pistis-RAG, a new RAG framework designed with a content-centric approach to better align LLMs with human preferences. Pistis-RAG effectively utilizes human feedback, enhancing content ranking and generation quality. To validate our framework, we use public datasets to simulate human feedback, allowing us to evaluate and refine our method effectively. Experimental results indicate that Pistis-RAG improves alignment with human preferences relative to the baseline RAG system, showing a 6.06% increase in MMLU (English) and a 7.08% increase in C-EVAL (Chinese) accuracy metrics. These results highlight Pistis-RAG’s effectiveness in overcoming the limitations associated with traditional RAG approaches. Subjects: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL) Cite as: arXiv:2407.00072 [cs.IR] (or arXiv:2407.00072v5 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2407.00072 Focus to learn more arXiv-issued DOI via DataCite
摘要：RAG 系统在仅依赖语义相关性时，难以保证生成质量的提升。这一问题在大语言模型 (LLM) 对少样本提示 (few-shot prompts) 的顺序敏感性上尤为明显，这种敏感性会影响模型性能。为应对这一挑战，通过结构化反馈（如复制、重新生成或不喜欢的选项）来使 LLM 输出与人类偏好对齐，提供了一种有前景的改进方法。这种反馈应用于整个输入列表，而非对单个文档进行具体评分，因此它是一个列表级标签学习排序任务。为解决这一任务，我们提出了 Pistis-RAG，这是一种以内容为中心的新型 RAG 框架，旨在更好地使 LLM 与人类偏好对齐。Pistis-RAG 有效利用了人类反馈，提升了内容排序和生成质量。为验证我们的框架，我们使用公共数据集来模拟人类反馈，从而能够有效评估和优化我们的方法。实验结果表明，相对于基线 RAG 系统，Pistis-RAG 在人类偏好对齐方面有所提升，MMLU (英语) 和 C-EVAL (中文) 的准确率分别提高了 6.06% 和 7.08%。这些结果突显了 Pistis-RAG 在克服传统 RAG 方法局限性方面的有效性。

主题：信息检索 (cs.IR); 人工智能 (cs.AI); 计算与语言 (cs.CL)
引用方式：arXiv:2407.00072 [cs.IR] (或 arXiv:2407.00072v5 [cs.IR] 用于此版本)
https://doi.org/10.48550/arXiv.2407.00072
通过 DataCite 发布的 arXiv DOI

[NLP-57] Optimizing Contextual Speech Recognition Using Vector Quantization for Efficient Retrieval

【速读】：该论文试图解决语音识别模型中上下文偏置机制的计算复杂性和内存使用问题，特别是在处理大规模偏置目录时。解决方案的关键在于提出了一种基于向量量化（vector quantization）的交叉注意力评分近似方法，并与基于检索的上下文偏置方法结合使用。具体来说，首先通过高效的量化检索模块根据音频内容对偏置条目进行初步筛选，然后使用筛选后的条目进行偏置。这种方法不仅提高了计算效率和内存使用效率，还使得系统能够有效利用包含数千条目的偏置目录，从而在个人实体识别任务中实现了高达71%的相对错误率降低。与标准的点积交叉注意力相比，该近似算法在处理多达一百万条目时，计算时间减少了20%，内存使用减少了85-95%。

链接: https://arxiv.org/abs/2411.00664
作者: Nikolaos Flemotomos,Roger Hsiao,Pawel Swietojanski,Dogan Can,Xiaodan Zhuang
关键词-EN: contextually relevant information, Neural contextual biasing, improved transcription accuracy, Neural contextual, speech recognition models
类目: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL)
备注: 14 pages, 7 figures, submitted to IEEE/ACM Transactions on Audio, Speech, and Language Processing

点击查看摘要

Abstract:Neural contextual biasing allows speech recognition models to leverage contextually relevant information, leading to improved transcription accuracy. However, the biasing mechanism is typically based on a cross-attention module between the audio and a catalogue of biasing entries, which means computational complexity can pose severe practical limitations on the size of the biasing catalogue and consequently on accuracy improvements. This work proposes an approximation to cross-attention scoring based on vector quantization and enables compute- and memory-efficient use of large biasing catalogues. We propose to use this technique jointly with a retrieval based contextual biasing approach. First, we use an efficient quantized retrieval module to shortlist biasing entries by grounding them on audio. Then we use retrieved entries for biasing. Since the proposed approach is agnostic to the biasing method, we investigate using full cross-attention, LLM prompting, and a combination of the two. We show that retrieval based shortlisting allows the system to efficiently leverage biasing catalogues of several thousands of entries, resulting in up to 71% relative error rate reduction in personal entity recognition. At the same time, the proposed approximation algorithm reduces compute time by 20% and memory usage by 85-95%, for lists of up to one million entries, when compared to standard dot-product cross-attention.
摘要：神经上下文偏置使得语音识别模型能够利用上下文相关信息，从而提高转录准确性。然而，偏置机制通常基于音频与偏置条目目录之间的交叉注意力模块，这意味着计算复杂性可能会对偏置目录的大小造成严重的实际限制，进而影响准确性的提升。本文提出了一种基于向量量化的交叉注意力评分近似方法，并实现了对大型偏置目录的计算和内存高效利用。我们建议将此技术与基于检索的上下文偏置方法联合使用。首先，我们使用高效的量化检索模块，通过将偏置条目与音频对齐来筛选出候选条目。然后，我们使用检索到的条目进行偏置。由于所提出的方法对偏置方法不敏感，我们研究了使用全交叉注意力、大语言模型提示以及两者的结合。我们展示了基于检索的筛选方法使得系统能够高效利用包含数千条目的偏置目录，从而在个人实体识别中实现了高达71%的相对错误率降低。同时，与标准的点积交叉注意力相比，所提出的近似算法在处理多达一百万条目的列表时，计算时间减少了20%，内存使用量减少了85-95%。

[NLP-58] Adding Error Bars to Evals: A Statistical Approach to Language Model Evaluations

【速读】：该论文试图解决在大型语言模型（LLMs）评估中忽视实验分析和规划的问题。解决方案的关键在于将评估问题概念化为从一个未见的超总体中抽取，并提供分析评估数据、测量两个模型之间差异以及规划评估实验的公式。论文还提出了具体的建议，以最小化统计噪声并最大化实验结果的信息量。

链接: https://arxiv.org/abs/2411.00640
作者: Evan Miller
关键词-EN: large language models, language model evaluations, critical for understanding, understanding the capabilities, capabilities of large
类目: Applications (stat.AP); Computation and Language (cs.CL)
备注: 14 pages

点击查看摘要

Abstract:Evaluations are critical for understanding the capabilities of large language models (LLMs). Fundamentally, evaluations are experiments; but the literature on evaluations has largely ignored the literature from other sciences on experiment analysis and planning. This article shows researchers with some training in statistics how to think about and analyze data from language model evaluations. Conceptualizing evaluation questions as having been drawn from an unseen super-population, we present formulas for analyzing evaluation data, measuring differences between two models, and planning an evaluation experiment. We make a number of specific recommendations for running language model evaluations and reporting experiment results in a way that minimizes statistical noise and maximizes informativeness.
摘要：评估对于理解大语言模型 (LLM) 的能力至关重要。从根本上说，评估是实验；然而，关于评估的文献在很大程度上忽略了来自其他科学领域的实验分析和规划文献。本文向具有统计学基础的研究人员展示了如何思考和分析来自语言模型评估的数据。我们将评估问题概念化为从不可见的超总体中抽取的样本，并提出了分析评估数据、测量两个模型之间差异以及规划评估实验的公式。我们提出了多项具体建议，以在运行语言模型评估和报告实验结果时，最大限度地减少统计噪声并最大化信息量。

[NLP-59] LLM 4Mat-Bench: Benchmarking Large Language Models for Materials Property Prediction NEURIPS2024

【速读】：该论文试图解决大语言模型（LLMs）在材料科学中应用时缺乏标准化评估和基准测试的问题。解决方案的关键在于提出了LLM4Mat-Bench，这是一个迄今为止最大的用于评估LLMs在预测晶体材料性质方面的基准。LLM4Mat-Bench包含了约190万个晶体结构，来自10个公开的材料数据源，涵盖45种不同的性质，并支持三种输入模态：晶体成分、CIF文件和晶体文本描述。通过使用LLM4Mat-Bench，研究者能够微调不同大小的模型（如LLM-Prop和MatBERT），并提供零样本和少样本提示来评估LLM-chat-like模型（如Llama、Gemma和Mistral）的性质预测能力。研究结果强调了通用LLMs在材料科学中的挑战，并指出了在材料性质预测中需要任务特定的预测模型和任务特定的指令微调LLMs。

链接: https://arxiv.org/abs/2411.00177
作者: Andre Niyongabo Rubungo,Kangming Li,Jason Hattrick-Simpers,Adji Bousso Dieng
关键词-EN: Large language models, Large language, materials property prediction, materials, Large
类目: Materials Science (cond-mat.mtrl-sci); Computation and Language (cs.CL)
备注: Accepted at NeurIPS 2024-AI4Mat Workshop. The Benchmark and code can be found at: this https URL

点击查看摘要

Abstract:Large language models (LLMs) are increasingly being used in materials science. However, little attention has been given to benchmarking and standardized evaluation for LLM-based materials property prediction, which hinders progress. We present LLM4Mat-Bench, the largest benchmark to date for evaluating the performance of LLMs in predicting the properties of crystalline materials. LLM4Mat-Bench contains about 1.9M crystal structures in total, collected from 10 publicly available materials data sources, and 45 distinct properties. LLM4Mat-Bench features different input modalities: crystal composition, CIF, and crystal text description, with 4.7M, 615.5M, and 3.1B tokens in total for each modality, respectively. We use LLM4Mat-Bench to fine-tune models with different sizes, including LLM-Prop and MatBERT, and provide zero-shot and few-shot prompts to evaluate the property prediction capabilities of LLM-chat-like models, including Llama, Gemma, and Mistral. The results highlight the challenges of general-purpose LLMs in materials science and the need for task-specific predictive models and task-specific instruction-tuned LLMs in materials property prediction.
摘要：大语言模型（LLMs）在材料科学中的应用日益增多。然而，针对基于 LLM 的材料性质预测的基准测试和标准化评估却鲜有关注，这阻碍了该领域的进展。我们提出了 LLM4Mat-Bench，这是迄今为止用于评估 LLMs 在预测晶体材料性质方面性能的最大基准。LLM4Mat-Bench 总共包含约 190 万个晶体结构，收集自 10 个公开的材料数据源，涵盖 45 种不同的性质。LLM4Mat-Bench 具有不同的输入模态：晶体成分、CIF 文件和晶体文本描述，每种模态分别包含 470 万、6.155 亿和 31 亿个 Token。我们使用 LLM4Mat-Bench 对不同规模的模型进行微调，包括 LLM-Prop 和 MatBERT，并提供零样本和少样本提示，以评估 Llama、Gemma 和 Mistral 等类似 LLM-chat 模型的性质预测能力。结果突显了通用 LLMs 在材料科学中的挑战，以及在材料性质预测中需要任务特定的预测模型和任务特定指令微调的 LLMs。

[NLP-60] Device-Directed Speech Detection for Follow-up Conversations Using Large Language Models

【速读】：该论文试图解决虚拟助手（Virtual Assistants, VAs）在后续对话中准确识别设备定向语音（Device-directed Speech Detection, DDSD）的问题，以实现更自然的用户体验。解决方案的关键在于利用大型语言模型（Large Language Models, LLMs）对首次查询进行建模，并在后续查询的推理过程中通过提示预训练的LLM或在其基础上构建二元分类器来实现。此外，该方法还利用了自动语音识别（ASR）的不确定性来设计LLM提示，从而在实际数据集上实现了显著的性能提升（在固定10%的误拒率下，误报率降低了20-40%），这得益于对先前语音上下文和ASR不确定性的联合建模。

链接: https://arxiv.org/abs/2411.00023
作者: Oggi Rudovic,Pranay Dighe,Yi Su,Vineet Garg,Sameer Dharur,Xiaochuan Niu,Ahmed H. Abdelaziz,Saurabah Adya,Ahmed Tewfik
关键词-EN: Device-directed Speech Detection, virtual assistants, seamlessly interact, repeatedly invoke, accurate Device-directed Speech
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Sound (cs.SD)
备注:

点击查看摘要

Abstract:Follow-up conversations with virtual assistants (VAs) enable a user to seamlessly interact with a VA without the need to repeatedly invoke it using a keyword (after the first query). Therefore, accurate Device-directed Speech Detection (DDSD) from the follow-up queries is critical for enabling naturalistic user experience. To this end, we explore the notion of Large Language Models (LLMs) and model the first query when making inference about the follow-ups (based on the ASR-decoded text), via prompting of a pretrained LLM, or by adapting a binary classifier on top of the LLM. In doing so, we also exploit the ASR uncertainty when designing the LLM prompts. We show on the real-world dataset of follow-up conversations that this approach yields large gains (20-40% reduction in false alarms at 10% fixed false rejects) due to the joint modeling of the previous speech context and ASR uncertainty, compared to when follow-ups are modeled alone.
摘要：与虚拟助手（Virtual Assistants, VAs）的后续对话使得用户能够在首次查询后无需重复使用关键词即可无缝地与VA进行交互。因此，从后续查询中准确检测设备定向语音（Device-directed Speech Detection, DDSD）对于实现自然用户体验至关重要。为此，我们探讨了大语言模型（Large Language Models, LLMs）的概念，并在对后续查询进行推理时（基于自动语音识别（ASR）解码的文本），通过提示预训练的LLM或在其基础上调整二元分类器来建模首次查询。在此过程中，我们还利用了ASR的不确定性来设计LLM提示。我们在真实世界的后续对话数据集上展示了这种方法的效果，由于联合建模了先前的语音上下文和ASR不确定性，与仅单独建模后续查询相比，这种方法在固定10%的误拒率下显著减少了20-40%的误报。

[NLP-61] SFM-Protein: Integrative Co-evolutionary Pre-training for Advanced Protein Sequence Representation

【速读】：该论文试图解决蛋白质结构与其氨基酸序列之间关系的建模问题，特别是传统蛋白质基础模型在捕捉协同进化信息方面的不足。解决方案的关键在于引入了一种新的预训练策略，该策略强调氨基酸残基之间的相互作用，以增强从序列数据中提取短程和长程协同进化特征的能力。通过在大规模蛋白质序列数据集上的训练，该模型展示了优越的泛化能力，并在多个下游任务中超越了现有的基准模型，包括ESM模型，从而显著提升了基于蛋白质序列的建模效果。

链接: https://arxiv.org/abs/2410.24022
作者: Liang He,Peiran Jin,Yaosen Min,Shufang Xie,Lijun Wu,Tao Qin,Xiaozhuan Liang,Kaiyuan Gao,Yuliang Jiang,Tie-Yan Liu
关键词-EN: perform functions intricately, functions intricately linked, essential to biological, biological systems, perform functions
类目: Quantitative Methods (q-bio.QM); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Proteins, essential to biological systems, perform functions intricately linked to their three-dimensional structures. Understanding the relationship between protein structures and their amino acid sequences remains a core challenge in protein modeling. While traditional protein foundation models benefit from pre-training on vast unlabeled datasets, they often struggle to capture critical co-evolutionary information, which evolutionary-based methods excel at. In this study, we introduce a novel pre-training strategy for protein foundation models that emphasizes the interactions among amino acid residues to enhance the extraction of both short-range and long-range co-evolutionary features from sequence data. Trained on a large-scale protein sequence dataset, our model demonstrates superior generalization ability, outperforming established baselines of similar size, including the ESM model, across diverse downstream tasks. Experimental results confirm the model’s effectiveness in integrating co-evolutionary information, marking a significant step forward in protein sequence-based modeling.
摘要：蛋白质是生物系统中的关键组成部分，其功能与其三维结构密切相关。理解蛋白质结构与其氨基酸序列之间的关系仍然是蛋白质建模中的核心挑战。尽管传统的蛋白质基础模型得益于在大量未标记数据上的预训练，但它们往往难以捕捉到关键的共进化信息，而基于进化的方法在这方面表现出色。在本研究中，我们提出了一种新的蛋白质基础模型预训练策略，该策略强调氨基酸残基之间的相互作用，以增强从序列数据中提取短程和长程共进化特征的能力。我们的模型在大规模蛋白质序列数据集上进行训练，展示了卓越的泛化能力，在各种下游任务中均优于类似规模的现有基线模型，包括 ESM 模型。实验结果证实了该模型在整合共进化信息方面的有效性，标志着基于蛋白质序列的建模取得了显著进展。

人工智能

[AI-0] LogiCity: Advancing Neuro-Symbolic AI with Abstract Urban Simulation

链接: https://arxiv.org/abs/2411.00773
作者: Bowen Li,Zhaoyu Li,Qiwei Du,Jinqi Luo,Wenshan Wang,Yaqi Xie,Simon Stepputtis,Chen Wang,Katia P. Sycara,Pradeep Kumar Ravikumar,Alexander G. Gray,Xujie Si,Sebastian Scherer
关键词-EN: deep neural networks, Recent years, integrate symbolic reasoning, development of Neuro-Symbolic, neural networks
类目: Artificial Intelligence (cs.AI)
*备注: 25 pages, 8 figures

点击查看摘要

Abstract:Recent years have witnessed the rapid development of Neuro-Symbolic (NeSy) AI systems, which integrate symbolic reasoning into deep neural networks. However, most of the existing benchmarks for NeSy AI fail to provide long-horizon reasoning tasks with complex multi-agent interactions. Furthermore, they are usually constrained by fixed and simplistic logical rules over limited entities, making them far from real-world complexities. To address these crucial gaps, we introduce LogiCity, the first simulator based on customizable first-order logic (FOL) for an urban-like environment with multiple dynamic agents. LogiCity models diverse urban elements using semantic and spatial concepts, such as IsAmbulance(X) and IsClose(X, Y). These concepts are used to define FOL rules that govern the behavior of various agents. Since the concepts and rules are abstractions, they can be universally applied to cities with any agent compositions, facilitating the instantiation of diverse scenarios. Besides, a key feature of LogiCity is its support for user-configurable abstractions, enabling customizable simulation complexities for logical reasoning. To explore various aspects of NeSy AI, LogiCity introduces two tasks, one features long-horizon sequential decision-making, and the other focuses on one-step visual reasoning, varying in difficulty and agent behaviors. Our extensive evaluation reveals the advantage of NeSy frameworks in abstract reasoning. Moreover, we highlight the significant challenges of handling more complex abstractions in long-horizon multi-agent scenarios or under high-dimensional, imbalanced data. With its flexible design, various features, and newly raised challenges, we believe LogiCity represents a pivotal step forward in advancing the next generation of NeSy AI. All the code and data are open-sourced at our website.

[AI-1] GameGen-X: Interactive Open-world Game Video Generation

链接: https://arxiv.org/abs/2411.00769
作者: Haoxuan Che,Xuanhua He,Quande Liu,Cheng Jin,Hao Chen
关键词-EN: interactively controlling open-world, diffusion transformer model, transformer model specifically, diffusion transformer, generating and interactively
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: Project Page: this https URL

点击查看摘要

Abstract:We introduce GameGen-X, the first diffusion transformer model specifically designed for both generating and interactively controlling open-world game videos. This model facilitates high-quality, open-domain generation by simulating an extensive array of game engine features, such as innovative characters, dynamic environments, complex actions, and diverse events. Additionally, it provides interactive controllability, predicting and altering future content based on the current clip, thus allowing for gameplay simulation. To realize this vision, we first collected and built an Open-World Video Game Dataset from scratch. It is the first and largest dataset for open-world game video generation and control, which comprises over a million diverse gameplay video clips sampling from over 150 games with informative captions from GPT-4o. GameGen-X undergoes a two-stage training process, consisting of foundation model pre-training and instruction tuning. Firstly, the model was pre-trained via text-to-video generation and video continuation, endowing it with the capability for long-sequence, high-quality open-domain game video generation. Further, to achieve interactive controllability, we designed InstructNet to incorporate game-related multi-modal control signal experts. This allows the model to adjust latent representations based on user inputs, unifying character interaction and scene content control for the first time in video generation. During instruction tuning, only the InstructNet is updated while the pre-trained foundation model is frozen, enabling the integration of interactive controllability without loss of diversity and quality of generated video content.

[AI-2] Multi-Agent Deep Q-Network with Layer-based Communication Channel for Autonomous Internal Logistics Vehicle Scheduling in Smart Manufacturing

链接: https://arxiv.org/abs/2411.00728
作者: Mohammad Feizabadi,Arman Hosseini,Zakaria Yahouni
关键词-EN: optimizing operational efficiency, autonomous internal logistic, internal logistic vehicles, scheduling autonomous internal, operational efficiency
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO)
*备注: Accepted for the 5th IFAC/INSTICC INTERNATIONAL CONFERENCE ON INNOVATIVE INTELLIGENT INDUSTRIAL PRODUCTION AND LOGISTICS

点击查看摘要

Abstract:In smart manufacturing, scheduling autonomous internal logistic vehicles is crucial for optimizing operational efficiency. This paper proposes a multi-agent deep Q-network (MADQN) with a layer-based communication channel (LBCC) to address this challenge. The main goals are to minimize total job tardiness, reduce the number of tardy jobs, and lower vehicle energy consumption. The method is evaluated against nine well-known scheduling heuristics, demonstrating its effectiveness in handling dynamic job shop behaviors like job arrivals and workstation unavailabilities. The approach also proves scalable, maintaining performance across different layouts and larger problem instances, highlighting the robustness and adaptability of MADQN with LBCC in smart manufacturing.

[AI-3] B-cosification: Transforming Deep Neural Networks to be Inherently Interpretable NEURIPS

链接: https://arxiv.org/abs/2411.00715
作者: Shreyash Arya,Sukrut Rao,Moritz Böhle,Bernt Schiele
关键词-EN: architecturally enforcing stronger, enforcing stronger alignment, obtaining highly human, effective for obtaining, decisions by architecturally
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 31 pages, 9 figures, 12 tables, Neural Information Processing Systems (NeurIPS) 2024

点击查看摘要

Abstract:B-cos Networks have been shown to be effective for obtaining highly human interpretable explanations of model decisions by architecturally enforcing stronger alignment between inputs and weight. B-cos variants of convolutional networks (CNNs) and vision transformers (ViTs), which primarily replace linear layers with B-cos transformations, perform competitively to their respective standard variants while also yielding explanations that are faithful by design. However, it has so far been necessary to train these models from scratch, which is increasingly infeasible in the era of large, pre-trained foundation models. In this work, inspired by the architectural similarities in standard DNNs and B-cos networks, we propose ‘B-cosification’, a novel approach to transform existing pre-trained models to become inherently interpretable. We perform a thorough study of design choices to perform this conversion, both for convolutional neural networks and vision transformers. We find that B-cosification can yield models that are on par with B-cos models trained from scratch in terms of interpretability, while often outperforming them in terms of classification performance at a fraction of the training cost. Subsequently, we apply B-cosification to a pretrained CLIP model, and show that, even with limited data and compute cost, we obtain a B-cosified version that is highly interpretable and competitive on zero shot performance across a variety of datasets. We release our code and pre-trained model weights at this https URL.

[AI-4] Learning in Markov Games with Adaptive Adversaries: Policy Regret Fundamental Barriers and Efficient Algorithms NEURIPS’24

链接: https://arxiv.org/abs/2411.00707
作者: Thanh Nguyen-Tang,Raman Arora
关键词-EN: dynamically evolving environment, evolving environment modeled, Markov games focus, Markov game, dynamically evolving
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Science and Game Theory (cs.GT); Machine Learning (stat.ML)
*备注: NeurIPS’24

点击查看摘要

Abstract:We study learning in a dynamically evolving environment modeled as a Markov game between a learner and a strategic opponent that can adapt to the learner’s strategies. While most existing works in Markov games focus on external regret as the learning objective, external regret becomes inadequate when the adversaries are adaptive. In this work, we focus on \emphpolicy regret – a counterfactual notion that aims to compete with the return that would have been attained if the learner had followed the best fixed sequence of policy, in hindsight. We show that if the opponent has unbounded memory or if it is non-stationary, then sample-efficient learning is not possible. For memory-bounded and stationary, we show that learning is still statistically hard if the set of feasible strategies for the learner is exponentially large. To guarantee learnability, we introduce a new notion of \emphconsistent adaptive adversaries, wherein, the adversary responds similarly to similar strategies of the learner. We provide algorithms that achieve \sqrtT policy regret against memory-bounded, stationary, and consistent adversaries.

[AI-5] Algorithmic Transparency in Forecasting Support Systems

链接: https://arxiv.org/abs/2411.00699
作者: Leif Feddersen
关键词-EN: Forecasting Support Systems, Support Systems, FSS, adjustments, Forecasting Support
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Most organizations adjust their statistical forecasts (e.g. on sales) manually. Forecasting Support Systems (FSS) enable the related process of automated forecast generation and manual adjustments. As the FSS user interface connects user and statistical algorithm, it is an obvious lever for facilitating beneficial adjustments whilst discouraging harmful adjustments. This paper reviews and organizes the literature on judgemental forecasting, forecast adjustments, and FSS design. I argue that algorithmic transparency may be a key factor towards better, integrative forecasting and test this assertion with three FSS designs that vary in their degrees of transparency based on time series decomposition. I find transparency to reduce the variance and amount of harmful forecast adjustments. Letting users adjust the algorithm’s transparent components themselves, however, leads to widely varied and overall most detrimental adjustments. Responses indicate a risk of overwhelming users with algorithmic transparency without adequate training. Accordingly, self-reported satisfaction is highest with a non-transparent FSS.

[AI-6] CTPD: Cross-Modal Temporal Pattern Discovery for Enhanced Multimodal Electronic Health Records Analysis

链接: https://arxiv.org/abs/2411.00696
作者: Fuying Wang,Feng Wu,Yihan Tang,Lequan Yu
关键词-EN: Electronic Health Records, Integrating multimodal Electronic, multimodal Electronic Health, numerical time series, Health Records
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Technical report

点击查看摘要

Abstract:Integrating multimodal Electronic Health Records (EHR) data, such as numerical time series and free-text clinical reports, has great potential in predicting clinical outcomes. However, prior work has primarily focused on capturing temporal interactions within individual samples and fusing multimodal information, overlooking critical temporal patterns across patients. These patterns, such as trends in vital signs like abnormal heart rate or blood pressure, can indicate deteriorating health or an impending critical event. Similarly, clinical notes often contain textual descriptions that reflect these patterns. Identifying corresponding temporal patterns across different modalities is crucial for improving the accuracy of clinical outcome predictions, yet it remains a challenging task. To address this gap, we introduce a Cross-Modal Temporal Pattern Discovery (CTPD) framework, designed to efficiently extract meaningful cross-modal temporal patterns from multimodal EHR data. Our approach introduces shared initial temporal pattern representations which are refined using slot attention to generate temporal semantic embeddings. To ensure rich cross-modal temporal semantics in the learned patterns, we introduce a contrastive-based TPNCE loss for cross-modal alignment, along with two reconstruction losses to retain core information of each modality. Evaluations on two clinically critical tasks, 48-hour in-hospital mortality and 24-hour phenotype classification, using the MIMIC-III database demonstrate the superiority of our method over existing approaches.

[AI-7] AI-based traffic analysis in digital twin networks

链接: https://arxiv.org/abs/2411.00681
作者: Sarah Al-Shareeda,Khayal Huseynov,Lal Verda Cakir,Craig Thomson,Mehmet Ozdem,Berk Canberk
关键词-EN: Digital Twin Networks, Networks Digital Twins, Digital Twin, Twin Networks, optimize physical networks
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET)
*备注: Chapter 4: Digital Twins for 6G: Fundamental theory, technology and applications; pp. 83-132

点击查看摘要

Abstract:In today’s networked world, Digital Twin Networks (DTNs) are revolutionizing how we understand and optimize physical networks. These networks, also known as ‘Digital Twin Networks (DTNs)’ or ‘Networks Digital Twins (NDTs),’ encompass many physical networks, from cellular and wireless to optical and satellite. They leverage computational power and AI capabilities to provide virtual representations, leading to highly refined recommendations for real-world network challenges. Within DTNs, tasks include network performance enhancement, latency optimization, energy efficiency, and more. To achieve these goals, DTNs utilize AI tools such as Machine Learning (ML), Deep Learning (DL), Reinforcement Learning (RL), Federated Learning (FL), and graph-based approaches. However, data quality, scalability, interpretability, and security challenges necessitate strategies prioritizing transparency, fairness, privacy, and accountability. This chapter delves into the world of AI-driven traffic analysis within DTNs. It explores DTNs’ development efforts, tasks, AI models, and challenges while offering insights into how AI can enhance these dynamic networks. Through this journey, readers will gain a deeper understanding of the pivotal role AI plays in the ever-evolving landscape of networked systems.

[AI-8] Beyond the Boundaries of Proximal Policy Optimization

链接: https://arxiv.org/abs/2411.00666
作者: Charlie B. Tan,Edan Toledo,Benjamin Ellis,Jakob N. Foerster,Ferenc Huszár
关键词-EN: Proximal policy optimization, on-policy reinforcement learning, Proximal policy, widely-used algorithm, algorithm for on-policy
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Proximal policy optimization (PPO) is a widely-used algorithm for on-policy reinforcement learning. This work offers an alternative perspective of PPO, in which it is decomposed into the inner-loop estimation of update vectors, and the outer-loop application of updates using gradient ascent with unity learning rate. Using this insight we propose outer proximal policy optimization (outer-PPO); a framework wherein these update vectors are applied using an arbitrary gradient-based optimizer. The decoupling of update estimation and update application enabled by outer-PPO highlights several implicit design choices in PPO that we challenge through empirical investigation. In particular we consider non-unity learning rates and momentum applied to the outer loop, and a momentum-bias applied to the inner estimation loop. Methods are evaluated against an aggressively tuned PPO baseline on Brax, Jumanji and MinAtar environments; non-unity learning rates and momentum both achieve statistically significant improvement on Brax and Jumanji, given the same hyperparameter tuning budget.

[AI-9] MoNTA: Accelerating Mixture-of-Experts Training with Network-Traffc-Aware Parallel Optimization

链接: https://arxiv.org/abs/2411.00662
作者: Jingming Guo,Yan Liu,Yu Meng,Zhiwei Tao,Banglan Liu,Gang Chen,Xiang Li
关键词-EN: multiple specialized expert, combines multiple specialized, specialized expert models, advanced model architecture, specialized expert
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The Mixture of Experts (MoE) is an advanced model architecture in the industry that combines multiple specialized expert models from various domains into a single supermodel. This approach enables the model to scale without significantly increasing the computational costs of training and inference, while maximizing model performance. However, current distributed training frameworks do not consider the ultimate optimization of communication, especially for large base models. This paper proposes a network-traffic-aware parallel optimization method that selects the optimal parallel strategy based on the communication volume, and the training cluster’s inter-node and intra-node network topologies. Compared to the DeepSpeed, MoNTA achieves an 8x increase in AllToAll communication performance under 8-card tensor parallelism. Compared to the baseline, training a 2x70B model using 16 A800 cards, with an 8K sequence, results in a 13% overall latency performance improvement. Project Page: this https URL.

[AI-10] Physics in Next-token Prediction

链接: https://arxiv.org/abs/2411.00660
作者: Hongjun An,Yiliang Song,Xuelong Li
关键词-EN: Next-token Prediction, physics in Next-token, law of information, Information Capacity, discovered the underlying
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: First Submit

点击查看摘要

Abstract:We discovered the underlying physics in Next-token Prediction (NTP). We identified the law of information conservation within NTP and proposed the First Law of Information Capacity (IC-1), demonstrating that the essence of intelligence emergence in auto-regressive models is fundamentally a process of information transfer. We also introduced Landauer’s Principle into NTP, formulating the Second Law of Information Capacity (IC-2), which establishes the relationship between auto-regressive model training and energy consumption. Additionally, we presented several corollaries, which hold practical significance for production practices. Finally, we validated the compatibility and complementarity of our findings with existing theories.

[AI-11] Generative AI and Agency in Education: A Critical Scoping Review and Thematic Analysis

链接: https://arxiv.org/abs/2411.00631
作者: Jasper Roe(1),Mike Perkins(2) ((1) James Cook University Singapore, (2) British University Vietnam)
关键词-EN: Critical Digital Pedagogy, Digital Pedagogy, relationship between Generative, lens of Critical, Critical Digital
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:This scoping review examines the relationship between Generative AI (GenAI) and agency in education, analyzing the literature available through the lens of Critical Digital Pedagogy. Following PRISMA-ScR guidelines, we collected 10 studies from academic databases focusing on both learner and teacher agency in GenAI-enabled environments. We conducted an AI-supported hybrid thematic analysis that revealed three key themes: Control in Digital Spaces, Variable Engagement and Access, and Changing Notions of Agency. The findings suggest that while GenAI may enhance learner agency through personalization and support, it also risks exacerbating educational inequalities and diminishing learner autonomy in certain contexts. This review highlights gaps in the current research on GenAI’s impact on agency. These findings have implications for educational policy and practice, suggesting the need for frameworks that promote equitable access while preserving learner agency in GenAI-enhanced educational environments. Subjects: Computers and Society (cs.CY); Artificial Intelligence (cs.AI) Cite as: arXiv:2411.00631 [cs.CY] (or arXiv:2411.00631v1 [cs.CY] for this version) https://doi.org/10.48550/arXiv.2411.00631 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-12] STAA: Spatio-Temporal Attention Attribution for Real-Time Interpreting Transformer-based Video Models

链接: https://arxiv.org/abs/2411.00630
作者: Zerui Wang,Yan Liu
关键词-EN: computer vision tasks, Transformer-based models, vision tasks, computer vision, XAI
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Transformer-based models have achieved state-of-the-art performance in various computer vision tasks, including image and video analysis. However, Transformer’s complex architecture and black-box nature pose challenges for explainability, a crucial aspect for real-world applications and scientific inquiry. Current Explainable AI (XAI) methods can only provide one-dimensional feature importance, either spatial or temporal explanation, with significant computational complexity. This paper introduces STAA (Spatio-Temporal Attention Attribution), an XAI method for interpreting video Transformer models. Differ from traditional methods that separately apply image XAI techniques for spatial features or segment contribution analysis for temporal aspects, STAA offers both spatial and temporal information simultaneously from attention values in Transformers. The study utilizes the Kinetics-400 dataset, a benchmark collection of 400 human action classes used for action recognition research. We introduce metrics to quantify explanations. We also apply optimization to enhance STAA’s raw output. By implementing dynamic thresholding and attention focusing mechanisms, we improve the signal-to-noise ratio in our explanations, resulting in more precise visualizations and better evaluation results. In terms of computational overhead, our method requires less than 3% of the computational resources of traditional XAI methods, making it suitable for real-time video XAI analysis applications. STAA contributes to the growing field of XAI by offering a method for researchers and practitioners to analyze Transformer models.

[AI-13] Lingma SWE-GPT: An Open Development-Process-Centric Language Model for Automated Software Improvement

链接: https://arxiv.org/abs/2411.00622
作者: Yingwei Ma,Rongyu Cao,Yongchang Cao,Yue Zhang,Jue Chen,Yibo Liu,Yuchen Liu,Binhua Li,Fei Huang,Yongbin Li
关键词-EN: Recent advancements, Lingma SWE-GPT, advancements in LLM-based, LLM-based agents, agents have led
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Recent advancements in LLM-based agents have led to significant progress in automatic software engineering, particularly in software maintenance and evolution. Despite these encouraging advances, current research faces two major challenges. First, SOTA performance primarily depends on closed-source models, which significantly limits the technology’s accessibility, and potential for customization in diverse SE tasks. Second, these models are predominantly trained on static code data, lacking a deep understanding of the dynamic interactions, iterative problem-solving processes, and evolutionary characteristics inherent in software development. To address these challenges, our study adopts a software engineering perspective. We recognize that real-world software maintenance and evolution processes encompass not only static code data but also developers’ thought processes, utilization of external tools, and the interaction between different functional personnel. Consequently, we introduce the Lingma SWE-GPT series, comprising Lingma SWE-GPT 7B and 72B. By learning from and simulating real-world code submission activities, Lingma SWE-GPT systematically incorporates the dynamic interactions and iterative problem-solving inherent in software development process, thereby achieving a more comprehensive understanding of software improvement processes. We conducted experimental evaluations using SWE-bench Verified benchmark. The results demonstrate that Lingma SWE-GPT 72B successfully resolves 30.20% of the GitHub issues, marking a significant improvement in automatic issue resolution (22.76% relative improvement compared to Llama 3.1 405B), approaching the performance of closed-source models (31.80% issues of GPT-4o resolved). Notably, Lingma SWE-GPT 7B resolves 18.20% of the issues, highlighting the potential for applying smaller models to ASE tasks.

[AI-14] How to Bridge Spatial and Temporal Heterogeneity in Link Prediction? A Contrastive Method

链接: https://arxiv.org/abs/2411.00612
作者: Yu Tai,Xinglong Wu,Hongwei Yang,Hui He,Duanjing Chen,Yuanming Shao,Weizhe Zhang
关键词-EN: noteworthy research avenue, real-world complex systems, Heterogeneous Networks play, complex systems, Temporal
类目: ocial and Information Networks (cs.SI); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Temporal Heterogeneous Networks play a crucial role in capturing the dynamics and heterogeneity inherent in various real-world complex systems, rendering them a noteworthy research avenue for link prediction. However, existing methods fail to capture the fine-grained differential distribution patterns and temporal dynamic characteristics, which we refer to as spatial heterogeneity and temporal heterogeneity. To overcome such limitations, we propose a novel \textbfContrastive Learning-based \textbfLink \textbfPrediction model, \textbfCLP, which employs a multi-view hierarchical self-supervised architecture to encode spatial and temporal heterogeneity. Specifically, aiming at spatial heterogeneity, we develop a spatial feature modeling layer to capture the fine-grained topological distribution patterns from node- and edge-level representations, respectively. Furthermore, aiming at temporal heterogeneity, we devise a temporal information modeling layer to perceive the evolutionary dependencies of dynamic graph topologies from time-level representations. Finally, we encode the spatial and temporal distribution heterogeneity from a contrastive learning perspective, enabling a comprehensive self-supervised hierarchical relation modeling for the link prediction task. Extensive experiments conducted on four real-world dynamic heterogeneous network datasets verify that our \mymodel consistently outperforms the state-of-the-art models, demonstrating an average improvement of 10.10%, 13.44% in terms of AUC and AP, respectively.

[AI-15] On Deep Learning for Geometric and Semantic Scene Understanding Using On-Vehicle 3D LiDAR CVPR2023 ECCV2024

链接: https://arxiv.org/abs/2411.00600
作者: Li Li
关键词-EN: point cloud data, autonomous driving, computer vision, LiDAR point cloud, data is crucial
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)
*备注: PhD thesis (Durham University, Computer Science), 149 pages (the 2024 BMVA Sullivan Doctoral Thesis Prize runner-up). Includes published content from arXiv:2407.10159 (ECCV 2024 ORAL), arXiv:2303.11203 (CVPR 2023), and arXiv:2406.10068 (3DV 2021), with minor revisions to the examined version: this https URL

点击查看摘要

Abstract:3D LiDAR point cloud data is crucial for scene perception in computer vision, robotics, and autonomous driving. Geometric and semantic scene understanding, involving 3D point clouds, is essential for advancing autonomous driving technologies. However, significant challenges remain, particularly in improving the overall accuracy (e.g., segmentation accuracy, depth estimation accuracy, etc.) and efficiency of these systems. To address the challenge in terms of accuracy related to LiDAR-based tasks, we present DurLAR, the first high-fidelity 128-channel 3D LiDAR dataset featuring panoramic ambient (near infrared) and reflectivity imagery. To improve efficiency in 3D segmentation while ensuring the accuracy, we propose a novel pipeline that employs a smaller architecture, requiring fewer ground-truth annotations while achieving superior segmentation accuracy compared to contemporary approaches. To improve the segmentation accuracy, we introduce Range-Aware Pointwise Distance Distribution (RAPiD) features and the associated RAPiD-Seg architecture. All contributions have been accepted by peer-reviewed conferences, underscoring the advancements in both accuracy and efficiency in 3D LiDAR applications for autonomous driving. Full abstract: this https URL.

[AI-16] alpha-TCVAE: On the relationship between Disentanglement and Diversity

链接: https://arxiv.org/abs/2411.00588
作者: Cristian Meo,Louis Mahon,Anirudh Goyal,Justin Dauwels
关键词-EN: usefulness remains debated, remains debated, shown promise, disentangled representations, generative modeling
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:While disentangled representations have shown promise in generative modeling and representation learning, their downstream usefulness remains debated. Recent studies re-defined disentanglement through a formal connection to symmetries, emphasizing the ability to reduce latent domains and consequently enhance generative capabilities. However, from an information theory viewpoint, assigning a complex attribute to a specific latent variable may be infeasible, limiting the applicability of disentangled representations to simple datasets. In this work, we introduce \alpha -TCVAE, a variational autoencoder optimized using a novel total correlation (TC) lower bound that maximizes disentanglement and latent variables informativeness. The proposed TC bound is grounded in information theory constructs, generalizes the \beta -VAE lower bound, and can be reduced to a convex combination of the known variational information bottleneck (VIB) and conditional entropy bottleneck (CEB) terms. Moreover, we present quantitative analyses that support the idea that disentangled representations lead to better generative capabilities and diversity. Additionally, we perform downstream task experiments from both representation and RL domains to assess our questions from a broader ML perspective. Our results demonstrate that \alpha -TCVAE consistently learns more disentangled representations than baselines and generates more diverse observations without sacrificing visual fidelity. Notably, \alpha -TCVAE exhibits marked improvements on MPI3D-Real, the most realistic disentangled dataset in our study, confirming its ability to represent complex datasets when maximizing the informativeness of individual variables. Finally, testing the proposed model off-the-shelf on a state-of-the-art model-based RL agent, Director, significantly shows \alpha -TCVAE downstream usefulness on the loconav Ant Maze task.

[AI-17] Benchmarking Bias in Large Language Models during Role-Playing

链接: https://arxiv.org/abs/2411.00585
作者: Xinyue Li,Zhenpeng Chen,Jie M. Zhang,Yiling Lou,Tianlin Li,Weisong Sun,Yang Liu,Xuanzhe Liu
关键词-EN: Large Language Models, Large Language, modern language-driven applications, profoundly influencing daily, influencing daily life
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have become foundational in modern language-driven applications, profoundly influencing daily life. A critical technique in leveraging their potential is role-playing, where LLMs simulate diverse roles to enhance their real-world utility. However, while research has highlighted the presence of social biases in LLM outputs, it remains unclear whether and to what extent these biases emerge during role-playing scenarios. In this paper, we introduce BiasLens, a fairness testing framework designed to systematically expose biases in LLMs during role-playing. Our approach uses LLMs to generate 550 social roles across a comprehensive set of 11 demographic attributes, producing 33,000 role-specific questions targeting various forms of bias. These questions, spanning Yes/No, multiple-choice, and open-ended formats, are designed to prompt LLMs to adopt specific roles and respond accordingly. We employ a combination of rule-based and LLM-based strategies to identify biased responses, rigorously validated through human evaluation. Using the generated questions as the benchmark, we conduct extensive evaluations of six advanced LLMs released by OpenAI, Mistral AI, Meta, Alibaba, and DeepSeek. Our benchmark reveals 72,716 biased responses across the studied LLMs, with individual models yielding between 7,754 and 16,963 biased responses, underscoring the prevalence of bias in role-playing contexts. To support future research, we have publicly released the benchmark, along with all scripts and experimental results.

[AI-18] WLPlan: Relational Features for Symbolic Planning

链接: https://arxiv.org/abs/2411.00577
作者: Dillon Z. Chen
关键词-EN: research generally involves, generally involves juggling, planning modules effectively, planning research generally, modules effectively
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Scalable learning for planning research generally involves juggling between different programming languages for handling learning and planning modules effectively. Interpreted languages such as Python are commonly used for learning routines due to their ease of use and the abundance of highly maintained learning libraries they exhibit, while compiled languages such as C++ are used for planning routines due to their optimised resource usage. Motivated by the need for tools for developing scalable learning planners, we introduce WLPlan, a C++ package with Python bindings which implements recent promising work for automatically generating relational features of planning tasks. Such features can be used for any downstream routine, such as learning domain control knowledge or probing and understanding planning tasks. More specifically, WLPlan provides functionality for (1) transforming planning tasks into graphs, and (2) embedding planning graphs into feature vectors via graph kernels. The source code and instructions for the installation and usage of WLPlan are available at this http URL

[AI-19] Simulate and Optimise: A two-layer mortgage simulator for designing novel mortgage assistance products

链接: https://arxiv.org/abs/2411.00563
作者: Leo Ardon,Benjamin Patrick Evans,Deepeka Garg,Annapoorani Lakshmi Narayanan,Makada Henry-Nickie,Sumitra Ganesh
关键词-EN: optimising mortgage relief, simulated multi-agent mortgage, multi-agent mortgage environment, mortgage relief products, simulated multi-agent
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Computational Finance (q-fin.CP)
*备注: Accepted at the 5th ACM International Conference on AI in Finance

点击查看摘要

Abstract:We develop a novel two-layer approach for optimising mortgage relief products through a simulated multi-agent mortgage environment. While the approach is generic, here the environment is calibrated to the US mortgage market based on publicly available census data and regulatory guidelines. Through the simulation layer, we assess the resilience of households to exogenous income shocks, while the optimisation layer explores strategies to improve the robustness of households to these shocks by making novel mortgage assistance products available to households. Households in the simulation are adaptive, learning to make mortgage-related decisions (such as product enrolment or strategic foreclosures) that maximize their utility, balancing their available liquidity and equity. We show how this novel two-layer simulation approach can successfully design novel mortgage assistance products to improve household resilience to exogenous shocks, and balance the costs of providing such products through post-hoc analysis. Previously, such analysis could only be conducted through expensive pilot studies involving real participants, demonstrating the benefit of the approach for designing and evaluating financial products.

[AI-20] LLM -KT: A Versatile Framework for Knowledge Transfer from Large Language Models to Collaborative Filtering ICDM2024

链接: https://arxiv.org/abs/2411.00556
作者: Nikita Severin,Aleksei Ziablitsev,Yulia Savelyeva,Valeriy Tashchilin,Ivan Bulychev,Mikhail Yushkov,Artem Kushneruk,Amaliya Zaryvnykh,Dmitrii Kiselev,Andrey Savchenko,Ilya Makarov
关键词-EN: Large Language Model, seamlessly integrating LLM, Large Language, enhance collaborative filtering, integrating LLM
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
*备注: accepted at ICDM 2024 (demo track)

点击查看摘要

Abstract:We present LLM-KT, a flexible framework designed to enhance collaborative filtering (CF) models by seamlessly integrating LLM (Large Language Model)-generated features. Unlike existing methods that rely on passing LLM-generated features as direct inputs, our framework injects these features into an intermediate layer of any CF model, allowing the model to reconstruct and leverage the embeddings internally. This model-agnostic approach works with a wide range of CF models without requiring architectural changes, making it adaptable to various recommendation scenarios. Our framework is built for easy integration and modification, providing researchers and developers with a powerful tool for extending CF model capabilities through efficient knowledge transfer. We demonstrate its effectiveness through experiments on the MovieLens and Amazon datasets, where it consistently improves baseline CF models. Experimental studies showed that LLM-KT is competitive with the state-of-the-art methods in context-aware settings but can be applied to a broader range of CF models than current approaches.

[AI-21] Differentiable Physics-based System Identification for Robotic Manipulation of Elastoplastic Materials

链接: https://arxiv.org/abs/2411.00554
作者: Xintong Yang,Ze Ji,Yu-Kun Lai
关键词-EN: volumetric elastoplastic deformable, Robotic manipulation, elastoplastic deformable materials, largely due, high-dimensional space
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE)
*备注: Underreivew on the Internation Journal of Robotics Research

点击查看摘要

Abstract:Robotic manipulation of volumetric elastoplastic deformable materials, from foods such as dough to construction materials like clay, is in its infancy, largely due to the difficulty of modelling and perception in a high-dimensional space. Simulating the dynamics of such materials is computationally expensive. It tends to suffer from inaccurately estimated physics parameters of the materials and the environment, impeding high-precision manipulation. Estimating such parameters from raw point clouds captured by optical cameras suffers further from heavy occlusions. To address this challenge, this work introduces a novel Differentiable Physics-based System Identification (DPSI) framework that enables a robot arm to infer the physics parameters of elastoplastic materials and the environment using simple manipulation motions and incomplete 3D point clouds, aligning the simulation with the real world. Extensive experiments show that with only a single real-world interaction, the estimated parameters, Young’s modulus, Poisson’s ratio, yield stress and friction coefficients, can accurately simulate visually and physically realistic deformation behaviours induced by unseen and long-horizon manipulation motions. Additionally, the DPSI framework inherently provides physically intuitive interpretations for the parameters in contrast to black-box approaches such as deep neural networks.

[AI-22] Conditional Synthesis of 3D Molecules with Time Correction Sampler NEURIPS2024

链接: https://arxiv.org/abs/2411.00551
作者: Hojung Jung,Youngrok Park,Laura Schmid,Jaehyeong Jo,Dongkyu Lee,Bongsang Kim,Se-Young Yun,Jinwoo Shin
关键词-EN: demonstrated remarkable success, including molecular generation, demonstrated remarkable, remarkable success, molecular generation
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: NeurIPS 2024

点击查看摘要

Abstract:Diffusion models have demonstrated remarkable success in various domains, including molecular generation. However, conditional molecular generation remains a fundamental challenge due to an intrinsic trade-off between targeting specific chemical properties and generating meaningful samples from the data distribution. In this work, we present Time-Aware Conditional Synthesis (TACS), a novel approach to conditional generation on diffusion models. It integrates adaptively controlled plug-and-play “online” guidance into a diffusion model, driving samples toward the desired properties while maintaining validity and stability. A key component of our algorithm is our new type of diffusion sampler, Time Correction Sampler (TCS), which is used to control guidance and ensure that the generated molecules remain on the correct manifold at each reverse step of the diffusion process at the same time. Our proposed method demonstrates significant performance in conditional 3D molecular generation and offers a promising approach towards inverse molecular design, potentially facilitating advancements in drug discovery, materials science, and other related fields.

[AI-23] Generative AI-based Pipeline Architecture for Increasing Training Efficiency in Intelligent Weed Control Systems

链接: https://arxiv.org/abs/2411.00548
作者: Sourav Modak,Anthony Stein
关键词-EN: demonstrated significant potential, automated crop protection, crop protection tasks, disease diagnosis, pest monitoring
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In automated crop protection tasks such as weed control, disease diagnosis, and pest monitoring, deep learning has demonstrated significant potential. However, these advanced models rely heavily on high-quality, diverse datasets, often limited and costly in agricultural settings. Traditional data augmentation can increase dataset volume but usually lacks the real-world variability needed for robust training. This study presents a new approach for generating synthetic images to improve deep learning-based object detection models for intelligent weed control. Our GenAI-based image generation pipeline integrates the Segment Anything Model (SAM) for zero-shot domain adaptation with a text-to-image Stable Diffusion Model, enabling the creation of synthetic images that capture diverse real-world conditions. We evaluate these synthetic datasets using lightweight YOLO models, measuring data efficiency with mAP50 and mAP50-95 scores across varying proportions of real and synthetic data. Notably, YOLO models trained on datasets with 10% synthetic and 90% real images generally demonstrate superior mAP50 and mAP50-95 scores compared to those trained solely on real images. This approach not only reduces dependence on extensive real-world datasets but also enhances predictive performance. The integration of this approach opens opportunities for achieving continual self-improvement of perception modules in intelligent technical systems.

[AI-24] Human-inspired Perspectives: A Survey on AI Long-term Memory

链接: https://arxiv.org/abs/2411.00489
作者: Zihong He,Weizhe Lin,Hao Zheng,Fan Zhang,Matt Jones,Laurence Aitchison,Xuhai Xu,Miao Liu,Per Ola Kristensson,Junxiao Shen
关键词-EN: long-term memory, long-term, long-term memory systems, memory, abilities to store
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:With the rapid advancement of AI systems, their abilities to store, retrieve, and utilize information over the long term - referred to as long-term memory - have become increasingly significant. These capabilities are crucial for enhancing the performance of AI systems across a wide range of tasks. However, there is currently no comprehensive survey that systematically investigates AI’s long-term memory capabilities, formulates a theoretical framework, and inspires the development of next-generation AI long-term memory systems. This paper begins by systematically introducing the mechanisms of human long-term memory, then explores AI long-term memory mechanisms, establishing a mapping between the two. Based on the mapping relationships identified, we extend the current cognitive architectures and propose the Cognitive Architecture of Self-Adaptive Long-term Memory (SALM). SALM provides a theoretical framework for the practice of AI long-term memory and holds potential for guiding the creation of next-generation long-term memory driven AI systems. Finally, we delve into the future directions and application prospects of AI long-term memory.

[AI-25] Multi Modal Information Fusion of Acoustic and Linguistic Data for Decoding Dairy Cow Vocalizations in Animal Welfare Assessment

链接: https://arxiv.org/abs/2411.00477
作者: Bubacarr Jobarteh,Madalina Mincu,Gavojdian Dinu,Suresh Neethirajan
关键词-EN: precision livestock farming, livestock farming, data fusion, Natural Language Processing, precision livestock
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS); Quantitative Methods (q-bio.QM)
*备注: 31 pages, 22 figures, 2 tables

点击查看摘要

Abstract:Understanding animal vocalizations through multi-source data fusion is crucial for assessing emotional states and enhancing animal welfare in precision livestock farming. This study aims to decode dairy cow contact calls by employing multi-modal data fusion techniques, integrating transcription, semantic analysis, contextual and emotional assessment, and acoustic feature extraction. We utilized the Natural Language Processing model to transcribe audio recordings of cow vocalizations into written form. By fusing multiple acoustic features frequency, duration, and intensity with transcribed textual data, we developed a comprehensive representation of cow vocalizations. Utilizing data fusion within a custom-developed ontology, we categorized vocalizations into high frequency calls associated with distress or arousal, and low frequency calls linked to contentment or calmness. Analyzing the fused multi dimensional data, we identified anxiety related features indicative of emotional distress, including specific frequency measurements and sound spectrum results. Assessing the sentiment and acoustic features of vocalizations from 20 individual cows allowed us to determine differences in calling patterns and emotional states. Employing advanced machine learning algorithms, Random Forest, Support Vector Machine, and Recurrent Neural Networks, we effectively processed and fused multi-source data to classify cow vocalizations. These models were optimized to handle computational demands and data quality challenges inherent in practical farm environments. Our findings demonstrate the effectiveness of multi-source data fusion and intelligent processing techniques in animal welfare monitoring. This study represents a significant advancement in animal welfare assessment, highlighting the role of innovative fusion technologies in understanding and improving the emotional wellbeing of dairy cows.

[AI-26] MIRFLEX: Music Information Retrieval Feature Library for Extraction

链接: https://arxiv.org/abs/2411.00469
作者: Anuradha Chopra,Abhinaba Roy,Dorien Herremans
关键词-EN: aid music information, music information retrieval, information retrieval research, paper introduces, introduces an extendable
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Audio and Speech Processing (eess.AS)
*备注: 2 pages, 4 tables, submitted to Extended Abstracts for the Late-Breaking Demo Session of the 25th Int. Society for Music Information Retrieval Conf., San Francisco, United States, 2024

点击查看摘要

Abstract:This paper introduces an extendable modular system that compiles a range of music feature extraction models to aid music information retrieval research. The features include musical elements like key, downbeats, and genre, as well as audio characteristics like instrument recognition, vocals/instrumental classification, and vocals gender detection. The integrated models are state-of-the-art or latest open-source. The features can be extracted as latent or post-processed labels, enabling integration into music applications such as generative music, recommendation, and playlist generation. The modular design allows easy integration of newly developed systems, making it a good benchmarking and comparison tool. This versatile toolkit supports the research community in developing innovative solutions by providing concrete musical features.

[AI-27] Uncertainty-based Offline Variational Bayesian Reinforcement Learning for Robustness under Diverse Data Corruptions NEURIPS2024

链接: https://arxiv.org/abs/2411.00465
作者: Rui Yang,Jie Wang,Guoping Wu,Bin Li
关键词-EN: Real-world offline datasets, Real-world offline, adversarial attacks, malicious attacks, Bayesian inference
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Accepted to NeurIPS 2024

点击查看摘要

Abstract:Real-world offline datasets are often subject to data corruptions (such as noise or adversarial attacks) due to sensor failures or malicious attacks. Despite advances in robust offline reinforcement learning (RL), existing methods struggle to learn robust agents under high uncertainty caused by the diverse corrupted data (i.e., corrupted states, actions, rewards, and dynamics), leading to performance degradation in clean environments. To tackle this problem, we propose a novel robust variational Bayesian inference for offline RL (TRACER). It introduces Bayesian inference for the first time to capture the uncertainty via offline data for robustness against all types of data corruptions. Specifically, TRACER first models all corruptions as the uncertainty in the action-value function. Then, to capture such uncertainty, it uses all offline data as the observations to approximate the posterior distribution of the action-value function under a Bayesian inference framework. An appealing feature of TRACER is that it can distinguish corrupted data from clean data using an entropy-based uncertainty measure, since corrupted data often induces higher uncertainty and entropy. Based on the aforementioned measure, TRACER can regulate the loss associated with corrupted data to reduce its influence, thereby enhancing robustness and performance in clean environments. Experiments demonstrate that TRACER significantly outperforms several state-of-the-art approaches across both individual and simultaneous data corruptions.

[AI-28] A Multi-Granularity Supervised Contrastive Framework for Remaining Useful Life Prediction of Aero-engines

链接: https://arxiv.org/abs/2411.00461
作者: Zixuan He,Ziqian Kong,Zhengyu Chen,Yuling Zhan,Zijun Que,Zhengguo Xu
关键词-EN: Accurate remaining, RUL prediction, RUL prediction task, remaining useful life, operation of aero-engines
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
*备注:

点击查看摘要

Abstract:Accurate remaining useful life (RUL) predictions are critical to the safe operation of aero-engines. Currently, the RUL prediction task is mainly a regression paradigm with only mean square error as the loss function and lacks research on feature space structure, the latter of which has shown excellent performance in a large number of studies. This paper develops a multi-granularity supervised contrastive (MGSC) framework from plain intuition that samples with the same RUL label should be aligned in the feature space, and address the problems of too large minibatch size and unbalanced samples in the implementation. The RUL prediction with MGSC is implemented on using the proposed multi-phase training strategy. This paper also demonstrates a simple and scalable basic network structure and validates the proposed MGSC strategy on the CMPASS dataset using a convolutional long short-term memory network as a baseline, which effectively improves the accuracy of RUL prediction.

[AI-29] Integrating Fuzzy Logic into Deep Symbolic Regression

链接: https://arxiv.org/abs/2411.00431
作者: Wout Gerdes,Erman Acar
关键词-EN: Credit card fraud, contactless payment technologies, Credit card, Deep Symbolic Regression, payment technologies
类目: Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO); Symbolic Computation (cs.SC)
*备注: 10 pages, 1 figure, published for XAI FIN 24 this https URL

点击查看摘要

Abstract:Credit card fraud detection is a critical concern for financial institutions, intensified by the rise of contactless payment technologies. While deep learning models offer high accuracy, their lack of explainability poses significant challenges in financial settings. This paper explores the integration of fuzzy logic into Deep Symbolic Regression (DSR) to enhance both performance and explainability in fraud detection. We investigate the effectiveness of different fuzzy logic implications, specifically Łukasiewicz, Gödel, and Product, in handling the complexity and uncertainty of fraud detection datasets. Our analysis suggest that the Łukasiewicz implication achieves the highest F1-score and overall accuracy, while the Product implication offers a favorable balance between performance and explainability. Despite having a performance lower than state-of-the-art (SOTA) models due to information loss in data transformation, our approach provides novelty and insights into into integrating fuzzy logic into DSR for fraud detection, providing a comprehensive comparison between different implications and methods.

[AI-30] On the Opportunities of Large Language Models for Programming Process Data

链接: https://arxiv.org/abs/2411.00414
作者: John Edwards,Arto Hellas,Juho Leinonen
关键词-EN: problems students struggle, programming process, understand how programs, programs are constructed, sorts of problems
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
*备注: 14 pages

点击查看摘要

Abstract:Computing educators and researchers have used programming process data to understand how programs are constructed and what sorts of problems students struggle with. Although such data shows promise for using it for feedback, fully automated programming process feedback systems have still been an under-explored area. The recent emergence of large language models (LLMs) have yielded additional opportunities for researchers in a wide variety of fields. LLMs are efficient at transforming content from one format to another, leveraging the body of knowledge they have been trained with in the process. In this article, we discuss opportunities of using LLMs for analyzing programming process data. To complement our discussion, we outline a case study where we have leveraged LLMs for automatically summarizing the programming process and for creating formative feedback on the programming process. Overall, our discussion and findings highlight that the computing education research and practice community is again one step closer to automating formative programming process-focused feedback.

[AI-31] Statistical Guarantees for Lifelong Reinforcement Learning using PAC-Bayesian Theory

链接: https://arxiv.org/abs/2411.00401
作者: Zhi Zhang,Chris Chow,Yasi Zhang,Yanchao Sun,Haochen Zhang,Eric Hanchen Jiang,Han Liu,Furong Huang,Yuchen Cui,Oscar Hernan Madrid Padilla
关键词-EN: Lifelong reinforcement learning, dynamic settings, reinforcement learning, paradigm for extending, extending single-task
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Lifelong reinforcement learning (RL) has been developed as a paradigm for extending single-task RL to more realistic, dynamic settings. In lifelong RL, the “life” of an RL agent is modeled as a stream of tasks drawn from a task distribution. We propose EPIC (\underlineEmpirical \underlinePAC-Bayes that \underlineImproves \underlineContinuously), a novel algorithm designed for lifelong RL using PAC-Bayes theory. EPIC learns a shared policy distribution, referred to as the \textitworld policy, which enables rapid adaptation to new tasks while retaining valuable knowledge from previous experiences. Our theoretical analysis establishes a relationship between the algorithm’s generalization performance and the number of prior tasks preserved in memory. We also derive the sample complexity of EPIC in terms of RL regret. Extensive experiments on a variety of environments demonstrate that EPIC significantly outperforms existing methods in lifelong RL, offering both theoretical guarantees and practical efficacy through the use of the world policy.

[AI-32] Right this way: Can VLMs Guide Us to See More to Answer Questions? NEURIPS2024

链接: https://arxiv.org/abs/2411.00394
作者: Li Liu,Diji Yang,Sijia Zhong,Kalyana Suma Sree Tholeti,Lei Ding,Yi Zhang,Leilani H. Gilpin
关键词-EN: Vision Language Models, seek additional information, Visual Question Answering, sufficient and seek, seek additional
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: NeurIPS 2024

点击查看摘要

Abstract:In question-answering scenarios, humans can assess whether the available information is sufficient and seek additional information if necessary, rather than providing a forced answer. In contrast, Vision Language Models (VLMs) typically generate direct, one-shot responses without evaluating the sufficiency of the information. To investigate this gap, we identify a critical and challenging task in the Visual Question Answering (VQA) scenario: can VLMs indicate how to adjust an image when the visual information is insufficient to answer a question? This capability is especially valuable for assisting visually impaired individuals who often need guidance to capture images correctly. To evaluate this capability of current VLMs, we introduce a human-labeled dataset as a benchmark for this task. Additionally, we present an automated framework that generates synthetic training data by simulating ``where to know’’ scenarios. Our empirical results show significant performance improvements in mainstream VLMs when fine-tuned with this synthetic data. This study demonstrates the potential to narrow the gap between information assessment and acquisition in VLMs, bringing their performance closer to humans.

[AI-33] Advantages of Neural Population Coding for Deep Learning

链接: https://arxiv.org/abs/2411.00393
作者: Heiko Hoffmann
关键词-EN: Scalar variables, population codes, commonly predicted, population, Scalar
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Scalar variables, e.g., the orientation of a shape in an image, are commonly predicted using a single output neuron in a neural network. In contrast, the mammalian cortex represents variables with a population of neurons. In this population code, each neuron is most active at its preferred value and shows partial activity for other values. Here, we investigate the benefit of using a population code for the output layer of a neural network. We compare population codes against single-neuron outputs and one-hot vectors. First, we show theoretically and in experiments with synthetic data that population codes improve robustness to input noise in networks of stacked linear layers. Second, we demonstrate the benefit of population codes to encode ambiguous outputs, as found for symmetric objects. Using the T-LESS dataset of feature-less real-world objects, we show that population codes improve the accuracy of predicting object orientation from RGB-image input.

[AI-34] Preventing Dimensional Collapse in Self-Supervised Learning via Orthogonality Regularization NEURIPS2024

链接: https://arxiv.org/abs/2411.00392
作者: Junlin He,Jinxiao Du,Wei Ma
关键词-EN: dimensional collapse, Self-supervised learning, dimensional collapse occurs, recent years, dimensional
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: accepted by NeurIPS 2024 as a poster

点击查看摘要

Abstract:Self-supervised learning (SSL) has rapidly advanced in recent years, approaching the performance of its supervised counterparts through the extraction of representations from unlabeled data. However, dimensional collapse, where a few large eigenvalues dominate the eigenspace, poses a significant obstacle for SSL. When dimensional collapse occurs on features (e.g. hidden features and representations), it prevents features from representing the full information of the data; when dimensional collapse occurs on weight matrices, their filters are self-related and redundant, limiting their expressive power. Existing studies have predominantly concentrated on the dimensional collapse of representations, neglecting whether this can sufficiently prevent the dimensional collapse of the weight matrices and hidden features. To this end, we first time propose a mitigation approach employing orthogonal regularization (OR) across the encoder, targeting both convolutional and linear layers during pretraining. OR promotes orthogonality within weight matrices, thus safeguarding against the dimensional collapse of weight matrices, hidden features, and representations. Our empirical investigations demonstrate that OR significantly enhances the performance of SSL methods across diverse benchmarks, yielding consistent gains with both CNNs and Transformer-based architectures.

[AI-35] Generalizability of Memorization Neural Networks

链接: https://arxiv.org/abs/2411.00372
作者: Lijia Yu,Xiao-Shan Gao,Lijun Zhang,Yibo Miao
关键词-EN: memorization, number of parameters, network memorization problem, neural network memorization, memorization neural networks
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The neural network memorization problem is to study the expressive power of neural networks to interpolate a finite dataset. Although memorization is widely believed to have a close relationship with the strong generalizability of deep learning when using over-parameterized models, to the best of our knowledge, there exists no theoretical study on the generalizability of memorization neural networks. In this paper, we give the first theoretical analysis of this topic. Since using i.i.d. training data is a necessary condition for a learning algorithm to be generalizable, memorization and its generalization theory for i.i.d. datasets are developed under mild conditions on the data distribution. First, algorithms are given to construct memorization networks for an i.i.d. dataset, which have the smallest number of parameters and even a constant number of parameters. Second, we show that, in order for the memorization networks to be generalizable, the width of the network must be at least equal to the dimension of the data, which implies that the existing memorization networks with an optimal number of parameters are not generalizable. Third, a lower bound for the sample complexity of general memorization algorithms and the exact sample complexity for memorization algorithms with constant number of parameters are given. It is also shown that there exist data distributions such that, to be generalizable for them, the memorization network must have an exponential number of parameters in the data dimension. Finally, an efficient and generalizable memorization algorithm is given when the number of training samples is greater than the efficient memorization sample complexity of the data distribution.

[AI-36] xtDestroyer: A Training- and Annotation-Free Diffusion Method for Destroying Anomal Text from Images

链接: https://arxiv.org/abs/2411.00355
作者: Mengcheng Li,Mingbao Lin,Fei Chao,Chia-Wen Lin,Rongrong Ji
关键词-EN: pre-trained diffusion model, text, Existing scene text, scene text, text removal models
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In this paper, we propose TextDestroyer, the first training- and annotation-free method for scene text destruction using a pre-trained diffusion model. Existing scene text removal models require complex annotation and retraining, and may leave faint yet recognizable text information, compromising privacy protection and content concealment. TextDestroyer addresses these issues by employing a three-stage hierarchical process to obtain accurate text masks. Our method scrambles text areas in the latent start code using a Gaussian distribution before reconstruction. During the diffusion denoising process, self-attention key and value are referenced from the original latent to restore the compromised background. Latent codes saved at each inversion step are used for replacement during reconstruction, ensuring perfect background restoration. The advantages of TextDestroyer include: (1) it eliminates labor-intensive data annotation and resource-intensive training; (2) it achieves more thorough text destruction, preventing recognizable traces; and (3) it demonstrates better generalization capabilities, performing well on both real-world scenes and generated images.

[AI-37] Examining Attacks on Consensus and Incentive Systems in Proof-of-Work Blockchains: A Systematic Literature Review

链接: https://arxiv.org/abs/2411.00349
作者: Dinitha Wijewardhana,Sugandima Vidanagamachchi,Nalin Arachchilage
关键词-EN: traditional financial systems, gained popularity due, Cryptocurrencies have gained, leading the market, financial systems
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Cryptocurrencies have gained popularity due to their transparency, security, and accessibility compared to traditional financial systems, with Bitcoin, introduced in 2009, leading the market. Bitcoin’s security relies on blockchain technology - a decentralized ledger consisting of a consensus and an incentive mechanism. The consensus mechanism, Proof of Work (PoW), requires miners to solve difficult cryptographic puzzles to add new blocks, while the incentive mechanism rewards them with newly minted bitcoins. However, as Bitcoin’s acceptance grows, it faces increasing threats from attacks targeting these mechanisms, such as selfish mining, double-spending, and block withholding. These attacks compromise security, efficiency, and reward distribution. Recent research shows that these attacks can be combined with each other or with either malicious strategies, such as network-layer attacks, or non-malicious strategies, like honest mining. These combinations lead to more sophisticated attacks, increasing the attacker’s success rates and profitability. Therefore, understanding and evaluating these attacks is essential for developing effective countermeasures and ensuring long-term security. This paper begins by examining individual attacks executed in isolation and their profitability. It then explores how combining these attacks with each other or with other malicious and non-malicious strategies can enhance their overall effectiveness and profitability. The analysis further explores how the deployment of attacks such as selfish mining and block withholding by multiple competing mining pools against each other impacts their economic returns. Lastly, a set of design guidelines is provided, outlining areas future work should focus on to prevent or mitigate the identified threats.

[AI-38] Attention Tracker: Detecting Prompt Injection Attacks in LLM s

链接: https://arxiv.org/abs/2411.00348
作者: Kuo-Han Hung,Ching-Yun Ko,Ambrish Rawat,I-Hsin Chung,Winston H. Hsu,Pin-Yu Chen
关键词-EN: Large Language Models, Large Language, executing designated action, malicious inputs manipulate, ignoring original instructions
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Project page: this https URL

点击查看摘要

Abstract:Large Language Models (LLMs) have revolutionized various domains but remain vulnerable to prompt injection attacks, where malicious inputs manipulate the model into ignoring original instructions and executing designated action. In this paper, we investigate the underlying mechanisms of these attacks by analyzing the attention patterns within LLMs. We introduce the concept of the distraction effect, where specific attention heads, termed important heads, shift focus from the original instruction to the injected instruction. Building on this discovery, we propose Attention Tracker, a training-free detection method that tracks attention patterns on instruction to detect prompt injection attacks without the need for additional LLM inference. Our method generalizes effectively across diverse models, datasets, and attack types, showing an AUROC improvement of up to 10.0% over existing methods, and performs well even on small LLMs. We demonstrate the robustness of our approach through extensive evaluations and provide insights into safeguarding LLM-integrated systems from prompt injection vulnerabilities.

[AI-39] An Untethered Bioinspired Robotic Tensegrity Dolphin with Multi-Flexibility Design for Aquatic Locomotion

链接: https://arxiv.org/abs/2411.00347
作者: Luyang Zhao,Yitao Jiang,Chun-Yi She,Mingi Jeong,Haibo Dong,Alberto Quattrini Li,Muhao Chen,Devin Balkcom
关键词-EN: soft dolphin robot, dolphin robot, soft dolphin, current dolphin robot, robot
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注: 7 pages, 13 figures

点击查看摘要

Abstract:This paper presents the first steps toward a soft dolphin robot using a bio-inspired approach to mimic dolphin flexibility. The current dolphin robot uses a minimalist approach, with only two actuated cable-driven degrees of freedom actuated by a pair of motors. The actuated tail moves up and down in a swimming motion, but this first proof of concept does not permit controlled turns of the robot. While existing robotic dolphins typically use revolute joints to articulate rigid bodies, our design – which will be made opensource – incorporates a flexible tail with tunable silicone skin and actuation flexibility via a cable-driven system, which mimics muscle dynamics and design flexibility with a tunable skeleton structure. The design is also tunable since the backbone can be easily printed in various geometries. The paper provides insights into how a few such variations affect robot motion and efficiency, measured by speed and cost of transport (COT). This approach demonstrates the potential of achieving dolphin-like motion through enhanced flexibility in bio-inspired robotics.

[AI-40] On the Exploration of LM-Based Soft Modular Robot Design

链接: https://arxiv.org/abs/2411.00345
作者: Weicheng Ma,Luyang Zhao,Chun-Yi She,Yitao Jiang,Alan Sun,Bo Zhu,Devin Balkcom,Soroush Vosoughi
关键词-EN: Recent large language, Recent large, enhancing knowledge-based generation, modeling real-world knowledge, demonstrated promising capabilities
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 8 pages, 7 figures

点击查看摘要

Abstract:Recent large language models (LLMs) have demonstrated promising capabilities in modeling real-world knowledge and enhancing knowledge-based generation tasks. In this paper, we further explore the potential of using LLMs to aid in the design of soft modular robots, taking into account both user instructions and physical laws, to reduce the reliance on extensive trial-and-error experiments typically needed to achieve robot designs that meet specific structural or task requirements. Specifically, we formulate the robot design process as a sequence generation task and find that LLMs are able to capture key requirements expressed in natural language and reflect them in the construction sequences of robots. To simplify, rather than conducting real-world experiments to assess design quality, we utilize a simulation tool to provide feedback to the generative model, allowing for iterative improvements without requiring extensive human annotations. Furthermore, we introduce five evaluation metrics to assess the quality of robot designs from multiple angles including task completion and adherence to instructions, supporting an automatic evaluation process. Our model performs well in evaluations for designing soft modular robots with uni- and bi-directional locomotion and stair-descending capabilities, highlighting the potential of using natural language and LLMs for robot design. However, we also observe certain limitations that suggest areas for further improvement.

[AI-41] StepCountJITAI: simulation environment for RL with application to physical activity adaptive intervention NEURIPS2024

链接: https://arxiv.org/abs/2411.00336
作者: Karine Karine,Benjamin M. Marlin
关键词-EN: including improving levels, physical activity JITAI, physical activity, domains including improving, physical activity adaptive
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Accepted at NeurIPS 2024 workshop on Behavioral ML

点击查看摘要

Abstract:The use of reinforcement learning (RL) to learn policies for just-in-time adaptive interventions (JITAIs) is of significant interest in many behavioral intervention domains including improving levels of physical activity. In a messaging-based physical activity JITAI, a mobile health app is typically used to send messages to a participant to encourage engagement in physical activity. In this setting, RL methods can be used to learn what intervention options to provide to a participant in different contexts. However, deploying RL methods in real physical activity adaptive interventions comes with challenges: the cost and time constraints of real intervention studies result in limited data to learn adaptive intervention policies. Further, commonly used RL simulation environments have dynamics that are of limited relevance to physical activity adaptive interventions and thus shed little light on what RL methods may be optimal for this challenging application domain. In this paper, we introduce StepCountJITAI, an RL environment designed to foster research on RL methods that address the significant challenges of policy learning for adaptive behavioral interventions.

[AI-42] Personalized Federated Learning via Feature Distribution Adaptation NEURIPS

链接: https://arxiv.org/abs/2411.00329
作者: Connor J. Mclaughlin,Lili Su
关键词-EN: distributed learning framework, distributed client datasets, framework that leverages, leverages commonalities, Federated learning
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 38th Annual Conference on Neural Information Processing Systems (NeurIPS), 2024

点击查看摘要

Abstract:Federated learning (FL) is a distributed learning framework that leverages commonalities between distributed client datasets to train a global model. Under heterogeneous clients, however, FL can fail to produce stable training results. Personalized federated learning (PFL) seeks to address this by learning individual models tailored to each client. One approach is to decompose model training into shared representation learning and personalized classifier training. Nonetheless, previous works struggle to navigate the bias-variance trade-off in classifier learning, relying solely on limited local datasets or introducing costly techniques to improve generalization. In this work, we frame representation learning as a generative modeling task, where representations are trained with a classifier based on the global feature distribution. We then propose an algorithm, pFedFDA, that efficiently generates personalized models by adapting global generative classifiers to their local feature distributions. Through extensive computer vision benchmarks, we demonstrate that our method can adjust to complex distribution shifts with significant improvements over current state-of-the-art in data-scarce settings.

[AI-43] Constant Acceleration Flow

链接: https://arxiv.org/abs/2411.00322
作者: Dogyun Park,Sojin Lee,Sihyeon Kim,Taehoon Lee,Youngjoon Hong,Hyunwoo J. Kim
关键词-EN: significantly advanced fast, progressively straightening ordinary, straightening ordinary differential, ordinary differential equation, advanced fast generation
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Rectified flow and reflow procedures have significantly advanced fast generation by progressively straightening ordinary differential equation (ODE) flows. They operate under the assumption that image and noise pairs, known as couplings, can be approximated by straight trajectories with constant velocity. However, we observe that modeling with constant velocity and using reflow procedures have limitations in accurately learning straight trajectories between pairs, resulting in suboptimal performance in few-step generation. To address these limitations, we introduce Constant Acceleration Flow (CAF), a novel framework based on a simple constant acceleration equation. CAF introduces acceleration as an additional learnable variable, allowing for more expressive and accurate estimation of the ODE flow. Moreover, we propose two techniques to further improve estimation accuracy: initial velocity conditioning for the acceleration model and a reflow process for the initial velocity. Our comprehensive studies on toy datasets, CIFAR-10, and ImageNet 64x64 demonstrate that CAF outperforms state-of-the-art baselines for one-step generation. We also show that CAF dramatically improves few-step coupling preservation and inversion over Rectified flow. Code is available at \hrefthis https URLthis https URL.

[AI-44] C2A: Client-Customized Adaptation for Parameter-Efficient Federated Learning ACL2023

链接: https://arxiv.org/abs/2411.00311
作者: Yeachan Kim,Junho Kim,Wing-Lam Mok,Jun-Hyung Park,SangKeun Lee
关键词-EN: large memory footprints, memory footprints pose, footprints pose significant, pose significant challenges, versatility of pre-trained
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
*备注: Published at Findings of ACL 2023

点击查看摘要

Abstract:Despite the versatility of pre-trained language models (PLMs) across domains, their large memory footprints pose significant challenges in federated learning (FL), where the training model has to be distributed between a server and clients. One potential solution to bypass such constraints might be the use of parameter-efficient fine-tuning (PEFT) in the context of FL. However, we have observed that typical PEFT tends to severely suffer from heterogeneity among clients in FL scenarios, resulting in unstable and slow convergence. In this paper, we propose Client-Customized Adaptation (C2A), a novel hypernetwork-based FL framework that generates client-specific adapters by conditioning the client information. With the effectiveness of the hypernetworks in generating customized weights through learning to adopt the different characteristics of inputs, C2A can maximize the utility of shared model parameters while minimizing the divergence caused by client heterogeneity. To verify the efficacy of C2A, we perform extensive evaluations on FL scenarios involving heterogeneity in label and language distributions. Comprehensive evaluation results clearly support the superiority of C2A in terms of both efficiency and effectiveness in FL scenarios.

[AI-45] GPT for Games: An Updated Scoping Review (2020-2024)

链接: https://arxiv.org/abs/2411.00308
作者: Daijin Yang,Erica Kleinman,Casper Harteveld
关键词-EN: impressive generative capabilities, GPT impressive generative, generative capabilities, impressive generative, GPT
类目: Artificial Intelligence (cs.AI)
*备注: Submitted to IEEE Transactions on Games

点击查看摘要

Abstract:Due to GPT’s impressive generative capabilities, its applications in games are expanding rapidly. To offer researchers a comprehensive understanding of the current applications and identify both emerging trends and unexplored areas, this paper introduces an updated scoping review of 131 articles, 76 of which were published in 2024, to explore GPT’s potential for games. By coding and synthesizing the papers, we identify five prominent applications of GPT in current game research: procedural content generation, mixed-initiative game design, mixed-initiative gameplay, playing games, and game user research. Drawing on insights from these application areas and emerging research, we propose future studies should focus on expanding the technical boundaries of the GPT models and exploring the complex interaction dynamics between them and users. This review aims to illustrate the state of the art in innovative GPT applications in games, offering a foundation to enrich game development and enhance player experiences through cutting-edge AI innovations.

[AI-46] Inducing Semi-Structured Sparsity by Masking for Efficient Model Inference in Convolutional Networks NEURIPS2024

链接: https://arxiv.org/abs/2411.00288
作者: David A. Danhofer
关键词-EN: necessitates effective acceleration, effective acceleration techniques, standalone vision models, necessitates effective, crucial role
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Neural and Evolutionary Computing (cs.NE); Performance (cs.PF)
*备注: 15 pages, 3 figures; this work will be presented at the NeurIPS 2024 Workshop on Fine-Tuning in Modern Machine Learning: Principles and Scalability (FITML)

点击查看摘要

Abstract:The crucial role of convolutional models, both as standalone vision models and backbones in foundation models, necessitates effective acceleration techniques. This paper proposes a novel method to learn semi-structured sparsity patterns for convolution kernels in the form of maskings enabling the utilization of readily available hardware accelerations. The approach accelerates convolutional models more than two-fold during inference without decreasing model performance. At the same time, the original model weights and structure remain unchanged keeping the model thus easily updatable. Beyond the immediate practical use, the effect of maskings on prediction is easily quantifiable. Therefore, guarantees on model predictions under maskings are derived showing stability bounds for learned maskings even after updating the original underlying model.

[AI-47] MBExplainer: Multilevel bandit-based explanations for downstream models with augmented graph embeddings

链接: https://arxiv.org/abs/2411.00287
作者: Ashkan Golgoon,Ryan Franks,Khashayar Filom,Arjun Ravi Kannan
关键词-EN: local search spaces, tabular features, graph embeddings generated, features, edge features
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Numerical Analysis (math.NA); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:In many industrial applications, it is common that the graph embeddings generated from training GNNs are used in an ensemble model where the embeddings are combined with other tabular features (e.g., original node or edge features) in a downstream ML task. The tabular features may even arise naturally if, e.g., one tries to build a graph such that some of the node or edge features are stored in a tabular format. Here we address the problem of explaining the output of such ensemble models for which the input features consist of learned neural graph embeddings combined with additional tabular features. We propose MBExplainer, a model-agnostic explanation approach for downstream models with augmented graph embeddings. MBExplainer returns a human-legible triple as an explanation for an instance prediction of the whole pipeline consisting of three components: a subgraph with the highest importance, the topmost important nodal features, and the topmost important augmented downstream features. A game-theoretic formulation is used to take the contributions of each component and their interactions into account by assigning three Shapley values corresponding to their own specific games. Finding the explanation requires an efficient search through the corresponding local search spaces corresponding to each component. MBExplainer applies a novel multilevel search algorithm that enables simultaneous pruning of local search spaces in a computationally tractable way. In particular, three interweaved Monte Carlo Tree Search are utilized to iteratively prune the local search spaces. MBExplainer also includes a global search algorithm that uses contextual bandits to efficiently allocate pruning budget among the local search spaces. We show the effectiveness of MBExplainer by presenting a set of comprehensive numerical examples on multiple public graph datasets for both node and graph classification tasks.

[AI-48] SimpleFSDP: Simpler Fully Sharded Data Parallel with torch.compile

链接: https://arxiv.org/abs/2411.00284
作者: Ruisi Zhang,Tianyu Liu,Will Feng,Andrew Gu,Sanket Purandare,Wanchao Liang,Francisco Massa
关键词-EN: consumes enormous computation, enormous computation resources, requires substantial engineering, substantial engineering efforts, Fully Sharded Data
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Distributed training of large models consumes enormous computation resources and requires substantial engineering efforts to compose various training techniques. This paper presents SimpleFSDP, a PyTorch-native compiler-based Fully Sharded Data Parallel (FSDP) framework, which has a simple implementation for maintenance and composability, allows full computation-communication graph tracing, and brings performance enhancement via compiler backend optimizations. SimpleFSDP’s novelty lies in its unique this http URL-friendly implementation of collective communications using existing PyTorch primitives, namely parametrizations, selective activation checkpointing, and DTensor. It also features the first-of-its-kind intermediate representation (IR) nodes bucketing and reordering in the TorchInductor backend for effective computation-communication overlapping. As a result, users can employ the aforementioned optimizations to automatically or manually wrap model components for minimal communication exposure. Extensive evaluations of SimpleFSDP on Llama 3 models (including the ultra-large 405B) using TorchTitan demonstrate up to 28.54% memory reduction and 68.67% throughput improvement compared to the most widely adopted FSDP2 eager framework, when composed with other distributed training techniques. Subjects: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI) Cite as: arXiv:2411.00284 [cs.DC] (or arXiv:2411.00284v1 [cs.DC] for this version) https://doi.org/10.48550/arXiv.2411.00284 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-49] Improving Traffic Flow Predictions with SGCN-LSTM: A Hybrid Model for Spatial and Temporal Dependencies

链接: https://arxiv.org/abs/2411.00282
作者: Alexandru T. Cismaru
关键词-EN: increased car accidents, Large amounts, significant time wasted, air pollution, car accidents
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 5 pages, 6 figures

点击查看摘要

Abstract:Large amounts of traffic can lead to negative effects such as increased car accidents, air pollution, and significant time wasted. Understanding traffic speeds on any given road segment can be highly beneficial for traffic management strategists seeking to reduce congestion. While recent studies have primarily focused on modeling spatial dependencies by using graph convolutional networks (GCNs) over fixed weighted graphs, the relationships between nodes are often more complex, with edges that interact dynamically. This paper addresses both the temporal patterns in traffic data and the intricate spatial dependencies by introducing the Signal-Enhanced Graph Convolutional Network Long Short Term Memory (SGCN-LSTM) model for predicting traffic speeds across road networks. Extensive experiments on the PEMS-BAY road network traffic dataset demonstrate the SGCN-LSTM model’s effectiveness, yielding significant improvements in Mean Absolute Error (MAE), Root Mean Square Error (RMSE), and Mean Absolute Percentage Error (MAPE) compared to benchmark models on the same dataset.

[AI-50] Quantifying calibration error in modern neural networks through evidence based theory

链接: https://arxiv.org/abs/2411.00265
作者: Koffi Ismael Ouattara
关键词-EN: play pivotal roles, uncertainty play pivotal, roles in decision-making, Expected Calibration Error, deployment in critical
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Logic (math.LO)
*备注:

点击查看摘要

Abstract:Trustworthiness in neural networks is crucial for their deployment in critical applications, where reliability, confidence, and uncertainty play pivotal roles in decision-making. Traditional performance metrics such as accuracy and precision fail to capture these aspects, particularly in cases where models exhibit overconfidence. To address these limitations, this paper introduces a novel framework for quantifying the trustworthiness of neural networks by incorporating subjective logic into the evaluation of Expected Calibration Error (ECE). This method provides a comprehensive measure of trust, disbelief, and uncertainty by clustering predicted probabilities and fusing opinions using appropriate fusion operators. We demonstrate the effectiveness of this approach through experiments on MNIST and CIFAR-10 datasets, where post-calibration results indicate improved trustworthiness. The proposed framework offers a more interpretable and nuanced assessment of AI models, with potential applications in sensitive domains such as healthcare and autonomous systems.

[AI-51] urtleBench: A Visual Programming Benchmark in Turtle Geometry

链接: https://arxiv.org/abs/2411.00264
作者: Sina Rismanchian,Yasaman Razeghi,Sameer Singh,Shayan Doroudi
关键词-EN: young age, ability to reason, images and scenes, geometric patterns, interpret geometric patterns
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Humans have the ability to reason about geometric patterns in images and scenes from a young age. However, developing large multimodal models (LMMs) capable of similar reasoning remains a challenge, highlighting the need for robust evaluation methods to assess these capabilities. We introduce TurtleBench, a benchmark designed to evaluate LMMs’ capacity to interpret geometric patterns – given visual examples, textual instructions, or both – and generate precise code outputs. Inspired by turtle geometry, a notion used to teach children foundational coding and geometric concepts, TurtleBench features tasks with patterned shapes that have underlying algorithmic logic. Our evaluation reveals that leading LMMs struggle significantly with these tasks, with GPT-4o achieving only 19% accuracy on the simplest tasks and few-shot prompting only marginally improves their performance ( 2% ). TurtleBench highlights the gap between human and AI performance in intuitive and visual geometrical understanding, setting the stage for future research in this area. TurtleBench stands as one of the few benchmarks to evaluate the integration of visual understanding and code generation capabilities in LMMs, setting the stage for future research. Code and Dataset for this paper is provided here: this https URL

[AI-52] Understanding Graphical Perception in Data Visualization through Zero-shot Prompting of Vision-Language Models

链接: https://arxiv.org/abs/2411.00257
作者: Grace Guo,Jenna Jiayi Kang,Raj Sanjay Shah,Hanspeter Pfister,Sashank Varma
关键词-EN: Vision Language Models, Vision Language, Language Models, accompanying textual descriptions, textual descriptions
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Vision Language Models (VLMs) have been successful at many chart comprehension tasks that require attending to both the images of charts and their accompanying textual descriptions. However, it is not well established how VLM performance profiles map to human-like behaviors. If VLMs can be shown to have human-like chart comprehension abilities, they can then be applied to a broader range of tasks, such as designing and evaluating visualizations for human readers. This paper lays the foundations for such applications by evaluating the accuracy of zero-shot prompting of VLMs on graphical perception tasks with established human performance profiles. Our findings reveal that VLMs perform similarly to humans under specific task and style combinations, suggesting that they have the potential to be used for modeling human performance. Additionally, variations to the input stimuli show that VLM accuracy is sensitive to stylistic changes such as fill color and chart contiguity, even when the underlying data and data mappings are the same.

[AI-53] Deep Learning Through A Telescoping Lens: A Simple Model Provides Empirical Insights On Grokking Gradient Boosting Beyond NEURIPS

链接: https://arxiv.org/abs/2411.00247
作者: Alan Jeffares,Alicia Curth,Mihaela van der Schaar
关键词-EN: neural network learning, Deep learning, applying deep learning, trained neural network, neural network consisting
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
*备注: Accepted at Conference on Neural Information Processing Systems (NeurIPS) 2024

点击查看摘要

Abstract:Deep learning sometimes appears to work in unexpected ways. In pursuit of a deeper understanding of its surprising behaviors, we investigate the utility of a simple yet accurate model of a trained neural network consisting of a sequence of first-order approximations telescoping out into a single empirically operational tool for practical analysis. Across three case studies, we illustrate how it can be applied to derive new empirical insights on a diverse range of prominent phenomena in the literature – including double descent, grokking, linear mode connectivity, and the challenges of applying deep learning on tabular data – highlighting that this model allows us to construct and extract metrics that help predict and understand the a priori unexpected performance of neural networks. We also demonstrate that this model presents a pedagogical formalism allowing us to isolate components of the training process even in complex contemporary settings, providing a lens to reason about the effects of design choices such as architecture optimization strategy, and reveals surprising parallels between neural network learning and gradient boosting.

[AI-54] Understanding the Limits of Vision Language Models Through the Lens of the Binding Problem

链接: https://arxiv.org/abs/2411.00238
作者: Declan Campbell,Sunayana Rane,Tyler Giallanza,Nicolò De Sabbata,Kia Ghods,Amogh Joshi,Alexander Ku,Steven M. Frankland,Thomas L. Griffiths,Jonathan D. Cohen,Taylor W. Webb
关键词-EN: vision language models, multimodal language models, documented striking heterogeneity, Recent work, language models
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Neurons and Cognition (q-bio.NC)
*备注:

点击查看摘要

Abstract:Recent work has documented striking heterogeneity in the performance of state-of-the-art vision language models (VLMs), including both multimodal language models and text-to-image models. These models are able to describe and generate a diverse array of complex, naturalistic images, yet they exhibit surprising failures on basic multi-object reasoning tasks – such as counting, localization, and simple forms of visual analogy – that humans perform with near perfect accuracy. To better understand this puzzling pattern of successes and failures, we turn to theoretical accounts of the binding problem in cognitive science and neuroscience, a fundamental problem that arises when a shared set of representational resources must be used to represent distinct entities (e.g., to represent multiple objects in an image), necessitating the use of serial processing to avoid interference. We find that many of the puzzling failures of state-of-the-art VLMs can be explained as arising due to the binding problem, and that these failure modes are strikingly similar to the limitations exhibited by rapid, feedforward processing in the human brain.

[AI-55] Protecting Feed-Forward Networks from Adversarial Attacks Using Predictive Coding

链接: https://arxiv.org/abs/2411.00222
作者: Ehsan Ganjidoost,Jeff Orchard
关键词-EN: Machine Learning, modified input image, input image designed, make a mistake, modified input
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
*备注:

点击查看摘要

Abstract:An adversarial example is a modified input image designed to cause a Machine Learning (ML) model to make a mistake; these perturbations are often invisible or subtle to human observers and highlight vulnerabilities in a model’s ability to generalize from its training data. Several adversarial attacks can create such examples, each with a different perspective, effectiveness, and perceptibility of changes. Conversely, defending against such adversarial attacks improves the robustness of ML models in image processing and other domains of deep learning. Most defence mechanisms require either a level of model awareness, changes to the model, or access to a comprehensive set of adversarial examples during training, which is impractical. Another option is to use an auxiliary model in a preprocessing manner without changing the primary model. This study presents a practical and effective solution – using predictive coding networks (PCnets) as an auxiliary step for adversarial defence. By seamlessly integrating PCnets into feed-forward networks as a preprocessing step, we substantially bolster resilience to adversarial perturbations. Our experiments on MNIST and CIFAR10 demonstrate the remarkable effectiveness of PCnets in mitigating adversarial examples with about 82% and 65% improvements in robustness, respectively. The PCnet, trained on a small subset of the dataset, leverages its generative nature to effectively counter adversarial efforts, reverting perturbed images closer to their original forms. This innovative approach holds promise for enhancing the security and reliability of neural network classifiers in the face of the escalating threat of adversarial attacks.

[AI-56] ADAPT: A Game-Theoretic and Neuro-Symbolic Framework for Automated Distributed Adaptive Penetration Testing

链接: https://arxiv.org/abs/2411.00217
作者: Haozhe Lei,Yunfei Ge,Quanyan Zhu
关键词-EN: significantly impact workflow, modern critical infrastructure, impact workflow, modern critical, significantly impact
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computer Science and Game Theory (cs.GT)
*备注:

点击查看摘要

Abstract:The integration of AI into modern critical infrastructure systems, such as healthcare, has introduced new vulnerabilities that can significantly impact workflow, efficiency, and safety. Additionally, the increased connectivity has made traditional human-driven penetration testing insufficient for assessing risks and developing remediation strategies. Consequently, there is a pressing need for a distributed, adaptive, and efficient automated penetration testing framework that not only identifies vulnerabilities but also provides countermeasures to enhance security posture. This work presents ADAPT, a game-theoretic and neuro-symbolic framework for automated distributed adaptive penetration testing, specifically designed to address the unique cybersecurity challenges of AI-enabled healthcare infrastructure networks. We use a healthcare system case study to illustrate the methodologies within ADAPT. The proposed solution enables a learning-based risk assessment. Numerical experiments are used to demonstrate effective countermeasures against various tactical techniques employed by adversarial AI.

[AI-57] Using Large Language Models for a standard assessment mapping for sustainable communities

链接: https://arxiv.org/abs/2411.00208
作者: Jonveaux Luc
关键词-EN: Large Language Models, Language Models, Large Language, Paris Participatory Budget, automate and standardise
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
*备注: 8 pages, 2 figures

点击查看摘要

Abstract:This paper presents a new approach to urban sustainability assessment through the use of Large Language Models (LLMs) to streamline the use of the ISO 37101 framework to automate and standardise the assessment of urban initiatives against the six “sustainability purposes” and twelve “issues” outlined in the standard. The methodology includes the development of a custom prompt based on the standard definitions and its application to two different datasets: 527 projects from the Paris Participatory Budget and 398 activities from the PROBONO Horizon 2020 project. The results show the effectiveness of LLMs in quickly and consistently categorising different urban initiatives according to sustainability criteria. The approach is particularly promising when it comes to breaking down silos in urban planning by providing a holistic view of the impact of projects. The paper discusses the advantages of this method over traditional human-led assessments, including significant time savings and improved consistency. However, it also points out the importance of human expertise in interpreting results and ethical considerations. This study hopefully can contribute to the growing body of work on AI applications in urban planning and provides a novel method for operationalising standardised sustainability frameworks in different urban contexts.

[AI-58] Whole-Herd Elephant Pose Estimation from Drone Data for Collective Behavior Analysis

链接: https://arxiv.org/abs/2411.00196
作者: Brody McNutt,Libby Zhang,Angus Carey-Douglas,Fritz Vollrath,Frank Pope,Leandra Brickson
关键词-EN: Samburu National Reserve, National Reserve, Samburu National, utilizing video footage, video footage captured
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: Accepted to CV4Animals: Computer Vision for Animal Behavior Tracking and Modeling Workshop in conjunction with Computer Vision and Pattern Recognition 2024

点击查看摘要

Abstract:This research represents a pioneering application of automated pose estimation from drone data to study elephant behavior in the wild, utilizing video footage captured from Samburu National Reserve, Kenya. The study evaluates two pose estimation workflows: DeepLabCut, known for its application in laboratory settings and emerging wildlife fieldwork, and YOLO-NAS-Pose, a newly released pose estimation model not previously applied to wildlife behavioral studies. These models are trained to analyze elephant herd behavior, focusing on low-resolution ( \sim 50 pixels) subjects to detect key points such as the head, spine, and ears of multiple elephants within a frame. Both workflows demonstrated acceptable quality of pose estimation on the test set, facilitating the automated detection of basic behaviors crucial for studying elephant herd dynamics. For the metrics selected for pose estimation evaluation on the test set – root mean square error (RMSE), percentage of correct keypoints (PCK), and object keypoint similarity (OKS) – the YOLO-NAS-Pose workflow outperformed DeepLabCut. Additionally, YOLO-NAS-Pose exceeded DeepLabCut in object detection evaluation. This approach introduces a novel method for wildlife behavioral research, including the burgeoning field of wildlife drone monitoring, with significant implications for wildlife conservation.

[AI-59] Monitoring fairness in machine learning models that predict patient mortality in the ICU

链接: https://arxiv.org/abs/2411.00190
作者: Tempest A. van Schaik,Xinggang Liu,Louis Atallah,Omar Badawi
关键词-EN: predict patient mortality, fairness monitoring approach, machine learning models, work proposes, monitoring approach
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 8 pages

点击查看摘要

Abstract:This work proposes a fairness monitoring approach for machine learning models that predict patient mortality in the ICU. We investigate how well models perform for patient groups with different race, sex and medical diagnoses. We investigate Documentation bias in clinical measurement, showing how fairness analysis provides a more detailed and insightful comparison of model performance than traditional accuracy metrics alone.

[AI-60] Building Multi-Agent Copilot towards Autonomous Agricultural Data Management and Analysis

链接: https://arxiv.org/abs/2411.00188
作者: Yu Pan,Jianxin Sun,Hongfeng Yu,Joe Luck,Geng Bai,Nipuna Chamara,Yufeng Ge,Tala Awada
关键词-EN: Current agricultural data, agricultural data management, data management, Current agricultural, data
类目: Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Current agricultural data management and analysis paradigms are to large extent traditional, in which data collecting, curating, integration, loading, storing, sharing and analyzing still involve too much human effort and know-how. The experts, researchers and the farm operators need to understand the data and the whole process of data management pipeline to make fully use of the data. The essential problem of the traditional paradigm is the lack of a layer of orchestrational intelligence which can understand, organize and coordinate the data processing utilities to maximize data management and analysis outcome. The emerging reasoning and tool mastering abilities of large language models (LLM) make it a potentially good fit to this position, which helps a shift from the traditional user-driven paradigm to AI-driven paradigm. In this paper, we propose and explore the idea of a LLM based copilot for autonomous agricultural data management and analysis. Based on our previously developed platform of Agricultural Data Management and Analytics (ADMA), we build a proof-of-concept multi-agent system called ADMA Copilot, which can understand user’s intent, makes plans for data processing pipeline and accomplishes tasks automatically, in which three agents: a LLM based controller, an input formatter and an output formatter collaborate together. Different from existing LLM based solutions, by defining a meta-program graph, our work decouples control flow and data flow to enhance the predictability of the behaviour of the agents. Experiments demonstrates the intelligence, autonomy, efficacy, efficiency, extensibility, flexibility and privacy of our system. Comparison is also made between ours and existing systems to show the superiority and potential of our system.

[AI-61] Self-Healing Machine Learning: A Framework for Autonomous Adaptation in Real-World Environments NEURIPS2024

链接: https://arxiv.org/abs/2411.00186
作者: Paulius Rauba,Nabeel Seedat,Krzysztof Kacprzyk,Mihaela van der Schaar
关键词-EN: data generating process, Real-world machine learning, encounter model performance, performance degradation due, underlying data generating
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Advances in Neural Information Processing Systems 38 (NeurIPS 2024)

点击查看摘要

Abstract:Real-world machine learning systems often encounter model performance degradation due to distributional shifts in the underlying data generating process (DGP). Existing approaches to addressing shifts, such as concept drift adaptation, are limited by their reason-agnostic nature. By choosing from a pre-defined set of actions, such methods implicitly assume that the causes of model degradation are irrelevant to what actions should be taken, limiting their ability to select appropriate adaptations. In this paper, we propose an alternative paradigm to overcome these limitations, called self-healing machine learning (SHML). Contrary to previous approaches, SHML autonomously diagnoses the reason for degradation and proposes diagnosis-based corrective actions. We formalize SHML as an optimization problem over a space of adaptation actions to minimize the expected risk under the shifted DGP. We introduce a theoretical framework for self-healing systems and build an agentic self-healing solution H-LLM which uses large language models to perform self-diagnosis by reasoning about the structure underlying the DGP, and self-adaptation by proposing and evaluating corrective actions. Empirically, we analyze different components of H-LLM to understand why and when it works, demonstrating the potential of self-healing ML.

[AI-62] Clinical Evaluation of Medical Image Synthesis: A Case Study in Wireless Capsule Endoscopy

链接: https://arxiv.org/abs/2411.00178
作者: Panagiota Gatoula,Dimitrios E. Diamantis,Anastasios Koulaouzidis,Cristina Carretero,Stefania Chetcuti-Zammit,Pablo Cortegoso Valdivia,Begoña González-Suárez,Alessandro Mussetto,John Plevris,Alexander Robertson,Bruno Rosa,Ervin Toth,Dimitris K. Iakovidis
关键词-EN: Sharing retrospectively acquired, retrospectively acquired data, Inflammatory Bowel Disease, Wireless Capsule Endoscopy, retrospectively acquired
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Image and Video Processing (eess.IV)
*备注:

点击查看摘要

Abstract:Sharing retrospectively acquired data is essential for both clinical research and training. Synthetic Data Generation (SDG), using Artificial Intelligence (AI) models, can overcome privacy barriers in sharing clinical data, enabling advancements in medical diagnostics. This study focuses on the clinical evaluation of medical SDG, with a proof-of-concept investigation on diagnosing Inflammatory Bowel Disease (IBD) using Wireless Capsule Endoscopy (WCE) images. The paper contributes by a) presenting a protocol for the systematic evaluation of synthetic images by medical experts and b) applying it to assess TIDE-II, a novel variational autoencoder-based model for high-resolution WCE image synthesis, with a comprehensive qualitative evaluation conducted by 10 international WCE specialists, focusing on image quality, diversity, realism, and clinical decision-making. The results show that TIDE-II generates clinically relevant WCE images, helping to address data scarcity and enhance diagnostic tools. The proposed protocol serves as a reference for future research on medical image-generation techniques.

[AI-63] Creativity in the Age of AI: Evaluating the Impact of Generative AI on Design Outputs and Designers Creative Thinking

链接: https://arxiv.org/abs/2411.00168
作者: Yue Fu,Han Bin,Tony Zhou,Marx Wang,Yixin Chen,Zelia Gomes Da Costa Lai,Jacob O. Wobbrock,Alexis Hiniker
关键词-EN: capabilities warrants investigation, increasingly permeates design, permeates design workflows, creative capabilities warrants, increasingly permeates
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:As generative AI (GenAI) increasingly permeates design workflows, its impact on design outcomes and designers’ creative capabilities warrants investigation. We conducted a within-subjects experiment where we asked participants to design advertisements both with and without GenAI support. Our results show that expert evaluators rated GenAI-supported designs as more creative and unconventional (“weird”) despite no significant differences in visual appeal, brand alignment, or usefulness, which highlights the decoupling of novelty from usefulness-traditional dual components of creativity-in the context of GenAI usage. Moreover, while GenAI does not significantly enhance designers’ overall creative thinking abilities, users were affected differently based on native language and prior AI exposure. Native English speakers experienced reduced relaxation when using AI, whereas designers new to GenAI exhibited gains in divergent thinking, such as idea fluency and flexibility. These findings underscore the variable impact of GenAI on different user groups, suggesting the potential for customized AI tools.

[AI-64] PSL: Rethinking and Improving Softmax Loss from Pairwise Perspective for Recommendation

链接: https://arxiv.org/abs/2411.00163
作者: Weiqin Yang,Jiawei Chen,Xin Xin,Sheng Zhou,Binbin Hu,Yan Feng,Chun Chen,Can Wang
关键词-EN: recommender systems, Pairwise Softmax Loss, widely applied, applied in recommender, Softmax Loss
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Softmax Loss (SL) is widely applied in recommender systems (RS) and has demonstrated effectiveness. This work analyzes SL from a pairwise perspective, revealing two significant limitations: 1) the relationship between SL and conventional ranking metrics like DCG is not sufficiently tight; 2) SL is highly sensitive to false negative instances. Our analysis indicates that these limitations are primarily due to the use of the exponential function. To address these issues, this work extends SL to a new family of loss functions, termed Pairwise Softmax Loss (PSL), which replaces the exponential function in SL with other appropriate activation functions. While the revision is minimal, we highlight three merits of PSL: 1) it serves as a tighter surrogate for DCG with suitable activation functions; 2) it better balances data contributions; and 3) it acts as a specific BPR loss enhanced by Distributionally Robust Optimization (DRO). We further validate the effectiveness and robustness of PSL through empirical experiments. The code is available at this https URL.

[AI-65] Unlocking the Potential of Global Human Expertise NEURIPS2024

链接: https://arxiv.org/abs/2411.00156
作者: Elliot Meyerson,Olivier Francon,Darren Sargent,Babak Hodjat,Risto Miikkulainen
关键词-EN: Solving societal problems, Solving societal, global scale requires, societal problems, scale requires
类目: Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Neural and Evolutionary Computing (cs.NE)
*备注: NeurIPS 2024; Main Paper 15 pages, Appendix 11 pages

点击查看摘要

Abstract:Solving societal problems on a global scale requires the collection and processing of ideas and methods from diverse sets of international experts. As the number and diversity of human experts increase, so does the likelihood that elements in this collective knowledge can be combined and refined to discover novel and better solutions. However, it is difficult to identify, combine, and refine complementary information in an increasingly large and diverse knowledge base. This paper argues that artificial intelligence (AI) can play a crucial role in this process. An evolutionary AI framework, termed RHEA, fills this role by distilling knowledge from diverse models created by human experts into equivalent neural networks, which are then recombined and refined in a population-based search. The framework was implemented in a formal synthetic domain, demonstrating that it is transparent and systematic. It was then applied to the results of the XPRIZE Pandemic Response Challenge, in which over 100 teams of experts across 23 countries submitted models based on diverse methodologies to predict COVID-19 cases and suggest non-pharmaceutical intervention policies for 235 nations, states, and regions across the globe. Building upon this expert knowledge, by recombining and refining the 169 resulting policy suggestion models, RHEA discovered a broader and more effective set of policies than either AI or human experts alone, as evaluated based on real-world data. The results thus suggest that AI can play a crucial role in realizing the potential of human expertise in global problem-solving.

[AI-66] Responsibility-aware Strategic Reasoning in Probabilistic Multi-Agent Systems

链接: https://arxiv.org/abs/2411.00146
作者: Chunyan Mu,Muhammad Najib,Nir Oren
关键词-EN: trustworthy autonomous systems, Probabilistic Alternating-time Temporal, Alternating-time Temporal Logic, plays a key, key role
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Responsibility plays a key role in the development and deployment of trustworthy autonomous systems. In this paper, we focus on the problem of strategic reasoning in probabilistic multi-agent systems with responsibility-aware agents. We introduce the logic PATL+R, a variant of Probabilistic Alternating-time Temporal Logic. The novelty of PATL+R lies in its incorporation of modalities for causal responsibility, providing a framework for responsibility-aware multi-agent strategic reasoning. We present an approach to synthesise joint strategies that satisfy an outcome specified in PATL+R, while optimising the share of expected causal responsibility and reward. This provides a notion of balanced distribution of responsibility and reward gain among agents. To this end, we utilise the Nash equilibrium as the solution concept for our strategic reasoning problem and demonstrate how to compute responsibility-aware Nash equilibrium strategies via a reduction to parametric model checking of concurrent stochastic multi-player games.

[AI-67] Learning Low-Dimensional Strain Models of Soft Robots by Looking at the Evolution of Their Shape with Application to Model-Based Control

链接: https://arxiv.org/abs/2411.00138
作者: Ricardo Valadas,Maximilian Stölzle,Jingyue Liu,Cosimo Della Santina
关键词-EN: Obtaining dynamic models, Obtaining dynamic, first-principle solutions, continuum soft robots, researchers have devoted
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注: 8 pages, under review

点击查看摘要

Abstract:Obtaining dynamic models of continuum soft robots is central to the analysis and control of soft robots, and researchers have devoted much attention to the challenge of proposing both data-driven and first-principle solutions. Both avenues have, however, shown their limitations; the former lacks structure and performs poorly outside training data, while the latter requires significant simplifications and extensive expert knowledge to be used in practice. This paper introduces a streamlined method for learning low-dimensional, physics-based models that are both accurate and easy to interpret. We start with an algorithm that uses image data (i.e., shape evolutions) to determine the minimal necessary segments for describing a soft robot’s movement. Following this, we apply a dynamic regression and strain sparsification algorithm to identify relevant strains and define the model’s dynamics. We validate our approach through simulations with various planar soft manipulators, comparing its performance against other learning strategies, showing that our models are both computationally efficient and 25x more accurate on out-of-training distribution inputs. Finally, we demonstrate that thanks to the capability of the method of generating physically compatible models, the learned models can be straightforwardly combined with model-based control policies.

[AI-68] Beyond Accuracy: Ensuring Correct Predictions With Correct Rationales NEURIPS2024

链接: https://arxiv.org/abs/2411.00132
作者: Tang Li,Mengmeng Ma,Xi Peng
关键词-EN: surpass human experts, Large pretrained foundation, demonstrate exceptional performance, Large pretrained, high-stakes applications
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: In Proceedings of the 38th Conference on Neural Information Processing Systems (NeurIPS 2024)

点击查看摘要

Abstract:Large pretrained foundation models demonstrate exceptional performance and, in some high-stakes applications, even surpass human experts. However, most of these models are currently evaluated primarily on prediction accuracy, overlooking the validity of the rationales behind their accurate predictions. For the safe deployment of foundation models, there is a pressing need to ensure double-correct predictions, i.e., correct prediction backed by correct rationales. To achieve this, we propose a two-phase scheme: First, we curate a new dataset that offers structured rationales for visual recognition tasks. Second, we propose a rationale-informed optimization method to guide the model in disentangling and localizing visual evidence for each rationale, without requiring manual annotations. Extensive experiments and ablation studies demonstrate that our model outperforms state-of-the-art models by up to 10.1% in prediction accuracy across a wide range of tasks. Furthermore, our method significantly improves the model’s rationale correctness, improving localization by 7.5% and disentanglement by 36.5%. Our dataset, source code, and pretrained weights: this https URL

[AI-69] raining and Evaluating Causal Forecasting Models for Time-Series

链接: https://arxiv.org/abs/2411.00126
作者: Thomas Crasson,Yacine Nabet,Mathias Lécuyer
关键词-EN: Deep learning time-series, inform downstream decisions, Deep learning, time-series models, make forecasts
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Deep learning time-series models are often used to make forecasts that inform downstream decisions. Since these decisions can differ from those in the training set, there is an implicit requirement that time-series models will generalize outside of their training distribution. Despite this core requirement, time-series models are typically trained and evaluated on in-distribution predictive tasks. We extend the orthogonal statistical learning framework to train causal time-series models that generalize better when forecasting the effect of actions outside of their training distribution. To evaluate these models, we leverage Regression Discontinuity Designs popular in economics to construct a test set of causal treatment effects.

[AI-70] I Can Hear You: Selective Robust Training for Deepfake Audio Detection

链接: https://arxiv.org/abs/2411.00121
作者: Zirui Zhang,Wei Hao,Aroon Sankoh,William Lin,Emanuel Mendiola-Ortiz,Junfeng Yang,Chengzhi Mao
关键词-EN: detecting deepfake audio, Recent advances, posing risks, spread of disinformation, advances in AI-generated
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
*备注:

点击查看摘要

Abstract:Recent advances in AI-generated voices have intensified the challenge of detecting deepfake audio, posing risks for scams and the spread of disinformation. To tackle this issue, we establish the largest public voice dataset to date, named DeepFakeVox-HQ, comprising 1.3 million samples, including 270,000 high-quality deepfake samples from 14 diverse sources. Despite previously reported high accuracy, existing deepfake voice detectors struggle with our diversely collected dataset, and their detection success rates drop even further under realistic corruptions and adversarial attacks. We conduct a holistic investigation into factors that enhance model robustness and show that incorporating a diversified set of voice augmentations is beneficial. Moreover, we find that the best detection models often rely on high-frequency features, which are imperceptible to humans and can be easily manipulated by an attacker. To address this, we propose the F-SAT: Frequency-Selective Adversarial Training method focusing on high-frequency components. Empirical results demonstrate that using our training dataset boosts baseline model performance (without robust training) by 33%, and our robust training further improves accuracy by 7.7% on clean samples and by 29.3% on corrupted and attacked samples, over the state-of-the-art RawNet3 model.

[AI-71] Project Sid: Many-agent simulations toward AI civilization

链接: https://arxiv.org/abs/2411.00114
作者: Altera.AL,Andrew Ahn,Nic Becker,Stephanie Carroll,Nico Christie,Manuel Cortes,Arda Demirci,Melissa Du,Frankie Li,Shuying Luo,Peter Y Wang,Mathew Willows,Feitong Yang,Guangyu Robert Yang
关键词-EN: interactions remain limited, small groups, scope and complexity, evaluated in isolation, interactions remain
类目: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
*备注: 35 pages, 14 figures

点击查看摘要

Abstract:AI agents have been evaluated in isolation or within small groups, where interactions remain limited in scope and complexity. Large-scale simulations involving many autonomous agents – reflecting the full spectrum of civilizational processes – have yet to be explored. Here, we demonstrate how 10 - 1000+ AI agents behave and progress within agent societies. We first introduce the PIANO (Parallel Information Aggregation via Neural Orchestration) architecture, which enables agents to interact with humans and other agents in real-time while maintaining coherence across multiple output streams. We then evaluate agent performance in agent simulations using civilizational benchmarks inspired by human history. These simulations, set within a Minecraft environment, reveal that agents are capable of meaningful progress – autonomously developing specialized roles, adhering to and changing collective rules, and engaging in cultural and religious transmission. These preliminary results show that agents can achieve significant milestones towards AI civilizations, opening new avenues for large simulations, agentic organizational intelligence, and integrating AI into human civilizations.

[AI-72] PARTNR: A Benchmark for Planning and Reasoning in Embodied Multi-agent Tasks

链接: https://arxiv.org/abs/2411.00081
作者: Matthew Chang,Gunjan Chhablani,Alexander Clegg,Mikael Dallaire Cote,Ruta Desai,Michal Hlavac,Vladimir Karashchuk,Jacob Krantz,Roozbeh Mottaghi,Priyam Parashar,Siddharth Patki,Ishita Prasad,Xavier Puig,Akshara Rai,Ram Ramrakhya,Daniel Tran,Joanne Truong,John M. Turner,Eric Undersander,Tsung-Yen Yang
关键词-EN: study human-robot coordination, humaN-Robot collaboration, Reasoning Tasks, study human-robot, designed to study
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注: Alphabetical author order

点击查看摘要

Abstract:We present a benchmark for Planning And Reasoning Tasks in humaN-Robot collaboration (PARTNR) designed to study human-robot coordination in household activities. PARTNR tasks exhibit characteristics of everyday tasks, such as spatial, temporal, and heterogeneous agent capability constraints. We employ a semi-automated task generation pipeline using Large Language Models (LLMs), incorporating simulation in the loop for grounding and verification. PARTNR stands as the largest benchmark of its kind, comprising 100,000 natural language tasks, spanning 60 houses and 5,819 unique objects. We analyze state-of-the-art LLMs on PARTNR tasks, across the axes of planning, perception and skill execution. The analysis reveals significant limitations in SoTA models, such as poor coordination and failures in task tracking and recovery from errors. When LLMs are paired with real humans, they require 1.5x as many steps as two humans collaborating and 1.1x more steps than a single human, underscoring the potential for improvement in these models. We further show that fine-tuning smaller LLMs with planning data can achieve performance on par with models 9 times larger, while being 8.6x faster at inference. Overall, PARTNR highlights significant challenges facing collaborative embodied agents and aims to drive research in this direction.

[AI-73] How Good Are We? Evaluating Cell AI Foundation Models in Kidney Pathology with Human-in-the-Loop Enrichment

链接: https://arxiv.org/abs/2411.00078
作者: Junlin Guo,Siqi Lu,Can Cui,Ruining Deng,Tianyuan Yao,Zhewen Tao,Yizhe Lin,Marilyn Lionts,Quan Liu,Juming Xiong,Yu Wang,Shilin Zhao,Catie Chang,Mitchell Wilkes,Mengmeng Yin,Haichun Yang,Yuankai Huo
关键词-EN: including digital pathology, promising large-scale learning, large-scale learning approach, real-world healthcare challenges, addressing real-world healthcare
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Image and Video Processing (eess.IV)
*备注:

点击查看摘要

Abstract:Training AI foundation models has emerged as a promising large-scale learning approach for addressing real-world healthcare challenges, including digital pathology. While many of these models have been developed for tasks like disease diagnosis and tissue quantification using extensive and diverse training datasets, their readiness for deployment on some arguably simplest tasks, such as nuclei segmentation within a single organ (e.g., the kidney), remains uncertain. This paper seeks to answer this key question, “How good are we?”, by thoroughly evaluating the performance of recent cell foundation models on a curated multi-center, multi-disease, and multi-species external testing dataset. Additionally, we tackle a more challenging question, “How can we improve?”, by developing and assessing human-in-the-loop data enrichment strategies aimed at enhancing model performance while minimizing the reliance on pixel-level human annotation. To address the first question, we curated a multicenter, multidisease, and multispecies dataset consisting of 2,542 kidney whole slide images (WSIs). Three state-of-the-art (SOTA) cell foundation models-Cellpose, StarDist, and CellViT-were selected for evaluation. To tackle the second question, we explored data enrichment algorithms by distilling predictions from the different foundation models with a human-in-the-loop framework, aiming to further enhance foundation model performance with minimal human efforts. Our experimental results showed that all three foundation models improved over their baselines with model fine-tuning with enriched data. Interestingly, the baseline model with the highest F1 score does not yield the best segmentation outcomes after fine-tuning. This study establishes a benchmark for the development and deployment of cell vision foundation models tailored for real-world data applications.

[AI-74] RPS: A Generic Reservoir Patterns Sampler

链接: https://arxiv.org/abs/2411.00074
作者: Lamine Diop,Marc Plantevit,Arnaud Soulet
关键词-EN: modern data analysis, data analysis due, Efficient learning, important for modern, analysis due
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Combinatorics (math.CO); Probability (math.PR)
*备注: Accepted at 2024 IEEE International Conference on Big Data

点击查看摘要

Abstract:Efficient learning from streaming data is important for modern data analysis due to the continuous and rapid evolution of data streams. Despite significant advancements in stream pattern mining, challenges persist, particularly in managing complex data streams like sequential and weighted itemsets. While reservoir sampling serves as a fundamental method for randomly selecting fixed-size samples from data streams, its application to such complex patterns remains largely unexplored. In this study, we introduce an approach that harnesses a weighted reservoir to facilitate direct pattern sampling from streaming batch data, thus ensuring scalability and efficiency. We present a generic algorithm capable of addressing temporal biases and handling various pattern types, including sequential, weighted, and unweighted itemsets. Through comprehensive experiments conducted on real-world datasets, we evaluate the effectiveness of our method, showcasing its ability to construct accurate incremental online classifiers for sequential data. Our approach not only enables previously unusable online machine learning models for sequential data to achieve accuracy comparable to offline baselines but also represents significant progress in the development of incremental online sequential itemset classifiers.

[AI-75] Meta-Sealing: A Revolutionizing Integrity Assurance Protocol for Transparent Tamper-Proof and Trustworthy AI System

链接: https://arxiv.org/abs/2411.00069
作者: Mahesh Vaijainthymala Krishnamoorthy
关键词-EN: public safety-has made, Artificial intelligence, safety-has made system, maintaining societal trust, made system integrity
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
*备注: 24 pages, 3 figures and 10 Code blocks, to be presented in the conference

点击查看摘要

Abstract:The Artificial intelligence in critical sectors-healthcare, finance, and public safety-has made system integrity paramount for maintaining societal trust. Current verification methods for AI systems lack comprehensive lifecycle assurance, creating significant vulnerabilities in deployment of both powerful and trustworthy AI. This research introduces Meta-Sealing, a cryptographic framework that fundamentally changes integrity verification in AI systems throughout their operational lifetime. Meta-Sealing surpasses traditional integrity protocols through its implementation of cryptographic seal chains, establishing verifiable, immutable records for all system decisions and transformations. The framework combines advanced cryptography with distributed verification, delivering tamper-evident guarantees that achieve both mathematical rigor and computational efficiency. Our implementation addresses urgent regulatory requirements for AI system transparency and auditability. The framework integrates with current AI governance standards, specifically the EU’s AI Act and FDA’s healthcare AI guidelines, enabling organizations to maintain operational efficiency while meeting compliance requirements. Testing on financial institution data demonstrated Meta-Sealing’s capability to reduce audit timeframes by 62% while enhancing stakeholder confidence by 47%. Results can establish a new benchmark for integrity assurance in enterprise AI deployments. This research presents Meta-Sealing not merely as a technical solution, but as a foundational framework ensuring AI system integrity aligns with human values and regulatory requirements. As AI continues to influence critical decisions, provides the necessary bridge between technological advancement and verifiable trust. Meta-Sealing serves as a guardian of trust, ensuring that the AI systems we depend on are as reliable and transparent as they are powerful.

[AI-76] he ISCSLP 2024 Conversational Voice Clone (CoVoC) Challenge: Tasks Results and Findings

链接: https://arxiv.org/abs/2411.00064
作者: Kangxiang Xia,Dake Guo,Jixun Yao,Liumeng Xue,Hanzhao Li,Shuai Wang,Zhao Guo,Lei Xie,Qingqing Zhang,Lei Luo,Minghui Dong,Peng Sun
关键词-EN: Conversational Voice Clone, style voice cloning, spontaneous style voice, Voice Clone, advance zero-shot spontaneous
类目: ound (cs.SD); Artificial Intelligence (cs.AI)
*备注: accepted by ISCSLP 2024

点击查看摘要

Abstract:The ISCSLP 2024 Conversational Voice Clone (CoVoC) Challenge aims to benchmark and advance zero-shot spontaneous style voice cloning, particularly focusing on generating spontaneous behaviors in conversational speech. The challenge comprises two tracks: an unconstrained track without limitation on data and model usage, and a constrained track only allowing the use of constrained open-source datasets. A 100-hour high-quality conversational speech dataset is also made available with the challenge. This paper details the data, tracks, submitted systems, evaluation results, and findings.

[AI-77] P2C2Net: PDE-Preserved Coarse Correction Network for efficient prediction of spatiotemporal dynamics

链接: https://arxiv.org/abs/2411.00040
作者: Qi Wang,Pu Ren,Hao Zhou,Xin-Yang Liu,Zhiwen Deng,Yi Zhang,Ruizhi Chengze,Hongsheng Liu,Zidong Wang,Jian-Xun Wang,Ji-Rong_Wen,Hao Sun,Yang Liu
关键词-EN: partial differential equations, solving partial differential, require fine mesh, small time stepping, classical numerical methods
类目: Numerical Analysis (math.NA); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:When solving partial differential equations (PDEs), classical numerical methods often require fine mesh grids and small time stepping to meet stability, consistency, and convergence conditions, leading to high computational cost. Recently, machine learning has been increasingly utilized to solve PDE problems, but they often encounter challenges related to interpretability, generalizability, and strong dependency on rich labeled data. Hence, we introduce a new PDE-Preserved Coarse Correction Network (P ^2 C ^2 Net) to efficiently solve spatiotemporal PDE problems on coarse mesh grids in small data regimes. The model consists of two synergistic modules: (1) a trainable PDE block that learns to update the coarse solution (i.e., the system state), based on a high-order numerical scheme with boundary condition encoding, and (2) a neural network block that consistently corrects the solution on the fly. In particular, we propose a learnable symmetric Conv filter, with weights shared over the entire model, to accurately estimate the spatial derivatives of PDE based on the neural-corrected system state. The resulting physics-encoded model is capable of handling limited training data (e.g., 3–5 trajectories) and accelerates the prediction of PDE solutions on coarse spatiotemporal grids while maintaining a high accuracy. P ^2 C ^2 Net achieves consistent state-of-the-art performance with over 50% gain (e.g., in terms of relative prediction error) across four datasets covering complex reaction-diffusion processes and turbulent flows.

[AI-78] A Theoretical Review on Solving Algebra Problems

链接: https://arxiv.org/abs/2411.00031
作者: Xinguo Yu,Weina Cheng,Chuanzhi Yang,Ting Zhang
关键词-EN: attract significant research, significant research interest, continues to attract, past decade, attract significant
类目: Logic in Computer Science (cs.LO); Artificial Intelligence (cs.AI); Symbolic Computation (cs.SC)
*备注: 22pages,5figures

点击查看摘要

Abstract:Solving algebra problems (APs) continues to attract significant research interest as evidenced by the large number of algorithms and theories proposed over the past decade. Despite these important research contributions, however, the body of work remains incomplete in terms of theoretical justification and scope. The current contribution intends to fill the gap by developing a review framework that aims to lay a theoretical base, create an evaluation scheme, and extend the scope of the investigation. This paper first develops the State Transform Theory (STT), which emphasizes that the problem-solving algorithms are structured according to states and transforms unlike the understanding that underlies traditional surveys which merely emphasize the progress of transforms. The STT, thus, lays the theoretical basis for a new framework for reviewing algorithms. This new construct accommodates the relation-centric algorithms for solving both word and diagrammatic algebra problems. The latter not only highlights the necessity of introducing new states but also allows revelation of contributions of individual algorithms obscured in prior reviews without this approach.

[AI-79] Applying Data Driven Decision Making to rank Vocational and Educational Training Programs with TOPSIS

链接: https://arxiv.org/abs/2411.00017
作者: J. M. Conejero,J. C. Preciado,A. E. Prieto,M. C. Bas,V. J. Bolos
关键词-EN: Programs in Extremadura, Vocational and Educational, Educational Programs, classification of Vocational, period 2009-2016
类目: Artificial Intelligence (cs.AI); Numerical Analysis (math.NA)
*备注: 18 pages, 7 figures

点击查看摘要

Abstract:In this paper we present a multi-criteria classification of Vocational and Educational Programs in Extremadura (Spain) during the period 2009-2016. This ranking has been carried out through the integration into a complete database of the detailed information of individuals finishing such studies together with their labor data. The multicriteria method used is TOPSIS together with a new decision support method for assessing the influence of each criterion and its dependence on the weights assigned to them. This new method is based on a worst-best case scenario analysis and it is compared to a well known global sensitivity analysis technique based on the Pearson’s correlation ratio.

[AI-80] Personality-Guided Code Generation Using Large Language Models

链接: https://arxiv.org/abs/2411.00006
作者: Yaoqi Guo,Zhenpeng Chen,Jie M. Zhang,Yang Liu,Yun Ma
关键词-EN: garnered significant attention, significant attention due, natural language descriptions, streamline software development, Code generation
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Code generation, the automatic creation of source code from natural language descriptions, has garnered significant attention due to its potential to streamline software development. Inspired by research that links task-personality alignment with improved development outcomes, we conduct an empirical study on personality-guided code generation using large language models (LLMs). Specifically, we investigate how emulating personality traits appropriate to the coding tasks affects LLM performance. We extensively evaluate this approach using seven widely adopted LLMs across four representative datasets. Our results show that personality guidance significantly enhances code generation accuracy, with improved pass rates in 23 out of 28 LLM-dataset combinations. Notably, in 11 cases, the improvement exceeds 5%, and in 5 instances, it surpasses 10%, with the highest gain reaching 12.9%. Additionally, personality guidance can be easily integrated with other prompting strategies to further boost performance.

[AI-81] Mastering the Craft of Data Synthesis for CodeLLM s

链接: https://arxiv.org/abs/2411.00005
作者: Meng Chen,Philip Arthur,Qianyu Feng,Cong Duy Vu Hoang,Yu-Heng Hong,Mahdi Kazemi Moghaddam,Omid Nezami,Thien Nguyen,Gioacchino Tangari,Duy Vu,Thanh Vu,Mark Johnson,Krishnaram Kenthapadi,Don Dharmasiri,Long Duong,Yuan-Fang Li
关键词-EN: Large language models, making coding tasks, shown impressive performance, Large language, LLM evaluation
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Large language models (LLMs) have shown impressive performance in \emphcode understanding and generation, making coding tasks a key focus for researchers due to their practical applications and value as a testbed for LLM evaluation. Data synthesis and filtering techniques have been widely adopted and shown to be highly effective in this context. In this paper, we present a focused survey and taxonomy of these techniques, emphasizing recent advancements. We highlight key challenges, explore future research directions, and offer practical guidance for new researchers entering the field.

[AI-82] IC/DC: Surpassing Heuristic Solvers in Combinatorial Optimization with Diffusion Models

链接: https://arxiv.org/abs/2411.00003
作者: Seong-Hyun Hong,Hyun-Sung Kim,Zian Jang,Byung-Jun Lee
关键词-EN: shown promising results, Recent advancements, solving NP-hard problems, learning-based combinatorial optimization, Travelling Salesman Problem
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:Recent advancements in learning-based combinatorial optimization (CO) methods have shown promising results in solving NP-hard problems without the need for expert-crafted heuristics. However, high performance of these approaches often rely on problem-specific human-expertise-based search after generating candidate solutions, limiting their applicability to commonly solved CO problems such as Travelling Salesman Problem (TSP). In this paper, we present IC/DC, a CO framework that operates without any supervision. IC/DC is specialized in addressing problems involving two distinct sets of items, and it does not need problem-specific search processes to generate valid solutions. IC/DC employs a novel architecture capable of capturing the intricate relationships between items, and thereby enabling effective optimization in challenging CO scenarios. We train our model in a self-supervised way to minimize the cost of the solution while adhering to the problem-specific constraints. IC/DC not only achieves state-of-the-art performance compared to previous learning methods, but also surpasses well-known solvers and heuristic approaches on Asymmetric Traveling Salesman Problem (ATSP).

[AI-83] Cross-Fundus Transformer for Multi-modal Diabetic Retinopathy Grading with Cataract

链接: https://arxiv.org/abs/2411.00726
作者: Fan Xiao,Junlin Hou,Ruiwei Zhao,Rui Feng,Haidong Zou,Lina Lu,Yi Xu,Juzhao Zhang
关键词-EN: Diabetic retinopathy, complication of diabetes, blindness worldwide, common complication, color fundus photography
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: 10 pages, 4 figures

点击查看摘要

Abstract:Diabetic retinopathy (DR) is a leading cause of blindness worldwide and a common complication of diabetes. As two different imaging tools for DR grading, color fundus photography (CFP) and infrared fundus photography (IFP) are highly-correlated and complementary in clinical applications. To the best of our knowledge, this is the first study that explores a novel multi-modal deep learning framework to fuse the information from CFP and IFP towards more accurate DR grading. Specifically, we construct a dual-stream architecture Cross-Fundus Transformer (CFT) to fuse the ViT-based features of two fundus image modalities. In particular, a meticulously engineered Cross-Fundus Attention (CFA) module is introduced to capture the correspondence between CFP and IFP images. Moreover, we adopt both the single-modality and multi-modality supervisions to maximize the overall performance for DR grading. Extensive experiments on a clinical dataset consisting of 1,713 pairs of multi-modal fundus images demonstrate the superiority of our proposed method. Our code will be released for public access.

[AI-84] Deep learning-based auto-contouring of organs/structures-at-risk for pediatric upper abdominal radiotherapy

链接: https://arxiv.org/abs/2411.00594
作者: Mianyong Ding,Matteo Maspero,Annemieke S Littooij,Martine van Grotel,Raquel Davila Fajardo,Max M van Noesel,Marry M van den Heuvel-Eibrink,Geert O Janssens
关键词-EN: pediatric upper abdominal, upper abdominal tumors, computed tomography, study aimed, aimed to develop
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Medical Physics (physics.med-ph)
*备注: 23 pages, 5 figures, 1 table. Submitted to Radiotherapy and Oncology (2024-11-01)

点击查看摘要

Abstract:Purposes: This study aimed to develop a computed tomography (CT)-based multi-organ segmentation model for delineating organs-at-risk (OARs) in pediatric upper abdominal tumors and evaluate its robustness across multiple datasets. Materials and methods: In-house postoperative CTs from pediatric patients with renal tumors and neuroblastoma (n=189) and a public dataset (n=189) with CTs covering thoracoabdominal regions were used. Seventeen OARs were delineated: nine by clinicians (Type 1) and eight using TotalSegmentator (Type 2). Auto-segmentation models were trained using in-house (ModelPMC-UMCU) and a combined dataset of public data (Model-Combined). Performance was assessed with Dice Similarity Coefficient (DSC), 95% Hausdorff Distance (HD95), and mean surface distance (MSD). Two clinicians rated clinical acceptability on a 5-point Likert scale across 15 patient contours. Model robustness was evaluated against sex, age, intravenous contrast, and tumor type. Results: Model-PMC-UMCU achieved mean DSC values above 0.95 for five of nine OARs, while spleen and heart ranged between 0.90 and 0.95. The stomach-bowel and pancreas exhibited DSC values below 0.90. Model-Combined demonstrated improved robustness across both datasets. Clinical evaluation revealed good usability, with both clinicians rating six of nine Type 1 OARs above four and six of eight Type 2 OARs above three. Significant performance 2 differences were only found across age groups in both datasets, specifically in the left lung and pancreas. The 0-2 age group showed the lowest performance. Conclusion: A multi-organ segmentation model was developed, showcasing enhanced robustness when trained on combined datasets. This model is suitable for various OARs and can be applied to multiple datasets in clinical settings.

[AI-85] From Easy to Hard: Tackling Quantum Problems with Learned Gadgets For Real Hardware

链接: https://arxiv.org/abs/2411.00230
作者: Akash Kundu,Leopoldo Sarra
关键词-EN: notoriously difficult problem, Building quantum circuits, Reinforcement learning, Gadget Reinforcement Learning, Building quantum
类目: Quantum Physics (quant-ph); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 15 pages, 8 figures. Comments are encouraged

点击查看摘要

Abstract:Building quantum circuits that perform a given task is a notoriously difficult problem. Reinforcement learning has proven to be a powerful approach, but many limitations remain due to the exponential scaling of the space of possible operations on qubits. In this paper, we develop an algorithm that automatically learns composite gates (" gadgets ") and adds them as additional actions to the reinforcement learning agent to facilitate the search, namely the Gadget Reinforcement Learning (GRL) algorithm. We apply our algorithm to finding parameterized quantum circuits (PQCs) that implement the ground state of a given quantum Hamiltonian, a well-known NP-hard challenge. In particular, we focus on the transverse field Ising model (TFIM), since understanding its ground state is crucial for studying quantum phase transitions and critical behavior, and serves as a benchmark for validating quantum algorithms and simulation techniques. We show that with GRL we can find very compact PQCs that improve the error in estimating the ground state of TFIM by up to 10^7 fold and make it suitable for implementation on real hardware compared to a pure reinforcement learning approach. Moreover, GRL scales better with increasing difficulty and to larger systems. The generality of the algorithm shows the potential for applications to other settings, including optimization tailored to specific real-world quantum platforms.

[AI-86] Prospective Learning: Learning for a Dynamic Future NEURIPS2024

链接: https://arxiv.org/abs/2411.00109
作者: Ashwin De Silva,Rahul Ramesh,Rubing Yang,Siyu Yu,Joshua T Vogelstein,Pratik Chaudhari
关键词-EN: Prospective ERM, PAC learning, ERM, learning, Prospective
类目: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Accepted to NeurIPS 2024

点击查看摘要

Abstract:In real-world applications, the distribution of the data, and our goals, evolve over time. The prevailing theoretical framework for studying machine learning, namely probably approximately correct (PAC) learning, largely ignores time. As a consequence, existing strategies to address the dynamic nature of data and goals exhibit poor real-world performance. This paper develops a theoretical framework called “Prospective Learning” that is tailored for situations when the optimal hypothesis changes over time. In PAC learning, empirical risk minimization (ERM) is known to be consistent. We develop a learner called Prospective ERM, which returns a sequence of predictors that make predictions on future data. We prove that the risk of prospective ERM converges to the Bayes risk under certain assumptions on the stochastic process generating the data. Prospective ERM, roughly speaking, incorporates time as an input in addition to the data. We show that standard ERM as done in PAC learning, without incorporating time, can result in failure to learn when distributions are dynamic. Numerical experiments illustrate that prospective ERM can learn synthetic and visual recognition problems constructed from MNIST and CIFAR-10.

[AI-87] DOC: Explainable Decoding Out-of-domain Cell Types with Evidential Learning

链接: https://arxiv.org/abs/2411.00054
作者: Chaochen Wu,Meiyun Zuo,Lei Xie
关键词-EN: OOD cell types, OOD cell, OOD, biological systems, Cell
类目: Genomics (q-bio.GN); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: under review

点击查看摘要

Abstract:Single-cell RNA-seq (scRNA-seq) technology is a powerful tool for unraveling the complexity of biological systems. One of essential and fundamental tasks in scRNA-seq data analysis is Cell Type Annotation (CTA). In spite of tremendous efforts in developing machine learning methods for this problem, several challenges remains. They include identifying Out-of-Domain (OOD) cell types, quantifying the uncertainty of unseen cell type annotations, and determining interpretable cell type-specific gene drivers for an OOD case. OOD cell types are often associated with therapeutic responses and disease origins, making them critical for precision medicine and early disease diagnosis. Additionally, scRNA-seq data contains tens thousands of gene expressions. Pinpointing gene drivers underlying CTA can provide deep insight into gene regulatory mechanisms and serve as disease biomarkers. In this study, we develop a new method, eDOC, to address aforementioned challenges. eDOC leverages a transformer architecture with evidential learning to annotate In-Domain (IND) and OOD cell types as well as to highlight genes that contribute both IND cells and OOD cells in a single cell resolution. Rigorous experiments demonstrate that eDOC significantly improves the efficiency and effectiveness of OOD cell type and gene driver identification compared to other state-of-the-art methods. Our findings suggest that eDOC may provide new insights into single-cell biology.

[AI-88] Coupling quantum-like cognition with the neuronal networks within generalized probability theory

链接: https://arxiv.org/abs/2411.00036
作者: Andrei Khrennikov,Masanao Ozawa,Felix Benninger,Oded Shor
关键词-EN: recent years, years are characterized, characterized by intensive, mathematical apparatus, intensive applications
类目: Physics and Society (physics.soc-ph); Artificial Intelligence (cs.AI); Quantum Physics (quant-ph)
*备注: RIKEN Quantum Workshop, October 11, 2024

点击查看摘要

Abstract:The recent years are characterized by intensive applications of the methodology and mathematical apparatus of quantum theory, quantum-like modeling, in cognition, psychology, and decision making. In spite of the successful applications of this approach to a variety of psychological effects, e.g., the order, conjunction, disjunction, and response replicability effects, one may (but need not) feel dissatisfaction due to the absence of clear coupling to the neurophysiological processes in the brain. For the moment, this is just a phenomenological approach. In this paper we construct the quantum-like representation of the networks of communicating neurons. It is based not on standard quantum theory, but on generalized probability theory (GPT) with the emphasis of the operational measurement approach. We employ GPT’s version which is based on ordered linear state space (instead of complex Hilbert space). A network of communicating neurons is described as a weighted ordered graph that in turn is encoded by its weight matrix. The state space of weight matrices is embedded in GPT with effect-observables and state updates within measurement instruments theory. The latter plays the crucial role. This GPT based model shows the basic quantum-like effects, as e.g. the order, non-repeatability, and disjunction effects; the latter is also known as interference of decisions. This GPT coupling also supports quantum-like modeling in medical diagnostic for neurological diseases, as depression and epilepsy. Although the paper is concentrated on cognition and neuronal networks, the formalism and methodology can be straightforwardly applied to a variety of biological and social networks.

[AI-89] Low-Overhead Channel Estimation via 3D Extrapolation for TDD mmWave Massive MIMO Systems Under High-Mobility Scenarios

链接: https://arxiv.org/abs/2406.08887
作者: Binggui Zhou,Xi Yang,Shaodan Ma,Feifei Gao,Guanghua Yang
关键词-EN: TDD mmWave massive, mmWave massive MIMO, uplink channel estimation, massive MIMO systems, channel estimation
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Information Theory (cs.IT)
*备注: 13 pages, 11 figures, 3 tables. This paper has been submitted to IEEE journal for possible publication

点击查看摘要

Abstract:In TDD mmWave massive MIMO systems, the downlink CSI can be attained through uplink channel estimation thanks to the uplink-downlink channel reciprocity. However, the channel aging issue is significant under high-mobility scenarios and thus necessitates frequent uplink channel estimation. In addition, large amounts of antennas and subcarriers lead to high-dimensional CSI matrices, aggravating the pilot training overhead. To systematically reduce the pilot overhead, a spatial, frequency, and temporal domain (3D) channel extrapolation framework is proposed in this paper. Considering the marginal effects of pilots in the spatial and frequency domains and the effectiveness of traditional knowledge-driven channel estimation methods, we first propose a knowledge-and-data driven spatial-frequency channel extrapolation network (KDD-SFCEN) for uplink channel estimation by exploiting the least square estimator for coarse channel estimation and joint spatial-frequency channel extrapolation to reduce the spatial-frequency domain pilot overhead. Then, resorting to the uplink-downlink channel reciprocity and temporal domain dependencies of downlink channels, a temporal uplink-downlink channel extrapolation network (TUDCEN) is proposed for slot-level channel extrapolation, aiming to enlarge the pilot signal period and thus reduce the temporal domain pilot overhead under high-mobility scenarios. Specifically, we propose the spatial-frequency sampling embedding module to reduce the representation dimension and consequent computational complexity, and we propose to exploit the autoregressive generative Transformer for generating downlink channels autoregressively. Numerical results demonstrate the superiority of the proposed framework in significantly reducing the pilot training overhead by more than 16 times and improving the system’s spectral efficiency under high-mobility scenarios.

[AI-90] Pay Less But Get More: A Dual-Attention-based Channel Estimation Network for Massive MIMO Systems with Low-Density Pilots

链接: https://arxiv.org/abs/2303.00986
作者: Binggui Zhou,Xi Yang,Shaodan Ma,Feifei Gao,Guanghua Yang
关键词-EN: massive multiple-input multiple-output, massive MIMO systems, channel state information, massive MIMO channels, massive MIMO
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Information Theory (cs.IT)
*备注: 16 pages, 9 figures, 6 tables. Accepted by IEEE Transactions on Wireless Communications

点击查看摘要

Abstract:To reap the promising benefits of massive multiple-input multiple-output (MIMO) systems, accurate channel state information (CSI) is required through channel estimation. However, due to the complicated wireless propagation environment and large-scale antenna arrays, precise channel estimation for massive MIMO systems is significantly challenging and costs an enormous training overhead. Considerable time-frequency resources are consumed to acquire sufficient accuracy of CSI, which thus severely degrades systems’ spectral and energy efficiencies. In this paper, we propose a dual-attention-based channel estimation network (DACEN) to realize accurate channel estimation via low-density pilots, by jointly learning the spatial-temporal domain features of massive MIMO channels with the temporal attention module and the spatial attention module. To further improve the estimation accuracy, we propose a parameter-instance transfer learning approach to transfer the channel knowledge learned from the high-density pilots pre-acquired during the training dataset collection period. Experimental results reveal that the proposed DACEN-based method achieves better channel estimation performance than the existing methods under various pilot-density settings and signal-to-noise ratios. Additionally, with the proposed parameter-instance transfer learning approach, the DACEN-based method achieves additional performance gain, thereby further demonstrating the effectiveness and superiority of the proposed method.

计算机视觉

[CV-0] Randomized Autoregressive Visual Generation

链接: https://arxiv.org/abs/2411.00776
作者: Qihang Yu,Ju He,Xueqing Deng,Xiaohui Shen,Liang-Chieh Chen
关键词-EN: paper presents Randomized, presents Randomized AutoRegressive, Randomized AutoRegressive modeling, presents Randomized, maintaining full compatibility
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: simple method improving autoregressive image generator to SOTA performance; Project page at this https URL

点击查看摘要

Abstract:This paper presents Randomized AutoRegressive modeling (RAR) for visual generation, which sets a new state-of-the-art performance on the image generation task while maintaining full compatibility with language modeling frameworks. The proposed RAR is simple: during a standard autoregressive training process with a next-token prediction objective, the input sequence-typically ordered in raster form-is randomly permuted into different factorization orders with a probability r, where r starts at 1 and linearly decays to 0 over the course of training. This annealing training strategy enables the model to learn to maximize the expected likelihood over all factorization orders and thus effectively improve the model’s capability of modeling bidirectional contexts. Importantly, RAR preserves the integrity of the autoregressive modeling framework, ensuring full compatibility with language modeling while significantly improving performance in image generation. On the ImageNet-256 benchmark, RAR achieves an FID score of 1.48, not only surpassing prior state-of-the-art autoregressive image generators but also outperforming leading diffusion-based and masked transformer-based methods. Code and models will be made available at this https URL

[CV-1] CityGaussianV2: Efficient and Geometrically Accurate Reconstruction for Large-Scale Scenes

链接: https://arxiv.org/abs/2411.00771
作者: Yang Liu,Chuanchen Luo,Zhongkai Mao,Junran Peng,Zhaoxiang Zhang
关键词-EN: revolutionized radiance field, radiance field reconstruction, Gaussian Splatting, manifesting efficient, view synthesis
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Project Page: this https URL

点击查看摘要

Abstract:Recently, 3D Gaussian Splatting (3DGS) has revolutionized radiance field reconstruction, manifesting efficient and high-fidelity novel view synthesis. However, accurately representing surfaces, especially in large and complex scenarios, remains a significant challenge due to the unstructured nature of 3DGS. In this paper, we present CityGaussianV2, a novel approach for large-scale scene reconstruction that addresses critical challenges related to geometric accuracy and efficiency. Building on the favorable generalization capabilities of 2D Gaussian Splatting (2DGS), we address its convergence and scalability issues. Specifically, we implement a decomposed-gradient-based densification and depth regression technique to eliminate blurry artifacts and accelerate convergence. To scale up, we introduce an elongation filter that mitigates Gaussian count explosion caused by 2DGS degeneration. Furthermore, we optimize the CityGaussian pipeline for parallel training, achieving up to 10 \times compression, at least 25% savings in training time, and a 50% decrease in memory usage. We also established standard geometry benchmarks under large-scale scenes. Experimental results demonstrate that our method strikes a promising balance between visual quality, geometric accuracy, as well as storage and training costs. The project page is available at this https URL.

[CV-2] Face Anonymization Made Simple

链接: https://arxiv.org/abs/2411.00762
作者: Han-Wei Kung,Tuomas Varanka,Sanjay Saha,Terence Sim,Nicu Sebe
关键词-EN: Current face anonymization, inaccurate and unreliable, Current face, techniques often depend, identity loss calculated
类目: Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR)
*备注:

点击查看摘要

Abstract:Current face anonymization techniques often depend on identity loss calculated by face recognition models, which can be inaccurate and unreliable. Additionally, many methods require supplementary data such as facial landmarks and masks to guide the synthesis process. In contrast, our approach uses diffusion models with only a reconstruction loss, eliminating the need for facial landmarks or masks while still producing images with intricate, fine-grained details. We validated our results on two public benchmarks through both quantitative and qualitative evaluations. Our model achieves state-of-the-art performance in three key areas: identity anonymization, facial attribute preservation, and image quality. Beyond its primary function of anonymization, our model can also perform face swapping tasks by incorporating an additional facial image as input, demonstrating its versatility and potential for diverse applications. Our code and models are available at this https URL .

[CV-3] Autobiasing Event Cameras ECCV2024

链接: https://arxiv.org/abs/2411.00729
作者: Mehdi Sefidgar Dilmaghani,Waseem Shariff,Cian Ryan,Joseph Lemley,Peter Corcoran
关键词-EN: address challenges arising, machine vision applications, presents an autonomous, address challenges, challenges arising
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: ECCV 2024 NeVi Workshop

点击查看摘要

Abstract:This paper presents an autonomous method to address challenges arising from severe lighting conditions in machine vision applications that use event cameras. To manage these conditions, the research explores the built in potential of these cameras to adjust pixel functionality, named bias settings. As cars are driven at various times and locations, shifts in lighting conditions are unavoidable. Consequently, this paper utilizes the neuromorphic YOLO-based face tracking module of a driver monitoring system as the event-based application to study. The proposed method uses numerical metrics to continuously monitor the performance of the event-based application in real-time. When the application malfunctions, the system detects this through a drop in the metrics and automatically adjusts the event cameras bias values. The Nelder-Mead simplex algorithm is employed to optimize this adjustment, with finetuning continuing until performance returns to a satisfactory level. The advantage of bias optimization lies in its ability to handle conditions such as flickering or darkness without requiring additional hardware or software. To demonstrate the capabilities of the proposed system, it was tested under conditions where detecting human faces with default bias values was impossible. These severe conditions were simulated using dim ambient light and various flickering frequencies. Following the automatic and dynamic process of bias modification, the metrics for face detection significantly improved under all conditions. Autobiasing resulted in an increase in the YOLO confidence indicators by more than 33 percent for object detection and 37 percent for face detection highlighting the effectiveness of the proposed method.

[CV-4] Debiasify: Self-Distillation for Unsupervised Bias Mitigation WACV2025

链接: https://arxiv.org/abs/2411.00711
作者: Nourhan Bayasi,Jamil Fayyad,Ghassan Hamarneh,Rafeef Garbi,Homayoun Najjaran
关键词-EN: Simplicity bias poses, decision rules influenced, favor simpler solutions, inadvertently learn decision, learn decision rules
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: Accepted at the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV2025)

点击查看摘要

Abstract:Simplicity bias poses a significant challenge in neural networks, often leading models to favor simpler solutions and inadvertently learn decision rules influenced by spurious correlations. This results in biased models with diminished generalizability. While many current approaches depend on human supervision, obtaining annotations for various bias attributes is often impractical. To address this, we introduce Debiasify, a novel self-distillation approach that requires no prior knowledge about the nature of biases. Our method leverages a new distillation loss to transfer knowledge within the network, from deeper layers containing complex, highly-predictive features to shallower layers with simpler, attribute-conditioned features in an unsupervised manner. This enables Debiasify to learn robust, debiased representations that generalize effectively across diverse biases and datasets, improving both worst-group performance and overall accuracy. Extensive experiments on computer vision and medical imaging benchmarks demonstrate the effectiveness of our approach, significantly outperforming previous unsupervised debiasing methods (e.g., a 10.13% improvement in worst-group accuracy for Wavy Hair classification in CelebA) and achieving comparable or superior performance to supervised approaches. Our code is publicly available at the following link: Debiasify.

[CV-5] ReMatching Dynamic Reconstruction Flow ATC

链接: https://arxiv.org/abs/2411.00705
作者: Sara Oblak,Despoina Paschalidou,Sanja Fidler,Matan Atzmon
关键词-EN: fundamental computer vision, computer vision task, Reconstructing dynamic scenes, downstream applications, Reconstructing dynamic
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
*备注: Our project website is at this https URL

点击查看摘要

Abstract:Reconstructing dynamic scenes from image inputs is a fundamental computer vision task with many downstream applications. Despite recent advancements, existing approaches still struggle to achieve high-quality reconstructions from unseen viewpoints and timestamps. This work introduces the ReMatching framework, designed to improve generalization quality by incorporating deformation priors into dynamic reconstruction models. Our approach advocates for velocity-field-based priors, for which we suggest a matching procedure that can seamlessly supplement existing dynamic reconstruction pipelines. The framework is highly adaptable and can be applied to various dynamic representations. Moreover, it supports integrating multiple types of model priors and enables combining simpler ones to create more complex classes. Our evaluations on popular benchmarks involving both synthetic and real-world dynamic scenes demonstrate a clear improvement in reconstruction accuracy of current state-of-the-art models.

[CV-6] Why do we regularise in every iteration for imaging inverse problems?

链接: https://arxiv.org/abs/2411.00688
作者: Evangelos Papoutsellis,Zeljko Kereta,Kostas Papafitsoros
关键词-EN: solving imaging inverse, imaging inverse problems, solving imaging, imaging inverse, inverse problems
类目: Numerical Analysis (math.NA); Computer Vision and Pattern Recognition (cs.CV); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:Regularisation is commonly used in iterative methods for solving imaging inverse problems. Many algorithms involve the evaluation of the proximal operator of the regularisation term in every iteration, leading to a significant computational overhead since such evaluation can be costly. In this context, the ProxSkip algorithm, recently proposed for federated learning purposes, emerges as an solution. It randomly skips regularisation steps, reducing the computational time of an iterative algorithm without affecting its convergence. Here we explore for the first time the efficacy of ProxSkip to a variety of imaging inverse problems and we also propose a novel PDHGSkip version. Extensive numerical results highlight the potential of these methods to accelerate computations while maintaining high-quality reconstructions.

[CV-7] owards High-fidelity Head Blending with Chroma Keying for Industrial Applications WACV2025

链接: https://arxiv.org/abs/2411.00652
作者: Hah Min Lew,Sahng-Min Yoo,Hyunwoo Kang,Gyeong-Moon Park
关键词-EN: digital content creation, industrial Head Blending, content creation, introduce an industrial, seamlessly integrating
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Machine Learning (cs.LG)
*备注: Accepted by WACV 2025. Project page: this https URL

点击查看摘要

Abstract:We introduce an industrial Head Blending pipeline for the task of seamlessly integrating an actor’s head onto a target body in digital content creation. The key challenge stems from discrepancies in head shape and hair structure, which lead to unnatural boundaries and blending artifacts. Existing methods treat foreground and background as a single task, resulting in suboptimal blending quality. To address this problem, we propose CHANGER, a novel pipeline that decouples background integration from foreground blending. By utilizing chroma keying for artifact-free background generation and introducing Head shape and long Hair augmentation ( H^2 augmentation) to simulate a wide range of head shapes and hair styles, CHANGER improves generalization on innumerable various real-world cases. Furthermore, our Foreground Predictive Attention Transformer (FPAT) module enhances foreground blending by predicting and focusing on key head and body regions. Quantitative and qualitative evaluations on benchmark datasets demonstrate that our CHANGER outperforms state-of-the-art methods, delivering high-fidelity, industrial-grade results.

[CV-8] Event-guided Low-light Video Semantic Segmentation WACV

链接: https://arxiv.org/abs/2411.00639
作者: Zhen Yao,Mooi Choo Chuah
关键词-EN: Recent video semantic, demonstrated promising results, Recent video, well-lit environments, demonstrated promising
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 12 pages, 5 figures, Accepted to IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) 2025

点击查看摘要

Abstract:Recent video semantic segmentation (VSS) methods have demonstrated promising results in well-lit environments. However, their performance significantly drops in low-light scenarios due to limited visibility and reduced contextual details. In addition, unfavorable low-light conditions make it harder to incorporate temporal consistency across video frames and thus, lead to video flickering effects. Compared with conventional cameras, event cameras can capture motion dynamics, filter out temporal-redundant information, and are robust to lighting conditions. To this end, we propose EVSNet, a lightweight framework that leverages event modality to guide the learning of a unified illumination-invariant representation. Specifically, we leverage a Motion Extraction Module to extract short-term and long-term temporal motions from event modality and a Motion Fusion Module to integrate image features and motion features adaptively. Furthermore, we use a Temporal Decoder to exploit video contexts and generate segmentation predictions. Such designs in EVSNet result in a lightweight architecture while achieving SOTA performance. Experimental results on 3 large-scale datasets demonstrate our proposed EVSNet outperforms SOTA methods with up to 11x higher parameter efficiency.

[CV-9] PCoTTA: Continual Test-Time Adaptation for Multi-Task Point Cloud Understanding NEURIPS2024

链接: https://arxiv.org/abs/2411.00632
作者: Jincen Jiang,Qianyu Zhou,Yuhang Li,Xinkui Zhao,Meili Wang,Lizhuang Ma,Jian Chang,Jian Jun Zhang,Xuequan Lu
关键词-EN: point cloud understanding, Continual Test-Time Adaptation, multi-task point cloud, pioneering framework, cloud understanding
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: Accepted to NeurIPS 2024

点击查看摘要

Abstract:In this paper, we present PCoTTA, an innovative, pioneering framework for Continual Test-Time Adaptation (CoTTA) in multi-task point cloud understanding, enhancing the model’s transferability towards the continually changing target domain. We introduce a multi-task setting for PCoTTA, which is practical and realistic, handling multiple tasks within one unified model during the continual adaptation. Our PCoTTA involves three key components: automatic prototype mixture (APM), Gaussian Splatted feature shifting (GSFS), and contrastive prototype repulsion (CPR). Firstly, APM is designed to automatically mix the source prototypes with the learnable prototypes with a similarity balancing factor, avoiding catastrophic forgetting. Then, GSFS dynamically shifts the testing sample toward the source domain, mitigating error accumulation in an online manner. In addition, CPR is proposed to pull the nearest learnable prototype close to the testing feature and push it away from other prototypes, making each prototype distinguishable during the adaptation. Experimental comparisons lead to a new benchmark, demonstrating PCoTTA’s superiority in boosting the model’s transferability towards the continually changing target domain.

[CV-10] Investigating the Gestalt Principle of Closure in Deep Convolutional Neural Networks

链接: https://arxiv.org/abs/2411.00627
作者: Yuyan Zhang,Derya Soydaner,Fatemeh Behrad,Lisa Koßmann,Johan Wagemans
关键词-EN: Deep neural networks, Deep neural, object recognition, neural networks perform, perceive objects
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: Published at the ESANN 2024 proceedings, European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning. Bruges (Belgium) and online event, 9-11 October 2024

点击查看摘要

Abstract:Deep neural networks perform well in object recognition, but do they perceive objects like humans? This study investigates the Gestalt principle of closure in convolutional neural networks. We propose a protocol to identify closure and conduct experiments using simple visual stimuli with progressively removed edge sections. We evaluate well-known networks on their ability to classify incomplete polygons. Our findings reveal a performance degradation as the edge removal percentage increases, indicating that current models heavily rely on complete edge information for accurate classification. The data used in our study is available on Github.

[CV-11] ZIM: Zero-Shot Image Matting for Anything

链接: https://arxiv.org/abs/2411.00626
作者: Beomyoung Kim,Chanyong Shin,Joonhyun Jeong,Hyungsik Jung,Se-Yun Lee,Sewhan Chun,Dong-Hyun Hwang,Joonsang Yu
关键词-EN: exhibits strong zero-shot, exhibits strong, zero-shot segmentation capabilities, falls short, short in generating
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: preprint (21 pages, 16 figures, and 8 tables)

点击查看摘要

Abstract:The recent segmentation foundation model, Segment Anything Model (SAM), exhibits strong zero-shot segmentation capabilities, but it falls short in generating fine-grained precise masks. To address this limitation, we propose a novel zero-shot image matting model, called ZIM, with two key contributions: First, we develop a label converter that transforms segmentation labels into detailed matte labels, constructing the new SA1B-Matte dataset without costly manual annotations. Training SAM with this dataset enables it to generate precise matte masks while maintaining its zero-shot capability. Second, we design the zero-shot matting model equipped with a hierarchical pixel decoder to enhance mask representation, along with a prompt-aware masked attention mechanism to improve performance by enabling the model to focus on regions specified by visual prompts. We evaluate ZIM using the newly introduced MicroMat-3K test set, which contains high-quality micro-level matte labels. Experimental results show that ZIM outperforms existing methods in fine-grained mask generation and zero-shot generalization. Furthermore, we demonstrate the versatility of ZIM in various downstream tasks requiring precise masks, such as image inpainting and 3D NeRF. Our contributions provide a robust foundation for advancing zero-shot matting and its downstream applications across a wide range of computer vision tasks. The code is available at \urlthis https URL.

[CV-12] Dual Low-Rank Adaptation for Continual Learning with Pre-Trained Models

链接: https://arxiv.org/abs/2411.00623
作者: Huancheng Chen,Jingtao Li,Nidham Gazagnadou,Weiming Zhuang,Chen Chen,Lingjuan Lyu
关键词-EN: enable vision transformers, revisit continual learning, vision transformers, era of foundation, aims to enable
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In the era of foundation models, we revisit continual learning~(CL), which aims to enable vision transformers (ViTs) to learn new tasks over time. However, as the scale of these models increases, catastrophic forgetting remains a persistent challenge, particularly in the presence of significant domain shifts across tasks. Recent studies highlight a crossover between CL techniques and parameter-efficient fine-tuning (PEFT), which focuses on fine-tuning only a small set of trainable parameters to adapt to downstream tasks, such as low-rank adaptation (LoRA). While LoRA achieves faster convergence and requires fewer trainable parameters, it has seldom been explored in the context of continual learning. To address this gap, we propose a novel PEFT-CL method called Dual Low-Rank Adaptation (DualLoRA), which introduces both an orthogonal LoRA adapter and a residual LoRA adapter parallel to pre-trained weights in each layer. These components are orchestrated by a dynamic memory mechanism to strike a balance between stability and plasticity. The orthogonal LoRA adapter’s parameters are updated in an orthogonal subspace of previous tasks to mitigate catastrophic forgetting, while the residual LoRA adapter’s parameters are updated in the residual subspace spanned by task-specific bases without interaction across tasks, offering complementary capabilities for fine-tuning new tasks. On ViT-based models, we demonstrate that DualLoRA offers significant advantages in accuracy, inference speed, and memory efficiency over existing CL methods across multiple benchmarks.

[CV-13] HopTrack: A Real-time Multi-Object Tracking System for Embedded Devices

链接: https://arxiv.org/abs/2411.00608
作者: Xiang Li,Cheng Chen,Yuan-yao Lou,Mustafa Abdallah,Kwang Taik Kim,Saurabh Bagchi
关键词-EN: poses significant challenges, embedded devices, poses significant, computer vision, embedded
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Multi-Object Tracking (MOT) poses significant challenges in computer vision. Despite its wide application in robotics, autonomous driving, and smart manufacturing, there is limited literature addressing the specific challenges of running MOT on embedded devices. State-of-the-art MOT trackers designed for high-end GPUs often experience low processing rates (11fps) when deployed on embedded devices. Existing MOT frameworks for embedded devices proposed strategies such as fusing the detector model with the feature embedding model to reduce inference latency or combining different trackers to improve tracking accuracy, but tend to compromise one for the other. This paper introduces HopTrack, a real-time multi-object tracking system tailored for embedded devices. Our system employs a novel discretized static and dynamic matching approach along with an innovative content-aware dynamic sampling technique to enhance tracking accuracy while meeting the real-time requirement. Compared with the best high-end GPU modified baseline Byte (Embed) and the best existing baseline on embedded devices MobileNet-JDE, HopTrack achieves a processing speed of up to 39.29 fps on NVIDIA AGX Xavier with a multi-object tracking accuracy (MOTA) of up to 63.12% on the MOT16 benchmark, outperforming both counterparts by 2.15% and 4.82%, respectively. Additionally, the accuracy improvement is coupled with the reduction in energy consumption (20.8%), power (5%), and memory usage (8%), which are crucial resources on embedded devices. HopTrack is also detector agnostic allowing the flexibility of plug-and-play.

[CV-14] Federated Voxel Scene Graph for Intracranial Hemorrhage

链接: https://arxiv.org/abs/2411.00578
作者: Antoine P. Sanner,Jonathan Stieber,Nils F. Grauhan,Suam Kim,Marc A. Brockmann,Ahmed E. Othman,Anirban Mukhopadhyay
关键词-EN: Intracranial Hemorrhage, clinical centers worldwide, potentially lethal condition, Scene Graph Generation, centers worldwide
类目: Computer Vision and Pattern Recognition (cs.CV); Distributed, Parallel, and Cluster Computing (cs.DC); Image and Video Processing (eess.IV)
*备注:

点击查看摘要

Abstract:Intracranial Hemorrhage is a potentially lethal condition whose manifestation is vastly diverse and shifts across clinical centers worldwide. Deep-learning-based solutions are starting to model complex relations between brain structures, but still struggle to generalize. While gathering more diverse data is the most natural approach, privacy regulations often limit the sharing of medical data. We propose the first application of Federated Scene Graph Generation. We show that our models can leverage the increased training data diversity. For Scene Graph Generation, they can recall up to 20% more clinically relevant relations across datasets compared to models trained on a single centralized dataset. Learning structured data representation in a federated setting can open the way to the development of new methods that can leverage this finer information to regularize across clients more effectively.

[CV-15] Handheld Video Document Scanning: A Robust On-Device Model for Multi-Page Document Scanning

链接: https://arxiv.org/abs/2411.00576
作者: Curtis Wigington
关键词-EN: Document capture applications, capture applications, emerged as popular, popular tools, tools for digitizing
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Document capture applications on smartphones have emerged as popular tools for digitizing documents. For many individuals, capturing documents with their smartphones is more convenient than using dedicated photocopiers or scanners, even if the quality of digitization is lower. However, using a smartphone for digitization can become excessively time-consuming and tedious when a user needs to digitize a document with multiple pages. In this work, we propose a novel approach to automatically scan multi-page documents from a video stream as the user turns through the pages of the document. Unlike previous methods that required constrained settings such as mounting the phone on a tripod, our technique is designed to allow the user to hold the phone in their hand. Our technique is trained to be robust to the motion and instability inherent in handheld scanning. Our primary contributions in this work include: (1) an efficient, on-device deep learning model that is accurate and robust for handheld scanning, (2) a novel data collection and annotation technique for video document scanning, and (3) state-of-the-art results on the PUCIT page turn dataset. Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2411.00576 [cs.CV] (or arXiv:2411.00576v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2411.00576 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[CV-16] Automated Classification of Cell Shapes: A Comparative Evaluation of Shape Descriptors

链接: https://arxiv.org/abs/2411.00561
作者: Valentina Vadori,Antonella Peruffo,Jean-Marie Graïc,Livio Finos,Enrico Grisan
关键词-EN: cell instance segmentation, including Elliptical Fourier, Elliptical Fourier Descriptors, study addresses, addresses the challenge
类目: Computer Vision and Pattern Recognition (cs.CV); Quantitative Methods (q-bio.QM)
*备注:

点击查看摘要

Abstract:This study addresses the challenge of classifying cell shapes from noisy contours, such as those obtained through cell instance segmentation of histological images. We assess the performance of various features for shape classification, including Elliptical Fourier Descriptors, curvature features, and lower dimensional representations. Using an annotated synthetic dataset of noisy contours, we identify the most suitable shape descriptors and apply them to a set of real images for qualitative analysis. Our aim is to provide a comprehensive evaluation of descriptors for classifying cell shapes, which can support cell type identification and tissue characterization-critical tasks in both biological research and histopathological assessments.

[CV-17] opology and Intersection-Union Constrained Loss Function for Multi-Region Anatomical Segmentation in Ocular Images

链接: https://arxiv.org/abs/2411.00560
作者: Ruiyu Xia,Jianqiang Li,Xi Xu,Guanghui Fu
关键词-EN: Ocular Myasthenia Gravis, Myasthenia Gravis, Ocular Myasthenia, eye muscles, double vision
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
*备注: 5 pages, 4 figures, International Symposium on Biomedical Imaging 2025

点击查看摘要

Abstract:Ocular Myasthenia Gravis (OMG) is a rare and challenging disease to detect in its early stages, but symptoms often first appear in the eye muscles, such as drooping eyelids and double vision. Ocular images can be used for early diagnosis by segmenting different regions, such as the sclera, iris, and pupil, which allows for the calculation of area ratios to support accurate medical assessments. However, no publicly available dataset and tools currently exist for this purpose. To address this, we propose a new topology and intersection-union constrained loss function (TIU loss) that improves performance using small training datasets. We conducted experiments on a public dataset consisting of 55 subjects and 2,197 images. Our proposed method outperformed two widely used loss functions across three deep learning networks, achieving a mean Dice score of 83.12% [82.47%, 83.81%] with a 95% bootstrap confidence interval. In a low-percentage training scenario (10% of the training data), our approach showed an 8.32% improvement in Dice score compared to the baseline. Additionally, we evaluated the method in a clinical setting with 47 subjects and 501 images, achieving a Dice score of 64.44% [63.22%, 65.62%]. We did observe some bias when applying the model in clinical settings. These results demonstrate that the proposed method is accurate, and our code along with the trained model is publicly available.

[CV-18] Is Multiple Object Tracking a Matter of Specialization? NEURIPS2024

链接: https://arxiv.org/abs/2411.00553
作者: Gianluca Mancusi,Mattia Bernardi,Aniello Panariello,Angelo Porrello,Rita Cucchiara,Simone Calderara
关键词-EN: achieved remarkable performance, Scenario-specific Tracking Architecture, Modular Deep Learning, human-related datasets, achieved remarkable
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: NeurIPS 2024

点击查看摘要

Abstract:End-to-end transformer-based trackers have achieved remarkable performance on most human-related datasets. However, training these trackers in heterogeneous scenarios poses significant challenges, including negative interference - where the model learns conflicting scene-specific parameters - and limited domain generalization, which often necessitates expensive fine-tuning to adapt the models to new domains. In response to these challenges, we introduce Parameter-efficient Scenario-specific Tracking Architecture (PASTA), a novel framework that combines Parameter-Efficient Fine-Tuning (PEFT) and Modular Deep Learning (MDL). Specifically, we define key scenario attributes (e.g, camera-viewpoint, lighting condition) and train specialized PEFT modules for each attribute. These expert modules are combined in parameter space, enabling systematic generalization to new domains without increasing inference time. Extensive experiments on MOTSynth, along with zero-shot evaluations on MOT17 and PersonPath22 demonstrate that a neural tracker built from carefully selected modules surpasses its monolithic counterpart. We release models and code.

[CV-19] racking one-in-a-million: Large-scale benchmark for microbial single-cell tracking with experiment-aware robustness metrics ECCV2024

链接: https://arxiv.org/abs/2411.00552
作者: J. Seiffarth,L. Blöbaum,R. D. Paul,N. Friederich,A. J. Yamachui Sitcheu,R. Mikut,H. Scharr,A. Grünberger,K. Nöh
关键词-EN: presents tremendous potential, experiment parameters, time-lapses reveals crucial, reveals crucial insights, biotechnological applications
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 17 pages, 4 figures, 3 tables, BioImage Computing @ ECCV 2024

点击查看摘要

Abstract:Tracking the development of living cells in live-cell time-lapses reveals crucial insights into single-cell behavior and presents tremendous potential for biomedical and biotechnological applications. In microbial live-cell imaging (MLCI), a few to thousands of cells have to be detected and tracked within dozens of growing cell colonies. The challenge of tracking cells is heavily influenced by the experiment parameters, namely the imaging interval and maximal cell number. For now, tracking benchmarks are not widely available in MLCI and the effect of these parameters on the tracking performance are not yet known. Therefore, we present the largest publicly available and annotated dataset for MLCI, containing more than 1.4 million cell instances, 29k cell tracks, and 14k cell divisions. With this dataset at hand, we generalize existing tracking metrics to incorporate relevant imaging and experiment parameters into experiment-aware metrics. These metrics reveal that current cell tracking methods crucially depend on the choice of the experiment parameters, where their performance deteriorates at high imaging intervals and large cell colonies. Thus, our new benchmark quantifies the influence of experiment parameters on the tracking quality, and gives the opportunity to develop new data-driven methods that generalize across imaging and experiment parameters. The benchmark dataset is publicly available at this https URL.

[CV-20] 3D Equivariant Pose Regression via Direct Wigner-D Harmonics Prediction KR NEURIPS2024

链接: https://arxiv.org/abs/2411.00543
作者: Jongmin Lee,Minsu Cho
关键词-EN: vision applications, crucial task, single-image pose estimation, Determining, single-image pose
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Robotics (cs.RO); Image and Video Processing (eess.IV)
*备注: Accepted to NeurIPS 2024, Project webpage at this http URL

点击查看摘要

Abstract:Determining the 3D orientations of an object in an image, known as single-image pose estimation, is a crucial task in 3D vision applications. Existing methods typically learn 3D rotations parametrized in the spatial domain using Euler angles or quaternions, but these representations often introduce discontinuities and singularities. SO(3)-equivariant networks enable the structured capture of pose patterns with data-efficient learning, but the parametrizations in spatial domain are incompatible with their architecture, particularly spherical CNNs, which operate in the frequency domain to enhance computational efficiency. To overcome these issues, we propose a frequency-domain approach that directly predicts Wigner-D coefficients for 3D rotation regression, aligning with the operations of spherical CNNs. Our SO(3)-equivariant pose harmonics predictor overcomes the limitations of spatial parameterizations, ensuring consistent pose estimation under arbitrary rotations. Trained with a frequency-domain regression loss, our method achieves state-of-the-art results on benchmarks such as ModelNet10-SO(3) and PASCAL3D+, with significant improvements in accuracy, robustness, and data efficiency.

[CV-21] Cross-modal semantic segmentation for indoor environmental perception using single-chip millimeter-wave radar raw data

链接: https://arxiv.org/abs/2411.00499
作者: Hairuo Hu,Haiyong Cong,Zhuyu Shao,Yubo Bi,Jinghao Liu
关键词-EN: indoor environmental perception, rescue operations, single-chip millimeter-wave, context of firefighting, firefighting and rescue
类目: Computer Vision and Pattern Recognition (cs.CV); Emerging Technologies (cs.ET); Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注: 5291 words, 17 pages, 11 figures

点击查看摘要

Abstract:In the context of firefighting and rescue operations, a cross-modal semantic segmentation model based on a single-chip millimeter-wave (mmWave) radar for indoor environmental perception is proposed and discussed. To efficiently obtain high-quality labels, an automatic label generation method utilizing LiDAR point clouds and occupancy grid maps is introduced. The proposed segmentation model is based on U-Net. A spatial attention module is incorporated, which enhanced the performance of the mode. The results demonstrate that cross-modal semantic segmentation provides a more intuitive and accurate representation of indoor environments. Unlike traditional methods, the model’s segmentation performance is minimally affected by azimuth. Although performance declines with increasing distance, this can be mitigated by a well-designed model. Additionally, it was found that using raw ADC data as input is ineffective; compared to RA tensors, RD tensors are more suitable for the proposed model.

[CV-22] LAM-YOLO: Drones-based Small Object Detection on Lighting-Occlusion Attention Mechanism YOLO

链接: https://arxiv.org/abs/2411.00485
作者: Yuchen Zheng,Yuxin Jing,Jufeng Zhao,Guangmang Cui
关键词-EN: presents inherent challenges, detection presents inherent, target detection presents, varying lighting conditions, complicates identification
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Drone-based target detection presents inherent challenges, such as the high density and overlap of targets in drone-based images, as well as the blurriness of targets under varying lighting conditions, which complicates identification. Traditional methods often struggle to recognize numerous densely packed small targets under complex background. To address these challenges, we propose LAM-YOLO, an object detection model specifically designed for drone-based. First, we introduce a light-occlusion attention mechanism to enhance the visibility of small targets under different lighting conditions. Meanwhile, we incroporate incorporate Involution modules to improve interaction among feature layers. Second, we utilize an improved SIB-IoU as the regression loss function to accelerate model convergence and enhance localization accuracy. Finally, we implement a novel detection strategy that introduces two auxiliary detection heads for identifying smaller-scale this http URL quantitative results demonstrate that LAM-YOLO outperforms methods such as Faster R-CNN, YOLOv9, and YOLOv10 in terms of mAP@0.5 and mAP@0.5:0.95 on the VisDrone2019 public dataset. Compared to the original YOLOv8, the average precision increases by 7.1%. Additionally, the proposed SIB-IoU loss function shows improved faster convergence speed during training and improved average precision over the traditional loss function.

[CV-23] MV-Adapter: Enhancing Underwater Instance Segmentation via Adaptive Channel Attention

链接: https://arxiv.org/abs/2411.00472
作者: Lianjun Liu
关键词-EN: Underwater instance segmentation, underwater vision tasks, fundamental and critical, critical step, Underwater
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Underwater instance segmentation is a fundamental and critical step in various underwater vision tasks. However, the decline in image quality caused by complex underwater environments presents significant challenges to existing segmentation models. While the state-of-the-art USIS-SAM model has demonstrated impressive performance, it struggles to effectively adapt to feature variations across different channels in addressing issues such as light attenuation, color distortion, and complex backgrounds. This limitation hampers its segmentation performance in challenging underwater scenarios. To address these issues, we propose the MarineVision Adapter (MV-Adapter). This module introduces an adaptive channel attention mechanism that enables the model to dynamically adjust the feature weights of each channel based on the characteristics of underwater images. By adaptively weighting features, the model can effectively handle challenges such as light attenuation, color shifts, and complex backgrounds. Experimental results show that integrating the MV-Adapter module into the USIS-SAM network architecture further improves the model’s overall performance, especially in high-precision segmentation tasks. On the USIS10K dataset, the module achieves improvements in key metrics such as mAP, AP50, and AP75 compared to competitive baseline models.

[CV-24] arget-Guided Adversarial Point Cloud Transformer Towards Recognition Against Real-world Corruptions NEURIPS2024

链接: https://arxiv.org/abs/2411.00462
作者: Jie Wang,Tingfa Xu,Lihe Ding,Jianan Li
关键词-EN: corrupted data presents, Achieving robust, Adversarial Significance Identifier, Point Cloud Transformer, Adversarial Point Cloud
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted by NeurIPS 2024; code: this https URL

点击查看摘要

Abstract:Achieving robust 3D perception in the face of corrupted data presents an challenging hurdle within 3D vision research. Contemporary transformer-based point cloud recognition models, albeit advanced, tend to overfit to specific patterns, consequently undermining their robustness against corruption. In this work, we introduce the Target-Guided Adversarial Point Cloud Transformer, termed APCT, a novel architecture designed to augment global structure capture through an adversarial feature erasing mechanism predicated on patterns discerned at each step during training. Specifically, APCT integrates an Adversarial Significance Identifier and a Target-guided Promptor. The Adversarial Significance Identifier, is tasked with discerning token significance by integrating global contextual analysis, utilizing a structural salience index algorithm alongside an auxiliary supervisory mechanism. The Target-guided Promptor, is responsible for accentuating the propensity for token discard within the self-attention mechanism, utilizing the value derived above, consequently directing the model attention towards alternative segments in subsequent stages. By iteratively applying this strategy in multiple steps during training, the network progressively identifies and integrates an expanded array of object-associated patterns. Extensive experiments demonstrate that our method achieves state-of-the-art results on multiple corruption benchmarks.

[CV-25] ConceptFactory: Facilitate 3D Object Knowledge Annotation with Object Conceptualization NEURIPS2024

链接: https://arxiv.org/abs/2411.00448
作者: Jianhua Sun,Yuxuan Li,Longfei Xu,Nange Wang,Jiude Wei,Yining Zhang,Cewu Lu
关键词-EN: promoting machine intelligence, learn comprehensive object, Concept Template Library, Standard Concept Template, comprehensive object knowledge
类目: Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC); Robotics (cs.RO)
*备注: NeurIPS 2024 Track on Datasets and Benchmarks

点击查看摘要

Abstract:We present ConceptFactory, a novel scope to facilitate more efficient annotation of 3D object knowledge by recognizing 3D objects through generalized concepts (i.e. object conceptualization), aiming at promoting machine intelligence to learn comprehensive object knowledge from both vision and robotics aspects. This idea originates from the findings in human cognition research that the perceptual recognition of objects can be explained as a process of arranging generalized geometric components (e.g. cuboids and cylinders). ConceptFactory consists of two critical parts: i) ConceptFactory Suite, a unified toolbox that adopts Standard Concept Template Library (STL-C) to drive a web-based platform for object conceptualization, and ii) ConceptFactory Asset, a large collection of conceptualized objects acquired using ConceptFactory suite. Our approach enables researchers to effortlessly acquire or customize extensive varieties of object knowledge to comprehensively study different object understanding tasks. We validate our idea on a wide range of benchmark tasks from both vision and robotics aspects with state-of-the-art algorithms, demonstrating the high quality and versatility of annotations provided by our approach. Our website is available at this https URL.

[CV-26] PLATYPUS: Progressive Local Surface Estimator for Arbitrary-Scale Point Cloud Upsampling

链接: https://arxiv.org/abs/2411.00432
作者: Donghyun Kim,Hyeonkyeong Kwon,Yumin Kim,Seong Jae Hwang
关键词-EN: raw data captured, driving and robotics, noise and sparsity, downstream tasks, increasingly vital
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:3D point clouds are increasingly vital for applications like autonomous driving and robotics, yet the raw data captured by sensors often suffer from noise and sparsity, creating challenges for downstream tasks. Consequently, point cloud upsampling becomes essential for improving density and uniformity, with recent approaches showing promise by projecting randomly generated query points onto the underlying surface of sparse point clouds. However, these methods often result in outliers, non-uniformity, and difficulties in handling regions with high curvature and intricate structures. In this work, we address these challenges by introducing the Progressive Local Surface Estimator (PLSE), which more effectively captures local features in complex regions through a curvature-based sampling technique that selectively targets high-curvature areas. Additionally, we incorporate a curriculum learning strategy that leverages the curvature distribution within the point cloud to naturally assess the sample difficulty, enabling curriculum learning on point cloud data for the first time. The experimental results demonstrate that our approach significantly outperforms existing methods, achieving high-quality, dense point clouds with superior accuracy and detail.

[CV-27] Class Incremental Learning with Task-Specific Batch Normalization and Out-of-Distribution Detection

链接: https://arxiv.org/abs/2411.00430
作者: Xuchen Xie,Yiqiao Qiu,Run Lin,Weishi Zheng,Ruixuan Wang
关键词-EN: incremental learning, reduce catastrophic forgetting, task incremental learning, class incremental learning, incremental learning lies
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注: 10 pages, 4 figures, 4 tables, in submission to IEEE Transaction of Multimedia Journal (TMM)

点击查看摘要

Abstract:This study focuses on incremental learning for image classification, exploring how to reduce catastrophic forgetting of all learned knowledge when access to old data is restricted due to memory or privacy constraints. The challenge of incremental learning lies in achieving an optimal balance between plasticity, the ability to learn new knowledge, and stability, the ability to retain old knowledge. Based on whether the task identifier (task-ID) of an image can be obtained during the test stage, incremental learning for image classifcation is divided into two main paradigms, which are task incremental learning (TIL) and class incremental learning (CIL). The TIL paradigm has access to the task-ID, allowing it to use multiple task-specific classification heads selected based on the task-ID. Consequently, in CIL, where the task-ID is unavailable, TIL methods must predict the task-ID to extend their application to the CIL paradigm. Our previous method for TIL adds task-specific batch normalization and classification heads incrementally. This work extends the method by predicting task-ID through an “unknown” class added to each classification head. The head with the lowest “unknown” probability is selected, enabling task-ID prediction and making the method applicable to CIL. The task-specific batch normalization (BN) modules effectively adjust the distribution of output feature maps across different tasks, enhancing the model’s this http URL, since BN has much fewer parameters compared to convolutional kernels, by only modifying the BN layers as new tasks arrive, the model can effectively manage parameter growth while ensuring stability across tasks. The innovation of this study lies in the first-time introduction of task-specific BN into CIL and verifying the feasibility of extending TIL methods to CIL through task-ID prediction with state-of-the-art performance on multiple datasets.

[CV-28] Cityscape-Adverse: Benchmarking Robustness of Semantic Segmentation with Realistic Scene Modifications via Diffusion-Based Image Editing

链接: https://arxiv.org/abs/2411.00425
作者: Naufal Suryanto,Andro Aprila Adiputra,Ahmada Yusril Kadiptya,Thi-Thu-Huong Le,Derry Pratama,Yongsu Kim,Howon Kim
关键词-EN: Recent advancements, text instructions, diffusion-based image editing, advancements in generative, enabled the transformation
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 19 pages, under review, code and dataset will be available at this https URL

点击查看摘要

Abstract:Recent advancements in generative AI, particularly diffusion-based image editing, have enabled the transformation of images into highly realistic scenes using only text instructions. This technology offers significant potential for generating diverse synthetic datasets to evaluate model robustness. In this paper, we introduce Cityscape-Adverse, a benchmark that employs diffusion-based image editing to simulate eight adverse conditions, including variations in weather, lighting, and seasons, while preserving the original semantic labels. We evaluate the reliability of diffusion-based models in generating realistic scene modifications and assess the performance of state-of-the-art CNN and Transformer-based semantic segmentation models under these challenging conditions. Additionally, we analyze which modifications have the greatest impact on model performance and explore how training on synthetic datasets can improve robustness in real-world adverse scenarios. Our results demonstrate that all tested models, particularly CNN-based architectures, experienced significant performance degradation under extreme conditions, while Transformer-based models exhibited greater resilience. We verify that models trained on Cityscape-Adverse show significantly enhanced resilience when applied to unseen domains. Code and datasets will be released at this https URL.

[CV-29] Improving Viewpoint-Independent Object-Centric Representations through Active Viewpoint Selection

链接: https://arxiv.org/abs/2411.00402
作者: Yinxuan Huang,Chengmin Gao,Bin Li,Xiangyang Xue
关键词-EN: object occlusion, viewpoint selection, complexities inherent, inherent in visual, comprehensive understanding
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Given the complexities inherent in visual scenes, such as object occlusion, a comprehensive understanding often requires observation from multiple viewpoints. Existing multi-viewpoint object-centric learning methods typically employ random or sequential viewpoint selection strategies. While applicable across various scenes, these strategies may not always be ideal, as certain scenes could benefit more from specific viewpoints. To address this limitation, we propose a novel active viewpoint selection strategy. This strategy predicts images from unknown viewpoints based on information from observation images for each scene. It then compares the object-centric representations extracted from both viewpoints and selects the unknown viewpoint with the largest disparity, indicating the greatest gain in information, as the next observation viewpoint. Through experiments on various datasets, we demonstrate the effectiveness of our active viewpoint selection strategy, significantly enhancing segmentation and reconstruction performance compared to random viewpoint selection. Moreover, our method can accurately predict images from unknown viewpoints.

[CV-30] StyleTex: Style Image-Guided Texture Generation for 3D Models SIGGRAPH

链接: https://arxiv.org/abs/2411.00399
作者: Zhiyu Xie,Yuqing Zhang,Xiangjun Tang,Yiqian Wu,Dehan Chen,Gongsheng Li,Xaogang Jin
关键词-EN: Style-guided texture generation, reference image, Style-guided texture, texture generation aims, reference style image
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
*备注: Accepted to Siggraph Asia 2024

点击查看摘要

Abstract:Style-guided texture generation aims to generate a texture that is harmonious with both the style of the reference image and the geometry of the input mesh, given a reference style image and a 3D mesh with its text description. Although diffusion-based 3D texture generation methods, such as distillation sampling, have numerous promising applications in stylized games and films, it requires addressing two challenges: 1) decouple style and content completely from the reference image for 3D models, and 2) align the generated texture with the color tone, style of the reference image, and the given text prompt. To this end, we introduce StyleTex, an innovative diffusion-model-based framework for creating stylized textures for 3D models. Our key insight is to decouple style information from the reference image while disregarding content in diffusion-based distillation sampling. Specifically, given a reference image, we first decompose its style feature from the image CLIP embedding by subtracting the embedding’s orthogonal projection in the direction of the content feature, which is represented by a text CLIP embedding. Our novel approach to disentangling the reference image’s style and content information allows us to generate distinct style and content features. We then inject the style feature into the cross-attention mechanism to incorporate it into the generation process, while utilizing the content feature as a negative prompt to further dissociate content information. Finally, we incorporate these strategies into StyleTex to obtain stylized textures. The resulting textures generated by StyleTex retain the style of the reference image, while also aligning with the text prompts and intrinsic details of the given 3D mesh. Quantitative and qualitative experiments show that our method outperforms existing baseline methods by a significant margin.

[CV-31] A Simple Remedy for Dataset Bias via Self-Influence: A Mislabeled Sample Perspective

链接: https://arxiv.org/abs/2411.00360
作者: Yeonsung Jung,Jaeyun Song,June Yong Yang,Jin-Hwa Kim,Sung-Yub Kim,Eunho Yang
关键词-EN: Learning generalized models, Learning generalized, deep learning, important undertaking, undertaking toward fairness
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Learning generalized models from biased data is an important undertaking toward fairness in deep learning. To address this issue, recent studies attempt to identify and leverage bias-conflicting samples free from spurious correlations without prior knowledge of bias or an unbiased set. However, spurious correlation remains an ongoing challenge, primarily due to the difficulty in precisely detecting these samples. In this paper, inspired by the similarities between mislabeled samples and bias-conflicting samples, we approach this challenge from a novel perspective of mislabeled sample detection. Specifically, we delve into Influence Function, one of the standard methods for mislabeled sample detection, for identifying bias-conflicting samples and propose a simple yet effective remedy for biased models by leveraging them. Through comprehensive analysis and experiments on diverse datasets, we demonstrate that our new perspective can boost the precision of detection and rectify biased models effectively. Furthermore, our approach is complementary to existing methods, showing performance improvement even when applied to models that have already undergone recent debiasing techniques.

[CV-32] All-frequency Full-body Human Image Relighting

链接: https://arxiv.org/abs/2411.00356
作者: Daichi Tajima,Yoshihiro Kanamori,Yuki Endo
关键词-EN: enables post-photography editing, human images enables, images enables post-photography, enables post-photography, post-photography editing
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
*备注: project page: [this URL]( this https URL )

点击查看摘要

Abstract:Relighting of human images enables post-photography editing of lighting effects in portraits. The current mainstream approach uses neural networks to approximate lighting effects without explicitly accounting for the principle of physical shading. As a result, it often has difficulty representing high-frequency shadows and shading. In this paper, we propose a two-stage relighting method that can reproduce physically-based shadows and shading from low to high frequencies. The key idea is to approximate an environment light source with a set of a fixed number of area light sources. The first stage employs supervised inverse rendering from a single image using neural networks and calculates physically-based shading. The second stage then calculates shadow for each area light and sums up to render the final image. We propose to make soft shadow mapping differentiable for the area-light approximation of environment lighting. We demonstrate that our method can plausibly reproduce all-frequency shadows and shading caused by environment illumination, which have been difficult to reproduce using existing methods.

[CV-33] GAFusion: Adaptive Fusing LiDAR and Camera with Multiple Guidance for 3D Object Detection

链接: https://arxiv.org/abs/2411.00340
作者: Xiaotian Li,Baojie Fan,Jiandong Tian,Huijie Fan
关键词-EN: Recent years, detection methods based, years have witnessed, witnessed the remarkable, remarkable progress
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Recent years have witnessed the remarkable progress of 3D multi-modality object detection methods based on the Bird’s-Eye-View (BEV) perspective. However, most of them overlook the complementary interaction and guidance between LiDAR and camera. In this work, we propose a novel multi-modality 3D objection detection method, named GAFusion, with LiDAR-guided global interaction and adaptive fusion. Specifically, we introduce sparse depth guidance (SDG) and LiDAR occupancy guidance (LOG) to generate 3D features with sufficient depth information. In the following, LiDAR-guided adaptive fusion transformer (LGAFT) is developed to adaptively enhance the interaction of different modal BEV features from a global perspective. Meanwhile, additional downsampling with sparse height compression and multi-scale dual-path transformer (MSDPT) are designed to enlarge the receptive fields of different modal features. Finally, a temporal fusion module is introduced to aggregate features from previous frames. GAFusion achieves state-of-the-art 3D object detection results with 73.6 % mAP and 74.9 % NDS on the nuScenes test set.

[CV-34] NCST: Neural-based Color Style Transfer for Video Retouching

链接: https://arxiv.org/abs/2411.00335
作者: Xintao Jiang,Yaosen Chen,Siqin Zhang,Wei Wang,Xuming Wen
关键词-EN: color style, color style transfer, style, style transfer, color
类目: Computer Vision and Pattern Recognition (cs.CV); Neural and Evolutionary Computing (cs.NE); Image and Video Processing (eess.IV)
*备注: 10 pages, 8 figures

点击查看摘要

Abstract:Video color style transfer aims to transform the color style of an original video by using a reference style image. Most existing methods employ neural networks, which come with challenges like opaque transfer processes and limited user control over the outcomes. Typically, users cannot fine-tune the resulting images or videos. To tackle this issue, we introduce a method that predicts specific parameters for color style transfer using two images. Initially, we train a neural network to learn the corresponding color adjustment parameters. When applying style transfer to a video, we fine-tune the network with key frames from the video and the chosen style image, generating precise transformation parameters. These are then applied to convert the color style of both images and videos. Our experimental results demonstrate that our algorithm surpasses current methods in color style transfer quality. Moreover, each parameter in our method has a specific, interpretable meaning, enabling users to understand the color style transfer process and allowing them to perform manual fine-tuning if desired.

[CV-35] Multiple Information Prompt Learning for Cloth-Changing Person Re-Identification

链接: https://arxiv.org/abs/2411.00330
作者: Shengxun Wei,Zan Gao,Yibo Zhao,Weili Guan
关键词-EN: pedestrians change clothes, Cloth-changing person re-identification, person re-identification, real world, subject closer
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Cloth-changing person re-identification is a subject closer to the real world, which focuses on solving the problem of person re-identification after pedestrians change clothes. The primary challenge in this field is to overcome the complex interplay between intra-class and inter-class variations and to identify features that remain unaffected by changes in appearance. Sufficient data collection for model training would significantly aid in addressing this problem. However, it is challenging to gather diverse datasets in practice. Current methods focus on implicitly learning identity information from the original image or introducing additional auxiliary models, which are largely limited by the quality of the image and the performance of the additional model. To address these issues, inspired by prompt learning, we propose a novel multiple information prompt learning (MIPL) scheme for cloth-changing person ReID, which learns identity robust features through the common prompt guidance of multiple messages. Specifically, the clothing information stripping (CIS) module is designed to decouple the clothing information from the original RGB image features to counteract the influence of clothing appearance. The Bio-guided attention (BGA) module is proposed to increase the learning intensity of the model for key information. A dual-length hybrid patch (DHP) module is employed to make the features have diverse coverage to minimize the impact of feature bias. Extensive experiments demonstrate that the proposed method outperforms all state-of-the-art methods on the LTCC, Celeb-reID, Celeb-reID-light, and CSCC datasets, achieving rank-1 scores of 74.8%, 73.3%, 66.0%, and 88.1%, respectively. When compared to AIM (CVPR23), ACID (TIP23), and SCNet (MM23), MIPL achieves rank-1 improvements of 11.3%, 13.8%, and 7.9%, respectively, on the PRCC dataset.

[CV-36] Unified Generative and Discriminative Training for Multi-modal Large Language Models

链接: https://arxiv.org/abs/2411.00304
作者: Wei Chow,Juncheng Li,Qifan Yu,Kaihang Pan,Hao Fei,Zhiqi Ge,Shuai Yang,Siliang Tang,Hanwang Zhang,Qianru Sun
关键词-EN: Multimodal Large Language, Large Language Models, enabled Multimodal Large, Multimodal Large, Large Language
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
*备注:

点击查看摘要

Abstract:In recent times, Vision-Language Models (VLMs) have been trained under two predominant paradigms. Generative training has enabled Multimodal Large Language Models (MLLMs) to tackle various complex tasks, yet issues such as hallucinations and weak object discrimination persist. Discriminative training, exemplified by models like CLIP, excels in zero-shot image-text classification and retrieval, yet struggles with complex scenarios requiring fine-grained semantic differentiation. This paper addresses these challenges by proposing a unified approach that integrates the strengths of both paradigms. Considering interleaved image-text sequences as the general format of input samples, we introduce a structure-induced training strategy that imposes semantic relationships between input samples and the MLLM’s hidden state. This approach enhances the MLLM’s ability to capture global semantics and distinguish fine-grained semantics. By leveraging dynamic sequence alignment within the Dynamic Time Warping framework and integrating a novel kernel for fine-grained semantic differentiation, our method effectively balances generative and discriminative tasks. Extensive experiments demonstrate the effectiveness of our approach, achieving state-of-the-art results in multiple generative tasks, especially those requiring cognitive and discrimination abilities. Additionally, our method surpasses discriminative benchmarks in interleaved and fine-grained retrieval tasks. By employing a retrieval-augmented generation strategy, our approach further enhances performance in some generative tasks within one model, offering a promising direction for future research in vision-language modeling.

[CV-37] RadFlag: A Black-Box Hallucination Detection Method for Medical Vision Language Models

链接: https://arxiv.org/abs/2411.00299
作者: Sraavya Sambara(1),Serena Zhang(2),Oishi Banerjee(1),Julian Acosta(1),John Fahrner(1),Pranav Rajpurkar(1) ((1) Harvard University, (2) Stanford University)
关键词-EN: Generating accurate radiology, Vision Language Models, current Vision Language, challenging task, Generating accurate
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 15 pages, 8 figures

点击查看摘要

Abstract:Generating accurate radiology reports from medical images is a clinically important but challenging task. While current Vision Language Models (VLMs) show promise, they are prone to generating hallucinations, potentially compromising patient care. We introduce RadFlag, a black-box method to enhance the accuracy of radiology report generation. Our method uses a sampling-based flagging technique to find hallucinatory generations that should be removed. We first sample multiple reports at varying temperatures and then use a Large Language Model (LLM) to identify claims that are not consistently supported across samples, indicating that the model has low confidence in those claims. Using a calibrated threshold, we flag a fraction of these claims as likely hallucinations, which should undergo extra review or be automatically rejected. Our method achieves high precision when identifying both individual hallucinatory sentences and reports that contain hallucinations. As an easy-to-use, black-box system that only requires access to a model’s temperature parameter, RadFlag is compatible with a wide range of radiology report generation models and has the potential to broadly improve the quality of automated radiology reporting.

[CV-38] Detection and tracking of gas plumes in LWIR hyperspectral video sequence data

链接: https://arxiv.org/abs/2411.00281
作者: Torin Gerhart,Justin Sunu,Ekaterina Merkurjev,Jen-Mei Chang,Jerome Gilles,Andrea L. Bertozzi
关键词-EN: Automated detection, Principal Components Analysis, gas plume detection, plume detection problem, Principal Components
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
*备注:

点击查看摘要

Abstract:Automated detection of chemical plumes presents a segmentation challenge. The segmentation problem for gas plumes is difficult due to the diffusive nature of the cloud. The advantage of considering hyperspectral images in the gas plume detection problem over the conventional RGB imagery is the presence of non-visual data, allowing for a richer representation of information. In this paper we present an effective method of visualizing hyperspectral video sequences containing chemical plumes and investigate the effectiveness of segmentation techniques on these post-processed videos. Our approach uses a combination of dimension reduction and histogram equalization to prepare the hyperspectral videos for segmentation. First, Principal Components Analysis (PCA) is used to reduce the dimension of the entire video sequence. This is done by projecting each pixel onto the first few Principal Components resulting in a type of spectral filter. Next, a Midway method for histogram equalization is used. These methods redistribute the intensity values in order to reduce flicker between frames. This properly prepares these high-dimensional video sequences for more traditional segmentation techniques. We compare the ability of various clustering techniques to properly segment the chemical plume. These include K-means, spectral clustering, and the Ginzburg-Landau functional.

[CV-39] Adaptive Residual Transformation for Enhanced Feature-Based OOD Detection in SAR Imagery

链接: https://arxiv.org/abs/2411.00274
作者: Kyung-hwan Lee,Kyung-tae Kim
关键词-EN: Synthetic Aperture Radar, Aperture Radar, Synthetic Aperture, deep learning architectures, Recent advances
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
*备注:

点击查看摘要

Abstract:Recent advances in deep learning architectures have enabled efficient and accurate classification of pre-trained targets in Synthetic Aperture Radar (SAR) images. Nevertheless, the presence of unknown targets in real battlefield scenarios is unavoidable, resulting in misclassification and reducing the accuracy of the classifier. Over the past decades, various feature-based out-of-distribution (OOD) approaches have been developed to address this issue, yet defining the decision boundary between known and unknown targets remains challenging. Additionally, unlike optical images, detecting unknown targets in SAR imagery is further complicated by high speckle noise, the presence of clutter, and the inherent similarities in back-scattered microwave signals. In this work, we propose transforming feature-based OOD detection into a class-localized feature-residual-based approach, demonstrating that this method can improve stability across varying unknown targets’ distribution conditions. Transforming feature-based OOD detection into a residual-based framework offers a more robust reference space for distinguishing between in-distribution (ID) and OOD data, particularly within the unique characteristics of SAR imagery. This adaptive residual transformation method standardizes feature-based inputs into distributional representations, enhancing OOD detection in noisy, low-information images. Our approach demonstrates promising performance in real-world SAR scenarios, effectively adapting to the high levels of noise and clutter inherent in these environments. These findings highlight the practical relevance of residual-based OOD detection for SAR applications and suggest a foundation for further advancements in unknown target detection in complex, operational settings.

[CV-40] IO Transformer: Evaluating SwinV2-Based Reward Models for Computer Vision

链接: https://arxiv.org/abs/2411.00252
作者: Maxwell Meyer,Jack Spruyt
关键词-EN: speech recognition tasks, performance across text, derivatives have achieved, speech recognition, Transformer
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 15 pages, 3 figures, 2 tables

点击查看摘要

Abstract:Transformers and their derivatives have achieved state-of-the-art performance across text, vision, and speech recognition tasks. However, minimal effort has been made to train transformers capable of evaluating the output quality of other models. This paper examines SwinV2-based reward models, called the Input-Output Transformer (IO Transformer) and the Output Transformer. These reward models can be leveraged for tasks such as inference quality evaluation, data categorization, and policy optimization. Our experiments demonstrate highly accurate model output quality assessment across domains where the output is entirely dependent on the input, with the IO Transformer achieving perfect evaluation accuracy on the Change Dataset 25 (CD25). We also explore modified Swin V2 architectures. Ultimately Swin V2 remains on top with a score of 95.41 % on the IO Segmentation Dataset, outperforming the IO Transformer in scenarios where the output is not entirely dependent on the input. Our work expands the application of transformer architectures to reward modeling in computer vision and provides critical insights into optimizing these models for various tasks.

[CV-41] ResiDual Transformer Alignment with Spectral Decomposition

链接: https://arxiv.org/abs/2411.00246
作者: Lorenzo Basile,Valentino Maiorca,Luca Bortolussi,Emanuele Rodolà,Francesco Locatello
关键词-EN: puzzling property emerges, puzzling property, property emerges, specialize in specific, specific tasks
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:When examined through the lens of their residual streams, a puzzling property emerges in transformer networks: residual contributions (e.g., attention heads) sometimes specialize in specific tasks or input attributes. In this paper, we analyze this phenomenon in vision transformers, focusing on the spectral geometry of residuals, and explore its implications for modality alignment in vision-language models. First, we link it to the intrinsically low-dimensional structure of visual head representations, zooming into their principal components and showing that they encode specialized roles across a wide variety of input data distributions. Then, we analyze the effect of head specialization in multimodal models, focusing on how improved alignment between text and specialized heads impacts zero-shot classification performance. This specialization-performance link consistently holds across diverse pre-training data, network sizes, and objectives, demonstrating a powerful new mechanism for boosting zero-shot classification through targeted alignment. Ultimately, we translate these insights into actionable terms by introducing ResiDual, a technique for spectral alignment of the residual stream. Much like panning for gold, it lets the noise from irrelevant unit principal components (i.e., attributes) wash away to amplify task-relevant ones. Remarkably, this dual perspective on modality alignment yields fine-tuning level performances on different data distributions while modeling an extremely interpretable and parameter-efficient transformation, as we extensively show on more than 50 (pre-trained network, dataset) pairs.

[CV-42] Aquatic-GS: A Hybrid 3D Representation for Underwater Scenes

链接: https://arxiv.org/abs/2411.00239
作者: Shaohua Liu,Junzhe Lu,Zuoya Gu,Jiajun Li,Yue Deng
关键词-EN: imaging significantly couple, underwater imaging significantly, Neural Water Field, water medium, water
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 13 pages, 7 figures

点击查看摘要

Abstract:Representing underwater 3D scenes is a valuable yet complex task, as attenuation and scattering effects during underwater imaging significantly couple the information of the objects and the water. This coupling presents a significant challenge for existing methods in effectively representing both the objects and the water medium simultaneously. To address this challenge, we propose Aquatic-GS, a hybrid 3D representation approach for underwater scenes that effectively represents both the objects and the water medium. Specifically, we construct a Neural Water Field (NWF) to implicitly model the water parameters, while extending the latest 3D Gaussian Splatting (3DGS) to model the objects explicitly. Both components are integrated through a physics-based underwater image formation model to represent complex underwater scenes. Moreover, to construct more precise scene geometry and details, we design a Depth-Guided Optimization (DGO) mechanism that uses a pseudo-depth map as auxiliary guidance. After optimization, Aquatic-GS enables the rendering of novel underwater viewpoints and supports restoring the true appearance of underwater scenes, as if the water medium were absent. Extensive experiments on both simulated and real-world datasets demonstrate that Aquatic-GS surpasses state-of-the-art underwater 3D representation methods, achieving better rendering quality and real-time rendering performance with a 410x increase in speed. Furthermore, regarding underwater image restoration, Aquatic-GS outperforms representative dewatering methods in color correction, detail recovery, and stability. Our models, code, and datasets can be accessed at this https URL.

[CV-43] Fashion-VDM: Video Diffusion Model for Virtual Try-On SIGGRAPH

链接: https://arxiv.org/abs/2411.00225
作者: Johanna Karras,Yingwei Li,Nan Liu,Luyang Zhu,Innfarn Yoo,Andreas Lugmayr,Chris Lee,Ira Kemelmacher-Shlizerman
关键词-EN: video virtual try-on, video diffusion model, virtual try-on, generating virtual try-on, present Fashion-VDM
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted to SIGGRAPH Asia 2025

点击查看摘要

Abstract:We present Fashion-VDM, a video diffusion model (VDM) for generating virtual try-on videos. Given an input garment image and person video, our method aims to generate a high-quality try-on video of the person wearing the given garment, while preserving the person’s identity and motion. Image-based virtual try-on has shown impressive results; however, existing video virtual try-on (VVT) methods are still lacking garment details and temporal consistency. To address these issues, we propose a diffusion-based architecture for video virtual try-on, split classifier-free guidance for increased control over the conditioning inputs, and a progressive temporal training strategy for single-pass 64-frame, 512px video generation. We also demonstrate the effectiveness of joint image-video training for video try-on, especially when video data is limited. Our qualitative and quantitative experiments show that our approach sets the new state-of-the-art for video virtual try-on. For additional results, visit our project page: this https URL.

[CV-44] Scale-Aware Recognition in Satellite Images under Resource Constraint

链接: https://arxiv.org/abs/2411.00210
作者: Shreelekha Revankar,Cheng Perng Phoo,Utkarsh Mall,Bharath Hariharan,Kavita Bala
关键词-EN: swimming pools, depends strongly, imagery, features in satellite, spatial scale
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 15,4

点击查看摘要

Abstract:Recognition of features in satellite imagery (forests, swimming pools, etc.) depends strongly on the spatial scale of the concept and therefore the resolution of the images. This poses two challenges: Which resolution is best suited for recognizing a given concept, and where and when should the costlier higher-resolution (HR) imagery be acquired? We present a novel scheme to address these challenges by introducing three components: (1) A technique to distill knowledge from models trained on HR imagery to recognition models that operate on imagery of lower resolution (LR), (2) a sampling strategy for HR imagery based on model disagreement, and (3) an LLM-based approach for inferring concept “scale”. With these components we present a system to efficiently perform scale-aware recognition in satellite imagery, improving accuracy over single-scale inference while following budget constraints. Our novel approach offers up to a 26.3% improvement over entirely HR baselines, using 76.3% fewer HR images.

[CV-45] Semantic Knowledge Distillation for Onboard Satellite Earth Observation Image Classification

链接: https://arxiv.org/abs/2411.00209
作者: Thanh-Dung Le,Vu Nguyen Ha,Ti Ti Nguyen,Geoffrey Eappen,Prabhu Thiruvasagam,Hong-fu Chou,Duc-Dung Tran,Luis M. Garces-Socarras,Jorge L. Gonzalez-Rios,Juan Carlos Merlano-Duncan,Symeon Chatzinotas
关键词-EN: efficient Earth observation, Earth observation, efficient Earth, resource-constrained settings, study presents
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注: Under revisions

点击查看摘要

Abstract:This study presents an innovative dynamic weighting knowledge distillation (KD) framework tailored for efficient Earth observation (EO) image classification (IC) in resource-constrained settings. Utilizing EfficientViT and MobileViT as teacher models, this framework enables lightweight student models, particularly ResNet8 and ResNet16, to surpass 90% in accuracy, precision, and recall, adhering to the stringent confidence thresholds necessary for reliable classification tasks. Unlike conventional KD methods that rely on static weight distribution, our adaptive weighting mechanism responds to each teacher model’s confidence, allowing student models to prioritize more credible sources of knowledge dynamically. Remarkably, ResNet8 delivers substantial efficiency gains, achieving a 97.5% reduction in parameters, a 96.7% decrease in FLOPs, an 86.2% cut in power consumption, and a 63.5% increase in inference speed over MobileViT. This significant optimization of complexity and resource demands establishes ResNet8 as an optimal candidate for EO tasks, combining robust performance with feasibility in deployment. The confidence-based, adaptable KD approach underscores the potential of dynamic distillation strategies to yield high-performing, resource-efficient models tailored for satellite-based EO applications. The reproducible code is accessible on our GitHub repository.

[CV-46] Evaluating the Evolution of YOLO (You Only Look Once) Models: A Comprehensive Benchmark Study of YOLO11 and Its Predecessors

链接: https://arxiv.org/abs/2411.00201
作者: Nidhal Jegham,Chan Young Koh,Marwan Abdelatti,Abdeltawab Hendawi
关键词-EN: comprehensive benchmark analysis, study presents, African Wildlife, Traffic Signs, Model Size
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 20 pages

点击查看摘要

Abstract:This study presents a comprehensive benchmark analysis of various YOLO (You Only Look Once) algorithms, from YOLOv3 to the newest addition. It represents the first research to comprehensively evaluate the performance of YOLO11, the latest addition to the YOLO family. It evaluates their performance on three diverse datasets: Traffic Signs (with varying object sizes), African Wildlife (with diverse aspect ratios and at least one instance of the object per image), and Ships and Vessels (with small-sized objects of a single class), ensuring a comprehensive assessment across datasets with distinct challenges. To ensure a robust evaluation, we employ a comprehensive set of metrics, including Precision, Recall, Mean Average Precision (mAP), Processing Time, GFLOPs count, and Model Size. Our analysis highlights the distinctive strengths and limitations of each YOLO version. For example: YOLOv9 demonstrates substantial accuracy but struggles with detecting small objects and efficiency whereas YOLOv10 exhibits relatively lower accuracy due to architectural choices that affect its performance in overlapping object detection but excels in speed and efficiency. Additionally, the YOLO11 family consistently shows superior performance in terms of accuracy, speed, computational efficiency, and model size. YOLO11m achieved a remarkable balance of accuracy and efficiency, scoring mAP50-95 scores of 0.795, 0.81, and 0.325 on the Traffic Signs, African Wildlife, and Ships datasets, respectively, while maintaining an average inference time of 2.4ms, a model size of 38.8Mb, and around 67.6 GFLOPs on average. These results provide critical insights for both industry and academia, facilitating the selection of the most suitable YOLO algorithm for diverse applications and guiding future enhancements.

[CV-47] Optical Lens Attack on Monocular Depth Estimation for Autonomous Driving

链接: https://arxiv.org/abs/2411.00192
作者: Ce Zhou(1),Qiben Yan(1),Daniel Kent(1),Guangjing Wang(2),Weikang Ding(1),Ziqi Zhang(3),Hayder Radha(1) ((1) Michigan State University, (2) University of South Florida, (3) Peking University)
关键词-EN: Monocular Depth Estimation, vision-based Autonomous Driving, single camera image, Monocular Depth, Depth Estimation
类目: Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR)
*备注: 28 pages. arXiv admin note: substantial text overlap with arXiv:2409.17376

点击查看摘要

Abstract:Monocular Depth Estimation (MDE) is a pivotal component of vision-based Autonomous Driving (AD) systems, enabling vehicles to estimate the depth of surrounding objects using a single camera image. This estimation guides essential driving decisions, such as braking before an obstacle or changing lanes to avoid collisions. In this paper, we explore vulnerabilities of MDE algorithms in AD systems, presenting LensAttack, a novel physical attack that strategically places optical lenses on the camera of an autonomous vehicle to manipulate the perceived object depths. LensAttack encompasses two attack formats: concave lens attack and convex lens attack, each utilizing different optical lenses to induce false depth perception. We first develop a mathematical model that outlines the parameters of the attack, followed by simulations and real-world evaluations to assess its efficacy on state-of-the-art MDE models. Additionally, we adopt an attack optimization method to further enhance the attack success rate by optimizing the attack focal length. To better evaluate the implications of LensAttack on AD, we conduct comprehensive end-to-end system simulations using the CARLA platform. The results reveal that LensAttack can significantly disrupt the depth estimation processes in AD systems, posing a serious threat to their reliability and safety. Finally, we discuss some potential defense methods to mitigate the effects of the proposed attack.

[CV-48] Pedestrian Trajectory Prediction with Missing Data: Datasets Imputation and Benchmarking NEURIPS2024

链接: https://arxiv.org/abs/2411.00174
作者: Pranav Singh Chib,Pravendra Singh
关键词-EN: Pedestrian trajectory prediction, trajectory prediction, trajectory prediction methods, Pedestrian trajectory, trajectory
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Robotics (cs.RO)
*备注: Accepted at NeurIPS 2024

点击查看摘要

Abstract:Pedestrian trajectory prediction is crucial for several applications such as robotics and self-driving vehicles. Significant progress has been made in the past decade thanks to the availability of pedestrian trajectory datasets, which enable trajectory prediction methods to learn from pedestrians’ past movements and predict future trajectories. However, these datasets and methods typically assume that the observed trajectory sequence is complete, ignoring real-world issues such as sensor failure, occlusion, and limited fields of view that can result in missing values in observed trajectories. To address this challenge, we present TrajImpute, a pedestrian trajectory prediction dataset that simulates missing coordinates in the observed trajectory, enhancing real-world applicability. TrajImpute maintains a uniform distribution of missing data within the observed trajectories. In this work, we comprehensively examine several imputation methods to reconstruct the missing coordinates and benchmark them for imputing pedestrian trajectories. Furthermore, we provide a thorough analysis of recent trajectory prediction methods and evaluate the performance of these models on the imputed trajectories. Our experimental evaluation of the imputation and trajectory prediction methods offers several valuable insights. Our dataset provides a foundational resource for future research on imputation-aware pedestrian trajectory prediction, potentially accelerating the deployment of these methods in real-world applications. Publicly accessible links to the datasets and code files are available at this https URL.

[CV-49] SeafloorAI: A Large-scale Vision-Language Dataset for Seafloor Geological Survey

链接: https://arxiv.org/abs/2411.00172
作者: Kien X. Nguyen,Fengchun Qiao,Arthur Trembanis,Xi Peng
关键词-EN: sonar imagery analysis, major obstacle, machine learning models, machine learning, AI-ready sonar image
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:A major obstacle to the advancements of machine learning models in marine science, particularly in sonar imagery analysis, is the scarcity of AI-ready datasets. While there have been efforts to make AI-ready sonar image dataset publicly available, they suffer from limitations in terms of environment setting and scale. To bridge this gap, we introduce SeafloorAI, the first extensive AI-ready datasets for seafloor mapping across 5 geological layers that is curated in collaboration with marine scientists. We further extend the dataset to SeafloorGenAI by incorporating the language component in order to facilitate the development of both vision- and language-capable machine learning models for sonar imagery. The dataset consists of 62 geo-distributed data surveys spanning 17,300 square kilometers, with 696K sonar images, 827K annotated segmentation masks, 696K detailed language descriptions and approximately 7M question-answer pairs. By making our data processing source code publicly available, we aim to engage the marine science community to enrich the data pool and inspire the machine learning community to develop more robust models. This collaborative approach will enhance the capabilities and applications of our datasets within both fields.

[CV-50] Aerial Flood Scene Classification Using Fine-Tuned Attention-based Architecture for Flood-Prone Countries in South Asia

链接: https://arxiv.org/abs/2411.00169
作者: Ibne Hassan,Aman Mujahid,Abdullah Al Hasib,Andalib Rahman Shagoto,Joyanta Jyoti Mondal,Meem Arafat Manab,Jannatun Noor
关键词-EN: South Asia experience, South Asian countries, flooding events regularly, catastrophic flooding events, South Asia
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Countries in South Asia experience many catastrophic flooding events regularly. Through image classification, it is possible to expedite search and rescue initiatives by classifying flood zones, including houses and humans. We create a new dataset collecting aerial imagery of flooding events across South Asian countries. For the classification, we propose a fine-tuned Compact Convolutional Transformer (CCT) based approach and some other cutting-edge transformer-based and Convolutional Neural Network-based architectures (CNN). We also implement the YOLOv8 object detection model and detect houses and humans within the imagery of our proposed dataset, and then compare the performance with our classification-based approach. Since the countries in South Asia have similar topography, housing structure, the color of flood water, and vegetation, this work can be more applicable to such a region as opposed to the rest of the world. The images are divided evenly into four classes: ‘flood’, ‘flood with domicile’, ‘flood with humans’, and ‘no flood’. After experimenting with our proposed dataset on our fine-tuned CCT model, which has a comparatively lower number of weight parameters than many other transformer-based architectures designed for computer vision, it exhibits an accuracy and macro average precision of 98.62% and 98.50%. The other transformer-based architectures that we implement are the Vision Transformer (ViT), Swin Transformer, and External Attention Transformer (EANet), which give an accuracy of 88.66%, 84.74%, and 66.56% respectively. We also implement DCECNN (Deep Custom Ensembled Convolutional Neural Network), which is a custom ensemble model that we create by combining MobileNet, InceptionV3, and EfficientNetB0, and we obtain an accuracy of 98.78%. The architectures we implement are fine-tuned to achieve optimal performance on our dataset.

[CV-51] A Recipe for Geometry-Aware 3D Mesh Transformers

链接: https://arxiv.org/abs/2411.00164
作者: Mohammad Farazi,Yalin Wang
关键词-EN: presents significant challenges, Utilizing patch-based transformers, unstructured geometric data, polygon meshes presents, meshes presents significant
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Utilizing patch-based transformers for unstructured geometric data such as polygon meshes presents significant challenges, primarily due to the absence of a canonical ordering and variations in input sizes. Prior approaches to handling 3D meshes and point clouds have either relied on computationally intensive node-level tokens for large objects or resorted to resampling to standardize patch size. Moreover, these methods generally lack a geometry-aware, stable Structural Embedding (SE), often depending on simplistic absolute SEs such as 3D coordinates, which compromise isometry invariance essential for tasks like semantic segmentation. In our study, we meticulously examine the various components of a geometry-aware 3D mesh transformer, from tokenization to structural encoding, assessing the contribution of each. Initially, we introduce a spectral-preserving tokenization rooted in algebraic multigrid methods. Subsequently, we detail an approach for embedding features at the patch level, accommodating patches with variable node counts. Through comparative analyses against a baseline model employing simple point-wise Multi-Layer Perceptrons (MLP), our research highlights critical insights: 1) the importance of structural and positional embeddings facilitated by heat diffusion in general 3D mesh transformers; 2) the effectiveness of novel components such as geodesic masking and feature interaction via cross-attention in enhancing learning; and 3) the superior performance and efficiency of our proposed methods in challenging segmentation and classification tasks.

[CV-52] Using Deep Neural Networks to Quantify Parking Dwell Time

链接: https://arxiv.org/abs/2411.00158
作者: Marcelo Eduardo Marques Ribas(1),Heloisa Benedet Mendes(1),Luiz Eduardo Soares de Oliveira(1),Luiz Antonio Zanlorensi(2),Paulo Ricardo Lisboa de Almeida(1) ((1) Department of Informatics - Federal University of Paraná, (2) DeepNeuronic)
关键词-EN: individual transportation solutions, smart cities, transportation solutions, individual car dwell, common practice
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: Paper accepted to the 2024 International Conference on Machine Learning and Applications

点击查看摘要

Abstract:In smart cities, it is common practice to define a maximum length of stay for a given parking space to increase the space’s rotativity and discourage the usage of individual transportation solutions. However, automatically determining individual car dwell times from images faces challenges, such as images collected from low-resolution cameras, lighting variations, and weather effects. In this work, we propose a method that combines two deep neural networks to compute the dwell time of each car in a parking lot. The proposed method first defines the parking space status between occupied and empty using a deep classification network. Then, it uses a Siamese network to check if the parked car is the same as the previous image. Using an experimental protocol that focuses on a cross-dataset scenario, we show that if a perfect classifier is used, the proposed system generates 75% of perfect dwell time predictions, where the predicted value matched exactly the time the car stayed parked. Nevertheless, our experiments show a drop in prediction quality when a real-world classifier is used to predict the parking space statuses, reaching 49% of perfect predictions, showing that the proposed Siamese network is promising but impacted by the quality of the classifier used at the beginning of the pipeline.

[CV-53] NIMBA: Towards Robust and Principled Processing of Point Clouds With SSMs

链接: https://arxiv.org/abs/2411.00151
作者: Nursena Köprücü,Destiny Okpekpe,Antonio Orvieto
关键词-EN: large-scale deep learning, deep learning tasks, including text, dominant in large-scale, large-scale deep
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Transformers have become dominant in large-scale deep learning tasks across various domains, including text, 2D and 3D vision. However, the quadratic complexity of their attention mechanism limits their efficiency as the sequence length increases, particularly in high-resolution 3D data such as point clouds. Recently, state space models (SSMs) like Mamba have emerged as promising alternatives, offering linear complexity, scalability, and high performance in long-sequence tasks. The key challenge in the application of SSMs in this domain lies in reconciling the non-sequential structure of point clouds with the inherently directional (or bi-directional) order-dependent processing of recurrent models like Mamba. To achieve this, previous research proposed reorganizing point clouds along multiple directions or predetermined paths in 3D space, concatenating the results to produce a single 1D sequence capturing different views. In our work, we introduce a method to convert point clouds into 1D sequences that maintain 3D spatial structure with no need for data replication, allowing Mamba sequential processing to be applied effectively in an almost permutation-invariant manner. In contrast to other works, we found that our method does not require positional embeddings and allows for shorter sequence lengths while still achieving state-of-the-art results in ModelNet40 and ScanObjectNN datasets and surpassing Transformer-based models in both accuracy and efficiency.

[CV-54] Self-Ensembling Gaussian Splatting for Few-shot Novel View Synthesis

链接: https://arxiv.org/abs/2411.00144
作者: Chen Zhao,Xuan Wang,Tong Zhang,Saqib Javed,Mathieu Salzmann
关键词-EN: Gaussian Splatting models, Gaussian Splatting, demonstrated remarkable effectiveness, Sigma, self-ensembling Gaussian Splatting
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
*备注:

点击查看摘要

Abstract:3D Gaussian Splatting (3DGS) has demonstrated remarkable effectiveness for novel view synthesis (NVS). However, the 3DGS model tends to overfit when trained with sparse posed views, limiting its generalization capacity for broader pose variations. In this paper, we alleviate the overfitting problem by introducing a self-ensembling Gaussian Splatting (SE-GS) approach. We present two Gaussian Splatting models named the \mathbf\Sigma -model and the \mathbf\Delta -model. The \mathbf\Sigma -model serves as the primary model that generates novel-view images during inference. At the training stage, the \mathbf\Sigma -model is guided away from specific local optima by an uncertainty-aware perturbing strategy. We dynamically perturb the \mathbf\Delta -model based on the uncertainties of novel-view renderings across different training steps, resulting in diverse temporal models sampled from the Gaussian parameter space without additional training costs. The geometry of the \mathbf\Sigma -model is regularized by penalizing discrepancies between the \mathbf\Sigma -model and the temporal samples. Therefore, our SE-GS conducts an effective and efficient regularization across a large number of Gaussian Splatting models, resulting in a robust ensemble, the \mathbf\Sigma -model. Experimental results on the LLFF, Mip-NeRF360, DTU, and MVImgNet datasets show that our approach improves NVS quality with few-shot training views, outperforming existing state-of-the-art methods. The code is released at this https URL.

[CV-55] Muscles in Time: Learning to Understand Human Motion by Simulating Muscle Activations

链接: https://arxiv.org/abs/2411.00128
作者: David Schneider,Simon Reiß,Marco Kugler,Alexander Jaus,Kunyu Peng,Susanne Sutschet,M. Saquib Sarfraz,Sven Matthiesen,Rainer Stiefelhagen
关键词-EN: Exploring the intricate, intricate dynamics, dynamics between muscular, muscular and skeletal, skeletal structures
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Exploring the intricate dynamics between muscular and skeletal structures is pivotal for understanding human motion. This domain presents substantial challenges, primarily attributed to the intensive resources required for acquiring ground truth muscle activation data, resulting in a scarcity of datasets. In this work, we address this issue by establishing Muscles in Time (MinT), a large-scale synthetic muscle activation dataset. For the creation of MinT, we enriched existing motion capture datasets by incorporating muscle activation simulations derived from biomechanical human body models using the OpenSim platform, a common approach in biomechanics and human motion research. Starting from simple pose sequences, our pipeline enables us to extract detailed information about the timing of muscle activations within the human musculoskeletal system. Muscles in Time contains over nine hours of simulation data covering 227 subjects and 402 simulated muscle strands. We demonstrate the utility of this dataset by presenting results on neural network-based muscle activation estimation from human pose sequences with two different sequence-to-sequence architectures. Data and code are provided under this https URL.

[CV-56] PathoGen-X: A Cross-Modal Genomic Feature Trans-Align Network for Enhanced Survival Prediction from Histopathology Images

链接: https://arxiv.org/abs/2411.00749
作者: Akhila Krishna,Nikhil Cherian Kurian,Abhijeet Patil,Amruta Parulekar,Amit Sethi
关键词-EN: Accurate survival prediction, personalized cancer treatment, Accurate survival, essential for personalized, survival prediction
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Genomics (q-bio.GN); Tissues and Organs (q-bio.TO)
*备注:

点击查看摘要

Abstract:Accurate survival prediction is essential for personalized cancer treatment. However, genomic data - often a more powerful predictor than pathology data - is costly and inaccessible. We present the cross-modal genomic feature translation and alignment network for enhanced survival prediction from histopathology images (PathoGen-X). It is a deep learning framework that leverages both genomic and imaging data during training, relying solely on imaging data at testing. PathoGen-X employs transformer-based networks to align and translate image features into the genomic feature space, enhancing weaker imaging signals with stronger genomic signals. Unlike other methods, PathoGen-X translates and aligns features without projecting them to a shared latent space and requires fewer paired samples. Evaluated on TCGA-BRCA, TCGA-LUAD, and TCGA-GBM datasets, PathoGen-X demonstrates strong survival prediction performance, emphasizing the potential of enriched imaging models for accessible cancer prognosis.

[CV-57] A Graph Attention-Guided Diffusion Model for Liver Vessel Segmentation

链接: https://arxiv.org/abs/2411.00617
作者: Xiaotong Zhang,Alexander Broersen,Gonnie CM van Erp,Silvia L. Pintea,Jouke Dijkstra
关键词-EN: small liver vessel, liver vessel segmentation, liver vessel, Improving connectivity, challenging aspects
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注: This work has been submitted to the IEEE for possible publication

点击查看摘要

Abstract:Improving connectivity and completeness are the most challenging aspects of small liver vessel segmentation. It is difficult for existing methods to obtain segmented liver vessel trees simultaneously with continuous geometry and detail in small vessels. We proposed a diffusion model-based method with a multi-scale graph attention guidance to break through the bottleneck to segment the liver vessels. Experiments show that the proposed method outperforms the other state-of-the-art methods used in this study on two public datasets of 3D-ircadb-01 and LiVS. Dice coefficient and Sensitivity are improved by at least 11.67% and 24.21% on 3D-ircadb-01 dataset, and are improved by at least 3.21% and 9.11% on LiVS dataset. Connectivity is also quantitatively evaluated in this study and our method performs best. The proposed method is reliable for small liver vessel segmentation.

[CV-58] umor Location-weighted MRI-Report Contrastive Learning: A Framework for Improving the Explainability of Pediatric Brain Tumor Diagnosis

链接: https://arxiv.org/abs/2411.00609
作者: Sara Ketabi,Matthias W. Wagner,Cynthia Hawkins,Uri Tabori,Birgit Betina Ertl-Wagner,Farzad Khalvati
关键词-EN: convolutional neural networks, magnetic resonance imaging, neural networks, resonance imaging, convolutional neural
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Despite the promising performance of convolutional neural networks (CNNs) in brain tumor diagnosis from magnetic resonance imaging (MRI), their integration into the clinical workflow has been limited. That is mainly due to the fact that the features contributing to a model’s prediction are unclear to radiologists and hence, clinically irrelevant, i.e., lack of explainability. As the invaluable sources of radiologists’ knowledge and expertise, radiology reports can be integrated with MRI in a contrastive learning (CL) framework, enabling learning from image-report associations, to improve CNN explainability. In this work, we train a multimodal CL architecture on 3D brain MRI scans and radiology reports to learn informative MRI representations. Furthermore, we integrate tumor location, salient to several brain tumor analysis tasks, into this framework to improve its generalizability. We then apply the learnt image representations to improve explainability and performance of genetic marker classification of pediatric Low-grade Glioma, the most prevalent brain tumor in children, as a downstream task. Our results indicate a Dice score of 31.1% between the model’s attention maps and manual tumor segmentation (as an explainability measure) with test classification performance of 87.7%, significantly outperforming the baselines. These enhancements can build trust in our model among radiologists, facilitating its integration into clinical practices for more efficient tumor diagnosis.

[CV-59] pcaGAN: Improving Posterior-Sampling cGANs via Principal Component Regularization NEURIPS2024

链接: https://arxiv.org/abs/2411.00605
作者: Matthew C. Bendel,Rizwan Ahmad,Philip Schniter
关键词-EN: ill-posed imaging inverse, imaging inverse problems, observed measurements, measurements and prior, prior knowledge
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: To appear at NeurIPS 2024

点击查看摘要

Abstract:In ill-posed imaging inverse problems, there can exist many hypotheses that fit both the observed measurements and prior knowledge of the true image. Rather than returning just one hypothesis of that image, posterior samplers aim to explore the full solution space by generating many probable hypotheses, which can later be used to quantify uncertainty or construct recoveries that appropriately navigate the perception/distortion trade-off. In this work, we propose a fast and accurate posterior-sampling conditional generative adversarial network (cGAN) that, through a novel form of regularization, aims for correctness in the posterior mean as well as the trace and K principal components of the posterior covariance matrix. Numerical experiments demonstrate that our method outperforms contemporary cGANs and diffusion models in imaging inverse problems like denoising, large-scale inpainting, and accelerated MRI recovery. The code for our model can be found here: this https URL.

[CV-60] MAROON: A Framework for the Joint Characterization of Near-Field High-Resolution Radar and Optical Depth Imaging Techniques

链接: https://arxiv.org/abs/2411.00527
作者: Vanessa Wirth,Johanna Bräunig,Martin Vossiek,Tim Weyrich,Marc Stamminger
关键词-EN: robust computer-assisted tasks, Utilizing the complementary, autonomous driving, complementary strengths, strengths of wavelength-specific
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Utilizing the complementary strengths of wavelength-specific range or depth sensors is crucial for robust computer-assisted tasks such as autonomous driving. Despite this, there is still little research done at the intersection of optical depth sensors and radars operating close range, where the target is decimeters away from the sensors. Together with a growing interest in high-resolution imaging radars operating in the near field, the question arises how these sensors behave in comparison to their traditional optical counterparts. In this work, we take on the unique challenge of jointly characterizing depth imagers from both, the optical and radio-frequency domain using a multimodal spatial calibration. We collect data from four depth imagers, with three optical sensors of varying operation principle and an imaging radar. We provide a comprehensive evaluation of their depth measurements with respect to distinct object materials, geometries, and object-to-sensor distances. Specifically, we reveal scattering effects of partially transmissive materials and investigate the response of radio-frequency signals. All object measurements will be made public in form of a multimodal dataset, called MAROON. Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2411.00527 [eess.IV] (or arXiv:2411.00527v1 [eess.IV] for this version) https://doi.org/10.48550/arXiv.2411.00527 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[CV-61] SpineFM: Leveraging Foundation Models for Automatic Spine X-ray Segmentation

链接: https://arxiv.org/abs/2411.00326
作者: Samuel J. Simons,Bartłomiej W. Papież
关键词-EN: paper introduces SpineFM, lumbar spine radiographs, pipeline that achieves, paper introduces, vertebral bodies
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注: 4 pages, 3 figures, submitted to ISBI 2025

点击查看摘要

Abstract:This paper introduces SpineFM, a novel pipeline that achieves state-of-the-art performance in the automatic segmentation and identification of vertebral bodies in cervical and lumbar spine radiographs. SpineFM leverages the regular geometry of the spine, employing a novel inductive process to sequentially infer the location of each vertebra along the spinal column. Vertebrae are segmented using Medical-SAM-Adaptor, a robust foundation model that diverges from commonly used CNN-based models. We achieved outstanding results on two publicly available spine X-Ray datasets, with successful identification of 97.8% and 99.6% of annotated vertebrae, respectively. Of which, our segmentation reached an average Dice of 0.942 and 0.921, surpassing previous state-of-the-art methods.

[CV-62] A Novel Breast Ultrasound Image Augmentation Method Using Advanced Neural Style Transfer: An Efficient and Explainable Approach

链接: https://arxiv.org/abs/2411.00254
作者: Lipismita Panigrahi,Prianka Rani Saha,Jurdana Masuma Iqrah,Sushil Prasad
关键词-EN: Clinical diagnosis, BUS images, recent era, breast malignancy, BUS
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Clinical diagnosis of breast malignancy (BM) is a challenging problem in the recent era. In particular, Deep learning (DL) models have continued to offer important solutions for early BM diagnosis but their performance experiences overfitting due to the limited volume of breast ultrasound (BUS) image data. Further, large BUS datasets are difficult to manage due to privacy and legal concerns. Hence, image augmentation is a necessary and challenging step to improve the performance of the DL models. However, the current DL-based augmentation models are inadequate and operate as a black box resulting lack of information and justifications about their suitability and efficacy. Additionally, pre and post-augmentation need high-performance computational resources and time to produce the augmented image and evaluate the model performance. Thus, this study aims to develop a novel efficient augmentation approach for BUS images with advanced neural style transfer (NST) and Explainable AI (XAI) harnessing GPU-based parallel infrastructure. We scale and distribute the training of the augmentation model across 8 GPUs using the Horovod framework on a DGX cluster, achieving a 5.09 speedup while maintaining the model’s accuracy. The proposed model is evaluated on 800 (348 benign and 452 malignant) BUS images and its performance is analyzed with other progressive techniques, using different quantitative analyses. The result indicates that the proposed approach can successfully augment the BUS images with 92.47% accuracy.

机器学习

[LG-0] Dimension-free Private Mean Estimation for Anisotropic Distributions

链接: https://arxiv.org/abs/2411.00775
作者: Yuval Dagan,Michael I. Jordan,Xuelin Yang,Lydia Zakynthinou,Nikita Zhivotovskiy
关键词-EN: Sigma, high-dimensional mean estimation, mathrm, differentially private, private
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:We present differentially private algorithms for high-dimensional mean estimation. Previous private estimators on distributions over \mathbbR^d suffer from a curse of dimensionality, as they require \Omega(d^1/2) samples to achieve non-trivial error, even in cases where O(1) samples suffice without privacy. This rate is unavoidable when the distribution is isotropic, namely, when the covariance is a multiple of the identity matrix, or when accuracy is measured with respect to the affine-invariant Mahalanobis distance. Yet, real-world data is often highly anisotropic, with signals concentrated on a small number of principal components. We develop estimators that are appropriate for such signals \unicodex2013 our estimators are (\varepsilon,\delta) -differentially private and have sample complexity that is dimension-independent for anisotropic subgaussian distributions. Given n samples from a distribution with known covariance-proxy \Sigma and unknown mean \mu , we present an estimator \hat\mu that achieves error |\hat\mu-\mu|_2\leq \alpha , as long as n\gtrsim\mathrmtr(\Sigma)/\alpha^2+ \mathrmtr(\Sigma^1/2)/(\alpha\varepsilon) . In particular, when \pmb\sigma^2=(\sigma_1^2, \ldots, \sigma_d^2) are the singular values of \Sigma , we have \mathrmtr(\Sigma)=|\pmb\sigma|_2^2 and \mathrmtr(\Sigma^1/2)=|\pmb\sigma|_1 , and hence our bound avoids dimension-dependence when the signal is concentrated in a few principal components. We show that this is the optimal sample complexity for this task up to logarithmic factors. Moreover, for the case of unknown covariance, we present an algorithm whose sample complexity has improved dependence on the dimension, from d^1/2 to d^1/4 .

[LG-1] Minibatch Optimal Transport and Perplexity Bound Estimation in Discrete Flow Matching

链接: https://arxiv.org/abs/2411.00759
作者: Etrit Haxholli,Yeti Z. Gürbüz,Oğul Can,Eli Waxman
关键词-EN: Outperforming autoregressive models, categorical data distributions, Outperforming autoregressive, remains challenging, categorical data
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Outperforming autoregressive models on categorical data distributions, such as textual data, remains challenging for continuous diffusion and flow models. Discrete flow matching, a recent framework for modeling categorical data, has shown competitive performance with autoregressive models. Despite its similarities with continuous flow matching, the rectification strategy applied in the continuous version does not directly extend to the discrete one due to the inherent stochasticity of discrete paths. This limitation necessitates exploring alternative methods to minimize state transitions during generation. To address this, we propose a dynamic-optimal-transport-like minimization objective for discrete flows with convex interpolants and derive its equivalent Kantorovich formulation. The latter defines transport cost solely in terms of inter-state similarity and is optimized using a minibatch strategy. Another limitation we address in the discrete flow framework is model evaluation. Unlike continuous flows, wherein the instantaneous change of variables enables density estimation, discrete models lack a similar mechanism due to the inherent non-determinism and discontinuity of their paths. To alleviate this issue, we propose an upper bound on the perplexity of discrete flow models, enabling performance evaluation and comparison with other methods.

[LG-2] Hierarchical Transformer for Electrocardiogram Diagnosis

链接: https://arxiv.org/abs/2411.00755
作者: Xiaoya Tang,Jake Berquist,Benjamin A. Steinberg,Tolga Tasdizen
关键词-EN: prominent in NLP, NLP and computer, ECG signal analysis, originally prominent, computer vision
类目: Machine Learning (cs.LG)
*备注: 5 pages,3 figures,under review by ISBI 2025

点击查看摘要

Abstract:Transformers, originally prominent in NLP and computer vision, are now being adapted for ECG signal analysis. This paper introduces a novel hierarchical transformer architecture that segments the model into multiple stages by assessing the spatial size of the embeddings, thus eliminating the need for additional downsampling strategies or complex attention designs. A classification token aggregates information across feature scales, facilitating interactions between different stages of the transformer. By utilizing depth-wise convolutions in a six-layer convolutional encoder, our approach preserves the relationships between different ECG leads. Moreover, an attention gate mechanism learns associations among the leads prior to classification. This model adapts flexibly to various embedding networks and input sizes while enhancing the interpretability of transformers in ECG signal analysis.

[LG-3] Private Augmentation-Robust and Task-Agnostic Data Valuation Approach for Data Marketplace

链接: https://arxiv.org/abs/2411.00745
作者: Tayyebeh Jahani-Nezhad,Parsa Moradi,Mohammad Ali Maddah-Ali,Giuseppe Caire
关键词-EN: Evaluating datasets, buyer, data, purchase valuable data, critical challenge
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注:

点击查看摘要

Abstract:Evaluating datasets in data marketplaces, where the buyer aim to purchase valuable data, is a critical challenge. In this paper, we introduce an innovative task-agnostic data valuation method called PriArTa which is an approach for computing the distance between the distribution of the buyer’s existing dataset and the seller’s dataset, allowing the buyer to determine how effectively the new data can enhance its dataset. PriArTa is communication-efficient, enabling the buyer to evaluate datasets without needing access to the entire dataset from each seller. Instead, the buyer requests that sellers perform specific preprocessing on their data and then send back the results. Using this information and a scoring metric, the buyer can evaluate the dataset. The preprocessing is designed to allow the buyer to compute the score while preserving the privacy of each seller’s dataset, mitigating the risk of information leakage before the purchase. A key feature of PriArTa is its robustness to common data transformations, ensuring consistent value assessment and reducing the risk of purchasing redundant data. The effectiveness of PriArTa is demonstrated through experiments on real-world image datasets, showing its ability to perform privacy-preserving, augmentation-robust data valuation in data marketplaces.

[LG-4] Modern Efficient and Differentiable Transport Equation Models using JAX: Applications to Population Balance Equations

链接: https://arxiv.org/abs/2411.00742
作者: Mohammed Alsubeihi,Arthur Jessop,Ben Moseley,Cláudio P. Fonte,Ashwin Kumar Rajagopalan
关键词-EN: Population balance equation, Population balance, far-reaching implications, automate many engineering, engineering processes
类目: Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Population balance equation (PBE) models have potential to automate many engineering processes with far-reaching implications. In the pharmaceutical sector, crystallization model-based design can contribute to shortening excessive drug development timelines. Even so, two major barriers, typical of most transport equations, not just PBEs, have limited this potential. Notably, the time taken to compute a solution to these models with representative accuracy is frequently limiting. Likewise, the model construction process is often tedious and wastes valuable time, owing to the reliance on human expertise to guess constituent models from empirical data. Hybrid models promise to overcome both barriers through tight integration of neural networks with physical PBE models. Towards eliminating experimental guesswork, hybrid models facilitate determining physical relationships from data, also known as ‘discovering physics’. Here, we aim to prepare for planned Scientific Machine Learning (SciML) integration through a contemporary implementation of an existing PBE algorithm, one with computational efficiency and differentiability at the forefront. To accomplish this, we utilized JAX, a cutting-edge library for accelerated computing. We showcase the speed benefits of this modern take on PBE modelling by benchmarking our solver to others we prepared using older, more widespread software. Primarily among these software tools is the ubiquitous NumPy, where we show JAX achieves up to 300x relative acceleration in PBE simulations. Our solver is also fully differentiable, which we demonstrate is the only feasible option for integrating learnable data-driven models at scale. We show that differentiability can be 40x faster for optimizing larger models than conventional approaches, which represents the key to neural network integration for physics discovery in later work.

[LG-5] Exploring Multi-Modality Dynamics: Insights and Challenges in Multimodal Fusion for Biomedical Tasks

链接: https://arxiv.org/abs/2411.00725
作者: Laura Wenderoth
关键词-EN: proposed by Han, paper investigates, Han, dynamics approach proposed, dynamics
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This paper investigates the MM dynamics approach proposed by Han et al. (2022) for multi-modal fusion in biomedical classification tasks. The MM dynamics algorithm integrates feature-level and modality-level informativeness to dynamically fuse modalities for improved classification performance. However, our analysis reveals several limitations and challenges in replicating and extending the results of MM dynamics. We found that feature informativeness improves performance and explainability, while modality informativeness does not provide significant advantages and can lead to performance degradation. Based on these results, we have extended feature informativeness to image data, resulting in the development of Image MM dynamics. Although this approach showed promising qualitative results, it did not outperform baseline methods quantitatively.

[LG-6] oken-level Proximal Policy Optimization for Query Generation

链接: https://arxiv.org/abs/2411.00722
作者: Yichen Ouyang,Lu Wang,Fangkai Yang,Pu Zhao,Chenghua Huang,Jianfeng Liu,Bochen Pang,Yaming Yang,Yuefeng Zhan,Hao Sun,Qingwei Lin,Saravan Rajmohan,Weiwei Deng,Dongmei Zhang,Feng Sun,Qi Zhang
关键词-EN: Large Language Models, leverage Large Language, Proximal Policy Optimization, Token-level Proximal Policy, recommendation systems
类目: Machine Learning (cs.LG)
*备注: 10 pages

点击查看摘要

Abstract:Query generation is a critical task for web search engines (e.g. Google, Bing) and recommendation systems. Recently, state-of-the-art query generation methods leverage Large Language Models (LLMs) for their strong capabilities in context understanding and text generation. However, they still face challenges in generating high-quality queries in terms of inferring user intent based on their web search interaction history. In this paper, we propose Token-level Proximal Policy Optimization (TPPO), a noval approach designed to empower LLMs perform better in query generation through fine-tuning. TPPO is based on the Reinforcement Learning from AI Feedback (RLAIF) paradigm, consisting of a token-level reward model and a token-level proximal policy optimization module to address the sparse reward challenge in traditional RLAIF frameworks. To evaluate the effectiveness and robustness of TPPO, we conducted experiments on both open-source dataset and an industrial dataset that was collected from a globally-used search engine. The experimental results demonstrate that TPPO significantly improves the performance of query generation for LLMs and outperforms its existing competitors.

[LG-7] PedSleepMAE: Generative Model for Multimodal Pediatric Sleep Signals

链接: https://arxiv.org/abs/2411.00718
作者: Saurav R. Pandey,Aaqib Saeed,Harlin Lee
关键词-EN: Pediatric sleep, health informatics, pediatric sleep signals, overlooked area, area in health
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Pediatric sleep is an important but often overlooked area in health informatics. We present PedSleepMAE, a generative model that fully leverages multimodal pediatric sleep signals including multichannel EEGs, respiratory signals, EOGs and EMG. This masked autoencoder-based model performs comparably to supervised learning models in sleep scoring and in the detection of apnea, hypopnea, EEG arousal and oxygen desaturation. Its embeddings are also shown to capture subtle differences in sleep signals coming from a rare genetic disorder. Furthermore, PedSleepMAE generates realistic signals that can be used for sleep segment retrieval, outlier detection, and missing channel imputation. This is the first general-purpose generative model trained on multiple types of pediatric sleep signals.

[LG-8] Wasserstein Flow Matching: Generative modeling over families of distributions

链接: https://arxiv.org/abs/2411.00698
作者: Doron Haviv,Aram-Alexandre Pooladian,Dana Pe’er,Brandon Amos
关键词-EN: single source distribution, single target distribution, simple probability flows, modeling typically concerns, single source
类目: Machine Learning (cs.LG)
*备注: 24 pages, 10 figures

点击查看摘要

Abstract:Generative modeling typically concerns the transport of a single source distribution to a single target distribution by learning (i.e., regressing onto) simple probability flows. However, in modern data-driven fields such as computer graphics and single-cell genomics, samples (say, point-clouds) from datasets can themselves be viewed as distributions (as, say, discrete measures). In these settings, the standard generative modeling paradigm of flow matching would ignore the relevant geometry of the samples. To remedy this, we propose \emphWasserstein flow matching (WFM), which appropriately lifts flow matching onto families of distributions by appealing to the Riemannian nature of the Wasserstein geometry. Our algorithm leverages theoretical and computational advances in (entropic) optimal transport, as well as the attention mechanism in our neural network architecture. We present two novel algorithmic contributions. First, we demonstrate how to perform generative modeling over Gaussian distributions, where we generate representations of granular cell states from single-cell genomics data. Secondly, we show that WFM can learn flows between high-dimensional and variable sized point-clouds and synthesize cellular microenvironments from spatial transcriptomics datasets. Code is available at [WassersteinFlowMatching](this https URL).

[LG-9] Explainable few-shot learning workflow for detecting invasive and exotic tree species

链接: https://arxiv.org/abs/2411.00684
作者: Caroline M. Gevaert,Alexandra Aguiar Pedro,Ou Ku,Hao Cheng,Pranav Chandramouli,Farzaneh Dadrass Javan,Francesco Nattino,Sonja Georgievska
关键词-EN: Deep Learning methods, extensive labeled datasets, Deep Learning, assess their performance, methods are notorious
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Deep Learning methods are notorious for relying on extensive labeled datasets to train and assess their performance. This can cause difficulties in practical situations where models should be trained for new applications for which very little data is available. While few-shot learning algorithms can address the first problem, they still lack sufficient explanations for the results. This research presents a workflow that tackles both challenges by proposing an explainable few-shot learning workflow for detecting invasive and exotic tree species in the Atlantic Forest of Brazil using Unmanned Aerial Vehicle (UAV) images. By integrating a Siamese network with explainable AI (XAI), the workflow enables the classification of tree species with minimal labeled data while providing visual, case-based explanations for the predictions. Results demonstrate the effectiveness of the proposed workflow in identifying new tree species, even in data-scarce conditions. With a lightweight backbone, e.g., MobileNet, it achieves a F1-score of 0.86 in 3-shot learning, outperforming a shallow CNN. A set of explanation metrics, i.e., correctness, continuity, and contrastivity, accompanied by visual cases, provide further insights about the prediction results. This approach opens new avenues for using AI and UAVs in forest management and biodiversity conservation, particularly concerning rare or under-studied species.

[LG-10] Rethinking Node Representation Interpretation through Relation Coherence

链接: https://arxiv.org/abs/2411.00653
作者: Ying-Chun Lin,Jennifer Neville,Cassiano Becker,Purvanshi Metha,Nabiha Asghar,Vipul Agarwal
关键词-EN: Understanding node representations, Understanding node, uncovering biases, crucial for uncovering, building trust
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Understanding node representations in graph-based models is crucial for uncovering biases ,diagnosing errors, and building trust in model decisions. However, previous work on explainable AI for node representations has primarily emphasized explanations (reasons for model predictions) rather than interpretations (mapping representations to understandable concepts). Furthermore, the limited research that focuses on interpretation lacks validation, and thus the reliability of such methods is unclear. We address this gap by proposing a novel interpretation method-Node Coherence Rate for Representation Interpretation (NCI)-which quantifies how well different node relations are captured in node representations. We also propose a novel method (IME) to evaluate the accuracy of different interpretation methods. Our experimental results demonstrate that NCI reduces the error of the previous best approach by an average of 39%. We then apply NCI to derive insights about the node representations produced by several graph-based methods and assess their quality in unsupervised settings.

[LG-11] Variational Neural Stochastic Differential Equations with Change Points

链接: https://arxiv.org/abs/2411.00635
作者: Yousef El-Laham,Zhongchang Sun,Haibei Zhu,Tucker Balch,Svitlana Vyetrenko
关键词-EN: stochastic differential equations, neural stochastic differential, differential equations, explore modeling change, neural SDEs
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:In this work, we explore modeling change points in time-series data using neural stochastic differential equations (neural SDEs). We propose a novel model formulation and training procedure based on the variational autoencoder (VAE) framework for modeling time-series as a neural SDE. Unlike existing algorithms training neural SDEs as VAEs, our proposed algorithm only necessitates a Gaussian prior of the initial state of the latent stochastic process, rather than a Wiener process prior on the entire latent stochastic process. We develop two methodologies for modeling and estimating change points in time-series data with distribution shifts. Our iterative algorithm alternates between updating neural SDE parameters and updating the change points based on either a maximum likelihood-based approach or a change point detection algorithm using the sequential likelihood ratio test. We provide a theoretical analysis of this proposed change point detection scheme. Finally, we present an empirical evaluation that demonstrates the expressive power of our proposed model, showing that it can effectively model both classical parametric SDEs and some real datasets with distribution shifts.

[LG-12] oward Automated Algorithm Design: A Survey and Practical Guide to Meta-Black-Box-Optimization

链接: https://arxiv.org/abs/2411.00625
作者: Zeyuan Ma,Hongshu Guo,Yue-Jiao Gong,Jun Zhang,Kay Chen Tan
关键词-EN: incorporates Meta-learning approaches, Evolutionary Computation, incorporates Meta-learning, Meta-learning approaches, assist automated algorithm
类目: Neural and Evolutionary Computing (cs.NE); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In this survey, we introduce Meta-Black-Box-Optimization (MetaBBO) as an emerging avenue within the Evolutionary Computation (EC) community, which incorporates Meta-learning approaches to assist automated algorithm design. Despite the success of MetaBBO, the current literature provides insufficient summaries of its key aspects and lacks practical guidance for implementation. To bridge this gap, we offer a comprehensive review of recent advances in MetaBBO, providing an in-depth examination of its key developments. We begin with a unified definition of the MetaBBO paradigm, followed by a systematic taxonomy of various algorithm design tasks, including algorithm selection, algorithm configuration, solution manipulation, and algorithm generation. Further, we conceptually summarize different learning methodologies behind current MetaBBO works, including reinforcement learning, supervised learning, neuroevolution, and in-context learning with Large Language Models. A comprehensive evaluation of the latest representative MetaBBO methods is then carried out, alongside an experimental analysis of their optimization performance, computational efficiency, and generalization ability. Based on the evaluation results, we meticulously identify a set of core designs that enhance the generalization and learning effectiveness of MetaBBO. Finally, we outline the vision for the field by providing insight into the latest trends and potential future directions. Relevant literature will be continuously collected and updated at this https URL.

[LG-13] Apriori_Goal algorithm for constructing association rules for a database with a given classification

链接: https://arxiv.org/abs/2411.00615
作者: Vladimir Billig
关键词-EN: constructing association rules, algorithm, association rules, database, Apriori
类目: Databases (cs.DB); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:An efficient algorithm, Apriori_Goal, is proposed for constructing association rules for a relational database with a given classification. The algorithm’s features are related to the specifics of the database and the method of encoding its records. The algorithm proposes five criteria that characterize the quality of the rules being constructed. Different criteria are also proposed for filtering the sets used when constructing association rules. The proposed method of encoding records allows for an efficient implementation of the basic operation underlying the computation of rule characteristics. The algorithm works with a relational database, where the columns can be of different types, both continuous and discrete. Among the columns, a target discrete column is distinguished, which defines the classification of the records. This allows the original database to be divided into n subsets according to the number of categories of the target parameter. A classical example of such databases is medical databases, where the target parameter is the diagnosis established by doctors. A preprocessor, which is an important part of the algorithm, converts the properties of the objects represented by the columns of the original database into binary properties and encodes each record as a single integer. In addition to saving memory, the proposed format allows the complete preservation of information about the binary properties representing the original record. More importantly, the computationally intensive operations on records, required for calculating rule characteristics, are performed almost instantly in this format using a pair of logical operations on integers. Subjects: Databases (cs.DB); Machine Learning (cs.LG) Cite as: arXiv:2411.00615 [cs.DB] (or arXiv:2411.00615v1 [cs.DB] for this version) https://doi.org/10.48550/arXiv.2411.00615 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Vladimir Billig [view email] [v1] Fri, 1 Nov 2024 14:23:48 UTC (12 KB)

[LG-14] Fast and scalable Wasserstein-1 neural optimal transport solver for single-cell perturbation prediction

链接: https://arxiv.org/abs/2411.00614
作者: Yanshuo Chen,Zhengmian Hu,Wei Chen,Heng Huang
关键词-EN: responses requires mapping, single-cell data distributions, Predicting single-cell perturbation, perturbation responses requires, unpaired single-cell data
类目: Machine Learning (cs.LG); Genomics (q-bio.GN)
*备注:

点击查看摘要

Abstract:Predicting single-cell perturbation responses requires mapping between two unpaired single-cell data distributions. Optimal transport (OT) theory provides a principled framework for constructing such mappings by minimizing transport cost. Recently, Wasserstein-2 ( W_2 ) neural optimal transport solvers (\textite.g., CellOT) have been employed for this prediction task. However, W_2 OT relies on the general Kantorovich dual formulation, which involves optimizing over two conjugate functions, leading to a complex min-max optimization problem that converges slowly. To address these challenges, we propose a novel solver based on the Wasserstein-1 ( W_1 ) dual formulation. Unlike W_2 , the W_1 dual simplifies the optimization to a maximization problem over a single 1-Lipschitz function, thus eliminating the need for time-consuming min-max optimization. While solving the W_1 dual only reveals the transport direction and does not directly provide a unique optimal transport map, we incorporate an additional step using adversarial training to determine an appropriate transport step size, effectively recovering the transport map. Our experiments demonstrate that the proposed W_1 neural optimal transport solver can mimic the W_2 OT solvers in finding a unique and ``monotonic" map on 2D datasets. Moreover, the W_1 OT solver achieves performance on par with or surpasses W_2 OT solvers on real single-cell perturbation datasets. Furthermore, we show that W_1 OT solver achieves 25 \sim 45\times speedup, scales better on high dimensional transportation task, and can be directly applied on single-cell RNA-seq dataset with highly variable genes. Our implementation and experiments are open-sourced at \urlthis https URL.

[LG-15] Provably and Practically Efficient Adversarial Imitation Learning with General Function Approximation NEURIPS’24 NEURIPS2024

链接: https://arxiv.org/abs/2411.00610
作者: Tian Xu,Zhilong Zhang,Ruishuo Chen,Yihao Sun,Yang Yu
关键词-EN: adversarial imitation learning, garnered significant practical, significant practical success, practical success powered, neural network approximation
类目: Machine Learning (cs.LG)
*备注: Published in NeurIPS 2024: Tian Xu, Zhilong Zhang, Ruishuo Chen, Yihao Sun, Yang Yu. Provably and practically efficient adversarial imitation learning with general function approximation. In: Advances in Neural Information Processing Systems 38 (NeurIPS’24), Vancouver, Canada, 2024

点击查看摘要

Abstract:As a prominent category of imitation learning methods, adversarial imitation learning (AIL) has garnered significant practical success powered by neural network approximation. However, existing theoretical studies on AIL are primarily limited to simplified scenarios such as tabular and linear function approximation and involve complex algorithmic designs that hinder practical implementation, highlighting a gap between theory and practice. In this paper, we explore the theoretical underpinnings of online AIL with general function approximation. We introduce a new method called optimization-based AIL (OPT-AIL), which centers on performing online optimization for reward functions and optimism-regularized Bellman error minimization for Q-value functions. Theoretically, we prove that OPT-AIL achieves polynomial expert sample complexity and interaction complexity for learning near-expert policies. To our best knowledge, OPT-AIL is the first provably efficient AIL method with general function approximation. Practically, OPT-AIL only requires the approximate optimization of two objectives, thereby facilitating practical implementation. Empirical studies demonstrate that OPT-AIL outperforms previous state-of-the-art deep AIL methods in several challenging tasks.

[LG-16] Improving self-training under distribution shifts via anchored confidence with theoretical guarantees NEURIPS2024

链接: https://arxiv.org/abs/2411.00586
作者: Taejong Joo,Diego Klabjan
关键词-EN: actual accuracy, falls short, increased discrepancy, discrepancy between prediction, prediction confidence
类目: Machine Learning (cs.LG)
*备注: NeurIPS 2024

点击查看摘要

Abstract:Self-training often falls short under distribution shifts due to an increased discrepancy between prediction confidence and actual accuracy. This typically necessitates computationally demanding methods such as neighborhood or ensemble-based label corrections. Drawing inspiration from insights on early learning regularization, we develop a principled method to improve self-training under distribution shifts based on temporal consistency. Specifically, we build an uncertainty-aware temporal ensemble with a simple relative thresholding. Then, this ensemble smooths noisy pseudo labels to promote selective temporal consistency. We show that our temporal ensemble is asymptotically correct and our label smoothing technique can reduce the optimality gap of self-training. Our extensive experiments validate that our approach consistently improves self-training performances by 8% to 16% across diverse distribution shift scenarios without a computational overhead. Besides, our method exhibits attractive properties, such as improved calibration performance and robustness to different hyperparameter choices.

[LG-17] Enhancing Adaptive Mixed-Criticality Scheduling with Deep Reinforcement Learning

链接: https://arxiv.org/abs/2411.00572
作者: Bruno Mendes(1),Pedro F. Souto(1 and 2),Pedro C. Diniz(2) ((1) Department of Informatics Engineering (DEI) Faculty of Engineering of the University of Porto (FEUP) (2) CISTER Research Centre)
关键词-EN: hard real-time systems, mixed-criticality hard real-time, fixed-priority preemptive scheduling, preemptive scheduling algorithm, Adaptive Mixed-Criticality
类目: Operating Systems (cs.OS); Machine Learning (cs.LG)
*备注: Version submitted to RTNS 2024, on 17/08/2024 (with some typos fixed)

点击查看摘要

Abstract:Adaptive Mixed-Criticality (AMC) is a fixed-priority preemptive scheduling algorithm for mixed-criticality hard real-time systems. It dominates many other scheduling algorithms for mixed-criticality systems, but does so at the cost of occasionally dropping jobs of less important/critical tasks, when low-priority jobs overrun their time budgets. In this paper we enhance AMC with a deep reinforcement learning (DRL) approach based on a Deep-Q Network. The DRL agent is trained off-line, and at run-time adjusts the low-criticality budgets of tasks to avoid budget overruns, while ensuring that no job misses its deadline if it does not overrun its budget. We have implemented and evaluated this approach by simulating realistic workloads from the automotive domain. The results show that the agent is able to reduce budget overruns by at least up to 50%, even when the budget of each task is chosen based on sampling the distribution of its execution time. To the best of our knowledge, this is the first use of DRL in AMC reported in the literature.

[LG-18] DeepSeq2: Enhanced Sequential Circuit Learning with Disentangled Representations

链接: https://arxiv.org/abs/2411.00530
作者: Sadaf Khan,Zhengyuan Shi,Ziyang Zheng,Min Li,Qiang Xu
关键词-EN: Electronic Design Automation, Design Automation, Electronic Design, pivotal in Electronic, enhanced model efficiency
类目: Hardware Architecture (cs.AR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Circuit representation learning is increasingly pivotal in Electronic Design Automation (EDA), serving various downstream tasks with enhanced model efficiency and accuracy. One notable work, DeepSeq, has pioneered sequential circuit learning by encoding temporal correlations. However, it suffers from significant limitations including prolonged execution times and architectural inefficiencies. To address these issues, we introduce DeepSeq2, a novel framework that enhances the learning of sequential circuits, by innovatively mapping it into three distinct embedding spaces-structure, function, and sequential behavior-allowing for a more nuanced representation that captures the inherent complexities of circuit dynamics. By employing an efficient Directed Acyclic Graph Neural Network (DAG-GNN) that circumvents the recursive propagation used in DeepSeq, DeepSeq2 significantly reduces execution times and improves model scalability. Moreover, DeepSeq2 incorporates a unique supervision mechanism that captures transitioning behaviors within circuits more effectively. DeepSeq2 sets a new benchmark in sequential circuit representation learning, outperforming prior works in power estimation and reliability analysis.

[LG-19] Active Preference-based Learning for Multi-dimensional Personalization

链接: https://arxiv.org/abs/2411.00524
作者: Minhyeon Oh,Seungjoon Lee,Jungseul Ok
关键词-EN: shown remarkable versatility, remains challenging due, individual human preferences, human preferences remains, preferences remains challenging
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Large language models (LLMs) have shown remarkable versatility across tasks, but aligning them with individual human preferences remains challenging due to the complexity and diversity of these preferences. Existing methods often overlook the fact that preferences are multi-objective, diverse, and hard to articulate, making full alignment difficult. In response, we propose an active preference learning framework that uses binary feedback to estimate user preferences across multiple objectives. Our approach leverages Bayesian inference to update preferences efficiently and reduces user feedback through an acquisition function that optimally selects queries. Additionally, we introduce a parameter to handle feedback noise and improve robustness. We validate our approach through theoretical analysis and experiments on language generation tasks, demonstrating its feedback efficiency and effectiveness in personalizing model responses.

[LG-20] Analyzing Multimodal Integration in the Variational Autoencoder from an Information-Theoretic Perspective

链接: https://arxiv.org/abs/2411.00522
作者: Carlotta Langer,Yasmin Kim Georgie,Ilja Porohovoj,Verena Vanessa Hafner,Nihat Ay
关键词-EN: Human perception, perception is inherently, multimodal VAE integrates, multimodal, latent space
类目: Machine Learning (cs.LG); Information Theory (cs.IT)
*备注:

点击查看摘要

Abstract:Human perception is inherently multimodal. We integrate, for instance, visual, proprioceptive and tactile information into one experience. Hence, multimodal learning is of importance for building robotic systems that aim at robustly interacting with the real world. One potential model that has been proposed for multimodal integration is the multimodal variational autoencoder. A variational autoencoder (VAE) consists of two networks, an encoder that maps the data to a stochastic latent space and a decoder that reconstruct this data from an element of this latent space. The multimodal VAE integrates inputs from different modalities at two points in time in the latent space and can thereby be used as a controller for a robotic agent. Here we use this architecture and introduce information-theoretic measures in order to analyze how important the integration of the different modalities are for the reconstruction of the input data. Therefore we calculate two different types of measures, the first type is called single modality error and assesses how important the information from a single modality is for the reconstruction of this modality or all modalities. Secondly, the measures named loss of precision calculate the impact that missing information from only one modality has on the reconstruction of this modality or the whole vector. The VAE is trained via the evidence lower bound, which can be written as a sum of two different terms, namely the reconstruction and the latent loss. The impact of the latent loss can be weighted via an additional variable, which has been introduced to combat posterior collapse. Here we train networks with four different weighting schedules and analyze them with respect to their capabilities for multimodal integration.

[LG-21] Outlier-Oriented Poisoning Attack: A Grey-box Approach to Disturb Decision Boundaries by Perturbing Outliers in Multiclass Learning

链接: https://arxiv.org/abs/2411.00519
作者: Anum Paracha,Junaid Arshad,Mohamed Ben Farah,Khalid Ismail
关键词-EN: manipulating training datasets, machine learning, machine learning models, machine learning algorithms, OOP attack
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Poisoning attacks are a primary threat to machine learning models, aiming to compromise their performance and reliability by manipulating training datasets. This paper introduces a novel attack - Outlier-Oriented Poisoning (OOP) attack, which manipulates labels of most distanced samples from the decision boundaries. The paper also investigates the adverse impact of such attacks on different machine learning algorithms within a multiclass classification scenario, analyzing their variance and correlation between different poisoning levels and performance degradation. To ascertain the severity of the OOP attack for different degrees (5% - 25%) of poisoning, we analyzed variance, accuracy, precision, recall, f1-score, and false positive rate for chosen ML this http URL our OOP attack, we have analyzed key characteristics of multiclass machine learning algorithms and their sensitivity to poisoning attacks. Our experimentation used three publicly available datasets: IRIS, MNIST, and ISIC. Our analysis shows that KNN and GNB are the most affected algorithms with a decrease in accuracy of 22.81% and 56.07% while increasing false positive rate to 17.14% and 40.45% for IRIS dataset with 15% poisoning. Further, Decision Trees and Random Forest are the most resilient algorithms with the least accuracy disruption of 12.28% and 17.52% with 15% poisoning of the IRIS dataset. We have also analyzed the correlation between number of dataset classes and the performance degradation of models. Our analysis highlighted that number of classes are inversely proportional to the performance degradation, specifically the decrease in accuracy of the models, which is normalized with increasing number of classes. Further, our analysis identified that imbalanced dataset distribution can aggravate the impact of poisoning for machine learning models

[LG-22] Zero-shot Generalization in Inventory Management: Train then Estimate and Decide

链接: https://arxiv.org/abs/2411.00515
作者: Tarkan Temizöz,Christina Imdahl,Remco Dijkman,Douniel Lamghari-Idrissi,Willem van Jaarsveld
关键词-EN: Deploying deep reinforcement, including dynamic environments, Deploying deep, deep reinforcement learning, management presents challenges
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Deploying deep reinforcement learning (DRL) in real-world inventory management presents challenges, including dynamic environments and uncertain problem parameters, e.g. demand and lead time distributions. These challenges highlight a research gap, suggesting a need for a unifying framework to model and solve sequential decision-making under parameter uncertainty. We address this by exploring an underexplored area of DRL for inventory management: training generally capable agents (GCAs) under zero-shot generalization (ZSG). Here, GCAs are advanced DRL policies designed to handle a broad range of sampled problem instances with diverse inventory challenges. ZSG refers to the ability to successfully apply learned policies to unseen instances with unknown parameters without retraining. We propose a unifying Super-Markov Decision Process formulation and the Train, then Estimate and Decide (TED) framework to train and deploy a GCA tailored to inventory management applications. The TED framework consists of three phases: training a GCA on varied problem instances, continuously estimating problem parameters during deployment, and making decisions based on these estimates. Applied to periodic review inventory problems with lost sales, cyclic demand patterns, and stochastic lead times, our trained agent, the Generally Capable Lost Sales Network (GC-LSN) consistently outperforms well-known traditional policies when problem parameters are known. Moreover, under conditions where demand and/or lead time distributions are initially unknown and must be estimated, we benchmark against online learning methods that provide worst-case performance guarantees. Our GC-LSN policy, paired with the Kaplan-Meier estimator, is demonstrated to complement these methods by providing superior empirical performance. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2411.00515 [cs.LG] (or arXiv:2411.00515v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2411.00515 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-23] Label Cluster Chains for Multi-Label Classification

链接: https://arxiv.org/abs/2411.00514
作者: Elaine Cecília Gatto,Felipe Nakano Kenji,Jesse Read,Mauri Ferrandin,Ricardo Cerri,Celine Vens
关键词-EN: simultaneously assign multiple, assign multiple labels, supervised machine learning, label, Multi-label classification
类目: Machine Learning (cs.LG)
*备注: 26 pages, 11 figures, 5 tables

点击查看摘要

Abstract:Multi-label classification is a type of supervised machine learning that can simultaneously assign multiple labels to an instance. To solve this task, some methods divide the original problem into several sub-problems (local approach), others learn all labels at once (global approach), and others combine several classifiers (ensemble approach). Regardless of the approach used, exploring and learning label correlations is important to improve the classifier predictions. Ensemble of Classifier Chains (ECC) is a well-known multi-label method that considers label correlations and can achieve good overall performance on several multi-label datasets and evaluation measures. However, one of the challenges when working with ECC is the high dimensionality of the label space, which can impose limitations for fully-cascaded chains as the complexity increases regarding feature space expansion. To improve classifier chains, we propose a method to chain disjoint correlated label clusters obtained by applying a partition method in the label space. During the training phase, the ground truth labels of each cluster are used as new features for all of the following clusters. During the test phase, the predicted labels of clusters are used as new features for all the following clusters. Our proposal, called Label Cluster Chains for Multi-Label Classification (LCC-ML), uses multi-label Random Forests as base classifiers in each cluster, combining their predictions to obtain a final multi-label classification. Our proposal obtained better results compared to the original ECC. This shows that learning and chaining disjoint correlated label clusters can better explore and learn label correlations.

[LG-24] Exploring the Precise Dynamics of Single-Layer GAN Models: Leveraging Multi-Feature Discriminators for High-Dimensional Subspace Learning NEURIPS2024

链接: https://arxiv.org/abs/2411.00498
作者: Andrew Bond,Zafer Dogan
关键词-EN: contemporary machine learning, Subspace learning, critical endeavor, endeavor in contemporary, contemporary machine
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: Accepted for NeurIPS 2024, 16 pages, 7 figures

点击查看摘要

Abstract:Subspace learning is a critical endeavor in contemporary machine learning, particularly given the vast dimensions of modern datasets. In this study, we delve into the training dynamics of a single-layer GAN model from the perspective of subspace learning, framing these GANs as a novel approach to this fundamental task. Through a rigorous scaling limit analysis, we offer insights into the behavior of this model. Extending beyond prior research that primarily focused on sequential feature learning, we investigate the non-sequential scenario, emphasizing the pivotal role of inter-feature interactions in expediting training and enhancing performance, particularly with an uninformed initialization strategy. Our investigation encompasses both synthetic and real-world datasets, such as MNIST and Olivetti Faces, demonstrating the robustness and applicability of our findings to practical scenarios. By bridging our analysis to the realm of subspace learning, we systematically compare the efficacy of GAN-based methods against conventional approaches, both theoretically and empirically. Notably, our results unveil that while all methodologies successfully capture the underlying subspace, GANs exhibit a remarkable capability to acquire a more informative basis, owing to their intrinsic ability to generate new data samples. This elucidates the unique advantage of GAN-based approaches in subspace learning tasks.

[LG-25] he learned range test method for the inverse inclusion problem

链接: https://arxiv.org/abs/2411.00463
作者: Shiwei Sun,Giovanni S. Alberti
关键词-EN: inverse problem consisting, Omega, Cauchy data, pair of Cauchy, partial
类目: Numerical Analysis (math.NA); Machine Learning (cs.LG)
*备注: 25 pages, 12 figures

点击查看摘要

Abstract:We consider the inverse problem consisting of the reconstruction of an inclusion B contained in a bounded domain \Omega\subset\mathbbR^d from a single pair of Cauchy data (u|\partial\Omega,\partial\nu u|_\partial\Omega) , where \Delta u=0 in \Omega\setminus\overline B and u=0 on \partial B . We show that the reconstruction algorithm based on the range test, a domain sampling method, can be written as a neural network with a specific architecture. We propose to learn the weights of this network in the framework of supervised learning, and to combine it with a pre-trained classifier, with the purpose of distinguishing the inclusions based on their distance from the boundary. The numerical simulations show that this learned range test method provides accurate and stable reconstructions of polygonal inclusions. Furthermore, the results are superior to those obtained with the standard range test method (without learning) and with an end-to-end fully connected deep neural network, a purely data-driven method.

[LG-26] Unlocking Your Sales Insights: Advanced XGBoost Forecasting Models for Amazon Products

链接: https://arxiv.org/abs/2411.00460
作者: Meng Wang,Yuchen Liu,Gangmin Li,Terry R.Payne,Yong Yue,Ka Lok Man
关键词-EN: important factors, sales, future transaction volume, volume, factors of profitability
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:One of the important factors of profitability is the volume of transactions. An accurate prediction of the future transaction volume becomes a pivotal factor in shaping corporate operations and decision-making processes. E-commerce has presented manufacturers with convenient sales channels to, with which the sales can increase dramatically. In this study, we introduce a solution that leverages the XGBoost model to tackle the challenge of predict-ing sales for consumer electronics products on the Amazon platform. Initial-ly, our attempts to solely predict sales volume yielded unsatisfactory results. However, by replacing the sales volume data with sales range values, we achieved satisfactory accuracy with our model. Furthermore, our results in-dicate that XGBoost exhibits superior predictive performance compared to traditional models.

[LG-27] Diffusion Models as Network Optimizers: Explorations and Analysis

链接: https://arxiv.org/abs/2411.00453
作者: Ruihuai Liang,Bo Yang,Pengyu Chen,Xianjin Li,Yifan Xue,Zhiwen Yu,Xuelin Cao,Yan Zhang,Mérouane Debbah,H. Vincent Poor,Chau Yuen
关键词-EN: Internet of Things, Network optimization, network optimization problems, optimization problems, fundamental challenge
类目: Machine Learning (cs.LG); Networking and Internet Architecture (cs.NI)
*备注:

点击查看摘要

Abstract:Network optimization is a fundamental challenge in the Internet of Things (IoT) network, often characterized by complex features that make it difficult to solve these problems. Recently, generative diffusion models (GDMs) have emerged as a promising new approach to network optimization, with the potential to directly address these optimization problems. However, the application of GDMs in this field is still in its early stages, and there is a noticeable lack of theoretical research and empirical findings. In this study, we first explore the intrinsic characteristics of generative models. Next, we provide a concise theoretical proof and intuitive demonstration of the advantages of generative models over discriminative models in network optimization. Based on this exploration, we implement GDMs as optimizers aimed at learning high-quality solution distributions for given inputs, sampling from these distributions during inference to approximate or achieve optimal solutions. Specifically, we utilize denoising diffusion probabilistic models (DDPMs) and employ a classifier-free guidance mechanism to manage conditional guidance based on input parameters. We conduct extensive experiments across three challenging network optimization problems. By investigating various model configurations and the principles of GDMs as optimizers, we demonstrate the ability to overcome prediction errors and validate the convergence of generated solutions to optimal solutions.

[LG-28] Improving Few-Shot Cross-Domain Named Entity Recognition by Instruction Tuning a Word-Embedding based Retrieval Augmented Large Language Model

链接: https://arxiv.org/abs/2411.00451
作者: Subhadip Nandi,Neeraj Agrawal
关键词-EN: Few-Shot Cross-Domain NER, Cross-Domain NER, NER, process of leveraging, leveraging knowledge
类目: Machine Learning (cs.LG); Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Few-Shot Cross-Domain NER is the process of leveraging knowledge from data-rich source domains to perform entity recognition on data scarce target domains. Most previous state-of-the-art (SOTA) approaches use pre-trained language models (PLMs) for cross-domain NER. However, these models are often domain specific. To successfully use these models for new target domains, we need to modify either the model architecture or perform model finetuning using data from the new domains. Both of these result in the creation of entirely new NER models for each target domain which is infeasible for practical scenarios. Recently,several works have attempted to use LLMs to solve Few-Shot Cross-Domain NER. However, most of these are either too expensive for practical purposes or struggle to follow LLM prompt instructions. In this paper, we propose IF-WRANER (Instruction Finetuned Word-embedding based Retrieval Augmented large language model for Named Entity Recognition), a retrieval augmented LLM, finetuned for the NER task. By virtue of the regularization techniques used during LLM finetuning and the adoption of word-level embedding over sentence-level embedding during the retrieval of in-prompt examples, IF-WRANER is able to outperform previous SOTA Few-Shot Cross-Domain NER approaches. We have demonstrated the effectiveness of our model by benchmarking its performance on the open source CrossNER dataset, on which it shows more than 2% F1 score improvement over the previous SOTA model. We have deployed the model for multiple customer care domains of an enterprise. Accurate entity prediction through IF-WRANER helps direct customers to automated workflows for the domains, thereby reducing escalations to human agents by almost 15% and leading to millions of dollars in yearly savings for the company.

[LG-29] A KAN-based Interpretable Framework for Process-Informed Prediction of Global Warming Potential

链接: https://arxiv.org/abs/2411.00426
作者: Jaewook Lee,Xinyang Sun,Ethan Errington,Miao Guo
关键词-EN: Global Warming Potential, Warming Potential, Global Warming, GWP prediction, GWP prediction models
类目: Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注:

点击查看摘要

Abstract:Accurate prediction of Global Warming Potential (GWP) is essential for assessing the environmental impact of chemical processes and materials. Traditional GWP prediction models rely predominantly on molecular structure, overlooking critical process-related information. In this study, we present an integrative GWP prediction model that combines molecular descriptors (MACCS keys and Mordred descriptors) with process information (process title, description, and location) to improve predictive accuracy and interpretability. Using a deep neural network (DNN) model, we achieved an R-squared of 86% on test data with Mordred descriptors, process location, and description information, representing a 25% improvement over the previous benchmark of 61%; XAI analysis further highlighted the significant role of process title embeddings in enhancing model predictions. To enhance interpretability, we employed a Kolmogorov-Arnold Network (KAN) to derive a symbolic formula for GWP prediction, capturing key molecular and process features and providing a transparent, interpretable alternative to black-box models, enabling users to gain insights into the molecular and process factors influencing GWP. Error analysis showed that the model performs reliably in densely populated data ranges, with increased uncertainty for higher GWP values. This analysis allows users to manage prediction uncertainty effectively, supporting data-driven decision-making in chemical and process design. Our results suggest that integrating both molecular and process-level information in GWP prediction models yields substantial gains in accuracy and interpretability, offering a valuable tool for sustainability assessments. Future work may extend this approach to additional environmental impact categories and refine the model to further enhance its predictive reliability.

[LG-30] Black-Box Forgetting NEURIPS2024

链接: https://arxiv.org/abs/2411.00409
作者: Yusuke Kuwana,Yuta Goto,Takashi Shibata,Go Irie
关键词-EN: Large-scale pre-trained models, provide remarkable zero-shot, remarkable zero-shot classification, zero-shot classification capability, classification capability covering
类目: Machine Learning (cs.LG)
*备注: NeurIPS 2024

点击查看摘要

Abstract:Large-scale pre-trained models (PTMs) provide remarkable zero-shot classification capability covering a wide variety of object classes. However, practical applications do not always require the classification of all kinds of objects, and leaving the model capable of recognizing unnecessary classes not only degrades overall accuracy but also leads to operational disadvantages. To mitigate this issue, we explore the selective forgetting problem for PTMs, where the task is to make the model unable to recognize only the specified classes while maintaining accuracy for the rest. All the existing methods assume “white-box” settings, where model information such as architectures, parameters, and gradients is available for training. However, PTMs are often “black-box,” where information on such models is unavailable for commercial reasons or social responsibilities. In this paper, we address a novel problem of selective forgetting for black-box models, named Black-Box Forgetting, and propose an approach to the problem. Given that information on the model is unavailable, we optimize the input prompt to decrease the accuracy of specified classes through derivative-free optimization. To avoid difficult high-dimensional optimization while ensuring high forgetting performance, we propose Latent Context Sharing, which introduces common low-dimensional latent components among multiple tokens for the prompt. Experiments on four standard benchmark datasets demonstrate the superiority of our method with reasonable baselines. The code is available at this https URL.

[LG-31] Inference-to-complete: A High-performance and Programmable Data-plane Co-processor for Neural-network-driven Traffic Analysis

链接: https://arxiv.org/abs/2411.00408
作者: Dong Wen,Zhongpei Liu,Tong Yang,Tao Li,Tianyun Li,Chenglong Li,Jie Li,Zhigang Sun
关键词-EN: NN-driven IDP, emerging topic, topic for excellent, IDP, intelligent data-plane
类目: Networking and Internet Architecture (cs.NI); Hardware Architecture (cs.AR); Machine Learning (cs.LG)
*备注: Under review

点击查看摘要

Abstract:Neural-networks-driven intelligent data-plane (NN-driven IDP) is becoming an emerging topic for excellent accuracy and high performance. Meanwhile we argue that NN-driven IDP should satisfy three design goals: the flexibility to support various NNs models, the low-latency-high-throughput inference performance, and the data-plane-unawareness harming no performance and functionality. Unfortunately, existing work either over-modify NNs for IDP, or insert inline pipelined accelerators into the data-plane, failing to meet the flexibility and unawareness goals. In this paper, we propose Kaleidoscope, a flexible and high-performance co-processor located at the bypass of the data-plane. To address the challenge of meeting three design goals, three key techniques are presented. The programmable run-to-completion accelerators are developed for flexible inference. To further improve performance, we design a scalable inference engine which completes low-latency and low-cost inference for the mouse flows, and perform complex NNs with high-accuracy for the elephant flows. Finally, raw-bytes-based NNs are introduced, which help to achieve unawareness. We prototype Kaleidoscope on both FPGA and ASIC library. In evaluation on six NNs models, Kaleidoscope reaches 256-352 ns inference latency and 100 Gbps throughput with negligible influence on the data-plane. The on-board tested NNs perform state-of-the-art accuracy among other NN-driven IDP, exhibiting the the significant impact of flexibility on enhancing traffic analysis accuracy. Comments: Under review Subjects: Networking and Internet Architecture (cs.NI); Hardware Architecture (cs.AR); Machine Learning (cs.LG) Cite as: arXiv:2411.00408 [cs.NI] (or arXiv:2411.00408v1 [cs.NI] for this version) https://doi.org/10.48550/arXiv.2411.00408 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-32] MoD: A Distribution-Based Approach for Merging Large Language Models

链接: https://arxiv.org/abs/2411.00406
作者: Quy-Anh Dang,Chris Ngo
关键词-EN: Large language models, Large language, task-specific variants, enabled the development, development of numerous
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Large language models (LLMs) have enabled the development of numerous specialized, task-specific variants. However, the maintenance and deployment of these individual models present substantial challenges in terms of resource utilization and operational efficiency. In this work, we propose the \textitMixture of Distributions (MoD) framework, a novel approach for merging LLMs that operates directly on their output probability distributions, rather than on model weights. Unlike traditional weight-averaging methods, MoD effectively preserves the specialized capabilities of individual models while enabling efficient knowledge sharing across tasks. Through extensive experimentation on mathematical reasoning benchmarks using Qwen2.5 models, we demonstrate that MoD significantly outperforms existing model merging techniques across multiple benchmarks. All code, data, and experimental materials are published at this https URL.

[LG-33] Fast Adaptation with Kernel and Gradient based Meta Leaning

链接: https://arxiv.org/abs/2411.00404
作者: JuneYoung Park,MinJae Kang
关键词-EN: Model Agnostic Meta, Agnostic Meta Learning, Model Agnostic, Agnostic Meta, MAML
类目: Machine Learning (cs.LG)
*备注: 12 pages(with reference), 2 figures, 4 tables

点击查看摘要

Abstract:Model Agnostic Meta Learning or MAML has become the standard for few-shot learning as a meta-learning problem. MAML is simple and can be applied to any model, as its name suggests. However, it often suffers from instability and computational inefficiency during both training and inference times. In this paper, we propose two algorithms to improve both the inner and outer loops of MAML, then pose an important question about what ‘meta’ learning truly is. Our first algorithm redefines the optimization problem in the function space to update the model using closed-form solutions instead of optimizing parameters through multiple gradient steps in the inner loop. In the outer loop, the second algorithm adjusts the learning of the meta-learner by assigning weights to the losses from each task of the inner loop. This method optimizes convergence during both the training and inference stages of MAML. In conclusion, our algorithms offer a new perspective on meta-learning and make significant discoveries in both theory and experiments. This research suggests a more efficient approach to few-shot learning and fast task adaptation compared to existing methods. Furthermore, it lays the foundation for establishing a new paradigm in meta-learning.

[LG-34] owards Building Secure UAV Navigation with FHE-aware Knowledge Distillation

链接: https://arxiv.org/abs/2411.00403
作者: Arjun Ramesh Kaushik,Charanjit Jutla,Nalini Ratha
关键词-EN: Unmanned Aerial Vehicles, Aerial Vehicles, Unmanned Aerial, safeguarding mission-critical systems, Fully Homomorphic Encryption
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注: arXiv admin note: text overlap with arXiv:2404.17225

点击查看摘要

Abstract:In safeguarding mission-critical systems, such as Unmanned Aerial Vehicles (UAVs), preserving the privacy of path trajectories during navigation is paramount. While the combination of Reinforcement Learning (RL) and Fully Homomorphic Encryption (FHE) holds promise, the computational overhead of FHE presents a significant challenge. This paper proposes an innovative approach that leverages Knowledge Distillation to enhance the practicality of secure UAV navigation. By integrating RL and FHE, our framework addresses vulnerabilities to adversarial attacks while enabling real-time processing of encrypted UAV camera feeds, ensuring data security. To mitigate FHE’s latency, Knowledge Distillation is employed to compress the network, resulting in an impressive 18x speedup without compromising performance, as evidenced by an R-squared score of 0.9499 compared to the original model’s score of 0.9631. Our methodology underscores the feasibility of processing encrypted data for UAV navigation tasks, emphasizing security alongside performance efficiency and timely processing. These findings pave the way for deploying autonomous UAVs in sensitive environments, bolstering their resilience against potential security threats.

[LG-35] owards Data Valuation via Asymmetric Data Shapley

链接: https://arxiv.org/abs/2411.00388
作者: Xi Zheng,Xiangyu Chang,Ruoxi Jia,Yong Tan
关键词-EN: economic advancements, algorithmic decision-making, vital driver, driver of technological, technological and economic
类目: Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:As data emerges as a vital driver of technological and economic advancements, a key challenge is accurately quantifying its value in algorithmic decision-making. The Shapley value, a well-established concept from cooperative game theory, has been widely adopted to assess the contribution of individual data sources in supervised machine learning. However, its symmetry axiom assumes all players in the cooperative game are homogeneous, which overlooks the complex structures and dependencies present in real-world datasets. To address this limitation, we extend the traditional data Shapley framework to asymmetric data Shapley, making it flexible enough to incorporate inherent structures within the datasets for structure-aware data valuation. We also introduce an efficient k -nearest neighbor-based algorithm for its exact computation. We demonstrate the practical applicability of our framework across various machine learning tasks and data market contexts. The code is available at: this https URL.

[LG-36] Preventing Model Collapse in Deep Canonical Correlation Analysis by Noise Regularization NEURIPS2024

链接: https://arxiv.org/abs/2411.00383
作者: Junlin He,Jinxiao Du,Susu Xu,Wei Ma
关键词-EN: Multi-View Representation Learning, Representation Learning, Multi-View Representation, unified representation, Canonical Correlation Analysis
类目: Machine Learning (cs.LG)
*备注: Accepted by NeurIPS 2024 as a poster

点击查看摘要

Abstract:Multi-View Representation Learning (MVRL) aims to learn a unified representation of an object from multi-view data. Deep Canonical Correlation Analysis (DCCA) and its variants share simple formulations and demonstrate state-of-the-art performance. However, with extensive experiments, we observe the issue of model collapse, \em i.e., the performance of DCCA-based methods will drop drastically when training proceeds. The model collapse issue could significantly hinder the wide adoption of DCCA-based methods because it is challenging to decide when to early stop. To this end, we develop NR-DCCA, which is equipped with a novel noise regularization approach to prevent model collapse. Theoretical analysis shows that the Correlation Invariant Property is the key to preventing model collapse, and our noise regularization forces the neural network to possess such a property. A framework to construct synthetic data with different common and complementary information is also developed to compare MVRL methods comprehensively. The developed NR-DCCA outperforms baselines stably and consistently in both synthetic and real-world datasets, and the proposed noise regularization approach can also be generalized to other DCCA-based methods such as DGCCA.

[LG-37] Communication Learning in Multi-Agent Systems from Graph Modeling Perspective ICLR

链接: https://arxiv.org/abs/2411.00382
作者: Shengchao Hu,Li Shen,Ya Zhang,Dacheng Tao
关键词-EN: artificial intelligence applications, numerous artificial intelligence, multiple intelligent agents, intelligence applications, target objectives
类目: Machine Learning (cs.LG); Multiagent Systems (cs.MA)
*备注: Extension of the corresponding ICLR edition: arXiv:2405.08550

点击查看摘要

Abstract:In numerous artificial intelligence applications, the collaborative efforts of multiple intelligent agents are imperative for the successful attainment of target objectives. To enhance coordination among these agents, a distributed communication framework is often employed. However, indiscriminate information sharing among all agents can be resource-intensive, and the adoption of manually pre-defined communication architectures imposes constraints on inter-agent communication, thus limiting the potential for effective collaboration. Moreover, the communication framework often remains static during inference, which may result in sustained high resource consumption, as in most cases, only key decisions necessitate information sharing among agents. In this study, we introduce a novel approach wherein we conceptualize the communication architecture among agents as a learnable graph. We formulate this problem as the task of determining the communication graph while enabling the architecture parameters to update normally, thus necessitating a bi-level optimization process. Utilizing continuous relaxation of the graph representation and incorporating attention units, our proposed approach, CommFormer, efficiently optimizes the communication graph and concurrently refines architectural parameters through gradient descent in an end-to-end manner. Additionally, we introduce a temporal gating mechanism for each agent, enabling dynamic decisions on whether to receive shared information at a given time, based on current observations, thus improving decision-making efficiency. Extensive experiments on a variety of cooperative tasks substantiate the robustness of our model across diverse cooperative scenarios, where agents are able to develop more coordinated and sophisticated strategies regardless of changes in the number of agents.

[LG-38] DeepCore: Simple Fingerprint Construction for Differentiating Homologous and Piracy Models

链接: https://arxiv.org/abs/2411.00380
作者: Haifeng Sun,Lan Zhang,Xiang-Yang Li
关键词-EN: piracy models, models, increasingly important, intellectual property, protection of deep
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注: 9 pages

点击查看摘要

Abstract:As intellectual property rights, the copyright protection of deep models is becoming increasingly important. Existing work has made many attempts at model watermarking and fingerprinting, but they have ignored homologous models trained with similar structures or training datasets. We highlight challenges in efficiently querying black-box piracy models to protect model copyrights without misidentifying homologous models. To address these challenges, we propose a novel method called DeepCore, which discovers that the classification confidence of the model is positively correlated with the distance of the predicted sample from the model decision boundary and piracy models behave more similarly at high-confidence classified sample points. Then DeepCore constructs core points far away from the decision boundary by optimizing the predicted confidence of a few sample points and leverages behavioral discrepancies between piracy and homologous models to identify piracy models. Finally, we design different model identification methods, including two similarity-based methods and a clustering-based method to identify piracy models using models’ predictions of core points. Extensive experiments show the effectiveness of DeepCore in identifying various piracy models, achieving lower missed and false identification rates, and outperforming state-of-the-art methods.

[LG-39] A Machine Learning Driven Website Platform and Browser Extension for Real-time Scoring and Fraud Detection for Website Legitimacy Verification and Consumer Protection

链接: https://arxiv.org/abs/2411.00368
作者: Md Kamrul Hasan Chy,Obed Nana Buadi
关键词-EN: Machine Learning-Driven website, Browser Extension designed, website legitimacy verification, introduces a Machine, Machine Learning-Driven
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注: Journal of Multidisciplinary Engineering Science and Technology (JMEST) 2024

点击查看摘要

Abstract:This paper introduces a Machine Learning-Driven website Platform and Browser Extension designed to quickly enhance online security by providing real-time risk scoring and fraud detection for website legitimacy verification and consumer protection. The platform works seamlessly in the background to analyze website behavior, network traffic, and user interactions, offering immediate feedback and alerts when potential threats are detected. By integrating this system into a user-friendly browser extension, the platform empowers individuals to navigate the web safely, reducing the risk of engaging with fraudulent websites. Its real-time functionality is crucial in e-commerce and everyday browsing, where quick, actionable insights can prevent financial losses, identity theft, and exposure to malicious sites. This paper explores how this solution offers a practical, fast-acting tool for enhancing online consumer protection, underscoring its potential to play a critical role in safeguarding users and maintaining trust in digital transactions. The platform’s focus on speed and efficiency makes it an essential asset for preventing fraud in today’s increasingly digital world.

[LG-40] ROSS:RObust decentralized Stochastic learning based on Shapley values

链接: https://arxiv.org/abs/2411.00365
作者: Lina Wang,Yunsheng Yuan,Feng Li,Lingjie Duan
关键词-EN: central server, collaborate to learn, learn a global, severely challenged, data distribution
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In the paradigm of decentralized learning, a group of agents collaborate to learn a global model using a distributed dataset without a central server; nevertheless, it is severely challenged by the heterogeneity of the data distribution across the agents. For example, the data may be distributed non-independently and identically, and even be noised or poisoned. To address these data challenges, we propose ROSS, a novel robust decentralized stochastic learning algorithm based on Shapley values, in this paper. Specifically, in each round, each agent aggregates the cross-gradient information from its neighbors, i.e., the derivatives of its local model with respect to the datasets of its neighbors, to update its local model in a momentum like manner, while we innovate in weighting the derivatives according to their contributions measured by Shapley values. We perform solid theoretical analysis to reveal the linear convergence speedup of our ROSS algorithm. We also verify the efficacy of our algorithm through extensive experiments on public datasets. Our results demonstrate that, in face of the above variety of data challenges, our ROSS algorithm have oblivious advantages over existing state-of-the-art proposals in terms of both convergence and prediction accuracy.

[LG-41] Hierarchical Preference Optimization: Learning to achieve goals via feasible subgoals prediction

链接: https://arxiv.org/abs/2411.00361
作者: Utsav Singh,Souradip Chakraborty,Wesley A. Suttle,Brian M. Sadler,Anit Kumar Sahu,Mubarak Shah,Vinay P. Namboodiri,Amrit Singh Bedi
关键词-EN: introduces Hierarchical Preference, Hierarchical Preference Optimization, work introduces Hierarchical, Direct Preference Optimization, hierarchical reinforcement learning
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This work introduces Hierarchical Preference Optimization (HPO), a novel approach to hierarchical reinforcement learning (HRL) that addresses non-stationarity and infeasible subgoal generation issues when solving complex robotic control tasks. HPO leverages maximum entropy reinforcement learning combined with token-level Direct Preference Optimization (DPO), eliminating the need for pre-trained reference policies that are typically unavailable in challenging robotic scenarios. Mathematically, we formulate HRL as a bi-level optimization problem and transform it into a primitive-regularized DPO formulation, ensuring feasible subgoal generation and avoiding degenerate solutions. Extensive experiments on challenging robotic navigation and manipulation tasks demonstrate impressive performance of HPO, where it shows an improvement of up to 35% over the baselines. Furthermore, ablation studies validate our design choices, and quantitative analyses confirm the ability of HPO to mitigate non-stationarity and infeasible subgoal generation issues in HRL.

[LG-42] Constrained Diffusion Implicit Models

链接: https://arxiv.org/abs/2411.00359
作者: Vivek Jayaram,Ira Kemelmacher-Shlizerman,Steven M. Seitz,John Thickstun
关键词-EN: diffusion implicit models, pretrained diffusion models, solving noisy linear, noisy linear inverse, implicit models
类目: Machine Learning (cs.LG); Image and Video Processing (eess.IV)
*备注:

点击查看摘要

Abstract:This paper describes an efficient algorithm for solving noisy linear inverse problems using pretrained diffusion models. Extending the paradigm of denoising diffusion implicit models (DDIM), we propose constrained diffusion implicit models (CDIM) that modify the diffusion updates to enforce a constraint upon the final output. For noiseless inverse problems, CDIM exactly satisfies the constraints; in the noisy case, we generalize CDIM to satisfy an exact constraint on the residual distribution of the noise. Experiments across a variety of tasks and metrics show strong performance of CDIM, with analogous inference acceleration to unconstrained DDIM: 10 to 50 times faster than previous conditional diffusion methods. We demonstrate the versatility of our approach on many problems including super-resolution, denoising, inpainting, deblurring, and 3D point cloud reconstruction.

[LG-43] Coherent Hierarchical Probabilistic Forecasting of Electric Vehicle Charging Demand

链接: https://arxiv.org/abs/2411.00337
作者: Kedi Zheng,Hanwei Xu,Zeyang Long,Yi Wang,Qixin Chen
关键词-EN: typical load curves, significantly changes typical, smart grids, growing penetration, typical load
类目: ystems and Control (eess.SY); Machine Learning (cs.LG)
*备注: Paper accepted for IEEE Transactions on Industrial Applications. Personal use of this material is permitted. Permission from Elsevier must be obtained for all other uses

点击查看摘要

Abstract:The growing penetration of electric vehicles (EVs) significantly changes typical load curves in smart grids. With the development of fast charging technology, the volatility of EV charging demand is increasing, which requires additional flexibility for real-time power balance. The forecasting of EV charging demand involves probabilistic modeling of high dimensional time series dynamics across diverse electric vehicle charging stations (EVCSs). This paper studies the forecasting problem of multiple EVCS in a hierarchical probabilistic manner. For each charging station, a deep learning model based on a partial input convex neural network (PICNN) is trained to predict the day-ahead charging demand’s conditional distribution, preventing the common quantile crossing problem in traditional quantile regression models. Then, differentiable convex optimization layers (DCLs) are used to reconcile the scenarios sampled from the distributions to yield coherent scenarios that satisfy the hierarchical constraint. It learns a better weight matrix for adjusting the forecasting results of different targets in a machine-learning approach compared to traditional optimization-based hierarchical reconciling methods. Numerical experiments based on real-world EV charging data are conducted to demonstrate the efficacy of the proposed method.

[LG-44] KAN-AD: Time Series Anomaly Detection with Kolmogorov-Arnold Networks

链接: https://arxiv.org/abs/2411.00278
作者: Quan Zhou,Changhua Pei,Fei Sun,Jing Han,Zhengwei Gao,Dan Pei,Haiming Zhang,Gaogang Xie,Jianhui Li
关键词-EN: providing early warnings, prevent greater losses, large-scale cloud services, promptly identify anomalies, providing early
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Time series anomaly detection (TSAD) has become an essential component of large-scale cloud services and web systems because it can promptly identify anomalies, providing early warnings to prevent greater losses. Deep learning-based forecasting methods have become very popular in TSAD due to their powerful learning capabilities. However, accurate predictions don’t necessarily lead to better anomaly detection. Due to the common occurrence of noise, i.e., local peaks and drops in time series, existing black-box learning methods can easily learn these unintended patterns, significantly affecting anomaly detection performance. Kolmogorov-Arnold Networks (KAN) offers a potential solution by decomposing complex temporal sequences into a combination of multiple univariate functions, making the training process more controllable. However, KAN optimizes univariate functions using spline functions, which are also susceptible to the influence of local anomalies. To address this issue, we present KAN-AD, which leverages the Fourier series to emphasize global temporal patterns, thereby mitigating the influence of local peaks and drops. KAN-AD improves both effectiveness and efficiency by transforming the existing black-box learning approach into learning the weights preceding univariate functions. Experimental results show that, compared to the current state-of-the-art, we achieved an accuracy increase of 15% while boosting inference speed by 55 times.

[LG-45] Improving Musical Instrument Classification with Advanced Machine Learning Techniques

链接: https://arxiv.org/abs/2411.00275
作者: Joanikij Chulev
关键词-EN: Music Information Retrieval, digital music production, Information Retrieval, gained considerable interest, considerable interest due
类目: ound (cs.SD); Information Retrieval (cs.IR); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注: 43 pages, 35 figures, 14 tables

点击查看摘要

Abstract:Musical instrument classification, a key area in Music Information Retrieval, has gained considerable interest due to its applications in education, digital music production, and consumer media. Recent advances in machine learning, specifically deep learning, have enhanced the capability to identify and classify musical instruments from audio signals. This study applies various machine learning methods, including Naive Bayes, Support Vector Machines, Random Forests, Boosting techniques like AdaBoost and XGBoost, as well as deep learning models such as Convolutional Neural Networks and Artificial Neural Networks. The effectiveness of these methods is evaluated on the NSynth dataset, a large repository of annotated musical sounds. By comparing these approaches, the analysis aims to showcase the advantages and limitations of each method, providing guidance for developing more accurate and efficient classification systems. Additionally, hybrid model testing and discussion are included. This research aims to support further studies in instrument classification by proposing new approaches and future research directions.

[LG-46] Efficient Model Compression for Bayesian Neural Networks

链接: https://arxiv.org/abs/2411.00273
作者: Diptarka Saha,Zihe Liu,Feng Liang
关键词-EN: learning community recently, Compression has drawn, community recently, deep learning community, Model Compression
类目: Machine Learning (cs.LG); Applications (stat.AP); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Model Compression has drawn much attention within the deep learning community recently. Compressing a dense neural network offers many advantages including lower computation cost, deployability to devices of limited storage and memories, and resistance to adversarial attacks. This may be achieved via weight pruning or fully discarding certain input features. Here we demonstrate a novel strategy to emulate principles of Bayesian model selection in a deep learning setup. Given a fully connected Bayesian neural network with spike-and-slab priors trained via a variational algorithm, we obtain the posterior inclusion probability for every node that typically gets lost. We employ these probabilities for pruning and feature selection on a host of simulated and real-world benchmark data and find evidence of better generalizability of the pruned model in all our experiments.

[LG-47] Unsupervised Feature Selection Algorithm Based on Graph Filtering and Self-representation

链接: https://arxiv.org/abs/2411.00270
作者: Yunhui Liang,Jianwen Gan,Yan Chen,Peng Zhou,Liang Du
关键词-EN: higher-order neighborhood information, higher-order graph information, unsupervised feature selection, Firstly,a higher-order graph, higher-order graph filter
类目: Machine Learning (cs.LG)
*备注: in Chinese language

点击查看摘要

Abstract:Aiming at the problem that existing methods could not fully capture the intrinsic structure of data without considering the higher-order neighborhood information of the data, we proposed an unsupervised feature selection algorithm based on graph filtering and self-representation. Firstly,a higher-order graph filter was applied to the data to obtain its smooth representation,and a regularizer was designed to combine the higher-order graph information for the self-representation matrix learning to capture the intrinsic structure of the data. Secondly,l2,1 norm was used to reconstruct the error term and feature selection matrix to enhance the robustness and row sparsity of the model to select the discriminant features. Finally, an iterative algorithm was applied to effectively solve the proposed objective function and simulation experiments were carried out to verify the effectiveness of the proposed algorithm.

[LG-48] Clustering ensemble algorithm with high-order consistency learning

链接: https://arxiv.org/abs/2411.00268
作者: Jianwen Gan,Yan Chen,Peng Zhou,Liang Du
关键词-EN: High-order Consensus learning, low-quality base clusters, base clusters varies, http URL solve, base clusters
类目: Machine Learning (cs.LG)
*备注: in Chinese language

点击查看摘要

Abstract:Most of the research on clustering ensemble focuses on designing practical consistency learning this http URL solve the problems that the quality of base clusters varies and the low-quality base clusters have an impact on the performance of the clustering ensemble, from the perspective of data mining, the intrinsic connections of data were mined based on the base clusters, and a high-order information fusion algorithm was proposed to represent the connections between data from different dimensions, namely Clustering Ensemble with High-order Consensus learning (HCLCE). Firstly, each high-order information was fused into a new structured consistency matrix. Then, the obtained multiple consistency matrices were fused together. Finally, multiple information was fused into a consistent result. Experimental results show that LCLCE algorithm has the clustering accuracy improved by an average of 7.22%, and the Normalized Mutual Information (NMI) improved by an average of 9.19% compared with the suboptimal Locally Weighted Evidence Accumulation (LWEA) algorithm. It can be seen that the proposed algorithm can obtain better clustering results compared with clustering ensemble algorithms and using one information alone.

[LG-49] A Systematic Review of NeurIPS Dataset Management Practices

链接: https://arxiv.org/abs/2411.00266
作者: Yiwei Wu,Leah Ajmani,Shayne Longpre,Hanlin Li
关键词-EN: machine learning methods, learning methods demand, methods demand larger, demand larger training, developers face significant
类目: Machine Learning (cs.LG)
*备注: 10 pages, 2 tables

点击查看摘要

Abstract:As new machine learning methods demand larger training datasets, researchers and developers face significant challenges in dataset management. Although ethics reviews, documentation, and checklists have been established, it remains uncertain whether consistent dataset management practices exist across the community. This lack of a comprehensive overview hinders our ability to diagnose and address fundamental tensions and ethical issues related to managing large datasets. We present a systematic review of datasets published at the NeurIPS Datasets and Benchmarks track, focusing on four key aspects: provenance, distribution, ethical disclosure, and licensing. Our findings reveal that dataset provenance is often unclear due to ambiguous filtering and curation processes. Additionally, a variety of sites are used for dataset hosting, but only a few offer structured metadata and version control. These inconsistencies underscore the urgent need for standardized data infrastructures for the publication and management of datasets.

[LG-50] Space for Improvement: Navigating the Design Space for Federated Learning in Satellite Constellations

链接: https://arxiv.org/abs/2411.00263
作者: Grace Kim,Luca Powell,Filip Svoboda,Nicholas Lane
关键词-EN: missions equipping deep, capabilities on-board spacecraft, deep learning capabilities, equipping deep learning, learning capabilities on-board
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Space has emerged as an exciting new application area for machine learning, with several missions equipping deep learning capabilities on-board spacecraft. Pre-processing satellite data through on-board training is necessary to address the satellite downlink deficit, as not enough transmission opportunities are available to match the high rates of data generation. To scale this effort across entire constellations, collaborated training in orbit has been enabled through federated learning (FL). While current explorations of FL in this context have successfully adapted FL algorithms for scenario-specific constraints, these theoretical FL implementations face several limitations that prevent progress towards real-world deployment. To address this gap, we provide a holistic exploration of the FL in space domain on several fronts. 1) We develop a method for space-ification of existing FL algorithms, evaluated on 2) FLySTacK, our novel satellite constellation design and hardware aware testing platform where we perform rigorous algorithm evaluations. Finally we introduce 3) AutoFLSat, a generalized, hierarchical, autonomous FL algorithm for space that provides a 12.5% to 37.5% reduction in model training time than leading alternatives.

[LG-51] Enhancing Diversity in Bayesian Deep Learning via Hyperspherical Energy Minimization of CKA NEURIPS2024

链接: https://arxiv.org/abs/2411.00259
作者: David Smerkous,Qinxun Bai,Fuxin Li
关键词-EN: Particle-based Bayesian deep, Particle-based Bayesian, Bayesian deep learning, Bayesian deep, naive similarity metrics
类目: Machine Learning (cs.LG)
*备注: NeurIPS 2024

点击查看摘要

Abstract:Particle-based Bayesian deep learning often requires a similarity metric to compare two networks. However, naive similarity metrics lack permutation invariance and are inappropriate for comparing networks. Centered Kernel Alignment (CKA) on feature kernels has been proposed to compare deep networks but has not been used as an optimization objective in Bayesian deep learning. In this paper, we explore the use of CKA in Bayesian deep learning to generate diverse ensembles and hypernetworks that output a network posterior. Noting that CKA projects kernels onto a unit hypersphere and that directly optimizing the CKA objective leads to diminishing gradients when two networks are very similar. We propose adopting the approach of hyperspherical energy (HE) on top of CKA kernels to address this drawback and improve training stability. Additionally, by leveraging CKA-based feature kernels, we derive feature repulsive terms applied to synthetically generated outlier examples. Experiments on both diverse ensembles and hypernetworks show that our approach significantly outperforms baselines in terms of uncertainty quantification in both synthetic and realistic outlier detection tasks.

[LG-52] SambaMixer: State of Health Prediction of Li-ion Batteries using Mamba State Space Models

链接: https://arxiv.org/abs/2411.00233
作者: José Ignacio Olalde-Verano,Sascha Kirch,Clara Pérez-Molina,Sergio Martin
关键词-EN: determines the remaining, remaining capacity, remaining lifetime, critical parameter, parameter that determines
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The state of health (SOH) of a Li-ion battery is a critical parameter that determines the remaining capacity and the remaining lifetime of the battery. In this paper, we propose SambaMixer a novel structured state space model (SSM) for predicting the state of health of Li-ion batteries. The proposed SSM is based on the MambaMixer architecture, which is designed to handle multi-variate time signals. We evaluate our model on the NASA battery discharge dataset and show that our model outperforms the state-of-the-art on this dataset. We further introduce a novel anchor-based resampling method which ensures time signals are of the expected length while also serving as augmentation technique. Finally, we condition prediction on the sample time and the cycle time difference using positional encodings to improve the performance of our model and to learn recuperation effects. Our results proof that our model is able to predict the SOH of Li-ion batteries with high accuracy and robustness.

[LG-53] BOMP: Bin-Optimized Motion Planning

链接: https://arxiv.org/abs/2411.00221
作者: Zachary Tam,Karthik Dharmarajan,Tianshuang Qiu,Yahav Avigal,Jeffrey Ichnowski,Ken Goldberg
关键词-EN: Motion Planning, compute and execute, increasing productivity, Bin-Optimized Motion Planning, motion planning framework
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In logistics, the ability to quickly compute and execute pick-and-place motions from bins is critical to increasing productivity. We present Bin-Optimized Motion Planning (BOMP), a motion planning framework that plans arm motions for a six-axis industrial robot with a long-nosed suction tool to remove boxes from deep bins. BOMP considers robot arm kinematics, actuation limits, the dimensions of a grasped box, and a varying height map of a bin environment to rapidly generate time-optimized, jerk-limited, and collision-free trajectories. The optimization is warm-started using a deep neural network trained offline in simulation with 25,000 scenes and corresponding trajectories. Experiments with 96 simulated and 15 physical environments suggest that BOMP generates collision-free trajectories that are up to 58 % faster than baseline sampling-based planners and up to 36 % faster than an industry-standard Up-Over-Down algorithm, which has an extremely low 15 % success rate in this context. BOMP also generates jerk-limited trajectories while baselines do not. Website: this https URL.

[LG-54] MEDS-Tab: Automated tabularization and baseline methods for MEDS datasets

链接: https://arxiv.org/abs/2411.00200
作者: Nassim Oufattole,Teya Bergamaschi,Aleksia Kolo,Hyewon Jeong,Hanna Gaggin,Collin M. Stultz,Matthew B.A. McDermott
关键词-EN: reliably generate high-quality, generate high-quality baseline, supervised learning tasks, high-quality baseline models, electronic health record
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Effective, reliable, and scalable development of machine learning (ML) solutions for structured electronic health record (EHR) data requires the ability to reliably generate high-quality baseline models for diverse supervised learning tasks in an efficient and performant manner. Historically, producing such baseline models has been a largely manual effort–individual researchers would need to decide on the particular featurization and tabularization processes to apply to their individual raw, longitudinal data; and then train a supervised model over those data to produce a baseline result to compare novel methods against, all for just one task and one dataset. In this work, powered by complementary advances in core data standardization through the MEDS framework, we dramatically simplify and accelerate this process of tabularizing irregularly sampled time-series data, providing researchers the ability to automatically and scalably featurize and tabularize their longitudinal EHR data across tens of thousands of individual features, hundreds of millions of clinical events, and diverse windowing horizons and aggregation strategies, all before ultimately leveraging these tabular data to automatically produce high-caliber XGBoost baselines in a highly computationally efficient manner. This system scales to dramatically larger datasets than tabularization tools currently available to the community and enables researchers with any MEDS format dataset to immediately begin producing reliable and performant baseline prediction results on various tasks, with minimal human effort required. This system will greatly enhance the reliability, reproducibility, and ease of development of powerful ML solutions for health problems across diverse datasets and clinical settings.

[LG-55] Kernel Operator-Theoretic Bayesian Filter for Nonlinear Dynamical Systems

链接: https://arxiv.org/abs/2411.00198
作者: Kan Li,José C. Príncipe
关键词-EN: machine-learning alternative based, Hilbert space, Koopman operator, kernel Hilbert space, Koopman operator theory
类目: Machine Learning (cs.LG); Signal Processing (eess.SP); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Motivated by the surge of interest in Koopman operator theory, we propose a machine-learning alternative based on a functional Bayesian perspective for operator-theoretic modeling of unknown, data-driven, nonlinear dynamical systems. This formulation is directly done in an infinite-dimensional space of linear operators or Hilbert space with universal approximation property. The theory of reproducing kernel Hilbert space (RKHS) allows the lifting of nonlinear dynamics to a potentially infinite-dimensional space via linear embeddings, where a general nonlinear function is represented as a set of linear functions or operators in the functional space. This allows us to apply classical linear Bayesian methods such as the Kalman filter directly in the Hilbert space, yielding nonlinear solutions in the original input space. This kernel perspective on the Koopman operator offers two compelling advantages. First, the Hilbert space can be constructed deterministically, agnostic to the nonlinear dynamics. The Gaussian kernel is universal, approximating uniformly an arbitrary continuous target function over any compact domain. Second, Bayesian filter is an adaptive, linear minimum-variance algorithm, allowing the system to update the Koopman operator and continuously track the changes across an extended period of time, ideally suited for modern data-driven applications such as real-time machine learning using streaming data. In this paper, we present several practical implementations to obtain a finite-dimensional approximation of the functional Bayesian filter (FBF). Due to the rapid decay of the Gaussian kernel, excellent approximation is obtained with a small dimension. We demonstrate that this practical approach can obtain accurate results and outperform finite-dimensional Koopman decomposition.

[LG-56] Machine Learning Framework for Audio-Based Content Evaluation using MFCC Chroma Spectral Contrast and Temporal Feature Engineering

链接: https://arxiv.org/abs/2411.00195
作者: Aris J. Aristorenas
关键词-EN: predicting sentiment score, study presents, Mel-Frequency Cepstral Coefficients, sentiment scores, Spectral Contrast
类目: ound (cs.SD); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注: 6 pages, 6 figures

点击查看摘要

Abstract:This study presents a machine learning framework for assessing similarity between audio content and predicting sentiment score. We construct a dataset containing audio samples from music covers on YouTube along with the audio of the original song, and sentiment scores derived from user comments, serving as proxy labels for content quality. Our approach involves extensive pre-processing, segmenting audio signals into 30-second windows, and extracting high-dimensional feature representations through Mel-Frequency Cepstral Coefficients (MFCC), Chroma, Spectral Contrast, and Temporal characteristics. Leveraging these features, we train regression models to predict sentiment scores on a 0-100 scale, achieving root mean square error (RMSE) values of 3.420, 5.482, 2.783, and 4.212, respectively. Improvements over a baseline model based on absolute difference metrics are observed. These results demonstrate the potential of machine learning to capture sentiment and similarity in audio, offering an adaptable framework for AI applications in media analysis.

[LG-57] APEBench: A Benchmark for Autoregressive Neural Emulators of PDEs NEURIPS2024

链接: https://arxiv.org/abs/2411.00180
作者: Felix Koehler,Simon Niedermayr,Rüdiger Westermann,Nils Thuerey
关键词-EN: partial differential equations, solving partial differential, evaluate autoregressive neural, Autoregressive PDE Emulator, comprehensive benchmark suite
类目: Machine Learning (cs.LG)
*备注: Accepted at Neurips 2024. The code is available at this https URL and APEBench can be installed via “pip install apebench”

点击查看摘要

Abstract:We introduce the Autoregressive PDE Emulator Benchmark (APEBench), a comprehensive benchmark suite to evaluate autoregressive neural emulators for solving partial differential equations. APEBench is based on JAX and provides a seamlessly integrated differentiable simulation framework employing efficient pseudo-spectral methods, enabling 46 distinct PDEs across 1D, 2D, and 3D. Facilitating systematic analysis and comparison of learned emulators, we propose a novel taxonomy for unrolled training and introduce a unique identifier for PDE dynamics that directly relates to the stability criteria of classical numerical methods. APEBench enables the evaluation of diverse neural architectures, and unlike existing benchmarks, its tight integration of the solver enables support for differentiable physics training and neural-hybrid emulators. Moreover, APEBench emphasizes rollout metrics to understand temporal generalization, providing insights into the long-term behavior of emulating PDE dynamics. In several experiments, we highlight the similarities between neural emulators and numerical simulators.

[LG-58] What Makes An Expert? Reviewing How ML Researchers Define “Expert”

链接: https://arxiv.org/abs/2411.00179
作者: Mark Díaz,Angela DR Smith
关键词-EN: evaluate system performance, Human experts, consult on algorithm, machine learning systems, collect and validate
类目: Machine Learning (cs.LG); Computers and Society (cs.CY)
*备注:

点击查看摘要

Abstract:Human experts are often engaged in the development of machine learning systems to collect and validate data, consult on algorithm development, and evaluate system performance. At the same time, who counts as an ‘expert’ and what constitutes ‘expertise’ is not always explicitly defined. In this work, we review 112 academic publications that explicitly reference ‘expert’ and ‘expertise’ and that describe the development of machine learning (ML) systems to survey how expertise is characterized and the role experts play. We find that expertise is often undefined and forms of knowledge outside of formal education and professional certification are rarely sought, which has implications for the kinds of knowledge that are recognized and legitimized in ML development. Moreover, we find that expert knowledge tends to be utilized in ways focused on mining textbook knowledge, such as through data annotation. We discuss the ways experts are engaged in ML development in relation to deskilling, the social construction of expertise, and implications for responsible AI development. We point to a need for reflection and specificity in justifications of domain expert engagement, both as a matter of documentation and reproducibility, as well as a matter of broadening the range of recognized expertise.

[LG-59] EARL-BO: Reinforcement Learning for Multi-Step Lookahead High-Dimensional Bayesian Optimization

链接: https://arxiv.org/abs/2411.00171
作者: Mujin Cheon,Jay H. Lee,Dong-Yeun Koh,Calvin Tsay
关键词-EN: primarily involve one-step, one-step optimal decisions, maximizing expected improvement, involve one-step optimal, Conventional methods
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注: 14 pages

点击查看摘要

Abstract:Conventional methods for Bayesian optimization (BO) primarily involve one-step optimal decisions (e.g., maximizing expected improvement of the next step). To avoid myopic behavior, multi-step lookahead BO algorithms such as rollout strategies consider the sequential decision-making nature of BO, i.e., as a stochastic dynamic programming (SDP) problem, demonstrating promising results in recent years. However, owing to the curse of dimensionality, most of these methods make significant approximations or suffer scalability issues, e.g., being limited to two-step lookahead. This paper presents a novel reinforcement learning (RL)-based framework for multi-step lookahead BO in high-dimensional black-box optimization problems. The proposed method enhances the scalability and decision-making quality of multi-step lookahead BO by efficiently solving the SDP of the BO process in a near-optimal manner using RL. We first introduce an Attention-DeepSets encoder to represent the state of knowledge to the RL agent and employ off-policy learning to accelerate its initial training. We then propose a multi-task, fine-tuning procedure based on end-to-end (encoder-RL) on-policy learning. We evaluate the proposed method, EARL-BO (Encoder Augmented RL for Bayesian Optimization), on both synthetic benchmark functions and real-world hyperparameter optimization problems, demonstrating significantly improved performance compared to existing multi-step lookahead and high-dimensional BO methods.

[LG-60] Mutual Information Preserving Neural Network Pruning

链接: https://arxiv.org/abs/2411.00147
作者: Charles Westphal,Stephen Hailes,Mirco Musolesi
关键词-EN: attracting increasing interest, consumption and costs, attracting increasing, increasing interest, positive implications
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Model pruning is attracting increasing interest because of its positive implications in terms of resource consumption and costs. A variety of methods have been developed in the past years. In particular, structured pruning techniques discern the importance of nodes in neural networks (NNs) and filters in convolutional neural networks (CNNs). Global versions of these rank all nodes in a network and select the top-k, offering an advantage over local methods that rank nodes only within individual layers. By evaluating all nodes simultaneously, global techniques provide greater control over the network architecture, which improves performance. However, the ranking and selecting process carried out during global pruning can have several major drawbacks. First, the ranking is not updated in real time based on the pruning already performed, making it unable to account for inter-node interactions. Second, it is not uncommon for whole layers to be removed from a model, which leads to untrainable networks. Lastly, global pruning methods do not offer any guarantees regarding re-training. In order to address these issues, we introduce Mutual Information Preserving Pruning (MIPP). The fundamental principle of our method is to select nodes such that the mutual information (MI) between the activations of adjacent layers is maintained. We evaluate MIPP on an array of vision models and datasets, including a pre-trained ResNet50 on ImageNet, where we demonstrate MIPP’s ability to outperform state-of-the-art methods. The implementation of MIPP will be made available upon publication.

[LG-61] Learning local discrete features in explainable-by-design convolutional neural networks

链接: https://arxiv.org/abs/2411.00139
作者: Pantelis I. Kaplanoglou,Konstantinos Diamantaras
关键词-EN: lateral inhibition mechanism, proposed framework attempts, convolutional neural network, convolutional neural, inhibition mechanism
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Our proposed framework attempts to break the trade-off between performance and explainability by introducing an explainable-by-design convolutional neural network (CNN) based on the lateral inhibition mechanism. The ExplaiNet model consists of the predictor, that is a high-accuracy CNN with residual or dense skip connections, and the explainer probabilistic graph that expresses the spatial interactions of the network neurons. The value on each graph node is a local discrete feature (LDF) vector, a patch descriptor that represents the indices of antagonistic neurons ordered by the strength of their activations, which are learned with gradient descent. Using LDFs as sequences we can increase the conciseness of explanations by repurposing EXTREME, an EM-based sequence motif discovery method that is typically used in molecular biology. Having a discrete feature motif matrix for each one of intermediate image representations, instead of a continuous activation tensor, allows us to leverage the inherent explainability of Bayesian networks. By collecting observations and directly calculating probabilities, we can explain causal relationships between motifs of adjacent levels and attribute the model’s output to global motifs. Moreover, experiments on various tiny image benchmark datasets confirm that our predictor ensures the same level of performance as the baseline architecture for a given count of parameters and/or layers. Our novel method shows promise to exceed this performance while providing an additional stream of explanations. In the solved MNIST classification task, it reaches a comparable to the state-of-the-art performance for single models, using standard training setup and 0.75 million parameters.

[LG-62] Cost-Aware Query Policies in Active Learning for Efficient Autonomous Robotic Exploration

链接: https://arxiv.org/abs/2411.00137
作者: Sapphira Akins,Hans Mertens,Frances Zhu
关键词-EN: efficient data collection, finite resources, efficient data, collection is critical, action cost
类目: Robotics (cs.RO); Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In missions constrained by finite resources, efficient data collection is critical. Informative path planning, driven by automated decision-making, optimizes exploration by reducing the costs associated with accurate characterization of a target in an environment. Previous implementations of active learning did not consider the action cost for regression problems or only considered the action cost for classification problems. This paper analyzes an AL algorithm for Gaussian Process regression while incorporating action cost. The algorithm’s performance is compared on various regression problems to include terrain mapping on diverse simulated surfaces along metrics of root mean square error, samples and distance until convergence, and model variance upon convergence. The cost-dependent acquisition policy doesn’t organically optimize information gain over distance. Instead, the traditional uncertainty metric with a distance constraint best minimizes root-mean-square error over trajectory distance. This studys impact is to provide insight into incorporating action cost with AL methods to optimize exploration under realistic mission constraints.

[LG-63] LLM -Inference-Bench: Inference Benchmarking of Large Language Models on AI Accelerators

链接: https://arxiv.org/abs/2411.00136
作者: Krishna Teja Chitty-Venkata,Siddhisanket Raskar,Bharat Kale,Farah Ferdaus,Aditya Tanikanti,Ken Raffenetti,Valerie Taylor,Murali Emani,Venkatram Vishwanath
关键词-EN: Large Language Models, Large Language, text generation applications, propelled groundbreaking advancements, Language Models
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have propelled groundbreaking advancements across several domains and are commonly used for text generation applications. However, the computational demands of these complex models pose significant challenges, requiring efficient hardware acceleration. Benchmarking the performance of LLMs across diverse hardware platforms is crucial to understanding their scalability and throughput characteristics. We introduce LLM-Inference-Bench, a comprehensive benchmarking suite to evaluate the hardware inference performance of LLMs. We thoroughly analyze diverse hardware platforms, including GPUs from Nvidia and AMD and specialized AI accelerators, Intel Habana and SambaNova. Our evaluation includes several LLM inference frameworks and models from LLaMA, Mistral, and Qwen families with 7B and 70B parameters. Our benchmarking results reveal the strengths and limitations of various models, hardware platforms, and inference frameworks. We provide an interactive dashboard to help identify configurations for optimal performance for a given hardware platform.

[LG-64] Soft Condorcet Optimization for Ranking of General Agents

链接: https://arxiv.org/abs/2411.00119
作者: Marc Lanctot,Kate Larson,Michael Kaisers,Quentin Berthet,Ian Gemp,Manfred Diaz,Roberto-Rafael Maura-Rivero,Yoram Bachrach,Anna Koop,Doina Precup
关键词-EN: Soft Condorcet Optimization, standardized benchmarks, optimal ranking, drive progress, SCO
类目: Multiagent Systems (cs.MA); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:A common way to drive progress of AI models and agents is to compare their performance on standardized benchmarks. Comparing the performance of general agents requires aggregating their individual performances across a potentially wide variety of different tasks. In this paper, we describe a novel ranking scheme inspired by social choice frameworks, called Soft Condorcet Optimization (SCO), to compute the optimal ranking of agents: the one that makes the fewest mistakes in predicting the agent comparisons in the evaluation data. This optimal ranking is the maximum likelihood estimate when evaluation data (which we view as votes) are interpreted as noisy samples from a ground truth ranking, a solution to Condorcet’s original voting system criteria. SCO ratings are maximal for Condorcet winners when they exist, which we show is not necessarily true for the classical rating system Elo. We propose three optimization algorithms to compute SCO ratings and evaluate their empirical performance. When serving as an approximation to the Kemeny-Young voting method, SCO rankings are on average 0 to 0.043 away from the optimal ranking in normalized Kendall-tau distance across 865 preference profiles from the PrefLib open ranking archive. In a simulated noisy tournament setting, SCO achieves accurate approximations to the ground truth ranking and the best among several baselines when 59% or more of the preference data is missing. Finally, SCO ranking provides the best approximation to the optimal ranking, measured on held-out test sets, in a problem containing 52,958 human players across 31,049 games of the classic seven-player game of Diplomacy.

[LG-65] Derivative-Free Optimization via Finite Difference Approximation: An Experimental Study

链接: https://arxiv.org/abs/2411.00112
作者: Wang Du-Yi,Liang Guo,Liu Guangwu,Zhang Kun
关键词-EN: noisy function evaluations, solving complex optimization, complex optimization problems, Derivative-free optimization, vital in solving
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:Derivative-free optimization (DFO) is vital in solving complex optimization problems where only noisy function evaluations are available through an oracle. Within this domain, DFO via finite difference (FD) approximation has emerged as a powerful method. Two classical approaches are the Kiefer-Wolfowitz (KW) and simultaneous perturbation stochastic approximation (SPSA) algorithms, which estimate gradients using just two samples in each iteration to conserve samples. However, this approach yields imprecise gradient estimators, necessitating diminishing step sizes to ensure convergence, often resulting in slow optimization progress. In contrast, FD estimators constructed from batch samples approximate gradients more accurately. While gradient descent algorithms using batch-based FD estimators achieve more precise results in each iteration, they require more samples and permit fewer iterations. This raises a fundamental question: which approach is more effective – KW-style methods or DFO with batch-based FD estimators? This paper conducts a comprehensive experimental comparison among these approaches, examining the fundamental trade-off between gradient estimation accuracy and iteration steps. Through extensive experiments in both low-dimensional and high-dimensional settings, we demonstrate a surprising finding: when an efficient batch-based FD estimator is applied, its corresponding gradient descent algorithm generally shows better performance compared to classical KW and SPSA algorithms in our tested scenarios.

[LG-66] Lagrangian neural networks for nonholonomic mechanics

链接: https://arxiv.org/abs/2411.00110
作者: Viviana Alejandra Diaz,Leandro Martin Salomone,Marcela Zuccalli
关键词-EN: Lagrangian Neural Networks, addressing physical systems, conservation laws, powerful tool, tool for addressing
类目: Machine Learning (cs.LG); Disordered Systems and Neural Networks (cond-mat.dis-nn); Emerging Technologies (cs.ET); Neural and Evolutionary Computing (cs.NE)
*备注:

点击查看摘要

Abstract:Lagrangian Neural Networks (LNNs) are a powerful tool for addressing physical systems, particularly those governed by conservation laws. LNNs can parametrize the Lagrangian of a system to predict trajectories with nearly conserved energy. These techniques have proven effective in unconstrained systems as well as those with holonomic constraints. In this work, we adapt LNN techniques to mechanical systems with nonholonomic constraints. We test our approach on some well-known examples with nonholonomic constraints, showing that incorporating these restrictions into the neural network’s learning improves not only trajectory estimation accuracy but also ensures adherence to constraints and exhibits better energy behavior compared to the unconstrained counterpart.

[LG-67] First Learn What You Dont Know: Active Information Gathering for Driving at the Limits of Handling

链接: https://arxiv.org/abs/2411.00107
作者: Alexander Davydov,Franck Djeumou,Marcus Greiff,Makoto Suminaka,Michael Thompson,John Subosits,Thomas Lew
关键词-EN: Combining data-driven models, Combining data-driven, enabled effective control, enabled effective, nonlinear systems
类目: Robotics (cs.RO); Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注:

点击查看摘要

Abstract:Combining data-driven models that adapt online and model predictive control (MPC) has enabled effective control of nonlinear systems. However, when deployed on unstable systems, online adaptation may not be fast enough to ensure reliable simultaneous learning and control. For example, controllers on a vehicle executing highly dynamic maneuvers may push the tires to their friction limits, destabilizing the vehicle and allowing modeling errors to quickly compound and cause a loss of control. In this work, we present a Bayesian meta-learning MPC framework. We propose an expressive vehicle dynamics model that leverages Bayesian last-layer meta-learning to enable rapid online adaptation. The model’s uncertainty estimates are used to guide informative data collection and quickly improve the model prior to deployment. Experiments on a Toyota Supra show that (i) the framework enables reliable control in dynamic drifting maneuvers, (ii) online adaptation alone may not suffice for zero-shot control of a vehicle at the edge of stability, and (iii) active data collection helps achieve reliable performance.

[LG-68] Label Noise: Ignorance Is Bliss

链接: https://arxiv.org/abs/2411.00079
作者: Yilun Zhu,Jianxin Zhang,Aditya Gangrade,Clayton Scott
关键词-EN: instance-dependent label noise, label noise, instance-dependent label, Noise Ignorant Empirical, noise
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:We establish a new theoretical framework for learning under multi-class, instance-dependent label noise. This framework casts learning with label noise as a form of domain adaptation, in particular, domain adaptation under posterior drift. We introduce the concept of \emphrelative signal strength (RSS), a pointwise measure that quantifies the transferability from noisy to clean posterior. Using RSS, we establish nearly matching upper and lower bounds on the excess risk. Our theoretical findings support the simple \emphNoise Ignorant Empirical Risk Minimization (NI-ERM) principle, which minimizes empirical risk while ignoring label noise. Finally, we translate this theoretical insight into practice: by using NI-ERM to fit a linear classifier on top of a self-supervised feature extractor, we achieve state-of-the-art performance on the CIFAR-N data challenge.

[LG-69] boldsymbolmumathbfP2: Effective Sharpness Aware Minimization Requires Layerwise Perturbation Scaling

链接: https://arxiv.org/abs/2411.00075
作者: Moritz Haas,Jin Xu,Volkan Cevher,Leena Chennuru Vankadara
关键词-EN: Sharpness Aware Minimization, Sharpness Aware, Aware Minimization, architectures and datasets, SAM
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Sharpness Aware Minimization (SAM) enhances performance across various neural architectures and datasets. As models are continually scaled up to improve performance, a rigorous understanding of SAM’s scaling behaviour is paramount. To this end, we study the infinite-width limit of neural networks trained with SAM, using the Tensor Programs framework. Our findings reveal that the dynamics of standard SAM effectively reduce to applying SAM solely in the last layer in wide neural networks, even with optimal hyperparameters. In contrast, we identify a stable parameterization with layerwise perturbation scaling, which we call \textitMaximal Update and Perturbation Parameterization ( \mu P ^2 ), that ensures all layers are both feature learning and effectively perturbed in the limit. Through experiments with MLPs, ResNets and Vision Transformers, we empirically demonstrate that \mu P ^2 is the first parameterization to achieve hyperparameter transfer of the joint optimum of learning rate and perturbation radius across model scales. Moreover, we provide an intuitive condition to derive \mu P ^2 for other perturbation rules like Adaptive SAM and SAM-ON, also ensuring balanced perturbation effects across all layers.

[LG-70] Nonparametric estimation of Hawkes processes with RKHSs

链接: https://arxiv.org/abs/2411.00621
作者: Anna Bonnet,Maxime Sangnier
关键词-EN: multivariate Hawkes processes, kernel Hilbert space, nonlinear multivariate Hawkes, reproducing kernel Hilbert, Hawkes processes
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Methodology (stat.ME)
*备注:

点击查看摘要

Abstract:This paper addresses nonparametric estimation of nonlinear multivariate Hawkes processes, where the interaction functions are assumed to lie in a reproducing kernel Hilbert space (RKHS). Motivated by applications in neuroscience, the model allows complex interaction functions, in order to express exciting and inhibiting effects, but also a combination of both (which is particularly interesting to model the refractory period of neurons), and considers in return that conditional intensities are rectified by the ReLU function. The latter feature incurs several methodological challenges, for which workarounds are proposed in this paper. In particular, it is shown that a representer theorem can be obtained for approximated versions of the log-likelihood and the least-squares criteria. Based on it, we propose an estimation method, that relies on two simple approximations (of the ReLU function and of the integral operator). We provide an approximation bound, justifying the negligible statistical effect of these approximations. Numerical results on synthetic data confirm this fact as well as the good asymptotic behavior of the proposed estimator. It also shows that our method achieves a better performance compared to related nonparametric estimation techniques and suits neuronal applications.

[LG-71] Small coresets via negative dependence: DPPs linear statistics and concentration NEURIPS2024

链接: https://arxiv.org/abs/2411.00611
作者: Rémi Bardenet,Subhroshekhar Ghosh,Hugo Simon-Onfroy,Hoang-Son Tran
关键词-EN: Determinantal point processes, tunable negative dependence, Determinantal point, point processes, negative dependence
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Probability (math.PR)
*备注: Accepted at NeurIPS 2024 (Spotlight Paper). Authors are listed in alphabetical order

点击查看摘要

Abstract:Determinantal point processes (DPPs) are random configurations of points with tunable negative dependence. Because sampling is tractable, DPPs are natural candidates for subsampling tasks, such as minibatch selection or coreset construction. A \emphcoreset is a subset of a (large) training set, such that minimizing an empirical loss averaged over the coreset is a controlled replacement for the intractable minimization of the original empirical loss. Typically, the control takes the form of a guarantee that the average loss over the coreset approximates the total loss uniformly across the parameter space. Recent work has provided significant empirical support in favor of using DPPs to build randomized coresets, coupled with interesting theoretical results that are suggestive but leave some key questions unanswered. In particular, the central question of whether the cardinality of a DPP-based coreset is fundamentally smaller than one based on independent sampling remained open. In this paper, we answer this question in the affirmative, demonstrating that \emphDPPs can provably outperform independently drawn coresets. In this vein, we contribute a conceptual understanding of coreset loss as a \emphlinear statistic of the (random) coreset. We leverage this structural observation to connect the coresets problem to a more general problem of concentration phenomena for linear statistics of DPPs, wherein we obtain \empheffective concentration inequalities that extend well-beyond the state-of-the-art, encompassing general non-projection, even non-symmetric kernels. The latter have been recently shown to be of interest in machine learning beyond coresets, but come with a limited theoretical toolbox, to the extension of which our result contributes. Finally, we are also able to address the coresets problem for vector-valued objective functions, a novelty in the coresets literature.

[LG-72] Constrained Sampling with Primal-Dual Langevin Monte Carlo NEURIPS2024

链接: https://arxiv.org/abs/2411.00568
作者: Luiz F. O. Chamon,Mohammad Reza Karimi,Anna Korba
关键词-EN: general nonlinear functions, nonlinear functions, normalization constant, constant while satisfying, satisfying a set
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注: 39 pages, 14 figures. Published at NeurIPS 2024

点击查看摘要

Abstract:This work considers the problem of sampling from a probability distribution known up to a normalization constant while satisfying a set of statistical constraints specified by the expected values of general nonlinear functions. This problem finds applications in, e.g., Bayesian inference, where it can constrain moments to evaluate counterfactual scenarios or enforce desiderata such as prediction fairness. Methods developed to handle support constraints, such as those based on mirror maps, barriers, and penalties, are not suited for this task. This work therefore relies on gradient descent-ascent dynamics in Wasserstein space to put forward a discrete-time primal-dual Langevin Monte Carlo algorithm (PD-LMC) that simultaneously constrains the target distribution and samples from it. We analyze the convergence of PD-LMC under standard assumptions on the target distribution and constraints, namely (strong) convexity and log-Sobolev inequalities. To do so, we bring classical optimization arguments for saddle-point algorithms to the geometry of Wasserstein space. We illustrate the relevance and effectiveness of PD-LMC in several applications.

[LG-73] PatternBoost: Constructions in Mathematics with a Little Help from AI

链接: https://arxiv.org/abs/2411.00566
作者: François Charton,Jordan S. Ellenberg,Adam Zsolt Wagner,Geordie Williamson
关键词-EN: finding interesting constructions, flexible method, method for finding, finding interesting, interesting constructions
类目: Combinatorics (math.CO); Machine Learning (cs.LG)
*备注: 32 pages

点击查看摘要

Abstract:We introduce PatternBoost, a flexible method for finding interesting constructions in mathematics. Our algorithm alternates between two phases. In the first local'' phase, a classical search algorithm is used to produce many desirable constructions. In the second global’’ phase, a transformer neural network is trained on the best such constructions. Samples from the trained transformer are then used as seeds for the first phase, and the process is repeated. We give a detailed introduction to this technique, and discuss the results of its application to several problems in extremal combinatorics. The performance of PatternBoost varies across different problems, but there are many situations where its performance is quite impressive. Using our technique, we find the best known solutions to several long-standing problems, including the construction of a counterexample to a conjecture that had remained open for 30 years.

[LG-74] Dirichlet process mixtures of block g priors for model selection and prediction in linear models

链接: https://arxiv.org/abs/2411.00471
作者: Anupreet Porwal,Abel Rodriguez
关键词-EN: Dirichlet process mixtures, paper introduces Dirichlet, introduces Dirichlet process, Dirichlet process, process mixtures
类目: Methodology (stat.ME); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This paper introduces Dirichlet process mixtures of block g priors for model selection and prediction in linear models. These priors are extensions of traditional mixtures of g priors that allow for differential shrinkage for various (data-selected) blocks of parameters while fully accounting for the predictors’ correlation structure, providing a bridge between the literatures on model selection and continuous shrinkage priors. We show that Dirichlet process mixtures of block g priors are consistent in various senses and, in particular, that they avoid the conditional Lindley ``paradox’’ highlighted by Som et al.(2016). Further, we develop a Markov chain Monte Carlo algorithm for posterior inference that requires only minimal ad-hoc tuning. Finally, we investigate the empirical performance of the prior in various real and simulated datasets. In the presence of a small number of very large effects, Dirichlet process mixtures of block g priors lead to higher power for detecting smaller but significant effects without only a minimal increase in the number of false discoveries.

[LG-75] A Lorentz-Equivariant Transformer for All of the LHC

链接: https://arxiv.org/abs/2411.00446
作者: Johann Brehmer,Víctor Bresó,Pim de Haan,Tilman Plehn,Huilin Qu,Jonas Spinner,Jesse Thaler
关键词-EN: Large Hadron Collider, Hadron Collider, Large Hadron, Geometric Algebra Transformer, Lorentz-Equivariant Geometric Algebra
类目: High Energy Physics - Phenomenology (hep-ph); Machine Learning (cs.LG); High Energy Physics - Experiment (hep-ex)
*备注: 26 pages, 7 figures, 8 tables

点击查看摘要

Abstract:We show that the Lorentz-Equivariant Geometric Algebra Transformer (L-GATr) yields state-of-the-art performance for a wide range of machine learning tasks at the Large Hadron Collider. L-GATr represents data in a geometric algebra over space-time and is equivariant under Lorentz transformations. The underlying architecture is a versatile and scalable transformer, which is able to break symmetries if needed. We demonstrate the power of L-GATr for amplitude regression and jet classification, and then benchmark it as the first Lorentz-equivariant generative network. For all three LHC tasks, we find significant improvements over previous architectures.

[LG-76] HAVER: Instance-Dependent Error Bounds for Maximum Mean Estimation and Applications to Q-Learning

链接: https://arxiv.org/abs/2411.00405
作者: Tuan Ngo Nguyen,Kwang-Sung Jun
关键词-EN: Monte Carlo tree, Carlo tree search, Monte Carlo, machine learning tasks, Carlo tree
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We study the problem of estimating the \emphvalue of the largest mean among K distributions via samples from them (rather than estimating \emphwhich distribution has the largest mean), which arises from various machine learning tasks including Q-learning and Monte Carlo tree search. While there have been a few proposed algorithms, their performance analyses have been limited to their biases rather than a precise error metric. In this paper, we propose a novel algorithm called HAVER (Head AVERaging) and analyze its mean squared error. Our analysis reveals that HAVER has a compelling performance in two respects. First, HAVER estimates the maximum mean as well as the oracle who knows the identity of the best distribution and reports its sample mean. Second, perhaps surprisingly, HAVER exhibits even better rates than this oracle when there are many distributions near the best one. Both of these improvements are the first of their kind in the literature, and we also prove that the naive algorithm that reports the largest empirical mean does not achieve these bounds. Finally, we confirm our theoretical findings via numerical experiments including bandits and Q-learning scenarios where HAVER outperforms baseline methods.

[LG-77] Unified theory of upper confidence bound policies for bandit problems targeting total reward maximal reward and more

链接: https://arxiv.org/abs/2411.00339
作者: Nobuaki Kikkawa,Hiroshi Ohno
关键词-EN: max bandit problem, bandit problem, classical total-reward bandit, upper confidence bound, max bandit
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The upper confidence bound (UCB) policy is recognized as an order-optimal solution for the classical total-reward bandit problem. While similar UCB-based approaches have been applied to the max bandit problem, which aims to maximize the cumulative maximal reward, their order optimality remains unclear. In this study, we clarify the unified conditions under which the UCB policy achieves the order optimality in both total-reward and max bandit problems. A key concept of our theory is the oracle quantity, which identifies the best arm by its highest value. This allows a unified definition of the UCB policy as pulling the arm with the highest UCB of the oracle quantity. Additionally, under this setting, optimality analysis can be conducted by replacing traditional regret with the number of failures as a core measure. One consequence of our analysis is that the confidence interval of the oracle quantity must narrow appropriately as trials increase to ensure the order optimality of UCB policies. From this consequence, we prove that the previously proposed MaxSearch algorithm satisfies this condition and is an order-optimal policy for the max bandit problem. We also demonstrate that new bandit problems and their order-optimal UCB algorithms can be systematically derived by providing the appropriate oracle quantity and its confidence interval. Building on this, we propose PIUCB algorithms, which aim to pull the arm with the highest probability of improvement (PI). These algorithms can be applied to the max bandit problem in practice and perform comparably or better than the MaxSearch algorithm in toy examples. This suggests that our theory has the potential to generate new policies tailored to specific oracle quantities.

[LG-78] In-situ Self-optimization of Quantum Dot Emission for Lasers by Machine-Learning Assisted Epitaxy

链接: https://arxiv.org/abs/2411.00332
作者: Chao Shen,Wenkang Zhan,Shujie Pan,Hongyue Hao,Ning Zhuo,Kaiyao Xin,Hui Cong,Chi Xu,Bo Xu,Tien Khee Ng,Siming Chen,Chunlai Xue,Fengqi Liu,Zhanguo Wang,Chao Zhao
关键词-EN: optimizing light source, source emissions rely, light source emissions, light source, light source gain
类目: Mesoscale and Nanoscale Physics (cond-mat.mes-hall); Machine Learning (cs.LG)
*备注: 5 figures

点击查看摘要

Abstract:Traditional methods for optimizing light source emissions rely on a time-consuming trial-and-error approach. While in-situ optimization of light source gain media emission during growth is ideal, it has yet to be realized. In this work, we integrate in-situ reflection high-energy electron diffraction (RHEED) with machine learning (ML) to correlate the surface reconstruction with the photoluminescence (PL) of InAs/GaAs quantum dots (QDs), which serve as the active region of lasers. A lightweight ResNet-GLAM model is employed for the real-time processing of RHEED data as input, enabling effective identification of optical performance. This approach guides the dynamic optimization of growth parameters, allowing real-time feedback control to adjust the QDs emission for lasers. We successfully optimized InAs QDs on GaAs substrates, with a 3.2-fold increase in PL intensity and a reduction in full width at half maximum (FWHM) from 36.69 meV to 28.17 meV under initially suboptimal growth conditions. Our automated, in-situ self-optimized lasers with 5-layer InAs QDs achieved electrically pumped continuous-wave operation at 1240 nm with a low threshold current of 150 A/cm2 at room temperature, an excellent performance comparable to samples grown through traditional manual multi-parameter optimization methods. These results mark a significant step toward intelligent, low-cost, and reproductive light emitters production.

[LG-79] How many classifiers do we need?

链接: https://arxiv.org/abs/2411.00328
作者: Hyunsuk Kim,Liam Hodgkinson,Ryan Theisen,Michael W. Mahoney
关键词-EN: experience diminishing returns, size experience diminishing, model size experience, performance gain achieved, diminishing returns
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:As performance gains through scaling data and/or model size experience diminishing returns, it is becoming increasingly popular to turn to ensembling, where the predictions of multiple models are combined to improve accuracy. In this paper, we provide a detailed analysis of how the disagreement and the polarization (a notion we introduce and define in this paper) among classifiers relate to the performance gain achieved by aggregating individual classifiers, for majority vote strategies in classification tasks. We address these questions in the following ways. (1) An upper bound for polarization is derived, and we propose what we call a neural polarization law: most interpolating neural network models are 4/3-polarized. Our empirical results not only support this conjecture but also show that polarization is nearly constant for a dataset, regardless of hyperparameters or architectures of classifiers. (2) The error of the majority vote classifier is considered under restricted entropy conditions, and we present a tight upper bound that indicates that the disagreement is linearly correlated with the target, and that the slope is linear in the polarization. (3) We prove results for the asymptotic behavior of the disagreement in terms of the number of classifiers, which we show can help in predicting the performance for a larger number of classifiers from that of a smaller number. Our theories and claims are supported by empirical results on several image classification tasks with various types of neural networks.

[LG-80] Forecasting Mortality in the Middle-Aged and Older Population of England: A 1D-CNN Approach

链接: https://arxiv.org/abs/2411.00317
作者: Marjan Qazvini
关键词-EN: Convolutional Neural Networks, time series data, Neural Networks, Daily Living, Convolutional Neural
类目: Applications (stat.AP); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Convolutional Neural Networks (CNNs) are proven to be effective when data are homogeneous such as images, or when there is a relationship between consecutive data such as time series data. Although CNNs are not famous for tabular data, we show that we can use them in longitudinal data, where individuals’ information is recorded over a period and therefore there is a relationship between them. This study considers the English Longitudinal Study of Ageing (ELSA) survey, conducted every two years. We use one-dimensional convolutional neural networks (1D-CNNs) to forecast mortality using socio-demographics, diseases, mobility impairment, Activities of Daily Living (ADLs), Instrumental Activities of Daily Living (IADLs), and lifestyle factors. As our dataset is highly imbalanced, we try different over and undersampling methods and find that over-representing the small class improves the results. We also try our model with different activation functions. Our results show that swish nonlinearity outperforms other functions.

[LG-81] Analysis of ELSA COVID-19 Substudy response rate using machine learning algorithms

链接: https://arxiv.org/abs/2411.00297
作者: Marjan Qazvini
关键词-EN: National Statistical Organisations, National Statistical, Statistical Organisations, year spend time, Organisations every year
类目: Applications (stat.AP); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:National Statistical Organisations every year spend time and money to collect information through surveys. Some of these surveys include follow-up studies, and usually, some participants due to factors such as death, immigration, change of employment, health, etc, do not participate in future surveys. In this study, we focus on the English Longitudinal Study of Ageing (ELSA) COVID-19 Substudy, which was carried out during the COVID-19 pandemic in two waves. In this substudy, some participants from wave 1 did not participate in wave 2. Our purpose is to predict non-responses using Machine Learning (ML) algorithms such as K-nearest neighbours (KNN), random forest (RF), AdaBoost, logistic regression, neural networks (NN), and support vector classifier (SVC). We find that RF outperforms other models in terms of balanced accuracy, KNN in terms of precision and test accuracy, and logistics regressions in terms of the area under the receiver operating characteristic curve (ROC), i.e. AUC.

[LG-82] Minimum Empirical Divergence for Sub-Gaussian Linear Bandits

链接: https://arxiv.org/abs/2411.00229
作者: Kapilan Balagopalan,Kwang-Sung Jun
关键词-EN: Minimum Empirical Divergence, Linear Minimum Empirical, called linear Thompson, linear bandit algorithm, Linear Minimum
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We propose a novel linear bandit algorithm called LinMED (Linear Minimum Empirical Divergence), which is a linear extension of the MED algorithm that was originally designed for multi-armed bandits. LinMED is a randomized algorithm that admits a closed-form computation of the arm sampling probabilities, unlike the popular randomized algorithm called linear Thompson sampling. Such a feature proves useful for off-policy evaluation where the unbiased evaluation requires accurately computing the sampling probability. We prove that LinMED enjoys a near-optimal regret bound of d\sqrtn up to logarithmic factors where d is the dimension and n is the time horizon. We further show that LinMED enjoys a \fracd^2\Delta\left(\log^2(n)\right)\log\left(\log(n)\right) problem-dependent regret where \Delta is the smallest sub-optimality gap, which is lower than \fracd^2\Delta\log^3(n) of the standard algorithm OFUL (Abbasi-yadkori et al., 2011). Our empirical study shows that LinMED has a competitive performance with the state-of-the-art algorithms.

[LG-83] Inclusive KL Minimization: A Wasserstein-Fisher-Rao Gradient Flow Perspective

链接: https://arxiv.org/abs/2411.00214
作者: Jia-Jie Zhu
关键词-EN: mathematically principled perspective, Wasserstein gradient flow, Wasserstein gradient, powerful and mathematically, mathematically principled
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:Otto’s (2001) Wasserstein gradient flow of the exclusive KL divergence functional provides a powerful and mathematically principled perspective for analyzing learning and inference algorithms. In contrast, algorithms for the inclusive KL inference, i.e., minimizing \mathrmKL(\pi | \mu) with respect to \mu for some target \pi , are rarely analyzed using tools from mathematical analysis. This paper shows that a general-purpose approximate inclusive KL inference paradigm can be constructed using the theory of gradient flows derived from PDE analysis. We uncover that several existing learning algorithms can be viewed as particular realizations of the inclusive KL inference paradigm. For example, existing sampling algorithms such as Arbel et al. (2019) and Korba et al. (2021) can be viewed in a unified manner as inclusive-KL inference with approximate gradient estimators. Finally, we provide the theoretical foundation for the Wasserstein-Fisher-Rao gradient flows for minimizing the inclusive KL divergence.

[LG-84] Learning Mixtures of Unknown Causal Interventions

链接: https://arxiv.org/abs/2411.00213
作者: Abhinav Kumar,Kirankumar Shiragur,Caroline Uhler
关键词-EN: diverse scientific disciplines, conduct interventions plays, learning causal relationships, Structural Equation Models, machine learning
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The ability to conduct interventions plays a pivotal role in learning causal relationships among variables, thus facilitating applications across diverse scientific disciplines such as genomics, economics, and machine learning. However, in many instances within these applications, the process of generating interventional data is subject to noise: rather than data being sampled directly from the intended interventional distribution, interventions often yield data sampled from a blend of both intended and unintended interventional distributions. We consider the fundamental challenge of disentangling mixed interventional and observational data within linear Structural Equation Models (SEMs) with Gaussian additive noise without the knowledge of the true causal graph. We demonstrate that conducting interventions, whether do or soft, yields distributions with sufficient diversity and properties conducive to efficiently recovering each component within the mixture. Furthermore, we establish that the sample complexity required to disentangle mixed data inversely correlates with the extent of change induced by an intervention in the equations governing the affected variable values. As a result, the causal graph can be identified up to its interventional Markov Equivalence Class, similar to scenarios where no noise influences the generation of interventional data. We further support our theoretical findings by conducting simulations wherein we perform causal discovery from such mixed data. Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG) Cite as: arXiv:2411.00213 [stat.ML] (or arXiv:2411.00213v1 [stat.ML] for this version) https://doi.org/10.48550/arXiv.2411.00213 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-85] Residual Deep Gaussian Processes on Manifolds

链接: https://arxiv.org/abs/2411.00161
作者: Kacper Wyrwal,Andreas Krause,Viacheslav Borovitskiy
关键词-EN: residual neural networks, propose practical deep, practical deep Gaussian, deep Gaussian process, similar in spirit
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We propose practical deep Gaussian process models on Riemannian manifolds, similar in spirit to residual neural networks. With manifold-to-manifold hidden layers and an arbitrary last layer, they can model manifold- and scalar-valued functions, as well as vector fields. We target data inherently supported on manifolds, which is too complex for shallow Gaussian processes thereon. For example, while the latter perform well on high-altitude wind data, they struggle with the more intricate, nonstationary patterns at low altitudes. Our models significantly improve performance in these settings, enhancing prediction quality and uncertainty calibration, and remain robust to overfitting, reverting to shallow models when additional complexity is unneeded. We further showcase our models on Bayesian optimisation problems on manifolds, using stylised examples motivated by robotics, and obtain substantial improvements in later stages of the optimisation process. Finally, we show our models to have potential for speeding up inference for non-manifold data, when, and if, it can be mapped to a proxy manifold well enough.

[LG-86] Enhancing Brain Source Reconstruction through Physics-Informed 3D Neural Networks

链接: https://arxiv.org/abs/2411.00143
作者: Marco Morik,Ali Hashemi,Klaus-Robert Müller,Stefan Haufe,Shinichi Nakajima
关键词-EN: Reconstructing brain sources, understanding brain function, Reconstructing brain, challenge in neuroscience, crucial for understanding
类目: Image and Video Processing (eess.IV); Machine Learning (cs.LG)
*备注: Under Review in IEEE Transactions on Medical Imaging

点击查看摘要

Abstract:Reconstructing brain sources is a fundamental challenge in neuroscience, crucial for understanding brain function and dysfunction. Electroencephalography (EEG) signals have a high temporal resolution. However, identifying the correct spatial location of brain sources from these signals remains difficult due to the ill-posed structure of the problem. Traditional methods predominantly rely on manually crafted priors, missing the flexibility of data-driven learning, while recent deep learning approaches focus on end-to-end learning, typically using the physical information of the forward model only for generating training data. We propose the novel hybrid method 3D-PIUNet for EEG source localization that effectively integrates the strengths of traditional and deep learning techniques. 3D-PIUNet starts from an initial physics-informed estimate by using the pseudo inverse to map from measurements to source space. Secondly, by viewing the brain as a 3D volume, we use a 3D convolutional U-Net to capture spatial dependencies and refine the solution according to the learned data prior. Training the model relies on simulated pseudo-realistic brain source data, covering different source distributions. Trained on this data, our model significantly improves spatial accuracy, demonstrating superior performance over both traditional and end-to-end data-driven methods. Additionally, we validate our findings with real EEG data from a visual task, where 3D-PIUNet successfully identifies the visual cortex and reconstructs the expected temporal behavior, thereby showcasing its practical applicability.

[LG-87] A Geometric Framework for Understanding Memorization in Generative Models

链接: https://arxiv.org/abs/2411.00113
作者: Brendan Leigh Ross,Hamidreza Kamkari,Tongzi Wu,Rasa Hosseinzadeh,Zhaoyan Liu,George Stein,Jesse C. Cresswell,Gabriel Loaiza-Ganem
关键词-EN: deep generative models, reproducing training datapoints, generative models, capable of memorizing, memorizing and reproducing
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 10 pages, 7 figures

点击查看摘要

Abstract:As deep generative models have progressed, recent work has shown them to be capable of memorizing and reproducing training datapoints when deployed. These findings call into question the usability of generative models, especially in light of the legal and privacy risks brought about by memorization. To better understand this phenomenon, we propose the manifold memorization hypothesis (MMH), a geometric framework which leverages the manifold hypothesis into a clear language in which to reason about memorization. We propose to analyze memorization in terms of the relationship between the dimensionalities of (i) the ground truth data manifold and (ii) the manifold learned by the model. This framework provides a formal standard for “how memorized” a datapoint is and systematically categorizes memorized data into two types: memorization driven by overfitting and memorization driven by the underlying data distribution. By analyzing prior work in the context of the MMH, we explain and unify assorted observations in the literature. We empirically validate the MMH using synthetic data and image datasets up to the scale of Stable Diffusion, developing new tools for detecting and preventing generation of memorized samples in the process.

[LG-88] A Universal Quantum Computer From Relativistic Motion

链接: https://arxiv.org/abs/2411.00105
作者: Philip A. LeMaitre,T. Rick Perche,Marius Krumm,Hans J. Briegel
关键词-EN: quantum computing architecture, variational quantum circuit, quantum circuit approach, relativistic quantum computing, quantum computing
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG); General Relativity and Quantum Cosmology (gr-qc)
*备注: 5 pages + appendices, 1 figure - revtex4-2

点击查看摘要

Abstract:We present an explicit construction of a relativistic quantum computing architecture using a variational quantum circuit approach that is shown to allow for universal quantum computing. The variational quantum circuit consists of tunable single-qubit rotations and entangling gates that are implemented successively. The single qubit rotations are parameterized by the proper time intervals of the qubits’ trajectories and can be tuned by varying their relativistic motion in spacetime. The entangling layer is mediated by a relativistic quantum field instead of through direct coupling between the qubits. Within this setting, we give a prescription for how to use quantum field-mediated entanglement and manipulation of the relativistic motion of qubits to obtain a universal gate set, for which compact non-perturbative expressions that are valid for general spacetimes are also obtained. We also derive a lower bound on the channel fidelity that shows the existence of parameter regimes in which all entangling operations are effectively unitary, despite the noise generated from the presence of a mediating quantum field. Finally, we consider an explicit implementation of the quantum Fourier transform with relativistic qubits.

[LG-89] Solving the 2D Advection-Diffusion Equation using Fixed-Depth Symbolic Regression and Symbolic Differentiation without Expression Trees

链接: https://arxiv.org/abs/2411.00011
作者: Edward Finkelstein
关键词-EN: fixed-depth symbolic regression, expression trees, paper presents, differentiation without expression, fixed-depth symbolic
类目: Computation (stat.CO); Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注: 13 pages, 2 figures, 16 equations

点击查看摘要

Abstract:This paper presents a novel method for solving the 2D advection-diffusion equation using fixed-depth symbolic regression and symbolic differentiation without expression trees. The method is applied to two cases with distinct initial and boundary conditions, demonstrating its accuracy and ability to find approximate solutions efficiently. This framework offers a promising, scalable solution for finding approximate solutions to differential equations, with the potential for future improvements in computational performance and applicability to more complex systems involving vector-valued objectives.

[LG-90] RapidDock: Unlocking Proteome-scale Molecular Docking

链接: https://arxiv.org/abs/2411.00004
作者: Rafał Powalski,Bazyli Klockiewicz,Maciej Jaśkowski,Bartosz Topolski,Paweł Dąbrowski-Tumański,Maciej Wiśniewski,Łukasz Kuciński,Piotr Miłoś,Dariusz Plewczynski
关键词-EN: Accelerating molecular docking, small-molecule drug discovery, boost small-molecule drug, Accelerating molecular, revolutionize medicine
类目: Biomolecules (q-bio.BM); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Accelerating molecular docking – the process of predicting how molecules bind to protein targets – could boost small-molecule drug discovery and revolutionize medicine. Unfortunately, current molecular docking tools are too slow to screen potential drugs against all relevant proteins, which often results in missed drug candidates or unexpected side effects occurring in clinical trials. To address this gap, we introduce RapidDock, an efficient transformer-based model for blind molecular docking. RapidDock achieves at least a 100 \times speed advantage over existing methods without compromising accuracy. On the Posebusters and DockGen benchmarks, our method achieves 52.1% and 44.0% success rates ( \textRMSD2 Å), respectively. The average inference time is 0.04 seconds on a single GPU, highlighting RapidDock’s potential for large-scale docking studies. We examine the key features of RapidDock that enable leveraging the transformer architecture for molecular docking, including the use of relative distance embeddings of 3 D structures in attention matrices, pre-training on protein folding, and a custom loss function invariant to molecular symmetries.

信息检索

[IR-0] Making Sense of Metadata Mess: Alignment Risk Assessment for Diatom Data Use Case

链接: https://arxiv.org/abs/2411.00677
作者: Kio Polson,Marina Potapova,Uttam Meena,Chad Peiper,Joshua Brown,Joshua Agar,Jane Greenberg
关键词-EN: Biologists study Diatoms, Biologists study, fundamental algae, Sciences’ Diatom Herbarium, assess the health
类目: Information Retrieval (cs.IR)
*备注: 13 pages, 2 figures, 1 table, to be published in MTSR 2024 conference proceedings

点击查看摘要

Abstract:Biologists study Diatoms, a fundamental algae, to assess the health of aquatic systems. Diatom specimens have traditionally been preserved on analog slides, where a single slide can contain thousands of these microscopic organisms. Digitization of these collections presents both metadata challenges and opportunities. This paper reports on metadata research aimed at providing access to a digital portion of the Academy of Natural Sciences’ Diatom Herbarium, Drexel University. We report results of a 3-part study covering 1) a review of relevant metadata standards and a microscopy metadata framework shared by Hammer et al., 2) a baseline metadata alignment mapping current diatom metadata properties to standard metadata types, and 3) a metadata risk analysis associated with the course of standard data curation practices. This research is part of an effort involving the transfer of these digital slides to an new system, DataFed, to support global accessible. The final section of this paper includes a conclusion and discusses next steps.

[IR-1] Enhancing Semantic Interoperability Across Materials Science With HIVE4MAT

链接: https://arxiv.org/abs/2411.00676
作者: Jane Greenberg,Kio Polson,Scott McClellan,Xintong Zhao,Alex Kalinowski,Yuan An
关键词-EN: linked data interactive, data interactive application, materials science, linked data, data interactive
类目: Information Retrieval (cs.IR)
*备注: 11 pages, 1 figures, 3 tables, to be published in SeMatS 2024 workshop proceedings

点击查看摘要

Abstract:HIVE4MAT is a linked data interactive application for navigating ontologies of value to materials science. HIVE enables automatic indexing of textual resources with standardized terminology. This article presents the motivation underlying HIVE4MAT, explains the system architecture, reports on two evaluations, and discusses future plans.

[IR-2] DivNet: Diversity-Aware Self-Correcting Sequential Recommendation Networks CIKM

链接: https://arxiv.org/abs/2411.00395
作者: Shuai Xiao,Zaifan Jiang
关键词-EN: aims to give, whole-page relevance, textit, recommended items, give the final
类目: Information Retrieval (cs.IR)
*备注: Published at CIKM

点击查看摘要

Abstract:As the last stage of a typical \textitrecommendation system, \textitcollective recommendation aims to give the final touches to the recommended items and their layout so as to optimize overall objectives such as diversity and whole-page relevance. In practice, however, the interaction dynamics among the recommended items, their visual appearances and meta-data such as specifications are often too complex to be captured by experts’ heuristics or simple models. To address this issue, we propose a \textit\underlinediversity-aware self-correcting sequential recommendation \underlinenetworks (\textitDivNet) that is able to estimate utility by capturing the complex interactions among sequential items and diversify recommendations simultaneously. Experiments on both offline and online settings demonstrate that \textitDivNet can achieve better results compared to baselines with or without collective recommendations.

[IR-3] A Survey on Bundle Recommendation: Methods Applications and Challenges

链接: https://arxiv.org/abs/2411.00341
作者: Meng Sun,Lin Li,Ming Li,Xiaohui Tao,Dong Zhang,Peipei Wang,Jimmy Xiangji Huang
关键词-EN: gained significant attention, enhance user experience, bundle recommendation, bundle recommendation systems, generative bundle recommendation
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:In recent years, bundle recommendation systems have gained significant attention in both academia and industry due to their ability to enhance user experience and increase sales by recommending a set of items as a bundle rather than individual items. This survey provides a comprehensive review on bundle recommendation, beginning by a taxonomy for exploring product bundling. We classify it into two categories based on bundling strategy from various application domains, i.e., discriminative and generative bundle recommendation. Then we formulate the corresponding tasks of the two categories and systematically review their methods: 1) representation learning from bundle and item levels and interaction modeling for discriminative bundle recommendation; 2) representation learning from item level and bundle generation for generative bundle recommendation. Subsequently, we survey the resources of bundle recommendation including datasets and evaluation metrics, and conduct reproducibility experiments on mainstream models. Lastly, we discuss the main challenges and highlight the promising future directions in the field of bundle recommendation, aiming to serve as a useful resource for researchers and practitioners. Our code and datasets are publicly available at this https URL.

[IR-4] Beyond Utility: Evaluating LLM as Recommender

链接: https://arxiv.org/abs/2411.00331
作者: Chumeng Jiang,Jiayin Wang,Weizhi Ma,Charles L. A. Clarke,Shuai Wang,Chuhan Wu,Min Zhang
关键词-EN: Large Language Models, Large Language, recent studies employed, provide personalized information, personalized information services
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:With the rapid development of Large Language Models (LLMs), recent studies employed LLMs as recommenders to provide personalized information services for distinct users. Despite efforts to improve the accuracy of LLM-based recommendation models, relatively little attention is paid to beyond-utility dimensions. Moreover, there are unique evaluation aspects of LLM-based recommendation models, which have been largely ignored. To bridge this gap, we explore four new evaluation dimensions and propose a multidimensional evaluation framework. The new evaluation dimensions include: 1) history length sensitivity, 2) candidate position bias, 3) generation-involved performance, and 4) hallucinations. All four dimensions have the potential to impact performance, but are largely unnecessary for consideration in traditional systems. Using this multidimensional evaluation framework, along with traditional aspects, we evaluate the performance of seven LLM-based recommenders, with three prompting strategies, comparing them with six traditional models on both ranking and re-ranking tasks on four datasets. We find that LLMs excel at handling tasks with prior knowledge and shorter input histories in the ranking setting, and perform better in the re-ranking setting, beating traditional models across multiple dimensions. However, LLMs exhibit substantial candidate position bias issues, and some models hallucinate non-existent items much more often than others. We intend our evaluation framework and observations to benefit future research on the use of LLMs as recommenders. The code and data are available at this https URL.

[IR-5] Content Aware Analysis of Scholarly Networks: A Case Study on CORD19 Dataset

链接: https://arxiv.org/abs/2411.00262
作者: Mehmet Emre Akbulut,Yusuf Erdem Nacar
关键词-EN: scientific research network, Named Entity Recognition, paper investigates, investigates the relationships, relationships among key
类目: ocial and Information Networks (cs.SI); Information Retrieval (cs.IR); Physics and Society (physics.soc-ph)
*备注:

点击查看摘要

Abstract:This paper investigates the relationships among key elements of scientific research network, namely articles, researchers, and journals. We introduce a novel approach to use semantic information through the HITS algorithm based propagation of topic information in the network. The topic information is derived by using the Named Entity Recognition and Entity Linkage. In our case, MedCAT is used to extract the topics from the CORD19 Dataset, which is a corpus of academic articles about COVID-19 and coronavirus scientific network. Our approach focuses on the COVID-19 domain, utilizing the CORD-19 dataset to demonstrate the efficacy of integrating topic-related information within the citation framework. Through the application of a hybrid HITS algorithm, we show that incorporating topic data significantly influences article rankings, revealing deeper insights into the structure of the academic community.

附件下载

点击下载今日全部论文列表

目录

概览 (2024-11-04)

自然语言处理

人工智能

计算机视觉

机器学习

信息检索

附件下载