Arxiv今日论文 | 2024-11-12

本篇博文主要展示 2024-11-12 从Arxiv.org论文网站获取的最新论文列表，自动更新，按照NLP、CV、ML、AI、IR五个大方向区分，若需要邮件定时接收，请在评论区留下你的邮箱号。

说明：每日论文数据从Arxiv.org获取，每天早上12:00左右定时自动更新。

友情提示: 如何您需要邮箱接收每日论文数据，请在评论处留下你的邮箱。

【速读】：该论文试图解决现有数学推理能力评估基准（如GSM8K和MATH）在评估大型语言模型（LLMs）时存在的局限性，包括问题定义狭窄、依赖特定数字和预设规则，从而影响对推理能力和适应性的准确评估。解决方案的关键在于引入UTMath Benchmark，这是一个基于软件开发中单元测试理念的创新评估框架，通过1,053个问题覆盖9个数学领域，以及超过68个测试案例，全面评估模型的准确性和可靠性。此外，论文提出了Reasoning-to-Coding of Thoughts (RCoT)方法，鼓励LLMs在生成代码前进行显式推理，从而提升解决方案的复杂性和性能。同时，论文还发布了UTMath-Train训练数据集（超过70k样本），以支持社区进一步探索数学推理能力。

链接: https://arxiv.org/abs/2411.07240
作者: Bo Yang,Qingping Yang,Runtao Liu
关键词-EN: Artificial General Intelligence, advancing Artificial General, General Intelligence, Artificial General, advancing Artificial
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The evaluation of mathematical reasoning capabilities is essential for advancing Artificial General Intelligence (AGI). While Large Language Models (LLMs) have shown impressive performance in solving mathematical problems, existing benchmarks such as GSM8K and MATH present limitations, including narrow problem definitions with specific numbers and reliance on predetermined rules that hinder accurate assessments of reasoning and adaptability. This paper introduces the UTMath Benchmark, which robustly evaluates the models through extensive unit tests. It consists of 1,053 problems across 9 mathematical domains, with over 68 test cases per this http URL propose an innovative evaluation framework inspired by unit testing in software development, focusing on both accuracy and reliability of results. Furthermore, we introduce the Reasoning-to-Coding of Thoughts (RCoT) approach, which encourages LLMs to perform explicit reasoning before generating code, leading to generating more advanced solution and improved performance. Furthermore, we are releasing not only the UTMath benchmark but also the UTMath-Train training dataset (more than 70k samples), to support the community in further exploring mathematical reasoning.
摘要：评估数学推理能力对于推进通用人工智能（AGI）至关重要。尽管大语言模型（LLMs）在解决数学问题上表现出色，但现有的基准测试如GSM8K和MATH存在局限性，包括问题定义狭窄、特定数字的使用以及依赖预定的规则，这些都阻碍了对推理能力和适应性的准确评估。本文介绍了UTMath基准测试，通过广泛的单元测试来稳健地评估模型。该基准包含1,053个问题，涵盖9个数学领域，每个领域超过68个测试用例。我们提出了一种创新的评估框架，灵感来源于软件开发中的单元测试，重点关注结果的准确性和可靠性。此外，我们引入了思维到代码的推理（RCoT）方法，鼓励LLMs在生成代码之前进行显式推理，从而生成更高级的解决方案并提升性能。此外，我们不仅发布了UTMath基准测试，还发布了UTMath-Train训练数据集（超过70,000个样本），以支持社区进一步探索数学推理。

[NLP-1] OpenThaiGPT 1.5: A Thai-Centric Open Source Large Language Model

【速读】：该论文旨在解决泰语语言处理中的高级对话模型开发问题，关键解决方案在于基于Qwen v2.5架构，通过在超过200万对泰语指令数据上进行微调，开发出OpenThaiGPT 1.5模型。该模型的关键特性包括多轮对话支持、检索增强生成（Retrieval Augmented Generation, RAG）兼容性以及工具调用功能。通过这些特性，OpenThaiGPT 1.5在多种泰语语言任务中展示了领先的表现，超越了其他开源泰语语言模型。此外，论文还讨论了模型在实际部署中的GPU内存需求和部署策略。

链接: https://arxiv.org/abs/2411.07238
作者: Sumeth Yuenyong,Kobkrit Viriyayudhakorn,Apivadee Piyatumrong,Jillaphat Jaroenkantasima
关键词-EN: Thai instruction pairs, based on Qwen, Retrieval Augmented Generation, chat model based, advanced Thai language
类目: Computation and Language (cs.CL)
备注: 8 pages, 4 tables

点击查看摘要

Abstract:OpenThaiGPT 1.5 is an advanced Thai language chat model based on Qwen v2.5, finetuned on over 2,000,000 Thai instruction pairs. This report provides an engineering perspective on the model’s development, capabilities, and performance. We discuss the model’s architecture, training process, and key features, including multi-turn conversation support, Retrieval Augmented Generation (RAG) compatibility, and tool-calling functionality. Benchmark results demonstrate OpenThaiGPT 1.5’s state-of-the-art performance on various Thai language tasks, outperforming other open-source Thai language models. We also address practical considerations such as GPU memory requirements and deployment strategies.
摘要：OpenThaiGPT 1.5 是一款基于 Qwen v2.5 的高级泰语聊天模型，经过超过 2,000,000 对泰语指令的微调。本报告从工程角度探讨了该模型的开发、功能和性能。我们讨论了模型的架构、训练过程及其关键特性，包括多轮对话支持、检索增强生成 (Retrieval Augmented Generation, RAG) 兼容性以及工具调用功能。基准测试结果显示，OpenThaiGPT 1.5 在多种泰语任务中表现出色，超越了其他开源泰语模型，达到了最先进的水平。此外，我们还探讨了实际应用中的考虑因素，如 GPU 内存需求和部署策略。

[NLP-2] Contextualized Evaluations: Taking the Guesswork Out of Language Model Evaluations

【速读】：该论文试图解决的问题是语言模型用户在提出查询时缺乏明确上下文信息，导致对查询响应的评估变得困难且主观。解决方案的关键在于提出了一种名为“情境化评估”（contextualized evaluations）的协议，该协议通过合成构建查询的上下文信息，并在评估过程中提供这些信息。这种方法能够改变评估结论，甚至在模型对之间翻转胜率；促使评估者减少基于表面标准的判断，如风格；并揭示模型在不同情境下行为的隐性偏见，特别是对WEIRD（Western, Educated, Industrialized, Rich, and Democratic）情境的默认倾向。此外，研究还发现，即使提供了不同的上下文，模型对这些上下文的敏感度也不尽相同。

链接: https://arxiv.org/abs/2411.07237
作者: Chaitanya Malaviya,Joseph Chee Chang,Dan Roth,Mohit Iyyer,Mark Yatskar,Kyle Lo
关键词-EN: Language model users, lack specification, Language model, user identity, Language
类目: Computation and Language (cs.CL)
备注: Code data available at this https URL

点击查看摘要

Abstract:Language model users often issue queries that lack specification, where the context under which a query was issued – such as the user’s identity, the query’s intent, and the criteria for a response to be useful – is not explicit. For instance, a good response to a subjective query like “What book should I read next?” would depend on the user’s preferences, and a good response to an open-ended query like “How do antibiotics work against bacteria?” would depend on the user’s expertise. This makes evaluation of responses to such queries an ill-posed task, as evaluators may make arbitrary judgments about the response quality. To remedy this, we present contextualized evaluations, a protocol that synthetically constructs context surrounding an underspecified query and provides it during evaluation. We find that the presence of context can 1) alter conclusions drawn from evaluation, even flipping win rates between model pairs, 2) nudge evaluators to make fewer judgments based on surface-level criteria, like style, and 3) provide new insights about model behavior across diverse contexts. Specifically, our procedure uncovers an implicit bias towards WEIRD contexts in models’ “default” responses and we find that models are not equally sensitive to following different contexts, even when they are provided in prompts.
摘要：语言模型用户经常提出缺乏明确说明的查询，其中查询的背景信息——如用户的身份、查询的意图以及响应的有用性标准——并不明确。例如，对于主观性查询“我接下来应该读什么书？”的良好回答将取决于用户的偏好，而对于开放性查询“抗生素如何对抗细菌？”的良好回答则取决于用户的专业知识。这使得对这类查询的响应评估成为一个不适定任务，因为评估者可能会对响应质量做出任意判断。为了解决这一问题，我们提出了情境化评估，这是一种协议，它能够综合构建围绕不明确查询的情境，并在评估过程中提供这些情境。我们发现，情境的存在可以 1) 改变从评估中得出的结论，甚至翻转模型对之间的胜率，2) 促使评估者减少基于表面标准（如风格）的判断，3) 提供关于模型在不同情境下行为的新的见解。具体来说，我们的程序揭示了模型“默认”响应中对WEIRD情境的隐性偏见，并且我们发现，即使在提示中提供了不同的情境，模型对遵循不同情境的敏感度并不相同。

[NLP-3] mpCharBERT: Keystroke Dynamics for Continuous Access Control Based on Pre-trained Language Models

【速读】：该论文试图解决在数字环境中利用击键动力学（Keystroke Dynamics）进行用户身份识别和认证的问题。解决方案的关键在于提出了一种名为TempCharBERT的新架构，该架构在CharBERT的基础上，通过在嵌入层中整合时间-字符信息（temporal-character information），从而能够更好地建模击键动力学。这种定制化的预训练语言模型（PLMs）能够处理子词和字符级别的信息，并且优化了用户的时间打字信息（如按键时间和飞行时间），从而显著提升了用户识别和认证的性能。此外，论文还展示了在联邦学习设置下训练TempCharBERT的可行性，以促进数据隐私保护。

链接: https://arxiv.org/abs/2411.07224
作者: Matheus Simão,Fabiano Prado,Omar Abdul Wahab,Anderson Avila
关键词-EN: continuous access control, digital environments, widespread of digital, continuous access, access control
类目: Cryptography and Security (cs.CR); Computation and Language (cs.CL)
备注: Accepted at WIFS 2024

点击查看摘要

Abstract:With the widespread of digital environments, reliable authentication and continuous access control has become crucial. It can minimize cyber attacks and prevent frauds, specially those associated with identity theft. A particular interest lies on keystroke dynamics (KD), which refers to the task of recognizing individuals’ identity based on their unique typing style. In this work, we propose the use of pre-trained language models (PLMs) to recognize such patterns. Although PLMs have shown high performance on multiple NLP benchmarks, the use of these models on specific tasks requires customization. BERT and RoBERTa, for instance, rely on subword tokenization, and they cannot be directly applied to KD, which requires temporal-character information to recognize users. Recent character-aware PLMs are able to process both subwords and character-level information and can be an alternative solution. Notwithstanding, they are still not suitable to be directly fine-tuned for KD as they are not optimized to account for user’s temporal typing information (e.g., hold time and flight time). To overcome this limitation, we propose TempCharBERT, an architecture that incorporates temporal-character information in the embedding layer of CharBERT. This allows modeling keystroke dynamics for the purpose of user identification and authentication. Our results show a significant improvement with this customization. We also showed the feasibility of training TempCharBERT on a federated learning settings in order to foster data privacy.
摘要：随着数字环境的普及，可靠的身份验证和持续的访问控制变得至关重要。这可以最大限度地减少网络攻击并防止欺诈，特别是那些与身份盗窃相关的欺诈。特别值得一提的是键盘动态 (Keystroke Dynamics, KD)，它指的是基于个人独特的打字风格来识别其身份的任务。在本研究中，我们提出使用预训练语言模型 (Pre-trained Language Models, PLMs) 来识别这些模式。尽管 PLMs 在多个自然语言处理 (Natural Language Processing, NLP) 基准测试中表现出色，但将这些模型应用于特定任务时需要进行定制化。例如，BERT 和 RoBERTa 依赖于子词 (subword) 分词，它们不能直接应用于 KD，因为 KD 需要时间-字符信息来识别用户。最近出现的字符感知 PLMs 能够处理子词和字符级别的信息，可以作为替代解决方案。然而，它们仍然不适合直接微调用于 KD，因为它们并未针对用户的时序打字信息（例如，按键时间和飞行时间）进行优化。为了克服这一限制，我们提出了 TempCharBERT，这是一种在 CharBERT 的嵌入层中融合时间-字符信息的架构。这使得可以为用户的身份识别和验证建模键盘动态。我们的结果显示，通过这种定制化，性能有了显著提升。我们还展示了在联邦学习环境中训练 TempCharBERT 的可行性，以促进数据隐私。

[NLP-4] reeCoders: Trees of Transformers

【速读】：该论文试图解决传统线性Transformer模型在处理大规模数据时计算复杂度高的问题。解决方案的关键在于引入了一种新型的Transformer树结构——TreeCoders，通过将Transformer块作为节点构建完整的k-ary树，并使用通用分类器选择最佳子节点来路由token序列到特定叶子节点。这种结构不仅支持多种架构的无缝集成，还由于树搜索的logarithmic复杂度而实现了稀疏节点激活，从而显著降低了计算成本。实验结果表明，TreeCoders在多种语言数据集上表现优异，超过76%的情况下优于同等规模的线性Transformer模型，并且自然适用于分布式实现。

链接: https://arxiv.org/abs/2411.07218
作者: Pierre Colonna D’Istria,Abdulrahman Altahhan
关键词-EN: introduce TreeCoders, transformer, Transformer blocks, tree, transformer model
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In this paper, we introduce TreeCoders, a novel family of transformer trees. We moved away from traditional linear transformers to complete k-ary trees. Transformer blocks serve as nodes, and generic classifiers learn to select the best child and route the sequence of tokens to a specific leaf. The selectors, moved outside the transformer blocks, allow for the use of a variety of architecture without further modifications. Furthermore, our proposed architecture supports sparse node activation due to the logarithmic complexity of a tree search. We validate our idea by testing a series of decoder-only tree transformers, achieving competitive results across a diverse range of language datasets. Our study demonstrates that the proposed tree transformer model outperforms a size-equivalent linear transformer model 76% of the time over a wide range of tree architectures. Furthermore, our proposed model naturally lends itself to distributed implementation.
摘要：本文介绍了一种名为 TreeCoders 的新型 Transformer 树族。我们摒弃了传统的线性 Transformer，转向了完整的 k-叉树结构。Transformer 模块作为节点，而通用分类器则学习选择最佳子节点并将 Token 序列路由到特定叶子节点。选择器被移至 Transformer 模块外部，使得在不进行进一步修改的情况下可以使用多种架构。此外，由于树搜索的对数复杂性，我们提出的架构支持稀疏节点激活。通过测试一系列仅解码器的树 Transformer，我们在多种语言数据集上取得了具有竞争力的结果，验证了我们的想法。研究表明，在广泛的树架构范围内，所提出的树 Transformer 模型在 76% 的情况下优于同等规模的线性 Transformer 模型。此外，我们提出的模型自然适合分布式实现。

[NLP-5] he Super Weight in Large Language Models

【速读】：该论文试图解决的问题是如何在大型语言模型 (LLM) 中识别和利用极少数对模型性能至关重要的参数，即“超级权重” (super weights)。解决方案的关键在于提出了一种无需数据的方法，通过模型的一次前向传播来识别这些超级权重。此外，论文还发现这些超级权重对应的激活值也表现出罕见的极大值，称为“超级激活” (super activations)。通过精确保留这些超级权重和超级激活，可以显著提升模型的量化性能，使其在权重和激活值的量化过程中达到与最先进方法相媲美的效果。

链接: https://arxiv.org/abs/2411.07191
作者: Mengxia Yu,De Wang,Qi Shan,Colorado Reed,Alvin Wan
关键词-EN: Large Language Model, Language Model, Large Language, Recent works, disproportionately important
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent works have shown a surprising result: a small fraction of Large Language Model (LLM) parameter outliers are disproportionately important to the quality of the model. LLMs contain billions of parameters, so these small fractions, such as 0.01%, translate to hundreds of thousands of parameters. In this work, we present an even more surprising finding: Pruning as few as a single parameter can destroy an LLM’s ability to generate text – increasing perplexity by 3 orders of magnitude and reducing zero-shot accuracy to guessing. We propose a data-free method for identifying such parameters, termed super weights, using a single forward pass through the model. We additionally find that these super weights induce correspondingly rare and large activation outliers, termed super activations. When preserved with high precision, super activations can improve simple round-to-nearest quantization to become competitive with state-of-the-art methods. For weight quantization, we similarly find that by preserving the super weight and clipping other weight outliers, round-to-nearest quantization can scale to much larger block sizes than previously considered. To facilitate further research into super weights, we provide an index of super weight coordinates for common, openly available LLMs.
摘要：最近的研究揭示了一个令人惊讶的结果：大语言模型 (Large Language Model, LLM) 参数中的一小部分异常值对模型的质量有着不成比例的重要影响。由于 LLM 包含数十亿参数，因此这些小比例的异常值，如 0.01%，实际上对应着数十万个参数。在本研究中，我们提出了一个更为惊人的发现：仅需剪枝一个参数，就可能彻底破坏 LLM 生成文本的能力——使困惑度增加三个数量级，并将零样本准确率降至猜测水平。我们提出了一种无需数据的方法，通过模型的一次前向传递来识别这些被称为“超级权重”的参数。此外，我们发现这些超级权重会引发相应罕见且大幅度的激活异常值，称为“超级激活”。当以高精度保留这些超级激活时，简单的四舍五入量化方法可以提升至与最先进方法相媲美的水平。对于权重量化，我们同样发现，通过保留超级权重并裁剪其他权重异常值，四舍五入量化方法可以扩展到比以往考虑的更大的块大小。为了促进对超级权重的进一步研究，我们提供了常见开源 LLM 的超级权重坐标索引。

[NLP-6] Counterfactual Generation from Language Models

【速读】：该论文试图解决的问题是如何在语言模型中理解和操控因果生成机制，以实现对其行为的精确控制。解决方案的关键在于提出了一种基于广义结构方程模型（Generalized Structural-equation Models）和Gumbel-max技巧的框架，用于生成真实的字符串反事实（string counterfactuals）。通过重新构建语言模型，该方法能够模拟原始字符串及其反事实的联合分布，从而在相同的采样噪声实例下生成反事实。论文还开发了一种基于事后Gumbel采样的算法，用于推断潜在噪声变量并生成观察字符串的反事实。实验结果表明，该方法不仅能生成有意义的反事实，还揭示了常用干预技术存在显著的非预期副作用。

链接: https://arxiv.org/abs/2411.07180
作者: Shauli Ravfogel,Anej Svete,Vésteinn Snæbjarnarson,Ryan Cotterell
关键词-EN: Understanding and manipulating, causal generation mechanisms, controlling their behavior, generation mechanisms, essential for controlling
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: A preprint

点击查看摘要

Abstract:Understanding and manipulating the causal generation mechanisms in language models is essential for controlling their behavior. Previous work has primarily relied on techniques such as representation surgery – e.g., model ablations or manipulation of linear subspaces tied to specific concepts – to intervene on these models. To understand the impact of interventions precisely, it is useful to examine counterfactuals – e.g., how a given sentence would have appeared had it been generated by the model following a specific intervention. We highlight that counterfactual reasoning is conceptually distinct from interventions, as articulated in Pearl’s causal hierarchy. Based on this observation, we propose a framework for generating true string counterfactuals by reformulating language models as Generalized Structural-equation. Models using the Gumbel-max trick. This allows us to model the joint distribution over original strings and their counterfactuals resulting from the same instantiation of the sampling noise. We develop an algorithm based on hindsight Gumbel sampling that allows us to infer the latent noise variables and generate counterfactuals of observed strings. Our experiments demonstrate that the approach produces meaningful counterfactuals while at the same time showing that commonly used intervention techniques have considerable undesired side effects.
摘要：理解和操控语言模型中的因果生成机制对于控制其行为至关重要。以往的研究主要依赖于表示手术等技术——例如，模型消融或对与特定概念相关的线性子空间的操控——来干预这些模型。为了精确评估干预的影响，检查反事实情况是有益的——例如，如果某个句子是在特定干预下由模型生成的，它会是怎样的。我们强调，反事实推理在概念上与干预不同，正如 Pearl 的因果层次结构中所阐述的那样。基于这一观察，我们提出了一种通过将语言模型重新表述为广义结构方程模型（Generalized Structural-equation Models）并使用 Gumbel-max 技巧来生成真实字符串反事实的框架。这使我们能够对原始字符串及其在相同采样噪声实例下产生的反事实结果的联合分布进行建模。我们开发了一种基于事后 Gumbel 采样的算法，该算法允许我们推断潜在噪声变量并生成观察到的字符串的反事实。我们的实验表明，该方法生成的反事实具有意义，同时揭示了常用干预技术存在显著的非预期副作用。

[NLP-7] More Expressive Attention with Negative Weights

【速读】：该论文试图解决传统注意力机制（如 softmax attention）中非负权重限制导致的表达能力不足和表示崩溃问题。解决方案的关键在于提出了一种名为 Cog Attention 的新型注意力机制，其核心特点包括：(1) 允许注意力权重为负值，从而使注意力头能够动态地删除、复制或保留标记，增强了模型的灵活性和表达能力；(2) 通过减少早期标记到后期标记的有效信息路径，提高了模型对表示崩溃的鲁棒性。Cog Attention 的应用包括语言建模和图像生成任务，实验结果表明其在性能上优于传统的 softmax attention 模块。

链接: https://arxiv.org/abs/2411.07176
作者: Ang Lv,Ruobing Xie,Shuaipeng Li,Jiayi Liao,Xingwu Sun,Zhanhui Kang,Rui Yan
关键词-EN: Cog Attention, named Cog Attention, attention, enables attention weights, static OV matrix
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:We propose a novel attention mechanism, named Cog Attention, that enables attention weights to be negative for enhanced expressiveness, which stems from two key factors: (1) Cog Attention can shift the token deletion and copying function from a static OV matrix to dynamic QK inner products, with the OV matrix now focusing more on refinement or modification. The attention head can simultaneously delete, copy, or retain tokens by assigning them negative, positive, or minimal attention weights, respectively. As a result, a single attention head becomes more flexible and expressive. (2) Cog Attention improves the model’s robustness against representational collapse, which can occur when earlier tokens are over-squashed into later positions, leading to homogeneous representations. Negative weights reduce effective information paths from earlier to later tokens, helping to mitigate this issue. We develop Transformer-like models which use Cog Attention as attention modules, including decoder-only models for language modeling and U-ViT diffusion models for image generation. Experiments show that models using Cog Attention exhibit superior performance compared to those employing traditional softmax attention modules. Our approach suggests a promising research direction for rethinking and breaking the entrenched constraints of traditional softmax attention, such as the requirement for non-negative weights.
摘要：我们提出了一种名为 Cog Attention 的新型注意力机制，该机制允许注意力权重为负值以增强表达能力。这一创新源于两个关键因素：(1) Cog Attention 能够将 Token 删除和复制功能从静态的 OV 矩阵转移到动态的 QK 内积中，而 OV 矩阵现在更专注于精炼或修改。注意力头可以通过分别赋予负值、正值或最小注意力权重来同时删除、复制或保留 Token。因此，单个注意力头变得更加灵活和富有表现力。(2) Cog Attention 提升了模型对表征崩溃的鲁棒性，这种崩溃可能发生在早期 Token 过度压缩到后期位置时，导致表征同质化。负权重减少了从早期 Token 到后期 Token 的有效信息路径，有助于缓解这一问题。我们开发了使用 Cog Attention 作为注意力模块的 Transformer 类模型，包括用于语言建模的仅解码器模型和用于图像生成的 U-ViT 扩散模型。实验表明，使用 Cog Attention 的模型在性能上优于采用传统 softmax 注意力模块的模型。我们的方法为重新思考和打破传统 softmax 注意力的固有约束（如非负权重的要求）提供了有前景的研究方向。

[NLP-8] Continual Memorization of Factoids in Large Language Models

【速读】：该论文试图解决大型语言模型（LLMs）在持续记忆过程中对长尾或专业事实的遗忘问题。解决方案的关键在于开发了一种名为REMIX（Random and Generic Data Mixing）的策略，通过在每个训练阶段混合从预训练语料库中采样的通用数据或随机生成的词序列，来防止模型在后续训练阶段对初始记忆事实的遗忘。REMIX通过改变模型的学习过程，使得模型将事实存储在比通常更早的层中，并多样化存储这些事实的层集合，从而有效缓解遗忘现象。

链接: https://arxiv.org/abs/2411.07175
作者: Howard Chen,Jiayi Geng,Adithya Bhaskar,Dan Friedman,Danqi Chen
关键词-EN: Large language models, Large language, absorb a massive, massive amount, inefficient for acquiring
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models can absorb a massive amount of knowledge through pretraining, but pretraining is inefficient for acquiring long-tailed or specialized facts. Therefore, fine-tuning on specialized or new knowledge that reflects changes in the world has become popular, though it risks disrupting the model’s original capabilities. We study this fragility in the context of continual memorization, where the model is trained on a small set of long-tail factoids (factual associations) and must retain these factoids after multiple stages of subsequent training on other datasets. Through extensive experiments, we show that LLMs suffer from forgetting across a wide range of subsequent tasks, and simple replay techniques do not fully prevent forgetting, especially when the factoid datasets are trained in the later stages. We posit that there are two ways to alleviate forgetting: 1) protect the memorization process as the model learns the factoids, or 2) reduce interference from training in later stages. With this insight, we develop an effective mitigation strategy: REMIX (Random and Generic Data Mixing). REMIX prevents forgetting by mixing generic data sampled from pretraining corpora or even randomly generated word sequences during each stage, despite being unrelated to the memorized factoids in the first stage. REMIX can recover performance from severe forgetting, often outperforming replay-based methods that have access to the factoids from the first stage. We then analyze how REMIX alters the learning process and find that successful forgetting prevention is associated with a pattern: the model stores factoids in earlier layers than usual and diversifies the set of layers that store these factoids. The efficacy of REMIX invites further investigation into the underlying dynamics of memorization and forgetting, opening exciting possibilities for future research.
摘要：大语言模型通过预训练可以吸收大量知识，但对于获取长尾或专业事实而言，预训练效率较低。因此，针对反映世界变化的专业或新知识进行微调已成为一种流行做法，尽管这可能会破坏模型的原始能力。我们在持续记忆的背景下研究这种脆弱性，即模型在一个小规模的长尾事实集合（事实关联）上进行训练，并且在后续多个阶段对其他数据集进行训练后，必须保留这些事实。通过广泛的实验，我们发现大语言模型在后续任务中普遍存在遗忘现象，简单的重放技术并不能完全防止遗忘，尤其是在事实数据集在后期阶段进行训练时。我们认为有两种方法可以缓解遗忘：1) 在模型学习事实时保护记忆过程，或 2) 减少后期训练阶段的干扰。基于这一洞察，我们开发了一种有效的缓解策略：REMIX（随机与通用数据混合）。REMIX通过在每个阶段混合从预训练语料库中采样的通用数据，甚至是随机生成的词序列，来防止遗忘，尽管这些数据与第一阶段记忆的事实无关。REMIX能够从严重的遗忘中恢复性能，通常优于那些可以访问第一阶段事实的重放方法。随后，我们分析了REMIX如何改变学习过程，并发现成功的遗忘预防与一种模式相关：模型将事实存储在比通常更早的层中，并且存储这些事实的层集合更加多样化。REMIX的有效性引发了对其记忆和遗忘底层动态的进一步研究，为未来的研究开辟了令人兴奋的可能性。

[NLP-9] A Primer on Word Embeddings: AI Techniques for Text Analysis in Social Work

【速读】：该论文试图解决在社会工作研究中如何更有效地分析文本数据的问题。解决方案的关键在于引入词嵌入（word embeddings）技术，这种数学表示方法能够比传统的基于关键词的方法更精确地捕捉文本数据中的意义和关系。论文详细介绍了词嵌入的基本概念、技术基础和实际应用，如语义搜索、聚类和检索增强生成，并通过具体的社会工作实践案例展示了其增强研究工作流程的潜力。论文还指出了实施词嵌入技术时需要克服的限制，如信息损失、训练数据约束和潜在偏见，并强调了开发领域特定模型、创建易用工具和建立符合社会工作伦理原则的最佳实践的重要性。

链接: https://arxiv.org/abs/2411.07156
作者: Brian E. Perron,Kelley A. Rivenburgh,Bryan G. Victor,Zia Qi,Hui Luan
关键词-EN: social work, Word embeddings represent, offering sophisticated tools, policy documents, offering sophisticated
类目: Computation and Language (cs.CL)
备注: 37 pages, 3 figures

点击查看摘要

Abstract:Word embeddings represent a transformative technology for analyzing text data in social work research, offering sophisticated tools for understanding case notes, policy documents, research literature, and other text-based materials. This methodological paper introduces word embeddings to social work researchers, explaining how these mathematical representations capture meaning and relationships in text data more effectively than traditional keyword-based approaches. We discuss fundamental concepts, technical foundations, and practical applications, including semantic search, clustering, and retrieval augmented generation. The paper demonstrates how embeddings can enhance research workflows through concrete examples from social work practice, such as analyzing case notes for housing instability patterns and comparing social work licensing examinations across languages. While highlighting the potential of embeddings for advancing social work research, we acknowledge limitations including information loss, training data constraints, and potential biases. We conclude that successfully implementing embedding technologies in social work requires developing domain-specific models, creating accessible tools, and establishing best practices aligned with social work’s ethical principles. This integration can enhance our ability to analyze complex patterns in text data while supporting more effective services and interventions.
摘要：词嵌入（Word embeddings）代表了社会工作研究中分析文本数据的一项变革性技术，提供了先进的工具来理解案例记录、政策文件、研究文献及其他基于文本的材料。本文向社会工作研究人员介绍了词嵌入方法，解释了这些数学表示如何比传统的基于关键词的方法更有效地捕捉文本数据中的意义和关系。我们讨论了基本概念、技术基础和实际应用，包括语义搜索、聚类和检索增强生成。本文通过社会工作实践中的具体例子展示了嵌入技术如何增强研究工作流程，例如分析住房不稳定模式的案例记录和跨语言比较社会工作执照考试。尽管强调了嵌入技术在推进社会工作研究方面的潜力，我们也承认其局限性，包括信息损失、训练数据限制和潜在偏见。我们得出结论，成功地在社会工作中实施嵌入技术需要开发领域特定的模型，创建易于使用的工具，并建立与社会工作伦理原则相一致的最佳实践。这种整合可以增强我们分析文本数据中复杂模式的能力，同时支持更有效的服务和干预措施。

[NLP-10] HierTOD: A Task-Oriented Dialogue System Driven by Hierarchical Goals

【速读】：该论文试图解决企业环境中任务导向对话系统（Task-Oriented Dialogue, TOD）面临的任务复杂性和缺乏标准化文档的问题。解决方案的关键在于引入HierTOD系统，该系统通过层次化目标（hierarchical goals）和复合工作流（composite workflows）的支持，实现了更主动的对话管理。HierTOD系统整合了自然语言理解、复合目标检索、对话管理和响应生成等组件，并依托于包含领域知识库和检索引擎的数据服务，从而在信息收集和任务执行过程中提供了高效的协助。通过统一槽填充（slot-filling）和逐步指导（step-by-step guidance）两种TOD范式，HierTOD在人类研究中展示了其有效性和帮助性。

链接: https://arxiv.org/abs/2411.07152
作者: Lingbo Mo,Shun Jiang,Akash Maharaj,Bernard Hishamunda,Yunyao Li
关键词-EN: single-layered workflow structure, systems assist users, hotel bookings, assist users, users in completing
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Task-Oriented Dialogue (TOD) systems assist users in completing tasks through natural language interactions, often relying on a single-layered workflow structure for slot-filling in public tasks, such as hotel bookings. However, in enterprise environments, which involve rich domain-specific knowledge, TOD systems face challenges due to task complexity and the lack of standardized documentation. In this work, we introduce HierTOD, an enterprise TOD system driven by hierarchical goals and can support composite workflows. By focusing on goal-driven interactions, our system serves a more proactive role, facilitating mixed-initiative dialogue and improving task completion. Equipped with components for natural language understanding, composite goal retriever, dialogue management, and response generation, backed by a well-organized data service with domain knowledge base and retrieval engine, HierTOD delivers efficient task assistance. Furthermore, our system implementation unifies two TOD paradigms: slot-filling for information collection and step-by-step guidance for task execution. Our human study demonstrates the effectiveness and helpfulness of HierTOD in performing both paradigms.
摘要：面向任务的对话 (Task-Oriented Dialogue, TOD) 系统通过自然语言交互帮助用户完成任务，通常依赖于单层工作流结构进行槽填充 (slot-filling)，例如在酒店预订等公共任务中。然而，在涉及丰富领域特定知识的企业环境中，TOD 系统面临任务复杂性和缺乏标准化文档的挑战。在本研究中，我们引入了 HierTOD，这是一个由分层目标驱动的企业 TOD 系统，能够支持复合工作流。通过专注于目标驱动的交互，我们的系统发挥了更主动的作用，促进了混合发起式对话并提高了任务完成率。该系统配备了自然语言理解组件、复合目标检索器、对话管理模块和响应生成模块，并由一个组织良好的数据服务支持，该服务包含领域知识库和检索引擎，从而实现了高效的辅助任务。此外，我们的系统实现统一了两种 TOD 范式：信息收集的槽填充和任务执行的逐步指导。我们的人类研究结果表明，HierTOD 在执行这两种范式时均表现出了有效性和帮助性。

[NLP-11] Greenback Bears and Fiscal Hawks: Finance is a Jungle and Text Embeddings Must Adapt EMNLP2024

【速读】：该论文试图解决金融领域文本嵌入（text embeddings）在处理专业术语、行话和缩略语时表现不佳的问题。解决方案的关键在于提出了BAM嵌入（BAM embeddings），这是一种专门针对金融领域进行微调的文本嵌入模型。通过在一个精心构建的包含14.3M查询-段落对的数据集上进行训练，BAM嵌入显著提升了在金融领域内的表现，例如在测试集上的Recall@1达到62.8%，远高于通用文本嵌入的39.2%。此外，BAM嵌入在FinanceBench上的问答准确率提高了8%，并显示出对金融特定元素的更高敏感性，尤其是在处理详细、前瞻性和公司及日期特定的查询时。论文还详细描述了其方法，并量化了硬负样本挖掘（hard negative mining）和数据集规模的重要性。

链接: https://arxiv.org/abs/2411.07142
作者: Peter Anderson,Mano Vikash Janardhanan,Jason He,Wei Cheng,Charlie Flanagan
关键词-EN: Financial documents, arcane jargon, BAM embeddings, text embeddings, documents are filled
类目: Computation and Language (cs.CL)
备注: EMNLP 2024

点击查看摘要

Abstract:Financial documents are filled with specialized terminology, arcane jargon, and curious acronyms that pose challenges for general-purpose text embeddings. Yet, few text embeddings specialized for finance have been reported in the literature, perhaps in part due to a lack of public datasets and benchmarks. We present BAM embeddings, a set of text embeddings finetuned on a carefully constructed dataset of 14.3M query-passage pairs. Demonstrating the benefits of domain-specific training, BAM embeddings achieve Recall@1 of 62.8% on a held-out test set, vs. only 39.2% for the best general-purpose text embedding from OpenAI. Further, BAM embeddings increase question answering accuracy by 8% on FinanceBench and show increased sensitivity to the finance-specific elements that are found in detailed, forward-looking and company and date-specific queries. To support further research we describe our approach in detail, quantify the importance of hard negative mining and dataset scale.
摘要：金融文件充斥着专业术语、晦涩行话和奇特缩写，这些都为通用文本嵌入带来了挑战。然而，文献中报道的专门针对金融领域的文本嵌入却寥寥无几，这或许部分归因于公开数据集和基准的缺乏。我们提出了 BAM 嵌入，这是一组在精心构建的包含 1430 万对查询-段落数据集上微调的文本嵌入。通过展示领域特定训练的优势，BAM 嵌入在保留测试集上达到了 62.8% 的 Recall@1，而 OpenAI 的最佳通用文本嵌入仅为 39.2%。此外，BAM 嵌入在 FinanceBench 上的问答准确率提高了 8%，并显示出对金融领域特定元素的更高敏感性，这些元素常见于详细、前瞻性和公司及日期特定的查询中。为了支持进一步的研究，我们详细描述了我们的方法，量化了硬负样本挖掘和数据集规模的重要性。

[NLP-12] Chinese SimpleQA: A Chinese Factuality Evaluation for Large Language Models

【速读】：该论文试图解决现有大型语言模型（LLMs）在中文事实性回答能力评估方面的不足，提出了首个全面的中文基准测试——中文SimpleQA。解决方案的关键在于构建一个具有中文、多样性、高质量、静态和易于评估五大特性的基准测试。具体来说，该基准涵盖了6个主要主题和99个多样化的子主题，通过严格的质量控制确保问题和答案的高质量，且答案是静态的，不会随时间变化。此外，该基准遵循SimpleQA的简洁性，使得问题和答案都非常简短，并基于OpenAI API进行易于评估的评分。通过这一基准，论文旨在帮助开发者更好地理解和提升其模型在中文事实性回答方面的能力，从而促进基础模型的发展。

链接: https://arxiv.org/abs/2411.07140
作者: Yancheng He,Shilong Li,Jiaheng Liu,Yingshui Tan,Hui Huang,Weixun Wang,Xingyuan Bu,Hangyu Guo,Chengwei Hu,Boren Zheng,Xuepeng Liu,Dekai Sun,Wenbo Su,Bo Zheng
关键词-EN: Large Language Models, development of Large, Large Language, Chinese, Chinese SimpleQA
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:New LLM evaluation benchmarks are important to align with the rapid development of Large Language Models (LLMs). In this work, we present Chinese SimpleQA, the first comprehensive Chinese benchmark to evaluate the factuality ability of language models to answer short questions, and Chinese SimpleQA mainly has five properties (i.e., Chinese, Diverse, High-quality, Static, Easy-to-evaluate). Specifically, first, we focus on the Chinese language over 6 major topics with 99 diverse subtopics. Second, we conduct a comprehensive quality control process to achieve high-quality questions and answers, where the reference answers are static and cannot be changed over time. Third, following SimpleQA, the questions and answers are very short, and the grading process is easy-to-evaluate based on OpenAI API. Based on Chinese SimpleQA, we perform a comprehensive evaluation on the factuality abilities of existing LLMs. Finally, we hope that Chinese SimpleQA could guide the developers to better understand the Chinese factuality abilities of their models and facilitate the growth of foundation models.
摘要：随着大语言模型 (Large Language Models, LLMs) 的快速发展，新的 LLM 评估基准显得尤为重要。在本研究中，我们提出了 Chinese SimpleQA，这是首个全面评估语言模型回答简短问题的事实性能力的综合性中文基准。Chinese SimpleQA 主要具备五个特性：中文性、多样性、高质量性、静态性和易评估性。具体来说，首先，我们专注于中文语言，涵盖了六大主题下的 99 个多样子主题。其次，我们进行了全面的质量控制流程，以确保问题和答案的高质量，其中参考答案是静态的，不会随时间变化。第三，遵循 SimpleQA 的设计，问题和答案都非常简短，评分过程基于 OpenAI API 易于评估。基于 Chinese SimpleQA，我们对现有 LLMs 的事实性能力进行了全面评估。最后，我们希望 Chinese SimpleQA 能够指导开发者更好地理解其模型在中文事实性能力方面的表现，并促进基础模型的发展。

[NLP-13] Stronger Models are NOT Stronger Teachers for Instruction Tuning

【速读】：该论文试图解决的问题是：在指令微调（instruction tuning）过程中，使用更大或更强的模型作为响应生成器（response generators）是否必然能提升较小模型的指令遵循能力。论文通过实验揭示了一个现象，即“大模型悖论”（Larger Models’ Paradox），即更大的模型并不总是更小的模型的更强教师。解决方案的关键在于提出了一种新的评估指标——兼容性调整奖励（Compatibility-Adjusted Reward, CAR），该指标考虑了教师模型与基础模型之间的兼容性，从而更准确地衡量响应生成器的有效性。实验结果表明，CAR在评估响应生成器的效果方面优于几乎所有基线方法。

链接: https://arxiv.org/abs/2411.07133
作者: Zhangchen Xu,Fengqing Jiang,Luyao Niu,Bill Yuchen Lin,Radha Poovendran
关键词-EN: ensure large language, follow user instructions, user instructions effectively, large language models, follow user
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Instruction tuning has been widely adopted to ensure large language models (LLMs) follow user instructions effectively. The resulting instruction-following capabilities of LLMs heavily rely on the instruction datasets used for tuning. Recently, synthetic instruction datasets have emerged as an economically viable solution to provide LLMs diverse and high-quality instructions. However, existing approaches typically assume that larger or stronger models are stronger teachers for instruction tuning, and hence simply adopt these models as response generators to the synthetic instructions. In this paper, we challenge this commonly-adopted assumption. Our extensive experiments across five base models and twenty response generators reveal that larger and stronger models are not necessarily stronger teachers of smaller models. We refer to this phenomenon as the Larger Models’ Paradox. We observe that existing metrics cannot precisely predict the effectiveness of response generators since they ignore the compatibility between teachers and base models being fine-tuned. We thus develop a novel metric, named as Compatibility-Adjusted Reward (CAR) to measure the effectiveness of response generators. Our experiments across five base models demonstrate that CAR outperforms almost all baselines.
摘要：指令调优已被广泛采用，以确保大语言模型（LLMs）能够有效地遵循用户指令。LLMs的指令遵循能力在很大程度上依赖于用于调优的指令数据集。近期，合成指令数据集作为一种经济可行的解决方案出现，为LLMs提供了多样化和高质量的指令。然而，现有方法通常假设更大或更强的模型是指令调优的更强教师，因此简单地采用这些模型作为合成指令的响应生成器。在本文中，我们挑战了这一普遍接受的假设。我们在五个基础模型和二十个响应生成器上进行了广泛的实验，结果表明，更大和更强的模型并不一定是指令调优的更强教师。我们将这种现象称为“大模型悖论”。我们观察到，现有指标无法精确预测响应生成器的有效性，因为它们忽略了教师模型与被微调的基础模型之间的兼容性。因此，我们开发了一种新的指标，称为兼容性调整奖励（Compatibility-Adjusted Reward, CAR），用于衡量响应生成器的有效性。我们在五个基础模型上的实验表明，CAR几乎在所有基线中表现最佳。

[NLP-14] Retrieval or Global Context Understanding? On Many-Shot In-Context Learning for Long-Context Evaluation

【速读】：该论文试图解决现有长上下文语言模型（LCLM）评估方法在衡量模型全局上下文理解能力方面的不足。现有基准主要侧重于模型的检索能力，而忽略了模型在跨输入内容进行综合和推理生成响应方面的能力。论文的关键解决方案是引入了一个新的多示例上下文学习（ICL）基准，称为MANYICLBENCH，该基准旨在分别评估LCLM的检索能力和全局上下文理解能力。通过这一基准，研究者发现，尽管最先进的模型在检索任务中表现良好，但在仅需16k标记的全局上下文任务中，许多模型的性能显著下降。

链接: https://arxiv.org/abs/2411.07130
作者: Kaijian Zou,Muhammad Khalifa,Lu Wang
关键词-EN: primarily measure LMs’, handle long-context information, LMs’ retrieval abilities, measure LMs’ retrieval, global context understanding
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Language models (LMs) have demonstrated an improved capacity to handle long-context information, yet existing long-context benchmarks primarily measure LMs’ retrieval abilities with extended inputs, e.g., pinpointing a short phrase from long-form text. Therefore, they may fall short when evaluating models’ global context understanding capacity, such as synthesizing and reasoning over content across input to generate the response. In this paper, we study long-context language model (LCLM) evaluation through many-shot in-context learning (ICL). Concretely, we identify the skills each ICL task requires, and examine models’ long-context capabilities on them. We first ask: What types of ICL tasks benefit from additional demonstrations, and are these tasks effective at evaluating LCLMs? We find that classification and summarization tasks show notable performance improvements with additional demonstrations, while translation and reasoning tasks do not exhibit clear trends. This suggests the classification tasks predominantly test models’ retrieval skills. Next, we ask: To what extent does each task require retrieval skills versus global context understanding from LCLMs? We develop metrics to categorize ICL tasks into two groups: (i) retrieval tasks that require strong retrieval ability to pinpoint relevant examples, and (ii) global context understanding tasks that necessitate a deeper comprehension of the full input. We find that not all datasets can effectively evaluate these long-context capabilities. To address this gap, we introduce a new many-shot ICL benchmark, MANYICLBENCH, designed to characterize LCLMs’ retrieval and global context understanding capabilities separately. Benchmarking 11 open-weight LCLMs with MANYICLBENCH, we find that while state-of-the-art models perform well in retrieval tasks up to 64k tokens, many show significant drops in global context tasks at just 16k tokens.
摘要：语言模型 (Language Models, LMs) 在处理长上下文信息方面展示了显著的改进，然而现有的长上下文基准主要测量 LMs 在扩展输入情况下的检索能力，例如从长篇文本中精准定位短语。因此，这些基准可能在评估模型的全局上下文理解能力时表现不足，例如在输入内容之间进行综合和推理以生成响应。本文通过多样本上下文学习 (In-Context Learning, ICL) 研究长上下文语言模型 (Long-Context Language Model, LCLM) 的评估。具体来说，我们识别了每个 ICL 任务所需的技能，并考察了模型在这些任务上的长上下文能力。首先，我们探讨了哪些类型的 ICL 任务能从额外的演示中受益，以及这些任务是否有效地评估了 LCLMs。我们发现，分类和总结任务在增加演示后表现出显著的性能提升，而翻译和推理任务则没有显示出明显的趋势。这表明分类任务主要测试模型的检索技能。接下来，我们探讨了每个任务在多大程度上需要 LCLMs 的检索技能与全局上下文理解能力。我们开发了指标将 ICL 任务分为两类：(i) 需要强大检索能力以定位相关示例的检索任务，以及 (ii) 需要对完整输入进行深入理解的全球上下文理解任务。我们发现，并非所有数据集都能有效地评估这些长上下文能力。为了填补这一空白，我们引入了一个新的多样本 ICL 基准 MANYICLBENCH，旨在分别表征 LCLMs 的检索和全局上下文理解能力。通过 MANYICLBENCH 对 11 个开源权重 LCLMs 进行基准测试，我们发现尽管最先进的模型在高达 64k Token 的检索任务中表现良好，但在仅 16k Token 的全局上下文任务中，许多模型表现出显著的性能下降。

[NLP-15] Benchmarking LLM s Judgments with No Gold Standard

【速读】：该论文试图解决在没有黄金标准参考的情况下，如何评估大型语言模型（LLMs）在生成信息性判断（如学术同行评审）中的表现问题。解决方案的关键是引入了一种名为GEM（Generative Estimator for Mutual Information）的评估指标，该指标通过生成模型来估计候选响应与参考响应之间的互信息（mutual information），而不需要参考响应是黄金标准。GEM在实验中展示了与人类评分的高度相关性，并且相比现有的GPT-4o Examiner，GEM对策略性操纵（如改写或延长）更具鲁棒性。此外，论文还提出了GRE-bench（Generating Review Evaluation Benchmark），这是一个基于GEM的基准测试，用于评估LLMs生成高质量学术同行评审的能力，并因其基于GEM的鲁棒性而避免了数据污染问题。

链接: https://arxiv.org/abs/2411.07127
作者: Shengwei Xu,Yuxuan Lu,Grant Schoenebeck,Yuqing Kong
关键词-EN: Large Language Models, generating informative judgments, Generative Estimator, gold standard, Large Language
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:We introduce the GEM (Generative Estimator for Mutual Information), an evaluation metric for assessing language generation by Large Language Models (LLMs), particularly in generating informative judgments, without the need for a gold standard reference. GEM broadens the scenarios where we can benchmark LLM generation performance-from traditional ones, like machine translation and summarization, where gold standard references are readily available, to subjective tasks without clear gold standards, such as academic peer review. GEM uses a generative model to estimate mutual information between candidate and reference responses, without requiring the reference to be a gold standard. In experiments on a human-annotated dataset, GEM demonstrates competitive correlations with human scores compared to the state-of-the-art GPT-4o Examiner, and outperforms all other baselines. Additionally, GEM is more robust against strategic manipulations, such as rephrasing or elongation, which can artificially inflate scores under a GPT-4o Examiner. We also present GRE-bench (Generating Review Evaluation Benchmark) which evaluates LLMs based on how well they can generate high-quality peer reviews for academic research papers. Because GRE-bench is based upon GEM, it inherits its robustness properties. Additionally, GRE-bench circumvents data contamination problems (or data leakage) by using the continuous influx of new open-access research papers and peer reviews each year. We show GRE-bench results of various popular LLMs on their peer review capabilities using the ICLR2023 dataset. Subjects: Computation and Language (cs.CL); Machine Learning (cs.LG) Cite as: arXiv:2411.07127 [cs.CL] (or arXiv:2411.07127v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2411.07127 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
摘要：我们引入了 GEM（生成式互信息估计器），这是一种用于评估大语言模型（LLMs）生成语言的度量标准，特别是在生成信息性判断时，无需依赖黄金标准参考。GEM 扩展了我们可以对 LLM 生成性能进行基准测试的场景——从传统的机器翻译和摘要等有黄金标准参考的任务，到学术同行评审等没有明确黄金标准的主观任务。GEM 使用生成模型来估计候选响应与参考响应之间的互信息，而不要求参考响应为黄金标准。在人类注释的数据集实验中，GEM 与人类评分的相关性相比最先进的 GPT-4o Examiner 具有竞争力，并且优于所有其他基线。此外，GEM 对策略性操纵（如改写或延长）更具鲁棒性，这些操纵可以在 GPT-4o Examiner 下人为地提高分数。我们还提出了 GRE-bench（生成评审评估基准），该基准根据 LLMs 生成高质量学术研究论文同行评审的能力来评估它们。由于 GRE-bench 基于 GEM，因此继承了其鲁棒性特性。此外，GRE-bench 通过每年不断流入的新开放获取研究论文和同行评审，避免了数据污染问题（或数据泄露）。我们展示了使用 ICLR2023 数据集对各种流行 LLMs 的同行评审能力的 GRE-bench 结果。

主题：计算与语言（cs.CL）；机器学习（cs.LG）
引用为：arXiv:2411.07127 [cs.CL]（或 arXiv:2411.07127v1 [cs.CL] 用于此版本）
https://doi.org/10.48550/arXiv.2411.07127
通过 DataCite 发布的 arXiv DOI（待注册）

[NLP-16] SCAR: Sparse Conditioned Autoencoders for Concept Detection and Steering in LLM s

【速读】：该论文试图解决大语言模型 (Large Language Models, LLMs) 在生成文本时可能产生与用户意图不符或有害内容的问题。解决方案的关键在于引入稀疏条件自编码器 (Sparse Conditioned Autoencoder, SCAR)，这是一个单独训练的模块，能够扩展原本未被修改的 LLM。SCAR 确保在生成文本之前对概念（如毒性内容）进行检测和引导，从而实现对生成内容的完全可控性，同时不影响模型在标准评估基准上的文本生成质量。通过这种方法，论文建立了一个强大的框架，确保 LLMs 在实际应用中的伦理和安全部署。

链接: https://arxiv.org/abs/2411.07122
作者: Ruben Härle,Felix Friedrich,Manuel Brack,Björn Deiseroth,Patrick Schramowski,Kristian Kersting
关键词-EN: Large Language Models, Large Language, demonstrated remarkable capabilities, produce harmful content, generating human-like text
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have demonstrated remarkable capabilities in generating human-like text, but their output may not be aligned with the user or even produce harmful content. This paper presents a novel approach to detect and steer concepts such as toxicity before generation. We introduce the Sparse Conditioned Autoencoder (SCAR), a single trained module that extends the otherwise untouched LLM. SCAR ensures full steerability, towards and away from concepts (e.g., toxic content), without compromising the quality of the model’s text generation on standard evaluation benchmarks. We demonstrate the effective application of our approach through a variety of concepts, including toxicity, safety, and writing style alignment. As such, this work establishes a robust framework for controlling LLM generations, ensuring their ethical and safe deployment in real-world applications.
摘要：大语言模型 (LLM) 在生成类人文本方面展示了显著的能力，但其输出可能与用户意图不符，甚至产生有害内容。本文提出了一种新颖的方法，在生成之前检测并引导诸如毒性等概念。我们引入了稀疏条件自编码器 (Sparse Conditioned Autoencoder, SCAR)，这是一个经过训练的单一模块，扩展了原本未被触及的 LLM。SCAR 确保了对概念（例如，有毒内容）的完全可控性，同时不影响模型在标准评估基准上的文本生成质量。我们通过多种概念（包括毒性、安全性和写作风格对齐）展示了我们方法的有效应用。因此，这项工作建立了一个强大的框架，用于控制 LLM 的生成，确保其在实际应用中的伦理和安全部署。

[NLP-17] Building a Taiwanese Mandarin Spoken Language Model: A First Attempt

【速读】：该论文试图解决在台湾普通话中构建一个用于多轮对话的口语大型语言模型（LLM）的问题，旨在实现实时语音到语音的交互。解决方案的关键在于采用了一种端到端的解码器专用变压器架构，该架构不仅支持全双工通信，允许同时说话和聆听，还通过合成对话数据进行训练，并针对实时交互进行了调整。此外，论文还开发了一个平台来评估多轮对话中的流畅性和响应连贯性。

链接: https://arxiv.org/abs/2411.07111
作者: Chih-Kai Yang,Yu-Kuan Fu,Chen-An Li,Yi-Cheng Lin,Yu-Xiang Lin,Wei-Chih Chen,Ho Lam Chung,Chun-Yi Kuan,Wei-Ping Huang,Ke-Han Lu,Tzu-Quan Lin,Hsiu-Hsuan Wang,En-Pei Hu,Chan-Jan Hsu,Liang-Hsuan Tseng,I-Hsiang Chiu,Ulin Sanga,Xuanjun Chen,Po-chun Hsu,Shu-wen Yang,Hung-yi Lee
关键词-EN: large language model, technical report presents, spoken large language, Taiwanese Mandarin, specifically tailored
类目: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: Work in progress

点击查看摘要

Abstract:This technical report presents our initial attempt to build a spoken large language model (LLM) for Taiwanese Mandarin, specifically tailored to enable real-time, speech-to-speech interaction in multi-turn conversations. Our end-to-end model incorporates a decoder-only transformer architecture and aims to achieve seamless interaction while preserving the conversational flow, including full-duplex capabilities allowing simultaneous speaking and listening. The paper also details the training process, including data preparation with synthesized dialogues and adjustments for real-time interaction. We also developed a platform to evaluate conversational fluency and response coherence in multi-turn dialogues. We hope the release of the report can contribute to the future development of spoken LLMs in Taiwanese Mandarin.
摘要：本技术报告介绍了我们首次尝试构建一个面向台湾普通话的口语大语言模型 (LLM)，该模型专门设计用于实现多轮对话中的实时语音到语音交互。我们的端到端模型采用了仅解码器的 Transformer 架构，旨在实现无缝交互的同时保持对话流程，包括支持同时说话和听的全双工能力。报告还详细描述了训练过程，包括使用合成对话进行数据准备以及针对实时交互的调整。此外，我们还开发了一个平台，用于评估多轮对话中的会话流畅性和响应连贯性。我们希望该报告的发布能够为未来台湾普通话口语 LLM 的发展做出贡献。

[NLP-18] raining Neural Networks as Recognizers of Formal Languages

【速读】：该论文试图解决神经网络架构在形式语言理论中的计算能力表征问题，特别是在实验验证与理论声明之间的不一致性。解决方案的关键在于直接将神经网络训练和评估为字符串的二元分类器，而不是使用语言建模或序列到序列转换等非正式的代理任务。论文通过扩展Snæbjarnarson等人（2024）提出的算法，实现了从正则语言中进行长度控制的采样，显著提高了时间复杂度。研究结果表明，简单RNN和LSTM在多种语言上的表现往往优于因果掩码transformer，且辅助训练目标如语言建模有助于提升性能，但并非所有目标都能在所有语言和架构上一致地改善表现。论文还发布了名为FLaRe（Formal Language Recognition）的基准数据集和代码，以促进未来在语言识别方面的理论验证工作。

链接: https://arxiv.org/abs/2411.07107
作者: Alexandra Butoi,Ghazal Khalighinejad,Anej Svete,Josef Valvoda,Ryan Cotterell,Brian DuSell
关键词-EN: Characterizing the computational, formal language theory, language theory remains, formal language, line of research
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 40 pages, 2 figures. Preprint

点击查看摘要

Abstract:Characterizing the computational power of neural network architectures in terms of formal language theory remains a crucial line of research, as it describes lower and upper bounds on the reasoning capabilities of modern AI. However, when empirically testing these bounds, existing work often leaves a discrepancy between experiments and the formal claims they are meant to support. The problem is that formal language theory pertains specifically to recognizers: machines that receive a string as input and classify whether it belongs to a language. On the other hand, it is common to instead use proxy tasks that are similar in only an informal sense, such as language modeling or sequence-to-sequence transduction. We correct this mismatch by training and evaluating neural networks directly as binary classifiers of strings, using a general method that can be applied to a wide variety of languages. As part of this, we extend an algorithm recently proposed by Snæbjarnarson et al. (2024) to do length-controlled sampling of strings from regular languages, with much better asymptotic time complexity than previous methods. We provide results on a variety of languages across the Chomsky hierarchy for three neural architectures: a simple RNN, an LSTM, and a causally-masked transformer. We find that the RNN and LSTM often outperform the transformer, and that auxiliary training objectives such as language modeling can help, although no single objective uniformly improves performance across languages and architectures. Our contributions will facilitate theoretically sound empirical testing of language recognition claims in future work. We have released our datasets as a benchmark called FLaRe (Formal Language Recognition), along with our code.
摘要：在形式语言理论的框架下，描述神经网络架构的计算能力仍然是一个关键的研究方向，因为它揭示了现代人工智能推理能力的上下限。然而，在实证测试这些界限时，现有研究往往在实验结果与所支持的形式化声明之间存在差异。问题在于，形式语言理论特别针对的是识别器：这些机器接收一个字符串作为输入，并判断它是否属于某个语言。另一方面，通常使用的是仅在非正式意义上相似的代理任务，如语言建模或序列到序列的转换。我们通过直接将神经网络训练和评估为字符串的二分类器，来纠正这种不匹配，使用一种可以应用于多种语言的通用方法。在此过程中，我们扩展了Snæbjarnarson等人（2024）最近提出的算法，以实现从正则语言中进行长度控制的采样，其渐近时间复杂度远优于先前的方法。我们在Chomsky层次结构中的多种语言上，对三种神经网络架构进行了测试：简单的RNN、LSTM和因果掩码Transformer。我们发现，RNN和LSTM通常优于Transformer，并且辅助训练目标如语言建模可以有所帮助，尽管没有一个单一的目标能在所有语言和架构中一致地提升性能。我们的贡献将为未来工作中对语言识别声明的理论验证提供便利。我们已经发布了我们的数据集，作为名为FLaRe（Formal Language Recognition）的基准，并附上了我们的代码。

[NLP-19] ransformer verbatim in-context retrieval across time and scale CONLL WWW

【速读】：该论文试图解决语言模型在训练过程中如何发展对上下文中名词的精确检索能力的问题。解决方案的关键在于识别出这种检索能力在训练过程中出现的突然转变，这一转变发生在训练数据的前1%左右，并且与模型规模无关。此外，研究还发现，这种检索能力的发展与模型在零样本基准测试中的表现呈正相关。特别地，在转变点附近，模型在检索具体名词（concrete nouns）时表现出优势，而这种优势在训练后期逐渐消失，尤其是在较小的模型中。

链接: https://arxiv.org/abs/2411.07075
作者: Kristijan Armeni,Marko Pranjić,Senja Pollak
关键词-EN: predict upcoming text, cases retrieve in-context, retrieve in-context information, language models, in-context information verbatim
类目: Computation and Language (cs.CL)
备注: accepted to Conference on Natural Language Learning 2024 ( this https URL )

点击查看摘要

Abstract:To predict upcoming text, language models must in some cases retrieve in-context information verbatim. In this report, we investigated how the ability of language models to retrieve arbitrary in-context nouns developed during training (across time) and as language models trained on the same dataset increase in size (across scale). We then asked whether learning of in-context retrieval correlates with learning of more challenging zero-shot benchmarks. Furthermore, inspired by semantic effects in human short-term memory, we evaluated the retrieval with respect to a major semantic component of target nouns, namely whether they denote a concrete or abstract entity, as rated by humans. We show that verbatim in-context retrieval developed in a sudden transition early in the training process, after about 1% of the training tokens. This was observed across model sizes (from 14M and up to 12B parameters), and the transition occurred slightly later for the two smallest models. We further found that the development of verbatim in-context retrieval is positively correlated with the learning of zero-shot benchmarks. Around the transition point, all models showed the advantage of retrieving concrete nouns as opposed to abstract nouns. In all but two smallest models, the advantage dissipated away toward the end of training.
摘要：为了预测即将出现的文本，语言模型在某些情况下必须逐字逐句地检索上下文信息。在本报告中，我们研究了语言模型在训练过程中（跨时间）以及随着基于相同数据集训练的语言模型规模增大（跨尺度）时，其检索任意上下文名词的能力是如何发展的。接着，我们探讨了上下文检索的学习是否与更复杂的零样本基准测试的学习相关联。此外，受人类短期记忆中语义效应的启发，我们根据目标名词的主要语义成分，即它们是否表示具体或抽象实体（由人类评定），评估了检索效果。我们发现，逐字逐句的上下文检索能力在训练过程的早期阶段（大约在训练 Token 的 1% 之后）突然发生了转变。这一现象在不同模型规模（从 14M 到 12B 参数）中均有观察到，且对于最小的两个模型，转变发生的时间略晚。我们进一步发现，逐字逐句上下文检索能力的发展与零样本基准测试的学习呈正相关。在转变点附近，所有模型都显示出检索具体名词相对于抽象名词的优势。在除最小的两个模型之外的所有模型中，这种优势在训练结束时逐渐消失。

[NLP-20] Universal Response and Emergence of Induction in LLM s

【速读】：该论文试图解决的问题是理解大型语言模型（LLMs）中上下文学习（in-context learning）的关键机制——归纳（induction）的具体电路分解，尤其是在实际模型中的表现。解决方案的关键在于通过探测模型对残差流中单个弱扰动的响应，来研究归纳行为的涌现。研究发现，LLMs在扰动强度变化下表现出稳健的尺度不变性，这使得能够量化模型中token相关性的积累。通过这种方法，论文观察到Gemma-2-2B、Llama-3.2-3B和GPT-2-XL的残差流中存在归纳行为的特征，并识别出构成这种行为的模型相关部分。这些结果为理解LLMs内部组件的集体相互作用提供了见解，并为大规模电路分析提供了基准。

链接: https://arxiv.org/abs/2411.07071
作者: Niclas Luick
关键词-EN: precise circuit decomposition, understanding its precise, models remains elusive, considered a key, key mechanism
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 14 pages, 5 figures

点击查看摘要

Abstract:While induction is considered a key mechanism for in-context learning in LLMs, understanding its precise circuit decomposition beyond toy models remains elusive. Here, we study the emergence of induction behavior within LLMs by probing their response to weak single-token perturbations of the residual stream. We find that LLMs exhibit a robust, universal regime in which their response remains scale-invariant under changes in perturbation strength, thereby allowing us to quantify the build-up of token correlations throughout the model. By applying our method, we observe signatures of induction behavior within the residual stream of Gemma-2-2B, Llama-3.2-3B, and GPT-2-XL. Across all models, we find that these induction signatures gradually emerge within intermediate layers and identify the relevant model sections composing this behavior. Our results provide insights into the collective interplay of components within LLMs and serve as a benchmark for large-scale circuit analysis.
摘要：尽管归纳被认为是大型语言模型（LLM）中上下文学习的关键机制，但对其在超越简单模型之外的精确电路分解的理解仍然难以捉摸。在此，我们通过探测LLM对残差流中弱单Token扰动的响应，研究了归纳行为在LLM中的涌现。我们发现，LLM在响应上表现出一种稳健且普遍的机制，即在扰动强度变化下保持尺度不变性，从而使我们能够量化模型中Token相关性的积累。通过应用我们的方法，我们在Gemma-2-2B、Llama-3.2-3B和GPT-2-XL的残差流中观察到了归纳行为的特征。在所有模型中，我们发现这些归纳特征逐渐在中层出现，并识别出构成这种行为的相应模型部分。我们的研究结果为LLM内部组件的集体相互作用提供了见解，并为大规模电路分析提供了基准。

[NLP-21] On Active Privacy Auditing in Supervised Fine-tuning for White-Box Language Models

【速读】：该论文试图解决在自然语言处理（NLP）应用中，监督微调（SFT）过程中由于数据敏感性和可识别性导致的隐私泄露问题。解决方案的关键在于引入了一种名为Parsing的新型主动隐私审计框架，该框架利用改进的白盒成员推断攻击（MIAs）作为核心技术，通过新颖的学习目标和两阶段流水线来监控语言模型（LMs）微调过程中的隐私泄露风险，从而最大化暴露隐私风险。此外，该研究还提升了MIAs在包括GPT-2、Llama2及其变体在内的大型LMs上的有效性，旨在为LMs的SFT社区提供一个可靠、即用的隐私审计工具，并为微调过程中的隐私保护提供有价值的见解。

链接: https://arxiv.org/abs/2411.07070
作者: Qian Sun,Hanpeng Wu,Xi Sheryl Zhang
关键词-EN: NLP applications, leading technique, fine-tuning process, fine-tuning, privacy
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The pretraining and fine-tuning approach has become the leading technique for various NLP applications. However, recent studies reveal that fine-tuning data, due to their sensitive nature, domain-specific characteristics, and identifiability, pose significant privacy concerns. To help develop more privacy-resilient fine-tuning models, we introduce a novel active privacy auditing framework, dubbed Parsing, designed to identify and quantify privacy leakage risks during the supervised fine-tuning (SFT) of language models (LMs). The framework leverages improved white-box membership inference attacks (MIAs) as the core technology, utilizing novel learning objectives and a two-stage pipeline to monitor the privacy of the LMs’ fine-tuning process, maximizing the exposure of privacy risks. Additionally, we have improved the effectiveness of MIAs on large LMs including GPT-2, Llama2, and certain variants of them. Our research aims to provide the SFT community of LMs with a reliable, ready-to-use privacy auditing tool, and to offer valuable insights into safeguarding privacy during the fine-tuning process. Experimental results confirm the framework’s efficiency across various models and tasks, emphasizing notable privacy concerns in the fine-tuning process. Project code available for this https URL.
摘要：预训练与微调方法已成为多种自然语言处理（NLP）应用中的主导技术。然而，近期研究表明，由于微调数据的敏感性、领域特定特征及可识别性，其存在显著的隐私风险。为促进开发更具隐私弹性的微调模型，我们提出了一种名为 Parsing 的新型主动隐私审计框架，旨在识别并量化语言模型（LMs）在监督微调（SFT）过程中的隐私泄露风险。该框架以改进的白盒成员推理攻击（MIAs）为核心技术，通过创新的学习目标和两阶段流程，监控 LMs 微调过程中的隐私状况，最大化暴露隐私风险。此外，我们提升了 MIAs 在包括 GPT-2、Llama2 及其某些变体在内的大语言模型上的有效性。本研究旨在为 LMs 的 SFT 社区提供一个可靠、即插即用的隐私审计工具，并为微调过程中的隐私保护提供宝贵见解。实验结果证实了该框架在多种模型和任务中的高效性，突显了微调过程中显著的隐私问题。项目代码可通过此 https URL 获取。

[NLP-22] Zeroth-Order Adaptive Neuron Alignment Based Pruning without Re-Training

【速读】：该论文试图解决在大规模预训练模型（LLMs）中进行网络剪枝（Network Pruning）时，由于重新训练成本过高而难以实施的问题。解决方案的关键在于提出了一种名为NeuroAl的顶层算法（top-up algorithm），该算法利用密集预训练模型的激活信息（activations）来获得稀疏模型，并最大化激活与相应密集模型之间的神经元对齐（neuron alignment）。与现有方法不同，NeuroAl能够自适应地选择最佳的块级和行级稀疏比率，且无需重新训练模型，从而在不同LLM家族和稀疏比率下均表现出优于最新技术的性能。

链接: https://arxiv.org/abs/2411.07066
作者: Elia Cunegatti,Leonardo Lucio Custode,Giovanni Iacca
关键词-EN: model computational cost, Network pruning, computational cost, impact on performance, aim to reduce
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Work in progress

点击查看摘要

Abstract:Network pruning is a set of computational techniques that aim to reduce a given model’s computational cost by removing a subset of its parameters while having minimal impact on performance. Throughout the last decade, the most widely used pruning paradigm has focused on pruning and re-training, which nowadays is inconvenient due to the vast amount of pre-trained models, which are in any case too expensive to re-train. In this paper, we exploit functional information from dense pre-trained models, i.e., their activations, to obtain sparse models that maximize the activations’ alignment w.r.t. their corresponding dense models. Hence, we propose \textscNeuroAl, a \emphtop-up algorithm that can be used on top of any given pruning algorithm for LLMs, that modifies the block-wise and row-wise sparsity ratios to maximize the \emphneuron alignment among activations. Moreover, differently from existing methods, our approach adaptively selects the best parameters for the block-wise and row-wise sparsity ratios w.r.t. to the model and the desired sparsity (given as input), and requires \emphno re-training. We test our method on 4 different LLM families and 3 different sparsity ratios, showing how it consistently outperforms the latest state-of-the-art techniques. The code is available at this https URL.
摘要：网络剪枝是一组计算技术，旨在通过移除模型参数的子集来降低给定模型的计算成本，同时对性能的影响最小。在过去十年中，最广泛使用的剪枝范式集中在剪枝和重新训练上，但由于预训练模型数量庞大，重新训练的成本过高，这一方法现已不再适用。本文中，我们利用密集预训练模型的功能信息，即其激活值，来获得稀疏模型，这些模型能够最大化激活值与其对应密集模型之间的对齐。因此，我们提出了\textscNeuroAl，一种\emphtop-up算法，可用于任何给定的大语言模型剪枝算法之上，通过调整块级和行级稀疏比率来最大化激活值之间的神经元对齐。此外，与现有方法不同，我们的方法能够自适应地选择块级和行级稀疏比率的最佳参数，以适应模型和所需的稀疏度（作为输入给出），并且无需重新训练。我们在4种不同的大语言模型家族和3种不同的稀疏比率上测试了我们的方法，结果表明它始终优于最新的最先进技术。代码可在以下链接获取：https URL。

[NLP-23] Minion: A Technology Probe for Resolving Value Conflicts through Expert-Driven and User-Driven Strategies in AI Companion Applications

【速读】：该论文试图解决AI伴侣（AI companions）在与用户互动过程中可能产生的价值观冲突问题，这种冲突可能导致用户感到冒犯或不适。解决方案的关键在于开发了一种名为Minion的技术探针（technology probe），该探针通过结合专家驱动和用户驱动的冲突解决策略，提供用户增强干预方法（user-empowerment intervention method），帮助用户有效解决人机价值观冲突。研究通过创建40个价值冲突场景并进行实验，结果显示22名参与者在274个任务中成功解决了94.16%的冲突，从而验证了该方法的有效性，并为减少冲突和增强用户解决冲突的能力提供了设计启示。

链接: https://arxiv.org/abs/2411.07042
作者: Xianzhe Fan,Qing Xiao,Xuhui Zhou,Yuran Su,Zhicong Lu,Maarten Sap,Hong Shen
关键词-EN: large language models, converse very naturally, large language, language models, models can role-play
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY)
备注: 18 pages, 5 figures

点击查看摘要

Abstract:AI companions based on large language models can role-play and converse very naturally. When value conflicts arise between the AI companion and the user, it may offend or upset the user. Yet, little research has examined such conflicts. We first conducted a formative study that analyzed 151 user complaints about conflicts with AI companions, providing design implications for our study. Based on these, we created Minion, a technology probe to help users resolve human-AI value conflicts. Minion applies a user-empowerment intervention method that provides suggestions by combining expert-driven and user-driven conflict resolution strategies. We conducted a technology probe study, creating 40 value conflict scenarios on this http URL and Talkie. 22 participants completed 274 tasks and successfully resolved conflicts 94.16% of the time. We summarize user responses, preferences, and needs in resolving value conflicts, and propose design implications to reduce conflicts and empower users to resolve them more effectively.
摘要：基于大语言模型的 AI 伴侣能够非常自然地进行角色扮演和对话。当 AI 伴侣与用户之间出现价值观冲突时，可能会冒犯或令用户感到不适。然而，关于此类冲突的研究却很少。我们首先进行了一项形成性研究，分析了 151 起用户对 AI 伴侣冲突的投诉，为我们的研究提供了设计启示。基于这些启示，我们创建了 Minion，一种技术探针，旨在帮助用户解决人机价值观冲突。Minion 采用了一种用户赋能干预方法，通过结合专家驱动和用户驱动的冲突解决策略来提供建议。我们进行了一项技术探针研究，在该 http URL 和 Talkie 上创建了 40 个价值观冲突场景。22 名参与者完成了 274 项任务，成功解决了 94.16% 的冲突。我们总结了用户在解决价值观冲突时的反应、偏好和需求，并提出了减少冲突和更有效地赋能用户解决冲突的设计启示。

[NLP-24] LIFBench: Evaluating the Instruction Following Performance and Stability of Large Language Models in Long-Context Scenarios

【速读】：该论文试图解决大型语言模型（LLMs）在长上下文输入中稳定遵循指令的能力评估问题。现有基准测试主要关注LLMs在短上下文中的表现，而忽略了在长上下文场景下的指令遵循能力和稳定性。解决方案的关键在于引入了长上下文指令遵循基准测试（Long-context Instruction-Following Benchmark, LIFBench），这是一个可扩展的数据集，专门用于评估LLMs在长上下文中的指令遵循能力和稳定性。LIFBench包含三种长上下文场景和十一个多样化的任务，并通过自动化扩展方法生成了2,766条指令，涵盖长度、表达和变量三个维度。此外，论文还提出了基于评分标准的评估框架（LIFEval），用于自动、精确地评分复杂LLM响应，无需依赖LLM辅助评估或人工判断，从而全面分析模型在不同视角下的性能和稳定性。

链接: https://arxiv.org/abs/2411.07037
作者: Xiaodong Wu,Minhao Wang,Yichen Liu,Xiaoming Shi,He Yan,Xiangju Lu,Junmin Zhu,Wei Zhang
关键词-EN: Large Language Models, natural language processing, Large Language, language processing, natural language
类目: Computation and Language (cs.CL)
备注: 17 pages, 3 figures

点击查看摘要

Abstract:As Large Language Models (LLMs) continue to advance in natural language processing (NLP), their ability to stably follow instructions in long-context inputs has become crucial for real-world applications. While existing benchmarks assess various LLM capabilities, they rarely focus on instruction-following in long-context scenarios or stability on different inputs. In response, we introduce the Long-context Instruction-Following Benchmark (LIFBench), a scalable dataset designed to evaluate LLMs’ instruction-following capabilities and stability across long contexts. LIFBench comprises three long-context scenarios and eleven diverse tasks, supported by 2,766 instructions generated through an automated expansion method across three dimensions: length, expression, and variables. For evaluation, we propose LIFEval, a rubric-based assessment framework that provides precise, automated scoring of complex LLM responses without relying on LLM-assisted evaluations or human judgments. This approach facilitates a comprehensive analysis of model performance and stability across various perspectives. We conduct extensive experiments on 20 notable LLMs across six length intervals, analyzing their instruction-following capabilities and stability. Our work contributes LIFBench and LIFEval as robust tools for assessing LLM performance in complex, long-context settings, providing insights that can inform future LLM development.
摘要：随着大语言模型 (LLM) 在自然语言处理 (NLP) 领域的不断进步，其在长上下文输入中稳定遵循指令的能力对于实际应用变得至关重要。尽管现有的基准测试评估了各种大语言模型的能力，但它们很少关注长上下文场景中的指令遵循或不同输入下的稳定性。为此，我们引入了长上下文指令遵循基准 (LIFBench)，这是一个可扩展的数据集，旨在评估大语言模型在长上下文中的指令遵循能力和稳定性。LIFBench 包含三个长上下文场景和十一个多样化的任务，支持通过自动化扩展方法在长度、表达和变量三个维度上生成的 2,766 条指令。在评估方面，我们提出了 LIFEval，这是一个基于评分标准的评估框架，能够提供精确的自动化评分，无需依赖大语言模型辅助评估或人工判断。这种方法有助于从多个角度全面分析模型的性能和稳定性。我们在六个长度区间内对 20 个知名大语言模型进行了广泛的实验，分析了它们的指令遵循能力和稳定性。我们的工作贡献了 LIFBench 和 LIFEval 作为评估大语言模型在复杂长上下文设置中性能的强大工具，为未来的大语言模型开发提供了有价值的见解。

[NLP-25] UniHR: Hierarchical Representation Learning for Unified Knowledge Graph Link Prediction

【速读】：该论文试图解决现有链接预测模型在处理不同类型事实表示（如超关系事实、时间事实和嵌套事实）时的泛化能力不足的问题。解决方案的关键在于提出了一个统一的层次表示学习框架（UniHR），该框架包括一个统一的层次数据表示模块（HiDR）和一个统一的层次结构学习模块（HiSL）。HiDR模块将超关系知识图谱、时间知识图谱和嵌套事实知识图谱统一表示为基于三元组的形式，而HiSL模块通过事实内和事实间的消息传递机制，增强了单个事实的语义信息并丰富了事实间的结构信息。实验结果表明，UniHR在处理不同类型知识图谱时表现优异，显示出其强大的泛化能力和HiSL模块的有效性。

链接: https://arxiv.org/abs/2411.07019
作者: Zhiqiang Liu,Mingyang Chen,Yin Hua,Zhuo Chen,Ziqi Liu,Lei Liang,Huajun Chen,Wen Zhang
关键词-EN: Beyond-triple fact representations, auxiliary key-value pairs, gaining significant attention, facts implying relationships, Beyond-triple fact
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Beyond-triple fact representations including hyper-relational facts with auxiliary key-value pairs, temporal facts with additional timestamps, and nested facts implying relationships between facts, are gaining significant attention. However, existing link prediction models are usually designed for one specific type of facts, making it difficult to generalize to other fact representations. To overcome this limitation, we propose a Unified Hierarchical Representation learning framework (UniHR) for unified knowledge graph link prediction. It consists of a unified Hierarchical Data Representation (HiDR) module and a unified Hierarchical Structure Learning (HiSL) module as graph encoder. The HiDR module unifies hyper-relational KGs, temporal KGs, and nested factual KGs into triple-based representations. Then HiSL incorporates intra-fact and inter-fact message passing, focusing on enhancing the semantic information within individual facts and enriching the structural information between facts. Experimental results across 7 datasets from 3 types of KGs demonstrate that our UniHR outperforms baselines designed for one specific kind of KG, indicating strong generalization capability of HiDR form and the effectiveness of HiSL module. Code and data are available at this https URL.
摘要：超越传统三元组的事实表示，包括带有辅助键值对的超关系事实、带有附加时间戳的时间事实以及暗示事实间关系的嵌套事实，正受到越来越多的关注。然而，现有的链接预测模型通常是为某一特定类型的事实设计的，这使得它们难以泛化到其他事实表示。为了克服这一局限，我们提出了一种统一的层次表示学习框架（UniHR），用于统一的知识图谱链接预测。该框架包括一个统一的层次数据表示（HiDR）模块和一个统一的层次结构学习（HiSL）模块作为图编码器。HiDR模块将超关系知识图谱、时间知识图谱和嵌套事实知识图谱统一为基于三元组的表示。随后，HiSL模块结合了事实内和事实间的消息传递，重点在于增强单个事实内的语义信息以及丰富事实间的结构信息。在来自三种知识图谱的7个数据集上的实验结果表明，我们的UniHR优于为某一特定类型知识图谱设计的基线模型，显示出HiDR形式的强大泛化能力和HiSL模块的有效性。代码和数据可在以下链接获取：https URL。

[NLP-26] oken2Wave

【速读】：该论文试图解决现有语言模型（如BERT）在处理文本时面临的计算复杂度和内存使用问题。解决方案的关键在于提出了一种名为Token2Wave的新型token表示方法，该方法通过波形网络（Wave Network）衍生而来，能够捕捉输入文本的全局和局部语义。Token2Wave通过使用幅度分量（magnitude component）来捕捉全局语义，并通过相位分量（phase component）来编码单个token与全局语义之间的关系。这种方法不仅在计算复杂度上显著优于BERT，还能有效减少视频内存使用和训练时间，同时保持了良好的收敛行为和反向传播特性。

链接: https://arxiv.org/abs/2411.06989
作者: Xin Zhang,Victor S. Sheng
关键词-EN: Wave Network, wave-inspired complex vectors, representation method derived, global semantics, input text
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This paper provides an in-depth analysis of Token2Wave, a novel token representation method derived from the Wave Network, designed to capture both global and local semantics of input text through wave-inspired complex vectors. In Token2Wave, each token is represented with a magnitude component, capturing the global semantics of the entire input text, and a phase component, encoding the relationships between individual tokens and the global semantics. Building on prior research that demonstrated the effectiveness of wave-like operations, such as interference and modulation, during forward propagation, this study investigates the convergence behavior, backpropagation characteristics, and embedding independence within the Token2Wave framework. A detailed computational complexity analysis shows that Token2Wave can significantly reduce video memory usage and training time compared to BERT. Gradient comparisons for the [CLS] token, total input text, and classifier parameters further highlight Token2Wave’s unique characteristics. This research offers new insights into wave-based token representations, demonstrating their potential to enable efficient and computationally friendly language model architectures.
摘要：本文深入分析了 Token2Wave，这是一种源自 Wave Network 的新型 Token 表示方法，旨在通过受波启发的复数向量捕捉输入文本的全局和局部语义。在 Token2Wave 中，每个 Token 由一个幅度分量表示，捕捉整个输入文本的全局语义，以及一个相位分量，编码单个 Token 与全局语义之间的关系。基于先前研究中展示的波状操作（如干涉和调制）在前向传播中的有效性，本研究探讨了 Token2Wave 框架中的收敛行为、反向传播特性及嵌入独立性。详细的计算复杂度分析表明，与 BERT 相比，Token2Wave 能显著减少视频内存使用和训练时间。对 [CLS] Token、总输入文本及分类器参数的梯度比较进一步突显了 Token2Wave 的独特特性。本研究为基于波的 Token 表示提供了新的见解，展示了其在实现高效且计算友好的大语言模型架构方面的潜力。

[NLP-27] Sniff AI: Is My Spicy Your Spicy? Exploring LLM s Perceptual Alignment with Human Smell Experiences

【速读】：该论文试图解决的问题是如何使AI系统更好地理解和解释人类对气味的感知描述，即实现AI与人类在嗅觉感知上的对齐。解决方案的关键在于设计一个AI系统，通过“闻一闻并描述”的交互任务，让参与者描述他们闻到的气味，然后AI系统根据这些描述尝试猜测参与者所体验的气味。这一过程评估了大型语言模型（LLMs）在上下文理解和气味关系表示方面的能力，特别是在其内部的高维嵌入空间中的表现。通过定量和定性方法评估AI系统的性能，研究结果揭示了当前AI在嗅觉感知对齐方面的局限性，如对某些气味（如柠檬和薄荷）的偏见，以及对其他气味（如迷迭香）的识别失败。

链接: https://arxiv.org/abs/2411.06950
作者: Shu Zhong,Zetao Zhou,Christopher Dawes,Giada Brianz,Marianna Obrist
关键词-EN: intent is important, smell-remains underexplored, Large Language Model, Aligning, human intent
类目: Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Aligning AI with human intent is important, yet perceptual alignment-how AI interprets what we see, hear, or smell-remains underexplored. This work focuses on olfaction, human smell experiences. We conducted a user study with 40 participants to investigate how well AI can interpret human descriptions of scents. Participants performed “sniff and describe” interactive tasks, with our designed AI system attempting to guess what scent the participants were experiencing based on their descriptions. These tasks evaluated the Large Language Model’s (LLMs) contextual understanding and representation of scent relationships within its internal states - high-dimensional embedding space. Both quantitative and qualitative methods were used to evaluate the AI system’s performance. Results indicated limited perceptual alignment, with biases towards certain scents, like lemon and peppermint, and continued failing to identify others, like rosemary. We discuss these findings in light of human-AI alignment advancements, highlighting the limitations and opportunities for enhancing HCI systems with multisensory experience integration.
摘要：将人工智能与人类意图对齐至关重要，然而感知对齐——即人工智能如何解释我们所见、所闻或所嗅——仍未得到充分探索。本研究聚焦于嗅觉，即人类的气味体验。我们进行了一项包含40名参与者的用户研究，以探讨人工智能在多大程度上能够解读人类对气味的描述。参与者执行了“闻香并描述”的互动任务，我们的设计AI系统根据参与者的描述尝试猜测他们所体验的气味。这些任务评估了大语言模型（LLMs）在其内部状态——高维嵌入空间中对气味关系的上下文理解和表示能力。我们采用了定量和定性方法来评估AI系统的性能。结果显示感知对齐存在局限性，系统对某些气味如柠檬和薄荷表现出偏爱，但持续未能识别其他气味如迷迭香。我们根据这些发现讨论了人机对齐的进展，强调了在多感官体验整合中增强人机交互系统的局限性和机遇。

[NLP-28] Cancer-Answer: Empowering Cancer Care with Advanced Large Language Models

【速读】：该论文试图解决胃肠道癌症（GI tract cancers）早期诊断和治疗中的信息获取难题。由于癌症病因复杂且症状重叠，导致诊断延迟和治疗策略不佳。论文的关键解决方案是利用大型语言模型（LLMs）如GPT-3.5 Turbo，通过预训练的医学数据生成准确、上下文相关的癌症相关查询答案。这些模型提供及时、可操作的见解，支持在癌症诊断和护理中的决策制定，从而改善患者预后。论文通过计算A1（实体准确性）和A2（语言正确性和意义性）两个指标，分别达到0.546和0.881的最大值，验证了模型的有效性。

链接: https://arxiv.org/abs/2411.06946
作者: Aniket Deroy,Subhankar Maity
关键词-EN: tract cancers account, global cancer burden, substantial portion, critical for improved, improved management
类目: Computation and Language (cs.CL)
备注: Accepted at FIRE 2024 (Track: Conversational System for Differential Diagnosis of GI Cancer)

点击查看摘要

Abstract:Gastrointestinal (GI) tract cancers account for a substantial portion of the global cancer burden, where early diagnosis is critical for improved management and patient outcomes. The complex aetiologies and overlapping symptoms across GI cancers often delay diagnosis, leading to suboptimal treatment strategies. Cancer-related queries are crucial for timely diagnosis, treatment, and patient education, as access to accurate, comprehensive information can significantly influence outcomes. However, the complexity of cancer as a disease, combined with the vast amount of available data, makes it difficult for clinicians and patients to quickly find precise answers. To address these challenges, we leverage large language models (LLMs) such as GPT-3.5 Turbo to generate accurate, contextually relevant responses to cancer-related queries. Pre-trained with medical data, these models provide timely, actionable insights that support informed decision-making in cancer diagnosis and care, ultimately improving patient outcomes. We calculate two metrics: A1 (which represents the fraction of entities present in the model-generated answer compared to the gold standard) and A2 (which represents the linguistic correctness and meaningfulness of the model-generated answer with respect to the gold standard), achieving maximum values of 0.546 and 0.881, respectively.
摘要：胃肠道（GI）癌症在全球癌症负担中占据了相当大的比例，早期诊断对于改善管理和患者预后至关重要。GI癌症的复杂病因和症状重叠常常导致诊断延迟，从而导致治疗策略不佳。癌症相关查询对于及时诊断、治疗和患者教育至关重要，因为获取准确、全面的信息可以显著影响预后。然而，癌症作为一种疾病的复杂性，加上大量可用数据的复杂性，使得临床医生和患者难以快速找到精确答案。为了应对这些挑战，我们利用大语言模型（LLMs）如GPT-3.5 Turbo来生成准确、上下文相关的癌症相关查询响应。这些模型经过医学数据预训练，提供及时、可操作的洞察，支持在癌症诊断和护理中的知情决策，最终改善患者预后。我们计算了两个指标：A1（表示模型生成答案中存在的实体与金标准相比的比例）和A2（表示模型生成答案相对于金标准的语言正确性和意义），分别达到最大值0.546和0.881。

[NLP-29] Electroencephalogram-based Multi-class Decoding of Attended Speakers Direction with Audio Spatial Spectrum

【速读】：该论文试图解决从听众的脑电图（EEG）信号中解码受关注说话者的方向性焦点问题，特别是在15个方向类别上的精确解码。解决方案的关键在于结合音频空间谱（audio spatial spectra）与EEG特征，以提高解码的准确性。通过引入辅助的音频空间信息，论文提出的Sp-Aux-Deformer模型在留一受试者出（leave-one-subject-out）和留一试验出（leave-one-trial-out）场景下分别实现了57.48%和61.83%的15类解码准确率，显著提升了仅依赖EEG输入的传统方法的解码性能。

链接: https://arxiv.org/abs/2411.06928
作者: Yuanming Zhang,Jing Lu,Zhibin Lin,Fei Chen,Haoliang Du,Xia Gao
关键词-EN: developing brain-computer interfaces, directional focus, attended speaker, hearing impairment, directional focus decoding
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:Decoding the directional focus of an attended speaker from listeners’ electroencephalogram (EEG) signals is essential for developing brain-computer interfaces to improve the quality of life for individuals with hearing impairment. Previous works have concentrated on binary directional focus decoding, i.e., determining whether the attended speaker is on the left or right side of the listener. However, a more precise decoding of the exact direction of the attended speaker is necessary for effective speech processing. Additionally, audio spatial information has not been effectively leveraged, resulting in suboptimal decoding results. In this paper, we observe that, on our recently presented dataset with 15-class directional focus, models relying exclusively on EEG inputs exhibits significantly lower accuracy when decoding the directional focus in both leave-one-subject-out and leave-one-trial-out scenarios. By integrating audio spatial spectra with EEG features, the decoding accuracy can be effectively improved. We employ the CNN, LSM-CNN, and EEG-Deformer models to decode the directional focus from listeners’ EEG signals with the auxiliary audio spatial spectra. The proposed Sp-Aux-Deformer model achieves notable 15-class decoding accuracies of 57.48% and 61.83% in leave-one-subject-out and leave-one-trial-out scenarios, respectively.
摘要：从听者的脑电图 (EEG) 信号中解码出被关注说话者的方向性焦点，对于开发脑机接口以改善听力障碍者的生活质量至关重要。以往的研究主要集中在二元方向性焦点解码上，即确定被关注说话者是在听者的左侧还是右侧。然而，为了实现有效的语音处理，更精确地解码被关注说话者的确切方向是必要的。此外，音频空间信息尚未得到有效利用，导致解码结果不尽如人意。在本研究中，我们观察到，在我们最近提出的包含15个方向性焦点的数据集中，仅依赖 EEG 输入的模型在留一受试者法和留一试验法场景下解码方向性焦点的准确性显著降低。通过将音频空间频谱与 EEG 特征相结合，可以有效提高解码准确性。我们采用卷积神经网络 (CNN)、LSM-CNN 和 EEG-Deformer 模型，结合辅助音频空间频谱，从听者的 EEG 信号中解码方向性焦点。所提出的 Sp-Aux-Deformer 模型在留一受试者法和留一试验法场景下分别实现了显著的 15 类解码准确率，分别为 57.48% 和 61.83%。

[NLP-30] EVQAScore: Efficient Video Question Answering Data Evaluation

【速读】：该论文试图解决视频问答（Video QA）和视频字幕数据质量评估的挑战，特别是缺乏针对视频问答数据质量的专用评估方法。解决方案的关键在于引入了EVQAScore，这是一种无参考的评估方法，通过关键词提取技术来评估视频字幕和视频问答数据的质量。此外，该方法结合了帧采样和重缩放技术，以提高评估效率和鲁棒性，使其能够有效评估极长视频的质量。在VATEX-EVAL基准测试中，EVQAScore在视频字幕评估方面达到了最先进的性能，同时在数据选择方面，仅使用原始数据量的12.5%就超越了之前的最先进方法PAC-S和全量数据的效果。

链接: https://arxiv.org/abs/2411.06908
作者: Hao Liang,Zirong Chen,Wentao Zhang
关键词-EN: Video, video caption, core task, video caption quality, Video question-answering
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Video question-answering (QA) is a core task in video understanding. Evaluating the quality of video QA and video caption data quality for training video large language models (VideoLLMs) is an essential challenge. Although various methods have been proposed for assessing video caption quality, there remains a lack of dedicated evaluation methods for Video QA. To address this gap, we introduce EVQAScore, a reference-free method that leverages keyword extraction to assess both video caption and video QA data quality. Additionally, we incorporate frame sampling and rescaling techniques to enhance the efficiency and robustness of our evaluation, this enables our score to evaluate the quality of extremely long videos. Our approach achieves state-of-the-art (SOTA) performance (32.8 for Kendall correlation and 42.3 for Spearman correlation, 4.7 and 5.9 higher than the previous method PAC-S++) on the VATEX-EVAL benchmark for video caption evaluation. Furthermore, by using EVQAScore for data selection, we achieved SOTA results with only 12.5% of the original data volume, outperforming the previous SOTA method PAC-S and 100% of data.
摘要：视频问答 (Video Question-Answering, QA) 是视频理解中的核心任务。评估视频 QA 和用于训练视频大语言模型 (Video Large Language Models, VideoLLMs) 的视频字幕数据质量是一个重要的挑战。尽管已有多种方法用于评估视频字幕质量，但针对视频 QA 的专用评估方法仍然缺乏。为了填补这一空白，我们提出了 EVQAScore，一种无需参考的方法，通过关键词提取来评估视频字幕和视频 QA 数据的质量。此外，我们还结合了帧采样和重缩放技术，以提高评估的效率和鲁棒性，这使得我们的评分方法能够评估极长视频的质量。我们的方法在 VATEX-EVAL 基准测试中达到了最先进的 (State-of-the-Art, SOTA) 性能（Kendall 相关系数为 32.8，Spearman 相关系数为 42.3，分别比之前的 PAC-S++ 方法高出 4.7 和 5.9），用于视频字幕评估。此外，通过使用 EVQAScore 进行数据选择，我们仅用原始数据量的 12.5% 就实现了 SOTA 结果，超过了之前的 SOTA 方法 PAC-S 和使用 100% 数据的结果。

[NLP-31] LongSafetyBench: Long-Context LLM s Struggle with Safety Issues

【速读】：该论文试图解决长上下文语言模型（long-context language models）在安全性评估方面的不足问题。现有的评估主要集中在模型的能力上，而对其安全性的研究相对缺乏。论文提出了LongSafetyBench，这是首个专门用于客观全面评估长上下文模型安全性的基准。解决方案的关键在于LongSafetyBench的设计，它包含了10个任务类别，平均长度为41,889字，能够系统地测试模型的安全性。通过在LongSafetyBench上测试八个长上下文语言模型，研究发现现有模型普遍存在安全性不足的问题，大多数主流长上下文大语言模型（LLMs）的安全响应比例低于50%。此外，长上下文场景下的安全性表现与短上下文场景下的表现并不一致。论文还提出了一种简单有效的解决方案，使开源模型能够达到与顶级闭源模型相媲美的性能。LongSafetyBench的引入旨在推动社区对长上下文模型安全性的关注，并为提升这些模型的安全性提供基准和解决方案。

链接: https://arxiv.org/abs/2411.06899
作者: Mianqiu Huang,Xiaoran Liu,Shaojun Zhou,Mozhi Zhang,Chenkun Tan,Pengyu Wang,Qipeng Guo,Zhe Xu,Linyang Li,Zhikai Lei,Linlin Li,Qun Liu,Yaqian Zhou,Xipeng Qiu,Xuanjing Huang
关键词-EN: long-context language models, models, long-context, long-context language, language models
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:With the development of large language models (LLMs), the sequence length of these models continues to increase, drawing significant attention to long-context language models. However, the evaluation of these models has been primarily limited to their capabilities, with a lack of research focusing on their safety. Existing work, such as ManyShotJailbreak, has to some extent demonstrated that long-context language models can exhibit safety concerns. However, the methods used are limited and lack comprehensiveness. In response, we introduce \textbfLongSafetyBench, the first benchmark designed to objectively and comprehensively evaluate the safety of long-context models. LongSafetyBench consists of 10 task categories, with an average length of 41,889 words. After testing eight long-context language models on LongSafetyBench, we found that existing models generally exhibit insufficient safety capabilities. The proportion of safe responses from most mainstream long-context LLMs is below 50%. Moreover, models’ safety performance in long-context scenarios does not always align with that in short-context scenarios. Further investigation revealed that long-context models tend to overlook harmful content within lengthy texts. We also proposed a simple yet effective solution, allowing open-source models to achieve performance comparable to that of top-tier closed-source models. We believe that LongSafetyBench can serve as a valuable benchmark for evaluating the safety capabilities of long-context language models. We hope that our work will encourage the broader community to pay attention to the safety of long-context models and contribute to the development of solutions to improve the safety of long-context LLMs.
摘要：随着大语言模型（LLMs）的发展，这些模型的序列长度不断增加，使得长上下文语言模型引起了广泛关注。然而，目前对这些模型的评估主要集中在其能力上，而对其安全性的研究却相对缺乏。现有的工作，如 ManyShotJailbreak，在一定程度上展示了长上下文语言模型可能存在的安全问题。但这些方法局限性较大，缺乏全面性。为此，我们引入了 LongSafetyBench，这是首个旨在客观全面评估长上下文模型安全性的基准测试。LongSafetyBench 包含 10 个任务类别，平均长度为 41,889 字。在 LongSafetyBench 上测试了八种长上下文语言模型后，我们发现现有模型普遍表现出安全能力不足。大多数主流长上下文 LLMs 的安全响应比例低于 50%。此外，模型在长上下文场景中的安全表现并不总是与短上下文场景中的表现一致。进一步的研究表明，长上下文模型往往忽视长文本中的有害内容。我们还提出了一种简单而有效的解决方案，使开源模型能够达到与顶尖闭源模型相媲美的性能。我们相信 LongSafetyBench 可以作为一个有价值的基准，用于评估长上下文语言模型的安全能力。我们希望我们的工作能够鼓励更广泛的社区关注长上下文模型的安全性，并为提高长上下文 LLMs 的安全性提供解决方案。

[NLP-32] Subgraph Retrieval Enhanced by Graph-Text Alignment for Commonsense Question Answering KDD2024 ECML

【速读】：该论文试图解决常识问答任务中基于规则提取子图可能遗漏关键节点和图文本模态对齐不佳的问题。解决方案的关键在于提出了一种名为 SEPTA 的新框架，该框架通过将知识图谱转化为子图向量数据库，并采用BFS风格的子图采样策略来避免信息丢失，同时利用双向对比学习方法增强图文本对齐，从而提升子图检索和知识融合的效果。最终，所有检索到的信息在预测模块中进行推理，以提高任务性能。

链接: https://arxiv.org/abs/2411.06866
作者: Boci Peng,Yongchao Liu,Xiaohe Bo,Sheng Tian,Baokun Wang,Chuntao Hong,Yan Zhang
关键词-EN: Commonsense question answering, Commonsense question, question answering, requires machines, textbf
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Social and Information Networks (cs.SI)
备注: Accepted by ECML PKDD 2024

点击查看摘要

Abstract:Commonsense question answering is a crucial task that requires machines to employ reasoning according to commonsense. Previous studies predominantly employ an extracting-and-modeling paradigm to harness the information in KG, which first extracts relevant subgraphs based on pre-defined rules and then proceeds to design various strategies aiming to improve the representations and fusion of the extracted structural knowledge. Despite their effectiveness, there are still two challenges. On one hand, subgraphs extracted by rule-based methods may have the potential to overlook critical nodes and result in uncontrollable subgraph size. On the other hand, the misalignment between graph and text modalities undermines the effectiveness of knowledge fusion, ultimately impacting the task performance. To deal with the problems above, we propose a novel framework: \textbfSubgraph R\textbfEtrieval Enhanced by Gra\textbfPh-\textbfText \textbfAlignment, named \textbfSEPTA. Firstly, we transform the knowledge graph into a database of subgraph vectors and propose a BFS-style subgraph sampling strategy to avoid information loss, leveraging the analogy between BFS and the message-passing mechanism. In addition, we propose a bidirectional contrastive learning approach for graph-text alignment, which effectively enhances both subgraph retrieval and knowledge fusion. Finally, all the retrieved information is combined for reasoning in the prediction module. Extensive experiments on five datasets demonstrate the effectiveness and robustness of our framework.
摘要：常识问答是一个关键任务，要求机器根据常识进行推理。以往的研究主要采用提取-建模范式来利用知识图谱（KG）中的信息，首先基于预定义规则提取相关子图，然后设计各种策略以改进提取的结构化知识的表示和融合。尽管这些方法有效，但仍存在两个挑战。一方面，基于规则提取的子图可能忽略关键节点，导致子图规模不可控。另一方面，图与文本模态之间的错位削弱了知识融合的效果，最终影响任务性能。为解决上述问题，我们提出了一种新的框架：子图检索增强的图-文对齐，命名为 SEPTA。首先，我们将知识图谱转化为子图向量数据库，并提出一种BFS风格的子图采样策略，以避免信息丢失，利用BFS与消息传递机制之间的类比。此外，我们提出了一种双向对比学习方法用于图-文对齐，有效增强了子图检索和知识融合。最后，所有检索到的信息在预测模块中进行推理。在五个数据集上的广泛实验证明了我们框架的有效性和鲁棒性。

[NLP-33] A Unified Multi-Task Learning Architecture for Hate Detection Leveraging User-Based Information

【速读】：该论文试图解决社交媒体中普遍存在的仇恨言论、冒犯性语言、攻击性言论、种族主义、性别歧视及其他滥用语言的问题。解决方案的关键在于引入了一种独特的模型，该模型通过利用用户内部（intra-user）和用户间（inter-user）的信息，改进了对英语语言中仇恨言论的识别。具体来说，论文采用了单任务学习（STL）和多任务学习（MTL）范式，结合深度神经网络如卷积神经网络（CNN）、门控循环单元（GRU）、双向编码器表示（BERT）和轻量级BERT（ALBERT），并通过实验验证了将特定用户特征与文本特征结合使用，能够显著提升宏观F1值和加权F1值。

链接: https://arxiv.org/abs/2411.06855
作者: Prashant Kapil,Asif Ekbal
关键词-EN: social media, common phenomena, phenomena in social, Hate speech, Artificial Intelligence
类目: Computation and Language (cs.CL)
备注: 7 pages, 1 figure, and two tables

点击查看摘要

Abstract:Hate speech, offensive language, aggression, racism, sexism, and other abusive language are common phenomena in social media. There is a need for Artificial Intelligence(AI)based intervention which can filter hate content at scale. Most existing hate speech detection solutions have utilized the features by treating each post as an isolated input instance for the classification. This paper addresses this issue by introducing a unique model that improves hate speech identification for the English language by utilising intra-user and inter-user-based information. The experiment is conducted over single-task learning (STL) and multi-task learning (MTL) paradigms that use deep neural networks, such as convolutional neural networks (CNN), gated recurrent unit (GRU), bidirectional encoder representations from the transformer (BERT), and A Lite BERT (ALBERT). We use three benchmark datasets and conclude that combining certain user features with textual features gives significant improvements in macro-F1 and weighted-F1.
摘要：仇恨言论、冒犯性语言、攻击性言论、种族主义、性别歧视以及其他形式的辱骂性语言在社交媒体中普遍存在。为了大规模过滤这些有害内容，需要基于人工智能（AI）的干预措施。现有的仇恨言论检测解决方案大多将每条帖子视为孤立的输入实例进行分类。本文通过引入一种独特的模型来解决这一问题，该模型利用用户内部和用户之间的信息，提升了英语语言中仇恨言论的识别能力。实验在单任务学习（STL）和多任务学习（MTL）范式下进行，采用了深度神经网络，如卷积神经网络（CNN）、门控循环单元（GRU）、Transformer的双向编码器表示（BERT）以及轻量级BERT（ALBERT）。我们使用了三个基准数据集，并得出结论：将某些用户特征与文本特征结合使用，可以显著提升宏观F1值和加权F1值。

[NLP-34] Evaluating Large Language Models on Financial Report Summarization: An Empirical Study

【速读】：该论文试图解决在金融领域应用大型语言模型（Large Language Models, LLMs）时面临的可靠性、准确性和合规性问题。解决方案的关键在于通过全面的比较研究，评估三种先进的LLMs（GLM-4, Mistral-NeMo, 和 LLaMA3.1）在生成自动化金融报告方面的有效性。论文提出了一个创新的评估框架，结合定量指标（如精度、召回率）和定性分析（如上下文适应性、一致性），以提供对模型输出质量的全面视图。此外，论文公开了金融数据集，鼓励研究人员和从业者通过社区参与和协作改进来进一步利用和审查研究成果。

链接: https://arxiv.org/abs/2411.06852
作者: Xinqi Yang,Scott Zang,Yong Ren,Dingjie Peng,Zheng Wen
关键词-EN: Large Language Models, including natural language, natural language understanding, domain-specific knowledge tasks, demonstrated remarkable versatility
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In recent years, Large Language Models (LLMs) have demonstrated remarkable versatility across various applications, including natural language understanding, domain-specific knowledge tasks, etc. However, applying LLMs to complex, high-stakes domains like finance requires rigorous evaluation to ensure reliability, accuracy, and compliance with industry standards. To address this need, we conduct a comprehensive and comparative study on three state-of-the-art LLMs, GLM-4, Mistral-NeMo, and LLaMA3.1, focusing on their effectiveness in generating automated financial reports. Our primary motivation is to explore how these models can be harnessed within finance, a field demanding precision, contextual relevance, and robustness against erroneous or misleading information. By examining each model’s capabilities, we aim to provide an insightful assessment of their strengths and limitations. Our paper offers benchmarks for financial report analysis, encompassing proposed metrics such as ROUGE-1, BERT Score, and LLM Score. We introduce an innovative evaluation framework that integrates both quantitative metrics (e.g., precision, recall) and qualitative analyses (e.g., contextual fit, consistency) to provide a holistic view of each model’s output quality. Additionally, we make our financial dataset publicly available, inviting researchers and practitioners to leverage, scrutinize, and enhance our findings through broader community engagement and collaborative improvement. Our dataset is available on huggingface.
摘要：近年来，大语言模型 (LLMs) 在多种应用中展现了显著的多功能性，包括自然语言理解、特定领域知识任务等。然而，将大语言模型应用于金融等复杂且高风险的领域，需要严格的评估以确保其可靠性、准确性及符合行业标准。为此，我们对三种最先进的 LLMs——GLM-4、Mistral-NeMo 和 LLaMA3.1——进行了全面且比较性的研究，重点考察它们在生成自动化财务报告方面的有效性。我们的主要动机是探索这些模型如何在金融领域中得到应用，该领域要求精确性、上下文相关性以及对错误或误导信息的强健性。通过考察每个模型的能力，我们旨在提供对其优势和局限性的深入评估。我们的论文提供了财务报告分析的基准，涵盖了如 ROUGE-1、BERT Score 和 LLM Score 等提出的指标。我们引入了一种创新的评估框架，该框架结合了定量指标（如精确度、召回率）和定性分析（如上下文适应性、一致性），以全面评估每个模型的输出质量。此外，我们将财务数据集公开，邀请研究人员和从业者通过更广泛的社区参与和协作改进来利用、审查和增强我们的研究成果。我们的数据集可在 huggingface 上获取。

[NLP-35] 1-800-SHARED-TASKS @ NLU of Devanagari Script Languages: Detection of Language Hate Speech and Targets using LLM s COLING2025

【速读】：该论文试图解决在Devanagari脚本语言中进行语言检测、仇恨言论识别和目标检测的问题。解决方案的关键在于结合大型语言模型（如MuRIL、IndicBERT和Gemma-2）及其集成，并采用如focal loss等独特技术来应对Devanagari语言在多语言处理和类别不平衡方面的自然理解挑战。通过这种方法，研究在所有任务中均取得了竞争性的结果，分别为子任务A、B和C的F1分数0.9980、0.7652和0.6804。

链接: https://arxiv.org/abs/2411.06850
作者: Jebish Purbey,Siddartha Pullakhandam,Kanwal Mehreen,Muhammad Arham,Drishti Sharma,Ashay Srivastava,Ram Mohan Rao Kadiyala
关键词-EN: hate speech identification, detailed system description, Devanagari script languages, hate speech, speech identification
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 13 pages, Submitted to CHIPSAL workshop @ COLING 2025

点击查看摘要

Abstract:This paper presents a detailed system description of our entry for the CHiPSAL 2025 shared task, focusing on language detection, hate speech identification, and target detection in Devanagari script languages. We experimented with a combination of large language models and their ensembles, including MuRIL, IndicBERT, and Gemma-2, and leveraged unique techniques like focal loss to address challenges in the natural understanding of Devanagari languages, such as multilingual processing and class imbalance. Our approach achieved competitive results across all tasks: F1 of 0.9980, 0.7652, and 0.6804 for Sub-tasks A, B, and C respectively. This work provides insights into the effectiveness of transformer models in tasks with domain-specific and linguistic challenges, as well as areas for potential improvement in future iterations.
摘要：本文详细描述了我们为 CHiPSAL 2025 共享任务提交的系统，重点介绍了在梵文脚本语言中的语言检测、仇恨言论识别和目标检测。我们实验了多种大语言模型及其集成方法，包括 MuRIL、IndicBERT 和 Gemma-2，并采用了如 focal loss 等独特技术来应对梵文语言自然理解中的挑战，例如多语言处理和类别不平衡问题。我们的方法在所有任务中均取得了具有竞争力的结果：子任务 A、B 和 C 的 F1 分数分别为 0.9980、0.7652 和 0.6804。这项工作为 Transformer 模型在具有领域特定和语言挑战的任务中的有效性提供了见解，同时也指出了未来迭代中可能的改进方向。

[NLP-36] LLM -Neo: Parameter Efficient Knowledge Distillation for Large Language Models ICASSP25

【速读】：该论文试图解决如何高效地将大型语言模型（LLM）中的知识迁移到更紧凑的学生模型中的问题。解决方案的关键在于提出了一种名为LLM-Neo的新框架，该框架结合了知识蒸馏（Knowledge Distillation, KD）和低秩适应（Low-Rank Adaption, LoRA）的策略。通过重新审视KD和LoRA的共性，论文提出了一种结合两者的策略，以提高知识迁移的效率。实验结果表明，LLM-Neo在压缩Llama 2和Llama 3模型时优于多种基线方法，并展示了其在不同LoRA变体上的鲁棒性。

链接: https://arxiv.org/abs/2411.06839
作者: Runming Yang,Taiqiang Wu,Jiahao Wang,Pengfei Hu,Ngai Wong,Yujiu Yang
关键词-EN: large language model, efficiently transfers knowledge, compact student, framework that efficiently, large language
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: ICASSP 25’ under review

点击查看摘要

Abstract:In this paper, we propose a novel LLM-Neo framework that efficiently transfers knowledge from a large language model (LLM) teacher to a compact student. Initially, we revisit the knowledge distillation (KD) and low-rank adaption (LoRA), and argue that they share the same paradigm. Inspired by this observation, we explore the strategy that combines LoRA and KD to enhance the efficiency of knowledge transfer. We first summarize some guidelines for this design and further develop the LLM-Neo. Experimental results on compressing Llama 2 and Llama 3 show that LLM-Neo outperforms various baselines. Further analysis demonstrates the robustness of the proposed LLM-Neo on variants of LoRA. The trained models have been available at \hrefthis https URLthis repository.
摘要：本文提出了一种新颖的大语言模型-Neo (LLM-Neo) 框架，该框架能够高效地将大语言模型 (LLM) 教师的知识传递给一个紧凑的学生模型。首先，我们重新审视了知识蒸馏 (Knowledge Distillation, KD) 和低秩适应 (Low-Rank Adaption, LoRA)，并指出它们共享相同的范式。受此启发，我们探索了将 LoRA 和 KD 结合的策略，以提高知识传递的效率。我们首先总结了这一设计的一些指导原则，并进一步开发了 LLM-Neo。在压缩 Llama 2 和 Llama 3 的实验中，LLM-Neo 的表现优于多种基线方法。进一步的分析表明，所提出的 LLM-Neo 在不同变体的 LoRA 上具有鲁棒性。训练好的模型已可在 \hrefthis https URLthis 仓库中获取。

[NLP-37] Persuasion with Large Language Models : a Survey

【速读】：该论文试图解决大语言模型（LLMs）在说服性沟通中的应用及其潜在的伦理和社会风险问题。解决方案的关键在于识别和评估LLM系统在不同领域（如政治、营销、公共卫生、电子商务和慈善捐赠）中影响人类态度和行为的不同模式，以及这些系统的效果受个性化方式和内容是否标记为AI生成的影响。论文强调了当前和未来LLM基于说服的潜在风险，包括错误信息的传播、偏见的放大和隐私的侵犯，并呼吁制定伦理指南和更新监管框架，以防止不负责任和有害的LLM系统的广泛部署。

链接: https://arxiv.org/abs/2411.06837
作者: Alexander Rogiers,Sander Noels,Maarten Buyl,Tijl De Bie
关键词-EN: Large Language Models, enabling fully-automated personalized, Language Models, Large Language, interactive content generation
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The rapid rise of Large Language Models (LLMs) has created new disruptive possibilities for persuasive communication, by enabling fully-automated personalized and interactive content generation at an unprecedented scale. In this paper, we survey the research field of LLM-based persuasion that has emerged as a result. We begin by exploring the different modes in which LLM Systems are used to influence human attitudes and behaviors. In areas such as politics, marketing, public health, e-commerce, and charitable giving, such LLM Systems have already achieved human-level or even super-human persuasiveness. We identify key factors influencing their effectiveness, such as the manner of personalization and whether the content is labelled as AI-generated. We also summarize the experimental designs that have been used to evaluate progress. Our survey suggests that the current and future potential of LLM-based persuasion poses profound ethical and societal risks, including the spread of misinformation, the magnification of biases, and the invasion of privacy. These risks underscore the urgent need for ethical guidelines and updated regulatory frameworks to avoid the widespread deployment of irresponsible and harmful LLM Systems.
摘要：大语言模型 (LLM) 的迅速崛起为说服性沟通创造了新的颠覆性可能性，通过实现前所未有的规模的全自动化个性化和互动内容生成。本文综述了由此产生的基于 LLM 的说服研究领域。我们首先探讨了 LLM 系统用于影响人类态度和行为的多种模式。在政治、营销、公共卫生、电子商务和慈善捐赠等领域，这些 LLM 系统已经达到了人类水平甚至超人类的说服力。我们识别了影响其有效性的关键因素，如个性化方式以及内容是否被标记为 AI 生成。我们还总结了用于评估进展的实验设计。我们的综述表明，基于 LLM 的说服的当前和未来潜力带来了深刻的伦理和社会风险，包括错误信息的传播、偏见的放大和隐私的侵犯。这些风险突显了迫切需要伦理指南和更新的监管框架，以避免不负责任和有害的 LLM 系统的广泛部署。

[NLP-38] HarmLevelBench: Evaluating Harm-Level Compliance and the Impact of Quantization on Model Alignment NEURIPS2024

【速读】：该论文试图解决大语言模型（LLM）在安全性方面面临的挑战，特别是关于“越狱”技术及其对模型脆弱性的评估。解决方案的关键在于创建了一个新颖的数据集，用于评估模型输出在多个危害级别上的有害性，并进行细粒度的危害级别分析。此外，论文还通过基准测试评估了针对Vicuna 13B v1.5模型的最先进“越狱”攻击，并研究了量化技术（如AWQ和GPTQ）对模型对齐和鲁棒性的影响，揭示了在增强对转移攻击的鲁棒性与增加直接攻击的脆弱性之间的权衡。通过这些研究，论文旨在加深对LLM脆弱性的理解，并改进在面对有害内容时评估模型鲁棒性的方法，特别是在压缩策略的背景下。

链接: https://arxiv.org/abs/2411.06835
作者: Yannis Belkhiter,Giulio Zizzo,Sergio Maffeis
关键词-EN: revolutionized the NLP, NLP field, transformers architecture, NLP, LLM vulnerabilities
类目: Computation and Language (cs.CL); Cryptography and Security (cs.CR)
备注: NeurIPS 2024 Workshop on Safe Generative Artificial Intelligence (SafeGenAI)

点击查看摘要

Abstract:With the introduction of the transformers architecture, LLMs have revolutionized the NLP field with ever more powerful models. Nevertheless, their development came up with several challenges. The exponential growth in computational power and reasoning capabilities of language models has heightened concerns about their security. As models become more powerful, ensuring their safety has become a crucial focus in research. This paper aims to address gaps in the current literature on jailbreaking techniques and the evaluation of LLM vulnerabilities. Our contributions include the creation of a novel dataset designed to assess the harmfulness of model outputs across multiple harm levels, as well as a focus on fine-grained harm-level analysis. Using this framework, we provide a comprehensive benchmark of state-of-the-art jailbreaking attacks, specifically targeting the Vicuna 13B v1.5 model. Additionally, we examine how quantization techniques, such as AWQ and GPTQ, influence the alignment and robustness of models, revealing trade-offs between enhanced robustness with regards to transfer attacks and potential increases in vulnerability on direct ones. This study aims to demonstrate the influence of harmful input queries on the complexity of jailbreaking techniques, as well as to deepen our understanding of LLM vulnerabilities and improve methods for assessing model robustness when confronted with harmful content, particularly in the context of compression strategies.
摘要：随着 Transformer 架构的引入，大语言模型 (LLM) 在自然语言处理 (NLP) 领域带来了革命性的变化，涌现出越来越强大的模型。然而，其发展也伴随着诸多挑战。语言模型在计算能力和推理能力上的指数级增长，引发了对其安全性的高度关注。随着模型能力的提升，确保其安全性已成为研究中的一个关键焦点。本文旨在填补当前文献中关于越狱技术和大语言模型漏洞评估的空白。我们的贡献包括创建了一个新颖的数据集，用于评估模型输出在多个危害级别上的有害性，并着重于细粒度的危害级别分析。利用这一框架，我们提供了一个全面的基准测试，针对 Vicuna 13B v1.5 模型，评估了最先进的越狱攻击。此外，我们还研究了量化技术（如 AWQ 和 GPTQ）如何影响模型的对齐和鲁棒性，揭示了在增强对转移攻击的鲁棒性与可能增加直接攻击的脆弱性之间的权衡。本研究旨在展示有害输入查询对越狱技术复杂性的影响，并加深我们对大语言模型漏洞的理解，同时改进在面对有害内容时评估模型鲁棒性的方法，特别是在压缩策略的背景下。

[NLP-39] AssistRAG: Boosting the Potential of Large Language Models with an Intelligent Information Assistant NEURIPS2024

【速读】：该论文试图解决大型语言模型（LLMs）在生成过程中产生的“幻觉”问题，即生成事实性错误信息。解决方案的关键在于提出了一种基于助手的检索增强生成方法（Assistant-based Retrieval-Augmented Generation, AssistRAG），通过在LLMs中集成一个智能信息助手来管理记忆和知识。该助手通过工具使用、动作执行、记忆构建和计划制定来增强信息检索和决策能力。解决方案的核心在于两阶段的训练方法：课程助手学习和强化偏好优化，这使得AssistRAG在复杂推理任务中显著优于传统方法，特别是在提升较不先进LLMs的推理能力和生成准确性方面。

链接: https://arxiv.org/abs/2411.06805
作者: Yujia Zhou,Zheng Liu,Zhicheng Dou
关键词-EN: Large Language Models, natural language processing, Large Language, generate factually incorrect, Language Models
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注: Accepted by NeurIPS 2024 (poster)

点击查看摘要

Abstract:The emergence of Large Language Models (LLMs) has significantly advanced natural language processing, but these models often generate factually incorrect information, known as “hallucination”. Initial retrieval-augmented generation (RAG) methods like the “Retrieve-Read” framework was inadequate for complex reasoning tasks. Subsequent prompt-based RAG strategies and Supervised Fine-Tuning (SFT) methods improved performance but required frequent retraining and risked altering foundational LLM capabilities. To cope with these challenges, we propose Assistant-based Retrieval-Augmented Generation (AssistRAG), integrating an intelligent information assistant within LLMs. This assistant manages memory and knowledge through tool usage, action execution, memory building, and plan specification. Using a two-phase training approach, Curriculum Assistant Learning and Reinforced Preference Optimization. AssistRAG enhances information retrieval and decision-making. Experiments show AssistRAG significantly outperforms benchmarks, especially benefiting less advanced LLMs, by providing superior reasoning capabilities and accurate responses.
摘要：大语言模型 (LLM) 的出现显著推动了自然语言处理的发展，但这些模型常常生成事实错误的信息，即所谓的“幻觉”。早期的检索增强生成 (RAG) 方法，如“检索-阅读”框架，在复杂推理任务中表现不足。随后的基于提示的 RAG 策略和监督微调 (SFT) 方法虽然提升了性能，但需要频繁的重新训练，并存在改变基础 LLM 能力的风险。为了应对这些挑战，我们提出了基于助手的检索增强生成 (AssistRAG)，将智能信息助手集成到大语言模型中。该助手通过工具使用、动作执行、记忆构建和计划制定来管理记忆和知识。采用两阶段训练方法，即课程助手学习和强化偏好优化。AssistRAG 提升了信息检索和决策能力。实验表明，AssistRAG 在基准测试中显著优于其他方法，尤其对较不先进的大语言模型，通过提供更强的推理能力和准确的响应。

[NLP-40] Large-scale moral machine experiment on large language models

【速读】：该论文试图解决的问题是如何评估大型语言模型（LLMs）在自动驾驶系统中的道德决策能力，并确保其与人类道德偏好的一致性。解决方案的关键在于通过综合分析51种不同LLM（包括多个版本的专有模型如GPT、Claude、Gemini和开源替代模型如Llama、Gemma）在道德机器实验框架下的表现，利用联合分析框架评估LLM响应与人类偏好的对齐程度，并研究模型规模、更新和架构对这一对齐的影响。研究发现，专有模型和参数超过10亿的开源模型在道德判断上与人类判断较为接近，但模型更新并不总能提升对齐效果，且许多LLM在特定伦理原则上有过度强调的倾向。这些结果表明，虽然增加模型规模可能自然地导致更接近人类的道德判断，但在自动驾驶系统中的实际应用需要权衡判断质量和计算效率。论文的全面分析为自动驾驶系统的伦理设计提供了重要见解，并强调了在AI道德决策中考虑文化背景的重要性。

链接: https://arxiv.org/abs/2411.06790
作者: Muhammad Shahrul Zaim bin Ahmad,Kazuhiro Takemoto
关键词-EN: Large Language Models, Large Language, advancement of Large, Language Models, systems necessitates understanding
类目: Computers and Society (cs.CY); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
备注: 20 pages, 6 figures

点击查看摘要

Abstract:The rapid advancement of Large Language Models (LLMs) and their potential integration into autonomous driving systems necessitates understanding their moral decision-making capabilities. While our previous study examined four prominent LLMs using the Moral Machine experimental framework, the dynamic landscape of LLM development demands a more comprehensive analysis. Here, we evaluate moral judgments across 51 different LLMs, including multiple versions of proprietary models (GPT, Claude, Gemini) and open-source alternatives (Llama, Gemma), to assess their alignment with human moral preferences in autonomous driving scenarios. Using a conjoint analysis framework, we evaluated how closely LLM responses aligned with human preferences in ethical dilemmas and examined the effects of model size, updates, and architecture. Results showed that proprietary models and open-source models exceeding 10 billion parameters demonstrated relatively close alignment with human judgments, with a significant negative correlation between model size and distance from human judgments in open-source models. However, model updates did not consistently improve alignment with human preferences, and many LLMs showed excessive emphasis on specific ethical principles. These findings suggest that while increasing model size may naturally lead to more human-like moral judgments, practical implementation in autonomous driving systems requires careful consideration of the trade-off between judgment quality and computational efficiency. Our comprehensive analysis provides crucial insights for the ethical design of autonomous systems and highlights the importance of considering cultural contexts in AI moral decision-making.
摘要：大语言模型 (LLM) 的快速发展及其在自动驾驶系统中的潜在集成，使得理解其道德决策能力变得至关重要。尽管我们之前的研究通过 Moral Machine 实验框架考察了四个著名的大语言模型，但大语言模型发展的动态性要求进行更全面的分析。在此，我们评估了 51 个不同大语言模型在自动驾驶场景中的道德判断，包括多个版本的专有模型（如 GPT、Claude、Gemini）和开源替代模型（如 Llama、Gemma），以评估它们与人类道德偏好的契合度。通过联合分析框架，我们评估了大语言模型响应与人类偏好在伦理困境中的契合度，并考察了模型大小、更新和架构的影响。结果显示，专有模型和参数超过 100 亿的开源模型与人类判断的契合度相对较高，开源模型中模型大小与人类判断距离之间存在显著的负相关关系。然而，模型更新并未一致地提高与人类偏好的契合度，许多大语言模型在特定伦理原则上表现出过度强调。这些发现表明，尽管增加模型大小可能自然地导致更接近人类的道德判断，但在自动驾驶系统中的实际应用需要仔细权衡判断质量与计算效率之间的平衡。我们的全面分析为自动驾驶系统的伦理设计提供了关键见解，并强调了在 AI 道德决策中考虑文化背景的重要性。

[NLP-41] PDC DM-SFT: A Road for LLM SQL Bug-Fix Enhancing COLING

【速读】：该论文试图解决现有代码大型语言模型（Code LLMs）在SQL代码错误修复任务中的不足，特别是模型在生成正确代码方面表现优异，但在修复代码错误时表现不佳的问题。解决方案的关键在于引入了一套增强LLM SQL错误修复能力的方法，主要包括两个部分：从零开始的渐进式数据集构建（Progressive Dataset Construction, PDC）和动态掩码监督微调（Dynamic Mask Supervised Fine-tuning, DM-SFT）。PDC通过广度优先和深度优先两种数据扩展方法，有效提升了数据集的多样性和深度。DM-SFT则提出了一种高效的错误修复监督学习方法，显著减少了训练步骤并缓解了SQL代码错误修复训练中的“迷失方向”问题。实验结果表明，采用这两种方法训练的Code LLM模型在性能上超越了所有当前最佳的大规模模型。

链接: https://arxiv.org/abs/2411.06767
作者: Yiwen Duan,Yonghong Yu,Xiaoming Zhao,Yichang Wu,Wenbo Liu
关键词-EN: Large Language Models, Code Large Language, Large Language, demonstrated exceptional performance, code generation tasks
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: COLING-Industry 2025 accepted

点击查看摘要

Abstract:Code Large Language Models (Code LLMs), such as Code llama and DeepSeek-Coder, have demonstrated exceptional performance in the code generation tasks. However, most existing models focus on the abilities of generating correct code, but often struggle with bug repair. We introduce a suit of methods to enhance LLM’s SQL bug-fixing abilities. The methods are mainly consisted of two parts: A Progressive Dataset Construction (PDC) from scratch and Dynamic Mask Supervised Fine-tuning (DM-SFT). PDC proposes two data expansion methods from the perspectives of breadth first and depth first respectively. DM-SFT introduces an efficient bug-fixing supervised learning approach, which effectively reduce the total training steps and mitigate the “disorientation” in SQL code bug-fixing training. In our evaluation, the code LLM models trained with two methods have exceeds all current best performing model which size is much larger.
摘要：代码大语言模型（Code LLMs），如 Code llama 和 DeepSeek-Coder，在代码生成任务中表现出色。然而，大多数现有模型侧重于生成正确代码的能力，但在修复代码漏洞方面往往表现不佳。我们提出了一套方法来增强大语言模型在 SQL 漏洞修复方面的能力。这些方法主要由两部分组成：从零开始的渐进式数据集构建（PDC）和动态掩码监督微调（DM-SFT）。PDC 从广度和深度两个角度分别提出了两种数据扩展方法。DM-SFT 引入了一种高效的漏洞修复监督学习方法，该方法有效减少了总训练步骤，并缓解了 SQL 代码漏洞修复训练中的“迷失方向”问题。在我们的评估中，通过这两种方法训练的代码大语言模型在性能上超越了所有当前表现最佳的模型，尽管这些模型的规模要大得多。

[NLP-42] Reverse Prompt Engineering

【速读】：该论文试图解决的是黑箱、零样本语言模型逆向问题，即在不了解模型内部结构的情况下，仅通过语言模型的文本输出重建原始提示（prompt）。解决方案的关键在于提出了一种创新的框架，利用大型语言模型（Large Language Model）和优化算法，以最少的资源有效地恢复提示。实验结果表明，该方法在多个公开数据集上实现了高质量的提示恢复，生成的提示与原始提示相比，比当前最先进的方法更为相似。此外，应用案例研究展示了该方法在生成高质量文本数据方面的强大潜力。

链接: https://arxiv.org/abs/2411.06729
作者: Hanqing Li,Diego Klabjan
关键词-EN: model inversion problem, zero-shot language model, language model inversion, language model, paper explores
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:This paper explores a new black-box, zero-shot language model inversion problem and proposes an innovative framework for prompt reconstruction using only text outputs from a language model. Leveraging a large language model alongside an optimization algorithm, the proposed method effectively recovers prompts with minimal resources. Experimental results on several datasets derived from public sources indicate that the proposed approach achieves high-quality prompt recovery and generates prompts more similar to the originals than current state-of-the-art methods. Additionally, the use-case study demonstrates the method’s strong potential for generating high-quality text data.
摘要：本文探讨了一种新的黑箱、零样本语言模型逆向问题，并提出了一种创新的框架，该框架仅利用语言模型的文本输出进行提示词重建。通过结合大语言模型与优化算法，所提出的方法能够以最少的资源有效地恢复提示词。在来自公共源的多个数据集上的实验结果表明，该方法在提示词恢复的质量上优于当前最先进的方法，并且生成的提示词与原始提示词更为相似。此外，用例研究展示了该方法在生成高质量文本数据方面的强大潜力。

[NLP-43] Model Fusion through Bayesian Optimization in Language Model Fine-Tuning

【速读】：该论文试图解决在微调预训练模型（fine-tuning pre-trained models）过程中，选择最佳模型和超参数的难题。解决方案的关键在于引入了一种新的模型融合技术（model fusion technique），通过多目标贝叶斯优化（multi-objective Bayesian optimization）同时优化期望的指标和损失函数。此外，论文还提出了一种两阶段过程，将贝叶斯优化过程整合到框架中，以有效选择超参数。实验结果表明，这种贝叶斯优化引导的方法在各种下游任务中显著提升了性能。

链接: https://arxiv.org/abs/2411.06710
作者: Chaeyun Jang,Hyungi Lee,Jungtaek Kim,Juho Lee
关键词-EN: widely adopted technique, widely adopted, adaptability and reliability, Fine-tuning, Bayesian
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Fine-tuning pre-trained models for downstream tasks is a widely adopted technique known for its adaptability and reliability across various domains. Despite its conceptual simplicity, fine-tuning entails several troublesome engineering choices, such as selecting hyperparameters and determining checkpoints from an optimization trajectory. To tackle the difficulty of choosing the best model, one effective solution is model fusion, which combines multiple models in a parameter space. However, we observe a large discrepancy between loss and metric landscapes during the fine-tuning of pre-trained language models. Building on this observation, we introduce a novel model fusion technique that optimizes both the desired metric and loss through multi-objective Bayesian optimization. In addition, to effectively select hyperparameters, we establish a two-stage procedure by integrating Bayesian optimization processes into our framework. Experiments across various downstream tasks show considerable performance improvements using our Bayesian optimization-guided method.
摘要：针对下游任务对预训练模型进行微调是一种广泛采用的技术，以其跨多个领域的适应性和可靠性著称。尽管概念上看似简单，微调过程涉及多个棘手的工程选择，例如选择超参数和从优化轨迹中确定检查点。为了解决选择最佳模型的难题，一种有效的解决方案是模型融合，即在参数空间中结合多个模型。然而，我们在预训练语言模型的微调过程中观察到损失和指标景观之间存在显著差异。基于这一观察，我们提出了一种新颖的模型融合技术，通过多目标贝叶斯优化同时优化期望的指标和损失。此外，为了有效选择超参数，我们将贝叶斯优化过程整合到我们的框架中，建立了一个两阶段程序。在各种下游任务中的实验表明，使用我们的贝叶斯优化引导方法可以显著提升性能。

[NLP-44] What Should Baby Models Read? Exploring Sample-Efficient Data Composition on Model Performance CONLL2024

【速读】：该论文试图解决在样本高效训练环境下，预训练数据组成对小型语言模型性能的影响问题。解决方案的关键在于识别不同数据源（如儿童导向语言数据 (CHILDES)、经典书籍 (Gutenberg)、合成数据 (TinyStories) 及其混合数据 (Mix)）对不同规模模型（从1800万到70500万参数）性能的影响。研究发现，较小规模的模型在训练于更复杂和丰富的数据集（如Gutenberg）时表现更佳，而CHILDES和TinyStories数据集在所有模型规模下表现均不佳。这表明，样本高效训练的最佳数据集选择依赖于模型规模，且儿童导向语言和简化故事并非适用于所有规模的语言模型。论文强调了在样本高效语言模型训练中，考虑数据集组成和模型容量两者的重要性。

链接: https://arxiv.org/abs/2411.06672
作者: Hong Meng Yam,Nathan J Paek
关键词-EN: sample-efficient setting, explore the impact, impact of pre-training, performance of small, pre-training data composition
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 8 pages, 6 figures, CoNLL 2024 (Shared Task) Accepted Paper

点击查看摘要

Abstract:We explore the impact of pre-training data composition on the performance of small language models in a sample-efficient setting. Using datasets limited to 10 million words, we evaluate several dataset sources, including child-directed speech (CHILDES), classic books (Gutenberg), synthetic data (TinyStories), and a mix of these (Mix) across different model sizes ranging from 18 million to 705 million parameters. Our experiments show that smaller models (e.g., GPT2-97M, GPT2-705M, Llama-360M) perform better when trained on more complex and rich datasets like Gutenberg. Models trained on the CHILDES and TinyStories datasets underperformed across all model sizes. These findings suggest that the optimal dataset for sample efficient training depends on the model size, and that neither child-directed speech nor simplified stories are optimal for language models of all sizes. We highlight the importance of considering both dataset composition and model capacity for effective sample efficient language model training.
摘要：我们探讨了预训练数据组成对样本高效设置下小型语言模型性能的影响。使用限制在1000万词的数据集，我们评估了多个数据源，包括面向儿童的语音数据（CHILDES）、经典书籍（Gutenberg）、合成数据（TinyStories）以及这些数据的混合（Mix），涵盖了从1800万到70500万参数的不同模型大小。我们的实验表明，较小的模型（例如，GPT2-97M、GPT2-705M、Llama-360M）在训练于更复杂和丰富的数据集如Gutenberg时表现更佳。在所有模型大小下，使用CHILDES和TinyStories数据集训练的模型表现均不佳。这些发现表明，样本高效训练的最佳数据集依赖于模型大小，并且无论是面向儿童的语音数据还是简化的故事，都不是适用于所有大小语言模型的最佳选择。我们强调了在有效的样本高效语言模型训练中，考虑数据集组成和模型容量两者的重要性。

[NLP-45] Bridge: A Unified Framework to Knowledge Graph Completion via Language Models and Knowledge Representation

【速读】：该论文试图解决知识图谱补全 (Knowledge Graph Completion, KGC) 中仅利用结构信息或语义信息导致模型性能不佳的问题。解决方案的关键在于提出了一种名为 Bridge 的新框架，该框架通过联合编码知识图谱 (Knowledge Graphs, KGs) 的结构信息和语义信息来提升模型性能。具体来说，Bridge 策略性地分别使用预训练语言模型 (Pre-trained Language Models, PLMs) 对实体和关系进行编码，以更好地利用 PLMs 的语义知识，并通过结构学习原则实现结构化表示学习。此外，为了弥合 KGs 和 PLMs 之间的差距，论文采用了一种自监督表示学习方法 BYOL 对 PLMs 进行微调，但与 BYOL 不同的是，Bridge 将三元组分成两部分来创建不同的视图，从而避免了语义信息的改变。实验结果表明，Bridge 在三个基准数据集上均优于现有的最先进模型。

链接: https://arxiv.org/abs/2411.06660
作者: Qiao Qiao,Yuepei Li,Qing Wang,Kang Zhou,Qi Li
关键词-EN: Knowledge graph completion, existing Knowledge Graphs, graph completion, Knowledge graph, inferring missing triples
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Knowledge graph completion (KGC) is a task of inferring missing triples based on existing Knowledge Graphs (KGs). Both structural and semantic information are vital for successful KGC. However, existing methods only use either the structural knowledge from the KG embeddings or the semantic information from pre-trained language models (PLMs), leading to suboptimal model performance. Moreover, since PLMs are not trained on KGs, directly using PLMs to encode triples may be inappropriate. To overcome these limitations, we propose a novel framework called Bridge, which jointly encodes structural and semantic information of KGs. Specifically, we strategically encode entities and relations separately by PLMs to better utilize the semantic knowledge of PLMs and enable structured representation learning via a structural learning principle. Furthermore, to bridge the gap between KGs and PLMs, we employ a self-supervised representation learning method called BYOL to fine-tune PLMs with two different views of a triple. Unlike BYOL, which uses augmentation methods to create two semantically similar views of the same image, potentially altering the semantic information. We strategically separate the triple into two parts to create different views, thus avoiding semantic alteration. Experiments demonstrate that Bridge outperforms the SOTA models on three benchmark datasets.
摘要：知识图谱补全 (Knowledge Graph Completion, KGC) 是一项基于现有知识图谱 (Knowledge Graphs, KGs) 推断缺失三元组的任务。结构信息和语义信息对于成功的 KGC 都至关重要。然而，现有方法仅使用来自 KG 嵌入的结构知识或来自预训练语言模型 (Pre-trained Language Models, PLMs) 的语义信息，导致模型性能次优。此外，由于 PLMs 并非在 KGs 上进行训练，直接使用 PLMs 编码三元组可能并不合适。为了克服这些限制，我们提出了一种名为 Bridge 的新框架，该框架联合编码 KGs 的结构信息和语义信息。具体而言，我们通过 PLMs 分别策略性地编码实体和关系，以更好地利用 PLMs 的语义知识，并通过结构学习原则实现结构化表示学习。此外，为了弥合 KGs 和 PLMs 之间的差距，我们采用了一种名为 BYOL 的自监督表示学习方法，通过两种不同的三元组视图对 PLMs 进行微调。与 BYOL 使用增强方法创建同一图像的两个语义相似视图（可能改变语义信息）不同，我们策略性地将三元组分成两部分以创建不同视图，从而避免语义改变。实验表明，Bridge 在三个基准数据集上优于现有的最先进模型。

[NLP-46] Renaissance: Investigating the Pretraining of Vision-Language Encoders

【速读】：该论文试图解决视觉-语言任务模型预训练中的最佳实践问题，特别是关于如何设计、训练这些模型以及在预训练过程中冻结模型部分的有效性。解决方案的关键在于通过元分析实验，发现冻结视觉-语言模型的大部分结构在预训练过程中可以显著节省计算资源，而不会影响下游任务的性能。此外，论文还探讨了基于视觉模型与基于文本模型的视觉-语言Transformer的效果差异，并引入了一个名为Renaissance的建模平台，用于灵活地创建、训练和评估视觉-语言Transformer编码器。

链接: https://arxiv.org/abs/2411.06657
作者: Clayton Fields,Casey Kennington
关键词-EN: past several years, vision-language tasks, questions related, vision-language, Abstract
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:In the past several years there has been an explosion of available models for vision-language tasks. Unfortunately, the literature still leaves open a number of questions related to best practices in designing and training such models. In this paper we seek to answer several questions related to the pretraining of vision-language encoders through meta-analysis. In our first set of experiments, we show that we can save significant compute at no cost to downstream performance, by freezing large parts of vision-language models during pretraining. In our second set of experiments we examine the effect of basing a VL transformer on a vision model versus a text model. Additionally, we introduce a VL modeling platform called Renaissance that we use to conduct all of the experiments. This program offers a great deal of flexibility in creating, training and evaluating transformer encoders for VL modeling. The source code for Renaissance can be found at this https URL.
摘要：近年来，针对视觉-语言任务的可用模型数量激增。然而，相关文献仍未完全解答在设计和训练此类模型时的最佳实践问题。本文旨在通过元分析回答与视觉-语言编码器预训练相关的几个问题。在第一组实验中，我们展示了在预训练过程中冻结视觉-语言模型的大部分内容，可以在不影响下游性能的情况下显著节省计算资源。在第二组实验中，我们探讨了基于视觉模型与基于文本模型构建视觉-语言 Transformer 的效果差异。此外，我们引入了一个名为 Renaissance 的视觉-语言建模平台，用于进行所有实验。该平台在创建、训练和评估视觉-语言 Transformer 编码器方面提供了极大的灵活性。Renaissance 的源代码可以在以下链接找到：https URL。

[NLP-47] Explore the Reasoning Capability of LLM s in the Chess Testbed NAACL2025

【速读】：该论文试图解决大型语言模型在长期复杂推理任务（如国际象棋）中的表现不足问题。解决方案的关键在于通过整合专家标注的策略和战术数据集（MATE）来增强模型的推理能力。具体来说，论文收集了包含100万种国际象棋局面及其候选移动的MATE数据集，并由国际象棋专家对这些移动进行了策略和战术的标注。随后，论文对LLaMA-3-8B模型进行了微调，并将其在国际象棋移动选择任务中的表现与GPT、Claude和Gemini等最先进的商业语言模型进行了比较。实验结果表明，通过语言解释增强的模型在选择更优国际象棋移动方面表现优于其他模型，这表明语言解释能够显著提升大型语言模型的推理能力。

链接: https://arxiv.org/abs/2411.06655
作者: Shu Wang,Lei Ji,Renxi Wang,Wenxiao Zhao,Haokun Liu,Yifan Hou,Ying Nian Wu
关键词-EN: human intelligence, large language models, Reasoning, language, language models
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: submitted to NAACL2025

点击查看摘要

Abstract:Reasoning is a central capability of human intelligence. In recent years, with the advent of large-scale datasets, pretrained large language models have emerged with new capabilities, including reasoning. However, these models still struggle with long-term, complex reasoning tasks, such as playing chess. Based on the observation that expert chess players employ a dual approach combining long-term strategic play with short-term tactical play along with language explanation, we propose improving the reasoning capability of large language models in chess by integrating annotated strategy and tactic. Specifically, we collect a dataset named MATE, which consists of 1 million chess positions with candidate moves annotated by chess experts for strategy and tactics. We finetune the LLaMA-3-8B model and compare it against state-of-the-art commercial language models in the task of selecting better chess moves. Our experiments show that our models perform better than GPT, Claude, and Gemini models. We find that language explanations can enhance the reasoning capability of large language models.
摘要：推理是人类智能的核心能力之一。近年来，随着大规模数据集的出现，预训练的大语言模型展现出新的能力，包括推理能力。然而，这些模型在处理长期、复杂的推理任务（如国际象棋）时仍面临挑战。基于观察到国际象棋专家采用结合长期战略与短期战术以及语言解释的双重方法，我们提出通过整合标注的战略和战术来提升大语言模型在国际象棋中的推理能力。具体而言，我们收集了一个名为 MATE 的数据集，该数据集包含 100 万个由国际象棋专家标注了候选走法的棋局位置，用于战略和战术的分析。我们对 LLaMA-3-8B 模型进行了微调，并在选择更优走法的任务中与当前最先进的商业语言模型（如 GPT、Claude 和 Gemini）进行了比较。实验结果表明，我们的模型在性能上优于 GPT、Claude 和 Gemini 模型。我们发现，语言解释能够增强大语言模型的推理能力。

[NLP-48] Understanding Scaling Laws with Statistical and Approximation Theory for Transformer Neural Networks on Intrinsically Low-dimensional Data

【速读】：该论文试图解决的问题是：为什么基于Transformer的大型语言模型在训练时，其泛化误差会遵循依赖于模型规模和数据规模的幂律缩放规律。解决方案的关键在于建立了一种新的统计估计和数学近似理论，用于解释当输入数据集中在一个低维流形上时，Transformer模型的行为。该理论预测了泛化误差与训练数据规模和网络规模之间的幂律关系，其中幂指数取决于训练数据的内在维度 (intrinsic dimension) ( d )。通过利用低维数据结构和流形假设，论文成功解释了Transformer模型的缩放规律，并验证了理论预测与实际训练大型语言模型（LLMs）在自然语言数据集上的观察结果高度一致。这些结果表明，数据的内在维度是影响Transformer模型缩放规律的关键因素。

链接: https://arxiv.org/abs/2411.06646
作者: Alex Havrilla,Wenjing Liao
关键词-EN: deep neural networks, transformer scaling laws, scaling laws, training deep neural, scaling law dependent
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:When training deep neural networks, a model’s generalization error is often observed to follow a power scaling law dependent both on the model size and the data size. Perhaps the best known example of such scaling laws are for transformer-based large language models, where networks with billions of parameters are trained on trillions of tokens of text. Yet, despite sustained widespread interest, a rigorous understanding of why transformer scaling laws exist is still missing. To answer this question, we establish novel statistical estimation and mathematical approximation theories for transformers when the input data are concentrated on a low-dimensional manifold. Our theory predicts a power law between the generalization error and both the training data size and the network size for transformers, where the power depends on the intrinsic dimension d of the training data. Notably, the constructed model architecture is shallow, requiring only logarithmic depth in d . By leveraging low-dimensional data structures under a manifold hypothesis, we are able to explain transformer scaling laws in a way which respects the data geometry. Moreover, we test our theory with empirical observation by training LLMs on natural language datasets. We find the observed empirical data scaling laws closely agree with our theoretical predictions. Taken together, these results rigorously show the intrinsic dimension of data to be a crucial quantity affecting transformer scaling laws in both theory and practice.
摘要：在训练深度神经网络时，模型的泛化误差通常被观察到遵循一种依赖于模型规模和数据规模的幂律缩放规律。其中，基于 Transformer 的大语言模型可能是这类缩放规律最为人熟知的例子，这些网络拥有数十亿参数，并在数万亿 Token 的文本数据上进行训练。然而，尽管持续受到广泛关注，对于 Transformer 缩放规律为何存在，目前仍缺乏严格的理论理解。为了解答这一问题，我们为 Transformer 建立了新的统计估计和数学近似理论，特别是在输入数据集中于低维流形的情况下。我们的理论预测，在 Transformer 中，泛化误差与训练数据规模和网络规模之间存在一种幂律关系，其中幂指数取决于训练数据的内在维度 ( d )。值得注意的是，所构建的模型架构较为浅层，仅需对 ( d ) 进行对数深度的要求。通过利用低维数据结构在流形假设下的特性，我们能够以尊重数据几何的方式解释 Transformer 的缩放规律。此外，我们通过在自然语言数据集上训练大语言模型来验证我们的理论，发现观察到的经验数据缩放规律与我们的理论预测高度一致。综上所述，这些结果严格表明，数据的内在维度是影响 Transformer 缩放规律在理论和实践中至关重要的量。

[NLP-49] Model Editing for LLM s4Code: How Far are We? ICSE2025

【速读】：该论文试图解决大语言模型在代码生成任务中存在的知识错误或过时问题，特别是在高成本的模型重新训练不可行的情况下。解决方案的关键在于引入了一种名为模型编辑（Model Editing）的新技术，通过系统地比较和分析现有的最先进模型编辑技术，来有效和高效地修正大语言模型中的错误知识。论文提出了一个名为CLMEEval的基准，包含两个数据集（CoNaLa-Edit和CodeSearchNet-Edit），用于评估六种先进的模型编辑技术在三种大语言模型（CodeLlama、CodeQwen1.5和Stable-Code）上的表现。研究发现，基于外部记忆的GRACE方法在知识编辑的有效性和特异性方面表现最佳，但泛化能力（即编辑是否能推广到其他语义相同的输入）是现有技术的普遍挑战。基于此，论文还提出了一种改进版的GRACE，称为A-GRACE，通过对比学习来更好地捕捉输入的语义。

链接: https://arxiv.org/abs/2411.06638
作者: Xiaopeng Li,Shangwen Wang,Shasha Li,Jun Ma,Jie Yu,Xiaodong Liu,Jing Wang,Bin Ji,Weimin Zhang
关键词-EN: Large Language Models, Large Language, software engineering domain, exhibit outstanding performance, model editing techniques
类目: oftware Engineering (cs.SE); Computation and Language (cs.CL)
备注: Accepted by ICSE2025. The code is available at: this https URL

点击查看摘要

Abstract:Large Language Models for Code (LLMs4Code) have been found to exhibit outstanding performance in the software engineering domain, especially the remarkable performance in coding tasks. However, even the most advanced LLMs4Code can inevitably contain incorrect or outdated code knowledge. Due to the high cost of training LLMs4Code, it is impractical to re-train the models for fixing these problematic code knowledge. Model editing is a new technical field for effectively and efficiently correcting erroneous knowledge in LLMs, where various model editing techniques and benchmarks have been proposed recently. Despite that, a comprehensive study that thoroughly compares and analyzes the performance of the state-of-the-art model editing techniques for adapting the knowledge within LLMs4Code across various code-related tasks is notably absent. To bridge this gap, we perform the first systematic study on applying state-of-the-art model editing approaches to repair the inaccuracy of LLMs4Code. To that end, we introduce a benchmark named CLMEEval, which consists of two datasets, i.e., CoNaLa-Edit (CNLE) with 21K+ code generation samples and CodeSearchNet-Edit (CSNE) with 16K+ code summarization samples. With the help of CLMEEval, we evaluate six advanced model editing techniques on three LLMs4Code: CodeLlama (7B), CodeQwen1.5 (7B), and Stable-Code (3B). Our findings include that the external memorization-based GRACE approach achieves the best knowledge editing effectiveness and specificity (the editing does not influence untargeted knowledge), while generalization (whether the editing can generalize to other semantically-identical inputs) is a universal challenge for existing techniques. Furthermore, building on in-depth case analysis, we introduce an enhanced version of GRACE called A-GRACE, which incorporates contrastive learning to better capture the semantics of the inputs.
摘要：代码大语言模型 (LLMs4Code) 在软件工程领域表现出色，特别是在编码任务中展现出显著的性能。然而，即使是最先进的 LLMs4Code 也难免包含错误或过时的代码知识。由于训练 LLMs4Code 的成本高昂，重新训练模型以修正这些有问题的代码知识是不切实际的。模型编辑是一个新兴的技术领域，旨在有效且高效地纠正大语言模型中的错误知识，近期已提出了多种模型编辑技术和基准。尽管如此，目前尚缺乏一项全面的研究，能够彻底比较和分析最先进的模型编辑技术在不同代码相关任务中适应 LLMs4Code 知识的表现。为了填补这一空白，我们首次系统地研究了将最先进的模型编辑方法应用于修复 LLMs4Code 的不准确性。为此，我们引入了一个名为 CLMEEval 的基准，该基准包含两个数据集：CoNaLa-Edit (CNLE) 包含超过 21,000 个代码生成样本，CodeSearchNet-Edit (CSNE) 包含超过 16,000 个代码摘要样本。借助 CLMEEval，我们评估了六种先进的模型编辑技术在三种 LLMs4Code 上的表现：CodeLlama (7B)、CodeQwen1.5 (7B) 和 Stable-Code (3B)。我们的研究发现，基于外部记忆的 GRACE 方法在知识编辑的有效性和特异性（编辑不影响未目标知识）方面表现最佳，而泛化性（编辑能否泛化到其他语义相同的输入）是现有技术的普遍挑战。此外，基于深入的案例分析，我们引入了 GRACE 的增强版本 A-GRACE，该版本结合了对比学习，以更好地捕捉输入的语义。

[NLP-50] CriticAL: Critic Automation with Language Models

【速读】：该论文试图解决自动化科学模型批评的问题，特别是在使用大型语言模型（LLM）进行科学发现时，如何避免模型批评中的幻觉现象。解决方案的关键在于引入CriticAL（Critic Automation with Language Models），它利用LLM生成捕捉模型预测与数据之间差异的摘要统计量，并通过假设检验评估这些差异的显著性。CriticAL作为一个验证器，通过嵌入假设检验框架来验证模型及其批评的正确性，从而确保生成的批评既准确又透明，且具有可操作性。

链接: https://arxiv.org/abs/2411.06590
作者: Michael Y. Li,Vivek Vajipey,Noah D. Goodman,Emily B. Fox
关键词-EN: fundamental goal, models, CriticAL, scientific research, scientific
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Understanding the world through models is a fundamental goal of scientific research. While large language model (LLM) based approaches show promise in automating scientific discovery, they often overlook the importance of criticizing scientific models. Criticizing models deepens scientific understanding and drives the development of more accurate models. Automating model criticism is difficult because it traditionally requires a human expert to define how to compare a model with data and evaluate if the discrepancies are significant–both rely heavily on understanding the modeling assumptions and domain. Although LLM-based critic approaches are appealing, they introduce new challenges: LLMs might hallucinate the critiques themselves. Motivated by this, we introduce CriticAL (Critic Automation with Language Models). CriticAL uses LLMs to generate summary statistics that capture discrepancies between model predictions and data, and applies hypothesis tests to evaluate their significance. We can view CriticAL as a verifier that validates models and their critiques by embedding them in a hypothesis testing framework. In experiments, we evaluate CriticAL across key quantitative and qualitative dimensions. In settings where we synthesize discrepancies between models and datasets, CriticAL reliably generates correct critiques without hallucinating incorrect ones. We show that both human and LLM judges consistently prefer CriticAL’s critiques over alternative approaches in terms of transparency and actionability. Finally, we show that CriticAL’s critiques enable an LLM scientist to improve upon human-designed models on real-world datasets.
摘要：通过模型理解世界是科学研究的基本目标。尽管基于大语言模型 (LLM) 的方法在自动化科学发现方面显示出潜力，但它们往往忽视了批判科学模型的重要性。批判模型深化了科学理解，并推动了更准确模型的开发。自动化模型批判是困难的，因为传统上需要人类专家来定义如何比较模型与数据，并评估差异是否显著——这两者都严重依赖于对建模假设和领域的理解。尽管基于 LLM 的批判方法具有吸引力，但它们引入了新的挑战：LLM 可能会产生幻觉式的批判。受此启发，我们引入了 CriticAL（Critic Automation with Language Models）。CriticAL 使用 LLM 生成捕捉模型预测与数据之间差异的摘要统计量，并应用假设检验来评估其显著性。我们可以将 CriticAL 视为一个验证器，通过将其嵌入假设检验框架来验证模型及其批判。在实验中，我们评估了 CriticAL 在关键的定量和定性维度上的表现。在合成模型与数据集之间差异的设置中，CriticAL 可靠地生成了正确的批判，而没有产生错误的幻觉。我们展示了在透明性和可操作性方面，人类和 LLM 评判者一致偏好 CriticAL 的批判，而非其他方法。最后，我们展示了 CriticAL 的批判使 LLM 科学家能够在真实世界数据集上改进人类设计的模型。

[NLP-51] he KIPARLA Forest treebank of spoken Italian: an overview of initial design choices

【速读】：该论文旨在解决为意大利KIParla语料库创建树库（treebank）的初始设计问题。解决方案的关键在于详细讨论和确定树库构建过程中的设计选择，包括语料的标注标准、句法结构分析方法以及数据结构的设计，以确保树库的准确性和实用性。

链接: https://arxiv.org/abs/2411.06554
作者: Ludovica Pannitto,Caterina Mauri
关键词-EN: Italian KIParla corpus, initial design choices, design choices discussed, Italian KIParla, KIParla corpus
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The paper presents an overview of initial design choices discussed towards the creation of a treebank for the Italian KIParla corpus
摘要：本文概述了在为意大利KIParla语料库创建树库过程中所讨论的初步设计选择。

[NLP-52] In-Context Learning for Preserving Patient Privacy: A Framework for Synthesizing Realistic Patient Portal Messages ALT ML4H

【速读】：该论文试图解决在COVID-19疫情后，患者门户消息数量大幅增加导致临床医生工作负担加重的问题。解决方案的关键在于引入一个基于大型语言模型（LLM）的框架，用于可配置且真实的患者门户消息生成。该框架利用少样本基础文本生成技术，仅需少量去标识化的患者门户消息即可帮助LLM更好地匹配真实数据的样式和语气。与现有的隐私保护合成文本生成方法不同，该框架被临床专家认为是符合HIPAA标准的，能够确保所有敏感属性得到保护。通过广泛的定量和人工评估，研究显示该框架生成的数据质量高于其他生成方法及相关数据集。这一工作为大规模合成患者消息数据集的发布提供了路径，这些数据集在风格上与真实样本相似，并且符合HIPAA标准，同时减少了人工去标识化的工作量。

链接: https://arxiv.org/abs/2411.06549
作者: Joseph Gatto,Parker Seegmiller,Timothy E. Burdick,Sarah Masud Preum
关键词-EN: patient portal messages, patient portal, significantly contributing, portal messages, large and sustained
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Findings paper presented at Machine Learning for Health (ML4H) symposium 2024, December 15-16, 2024, Vancouver, Canada, 8 pages

点击查看摘要

Abstract:Since the COVID-19 pandemic, clinicians have seen a large and sustained influx in patient portal messages, significantly contributing to clinician burnout. To the best of our knowledge, there are no large-scale public patient portal messages corpora researchers can use to build tools to optimize clinician portal workflows. Informed by our ongoing work with a regional hospital, this study introduces an LLM-powered framework for configurable and realistic patient portal message generation. Our approach leverages few-shot grounded text generation, requiring only a small number of de-identified patient portal messages to help LLMs better match the true style and tone of real data. Clinical experts in our team deem this framework as HIPAA-friendly, unlike existing privacy-preserving approaches to synthetic text generation which cannot guarantee all sensitive attributes will be protected. Through extensive quantitative and human evaluation, we show that our framework produces data of higher quality than comparable generation methods as well as all related datasets. We believe this work provides a path forward for (i) the release of large-scale synthetic patient message datasets that are stylistically similar to ground-truth samples and (ii) HIPAA-friendly data generation which requires minimal human de-identification efforts.
摘要：自 COVID-19 疫情以来，临床医生面临了大量且持续增长的病人门户消息，这显著加剧了临床医生的职业倦怠。据我们所知，目前尚无大规模的公开病人门户消息语料库供研究人员用于构建优化临床医生门户工作流程的工具。基于我们与区域医院的持续合作，本研究引入了一个由大语言模型 (LLM) 驱动的框架，用于可配置且真实的病人门户消息生成。我们的方法利用少样本基础文本生成 (few-shot grounded text generation)，仅需少量去标识化的病人门户消息，即可帮助大语言模型更好地匹配真实数据的样式和语气。我们团队的临床专家认为，该框架符合 HIPAA 标准，与现有的合成文本生成隐私保护方法不同，后者无法保证所有敏感属性都能得到保护。通过广泛的定量和人工评估，我们证明该框架生成的数据质量高于可比生成方法以及所有相关数据集。我们相信这项工作为以下两个方面提供了前进的道路：(i) 发布大规模的合成病人消息数据集，这些数据集在风格上与真实样本相似；(ii) 符合 HIPAA 标准的数据生成，仅需最少的人工去标识化工作。

[NLP-53] CineXDrama: Relevance Detection and Sentiment Analysis of Bangla YouTube Comments on Movie-Drama using Transformers: Insights from Interpretability Tool

【速读】：该论文试图解决在YouTube上对孟加拉语电影和剧集评论进行情感分析时，如何有效过滤无关评论的问题。解决方案的关键在于提出一个两阶段的系统：首先评估评论的相关性，然后对被判定为相关的评论进行情感分析。论文引入了包含14,000条手动收集和预处理评论的数据集，这些评论被标注为相关或不相关以及正面或负面情感。通过使用包括BanglaBERT在内的八个变压器模型进行分类任务，BanglaBERT在相关性检测和情感分析中分别达到了83.99%和93.3%的最高准确率。此外，研究还整合了LIME（Local Interpretable Model-agnostic Explanations）以解释模型决策，增强了系统的透明性。

链接: https://arxiv.org/abs/2411.06548
作者: Usafa Akther Rifa,Pronay Debnath,Busra Kamal Rafa,Shamaun Safa Hridi,Md. Aminur Rahman
关键词-EN: platform for Bangla, Bangla movies, recent years, movies and dramas, leading platform
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:In recent years, YouTube has become the leading platform for Bangla movies and dramas, where viewers express their opinions in comments that convey their sentiments about the content. However, not all comments are relevant for sentiment analysis, necessitating a filtering mechanism. We propose a system that first assesses the relevance of comments and then analyzes the sentiment of those deemed relevant. We introduce a dataset of 14,000 manually collected and preprocessed comments, annotated for relevance (relevant or irrelevant) and sentiment (positive or negative). Eight transformer models, including BanglaBERT, were used for classification tasks, with BanglaBERT achieving the highest accuracy (83.99% for relevance detection and 93.3% for sentiment analysis). The study also integrates LIME to interpret model decisions, enhancing transparency.
摘要：近年来，YouTube已成为孟加拉电影和戏剧的主要平台，观众在评论中表达他们对内容的情感。然而，并非所有评论都适合进行情感分析，因此需要一个过滤机制。我们提出了一种系统，首先评估评论的相关性，然后对被判定为相关的评论进行情感分析。我们引入了一个包含14,000条手动收集和预处理的评论的数据集，这些评论被标注为相关性（相关或不相关）和情感（正面或负面）。我们使用了八个Transformer模型，包括BanglaBERT，用于分类任务，其中BanglaBERT在相关性检测和情感分析中分别达到了最高准确率（83.99%和93.3%）。研究还集成了LIME来解释模型决策，增强了透明度。

[NLP-54] Probabilistic Consensus through Ensemble Validation: A Framework for LLM Reliability

【速读】：该论文试图解决大语言模型 (Large Language Models, LLMs) 在高度依赖可靠性的领域（如医疗、法律和金融）中自主部署时缺乏可靠性的问题。解决方案的关键在于引入了一种新的框架，通过模型共识来重新利用集成方法进行内容验证。该框架在78个复杂案例测试中显著提高了精确度，从73.1%提升至93.9%（使用两个模型）和95.6%（使用三个模型），并显示出较强的模型间一致性（κ 0.76）。尽管当前方法受限于多选格式要求和处理延迟，但它为在关键应用中实现可靠的自主AI系统提供了直接价值。

链接: https://arxiv.org/abs/2411.06535
作者: Ninad Naik
关键词-EN: Large Language Models, Large Language, shown significant advances, domains like healthcare, shown significant
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 8 pages, 6 tables

点击查看摘要

Abstract:Large Language Models (LLMs) have shown significant advances in text generation but often lack the reliability needed for autonomous deployment in high-stakes domains like healthcare, law, and finance. Existing approaches rely on external knowledge or human oversight, limiting scalability. We introduce a novel framework that repurposes ensemble methods for content validation through model consensus. In tests across 78 complex cases requiring factual accuracy and causal consistency, our framework improved precision from 73.1% to 93.9% with two models (95% CI: 83.5%-97.9%) and to 95.6% with three models (95% CI: 85.2%-98.8%). Statistical analysis indicates strong inter-model agreement ( \kappa 0.76) while preserving sufficient independence to catch errors through disagreement. We outline a clear pathway to further enhance precision with additional validators and refinements. Although the current approach is constrained by multiple-choice format requirements and processing latency, it offers immediate value for enabling reliable autonomous AI systems in critical applications.
摘要：大语言模型（LLMs）在文本生成方面取得了显著进展，但在医疗、法律和金融等高风险领域的自主部署中，往往缺乏所需的可靠性。现有方法依赖于外部知识或人工监督，限制了其可扩展性。我们提出了一种新颖的框架，通过模型共识重新利用集成方法进行内容验证。在涵盖78个复杂案例的测试中，这些案例需要事实准确性和因果一致性，我们的框架在两个模型的情况下将精度从73.1%提高到93.9%（95% CI: 83.5%-97.9%），在三个模型的情况下提高到95.6%（95% CI: 85.2%-98.8%）。统计分析表明，模型间具有较强的共识性（κ 0.76），同时保持了足够的独立性，通过分歧捕捉错误。我们明确了一条通过增加验证器和改进来进一步提高精度的路径。尽管当前方法受限于多项选择格式要求和处理延迟，但它为在关键应用中实现可靠的自主AI系统提供了即时价值。

[NLP-55] Epistemic Integrity in Large Language Models

【速读】：该论文试图解决大语言模型（Large Language Models, LLMs）在生成信息时存在的高自信度与实际准确性之间的不匹配问题，即认知失调（epistemic miscalibration）。解决方案的关键在于引入了一个新的人类标注数据集和一个创新的方法来测量LLMs的语言自信度，该方法相比之前的基准降低了超过50%的错误率。通过在多个数据集上的验证，该方法揭示了模型在语言表达上的自信度与其真实准确性之间的显著不一致。进一步的人类评估证实了这种失调的严重性，强调了LLMs过度自信可能导致的误导用户的风险。该框架为诊断和纠正这种失调提供了重要步骤，为实现跨领域的更可信的AI系统铺平了道路。

链接: https://arxiv.org/abs/2411.06528
作者: Bijean Ghafouri,Shahrad Mohammadzadeh,James Zhou,Pratheeksha Nair,Jacob-Junqi Tian,Mayank Goel,Reihaneh Rabbany,Jean-François Godbout,Kellin Pelrine
关键词-EN: Large language models, high confidence poses, Large language, confidence poses risks, language models
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Large language models are increasingly relied upon as sources of information, but their propensity for generating false or misleading statements with high confidence poses risks for users and society. In this paper, we confront the critical problem of epistemic miscalibration \unicodex2013 where a model’s linguistic assertiveness fails to reflect its true internal certainty. We introduce a new human-labeled dataset and a novel method for measuring the linguistic assertiveness of Large Language Models (LLMs) which cuts error rates by over 50% relative to previous benchmarks. Validated across multiple datasets, our method reveals a stark misalignment between how confidently models linguistically present information and their actual accuracy. Further human evaluations confirm the severity of this miscalibration. This evidence underscores the urgent risk of the overstated certainty LLMs hold which may mislead users on a massive scale. Our framework provides a crucial step forward in diagnosing this miscalibration, offering a path towards correcting it and more trustworthy AI across domains.
摘要：大语言模型正越来越多地被用作信息来源，但它们高自信地生成虚假或误导性陈述的倾向对用户和社会构成了风险。本文针对知识性误校准这一关键问题——即模型的语言断言性未能反映其真实的内部确定性——进行了探讨。我们引入了一个新的人工标注数据集和一种新颖的方法来测量大语言模型（LLMs）的语言断言性，该方法相对于先前的基准将错误率降低了超过50%。在多个数据集上验证后，我们的方法揭示了模型在语言上呈现信息时的自信程度与其实际准确性之间存在显著的不一致。进一步的人类评估证实了这种误校准的严重性。这一证据强调了LLMs过度自信的确定性可能大规模误导用户的紧迫风险。我们的框架为诊断这种误校准提供了关键步骤，为纠正这一问题并实现跨领域的更可信的AI提供了路径。

[NLP-56] CULL-MT: Compression Using Language and Layer pruning for Machine Translation

【速读】：该论文试图解决多语言机器翻译模型在支持大量语言对时推理操作成本过高的问题。解决方案的关键在于提出了一种名为CULL-MT的压缩方法，通过结构层剪枝和选择性语言方向来优化模型。具体来说，CULL-MT采用贪心策略识别并剪除不重要的层，然后通过知识蒸馏和参数高效微调来减轻剪枝带来的影响。实验结果表明，在NLLB-3.3B和LLaMA3.1-8B-Instruct模型上，CULL-MT能够在保持翻译质量的同时显著减少模型层数，从而降低推理成本。

链接: https://arxiv.org/abs/2411.06506
作者: Pedram Rostami,Mohammad Javad Dousti
关键词-EN: outperform traditional bilingual, Multilingual machine translation, Multilingual machine, traditional bilingual models, outperform traditional
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Multilingual machine translation models often outperform traditional bilingual models by leveraging translation knowledge transfer. Recent advancements have led to these models supporting hundreds of languages and achieving state-of-the-art results across various translation directions. However, as these models grow larger, their inference operations become increasingly costly. In many use cases, there is no need to support such a wide range of language pairs, as translation is typically needed in only a few selected directions. In this paper, we present CULL-MT, a compression method for machine translation models based on structural layer pruning and selected language directions. Our approach identifies and prunes unimportant layers using a greedy strategy, then mitigates the impact by applying knowledge distillation from the original model along with parameter-efficient fine-tuning. We apply CULL-MT to the NLLB-3.3B and LLaMA3.1-8B-Instruct models. In a multi-way translation scenario (Persian, French, and German to English), we find the NLLB-3.3B model to be robust, allowing 25% of layers to be pruned with only a 0.9 spBLEU drop. However, LLaMA3.1-8B-Instruct is more sensitive, with a 2.0 spBLEU drop after pruning 5 layers.
摘要：多语言机器翻译模型通常通过利用翻译知识迁移来超越传统的双语模型。近期的进展使得这些模型能够支持数百种语言，并在各种翻译方向上达到最先进的结果。然而，随着这些模型的规模不断扩大，其推理操作的成本也显著增加。在许多应用场景中，并不需要支持如此广泛的语言对，因为翻译通常仅在少数选定的方向上需要。本文提出了CULL-MT，一种基于结构层剪枝和选定语言方向的机器翻译模型压缩方法。我们的方法采用贪婪策略识别并剪枝不重要的层，然后通过从原始模型进行知识蒸馏和参数高效微调来减轻影响。我们将CULL-MT应用于NLLB-3.3B和LLaMA3.1-8B-Instruct模型。在多向翻译场景（波斯语、法语和德语到英语）中，我们发现NLLB-3.3B模型具有较强的鲁棒性，允许剪枝25%的层，仅导致0.9 spBLEU的下降。然而，LLaMA3.1-8B-Instruct模型则更为敏感，剪枝5层后导致2.0 spBLEU的下降。

[NLP-57] VocalTweets: Investigating Social Media Offensive Language Among Nigerian Musicians

【速读】：该论文试图解决音乐人在社交媒体上使用攻击性语言的问题，特别是在尼日利亚音乐人群体中的表现。解决方案的关键在于引入了VocalTweets数据集，这是一个包含12位尼日利亚知名音乐人推文的代码转换和多语言数据集，采用二元分类方法标记为“正常”或“攻击性”。通过使用HuggingFace的base-Twitter-RoBERTa模型进行训练，论文实现了74.5的F1分数，并通过与OLID数据集的跨语料库实验验证了数据集的通用性。

链接: https://arxiv.org/abs/2411.06477
作者: Sunday Oluyele,Juwon Akingbade,Victor Akinode
关键词-EN: express their opinions, posts online, convey different messages, music compared, social media
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 13 pages, 5 figures, 6 tables

点击查看摘要

Abstract:Musicians frequently use social media to express their opinions, but they often convey different messages in their music compared to their posts online. Some utilize these platforms to abuse their colleagues, while others use it to show support for political candidates or engage in activism, as seen during the #EndSars protest. There are extensive research done on offensive language detection on social media, the usage of offensive language by musicians has received limited attention. In this study, we introduce VocalTweets, a code-switched and multilingual dataset comprising tweets from 12 prominent Nigerian musicians, labeled with a binary classification method as Normal or Offensive. We trained a model using HuggingFace’s base-Twitter-RoBERTa, achieving an F1 score of 74.5. Additionally, we conducted cross-corpus experiments with the OLID dataset to evaluate the generalizability of our dataset.
摘要：音乐家经常使用社交媒体表达他们的观点，但他们在音乐中传达的信息往往与在线帖子中的不同。有些人利用这些平台攻击同事，而另一些人则通过这些平台支持政治候选人或参与社会活动，如在#EndSars抗议活动中所见。尽管关于社交媒体上冒犯性语言检测的研究已经非常广泛，但音乐家使用冒犯性语言的情况却鲜有关注。在本研究中，我们引入了VocalTweets，这是一个代码切换和多语言的数据集，包含了来自12位尼日利亚知名音乐家的推文，采用二元分类方法标记为“正常”或“冒犯性”。我们使用HuggingFace的base-Twitter-RoBERTa模型进行训练，达到了74.5的F1分数。此外，我们还与OLID数据集进行了跨语料库实验，以评估我们数据集的泛化能力。

[NLP-58] ClinicalBench: Can LLM s Beat Traditional ML Models in Clinical Prediction?

【速读】：该论文试图解决的问题是：在临床预测任务中，大型语言模型（LLMs）是否能够超越传统的机器学习模型（如SVM和XGBoost）。解决方案的关键在于构建了一个名为ClinicalBench的新基准，用于全面研究通用和医疗领域LLMs在临床预测建模中的能力，并将其与传统ML模型进行比较。ClinicalBench涵盖了三种常见的临床预测任务、两个数据库、14个通用LLMs、8个医疗LLMs以及11个传统ML模型。通过广泛的实证研究，论文发现尽管LLMs在不同模型规模、多样化的提示或微调策略下表现出色，但在临床预测任务中仍未能超越传统ML模型，这揭示了LLMs在临床推理和决策制定方面的潜在不足。因此，论文呼吁在临床应用中采用LLMs时需谨慎，并建议ClinicalBench可以作为连接LLMs在医疗领域开发与实际临床实践之间的桥梁。

链接: https://arxiv.org/abs/2411.06469
作者: Canyu Chen,Jian Yu,Shan Chen,Che Liu,Zhongwei Wan,Danielle Bitterman,Fei Wang,Kai Shu
关键词-EN: Large Language Models, Large Language, hold great promise, text processing tasks, medical licensing exams
类目: Computation and Language (cs.CL)
备注: The first two authors contributed equally. 10 pages for main paper, 66 pages including appendix. Project website: this https URL

点击查看摘要

Abstract:Large Language Models (LLMs) hold great promise to revolutionize current clinical systems for their superior capacities on medical text processing tasks and medical licensing exams. Meanwhile, traditional ML models such as SVM and XGBoost have still been mainly adopted in clinical prediction tasks. An emerging question is Can LLMs beat traditional ML models in clinical prediction? Thus, we build a new benchmark ClinicalBench to comprehensively study the clinical predictive modeling capacities of both general-purpose and medical LLMs, and compare them with traditional ML models. ClinicalBench embraces three common clinical prediction tasks, two databases, 14 general-purpose LLMs, 8 medical LLMs, and 11 traditional ML models. Through extensive empirical investigation, we discover that both general-purpose and medical LLMs, even with different model scales, diverse prompting or fine-tuning strategies, still cannot beat traditional ML models in clinical prediction yet, shedding light on their potential deficiency in clinical reasoning and decision-making. We call for caution when practitioners adopt LLMs in clinical applications. ClinicalBench can be utilized to bridge the gap between LLMs’ development for healthcare and real-world clinical practice.
摘要：大语言模型 (LLMs) 在革新当前临床系统方面展现出巨大潜力，这主要归功于其在医学文本处理任务和医学执照考试中的卓越能力。与此同时，传统的机器学习模型如支持向量机 (SVM) 和极端梯度提升 (XGBoost) 仍然在临床预测任务中占据主导地位。一个新兴的问题是：大语言模型能否在临床预测中超越传统机器学习模型？为此，我们构建了一个新的基准——临床基准 (ClinicalBench)，以全面研究通用和医学大语言模型在临床预测建模方面的能力，并将其与传统机器学习模型进行比较。临床基准涵盖了三种常见的临床预测任务、两个数据库、14个通用大语言模型、8个医学大语言模型以及11个传统机器学习模型。通过广泛的实证研究，我们发现，尽管通用和医学大语言模型在模型规模、提示策略或微调方法上存在差异，但它们在临床预测中仍未能超越传统机器学习模型，这揭示了它们在临床推理和决策制定方面的潜在不足。我们呼吁从业者在临床应用中采用大语言模型时应保持谨慎。临床基准可以用于弥合大语言模型在医疗领域的开发与实际临床实践之间的差距。

[NLP-59] Prompt-Efficient Fine-Tuning for GPT-like Deep Models to Reduce Hallucination and to Improve Reproducibility in Scientific Text Generation Using Stochastic Optimisation Techniques

【速读】：该论文试图解决大型语言模型（LLMs）在复杂科学文本生成任务中存在的准确性、一致性和幻觉控制问题。解决方案的关键在于采用参数高效微调（Parameter-Efficient Fine-Tuning, PEFT）方法，特别是针对GPT-like模型引入低秩适应（Low-Rank Adaptation, LoRA）适配器，以专门针对质谱学领域的文献进行微调，生成名为MS-GPT的模型。通过BLEU、ROUGE和困惑度（Perplexity）等评估方法，MS-GPT在文本连贯性和可重复性方面表现优于基线GPT-2，并通过统计分析（Wilcoxon秩和检验）得到验证。此外，论文提出了一种基于模型输出在受控提示下余弦相似度的可重复性度量，展示了MS-GPT在稳定性方面的提升。该研究强调了PEFT在优化LLMs以适应科学环境中的潜力，同时降低了计算成本并提高了模型可靠性。

链接: https://arxiv.org/abs/2411.06445
作者: Daniil Sulimov
关键词-EN: Large Language Models, Large Language, text generation tasks, generation tasks, limitations in accuracy
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 73 pages, 6 figures

点击查看摘要

Abstract:Large Language Models (LLMs) are increasingly adopted for complex scientific text generation tasks, yet they often suffer from limitations in accuracy, consistency, and hallucination control. This thesis introduces a Parameter-Efficient Fine-Tuning (PEFT) approach tailored for GPT-like models, aiming to mitigate hallucinations and enhance reproducibility, particularly in the computational domain of mass spectrometry. We implemented Low-Rank Adaptation (LoRA) adapters to refine GPT-2, termed MS-GPT, using a specialized corpus of mass spectrometry literature. Through novel evaluation methods applied to LLMs, including BLEU, ROUGE, and Perplexity scores, the fine-tuned MS-GPT model demonstrated superior text coherence and reproducibility compared to the baseline GPT-2, confirmed through statistical analysis with the Wilcoxon rank-sum test. Further, we propose a reproducibility metric based on cosine similarity of model outputs under controlled prompts, showcasing MS-GPT’s enhanced stability. This research highlights PEFT’s potential to optimize LLMs for scientific contexts, reducing computational costs while improving model reliability.
摘要：大语言模型 (LLMs) 在复杂的科学文本生成任务中得到了越来越多的应用，但它们在准确性、一致性和幻觉控制方面往往存在局限性。本论文提出了一种针对 GPT 类模型的参数高效微调 (Parameter-Efficient Fine-Tuning, PEFT) 方法，旨在减少幻觉现象并增强可重复性，特别是在质谱计算领域。我们采用了低秩适应 (Low-Rank Adaptation, LoRA) 适配器对 GPT-2 进行微调，命名为 MS-GPT，使用了一个专门的质谱文献语料库。通过应用包括 BLEU、ROUGE 和困惑度 (Perplexity) 评分在内的新颖评估方法，微调后的 MS-GPT 模型在文本连贯性和可重复性方面表现优于基线 GPT-2，这一结论通过 Wilcoxon 秩和检验的统计分析得到了证实。此外，我们提出了一种基于模型输出在受控提示下余弦相似度的可重复性度量方法，展示了 MS-GPT 增强的稳定性。本研究突显了 PEFT 在优化科学背景下大语言模型的潜力，同时降低了计算成本并提高了模型可靠性。

[NLP-60] PLM-Based Discrete Diffusion Language Models with Entropy-Adaptive Gibbs Sampling

【速读】：该论文试图解决将预训练语言模型 (Pretrained Language Models, PLMs) 与离散扩散模型有效集成的问题，特别是在下游自然语言生成任务中的性能提升。解决方案的关键在于提出了 Diffusion-EAGS 方法，该方法通过引入熵跟踪模块来辅助 PLMs 在扩散过程中的去噪决策，并提出了基于熵的噪声调度策略，以提高生成阶段的熵自适应采样效果。这些创新使得 Diffusion-EAGS 在下游生成任务中表现优异，不仅提高了文本质量和多样性，还能实现精确的词级别控制，并适应双语和低资源环境。

链接: https://arxiv.org/abs/2411.06438
作者: Hyukhun Koh,Minha Jhang,Dohyung Kim,Sangmook Lee,Kyomin Jung
关键词-EN: Pretrained Language Models, integrating Pretrained Language, discrete diffusion language, demonstrated promising results, language models
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recently, discrete diffusion language models have demonstrated promising results in NLP. However, there has been limited research on integrating Pretrained Language Models (PLMs) into discrete diffusion models, resulting in underwhelming performance in downstream NLP generation tasks. This integration is particularly challenging because of the discrepancy between step-wise denoising strategy of diffusion models and single-step mask prediction approach of MLM-based PLMs. In this paper, we introduce Diffusion-EAGS, a novel approach that effectively integrates PLMs with the diffusion models. Furthermore, as it is challenging for PLMs to determine where to apply denoising during the diffusion process, we integrate an entropy tracking module to assist them. Finally, we propose entropy-based noise scheduling in the forward process to improve the effectiveness of entropy-adaptive sampling throughout the generation phase. Experimental results show that Diffusion-EAGS outperforms existing diffusion baselines in downstream generation tasks, achieving high text quality and diversity with precise token-level control. We also show that our model is capable of adapting to bilingual and low-resource settings, which are common in real-world applications.
摘要：近年来，离散扩散语言模型在自然语言处理（NLP）领域展示了令人鼓舞的结果。然而，将预训练语言模型（Pretrained Language Models, PLMs）整合到离散扩散模型中的研究相对有限，导致在下游NLP生成任务中的表现不尽如人意。这种整合尤其具有挑战性，因为扩散模型的逐步去噪策略与基于掩码语言模型（MLM）的PLMs的单步掩码预测方法之间存在差异。本文中，我们提出了Diffusion-EAGS，一种新颖的方法，能够有效整合PLMs与扩散模型。此外，由于PLMs在扩散过程中难以确定去噪的应用位置，我们引入了一个熵跟踪模块来辅助它们。最后，我们在前向过程中提出了基于熵的噪声调度，以提高生成阶段熵自适应采样的有效性。实验结果表明，Diffusion-EAGS在下游生成任务中优于现有的扩散基线模型，实现了高质量和多样化的文本生成，并具有精确的Token级控制。我们还展示了我们的模型能够适应双语和低资源环境，这在实际应用中非常常见。

[NLP-61] SequentialBreak: Large Language Models Can be Fooled by Embedding Jailbreak Prompts into Sequential Prompt Chains

【速读】：该论文试图解决大型语言模型（LLMs）在面对顺序提示链攻击时容易受到恶意提示影响的问题。解决方案的关键在于提出了一种名为SequentialBreak的新型越狱攻击方法，该方法利用顺序提示链中的漏洞，通过在单个查询中嵌入有害提示于良性提示之间，从而诱导LLMs生成有害响应。SequentialBreak展示了其在不同场景（如题库、对话完成和游戏环境）中的灵活性和适应性，并通过实验证明了其在攻击成功率上的显著提升。该研究强调了加强LLM安全性的迫切需求，并提供了相关实验结果和代码的GitHub仓库。

链接: https://arxiv.org/abs/2411.06426
作者: Bijoy Ahmed Saiem,MD Sadik Hossain Shanto,Rakib Ahsan,Md Rafi ur Rashid
关键词-EN: Large Language Models, Large Language, raising significant security, significant security concerns, Language Models
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:As the integration of the Large Language Models (LLMs) into various applications increases, so does their susceptibility to misuse, raising significant security concerns. Numerous jailbreak attacks have been proposed to assess the security defense of LLMs. Current jailbreak attacks mainly rely on scenario camouflage, prompt obfuscation, prompt optimization, and prompt iterative optimization to conceal malicious prompts. In particular, sequential prompt chains in a single query can lead LLMs to focus on certain prompts while ignoring others, facilitating context manipulation. This paper introduces SequentialBreak, a novel jailbreak attack that exploits this vulnerability. We discuss several scenarios, not limited to examples like Question Bank, Dialog Completion, and Game Environment, where the harmful prompt is embedded within benign ones that can fool LLMs into generating harmful responses. The distinct narrative structures of these scenarios show that SequentialBreak is flexible enough to adapt to various prompt formats beyond those discussed. Extensive experiments demonstrate that SequentialBreak uses only a single query to achieve a substantial gain of attack success rate over existing baselines against both open-source and closed-source models. Through our research, we highlight the urgent need for more robust and resilient safeguards to enhance LLM security and prevent potential misuse. All the result files and website associated with this research are available in this GitHub repository: this https URL.
摘要：随着大语言模型 (LLM) 在各种应用中的集成日益增加，其易受滥用的风险也随之上升，引发了重大的安全担忧。已有多种越狱攻击方法被提出，以评估 LLM 的安全防御能力。当前的越狱攻击主要依赖于场景伪装、提示混淆、提示优化和提示迭代优化，以隐藏恶意提示。特别是，单个查询中的顺序提示链可以使 LLM 专注于某些提示而忽略其他提示，从而促进上下文操控。本文介绍了 SequentialBreak，这是一种利用此漏洞的新型越狱攻击。我们讨论了多种场景，包括但不限于题库、对话补全和游戏环境等，在这些场景中，有害提示被嵌入在良性提示中，从而诱使 LLM 生成有害响应。这些场景的独特叙事结构表明，SequentialBreak 具有足够的灵活性，能够适应各种提示格式，而不仅限于所讨论的格式。广泛的实验表明，SequentialBreak 仅通过单个查询就能在现有基线基础上显著提高攻击成功率，无论是针对开源还是闭源模型。通过我们的研究，我们强调了迫切需要更强大和更具弹性的防护措施，以增强 LLM 的安全性并防止潜在的滥用。所有与本研究相关的结果文件和网站均可在以下 GitHub 仓库中获取：this https URL。

[NLP-62] Ablation is Not Enough to Emulate DPO: How Neuron Dynamics Drive Toxicity Reduction

【速读】：该论文试图解决的问题是关于安全微调算法（Safety fine-tuning algorithms）如何有效减少语言模型输出的有害内容的具体内部机制。当前的解释认为，直接偏好优化（Direct Preference Optimisation, DPO）通过抑制最具毒性的多层感知器（MLP）神经元来学习偏移量，以避免残差流中的有毒区域。然而，通过消融最具毒性的神经元并应用激活补丁技术，研究发现这一解释并不完整。关键发现是，DPO通过多个神经元组的累积效应来降低毒性，不仅减少了有毒方向的写作，还促进了残差流中的反毒性。此外，DPO对神经元激活的调整具有噪声性，许多神经元实际上增加了毒性，这表明DPO是通过平衡对立神经元效应来实现毒性减少的复杂过程。

链接: https://arxiv.org/abs/2411.06424
作者: Yushi Yang,Filip Sondej,Harry Mayne,Adam Mahdi
关键词-EN: Safety fine-tuning algorithms, fine-tune language models, exact internal mechanisms, Safety fine-tuning, reduce harmful outputs
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Safety fine-tuning algorithms are commonly used to fine-tune language models to reduce harmful outputs, but the exact internal mechanisms of how those models achieve this remain unclear. In studying direct preference optimisation (DPO) for toxicity reduction, current explanations claim that DPO works by dampening the most toxic MLP neurons to learn an offset to avert toxic regions in the residual stream. However, by ablating the most toxic neurons and applying activation patching, we find this explanation incomplete. By projecting neuron activation changes onto a toxicity probe, we find that only 31.8% of toxicity reduction comes from dampened toxic neurons. Instead, DPO reduces toxicity by accumulating effects across multiple neuron groups, both reducing writing in the toxic direction and promoting anti-toxicity in the residual stream. Moreover, DPO gives noisy adjustments to neuron activations, with many neurons actually increasing toxicity. This indicates that DPO is a balancing process between opposing neuron effects to achieve toxicity reduction.
摘要：安全微调算法常用于微调语言模型以减少有害输出，但其内部机制的确切工作原理尚不清楚。在研究用于降低毒性的直接偏好优化（DPO）时，现有解释声称DPO通过抑制最具毒性的MLP神经元来学习偏移量，以避开残差流中的毒性区域。然而，通过消融最具毒性的神经元并应用激活补丁，我们发现这一解释并不完整。通过将神经元激活变化投影到毒性探针上，我们发现仅有31.8%的毒性降低来自被抑制的毒性神经元。相反，DPO通过在多个神经元组之间积累效应来降低毒性，既减少了向毒性方向的书写，又促进了残差流中的抗毒性。此外，DPO对神经元激活进行了噪声调整，许多神经元实际上增加了毒性。这表明DPO是一个在对抗性神经元效应之间进行平衡的过程，以实现毒性降低。

[NLP-63] Fineweb-Edu-Ar: Machine-translated Corpus to Support Arabic Small Language Models

【速读】：该论文试图解决多语言大型语言模型（LLMs）在训练过程中面临的高质量数据稀缺问题。解决方案的关键在于利用机器翻译（Machine Translation, MT）技术，将高质量的英文文本适配到目标低资源语言，从而生成合成数据集。具体来说，论文介绍了FineWeb-Edu-Ar，这是一个基于HuggingFace的FineWeb-Edu数据集的机器翻译版本，是目前公开可用的最大阿拉伯语机器翻译数据集，包含202B个由阿拉伯语训练的token。

链接: https://arxiv.org/abs/2411.06402
作者: Sultan Alrashed,Dmitrii Khizbullin,David R. Pugh
关键词-EN: large language models, grow and develop, data demands, language models, large language
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:As large language models (LLMs) grow and develop, so do their data demands. This is especially true for multilingual LLMs, where the scarcity of high-quality and readily available data online has led to a multitude of synthetic dataset generation approaches. A key technique in this space is machine translation (MT), where high-quality English text is adapted to a target, comparatively low-resource language. This report introduces FineWeb-Edu-Ar, a machine-translated version of the exceedingly popular (deduplicated) FineWeb-Edu dataset from HuggingFace. To the best of our knowledge, FineWeb-Edu-Ar is the largest publicly available machine-translated Arabic dataset out there, with its size of 202B tokens of an Arabic-trained tokenizer.
摘要：随着大语言模型 (Large Language Model, LLM) 的不断发展和壮大，其对数据的需求也随之增加。对于多语言大语言模型而言，这一现象尤为明显。由于高质量且易于获取的在线数据稀缺，导致出现了多种合成数据集生成方法。在这一领域中，机器翻译 (Machine Translation, MT) 是一项关键技术，它将高质量的英文文本适配到目标的、相对资源匮乏的语言中。本报告介绍了 FineWeb-Edu-Ar，这是 HuggingFace 上极受欢迎的（去重后的）FineWeb-Edu 数据集的机器翻译版本。据我们所知，FineWeb-Edu-Ar 是目前公开可用的最大的机器翻译阿拉伯语数据集，其规模达到了 2020 亿个由阿拉伯语训练的 Token (Tokenizer)。

[NLP-64] CausalStock: Deep End-to-end Causal Discovery for News-driven Stock Movement Prediction NEURIPS2024

【速读】：该论文试图解决新闻驱动的多股票走势预测任务中的两个关键问题：一是如何有效发现股票之间的因果关系（relation discovery），二是如何从大量新闻数据中提取有效信息以减少噪音影响。解决方案的关键在于提出了一个名为CausalStock的新框架，该框架通过设计一种依赖于滞后的时间因果发现机制（lag-dependent temporal causal discovery mechanism）来建模股票间的时间因果图分布，并利用功能因果模型（Functional Causal Model）来封装和预测股票走势。此外，通过利用大型语言模型（LLMs）的优秀文本评估能力，设计了一个去噪新闻编码器（Denoised News Encoder），以从海量新闻数据中提取有用信息。实验结果表明，CausalStock在多个真实世界数据集上优于现有基线方法，并提供了具有良好解释性的预测机制。

链接: https://arxiv.org/abs/2411.06391
作者: Shuqi Li,Yuebo Sun,Yuxin Lin,Xin Gao,Shuo Shang,Rui Yan
关键词-EN: multi-stock movement prediction, news-driven multi-stock movement, movement prediction, multi-stock movement, causal relations
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Computation and Language (cs.CL)
备注: Accepted by NeurIPS 2024

点击查看摘要

Abstract:There are two issues in news-driven multi-stock movement prediction tasks that are not well solved in the existing works. On the one hand, “relation discovery” is a pivotal part when leveraging the price information of other stocks to achieve accurate stock movement prediction. Given that stock relations are often unidirectional, such as the “supplier-consumer” relationship, causal relations are more appropriate to capture the impact between stocks. On the other hand, there is substantial noise existing in the news data leading to extracting effective information with difficulty. With these two issues in mind, we propose a novel framework called CausalStock for news-driven multi-stock movement prediction, which discovers the temporal causal relations between stocks. We design a lag-dependent temporal causal discovery mechanism to model the temporal causal graph distribution. Then a Functional Causal Model is employed to encapsulate the discovered causal relations and predict the stock movements. Additionally, we propose a Denoised News Encoder by taking advantage of the excellent text evaluation ability of large language models (LLMs) to extract useful information from massive news data. The experiment results show that CausalStock outperforms the strong baselines for both news-driven multi-stock movement prediction and multi-stock movement prediction tasks on six real-world datasets collected from the US, China, Japan, and UK markets. Moreover, getting benefit from the causal relations, CausalStock could offer a clear prediction mechanism with good explainability.
摘要：在新闻驱动的多股票走势预测任务中，现有研究尚未充分解决两个关键问题。一方面，“关系发现”在利用其他股票的价格信息进行准确的股票走势预测时起着至关重要的作用。鉴于股票关系通常是单向的，例如“供应商-消费者”关系，因果关系更适合捕捉股票之间的影响。另一方面，新闻数据中存在大量噪音，导致难以提取有效信息。针对这两个问题，我们提出了一种名为CausalStock的新框架，用于新闻驱动的多股票走势预测，该框架能够发现股票之间的时间因果关系。我们设计了一种依赖于滞后的时间因果发现机制，以建模时间因果图的分布。随后，采用功能因果模型（Functional Causal Model）来封装发现的因果关系并预测股票走势。此外，我们提出了一种去噪新闻编码器，利用大语言模型（LLMs）优秀的文本评估能力，从海量新闻数据中提取有用信息。实验结果表明，CausalStock在新闻驱动的多股票走势预测和多股票走势预测任务中，均优于六个来自美国、中国、日本和英国市场的真实数据集上的强基线模型。此外，得益于因果关系，CausalStock能够提供具有良好解释性的清晰预测机制。

[NLP-65] Self-Training Meets Consistency: Improving LLM s Reasoning With Consistency-Driven Rationale Evaluation

【速读】：该论文试图解决现有自训练方法在评估大语言模型（LLMs）生成的理由（rationales）时可能存在的单一评估标准导致模型学习到错误推理模式的问题。解决方案的关键在于提出了CREST（Consistency-driven Rationale Evaluation for Self-Training）框架，通过后续问题对每个理由进行进一步评估，并利用这一评估结果指导训练。具体方法包括：(1) 过滤掉在后续问题中频繁导致错误答案的理由；(2) 基于原始问题和后续问题的理由评估结果的混合偏好进行偏好学习。实验结果表明，CREST不仅提高了理由的逻辑鲁棒性和正确性，还增强了模型的推理能力。

链接: https://arxiv.org/abs/2411.06387
作者: Jaehyeok Lee,Keisuke Sakaguchi,JinYeong Bak
关键词-EN: large language models, approach for large, large language, improves reasoning abilities, language models
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: under review

点击查看摘要

Abstract:Self-training approach for large language models (LLMs) improves reasoning abilities by training the models on their self-generated rationales. Previous approaches have labeled rationales that produce correct answers for a given question as appropriate for training. However, a single measure risks misjudging rationale quality, leading the models to learn flawed reasoning patterns. To address this issue, we propose CREST (Consistency-driven Rationale Evaluation for Self-Training), a self-training framework that further evaluates each rationale through follow-up questions and leverages this evaluation to guide its training. Specifically, we introduce two methods: (1) filtering out rationales that frequently result in incorrect answers on follow-up questions and (2) preference learning based on mixed preferences from rationale evaluation results of both original and follow-up questions. Experiments on three question-answering datasets using open LLMs show that CREST not only improves the logical robustness and correctness of rationales but also improves reasoning abilities compared to previous self-training approaches.
摘要：针对大语言模型 (LLM) 的自训练方法通过在其自我生成的推理路径上进行训练，提升了模型的推理能力。以往的方法将那些能正确回答给定问题的推理路径标记为适合训练的。然而，单一的评估标准存在误判推理路径质量的风险，可能导致模型学习到错误的推理模式。为解决这一问题，我们提出了 CREST（基于一致性的推理路径评估自训练框架），该框架通过后续问题进一步评估每条推理路径，并利用这一评估结果指导训练。具体来说，我们引入了两种方法：（1）过滤掉在后续问题中频繁导致错误答案的推理路径；（2）基于原始问题和后续问题的推理路径评估结果的混合偏好进行偏好学习。在三个问答数据集上使用开放的 LLM 进行的实验表明，CREST 不仅提高了推理路径的逻辑鲁棒性和正确性，还比以往的自训练方法提升了推理能力。

[NLP-66] LLM Vocabulary Compression for Low-Compute Environments NEURIPS2024

【速读】：该论文试图解决语言模型中最终线性层（final linear layer）的内存占用问题，通过压缩该层来减少内存使用，从而提高模型在低计算资源环境下的适用性。解决方案的关键在于利用字节对编码（Byte Pair Encoding, BPE）对token进行分组，避免生成内存密集型的logits张量，从而在不显著影响性能的情况下，将内存使用减少了高达3.4倍。实验结果表明，该方法在TinyStories数据集上的表现与GPT-Neo和GPT2相当，同时吞吐量提高了最多3倍。

链接: https://arxiv.org/abs/2411.06371
作者: Sreeram Vennam,Anish Joishy,Ponnurangam Kumaraguru
关键词-EN: significant performance loss, reducing memory usage, final linear layer, Byte Pair Encoding, language models
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Machine Learning and Compression Workshop @ NeurIPS 2024

点击查看摘要

Abstract:We present a method to compress the final linear layer of language models, reducing memory usage by up to 3.4x without significant performance loss. By grouping tokens based on Byte Pair Encoding (BPE) merges, we prevent materialization of the memory-intensive logits tensor. Evaluations on the TinyStories dataset show that our method performs on par with GPT-Neo and GPT2 while significantly improving throughput by up to 3x, making it suitable for low-compute environments.
摘要：我们提出了一种压缩语言模型最终线性层的方法，能够在不显著损失性能的情况下将内存使用量减少至多3.4倍。通过基于字节对编码（Byte Pair Encoding, BPE）合并的Token分组，我们避免了内存密集型logits张量的具体化。在TinyStories数据集上的评估表明，我们的方法在性能上与GPT-Neo和GPT2相当，同时通过率显著提高了至多3倍，使其非常适合低计算环境。

[NLP-67] Prompts Matter: Comparing ML/GAI Approaches for Generating Inductive Qualitative Coding Results

【速读】：该论文试图解决的问题是如何利用生成式 AI (GAI) 技术提高定性研究中的归纳编码效率和准确性。解决方案的关键在于将人类编码过程融入到 GAI 的指令中，通过引入理论指导和已知方法，优化 GAI 的编码结果。研究结果表明，这种方法显著减少了 GAI 编码结果与人类编码之间的差异，显示出其在定性编码过程中的优势。

链接: https://arxiv.org/abs/2411.06316
作者: John Chen,Alexandros Lotsos,Lexie Zhao,Grace Wang,Uri Wilensky,Bruce Sherin,Michael Horn
关键词-EN: Inductive qualitative methods, research for decades, conduct rigorously, inductive coding results, mainstay of education
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: Accepted by AERA 2025 Annual Meeting

点击查看摘要

Abstract:Inductive qualitative methods have been a mainstay of education research for decades, yet it takes much time and effort to conduct rigorously. Recent advances in artificial intelligence, particularly with generative AI (GAI), have led to initial success in generating inductive coding results. Like human coders, GAI tools rely on instructions to work, and how to instruct it may matter. To understand how ML/GAI approaches could contribute to qualitative coding processes, this study applied two known and two theory-informed novel approaches to an online community dataset and evaluated the resulting coding results. Our findings show significant discrepancies between ML/GAI approaches and demonstrate the advantage of our approaches, which introduce human coding processes into GAI prompts.
摘要：归纳性定性方法几十年来一直是教育研究的主要手段，但其执行过程需要大量的时间和精力。近年来，人工智能，特别是生成式 AI (Generative AI) 的进步，已经在生成归纳编码结果方面取得了初步成功。与人类编码员类似，生成式 AI 工具依赖于指令来工作，而如何指导它可能至关重要。为了理解机器学习 (ML) 和生成式 AI 方法如何能够促进定性编码过程，本研究应用了两种已知的方法和两种基于理论的新方法于一个在线社区数据集，并评估了由此产生的编码结果。我们的研究结果显示，ML 和生成式 AI 方法之间存在显著差异，并展示了我们方法的优势，即在生成式 AI 提示中引入人类编码过程。

[NLP-68] Golden Touchstone: A Comprehensive Bilingual Benchmark for Evaluating Financial Large Language Models

【速读】：该论文试图解决金融领域大型语言模型（LLM）评估标准化的迫切需求，现有金融基准在语言和任务覆盖范围有限、数据质量低以及对LLM评估适应性不足等问题。解决方案的关键是提出了“Golden Touchstone”，这是首个全面的双语金融LLM基准，涵盖了中英文的八个核心金融自然语言处理（NLP）任务。该基准通过广泛的开放数据收集和行业特定需求开发，旨在全面评估模型的语言理解和生成能力。通过对比分析主要模型（如GPT-4、Llama3、FinGPT和FinMA）在该基准上的表现，揭示了它们在处理复杂金融信息方面的优缺点。此外，论文还开源了Touchstone-GPT，一个通过持续预训练和金融指令调优的金融LLM，展示了在双语基准上的强大性能，但仍存在特定领域的局限性。

链接: https://arxiv.org/abs/2411.06272
作者: Xiaojun Wu,Junxi Liu,Huanyi Su,Zhouchi Lin,Yiyan Qi,Chengjin Xu,Jiajun Su,Jiajie Zhong,Fuwei Wang,Saizhuo Wang,Fengrui Hua,Jia Li,Jian Guo
关键词-EN: increasingly prevalent, standardized method, method to comprehensively, comprehensively assess, Golden Touchstone
类目: Computation and Language (cs.CL); Computational Engineering, Finance, and Science (cs.CE)
备注: 26 pages, 9 tables, 3 figures

点击查看摘要

Abstract:As large language models become increasingly prevalent in the financial sector, there is a pressing need for a standardized method to comprehensively assess their performance. However, existing finance benchmarks often suffer from limited language and task coverage, as well as challenges such as low-quality datasets and inadequate adaptability for LLM evaluation. To address these limitations, we propose “Golden Touchstone”, the first comprehensive bilingual benchmark for financial LLMs, which incorporates representative datasets from both Chinese and English across eight core financial NLP tasks. Developed from extensive open source data collection and industry-specific demands, this benchmark includes a variety of financial tasks aimed at thoroughly assessing models’ language understanding and generation capabilities. Through comparative analysis of major models on the benchmark, such as GPT-4o Llama3, FinGPT and FinMA, we reveal their strengths and limitations in processing complex financial information. Additionally, we open-sourced Touchstone-GPT, a financial LLM trained through continual pre-training and financial instruction tuning, which demonstrates strong performance on the bilingual benchmark but still has limitations in specific this http URL research not only provides the financial large language models with a practical evaluation tool but also guides the development and optimization of future research. The source code for Golden Touchstone and model weight of Touchstone-GPT have been made publicly available at \urlthis https URL, contributing to the ongoing evolution of FinLLMs and fostering further research in this critical area.
摘要：随着大语言模型在金融领域的应用日益广泛，迫切需要一种标准化的方法来全面评估其性能。然而，现有的金融基准测试往往存在语言和任务覆盖范围有限的问题，同时还面临数据质量低下和适应性不足等挑战。为了解决这些局限性，我们提出了“Golden Touchstone”，这是首个全面的双语金融大语言模型基准测试，涵盖了中英文两种语言在八个核心金融自然语言处理任务中的代表性数据集。该基准测试基于广泛的开源数据收集和行业特定需求开发，包含多种金融任务，旨在全面评估模型的语言理解和生成能力。通过对GPT-4、Llama3、FinGPT和FinMA等主要模型在该基准测试上的比较分析，我们揭示了它们在处理复杂金融信息方面的优势和局限性。此外，我们还开源了Touchstone-GPT，这是一款通过持续预训练和金融指令微调训练的金融大语言模型，该模型在双语基准测试中表现出色，但在特定任务上仍存在局限性。这项研究不仅为金融大语言模型提供了一个实用的评估工具，还指导了未来研究的发展和优化。Golden Touchstone的源代码和Touchstone-GPT的模型权重已在\urlthis https URL公开发布，为金融大语言模型的持续演进和该关键领域的进一步研究做出了贡献。

[NLP-69] Robust Detection of LLM -Generated Text: A Comparative Analysis

【速读】：该论文试图解决大语言模型（LLM）生成文本的检测问题，关键在于开发一种能够准确识别文本是否由LLM生成的检测器。解决方案的核心是采用多种分类方法，包括传统的机器学习技术（如逻辑回归、k-means聚类、高斯朴素贝叶斯、支持向量机）、基于转换器的方法（如BERT），以及使用LLM自身进行检测的算法。研究重点在于模型的泛化能力、对抗攻击的潜在风险以及模型评估的准确性。

链接: https://arxiv.org/abs/2411.06248
作者: Yongye Su,Yuqing Wu
关键词-EN: generate complex texts, large language models, aspects of life, network resources, ability of large
类目: Computation and Language (cs.CL)
备注: 8 pages

点击查看摘要

Abstract:The ability of large language models to generate complex texts allows them to be widely integrated into many aspects of life, and their output can quickly fill all network resources. As the impact of LLMs grows, it becomes increasingly important to develop powerful detectors for the generated text. This detector is essential to prevent the potential misuse of these technologies and to protect areas such as social media from the negative effects of false content generated by LLMS. The main goal of LLM-generated text detection is to determine whether text is generated by an LLM, which is a basic binary classification task. In our work, we mainly use three different classification methods based on open source datasets: traditional machine learning techniques such as logistic regression, k-means clustering, Gaussian Naive Bayes, support vector machines, and methods based on converters such as BERT, and finally algorithms that use LLMs to detect LLM-generated text. We focus on model generalization, potential adversarial attacks, and accuracy of model evaluation. Finally, the possible research direction in the future is proposed, and the current experimental results are summarized.
摘要：大语言模型生成复杂文本的能力使其能够广泛集成到生活的许多方面，其输出可以迅速填满所有网络资源。随着大语言模型的影响力不断扩大，开发强大的生成文本检测器变得愈发重要。这一检测器对于防止这些技术的潜在滥用，以及保护社交媒体等区域免受大语言模型生成虚假内容的负面影响至关重要。大语言模型生成文本检测的主要目标是确定文本是否由大语言模型生成，这是一个基本的二分类任务。在我们的工作中，我们主要使用了三种不同的分类方法，基于开源数据集：传统的机器学习技术，如逻辑回归、k-means 聚类、高斯朴素贝叶斯、支持向量机，以及基于转换器的方法，如 BERT，最后是使用大语言模型来检测大语言模型生成文本的算法。我们重点关注模型的泛化能力、潜在的对抗攻击以及模型评估的准确性。最后，提出了未来的可能研究方向，并总结了当前的实验结果。

[NLP-70] An mathbfL* Algorithm for Deterministic Weighted Regular Languages

【速读】：该论文试图解决从黑箱模型中提取有限状态自动机（Finite State Automata, FSAs）的问题，以获得对复杂模型行为的可解释性洞察。解决方案的关键在于提出了一种加权版本的Angluin的L算法，用于精确学习支持除法的确定性加权FSAs。该方法不仅忠实于原始算法，还通过将学习过程与FSA最小化联系起来，展示了L算法如何直接学习目标语言的最小自动机。

链接: https://arxiv.org/abs/2411.06228
作者: Clemente Pasti,Talu Karagöz,Anej Svete,Franz Nowak,Reda Boumasmoud,Ryan Cotterell
关键词-EN: Extracting finite state, complex model behaviors, finite state automata, black-box models offers, gaining interpretable insights
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Extracting finite state automata (FSAs) from black-box models offers a powerful approach to gaining interpretable insights into complex model behaviors. To support this pursuit, we present a weighted variant of Angluin’s (1987) \mathbfL^* algorithm for learning FSAs. We stay faithful to the original algorithm, devising a way to exactly learn deterministic weighted FSAs whose weights support division. Furthermore, we formulate the learning process in a manner that highlights the connection with FSA minimization, showing how \mathbfL^* directly learns a minimal automaton for the target language.
摘要：从黑箱模型中提取有限状态自动机 (Finite State Automata, FSAs) 提供了一种强大的方法，以获得对复杂模型行为的可解释性洞察。为了支持这一追求，我们提出了一种基于 Angluin (1987) 的 \mathbfL^* 算法的加权变体，用于学习 FSAs。我们忠实于原始算法，设计了一种方法来精确学习支持除法的确定性加权 FSAs。此外，我们将学习过程以一种突出与 FSA 最小化联系的方式进行公式化，展示了 \mathbfL^* 如何直接为目标语言学习一个最小化的自动机。

[NLP-71] Incorporating Human Explanations for Robust Hate Speech Detection ACL

【速读】：该论文试图解决大型变压器语言模型（LM）在仇恨言论（HS）检测中的泛化性和鲁棒性问题，特别是在处理隐含的刻板印象意图时。解决方案的关键在于设计了一个新的任务——刻板印象意图蕴含（Stereotype Intent Entailment, SIE），通过该任务鼓励模型在上下文中理解刻板印象的存在。论文通过三阶段分析，首先确认了建模上下文相关的刻板印象意图的必要性，然后设计了SIE任务，最后通过消融测试和用户研究验证了SIE目标能提升内容理解，但仍需解决隐含意图建模的挑战。

链接: https://arxiv.org/abs/2411.06213
作者: Jennifer L. Chen,Faisal Ladhak,Daniel Li,Noémie Elhadad
关键词-EN: large transformer language, robustness present ethical, present ethical implications, Social Bias Frames, Bias Frames dataset
类目: Computation and Language (cs.CL)
备注: 2021 ACL Unimplicit Workshop

点击查看摘要

Abstract:Given the black-box nature and complexity of large transformer language models (LM), concerns about generalizability and robustness present ethical implications for domains such as hate speech (HS) detection. Using the content rich Social Bias Frames dataset, containing human-annotated stereotypes, intent, and targeted groups, we develop a three stage analysis to evaluate if LMs faithfully assess hate speech. First, we observe the need for modeling contextually grounded stereotype intents to capture implicit semantic meaning. Next, we design a new task, Stereotype Intent Entailment (SIE), which encourages a model to contextually understand stereotype presence. Finally, through ablation tests and user studies, we find a SIE objective improves content understanding, but challenges remain in modeling implicit intent.
摘要：鉴于大型 Transformer 语言模型（LM）的黑箱性质和复杂性，其在仇恨言论（HS）检测等领域的泛化能力和鲁棒性问题引发了伦理关注。我们利用内容丰富的 Social Bias Frames 数据集，该数据集包含人工标注的刻板印象、意图和目标群体信息，设计了一个三阶段的分析流程，以评估这些语言模型是否能够忠实地识别仇恨言论。首先，我们观察到，为了捕捉隐含的语义意义，需要对基于上下文的刻板印象意图进行建模。其次，我们设计了一项新的任务——刻板印象意图蕴含（Stereotype Intent Entailment, SIE），旨在促使模型在上下文中理解刻板印象的存在。最后，通过消融测试和用户研究，我们发现 SIE 目标确实提升了内容理解能力，但在建模隐含意图方面仍面临挑战。

[NLP-72] IOPO: Empowering LLM s with Complex Instruction Following via Input-Output Preference Optimization

【速读】：该论文试图解决大型语言模型（LLMs）在复杂指令跟随能力方面的不足，特别是在指令复杂性快速增加的应用场景中。解决方案的关键在于提出了TRACE基准和IOPO（Input-Output Preference Optimization）对齐方法。TRACE基准包含120K训练数据和1K评估数据，用于提升和评估复杂指令跟随能力。IOPO方法通过考虑输入和输出偏好对，使LLMs不仅快速对齐响应偏好，还能细致地探索指令偏好。实验结果表明，IOPO在域内和域外数据集上分别比SFT和DPO方法有显著提升，分别为8.15%、2.18%和6.29%、3.13%。

链接: https://arxiv.org/abs/2411.06208
作者: Xinghua Zhang,Haiyang Yu,Cheng Fu,Fei Huang,Yongbin Li
关键词-EN: large language models, applications leverage LLMs, accurately follow instructions, language models, realm of large
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Work in progress

点击查看摘要

Abstract:In the realm of large language models (LLMs), the ability of models to accurately follow instructions is paramount as more agents and applications leverage LLMs for construction, where the complexity of instructions are rapidly increasing. However, on the one hand, there is only a certain amount of complex instruction evaluation data; on the other hand, there are no dedicated algorithms to improve the ability to follow complex instructions. To this end, this paper introduces TRACE, a benchmark for improving and evaluating the complex instructionfollowing ability, which consists of 120K training data and 1K evaluation data. Furthermore, we propose IOPO (Input-Output Preference Optimization) alignment method which takes both input and output preference pairs into consideration, where LLMs not only rapidly align with response preferences but also meticulously explore the instruction preferences. Extensive experiments on both in-domain and outof-domain datasets confirm the effectiveness of IOPO, showing 8.15%, 2.18% improvements on in-domain data and 6.29%, 3.13% on outof-domain data compared to SFT and DPO respectively.
摘要：在大语言模型 (LLM) 领域，模型准确遵循指令的能力至关重要，因为越来越多的智能体和应用正在利用 LLM 进行构建，而指令的复杂性也在迅速增加。然而，一方面，现有的复杂指令评估数据有限；另一方面，目前尚无专门用于提升复杂指令遵循能力的算法。为此，本文引入了 TRACE，这是一个用于提升和评估复杂指令遵循能力的基准，包含 120K 训练数据和 1K 评估数据。此外，我们提出了 IOPO（输入-输出偏好优化）对齐方法，该方法综合考虑了输入和输出偏好对，使得 LLM 不仅能够快速对齐响应偏好，还能细致地探索指令偏好。在领域内和领域外数据集上的广泛实验验证了 IOPO 的有效性，与 SFT 和 DPO 相比，在领域内数据上分别提升了 8.15% 和 2.18%，在领域外数据上分别提升了 6.29% 和 3.13%。

[NLP-73] Exploring Knowledge Boundaries in Large Language Models for Retrieval Judgment

【速读】：该论文试图解决大型语言模型（Large Language Models, LLMs）在处理动态变化知识和未知静态知识时遇到的挑战。解决方案的关键在于通过训练一个知识边界模型（Knowledge Boundary Model, KBM）来区分不同类型的问题，从而减少不必要的检索请求，降低时间和计算成本，同时提升LLMs的整体性能。具体来说，KBM能够有效界定知识边界，显著减少为达到最佳端到端性能所需的检索比例，并在动态知识、长尾静态知识和多跳问题等复杂场景中验证了其有效性，同时展示了其作为外部LLM插件的功能。

链接: https://arxiv.org/abs/2411.06207
作者: Zhen Zhang,Xinyu Wang,Yong Jiang,Zhuo Chen,Feiteng Mu,Mengting Hu,Pengjun Xie,Fei Huang
关键词-EN: Large Language Models, Large Language, Language Models, practical applications, increasingly recognized
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) are increasingly recognized for their practical applications. However, these models often encounter challenges in dynamically changing knowledge, as well as in managing unknown static knowledge. Retrieval-Augmented Generation (RAG) tackles this challenge and has shown a significant impact on LLMs. Actually, we find that the impact of RAG on the question answering capabilities of LLMs can be categorized into three groups: beneficial, neutral, and harmful. By minimizing retrieval requests that yield neutral or harmful results, we can effectively reduce both time and computational costs, while also improving the overall performance of LLMs. This insight motivates us to differentiate between types of questions using certain metrics as indicators, to decrease the retrieval ratio without compromising performance. In our work, we propose a method that is able to identify different types of questions from this view by training a Knowledge Boundary Model (KBM). Experiments conducted on 11 English and Chinese datasets illustrate that the KBM effectively delineates the knowledge boundary, significantly decreasing the proportion of retrievals required for optimal end-to-end performance. Specifically, we evaluate the effectiveness of KBM in three complex scenarios: dynamic knowledge, long-tail static knowledge, and multi-hop problems, as well as its functionality as an external LLM plug-in.
摘要：大语言模型（Large Language Models, LLMs）在实际应用中的价值日益受到认可。然而，这些模型在处理动态变化的知识以及未知静态知识时常常面临挑战。检索增强生成（Retrieval-Augmented Generation, RAG）技术正是针对这一难题而提出的解决方案，并在提升大语言模型的性能方面展现出显著效果。实际上，我们发现RAG对大语言模型问答能力的影响可以分为三类：有益、中性及有害。通过减少那些产生中性或有害结果的检索请求，我们不仅能够有效降低时间和计算成本，还能提升大语言模型的整体表现。这一发现促使我们利用特定指标来区分不同类型的问题，从而在不损害性能的前提下降低检索比例。在我们的研究中，我们提出了一种通过训练知识边界模型（Knowledge Boundary Model, KBM）来识别不同类型问题的方法。在11个中英文数据集上的实验表明，KBM能够有效界定知识边界，显著减少为达到最佳端到端性能所需的检索比例。具体而言，我们在三种复杂场景中评估了KBM的有效性：动态知识、长尾静态知识及多跳问题，并测试了其作为外部大语言模型插件的功能。

[NLP-74] WMT24 Test Suite: Gender Resolution in Speaker-Listener Dialogue Roles

【速读】：该论文试图解决文学风格对话场景中性别识别的难度及其受性别刻板印象影响的问题。解决方案的关键在于分析对话外的角色和说话方式的刻板印象如何显著影响对话内指称对象的性别一致性。通过构建包含角色元信息和说话方式的测试集，研究揭示了这些外部因素对性别识别的显著影响。

链接: https://arxiv.org/abs/2411.06194
作者: Hillary Dawkins,Isar Nejadgholi,Chi-kiu Lo
关键词-EN: literary-style dialogue settings, assess the difficulty, resolution in literary-style, literary-style dialogue, dialogue settings
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We assess the difficulty of gender resolution in literary-style dialogue settings and the influence of gender stereotypes. Instances of the test suite contain spoken dialogue interleaved with external meta-context about the characters and the manner of speaking. We find that character and manner stereotypes outside of the dialogue significantly impact the gender agreement of referents within the dialogue.
摘要：我们评估了在文学风格对话场景中性别识别的难度及其受性别刻板印象的影响。测试集中的实例包含角色之间的对话，并穿插了关于角色和说话方式的外部元上下文。我们发现，对话之外的角色和说话方式的刻板印象显著影响了对话内指代对象的性别一致性。

[NLP-75] M-Longdoc: A Benchmark For Multimodal Super-Long Document Understanding And A Retrieval-Aware Tuning Framework

【速读】：该论文试图解决在处理包含文本、图表等多模态内容的长文档时，人工阅读耗时且效率低下的问题。解决方案的关键在于提出了M-LongDoc基准数据集和一个自动化框架，用于评估大型多模态模型的性能。论文进一步提出了检索感知调优方法，以实现高效且有效的多模态文档阅读。与现有工作相比，M-LongDoc包含更近期且篇幅更长的文档，并要求开放式解决方案而非仅提取答案。论文首次直接针对多模态长文档的检索场景进行训练框架设计，并通过全自动方式构建训练语料库，用于问答任务。实验结果显示，该调优方法相较于基线开源模型，在模型响应的正确性上实现了4.6%的相对提升。

链接: https://arxiv.org/abs/2411.06176
作者: Yew Ken Chia,Liying Cheng,Hou Pong Chan,Chaoqun Liu,Maojia Song,Sharifah Mahani Aljunied,Soujanya Poria,Lidong Bing
关键词-EN: practical applications, ability to understand, business and practical, documents, answer questions
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The ability to understand and answer questions over documents can be useful in many business and practical applications. However, documents often contain lengthy and diverse multimodal contents such as texts, figures, and tables, which are very time-consuming for humans to read thoroughly. Hence, there is an urgent need to develop effective and automated methods to aid humans in this task. In this work, we introduce M-LongDoc, a benchmark of 851 samples, and an automated framework to evaluate the performance of large multimodal models. We further propose a retrieval-aware tuning approach for efficient and effective multimodal document reading. Compared to existing works, our benchmark consists of more recent and lengthy documents with hundreds of pages, while also requiring open-ended solutions and not just extractive answers. To our knowledge, our training framework is the first to directly address the retrieval setting for multimodal long documents. To enable tuning open-source models, we construct a training corpus in a fully automatic manner for the question-answering task over such documents. Experiments show that our tuning approach achieves a relative improvement of 4.6% for the correctness of model responses, compared to the baseline open-source models. Our data, code, and models are available at this https URL.
摘要：理解和回答文档中的问题在许多商业和实际应用中具有重要价值。然而，文档通常包含冗长且多样化的多模态内容，如文本、图表和表格，这些内容对于人类来说阅读起来非常耗时。因此，迫切需要开发有效且自动化的方法来辅助人类完成这一任务。在本研究中，我们引入了 M-LongDoc，一个包含 851 个样本的基准数据集，以及一个自动化框架，用于评估大型多模态模型的性能。我们进一步提出了一种检索感知调优方法，以实现高效且有效的多模态文档阅读。与现有工作相比，我们的基准数据集包含了更近期且篇幅更长的文档，这些文档通常有数百页，并且不仅需要提取式答案，还需要开放式解决方案。据我们所知，我们的训练框架是首个直接针对多模态长文档检索场景的解决方案。为了支持开源模型的调优，我们采用全自动方式构建了一个用于此类文档问答任务的训练语料库。实验结果表明，与基线开源模型相比，我们的调优方法在模型响应的正确性上实现了 4.6% 的相对提升。我们的数据、代码和模型可通过以下链接获取：https URL。

[NLP-76] Clustering Algorithms and RAG Enhancing Semi-Supervised Text Classification with Large LLM s

【速读】：该论文试图解决在文本分类任务中，尽管数据量庞大但标注样本有限的问题。解决方案的关键在于创新性地结合了少样本学习（few-shot learning）、检索增强生成（Retrieval-Augmented Generation, RAG）和传统的统计聚类方法。通过这种集成方法，系统能够在极少量的标注实例基础上进行有效学习，并生成高质量的标注数据。这是首次将RAG与聚类技术结合用于文本数据生成，实验结果表明，仅使用少样本增强数据即可达到接近全标注数据集的性能，在复杂文本分类任务中分别实现了95.41%和82.43%的准确率。

链接: https://arxiv.org/abs/2411.06175
作者: Shan Zhong,Jiahao Zeng,Yongxin Yu,Bohong Lin
关键词-EN: innovative semi-supervised learning, semi-supervised learning approach, addressing the challenge, paper introduces, introduces an innovative
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:This paper introduces an innovative semi-supervised learning approach for text classification, addressing the challenge of abundant data but limited labeled examples. Our methodology integrates few-shot learning with retrieval-augmented generation (RAG) and conventional statistical clustering, enabling effective learning from a minimal number of labeled instances while generating high-quality labeled data. To the best of our knowledge, we are the first to incorporate RAG alongside clustering in text data generation. Our experiments on the Reuters and Web of Science datasets demonstrate state-of-the-art performance, with few-shot augmented data alone producing results nearly equivalent to those achieved with fully labeled datasets. Notably, accuracies of 95.41% and 82.43% were achieved for complex text document classification tasks, where the number of categories can exceed 100.
摘要：本文介绍了一种创新的半监督学习方法，用于解决文本分类中数据丰富但标注样本有限的问题。我们的方法结合了少样本学习与检索增强生成 (RAG) 以及传统的统计聚类技术，能够在极少标注实例的情况下实现高效学习，并生成高质量的标注数据。据我们所知，我们是首个将 RAG 与聚类技术结合应用于文本数据生成的研究。在 Reuters 和 Web of Science 数据集上的实验结果表明，我们的方法达到了最先进的性能，仅通过少样本增强数据就能获得与全标注数据集相当的结果。特别地，在类别数量可能超过 100 的复杂文本文档分类任务中，我们分别达到了 95.41% 和 82.43% 的准确率。

[NLP-77] SEEKR: Selective Attention-Guided Knowledge Retention for Continual Learning of Large Language Models EMNLP2024

【速读】：该论文试图解决在持续学习（Continual Learning, CL）过程中，大型语言模型（Large Language Models, LLMs）面临的灾难性遗忘问题。解决方案的关键在于提出了一种基于选择性注意力引导的知识保留方法（SElective attEntion-guided Knowledge Retention, SEEKR）。SEEKR通过在选定的注意力头（attention heads）上进行注意力蒸馏（attention distillation），以实现更细粒度的知识保留。具体来说，SEEKR利用遗忘性（forgettability）和任务敏感性（task-sensitivity）两种度量方法，识别出最有价值的注意力头，从而在数据回放（data replay）过程中，显著减少所需的回放数据量，同时保持甚至提升模型性能。实验结果表明，SEEKR在性能和效率上均优于现有方法，能够在使用其他方法所需回放数据的1/10甚至1%的情况下，达到相当的或更好的效果。

链接: https://arxiv.org/abs/2411.06171
作者: Jinghan He,Haiyun Guo,Kuan Zhu,Zihan Zhao,Ming Tang,Jinqiao Wang
关键词-EN: evolving real-world demands, real-world demands, dynamically adapt, evolving real-world, knowledge retention
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: EMNLP2024

点击查看摘要

Abstract:Continual learning (CL) is crucial for language models to dynamically adapt to the evolving real-world demands. To mitigate the catastrophic forgetting problem in CL, data replay has been proven a simple and effective strategy, and the subsequent data-replay-based distillation can further enhance the performance. However, existing methods fail to fully exploit the knowledge embedded in models from previous tasks, resulting in the need for a relatively large number of replay samples to achieve good results. In this work, we first explore and emphasize the importance of attention weights in knowledge retention, and then propose a SElective attEntion-guided Knowledge Retention method (SEEKR) for data-efficient replay-based continual learning of large language models (LLMs). Specifically, SEEKR performs attention distillation on the selected attention heads for finer-grained knowledge retention, where the proposed forgettability-based and task-sensitivity-based measures are used to identify the most valuable attention heads. Experimental results on two continual learning benchmarks for LLMs demonstrate the superiority of SEEKR over the existing methods on both performance and efficiency. Explicitly, SEEKR achieves comparable or even better performance with only 1/10 of the replayed data used by other methods, and reduces the proportion of replayed data to 1%.
摘要：持续学习 (Continual Learning, CL) 对于语言模型动态适应不断变化的现实世界需求至关重要。为了缓解 CL 中的灾难性遗忘问题，数据回放已被证明是一种简单且有效的策略，而基于数据回放的蒸馏可以进一步增强性能。然而，现有方法未能充分利用模型在先前任务中嵌入的知识，导致需要相对大量的回放样本才能取得良好效果。在本研究中，我们首先探讨并强调了注意力权重在知识保留中的重要性，然后提出了一种选择性注意力引导的知识保留方法 (SElective attEntion-guided Knowledge Retention, SEEKR)，用于大语言模型 (Large Language Models, LLMs) 的高效数据回放持续学习。具体而言，SEEKR 对选定的注意力头进行注意力蒸馏，以实现更细粒度的知识保留，其中提出的基于遗忘性和任务敏感性的度量用于识别最有价值的注意力头。在两个针对 LLMs 的持续学习基准测试中的实验结果表明，SEEKR 在性能和效率方面均优于现有方法。具体来说，SEEKR 在使用其他方法所需回放数据的 1/10 的情况下，实现了可比甚至更好的性能，并将回放数据的比例降低至 1%。

[NLP-78] Expansion Quantization Network: An Efficient Micro-emotion Annotation and Detection Framework

【速读】：该论文试图解决现有情感检测数据集依赖人工标注所带来的高成本、主观性强和标签不平衡问题，特别是微情感标注不足和情感强度表示缺失的问题。解决方案的关键在于提出了全标签和训练集标签回归方法，通过将标签值映射到能量强度级别，充分利用机器模型的学习能力和标签间的相互依赖关系，以揭示样本中的多种情感。这一方法最终形成了情感量化网络（Emotion Quantization Network, EQN）框架，用于微情感检测和标注。实验结果表明，EQN框架在多个情感数据集上的表现优于传统模型，实现了微情感的自动检测和标注，并首次引入了能量级别评分，为情感检测分析和情感计算的量化研究提供了有力支持。

链接: https://arxiv.org/abs/2411.06160
作者: Jingyi Zhou,Senlin Luo,Haofan Chen
关键词-EN: Text emotion detection, advancing artificial intelligence, Text emotion, emotion detection constitutes, emotion detection
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Text emotion detection constitutes a crucial foundation for advancing artificial intelligence from basic comprehension to the exploration of emotional reasoning. Most existing emotion detection datasets rely on manual annotations, which are associated with high costs, substantial subjectivity, and severe label imbalances. This is particularly evident in the inadequate annotation of micro-emotions and the absence of emotional intensity representation, which fail to capture the rich emotions embedded in sentences and adversely affect the quality of downstream task completion. By proposing an all-labels and training-set label regression method, we map label values to energy intensity levels, thereby fully leveraging the learning capabilities of machine models and the interdependencies among labels to uncover multiple emotions within samples. This led to the establishment of the Emotion Quantization Network (EQN) framework for micro-emotion detection and annotation. Using five commonly employed sentiment datasets, we conducted comparative experiments with various models, validating the broad applicability of our framework within NLP machine learning models. Based on the EQN framework, emotion detection and annotation are conducted on the GoEmotions dataset. A comprehensive comparison with the results from Google literature demonstrates that the EQN framework possesses a high capability for automatic detection and annotation of micro-emotions. The EQN framework is the first to achieve automatic micro-emotion annotation with energy-level scores, providing strong support for further emotion detection analysis and the quantitative research of emotion computing.
摘要：文本情感检测构成了推动人工智能从基础理解到情感推理探索的关键基础。大多数现有的情感检测数据集依赖于人工标注，这伴随着高成本、显著的主观性以及严重的标签不平衡问题。尤其在微情感的不足标注和情感强度表示的缺失方面，未能捕捉句子中蕴含的丰富情感，并严重影响下游任务完成的质量。通过提出全标签和训练集标签回归方法，我们将标签值映射到能量强度级别，从而充分利用机器模型的学习能力和标签间的相互依赖性，揭示样本中的多种情感。这导致了微情感检测和标注的情感量化网络（Emotion Quantization Network, EQN）框架的建立。利用五种常用的情感数据集，我们与多种模型进行了对比实验，验证了我们的框架在自然语言处理（NLP）机器学习模型中的广泛适用性。基于EQN框架，我们在GoEmotions数据集上进行了情感检测和标注。与Google文献结果的全面比较表明，EQN框架具有高能力的微情感自动检测和标注功能。EQN框架是首个实现能量级别评分自动微情感标注的框架，为情感检测分析和情感计算的定量研究提供了强有力的支持。

[NLP-79] From References to Insights: Collaborative Knowledge Minigraph Agents for Automating Scholarly Literature Review

【速读】：该论文试图解决学术文献综述过程中耗时且复杂的问题，提出了一种名为协作知识微图代理 (Collaborative Knowledge Minigraph Agents, CKMAs) 的新框架。解决方案的关键在于设计了一种基于提示的算法，即知识微图构建代理 (Knowledge Minigraph Construction Agent, KMCA)，用于自动识别学术文献中的信息片段之间的关系，并构建知识微图。随后，通过利用大型语言模型 (Large Language Models) 的能力，多路径摘要代理 (Multiple Path Summarization Agent, MPSA) 能够从不同视角高效组织信息片段及其关系，生成文献综述段落。实验结果表明，该方法能够生成信息丰富、完整、一致且有洞察力的总结，推动了大型语言模型在专业领域的应用。

链接: https://arxiv.org/abs/2411.06159
作者: Zhi Zhang,Yan Liu,Sheng-hua Zhong,Gong Chen,Yu Yang,Jiannong Cao
关键词-EN: guiding future studies, Literature reviews play, identifying gaps, specific topics, play a crucial
类目: Computation and Language (cs.CL); Computational Engineering, Finance, and Science (cs.CE)
备注:

点击查看摘要

Abstract:Literature reviews play a crucial role in scientific research for understanding the current state of research, identifying gaps, and guiding future studies on specific topics. However, the process of conducting a comprehensive literature review is yet time-consuming. This paper proposes a novel framework, collaborative knowledge minigraph agents (CKMAs), to automate scholarly literature reviews. A novel prompt-based algorithm, the knowledge minigraph construction agent (KMCA), is designed to identify relationships between information pieces from academic literature and automatically constructs knowledge minigraphs. By leveraging the capabilities of large language models on constructed knowledge minigraphs, the multiple path summarization agent (MPSA) efficiently organizes information pieces and relationships from different viewpoints to generate literature review paragraphs. We evaluate CKMAs on three benchmark datasets. Experimental results demonstrate that the proposed techniques generate informative, complete, consistent, and insightful summaries for different research problems, promoting the use of LLMs in more professional fields.
摘要：文献综述在科学研究中扮演着至关重要的角色，它有助于理解当前研究的现状、识别研究空白，并为特定主题的未来研究提供指导。然而，进行全面的文献综述过程仍然耗时费力。本文提出了一种新颖的框架——协作知识微图智能体 (CKMAs)，用于自动化学术文献综述。设计了一种基于提示的算法——知识微图构建智能体 (KMCA)，用于识别学术文献中信息片段之间的关系，并自动构建知识微图。通过利用大语言模型在构建的知识微图上的能力，多路径摘要智能体 (MPSA) 能够从不同视角高效组织信息片段及其关系，生成文献综述段落。我们在三个基准数据集上评估了 CKMAs。实验结果表明，所提出的技术能够为不同的研究问题生成信息丰富、完整、一致且富有洞察力的总结，推动了大语言模型在更专业领域的应用。

[NLP-80] Building an Efficient Multilingual Non-Profit IR System for the Islamic Domain Leveraging Multiprocessing Design in Rust

【速读】：该论文试图解决在宗教和文化遗产领域，特别是伊斯兰文学领域，缺乏利用先进人工智能工具进行信息检索（Information Retrieval, IR）的统一资源的问题。解决方案的关键在于开发一个多语言的非营利性IR系统，针对这一领域特有的挑战，如多语言领域特定语料库的准备（尤其是在数据有限的情况下）、在资源受限设备上的模型部署以及在有限预算下实现快速搜索。通过采用继续预训练（continued pre-training）进行领域适应和语言缩减（language reduction）以减小模型尺寸的方法，论文成功构建了一个轻量级的多语言检索模型，其性能优于在通用领域数据上预训练的大型模型。此外，论文提出的架构利用Rust语言的能力，展示了在低资源环境下实现高效语义搜索的可能性。

链接: https://arxiv.org/abs/2411.06151
作者: Vera Pavlova,Mohammed Makhlouf
关键词-EN: Natural Language Processing, including Information Retrieval, including Information, applications of Natural, Information Retrieval
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The widespread use of large language models (LLMs) has dramatically improved many applications of Natural Language Processing (NLP), including Information Retrieval (IR). However, domains that are not driven by commercial interest often lag behind in benefiting from AI-powered solutions. One such area is religious and heritage corpora. Alongside similar domains, Islamic literature holds significant cultural value and is regularly utilized by scholars and the general public. Navigating this extensive amount of text is challenging, and there is currently no unified resource that allows for easy searching of this data using advanced AI tools. This work focuses on the development of a multilingual non-profit IR system for the Islamic domain. This process brings a few major challenges, such as preparing multilingual domain-specific corpora when data is limited in certain languages, deploying a model on resource-constrained devices, and enabling fast search on a limited budget. By employing methods like continued pre-training for domain adaptation and language reduction to decrease model size, a lightweight multilingual retrieval model was prepared, demonstrating superior performance compared to larger models pre-trained on general domain data. Furthermore, evaluating the proposed architecture that utilizes Rust Language capabilities shows the possibility of implementing efficient semantic search in a low-resource setting.
摘要：大语言模型 (LLM) 的广泛应用显著提升了自然语言处理 (NLP) 在诸多领域的应用效果，包括信息检索 (IR)。然而，那些不受商业利益驱动的领域往往在受益于 AI 驱动的解决方案方面进展缓慢。宗教与文化遗产语料库便是这样一个领域。与类似领域一样，伊斯兰文学具有重要的文化价值，并经常被学者和公众所利用。然而，处理如此庞大的文本量极具挑战性，目前尚无统一的资源能够利用先进的 AI 工具轻松搜索这些数据。本研究专注于开发一个面向伊斯兰领域的多语言非营利性信息检索系统。这一过程中面临若干主要挑战，如在某些语言数据有限的情况下准备多语言领域专用语料库、在资源受限的设备上部署模型，以及在预算有限的情况下实现快速搜索。通过采用如领域适应的持续预训练和语言缩减以减小模型尺寸的方法，我们准备了一个轻量级的多语言检索模型，其性能优于在通用领域数据上预训练的大型模型。此外，评估利用 Rust 语言能力的提议架构表明，在资源匮乏的环境中实现高效的语义搜索是可能的。

[NLP-81] StopHC: A Harmful Content Detection and Mitigation Architecture for Social Media Platforms

【速读】：该论文试图解决社交媒体平台上有害、仇恨和冒犯性内容对用户心理健康构成威胁的问题。解决方案的关键在于提出了一个名为 StopHC 的有害内容检测与缓解架构。该架构包含两个核心模块：一是采用深度神经网络架构进行有害内容检测，二是利用网络免疫算法来封锁有毒节点并阻止有害内容的传播。通过在两个真实世界数据集上的实验，验证了该解决方案的有效性。

链接: https://arxiv.org/abs/2411.06138
作者: Ciprian-Octavian Truică,Ana-Teodora Constantinescu,Elena-Simona Apostol
关键词-EN: social media users, harmful content detection, harmful content, mental health, users has started
类目: ocial and Information Networks (cs.SI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The mental health of social media users has started more and more to be put at risk by harmful, hateful, and offensive content. In this paper, we propose \textscStopHC, a harmful content detection and mitigation architecture for social media platforms. Our aim with \textscStopHC is to create more secure online environments. Our solution contains two modules, one that employs deep neural network architecture for harmful content detection, and one that uses a network immunization algorithm to block toxic nodes and stop the spread of harmful content. The efficacy of our solution is demonstrated by experiments conducted on two real-world datasets.
摘要：社交媒体用户的心理健康正越来越多地受到有害、仇恨和冒犯性内容的威胁。本文提出了 \textscStopHC，这是一种针对社交媒体平台的有害内容检测与缓解架构。我们的目标是通过 \textscStopHC 创建更安全的在线环境。我们的解决方案包含两个模块：一个采用深度神经网络架构进行有害内容检测，另一个使用网络免疫算法来封锁有毒节点并阻止有害内容的传播。通过在两个真实世界数据集上进行的实验，我们展示了该解决方案的有效性。

[NLP-82] Detecting Reference Errors in Scientific Literature with Large Language Models

【速读】：该论文试图解决科学论文中常见的引用和引述错误问题，这些问题难以检测且耗时，对科学出版构成重大挑战。解决方案的关键在于评估OpenAI的GPT系列大型语言模型在检测引述错误方面的能力。研究通过准备一个专家标注的、来自期刊文章的陈述-引用对数据集，并在不同设置下（包括不同程度的引用信息增强）评估这些模型。结果表明，大型语言模型能够在有限上下文且无需微调的情况下检测出错误的引用。这一研究为利用人工智能辅助科学论文的撰写、评审和出版提供了新的视角，并讨论了进一步改进的可能性。

链接: https://arxiv.org/abs/2411.06101
作者: Tianmai M. Zhang,Neil F. Abernethy
关键词-EN: large language models, Reference errors, detect quotation errors, quotation errors, errors
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Reference errors, such as citation and quotation errors, are common in scientific papers. Such errors can result in the propagation of inaccurate information, but are difficult and time-consuming to detect, posing a significant challenge to scientific publishing. To support automatic detection of reference errors, this work evaluated the ability of large language models in OpenAI’s GPT family to detect quotation errors. Specifically, we prepared an expert-annotated, general-domain dataset of statement-reference pairs from journal articles. Large language models were evaluated in different settings with varying amounts of reference information provided by retrieval augmentation. Our results showed that large language models are able to detect erroneous citations with limited context and without fine-tuning. This study contributes to the growing literature that seeks to utilize artificial intelligence to assist in the writing, reviewing, and publishing of scientific papers. Potential avenues for further improvements in this task are also discussed.
摘要：参考错误，如引用和引述错误，在科学论文中很常见。这些错误可能导致不准确信息的传播，但检测这些错误既困难又耗时，对科学出版构成重大挑战。为了支持参考错误的自动检测，本研究评估了OpenAI的GPT系列大语言模型在检测引述错误方面的能力。具体而言，我们准备了一个由专家标注的、来自期刊文章的陈述-参考对通用领域数据集。在不同设置下，通过检索增强提供不同数量的参考信息，对大语言模型进行了评估。结果显示，大语言模型能够在有限上下文且无需微调的情况下检测出错误的引用。本研究为利用人工智能辅助科学论文的撰写、审阅和出版的日益增长的文献做出了贡献，并讨论了进一步改进此任务的潜在途径。

[NLP-83] ZhoBLiMP: a Systematic Assessment of Language Models with Linguistic Minimal Pairs in Chinese

【速读】：该论文试图解决的问题是评估语言模型（LMs）在非英语语言（特别是中文）中对自然语言语法的掌握程度，并探讨其学习模式。解决方案的关键在于引入了ZhoBLiMP，这是一个迄今为止最全面的中文语言学最小对（linguistic minimal pairs）基准，涵盖了15种语言现象和118个范例。通过训练和评估20个不同规模（从14M到1.4B参数）的语言模型，以及14个现成的LLMs，研究团队发现，约500M参数的模型在经过1B token的训练后，能够较好地掌握中文语法，进一步扩展模型规模带来的收益有限。此外，研究还观察到语言模型在某些语言现象上呈现出U型学习模式，类似于儿童语言习得的过程。

链接: https://arxiv.org/abs/2411.06096
作者: Yikang Liu,Yeting Shen,Hongao Zhu,Lilong Xu,Zhiheng Qian,Siyuan Song,Kejia Zhang,Jialong Tang,Pei Zhang,Baosong Yang,Rui Wang,Hai Hu
关键词-EN: syntax of natural, widely evaluated, Chinese, natural languages, Chinese grammar
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Whether and how language models (LMs) acquire the syntax of natural languages has been widely evaluated under the minimal pair paradigm. However, a lack of wide-coverage benchmarks in languages other than English has constrained systematic investigations into the issue. Addressing it, we first introduce ZhoBLiMP, the most comprehensive benchmark of linguistic minimal pairs for Chinese to date, with 118 paradigms, covering 15 linguistic phenomena. We then train 20 LMs of different sizes (14M to 1.4B) on Chinese corpora of various volumes (100M to 3B tokens) and evaluate them along with 14 off-the-shelf LLMs on ZhoBLiMP. The overall results indicate that Chinese grammar can be mostly learned by models with around 500M parameters, trained on 1B tokens with one epoch, showing limited benefits for further scaling. Most (N=95) linguistic paradigms are of easy or medium difficulty for LMs, while there are still 13 paradigms that remain challenging even for models with up to 32B parameters. In regard to how LMs acquire Chinese grammar, we observe a U-shaped learning pattern in several phenomena, similar to those observed in child language acquisition.
摘要：语言模型 (LMs) 是否以及如何习得自然语言的语法，已在最小对范式下得到了广泛评估。然而，除英语外的其他语言缺乏广泛覆盖的基准，限制了对该问题的系统研究。为此，我们首先引入了 ZhoBLiMP，这是迄今为止最全面的中文语言学最小对基准，包含 118 个范式，涵盖 15 种语言现象。随后，我们训练了 20 个不同规模的语言模型（从 14M 到 1.4B 参数），使用不同规模的中文语料库（从 100M 到 3B Token），并将其与 14 个现成的大语言模型一起在 ZhoBLiMP 上进行评估。总体结果表明，约 500M 参数的模型在经过 1B Token 的一次训练后，基本能掌握大部分中文语法，进一步扩大规模带来的收益有限。大多数（N=95）语言学范式对语言模型来说难度为简单或中等，但仍有 13 个范式对高达 32B 参数的模型来说依然具有挑战性。关于语言模型如何习得中文语法，我们观察到在某些现象中存在 U 形学习模式，类似于儿童语言习得中的观察结果。

[NLP-84] Optimizing Large Language Models through Quantization: A Comparative Analysis of PTQ and QAT Techniques

【速读】：该论文试图解决大型语言模型（LLMs）在资源受限环境下的优化问题，特别是通过量化技术来减少模型大小和计算成本。解决方案的关键在于综合应用后训练量化（Post-Training Quantization, PTQ）和量化感知训练（Quantization-Aware Training, QAT），并引入一种新的缩放因子γ来平衡量化后的性能损失。论文通过实验验证了INT8和INT4量化在模型大小、计算成本和功耗方面的显著优化效果，同时提出了一种基于层敏感性和权重方差的混合精度量化理论框架，以实现最优的比特分配策略。这些方法在边缘设备上的硬件效率评估中表现出色，显著提升了吞吐量并降低了功耗。

链接: https://arxiv.org/abs/2411.06084
作者: Jahid Hasan
关键词-EN: optimizing Large Language, Large Language Models, Large Language, Quantization-Aware Training, optimizing Large
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 9 pages, 2 figures

点击查看摘要

Abstract:This paper presents a comprehensive analysis of quantization techniques for optimizing Large Language Models (LLMs), specifically focusing on Post-Training Quantization (PTQ) and Quantization-Aware Training (QAT). Through empirical evaluation across models ranging from 10M to 1B parameters, we demonstrate that quantization can achieve up to 68% reduction in model size while maintaining performance within 6% of full-precision baselines when utilizing our proposed scaling factor \gamma. Our experiments show that INT8 quantization delivers a 40% reduction in computational cost and power consumption, while INT4 quantization further improves these metrics by 60%. We introduce a novel theoretical framework for mixed-precision quantization, deriving optimal bit allocation strategies based on layer sensitivity and weight variance. Hardware efficiency evaluations on edge devices reveal that our quantization approach enables up to 2.4x throughput improvement for INT8 and 3x for INT4, with 60% power reduction compared to full-precision models.
摘要：本文对优化大语言模型（Large Language Models, LLMs）的量化技术进行了全面分析，特别关注了训练后量化（Post-Training Quantization, PTQ）和量化感知训练（Quantization-Aware Training, QAT）。通过在从10M到1B参数的模型上的实证评估，我们证明了量化技术可以在保持性能在全精度基线的6%以内的情况下，实现模型大小高达68%的减少，这得益于我们提出的缩放因子γ。我们的实验表明，INT8量化可以减少40%的计算成本和功耗，而INT4量化则在此基础上进一步提升了60%。我们引入了一种新的混合精度量化理论框架，基于层敏感度和权重方差推导出最佳比特分配策略。在边缘设备上的硬件效率评估显示，我们的量化方法使得INT8的吞吐量提高了2.4倍，INT4提高了3倍，同时与全精度模型相比，功耗减少了60%。

[NLP-85] Zyda-2: a 5 Trillion Token High-Quality Dataset

【速读】：该论文旨在解决大规模语言模型预训练所需的高质量数据集问题。解决方案的关键在于构建了Zyda-2数据集，这是一个包含五万亿标记的庞大数据集，通过整合高质量的开源标记（如FineWeb和DCLM），并通过交叉去重和基于模型的质量过滤技术，提炼出最高质量的子集。这一数据集的构建为训练Zamba2系列模型提供了基础，使其在其权重类别中达到最先进的水平。

链接: https://arxiv.org/abs/2411.06068
作者: Yury Tokpanov,Paolo Glorioso,Quentin Anthony,Beren Millidge
关键词-EN: language model pretraining, trillion token dataset, technical report, dataset for language, model pretraining
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: initial upload 11/08/24

点击查看摘要

Abstract:In this technical report, we present Zyda-2: a five trillion token dataset for language model pretraining. Zyda-2 was used to train our Zamba2 series of models which are state-of-the-art for their weight class. We build Zyda-2 by collating high-quality open-source tokens such as FineWeb and DCLM, then distilling them to the highest-quality subset via cross-deduplication and model-based quality filtering. Zyda-2 is released under a permissive open license, and is available at this https URL
摘要：在本技术报告中，我们介绍了 Zyda-2：一个用于大语言模型预训练的五万亿 Token 数据集。Zyda-2 被用于训练我们的 Zamba2 系列模型，这些模型在其权重类别中处于最先进水平。我们通过收集高质量的开源 Token，如 FineWeb 和 DCLM，然后通过交叉去重和基于模型的质量过滤，提炼出最高质量的子集，构建了 Zyda-2。Zyda-2 以宽松的开源许可证发布，并可通过此 https URL 获取。

[NLP-86] Sufficient Context: A New Lens on Retrieval Augmented Generation Systems

【速读】：该论文试图解决的问题是确定在检索增强生成 (RAG) 系统中，错误产生的原因是由于大语言模型 (LLMs) 未能有效利用检索到的上下文，还是检索到的上下文本身不足以回答查询。解决方案的关键在于提出了“充分上下文”的新概念，并开发了一种分类方法，用于判断实例是否具有足够的信息来回答查询。通过基于上下文充分性的错误分层分析，研究者发现专有 LLMs（如 Gemini、GPT、Claude）在上下文充分时表现优异，但在上下文不足时往往输出错误答案而非选择放弃。相反，开源 LLMs（如 Llama、Mistral、Gemma）即使在上下文充分时也经常产生幻觉或选择放弃。基于这些发现，研究者探索了减少 RAG 系统中幻觉的方法，包括一种新的选择性生成方法，该方法利用充分上下文信息进行引导性放弃，从而提高了模型在响应时的正确答案比例，对于 Gemini、GPT 和 Gemma 分别提升了 2-10%。

链接: https://arxiv.org/abs/2411.06037
作者: Hailey Joren,Jianyi Zhang,Chun-Sung Ferng,Da-Cheng Juan,Ankur Taly,Cyrus Rashtchian
关键词-EN: context, Retrieval Augmented Generation, Augmenting LLMs, sufficient context, leads to improved
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Augmenting LLMs with context leads to improved performance across many applications. Despite much research on Retrieval Augmented Generation (RAG) systems, an open question is whether errors arise because LLMs fail to utilize the context from retrieval or the context itself is insufficient to answer the query. To shed light on this, we develop a new notion of sufficient context, along with a way to classify instances that have enough information to answer the query. We then use sufficient context to analyze several models and datasets. By stratifying errors based on context sufficiency, we find that proprietary LLMs (Gemini, GPT, Claude) excel at answering queries when the context is sufficient, but often output incorrect answers instead of abstaining when the context is not. On the other hand, open-source LLMs (Llama, Mistral, Gemma) hallucinate or abstain often, even with sufficient context. We further categorize cases when the context is useful, and improves accuracy, even though it does not fully answer the query and the model errs without the context. Building on our findings, we explore ways to reduce hallucinations in RAG systems, including a new selective generation method that leverages sufficient context information for guided abstention. Our method improves the fraction of correct answers among times where the model responds by 2-10% for Gemini, GPT, and Gemma.
摘要：通过将上下文信息融入大语言模型 (LLM)，可以显著提升其在多种应用中的表现。尽管关于检索增强生成 (Retrieval Augmented Generation, RAG) 系统的研究已有很多，但一个尚未解决的问题是：错误的出现是由于 LLM 未能有效利用检索到的上下文，还是上下文本身不足以回答查询。为了深入探讨这一问题，我们提出了一种新的“充分上下文”概念，并设计了一种方法来分类哪些实例拥有足够的信息来回答查询。随后，我们利用这一概念分析了多个模型和数据集。通过基于上下文充分性的错误分层分析，我们发现，专有 LLM（如 Gemini、GPT、Claude）在上下文充分时能够出色地回答查询，但在上下文不足时，它们往往输出错误答案而非选择放弃。相比之下，开源 LLM（如 Llama、Mistral、Gemma）即使在上下文充分的情况下，也经常出现幻觉或选择放弃。此外，我们还进一步分类了上下文虽有用但不足以完全回答查询的情况，以及上下文缺失导致模型出错的情况。基于这些发现，我们探索了减少 RAG 系统中幻觉的方法，包括一种新的选择性生成方法，该方法利用充分上下文信息来指导模型在必要时选择放弃。我们的方法使得 Gemini、GPT 和 Gemma 在模型响应时的正确答案比例提高了 2-10%。

[NLP-87] LLM -GLOBE: A Benchmark Evaluating the Cultural Values Embedded in LLM Output

【速读】：该论文试图解决的问题是如何评估大型语言模型（LLMs）的文化价值系统，并探讨这些模型在不同文化背景下的表现差异。解决方案的关键在于提出了一个名为LLM-GLOBE的基准，该基准基于文化心理学理论和经验验证的GLOBE框架，用于系统地评估LLMs的文化价值。论文还引入了一种新颖的“LLMs-as-a-Jury”方法，通过自动化评估开放式内容，实现对文化价值的大规模概念层面分析。这种方法使得能够比较中国和美国LLMs的文化价值，揭示了东西方文化价值系统的异同，并指出了开放生成任务在评估文化价值方面的潜力。

链接: https://arxiv.org/abs/2411.06032
作者: Elise Karinshak,Amanda Hu,Kewen Kong,Vishwanatha Rao,Jingren Wang,Jindong Wang,Yi Zeng
关键词-EN: biased generative content, human intention, early stages, Immense effort, dedicated to minimizing
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Immense effort has been dedicated to minimizing the presence of harmful or biased generative content and better aligning AI output to human intention; however, research investigating the cultural values of LLMs is still in very early stages. Cultural values underpin how societies operate, providing profound insights into the norms, priorities, and decision making of their members. In recognition of this need for further research, we draw upon cultural psychology theory and the empirically-validated GLOBE framework to propose the LLM-GLOBE benchmark for evaluating the cultural value systems of LLMs, and we then leverage the benchmark to compare the values of Chinese and US LLMs. Our methodology includes a novel “LLMs-as-a-Jury” pipeline which automates the evaluation of open-ended content to enable large-scale analysis at a conceptual level. Results clarify similarities and differences that exist between Eastern and Western cultural value systems and suggest that open-generation tasks represent a more promising direction for evaluation of cultural values. We interpret the implications of this research for subsequent model development, evaluation, and deployment efforts as they relate to LLMs, AI cultural alignment more broadly, and the influence of AI cultural value systems on human-AI collaboration outcomes.
摘要：在减少有害或偏见生成内容的存在以及更好地使 AI 输出与人类意图对齐方面，已经投入了巨大的努力；然而，研究大语言模型（LLM）的文化价值观仍处于非常早期的阶段。文化价值观支撑着社会的运作，为理解其成员的规范、优先事项和决策提供了深刻的见解。鉴于这一进一步研究的需求，我们借鉴了文化心理学理论和经验证实的 GLOBE 框架，提出了 LLM-GLOBE 基准，用于评估大语言模型的文化价值体系，并利用该基准比较了中国和美国大语言模型的价值观。我们的方法包括一种新颖的“LLM 作为陪审团”流程，该流程自动化了开放式内容的评估，以实现概念层面的大规模分析。结果阐明了东西方文化价值体系之间的相似性和差异，并表明开放生成任务代表了评估文化价值的更有前景的方向。我们解释了这项研究对后续模型开发、评估和部署工作的意义，这些工作与大语言模型、更广泛的 AI 文化对齐以及 AI 文化价值体系对人与 AI 协作结果的影响有关。

[NLP-88] Improved intent classification based on context information using a windows-based approach

【速读】：该论文试图解决在对话系统中自然语言理解模块的意图分类任务中，仅依赖当前话语而忽略上下文信息的问题。解决方案的关键在于提出了几种利用上下文信息的方法，通过将对话历史与当前话语进行拼接，结合卷积神经网络（CNN）和BERT的有效向量表示，实现了基于窗口的意图分类。实验结果表明，利用用户和系统之前的对话信息作为上下文，显著提升了意图分类的准确性。

链接: https://arxiv.org/abs/2411.06022
作者: Jeanfranco D. Farfan-Escobedo,Julio C. Dos Reis
关键词-EN: Natural Language Understanding, Language Understanding, Natural Language, Conversational systems, intent classification
类目: Computation and Language (cs.CL)
备注: In preparation for Journal Submission

点击查看摘要

Abstract:Conversational systems have a Natural Language Understanding (NLU) module. In this module, there is a task known as an intent classification that aims at identifying what a user is attempting to achieve from an utterance. Previous works use only the current utterance to predict the intent of a given query and they do not consider the role of the context (one or a few previous utterances) in the dialog flow for this task. In this work, we propose several approaches to investigate the role of contextual information for the intent classification task. Each approach is used to carry out a concatenation between the dialogue history and the current utterance. Our intent classification method is based on a convolutional neural network that obtains effective vector representations from BERT to perform accurate intent classification using an approach window-based. Our experiments were carried out on a real-world Brazilian Portuguese corpus with dialog flows provided by Wavy global company. Our results achieved substantial improvements over the baseline, isolated utterances (without context), in three approaches using the user’s utterance and system’s response from previous messages as dialogue context.
摘要：对话系统包含一个自然语言理解 (Natural Language Understanding, NLU) 模块。在该模块中，有一个任务称为意图分类 (intent classification)，旨在识别用户从一句话中试图实现的目标。以往的研究仅使用当前的语句来预测给定查询的意图，并未考虑上下文（一个或几个之前的语句）在对话流程中对此任务的作用。在本研究中，我们提出了几种方法来探讨上下文信息在意图分类任务中的作用。每种方法都用于在对话历史和当前语句之间进行连接。我们的意图分类方法基于卷积神经网络 (convolutional neural network)，该网络从 BERT 中获取有效的向量表示，以执行基于窗口的准确意图分类。我们的实验在一个真实的巴西葡萄牙语语料库上进行，该语料库由 Wavy 全球公司提供的对话流程组成。我们的结果在三种方法上显著优于基线，即孤立语句（无上下文），这三种方法使用了用户语句和系统从前几条消息中的响应作为对话上下文。

[NLP-89] he Dark Patterns of Personalized Persuasion in Large Language Models : Exposing Persuasive Linguistic Features for Big Five Personality Traits in LLM s Responses

【速读】：该论文试图解决的问题是如何理解大型语言模型（LLMs）在生成个性化说服性输出时调整语言特征的机制。解决方案的关键在于识别了13种对影响不同大五人格模型层次至关重要的语言特征，并通过分析19个LLMs在五个模型家族中的表现，揭示了这些模型如何根据提示中的个性特征信息调整输出。研究发现，模型在处理神经质时使用更多与焦虑相关的词汇，在处理尽责性时增加与成就相关的词汇，而在处理开放性时减少认知过程词汇的使用。不同模型家族在适应开放性和尽责性方面表现出色，而只有少数模型在适应神经质方面表现突出。这些发现表明，LLMs能够根据个性提示定制响应，从而创造出影响接收者心理和福祉的说服性内容。

链接: https://arxiv.org/abs/2411.06008
作者: Wiktoria Mieleszczenko-Kowszewicz,Dawid Płudowski,Filip Kołodziejczyk,Jakub Świstak,Julian Sienkiewicz,Przemysław Biecek
关键词-EN: Large Language Models, adjust linguistic features, Large Language, linguistic features, personalized persuasive outputs
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 31 pages

点击查看摘要

Abstract:This study explores how the Large Language Models (LLMs) adjust linguistic features to create personalized persuasive outputs. While research showed that LLMs personalize outputs, a gap remains in understanding the linguistic features of their persuasive capabilities. We identified 13 linguistic features crucial for influencing personalities across different levels of the Big Five model of personality. We analyzed how prompts with personality trait information influenced the output of 19 LLMs across five model families. The findings show that models use more anxiety-related words for neuroticism, increase achievement-related words for conscientiousness, and employ fewer cognitive processes words for openness to experience. Some model families excel at adapting language for openness to experience, others for conscientiousness, while only one model adapts language for neuroticism. Our findings show how LLMs tailor responses based on personality cues in prompts, indicating their potential to create persuasive content affecting the mind and well-being of the recipients.
摘要：本研究探讨了大语言模型 (LLMs) 如何调整语言特征以生成个性化的说服性输出。尽管已有研究表明 LLMs 能够个性化输出，但对于其说服能力的语言特征理解仍存在差距。我们识别了 13 种对影响大五人格模型不同层次人格至关重要的语言特征。我们分析了包含人格特质信息的提示如何影响来自五个模型家族的 19 个 LLMs 的输出。研究发现，模型在神经质方面使用更多与焦虑相关的词汇，在尽责性方面增加与成就相关的词汇，而在开放性方面则减少认知过程词汇的使用。某些模型家族在适应开放性语言方面表现出色，其他则在尽责性方面，而仅有一个模型在神经质方面进行了语言适应。我们的研究结果显示了 LLMs 如何根据提示中的人格线索定制响应，表明它们具有生成影响接收者心理和福祉的说服性内容的潜力。

[NLP-90] GUIDEQ: Framework for Guided Questioning for progressive informational collection and classification

【速读】：该论文试图解决的问题是大型语言模型（LLMs）在问答系统中由于缺乏或缺失分类所需的特定领域信息而导致的分类不准确问题。解决方案的关键是提出了一个名为GUIDEQ的新框架，该框架通过利用分类器模型的可解释性，结合LLMs生成引导性问题，以获取更多相关信息，从而提高文本分类的准确性。GUIDEQ通过遮挡技术提取代表标签的最显著关键词，并基于分类器输出的前三个标签和这些关键词设计引导性问题的提示策略，以收集特定和相关的进一步信息，实现更精准的分类。实验结果表明，GUIDEQ在F1-Score上优于其他基于LLM的基线方法，显著提升了问答系统的性能。

链接: https://arxiv.org/abs/2411.05991
作者: Priya Mishra,Suraj Racha,Kaustubh Ponkshe,Adit Akarsh,Ganesh Ramakrishnan
关键词-EN: Question Answering, important part, part of tasks, information, Answering
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Question Answering (QA) is an important part of tasks like text classification through information gathering. These are finding increasing use in sectors like healthcare, customer support, legal services, etc., to collect and classify responses into actionable categories. LLMs, although can support QA systems, they face a significant challenge of insufficient or missing information for classification. Although LLMs excel in reasoning, the models rely on their parametric knowledge to answer. However, questioning the user requires domain-specific information aiding to collect accurate information. Our work, GUIDEQ, presents a novel framework for asking guided questions to further progress a partial information. We leverage the explainability derived from the classifier model for along with LLMs for asking guided questions to further enhance the information. This further information helps in more accurate classification of a text. GUIDEQ derives the most significant key-words representative of a label using occlusions. We develop GUIDEQ’s prompting strategy for guided questions based on the top-3 classifier label outputs and the significant words, to seek specific and relevant information, and classify in a targeted manner. Through our experimental results, we demonstrate that GUIDEQ outperforms other LLM-based baselines, yielding improved F1-Score through the accurate collection of relevant further information. We perform various analytical studies and also report better question quality compared to our method.
摘要：问答 (Question Answering, QA) 是信息收集过程中文本分类任务的重要组成部分。这些任务在医疗、客户支持、法律服务等领域中得到了越来越多的应用，用于收集并将响应分类为可操作的类别。尽管大语言模型 (LLM) 能够支持问答系统，但它们在分类时面临信息不足或缺失的重大挑战。虽然大语言模型在推理方面表现出色，但模型依赖于其参数化知识来回答问题。然而，向用户提问需要特定领域的信息，以帮助收集准确的信息。我们的工作 GUIDEQ 提出了一种新颖的框架，用于提出引导性问题以进一步推进部分信息的处理。我们利用分类器模型导出的可解释性，结合大语言模型，提出引导性问题以进一步增强信息。这些进一步的信息有助于更准确地对文本进行分类。GUIDEQ 通过遮挡方法提取代表标签的最显著关键词。我们基于分类器标签输出的前三个结果和显著词，开发了 GUIDEQ 的引导性问题提示策略，以寻求特定和相关的信息，并以有针对性的方式进行分类。通过我们的实验结果，我们证明了 GUIDEQ 优于其他基于大语言模型的基线方法，通过准确收集相关进一步信息，提高了 F1 分数。我们还进行了各种分析研究，并报告了与我们的方法相比更好的问题质量。

[NLP-91] Game-theoretic LLM : Agent Workflow for Negotiation Games

【速读】：该论文试图解决大型语言模型（LLMs）在战略决策情境中的合理性问题，特别是在博弈论框架下。研究的核心问题是LLMs在面对复杂博弈时，如何偏离理性策略，尤其是在支付矩阵增大或序列树加深的情况下。解决方案的关键在于设计多种博弈论工作流程，这些流程旨在指导LLMs的推理和决策过程，增强其计算纳什均衡和在不确定及不完全信息条件下做出理性选择的能力。实验结果表明，采用这些工作流程显著提高了LLMs在博弈论任务中的合理性和鲁棒性，特别是在识别最优策略、谈判场景中的近最优分配以及减少谈判中被利用的脆弱性方面。此外，研究还探讨了代理是否应采用这些工作流程的元策略考虑，认识到使用或放弃工作流程本身就是一个博弈论问题。

链接: https://arxiv.org/abs/2411.05990
作者: Wenyue Hua,Ollie Liu,Lingyao Li,Alfonso Amayuelas,Julie Chen,Lucas Jiang,Mingyu Jin,Lizhou Fan,Fei Sun,William Wang,Xintong Wang,Yongfeng Zhang
关键词-EN: large language models, paper investigates, game theory, LLMs, language models
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
备注: 44 pages, 12 figures

点击查看摘要

Abstract:This paper investigates the rationality of large language models (LLMs) in strategic decision-making contexts, specifically within the framework of game theory. We evaluate several state-of-the-art LLMs across a spectrum of complete-information and incomplete-information games. Our findings reveal that LLMs frequently deviate from rational strategies, particularly as the complexity of the game increases with larger payoff matrices or deeper sequential trees. To address these limitations, we design multiple game-theoretic workflows that guide the reasoning and decision-making processes of LLMs. These workflows aim to enhance the models’ ability to compute Nash Equilibria and make rational choices, even under conditions of uncertainty and incomplete information. Experimental results demonstrate that the adoption of these workflows significantly improves the rationality and robustness of LLMs in game-theoretic tasks. Specifically, with the workflow, LLMs exhibit marked improvements in identifying optimal strategies, achieving near-optimal allocations in negotiation scenarios, and reducing susceptibility to exploitation during negotiations. Furthermore, we explore the meta-strategic considerations of whether it is rational for agents to adopt such workflows, recognizing that the decision to use or forgo the workflow constitutes a game-theoretic issue in itself. Our research contributes to a deeper understanding of LLMs’ decision-making capabilities in strategic contexts and provides insights into enhancing their rationality through structured workflows. The findings have implications for the development of more robust and strategically sound AI agents capable of navigating complex interactive environments. Code and data supporting this study are available at \urlthis https URL. Comments: 44 pages, 12 figures Subjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG); Multiagent Systems (cs.MA) Cite as: arXiv:2411.05990 [cs.AI] (or arXiv:2411.05990v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2411.05990 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
摘要：本文研究了大语言模型 (LLMs) 在战略决策情境中的合理性，特别是基于博弈论框架下的表现。我们评估了多种最先进的 LLMs 在完全信息和不完全信息博弈中的表现。研究结果显示，随着博弈复杂度的增加，例如支付矩阵的扩大或序列树的加深，LLMs 经常偏离理性策略。为了解决这些局限性，我们设计了多个博弈论工作流程，以指导 LLMs 的推理和决策过程。这些工作流程旨在增强模型计算纳什均衡和在不确定及不完全信息条件下做出理性选择的能力。实验结果表明，采用这些工作流程显著提升了 LLMs 在博弈论任务中的合理性和鲁棒性。具体而言，通过这些工作流程，LLMs 在识别最优策略、在谈判场景中实现接近最优的分配以及减少谈判中被利用的脆弱性方面表现出显著改进。此外，我们探讨了智能体是否应采用此类工作流程的元策略考虑，认识到使用或放弃工作流程本身就是一个博弈论问题。本研究有助于深化对 LLMs 在战略情境中决策能力的理解，并提供了通过结构化工作流程增强其合理性的见解。研究结果对开发能够应对复杂交互环境的更稳健和战略上合理的 AI 智能体具有重要意义。支持本研究的代码和数据可通过以下链接获取：\urlthis https URL。

评论：44页，12图
主题：人工智能 (cs.AI)；计算与语言 (cs.CL)；计算机科学与博弈论 (cs.GT)；机器学习 (cs.LG)；多智能体系统 (cs.MA)
引用为：arXiv:2411.05990 [cs.AI]
(或 arXiv:2411.05990v1 [cs.AI] 用于此版本)
https://doi.org/10.48550/arXiv.2411.05990
arXiv 发布的 DOI 通过 DataCite (待注册)

[NLP-92] Fine-Grained Reward Optimization for Machine Translation using Error Severity Mappings

【速读】：该论文试图解决强化学习 (Reinforcement Learning, RL) 在神经机器翻译 (Neural Machine Translation) 系统中使用句子级反馈导致的奖励稀疏问题 (reward sparsity problem)，即模型仅接收整个句子的单一评分，导致学习信号效率低下。解决方案的关键在于引入细粒度的词级别奖励机制 (fine-grained token-level reward mechanisms)，并结合先进的质量评估系统 xCOMET 作为词级别奖励模型。xCOMET 通过预测源-翻译对中的细粒度错误范围及其严重性，提供详细的反馈。实验结果表明，使用词级别奖励训练的模型在翻译质量和训练稳定性方面均优于基于句子级奖励的基线模型。

链接: https://arxiv.org/abs/2411.05986
作者: Miguel Moura Ramos,Tomás Almeida,Daniel Vareta,Filipe Azevedo,Sweta Agrawal,Patrick Fernandes,André F. T. Martins
关键词-EN: Reinforcement learning, neural machine translation, accurately assess translation, training neural machine, effective and robust
类目: Computation and Language (cs.CL)
备注: 10 pages, work-in-progress

点击查看摘要

Abstract:Reinforcement learning (RL) has been proven to be an effective and robust method for training neural machine translation systems, especially when paired with powerful reward models that accurately assess translation quality. However, most research has focused on RL methods that use sentence-level feedback, which leads to inefficient learning signals due to the reward sparsity problem – the model receives a single score for the entire sentence. To address this, we introduce a novel approach that leverages fine-grained token-level reward mechanisms with RL methods. We use xCOMET, a state-of-the-art quality estimation system as our token-level reward model. xCOMET provides detailed feedback by predicting fine-grained error spans and their severity given source-translation pairs. We conduct experiments on small and large translation datasets to compare the impact of sentence-level versus fine-grained reward signals on translation quality. Our results show that training with token-level rewards improves translation quality across language pairs over baselines according to automatic and human evaluation. Furthermore, token-level reward optimization also improves training stability, evidenced by a steady increase in mean rewards over training epochs.
摘要：强化学习 (Reinforcement Learning, RL) 已被证明是训练神经机器翻译系统的有效且稳健的方法，尤其是在与能够准确评估翻译质量的强大奖励模型结合使用时。然而，大多数研究集中在使用句子级反馈的 RL 方法上，这导致了学习信号的效率低下，原因是奖励稀疏问题——模型仅收到整个句子的单一分数。为了解决这一问题，我们提出了一种新颖的方法，该方法利用细粒度的 Token 级奖励机制与 RL 方法相结合。我们使用 xCOMET，一种最先进的质量评估系统作为我们的 Token 级奖励模型。xCOMET 通过预测细粒度错误范围及其严重性，为源-翻译对提供详细的反馈。我们在小型和大型翻译数据集上进行了实验，比较了句子级与细粒度奖励信号对翻译质量的影响。我们的结果表明，根据自动和人工评估，使用 Token 级奖励进行训练在多种语言对的翻译质量上均优于基线。此外，Token 级奖励优化还提高了训练的稳定性，表现为训练周期内平均奖励的稳步增加。

[NLP-93] FactLens: Benchmarking Fine-Grained Fact Verification

【速读】：该论文试图解决大语言模型 (Large Language Models, LLMs) 在生成内容时容易产生事实错误的问题，特别是传统验证方法依赖于整体模型，可能掩盖复杂声明中的细微错误。论文提出了一种细粒度验证 (fine-grained verification) 的解决方案，即将复杂声明分解为更小的子声明 (sub-claims) 进行单独验证，从而更精确地识别不准确之处，提高透明度并减少证据检索的模糊性。解决方案的关键在于引入 FactLens 基准，用于评估细粒度事实验证，并提供子声明质量的自动化评估工具和指标。该基准数据经过手动策划，以确保高质量的基准事实。研究结果显示，自动化 FactLens 评估工具与人类判断之间存在良好的一致性，并讨论了子声明特征对整体验证性能的影响。

链接: https://arxiv.org/abs/2411.05980
作者: Kushan Mitra,Dan Zhang,Sajjadur Rahman,Estevam Hruschka
关键词-EN: Large Language Models, Large Language, shown impressive capability, produce factually incorrect, factually incorrect information
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 12 pages, under review

点击查看摘要

Abstract:Large Language Models (LLMs) have shown impressive capability in language generation and understanding, but their tendency to hallucinate and produce factually incorrect information remains a key limitation. To verify LLM-generated contents and claims from other sources, traditional verification approaches often rely on holistic models that assign a single factuality label to complex claims, potentially obscuring nuanced errors. In this paper, we advocate for a shift toward fine-grained verification, where complex claims are broken down into smaller sub-claims for individual verification, allowing for more precise identification of inaccuracies, improved transparency, and reduced ambiguity in evidence retrieval. However, generating sub-claims poses challenges, such as maintaining context and ensuring semantic equivalence with respect to the original claim. We introduce FactLens, a benchmark for evaluating fine-grained fact verification, with metrics and automated evaluators of sub-claim quality. The benchmark data is manually curated to ensure high-quality ground truth. Our results show alignment between automated FactLens evaluators and human judgments, and we discuss the impact of sub-claim characteristics on the overall verification performance.
摘要：大语言模型 (LLM) 在语言生成和理解方面展现了令人印象深刻的能力，但其产生幻觉和输出事实错误信息的倾向仍然是一个关键限制。为了验证 LLM 生成的内容和其他来源的主张，传统的验证方法通常依赖于整体模型，这些模型为复杂的主张分配单一的事实标签，可能会掩盖细微的错误。本文主张转向细粒度验证，即将复杂的主张分解为更小的子主张进行单独验证，从而实现更精确的不准确性识别、提高透明度以及减少证据检索中的模糊性。然而，生成子主张面临挑战，如保持上下文一致性和确保与原始主张的语义等价性。我们引入了 FactLens，一个用于评估细粒度事实验证的基准，包括子主张质量的指标和自动化评估器。基准数据经过人工精心策划，以确保高质量的基准事实。我们的结果显示，自动化 FactLens 评估器与人类判断之间存在一致性，并讨论了子主张特征对整体验证性能的影响。

[NLP-94] he Empirical Impact of Data Sanitization on Language Models NEURIPS2024

【速读】：该论文试图解决数据清洗（data sanitization）对语言模型（language model）理解能力的影响问题。解决方案的关键在于通过实证分析，比较原始数据集和经过清洗的数据集在多个基准语言建模任务（如理解问答（QA）、蕴含关系、情感分析和文本分类）中的表现差异。研究发现，数据清洗对某些任务（如情感分析或蕴含关系）的影响较小，通常在1-5%之间，而对理解问答任务的影响较大，性能下降达25%。论文进一步探讨了任务关键实体的存在与性能下降的关系，并提出了一种基于内容子采样的策略来修复已清洗的数据集。

链接: https://arxiv.org/abs/2411.05978
作者: Anwesan Pal,Radhika Bhargava,Kyle Hinsz,Jacques Esterhuizen,Sudipta Bhattacharya
关键词-EN: identifying sensitive content, personally identifiable information, modeling involves identifying, involves identifying sensitive, language modeling involves
类目: Computation and Language (cs.CL)
备注: Paper accepted at Safe Generative AI Workshop at NeurIPS 2024

点击查看摘要

Abstract:Data sanitization in the context of language modeling involves identifying sensitive content, such as personally identifiable information (PII), and redacting them from a dataset corpus. It is a common practice used in natural language processing (NLP) to maintain privacy. Nevertheless, the impact of data sanitization on the language understanding capability of a language model remains less studied. This paper empirically analyzes the effects of data sanitization across several benchmark language-modeling tasks including comprehension question answering (QA), entailment, sentiment analysis, and text classification. Our experiments cover a wide spectrum comprising finetuning small-scale language models, to prompting large language models (LLMs), on both original and sanitized datasets, and comparing their performance across the tasks. Interestingly, our results suggest that for some tasks such as sentiment analysis or entailment, the impact of redaction is quite low, typically around 1-5%, while for tasks such as comprehension QA there is a big drop of 25% in performance observed in redacted queries as compared to the original. For tasks that have a higher impact, we perform a deeper dive to inspect the presence of task-critical entities. Finally, we investigate correlation between performance and number of redacted entities, and also suggest a strategy to repair an already redacted dataset by means of content-based subsampling. Additional details are available at this https URL.
摘要：在语言模型背景下，数据净化涉及识别敏感内容，如个人身份信息 (PII)，并将其从数据集语料库中删除。这是自然语言处理 (NLP) 中维护隐私的常见做法。然而，数据净化对语言模型语言理解能力的影响研究较少。本文实证分析了数据净化在多个基准语言建模任务中的影响，包括理解问答 (QA)、蕴含、情感分析和文本分类。我们的实验涵盖了从微调小规模语言模型到提示大语言模型 (LLMs) 的广泛范围，同时在原始和净化数据集上进行，并比较它们在各任务中的表现。有趣的是，我们的结果表明，对于某些任务如情感分析或蕴含，删除的影响非常低，通常在 1-5% 左右，而对于理解 QA 任务，净化查询的性能下降了 25%，相比原始数据集。对于影响较大的任务，我们进行了深入分析，检查任务关键实体的存在。最后，我们研究了性能与删除实体数量之间的相关性，并提出了一种通过基于内容的子采样修复已净化数据集的策略。更多详情请访问此 https URL。

[NLP-95] oward Transdisciplinary Approaches to Audio Deepfake Discernment

【速读】：该论文试图解决音频深度伪造（audio deepfake）检测的挑战，特别是当前人工智能模型在理解和处理语言变异性和人类语音复杂性方面的不足。解决方案的关键在于跨学科合作，将语言学知识融入人工智能方法中，以实现更全面和鲁棒的深度伪造检测。具体来说，通过专家参与的循环（expert-in-the-loop）和超越专家无关的AI方法，可以提升检测的准确性和可靠性。

链接: https://arxiv.org/abs/2411.05969
作者: Vandana P. Janeja,Christine Mallinson
关键词-EN: Artificial Intelligence methods, Artificial Intelligence, lens across Artificial, audio deepfake detection, perspective calls
类目: ound (cs.SD); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:This perspective calls for scholars across disciplines to address the challenge of audio deepfake detection and discernment through an interdisciplinary lens across Artificial Intelligence methods and linguistics. With an avalanche of tools for the generation of realistic-sounding fake speech on one side, the detection of deepfakes is lagging on the other. Particularly hindering audio deepfake detection is the fact that current AI models lack a full understanding of the inherent variability of language and the complexities and uniqueness of human speech. We see the promising potential in recent transdisciplinary work that incorporates linguistic knowledge into AI approaches to provide pathways for expert-in-the-loop and to move beyond expert agnostic AI-based methods for more robust and comprehensive deepfake detection.
摘要：这一视角呼吁跨学科的学者们通过跨人工智能方法和语言学的多学科视角，来应对音频深度伪造检测和辨别这一挑战。一方面，随着生成逼真假语音工具的激增，另一方面，深度伪造的检测却相对滞后。特别是阻碍音频深度伪造检测的一个事实是，当前的 AI 模型缺乏对语言内在变异性和人类语音复杂性与独特性的全面理解。我们看到了近期跨学科工作的前景，这些工作将语言学知识融入 AI 方法，为专家参与的循环提供了途径，并超越了专家无关的基于 AI 的方法，以实现更稳健和全面的深度伪造检测。

[NLP-96] Sentiment Analysis of Cyberbullying Data in Social Media

【速读】：该论文试图解决社交媒体中的网络欺凌（cyberbullying）问题，特别是通过情感分析（sentiment analysis）技术来检测和识别高风险的受害者。解决方案的关键在于利用深度学习和自然语言理解技术，具体采用了带有长短期记忆（LSTM）单元的循环神经网络，并结合了两种不同的嵌入方法：BERT嵌入和OpenAI的嵌入API。通过对比这两种方法在Formspring网络欺凌数据上的情感分析性能，评估其在检测欺凌行为中的有效性。

链接: https://arxiv.org/abs/2411.05958
作者: Arvapalli Sai Susmitha,Pradeep Pujari
关键词-EN: today digital age, modern life, digital age, integral part, part of modern
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Social media has become an integral part of modern life, but it has also brought with it the pervasive issue of cyberbullying a serious menace in today’s digital age. Cyberbullying, a form of harassment that occurs on social networks, has escalated alongside the growth of these platforms. Sentiment analysis holds significant potential not only for detecting bullying phrases but also for identifying victims who are at high risk of harm, whether to themselves or others. Our work focuses on leveraging deep learning and natural language understanding techniques to detect traces of bullying in social media posts. We developed a Recurrent Neural Network with Long Short-Term Memory (LSTM) cells, using different embeddings. One approach utilizes BERT embeddings, while the other replaces the embeddings layer with the recently released embeddings API from OpenAI. We conducted a performance comparison between these two approaches to evaluate their effectiveness in sentiment analysis of Formspring Cyberbullying data. Our Code is Available at this https URL
摘要：社交媒体已成为现代生活不可或缺的一部分，但同时也带来了网络欺凌这一普遍问题，成为当今数字时代的一大威胁。网络欺凌，即在社交网络上发生的骚扰行为，随着这些平台的发展而愈演愈烈。情感分析不仅在检测欺凌性言论方面具有巨大潜力，还能识别出那些处于高风险伤害中的受害者，无论是对自己还是他人。我们的研究重点是利用深度学习和自然语言理解技术，检测社交媒体帖子中的欺凌痕迹。我们开发了一种基于长短期记忆（LSTM）单元的循环神经网络，采用了不同的嵌入方式。一种方法使用BERT嵌入，而另一种方法则用OpenAI最近发布的嵌入API替代了嵌入层。我们通过对比这两种方法在Formspring网络欺凌数据情感分析中的表现，评估了它们的有效性。我们的代码可在以下链接获取：https URL

[NLP-97] NeKo: Toward Post Recognition Generative Correction Large Language Models with Task-Oriented Experts

【速读】：该论文试图解决通用后识别错误校正器的构建问题，特别是如何在大量混合领域数据集上最有效地训练模型。解决方案的关键在于学习数据集特定的特征，并将这些知识整合到一个单一模型中。论文提出了一种多任务校正的混合专家模型（Multi-Task Correction MoE），通过训练专家模型来处理语音转文本、语言转文本和视觉转文本数据集，从而实现对每个数据集的令牌进行路由到相应的专家模型。实验结果表明，该方法在Open ASR Leaderboard上实现了新的最先进性能，平均相对WER降低了5.0%，并在语音和翻译任务中显著提升了BLEU分数。在零样本评估中，NeKo模型在Hyporadise基准测试中相对于GPT-3.5和Claude-Opus分别实现了15.5%到27.6%的相对WER降低。

链接: https://arxiv.org/abs/2411.05945
作者: Yen-Ting Lin,Chao-Han Huck Yang,Zhehuai Chen,Piotr Zelasko,Xuesong Yang,Zih-Ching Chen,Krishna C Puvvada,Szu-Wei Fu,Ke Hu,Jun Wei Chiu,Jagadeesh Balam,Boris Ginsburg,Yu-Chiang Frank Wang
关键词-EN: general-purpose post-recognition error, post-recognition error corrector, error corrector poses, crucial question, general-purpose post-recognition
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multiagent Systems (cs.MA); Audio and Speech Processing (eess.AS)
备注: NeKo work has been done in June 2024. NeKo LMs will be open source on this https URL under the MIT license

点击查看摘要

Abstract:Construction of a general-purpose post-recognition error corrector poses a crucial question: how can we most effectively train a model on a large mixture of domain datasets? The answer would lie in learning dataset-specific features and digesting their knowledge in a single model. Previous methods achieve this by having separate correction language models, resulting in a significant increase in parameters. In this work, we present Mixture-of-Experts as a solution, highlighting that MoEs are much more than a scalability tool. We propose a Multi-Task Correction MoE, where we train the experts to become an ``expert’’ of speech-to-text, language-to-text and vision-to-text datasets by learning to route each dataset’s tokens to its mapped expert. Experiments on the Open ASR Leaderboard show that we explore a new state-of-the-art performance by achieving an average relative 5.0 % WER reduction and substantial improvements in BLEU scores for speech and translation tasks. On zero-shot evaluation, NeKo outperforms GPT-3.5 and Claude-Opus with 15.5 % to 27.6 % relative WER reduction in the Hyporadise benchmark. NeKo performs competitively on grammar and post-OCR correction as a multi-task model.
摘要：构建一个通用的后识别错误校正器面临一个关键问题：如何在大量混合领域数据集上最有效地训练模型？答案在于学习数据集特定的特征，并将这些知识整合到一个单一模型中。以往的方法通过使用独立的校正语言模型来实现这一点，从而导致参数数量显著增加。在本研究中，我们提出将“专家混合”（Mixture-of-Experts, MoE）作为解决方案，强调MoE不仅仅是扩展性的工具。我们提出了一种多任务校正MoE，通过训练专家学习将每个数据集的Token路由到其对应的专家，使其成为语音转文本、语言转文本和视觉转文本数据集的“专家”。在Open ASR排行榜上的实验表明，我们通过实现平均相对5.0%的WER（词错误率）降低，以及在语音和翻译任务中显著提升的BLEU分数，探索了一种新的最先进性能。在零样本评估中，NeKo在Hyporadise基准测试中分别以15.5%至27.6%的相对WER降低，优于GPT-3.5和Claude-Opus。作为多任务模型，NeKo在语法和后OCR校正方面表现出色。

[NLP-98] Quantifying artificial intelligence through algebraic generalization

【速读】：该论文试图解决现代人工智能系统在符号处理和抽象能力方面的不足，特别是在需要可解释性和可靠性的技术领域。解决方案的关键在于采用代数电路复杂性理论（algebraic circuit complexity）来量化符号泛化能力。通过将符号推理问题重构为代数表达式，并利用代数电路复杂性理论中的工具，论文提出了一种基于复杂性理论属性的基准定义方法，从而能够系统地评估和提升AI系统的符号处理能力。此外，代数电路作为通用的数学对象，能够生成大量样本，非常适合当前数据需求量大的机器学习算法。

链接: https://arxiv.org/abs/2411.05943
作者: Takuya Ito,Murray Campbell,Lior Horesh,Tim Klinger,Parikshit Ram
关键词-EN: modern artificial intelligence, algebraic circuit complexity, algebraic circuit, artificial intelligence, scientific quantification
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Logic in Computer Science (cs.LO)
备注:

点击查看摘要

Abstract:The rapid development of modern artificial intelligence (AI) systems has created an urgent need for their scientific quantification. While their fluency across a variety of domains is impressive, modern AI systems fall short on tests requiring symbolic processing and abstraction - a glaring limitation given the necessity for interpretable and reliable technology. Despite a surge of reasoning benchmarks emerging from the academic community, no comprehensive and theoretically-motivated framework exists to quantify reasoning (and more generally, symbolic ability) in AI systems. Here, we adopt a framework from computational complexity theory to explicitly quantify symbolic generalization: algebraic circuit complexity. Many symbolic reasoning problems can be recast as algebraic expressions. Thus, algebraic circuit complexity theory - the study of algebraic expressions as circuit models (i.e., directed acyclic graphs) - is a natural framework to study the complexity of symbolic computation. The tools of algebraic circuit complexity enable the study of generalization by defining benchmarks in terms of their complexity-theoretic properties (i.e., the difficulty of a problem). Moreover, algebraic circuits are generic mathematical objects; for a given algebraic circuit, an arbitrarily large number of samples can be generated for a specific circuit, making it an optimal testbed for the data-hungry machine learning algorithms that are used today. Here, we adopt tools from algebraic circuit complexity theory, apply it to formalize a science of symbolic generalization, and address key theoretical and empirical challenges for its successful application to AI science and its impact on the broader community.
摘要：现代人工智能（AI）系统的快速发展迫切需要对其进行科学的量化。尽管这些系统在跨多个领域的表现令人印象深刻，但在需要符号处理和抽象能力的测试中，现代AI系统的表现却明显不足——考虑到可解释性和可靠性技术的必要性，这是一个显著的局限。尽管学术界涌现出大量推理基准，但尚未存在一个全面且理论驱动的框架来量化AI系统中的推理能力（以及更广泛的符号能力）。在此，我们采用计算复杂性理论中的一个框架来明确量化符号泛化能力：代数电路复杂性。许多符号推理问题可以重构为代数表达式。因此，代数电路复杂性理论——将代数表达式研究为电路模型（即有向无环图）——自然成为研究符号计算复杂性的框架。代数电路复杂性理论的工具通过定义基于其复杂性理论属性的基准（即问题的难度），促进了泛化能力的研究。此外，代数电路是通用的数学对象；对于给定的代数电路，可以生成任意大量的样本用于特定电路，这使其成为当今数据需求旺盛的机器学习算法的理想测试平台。在此，我们采用代数电路复杂性理论的工具，将其应用于符号泛化科学的规范化，并解决其成功应用于AI科学及其对更广泛社区影响的关键理论和实证挑战。

[NLP-99] Mitigating Hallucination with ZeroG: An Advanced Knowledge Management Engine

【速读】：该论文试图解决数字文档管理中高效知识提取的挑战，特别是传统方法在处理复杂文档时出现的幻觉（hallucinations）和高延迟问题。解决方案的关键在于ZeroG的创新方法，它通过知识蒸馏（knowledge distillation）和提示调优（prompt tuning）来提升模型性能。ZeroG利用一个小型模型来复制大型教师模型的行为，采用黑箱蒸馏方法，不依赖中间特征，从而优化计算效率。这种方法显著提高了准确性并减少了响应时间，同时结合文档摄取和元数据利用的高级技术，改进了问答系统的准确性。通过整合图数据库和强大的元数据管理，ZeroG进一步简化了信息检索，实现了精确和上下文感知的响应，为现代文档管理提供了一个可扩展的解决方案。

链接: https://arxiv.org/abs/2411.05936
作者: Anantha Sharma,Sheeba Elizabeth John,Fatemeh Rezapoor Nikroo,Krupali Bhatt,Mrunal Zambre,Aditi Wikhe
关键词-EN: presents significant challenges, documents presents significant, Large Language Models, presents significant, digital documents presents
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Theory (cs.IT)
备注: 10 pages, 4 figures, 1 table

点击查看摘要

Abstract:The growth of digital documents presents significant challenges in efficient management and knowledge extraction. Traditional methods often struggle with complex documents, leading to issues such as hallucinations and high latency in responses from Large Language Models (LLMs). ZeroG, an innovative approach, significantly mitigates these challenges by leveraging knowledge distillation and prompt tuning to enhance model performance. ZeroG utilizes a smaller model that replicates the behavior of a larger teacher model, ensuring contextually relevant and grounded responses, by employing a black-box distillation approach, it creates a distilled dataset without relying on intermediate features, optimizing computational efficiency. This method significantly enhances accuracy and reduces response times, providing a balanced solution for modern document management. Incorporating advanced techniques for document ingestion and metadata utilization, ZeroG improves the accuracy of question-and-answer systems. The integration of graph databases and robust metadata management further streamlines information retrieval, allowing for precise and context-aware responses. By transforming how organizations interact with complex data, ZeroG enhances productivity and user experience, offering a scalable solution for the growing demands of digital document management. Comments: 10 pages, 4 figures, 1 table Subjects: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Theory (cs.IT) Cite as: arXiv:2411.05936 [cs.IR] (or arXiv:2411.05936v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2411.05936 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
摘要：数字文档的增长给高效管理和知识提取带来了重大挑战。传统方法在处理复杂文档时常常遇到困难，导致大语言模型（LLM）在响应中出现幻觉和高延迟问题。ZeroG 作为一种创新方法，通过利用知识蒸馏和提示调优显著缓解了这些挑战，从而提升了模型性能。ZeroG 使用一个较小的模型来复制较大教师模型的行为，通过采用黑箱蒸馏方法，确保了上下文相关且基于事实的响应，同时无需依赖中间特征，优化了计算效率。这种方法显著提高了准确性并减少了响应时间，为现代文档管理提供了一种平衡的解决方案。结合先进的文档摄取和元数据利用技术，ZeroG 提高了问答系统的准确性。图数据库和强大的元数据管理的集成进一步简化了信息检索，实现了精确且上下文感知的响应。通过改变组织与复杂数据的交互方式，ZeroG 提升了生产力和用户体验，为不断增长的数字文档管理需求提供了一种可扩展的解决方案。

评论：10 页，4 图，1 表主题：信息检索（cs.IR）；人工智能（cs.AI）；计算与语言（cs.CL）；信息论（cs.IT）引用为：arXiv:2411.05936 [cs.IR] （或 arXiv:2411.05936v1 [cs.IR] 用于此版本） https://doi.org/10.48550/arXiv.2411.05936 聚焦以了解更多 arXiv 发布的 DOI 通过 DataCite（待注册）

[NLP-100] BERTrend: Neural Topic Modeling for Emerging Trends Detection EMNLP2024

【速读】：该论文试图解决在大规模、动态变化的文本语料库中检测和跟踪新兴趋势及弱信号的问题，特别是在监控科学文献、管理品牌声誉、监控关键基础设施等应用中。解决方案的关键在于引入了一种名为BERTrend的新方法，该方法利用在线神经主题建模技术，通过考虑文档数量和更新频率来量化主题随时间的流行度。BERTrend引入了一个新的度量标准，将主题分类为噪声、弱信号或强信号，从而标记出新兴且快速增长的主题以供进一步调查。实验结果表明，BERTrend能够准确检测和跟踪有意义的弱信号，同时过滤掉噪声，为大规模、动态文本语料库中的趋势监控提供了一个全面的解决方案。此外，结合大型语言模型（Large Language Models）的使用，BERTrend还能提供趋势事件的可解释性。

链接: https://arxiv.org/abs/2411.05930
作者: Allaa Boutaleb,Jerome Picault,Guillaume Grosjean
关键词-EN: managing brand reputation, surveilling critical infrastructure, Detecting and tracking, monitoring scientific literature, text-based event detection
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注: 17 pages, 12 figures, FuturED 2024: Workshop on Future of Event Detection (CoLocated with EMNLP 2024)

点击查看摘要

Abstract:Detecting and tracking emerging trends and weak signals in large, evolving text corpora is vital for applications such as monitoring scientific literature, managing brand reputation, surveilling critical infrastructure and more generally to any kind of text-based event detection. Existing solutions often fail to capture the nuanced context or dynamically track evolving patterns over time. BERTrend, a novel method, addresses these limitations using neural topic modeling in an online setting. It introduces a new metric to quantify topic popularity over time by considering both the number of documents and update frequency. This metric classifies topics as noise, weak, or strong signals, flagging emerging, rapidly growing topics for further investigation. Experimentation on two large real-world datasets demonstrates BERTrend’s ability to accurately detect and track meaningful weak signals while filtering out noise, offering a comprehensive solution for monitoring emerging trends in large-scale, evolving text corpora. The method can also be used for retrospective analysis of past events. In addition, the use of Large Language Models together with BERTrend offers efficient means for the interpretability of trends of events.
摘要：在大规模、不断演变的文本语料库中检测和跟踪新兴趋势及微弱信号，对于监控科学文献、管理品牌声誉、监控关键基础设施等应用至关重要，并且在任何基于文本的事件检测中都具有普遍意义。现有解决方案往往无法捕捉微妙的上下文信息或动态跟踪随时间演变的模式。BERTrend 是一种新颖的方法，通过在线环境中的神经主题建模来解决这些局限性。它引入了一种新的指标，通过考虑文档数量和更新频率来量化主题随时间的热度。该指标将主题分类为噪声、弱信号或强信号，标记出新兴且快速增长的主题以供进一步调查。在两个大型真实世界数据集上的实验表明，BERTrend 能够准确检测和跟踪有意义的微弱信号，同时过滤掉噪声，为大规模、演变中的文本语料库中的新兴趋势监控提供了一个全面的解决方案。该方法还可用于对过去事件的回顾性分析。此外，结合大语言模型与 BERTrend 的使用，为事件趋势的可解释性提供了高效的途径。

[NLP-101] Reducing Distraction in Long-Context Language Models by Focused Learning

【速读】：该论文试图解决大型语言模型（LLMs）在处理长上下文时由于无关信息过多而导致注意力分散的问题。解决方案的关键在于提出了一种结合检索增强数据扩充和对比学习的训练方法。具体来说，在微调过程中，使用检索器提取最相关的片段作为增强输入，并通过引入辅助的对比学习目标，确保原始上下文和检索子上下文的输出紧密对齐，从而提高模型在长上下文中的信息辨别能力。

链接: https://arxiv.org/abs/2411.05928
作者: Zijun Wu,Bingyuan Liu,Ran Yan,Lei Chen,Thomas Delteil
关键词-EN: Large Language Models, Language Models, Large Language, Recent advancements, advancements in Large
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Recent advancements in Large Language Models (LLMs) have significantly enhanced their capacity to process long contexts. However, effectively utilizing this long context remains a challenge due to the issue of distraction, where irrelevant information dominates lengthy contexts, causing LLMs to lose focus on the most relevant segments. To address this, we propose a novel training method that enhances LLMs’ ability to discern relevant information through a unique combination of retrieval-based data augmentation and contrastive learning. Specifically, during fine-tuning with long contexts, we employ a retriever to extract the most relevant segments, serving as augmented inputs. We then introduce an auxiliary contrastive learning objective to explicitly ensure that outputs from the original context and the retrieved sub-context are closely aligned. Extensive experiments on long single-document and multi-document QA benchmarks demonstrate the effectiveness of our proposed method.
摘要：近年来，大语言模型 (Large Language Models, LLMs) 在处理长上下文方面的能力显著提升。然而，如何有效利用这些长上下文仍然是一个挑战，因为无关信息在长上下文中占据主导地位，导致 LLMs 在处理时失去对最相关部分的聚焦。为了解决这一问题，我们提出了一种新的训练方法，通过结合基于检索的数据增强和对比学习，增强 LLMs 识别相关信息的能力。具体而言，在长上下文的微调过程中，我们使用一个检索器提取最相关的片段，作为增强输入。随后，我们引入一个辅助的对比学习目标，以明确确保原始上下文和检索到的子上下文的输出紧密对齐。在长单文档和多文档问答基准上的广泛实验表明，我们提出的方法具有显著的有效性。

[NLP-102] owards Multi-Modal Mastery: A 4.5B Parameter Truly Multi-Modal Small Language Model

【速读】：该论文试图解决多模态输入和输出任务中的复杂问题，提出了一种4.5亿参数的小型语言模型，能够处理文本、图像、视频和音频等多种模态。解决方案的关键在于利用最新的语言建模和多任务学习技术，构建一个多功能且高性能的模型，即使在边缘设备上也能进行推理。实验结果表明，该模型在多个基准测试中表现出色，为多模态人工智能的进一步发展铺平了道路。

链接: https://arxiv.org/abs/2411.05903
作者: Ben Koska,Mojmír Horváth
关键词-EN: handle multiple input, including text, parameter small language, output modalities, input and output
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:We present a novel 4.5B parameter small language model that can handle multiple input and output modalities, including text, images, videos, and audio. Despite its small size, the model achieves near state-of-the-art performance on a variety of tasks, demonstrating the potential of multi-modal models to tackle complex real-world problems. Our approach leverages recent advancements in language modeling and multi-task learning to create a versatile and high-performing model that can even be deployed for edge inference. Experimental results show the model’s strong performance across multiple benchmarks, paving the way for further progress in multi-modal artificial intelligence.
摘要：我们提出了一种新颖的 4.5 亿参数小型语言模型，该模型能够处理多种输入和输出模态，包括文本、图像、视频和音频。尽管其规模较小，但该模型在多种任务上达到了接近最先进水平的性能，展示了多模态模型解决复杂现实问题的潜力。我们的方法利用了语言建模和多任务学习领域的最新进展，创建了一个多功能且高性能的模型，甚至可以部署于边缘推理场景。实验结果显示，该模型在多个基准测试中表现出色，为多模态人工智能的进一步发展铺平了道路。

[NLP-103] Autoregressive Models in Vision: A Survey

【速读】：该论文试图解决的问题是如何在计算机视觉领域中应用自回归模型（autoregressive models），并探讨其在不同视觉数据表示层级（如像素级、token级和尺度级）中的表现和应用。解决方案的关键在于对视觉自回归模型的全面分类和分析，包括基于像素、token和尺度的模型框架，以及它们在图像生成、视频生成、3D生成和多模态生成等任务中的应用。此外，论文还探讨了自回归模型与其他生成模型之间的联系，并指出了当前面临的挑战和未来的研究方向。

链接: https://arxiv.org/abs/2411.05902
作者: Jing Xiong,Gongye Liu,Lun Huang,Chengyue Wu,Taiqiang Wu,Yao Mu,Yuan Yao,Hui Shen,Zhongwei Wan,Jinfa Huang,Chaofan Tao,Shen Yan,Huaxiu Yao,Lingpeng Kong,Hongxia Yang,Mi Zhang,Guillermo Sapiro,Jiebo Luo,Ping Luo,Ngai Wong
关键词-EN: natural language processing, autoregressive models, models, Autoregressive, huge success
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Autoregressive modeling has been a huge success in the field of natural language processing (NLP). Recently, autoregressive models have emerged as a significant area of focus in computer vision, where they excel in producing high-quality visual content. Autoregressive models in NLP typically operate on subword tokens. However, the representation strategy in computer vision can vary in different levels, \textiti.e., pixel-level, token-level, or scale-level, reflecting the diverse and hierarchical nature of visual data compared to the sequential structure of language. This survey comprehensively examines the literature on autoregressive models applied to vision. To improve readability for researchers from diverse research backgrounds, we start with preliminary sequence representation and modeling in vision. Next, we divide the fundamental frameworks of visual autoregressive models into three general sub-categories, including pixel-based, token-based, and scale-based models based on the strategy of representation. We then explore the interconnections between autoregressive models and other generative models. Furthermore, we present a multi-faceted categorization of autoregressive models in computer vision, including image generation, video generation, 3D generation, and multi-modal generation. We also elaborate on their applications in diverse domains, including emerging domains such as embodied AI and 3D medical AI, with about 250 related references. Finally, we highlight the current challenges to autoregressive models in vision with suggestions about potential research directions. We have also set up a Github repository to organize the papers included in this survey at: \urlthis https URL.
摘要：自回归建模在自然语言处理 (NLP) 领域取得了巨大成功。近年来，自回归模型在计算机视觉领域崭露头角，尤其在生成高质量视觉内容方面表现卓越。在 NLP 中，自回归模型通常基于子词 Token 进行操作。然而，计算机视觉中的表示策略可以在不同层次上有所不同，即像素级、Token 级或尺度级，这反映了视觉数据与语言序列结构相比的多样性和层次性。本综述全面审视了应用于视觉的自回归模型的文献。为了提高来自不同研究背景的研究人员的可读性，我们首先介绍了视觉中的初步序列表示和建模。接下来，我们将视觉自回归模型的基本框架分为三大类，包括基于像素的、基于 Token 的和基于尺度的模型，这些分类基于表示策略。随后，我们探讨了自回归模型与其他生成模型之间的联系。此外，我们提出了计算机视觉中自回归模型的多方面分类，包括图像生成、视频生成、3D 生成和多模态生成。我们还详细阐述了它们在多个领域的应用，包括新兴领域如具身 AI 和 3D 医疗 AI，涉及约 250 篇相关文献。最后，我们指出了视觉自回归模型当前面临的挑战，并提出了潜在的研究方向。我们还建立了一个 Github 仓库，用于整理本综述中包含的论文，地址为：\urlthis https URL。

[NLP-104] Humans Continue to Outperform Large Language Models in Complex Clinical Decision-Making: A Study with Medical Calculators

【速读】：该论文试图解决的问题是评估大型语言模型（LLMs）在临床决策支持任务中的有效性，特别是它们在推荐和使用医学计算器方面的能力。解决方案的关键在于通过比较LLMs和医学实习生在多个临床场景中的表现，评估它们在风险分层、预后评估和疾病诊断等任务中推荐医学计算器的准确性。研究结果表明，尽管最高性能的LLM（如GPT-4o）在回答准确性上达到了74.3%，但人类注释者的平均准确性为79.5%，显示出人类在复杂临床任务中仍优于LLMs。这一发现强调了当前LLMs在理解和计算器知识方面的不足，提示在临床应用中仍需谨慎依赖LLMs。

链接: https://arxiv.org/abs/2411.05897
作者: Nicholas Wan,Qiao Jin,Joey Chan,Guangzhi Xiong,Serina Applebaum,Aidan Gilson,Reid McMurry,R. Andrew Taylor,Aidong Zhang,Qingyu Chen,Zhiyong Lu
关键词-EN: medical licensing exams, large language models, effectively support clinical, support clinical decision-making, remains uncertain
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Although large language models (LLMs) have been assessed for general medical knowledge using medical licensing exams, their ability to effectively support clinical decision-making tasks, such as selecting and using medical calculators, remains uncertain. Here, we evaluate the capability of both medical trainees and LLMs to recommend medical calculators in response to various multiple-choice clinical scenarios such as risk stratification, prognosis, and disease diagnosis. We assessed eight LLMs, including open-source, proprietary, and domain-specific models, with 1,009 question-answer pairs across 35 clinical calculators and measured human performance on a subset of 100 questions. While the highest-performing LLM, GPT-4o, provided an answer accuracy of 74.3% (CI: 71.5-76.9%), human annotators, on average, outperformed LLMs with an accuracy of 79.5% (CI: 73.5-85.0%). With error analysis showing that the highest-performing LLMs continue to make mistakes in comprehension (56.6%) and calculator knowledge (8.1%), our findings emphasize that humans continue to surpass LLMs on complex clinical tasks such as calculator recommendation.
摘要：尽管大语言模型 (LLMs) 已通过医学执照考试评估了其一般医学知识，但它们在有效支持临床决策任务（如选择和使用医学计算器）方面的能力仍不确定。在此，我们评估了医学实习生和 LLMs 在应对各种多选临床场景（如风险分层、预后和疾病诊断）时推荐医学计算器的能力。我们评估了八种 LLMs，包括开源、专有和领域特定模型，涉及 35 种临床计算器的 1,009 个问答对，并在 100 个问题子集上测量了人类的表现。尽管表现最佳的 LLM，GPT-4o，提供了 74.3% 的答案准确率（置信区间：71.5-76.9%），但人类标注者的平均准确率达到了 79.5%（置信区间：73.5-85.0%），表现优于 LLMs。通过错误分析显示，表现最佳的 LLMs 在理解（56.6%）和计算器知识（8.1%）方面仍存在错误，这强调了在复杂的临床任务（如计算器推荐）中，人类仍超越 LLMs。

[NLP-105] One Small and One Large for Document-level Event Argument Extraction

【速读】：该论文试图解决文档级事件论元抽取 (Document-level Event Argument Extraction, EAE) 中由于输入长度增加而导致的两个主要挑战：1) 事件间语义边界的区分困难；2) 冗余信息的干扰。解决方案的关键在于提出了两种方法：首先，引入基于小型语言模型 (Small Language Models, SLMs) 的协同和结构事件论元抽取模型 (Co and Structure Event Argument Extraction model, CsEAE)。CsEAE 包括一个协同出现感知模块，通过上下文标签和协同出现事件提示提取，整合当前输入中所有事件的信息；以及一个结构感知模块，通过建立触发句与其他句子之间的结构关系，减少冗余信息的干扰。其次，引入新的提示将抽取任务转化为适用于大型语言模型 (Large Language Models, LLMs) 的生成任务，解决了在监督微调 (Supervised Fine-Tuning, SFT) 条件下使用 LLMs 进行 EAE 的性能差距。此外，通过在多个数据集上微调 LLMs，并应用 CsEAE 的见解，进一步提升了性能。实验结果表明，CsEAE 在 Rams、WikiEvents 和 MLEE 数据集上的 Arg-C F1 指标分别比基线 PAIE 提高了 2.1%、2.3% 和 3.2%。

链接: https://arxiv.org/abs/2411.05895
作者: Jiaren Peng,Hongda Sun,Wenzhong Yang,Fuyuan Wei,Liang He,Liejun Wang
关键词-EN: Event Argument Extraction, distinguishing semantic boundaries, Structure Event Argument, increased input length, Argument Extraction model
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Document-level Event Argument Extraction (EAE) faces two challenges due to increased input length: 1) difficulty in distinguishing semantic boundaries between events, and 2) interference from redundant information. To address these issues, we propose two methods. The first method introduces the Co and Structure Event Argument Extraction model (CsEAE) based on Small Language Models (SLMs). CsEAE includes a co-occurrences-aware module, which integrates information about all events present in the current input through context labeling and co-occurrences event prompts extraction. Additionally, CsEAE includes a structure-aware module that reduces interference from redundant information by establishing structural relationships between the sentence containing the trigger and other sentences in the document. The second method introduces new prompts to transform the extraction task into a generative task suitable for Large Language Models (LLMs), addressing gaps in EAE performance using LLMs under Supervised Fine-Tuning (SFT) conditions. We also fine-tuned multiple datasets to develop an LLM that performs better across most datasets. Finally, we applied insights from CsEAE to LLMs, achieving further performance improvements. This suggests that reliable insights validated on SLMs are also applicable to LLMs. We tested our models on the Rams, WikiEvents, and MLEE datasets. The CsEAE model achieved improvements of 2.1%, 2.3%, and 3.2% in the Arg-C F1 metric compared to the baseline, PAIE~\citePAIE. For LLMs, we demonstrated that their performance on document-level datasets is comparable to that of SLMs~\footnoteAll code is available at this https URL.
摘要：文档级事件论元抽取 (Event Argument Extraction, EAE) 由于输入长度增加面临两个挑战：1) 难以区分事件之间的语义边界，2) 冗余信息的干扰。为解决这些问题，我们提出了两种方法。第一种方法引入了基于小型语言模型 (Small Language Models, SLMs) 的协同与结构事件论元抽取模型 (Co and Structure Event Argument Extraction model, CsEAE)。CsEAE 包含一个共现感知模块，通过上下文标签和共现事件提示抽取，整合当前输入中所有事件的信息。此外，CsEAE 还包含一个结构感知模块，通过建立触发句与其他句子之间的结构关系，减少冗余信息的干扰。第二种方法引入了新的提示，将抽取任务转化为适合大语言模型 (Large Language Models, LLMs) 的生成任务，解决了在监督微调 (Supervised Fine-Tuning, SFT) 条件下使用 LLMs 进行 EAE 的性能差距。我们还微调了多个数据集，开发了一个在大多数数据集上表现更好的 LLM。最后，我们将 CsEAE 的见解应用于 LLMs，进一步提升了性能。这表明，在 SLMs 上验证的可靠见解同样适用于 LLMs。我们在 Rams、WikiEvents 和 MLEE 数据集上测试了我们的模型。与基线模型 PAIE~\citePAIE 相比，CsEAE 模型在 Arg-C F1 指标上分别提升了 2.1%、2.3% 和 3.2%。对于 LLMs，我们证明了它们在文档级数据集上的性能与 SLMs 相当~\footnote所有代码可在以下链接获取：https URL。

[NLP-106] SSSD: Simply-Scalable Speculative Decoding

【速读】：该论文试图解决在大规模数据中心中使用推测解码（Speculative Decoding）加速大型语言模型（LLM）推理时，面临的性能不佳和部署复杂性问题。解决方案的关键在于提供了一种理论解释，说明如何在更大的批量大小（≥8）下有效利用推测解码，并引入了一种无需额外训练或复杂部署的小型LLM集成方法。该方法在连续批处理设置下，实现了短上下文生成时4倍的吞吐量提升且无延迟影响，以及长上下文生成时1.7-2倍的延迟和吞吐量改进。

链接: https://arxiv.org/abs/2411.05894
作者: Michele Marzollo,Jiawei Zhuang,Niklas Roemer,Lorenz K. Müller,Lukas Cavigelli
关键词-EN: Large Language Model, Language Model inference, accelerating Large Language, Large Language, Language Model
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 14 pages, 7 figures

点击查看摘要

Abstract:Over the past year, Speculative Decoding has gained popularity as a technique for accelerating Large Language Model inference. While several methods have been introduced, most struggle to deliver satisfactory performance at batch sizes typical for data centers ( \geq 8 ) and often involve significant deployment complexities. In this work, we offer a theoretical explanation of how Speculative Decoding can be effectively utilized with larger batch sizes. We also introduce a method that integrates seamlessly into existing systems without additional training or the complexity of deploying a small LLM. In a continuous batching setting, we achieve a 4x increase in throughput without any latency impact for short context generation, and a 1.7-2x improvement in both latency and throughput for longer contexts.
摘要：在过去的一年中，推测性解码（Speculative Decoding）作为一种加速大语言模型（Large Language Model, LLM）推理的技术，逐渐受到关注。尽管已有多种方法被提出，但大多数方法在数据中心典型批量大小（≥8）下难以提供令人满意的性能，并且通常涉及显著的部署复杂性。在本研究中，我们提供了一个理论解释，说明如何在大批量大小下有效利用推测性解码。我们还提出了一种方法，该方法可以无缝集成到现有系统中，无需额外的训练或部署小型LLM的复杂性。在连续批处理设置中，我们实现了短上下文生成时吞吐量4倍的提升，且无任何延迟影响；对于较长上下文，延迟和吞吐量均提高了1.7-2倍。

[NLP-107] Identifying and Decomposing Compound Ingredients in Meal Plans Using Large Language Models KR

【速读】：该论文试图解决大型语言模型（Large Language Models, LLMs）在膳食计划中的应用问题，特别是它们识别和分解复合成分的能力。解决方案的关键在于评估不同模型（如GPT-4o、Llama-3 (70b) 和 Mixtral (8x7b)）在识别和分解复杂成分组合方面的表现，并识别出在识别关键元素（如调味品和油）时的困难。尽管初步结果显示某些模型在分解成分方面表现出色，但所有模型在识别这些关键元素时都存在挑战。因此，未来的研究需要针对这些局限性进行改进，以提升营养建议的准确性和健康效果。

链接: https://arxiv.org/abs/2411.05892
作者: Leon Kopitar,Leon Bedrac,Larissa J Strath,Jiang Bian,Gregor Stiglic
关键词-EN: Large Language Models, Large Language, decompose compound ingredients, effectiveness of Large, meal planning
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注: Comments: Presented at NeLaMKRR@KR, 2024 ( arXiv:2410.05339 )

点击查看摘要

Abstract:This study explores the effectiveness of Large Language Models in meal planning, focusing on their ability to identify and decompose compound ingredients. We evaluated three models-GPT-4o, Llama-3 (70b), and Mixtral (8x7b)-to assess their proficiency in recognizing and breaking down complex ingredient combinations. Preliminary results indicate that while Llama-3 (70b) and GPT-4o excels in accurate decomposition, all models encounter difficulties with identifying essential elements like seasonings and oils. Despite strong overall performance, variations in accuracy and completeness were observed across models. These findings underscore LLMs’ potential to enhance personalized nutrition but highlight the need for further refinement in ingredient decomposition. Future research should address these limitations to improve nutritional recommendations and health outcomes.
摘要：本研究探讨了大语言模型在膳食规划中的有效性，重点考察其识别和分解复合成分的能力。我们评估了三种模型——GPT-4o、Llama-3 (70b) 和 Mixtral (8x7b)——以评估它们在识别和分解复杂成分组合方面的熟练程度。初步结果显示，尽管 Llama-3 (70b) 和 GPT-4o 在准确分解方面表现出色，但所有模型在识别如调味品和油等基本元素时均遇到困难。尽管整体表现强劲，各模型在准确性和完整性方面仍存在差异。这些发现强调了大语言模型在增强个性化营养方面的潜力，但同时也突显了在成分分解方面进一步改进的必要性。未来的研究应针对这些局限性进行改进，以提升营养建议和健康结果。

[NLP-108] When are 1.58 bits enough? A Bottom-up Exploration of BitNet Quantization

【速读】：该论文试图解决当代机器学习模型（如语言模型）在训练和推理过程中资源需求巨大的问题。解决方案的关键在于探索并验证了在多层感知器（MLP）和图神经网络（GNN）等非Transformer模型架构中，以及在Transformer基础的编码器-解码器模型中，使用1.58位权重（ternary weights）进行训练的可行性和有效性。研究结果表明，在所有这些模型中，1.58位训练在性能上与标准32/16位模型相当，甚至在某些情况下表现更优，从而实现了高效的推理和资源利用。

链接: https://arxiv.org/abs/2411.05882
作者: Jacob Nielsen,Lukas Galke,Peter Schneider-Kamp
关键词-EN: Contemporary machine learning, immense resource requirements, Contemporary machine, machine learning models, machine learning
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: 10 pages, 2 tables, 6 figures

点击查看摘要

Abstract:Contemporary machine learning models, such as language models, are powerful, but come with immense resource requirements both at training and inference time. It has been shown that decoder-only language models can be trained to a competitive state with ternary weights (1.58 bits per weight), facilitating efficient inference. Here, we start our exploration with non-transformer model architectures, investigating 1.58-bit training for multi-layer perceptrons and graph neural networks. Then, we explore 1.58-bit training in other transformer-based language models, namely encoder-only and encoder-decoder models. Our results show that in all of these settings, 1.58-bit training is on par with or sometimes even better than the standard 32/16-bit models.
摘要：当代机器学习模型，如语言模型，虽然功能强大，但在训练和推理阶段都需要大量的资源。已有研究表明，仅解码器的语言模型可以通过三元权重（每权重1.58比特）进行训练，达到竞争水平，从而实现高效的推理。在此，我们从非Transformer模型架构开始探索，研究多层感知器和图神经网络的1.58比特训练。接着，我们进一步探索基于Transformer的其他语言模型，即仅编码器和编码器-解码器模型的1.58比特训练。我们的实验结果表明，在所有这些情况下，1.58比特训练的表现与标准32/16比特模型相当，有时甚至更优。

[NLP-109] Generative Adapter: Contextualizing Language Models in Parameters with A Single Forward Pass

【速读】：该论文试图解决大型语言模型（LMs）在新任务或领域中适应时面临的精度和计算成本之间的权衡问题。传统方法如微调（fine-tuning）和提示（prompting）分别存在显著的训练成本和推理开销。论文提出的解决方案是 GenerativeAdapter，这是一种高效且有效的适应方法，通过直接将新上下文映射到低秩的LM适配器（adapters），从而显著减少推理开销，且无需进行微调。关键在于，适配器生成器通过自监督学习进行训练，能够将任何新任务或领域上下文映射到新的适配器，从而实现对单一冻结LM的快速适应。实验结果表明，GenerativeAdapter在知识获取、从示范中学习和用户个性化等多个适应场景中表现优异，特别是在StreamingQA任务中，F1分数提升了63.5%，且在MetaICL和MSC任务中均表现出显著的计算和内存成本优势。

链接: https://arxiv.org/abs/2411.05877
作者: Tong Chen,Hao Fang,Patrick Xia,Xiaodong Liu,Benjamin Van Durme,Luke Zettlemoyer,Jianfeng Gao,Hao Cheng
关键词-EN: Large language models, Large language, improve performance, text prompts, prompts that define
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:Large language models (LMs) are typically adapted to improve performance on new contexts (\eg text prompts that define new tasks or domains) through fine-tuning or prompting. However, there is an accuracy compute tradeoff – fine-tuning incurs significant training cost and prompting increases inference overhead. We introduce GenerativeAdapter , an effective and efficient adaptation method that directly maps new contexts to low-rank LM adapters, thereby significantly reducing inference overhead with no need for finetuning. The adapter generator is trained via self-supervised learning, and can be used to adapt a single frozen LM for any new task simply by mapping the associated task or domain context to a new adapter. We apply GenerativeAdapter to two pretrained LMs (Mistral-7B-Instruct and Llama2-7B-Chat) and evaluate the adapted models in three adaption scenarios: knowledge acquisition from documents, learning from demonstrations, and personalization for users. In StreamingQA, our approach is effective in injecting knowledge into the LM’s parameters, achieving a 63.5% improvement in F1 score over the model with supervised fine-tuning (from 19.5 to 31.5 ) for contexts as long as 32K tokens. In the MetaICL in-context learning evaluation, our method achieves an average accuracy of 44.9 across 26 tasks, outperforming the base model. On MSC, our method proves to be highly competitive in memorizing user information from conversations with a 4x reduction in computation and memory costs compared to prompting with full conversation history. Together, these results suggest that GenerativeAdapter should allow for general adaption to a wide range of different contexts.
摘要：大语言模型 (LMs) 通常通过微调 (fine-tuning) 或提示 (prompting) 来适应新的上下文（例如定义新任务或领域的文本提示）。然而，这涉及精度和计算成本之间的权衡——微调会产生显著的训练成本，而提示则增加了推理开销。我们引入了 生成式适配器 (GenerativeAdapter)，这是一种高效且有效的适应方法，它直接将新上下文映射到低秩的 LM 适配器，从而显著减少推理开销，且无需进行微调。适配器生成器通过自监督学习进行训练，可以简单地通过将相关任务或领域上下文映射到新的适配器，来适应单一的冻结 LM 模型。我们将 生成式适配器 应用于两个预训练的 LMs（Mistral-7B-Instruct 和 Llama2-7B-Chat），并在三种适应场景中评估了适应后的模型：从文档中获取知识、从演示中学习以及用户个性化。在 StreamingQA 中，我们的方法有效地将知识注入到 LM 的参数中，对于长达 32K Token 的上下文，F1 分数比监督微调模型提高了 63.5%（从 19.5 提升到 31.5）。在 MetaICL 的上下文学习评估中，我们的方法在 26 个任务中平均准确率达到 44.9，优于基础模型。在 MSC 中，我们的方法在记忆用户对话信息方面表现出高度竞争力，与使用完整对话历史进行提示相比，计算和内存成本降低了 4 倍。这些结果共同表明，生成式适配器 应能广泛适应各种不同的上下文。

[NLP-110] owards Improved Preference Optimization Pipeline: from Data Generation to Budget-Controlled Regularization

【速读】：该论文试图解决直接偏好优化 (Direct Preference Optimization, DPO) 在大型语言模型 (LLMs) 对齐人类偏好或特定目标时所面临的高质量偏好数据需求和优化不稳定性的问题。解决方案的关键在于两个方面：首先，论文提出了一种迭代成对排序机制 (iterative pairwise ranking mechanism)，用于生成高质量的偏好数据，通过成对比较信号来推导完成项的偏好排序，从而解决现有评分型奖励模型在分布外任务上表现不佳的问题。其次，论文设计了一种预算控制的正则化方法 (budget-controlled regularization formulation)，以允许在训练过程中适度降低偏好样本的预测似然性，从而改善偏好优化的收敛性，这与广泛使用的监督下词预测正则化方法形成对比。通过结合这两种设计，论文展示了其方法在两个流行基准测试中超越了现有的最先进模型。

链接: https://arxiv.org/abs/2411.05875
作者: Zhuotong Chen,Fang Liu,Jennifer Zhu,Wanyu Du,Yanjun Qi
关键词-EN: Direct Preference Optimization, aligning large language, preference data generation, Direct Preference, preference data
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 15 pages

点击查看摘要

Abstract:Direct Preference Optimization (DPO) and its variants have become the de facto standards for aligning large language models (LLMs) with human preferences or specific goals. However, DPO requires high-quality preference data and suffers from unstable preference optimization. In this work, we aim to improve the preference optimization pipeline by taking a closer look at preference data generation and training regularization techniques. For preference data generation, we demonstrate that existing scoring-based reward models produce unsatisfactory preference data and perform poorly on out-of-distribution tasks. This significantly impacts the LLM alignment performance when using these data for preference tuning. To ensure high-quality preference data generation, we propose an iterative pairwise ranking mechanism that derives preference ranking of completions using pairwise comparison signals. For training regularization, we observe that preference optimization tends to achieve better convergence when the LLM predicted likelihood of preferred samples gets slightly reduced. However, the widely used supervised next-word prediction regularization strictly prevents any likelihood reduction of preferred samples. This observation motivates our design of a budget-controlled regularization formulation. Empirically we show that combining the two designs leads to aligned models that surpass existing SOTA across two popular benchmarks.
摘要：直接偏好优化 (Direct Preference Optimization, DPO) 及其变体已成为将大语言模型 (Large Language Models, LLMs) 与人类偏好或特定目标对齐的实际标准。然而，DPO 需要高质量的偏好数据，并且在偏好优化过程中存在不稳定性。在本研究中，我们旨在通过深入探讨偏好数据生成和训练正则化技术来改进偏好优化流程。对于偏好数据生成，我们发现现有的基于评分奖励模型生成的偏好数据质量不佳，并且在分布外任务上表现不佳。这显著影响了使用这些数据进行偏好调优时的 LLM 对齐性能。为了确保高质量的偏好数据生成，我们提出了一种迭代成对排序机制，该机制利用成对比较信号推导出完成项的偏好排序。在训练正则化方面，我们观察到，当 LLM 预测的偏好样本似然性略有降低时，偏好优化往往能实现更好的收敛。然而，广泛使用的监督式下一词预测正则化严格阻止了偏好样本似然性的任何降低。这一观察结果促使我们设计了一种预算控制的正则化公式。实证结果表明，结合这两种设计，对齐模型在两个流行的基准测试中均超越了现有的最先进水平。

[NLP-111] Dialectal Coverage And Generalization in Arabic Speech Recognition

【速读】：该论文试图解决阿拉伯语自动语音识别 (ASR) 系统在面对丰富的方言多样性和低资源语言特性时的鲁棒性问题。解决方案的关键在于探索三个影响ASR性能的关键因素：预训练中方言覆盖率的作用、方言特定微调与多方言方法的有效性对比，以及对未见方言的泛化能力。通过在不同方言组合上的广泛实验，研究提供了推进多中心语言（如阿拉伯语）ASR系统开发的关键见解。

链接: https://arxiv.org/abs/2411.05872
作者: Amirbek Djanibekov,Hawau Olamide Toyin,Raghad Alshalan,Abdullah Alitr,Hanan Aldarmaki
关键词-EN: Developing robust automatic, demands effective strategies, automatic speech recognition, robust automatic speech, Developing robust
类目: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:Developing robust automatic speech recognition (ASR) systems for Arabic, a language characterized by its rich dialectal diversity and often considered a low-resource language in speech technology, demands effective strategies to manage its complexity. This study explores three critical factors influencing ASR performance: the role of dialectal coverage in pre-training, the effectiveness of dialect-specific fine-tuning compared to a multi-dialectal approach, and the ability to generalize to unseen dialects. Through extensive experiments across different dialect combinations, our findings offer key insights towards advancing the development of ASR systems for pluricentric languages like Arabic.
摘要：为阿拉伯语开发鲁棒的自动语音识别 (ASR) 系统，这一语言以其丰富的方言多样性著称，并且在语音技术领域常被视为低资源语言，需要有效的策略来管理其复杂性。本研究探讨了影响 ASR 性能的三个关键因素：预训练中对方言覆盖率的作用、方言特定微调与多方言方法的有效性对比，以及对未见方言的泛化能力。通过在不同方言组合上的广泛实验，我们的研究结果为推进像阿拉伯语这样的多中心语言的 ASR 系统开发提供了关键见解。

[NLP-112] LEGO-GraphRAG: Modularizing Graph-based Retrieval-Augmented Generation for Design Space Exploration

【速读】：该论文试图解决在基于图的检索增强生成 (RAG) 中缺乏统一框架和系统分类的问题。解决方案的关键在于提出了 LEGO-GraphRAG 模块化框架，该框架将图知识检索过程细分为三个相互关联的模块：子图提取 (subgraph-extraction)、路径过滤 (path-filtering) 和路径优化 (path-refinement)。通过系统地总结和分类每个模块的相关算法和神经网络模型 (NN)，论文为 GraphRAG 实例的设计空间提供了更清晰的认识，并识别了影响 GraphRAG 实现效果的关键设计因素，如图耦合 (Graph Coupling) 和计算成本 (Computational Cost)。通过广泛的实证研究，论文构建了高质量的 GraphRAG 实例，并分析了它们对检索和推理性能的影响，从而为优化 GraphRAG 实例设计提供了重要见解。

链接: https://arxiv.org/abs/2411.05844
作者: Yukun Cao,Zengyi Gao,Zhiyang Li,Xike Xie,S Kevin Zhou
关键词-EN: Large Language Models, Large Language, addresses significant challenges, Retrieval-Augmented Generation, capabilities of Large
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:GraphRAG addresses significant challenges in Retrieval-Augmented Generation (RAG) by leveraging graphs with embedded knowledge to enhance the reasoning capabilities of Large Language Models (LLMs). Despite its promising potential, the GraphRAG community currently lacks a unified framework for fine-grained decomposition of the graph-based knowledge retrieval process. Furthermore, there is no systematic categorization or evaluation of existing solutions within the retrieval process. In this paper, we present LEGO-GraphRAG, a modular framework that decomposes the retrieval process of GraphRAG into three interconnected modules: subgraph-extraction, path-filtering, and path-refinement. We systematically summarize and classify the algorithms and neural network (NN) models relevant to each module, providing a clearer understanding of the design space for GraphRAG instances. Additionally, we identify key design factors, such as Graph Coupling and Computational Cost, that influence the effectiveness of GraphRAG implementations. Through extensive empirical studies, we construct high-quality GraphRAG instances using a representative selection of solutions and analyze their impact on retrieval and reasoning performance. Our findings offer critical insights into optimizing GraphRAG instance design, ultimately contributing to the advancement of more accurate and contextually relevant LLM applications.
摘要：GraphRAG 通过利用嵌入知识的图结构来增强大语言模型（Large Language Models, LLMs）的推理能力，从而解决了检索增强生成（Retrieval-Augmented Generation, RAG）中的重大挑战。尽管 GraphRAG 具有巨大的潜力，但目前该领域缺乏一个统一的框架来对基于图的知识检索过程进行细粒度的分解。此外，对于检索过程中的现有解决方案，也没有系统的分类或评估。在本文中，我们提出了 LEGO-GraphRAG，这是一个模块化框架，将 GraphRAG 的检索过程分解为三个相互关联的模块：子图提取（subgraph-extraction）、路径过滤（path-filtering）和路径优化（path-refinement）。我们系统地总结和分类了与每个模块相关的算法和神经网络（Neural Network, NN）模型，从而为 GraphRAG 实例的设计空间提供了更清晰的认识。此外，我们识别了影响 GraphRAG 实现效果的关键设计因素，如图耦合（Graph Coupling）和计算成本（Computational Cost）。通过广泛的实证研究，我们使用代表性的解决方案构建了高质量的 GraphRAG 实例，并分析了它们对检索和推理性能的影响。我们的研究结果为优化 GraphRAG 实例设计提供了关键见解，最终有助于推动更准确和上下文相关的大语言模型应用的发展。

[NLP-113] Hierarchical Sentiment Analysis Framework for Hate Speech Detection: Implementing Binary and Multiclass Classification Strategy

【速读】：该论文试图解决社交媒体上仇恨言论自动检测中的一个关键问题，即如何有效区分仇恨言论与普通和冒犯性语言。解决方案的关键在于提出了一种新的多任务模型，该模型结合了共享情感表示，并利用了基于Transformer的模型（如Hugging Face提供的模型）和情感分析技术，以提高检测的准确性并减少误报。通过这种方式，模型能够更精确地识别仇恨言论，同时避免将包含某些词汇的普通信息错误归类为仇恨言论。

链接: https://arxiv.org/abs/2411.05819
作者: Faria Naznin,Md Touhidur Rahman,Shahran Rahman Alve
关键词-EN: hate speech, significant challenge, challenge in automating, social media, media is distinguishing
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 20 Pages

点击查看摘要

Abstract:A significant challenge in automating hate speech detection on social media is distinguishing hate speech from regular and offensive language. These identify an essential category of content that web filters seek to remove. Only automated methods can manage this volume of daily data. To solve this problem, the community of Natural Language Processing is currently investigating different ways of hate speech detection. In addition to those, previous approaches (e.g., Convolutional Neural Networks, multi-channel BERT models, and lexical detection) have always achieved low precision without carefully treating other related tasks like sentiment analysis and emotion classification. They still like to group all messages with specific words in them as hate speech simply because those terms often appear alongside hateful rhetoric. In this research, our paper presented the hate speech text classification system model drawn upon deep learning and machine learning. In this paper, we propose a new multitask model integrated with shared emotional representations to detect hate speech across the English language. The Transformer-based model we used from Hugging Face and sentiment analysis helped us prevent false positives. Conclusion. We conclude that utilizing sentiment analysis and a Transformer-based trained model considerably improves hate speech detection across multiple datasets.
摘要：在自动化社交媒体上的仇恨言论检测中，一个重大挑战是如何区分仇恨言论与常规及冒犯性语言。这些内容是网络过滤器试图移除的重要类别。只有自动化方法才能处理如此庞大的每日数据量。为了解决这一问题，自然语言处理社区目前正在探索不同的仇恨言论检测方法。除了这些方法外，以往的方案（如卷积神经网络、多通道 BERT 模型和词汇检测）在未仔细处理情感分析和情绪分类等关联任务的情况下，总是达到较低的精确度。它们仍然倾向于将包含特定词汇的所有消息归类为仇恨言论，仅仅因为这些词汇经常与仇恨言论同时出现。在本研究中，我们的论文提出了基于深度学习和机器学习的仇恨言论文本分类系统模型。本文中，我们提出了一种新的多任务模型，该模型集成了共享的情绪表示，用于跨英语语言检测仇恨言论。我们使用的 Hugging Face 的 Transformer 模型和情感分析帮助我们防止了误报。结论。我们得出结论，利用情感分析和基于 Transformer 的训练模型显著提高了跨多个数据集的仇恨言论检测效果。

[NLP-114] LA4SR: illuminating the dark proteome with generative AI

【速读】：该论文试图解决微生物序列分类的问题，特别是针对藻类暗蛋白组（algal dark proteome）的分类。解决方案的关键在于重新工程化开源语言模型（如GPT-2、BLOOM、DistilRoBERTa、ELECTRA和Mamba），使其适用于微生物序列分类任务。这些模型在分类任务中表现出色，F1分数高达95，且比BLASTP快16,580倍，召回率提高2.9倍。特别地，1B参数的LA4SR模型在训练数据不足2%的情况下仍能达到86的F1分数，显示出强大的泛化能力。此外，模型对序列末端信息的完整性或混乱性具有鲁棒性，表明其能够有效处理不完整的序列。最后，论文还提供了自定义的AI解释性软件工具，用于将氨基酸模式归因于AI生成过程，并在进化和生物物理背景下解释其输出。

链接: https://arxiv.org/abs/2411.06798
作者: David R. Nelson,Ashish Kumar Jaiswal,Noha Ismail,Alexandra Mystikou,Kourosh Salehi-Ashtiani
关键词-EN: biological sequence analysis, show promise, promise for biological, sequence analysis, re-engineered open-source LMs
类目: Genomics (q-bio.GN); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Quantitative Methods (q-bio.QM)
备注:

点击查看摘要

Abstract:AI language models (LMs) show promise for biological sequence analysis. We re-engineered open-source LMs (GPT-2, BLOOM, DistilRoBERTa, ELECTRA, and Mamba, ranging from 70M to 12B parameters) for microbial sequence classification. The models achieved F1 scores up to 95 and operated 16,580x faster and at 2.9x the recall of BLASTP. They effectively classified the algal dark proteome - uncharacterized proteins comprising about 65% of total proteins - validated on new data including a new, complete Hi-C/Pacbio Chlamydomonas genome. Larger (1B) LA4SR models reached high accuracy (F1 86) when trained on less than 2% of available data, rapidly achieving strong generalization capacity. High accuracy was achieved when training data had intact or scrambled terminal information, demonstrating robust generalization to incomplete sequences. Finally, we provide custom AI explainability software tools for attributing amino acid patterns to AI generative processes and interpret their outputs in evolutionary and biophysical contexts.
摘要：AI 语言模型 (Language Models, LMs) 在生物序列分析中展现出潜力。我们重新设计了开源的 LMs（包括 GPT-2、BLOOM、DistilRoBERTa、ELECTRA 和 Mamba，参数规模从 70M 到 12B 不等）用于微生物序列分类。这些模型达到了高达 95 的 F1 分数，并且比 BLASTP 快 16,580 倍，召回率提高了 2.9 倍。它们有效地分类了藻类暗蛋白组——约占蛋白质总数 65% 的未表征蛋白质——并在包括新的完整 Hi-C/Pacbio Chlamydomonas 基因组在内的新数据上得到了验证。当训练数据量不足可用数据的 2% 时，更大的 (1B) LA4SR 模型达到了高准确度（F1 86），迅速实现了强大的泛化能力。当训练数据包含完整或打乱的末端信息时，模型仍能实现高准确度，表明其对不完整序列具有稳健的泛化能力。最后，我们提供了定制的 AI 可解释性软件工具，用于将氨基酸模式归因于 AI 生成过程，并在进化和生物物理背景下解释其输出。

[NLP-115] CTC-Assisted LLM -Based Contextual ASR

【速读】：该论文试图解决当前端到端（E2E）自动语音识别（ASR）系统在识别稀有词汇时面临的挑战，特别是复杂架构和解码机制导致的性能受限及干扰词影响的问题。解决方案的关键在于提出了一种基于CTC辅助的大语言模型（LLM）上下文ASR模型，并引入了一种高效的过滤算法。通过利用粗略的CTC解码结果来筛选潜在的相关热词，并将这些热词整合到LLM的提示输入中，该模型在Librispeech测试集上实现了显著的性能提升，特别是在识别长尾稀有词汇方面，WER/B-WER分别达到了1.27%/3.67%和2.72%/8.02%，显著优于基线LLM-based ASR模型及其他相关工作。此外，该模型在处理多达2000个偏置词时仍能保持良好性能。

链接: https://arxiv.org/abs/2411.06437
作者: Guanrou Yang,Ziyang Ma,Zhifu Gao,Shiliang Zhang,Xie Chen
关键词-EN: customization holds substantial, holds substantial practical, Contextual ASR, Contextual ASR model, ASR
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: SLT 2024

点击查看摘要

Abstract:Contextual ASR or hotword customization holds substantial practical value. Despite the impressive performance of current end-to-end (E2E) automatic speech recognition (ASR) systems, they often face challenges in accurately recognizing rare words. Typical E2E contextual ASR models commonly feature complex architectures and decoding mechanisms, limited in performance and susceptible to interference from distractor words. With large language model (LLM)-based ASR models emerging as the new mainstream, we propose a CTC-Assisted LLM-Based Contextual ASR model with an efficient filtering algorithm. By using coarse CTC decoding results to filter potential relevant hotwords and incorporating them into LLM prompt input, our model attains WER/B-WER of 1.27%/3.67% and 2.72%/8.02% on the Librispeech test-clean and test-other sets targeting on recognizing rare long-tail words, demonstrating significant improvements compared to the baseline LLM-based ASR model, and substantially surpassing other related work. More remarkably, with the help of the large language model and proposed filtering algorithm, our contextual ASR model still performs well with 2000 biasing words.
摘要：上下文自动语音识别（ASR）或热词定制具有重要的实际价值。尽管当前的端到端（E2E）自动语音识别系统表现出色，但在准确识别罕见词汇方面仍面临挑战。典型的E2E上下文ASR模型通常具有复杂的架构和解码机制，性能受限且易受干扰词的影响。随着基于大语言模型（LLM）的ASR模型成为新的主流，我们提出了一种CTC辅助的LLM基础上下文ASR模型，并配备了一种高效的过滤算法。通过使用粗略的CTC解码结果来筛选潜在的相关热词，并将其整合到LLM的提示输入中，我们的模型在Librispeech test-clean和test-other数据集上分别达到了1.27%/3.67%和2.72%/8.02%的WER/B-WER，针对识别罕见的长尾词汇，相较于基线LLM基础ASR模型显示出显著的改进，并大幅超越了其他相关工作。更为显著的是，借助大语言模型和提出的过滤算法，我们的上下文ASR模型在处理2000个偏置词时仍表现出色。

人工智能

[AI-0] Add-it: Training-Free Object Insertion in Images With Pretrained Diffusion Models

链接: https://arxiv.org/abs/2411.07232
作者: Yoad Tewel,Rinon Gal,Dvir Samuel Yuval Atzmon,Lior Wolf,Gal Chechik
关键词-EN: semantic image editing, challenging task, task in semantic, preserving the original, seamlessly integrating
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Graphics (cs.GR); Machine Learning (cs.LG)
*备注: Project page is at this https URL

点击查看摘要

Abstract:Adding Object into images based on text instructions is a challenging task in semantic image editing, requiring a balance between preserving the original scene and seamlessly integrating the new object in a fitting location. Despite extensive efforts, existing models often struggle with this balance, particularly with finding a natural location for adding an object in complex scenes. We introduce Add-it, a training-free approach that extends diffusion models’ attention mechanisms to incorporate information from three key sources: the scene image, the text prompt, and the generated image itself. Our weighted extended-attention mechanism maintains structural consistency and fine details while ensuring natural object placement. Without task-specific fine-tuning, Add-it achieves state-of-the-art results on both real and generated image insertion benchmarks, including our newly constructed “Additing Affordance Benchmark” for evaluating object placement plausibility, outperforming supervised methods. Human evaluations show that Add-it is preferred in over 80% of cases, and it also demonstrates improvements in various automated metrics.

[AI-1] ooling or Not Tooling? The Impact of Tools on Language Agents for Chemistry Problem Solving

链接: https://arxiv.org/abs/2411.07228
作者: Botao Yu,Frazier N. Baker,Ziru Chen,Garrett Herb,Boyu Gou,Daniel Adu-Ampratwum,Xia Ning,Huan Sun
关键词-EN: large language models, enhance large language, chemistry problem solving, LLM-based agents augmented, language models
类目: Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE)
*备注:

点击查看摘要

Abstract:To enhance large language models (LLMs) for chemistry problem solving, several LLM-based agents augmented with tools have been proposed, such as ChemCrow and Coscientist. However, their evaluations are narrow in scope, leaving a large gap in understanding the benefits of tools across diverse chemistry tasks. To bridge this gap, we develop ChemAgent, an enhanced chemistry agent over ChemCrow, and conduct a comprehensive evaluation of its performance on both specialized chemistry tasks and general chemistry questions. Surprisingly, ChemAgent does not consistently outperform its base LLMs without tools. Our error analysis with a chemistry expert suggests that: For specialized chemistry tasks, such as synthesis prediction, we should augment agents with specialized tools; however, for general chemistry questions like those in exams, agents’ ability to reason correctly with chemistry knowledge matters more, and tool augmentation does not always help.

[AI-2] Grounding Video Models to Actions through Goal Conditioned Exploration

链接: https://arxiv.org/abs/2411.07223
作者: Yunhao Luo,Yilun Du
关键词-EN: Large video models, amounts of Internet, Large video, Internet video, pretrained on massive
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: Project page at this https URL

点击查看摘要

Abstract:Large video models, pretrained on massive amounts of Internet video, provide a rich source of physical knowledge about the dynamics and motions of objects and tasks. However, video models are not grounded in the embodiment of an agent, and do not describe how to actuate the world to reach the visual states depicted in a video. To tackle this problem, current methods use a separate vision-based inverse dynamic model trained on embodiment-specific data to map image states to actions. Gathering data to train such a model is often expensive and challenging, and this model is limited to visual settings similar to the ones in which data are available. In this paper, we investigate how to directly ground video models to continuous actions through self-exploration in the embodied environment – using generated video states as visual goals for exploration. We propose a framework that uses trajectory level action generation in combination with video guidance to enable an agent to solve complex tasks without any external supervision, e.g., rewards, action labels, or segmentation masks. We validate the proposed approach on 8 tasks in Libero, 6 tasks in MetaWorld, 4 tasks in Calvin, and 12 tasks in iThor Visual Navigation. We show how our approach is on par with or even surpasses multiple behavior cloning baselines trained on expert demonstrations while without requiring any action annotations.

[AI-3] Explaining RL Decisions with Trajectories: A Reproducibility Study

链接: https://arxiv.org/abs/2411.07200
作者: Karim Abdel Sadek,Matteo Nulli,Joan Velja,Jort Vincenti
关键词-EN: trajectories, original paper introduces, Explaining RL decisions, Explaining, investigates the reproducibility
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:This work investigates the reproducibility of the paper ‘Explaining RL decisions with trajectories’. The original paper introduces a novel approach in explainable reinforcement learning based on the attribution decisions of an agent to specific clusters of trajectories encountered during training. We verify the main claims from the paper, which state that (i) training on less trajectories induces a lower initial state value, (ii) trajectories in a cluster present similar high-level patterns, (iii) distant trajectories influence the decision of an agent, and (iv) humans correctly identify the attributed trajectories to the decision of the agent. We recover the environments used by the authors based on the partial original code they provided for one of the environments (Grid-World), and implemented the remaining from scratch (Seaquest, HalfCheetah, Breakout and Q*Bert). While we confirm that (i), (ii), and (iii) partially hold, we extend on the largely qualitative experiments from the authors by introducing a quantitative metric to further support (iii), and new experiments and visual results for (i). Moreover, we investigate the use of different clustering algorithms and encoder architectures to further support (ii). We could not support (iv), given the limited extent of the original experiments. We conclude that, while some of the claims can be supported, further investigations and experiments could be of interest. We recognise the novelty of the work from the authors and hope that our work paves the way for clearer and more transparent approaches.

[AI-4] OmniEdit: Building Image Editing Generalist Models Through Specialist Supervision

链接: https://arxiv.org/abs/2411.07199
作者: Cong Wei,Zheyang Xiong,Weiming Ren,Xinrun Du,Ge Zhang,Wenhu Chen
关键词-EN: demonstrated significant potential, Instruction-guided image editing, manually annotated image, training diffusion models, Instruction-guided image
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 21 pages

点击查看摘要

Abstract:Instruction-guided image editing methods have demonstrated significant potential by training diffusion models on automatically synthesized or manually annotated image editing pairs. However, these methods remain far from practical, real-life applications. We identify three primary challenges contributing to this gap. Firstly, existing models have limited editing skills due to the biased synthesis process. Secondly, these methods are trained with datasets with a high volume of noise and artifacts. This is due to the application of simple filtering methods like CLIP-score. Thirdly, all these datasets are restricted to a single low resolution and fixed aspect ratio, limiting the versatility to handle real-world use cases. In this paper, we present \omniedit, which is an omnipotent editor to handle seven different image editing tasks with any aspect ratio seamlessly. Our contribution is in four folds: (1) \omniedit is trained by utilizing the supervision from seven different specialist models to ensure task coverage. (2) we utilize importance sampling based on the scores provided by large multimodal models (like GPT-4o) instead of CLIP-score to improve the data quality. (3) we propose a new editing architecture called EditNet to greatly boost the editing success rate, (4) we provide images with different aspect ratios to ensure that our model can handle any image in the wild. We have curated a test set containing images of different aspect ratios, accompanied by diverse instructions to cover different tasks. Both automatic evaluation and human evaluations demonstrate that \omniedit can significantly outperform all the existing models. Our code, dataset and model will be available at \urlthis https URL

[AI-5] NatureLM-audio: an Audio-Language Foundation Model for Bioacoustics

链接: https://arxiv.org/abs/2411.07186
作者: David Robinson,Marius Miron,Masato Hagiwara,Olivier Pietquin
关键词-EN: showing emergent abilities, Large language models, audio represent, general audio, prompted with text
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注: Demo page: this https URL The code will be open-sourced and available shortly

点击查看摘要

Abstract:Large language models (LLMs) prompted with text and audio represent the state of the art in various auditory tasks, including speech, music, and general audio, showing emergent abilities on unseen tasks. However, these capabilities have yet to be fully demonstrated in bioacoustics tasks, such as detecting animal vocalizations in large recordings, classifying rare and endangered species, and labeling context and behavior - tasks that are crucial for conservation, biodiversity monitoring, and the study of animal behavior. In this work, we present NatureLM-audio, the first audio-language foundation model specifically designed for bioacoustics. Our carefully curated training dataset comprises text-audio pairs spanning a diverse range of bioacoustics, speech, and music data, designed to address the challenges posed by limited annotated datasets in the field. We demonstrate successful transfer of learned representations from music and speech to bioacoustics, and our model shows promising generalization to unseen taxa and tasks. Importantly, we test NatureLM-audio on a novel benchmark (BEANS-Zero) and it sets the new state of the art (SotA) on several bioacoustics tasks, including zero-shot classification of unseen species. To advance bioacoustics research, we also open-source the code for generating training and benchmark data, as well as for training the model.

[AI-6] Gradual Fine-Tuning with Graph Routing for Multi-Source Unsupervised Domain Adaptation

链接: https://arxiv.org/abs/2411.07185
作者: Yao Ma,Samuel Louvan,Zhunxuan Wang
关键词-EN: Multi-source unsupervised domain, Multi-source unsupervised, unsupervised domain adaptation, domain adaptation aims, multiple source domains
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: In Proceedings of the 3rd Conference on Lifelong Learning Agents (CoLLAs 2024)

点击查看摘要

Abstract:Multi-source unsupervised domain adaptation aims to leverage labeled data from multiple source domains for training a machine learning model to generalize well on a target domain without labels. Source domain selection plays a crucial role in determining the model’s performance. It relies on the similarities amongst source and target domains. Nonetheless, existing work for source domain selection often involves heavyweight computational procedures, especially when dealing with numerous source domains and the need to identify the best ones from them. In this paper, we introduce a framework for gradual fine tuning (GFT) of machine learning models on multiple source domains. We represent multiple source domains as an undirected weighted graph. We then give a new generalization error bound for GFT along any path within the graph, which is used to determine the optimal path corresponding to the optimal training order. With this formulation, we introduce three lightweight graph-routing strategies which tend to minimize the error bound. Our best strategy improves 2.3% of accuracy over the state-of-the-art on Natural Language Inference (NLI) task and achieves competitive performance on Sentiment Analysis (SA) task, especially a 3.9% improvement on a more diverse subset of data we use for SA.

[AI-7] Anytime Sequential Halving in Monte-Carlo Tree Search

链接: https://arxiv.org/abs/2411.07171
作者: Dominic Sagers,Mark H.M. Winands,Dennis J.N.J. Soemers
关键词-EN: Monte-Carlo Tree Search, minimize cumulative regret, Sequential Halving, minimize simple regret, typically uses multi-armed
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Accepted by the Computers and Games 2024 conference

点击查看摘要

Abstract:Monte-Carlo Tree Search (MCTS) typically uses multi-armed bandit (MAB) strategies designed to minimize cumulative regret, such as UCB1, as its selection strategy. However, in the root node of the search tree, it is more sensible to minimize simple regret. Previous work has proposed using Sequential Halving as selection strategy in the root node, as, in theory, it performs better with respect to simple regret. However, Sequential Halving requires a budget of iterations to be predetermined, which is often impractical. This paper proposes an anytime version of the algorithm, which can be halted at any arbitrary time and still return a satisfactory result, while being designed such that it approximates the behavior of Sequential Halving. Empirical results in synthetic MAB problems and ten different board games demonstrate that the algorithm’s performance is competitive with Sequential Halving and UCB1 (and their analogues in MCTS).

[AI-8] Acoustic-based 3D Human Pose Estimation Robust to Human Position BMVC2024

链接: https://arxiv.org/abs/2411.07165
作者: Yusuke Oumi,Yuto Shibata,Go Irie,Akisato Kimura,Yoshimitsu Aoki,Mariko Isogawa
关键词-EN: human pose estimation, low-level acoustic signals, pose estimation, paper explores, explores the problem
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Robotics (cs.RO)
*备注: Accepted at BMVC2024

点击查看摘要

Abstract:This paper explores the problem of 3D human pose estimation from only low-level acoustic signals. The existing active acoustic sensing-based approach for 3D human pose estimation implicitly assumes that the target user is positioned along a line between loudspeakers and a microphone. Because reflection and diffraction of sound by the human body cause subtle acoustic signal changes compared to sound obstruction, the existing model degrades its accuracy significantly when subjects deviate from this line, limiting its practicality in real-world scenarios. To overcome this limitation, we propose a novel method composed of a position discriminator and reverberation-resistant model. The former predicts the standing positions of subjects and applies adversarial learning to extract subject position-invariant features. The latter utilizes acoustic signals before the estimation target time as references to enhance robustness against the variations in sound arrival times due to diffraction and reflection. We construct an acoustic pose estimation dataset that covers diverse human locations and demonstrate through experiments that our proposed method outperforms existing approaches.

[AI-9] A Domain-Agnostic Neurosymbolic Approach for Big Social Data Analysis: Evaluating Mental Health Sentiment on Social Media during COVID-19

链接: https://arxiv.org/abs/2411.07163
作者: Vedant Khandelwal,Manas Gaur,Ugur Kursuncu,Valerie Shalin,Amit Sheth
关键词-EN: Monitoring public sentiment, Monitoring public, public sentiment, Monitoring, symbolic knowledge sources
类目: Artificial Intelligence (cs.AI)
*备注: 13 Pages, 5 Figures, 5 Tables, 2024 IEEE International Conference on Big Data, Regular Paper

点击查看摘要

Abstract:Monitoring public sentiment via social media is potentially helpful during health crises such as the COVID-19 pandemic. However, traditional frequency-based, data-driven neural network-based approaches can miss newly relevant content due to the evolving nature of language in a dynamically evolving environment. Human-curated symbolic knowledge sources, such as lexicons for standard language and slang terms, can potentially elevate social media signals in evolving language. We introduce a neurosymbolic method that integrates neural networks with symbolic knowledge sources, enhancing the detection and interpretation of mental health-related tweets relevant to COVID-19. Our method was evaluated using a corpus of large datasets (approximately 12 billion tweets, 2.5 million subreddit data, and 700k news articles) and multiple knowledge graphs. This method dynamically adapts to evolving language, outperforming purely data-driven models with an F1 score exceeding 92%. This approach also showed faster adaptation to new data and lower computational demands than fine-tuning pre-trained large language models (LLMs). This study demonstrates the benefit of neurosymbolic methods in interpreting text in a dynamic environment for tasks such as health surveillance.

[AI-10] RoundTable: Investigating Group Decision-Making Mechanism in Multi-Agent Collaboration

链接: https://arxiv.org/abs/2411.07161
作者: Young-Min Cho,Raphael Shu,Nilaksh Das,Tamer Alkhouli,Yi-An Lai,Jason Cai,Monica Sunkara,Yi Zhang
关键词-EN: enhancing collective intelligence, Systems in eliciting, eliciting cross-agent communication, study investigates, investigates the efficacy
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI)
*备注: preprint

点击查看摘要

Abstract:This study investigates the efficacy of Multi-Agent Systems in eliciting cross-agent communication and enhancing collective intelligence through group decision-making in a decentralized setting. Unlike centralized mechanisms, where a fixed hierarchy governs social choice, decentralized group decision-making allows agents to engage in joint deliberation. Our research focuses on the dynamics of communication and decision-making within various social choice methods. By applying different voting rules in various environments, we find that moderate decision flexibility yields better outcomes. Additionally, exploring the linguistic features of agent-to-agent conversations reveals indicators of effective collaboration, offering insights into communication patterns that facilitate or hinder collaboration. Finally, we propose various methods for determining the optimal stopping point in multi-agent collaborations based on linguistic cues. Our findings contribute to a deeper understanding of how decentralized decision-making and group conversation shape multi-agent collaboration, with implications for the design of more effective MAS environments.

[AI-11] Variational Graph Contrastive Learning

链接: https://arxiv.org/abs/2411.07150
作者: Shifeng Xie,Jhony H. Giraldo
关键词-EN: encode high-dimensional graph-structured, high-dimensional graph-structured data, Subgraph Gaussian Embedding, Graph representation learning, Gaussian Embedding Contrast
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Graph representation learning (GRL) is a fundamental task in machine learning, aiming to encode high-dimensional graph-structured data into low-dimensional vectors. Self-supervised learning (SSL) methods are widely used in GRL because they can avoid expensive human annotation. In this work, we propose a novel Subgraph Gaussian Embedding Contrast (SGEC) method. Our approach introduces a subgraph Gaussian embedding module, which adaptively maps subgraphs to a structured Gaussian space, ensuring the preservation of graph characteristics while controlling the distribution of generated subgraphs. We employ optimal transport distances, including Wasserstein and Gromov-Wasserstein distances, to effectively measure the similarity between subgraphs, enhancing the robustness of the contrastive learning process. Extensive experiments across multiple benchmarks demonstrate that SGEC outperforms or presents competitive performance against state-of-the-art approaches. Our findings provide insights into the design of SSL methods for GRL, emphasizing the importance of the distribution of the generated contrastive pairs.

[AI-12] Edify 3D: Scalable High-Quality 3D Asset Generation

链接: https://arxiv.org/abs/2411.07135
作者: NVIDIA:Maciej Bala,Yin Cui,Yifan Ding,Yunhao Ge,Zekun Hao,Jon Hasselgren,Jacob Huffman,Jingyi Jin,J.P. Lewis,Zhaoshuo Li,Chen-Hsuan Lin,Yen-Chen Lin,Tsung-Yi Lin,Ming-Yu Liu,Alice Luo,Qianli Ma,Jacob Munkberg,Stella Shi,Fangyin Wei,Donglai Xiang,Jiashu Xu,Xiaohui Zeng,Qinsheng Zhang
关键词-EN: advanced solution designed, introduce Edify, advanced solution, solution designed, asset generation
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Graphics (cs.GR)
*备注: Project website: this https URL

点击查看摘要

Abstract:We introduce Edify 3D, an advanced solution designed for high-quality 3D asset generation. Our method first synthesizes RGB and surface normal images of the described object at multiple viewpoints using a diffusion model. The multi-view observations are then used to reconstruct the shape, texture, and PBR materials of the object. Our method can generate high-quality 3D assets with detailed geometry, clean shape topologies, high-resolution textures, and materials within 2 minutes of runtime.

[AI-13] oken Merging for Training-Free Semantic Binding in Text-to-Image Synthesis NEURIPS2024

链接: https://arxiv.org/abs/2411.07132
作者: Taihang Hu,Linxuan Li,Joost van de Weijer,Hongcheng Gao,Fahad Shahbaz Khan,Jian Yang,Ming-Ming Cheng,Kai Wang,Yaxing Wang
关键词-EN: accurately bind semantically, models exhibit remarkable, remarkable generation capabilities, exhibit remarkable generation, bind semantically related
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: Accepted by Neurips2024

点击查看摘要

Abstract:Although text-to-image (T2I) models exhibit remarkable generation capabilities, they frequently fail to accurately bind semantically related objects or attributes in the input prompts; a challenge termed semantic binding. Previous approaches either involve intensive fine-tuning of the entire T2I model or require users or large language models to specify generation layouts, adding complexity. In this paper, we define semantic binding as the task of associating a given object with its attribute, termed attribute binding, or linking it to other related sub-objects, referred to as object binding. We introduce a novel method called Token Merging (ToMe), which enhances semantic binding by aggregating relevant tokens into a single composite token. This ensures that the object, its attributes and sub-objects all share the same cross-attention map. Additionally, to address potential confusion among main objects with complex textual prompts, we propose end token substitution as a complementary strategy. To further refine our approach in the initial stages of T2I generation, where layouts are determined, we incorporate two auxiliary losses, an entropy loss and a semantic binding loss, to iteratively update the composite token to improve the generation integrity. We conducted extensive experiments to validate the effectiveness of ToMe, comparing it against various existing methods on the T2I-CompBench and our proposed GPT-4o object binding benchmark. Our method is particularly effective in complex scenarios that involve multiple objects and attributes, which previous methods often fail to address. The code will be publicly available at \urlthis https URL.

[AI-14] Fast and Robust Contextual Node Representation Learning over Dynamic Graphs

链接: https://arxiv.org/abs/2411.07123
作者: Xingzhi Guo,Silong Wang,Baojian Zhou,Yanghua Xiao,Steven Skiena
关键词-EN: Real-world graphs grow, graphs grow rapidly, efficiently maintaining robust, Real-world graphs, grow rapidly
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Real-world graphs grow rapidly with edge and vertex insertions over time, motivating the problem of efficiently maintaining robust node representation over evolving graphs. Recent efficient GNNs are designed to decouple recursive message passing from the learning process, and favor Personalized PageRank (PPR) as the underlying feature propagation mechanism. However, most PPR-based GNNs are designed for static graphs, and efficient PPR maintenance remains as an open problem. Further, there is surprisingly little theoretical justification for the choice of PPR, despite its impressive empirical performance. In this paper, we are inspired by the recent PPR formulation as an explicit \ell_1 -regularized optimization problem and propose a unified dynamic graph learning framework based on sparse node-wise attention. We also present a set of desired properties to justify the choice of PPR in STOA GNNs, and serves as the guideline for future node attention designs. Meanwhile, we take advantage of the PPR-equivalent optimization formulation and employ the proximal gradient method (ISTA) to improve the efficiency of PPR-based GNNs upto 6 times. Finally, we instantiate a simple-yet-effective model (\textscGoPPE) with robust positional encodings by maximizing PPR previously used as attention. The model performs comparably to or better than the STOA baselines and greatly outperforms when the initial node attributes are noisy during graph evolution, demonstrating the effectiveness and robustness of \textscGoPPE. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2411.07123 [cs.LG] (or arXiv:2411.07123v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2411.07123 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-15] Learning Multi-Agent Collaborative Manipulation for Long-Horizon Quadrupedal Pushing

链接: https://arxiv.org/abs/2411.07104
作者: Chuye Hong,Yuming Feng,Yaru Niu,Shiqi Liu,Yuxiang Yang,Wenhao Yu,Tingnan Zhang,Jie Tan,Ding Zhao
关键词-EN: handling large objects, demanding real-world applications, remain limited, industrial automation, large objects
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
*备注:

点击查看摘要

Abstract:Recently, quadrupedal locomotion has achieved significant success, but their manipulation capabilities, particularly in handling large objects, remain limited, restricting their usefulness in demanding real-world applications such as search and rescue, construction, industrial automation, and room organization. This paper tackles the task of obstacle-aware, long-horizon pushing by multiple quadrupedal robots. We propose a hierarchical multi-agent reinforcement learning framework with three levels of control. The high-level controller integrates an RRT planner and a centralized adaptive policy to generate subgoals, while the mid-level controller uses a decentralized goal-conditioned policy to guide the robots toward these sub-goals. A pre-trained low-level locomotion policy executes the movement commands. We evaluate our method against several baselines in simulation, demonstrating significant improvements over baseline approaches, with 36.0% higher success rates and 24.5% reduction in completion time than the best baseline. Our framework successfully enables long-horizon, obstacle-aware manipulation tasks like Push-Cuboid and Push-T on Go1 robots in the real world.

[AI-16] Bounded Rationality Equilibrium Learning in Mean Field Games

链接: https://arxiv.org/abs/2411.07099
作者: Yannick Eich,Christian Fabian,Kai Cui,Heinz Koeppl
关键词-EN: tractably model behavior, large agent populations, field games, QRE, MFG QRE
类目: Computer Science and Game Theory (cs.GT); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
*备注:

点击查看摘要

Abstract:Mean field games (MFGs) tractably model behavior in large agent populations. The literature on learning MFG equilibria typically focuses on finding Nash equilibria (NE), which assume perfectly rational agents and are hence implausible in many realistic situations. To overcome these limitations, we incorporate bounded rationality into MFGs by leveraging the well-known concept of quantal response equilibria (QRE). Two novel types of MFG QRE enable the modeling of large agent populations where individuals only noisily estimate the true objective. We also introduce a second source of bounded rationality to MFGs by restricting agents’ planning horizon. The resulting novel receding horizon (RH) MFGs are combined with QRE and existing approaches to model different aspects of bounded rationality in MFGs. We formally define MFG QRE and RH MFGs and compare them to existing equilibrium concepts such as entropy-regularized NE. Subsequently, we design generalized fixed point iteration and fictitious play algorithms to learn QRE and RH equilibria. After a theoretical analysis, we give different examples to evaluate the capabilities of our learning algorithms and outline practical differences between the equilibrium concepts.

[AI-17] A Multi-Agent Approach for REST API Testing with Semantic Graphs and LLM -Driven Inputs ICSE2025

链接: https://arxiv.org/abs/2411.07098
作者: Myeongsoo Kim,Tyler Stennett,Saurabh Sinha,Alessandro Orso
关键词-EN: REST API testing, REST API, black-box REST API, REST API specifications, API testing
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
*备注: To be published in the 47th IEEE/ACM International Conference on Software Engineering (ICSE 2025)

点击查看摘要

Abstract:As modern web services increasingly rely on REST APIs, their thorough testing has become crucial. Furthermore, the advent of REST API specifications such as the OpenAPI Specification has led to the emergence of many black-box REST API testing tools. However, these tools often focus on individual test elements in isolation (e.g., APIs, parameters, values), resulting in lower coverage and less effectiveness in detecting faults (i.e., 500 response codes). To address these limitations, we present AutoRestTest, the first black-box framework to adopt a dependency-embedded multi-agent approach for REST API testing, integrating Multi-Agent Reinforcement Learning (MARL) with a Semantic Property Dependency Graph (SPDG) and Large Language Models (LLMs). Our approach treats REST API testing as a separable problem, where four agents – API, dependency, parameter, and value – collaborate to optimize API exploration. LLMs handle domain-specific value restrictions, the SPDG model simplifies the search space for dependencies using a similarity score between API operations, and MARL dynamically optimizes the agents’ behavior. Evaluated on 12 real-world REST services, AutoRestTest outperforms the four leading black-box REST API testing tools, including those assisted by RESTGPT (which augments realistic test inputs using LLMs), in terms of code coverage, operation coverage, and fault detection. Notably, AutoRestTest is the only tool able to identify an internal server error in Spotify. Our ablation study underscores the significant contributions of the agent learning, SPDG, and LLM components.

[AI-18] owards Characterizing Cyber Networks with Large Language Models

链接: https://arxiv.org/abs/2411.07089
作者: Alaric Hartsock,Luiz Manella Pereira,Glenn Fink
关键词-EN: Threat hunting analyzes, hunting analyzes large, sparse adversarial behavior, Threat hunting, find sparse adversarial
类目: Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注: 5 pages, 2 figures

点击查看摘要

Abstract:Threat hunting analyzes large, noisy, high-dimensional data to find sparse adversarial behavior. We believe adversarial activities, however they are disguised, are extremely difficult to completely obscure in high dimensional space. In this paper, we employ these latent features of cyber data to find anomalies via a prototype tool called Cyber Log Embeddings Model (CLEM). CLEM was trained on Zeek network traffic logs from both a real-world production network and an from Internet of Things (IoT) cybersecurity testbed. The model is deliberately overtrained on a sliding window of data to characterize each window closely. We use the Adjusted Rand Index (ARI) to comparing the k-means clustering of CLEM output to expert labeling of the embeddings. Our approach demonstrates that there is promise in using natural language modeling to understand cyber data.

[AI-19] OCMDP: Observation-Constrained Markov Decision Process

链接: https://arxiv.org/abs/2411.07087
作者: Taiyi Wang,Jianheng Liu,Jiaye Li,Zhihao Wu,Yu Wu
关键词-EN: Markov Decision Process, practical applications, decision-making processes, processes must balance, acquiring information
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注: Full paper, 14 Pages

点击查看摘要

Abstract:In many practical applications, decision-making processes must balance the costs of acquiring information with the benefits it provides. Traditional control systems often assume full observability, an unrealistic assumption when observations are expensive. We tackle the challenge of simultaneously learning observation and control strategies in such cost-sensitive environments by introducing the Observation-Constrained Markov Decision Process (OCMDP), where the policy influences the observability of the true state. To manage the complexity arising from the combined observation and control actions, we develop an iterative, model-free deep reinforcement learning algorithm that separates the sensing and control components of the policy. This decomposition enables efficient learning in the expanded action space by focusing on when and what to observe, as well as determining optimal control actions, without requiring knowledge of the environment’s dynamics. We validate our approach on a simulated diagnostic task and a realistic healthcare environment using HeartPole. Given both scenarios, the experimental results demonstrate that our model achieves a substantial reduction in observation costs on average, significantly outperforming baseline methods by a notable margin in efficiency.

[AI-20] o Train or Not to Train: Balancing Efficiency and Training Cost in Deep Reinforcement Learning for Mobile Edge Computing

链接: https://arxiv.org/abs/2411.07086
作者: Maddalena Boscaro,Federico Mason,Federico Chiariotti,Andrea Zanella
关键词-EN: Artificial Intelligence, end users’ requirements, Mobile Edge Computing, demand patterns, enables communication
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注:

点击查看摘要

Abstract:Artificial Intelligence (AI) is a key component of 6G networks, as it enables communication and computing services to adapt to end users’ requirements and demand patterns. The management of Mobile Edge Computing (MEC) is a meaningful example of AI application: computational resources available at the network edge need to be carefully allocated to users, whose jobs may have different priorities and latency requirements. The research community has developed several AI algorithms to perform this resource allocation, but it has neglected a key aspect: learning is itself a computationally demanding task, and considering free training results in idealized conditions and performance in simulations. In this work, we consider a more realistic case in which the cost of learning is specifically accounted for, presenting a new algorithm to dynamically select when to train a Deep Reinforcement Learning (DRL) agent that allocates resources. Our method is highly general, as it can be directly applied to any scenario involving a training overhead, and it can approach the same performance as an ideal learning agent even under realistic training conditions.

[AI-21] StoryTeller: Improving Long Video Description through Global Audio-Visual Character Identification

链接: https://arxiv.org/abs/2411.07076
作者: Yichen He,Yuan Lin,Jianchao Wu,Hanchong Zhang,Yuchen Zhang,Ruicheng Le
关键词-EN: Existing large vision-language, extended video spanning, video spanning minutes, Existing large, processing short
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Existing large vision-language models (LVLMs) are largely limited to processing short, seconds-long videos and struggle with generating coherent descriptions for extended video spanning minutes or more. Long video description introduces new challenges, such as plot-level consistency across descriptions. To address these, we figure out audio-visual character identification, matching character names to each dialogue, as a key factor. We propose StoryTeller, a system for generating dense descriptions of long videos, incorporating both low-level visual concepts and high-level plot information. StoryTeller uses a multimodal large language model that integrates visual, audio, and text modalities to perform audio-visual character identification on minute-long video clips. The results are then fed into a LVLM to enhance consistency of video description. We validate our approach on movie description tasks and introduce MovieStory101, a dataset with dense descriptions for three-minute movie clips. To evaluate long video descriptions, we create MovieQA, a large set of multiple-choice questions for the MovieStory101 test set. We assess descriptions by inputting them into GPT-4 to answer these questions, using accuracy as an automatic evaluation metric. Experiments show that StoryTeller outperforms all open and closed-source baselines on MovieQA, achieving 9.5% higher accuracy than the strongest baseline, Gemini-1.5-pro, and demonstrating a +15.56% advantage in human side-by-side evaluations. Additionally, incorporating audio-visual character identification from StoryTeller improves the performance of all video description models, with Gemini-1.5-pro and GPT-4o showing relative improvement of 5.5% and 13.0%, respectively, in accuracy on MovieQA.

[AI-22] An Interpretable X-ray Style Transfer via Trainable Local Laplacian Filter

链接: https://arxiv.org/abs/2411.07072
作者: Dominik Eckert,Ludwig Ritschl,Christopher Syben,Christian Hümmer,Julia Wicklein,Marcel Beister,Steffen Kappler,Sebastian Stober
关键词-EN: preferred visual impressions, Local Laplacian Filter, Radiologists have preferred, X-ray style transfer, LLF style transfer
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Radiologists have preferred visual impressions or ‘styles’ of X-ray images that are manually adjusted to their needs to support their diagnostic performance. In this work, we propose an automatic and interpretable X-ray style transfer by introducing a trainable version of the Local Laplacian Filter (LLF). From the shape of the LLF’s optimized remap function, the characteristics of the style transfer can be inferred and reliability of the algorithm can be ensured. Moreover, we enable the LLF to capture complex X-ray style features by replacing the remap function with a Multi-Layer Perceptron (MLP) and adding a trainable normalization layer. We demonstrate the effectiveness of the proposed method by transforming unprocessed mammographic X-ray images into images that match the style of target mammograms and achieve a Structural Similarity Index (SSIM) of 0.94 compared to 0.82 of the baseline LLF style transfer method from Aubry et al.

[AI-23] Designing Reliable Experiments with Generative Agent -Based Modeling: A Comprehensive Guide Using Concordia by Google DeepMind

链接: https://arxiv.org/abs/2411.07038
作者: Alejandro Leonardo García Navarro,Nataliia Koneva,Alfonso Sánchez-Macián,José Alberto Hernández,Manuel Goyanes
关键词-EN: technical expertise required, conducting large-scale experiments, social sciences, face challenges, challenges when conducting
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In social sciences, researchers often face challenges when conducting large-scale experiments, particularly due to the simulations’ complexity and the lack of technical expertise required to develop such frameworks. Agent-Based Modeling (ABM) is a computational approach that simulates agents’ actions and interactions to evaluate how their behaviors influence the outcomes. However, the traditional implementation of ABM can be demanding and complex. Generative Agent-Based Modeling (GABM) offers a solution by enabling scholars to create simulations where AI-driven agents can generate complex behaviors based on underlying rules and interactions. This paper introduces a framework for designing reliable experiments using GABM, making sophisticated simulation techniques more accessible to researchers across various fields. We provide a step-by-step guide for selecting appropriate tools, designing the model, establishing experimentation protocols, and validating results.

[AI-24] Evaluating the Accuracy of Chatbots in Financial Literature

链接: https://arxiv.org/abs/2411.07031
作者: Orhan Erdem,Kristi Hassett,Feyzullah Egriboyun
关键词-EN: Gemini Advanced, employing novel methodologies, evaluate the reliability, hallucination rates, hallucination
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:We evaluate the reliability of two chatbots, ChatGPT (4o and o1-preview versions), and Gemini Advanced, in providing references on financial literature and employing novel methodologies. Alongside the conventional binary approach commonly used in the literature, we developed a nonbinary approach and a recency measure to assess how hallucination rates vary with how recent a topic is. After analyzing 150 citations, ChatGPT-4o had a hallucination rate of 20.0% (95% CI, 13.6%-26.4%), while the o1-preview had a hallucination rate of 21.3% (95% CI, 14.8%-27.9%). In contrast, Gemini Advanced exhibited higher hallucination rates: 76.7% (95% CI, 69.9%-83.4%). While hallucination rates increased for more recent topics, this trend was not statistically significant for Gemini Advanced. These findings emphasize the importance of verifying chatbot-provided references, particularly in rapidly evolving fields.

[AI-25] Leveraging LSTM for Predictive Modeling of Satellite Clock Bias

链接: https://arxiv.org/abs/2411.07015
作者: Ahan Bhatt,Ishaan Mehta,Pravin Patidar
关键词-EN: Satellite clock bias, Long Short-Term Memory, Satellite clock, clock bias, Satellite
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Signal Processing (eess.SP)
*备注: 6 Pages, 6 figures (8 sub-figures), 5 Tables Index Terms-LSTM, Satellite Navigation, Deep Learning, Clock Bias

点击查看摘要

Abstract:Satellite clock bias prediction plays a crucial role in enhancing the accuracy of satellite navigation systems. In this paper, we propose an approach utilizing Long Short-Term Memory (LSTM) networks to predict satellite clock bias. We gather data from the PRN 8 satellite of the Galileo and preprocess it to obtain a single difference sequence, crucial for normalizing the data. Normalization allows resampling of the data, ensuring that the predictions are equidistant and complete. Our methodology involves training the LSTM model on varying lengths of datasets, ranging from 7 days to 31 days. We employ a training set consisting of two days’ worth of data in each case. Our LSTM model exhibits exceptional accuracy, with a Root Mean Square Error (RMSE) of 2.11 \times 10 ^-11 . Notably, our approach outperforms traditional methods used for similar time-series forecasting projects, being 170 times more accurate than RNN, 2.3 \times 10 ^7 times more accurate than MLP, and 1.9 \times 10 ^4 times more accurate than ARIMA. This study holds significant potential in enhancing the accuracy and efficiency of low-power receivers used in various devices, particularly those requiring power conservation. By providing more accurate predictions of satellite clock bias, the findings of this research can be integrated into the algorithms of such devices, enabling them to function with heightened precision while conserving power. Improved accuracy in clock bias predictions ensures that low-power receivers can maintain optimal performance levels, thereby enhancing the overall reliability and effectiveness of satellite navigation systems. Consequently, this advancement holds promise for a wide range of applications, including remote areas, IoT devices, wearable technology, and other devices where power efficiency and navigation accuracy are paramount.

[AI-26] A neural-network based anomaly detection system and a safety protocol to protect vehicular network

链接: https://arxiv.org/abs/2411.07013
作者: Marco Franceschini
关键词-EN: Intelligent Transport Systems, Cooperative Intelligent Transport, Intelligent Transport, improve road safety, accurate data exchange
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Networking and Internet Architecture (cs.NI)
*备注: Master’s thesis 2023-2024

点击查看摘要

Abstract:This thesis addresses the use of Cooperative Intelligent Transport Systems (CITS) to improve road safety and efficiency by enabling vehicle-to-vehicle communication, highlighting the importance of secure and accurate data exchange. To ensure safety, the thesis proposes a Machine Learning-based Misbehavior Detection System (MDS) using Long Short-Term Memory (LSTM) networks to detect and mitigate incorrect or misleading messages within vehicular networks. Trained offline on the VeReMi dataset, the detection model is tested in real-time within a platooning scenario, demonstrating that it can prevent nearly all accidents caused by misbehavior by triggering a defense protocol that dissolves the platoon if anomalies are detected. The results show that while the system can accurately detect general misbehavior, it struggles to label specific types due to varying traffic conditions, implying the difficulty of creating a universally adaptive protocol. However, the thesis suggests that with more data and further refinement, this MDS could be implemented in real-world CITS, enhancing driving safety by mitigating risks from misbehavior in cooperative driving networks.

[AI-27] Permutative redundancy and uncertainty of the objective in deep learning

链接: https://arxiv.org/abs/2411.07008
作者: Vacslav Glukhov
关键词-EN: traditional deep learning, deep learning architectures, Implications of uncertain, uncertain objective functions, functions and permutative
类目: Artificial Intelligence (cs.AI)
*备注: 22 pages, 3 figures

点击查看摘要

Abstract:Implications of uncertain objective functions and permutative symmetry of traditional deep learning architectures are discussed. It is shown that traditional architectures are polluted by an astronomical number of equivalent global and local optima. Uncertainty of the objective makes local optima unattainable, and, as the size of the network grows, the global optimization landscape likely becomes a tangled web of valleys and ridges. Some remedies which reduce or eliminate ghost optima are discussed including forced pre-pruning, re-ordering, ortho-polynomial activations, and modular bio-inspired architectures.

[AI-28] Non-Adversarial Inverse Reinforcement Learning via Successor Feature Matching

链接: https://arxiv.org/abs/2411.07007
作者: Arnav Kumar Jain,Harley Wiltzer,Jesse Farebrother,Irina Rish,Glen Berseth,Sanjiban Choudhury
关键词-EN: inverse reinforcement learning, inverse reinforcement, agent seeks, seeks to replicate, IRL
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In inverse reinforcement learning (IRL), an agent seeks to replicate expert demonstrations through interactions with the environment. Traditionally, IRL is treated as an adversarial game, where an adversary searches over reward models, and a learner optimizes the reward through repeated RL procedures. This game-solving approach is both computationally expensive and difficult to stabilize. In this work, we propose a novel approach to IRL by direct policy optimization: exploiting a linear factorization of the return as the inner product of successor features and a reward vector, we design an IRL algorithm by policy gradient descent on the gap between the learner and expert features. Our non-adversarial method does not require learning a reward function and can be solved seamlessly with existing actor-critic RL algorithms. Remarkably, our approach works in state-only settings without expert action labels, a setting which behavior cloning (BC) cannot solve. Empirical results demonstrate that our method learns from as few as a single expert demonstration and achieves improved performance on various control tasks.

[AI-29] Estimating Causal Effects in Partially Directed Parametric Causal Factor Graphs

链接: https://arxiv.org/abs/2411.07006
作者: Malte Luttermann,Tanya Braun,Ralf Möller,Marcel Gehrke
关键词-EN: maintaining exact answers, probabilistic relational models, causal factor graphs, exact answers, parametric causal factor
类目: Artificial Intelligence (cs.AI); Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG)
*备注: Accepted to the Proceedings of the 16th International Conference on Scalable Uncertainty Management (SUM 2024)

点击查看摘要

Abstract:Lifting uses a representative of indistinguishable individuals to exploit symmetries in probabilistic relational models, denoted as parametric factor graphs, to speed up inference while maintaining exact answers. In this paper, we show how lifting can be applied to causal inference in partially directed graphs, i.e., graphs that contain both directed and undirected edges to represent causal relationships between random variables. We present partially directed parametric causal factor graphs (PPCFGs) as a generalisation of previously introduced parametric causal factor graphs, which require a fully directed graph. We further show how causal inference can be performed on a lifted level in PPCFGs, thereby extending the applicability of lifted causal inference to a broader range of models requiring less prior knowledge about causal relationships.

[AI-30] Enhancing Robot Assistive Behaviour with Reinforcement Learning and Theory of Mind

链接: https://arxiv.org/abs/2411.07003
作者: Antonio Andriella,Giovanni Falcone,Silvia Rossi
关键词-EN: Theory of Mind, effective human-robot collaboration, interpret humans’ beliefs, achieving effective human-robot, human-robot collaboration
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
*备注:

点击查看摘要

Abstract:The adaptation to users’ preferences and the ability to infer and interpret humans’ beliefs and intents, which is known as the Theory of Mind (ToM), are two crucial aspects for achieving effective human-robot collaboration. Despite its importance, very few studies have investigated the impact of adaptive robots with ToM abilities. In this work, we present an exploratory comparative study to investigate how social robots equipped with ToM abilities impact users’ performance and perception. We design a two-layer architecture. The Q-learning agent on the first layer learns the robot’s higher-level behaviour. On the second layer, a heuristic-based ToM infers the user’s intended strategy and is responsible for implementing the robot’s assistance, as well as providing the motivation behind its choice. We conducted a user study in a real-world setting, involving 56 participants who interacted with either an adaptive robot capable of ToM, or with a robot lacking such abilities. Our findings suggest that participants in the ToM condition performed better, accepted the robot’s assistance more often, and perceived its ability to adapt, predict and recognise their intents to a higher degree. Our preliminary insights could inform future research and pave the way for designing more complex computation architectures for adaptive behaviour with ToM capabilities.

[AI-31] Which PPML Would a User Choose? A Structured Decision Support Framework for Developers to Rank PPML Techniques Based on User Acceptance Criteria

链接: https://arxiv.org/abs/2411.06995
作者: Sascha Löbner,Sebastian Pape,Vanessa Bracamonte,Kittiphop Phalakarn
关键词-EN: machine learning approach, machine learning, needed computational power, Preserving Machine Learning, machine learning based
类目: Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Machine Learning (cs.LG); Software Engineering (cs.SE)
*备注:

点击查看摘要

Abstract:Using Privacy-Enhancing Technologies (PETs) for machine learning often influences the characteristics of a machine learning approach, e.g., the needed computational power, timing of the answers or how the data can be utilized. When designing a new service, the developer faces the problem that some decisions require a trade-off. For example, the use of a PET may cause a delay in the responses or adding noise to the data to improve the users’ privacy might have a negative impact on the accuracy of the machine learning approach. As of now, there is no structured way how the users’ perception of a machine learning based service can contribute to the selection of Privacy Preserving Machine Learning (PPML) methods. This is especially a challenge since one cannot assume that users have a deep technical understanding of these technologies. Therefore, they can only be asked about certain attributes that they can perceive when using the service and not directly which PPML they prefer. This study introduces a decision support framework with the aim of supporting the selection of PPML technologies based on user preferences. Based on prior work analysing User Acceptance Criteria (UAC), we translate these criteria into differentiating characteristics for various PPML techniques. As a final result, we achieve a technology ranking based on the User Acceptance Criteria while providing technology insights for the developers. We demonstrate its application using the use case of classifying privacy-relevant information. Our contribution consists of the decision support framework which consists of a process to connect PPML technologies with UAC, a process for evaluating the characteristics that separate PPML techniques, and a ranking method to evaluate the best PPML technique for the use case. Subjects: Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Machine Learning (cs.LG); Software Engineering (cs.SE) Cite as: arXiv:2411.06995 [cs.AI] (or arXiv:2411.06995v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2411.06995 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-32] Imitation from Diverse Behaviors: Wasserstein Quality Diversity Imitation Learning with Single-Step Archive Exploration

链接: https://arxiv.org/abs/2411.06965
作者: Xingrui Yu,Zhenglin Wan,David Mark Bossens,Yueming Lyu,Qing Guo,Ivor W. Tsang
关键词-EN: diversity imitation learning, imitation learning, quality diversity imitation, Traditional imitation learning, diverse and high-performance
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Learning diverse and high-performance behaviors from a limited set of demonstrations is a grand challenge. Traditional imitation learning methods usually fail in this task because most of them are designed to learn one specific behavior even with multiple demonstrations. Therefore, novel techniques for quality diversity imitation learning are needed to solve the above challenge. This work introduces Wasserstein Quality Diversity Imitation Learning (WQDIL), which 1) improves the stability of imitation learning in the quality diversity setting with latent adversarial training based on a Wasserstein Auto-Encoder (WAE), and 2) mitigates a behavior-overfitting issue using a measure-conditioned reward function with a single-step archive exploration bonus. Empirically, our method significantly outperforms state-of-the-art IL methods, achieving near-expert or beyond-expert QD performance on the challenging continuous control tasks derived from MuJoCo environments.

[AI-33] ENAT: Rethinking Spatial-temporal Interactions in Token-based Image Synthesis NEURIPS2024

链接: https://arxiv.org/abs/2411.06959
作者: Zanlin Ni,Yulin Wang,Renping Zhou,Yizeng Han,Jiayi Guo,Zhiyuan Liu,Yuan Yao,Gao Huang
关键词-EN: tokens, NATs, mask tokens, visible tokens, Recently
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: Accepted by NeurIPS2024

点击查看摘要

Abstract:Recently, token-based generation have demonstrated their effectiveness in image synthesis. As a representative example, non-autoregressive Transformers (NATs) can generate decent-quality images in a few steps. NATs perform generation in a progressive manner, where the latent tokens of a resulting image are incrementally revealed. At each step, the unrevealed image regions are padded with mask tokens and inferred by NAT. In this paper, we delve into the mechanisms behind the effectiveness of NATs and uncover two important patterns that naturally emerge from NATs: Spatially (within a step), although mask and visible tokens are processed uniformly by NATs, the interactions between them are highly asymmetric. In specific, mask tokens mainly gather information for decoding, while visible tokens tend to primarily provide information, and their deep representations can be built only upon themselves. Temporally (across steps), the interactions between adjacent generation steps mostly concentrate on updating the representations of a few critical tokens, while the computation for the majority of tokens is generally repetitive. Driven by these findings, we propose EfficientNAT (ENAT), a NAT model that explicitly encourages these critical interactions inherent in NATs. At the spatial level, we disentangle the computations of visible and mask tokens by encoding visible tokens independently, while decoding mask tokens conditioned on the fully encoded visible tokens. At the temporal level, we prioritize the computation of the critical tokens at each step, while maximally reusing previously computed token representations to supplement necessary information. ENAT improves the performance of NATs notably with significantly reduced computational cost. Experiments on ImageNet-256, ImageNet-512 and MS-COCO validate the effectiveness of ENAT. Code is available at this https URL.

[AI-34] Multi-modal Iterative and Deep Fusion Frameworks for Enhanced Passive DOA Sensing via a Green Massive H2AD MIMO Receiver

链接: https://arxiv.org/abs/2411.06927
作者: Jiatong Bai,Minghao Chen,Wankai Tang,Yifan Li,Cunhua Pan,Yongpeng Wu,Feng Shu
关键词-EN: source incident angles, existing DOA estimation, assume ideal source, ideal source incident, DOA estimation
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Most existing DOA estimation methods assume ideal source incident angles with minimal noise. Moreover, directly using pre-estimated angles to calculate weighted coefficients can lead to performance loss. Thus, a green multi-modal (MM) fusion DOA framework is proposed to realize a more practical, low-cost and high time-efficiency DOA estimation for a H ^2 AD array. Firstly, two more efficient clustering methods, global maximum cos_similarity clustering (GMaxCS) and global minimum distance clustering (GMinD), are presented to infer more precise true solutions from the candidate solution sets. Based on this, an iteration weighted fusion (IWF)-based method is introduced to iteratively update weighted fusion coefficients and the clustering center of the true solution classes by using the estimated values. Particularly, the coarse DOA calculated by fully digital (FD) subarray, serves as the initial cluster center. The above process yields two methods called MM-IWF-GMaxCS and MM-IWF-GMinD. To further provide a higher-accuracy DOA estimation, a fusion network (fusionNet) is proposed to aggregate the inferred two-part true angles and thus generates two effective approaches called MM-fusionNet-GMaxCS and MM-fusionNet-GMinD. The simulation outcomes show the proposed four approaches can achieve the ideal DOA performance and the CRLB. Meanwhile, proposed MM-fusionNet-GMaxCS and MM-fusionNet-GMinD exhibit superior DOA performance compared to MM-IWF-GMaxCS and MM-IWF-GMinD, especially in extremely-low SNR range.

[AI-35] Slowing Down Forgetting in Continual Learning

链接: https://arxiv.org/abs/2411.06916
作者: Pascal Janetzky,Tobias Schlagenhauf,Stefan Feuerriegel
关键词-EN: common challenge, challenge in continual, framework, continual learning, additional tasks
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:A common challenge in continual learning (CL) is catastrophic forgetting, where the performance on old tasks drops after new, additional tasks are learned. In this paper, we propose a novel framework called ReCL to slow down forgetting in CL. Our framework exploits an implicit bias of gradient-based neural networks due to which these converge to margin maximization points. Such convergence points allow us to reconstruct old data from previous tasks, which we then combine with the current training data. Our framework is flexible and can be applied on top of existing, state-of-the-art CL methods to slow down forgetting. We further demonstrate the performance gain from our framework across a large series of experiments, including different CL scenarios (class incremental, domain incremental, task incremental learning) different datasets (MNIST, CIFAR10), and different network architectures. Across all experiments, we find large performance gains through ReCL. To the best of our knowledge, our framework is the first to address catastrophic forgetting by leveraging models in CL as their own memory buffers.

[AI-36] Gaussian Process Emulators for Few-Shot Segmentation in Cardiac MRI

链接: https://arxiv.org/abs/2411.06911
作者: Bruno Viti,Franz Thaler,Kathrin Lisa Kapper,Martin Urschler,Martin Holler,Elias Karabelas
关键词-EN: helping to diagnose, cardiovascular diseases, analysis and assessment, diagnose and treat, treat various cardiovascular
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Submitted to Statistical Atlases and Computational Modeling of the Heart (STACOM) 2024

点击查看摘要

Abstract:Segmentation of cardiac magnetic resonance images (MRI) is crucial for the analysis and assessment of cardiac function, helping to diagnose and treat various cardiovascular diseases. Most recent techniques rely on deep learning and usually require an extensive amount of labeled data. To overcome this problem, few-shot learning has the capability of reducing data dependency on labeled data. In this work, we introduce a new method that merges few-shot learning with a U-Net architecture and Gaussian Process Emulators (GPEs), enhancing data integration from a support set for improved performance. GPEs are trained to learn the relation between the support images and the corresponding masks in latent space, facilitating the segmentation of unseen query images given only a small labeled support set at inference. We test our model with the MMs-2 public dataset to assess its ability to segment the heart in cardiac magnetic resonance imaging from different orientations, and compare it with state-of-the-art unsupervised and few-shot methods. Our architecture shows higher DICE coefficients compared to these methods, especially in the more challenging setups where the size of the support set is considerably small.

[AI-37] GraphRPM: Risk Pattern Mining on Industrial Large Attributed Graphs KDD2024 ECML

链接: https://arxiv.org/abs/2411.06878
作者: Sheng Tian,Xintan Zeng,Yifei Hu,Baokun Wang,Yongchao Liu,Yue Jin,Changhua Meng,Chuntao Hong,Tianyi Zhang,Weiqiang Wang
关键词-EN: offering enhanced interpretability, black-box models commonly, models commonly utilized, industrial companies due, Graph-based patterns
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC); Social and Information Networks (cs.SI)
*备注: Accepted by ECML PKDD 2024

点击查看摘要

Abstract:Graph-based patterns are extensively employed and favored by practitioners within industrial companies due to their capacity to represent the behavioral attributes and topological relationships among users, thereby offering enhanced interpretability in comparison to black-box models commonly utilized for classification and recognition tasks. For instance, within the scenario of transaction risk management, a graph pattern that is characteristic of a particular risk category can be readily employed to discern transactions fraught with risk, delineate networks of criminal activity, or investigate the methodologies employed by fraudsters. Nonetheless, graph data in industrial settings is often characterized by its massive scale, encompassing data sets with millions or even billions of nodes, making the manual extraction of graph patterns not only labor-intensive but also necessitating specialized knowledge in particular domains of risk. Moreover, existing methodologies for mining graph patterns encounter significant obstacles when tasked with analyzing large-scale attributed graphs. In this work, we introduce GraphRPM, an industry-purpose parallel and distributed risk pattern mining framework on large attributed graphs. The framework incorporates a novel edge-involved graph isomorphism network alongside optimized operations for parallel graph computation, which collectively contribute to a considerable reduction in computational complexity and resource expenditure. Moreover, the intelligent filtration of efficacious risky graph patterns is facilitated by the proposed evaluation metrics. Comprehensive experimental evaluations conducted on real-world datasets of varying sizes substantiate the capability of GraphRPM to adeptly address the challenges inherent in mining patterns from large-scale industrial attributed graphs, thereby underscoring its substantial value for industrial deployment.

[AI-38] Multi-Modal interpretable automatic video captioning

链接: https://arxiv.org/abs/2411.06872
作者: Antoine Hanna-Asaad,Decky Aspandi,Titus Zaharia
关键词-EN: natural language format, describe video contents, Video captioning aims, interpreting scenes, actions and events
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Video captioning aims to describe video contents using natural language format that involves understanding and interpreting scenes, actions and events that occurs simultaneously on the view. Current approaches have mainly concentrated on visual cues, often neglecting the rich information available from other important modality of audio information, including their inter-dependencies. In this work, we introduce a novel video captioning method trained with multi-modal contrastive loss that emphasizes both multi-modal integration and interpretability. Our approach is designed to capture the dependency between these modalities, resulting in more accurate, thus pertinent captions. Furthermore, we highlight the importance of interpretability, employing multiple attention mechanisms that provide explanation into the model’s decision-making process. Our experimental results demonstrate that our proposed method performs favorably against the state-of the-art models on commonly used benchmark datasets of MSR-VTT and VATEX.

[AI-39] AI-Native Multi-Access Future Networks – The REASON Architecture

链接: https://arxiv.org/abs/2411.06870
作者: Konstantinos Katsaros,Ioannis Mavromatis,Kostantinos Antonakoglou,Saptarshi Ghosh,Dritan Kaleshi,Toktam Mahmoodi,Hamid Asgari,Anastasios Karousos,Iman Tavakkolnia,Hossein Safi,Harald Hass,Constantinos Vrontos,Amin Emami,Juan Parra Ullauri,Shadi Moazzeni,Dimitra Simeonidou
关键词-EN: past years, sixth generation, generation of communication, gaining momentum, multiple access technologies
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
*备注: Accepted for publication at IEEE Access

点击查看摘要

Abstract:The development of the sixth generation of communication networks (6G) has been gaining momentum over the past years, with a target of being introduced by 2030. Several initiatives worldwide are developing innovative solutions and setting the direction for the key features of these networks. Some common emerging themes are the tight integration of AI, the convergence of multiple access technologies and sustainable operation, aiming to meet stringent performance and societal requirements. To that end, we are introducing REASON - Realising Enabling Architectures and Solutions for Open Networks. The REASON project aims to address technical challenges in future network deployments, such as E2E service orchestration, sustainability, security and trust management, and policy management, utilising AI-native principles, considering multiple access technologies and cloud-native solutions. This paper presents REASON’s architecture and the identified requirements for future networks. The architecture is meticulously designed for modularity, interoperability, scalability, simplified troubleshooting, flexibility, and enhanced security, taking into consideration current and future standardisation efforts, and the ease of implementation and training. It is structured into four horizontal layers: Physical Infrastructure, Network Service, Knowledge, and End-User Application, complemented by two vertical layers: Management and Orchestration, and E2E Security. This layered approach ensures a robust, adaptable framework to support the diverse and evolving requirements of 6G networks, fostering innovation and facilitating seamless integration of advanced technologies. Comments: Accepted for publication at IEEE Access Subjects: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI); Systems and Control (eess.SY) Cite as: arXiv:2411.06870 [cs.NI] (or arXiv:2411.06870v1 [cs.NI] for this version) https://doi.org/10.48550/arXiv.2411.06870 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-40] Computable Model-Independent Bounds for Adversarial Quantum Machine Learning

链接: https://arxiv.org/abs/2411.06863
作者: Bacui Li,Tansu Alpcan,Chandra Thapa,Udaya Parampalli
关键词-EN: QML opens doors, offers potential speedup, QML, machine learning, machine learning models
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Quantum Physics (quant-ph)
*备注: 21 pages, 9 figures

点击查看摘要

Abstract:By leveraging the principles of quantum mechanics, QML opens doors to novel approaches in machine learning and offers potential speedup. However, machine learning models are well-documented to be vulnerable to malicious manipulations, and this susceptibility extends to the models of QML. This situation necessitates a thorough understanding of QML’s resilience against adversarial attacks, particularly in an era where quantum computing capabilities are expanding. In this regard, this paper examines model-independent bounds on adversarial performance for QML. To the best of our knowledge, we introduce the first computation of an approximate lower bound for adversarial error when evaluating model resilience against sophisticated quantum-based adversarial attacks. Experimental results are compared to the computed bound, demonstrating the potential of QML models to achieve high robustness. In the best case, the experimental error is only 10% above the estimated bound, offering evidence of the inherent robustness of quantum models. This work not only advances our theoretical understanding of quantum model resilience but also provides a precise reference bound for the future development of robust QML algorithms.

[AI-41] Enhancing Phishing Detection through Feature Importance Analysis and Explainable AI: A Comparative Study of CatBoost XGBoost and EBM Models

链接: https://arxiv.org/abs/2411.06860
作者: Abdullah Fajar,Setiadi Yazid,Indra Budi
关键词-EN: Phishing attacks remain, demanding robust detection, robust detection methods, online security, demanding robust
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Phishing attacks remain a persistent threat to online security, demanding robust detection methods. This study investigates the use of machine learning to identify phishing URLs, emphasizing the crucial role of feature selection and model interpretability for improved performance. Employing Recursive Feature Elimination, the research pinpointed key features like “length_url,” “time_domain_activation” and “Page_rank” as strong indicators of phishing attempts. The study evaluated various algorithms, including CatBoost, XGBoost, and Explainable Boosting Machine, assessing their robustness and scalability. XGBoost emerged as highly efficient in terms of runtime, making it well-suited for large datasets. CatBoost, on the other hand, demonstrated resilience by maintaining high accuracy even with reduced features. To enhance transparency and trustworthiness, Explainable AI techniques, such as SHAP, were employed to provide insights into feature importance. The study’s findings highlight that effective feature selection and model interpretability can significantly bolster phishing detection systems, paving the way for more efficient and adaptable defenses against evolving cyber threats

[AI-42] Scientific machine learning in ecological systems: A study on the predator-prey dynamics

链接: https://arxiv.org/abs/2411.06858
作者: Ranabir Devgupta,Raj Abhijit Dandekar,Rajat Dandekar,Sreedath Panat
关键词-EN: Lotka Volterra Predator, Volterra Predator Prey, Predator Prey Model, Ordinary Differential Equations, Universal Differential Equations
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
*备注: 16 pages, 7 figures, 1 table

点击查看摘要

Abstract:In this study, we apply two pillars of Scientific Machine Learning: Neural Ordinary Differential Equations (Neural ODEs) and Universal Differential Equations (UDEs) to the Lotka Volterra Predator Prey Model, a fundamental ecological model describing the dynamic interactions between predator and prey populations. The Lotka-Volterra model is critical for understanding ecological dynamics, population control, and species interactions, as it is represented by a system of differential equations. In this work, we aim to uncover the underlying differential equations without prior knowledge of the system, relying solely on training data and neural networks. Using robust modeling in the Julia programming language, we demonstrate that both Neural ODEs and UDEs can be effectively utilized for prediction and forecasting of the Lotka-Volterra system. More importantly, we introduce the forecasting breakdown point: the time at which forecasting fails for both Neural ODEs and UDEs. We observe how UDEs outperform Neural ODEs by effectively recovering the underlying dynamics and achieving accurate forecasting with significantly less training data. Additionally, we introduce Gaussian noise of varying magnitudes (from mild to high) to simulate real-world data perturbations and show that UDEs exhibit superior robustness, effectively recovering the underlying dynamics even in the presence of noisy data, while Neural ODEs struggle with high levels of noise. Through extensive hyperparameter optimization, we offer insights into neural network architectures, activation functions, and optimizers that yield the best results. This study opens the door to applying Scientific Machine Learning frameworks for forecasting tasks across a wide range of ecological and scientific domains.

[AI-43] Learning Interpretable Network Dynamics via Universal Neural Symbolic Regression

链接: https://arxiv.org/abs/2411.06833
作者: Jiao Hu,Jiaxu Cui,Bo Yang
关键词-EN: Discovering governing equations, Discovering governing, rich data, assist in decision-making, fundamental challenge
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multiagent Systems (cs.MA); Symbolic Computation (cs.SC)
*备注: preprint

点击查看摘要

Abstract:Discovering governing equations of complex network dynamics is a fundamental challenge in contemporary science with rich data, which can uncover the mysterious patterns and mechanisms of the formation and evolution of complex phenomena in various fields and assist in decision-making. In this work, we develop a universal computational tool that can automatically, efficiently, and accurately learn the symbolic changing patterns of complex system states by combining the excellent fitting ability from deep learning and the equation inference ability from pre-trained symbolic regression. We conduct intensive experimental verifications on more than ten representative scenarios from physics, biochemistry, ecology, epidemiology, etc. Results demonstrate the outstanding effectiveness and efficiency of our tool by comparing with the state-of-the-art symbolic regression techniques for network dynamics. The application to real-world systems including global epidemic transmission and pedestrian movements has verified its practical applicability. We believe that our tool can serve as a universal solution to dispel the fog of hidden mechanisms of changes in complex phenomena, advance toward interpretability, and inspire more scientific discoveries.

[AI-44] Combining Domain and Alignment Vectors to Achieve Better Knowledge-Safety Trade-offs in LLM s

链接: https://arxiv.org/abs/2411.06824
作者: Megh Thakkar,Yash More,Quentin Fournier,Matthew Riemer,Pin-Yu Chen,Amal Zouaq,Payel Das,Sarath Chandar
关键词-EN: general-purpose instruction-tuned counterparts, specific technical fields, technical fields compared, training domain-expert LLMs, instruction-tuned counterparts
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:There is a growing interest in training domain-expert LLMs that excel in specific technical fields compared to their general-purpose instruction-tuned counterparts. However, these expert models often experience a loss in their safety abilities in the process, making them capable of generating harmful content. As a solution, we introduce an efficient and effective merging-based alignment method called \textscMergeAlign that interpolates the domain and alignment vectors, creating safer domain-specific models while preserving their utility. We apply \textscMergeAlign on Llama3 variants that are experts in medicine and finance, obtaining substantial alignment improvements with minimal to no degradation on domain-specific benchmarks. We study the impact of model merging through model similarity metrics and contributions of individual models being merged. We hope our findings open new research avenues and inspire more efficient development of safe expert LLMs.

[AI-45] Generative midtended cognition and Artificial Intelligence. Thinging with thinging things

链接: https://arxiv.org/abs/2411.06812
作者: Xabier E. Barandiaran,Marta Pérez-Verdugo
关键词-EN: generative midtended cognition, generative, generative midtended, exploring the integration, midtended cognition
类目: Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Machine Learning (cs.LG)
*备注: 16 pages, 2 figures. Submitted to “Synthese” Journal, accepted

点击查看摘要

Abstract:This paper introduces the concept of ``generative midtended cognition’', exploring the integration of generative AI with human cognition. The term “generative” reflects AI’s ability to iteratively produce structured outputs, while “midtended” captures the potential hybrid (human-AI) nature of the process. It stands between traditional conceptions of intended creation, understood directed from within, and extended processes that bring exo-biological processes into the creative process. We examine current generative technologies (based on multimodal transformer architectures typical of large language models like ChatGPT), to explain how they can transform human cognitive agency beyond what standard theories of extended cognition can capture. We suggest that the type of cognitive activity typical of the coupling between a human and generative technologies is closer (but not equivalent) to social cognition than to classical extended cognitive paradigms. Yet, it deserves a specific treatment. We provide an explicit definition of generative midtended cognition in which we treat interventions by AI systems as constitutive of the agent’s intentional creative processes. Furthermore, we distinguish two dimensions of generative hybrid creativity: 1. Width: captures the sensitivity of the context of the generative process (from the single letter to the whole historical and surrounding data), 2. Depth: captures the granularity of iteration loops involved in the process. Generative midtended cognition stands in the middle depth between conversational forms of cognition in which complete utterances or creative units are exchanged, and micro-cognitive (e.g. neural) subpersonal processes. Finally, the paper discusses the potential risks and benefits of widespread generative AI adoption, including the challenges of authenticity, generative power asymmetry, and creative boost or atrophy.

[AI-46] JPEG AI Image Compression Visual Artifacts: Detection Methods and Dataset

链接: https://arxiv.org/abs/2411.06810
作者: Daria Tsereh,Mark Mirgaleev,Ivan Molodetskikh,Roman Kazantsev,Dmitriy Vatolin
关键词-EN: Learning-based image compression, Learning-based image, improved in recent, recent years, years and started
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
*备注:

点击查看摘要

Abstract:Learning-based image compression methods have improved in recent years and started to outperform traditional codecs. However, neural-network approaches can unexpectedly introduce visual artifacts in some images. We therefore propose methods to separately detect three types of artifacts (texture and boundary degradation, color change, and text corruption), to localize the affected regions, and to quantify the artifact strength. We consider only those regions that exhibit distortion due solely to the neural compression but that a traditional codec recovers successfully at a comparable bitrate. We employed our methods to collect artifacts for the JPEG AI verification model with respect to HM-18.0, the H.265 reference software. We processed about 350,000 unique images from the Open Images dataset using different compression-quality parameters; the result is a dataset of 46,440 artifacts validated through crowd-sourced subjective assessment. Our proposed dataset and methods are valuable for testing neural-network-based image codecs, identifying bugs in these codecs, and enhancing their performance. We make source code of the methods and the dataset publicly available.

[AI-47] Evolving Efficient Genetic Encoding for Deep Spiking Neural Networks

链接: https://arxiv.org/abs/2411.06792
作者: Wenxuan Pan,Feifei Zhao,Bing Han,Haibo Tong,Yi Zeng
关键词-EN: Spiking Neural Networks, Artificial Neural Networks, Spiking Neural, Neural Networks, Artificial Neural
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:By exploiting discrete signal processing and simulating brain neuron communication, Spiking Neural Networks (SNNs) offer a low-energy alternative to Artificial Neural Networks (ANNs). However, existing SNN models, still face high computational costs due to the numerous time steps as well as network depth and scale. The tens of billions of neurons and trillions of synapses in the human brain are developed from only 20,000 genes, which inspires us to design an efficient genetic encoding strategy that dynamic evolves to regulate large-scale deep SNNs at low cost. Therefore, we first propose a genetically scaled SNN encoding scheme that incorporates globally shared genetic interactions to indirectly optimize neuronal encoding instead of weight, which obviously brings about reductions in parameters and energy consumption. Then, a spatio-temporal evolutionary framework is designed to optimize the inherently initial wiring rules. Two dynamic regularization operators in the fitness function evolve the neuronal encoding to a suitable distribution and enhance information quality of the genetic interaction respectively, substantially accelerating evolutionary speed and improving efficiency. Experiments show that our approach compresses parameters by approximately 50% to 80%, while outperforming models on the same architectures by 0.21% to 4.38% on CIFAR-10, CIFAR-100 and ImageNet. In summary, the consistent trends of the proposed genetically encoded spatio-temporal evolution across different datasets and architectures highlight its significant enhancements in terms of efficiency, broad scalability and robustness, demonstrating the advantages of the brain-inspired evolutionary genetic coding for SNN optimization.

[AI-48] ScaleKD: Strong Vision Transformers Could Be Excellent Teachers NEURIPS2024

链接: https://arxiv.org/abs/2411.06786
作者: Jiawei Fan,Chao Li,Xiaolong Liu,Anbang Yao
关键词-EN: pre-trained vision transformer, vision transformer, advance cross architecture, knowledge density differences, exhibit scalable properties
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: This work is accepted to NeurIPS 2024. The project page: this https URL

点击查看摘要

Abstract:In this paper, we question if well pre-trained vision transformer (ViT) models could be used as teachers that exhibit scalable properties to advance cross architecture knowledge distillation (KD) research, in the context of using large-scale datasets for evaluation. To make this possible, our analysis underlines the importance of seeking effective strategies to align (1) feature computing paradigm differences, (2) model scale differences, and (3) knowledge density differences. By combining three coupled components namely cross attention projector, dual-view feature mimicking and teacher parameter perception tailored to address the above problems, we present a simple and effective KD method, called ScaleKD. Our method can train student backbones that span across a variety of convolutional neural network (CNN), multi-layer perceptron (MLP), and ViT architectures on image classification datasets, achieving state-of-the-art distillation performance. For instance, taking a well pre-trained Swin-L as the teacher model, our method gets 75.15%|82.03%|84.16%|78.63%|81.96%|83.93%|83.80%|85.53% top-1 accuracies for MobileNet-V1|ResNet-50|ConvNeXt-T|Mixer-S/16|Mixer-B/16|ViT-S/16|Swin-T|ViT-B/16 models trained on ImageNet-1K dataset from scratch, showing 3.05%|3.39%|2.02%|4.61%|5.52%|4.03%|2.62%|3.73% absolute gains to the individually trained counterparts. Intriguingly, when scaling up the size of teacher models or their pre-training datasets, our method showcases the desired scalable properties, bringing increasingly larger gains to student models. The student backbones trained by our method transfer well on downstream MS-COCO and ADE20K datasets. More importantly, our method could be used as a more efficient alternative to the time-intensive pre-training paradigm for any target student model if a strong pre-trained ViT is available, reducing the amount of viewed training samples up to 195x.

[AI-49] QuadWBG: Generalizable Quadrupedal Whole-Body Grasping

链接: https://arxiv.org/abs/2411.06782
作者: Jilong Wang,Javokhirbek Rajabov,Chaoyi Xu,Yiming Zheng,He Wang
关键词-EN: significantly improve household, improve household duties, advanced manipulation capabilities, Legged robots, urban maintenance
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注:

点击查看摘要

Abstract:Legged robots with advanced manipulation capabilities have the potential to significantly improve household duties and urban maintenance. Despite considerable progress in developing robust locomotion and precise manipulation methods, seamlessly integrating these into cohesive whole-body control for real-world applications remains challenging. In this paper, we present a modular framework for robust and generalizable whole-body loco-manipulation controller based on a single arm-mounted camera. By using reinforcement learning (RL), we enable a robust low-level policy for command execution over 5 dimensions (5D) and a grasp-aware high-level policy guided by a novel metric, Generalized Oriented Reachability Map (GORM). The proposed system achieves state-of-the-art one-time grasping accuracy of 89% in the real world, including challenging tasks such as grasping transparent objects. Through extensive simulations and real-world experiments, we demonstrate that our system can effectively manage a large workspace, from floor level to above body height, and perform diverse whole-body loco-manipulation tasks.

[AI-50] MP-PINN: A Multi-Phase Physics-Informed Neural Network for Epidemic Forecasting

链接: https://arxiv.org/abs/2411.06781
作者: Thang Nguyen,Dung Nguyen,Kha Pham,Truyen Tran
关键词-EN: Forecasting temporal processes, observed time-series data, temporal processes, observed time-series, neural networks make
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Forecasting temporal processes such as virus spreading in epidemics often requires more than just observed time-series data, especially at the beginning of a wave when data is limited. Traditional methods employ mechanistic models like the SIR family, which make strong assumptions about the underlying spreading process, often represented as a small set of compact differential equations. Data-driven methods such as deep neural networks make no such assumptions and can capture the generative process in more detail, but fail in long-term forecasting due to data limitations. We propose a new hybrid method called MP-PINN (Multi-Phase Physics-Informed Neural Network) to overcome the limitations of these two major approaches. MP-PINN instils the spreading mechanism into a neural network, enabling the mechanism to update in phases over time, reflecting the dynamics of the epidemics due to policy interventions. Experiments on COVID-19 waves demonstrate that MP-PINN achieves superior performance over pure data-driven or model-driven approaches for both short-term and long-term forecasting.

[AI-51] Machine vision-aware quality metrics for compressed image and video assessment

链接: https://arxiv.org/abs/2411.06776
作者: Mikhail Dremin(1),Konstantin Kozhemyakov(1),Ivan Molodetskikh(1),Malakhov Kirill(2),Artur Sagitov(2 and 3),Dmitriy Vatolin(1) ((1) Lomonosov Moscow State University, (2) Huawei Technologies Co., Ltd., (3) Independent Researcher Linjianping)
关键词-EN: maintaining file size, enhance human-perceived visual, human-perceived visual quality, developing video-compression algorithms, file size
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 16 pages, 10 figures

点击查看摘要

Abstract:A main goal in developing video-compression algorithms is to enhance human-perceived visual quality while maintaining file size. But modern video-analysis efforts such as detection and recognition, which are integral to video surveillance and autonomous vehicles, involve so much data that they necessitate machine-vision processing with minimal human intervention. In such cases, the video codec must be optimized for machine vision. This paper explores the effects of compression on detection and recognition algorithms (objects, faces, and license plates) and introduces novel full-reference image/video-quality metrics for each task, tailored to machine vision. Experimental results indicate our proposed metrics correlate better with the machine-vision results for the respective tasks than do existing image/video-quality metrics.

[AI-52] A Text Classification Model Combining Adversarial Training with Pre-trained Language Model and neural networks: A Case Study on Telecom Fraud Incident Texts

链接: https://arxiv.org/abs/2411.06772
作者: Liu Zhuoxian,Shi Tuo,Hu Xiaofeng
关键词-EN: Front-line police officers, police call reported, targeted prevention measures, facilitate targeted prevention, Front-line police
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Front-line police officers often categorize all police call reported cases of Telecom Fraud into 14 subcategories to facilitate targeted prevention measures, such as precise public education. However, the associated data is characterized by its large volume, diverse information content, and variations in expression. Currently, there is a lack of efficient and accurate intelligent models to replace manual classification, which, while precise, is relatively inefficient. To address these challenges, this paper proposes a text classification model that combines adversarial training with Pre-trained Language Model and neural networks. The Linguistically-motivated Pre-trained Language Model model extracts three types of language features and then utilizes the Fast Gradient Method algorithm to perturb the generated embedding layer. Subsequently, the Bi-directional Long Short-Term Memory and Convolutional Neural Networks networks extract contextual syntactic information and local semantic information, respectively. The model achieved an 83.9% classification accuracy when trained on a portion of telecom fraud case data provided by the operational department. The model established in this paper has been deployed in the operational department, freeing up a significant amount of manpower and improving the department’s efficiency in combating Telecom Fraud crimes. Furthermore, considering the universality of the model established in this paper, other application scenarios await further exploration.

[AI-53] Research on an intelligent fault diagnosis method for nuclear power plants based on ETCN-SSA combined algorithm

链接: https://arxiv.org/abs/2411.06765
作者: Jiayan Fang,Siwei Li,Yichun Wu
关键词-EN: nuclear power plants, nuclear power professionals, nuclear power, Utilizing fault diagnosis, Utilizing fault
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Signal Processing (eess.SP)
*备注:

点击查看摘要

Abstract:Utilizing fault diagnosis methods is crucial for nuclear power professionals to achieve efficient and accurate fault diagnosis for nuclear power plants (NPPs). The performance of traditional methods is limited by their dependence on complex feature extraction and skilled expert knowledge, which can be time-consuming and subjective. This paper proposes a novel intelligent fault diagnosis method for NPPs that combines enhanced temporal convolutional network (ETCN) with sparrow search algorithm (SSA). ETCN utilizes temporal convolutional network (TCN), self-attention (SA) mechanism and residual block for enhancing performance. ETCN excels at extracting local features and capturing time series information, while SSA adaptively optimizes its hyperparameters for superior performance. The proposed method’s performance is experimentally verified on a CPR1000 simulation dataset. Compared to other advanced intelligent fault diagnosis methods, the proposed one demonstrates superior performance across all evaluation metrics. This makes it a promising tool for NPP intelligent fault diagnosis, ultimately enhancing operational reliability.

[AI-54] KLCBL: An Improved Police Incident Classification Model

链接: https://arxiv.org/abs/2411.06749
作者: Liu Zhuoxian,Shi Tuo,Hu Xiaofeng
关键词-EN: public security intelligence, automated system limitations, online fraud cases, grassroots agencies struggle, Convolutional Neural Network
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Police incident data is crucial for public security intelligence, yet grassroots agencies struggle with efficient classification due to manual inefficiency and automated system limitations, especially in telecom and online fraud cases. This research proposes a multichannel neural network model, KLCBL, integrating Kolmogorov-Arnold Networks (KAN), a linguistically enhanced text preprocessing approach (LERT), Convolutional Neural Network (CNN), and Bidirectional Long Short-Term Memory (BiLSTM) for police incident classification. Evaluated with real data, KLCBL achieved 91.9% accuracy, outperforming baseline models. The model addresses classification challenges, enhances police informatization, improves resource allocation, and offers broad applicability to other classification tasks.

[AI-55] Dockformer: A transformer-based molecular docking paradigm for large-scale virtual screening

链接: https://arxiv.org/abs/2411.06740
作者: Zhangfan Yang,Junkai Ji,Shan He,Jianqiang Li,Ruibin Bai,Zexuan Zhu,Yew Soon Ong
关键词-EN: Molecular docking enables, compound library increases, identify potential ligands, compound libraries, compound library
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 14 pages, 5 figures

点击查看摘要

Abstract:Molecular docking enables virtual screening of compound libraries to identify potential ligands that target proteins of interest, a crucial step in drug development; however, as the size of the compound library increases, the computational complexity of traditional docking models increases. Deep learning algorithms can provide data-driven research and development models to increase the speed of the docking process. Unfortunately, few models can achieve superior screening performance compared to that of traditional models. Therefore, a novel deep learning-based docking approach named Dockformer is introduced in this study. Dockformer leverages multimodal information to capture the geometric topology and structural knowledge of molecules and can directly generate binding conformations with the corresponding confidence measures in an end-to-end manner. The experimental results show that Dockformer achieves success rates of 90.53% and 82.71% on the PDBbind core set and PoseBusters benchmarks, respectively, and more than a 100-fold increase in the inference process speed, outperforming almost all state-of-the-art docking methods. In addition, the ability of Dockformer to identify the main protease inhibitors of coronaviruses is demonstrated in a real-world virtual screening scenario. Considering its high docking accuracy and screening efficiency, Dockformer can be regarded as a powerful and robust tool in the field of drug design.

[AI-56] Multi-Modal Forecaster: Jointly Predicting Time Series and Textual Data

链接: https://arxiv.org/abs/2411.06735
作者: Kai Kim,Howard Tsai,Rajat Sen,Abhimanyu Das,Zihao Zhou,Abhishek Tanpure,Mathew Luo,Rose Yu
关键词-EN: Current forecasting approaches, well-curated multimodal benchmark, Current forecasting, rich textual data, time series due
类目: Artificial Intelligence (cs.AI)
*备注: 21 pages, 4 tables, 2 figures

点击查看摘要

Abstract:Current forecasting approaches are largely unimodal and ignore the rich textual data that often accompany the time series due to lack of well-curated multimodal benchmark dataset. In this work, we develop TimeText Corpus (TTC), a carefully curated, time-aligned text and time dataset for multimodal forecasting. Our dataset is composed of sequences of numbers and text aligned to timestamps, and includes data from two different domains: climate science and healthcare. Our data is a significant contribution to the rare selection of available multimodal datasets. We also propose the Hybrid Multi-Modal Forecaster (Hybrid-MMF), a multimodal LLM that jointly forecasts both text and time series data using shared embeddings. However, contrary to our expectations, our Hybrid-MMF model does not outperform existing baselines in our experiments. This negative result highlights the challenges inherent in multimodal forecasting. Our code and data are available at this https URL Forecasting.

[AI-57] On the Principles of ReLU Networks with One Hidden Layer

链接: https://arxiv.org/abs/2411.06728
作者: Changcun Huang
关键词-EN: simplest feedforward neural, feedforward neural network, general network architectures, feedforward neural, neural network
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE)
*备注:

点击查看摘要

Abstract:A neural network with one hidden layer or a two-layer network (regardless of the input layer) is the simplest feedforward neural network, whose mechanism may be the basis of more general network architectures. However, even to this type of simple architecture, it is also a ``black box’'; that is, it remains unclear how to interpret the mechanism of its solutions obtained by the back-propagation algorithm and how to control the training process through a deterministic way. This paper systematically studies the first problem by constructing universal function-approximation solutions. It is shown that, both theoretically and experimentally, the training solution for the one-dimensional input could be completely understood, and that for a higher-dimensional input can also be well interpreted to some extent. Those results pave the way for thoroughly revealing the black box of two-layer ReLU networks and advance the understanding of deep ReLU networks.

[AI-58] Script-Strategy Aligned Generation: Aligning LLM s with Expert-Crafted Dialogue Scripts and Therapeutic Strategies for Psychotherapy

链接: https://arxiv.org/abs/2411.06723
作者: Xin Sun,Jan de Wit,Zhuying Li,Jiahuan Pei,Abdallah El Ali,Jos A.Bosch
关键词-EN: conversational agents, LLMs, Chatbots, improve access, SSAG
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Chatbots or conversational agents (CAs) are increasingly used to improve access to digital psychotherapy. Many current systems rely on rigid, rule-based designs, heavily dependent on expert-crafted dialogue scripts for guiding therapeutic conversations. Although recent advances in large language models (LLMs) offer the potential for more flexible interactions, their lack of controllability and transparency poses significant challenges in sensitive areas like psychotherapy. In this work, we explored how aligning LLMs with expert-crafted scripts can enhance psychotherapeutic chatbot performance. Our comparative study showed that LLMs aligned with expert-crafted scripts through prompting and fine-tuning significantly outperformed both pure LLMs and rule-based chatbots, achieving a more effective balance between dialogue flexibility and adherence to therapeutic principles. Building on findings, we proposed ``Script-Strategy Aligned Generation (SSAG)‘’, a flexible alignment approach that reduces reliance on fully scripted content while enhancing LLMs’ therapeutic adherence and controllability. In a 10-day field study, SSAG demonstrated performance comparable to full script alignment and outperformed rule-based chatbots, empirically supporting SSAG as an efficient approach for aligning LLMs with domain expertise. Our work advances LLM applications in psychotherapy by providing a controllable, adaptable, and scalable solution for digital interventions, reducing reliance on expert effort. It also provides a collaborative framework for domain experts and developers to efficiently build expertise-aligned chatbots, broadening access to psychotherapy and behavioral interventions.

[AI-59] Synthesize Partition then Adapt: Eliciting Diverse Samples from Foundation Models

链接: https://arxiv.org/abs/2411.06722
作者: Yeming Wen,Swarat Chaudhuri
关键词-EN: accommodating varying preferences, Presenting users, varying preferences, diverse responses, crucial for enhancing
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Presenting users with diverse responses from foundation models is crucial for enhancing user experience and accommodating varying preferences. However, generating multiple high-quality and diverse responses without sacrificing accuracy remains a challenge, especially when using greedy sampling. In this work, we propose a novel framework, Synthesize-Partition-Adapt (SPA), that leverages the abundant synthetic data available in many domains to elicit diverse responses from foundation models. By leveraging signal provided by data attribution methods such as influence functions, SPA partitions data into subsets, each targeting unique aspects of the data, and trains multiple model adaptations optimized for these subsets. Experimental results demonstrate the effectiveness of our approach in diversifying foundation model responses while maintaining high quality, showcased through the HumanEval and MBPP tasks in the code generation domain and several tasks in the natural language understanding domain, highlighting its potential to enrich user experience across various applications.

[AI-60] Ambient AI Scribing Support: Comparing the Performance of Specialized AI Agent ic Architecture to Leading Foundational Models

链接: https://arxiv.org/abs/2411.06713
作者: Chanseo Lee,Sonu Kumar,Kimon A. Vogt,Sam Meraj
关键词-EN: Health AI Scribe, compares Sporo Health, study compares Sporo, Sporo Health, proprietary model fine-tuned
类目: Artificial Intelligence (cs.AI)
*备注: arXiv admin note: substantial text overlap with arXiv:2410.15528

点击查看摘要

Abstract:This study compares Sporo Health’s AI Scribe, a proprietary model fine-tuned for medical scribing, with various LLMs (GPT-4o, GPT-3.5, Gemma-9B, and Llama-3.2-3B) in clinical documentation. We analyzed de-identified patient transcripts from partner clinics, using clinician-provided SOAP notes as the ground truth. Each model generated SOAP summaries using zero-shot prompting, with performance assessed via recall, precision, and F1 scores. Sporo outperformed all models, achieving the highest recall (73.3%), precision (78.6%), and F1 score (75.3%) with the lowest performance variance. Statistically significant differences (p 0.05) were found between Sporo and the other models, with post-hoc tests showing significant improvements over GPT-3.5, Gemma-9B, and Llama 3.2-3B. While Sporo outperformed GPT-4o by up to 10%, the difference was not statistically significant (p = 0.25). Clinical user satisfaction, measured with a modified PDQI-9 inventory, favored Sporo. Evaluations indicated Sporo’s outputs were more accurate and relevant. This highlights the potential of Sporo’s multi-agentic architecture to improve clinical workflows.

[AI-61] Anytime Probabilistically Constrained Provably Convergent Online Belief Space Planning

链接: https://arxiv.org/abs/2411.06711
作者: Andrey Zhitnikov,Vadim Indelman
关键词-EN: account future risk, autonomously operating robot, Taking into account, account future, future risk
类目: Artificial Intelligence (cs.AI); Robotics (cs.RO)
*备注: arXiv admin note: text overlap with arXiv:2302.10439 by other authors

点击查看摘要

Abstract:Taking into account future risk is essential for an autonomously operating robot to find online not only the best but also a safe action to execute. In this paper, we build upon the recently introduced formulation of probabilistic belief-dependent constraints. We present an anytime approach employing the Monte Carlo Tree Search (MCTS) method in continuous domains. Unlike previous approaches, our method assures safety anytime with respect to the currently expanded search tree without relying on the convergence of the search. We prove convergence in probability with an exponential rate of a version of our algorithms and study proposed techniques via extensive simulations. Even with a tiny number of tree queries, the best action found by our approach is much safer than the baseline. Moreover, our approach constantly finds better than the baseline action in terms of objective. This is because we revise the values and statistics maintained in the search tree and remove from them the contribution of the pruned actions.

[AI-62] Autonomous Droplet Microfluidic Design Framework with Large Language Models

链接: https://arxiv.org/abs/2411.06691
作者: Dinh-Nguyen Nguyen,Raymond Kai-Yu Tong,Ngoc-Duy Dinh
关键词-EN: current assessment tools, Droplet-based microfluidic devices, biological research, Droplet-based microfluidic, substantial promise
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Droplet-based microfluidic devices have substantial promise as cost-effective alternatives to current assessment tools in biological research. Moreover, machine learning models that leverage tabular data, including input design parameters and their corresponding efficiency outputs, are increasingly utilised to automate the design process of these devices and to predict their performance. However, these models fail to fully leverage the data presented in the tables, neglecting crucial contextual information, including column headings and their associated descriptions. This study presents MicroFluidic-LLMs, a framework designed for processing and feature extraction, which effectively captures contextual information from tabular data formats. MicroFluidic-LLMs overcomes processing challenges by transforming the content into a linguistic format and leveraging pre-trained large language models (LLMs) for analysis. We evaluate our MicroFluidic-LLMs framework on 11 prediction tasks, covering aspects such as geometry, flow conditions, regimes, and performance, utilising a publicly available dataset on flow-focusing droplet microfluidics. We demonstrate that our MicroFluidic-LLMs framework can empower deep neural network models to be highly effective and straightforward while minimising the need for extensive data preprocessing. Moreover, the exceptional performance of deep neural network models, particularly when combined with advanced natural language processing models such as DistilBERT and GPT-2, reduces the mean absolute error in the droplet diameter and generation rate by nearly 5- and 7-fold, respectively, and enhances the regime classification accuracy by over 4%, compared with the performance reported in a previous study. This study lays the foundation for the huge potential applications of LLMs and machine learning in a wider spectrum of microfluidic applications.

[AI-63] High-Frequency Enhanced Hybrid Neural Representation for Video Compression

链接: https://arxiv.org/abs/2411.06685
作者: Li Yu,Zhihui Li,Jimin Xiao,Moncef Gabbouj
关键词-EN: achieved swift decoding, swift decoding speeds, encoding video content, Neural Representation Network, Hybrid Neural Representation
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Image and Video Processing (eess.IV)
*备注:

点击查看摘要

Abstract:Neural Representations for Videos (NeRV) have simplified the video codec process and achieved swift decoding speeds by encoding video content into a neural network, presenting a promising solution for video compression. However, existing work overlooks the crucial issue that videos reconstructed by these methods lack high-frequency details. To address this problem, this paper introduces a High-Frequency Enhanced Hybrid Neural Representation Network. Our method focuses on leveraging high-frequency information to improve the synthesis of fine details by the network. Specifically, we design a wavelet high-frequency encoder that incorporates Wavelet Frequency Decomposer (WFD) blocks to generate high-frequency feature embeddings. Next, we design the High-Frequency Feature Modulation (HFM) block, which leverages the extracted high-frequency embeddings to enhance the fitting process of the decoder. Finally, with the refined Harmonic decoder block and a Dynamic Weighted Frequency Loss, we further reduce the potential loss of high-frequency information. Experiments on the Bunny and UVG datasets demonstrate that our method outperforms other methods, showing notable improvements in detail preservation and compression performance.

[AI-64] WDMoE: Wireless Distributed Mixture of Experts for Large Language Models

链接: https://arxiv.org/abs/2411.06681
作者: Nan Xue,Yaping Sun,Zhiyong Chen,Meixia Tao,Xiaodong Xu,Liang Qian,Shuguang Cui,Wenjun Zhang,Ping Zhang
关键词-EN: Large Language Models, language processing tasks, natural language processing, Large Language, achieved significant success
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC); Information Theory (cs.IT)
*备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have achieved significant success in various natural language processing tasks, but the role of wireless networks in supporting LLMs has not been thoroughly explored. In this paper, we propose a wireless distributed Mixture of Experts (WDMoE) architecture to enable collaborative deployment of LLMs across edge servers at the base station (BS) and mobile devices in wireless networks. Specifically, we decompose the MoE layer in LLMs by placing the gating network and the preceding neural network layer at BS, while distributing the expert networks among the devices. This deployment leverages the parallel inference capabilities of expert networks on mobile devices, effectively utilizing the limited computing and caching resources of these devices. Accordingly, we develop a performance metric for WDMoE-based LLMs, which accounts for both model capability and latency. To minimize the latency while maintaining accuracy, we jointly optimize expert selection and bandwidth allocation based on the performance metric. Moreover, we build a hardware testbed using NVIDIA Jetson kits to validate the effectiveness of WDMoE. Both theoretical simulations and practical hardware experiments demonstrate that the proposed method can significantly reduce the latency without compromising LLM performance.

[AI-65] Adversarial Detection with a Dynamically Stable System

链接: https://arxiv.org/abs/2411.06666
作者: Xiaowei Long,Jie Lin,Xiangyuan Yang
关键词-EN: Dynamically Stable System, maliciously crafted adversarial, Stable System, Toggle, Dynamically Stable
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Adversarial detection is designed to identify and reject maliciously crafted adversarial examples(AEs) which are generated to disrupt the classification of target models. Presently, various input transformation-based methods have been developed on adversarial example detection, which typically rely on empirical experience and lead to unreliability against new attacks. To address this issue, we propose and conduct a Dynamically Stable System (DSS), which can effectively detect the adversarial examples from normal examples according to the stability of input examples. Particularly, in our paper, the generation of adversarial examples is considered as the perturbation process of a Lyapunov dynamic system, and we propose an example stability mechanism, in which a novel control term is added in adversarial example generation to ensure that the normal examples can achieve dynamic stability while the adversarial examples cannot achieve the stability. Then, based on the proposed example stability mechanism, a Dynamically Stable System (DSS) is proposed, which can utilize the disruption and restoration actions to determine the stability of input examples and detect the adversarial examples through changes in the stability of the input examples. In comparison with existing methods in three benchmark datasets(MNIST, CIFAR10, and CIFAR100), our evaluation results show that our proposed DSS can achieve ROC-AUC values of 99.83%, 97.81% and 94.47%, surpassing the state-of-the-art(SOTA) values of 97.35%, 91.10% and 93.49% in the other 7 methods. Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2411.06666 [cs.AI] (or arXiv:2411.06666v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2411.06666 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Xiaowei Long [view email] [v1] Mon, 11 Nov 2024 02:16:17 UTC (489 KB) Full-text links: Access Paper: View a PDF of the paper titled Adversarial Detection with a Dynamically Stable System, by Xiaowei Long and 2 other authorsView PDFHTML (experimental)TeX SourceOther Formats view license Current browse context: cs.AI prev | next new | recent | 2024-11 Change to browse by: cs References Citations NASA ADSGoogle Scholar Semantic Scholar a export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) Links to Code Toggle Papers with Code (What is Papers with Code?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status Get status notifications via email or slack

[AI-66] An Efficient Memory Module for Graph Few-Shot Class-Incremental Learning

链接: https://arxiv.org/abs/2411.06659
作者: Dong Li,Aijia Zhang,Junqi Gao,Biqing Qi
关键词-EN: gained significant attention, catastrophic forgetting problem, Incremental graph learning, gained significant, significant attention
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 16 pages, 6 figures, 38th Conference on Neural Information Processing Systems, 2024

点击查看摘要

Abstract:Incremental graph learning has gained significant attention for its ability to address the catastrophic forgetting problem in graph representation learning. However, traditional methods often rely on a large number of labels for node classification, which is impractical in real-world applications. This makes few-shot incremental learning on graphs a pressing need. Current methods typically require extensive training samples from meta-learning to build memory and perform intensive fine-tuning of GNN parameters, leading to high memory consumption and potential loss of previously learned knowledge. To tackle these challenges, we introduce Mecoin, an efficient method for building and maintaining memory. Mecoin employs Structured Memory Units to cache prototypes of learned categories, as well as Memory Construction Modules to update these prototypes for new categories through interactions between the nodes and the cached prototypes. Additionally, we have designed a Memory Representation Adaptation Module to store probabilities associated with each class prototype, reducing the need for parameter fine-tuning and lowering the forgetting rate. When a sample matches its corresponding class prototype, the relevant probabilities are retrieved from the MRaM. Knowledge is then distilled back into the GNN through a Graph Knowledge Distillation Module, preserving the model’s memory. We analyze the effectiveness of Mecoin in terms of generalization error and explore the impact of different distillation strategies on model performance through experiments and VC-dimension analysis. Compared to other related works, Mecoin shows superior performance in accuracy and forgetting rate. Our code is publicly available on the this https URL .

[AI-67] Predicting Country Instability Using Bayesian Deep Learning and Random Forest

链接: https://arxiv.org/abs/2411.06639
作者: Adam Zebrowski,Haithem Afli
关键词-EN: instability thwarting socio-economic, unpredictably high levels, thwarting socio-economic growth, instability thwarting, Country instability
类目: Artificial Intelligence (cs.AI); Social and Information Networks (cs.SI)
*备注:

点击查看摘要

Abstract:Country instability is a global issue, with unpredictably high levels of instability thwarting socio-economic growth and possibly causing a slew of negative consequences. As a result, uncertainty prediction models for a country are becoming increasingly important in the real world, and they are expanding to provide more input from ‘big data’ collections, as well as the interconnectedness of global economies and social networks. This has culminated in massive volumes of qualitative data from outlets like television, print, digital, and social media, necessitating the use of artificial intelligence (AI) tools like machine learning to make sense of it all and promote predictive precision [1]. The Global Database of Activities, Voice, and Tone (GDELT Project) records broadcast, print, and web news in over 100 languages every second of every day, identifying the people, locations, organisations, counts, themes, outlets, and events that propel our global community and offering a free open platform for computation on the entire world. The main goal of our research is to investigate how, when our data grows more voluminous and fine-grained, we can conduct a more complex methodological analysis of political conflict. The GDELT dataset, which was released in 2012, is the first and potentially the most technologically sophisticated publicly accessible dataset on political conflict.

[AI-68] Exploring social bots: A feature-based approach to improve bot detection in social networks

链接: https://arxiv.org/abs/2411.06626
作者: Salvador Lopez-Joya,Jose A. Diaz-Garcia,M. Dolores Ruiz,Maria J. Martin-Bautista
关键词-EN: spread of misinformation, political messages, malicious links, importance of social, social media
类目: ocial and Information Networks (cs.SI); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The importance of social media in our daily lives has unfortunately led to an increase in the spread of misinformation, political messages and malicious links. One of the most popular ways of carrying out those activities is using automated accounts, also known as bots, which makes the detection of such accounts a necessity. This paper addresses that problem by investigating features based on the user account profile and its content, aiming to understand the relevance of each feature as a basis for improving future bot detectors. Through an exhaustive process of research, inference and feature selection, we are able to surpass the state of the art on several metrics using classical machine learning algorithms and identify the types of features that are most important in detecting automated accounts.

[AI-69] A Review of Fairness and A Practical Guide to Selecting Context-Appropriate Fairness Metrics in Machine Learning

链接: https://arxiv.org/abs/2411.06624
作者: Caleb J.S. Barr,Olivia Erdelyi,Paul D. Docherty,Randolph C. Grace
关键词-EN: Recent regulatory proposals, artificial intelligence emphasize, intelligence emphasize fairness, Recent regulatory, proposals for artificial
类目: Artificial Intelligence (cs.AI)
*备注: 15 pages, 4 figures, 1 table

点击查看摘要

Abstract:Recent regulatory proposals for artificial intelligence emphasize fairness requirements for machine learning models. However, precisely defining the appropriate measure of fairness is challenging due to philosophical, cultural and political contexts. Biases can infiltrate machine learning models in complex ways depending on the model’s context, rendering a single common metric of fairness insufficient. This ambiguity highlights the need for criteria to guide the selection of context-aware measures, an issue of increasing importance given the proliferation of ever tighter regulatory requirements. To address this, we developed a flowchart to guide the selection of contextually appropriate fairness measures. Twelve criteria were used to formulate the flowchart. This included consideration of model assessment criteria, model selection criteria, and data bias. We also review fairness literature in the context of machine learning and link it to core regulatory instruments to assist policymakers, AI developers, researchers, and other stakeholders in appropriately addressing fairness concerns and complying with relevant regulatory requirements.

[AI-70] MEANT: Multimodal Encoder for Antecedent Information

链接: https://arxiv.org/abs/2411.06616
作者: Benjamin Iyoya Irving,Annika Marie Schoene
关键词-EN: split across modalities, stock market, ideal candidate, multimodal evaluation, information
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The stock market provides a rich well of information that can be split across modalities, making it an ideal candidate for multimodal evaluation. Multimodal data plays an increasingly important role in the development of machine learning and has shown to positively impact performance. But information can do more than exist across modes – it can exist across time. How should we attend to temporal data that consists of multiple information types? This work introduces (i) the MEANT model, a Multimodal Encoder for Antecedent information and (ii) a new dataset called TempStock, which consists of price, Tweets, and graphical data with over a million Tweets from all of the companies in the SP 500 Index. We find that MEANT improves performance on existing baselines by over 15%, and that the textual information affects performance far more than the visual information on our time-dependent task from our ablation study.

[AI-71] vTune: Verifiable Fine-Tuning for LLM s Through Backdooring

链接: https://arxiv.org/abs/2411.06611
作者: Eva Zhang,Arka Pal,Akilesh Potti,Micah Goldblum
关键词-EN: fine-tuning large language, large language models, increasingly prevalent, large language, rely on third-party
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
*备注:

点击查看摘要

Abstract:As fine-tuning large language models (LLMs) becomes increasingly prevalent, users often rely on third-party services with limited visibility into their fine-tuning processes. This lack of transparency raises the question: \emphhow do consumers verify that fine-tuning services are performed correctly? For instance, a service provider could claim to fine-tune a model for each user, yet simply send all users back the same base model. To address this issue, we propose vTune, a simple method that uses a small number of \textitbackdoor data points added to the training data to provide a statistical test for verifying that a provider fine-tuned a custom model on a particular user’s dataset. Unlike existing works, vTune is able to scale to verification of fine-tuning on state-of-the-art LLMs, and can be used both with open-source and closed-source models. We test our approach across several model families and sizes as well as across multiple instruction-tuning datasets, and find that the statistical test is satisfied with p-values on the order of \sim 10^-40 , with no negative impact on downstream task performance. Further, we explore several attacks that attempt to subvert vTune and demonstrate the method’s robustness to these attacks.

[AI-72] Gen-AI for User Safety: A Survey

链接: https://arxiv.org/abs/2411.06606
作者: Akshar Prabhu Desai,Tejasvi Ravi,Mohammad Luqman,Nithya Kota,Pranjul Yadav
关键词-EN: Machine Learning, Gen-AI techniques, supervised and unsupervised, techniques, Gen-AI
类目: Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
*备注:

点击查看摘要

Abstract:Machine Learning and data mining techniques (i.e. supervised and unsupervised techniques) are used across domains to detect user safety violations. Examples include classifiers used to detect whether an email is spam or a web-page is requesting bank login information. However, existing ML/DM classifiers are limited in their ability to understand natural languages w.r.t the context and nuances. The aforementioned challenges are overcome with the arrival of Gen-AI techniques, along with their inherent ability w.r.t translation between languages, fine-tuning between various tasks and domains. In this manuscript, we provide a comprehensive overview of the various work done while using Gen-AI techniques w.r.t user safety. In particular, we first provide the various domains (e.g. phishing, malware, content moderation, counterfeit, physical safety) across which Gen-AI techniques have been applied. Next, we provide how Gen-AI techniques can be used in conjunction with various data modalities i.e. text, images, videos, audio, executable binaries to detect violations of user-safety. Further, also provide an overview of how Gen-AI techniques can be used in an adversarial setting. We believe that this work represents the first summarization of Gen-AI techniques for user-safety. Subjects: Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR) Cite as: arXiv:2411.06606 [cs.AI] (or arXiv:2411.06606v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2411.06606 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-73] OffLight: An Offline Multi-Agent Reinforcement Learning Framework for Traffic Signal Control

链接: https://arxiv.org/abs/2411.06601
作者: Rohit Bokade,Xiaoning Jin
关键词-EN: modern urban mobility, Efficient traffic signal, city traffic patterns, traffic signal control, complex city traffic
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
*备注:

点击查看摘要

Abstract:Efficient traffic signal control is critical for modern urban mobility, but traditional systems often struggle to adapt to complex city traffic patterns. Multi-Agent Reinforcement Learning, or MARL, offers adaptive solutions, yet online MARL methods require extensive real-time interactions, which are costly and time-intensive. Offline MARL addresses these issues by using historical traffic data, but it faces challenges due to the diverse behavior policies in real-world datasets, where different controllers complicate learning.

[AI-74] Federated LLM s Fine-tuned with Adaptive Importance-Aware LoRA

链接: https://arxiv.org/abs/2411.06581
作者: Yang Su,Na Yan,Yansha Deng
关键词-EN: pre-trained Large Language, Large Language Models, preserving data privacy, enables task-specific adaptation, Large Language
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注:

点击查看摘要

Abstract:Federated fine-tuning of pre-trained Large Language Models (LLMs) enables task-specific adaptation across diverse datasets while preserving data privacy. However, the large model size and heterogeneity in client resources pose significant computational and communication challenges. To address these issues, in this paper, we propose a novel Heterogeneous Adaptive Federated Low-Rank Adaptation (LoRA) fine-tuned LLM framework (HAFL). To accommodate client resource heterogeneity, we first introduce an importance-based parameter truncation scheme, which allows clients to have different LoRA ranks, and smoothed sensitivity scores are used as importance indicators. Despite its flexibility, the truncation process may cause performance degradation. To tackle this problem, we develop an importance-based parameter freezing scheme. In this approach, both the cloud server and clients maintain the same LoRA rank, while clients selectively update only the most important decomposed LoRA rank-1 matrices, keeping the rest frozen. To mitigate the information dilution caused by the zero-padding aggregation method, we propose an adaptive aggregation approach that operates at the decomposed rank-1 matrix level. Experiments on the 20 News Group classification task show that our method converges quickly with low communication size, and avoids performance degradation when distributing models to clients compared to truncation-based heterogeneous LoRA rank scheme. Additionally, our adaptive aggregation method achieves faster convergence compared to the zero-padding approach.

[AI-75] Discovering emergent connections in quantum physics research via dynamic word embeddings

链接: https://arxiv.org/abs/2411.06577
作者: Felix Frohnert,Xuemei Gu,Mario Krenn,Evert van Nieuwenburg
关键词-EN: researchers naturally form, naturally form subgroups, form subgroups focusing, researchers naturally, naturally form
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Quantum Physics (quant-ph)
*备注: 7 pages; 4 figures; 1 table; Appendix: 2 pages, 2 figures

点击查看摘要

Abstract:As the field of quantum physics evolves, researchers naturally form subgroups focusing on specialized problems. While this encourages in-depth exploration, it can limit the exchange of ideas across structurally similar problems in different subfields. To encourage cross-talk among these different specialized areas, data-driven approaches using machine learning have recently shown promise to uncover meaningful connections between research concepts, promoting cross-disciplinary innovation. Current state-of-the-art approaches represent concepts using knowledge graphs and frame the task as a link prediction problem, where connections between concepts are explicitly modeled. In this work, we introduce a novel approach based on dynamic word embeddings for concept combination prediction. Unlike knowledge graphs, our method captures implicit relationships between concepts, can be learned in a fully unsupervised manner, and encodes a broader spectrum of information. We demonstrate that this representation enables accurate predictions about the co-occurrence of concepts within research abstracts over time. To validate the effectiveness of our approach, we provide a comprehensive benchmark against existing methods and offer insights into the interpretability of these embeddings, particularly in the context of quantum physics research. Our findings suggest that this representation offers a more flexible and informative way of modeling conceptual relationships in scientific literature.

[AI-76] Learning Loss Landscapes in Preference Optimization

链接: https://arxiv.org/abs/2411.06568
作者: Carlo Alfano,Silvia Sapora,Jakob Nicolaus Foerster,Patrick Rebeschini,Yee Whye Teh
关键词-EN: Preference Optimization, Direct Preference Optimization, empirical study investigating, Odds-Ratio Preference Optimization, preference datasets
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:We present an empirical study investigating how specific properties of preference datasets, such as mixed-quality or noisy data, affect the performance of Preference Optimization (PO) algorithms. Our experiments, conducted in MuJoCo environments, reveal several scenarios where state-of-the-art PO methods experience significant drops in performance. To address this issue, we introduce a novel PO framework based on mirror descent, which can recover existing methods like Direct Preference Optimization (DPO) and Odds-Ratio Preference Optimization (ORPO) for specific choices of the mirror map. Within this framework, we employ evolutionary strategies to discover new loss functions capable of handling the identified problematic scenarios. These new loss functions lead to significant performance improvements over DPO and ORPO across several tasks. Additionally, we demonstrate the generalization capability of our approach by applying the discovered loss functions to fine-tuning large language models using mixed-quality data, where they outperform ORPO.

[AI-77] Foundation Model for Composite Materials and Microstructural Analysis

链接: https://arxiv.org/abs/2411.06565
作者: Ting-Ju Wei,Chuin-Shan(David)Chen
关键词-EN: unlocked numerous opportunities, rapid advancement, advancement of machine, unlocked numerous, numerous opportunities
类目: Computational Engineering, Finance, and Science (cs.CE); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The rapid advancement of machine learning has unlocked numerous opportunities for materials science, particularly in accelerating the design and analysis of materials. However, a significant challenge lies in the scarcity and high cost of obtaining high-quality materials datasets. In other fields, such as natural language processing, foundation models pre-trained on large datasets have achieved exceptional success in transfer learning, effectively leveraging latent features to achieve high performance on tasks with limited data. Despite this progress, the concept of foundation models remains underexplored in materials science. Here, we present a foundation model specifically designed for composite materials. Our model is pre-trained on a dataset of short-fiber composites to learn robust latent features. During transfer learning, the MMAE accurately predicts homogenized stiffness, with an R2 score reaching as high as 0.959 and consistently exceeding 0.91, even when trained on limited data. These findings validate the feasibility and effectiveness of foundation models in composite materials. We anticipate extending this approach to more complex three-dimensional composite materials, polycrystalline materials, and beyond. Moreover, this framework enables high-accuracy predictions even when experimental data are scarce, paving the way for more efficient and cost-effective materials design and analysis.

[AI-78] Is Your LLM Secretly a World Model of the Internet? Model-Based Planning for Web Agents

链接: https://arxiv.org/abs/2411.06559
作者: Yu Gu,Boyuan Zheng,Boyu Gou,Kai Zhang,Cheng Chang,Sanjari Srivastava,Yanan Xie,Peng Qi,Huan Sun,Yu Su
关键词-EN: automating web-based tasks, demonstrated promising capabilities, underperform largely compared, current reactive approaches, web-based tasks
类目: Artificial Intelligence (cs.AI)
*备注: 18 pages, 6 figures, 4 tables

点击查看摘要

Abstract:Language agents have demonstrated promising capabilities in automating web-based tasks, though their current reactive approaches still underperform largely compared to humans. While incorporating advanced planning algorithms, particularly tree search methods, could enhance these agents’ performance, implementing tree search directly on live websites poses significant safety risks and practical constraints due to irreversible actions such as confirming a purchase. In this paper, we introduce a novel paradigm that augments language agents with model-based planning, pioneering the innovative use of large language models (LLMs) as world models in complex web environments. Our method, WebDreamer, builds on the key insight that LLMs inherently encode comprehensive knowledge about website structures and functionalities. Specifically, WebDreamer uses LLMs to simulate outcomes for each candidate action (e.g., “what would happen if I click this button?”) using natural language descriptions, and then evaluates these imagined outcomes to determine the optimal action at each step. Empirical results on two representative web agent benchmarks with online interaction – VisualWebArena and Mind2Web-live – demonstrate that WebDreamer achieves substantial improvements over reactive baselines. By establishing the viability of LLMs as world models in web environments, this work lays the groundwork for a paradigm shift in automated web interaction. More broadly, our findings open exciting new avenues for future research into 1) optimizing LLMs specifically for world modeling in complex, dynamic environments, and 2) model-based speculative planning for language agents.

[AI-79] Is Linear Feedback on Smoothed Dynamics Sufficient for Stabilizing Contact-Rich Plans? ICRA2025

链接: https://arxiv.org/abs/2411.06542
作者: Yuki Shirai,Tong Zhao,H.J. Terry Suh,Huaijiang Zhu,Xinpei Ni,Jiuguang Wang,Max Simchowitz,Tao Pang
关键词-EN: Designing planners, synthesis tools assume, extremely challenging, violates the smoothness, tools assume
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
*备注: Under review for ICRA2025

点击查看摘要

Abstract:Designing planners and controllers for contact-rich manipulation is extremely challenging as contact violates the smoothness conditions that many gradient-based controller synthesis tools assume. Contact smoothing approximates a non-smooth system with a smooth one, allowing one to use these synthesis tools more effectively. However, applying classical control synthesis methods to smoothed contact dynamics remains relatively under-explored. This paper analyzes the efficacy of linear controller synthesis using differential simulators based on contact smoothing. We introduce natural baselines for leveraging contact smoothing to compute (a) open-loop plans robust to uncertain conditions and/or dynamics, and (b) feedback gains to stabilize around open-loop plans. Using robotic bimanual whole-body manipulation as a testbed, we perform extensive empirical experiments on over 300 trajectories and analyze why LQR seems insufficient for stabilizing contact-rich plans.

[AI-80] A Next-Generation Approach to Airline Reservations: Integrating Cloud Microservices with AI and Blockchain for Enhanced Operational Performance

链接: https://arxiv.org/abs/2411.06538
作者: Biman Barua,M. Shamim Kaiser
关键词-EN: distributed artificial intelligence, Cloud microservices, incorporates the Cloud, artificial intelligence modules, generation airline reservation
类目: Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE)
*备注: 25 pages, 8 figures

点击查看摘要

Abstract:This research proposes the development of a next generation airline reservation system that incorporates the Cloud microservices, distributed artificial intelligence modules and the blockchain technology to improve on the efficiency, safety and customer satisfaction. The traditional reservation systems encounter issues related to the expansion of the systems, the integrity of the data provided and the level of service offered to the customers, which is the main focus of this architecture through the modular and data centric design approaches. This will allow different operations such as reservations, payments, and customer data management among others to be performed separately thereby facilitating high availability of the system by 30% and enhancing performance of the system by 40% on its scalability. Such systems contain AI driven modules that utilize the past booking patterns along with the profile of the customer to estimate the demand and make recommendations, which increases to 25 % of customer engagement. Moreover, blockchain is effective in engaging an incorruptible ledger system for the all transactions therefore mitigating fraud incidences and increasing the clarity by 20%. The system was subjected to analysis using a simulator and using machine learning evaluations that rated it against other conventional systems. The results show that there were clear enhancements in the speed of transactions where the rates of secure data processing rose by 35%, and the system response time by 15 %. The system can also be used for other high transaction industries like logistics and hospitality. This structural design is indicative of how the use of advanced technologies will revolutionize the airline reservation sector. The implications are growing effectiveness, improvement in security and greater customer contentment.

[AI-81] I2VControl-Camera: Precise Video Camera Control with Adjustable Motion Strength

链接: https://arxiv.org/abs/2411.06525
作者: Wanquan Feng,Jiawei Liu,Pengqi Tu,Tianhao Qi,Mingzhen Sun,Tianxiang Ma,Songtao Zhao,Siyu Zhou,Qian He
关键词-EN: broad potential applications, Video generation technologies, potential applications, developing rapidly, broad potential
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Video generation technologies are developing rapidly and have broad potential applications. Among these technologies, camera control is crucial for generating professional-quality videos that accurately meet user expectations. However, existing camera control methods still suffer from several limitations, including control precision and the neglect of the control for subject motion dynamics. In this work, we propose I2VControl-Camera, a novel camera control method that significantly enhances controllability while providing adjustability over the strength of subject motion. To improve control precision, we employ point trajectory in the camera coordinate system instead of only extrinsic matrix information as our control signal. To accurately control and adjust the strength of subject motion, we explicitly model the higher-order components of the video trajectory expansion, not merely the linear terms, and design an operator that effectively represents the motion strength. We use an adapter architecture that is independent of the base model structure. Experiments on static and dynamic scenes show that our framework outperformances previous methods both quantitatively and qualitatively.

[AI-82] Does This Summary Answer My Question? Modeling Query-Focused Summary Readers with Rational Speech Acts

链接: https://arxiv.org/abs/2411.06524
作者: Cesare Spinoso-Di Piano,Jackie Chi Kit Cheung
关键词-EN: existing QFS systems, QFS systems, Rational Speech Act, QFS, causing QFS systems
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Query-focused summarization (QFS) is the task of generating a summary in response to a user-written query. Despite its user-oriented nature, there has been limited work in QFS in explicitly considering a user’s understanding of a generated summary, potentially causing QFS systems to underperform at inference time. In this paper, we adapt the Rational Speech Act (RSA) framework, a model of human communication, to explicitly model a reader’s understanding of a query-focused summary and integrate it within the generation method of existing QFS systems. In particular, we introduce the answer reconstruction objective which approximates a reader’s understanding of a summary by their ability to use it to reconstruct the answer to their initial query. Using this objective, we are able to re-rank candidate summaries generated by existing QFS systems and select summaries that better align with their corresponding query and reference summary. More generally, our study suggests that a simple and effective way of improving a language generation system designed for a user-centered task may be to explicitly incorporate its user requirements into the system’s generation procedure.

[AI-83] Offline Handwritten Signature Verification Using a Stream-Based Approach ICPR

链接: https://arxiv.org/abs/2411.06510
作者: Kecia G. de Moura,Rafael M. O. Cruz,Robert Sabourin
关键词-EN: Handwritten Signature Verification, Handwritten Signature, Signature Verification, distinguish between genuine, genuine and forged
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Accepted for oral presentation at the International Conference on Pattern Recognition (ICPR) 2024

点击查看摘要

Abstract:Handwritten Signature Verification (HSV) systems distinguish between genuine and forged signatures. Traditional HSV development involves a static batch configuration, constraining the system’s ability to model signatures to the limited data available. Signatures exhibit high intra-class variability and are sensitive to various factors, including time and external influences, imparting them a dynamic nature. This paper investigates the signature learning process within a data stream context. We propose a novel HSV approach with an adaptive system that receives an infinite sequence of signatures and is updated over time. Experiments were carried out on GPDS Synthetic, CEDAR, and MCYT datasets. Results demonstrate the superior performance of the proposed method compared to standard approaches that use a Support Vector Machine as a classifier. Implementation of the method is available at this https URL.

[AI-84] Understanding the Role of Equivariance in Self-supervised Learning NEURIPS2024

链接: https://arxiv.org/abs/2411.06508
作者: Yifei Wang,Kaiwen Hu,Sharut Gupta,Ziyu Ye,Yisen Wang,Stefanie Jegelka
关键词-EN: Contrastive learning, self-supervised learning, leading paradigm, widely observed, price of sacrificing
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Information Theory (cs.IT); Machine Learning (stat.ML)
*备注: Accepted at NeurIPS 2024

点击查看摘要

Abstract:Contrastive learning has been a leading paradigm for self-supervised learning, but it is widely observed that it comes at the price of sacrificing useful features (\eg colors) by being invariant to data augmentations. Given this limitation, there has been a surge of interest in equivariant self-supervised learning (E-SSL) that learns features to be augmentation-aware. However, even for the simplest rotation prediction method, there is a lack of rigorous understanding of why, when, and how E-SSL learns useful features for downstream tasks. To bridge this gap between practice and theory, we establish an information-theoretic perspective to understand the generalization ability of E-SSL. In particular, we identify a critical explaining-away effect in E-SSL that creates a synergy between the equivariant and classification tasks. This synergy effect encourages models to extract class-relevant features to improve its equivariant prediction, which, in turn, benefits downstream tasks requiring semantic features. Based on this perspective, we theoretically analyze the influence of data transformations and reveal several principles for practical designs of E-SSL. Our theory not only aligns well with existing E-SSL methods but also sheds light on new directions by exploring the benefits of model equivariance. We believe that a theoretically grounded understanding on the role of equivariance would inspire more principled and advanced designs in this field. Code is available at this https URL.

[AI-85] Barriers to Complexity-Theoretic Proofs that Achieving AGI Using Machine Learning is Intractable

链接: https://arxiv.org/abs/2411.06498
作者: Michael Guerzhoy
关键词-EN: achieving human-like intelligence, van Rooij, recent paper, complexity-theoretic sense, proved that achieving
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:A recent paper (van Rooij et al. 2024) claims to have proved that achieving human-like intelligence using learning from data is intractable in a complexity-theoretic sense. We identify that the proof relies on an unjustified assumption about the distribution of (input, output) pairs to the system. We briefly discuss that assumption in the context of two fundamental barriers to repairing the proof: the need to precisely define ``human-like," and the need to account for the fact that a particular machine learning system will have particular inductive biases that are key to the analysis.

[AI-86] LProtector: An LLM -driven Vulnerability Detection System

链接: https://arxiv.org/abs/2411.06493
作者: Ze Sheng,Fenghua Wu,Xiangwu Zuo,Chao Li,Yuxin Qiao
关键词-EN: large language model, language model, paper presents LProtector, RAG, paper presents
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
*备注: 5 pages, 4 figures. This is a preprint version of the article. The final version will be published in the proceedings of the IEEE conference

点击查看摘要

Abstract:This paper presents LProtector, an automated vulnerability detection system for C/C++ codebases driven by the large language model (LLM) GPT-4o and Retrieval-Augmented Generation (RAG). As software complexity grows, traditional methods face challenges in detecting vulnerabilities effectively. LProtector leverages GPT-4o’s powerful code comprehension and generation capabilities to perform binary classification and identify vulnerabilities within target codebases. We conducted experiments on the Big-Vul dataset, showing that LProtector outperforms two state-of-the-art baselines in terms of F1 score, demonstrating the potential of integrating LLMs with vulnerability detection.

[AI-87] Hermes: A Large Language Model Framework on the Journey to Autonomous Networks

链接: https://arxiv.org/abs/2411.06490
作者: Fadhel Ayed,Ali Maatouk,Nicola Piovesan,Antonio De Domenico,Merouane Debbah,Zhi-Quan Luo
关键词-EN: automating cellular network, Network Digital Twins, drive toward automating, increasing complexity, network
类目: Artificial Intelligence (cs.AI); Networking and Internet Architecture (cs.NI)
*备注:

点击查看摘要

Abstract:The drive toward automating cellular network operations has grown with the increasing complexity of these systems. Despite advancements, full autonomy currently remains out of reach due to reliance on human intervention for modeling network behaviors and defining policies to meet target requirements. Network Digital Twins (NDTs) have shown promise in enhancing network intelligence, but the successful implementation of this technology is constrained by use case-specific architectures, limiting its role in advancing network autonomy. A more capable network intelligence, or “telecommunications brain”, is needed to enable seamless, autonomous management of cellular network. Large Language Models (LLMs) have emerged as potential enablers for this vision but face challenges in network modeling, especially in reasoning and handling diverse data types. To address these gaps, we introduce Hermes, a chain of LLM agents that uses “blueprints” for constructing NDT instances through structured and explainable logical steps. Hermes allows automatic, reliable, and accurate network modeling of diverse use cases and configurations, thus marking progress toward fully autonomous network operations.

[AI-88] RL-Pruner: Structured Pruning Using Reinforcement Learning for CNN Compression and Acceleration

链接: https://arxiv.org/abs/2411.06463
作者: Boyao Wang,Volodymyr Kindratenko
关键词-EN: Convolutional Neural Networks, Convolutional Neural, demonstrated exceptional performance, recent years, demonstrated exceptional
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Convolutional Neural Networks (CNNs) have demonstrated exceptional performance in recent years. Compressing these models not only reduces storage requirements, making deployment to edge devices feasible, but also accelerates inference, thereby reducing latency and computational costs. Structured pruning, which removes filters at the layer level, directly modifies the model architecture. This approach achieves a more compact architecture while maintaining target accuracy, ensuring that the compressed model retains good compatibility and hardware efficiency. Our method is based on a key observation: filters in different layers of a neural network have varying importance to the model’s performance. When the number of filters to prune is fixed, the optimal pruning distribution across different layers is uneven to minimize performance loss. Layers that are more sensitive to pruning should account for a smaller proportion of the pruning distribution. To leverage this insight, we propose RL-Pruner, which uses reinforcement learning to learn the optimal pruning distribution. RL-Pruner can automatically extract dependencies between filters in the input model and perform pruning, without requiring model-specific pruning implementations. We conducted experiments on models such as GoogleNet, ResNet, and MobileNet, comparing our approach to other structured pruning methods to validate its effectiveness. Our code is available at this https URL.

[AI-89] Over-parameterized Student Model via Tensor Decomposition Boosted Knowledge Distillation NEURIPS2024

链接: https://arxiv.org/abs/2411.06448
作者: Yu-Liang Zhan,Zhong-Yi Lu,Hao Sun,Ze-Feng Gao
关键词-EN: enabled large pre-trained, Increased training parameters, Increased training, large pre-trained models, student model
类目: Artificial Intelligence (cs.AI)
*备注: 38th Conference on Neural Information Processing Systems (NeurIPS 2024)

点击查看摘要

Abstract:Increased training parameters have enabled large pre-trained models to excel in various downstream tasks. Nevertheless, the extensive computational requirements associated with these models hinder their widespread adoption within the community. We focus on Knowledge Distillation (KD), where a compact student model is trained to mimic a larger teacher model, facilitating the transfer of knowledge of large models. In contrast to much of the previous work, we scale up the parameters of the student model during training, to benefit from overparameterization without increasing the inference latency. In particular, we propose a tensor decomposition strategy that effectively over-parameterizes the relatively small student model through an efficient and nearly lossless decomposition of its parameter matrices into higher-dimensional tensors. To ensure efficiency, we further introduce a tensor constraint loss to align the high-dimensional tensors between the student and teacher models. Comprehensive experiments validate the significant performance enhancement by our approach in various KD tasks, covering computer vision and natural language processing areas. Our code is available at this https URL.

[AI-90] Local Implicit Wavelet Transformer for Arbitrary-Scale Super-Resolution BMVC2024

链接: https://arxiv.org/abs/2411.06442
作者: Minghong Duan,Linhao Qu,Shaolei Liu,Manning Wang
关键词-EN: Implicit neural representations, Implicit Wavelet Transformer, Local Implicit Wavelet, recently demonstrated promising, demonstrated promising potential
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: Accepted by BMVC 2024

点击查看摘要

Abstract:Implicit neural representations have recently demonstrated promising potential in arbitrary-scale Super-Resolution (SR) of images. Most existing methods predict the pixel in the SR image based on the queried coordinate and ensemble nearby features, overlooking the importance of incorporating high-frequency prior information in images, which results in limited performance in reconstructing high-frequency texture details in images. To address this issue, we propose the Local Implicit Wavelet Transformer (LIWT) to enhance the restoration of high-frequency texture details. Specifically, we decompose the features extracted by an encoder into four sub-bands containing different frequency information using Discrete Wavelet Transform (DWT). We then introduce the Wavelet Enhanced Residual Module (WERM) to transform these four sub-bands into high-frequency priors, followed by utilizing the Wavelet Mutual Projected Fusion (WMPF) and the Wavelet-aware Implicit Attention (WIA) to fully exploit the high-frequency prior information for recovering high-frequency details in images. We conducted extensive experiments on benchmark datasets to validate the effectiveness of LIWT. Both qualitative and quantitative results demonstrate that LIWT achieves promising performance in arbitrary-scale SR tasks, outperforming other state-of-the-art methods. The code is available at this https URL.

[AI-91] Reinforcement learning for Quantum Tiq-Taq-Toe

链接: https://arxiv.org/abs/2411.06429
作者: Catalin-Viorel Dinu,Thomas Moerland
关键词-EN: Quantum, well-known benchmark, benchmark and playground, quantum computing, machine learning
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Quantum Tiq-Taq-Toe is a well-known benchmark and playground for both quantum computing and machine learning. Despite its popularity, no reinforcement learning (RL) methods have been applied to Quantum Tiq-Taq-Toe. Although there has been some research on Quantum Chess this game is significantly more complex in terms of computation and analysis. Therefore, we study the combination of quantum computing and reinforcement learning in Quantum Tiq-Taq-Toe, which may serve as an accessible testbed for the integration of both fields. Quantum games are challenging to represent classically due to their inherent partial observability and the potential for exponential state complexity. In Quantum Tiq-Taq-Toe, states are observed through Measurement (a 3x3 matrix of state probabilities) and Move History (a 9x9 matrix of entanglement relations), making strategy complex as each move can collapse the quantum state. Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2411.06429 [cs.AI] (or arXiv:2411.06429v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2411.06429 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-92] Neuro-Symbolic Rule Lists

链接: https://arxiv.org/abs/2411.06428
作者: Sascha Xu,Nils Philipp Walter,Jilles Vreeken
关键词-EN: Machine learning models, Machine learning, accountability and fairness, deployed in sensitive, sensitive areas
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Machine learning models deployed in sensitive areas such as healthcare must be interpretable to ensure accountability and fairness. Rule lists (if Age 35 \wedge Priors 0 then Recidivism = True, else if Next Condition . . . ) offer full transparency, making them well-suited for high-stakes decisions. However, learning such rule lists presents significant challenges. Existing methods based on combinatorial optimization require feature pre-discretization and impose restrictions on rule size. Neuro-symbolic methods use more scalable continuous optimization yet place similar pre-discretization constraints and suffer from unstable optimization. To address the existing limitations, we introduce NeuRules, an end-to-end trainable model that unifies discretization, rule learning, and rule order into a single differentiable framework. We formulate a continuous relaxation of the rule list learning problem that converges to a strict rule list through temperature annealing. NeuRules learns both the discretizations of individual features, as well as their combination into conjunctive rules without any pre-processing or restrictions. Extensive experiments demonstrate that NeuRules consistently outperforms both combinatorial and neuro-symbolic methods, effectively learning simple and complex rules, as well as their order, across a wide range of datasets.

[AI-93] Generating Mixcode Popular Songs with Artificial Intelligence: Concepts Plans and Speculations

链接: https://arxiv.org/abs/2411.06420
作者: Abhishek Kaushik,Kayla Rush
关键词-EN: potent form, create the emotions, Music, form of expression, artificial intelligence
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
*备注: Link to the paper: this https URL Published in The International Conference on AI and Musical Creativity at the University of Oxford (2024) this https URL

点击查看摘要

Abstract:Music is a potent form of expression that can communicate, accentuate or even create the emotions of an individual or a collective. Both historically and in contemporary experiences, musical expression was and is commonly instrumentalized for social, political and/or economic purposes. Generative artificial intelligence provides a wealth of both opportunities and challenges with regard to music and its role in society. This paper discusses a proposed project integrating artificial intelligence and popular music, with the ultimate goal of creating a powerful tool for implementing music for social transformation, education, healthcare, and emotional well-being. Given that it is being presented at the outset of a collaboration between a computer scientist/data analyst and an ethnomusicologist/social anthropologist. it is mainly conceptual and somewhat speculative in nature.

[AI-94] Automated Strategy Invention for Confluence of Term Rewrite Systems

链接: https://arxiv.org/abs/2411.06409
作者: Liao Zhang,Fabian Mitterwallner,Jan Jakubuv,Cezary Kaliszyk
关键词-EN: compiler optimization, Term rewriting plays, plays a crucial, crucial role, role in software
类目: Logic in Computer Science (cs.LO); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Term rewriting plays a crucial role in software verification and compiler optimization. With dozens of highly parameterizable techniques developed to prove various system properties, automatic term rewriting tools work in an extensive parameter space. This complexity exceeds human capacity for parameter selection, motivating an investigation into automated strategy invention. In this paper, we focus on confluence, an important property of term rewrite systems, and apply machine learning to develop the first learning-guided automatic confluence prover. Moreover, we randomly generate a large dataset to analyze confluence for term rewrite systems. Our results focus on improving the state-of-the-art automatic confluence prover CSI: When equipped with our invented strategies, it surpasses its human-designed strategies both on the augmented dataset and on the original human-created benchmark dataset Cops, proving/disproving the confluence of several term rewrite systems for which no automated proofs were known before.

[AI-95] Mastering NIM and Impartial Games with Weak Neural Networks: An AlphaZero-inspired Multi-Frame Approach

链接: https://arxiv.org/abs/2411.06403
作者: Søren Riis
关键词-EN: Bei Zhou experimentally, Zhou experimentally finding, Bei Zhou, Harvey Friedman, Zhou experimentally
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:This paper provides a theoretical framework that validates and explains the results in the work with Bei Zhou experimentally finding that AlphaZero-style reinforcement learning algorithms struggle to learn optimal play in NIM, a canonical impartial game proposed as an AI challenge by Harvey Friedman in 2017. Our analysis resolves a controversy around these experimental results, which revealed unexpected difficulties in learning NIM despite its mathematical simplicity compared to games like chess and Go. Our key contributions are as follows: We prove that by incorporating recent game history, these limited AlphaZero models can, in principle, achieve optimal play in NIM. We introduce a novel search strategy where roll-outs preserve game-theoretic values during move selection, guided by a specialised policy network. We provide constructive proofs showing that our approach enables optimal play within the (\textAC^0) complexity class despite the theoretical limitations of these networks. This research demonstrates how constrained neural networks when properly designed, can achieve sophisticated decision-making even in domains where their basic computational capabilities appear insufficient. Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2411.06403 [cs.AI] (or arXiv:2411.06403v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2411.06403 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Soren Riis [view email] [v1] Sun, 10 Nov 2024 09:34:26 UTC (33 KB) Full-text links: Access Paper: View a PDF of the paper titled Mastering NIM and Impartial Games with Weak Neural Networks: An AlphaZero-inspired Multi-Frame Approach, by S\oren RiisView PDFHTML (experimental)TeX SourceOther Formats view license Current browse context: cs.AI prev | next new | recent | 2024-11 Change to browse by: cs References Citations NASA ADSGoogle Scholar Semantic Scholar a export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) Links to Code Toggle Papers with Code (What is Papers with Code?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status Get status notifications via email or slack

[AI-96] A Variance Minimization Approach to Temporal-Difference Learning

链接: https://arxiv.org/abs/2411.06396
作者: Xingguo Chen,Yu Gong,Shangdong Yang,Wenhao Wang
关键词-EN: Fast-converging algorithms, reinforcement learning, contemporary requirement, requirement in reinforcement, Projected Bellman Error
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Fast-converging algorithms are a contemporary requirement in reinforcement learning. In the context of linear function approximation, the magnitude of the smallest eigenvalue of the key matrix is a major factor reflecting the convergence speed. Traditional value-based RL algorithms focus on minimizing errors. This paper introduces a variance minimization (VM) approach for value-based RL instead of error minimization. Based on this approach, we proposed two objectives, the Variance of Bellman Error (VBE) and the Variance of Projected Bellman Error (VPBE), and derived the VMTD, VMTDC, and VMETD algorithms. We provided proofs of their convergence and optimal policy invariance of the variance minimization. Experimental studies validate the effectiveness of the proposed algorithms.

[AI-97] Class Granularity: How richly does your knowledge graph represent the real world?

链接: https://arxiv.org/abs/2411.06385
作者: Sumin Seo,Heeseon Cheon,Hyunho Kim
关键词-EN: Class Granularity, utilize knowledge graphs, effectively manage, manage and utilize, knowledge graphs
类目: Artificial Intelligence (cs.AI)
*备注: 10 pages

点击查看摘要

Abstract:To effectively manage and utilize knowledge graphs, it is crucial to have metrics that can assess the quality of knowledge graphs from various perspectives. While there have been studies on knowledge graph quality metrics, there has been a lack of research on metrics that measure how richly ontologies, which form the backbone of knowledge graphs, are defined or the impact of richly defined ontologies. In this study, we propose a new metric called Class Granularity, which measures how well a knowledge graph is structured in terms of how finely classes with unique characteristics are defined. Furthermore, this research presents potential impact of Class Granularity in knowledge graph’s on downstream tasks. In particular, we explore its influence on graph embedding and provide experimental results. Additionally, this research goes beyond traditional Linked Open Data comparison studies, which mainly focus on factors like scale and class distribution, by using Class Granularity to compare four different LOD sources.

[AI-98] Phantom: Constraining Generative Artificial Intelligence Models for Practical Domain Specific Peripherals Trace Synthesizing

链接: https://arxiv.org/abs/2411.06376
作者: Zhibai Huang,Yihan Shen,Yongchen Xie,Zhixiang Wei,Yun wang,Fangxin Liu,Tao Song,Zhengwei Qi
关键词-EN: Component Interconnect Express, Peripheral Component Interconnect, facto interconnect standard, Interconnect Express, Peripheral Component
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Hardware Architecture (cs.AR)
*备注:

点击查看摘要

Abstract:Peripheral Component Interconnect Express (PCIe) is the de facto interconnect standard for high-speed peripherals and CPUs. Prototyping and optimizing PCIe devices for emerging scenarios is an ongoing challenge. Since Transaction Layer Packets (TLPs) capture device-CPU interactions, it is crucial to analyze and generate realistic TLP traces for effective device design and optimization. Generative AI offers a promising approach for creating intricate, custom TLP traces necessary for PCIe hardware and software development. However, existing models often generate impractical traces due to the absence of PCIe-specific constraints, such as TLP ordering and causality. This paper presents Phantom, the first framework that treats TLP trace generation as a generative AI problem while incorporating PCIe-specific constraints. We validate Phantom’s effectiveness by generating TLP traces for an actual PCIe network interface card. Experimental results show that Phantom produces practical, large-scale TLP traces, significantly outperforming existing models, with improvements of up to 1000 \times in task-specific metrics and up to 2.19 \times in Frechet Inception Distance (FID) compared to backbone-only methods.

[AI-99] BayesNAM: Leveraging Inconsistency for Reliable Explanations

链接: https://arxiv.org/abs/2411.06367
作者: Hoki Kim,Jinseong Park,Yujin Choi,Seungyun Lee,Jaewook Lee
关键词-EN: explainable artificial intelligence, recently proposed explainable, proposed explainable artificial, utilizes neural network-based, neural network-based architectures
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE)
*备注: Under Review

点击查看摘要

Abstract:Neural additive model (NAM) is a recently proposed explainable artificial intelligence (XAI) method that utilizes neural network-based architectures. Given the advantages of neural networks, NAMs provide intuitive explanations for their predictions with high model performance. In this paper, we analyze a critical yet overlooked phenomenon: NAMs often produce inconsistent explanations, even when using the same architecture and dataset. Traditionally, such inconsistencies have been viewed as issues to be resolved. However, we argue instead that these inconsistencies can provide valuable explanations within the given data model. Through a simple theoretical framework, we demonstrate that these inconsistencies are not mere artifacts but emerge naturally in datasets with multiple important features. To effectively leverage this information, we introduce a novel framework, Bayesian Neural Additive Model (BayesNAM), which integrates Bayesian neural networks and feature dropout, with theoretical proof demonstrating that feature dropout effectively captures model inconsistencies. Our experiments demonstrate that BayesNAM effectively reveals potential problems such as insufficient data or structural limitations of the model, providing more reliable explanations and potential remedies.

[AI-100] Layer-Wise Feature Metric of Semantic-Pixel Matching for Few-Shot Learning

链接: https://arxiv.org/abs/2411.06363
作者: Hao Tang,Junhao Lu,Guoheng Huang,Ming Li,Xuhang Chen,Guo Zhong,Zhengguang Tan,Zinuo Li
关键词-EN: traditional metric-based approaches, traditional metric-based, Few-Shot Learning, metric-based approaches, approaches often rely
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In Few-Shot Learning (FSL), traditional metric-based approaches often rely on global metrics to compute similarity. However, in natural scenes, the spatial arrangement of key instances is often inconsistent across images. This spatial misalignment can result in mismatched semantic pixels, leading to inaccurate similarity measurements. To address this issue, we propose a novel method called the Layer-Wise Features Metric of Semantic-Pixel Matching (LWFM-SPM) to make finer comparisons. Our method enhances model performance through two key modules: (1) the Layer-Wise Embedding (LWE) Module, which refines the cross-correlation of image pairs to generate well-focused feature maps for each layer; (2)the Semantic-Pixel Matching (SPM) Module, which aligns critical pixels based on semantic embeddings using an assignment algorithm. We conducted extensive experiments to evaluate our method on four widely used few-shot classification benchmarks: miniImageNet, tieredImageNet, CUB-200-2011, and CIFAR-FS. The results indicate that LWFM-SPM achieves competitive performance across these benchmarks. Our code will be publicly available on this https URL.

[AI-101] Deep Active Learning in the Open World

链接: https://arxiv.org/abs/2411.06353
作者: Tian Xie,Jifan Zhang,Haoyue Bai,Robert Nowak
关键词-EN: encounter unfamiliar conditions, unanticipated situations, scenarios often encounter, encounter unfamiliar, unfamiliar conditions
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Machine learning models deployed in open-world scenarios often encounter unfamiliar conditions and perform poorly in unanticipated situations. As AI systems advance and find application in safety-critical domains, effectively handling out-of-distribution (OOD) data is crucial to building open-world learning systems. In this work, we introduce ALOE, a novel active learning algorithm for open-world environments designed to enhance model adaptation by incorporating new OOD classes via a two-stage approach. First, diversity sampling selects a representative set of examples, followed by energy-based OOD detection to prioritize likely unknown classes for annotation. This strategy accelerates class discovery and learning, even under constrained annotation budgets. Evaluations on three long-tailed image classification benchmarks demonstrate that ALOE outperforms traditional active learning baselines, effectively expanding known categories while balancing annotation cost. Our findings reveal a crucial tradeoff between enhancing known-class performance and discovering new classes, setting the stage for future advancements in open-world machine learning.

[AI-102] Balancing Power and Ethics: A Framework for Addressing Human Rights Concerns in Military AI WWW

链接: https://arxiv.org/abs/2411.06336
作者: Mst Rafia Islam,Azmine Toushik Wasi
关键词-EN: significant strides recently, made significant strides, international humanitarian law, strides recently, made significant
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
*备注: Accepted for oral (only 3 papers are selected!) Harms and Risks of AI in the Military Workshop (HRAIM 2024) at Mila Quebec ( this https URL )

点击查看摘要

Abstract:AI has made significant strides recently, leading to various applications in both civilian and military sectors. The military sees AI as a solution for developing more effective and faster technologies. While AI offers benefits like improved operational efficiency and precision targeting, it also raises serious ethical and legal concerns, particularly regarding human rights violations. Autonomous weapons that make decisions without human input can threaten the right to life and violate international humanitarian law. To address these issues, we propose a three-stage framework (Design, In Deployment, and During/After Use) for evaluating human rights concerns in the design, deployment, and use of military AI. Each phase includes multiple components that address various concerns specific to that phase, ranging from bias and regulatory issues to violations of International Humanitarian Law. By this framework, we aim to balance the advantages of AI in military operations with the need to protect human rights.

[AI-103] NeuReg: Domain-invariant 3D Image Registration on Human and Mouse Brains

链接: https://arxiv.org/abs/2411.06315
作者: Taha Razzaq,Asim Iqbal
关键词-EN: accurately curate structural, curate structural boundaries, Medical brain imaging, imaging relies heavily, Medical brain
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE); Quantitative Methods (q-bio.QM)
*备注: 15 pages, 5 figures, 5 tables

点击查看摘要

Abstract:Medical brain imaging relies heavily on image registration to accurately curate structural boundaries of brain features for various healthcare applications. Deep learning models have shown remarkable performance in image registration in recent years. Still, they often struggle to handle the diversity of 3D brain volumes, challenged by their structural and contrastive variations and their imaging domains. In this work, we present NeuReg, a Neuro-inspired 3D image registration architecture with the feature of domain invariance. NeuReg generates domain-agnostic representations of imaging features and incorporates a shifting window-based Swin Transformer block as the encoder. This enables our model to capture the variations across brain imaging modalities and species. We demonstrate a new benchmark in multi-domain publicly available datasets comprising human and mouse 3D brain volumes. Extensive experiments reveal that our model (NeuReg) outperforms the existing baseline deep learning-based image registration models and provides a high-performance boost on cross-domain datasets, where models are trained on ‘source-only’ domain and tested on completely ‘unseen’ target domains. Our work establishes a new state-of-the-art for domain-agnostic 3D brain image registration, underpinned by Neuro-inspired Transformer-based architecture.

[AI-104] Optimal Driver Warning Generation in Dynamic Driving Environment ICRA2024

链接: https://arxiv.org/abs/2411.06306
作者: Chenran Li,Aolin Xu,Enna Sachdeva,Teruhisa Misu,Behzad Dariush
关键词-EN: advanced driver assistance, driver assistance system, warning, driver warning system, driver
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
*备注: ICRA 2024

点击查看摘要

Abstract:The driver warning system that alerts the human driver about potential risks during driving is a key feature of an advanced driver assistance system. Existing driver warning technologies, mainly the forward collision warning and unsafe lane change warning, can reduce the risk of collision caused by human errors. However, the current design methods have several major limitations. Firstly, the warnings are mainly generated in a one-shot manner without modeling the ego driver’s reactions and surrounding objects, which reduces the flexibility and generality of the system over different scenarios. Additionally, the triggering conditions of warning are mostly rule-based threshold-checking given the current state, which lacks the prediction of the potential risk in a sufficiently long future horizon. In this work, we study the problem of optimally generating driver warnings by considering the interactions among the generated warning, the driver behavior, and the states of ego and surrounding vehicles on a long horizon. The warning generation problem is formulated as a partially observed Markov decision process (POMDP). An optimal warning generation framework is proposed as a solution to the proposed POMDP. The simulation experiments demonstrate the superiority of the proposed solution to the existing warning generation methods.

[AI-105] Analyzing the Evolution of Graphs and Texts

链接: https://arxiv.org/abs/2411.06295
作者: Xingzhi Guo
关键词-EN: representation learning algorithms, achieve human-level performance, downstream tasks, natural languages, sentence classification
类目: ocial and Information Networks (cs.SI); Artificial Intelligence (cs.AI)
*备注: PhD dissertation

点击查看摘要

Abstract:With the recent advance of representation learning algorithms on graphs (e.g., DeepWalk/GraphSage) and natural languages (e.g., Word2Vec/BERT) , the state-of-the art models can even achieve human-level performance over many downstream tasks, particularly for the task of node and sentence classification. However, most algorithms focus on large-scale models for static graphs and text corpus without considering the inherent dynamic characteristics or discovering the reasons behind the changes. This dissertation aims to efficiently model the dynamics in graphs (such as social networks and citation graphs) and understand the changes in texts (specifically news titles and personal biographies). To achieve this goal, we utilize the renowned Personalized PageRank algorithm to create effective dynamic network embeddings for evolving graphs. Our proposed approaches significantly improve the running time and accuracy for both detecting network abnormal intruders and discovering entity meaning shifts over large-scale dynamic graphs. For text changes, we analyze the post-publication changes in news titles to understand the intents behind the edits and discuss the potential impact of titles changes from information integrity perspective. Moreover, we investigate self-presented occupational identities in Twitter users’ biographies over five years, investigating job prestige and demographics effects in how people disclose jobs, quantifying over-represented jobs and their transitions over time.

[AI-106] A Comprehensive Survey and Guide to Multimodal Large Language Models in Vision-Language Tasks

链接: https://arxiv.org/abs/2411.06284
作者: Chia Xin Liang,Pu Tian,Caitlyn Heqi Yin,Yao Yua,Wei An-Hou,Li Ming,Tianyang Wang,Ziqian Bi,Ming Liu
关键词-EN: large language models, Generative Models, rapidly developing field, multimodal large language, examining their architectures
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:This survey and application guide to multimodal large language models(MLLMs) explores the rapidly developing field of MLLMs, examining their architectures, applications, and impact on AI and Generative Models. Starting with foundational concepts, we delve into how MLLMs integrate various data types, including text, images, video and audio, to enable complex AI systems for cross-modal understanding and generation. It covers essential topics such as training methods, architectural components, and practical applications in various fields, from visual storytelling to enhanced accessibility. Through detailed case studies and technical analysis, the text examines prominent MLLM implementations while addressing key challenges in scalability, robustness, and cross-modal learning. Concluding with a discussion of ethical considerations, responsible AI development, and future directions, this authoritative resource provides both theoretical frameworks and practical insights. It offers a balanced perspective on the opportunities and challenges in the development and deployment of MLLMs, and is highly valuable for researchers, practitioners, and students interested in the intersection of natural language processing and computer vision.

[AI-107] Multi-View Majority Vote Learning Algorithms: Direct Minimization of PAC-Bayesian Bounds

链接: https://arxiv.org/abs/2411.06276
作者: Mehdi Hennequin,Abdelkrim Zitouni,Khalid Benabdeslem,Haytham Elghazel,Yacine Gaci
关键词-EN: majority voting methods, voting methods, framework has significantly, significantly advanced, advanced our understanding
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:The PAC-Bayesian framework has significantly advanced our understanding of statistical learning, particularly in majority voting methods. However, its application to multi-view learning remains underexplored. In this paper, we extend PAC-Bayesian theory to the multi-view setting, introducing novel PAC-Bayesian bounds based on Rényi divergence. These bounds improve upon traditional Kullback-Leibler divergence and offer more refined complexity measures. We further propose first and second-order oracle PAC-Bayesian bounds, along with an extension of the C-bound for multi-view learning. To ensure practical applicability, we develop efficient optimization algorithms with self-bounding properties.

[AI-108] AIs Spatial Intelligence: Evaluating AIs Understanding of Spatial Transformations in PSVT:R and Augmented Reality

链接: https://arxiv.org/abs/2411.06269
作者: Uttamasha Monjoree,Wei Yan
关键词-EN: important in Architecture, Spatial, spatial rotation process, rotation process, Revised PSVT
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Spatial intelligence is important in Architecture, Construction, Science, Technology, Engineering, and Mathematics (STEM), and Medicine. Understanding three-dimensional (3D) spatial rotations can involve verbal descriptions and visual or interactive examples, illustrating how objects change orientation in 3D space. Recent studies show Artificial Intelligence (AI) with language and vision capabilities still face limitations in spatial reasoning. In this paper, we have studied generative AI’s spatial capabilities of understanding rotations of objects utilizing its image and language processing features. We examined the spatial intelligence of the GPT-4 model with vision in understanding spatial rotation process with diagrams based on the Revised Purdue Spatial Visualization Test: Visualization of Rotations (Revised PSVT:R). Next, we incorporated a layer of coordinate system axes on Revised PSVT:R to study the variations in GPT-4’s performance. We also examined GPT-4’s understanding of 3D rotations in Augmented Reality (AR) scenes that visualize spatial rotations of an object in 3D space and observed increased accuracy of GPT-4’s understanding of the rotations by adding supplementary textual information depicting the rotation process or mathematical representations of the rotation (e.g., matrices). The results indicate that while GPT-4 as a major current Generative AI model lacks the understanding of a spatial rotation process, it has the potential to understand the rotation process with additional information that can be provided by methods such as AR. By combining the potentials in spatial intelligence of AI with AR’s interactive visualization abilities, we expect to offer enhanced guidance for students’ spatial learning activities. Such spatial guidance can benefit understanding spatial transformations and additionally support processes like assembly, fabrication, and manufacturing.

[AI-109] GuidelineGuard: An Agent ic Framework for Medical Note Evaluation with Guideline Adherence

链接: https://arxiv.org/abs/2411.06264
作者: MD Ragib Shahriyear
关键词-EN: Large Language Models, Language Models, artificial intelligence-based applications, Large Language, advancements in Large
类目: Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Although rapid advancements in Large Language Models (LLMs) are facilitating the integration of artificial intelligence-based applications and services in healthcare, limited research has focused on the systematic evaluation of medical notes for guideline adherence. This paper introduces GuidelineGuard, an agentic framework powered by LLMs that autonomously analyzes medical notes, such as hospital discharge and office visit notes, to ensure compliance with established healthcare guidelines. By identifying deviations from recommended practices and providing evidence-based suggestions, GuidelineGuard helps clinicians adhere to the latest standards from organizations like the WHO and CDC. This framework offers a novel approach to improving documentation quality and reducing clinical errors.

[AI-110] Federated Split Learning for Human Activity Recognition with Differential Privacy

链接: https://arxiv.org/abs/2411.06263
作者: Josue Ndeko,Shaba Shaon,Aubrey Beal,Avimanyu Sahoo,Dinh C. Nguyen
关键词-EN: Federated Split Learning, human activity recognition, intelligent human activity, Split Learning, Federated Split
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
*备注: Accepted to IEEE Consumer Communications and Networking Conference (CCNC), 6 pages

点击查看摘要

Abstract:This paper proposes a novel intelligent human activity recognition (HAR) framework based on a new design of Federated Split Learning (FSL) with Differential Privacy (DP) over edge networks. Our FSL-DP framework leverages both accelerometer and gyroscope data, achieving significant improvements in HAR accuracy. The evaluation includes a detailed comparison between traditional Federated Learning (FL) and our FSL framework, showing that the FSL framework outperforms FL models in both accuracy and loss metrics. Additionally, we examine the privacy-performance trade-off under different data settings in the DP mechanism, highlighting the balance between privacy guarantees and model accuracy. The results also indicate that our FSL framework achieves faster communication times per training round compared to traditional FL, further emphasizing its efficiency and effectiveness. This work provides valuable insight and a novel framework which was tested on a real-life dataset.

[AI-111] Knowledge Authoring with Factual English Rules and Actions

链接: https://arxiv.org/abs/2411.06253
作者: Yuheng Wang
关键词-EN: KALM, Knowledge, KALM called KALM, Authoring Logic Machine, Knowledge Authoring Logic
类目: Artificial Intelligence (cs.AI)
*备注: PhD thesis

点击查看摘要

Abstract:Knowledge representation and reasoning systems represent knowledge as collections of facts and rules. KRRs can represent complex concepts and relations, and they can query and manipulate information in sophisticated ways. Unfortunately, the KRR technology has been hindered by the fact that specifying the requisite knowledge requires skills that most domain experts do not have, and professional knowledge engineers are hard to find. Some recent CNL-based approaches, such as the Knowledge Authoring Logic Machine (KALM), have shown to have very high accuracy compared to others, and a natural question is to what extent the CNL restrictions can be lifted. Besides the CNL restrictions, KALM has limitations in terms of the types of knowledge it can represent. To address these issues, we propose an extension of KALM called KALM for Factual Language (KALMF). KALMF uses a neural parser for natural language, MS, to parse what we call factual English sentences, which require little grammar training to use. Building upon KALMF, we propose KALM for Rules and Actions (KALMR), to represent and reason with rules and actions. Furthermore, we identify the reasons behind the slow speed of KALM and make optimizations to address this issue. Our evaluation using multiple benchmarks shows that our approaches achieve a high level of correctness on fact and query authoring (95%) and on rule authoring (100%). When used for authoring and reasoning with actions, our approach achieves more than 99.3% correctness, demonstrating its effectiveness in enabling more sophisticated knowledge representation and reasoning. We also illustrate the logical reasoning capabilities of our approach by drawing attention to the problems faced by the famous AI, ChatGPT. Finally, the evaluation of the newly proposed speed optimization points not only to a 68% runtime improvement but also yields better accuracy of the overall system.

[AI-112] Quasi-random Multi-Sample Inference for Large Language Models

链接: https://arxiv.org/abs/2411.06251
作者: Aditya Parashar,Aditya Vikram Singh,Avinash Amballa,Jinlin Lai,Benjamin Rozonoyer
关键词-EN: Large language models, Large language, language models, Large, multi-sample decoding strategies
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Large language models (LLMs) are often equipped with multi-sample decoding strategies. An LLM implicitly defines an arithmetic code book, facilitating efficient and embarrassingly parallelizable \textbfarithmetic sampling to produce multiple samples using quasi-random codes. Traditional text generation methods, such as beam search and sampling-based techniques, have notable limitations: they lack parallelizability or diversity of sampled sequences. This study explores the potential of arithmetic sampling, contrasting it with ancestral sampling across two decoding tasks that employ multi-sample inference: chain-of-thought reasoning with self-consistency and machine translation with minimum Bayes risk decoding. Our results demonstrate that arithmetic sampling produces more diverse samples, significantly improving reasoning and translation performance as the sample size increases. We observe a \mathbf3\text-5% point increase in accuracy on the GSM8K dataset and a \mathbf0.45\text-0.89% point increment in COMET score for WMT19 tasks using arithmetic sampling without any significant computational overhead.

[AI-113] Multimodal Contrastive Learning of Urban Space Representations from POI Data

链接: https://arxiv.org/abs/2411.06229
作者: Xinglei Wang,Tao Cheng,Stephen Law,Zichao Zeng,Lu Yin,Junyuan Liu
关键词-EN: POI semantic attributes, inadequate spatial information, spatial information modelling, Existing methods, data face
类目: Artificial Intelligence (cs.AI)
*备注: 19 pages, 5 figures, 7 tables

点击查看摘要

Abstract:Existing methods for learning urban space representations from Point-of-Interest (POI) data face several limitations, including issues with geographical delineation, inadequate spatial information modelling, underutilisation of POI semantic attributes, and computational inefficiencies. To address these issues, we propose CaLLiPer (Contrastive Language-Location Pre-training), a novel representation learning model that directly embeds continuous urban spaces into vector representations that can capture the spatial and semantic distribution of urban environment. This model leverages a multimodal contrastive learning objective, aligning location embeddings with textual POI descriptions, thereby bypassing the need for complex training corpus construction and negative sampling. We validate CaLLiPer’s effectiveness by applying it to learning urban space representations in London, UK, where it demonstrates 5-15% improvement in predictive performance for land use classification and socioeconomic mapping tasks compared to state-of-the-art methods. Visualisations of the learned representations further illustrate our model’s advantages in capturing spatial variations in urban semantics with high accuracy and fine resolution. Additionally, CaLLiPer achieves reduced training time, showcasing its efficiency and scalability. This work provides a promising pathway for scalable, semantically rich urban space representation learning that can support the development of geospatial foundation models. The implementation code is available at this https URL.

[AI-114] Smart-LLaMA: Two-Stage Post-Training of Large Language Models for Smart Contract Vulnerability Detection and Explanation

链接: https://arxiv.org/abs/2411.06221
作者: Lei Yu,Shiqi Chen,Hang Yuan,Peng Wang,Zhirong Huang,Jingyuan Zhang,Chenjie Shen,Fengjun Zhang,Li Yang,Jiajia Ma
关键词-EN: smart contract, smart contract security, smart, blockchain technology, critical challenge
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
*备注:

点击查看摘要

Abstract:With the rapid development of blockchain technology, smart contract security has become a critical challenge. Existing smart contract vulnerability detection methods face three main issues: (1) Insufficient quality of datasets, lacking detailed explanations and precise vulnerability locations. (2) Limited adaptability of large language models (LLMs) to the smart contract domain, as most LLMs are pre-trained on general text data but minimal smart contract-specific data. (3) Lack of high-quality explanations for detected vulnerabilities, as existing methods focus solely on detection without clear explanations. These limitations hinder detection performance and make it harder for developers to understand and fix vulnerabilities quickly, potentially leading to severe financial losses. To address these problems, we propose Smart-LLaMA, an advanced detection method based on the LLaMA language model. First, we construct a comprehensive dataset covering four vulnerability types with labels, detailed explanations, and precise vulnerability locations. Second, we introduce Smart Contract-Specific Continual Pre-Training, using raw smart contract data to enable the LLM to learn smart contract syntax and semantics, enhancing their domain adaptability. Furthermore, we propose Explanation-Guided Fine-Tuning, which fine-tunes the LLM using paired vulnerable code and explanations, enabling both vulnerability detection and reasoned explanations. We evaluate explanation quality through LLM and human evaluation, focusing on Correctness, Completeness, and Conciseness. Experimental results show that Smart-LLaMA outperforms state-of-the-art baselines, with average improvements of 6.49% in F1 score and 3.78% in accuracy, while providing reliable explanations.

[AI-115] Multistage non-deterministic classification using secondary concept graphs and graph convolutional networks for high-level feature extraction

链接: https://arxiv.org/abs/2411.06212
作者: Masoud Kargar,Nasim Jelodari,Alireza Assadzadeh
关键词-EN: visually depict relationships, visually depict, relationships and structures, Graph Convolutional Networks, depict relationships
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 13 Pages, 15 figures, and 4 Tables

点击查看摘要

Abstract:Graphs, comprising nodes and edges, visually depict relationships and structures, posing challenges in extracting high-level features due to their intricate connections. Multiple connections introduce complexities in discovering patterns, where node weights may affect some features more than others. In domains with diverse topics, graph representations illustrate interrelations among features. Pattern discovery within graphs is recognized as NP-hard. Graph Convolutional Networks (GCNs) are a prominent deep learning approach for acquiring meaningful representations by leveraging node connectivity and characteristics. Despite achievements, predicting and assigning 9 deterministic classes often involves errors. To address this challenge, we present a multi-stage non-deterministic classification method based on a secondary conceptual graph and graph convolutional networks, which includes distinct steps: 1) leveraging GCN for the extraction and generation of 12 high-level features: 2) employing incomplete, non-deterministic models for feature extraction, conducted before reaching a definitive prediction: and 3) formulating definitive forecasts grounded in conceptual (logical) graphs. The empirical findings indicate that our proposed approach outperforms contemporary methods in classification tasks. Across three datasets Cora, Citeseer, and PubMed the achieved accuracies are 96%, 93%, and 95%, respectively. Code is available at this https URL.

[AI-116] Artificial Intelligence for Collective Intelligence: A National-Scale Research Strategy

链接: https://arxiv.org/abs/2411.06211
作者: Seth Bullock(1),Nirav Ajmeri(1),Mike Batty(2),Michaela Black(3),John Cartlidge(1),Robert Challen(1),Cangxiong Chen(4),Jing Chen(5),Joan Condell(3),Leon Danon(1),Adam Dennett(2),Alison Heppenstall(6),Paul Marshall(1),Phil Morgan(5),Aisling O’Kane(1),Laura G. E. Smith(4),Theresa Smith(4),Hywel T. P. Williams(7) ((1) University of Bristol, (2) University College London, (3) Ulster University, (4) University of Bath, (5) Cardiff University, (6) University of Glasgow, (7) University of Exeter)
关键词-EN: Advances in artificial, trans-national scale, national-scale collective intelligence, great potential, nature and present
类目: Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
*备注: 25 pages, 3 figures, Accepted for publication at Knowledge Engineering Review (KER)

点击查看摘要

Abstract:Advances in artificial intelligence (AI) have great potential to help address societal challenges that are both collective in nature and present at national or trans-national scale. Pressing challenges in healthcare, finance, infrastructure and sustainability, for instance, might all be productively addressed by leveraging and amplifying AI for national-scale collective intelligence. The development and deployment of this kind of AI faces distinctive challenges, both technical and socio-technical. Here, a research strategy for mobilising inter-disciplinary research to address these challenges is detailed and some of the key issues that must be faced are outlined.

[AI-117] OpenAI-o1 AB Testing: Does the o1 model really do good reasoning in math problem solving?

链接: https://arxiv.org/abs/2411.06198
作者: Leo Li,Ye Luo,Tingyou Pan
关键词-EN: robust logical reasoning, logical reasoning capabilities, previous large language, large language models, International Mathematics Olympiad
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The Orion-1 model by OpenAI is claimed to have more robust logical reasoning capabilities than previous large language models. However, some suggest the excellence might be partially due to the model “memorizing” solutions, resulting in less satisfactory performance when prompted with problems not in the training data. We conduct a comparison experiment using two datasets: one consisting of International Mathematics Olympiad (IMO) problems, which is easily accessible; the other one consisting of Chinese National Team Training camp (CNT) problems, which have similar difficulty but not as publically accessible. We label the response for each problem and compare the performance between the two datasets. We conclude that there is no significant evidence to show that the model relies on memorizing problems and solutions. Also, we perform case studies to analyze some features of the model’s response.

[AI-118] Generalizing Hyperedge Expansion for Hyper-relational Knowledge Graph Modeling

链接: https://arxiv.org/abs/2411.06191
作者: Yu Liu,Shu Yang,Jingtao Ding,Quanming Yao,Yong Li
关键词-EN: triple-based knowledge graph, hyper-relational knowledge graph, generalizes triple-based knowledge, additional attribute-value qualifiers, knowledge graph
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:By representing knowledge in a primary triple associated with additional attribute-value qualifiers, hyper-relational knowledge graph (HKG) that generalizes triple-based knowledge graph (KG) has been attracting research attention recently. Compared with KG, HKG is enriched with the semantic qualifiers as well as the hyper-relational graph structure. However, to model HKG, existing studies mainly focus on either semantic information or structural information therein, which however fail to capture both simultaneously. To tackle this issue, in this paper, we generalize the hyperedge expansion in hypergraph learning and propose an equivalent transformation for HKG modeling, referred to as TransEQ. Specifically, the equivalent transformation transforms a HKG to a KG, which considers both semantic and structural characteristics. Then an encoder-decoder framework is developed to bridge the modeling research between KG and HKG. In the encoder part, KG-based graph neural networks are leveraged for structural modeling; while in the decoder part, various HKG-based scoring functions are exploited for semantic modeling. Especially, we design the sharing embedding mechanism in the encoder-decoder framework with semantic relatedness captured. We further theoretically prove that TransEQ preserves complete information in the equivalent transformation, and also achieves full expressivity. Finally, extensive experiments on three benchmarks demonstrate the superior performance of TransEQ in terms of both effectiveness and efficiency. On the largest benchmark WikiPeople, TransEQ significantly improves the state-of-the-art models by 15% on MRR.

[AI-119] Deep Reinforcement Learning for Digital Twin-Oriented Complex Networked Systems

链接: https://arxiv.org/abs/2411.06148
作者: Jiaqi Wen,Bogdan Gabrys,Katarzyna Musial
关键词-EN: Complex Networked System, Oriented Complex Networked, Digital Twin Oriented, Twin Oriented Complex, Networked System
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The Digital Twin Oriented Complex Networked System (DT-CNS) aims to build and extend a Complex Networked System (CNS) model with progressively increasing dynamics complexity towards an accurate reflection of reality – a Digital Twin of reality. Our previous work proposed evolutionary DT-CNSs to model the long-term adaptive network changes in an epidemic outbreak. This study extends this framework by proposeing the temporal DT-CNS model, where reinforcement learning-driven nodes make decisions on temporal directed interactions in an epidemic outbreak. We consider cooperative nodes, as well as egocentric and ignorant “free-riders” in the cooperation. We describe this epidemic spreading process with the Susceptible-Infected-Recovered ( SIR ) model and investigate the impact of epidemic severity on the epidemic resilience for different types of nodes. Our experimental results show that (i) the full cooperation leads to a higher reward and lower infection number than a cooperation with egocentric or ignorant “free-riders”; (ii) an increasing number of “free-riders” in a cooperation leads to a smaller reward, while an increasing number of egocentric “free-riders” further escalate the infection numbers and (iii) higher infection rates and a slower recovery weakens networks’ resilience to severe epidemic outbreaks. These findings also indicate that promoting cooperation and reducing “free-riders” can improve public health during epidemics.

[AI-120] AI-Compass: A Comprehensive and Effective Multi-module Testing Tool for AI Systems

链接: https://arxiv.org/abs/2411.06146
作者: Zhiyu Zhu,Zhibo Jin,Hongsheng Hu,Minhui Xue,Ruoxi Sun,Seyit Camtepe,Praveen Gauravaram,Huaming Chen
关键词-EN: demonstrated superior performance, demonstrated superior, superior performance, deep learning techniques, testing
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:AI systems, in particular with deep learning techniques, have demonstrated superior performance for various real-world applications. Given the need for tailored optimization in specific scenarios, as well as the concerns related to the exploits of subsurface vulnerabilities, a more comprehensive and in-depth testing AI system becomes a pivotal topic. We have seen the emergence of testing tools in real-world applications that aim to expand testing capabilities. However, they often concentrate on ad-hoc tasks, rendering them unsuitable for simultaneously testing multiple aspects or components. Furthermore, trustworthiness issues arising from adversarial attacks and the challenge of interpreting deep learning models pose new challenges for developing more comprehensive and in-depth AI system testing tools. In this study, we design and implement a testing tool, \tool, to comprehensively and effectively evaluate AI systems. The tool extensively assesses multiple measurements towards adversarial robustness, model interpretability, and performs neuron analysis. The feasibility of the proposed testing tool is thoroughly validated across various modalities, including image classification, object detection, and text classification. Extensive experiments demonstrate that \tool is the state-of-the-art tool for a comprehensive assessment of the robustness and trustworthiness of AI systems. Our research sheds light on a general solution for AI systems testing landscape.

[AI-121] Aquila-plus: Prompt-Driven Visual-Language Models for Pixel-Level Remote Sensing Image Understanding

链接: https://arxiv.org/abs/2411.06142
作者: Kaixuan Lu
关键词-EN: remote sensing image, sensing image understanding, demonstrating their powerful, led to significant, significant advances
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The recent development of vision language models (VLMs) has led to significant advances in visual-language integration through visual instruction tuning, and they have rapidly evolved in the field of remote sensing image understanding, demonstrating their powerful capabilities. However, existing RSVLMs mainly focus on image-level or frame-level understanding, making it difficult to achieve fine-grained pixel-level visual-language alignment. Additionally, the lack of mask-based instructional data limits their further development. In this paper, we propose a mask-text instruction tuning method called Aquila-plus, which extends the capabilities of RSVLMs to achieve pixel-level visual understanding by incorporating fine-grained mask regions into language instructions. To achieve this, we first meticulously constructed a mask region-text dataset containing 100K samples, and then designed a visual-language model by injecting pixel-level representations into a large language model (LLM). Specifically, Aquila-plus uses a convolutional CLIP as the visual encoder and employs a mask-aware visual extractor to extract precise visual mask features from high-resolution inputs. Experimental results demonstrate that Aquila-plus outperforms existing methods in various region understanding tasks, showcasing its novel capabilities in pixel-level instruction tuning.

[AI-122] Online Parallel Multi-Task Relationship Learning via Alternating Direction Method of Multipliers

链接: https://arxiv.org/abs/2411.06135
作者: Ruiyu Li,Peilin Zhao,Guangxia Li,Zhiqiang Xu,Xuewei Li
关键词-EN: Online multi-task learning, enhances streaming data, streaming data processing, multiple tasks, multi-task learning
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注: Accpeted by Neurocomputing

点击查看摘要

Abstract:Online multi-task learning (OMTL) enhances streaming data processing by leveraging the inherent relations among multiple tasks. It can be described as an optimization problem in which a single loss function is defined for multiple tasks. Existing gradient-descent-based methods for this problem might suffer from gradient vanishing and poor conditioning issues. Furthermore, the centralized setting hinders their application to online parallel optimization, which is vital to big data analytics. Therefore, this study proposes a novel OMTL framework based on the alternating direction multiplier method (ADMM), a recent breakthrough in optimization suitable for the distributed computing environment because of its decomposable and easy-to-implement nature. The relations among multiple tasks are modeled dynamically to fit the constant changes in an online scenario. In a classical distributed computing architecture with a central server, the proposed OMTL algorithm with the ADMM optimizer outperforms SGD-based approaches in terms of accuracy and efficiency. Because the central server might become a bottleneck when the data scale grows, we further tailor the algorithm to a decentralized setting, so that each node can work by only exchanging information with local neighbors. Experimental results on a synthetic and several real-world datasets demonstrate the efficiency of our methods.

[AI-123] Research on reinforcement learning based warehouse robot navigation algorithm in complex warehouse layout

链接: https://arxiv.org/abs/2411.06128
作者: Keqin Li,Lipeng Liu,Jiajing Chen,Dezhi Yu,Xiaofan Zhou,Ming Li,Congyu Wang,Zhao Li
关键词-EN: Proximal Policy Optimization, make real-time decision, optimal path, global optimal path, Dijkstra algorithm
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In this paper, how to efficiently find the optimal path in complex warehouse layout and make real-time decision is a key problem. This paper proposes a new method of Proximal Policy Optimization (PPO) and Dijkstra’s algorithm, Proximal policy-Dijkstra (PP-D). PP-D method realizes efficient strategy learning and real-time decision making through PPO, and uses Dijkstra algorithm to plan the global optimal path, thus ensuring high navigation accuracy and significantly improving the efficiency of path planning. Specifically, PPO enables robots to quickly adapt and optimize action strategies in dynamic environments through its stable policy updating mechanism. Dijkstra’s algorithm ensures global optimal path planning in static environment. Finally, through the comparison experiment and analysis of the proposed framework with the traditional algorithm, the results show that the PP-D method has significant advantages in improving the accuracy of navigation prediction and enhancing the robustness of the system. Especially in complex warehouse layout, PP-D method can find the optimal path more accurately and reduce collision and stagnation. This proves the reliability and effectiveness of the robot in the study of complex warehouse layout navigation algorithm.

[AI-124] Characteristics of Political Misinformation Over the Past Decade

链接: https://arxiv.org/abs/2411.06122
作者: Erik J Schlicht
关键词-EN: real-world consequences, misinformation, spread online, information, Facebook and Instagram
类目: ocial and Information Networks (cs.SI); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Although misinformation tends to spread online, it can have serious real-world consequences. In order to develop automated tools to detect and mitigate the impact of misinformation, researchers must leverage algorithms that can adapt to the modality (text, images and video), the source, and the content of the false information. However, these characteristics tend to change dynamically across time, making it challenging to develop robust algorithms to fight misinformation spread. Therefore, this paper uses natural language processing to find common characteristics of political misinformation over a twelve year period. The results show that misinformation has increased dramatically in recent years and that it has increasingly started to be shared from sources with primary information modalities of text and images (e.g., Facebook and Instagram), although video sharing sources containing misinformation are starting to increase (e.g., TikTok). Moreover, it was discovered that statements expressing misinformation contain more negative sentiment than accurate information. However, the sentiment associated with both accurate and inaccurate information has trended downward, indicating a generally more negative tone in political statements across time. Finally, recurring misinformation categories were uncovered that occur over multiple years, which may imply that people tend to share inaccurate statements around information they fear or don’t understand (Science and Medicine, Crime, Religion), impacts them directly (Policy, Election Integrity, Economic) or Public Figures who are salient in their daily lives. Together, it is hoped that these insights will assist researchers in developing algorithms that are temporally invariant and capable of detecting and mitigating misinformation across time.

[AI-125] Evaluating the Propensity of Generative AI for Producing Disinformation During an Election Cycle

链接: https://arxiv.org/abs/2411.06120
作者: Erik J Schlicht
关键词-EN: Russian Internet Research, Internet Research Agency, Chinese Spamouflage operation, Generative Artificial Intelligence, Research Agency effort
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Generative Artificial Intelligence offers a powerful tool for adversaries who wish to engage in influence operations, such as the Chinese Spamouflage operation and the Russian Internet Research Agency effort that both sought to interfere with recent US election cycles. Therefore, this study seeks to investigate the propensity of current Generative AI models for producing harmful disinformation during an election cycle. The probability that different Generative AI models produced disinformation when given adversarial prompts was evaluated, in addition the associated harm. This allows for the expected harm for each model to be computed and it was discovered that Copilot and Gemini tied for the overall safest performance by realizing the lowest expected harm, while GPT-4o produced the greatest rates of harmful disinformation, resulting in much higher expected harm scores. The impact of disinformation category was also investigated and Gemini was safest within the political category of disinformation, while Copilot was safest for topics related to health. Moreover, characteristics of adversarial roles were discovered that led to greater expected harm across all models. Finally, classification models were developed that predicted disinformation production based on the conditions considered in this study, which offers insight into factors important for predicting disinformation production. Based on all of these insights, recommendations are provided that seek to mitigate factors that lead to harmful disinformation being produced by Generative AI models. It is hoped that developers will use these insights to improve future models.

[AI-126] Energy-efficient Hybrid Model Predictive Trajectory Planning for Autonomous Electric Vehicles

链接: https://arxiv.org/abs/2411.06111
作者: Fan Ding,Xuewen Luo,Gaoxuan Li,Hwa Hui Tew,Junn Yong Loo,Chor Wai Tong,A.S.M Bakibillah,Ziyuan Zhao,Zhiyu Tao
关键词-EN: Energy-efficient Hybrid Model, Hybrid Model Predictive, Model Predictive Planner, energy-saving optimization strategy, limited battery life
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注: Accepted at the IEEE International Conference on Systems, Man, and Cybernetics (SMC) 2024

点击查看摘要

Abstract:To tackle the twin challenges of limited battery life and lengthy charging durations in electric vehicles (EVs), this paper introduces an Energy-efficient Hybrid Model Predictive Planner (EHMPP), which employs an energy-saving optimization strategy. EHMPP focuses on refining the design of the motion planner to be seamlessly integrated with the existing automatic driving algorithms, without additional hardware. It has been validated through simulation experiments on the Prescan, CarSim, and Matlab platforms, demonstrating that it can increase passive recovery energy by 11.74% and effectively track motor speed and acceleration at optimal power. To sum up, EHMPP not only aids in trajectory planning but also significantly boosts energy efficiency in autonomous EVs.

[AI-127] Personalize to generalize: Towards a universal medical multi-modality generalization through personalization

链接: https://arxiv.org/abs/2411.06106
作者: Zhaorui Tan,Xi Yang,Tan Pan,Tianyi Liu,Chen Jiang,Xin Guo,Qiufeng Wang,Anh Nguyen,Yuan Qi,Kaizhu Huang,Yuan Cheng
关键词-EN: unique clinical characteristics, groundbreaking healthcare framework, tailoring medical treatments, clinical characteristics, groundbreaking healthcare
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Personalized medicine is a groundbreaking healthcare framework for the 21^st century, tailoring medical treatments to individuals based on unique clinical characteristics, including diverse medical imaging modalities. Given the significant differences among these modalities due to distinct underlying imaging principles, generalization in multi-modal medical image tasks becomes substantially challenging. Previous methods addressing multi-modal generalization rarely consider personalization, primarily focusing on common anatomical information. This paper aims to bridge multi-modal generalization with the concept of personalized medicine. Specifically, we propose a novel approach to derive a tractable form of the underlying personalized invariant representation \mathbbX_h by leveraging individual-level constraints and a learnable biological prior. We demonstrate the feasibility and benefits of learning a personalized \mathbbX_h , showing that this representation is highly generalizable and transferable across various multi-modal medical tasks. Our method is rigorously validated on medical imaging modalities emphasizing both physical structure and functional information, encompassing a range of tasks that require generalization. Extensive experimental results consistently show that our approach significantly improves performance across diverse scenarios, confirming its effectiveness.

[AI-128] LT-DARTS: An Architectural Approach to Enhance Deep Long-Tailed Learning

链接: https://arxiv.org/abs/2411.06098
作者: Yuhan Pan,Yanan Sun,Wei Gong
关键词-EN: Deep long-tailed recognition, imbalanced data distributions, Differential Architecture Search, Deep long-tailed, widely studied
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Deep long-tailed recognition has been widely studied to address the issue of imbalanced data distributions in real-world scenarios. However, there has been insufficient focus on the design of neural architectures, despite empirical evidence suggesting that architecture can significantly impact performance. In this paper, we attempt to mitigate long-tailed issues through architectural improvements. To simplify the design process, we utilize Differential Architecture Search (DARTS) to achieve this goal. Unfortunately, existing DARTS methods struggle to perform well in long-tailed scenarios. To tackle this challenge, we introduce Long-Tailed Differential Architecture Search (LT-DARTS). Specifically, we conduct extensive experiments to explore architectural components that demonstrate better performance on long-tailed data and propose a new search space based on our observations. This ensures that the architecture obtained through our search process incorporates superior components. Additionally, we propose replacing the learnable linear classifier with an Equiangular Tight Frame (ETF) classifier to further enhance our method. This classifier effectively alleviates the biased search process and prevents performance collapse. Extensive experimental evaluations demonstrate that our approach consistently improves upon existing methods from an orthogonal perspective and achieves state-of-the-art results with simple enhancements.

[AI-129] A Multimodal Adaptive Graph-based Intelligent Classification Model for Fake News

链接: https://arxiv.org/abs/2411.06097
作者: Junhao(Leo)Xu
关键词-EN: Numerous studies, deep learning, Graph-based Intelligent Classification, proposed to detect, multi-modalities based
类目: Artificial Intelligence (cs.AI)
*备注: 8 pages

点击查看摘要

Abstract:Numerous studies have been proposed to detect fake news focusing on multi-modalities based on machine and/or deep learning. However, studies focusing on graph-based structures using geometric deep learning are lacking. To address this challenge, we introduce the Multimodal Adaptive Graph-based Intelligent Classification (aptly referred to as MAGIC) for fake news detection. Specifically, the Encoder Representations from Transformers was used for text vectorization whilst ResNet50 was used for images. A comprehensive information interaction graph was built using the adaptive Graph Attention Network before classifying the multimodal input through the Softmax function. MAGIC was trained and tested on two fake news datasets, that is, Fakeddit (English) and Multimodal Fake News Detection (Chinese), with the model achieving an accuracy of 98.8% and 86.3%, respectively. Ablation experiments also revealed MAGIC to yield superior performance across both the datasets. Findings show that a graph-based deep learning adaptive model is effective in detecting multimodal fake news, surpassing state-of-the-art methods.

[AI-130] Cross-Domain Transfer Learning using Attention Latent Features for Multi-Agent Trajectory Prediction

链接: https://arxiv.org/abs/2411.06087
作者: Jia Quan Loh,Xuewen Luo,Fan Ding,Hwa Hui Tew,Junn Yong Loo,Ze Yang Ding,Susilawati Susilawati,Chee Pin Tan
关键词-EN: intelligent transportation systems, deep learning architectures, sensor hardware, transportation systems, advancements of sensor
类目: Artificial Intelligence (cs.AI); Robotics (cs.RO)
*备注: Accepted at the IEEE International Conference on Systems, Man, and Cybernetics (IEECSMC) 2024

点击查看摘要

Abstract:With the advancements of sensor hardware, traffic infrastructure and deep learning architectures, trajectory prediction of vehicles has established a solid foundation in intelligent transportation systems. However, existing solutions are often tailored to specific traffic networks at particular time periods. Consequently, deep learning models trained on one network may struggle to generalize effectively to unseen networks. To address this, we proposed a novel spatial-temporal trajectory prediction framework that performs cross-domain adaption on the attention representation of a Transformer-based model. A graph convolutional network is also integrated to construct dynamic graph feature embeddings that accurately model the complex spatial-temporal interactions between the multi-agent vehicles across multiple traffic domains. The proposed framework is validated on two case studies involving the cross-city and cross-period settings. Experimental results show that our proposed framework achieves superior trajectory prediction and domain adaptation performances over the state-of-the-art models.

[AI-131] Aquila: A Hierarchically Aligned Visual-Language Model for Enhanced Remote Sensing Image Comprehension

链接: https://arxiv.org/abs/2411.06074
作者: Kaixuan Lu,Ruiqian Zhang,Xiao Huang,Yuxing Xie
关键词-EN: showing great promise, made significant strides, remote sensing, visual instruction tuning, remote sensing vision
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Recently, large vision language models (VLMs) have made significant strides in visual language capabilities through visual instruction tuning, showing great promise in the field of remote sensing image interpretation. However, existing remote sensing vision language models (RSVLMs) often fall short in capturing the complex characteristics of remote sensing scenes, as they typically rely on low resolution, single scale visual features and simplistic methods to map visual features to language features. In this paper, we present Aquila, an advanced visual language foundation model designed to enable richer visual feature representation and more precise visual-language feature alignment for remote sensing images. Our approach introduces a learnable Hierarchical Spatial Feature Integration (SFI) module that supports high resolution image inputs and aggregates multi scale visual features, allowing for the detailed representation of complex visual information. Additionally, the SFI module is repeatedly integrated into the layers of the large language model (LLM) to achieve deep visual language feature alignment, without compromising the model’s performance in natural language processing tasks. These innovations, capturing detailed visual effects through higher resolution and multi scale input, and enhancing feature alignment significantly improve the model’s ability to learn from image text data. We validate the effectiveness of Aquila through extensive quantitative experiments and qualitative analyses, demonstrating its superior performance.

[AI-132] GFT: Graph Foundation Model with Transferable Tree Vocabulary NEURIPS2024

链接: https://arxiv.org/abs/2411.06070
作者: Zehong Wang,Zheyuan Zhang,Nitesh V Chawla,Chuxu Zhang,Yanfang Ye
关键词-EN: social network analysis, graph foundation model, Graph Foundation, broader applications, drug discovery
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Social and Information Networks (cs.SI)
*备注: Accepted by NeurIPS 2024

点击查看摘要

Abstract:Inspired by the success of foundation models in applications such as ChatGPT, as graph data has been ubiquitous, one can envision the far-reaching impacts that can be brought by Graph Foundation Models (GFMs) with broader applications in the areas such as scientific research, social network analysis, drug discovery, and e-commerce. Despite the significant progress of pre-trained graph neural networks, there haven’t been GFMs that can achieve desired performance on various graph-learning-related tasks. Building GFMs may rely on a vocabulary that encodes transferable patterns shared among different tasks and domains. Unlike image and text, defining such transferable patterns for graphs remains an open question. In this paper, we aim to bridge this gap by rethinking the transferable patterns on graphs as computation trees – i.e., tree structures derived from the message-passing process. Based on this insight, we propose a cross-task, cross-domain graph foundation model named GFT, short for Graph Foundation model with transferable Tree vocabulary. By treating computation trees as tokens within the transferable vocabulary, GFT improves model generalization and reduces the risk of negative transfer. The theoretical analyses and extensive experimental studies have demonstrated the transferability of computation trees and shown the effectiveness of GFT across diverse tasks and domains in graph learning. The open source code and data are available at this https URL.

[AI-133] Diversity and Inclusion in AI for Recruitment: Lessons from Industry Workshop

链接: https://arxiv.org/abs/2411.06066
作者: Muneera Bano,Didar Zowghi,Fernando Mourao,Sarah Kaur,Tao Zhang
关键词-EN: Artificial Intelligence, potential to significantly, significantly enhance, enhance the efficiency, efficiency and effectiveness
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Artificial Intelligence (AI) systems for online recruitment markets have the potential to significantly enhance the efficiency and effectiveness of job placements and even promote fairness or inclusive hiring practices. Neglecting Diversity and Inclusion (DI) in these systems, however, can perpetuate biases, leading to unfair hiring practices and decreased workplace diversity, while exposing organisations to legal and reputational risks. Despite the acknowledged importance of DI in AI, there is a gap in research on effectively implementing DI guidelines in real-world recruitment systems. Challenges include a lack of awareness and framework for operationalising DI in a cost-effective, context-sensitive manner. This study aims to investigate the practical application of DI guidelines in AI-driven online job-seeking systems, specifically exploring how these principles can be operationalised to create more inclusive recruitment processes. We conducted a co-design workshop with a large multinational recruitment company focusing on two AI-driven recruitment use cases. User stories and personas were applied to evaluate the impacts of AI on diverse stakeholders. Follow-up interviews were conducted to assess the workshop’s long-term effects on participants’ awareness and application of DI principles. The co-design workshop successfully increased participants’ understanding of DI in AI. However, translating awareness into operational practice posed challenges, particularly in balancing DI with business goals. The results suggest developing tailored DI guidelines and ongoing support to ensure the effective adoption of inclusive AI practices.

[AI-134] Wild Narratives: Exploring the Effects of Animal Chatbots on Empathy and Positive Attitudes toward Animals

链接: https://arxiv.org/abs/2411.06060
作者: Jingshu Li,Aaditya Patwari,Yi-Chieh Lee
关键词-EN: animal abuse cases, abuse cases, cases are reported, Rises, users’ perceptions
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Rises in the number of animal abuse cases are reported around the world. While chatbots have been effective in influencing their users’ perceptions and behaviors, little if any research has hitherto explored the design of chatbots that embody animal identities for the purpose of eliciting empathy toward animals. We therefore conducted a mixed-methods experiment to investigate how specific design cues in such chatbots can shape their users’ perceptions of both the chatbots’ identities and the type of animal they represent. Our findings indicate that such chatbots can significantly increase empathy, improve attitudes, and promote prosocial behavioral intentions toward animals, particularly when they incorporate emotional verbal expressions and authentic details of such animals’ lives. These results expand our understanding of chatbots with non-human identities and highlight their potential for use in conservation initiatives, suggesting a promising avenue whereby technology could foster a more informed and empathetic society.

[AI-135] An Empirical Analysis on Spatial Reasoning Capabilities of Large Multimodal Models

链接: https://arxiv.org/abs/2411.06048
作者: Fatemeh Shiri,Xiao-Yu Guo,Mona Golestan Far,Xin Yu,Gholamreza Haffari,Yuan-Fang Li
关键词-EN: Large Multimodal Models, Large Multimodal, Multimodal Models, achieved strong performance, language tasks
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Large Multimodal Models (LMMs) have achieved strong performance across a range of vision and language tasks. However, their spatial reasoning capabilities are under-investigated. In this paper, we construct a novel VQA dataset, Spatial-MM, to comprehensively study LMMs’ spatial understanding and reasoning capabilities. Our analyses on object-relationship and multi-hop reasoning reveal several important findings. Firstly, bounding boxes and scene graphs, even synthetic ones, can significantly enhance LMMs’ spatial reasoning. Secondly, LMMs struggle more with questions posed from the human perspective than the camera perspective about the image. Thirdly, chain of thought (CoT) prompting does not improve model performance on complex multi-hop questions involving spatial relations. % Moreover, spatial reasoning steps are much less accurate than non-spatial ones across MLLMs. Lastly, our perturbation analysis on GQA-spatial reveals that LMMs are much stronger at basic object detection than complex spatial reasoning. We believe our benchmark dataset and in-depth analyses can spark further research on LMMs spatial reasoning. Spatial-MM benchmark is available at: this https URL

[AI-136] Personalized News Recommendation System via LLM Embedding and Co-Occurrence Patterns

链接: https://arxiv.org/abs/2411.06046
作者: Zheng Li,Kai Zhange
关键词-EN: demonstrated remarkable emerging, achieved rapid development, remarkable emerging capabilities, large language models, past two years
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In the past two years, large language models (LLMs) have achieved rapid development and demonstrated remarkable emerging capabilities. Concurrently, with powerful semantic understanding and reasoning capabilities, LLMs have significantly empowered the rapid advancement of the recommendation system field. Specifically, in news recommendation (NR), systems must comprehend and process a vast amount of clicked news text to infer the probability of candidate news clicks. This requirement exceeds the capabilities of traditional NR models but aligns well with the strengths of LLMs. In this paper, we propose a novel NR algorithm to reshape the news model via LLM Embedding and Co-Occurrence Pattern (LECOP). On one hand, we fintuned LLM by contrastive learning using large-scale datasets to encode news, which can fully explore the semantic information of news to thoroughly identify user preferences. On the other hand, we explored multiple co-occurrence patterns to mine collaborative information. Those patterns include news ID co-occurrence, Item-Item keywords co-occurrence and Intra-Item keywords co-occurrence. The keywords mentioned above are all generated by LLM. As far as we know, this is the first time that constructing such detailed Co-Occurrence Patterns via LLM to capture collaboration. Extensive experiments demonstrate the superior performance of our proposed novel method

[AI-137] PointCG: Self-supervised Point Cloud Learning via Joint Completion and Generation

链接: https://arxiv.org/abs/2411.06041
作者: Yun Liu,Peng Li,Xuefeng Yan,Liangliang Nan,Bing Wang,Honghua Chen,Lina Gong,Wei Zhao,Mingqiang Wei
关键词-EN: cloud learning lies, self-supervised point cloud, point cloud learning, objects effectively, pre-training framework
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The core of self-supervised point cloud learning lies in setting up appropriate pretext tasks, to construct a pre-training framework that enables the encoder to perceive 3D objects effectively. In this paper, we integrate two prevalent methods, masked point modeling (MPM) and 3D-to-2D generation, as pretext tasks within a pre-training framework. We leverage the spatial awareness and precise supervision offered by these two methods to address their respective limitations: ambiguous supervision signals and insensitivity to geometric information. Specifically, the proposed framework, abbreviated as PointCG, consists of a Hidden Point Completion (HPC) module and an Arbitrary-view Image Generation (AIG) module. We first capture visible points from arbitrary views as inputs by removing hidden points. Then, HPC extracts representations of the inputs with an encoder and completes the entire shape with a decoder, while AIG is used to generate rendered images based on the visible points’ representations. Extensive experiments demonstrate the superiority of the proposed method over the baselines in various downstream tasks. Our code will be made available upon acceptance.

[AI-138] CGLearn: Consistent Gradient-Based Learning for Out-of-Distribution Generalization

链接: https://arxiv.org/abs/2411.06040
作者: Jawad Chowdhury,Gabriel Terejanu
关键词-EN: achieving highly predictive, models necessitates learning, learning models necessitates, Improving generalization, machine learning models
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 9 pages, 3 figures

点击查看摘要

Abstract:Improving generalization and achieving highly predictive, robust machine learning models necessitates learning the underlying causal structure of the variables of interest. A prominent and effective method for this is learning invariant predictors across multiple environments. In this work, we introduce a simple yet powerful approach, CGLearn, which relies on the agreement of gradients across various environments. This agreement serves as a powerful indication of reliable features, while disagreement suggests less reliability due to potential differences in underlying causal mechanisms. Our proposed method demonstrates superior performance compared to state-of-the-art methods in both linear and nonlinear settings across various regression and classification tasks. CGLearn shows robust applicability even in the absence of separate environments by exploiting invariance across different subsamples of observational data. Comprehensive experiments on both synthetic and real-world datasets highlight its effectiveness in diverse scenarios. Our findings underscore the importance of leveraging gradient agreement for learning causal invariance, providing a significant step forward in the field of robust machine learning. The source code of the linear and nonlinear implementation of CGLearn is open-source and available at: this https URL.

[AI-139] CROPS: A Deployable Crop Management System Over All Possible State Availabilities

链接: https://arxiv.org/abs/2411.06034
作者: Jing Wu,Zhixin Lai,Shengjie Liu,Suiyao Chen,Ran Tao,Pan Zhao,Chuyuan Tao,Yikun Cheng,Naira Hovakimyan
关键词-EN: optimal management strategy, Decision Support System, textbf, strategy for nitrogen, nitrogen and irrigation
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Exploring the optimal management strategy for nitrogen and irrigation has a significant impact on crop yield, economic profit, and the environment. To tackle this optimization challenge, this paper introduces a deployable \textbfCRop Management system \textbfOver all \textbfPossible \textbfState availabilities (CROPS). CROPS employs a language model (LM) as a reinforcement learning (RL) agent to explore optimal management strategies within the Decision Support System for Agrotechnology Transfer (DSSAT) crop simulations. A distinguishing feature of this system is that the states used for decision-making are partially observed through random masking. Consequently, the RL agent is tasked with two primary objectives: optimizing management policies and inferring masked states. This approach significantly enhances the RL agent’s robustness and adaptability across various real-world agricultural scenarios. Extensive experiments on maize crops in Florida, USA, and Zaragoza, Spain, validate the effectiveness of CROPS. Not only did CROPS achieve State-of-the-Art (SOTA) results across various evaluation metrics such as production, profit, and sustainability, but the trained management policies are also immediately deployable in over of ten millions of real-world contexts. Furthermore, the pre-trained policies possess a noise resilience property, which enables them to minimize potential sensor biases, ensuring robustness and generalizability. Finally, unlike previous methods, the strength of CROPS lies in its unified and elegant structure, which eliminates the need for pre-defined states or multi-stage training. These advancements highlight the potential of CROPS in revolutionizing agricultural practices.

[AI-140] A Picture is Worth A Thousand Numbers: Enabling LLM s Reason about Time Series via Visualization

链接: https://arxiv.org/abs/2411.06018
作者: Haoxin Liu,Chenghao Liu,B. Aditya Prakash
关键词-EN: Large language models, Large language, demonstrated reasoning abilities, real world, largely underexplored
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Large language models (LLMs), with demonstrated reasoning abilities across multiple domains, are largely underexplored for time-series reasoning (TsR), which is ubiquitous in the real world. In this work, we propose TimerBed, the first comprehensive testbed for evaluating LLMs’ TsR performance. Specifically, TimerBed includes stratified reasoning patterns with real-world tasks, comprehensive combinations of LLMs and reasoning strategies, and various supervised models as comparison anchors. We perform extensive experiments with TimerBed, test multiple current beliefs, and verify the initial failures of LLMs in TsR, evidenced by the ineffectiveness of zero shot (ZST) and performance degradation of few shot in-context learning (ICL). Further, we identify one possible root cause: the numerical modeling of data. To address this, we propose a prompt-based solution VL-Time, using visualization-modeled data and language-guided reasoning. Experimental results demonstrate that Vl-Time enables multimodal LLMs to be non-trivial ZST and powerful ICL reasoners for time series, achieving about 140% average performance improvement and 99% average token costs reduction.

[AI-141] A Comprehensive Guide to Enhancing Antibiotic Discovery Using Machine Learning Derived Bio-computation

链接: https://arxiv.org/abs/2411.06009
作者: Khartik Uppalapati,Eeshan Dandamudi,S. Nick Ice,Gaurav Chandra,Kirsten Bischof,Christian L. Lorson,Kamal Singh
关键词-EN: Traditional drug discovery, Traditional drug, Artificial Intelligence, Machine Learning, complex process
类目: Artificial Intelligence (cs.AI)
*备注: 65 pages

点击查看摘要

Abstract:Traditional drug discovery is a long, expensive, and complex process. Advances in Artificial Intelligence (AI) and Machine Learning (ML) are beginning to change this narrative. Here, we provide a comprehensive overview of different AI and ML tools that can be used to streamline and accelerate the drug discovery process. By using data sets to train ML algorithms, it is possible to discover drugs or drug-like compounds relatively quickly, and efficiently. Additionally, we address limitations in AI-based drug discovery and development, including the scarcity of high-quality data to train AI models and ethical considerations. The growing impact of AI on the pharmaceutical industry is also highlighted. Finally, we discuss how AI and ML can expedite the discovery of new antibiotics to combat the problem of worldwide antimicrobial resistance (AMR).

[AI-142] Longitudinal Ensemble Integration for sequential classification with multimodal data ICLR2025

链接: https://arxiv.org/abs/2411.05983
作者: Aviad Susman,Repack Krishnamurthy,Richard Yan Chak Li,Mohammad Olaimat,Serdar Bozdag,Bino Varghese,Nasim Sheikh-Bahei,Gaurav Pandey
关键词-EN: Effectively modeling multimodal, Effectively modeling, application areas, modeling multimodal longitudinal, Longitudinal Ensemble Integration
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 11 pages, submitted to ICLR 2025

点击查看摘要

Abstract:Effectively modeling multimodal longitudinal data is a pressing need in various application areas, especially biomedicine. Despite this, few approaches exist in the literature for this problem, with most not adequately taking into account the multimodality of the data. In this study, we developed multiple configurations of a novel multimodal and longitudinal learning framework, Longitudinal Ensemble Integration (LEI), for sequential classification. We evaluated LEI’s performance, and compared it against existing approaches, for the early detection of dementia, which is among the most studied multimodal sequential classification tasks. LEI outperformed these approaches due to its use of intermediate base predictions arising from the individual data modalities, which enabled their better integration over time. LEI’s design also enabled the identification of features that were consistently important across time for the effective prediction of dementia-related diagnoses. Overall, our work demonstrates the potential of LEI for sequential classification from longitudinal multimodal data.

[AI-143] Unmasking the Shadows: Pinpoint the Implementations of Anti-Dynamic Analysis Techniques in Malware Using LLM

链接: https://arxiv.org/abs/2411.05982
作者: Haizhou Wang,Nanqing Luo,Peng LIu
关键词-EN: detection systems nowadays, malware detection systems, dynamic analysis processes, capability of detecting, detection systems
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Sandboxes and other dynamic analysis processes are prevalent in malware detection systems nowadays to enhance the capability of detecting 0-day malware. Therefore, techniques of anti-dynamic analysis (TADA) are prevalent in modern malware samples, and sandboxes can suffer from false negatives and analysis failures when analyzing the samples with TADAs. In such cases, human reverse engineers will get involved in conducting dynamic analysis manually (i.e., debugging, patching), which in turn also gets obstructed by TADAs. In this work, we propose a Large Language Model (LLM) based workflow that can pinpoint the location of the TADA implementation in the code, to help reverse engineers place breakpoints used in debugging. Our evaluation shows that we successfully identified the locations of 87.80% known TADA implementations adopted from public repositories. In addition, we successfully pinpoint the locations of TADAs in 4 well-known malware samples that are documented in online malware analysis blogs.

[AI-144] Aligned Vector Quantization for Edge-Cloud Collabrative Vision-Language Models

链接: https://arxiv.org/abs/2411.05961
作者: Xiao Liu,Lijun Zhang,Deepak Ganesan,Hui Guan
关键词-EN: Visual Question Answering, Vision Language Models, Vision Language, Question Answering, Visual Question
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 12 pages, 7 figures

点击查看摘要

Abstract:Vision Language Models (VLMs) are central to Visual Question Answering (VQA) systems and are typically deployed in the cloud due to their high computational demands. However, this cloud-only approach underutilizes edge computational resources and requires significant bandwidth for transmitting raw images. In this paper, we introduce an edge-cloud collaborative VQA system, called LLaVA-AlignedVQ, which features a novel Aligned Vector Quantization algorithm (AlignedVQ) that efficiently compress intermediate features without compromising accuracy to support partitioned execution. Our experiments demonstrate that LLaVA-AlignedVQ achieves approximately 1365x compression rate of intermediate features, reducing data transmission overhead by 96.8% compared to transmitting JPEG90-compressed images to the cloud. LLaVA-AlignedVQ achieves an inference speedup of 2-15x while maintaining high accuracy, remaining within -2.23% to +1.6% of the original model’s accuracy performance across eight VQA datasets, compared to the cloud-only solution.

[AI-145] GCI-ViTAL: Gradual Confidence Improvement with Vision Transformers for Active Learning on Label Noise

链接: https://arxiv.org/abs/2411.05939
作者: Moseli Mots’oehli,kyungim Baek
关键词-EN: minimizing labeling costs, Convolutional Neural Networks, Chest X-ray datasets, train accurate classifiers, Active learning aims
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: under review

点击查看摘要

Abstract:Active learning aims to train accurate classifiers while minimizing labeling costs by strategically selecting informative samples for annotation. This study focuses on image classification tasks, comparing AL methods on CIFAR10, CIFAR100, Food101, and the Chest X-ray datasets under varying label noise rates. We investigate the impact of model architecture by comparing Convolutional Neural Networks (CNNs) and Vision Transformer (ViT)-based models. Additionally, we propose a novel deep active learning algorithm, GCI-ViTAL, designed to be robust to label noise. GCI-ViTAL utilizes prediction entropy and the Frobenius norm of last-layer attention vectors compared to class-centric clean set attention vectors. Our method identifies samples that are both uncertain and semantically divergent from typical images in their assigned class. This allows GCI-ViTAL to select informative data points even in the presence of label noise while flagging potentially mislabeled candidates. Label smoothing is applied to train a model that is not overly confident about potentially noisy labels. We evaluate GCI-ViTAL under varying levels of symmetric label noise and compare it to five other AL strategies. Our results demonstrate that using ViTs leads to significant performance improvements over CNNs across all AL strategies, particularly in noisy label settings. We also find that using the semantic information of images as label grounding helps in training a more robust model under label noise. Notably, we do not perform extensive hyperparameter tuning, providing an out-of-the-box comparison that addresses the common challenge practitioners face in selecting models and active learning strategies without an exhaustive literature review on training and fine-tuning vision models on real-world application data.

[AI-146] Qwen2.5-32B: Leveraging Self-Consistent Tool-Integrated Reasoning for Bengali Mathematical Olympiad Problem Solving

链接: https://arxiv.org/abs/2411.05934
作者: Saad Tahmid,Sourav Sarker
关键词-EN: BUET CSE Fest, BUET CSE, CSE Fest, Tool Integrated Reasoning, present an innovative
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:We present an innovative approach for solving mathematical problems in Bengali, developed for the DL Sprint 3.0 BUET CSE Fest 2024 Competition. Our method uses advanced deep learning models, notably the Qwen 2.5 series, with improvements made through prompt engineering, model quantization, and Tool Integrated Reasoning (TIR) to handle complex calculations. Initially, we explored various model architectures, including fine-tuned Mistral and quantized Qwen models, refining them with translation techniques, Retrieval-Augmented Generation (RAG), and custom dataset curation. Manual hyperparameter tuning optimized parameters like temperature and top-p to enhance model adaptability and accuracy. Removal of RAG and parameter adjustments further improved robustness. Our approach highlights the potential of advanced NLP techniques in solving Bengali mathematical problems.

[AI-147] Moving Off-the-Grid: Scene-Grounded Video Representations NEURIPS2024

链接: https://arxiv.org/abs/2411.05927
作者: Sjoerd van Steenkiste,Daniel Zoran,Yi Yang,Yulia Rubanova,Rishabh Kabra,Carl Doersch,Dilara Gokay,Joseph Heyward,Etienne Pot,Klaus Greff,Drew A. Hudson,Thomas Albert Keck,Joao Carreira,Alexey Dosovitskiy,Mehdi S. M. Sajjadi,Thomas Kipf
关键词-EN: Current vision models, models typically maintain, Current vision, typically maintain, maintain a fixed
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Accepted to NeurIPS 2024 (spotlight). Project page: this https URL

点击查看摘要

Abstract:Current vision models typically maintain a fixed correspondence between their representation structure and image space. Each layer comprises a set of tokens arranged “on-the-grid,” which biases patches or tokens to encode information at a specific spatio(-temporal) location. In this work we present Moving Off-the-Grid (MooG), a self-supervised video representation model that offers an alternative approach, allowing tokens to move “off-the-grid” to better enable them to represent scene elements consistently, even as they move across the image plane through time. By using a combination of cross-attention and positional embeddings we disentangle the representation structure and image structure. We find that a simple self-supervised objective–next frame prediction–trained on video data, results in a set of latent tokens which bind to specific scene structures and track them as they move. We demonstrate the usefulness of MooG’s learned representation both qualitatively and quantitatively by training readouts on top of the learned representation on a variety of downstream tasks. We show that MooG can provide a strong foundation for different vision tasks when compared to “on-the-grid” baselines.

[AI-148] Integrating Object Detection Modality into Visual Language Model for Enhanced Autonomous Driving Agent NEURIPS2024

链接: https://arxiv.org/abs/2411.05898
作者: Linfeng He,Yiming Sun,Sihao Wu,Jiaxu Liu,Xiaowei Huang
关键词-EN: perception module specialised, enhancing visual comprehension, integrating visual language, framework for enhancing, module specialised
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)
*备注: accepted by SafeGenAI workshop of NeurIPS 2024

点击查看摘要

Abstract:In this paper, we propose a novel framework for enhancing visual comprehension in autonomous driving systems by integrating visual language models (VLMs) with additional visual perception module specialised in object detection. We extend the Llama-Adapter architecture by incorporating a YOLOS-based detection network alongside the CLIP perception network, addressing limitations in object detection and localisation. Our approach introduces camera ID-separators to improve multi-view processing, crucial for comprehensive environmental awareness. Experiments on the DriveLM visual question answering challenge demonstrate significant improvements over baseline models, with enhanced performance in ChatGPT scores, BLEU scores, and CIDEr metrics, indicating closeness of model answer to ground truth. Our method represents a promising step towards more capable and interpretable autonomous driving systems. Possible safety enhancement enabled by detection modality is also discussed.

[AI-149] owards Equitable ASD Diagnostics: A Comparative Study of Machine and Deep Learning Models Using Behavioral and Facial Data

链接: https://arxiv.org/abs/2411.05880
作者: Mohammed Aledhari,Mohamed Rahouti,Ali Alfatemi
关键词-EN: Autism Spectrum Disorder, Autism Spectrum, Spectrum Disorder, gender-specific symptom differences, symptom differences overlooked
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Autism Spectrum Disorder (ASD) is often underdiagnosed in females due to gender-specific symptom differences overlooked by conventional diagnostics. This study evaluates machine learning models, particularly Random Forest and convolutional neural networks, for enhancing ASD diagnosis through structured data and facial image analysis. Random Forest achieved 100% validation accuracy across datasets, highlighting its ability to manage complex relationships and reduce false negatives, which is crucial for early intervention and addressing gender biases. In image-based analysis, MobileNet outperformed the baseline CNN, achieving 87% accuracy, though a 30% validation loss suggests possible overfitting, requiring further optimization for robustness in clinical settings. Future work will emphasize hyperparameter tuning, regularization, and transfer learning. Integrating behavioral data with facial analysis could improve diagnosis for underdiagnosed groups. These findings suggest Random Forest’s high accuracy and balanced precision-recall metrics could enhance clinical workflows. MobileNet’s lightweight structure also shows promise for resource-limited environments, enabling accessible ASD screening. Addressing model explainability and clinician trust will be vital.

[AI-150] Interplay between Federated Learning and Explainable Artificial Intelligence: a Scoping Review

链接: https://arxiv.org/abs/2411.05874
作者: Luis M. Lopez-Ramos,Florian Leiser,Aditya Rastogi,Steven Hicks,Inga Strümke,Vince I. Madai,Tobias Budig,Ali Sunyaev,Adam Hilbert
关键词-EN: preserving important aspects, Explainable artificial intelligence, implementation of Federated, Federated learning, aspects of privacy
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 16 pages, 11 figures, submitted in IEEE Trans. Knowledge and Data Engineering

点击查看摘要

Abstract:The joint implementation of Federated learning (FL) and Explainable artificial intelligence (XAI) will allow training models from distributed data and explaining their inner workings while preserving important aspects of privacy. Towards establishing the benefits and tensions associated with their interplay, this scoping review maps those publications that jointly deal with FL and XAI, focusing on publications where an interplay between FL and model interpretability or post-hoc explanations was found. In total, 37 studies met our criteria, with more papers focusing on explanation methods (mainly feature relevance) than on interpretability (mainly algorithmic transparency). Most works used simulated horizontal FL setups involving 10 or fewer data centers. Only one study explicitly and quantitatively analyzed the influence of FL on model explanations, revealing a significant research gap. Aggregation of interpretability metrics across FL nodes created generalized global insights at the expense of node-specific patterns being diluted. 8 papers addressed the benefits of incorporating explanation methods as a component of the FL algorithm. Studies using established FL libraries or following reporting guidelines are a minority. More quantitative research and structured, transparent practices are needed to fully understand their mutual impact and under which conditions it happens. Comments: 16 pages, 11 figures, submitted in IEEE Trans. Knowledge and Data Engineering Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2411.05874 [cs.LG] (or arXiv:2411.05874v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2411.05874 Focus to learn more arXiv-issued DOI via DataCite

[AI-151] Modeling Nonlinear Oscillator Networks Using Physics-Informed Hybrid Reservoir Computing

链接: https://arxiv.org/abs/2411.05867
作者: Andrew Shannon,Conor Houghton,David Barton,Martin Homer
关键词-EN: oscillator networks remains, networks remains challenging, remains challenging due, non-linear oscillator networks, simplified analytical models
类目: ystems and Control (eess.SY); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 27 pages, 10 figures, 17 supplementary figures. Code available at this https URL

点击查看摘要

Abstract:Surrogate modeling of non-linear oscillator networks remains challenging due to discrepancies between simplified analytical models and real-world complexity. To bridge this gap, we investigate hybrid reservoir computing, combining reservoir computing with “expert” analytical models. Simulating the absence of an exact model, we first test the surrogate models with parameter errors in their expert model. Second, we assess their performance when their expert model lacks key non-linear coupling terms present in an extended ground-truth model. We focus on short-term forecasting across diverse dynamical regimes, evaluating the use of these surrogates for control applications. We show that hybrid reservoir computers generally outperform standard reservoir computers and exhibit greater robustness to parameter tuning. Notably, unlike standard reservoir computers, the performance of the hybrid does not degrade when crossing an observed spectral radius threshold. Furthermore, there is good performance for dynamical regimes not accessible to the expert model, demonstrating the contribution of the reservoir.

[AI-152] Bilinear Fuzzy Genetic Algorithm and Its Application on the Optimum Design of Steel Structures with Semi-rigid Connections

链接: https://arxiv.org/abs/2411.05865
作者: Salar Farahmand-Tabar,Payam Ashtari
关键词-EN: improved bilinear fuzzy, semi-rigid connections, bilinear fuzzy genetic, BFGA, improved bilinear
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI)
*备注: 19 pages, 12 figures, book chapter, Springer

点击查看摘要

Abstract:An improved bilinear fuzzy genetic algorithm (BFGA) is introduced in this chapter for the design optimization of steel structures with semi-rigid connections. Semi-rigid connections provide a compromise between the stiffness of fully rigid connections and the flexibility of fully pinned connections. However, designing such structures is challenging due to the nonlinear behavior of semi-rigid connections. The BFGA is a robust optimization method that combines the strengths of fuzzy logic and genetic algorithm to handle the complexity and uncertainties of structural design problems. The BFGA, compared to standard GA, demonstrated to generate high-quality solutions in a reasonable time. The application of the BFGA is demonstrated through the optimization of steel structures with semirigid connections, considering the weight and performance criteria. The results show that the proposed BFGA is capable of finding optimal designs that satisfy all the design requirements and constraints. The proposed approach provides a promising solution for the optimization of complex structures with nonlinear behavior.

[AI-153] Boosting the Efficiency of Metaheuristics Through Opposition-Based Learning in Optimum Locating of Control Systems in Tall Buildings

链接: https://arxiv.org/abs/2411.05864
作者: Salar Farahmand-Tabar,Sina Shirgir
关键词-EN: metaheuristic optimization algorithms, Opposition-based learning, solving complex engineering, metaheuristic optimization, opposition strategies
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
*备注: 17 pages, 4 figures, book chapter, Springer

点击查看摘要

Abstract:Opposition-based learning (OBL) is an effective approach to improve the performance of metaheuristic optimization algorithms, which are commonly used for solving complex engineering problems. This chapter provides a comprehensive review of the literature on the use of opposition strategies in metaheuristic optimization algorithms, discussing the benefits and limitations of this approach. An overview of the opposition strategy concept, its various implementations, and its impact on the performance of metaheuristic algorithms are presented. Furthermore, case studies on the application of opposition strategies in engineering problems are provided, including the optimum locating of control systems in tall building. A shear frame with Magnetorheological (MR) fluid damper is considered as a case study. The results demonstrate that the incorporation of opposition strategies in metaheuristic algorithms significantly enhances the quality and speed of the optimization process. This chapter aims to provide a clear understanding of the opposition strategy in metaheuristic optimization algorithms and its engineering applications, with the ultimate goal of facilitating its adoption in real-world engineering problems.

[AI-154] Conditional Diffusion Model for Longitudinal Medical Image Generation

链接: https://arxiv.org/abs/2411.05860
作者: Duy-Phuong Dao,Hyung-Jeong Yang,Jahae Kim
关键词-EN: Alzheimers disease progresses, disease progresses slowly, Alzheimers disease, involves complex interaction, biological factors
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 4 pages, 2 figures, conference

点击查看摘要

Abstract:Alzheimers disease progresses slowly and involves complex interaction between various biological factors. Longitudinal medical imaging data can capture this progression over time. However, longitudinal data frequently encounter issues such as missing data due to patient dropouts, irregular follow-up intervals, and varying lengths of observation periods. To address these issues, we designed a diffusion-based model for 3D longitudinal medical imaging generation using single magnetic resonance imaging (MRI). This involves the injection of a conditioning MRI and time-visit encoding to the model, enabling control in change between source and target images. The experimental results indicate that the proposed method generates higher-quality images compared to other competing methods.

[AI-155] Enhancing Financial Fraud Detection with Human-in-the-Loop Feedback and Feedback Propagation

链接: https://arxiv.org/abs/2411.05859
作者: Prashank Kadam
关键词-EN: significantly enhance machine, enhance machine learning, Subject Matter Experts, patterns change rapidly, machine learning models
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE)
*备注: International Conference on Machine Learning and Applications 2024

点击查看摘要

Abstract:Human-in-the-loop (HITL) feedback mechanisms can significantly enhance machine learning models, particularly in financial fraud detection, where fraud patterns change rapidly, and fraudulent nodes are sparse. Even small amounts of feedback from Subject Matter Experts (SMEs) can notably boost model performance. This paper examines the impact of HITL feedback on both traditional and advanced techniques using proprietary and publicly available datasets. Our results show that HITL feedback improves model accuracy, with graph-based techniques benefiting the most. We also introduce a novel feedback propagation method that extends feedback across the dataset, further enhancing detection accuracy. By leveraging human expertise, this approach addresses challenges related to evolving fraud patterns, data sparsity, and model interpretability, ultimately improving model robustness and streamlining the annotation process.

[AI-156] Evaluating the Economic Implications of Using Machine Learning in Clinical Psychiatry ALT ML4H

链接: https://arxiv.org/abs/2411.05856
作者: Soaad Hossain,James Rasalingam,Arhum Waheed,Fatah Awil,Rachel Kandiah,Syed Ishtiaque Ahmed
关键词-EN: clinical psychiatry, machine learning, literature covering, growing interest, increasing number
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
*备注: 11 pages, submitted to Machine Learning for Health (ML4H) 2024

点击查看摘要

Abstract:With the growing interest in using AI and machine learning (ML) in medicine, there is an increasing number of literature covering the application and ethics of using AI and ML in areas of medicine such as clinical psychiatry. The problem is that there is little literature covering the economic aspects associated with using ML in clinical psychiatry. This study addresses this gap by specifically studying the economic implications of using ML in clinical psychiatry. In this paper, we evaluate the economic implications of using ML in clinical psychiatry through using three problem-oriented case studies, literature on economics, socioeconomic and medical AI, and two types of health economic evaluations. In addition, we provide details on fairness, legal, ethics and other considerations for ML in clinical psychiatry.

[AI-157] Harmful YouTube Video Detection: A Taxonomy of Online Harm and MLLM s as Alternative Annotators

链接: https://arxiv.org/abs/2411.05854
作者: Claire Wonjeong Jo,Miki Wesołowska,Magdalena Wojcieszak
关键词-EN: Short video platforms, Short video, users globally, Instagram, video platforms
类目: Multimedia (cs.MM); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Computers and Society (cs.CY)
*备注:

点击查看摘要

Abstract:Short video platforms, such as YouTube, Instagram, or TikTok, are used by billions of users globally. These platforms expose users to harmful content, ranging from clickbait or physical harms to misinformation or online hate. Yet, detecting harmful videos remains challenging due to an inconsistent understanding of what constitutes harm and limited resources and mental tolls involved in human annotation. As such, this study advances measures and methods to detect harm in video content. First, we develop a comprehensive taxonomy for online harm on video platforms, categorizing it into six categories: Information, Hate and harassment, Addictive, Clickbait, Sexual, and Physical harms. Next, we establish multimodal large language models as reliable annotators of harmful videos. We analyze 19,422 YouTube videos using 14 image frames, 1 thumbnail, and text metadata, comparing the accuracy of crowdworkers (Mturk) and GPT-4-Turbo with domain expert annotations serving as the gold standard. Our results demonstrate that GPT-4-Turbo outperforms crowdworkers in both binary classification (harmful vs. harmless) and multi-label harm categorization tasks. Methodologically, this study extends the application of LLMs to multi-label and multi-modal contexts beyond text annotation and binary classification. Practically, our study contributes to online harm mitigation by guiding the definitions and identification of harmful content on video platforms.

[AI-158] Federated Data-Driven Kalman Filtering for State Estimation

链接: https://arxiv.org/abs/2411.05847
作者: Nikos Piperigkos,Alexandros Gkillas,Christos Anagnostopoulos,Aris S. Lalos
关键词-EN: localization framework based, highly accurate localization, federated learning paradigm, Extended Kalman Filtering, localization framework
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:This paper proposes a novel localization framework based on collaborative training or federated learning paradigm, for highly accurate localization of autonomous vehicles. More specifically, we build on the standard approach of KalmanNet, a recurrent neural network aiming to estimate the underlying system uncertainty of traditional Extended Kalman Filtering, and reformulate it by the adapt-then-combine concept to FedKalmanNet. The latter is trained in a distributed manner by a group of vehicles (or clients), with local training datasets consisting of vehicular location and velocity measurements, through a global server aggregation operation. The FedKalmanNet is then used by each vehicle to localize itself, by estimating the associated system uncertainty matrices (i.e, Kalman gain). Our aim is to actually demonstrate the benefits of collaborative training for state estimation in autonomous driving, over collaborative decision-making which requires rich V2X communication resources for measurement exchange and sensor fusion under real-time constraints. An extensive experimental and evaluation study conducted in CARLA autonomous driving simulator highlights the superior performance of FedKalmanNet over state-of-the-art collaborative decision-making approaches, in localizing vehicles without the need of real-time V2X communication.

[AI-159] Diversify Contextualize and Adapt: Efficient Entropy Modeling for Neural Image Codec NEURIPS2024

链接: https://arxiv.org/abs/2411.05832
作者: Jun-Hyuk Kim,Seungeon Kim,Won-Hee Lee,Dokwan Oh
关键词-EN: Designing a fast, entropy models, effective entropy model, autoregressive entropy models, adaptation-based entropy models
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Multimedia (cs.MM)
*备注: NeurIPS 2024

点击查看摘要

Abstract:Designing a fast and effective entropy model is challenging but essential for practical application of neural codecs. Beyond spatial autoregressive entropy models, more efficient backward adaptation-based entropy models have been recently developed. They not only reduce decoding time by using smaller number of modeling steps but also maintain or even improve rate–distortion performance by leveraging more diverse contexts for backward adaptation. Despite their significant progress, we argue that their performance has been limited by the simple adoption of the design convention for forward adaptation: using only a single type of hyper latent representation, which does not provide sufficient contextual information, especially in the first modeling step. In this paper, we propose a simple yet effective entropy modeling framework that leverages sufficient contexts for forward adaptation without compromising on bit-rate. Specifically, we introduce a strategy of diversifying hyper latent representations for forward adaptation, i.e., using two additional types of contexts along with the existing single type of context. In addition, we present a method to effectively use the diverse contexts for contextualizing the current elements to be encoded/decoded. By addressing the limitation of the previous approach, our proposed framework leads to significant performance improvements. Experimental results on popular datasets show that our proposed framework consistently improves rate–distortion performance across various bit-rate regions, e.g., 3.73% BD-rate gain over the state-of-the-art baseline on the Kodak dataset.

[AI-160] o Ask or Not to Ask? Detecting Absence of Information in Vision and Language Navigation WACV2025

链接: https://arxiv.org/abs/2411.05831
作者: Savitha Sam Abraham,Sourav Garg,Feras Dayoub
关键词-EN: Vision Language Navigation, Language Navigation, agents’ inquisitive abilities, Vision Language, Recent research
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted at WACV 2025

点击查看摘要

Abstract:Recent research in Vision Language Navigation (VLN) has overlooked the development of agents’ inquisitive abilities, which allow them to ask clarifying questions when instructions are incomplete. This paper addresses how agents can recognize “when” they lack sufficient information, without focusing on “what” is missing, particularly in VLN tasks with vague instructions. Equipping agents with this ability enhances efficiency by reducing potential digressions and seeking timely assistance. The challenge in identifying such uncertain points is balancing between being overly cautious (high recall) and overly confident (high precision). We propose an attention-based instruction-vagueness estimation module that learns associations between instructions and the agent’s trajectory. By leveraging instruction-to-path alignment information during training, the module’s vagueness estimation performance improves by around 52% in terms of precision-recall balance. In our ablative experiments, we also demonstrate the effectiveness of incorporating this additional instruction-to-path attention network alongside the cross-modal attention networks within the navigator module. Our results show that the attention scores from the instruction-to-path attention network serve as better indicators for estimating vagueness.

[AI-161] AI Multi-Agent Interoperability Extension for Managing Multiparty Conversations

链接: https://arxiv.org/abs/2411.05828
作者: Diego Gosmar,Deborah A. Dahl,Emmett Coin,David Attwater
关键词-EN: Open Voice Network, Open Voice Interoperability, Voice Interoperability Initiative, Open Voice, existing Multi-Agent Interoperability
类目: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
*备注: 20 pages, 3 figures

点击查看摘要

Abstract:This paper presents a novel extension to the existing Multi-Agent Interoperability specifications of the Open Voice Interoperability Initiative (originally also known as OVON from the Open Voice Network). This extension enables AI agents developed with different technologies to communicate using a universal, natural language-based API or NLP-based standard APIs. Focusing on the management of multiparty AI conversations, this work introduces new concepts such as the Convener Agent, Floor-Shared Conversational Space, Floor Manager, Multi-Conversant Support, and mechanisms for handling Interruptions and Uninvited Agents. Additionally, it explores the Convener’s role as a message relay and controller of participant interactions, enhancing both scalability and security. These advancements are crucial for ensuring smooth, efficient, and secure interactions in scenarios where multiple AI agents need to collaborate, debate, or contribute to a discussion. The paper elaborates on these concepts and provides practical examples, illustrating their implementation within the conversation envelope structure.

[AI-162] From Pixels to Prose: Advancing Multi-Modal Language Models for Remote Sensing

链接: https://arxiv.org/abs/2411.05826
作者: Xintian Sun,Benji Peng,Charles Zhang,Fei Jin,Qian Niu,Junyu Liu,Keyu Chen,Ming Li,Pohsun Feng,Ziqian Bi,Ming Liu,Yichao Zhang
关键词-EN: simple image acquisition, complex systems capable, Remote sensing, remote sensing data, evolved from simple
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 10 pages, 1 figure

点击查看摘要

Abstract:Remote sensing has evolved from simple image acquisition to complex systems capable of integrating and processing visual and textual data. This review examines the development and application of multi-modal language models (MLLMs) in remote sensing, focusing on their ability to interpret and describe satellite imagery using natural language. We cover the technical underpinnings of MLLMs, including dual-encoder architectures, Transformer models, self-supervised and contrastive learning, and cross-modal integration. The unique challenges of remote sensing data–varying spatial resolutions, spectral richness, and temporal changes–are analyzed for their impact on MLLM performance. Key applications such as scene description, object detection, change detection, text-to-image retrieval, image-to-text generation, and visual question answering are discussed to demonstrate their relevance in environmental monitoring, urban planning, and disaster response. We review significant datasets and resources supporting the training and evaluation of these models. Challenges related to computational demands, scalability, data quality, and domain adaptation are highlighted. We conclude by proposing future research directions and technological advancements to further enhance MLLM utility in remote sensing.

[AI-163] FlexCAD: Unified and Versatile Controllable CAD Generation with Fine-tuned Large Language Models

链接: https://arxiv.org/abs/2411.05823
作者: Zhanwei Zhang,Shizhao Sun,Wenxiao Wang,Deng Cai,Jiang Bian
关键词-EN: creating computer-aided design, CAD, computer-aided design, growing interest, interest in creating
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Graphics (cs.GR)
*备注: 23 pages

点击查看摘要

Abstract:Recently, there is a growing interest in creating computer-aided design (CAD) models based on user intent, known as controllable CAD generation. Existing work offers limited controllability and needs separate models for different types of control, reducing efficiency and practicality. To achieve controllable generation across all CAD construction hierarchies, such as sketch-extrusion, extrusion, sketch, face, loop and curve, we propose FlexCAD, a unified model by fine-tuning large language models (LLMs). First, to enhance comprehension by LLMs, we represent a CAD model as a structured text by abstracting each hierarchy as a sequence of text tokens. Second, to address various controllable generation tasks in a unified model, we introduce a hierarchy-aware masking strategy. Specifically, during training, we mask a hierarchy-aware field in the CAD text with a mask token. This field, composed of a sequence of tokens, can be set flexibly to represent various hierarchies. Subsequently, we ask LLMs to predict this masked field. During inference, the user intent is converted into a CAD text with a mask token replacing the part the user wants to modify, which is then fed into FlexCAD to generate new CAD models. Comprehensive experiments on public dataset demonstrate the effectiveness of FlexCAD in both generation quality and controllability. Code will be available at this https URL.

[AI-164] Guiding Genetic Programming with Graph Neural Networks GECCO2024

链接: https://arxiv.org/abs/2411.05820
作者: Piotr Wyrwiński,Krzysztof Krawiec
关键词-EN: algorithm acquires knowledge, search algorithm acquires, commonly assumed, algorithm acquires, space and evaluating
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Symbolic Computation (cs.SC); Machine Learning (stat.ML)
*备注: Full version of the same-titled paper accepted at GECCO 2024

点击查看摘要

Abstract:In evolutionary computation, it is commonly assumed that a search algorithm acquires knowledge about a problem instance by sampling solutions from the search space and evaluating them with a fitness function. This is necessarily inefficient because fitness reveals very little about solutions – yet they contain more information that can be potentially exploited. To address this observation in genetic programming, we propose EvoNUDGE, which uses a graph neural network to elicit additional knowledge from symbolic regression problems. The network is queried on the problem before an evolutionary run to produce a library of subprograms, which is subsequently used to seed the initial population and bias the actions of search operators. In an extensive experiment on a large number of problem instances, EvoNUDGE is shown to significantly outperform multiple baselines, including the conventional tree-based genetic programming and the purely neural variant of the method.

[AI-165] Learning Characteristics of Reverse Quaternion Neural Network

链接: https://arxiv.org/abs/2411.05816
作者: Shogo Yamauchi,Tohru Nitta,Takaaki Ohnishi
关键词-EN: quaternion neural network, Reverse Quaternion Neural, feedforward quaternion neural, quaternion neural, multi-layer feedforward quaternion
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The purpose of this paper is to propose a new multi-layer feedforward quaternion neural network model architecture, Reverse Quaternion Neural Network which utilizes the non-commutative nature of quaternion products, and to clarify its learning characteristics. While quaternion neural networks have been used in various fields, there has been no research report on the characteristics of multi-layer feedforward quaternion neural networks where weights are applied in the reverse direction. This paper investigates the learning characteristics of the Reverse Quaternion Neural Network from two perspectives: the learning speed and the generalization on rotation. As a result, it is found that the Reverse Quaternion Neural Network has a learning speed comparable to existing models and can obtain a different rotation representation from the existing models.

[AI-166] SkipSNN: Efficiently Classifying Spike Trains with Event-attention

链接: https://arxiv.org/abs/2411.05806
作者: Hang Yin,Yao Su,Liping Liu,Thomas Hartvigsen,Xin Dai,Xiangnan Kong
关键词-EN: binary event sequence, machine learning community, Spiking Neural Networks, spike trains, important topic
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Published as a research paper at IEEE BigData 2024

点击查看摘要

Abstract:Spike train classification has recently become an important topic in the machine learning community, where each spike train is a binary event sequence with \emphtemporal-sparsity of signals of interest and \emphtemporal-noise properties. A promising model for it should follow the design principle of performing intensive computation only when signals of interest appear. So such tasks use mainly Spiking Neural Networks (SNNs) due to their consideration of temporal-sparsity of spike trains. However, the basic mechanism of SNNs ignore the temporal-noise issue, which makes them computationally expensive and thus high power consumption for analyzing spike trains on resource-constrained platforms. As an event-driven model, an SNN neuron makes a reaction given any input signals, making it difficult to quickly find signals of interest. In this paper, we introduce an event-attention mechanism that enables SNNs to dynamically highlight useful signals of the original spike trains. To this end, we propose SkipSNN, which extends existing SNN models by learning to mask out noise by skipping membrane potential updates and shortening the effective size of the computational graph. This process is analogous to how people choose to open and close their eyes to filter the information they see. We evaluate SkipSNN on various neuromorphic tasks and demonstrate that it achieves significantly better computational efficiency and classification accuracy than other state-of-the-art SNNs.

[AI-167] Similarity-based context aware continual learning for spiking neural networks

链接: https://arxiv.org/abs/2411.05802
作者: Bing Han,Feifei Zhao,Yang Li,Qingqun Kong,Xianqi Li,Yi Zeng
关键词-EN: coordinate relevant neuronal, relevant neuronal populations, learn continuously changing, neuronal populations based, continuously changing tasks
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Biological brains have the capability to adaptively coordinate relevant neuronal populations based on the task context to learn continuously changing tasks in real-world environments. However, existing spiking neural network-based continual learning algorithms treat each task equally, ignoring the guiding role of different task similarity associations for network learning, which limits knowledge utilization efficiency. Inspired by the context-dependent plasticity mechanism of the brain, we propose a Similarity-based Context Aware Spiking Neural Network (SCA-SNN) continual learning algorithm to efficiently accomplish task incremental learning and class incremental learning. Based on contextual similarity across tasks, the SCA-SNN model can adaptively reuse neurons from previous tasks that are beneficial for new tasks (the more similar, the more neurons are reused) and flexibly expand new neurons for the new task (the more similar, the fewer neurons are expanded). Selective reuse and discriminative expansion significantly improve the utilization of previous knowledge and reduce energy consumption. Extensive experimental results on CIFAR100, ImageNet generalized datasets, and FMNIST-MNIST, SVHN-CIFAR100 mixed datasets show that our SCA-SNN model achieves superior performance compared to both SNN-based and DNN-based continual learning algorithms. Additionally, our algorithm has the capability to adaptively select similar groups of neurons for related tasks, offering a promising approach to enhancing the biological interpretability of efficient continual learning.

[AI-168] NeoPhysIx: An Ultra Fast 3D Physical Simulator as Development Tool for AI Algorithms

链接: https://arxiv.org/abs/2411.05799
作者: Jörn Fischer,Thomas Ihme
关键词-EN: Programming and Reinforcement, require extensive computational, extensive computational resources, Traditional AI algorithms, physical scenarios effectively
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 7 Pages, 4 Figures

点击查看摘要

Abstract:Traditional AI algorithms, such as Genetic Programming and Reinforcement Learning, often require extensive computational resources to simulate real-world physical scenarios effectively. While advancements in multi-core processing have been made, the inherent limitations of parallelizing rigid body dynamics lead to significant communication overheads, hindering substantial performance gains for simple simulations. This paper introduces NeoPhysIx, a novel 3D physical simulator designed to overcome these challenges. By adopting innovative simulation paradigms and focusing on essential algorithmic elements, NeoPhysIx achieves unprecedented speedups exceeding 1000x compared to real-time. This acceleration is realized through strategic simplifications, including point cloud collision detection, joint angle determination, and friction force estimation. The efficacy of NeoPhysIx is demonstrated through its application in training a legged robot with 18 degrees of freedom and six sensors, controlled by an evolved genetic program. Remarkably, simulating half a year of robot lifetime within a mere 9 hours on a single core of a standard mid-range CPU highlights the significant efficiency gains offered by NeoPhysIx. This breakthrough paves the way for accelerated AI development and training in physically-grounded domains. Comments: 7 Pages, 4 Figures Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2411.05799 [cs.RO] (or arXiv:2411.05799v1 [cs.RO] for this version) https://doi.org/10.48550/arXiv.2411.05799 Focus to learn more arXiv-issued DOI via DataCite

[AI-169] A Genetic Algorithm for Multi-Capacity Fixed-Charge Flow Network Design

链接: https://arxiv.org/abs/2411.05798
作者: Caleb Eardley,Dalton Gomez,Ryan Dupuis,Michael Papadopoulos,Sean Yaw
关键词-EN: Fixed-Charge Network Flow, Multi-Capacity Fixed-Charge Network, Fixed-Charge Network, Multi-Capacity Fixed-Charge, aims to assign
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The Multi-Capacity Fixed-Charge Network Flow (MC-FCNF) problem, a generalization of the Fixed-Charge Network Flow problem, aims to assign capacities to edges in a flow network such that a target amount of flow can be hosted at minimum cost. The cost model for both problems dictates that the fixed cost of an edge is incurred for any non-zero amount of flow hosted by that edge. This problem naturally arises in many areas including infrastructure design, transportation, telecommunications, and supply chain management. The MC-FCNF problem is NP-Hard, so solving large instances using exact techniques is impractical. This paper presents a genetic algorithm designed to quickly find high-quality flow solutions to the MC-FCNF problem. The genetic algorithm uses a novel solution representation scheme that eliminates the need to repair invalid flow solutions, which is an issue common to many other genetic algorithms for the MC-FCNF problem. The genetic algorithm’s efficiency is displayed with an evaluation using real-world CO2 capture and storage infrastructure design data. The evaluation results highlight the genetic algorithm’s potential for solving large-scale network design problems.

[AI-170] A Comprehensive Survey of Time Series Forecasting: Architectural Diversity and Open Challenges

链接: https://arxiv.org/abs/2411.05793
作者: Jongseon Kim(1 and 3),Hyungjoon Kim(1 and 4),HyunGi Kim(2),Dongjun Lee(1),Sungroh Yoon(1 and 2) ((1) Interdisciplinary Program in Artificial Intelligence, Seoul National University, (2) Department of Electrical and Computer Engineering, Seoul National University, (3) Ramp;D Department, LG Chem, (4) Ramp;D Department, Samsung SDI)
关键词-EN: Time series forecasting, Time series, series forecasting, series, Time
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Submitted to the Artificial Intelligence Review on October 10, 2024

点击查看摘要

Abstract:Time series forecasting is a critical task that provides key information for decision-making across various fields. Recently, various fundamental deep learning architectures such as MLPs, CNNs, RNNs, and GNNs have been developed and applied to solve time series forecasting problems. However, the structural limitations caused by the inductive biases of each deep learning architecture constrained their performance. Transformer models, which excel at handling long-term dependencies, have become significant architectural components for time series forecasting. However, recent research has shown that alternatives such as simple linear layers can outperform Transformers. These findings have opened up new possibilities for using diverse architectures. In this context of exploration into various models, the architectural modeling of time series forecasting has now entered a renaissance. This survey not only provides a historical context for time series forecasting but also offers comprehensive and timely analysis of the movement toward architectural diversification. By comparing and re-examining various deep learning models, we uncover new perspectives and presents the latest trends in time series forecasting, including the emergence of hybrid models, diffusion models, Mamba models, and foundation models. By focusing on the inherent characteristics of time series data, we also address open challenges that have gained attention in time series forecasting, such as channel dependency, distribution shift, causality, and feature extraction. This survey explores vital elements that can enhance forecasting performance through diverse approaches. These contributions lead to lowering the entry barriers for newcomers to the field of time series forecasting, while also offering seasoned researchers broad perspectives, new opportunities, and deep insights.

[AI-171] FedCVD: The First Real-World Federated Learning Benchmark on Cardiovascular Disease Data

链接: https://arxiv.org/abs/2411.07050
作者: Yukun Zhang,Guanzhong Chen,Zenglin Xu,Jianyong Wang,Dun Zeng,Junfan Li,Jinghua Wang,Yuan Qi,Irwin King
关键词-EN: death worldwide, highlighting the critical, diagnosis and treatment, diagnose CVDs early, early diagnosis
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 10 pages, 4 figures

点击查看摘要

Abstract:Cardiovascular diseases (CVDs) are currently the leading cause of death worldwide, highlighting the critical need for early diagnosis and treatment. Machine learning (ML) methods can help diagnose CVDs early, but their performance relies on access to substantial data with high quality. However, the sensitive nature of healthcare data often restricts individual clinical institutions from sharing data to train sufficiently generalized and unbiased ML models. Federated Learning (FL) is an emerging approach, which offers a promising solution by enabling collaborative model training across multiple participants without compromising the privacy of the individual data owners. However, to the best of our knowledge, there has been limited prior research applying FL to the cardiovascular disease domain. Moreover, existing FL benchmarks and datasets are typically simulated and may fall short of replicating the complexity of natural heterogeneity found in realistic datasets that challenges current FL algorithms. To address these gaps, this paper presents the first real-world FL benchmark for cardiovascular disease detection, named FedCVD. This benchmark comprises two major tasks: electrocardiogram (ECG) classification and echocardiogram (ECHO) segmentation, based on naturally scattered datasets constructed from the CVD data of seven institutions. Our extensive experiments on these datasets reveal that FL faces new challenges with real-world non-IID and long-tail data. The code and datasets of FedCVD are available this https URL.

[AI-172] DiffSR: Learning Radar Reflectivity Synthesis via Diffusion Model from Satellite Observations

链接: https://arxiv.org/abs/2411.06714
作者: Xuming He,Zhiwang Zhou,Wenlong Zhang,Xiangyu Zhao,Hao Chen,Shiqi Chen,Lei Bai
关键词-EN: synthesis can fill, radar data synthesis, radar, data, Weather radar data
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Weather radar data synthesis can fill in data for areas where ground observations are missing. Existing methods often employ reconstruction-based approaches with MSE loss to reconstruct radar data from satellite observation. However, such methods lead to over-smoothing, which hinders the generation of high-frequency details or high-value observation areas associated with convective weather. To address this issue, we propose a two-stage diffusion-based method called DiffSR. We first pre-train a reconstruction model on global-scale data to obtain radar estimation and then synthesize radar reflectivity by combining radar estimation results with satellite data as conditions for the diffusion model. Extensive experiments show that our method achieves state-of-the-art (SOTA) results, demonstrating the ability to generate high-frequency details and high-value areas.

[AI-173] Enhancing frozen histological section images using permanent-section-guided deep learning with nuclei attention

链接: https://arxiv.org/abs/2411.06583
作者: Elad Yoshai,Gil Goldinger,Miki Haifler,Natan T. Shaked
关键词-EN: diagnosis during surgeries, produced within minutes, rapid diagnosis, Permanent sections, sections
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Quantitative Methods (q-bio.QM)
*备注:

点击查看摘要

Abstract:In histological pathology, frozen sections are often used for rapid diagnosis during surgeries, as they can be produced within minutes. However, they suffer from artifacts and often lack crucial diagnostic details, particularly within the cell nuclei region. Permanent sections, on the other hand, contain more diagnostic detail but require a time-intensive preparation process. Here, we present a generative deep learning approach to enhance frozen section images by leveraging guidance from permanent sections. Our method places a strong emphasis on the nuclei region, which contains critical information in both frozen and permanent sections. Importantly, our approach avoids generating artificial data in blank regions, ensuring that the network only enhances existing features without introducing potentially unreliable information. We achieve this through a segmented attention network, incorporating nuclei-segmented images during training and adding an additional loss function to refine the nuclei details in the generated permanent images. We validated our method across various tissues, including kidney, breast, and colon. This approach significantly improves histological efficiency and diagnostic accuracy, enhancing frozen section images within seconds, and seamlessly integrating into existing laboratory workflows.

[AI-174] Assessing Foundational Medical Segment Anything (Med-SAM1 Med-SAM2) Deep Learning Models for Left Atrial Segmentation in 3D LGE MRI

链接: https://arxiv.org/abs/2411.05963
作者: Mehri Mehrnia,Mohamed Elbayumi,Mohammed S. M. Elbaz
关键词-EN: common cardiac arrhythmia, Atrial fibrillation, failure and stroke, common cardiac, heart failure
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Atrial fibrillation (AF), the most common cardiac arrhythmia, is associated with heart failure and stroke. Accurate segmentation of the left atrium (LA) in 3D late gadolinium-enhanced (LGE) MRI is helpful for evaluating AF, as fibrotic remodeling in the LA myocardium contributes to arrhythmia and serves as a key determinant of therapeutic strategies. However, manual LA segmentation is labor-intensive and challenging. Recent foundational deep learning models, such as the Segment Anything Model (SAM), pre-trained on diverse datasets, have demonstrated promise in generic segmentation tasks. MedSAM, a fine-tuned version of SAM for medical applications, enables efficient, zero-shot segmentation without domain-specific training. Despite the potential of MedSAM model, it has not yet been evaluated for the complex task of LA segmentation in 3D LGE-MRI. This study aims to (1) evaluate the performance of MedSAM in automating LA segmentation, (2) compare the performance of the MedSAM2 model, which uses a single prompt with automated tracking, with the MedSAM1 model, which requires separate prompt for each slice, and (3) analyze the performance of MedSAM1 in terms of Dice score(i.e., segmentation accuracy) by varying the size and location of the box prompt.

[AI-175] From Electrode to Global Brain: Integrating Multi- and Cross-Scale Brain Connections and Interactions Under Cross-Subject and Within-Subject Scenarios

链接: https://arxiv.org/abs/2411.05862
作者: Chen Zhige,Qin Chengxuan
关键词-EN: electroencephalogram signals pose, signals pose great, pose great challenges, multi-scale spatial data, spatial data distribution
类目: Neurons and Cognition (q-bio.NC); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The individual variabilities of electroencephalogram signals pose great challenges to cross-subject motor imagery (MI) classification, especially for the data-scarce single-source to single-target (STS) scenario. The multi-scale spatial data distribution differences can not be fully eliminated in MI experiments for the topological structure and connection are the inherent properties of the human brain. Overall, no literature investigates the multi-scale spatial data distribution problem in STS cross-subject MI classification task, neither intra-subject nor inter-subject scenarios. In this paper, a novel multi-scale spatial domain adaptation network (MSSDAN) consists of both multi-scale spatial feature extractor (MSSFE) and deep domain adaptation method called multi-scale spatial domain adaptation (MSSDA) is proposed and verified, our goal is to integrate the principles of multi-scale brain topological structures in order to solve the multi-scale spatial data distribution difference problem.

[AI-176] Input-Driven Dynamics for Robust Memory Retrieval in Hopfield Networks

链接: https://arxiv.org/abs/2411.05849
作者: Simone Betteti,Giacomo Baggio,Francesco Bullo,Sandro Zampieri
关键词-EN: human brain, mathematically idealized, idealized yet insightful, memory retrieval, Hopfield model
类目: Neurons and Cognition (q-bio.NC); Statistical Mechanics (cond-mat.stat-mech); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Dynamical Systems (math.DS)
*备注: 24 pages, 15 figures

点击查看摘要

Abstract:The Hopfield model provides a mathematically idealized yet insightful framework for understanding the mechanisms of memory storage and retrieval in the human brain. This model has inspired four decades of extensive research on learning and retrieval dynamics, capacity estimates, and sequential transitions among memories. Notably, the role and impact of external inputs has been largely underexplored, from their effects on neural dynamics to how they facilitate effective memory retrieval. To bridge this gap, we propose a novel dynamical system framework in which the external input directly influences the neural synapses and shapes the energy landscape of the Hopfield model. This plasticity-based mechanism provides a clear energetic interpretation of the memory retrieval process and proves effective at correctly classifying highly mixed inputs. Furthermore, we integrate this model within the framework of modern Hopfield architectures, using this connection to elucidate how current and past information are combined during the retrieval process. Finally, we embed both the classic and the new model in an environment disrupted by noise and compare their robustness during memory retrieval.

[AI-177] Utilizing RNN for Real-time Cryptocurrency Price Prediction and Trading Strategy Optimization

链接: https://arxiv.org/abs/2411.05829
作者: Shamima Nasrin Tumpa,Kehelwala Dewage Gayan Maduranga
关键词-EN: Recurrent Neural Networks, Neural Networks, Recurrent Neural, real-time cryptocurrency price, optimized trading strategies
类目: atistical Finance (q-fin.ST); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 10 pages, 16 figures, 1 table

点击查看摘要

Abstract:This study explores the use of Recurrent Neural Networks (RNN) for real-time cryptocurrency price prediction and optimized trading strategies. Given the high volatility of the cryptocurrency market, traditional forecasting models often fall short. By leveraging RNNs’ capability to capture long-term patterns in time-series data, this research aims to improve accuracy in price prediction and develop effective trading strategies. The project follows a structured approach involving data collection, preprocessing, and model refinement, followed by rigorous backtesting for profitability and risk assessment. This work contributes to both the academic and practical fields by providing a robust predictive model and optimized trading strategies that address the challenges of cryptocurrency trading.

[AI-178] SurfGNN: A robust surface-based prediction model with interpretability for coactivation maps of spatial and cortical features

链接: https://arxiv.org/abs/2411.05825
作者: Zhuoshuo Li(1),Jiong Zhang(2),Youbing Zeng(1),Jiaying Lin(1),Dan Zhang(3),Jianjia Zhang(1),Duan Xu(4),Hosung Kim(5),Bingguang Liu(6),Mengting Liu(1) ((1) Department of Biomedical Engineering, Sun Yat-sen University, Shenzhen, China, (2) Institute of Biomedical Engineering, Ningbo Institute of Materials Technology and Engineering, Chinese Academy of Sciences, Ningbo, China, (3) School of Cyber Science and Engineering, Ningbo University of Technology, Ningbo, China, (4) Department of Radiology, School of Medicine, University of California San Francisco, San Francisco, CA, USA, (5) USC Mark and Mary Stevens Neuroimaging and Informatics Institute, Keck School of Medicine of USC, University of Southern California, Los Angeles, CA, USA, (6) Department of Radiology, Shenzhen Maternity and Child Healthcare Hospital, Shenzhen, China)
关键词-EN: Current brain surface-based, graph neural networks, Current brain, cortical feature level, overlook the variability
类目: Neurons and Cognition (q-bio.NC); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: 15 pages, 6 figures

点击查看摘要

Abstract:Current brain surface-based prediction models often overlook the variability of regional attributes at the cortical feature level. While graph neural networks (GNNs) excel at capturing regional differences, they encounter challenges when dealing with complex, high-density graph structures. In this work, we consider the cortical surface mesh as a sparse graph and propose an interpretable prediction model-Surface Graph Neural Network (SurfGNN). SurfGNN employs topology-sampling learning (TSL) and region-specific learning (RSL) structures to manage individual cortical features at both lower and higher scales of the surface mesh, effectively tackling the challenges posed by the overly abundant mesh nodes and addressing the issue of heterogeneity in cortical regions. Building on this, a novel score-weighted fusion (SWF) method is implemented to merge nodal representations associated with each cortical feature for prediction. We apply our model to a neonatal brain age prediction task using a dataset of harmonized MR images from 481 subjects (503 scans). SurfGNN outperforms all existing state-of-the-art methods, demonstrating an improvement of at least 9.0% and achieving a mean absolute error (MAE) of 0.827+0.056 in postmenstrual weeks. Furthermore, it generates feature-level activation maps, indicating its capability to identify robust regional variations in different morphometric contributions for prediction.

[AI-179] Neurophysiological Analysis in Motor and Sensory Cortices for Improving Motor Imagination

链接: https://arxiv.org/abs/2411.05811
作者: Si-Hyun Kim,Sung-Jin Kim,Dae-Hyeok Lee
关键词-EN: enables direct communication, offering potential solutions, Brain-computer interface, sensorimotor cortex, decoding neural signals
类目: Neurons and Cognition (q-bio.NC); Artificial Intelligence (cs.AI)
*备注: 4 pages, 3 figures, 1 table, Name of Conference: International Winter Conference on Brain-Computer Interface

点击查看摘要

Abstract:Brain-computer interface (BCI) enables direct communication between the brain and external devices by decoding neural signals, offering potential solutions for individuals with motor impairments. This study explores the neural signatures of motor execution (ME) and motor imagery (MI) tasks using EEG signals, focusing on four conditions categorized as sense-related (hot and cold) and motor-related (pull and push) conditions. We conducted scalp topography analysis to examine activation patterns in the sensorimotor cortex, revealing distinct regional differences: sense–related conditions primarily activated the posterior region of the sensorimotor cortex, while motor–related conditions activated the anterior region of the sensorimotor cortex. These spatial distinctions align with neurophysiological principles, suggesting condition-specific functional subdivisions within the sensorimotor cortex. We further evaluated the performances of three neural network models-EEGNet, ShallowConvNet, and DeepConvNet-demonstrating that ME tasks achieved higher classification accuracies compared to MI tasks. Specifically, in sense-related conditions, the highest accuracy was observed in the cold condition. In motor-related conditions, the pull condition showed the highest performance, with DeepConvNet yielding the highest results. These findings provide insights into optimizing BCI applications by leveraging specific condition-induced neural activations.

[AI-180] Is it me or is A larger than B: Uncovering the determinants of relational cognitive dissonance resolution

链接: https://arxiv.org/abs/2411.05809
作者: Tomer Barak,Yonatan Loewenstein
关键词-EN: computational mechanisms underlying, explores the computational, computational mechanisms, mechanisms underlying, underlying the resolution
类目: Neurons and Cognition (q-bio.NC); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:This study explores the computational mechanisms underlying the resolution of cognitive dissonances. We focus on scenarios in which an observation violates the expected relationship between objects. For instance, an agent expects object A to be smaller than B in some feature space but observes the opposite. One solution is to adjust the expected relationship according to the new observation and change the expectation to A being larger than B. An alternative solution would be to adapt the representation of A and B in the feature space such that in the new representation, the relationship that A is smaller than B is maintained. While both pathways resolve the dissonance, they generalize differently to different tasks. Using Artificial Neural Networks (ANNs) capable of relational learning, we demonstrate the existence of these two pathways and show that the chosen pathway depends on the dissonance magnitude. Large dissonances alter the representation of the objects, while small dissonances lead to adjustments in the expected relationships. We show that this effect arises from the inherently different learning dynamics of relationships and representations and study the implications.

[AI-181] Do LLM Personas Dream of Bull Markets? Comparing Human and AI Investment Strategies Through the Lens of the Five-Factor Model

链接: https://arxiv.org/abs/2411.05801
作者: Harris Borman,Anna Leontjeva,Luiz Pizzato,Max Kun Jiang,Dan Jermyn
关键词-EN: Large Language Models, Language Models, Large Language, human-like manner, demonstrated the ability
类目: atistical Finance (q-fin.ST); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); General Finance (q-fin.GN)
*备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have demonstrated the ability to adopt a personality and behave in a human-like manner. There is a large body of research that investigates the behavioural impacts of personality in less obvious areas such as investment attitudes or creative decision making. In this study, we investigated whether an LLM persona with a specific Big Five personality profile would perform an investment task similarly to a human with the same personality traits. We used a simulated investment task to determine if these results could be generalised into actual behaviours. In this simulated environment, our results show these personas produced meaningful behavioural differences in all assessed categories, with these behaviours generally being consistent with expectations derived from human research. We found that LLMs are able to generalise traits into expected behaviours in three areas: learning style, impulsivity and risk appetite while environmental attitudes could not be accurately represented. In addition, we showed that LLMs produce behaviour that is more reflective of human behaviour in a simulation environment compared to a survey environment.

计算机视觉

[CV-0] Watermark Anything with Localized Messages KR

链接: https://arxiv.org/abs/2411.07231
作者: Tom Sander,Pierre Fernandez,Alain Durmus,Teddy Furon,Matthijs Douze
关键词-EN: tailored to handle, Image, WAM, handle small watermarked, Image watermarking
类目: Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR)
*备注: Under review. Code at this https URL

点击查看摘要

Abstract:Image watermarking methods are not tailored to handle small watermarked areas. This restricts applications in real-world scenarios where parts of the image may come from different sources or have been edited. We introduce a deep-learning model for localized image watermarking, dubbed the Watermark Anything Model (WAM). The WAM embedder imperceptibly modifies the input image, while the extractor segments the received image into watermarked and non-watermarked areas and recovers one or several hidden messages from the areas found to be watermarked. The models are jointly trained at low resolution and without perceptual constraints, then post-trained for imperceptibility and multiple watermarks. Experiments show that WAM is competitive with state-of-the art methods in terms of imperceptibility and robustness, especially against inpainting and splicing, even on high-resolution images. Moreover, it offers new capabilities: WAM can locate watermarked areas in spliced images and extract distinct 32-bit messages with less than 1 bit error from multiple small regions - no larger than 10% of the image surface - even for small 256\times 256 images.

[CV-1] Learning from Limited and Imperfect Data

链接: https://arxiv.org/abs/2411.07229
作者: Harsh Rangwani
关键词-EN: Neural Network training, Deep Neural Network, Deep Neural, manually balanced, Network training
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: ICVGIP’24 Young Researcher Symposium Abstract

点击查看摘要

Abstract:The datasets used for Deep Neural Network training (e.g., ImageNet, MSCOCO, etc.) are often manually balanced across categories (classes) to facilitate learning of all the categories. This curation process is often expensive and requires throwing away precious annotated data to balance the frequency across classes. This is because the distribution of data in the world (e.g., internet, etc.) significantly differs from the well-curated datasets and is often over-populated with samples from common categories. The algorithms designed for well-curated datasets perform suboptimally when used to learn from imperfect datasets with long-tailed imbalances and distribution shifts. For deep models to be widely used, getting away with the costly curation process by developing robust algorithms that can learn from real-world data distribution is necessary. Toward this goal, we develop practical algorithms for Deep Neural Networks that can learn from limited and imperfect data present in the real world. These works are divided into four segments, each covering a scenario of learning from limited or imperfect data. The first part of the works focuses on Learning Generative Models for Long-Tail Data, where we mitigate the mode-collapse for tail (minority) classes and enable diverse aesthetic image generations as head (majority) classes. In the second part, we enable effective generalization on tail classes through Inductive Regularization schemes, which allow tail classes to generalize as the head classes without enforcing explicit generation of images. In the third part, we develop algorithms for Optimizing Relevant Metrics compared to the average accuracy for learning from long-tailed data with limited annotation (semi-supervised), followed by the fourth part, which focuses on the effective domain adaptation of the model to various domains with zero to very few labeled samples.

[CV-2] DLCR: A Generative Data Expansion Framework via Diffusion for Clothes-Changing Person Re-ID WACV2025

链接: https://arxiv.org/abs/2411.07205
作者: Nyle Siddiqui,Florinel Alin Croitoru,Gaurav Kumar Nayak,Radu Tudor Ionescu,Mubarak Shah
关键词-EN: recent exhibited strength, open research question, recent exhibited, exhibited strength, open research
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Published in WACV 2025

点击查看摘要

Abstract:With the recent exhibited strength of generative diffusion models, an open research question is \textitif images generated by these models can be used to learn better visual representations. While this generative data expansion may suffice for easier visual tasks, we explore its efficacy on a more difficult discriminative task: clothes-changing person re-identification (CC-ReID). CC-ReID aims to match people appearing in non-overlapping cameras, even when they change their clothes across cameras. Not only are current CC-ReID models constrained by the limited diversity of clothing in current CC-ReID datasets, but generating additional data that retains important personal features for accurate identification is a current challenge. To address this issue we propose DLCR, a novel data expansion framework that leverages pre-trained diffusion and large language models (LLMs) to accurately generate diverse images of individuals in varied attire. We generate additional data for five benchmark CC-ReID datasets (PRCC, CCVID, LaST, VC-Clothes, and LTCC) and \textbfincrease their clothing diversity by \boldmath 10 x, totaling over \boldmath 2.1 M images generated. DLCR employs diffusion-based text-guided inpainting, conditioned on clothing prompts constructed using LLMs, to generate synthetic data that only modifies a subject’s clothes while preserving their personally identifiable features. With this massive increase in data, we introduce two novel strategies - progressive learning and test-time prediction refinement - that respectively reduce training time and further boosts CC-ReID performance. On the PRCC dataset, we obtain a large top-1 accuracy improvement of 11.3% by training CAL, a previous state of the art (SOTA) method, with DLCR-generated data. We publicly release our code and generated data for each dataset here: \urlthis https URL.

[CV-3] SAMPart3D: Segment Any Part in 3D Objects

链接: https://arxiv.org/abs/2411.07184
作者: Yunhan Yang,Yukun Huang,Yuan-Chen Guo,Liangjun Lu,Xiaoyang Wu,Edmund Y. Lam,Yan-Pei Cao,Xihui Liu
关键词-EN: part segmentation, Vision Language Models, playing a vital, part, powerful Vision Language
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Project Page: this https URL

点击查看摘要

Abstract:3D part segmentation is a crucial and challenging task in 3D perception, playing a vital role in applications such as robotics, 3D generation, and 3D editing. Recent methods harness the powerful Vision Language Models (VLMs) for 2D-to-3D knowledge distillation, achieving zero-shot 3D part segmentation. However, these methods are limited by their reliance on text prompts, which restricts the scalability to large-scale unlabeled datasets and the flexibility in handling part ambiguities. In this work, we introduce SAMPart3D, a scalable zero-shot 3D part segmentation framework that segments any 3D object into semantic parts at multiple granularities, without requiring predefined part label sets as text prompts. For scalability, we use text-agnostic vision foundation models to distill a 3D feature extraction backbone, allowing scaling to large unlabeled 3D datasets to learn rich 3D priors. For flexibility, we distill scale-conditioned part-aware 3D features for 3D part segmentation at multiple granularities. Once the segmented parts are obtained from the scale-conditioned part-aware 3D features, we use VLMs to assign semantic labels to each part based on the multi-view renderings. Compared to previous methods, our SAMPart3D can scale to the recent large-scale 3D object dataset Objaverse and handle complex, non-ordinary objects. Additionally, we contribute a new 3D part segmentation benchmark to address the lack of diversity and complexity of objects and parts in existing benchmarks. Experiments show that our SAMPart3D significantly outperforms existing zero-shot 3D part segmentation methods, and can facilitate various applications such as part-level editing and interactive segmentation.

[CV-4] Cascaded Dual Vision Transformer for Accurate Facial Landmark Detection WACV2025

链接: https://arxiv.org/abs/2411.07167
作者: Ziqiang Dang,Jianfang Li,Lin Liu
关键词-EN: Dual Vision Transformer, Facial landmark detection, Vision Transformer, Long Skip Connections, Dual Vision
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted by WACV 2025. Supplementary material is included at the end of the main paper (3 pages, 5 figures, 2 tables)

点击查看摘要

Abstract:Facial landmark detection is a fundamental problem in computer vision for many downstream applications. This paper introduces a new facial landmark detector based on vision transformers, which consists of two unique designs: Dual Vision Transformer (D-ViT) and Long Skip Connections (LSC). Based on the observation that the channel dimension of feature maps essentially represents the linear bases of the heatmap space, we propose learning the interconnections between these linear bases to model the inherent geometric relations among landmarks via Channel-split ViT. We integrate such channel-split ViT into the standard vision transformer (i.e., spatial-split ViT), forming our Dual Vision Transformer to constitute the prediction blocks. We also suggest using long skip connections to deliver low-level image features to all prediction blocks, thereby preventing useful information from being discarded by intermediate supervision. Extensive experiments are conducted to evaluate the performance of our proposal on the widely used benchmarks, i.e., WFLW, COFW, and 300W, demonstrating that our model outperforms the previous SOTAs across all three benchmarks.

[CV-5] Lost in Tracking Translation: A Comprehensive Analysis of Visual SLAM in Human-Centered XR and IoT Ecosystems

链接: https://arxiv.org/abs/2411.07146
作者: Yasra Chandio,Khotso Selialia,Joseph DeGol,Luis Garcia,Fatima M. Anwar
关键词-EN: enhancing augmented reality, augmented reality experiences, empowered nascent applications, experiences for users, tracking
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Advancements in tracking algorithms have empowered nascent applications across various domains, from steering autonomous vehicles to guiding robots to enhancing augmented reality experiences for users. However, these algorithms are application-specific and do not work across applications with different types of motion; even a tracking algorithm designed for a given application does not work in scenarios deviating from highly standard conditions. For example, a tracking algorithm designed for robot navigation inside a building will not work for tracking the same robot in an outdoor environment. To demonstrate this problem, we evaluate the performance of the state-of-the-art tracking methods across various applications and scenarios. To inform our analysis, we first categorize algorithmic, environmental, and locomotion-related challenges faced by tracking algorithms. We quantitatively evaluate the performance using multiple tracking algorithms and representative datasets for a wide range of Internet of Things (IoT) and Extended Reality (XR) applications, including autonomous vehicles, drones, and humans. Our analysis shows that no tracking algorithm works across different applications and scenarios within applications. Ultimately, using the insights generated from our analysis, we discuss multiple approaches to improving the tracking performance using input data characterization, leveraging intermediate information, and output evaluation.

[CV-6] Nuremberg Letterbooks: A Multi-Transcriptional Dataset of Early 15th Century Manuscripts for Document Analysis

链接: https://arxiv.org/abs/2411.07138
作者: Martin Mayr,Julian Krenz,Katharina Neumeier,Anna Bub,Simon Bürcky,Nina Brolich,Klaus Herbers,Mechthild Habermann,Peter Fleischmann,Andreas Maier,Vincent Christlein
关键词-EN: highly standardized labels, analysis utilize highly, utilize highly standardized, document analysis utilize, simplifying specific tasks
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Most datasets in the field of document analysis utilize highly standardized labels, which, while simplifying specific tasks, often produce outputs that are not directly applicable to humanities research. In contrast, the Nuremberg Letterbooks dataset, which comprises historical documents from the early 15th century, addresses this gap by providing multiple types of transcriptions and accompanying metadata. This approach allows for developing methods that are more closely aligned with the needs of the humanities. The dataset includes 4 books containing 1711 labeled pages written by 10 scribes. Three types of transcriptions are provided for handwritten text recognition: Basic, diplomatic, and regularized. For the latter two, versions with and without expanded abbreviations are also available. A combination of letter ID and writer ID supports writer identification due to changing writers within pages. In the technical validation, we established baselines for various tasks, demonstrating data consistency and providing benchmarks for future research to build upon.

[CV-7] Edify Image: High-Quality Image Generation with Pixel Space Laplacian Diffusion Models

链接: https://arxiv.org/abs/2411.07126
作者: NVIDIA:Yuval Atzmon,Maciej Bala,Yogesh Balaji,Tiffany Cai,Yin Cui,Jiaojiao Fan,Yunhao Ge,Siddharth Gururani,Jacob Huffman,Ronald Isaac,Pooya Jannaty,Tero Karras,Grace Lam,J. P. Lewis,Aaron Licata,Yen-Chen Lin,Ming-Yu Liu,Qianli Ma,Arun Mallya,Ashlee Martino-Tarr,Doug Mendez,Seungjun Nah,Chris Pruett,Fitsum Reda,Jiaming Song,Ting-Chun Wang,Fangyin Wei,Xiaohui Zeng,Yu Zeng,Qinsheng Zhang
关键词-EN: generating photorealistic image, photorealistic image content, introduce Edify Image, diffusion models capable, Edify Image
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We introduce Edify Image, a family of diffusion models capable of generating photorealistic image content with pixel-perfect accuracy. Edify Image utilizes cascaded pixel-space diffusion models trained using a novel Laplacian diffusion process, in which image signals at different frequency bands are attenuated at varying rates. Edify Image supports a wide range of applications, including text-to-image synthesis, 4K upsampling, ControlNets, 360 HDR panorama generation, and finetuning for image customization.

[CV-8] Decoding Visual Experience and Mapping Semantics through Whole-Brain Analysis Using fMRI Foundation Models

链接: https://arxiv.org/abs/2411.07121
作者: Yanchen Wang,Adam Turnbull,Tiange Xiang,Yunlong Xu,Sa Zhou,Adnan Masoud,Shekoofeh Azizi,Feng Vankee Lin,Ehsan Adeli
关键词-EN: Magnetic Resonance Imaging, brain activity corresponds, functional Magnetic Resonance, visual cortex, Neural decoding
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Neural decoding, the process of understanding how brain activity corresponds to different stimuli, has been a primary objective in cognitive sciences. Over the past three decades, advancements in functional Magnetic Resonance Imaging and machine learning have greatly improved our ability to map visual stimuli to brain activity, especially in the visual cortex. Concurrently, research has expanded into decoding more complex processes like language and memory across the whole brain, utilizing techniques to handle greater variability and improve signal accuracy. We argue that “seeing” involves more than just mapping visual stimuli onto the visual cortex; it engages the entire brain, as various emotions and cognitive states can emerge from observing different scenes. In this paper, we develop algorithms to enhance our understanding of visual processes by incorporating whole-brain activation maps while individuals are exposed to visual stimuli. We utilize large-scale fMRI encoders and Image generative models pre-trained on large public datasets, which are then fine-tuned through Image-fMRI contrastive learning. Our models hence can decode visual experience across the entire cerebral cortex, surpassing the traditional confines of the visual cortex. We first compare our method with state-of-the-art approaches to decoding visual processing and show improved predictive semantic accuracy by 43%. A network ablation analysis suggests that beyond the visual cortex, the default mode network contributes most to decoding stimuli, in line with the proposed role of this network in sense-making and semantic processing. Additionally, we implemented zero-shot imagination decoding on an extra validation dataset, achieving a p-value of 0.0206 for mapping the reconstructed images and ground-truth text stimuli, which substantiates the model’s capability to capture semantic meanings across various scenarios.

[CV-9] ConvMixFormer- A Resource-efficient Convolution Mixer for Transformer-based Dynamic Hand Gesture Recognition

链接: https://arxiv.org/abs/2411.07118
作者: Mallika Garg,Debashis Ghosh,Pyari Mohan Pradhan
关键词-EN: demonstrated remarkable success, natural language processing, computer vision, demonstrated remarkable, remarkable success
类目: Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Transformer models have demonstrated remarkable success in many domains such as natural language processing (NLP) and computer vision. With the growing interest in transformer-based architectures, they are now utilized for gesture recognition. So, we also explore and devise a novel ConvMixFormer architecture for dynamic hand gestures. The transformers use quadratic scaling of the attention features with the sequential data, due to which these models are computationally complex and heavy. We have considered this drawback of the transformer and designed a resource-efficient model that replaces the self-attention in the transformer with the simple convolutional layer-based token mixer. The computational cost and the parameters used for the convolution-based mixer are comparatively less than the quadratic self-attention. Convolution-mixer helps the model capture the local spatial features that self-attention struggles to capture due to their sequential processing nature. Further, an efficient gate mechanism is employed instead of a conventional feed-forward network in the transformer to help the model control the flow of features within different stages of the proposed model. This design uses fewer learnable parameters which is nearly half the vanilla transformer that helps in fast and efficient training. The proposed method is evaluated on NVidia Dynamic Hand Gesture and Briareo datasets and our model has achieved state-of-the-art results on single and multimodal inputs. We have also shown the parameter efficiency of the proposed ConvMixFormer model compared to other methods. The source code is available at this https URL.

[CV-10] Arctique: An artificial histopathological dataset unifying realism and controllability for uncertainty quantification

链接: https://arxiv.org/abs/2411.07097
作者: Jannik Franzen,Claudia Winklmayr,Vanessa E. Guarino,Christoph Karg,Xiaoyan Yu,Nora Koreuber,Jan P. Albrecht,Philip Bischoff,Dagmar Kainmueller
关键词-EN: Uncertainty Quantification, crucial for reliable, reliable image segmentation, discern true uncertainty, Quantification
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 13 pages, 4 figures

点击查看摘要

Abstract:Uncertainty Quantification (UQ) is crucial for reliable image segmentation. Yet, while the field sees continual development of novel methods, a lack of agreed-upon benchmarks limits their systematic comparison and evaluation: Current UQ methods are typically tested either on overly simplistic toy datasets or on complex real-world datasets that do not allow to discern true uncertainty. To unify both controllability and complexity, we introduce Arctique, a procedurally generated dataset modeled after histopathological colon images. We chose histopathological images for two reasons: 1) their complexity in terms of intricate object structures and highly variable appearance, which yields challenging segmentation problems, and 2) their broad prevalence for medical diagnosis and respective relevance of high-quality UQ. To generate Arctique, we established a Blender-based framework for 3D scene creation with intrinsic noise manipulation. Arctique contains 50,000 rendered images with precise masks as well as noisy label simulations. We show that by independently controlling the uncertainty in both images and labels, we can effectively study the performance of several commonly used UQ methods. Hence, Arctique serves as a critical resource for benchmarking and advancing UQ techniques and other methodologies in complex, multi-object environments, bridging the gap between realism and controllability. All code is publicly available, allowing re-creation and controlled manipulations of our shipped images as well as creation and rendering of new scenes.

[CV-11] Extreme Rotation Estimation in the Wild

链接: https://arxiv.org/abs/2411.07096
作者: Hana Bezalel,Dotan Ankri,Ruojin Cai,Hadar Averbuch-Elor
关键词-EN: limited or non-overlapping, non-overlapping field, Internet images captured, images captured, Internet image pairs
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Project webpage: this https URL

点击查看摘要

Abstract:We present a technique and benchmark dataset for estimating the relative 3D orientation between a pair of Internet images captured in an extreme setting, where the images have limited or non-overlapping field of views. Prior work targeting extreme rotation estimation assume constrained 3D environments and emulate perspective images by cropping regions from panoramic views. However, real images captured in the wild are highly diverse, exhibiting variation in both appearance and camera intrinsics. In this work, we propose a Transformer-based method for estimating relative rotations in extreme real-world settings, and contribute the ExtremeLandmarkPairs dataset, assembled from scene-level Internet photo collections. Our evaluation demonstrates that our approach succeeds in estimating the relative rotations in a wide variety of extremeview Internet image pairs, outperforming various baselines, including dedicated rotation estimation techniques and contemporary 3D reconstruction methods.

[CV-12] Increasing Rosacea Awareness Among Population Using Deep Learning and Statistical Approaches

链接: https://arxiv.org/abs/2411.07074
作者: Chengyu Yang,Chengjun Liu
关键词-EN: National Rosacea Society, million Americans suffer, million Americans, Rosacea Society, National Rosacea
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted to 2024 International Conference on Medical Imaging and Computer-Aided Diagnosis

点击查看摘要

Abstract:Approximately 16 million Americans suffer from rosacea according to the National Rosacea Society. To increase rosacea awareness, automatic rosacea detection methods using deep learning and explainable statistical approaches are presented in this paper. The deep learning method applies the ResNet-18 for rosacea detection, and the statistical approaches utilize the means of the two classes, namely, the rosacea class vs. the normal class, and the principal component analysis to extract features from the facial images for automatic rosacea detection. The contributions of the proposed methods are three-fold. First, the proposed methods are able to automatically distinguish patients who are suffering from rosacea from people who are clean of this disease. Second, the statistical approaches address the explainability issue that allows doctors and patients to understand and trust the results. And finally, the proposed methods will not only help increase rosacea awareness in the general population but also help remind the patients who suffer from this disease of possible early treatment since rosacea is more treatable at its early stages. The code and data are available at this https URL.

[CV-13] Learning Collective Dynamics of Multi-Agent Systems using Event-based Vision

链接: https://arxiv.org/abs/2411.07039
作者: Minah Lee,Uday Kamal,Saibal Mukhopadhyay
关键词-EN: vision-based perception, specifically focusing, convergence time, multi-agent systems, paper proposes
类目: Multiagent Systems (cs.MA); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:This paper proposes a novel problem: vision-based perception to learn and predict the collective dynamics of multi-agent systems, specifically focusing on interaction strength and convergence time. Multi-agent systems are defined as collections of more than ten interacting agents that exhibit complex group behaviors. Unlike prior studies that assume knowledge of agent positions, we focus on deep learning models to directly predict collective dynamics from visual data, captured as frames or events. Due to the lack of relevant datasets, we create a simulated dataset using a state-of-the-art flocking simulator, coupled with a vision-to-event conversion framework. We empirically demonstrate the effectiveness of event-based representation over traditional frame-based methods in predicting these collective behaviors. Based on our analysis, we present event-based vision for Multi-Agent dynamic Prediction (evMAP), a deep learning architecture designed for real-time, accurate understanding of interaction strength and collective behavior emergence in multi-agent systems.

[CV-14] Scaling Mesh Generation via Compressive Tokenization

链接: https://arxiv.org/abs/2411.07025
作者: Haohan Weng,Zibo Zhao,Biwen Lei,Xianghui Yang,Jian Liu,Zeqiang Lai,Zhuo Chen,Yuhong Liu,Jie Jiang,Chunchao Guo,Tong Zhang,Shenghua Gao,C. L. Philip Chen
关键词-EN: Blocked and Patchified, Patchified Tokenization, effective mesh representation, propose a compressive, compressive yet effective
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
*备注: Homepage: this https URL , Code: this https URL

点击查看摘要

Abstract:We propose a compressive yet effective mesh representation, Blocked and Patchified Tokenization (BPT), facilitating the generation of meshes exceeding 8k faces. BPT compresses mesh sequences by employing block-wise indexing and patch aggregation, reducing their length by approximately 75% compared to the original sequences. This compression milestone unlocks the potential to utilize mesh data with significantly more faces, thereby enhancing detail richness and improving generation robustness. Empowered with the BPT, we have built a foundation mesh generative model training on scaled mesh data to support flexible control for point clouds and images. Our model demonstrates the capability to generate meshes with intricate details and accurate topology, achieving SoTA performance on mesh generation and reaching the level for direct product usage.

[CV-15] SIESEF-FusionNet: Spatial Inter-correlation Enhancement and Spatially-Embedded Feature Fusion Network for LiDAR Point Cloud Semantic Segmentation

链接: https://arxiv.org/abs/2411.06991
作者: Jiale Chen,Fei Xia,Jianliang Mao,Haoping Wang,Chuanlin Zhang
关键词-EN: intelligent perception systems, point cloud semantic, autonomous driving, perception systems, classes in point
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 9 pages, 4 figures

点击查看摘要

Abstract:The ambiguity at the boundaries of different semantic classes in point cloud semantic segmentation often leads to incorrect decisions in intelligent perception systems, such as autonomous driving. Hence, accurate delineation of the boundaries is crucial for improving safety in autonomous driving. A novel spatial inter-correlation enhancement and spatially-embedded feature fusion network (SIESEF-FusionNet) is proposed in this paper, enhancing spatial inter-correlation by combining inverse distance weighting and angular compensation to extract more beneficial spatial information without causing redundancy. Meanwhile, a new spatial adaptive pooling module is also designed, embedding enhanced spatial information into semantic features for strengthening the context-awareness of semantic features. Experimental results demonstrate that 83.7% mIoU and 97.8% OA are achieved by SIESEF-FusionNet on the Toronto3D dataset, with performance superior to other baseline methods. A value of 61.1% mIoU is reached on the semanticKITTI dataset, where a marked improvement in segmentation performance is observed. In addition, the effectiveness and plug-and-play capability of the proposed modules are further verified through ablation studies.

[CV-16] A Hierarchical Compression Technique for 3D Gaussian Splatting Compression

链接: https://arxiv.org/abs/2411.06976
作者: He Huang,Wenjie Huang,Qi Yang,Yiling Xu,Zhu li
关键词-EN: demonstrates excellent rendering, Gaussian Splatting, excellent rendering quality, demonstrates excellent, view synthesis
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
*备注:

点击查看摘要

Abstract:3D Gaussian Splatting (GS) demonstrates excellent rendering quality and generation speed in novel view synthesis. However, substantial data size poses challenges for storage and transmission, making 3D GS compression an essential technology. Current 3D GS compression research primarily focuses on developing more compact scene representations, such as converting explicit 3D GS data into implicit forms. In contrast, compression of the GS data itself has hardly been explored. To address this gap, we propose a Hierarchical GS Compression (HGSC) technique. Initially, we prune unimportant Gaussians based on importance scores derived from both global and local significance, effectively reducing redundancy while maintaining visual quality. An Octree structure is used to compress 3D positions. Based on the 3D GS Octree, we implement a hierarchical attribute compression strategy by employing a KD-tree to partition the 3D GS into multiple blocks. We apply farthest point sampling to select anchor primitives within each block and others as non-anchor primitives with varying Levels of Details (LoDs). Anchor primitives serve as reference points for predicting non-anchor primitives across different LoDs to reduce spatial redundancy. For anchor primitives, we use the region adaptive hierarchical transform to achieve near-lossless compression of various attributes. For non-anchor primitives, each is predicted based on the k-nearest anchor primitives. To further minimize prediction errors, the reconstructed LoD and anchor primitives are combined to form new anchor primitives to predict the next LoD. Our method notably achieves superior compression quality and a significant data size reduction of over 4.5 times compared to the state-of-the-art compression method on small scenes datasets.

[CV-17] MapSAM: Adapting Segment Anything Model for Automated Feature Detection in Historical Maps

链接: https://arxiv.org/abs/2411.06971
作者: Xue Xia,Daiwei Zhang,Wenxuan Song,Wei Huang,Lorenz Hurni
关键词-EN: historical map segmentation, Automated feature detection, significantly accelerate, accelerate the reconstruction, map segmentation tasks
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Automated feature detection in historical maps can significantly accelerate the reconstruction of the geospatial past. However, this process is often constrained by the time-consuming task of manually digitizing sufficient high-quality training data. The emergence of visual foundation models, such as the Segment Anything Model (SAM), offers a promising solution due to their remarkable generalization capabilities and rapid adaptation to new data distributions. Despite this, directly applying SAM in a zero-shot manner to historical map segmentation poses significant challenges, including poor recognition of certain geospatial features and a reliance on input prompts, which limits its ability to be fully automated. To address these challenges, we introduce MapSAM, a parameter-efficient fine-tuning strategy that adapts SAM into a prompt-free and versatile solution for various downstream historical map segmentation tasks. Specifically, we employ Weight-Decomposed Low-Rank Adaptation (DoRA) to integrate domain-specific knowledge into the image encoder. Additionally, we develop an automatic prompt generation process, eliminating the need for manual input. We further enhance the positional prompt in SAM, transforming it into a higher-level positional-semantic prompt, and modify the cross-attention mechanism in the mask decoder with masked attention for more effective feature aggregation. The proposed MapSAM framework demonstrates promising performance across two distinct historical map segmentation tasks: one focused on linear features and the other on areal features. Experimental results show that it adapts well to various features, even when fine-tuned with extremely limited data (e.g. 10 shots).

[CV-18] Robust Fine-tuning of Zero-shot Models via Variance Reduction NEURIPS2024

链接: https://arxiv.org/abs/2411.06966
作者: Beier Zhu,Jiequan Cui,Hanwang Zhang
关键词-EN: OOD accuracy, OOD, accuracy, CLIP, models like CLIP
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted to NeurIPS 2024

点击查看摘要

Abstract:When fine-tuning zero-shot models like CLIP, our desideratum is for the fine-tuned model to excel in both in-distribution (ID) and out-of-distribution (OOD). Recently, ensemble-based models (ESM) have been shown to offer significant robustness improvement, while preserving high ID accuracy. However, our study finds that ESMs do not solve the ID-OOD trade-offs: they achieve peak performance for ID and OOD accuracy at different mixing coefficients. When optimized for OOD accuracy, the ensemble model exhibits a noticeable decline in ID accuracy, and vice versa. In contrast, we propose a sample-wise ensembling technique that can simultaneously attain the best ID and OOD accuracy without the trade-offs. Specifically, we construct a Zero-Shot Failure (ZSF) set containing training samples incorrectly predicted by the zero-shot model. For each test sample, we calculate its distance to the ZSF set and assign a higher weight to the fine-tuned model in the ensemble if the distance is small. We term our method Variance Reduction Fine-tuning (VRF), as it effectively reduces the variance in ensemble predictions, thereby decreasing residual error. On ImageNet and five derived distribution shifts, our VRF further improves the OOD accuracy by 1.5 - 2.0 pp over the ensemble baselines while maintaining or increasing ID accuracy. VRF achieves similar large robustness gains (0.9 - 3.1 pp) on other distribution shifts benchmarks. Codes are available in this https URL.

[CV-19] UMFC: Unsupervised Multi-Domain Feature Calibration for Vision-Language Models NEURIPS2024

链接: https://arxiv.org/abs/2411.06921
作者: Jiachen Liang,Ruibing Hou,Minyang Hu,Hong Chang,Shiguang Shan,Xilin Chen
关键词-EN: zero-shot transfer capabilities, shown powerful zero-shot, powerful zero-shot transfer, Pre-trained vision-language models, Pre-trained vision-language
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: NeurIPS 2024

点击查看摘要

Abstract:Pre-trained vision-language models (e.g., CLIP) have shown powerful zero-shot transfer capabilities. But they still struggle with domain shifts and typically require labeled data to adapt to downstream tasks, which could be costly. In this work, we aim to leverage unlabeled data that naturally spans multiple domains to enhance the transferability of vision-language models. Under this unsupervised multi-domain setting, we have identified inherent model bias within CLIP, notably in its visual and text encoders. Specifically, we observe that CLIP’s visual encoder tends to prioritize encoding domain over discriminative category information, meanwhile its text encoder exhibits a preference for domain-relevant classes. To mitigate this model bias, we propose a training-free and label-free feature calibration method, Unsupervised Multi-domain Feature Calibration (UMFC). UMFC estimates image-level biases from domain-specific features and text-level biases from the direction of domain transition. These biases are subsequently subtracted from original image and text features separately, to render them domain-invariant. We evaluate our method on multiple settings including transductive learning and test-time adaptation. Extensive experiments show that our method outperforms CLIP and performs on par with the state-of-the-arts that need additional annotations or optimization. Our code is available at this https URL.

[CV-20] BuckTales : A multi-UAV dataset for multi-object tracking and re-identification of wild antelopes

链接: https://arxiv.org/abs/2411.06896
作者: Hemal Naik,Junran Yang,Dipin Das,Margaret C Crofoot,Akanksha Rathore,Vivek Hari Sridhar
关键词-EN: Understanding animal behaviour, Unmanned Aerial Vehicles, Understanding animal, Understanding, central to predicting
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 9 pages, 5 figures

点击查看摘要

Abstract:Understanding animal behaviour is central to predicting, understanding, and mitigating impacts of natural and anthropogenic changes on animal populations and ecosystems. However, the challenges of acquiring and processing long-term, ecologically relevant data in wild settings have constrained the scope of behavioural research. The increasing availability of Unmanned Aerial Vehicles (UAVs), coupled with advances in machine learning, has opened new opportunities for wildlife monitoring using aerial tracking. However, limited availability of datasets with wild animals in natural habitats has hindered progress in automated computer vision solutions for long-term animal tracking. Here we introduce BuckTales, the first large-scale UAV dataset designed to solve multi-object tracking (MOT) and re-identification (Re-ID) problem in wild animals, specifically the mating behaviour (or lekking) of blackbuck antelopes. Collected in collaboration with biologists, the MOT dataset includes over 1.2 million annotations including 680 tracks across 12 high-resolution (5.4K) videos, each averaging 66 seconds and featuring 30 to 130 individuals. The Re-ID dataset includes 730 individuals captured with two UAVs simultaneously. The dataset is designed to drive scalable, long-term animal behaviour tracking using multiple camera sensors. By providing baseline performance with two detectors, and benchmarking several state-of-the-art tracking methods, our dataset reflects the real-world challenges of tracking wild animals in socially and ecologically relevant contexts. In making these data widely available, we hope to catalyze progress in MOT and Re-ID for wild animals, fostering insights into animal behaviour, conservation efforts, and ecosystem dynamics through automated, long-term monitoring.

[CV-21] Multi-scale Frequency Enhancement Network for Blind Image Deblurring

链接: https://arxiv.org/abs/2411.06893
作者: Yawen Xiang,Heng Zhou,Chengyang Li,Zhongbo Li,Yongqiang Xie
关键词-EN: image preprocessing technique, essential image preprocessing, detailed images form, images form blurry, preprocessing technique
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Image deblurring is an essential image preprocessing technique, aiming to recover clear and detailed images form blurry ones. However, existing algorithms often fail to effectively integrate multi-scale feature extraction with frequency enhancement, limiting their ability to reconstruct fine textures. Additionally, non-uniform blur in images also restricts the effectiveness of image restoration. To address these issues, we propose a multi-scale frequency enhancement network (MFENet) for blind image deblurring. To capture the multi-scale spatial and channel information of blurred images, we introduce a multi-scale feature extraction module (MS-FE) based on depthwise separable convolutions, which provides rich target features for deblurring. We propose a frequency enhanced blur perception module (FEBP) that employs wavelet transforms to extract high-frequency details and utilizes multi-strip pooling to perceive non-uniform blur, combining multi-scale information with frequency enhancement to improve the restoration of image texture details. Experimental results on the GoPro and HIDE datasets demonstrate that the proposed method achieves superior deblurring performance in both visual quality and objective evaluation metrics. Furthermore, in downstream object detection tasks, the proposed blind image deblurring algorithm significantly improves detection accuracy, further validating its effectiveness androbustness in the field of image deblurring.

[CV-22] Classification of residential and non-residential buildings based on satellite data using deep learning

链接: https://arxiv.org/abs/2411.06879
作者: Jai G Singla
关键词-EN: infrastructure development, population estimation, non-residential categories, categories is crucial, residential and non-residential
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
*备注:

点击查看摘要

Abstract:Accurate classification of buildings into residential and non-residential categories is crucial for urban planning, infrastructure development, population estimation and resource allocation. It is a complex job to carry out automatic classification of residential and nonresidential buildings manually using satellite data. In this paper, we are proposing a novel deep learning approach that combines high-resolution satellite data (50 cm resolution Image + 1m grid interval DEM) and vector data to achieve high-performance building classification. Our architecture leverages LeakyReLU and ReLU activations to capture nonlinearities in the data and employs feature-engineering techniques to eliminate highly correlated features, resulting in improved computational efficiency. Experimental results on a large-scale dataset demonstrate the effectiveness of our model, achieving an impressive overall F1 -score of 0.9936. The proposed approach offers a scalable and accurate solution for building classification, enabling informed decision-making in urban planning and resource allocation. This research contributes to the field of urban analysis by providing a valuable tool for understanding the built environment and optimizing resource utilization.

[CV-23] CapeLLM : Support-Free Category-Agnostic Pose Estimation with Multimodal Large Language Models

链接: https://arxiv.org/abs/2411.06869
作者: Junho Kim,Hyungjin Chung,Byung-Hoon Kim
关键词-EN: diverse object categories, object categories, traditionally relied, fail to fully, fully capture
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Category-agnostic pose estimation (CAPE) has traditionally relied on support images with annotated keypoints, a process that is often cumbersome and may fail to fully capture the necessary correspondences across diverse object categories. Recent efforts have begun exploring the use of text-based queries, where the need for support keypoints is eliminated. However, the optimal use of textual descriptions for keypoints remains an underexplored area. In this work, we introduce CapeLLM, a novel approach that leverages a text-based multimodal large language model (MLLM) for CAPE. Our method only employs query image and detailed text descriptions as an input to estimate category-agnostic keypoints. We conduct extensive experiments to systematically explore the design space of LLM-based CAPE, investigating factors such as choosing the optimal description for keypoints, neural network architectures, and training strategies. Thanks to the advanced reasoning capabilities of the pre-trained MLLM, CapeLLM demonstrates superior generalization and robust performance. Our approach sets a new state-of-the-art on the MP-100 benchmark in the challenging 1-shot setting, marking a significant advancement in the field of category-agnostic pose estimation.

[CV-24] Veri-Car: Towards Open-world Vehicle Information Retrieval

链接: https://arxiv.org/abs/2411.06864
作者: Andrés Muñoz,Nancy Thomas,Annita Vapsi,Daciel Borrajo
关键词-EN: service sectors require, sectors require tools, extract vehicle characteristics, characteristics from images, industrial and service
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 33 pages, 12 figures

点击查看摘要

Abstract:Many industrial and service sectors require tools to extract vehicle characteristics from images. This is a complex task not only by the variety of noise, and large number of classes, but also by the constant introduction of new vehicle models to the market. In this paper, we present Veri-Car, an information retrieval integrated approach designed to help on this task. It leverages supervised learning techniques to accurately identify the make, type, model, year, color, and license plate of cars. The approach also addresses the challenge of handling open-world problems, where new car models and variations frequently emerge, by employing a sophisticated combination of pre-trained models, and a hierarchical multi-similarity loss. Veri-Car demonstrates robust performance, achieving high precision and accuracy in classifying both seen and unseen data. Additionally, it integrates an ensemble license plate detection, and an OCR model to extract license plate numbers with impressive accuracy.

[CV-25] Fast and Efficient Transformer-based Method for Birds Eye View Instance Prediction ITSC2024

链接: https://arxiv.org/abs/2411.06851
作者: Miguel Antunes-García,Luis M. Bergasa,Santiago Montiel-Marín,Rafael Barea,Fabio Sánchez-García,Ángel Llamazares
关键词-EN: Accurate object detection, Accurate object, critical to ensure, ensure the safety, safety and efficiency
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: The article has been presented in the 27th IEEE International Conference on Intelligent Transportation Systems (IEEE ITSC 2024) on September, 2024. Number of pages: 6, Number of figures: 4

点击查看摘要

Abstract:Accurate object detection and prediction are critical to ensure the safety and efficiency of self-driving architectures. Predicting object trajectories and occupancy enables autonomous vehicles to anticipate movements and make decisions with future information, increasing their adaptability and reducing the risk of accidents. Current State-Of-The-Art (SOTA) approaches often isolate the detection, tracking, and prediction stages, which can lead to significant prediction errors due to accumulated inaccuracies between stages. Recent advances have improved the feature representation of multi-camera perception systems through Bird’s-Eye View (BEV) transformations, boosting the development of end-to-end systems capable of predicting environmental elements directly from vehicle sensor data. These systems, however, often suffer from high processing times and number of parameters, creating challenges for real-world deployment. To address these issues, this paper introduces a novel BEV instance prediction architecture based on a simplified paradigm that relies only on instance segmentation and flow prediction. The proposed system prioritizes speed, aiming at reduced parameter counts and inference times compared to existing SOTA architectures, thanks to the incorporation of an efficient transformer-based architecture. Furthermore, the implementation of the proposed architecture is optimized for performance improvements in PyTorch version 2.1. Code and trained models are available at this https URL

[CV-26] HSTrack: Bootstrap End-to-End Multi-Camera 3D Multi-object Tracking with Hybrid Supervision

链接: https://arxiv.org/abs/2411.06780
作者: Shubo Lin,Yutong Kou,Bing Li,Weiming Hu,Jin Gao
关键词-EN: prevailing methods follow, object queries handle, object queries, employs track queries, manage the lifecycle
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 9 pages, 2 figures

点击查看摘要

Abstract:In camera-based 3D multi-object tracking (MOT), the prevailing methods follow the tracking-by-query-propagation paradigm, which employs track queries to manage the lifecycle of identity-consistent tracklets while object queries handle the detection of new-born tracklets. However, this intertwined paradigm leads the inter-temporal tracking task and the single-frame detection task utilize the same model parameters, complicating training optimization. Drawing inspiration from studies on the roles of attention components in transformer-based decoders, we identify that the dispersing effect of self-attention necessitates object queries to match with new-born tracklets. This matching strategy diverges from the detection pre-training phase, where object queries align with all ground-truth targets, resulting in insufficient supervision signals. To address these issues, we present HSTrack, a novel plug-and-play method designed to co-facilitate multi-task learning for detection and tracking. HSTrack constructs a parallel weight-share decoder devoid of self-attention layers, circumventing competition between different types of queries. Considering the characteristics of cross-attention layer and distinct query types, our parallel decoder adopt one-to-one and one-to-many label assignment strategies for track queries and object queries, respectively. Leveraging the shared architecture, HSTrack further improve trackers for spatio-temporal modeling and quality candidates generation. Extensive experiments demonstrate that HSTrack consistently delivers improvements when integrated with various query-based 3D MOT trackers. For example, HSTrack improves the state-of-the-art PF-Track method by +2.3% AMOTA and +1.7% mAP on the nuScenes dataset.

[CV-27] Multi-Stage Knowledge Integration of Vision-Language Models for Continual Learning

链接: https://arxiv.org/abs/2411.06764
作者: Hongsheng Zhang,Zhong Ji,Jingren Liu,Yanwei Pang,Jungong Han
关键词-EN: Vision Language Models, Vision Language, large-scale image-text datasets, specific unseen tasks, enable zero-shot predictions
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Vision Language Models (VLMs), pre-trained on large-scale image-text datasets, enable zero-shot predictions for unseen data but may underperform on specific unseen tasks. Continual learning (CL) can help VLMs effectively adapt to new data distributions without joint training, but faces challenges of catastrophic forgetting and generalization forgetting. Although significant progress has been achieved by distillation-based methods, they exhibit two severe limitations. One is the popularly adopted single-teacher paradigm fails to impart comprehensive knowledge, The other is the existing methods inadequately leverage the multimodal information in the original training dataset, instead they rely on additional data for distillation, which increases computational and storage overhead. To mitigate both limitations, by drawing on Knowledge Integration Theory (KIT), we propose a Multi-Stage Knowledge Integration network (MulKI) to emulate the human learning process in distillation methods. MulKI achieves this through four stages, including Eliciting Ideas, Adding New Ideas, Distinguishing Ideas, and Making Connections. During the four stages, we first leverage prototypes to align across modalities, eliciting cross-modal knowledge, then adding new knowledge by constructing fine-grained intra- and inter-modality relationships with prototypes. After that, knowledge from two teacher models is adaptively distinguished and re-weighted. Finally, we connect between models from intra- and inter-task, integrating preceding and new knowledge. Our method demonstrates significant improvements in maintaining zero-shot capabilities while supporting continual learning across diverse downstream tasks, showcasing its potential in adapting VLMs to evolving data distributions.

[CV-28] LuSh-NeRF: Lighting up and Sharpening NeRFs for Low-light Scenes NEURIPS2024

链接: https://arxiv.org/abs/2411.06757
作者: Zefan Qu,Ke Xu,Gerhard Petrus Hancke,Rynson W.H. Lau
关键词-EN: Neural Radiance Fields, Neural Radiance, Radiance Fields, shown remarkable performances, producing novel-view images
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted by NeurIPS 2024

点击查看摘要

Abstract:Neural Radiance Fields (NeRFs) have shown remarkable performances in producing novel-view images from high-quality scene images. However, hand-held low-light photography challenges NeRFs as the captured images may simultaneously suffer from low visibility, noise, and camera shakes. While existing NeRF methods may handle either low light or motion, directly combining them or incorporating additional image-based enhancement methods does not work as these degradation factors are highly coupled. We observe that noise in low-light images is always sharp regardless of camera shakes, which implies an implicit order of these degradation factors within the image formation process. To this end, we propose in this paper a novel model, named LuSh-NeRF, which can reconstruct a clean and sharp NeRF from a group of hand-held low-light images. The key idea of LuSh-NeRF is to sequentially model noise and blur in the images via multi-view feature consistency and frequency information of NeRF, respectively. Specifically, LuSh-NeRF includes a novel Scene-Noise Decomposition (SND) module for decoupling the noise from the scene representation and a novel Camera Trajectory Prediction (CTP) module for the estimation of camera motions based on low-frequency scene information. To facilitate training and evaluations, we construct a new dataset containing both synthetic and real images. Experiments show that LuSh-NeRF outperforms existing approaches. Our code and dataset can be found here: this https URL.

[CV-29] Can KAN Work? Exploring the Potential of Kolmogorov-Arnold Networks in Computer Vision

链接: https://arxiv.org/abs/2411.06727
作者: Yueyang Cang,Yu hang liu,Li Shi
关键词-EN: neural network architecture, efficient neural network, theoretically efficient neural, Kolmogorov-Arnold Networks, network architecture
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Kolmogorov-Arnold Networks(KANs), as a theoretically efficient neural network architecture, have garnered attention for their potential in capturing complex patterns. However, their application in computer vision remains relatively unexplored. This study first analyzes the potential of KAN in computer vision tasks, evaluating the performance of KAN and its convolutional variants in image classification and semantic segmentation. The focus is placed on examining their characteristics across varying data scales and noise levels. Results indicate that while KAN exhibits stronger fitting capabilities, it is highly sensitive to noise, limiting its robustness. To address this challenge, we propose a smoothness regularization method and introduce a Segment Deactivation technique. Both approaches enhance KAN’s stability and generalization, demonstrating its potential in handling complex visual data tasks.

[CV-30] GTA-Net: An IoT-Integrated 3D Human Pose Estimation System for Real-Time Adolescent Sports Posture Correction

链接: https://arxiv.org/abs/2411.06725
作者: Shizhe Yuan,Li Zhou
关键词-EN: gained significant attention, human pose estimation-based, Temporal Convolutional Networks, Graph Convolutional Networks, artificial intelligence
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 18 pages

点击查看摘要

Abstract:With the advancement of artificial intelligence, 3D human pose estimation-based systems for sports training and posture correction have gained significant attention in adolescent sports. However, existing methods face challenges in handling complex movements, providing real-time feedback, and accommodating diverse postures, particularly with occlusions, rapid movements, and the resource constraints of Internet of Things (IoT) devices, making it difficult to balance accuracy and real-time performance. To address these issues, we propose GTA-Net, an intelligent system for posture correction and real-time feedback in adolescent sports, integrated within an IoT-enabled environment. This model enhances pose estimation in dynamic scenes by incorporating Graph Convolutional Networks (GCN), Temporal Convolutional Networks (TCN), and Hierarchical Attention mechanisms, achieving real-time correction through IoT devices. Experimental results show GTA-Net’s superior performance on Human3.6M, HumanEva-I, and MPI-INF-3DHP datasets, with Mean Per Joint Position Error (MPJPE) values of 32.2mm, 15.0mm, and 48.0mm, respectively, significantly outperforming existing methods. The model also demonstrates strong robustness in complex scenarios, maintaining high accuracy even with occlusions and rapid movements. This system enhances real-time posture correction and offers broad applications in intelligent sports and health management.

[CV-31] Shallow Signed Distance Functions for Kinematic Collision Bodies

链接: https://arxiv.org/abs/2411.06719
作者: Osman Akar,Yushan Han,Yizhou Chen,Weixian Lan,Benn Gallagher,Ronald Fedkiw,Joseph Teran
关键词-EN: shallow SDF, implicit shape representations, collision queries arising, avatar collision queries, shape representations designed
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: Preprint

点击查看摘要

Abstract:We present learning-based implicit shape representations designed for real-time avatar collision queries arising in the simulation of clothing. Signed distance functions (SDFs) have been used for such queries for many years due to their computational efficiency. Recently deep neural networks have been used for implicit shape representations (DeepSDFs) due to their ability to represent multiple shapes with modest memory requirements compared to traditional representations over dense grids. However, the computational expense of DeepSDFs prevents their use in real-time clothing simulation applications. We design a learning-based representation of SDFs for human avatars whoes bodies change shape kinematically due to joint-based skinning. Rather than using a single DeepSDF for the entire avatar, we use a collection of extremely computationally efficient (shallow) neural networks that represent localized deformations arising from changes in body shape induced by the variation of a single joint. This requires a stitching process to combine each shallow SDF in the collection together into one SDF representing the signed closest distance to the boundary of the entire body. To achieve this we augment each shallow SDF with an additional output that resolves whether or not the individual shallow SDF value is referring to a closest point on the boundary of the body, or to a point on the interior of the body (but on the boundary of the individual shallow SDF). Our model is extremely fast and accurate and we demonstrate its applicability with real-time simulation of garments driven by animated characters.

[CV-32] United Domain Cognition Network for Salient Object Detection in Optical Remote Sensing Images

链接: https://arxiv.org/abs/2411.06703
作者: Yanguang Sun,Jian Yang,Lei Luo
关键词-EN: achieved significant breakthroughs, deep learning-based salient, deep learning-based, significant breakthroughs, achieved significant
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted at TGRS 2024

点击查看摘要

Abstract:Recently, deep learning-based salient object detection (SOD) in optical remote sensing images (ORSIs) have achieved significant breakthroughs. We observe that existing ORSIs-SOD methods consistently center around optimizing pixel features in the spatial domain, progressively distinguishing between backgrounds and objects. However, pixel information represents local attributes, which are often correlated with their surrounding context. Even with strategies expanding the local region, spatial features remain biased towards local characteristics, lacking the ability of global perception. To address this problem, we introduce the Fourier transform that generate global frequency features and achieve an image-size receptive field. To be specific, we propose a novel United Domain Cognition Network (UDCNet) to jointly explore the global-local information in the frequency and spatial domains. Technically, we first design a frequency-spatial domain transformer block that mutually amalgamates the complementary local spatial and global frequency features to strength the capability of initial input features. Furthermore, a dense semantic excavation module is constructed to capture higher-level semantic for guiding the positioning of remote sensing objects. Finally, we devise a dual-branch joint optimization decoder that applies the saliency and edge branches to generate high-quality representations for predicting salient objects. Experimental results demonstrate the superiority of the proposed UDCNet method over 24 state-of-the-art models, through extensive quantitative and qualitative comparisons in three widely-used ORSIs-SOD datasets. The source code is available at: \hrefthis https URL\colorblue this https URL.

[CV-33] rack Any Peppers: Weakly Supervised Sweet Pepper Tracking Using VLMs

链接: https://arxiv.org/abs/2411.06702
作者: Jia Syuen Lim,Yadan Luo,Zhi Chen,Tianqi Wei,Scott Chapman,Zi Huang
关键词-EN: Sweet Peppers Challenge, weakly supervised ensemble, Sweet Peppers, Peppers Challenge, present Track
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:In the Detection and Multi-Object Tracking of Sweet Peppers Challenge, we present Track Any Peppers (TAP) - a weakly supervised ensemble technique for sweet peppers tracking. TAP leverages the zero-shot detection capabilities of vision-language foundation models like Grounding DINO to automatically generate pseudo-labels for sweet peppers in video sequences with minimal human intervention. These pseudo-labels, refined when necessary, are used to train a YOLOv8 segmentation network. To enhance detection accuracy under challenging conditions, we incorporate pre-processing techniques such as relighting adjustments and apply depth-based filtering during post-inference. For object tracking, we integrate the Matching by Segment Anything (MASA) adapter with the BoT-SORT algorithm. Our approach achieves a HOTA score of 80.4%, MOTA of 66.1%, Recall of 74.0%, and Precision of 90.7%, demonstrating effective tracking of sweet peppers without extensive manual effort. This work highlights the potential of foundation models for efficient and accurate object detection and tracking in agricultural settings.

[CV-34] HomoMatcher: Dense Feature Matching Results with Semi-Dense Efficiency by Homography Estimation

链接: https://arxiv.org/abs/2411.06700
作者: Xiaolong Wang,Lei Yu,Yingying Zhang,Jiangwei Lao,Lixiang Ru,Liheng Zhong,Jingdong Chen,Yu Zhang,Ming Yang
关键词-EN: drives many applications, fundamental problem, problem in computer, computer vision, vision that drives
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 10 pages, 5 figures, conference under review

点击查看摘要

Abstract:Feature matching between image pairs is a fundamental problem in computer vision that drives many applications, such as SLAM. Recently, semi-dense matching approaches have achieved substantial performance enhancements and established a widely-accepted coarse-to-fine paradigm. However, the majority of existing methods focus on improving coarse feature representation rather than the fine-matching module. Prior fine-matching techniques, which rely on point-to-patch matching probability expectation or direct regression, often lack precision and do not guarantee the continuity of feature points across sequential images. To address this limitation, this paper concentrates on enhancing the fine-matching module in the semi-dense matching framework. We employ a lightweight and efficient homography estimation network to generate the perspective mapping between patches obtained from coarse matching. This patch-to-patch approach achieves the overall alignment of two patches, resulting in a higher sub-pixel accuracy by incorporating additional constraints. By leveraging the homography estimation between patches, we can achieve a dense matching result with low computational cost. Extensive experiments demonstrate that our method achieves higher accuracy compared to previous semi-dense matchers. Meanwhile, our dense matching results exhibit similar end-point-error accuracy compared to previous dense matchers while maintaining semi-dense efficiency.

[CV-35] Layout Control and Semantic Guidance with Attention Loss Backward for T2I Diffusion Model

链接: https://arxiv.org/abs/2411.06692
作者: Guandong Li
关键词-EN: aiming to create, core demands, creative and logical, logical while satisfying, satisfying additional
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Controllable image generation has always been one of the core demands in image generation, aiming to create images that are both creative and logical while satisfying additional specified conditions. In the post-AIGC era, controllable generation relies on diffusion models and is accomplished by maintaining certain components or introducing inference interferences. This paper addresses key challenges in controllable generation: 1. mismatched object attributes during generation and poor prompt-following effects; 2. inadequate completion of controllable layouts. We propose a train-free method based on attention loss backward, cleverly controlling the cross attention map. By utilizing external conditions such as prompts that can reasonably map onto the attention map, we can control image generation without any training or fine-tuning. This method addresses issues like attribute mismatch and poor prompt-following while introducing explicit layout constraints for controllable image generation. Our approach has achieved excellent practical applications in production, and we hope it can serve as an inspiring technical report in this field.

[CV-36] SeedEdit: Align Image Re-Generation to Image Editing

链接: https://arxiv.org/abs/2411.06686
作者: Yichun Shi,Peng Wang,Weilin Huang
关键词-EN: text prompt, image, introduce SeedEdit, Abstract, prompt
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Our website: this https URL

点击查看摘要

Abstract:We introduce SeedEdit, a diffusion model that is able to revise a given image with any text prompt. In our perspective, the key to such a task is to obtain an optimal balance between maintaining the original image, i.e. image reconstruction, and generating a new image, i.e. image re-generation. To this end, we start from a weak generator (text-to-image model) that creates diverse pairs between such two directions and gradually align it into a strong image editor that well balances between the two tasks. SeedEdit can achieve more diverse and stable editing capability over prior image editing methods, enabling sequential revision over images generated by diffusion models.

[CV-37] Learning from Different Samples: A Source-free Framework for Semi-supervised Domain Adaptation

链接: https://arxiv.org/abs/2411.06665
作者: Xinyang Huang,Chuang Zhu,Bowen Zhang,Shanghang Zhang
关键词-EN: target samples, widely studied due, target, generalization ability, Semi-supervised domain adaptation
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Semi-supervised domain adaptation (SSDA) has been widely studied due to its ability to utilize a few labeled target data to improve the generalization ability of the model. However, existing methods only consider designing certain strategies for target samples to adapt, ignoring the exploration of customized learning for different target samples. When the model encounters complex target distribution, existing methods will perform limited due to the inability to clearly and comprehensively learn the knowledge of multiple types of target samples. To fill this gap, this paper focuses on designing a framework to use different strategies for comprehensively mining different target samples. We propose a novel source-free framework (SOUF) to achieve semi-supervised fine-tuning of the source pre-trained model on the target domain. Different from existing SSDA methods, SOUF decouples SSDA from the perspectives of different target samples, specifically designing robust learning techniques for unlabeled, reliably labeled, and noisy pseudo-labeled target samples. For unlabeled target samples, probability-based weighted contrastive learning (PWC) helps the model learn more discriminative feature representations. To mine the latent knowledge of labeled target samples, reliability-based mixup contrastive learning (RMC) learns complex knowledge from the constructed reliable sample set. Finally, predictive regularization learning (PR) further mitigates the misleading effect of noisy pseudo-labeled samples on the model. Extensive experiments on benchmark datasets demonstrate the superiority of our framework over state-of-the-art methods.

[CV-38] LFSamba: Marry SAM with Mamba for Light Field Salient Object Detection

链接: https://arxiv.org/abs/2411.06652
作者: Zhengyi Liu,Longzhen Wang,Xianyong Fang,Zhengzheng Tu,Linbo Wang
关键词-EN: spatial geometric information, rich spatial geometric, light field camera, virtual reality, captured multi-focus images
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted by SPL

点击查看摘要

Abstract:A light field camera can reconstruct 3D scenes using captured multi-focus images that contain rich spatial geometric information, enhancing applications in stereoscopic photography, virtual reality, and robotic vision. In this work, a state-of-the-art salient object detection model for multi-focus light field images, called LFSamba, is introduced to emphasize four main insights: (a) Efficient feature extraction, where SAM is used to extract modality-aware discriminative features; (b) Inter-slice relation modeling, leveraging Mamba to capture long-range dependencies across multiple focal slices, thus extracting implicit depth cues; © Inter-modal relation modeling, utilizing Mamba to integrate all-focus and multi-focus images, enabling mutual enhancement; (d) Weakly supervised learning capability, developing a scribble annotation dataset from an existing pixel-level mask dataset, establishing the first scribble-supervised baseline for light field salient object this http URL://github.com/liuzywen/LFScribble

[CV-39] Machine learning enabled velocity model building with uncertainty quantification

链接: https://arxiv.org/abs/2411.06651
作者: Rafael Orozco,Huseyin Tuna Erdinc,Yunlin Zeng,Mathias Louboutin,Felix J. Herrmann
关键词-EN: Accurately characterizing migration, Accurately characterizing, migration velocity models, velocity models, characterizing migration velocity
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Accurately characterizing migration velocity models is crucial for a wide range of geophysical applications, from hydrocarbon exploration to monitoring of CO2 sequestration projects. Traditional velocity model building methods such as Full-Waveform Inversion (FWI) are powerful but often struggle with the inherent complexities of the inverse problem, including noise, limited bandwidth, receiver aperture and computational constraints. To address these challenges, we propose a scalable methodology that integrates generative modeling, in the form of Diffusion networks, with physics-informed summary statistics, making it suitable for complicated imaging problems including field datasets. By defining these summary statistics in terms of subsurface-offset image volumes for poor initial velocity models, our approach allows for computationally efficient generation of Bayesian posterior samples for migration velocity models that offer a useful assessment of uncertainty. To validate our approach, we introduce a battery of tests that measure the quality of the inferred velocity models, as well as the quality of the inferred uncertainties. With modern synthetic datasets, we reconfirm gains from using subsurface-image gathers as the conditioning observable. For complex velocity model building involving salt, we propose a new iterative workflow that refines amortized posterior approximations with salt flooding and demonstrate how the uncertainty in the velocity model can be propagated to the final product reverse time migrated images. Finally, we present a proof of concept on field datasets to show that our method can scale to industry-sized problems.

[CV-40] Few-shot Semantic Learning for Robust Multi-Biome 3D Semantic Mapping in Off-Road Environments

链接: https://arxiv.org/abs/2411.06632
作者: Deegan Atha,Xianmei Lei,Shehryar Khattak,Anna Sabel,Elle Miller,Aurelio Noca,Grace Lim,Jeffrey Edlund,Curtis Padgett,Patrick Spieler
关键词-EN: environments pose significant, pose significant perception, significant perception challenges, high-speed autonomous navigation, autonomous navigation due
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Robotics (cs.RO)
*备注: Accepted to Australasian Conference on Robotics and Automation (ACRA 2024)

点击查看摘要

Abstract:Off-road environments pose significant perception challenges for high-speed autonomous navigation due to unstructured terrain, degraded sensing conditions, and domain-shifts among biomes. Learning semantic information across these conditions and biomes can be challenging when a large amount of ground truth data is required. In this work, we propose an approach that leverages a pre-trained Vision Transformer (ViT) with fine-tuning on a small (500 images), sparse and coarsely labeled (30% pixels) multi-biome dataset to predict 2D semantic segmentation classes. These classes are fused over time via a novel range-based metric and aggregated into a 3D semantic voxel map. We demonstrate zero-shot out-of-biome 2D semantic segmentation on the Yamaha (52.9 mIoU) and Rellis (55.5 mIoU) datasets along with few-shot coarse sparse labeling with existing data for improved segmentation performance on Yamaha (66.6 mIoU) and Rellis (67.2 mIoU). We further illustrate the feasibility of using a voxel map with a range-based semantic fusion approach to handle common off-road hazards like pop-up hazards, overhangs, and water features.

[CV-41] Adaptive and Temporally Consistent Gaussian Surfels for Multi-view Dynamic Reconstruction

链接: https://arxiv.org/abs/2411.06602
作者: Decai Chen,Brianne Oberson,Ingo Feldmann,Oliver Schreer,Anna Hilsmann,Peter Eisert
关键词-EN: Gaussian Splatting, recently achieved notable, achieved notable success, Splatting has recently, dynamic surface reconstruction
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:3D Gaussian Splatting has recently achieved notable success in novel view synthesis for dynamic scenes and geometry reconstruction in static scenes. Building on these advancements, early methods have been developed for dynamic surface reconstruction by globally optimizing entire sequences. However, reconstructing dynamic scenes with significant topology changes, emerging or disappearing objects, and rapid movements remains a substantial challenge, particularly for long sequences. To address these issues, we propose AT-GS, a novel method for reconstructing high-quality dynamic surfaces from multi-view videos through per-frame incremental optimization. To avoid local minima across frames, we introduce a unified and adaptive gradient-aware densification strategy that integrates the strengths of conventional cloning and splitting techniques. Additionally, we reduce temporal jittering in dynamic surfaces by ensuring consistency in curvature maps across consecutive frames. Our method achieves superior accuracy and temporal coherence in dynamic surface reconstruction, delivering high-fidelity space-time novel view synthesis, even in complex and challenging scenes. Extensive experiments on diverse multi-view video datasets demonstrate the effectiveness of our approach, showing clear advantages over baseline methods. Project page: \urlthis https URL

[CV-42] Graph Neural Networks for modelling breast biomechanical compression MICCAI2024

链接: https://arxiv.org/abs/2411.06596
作者: Hadeel Awwad,Eloy García,Robert Martí
关键词-EN: modalities to X-ray, X-ray procedures, accurate image registration, procedures like mammography, image registration
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Deep Breath @ MICCAI 2024 | The code is available at this URL: this https URL

点击查看摘要

Abstract:Breast compression simulation is essential for accurate image registration from 3D modalities to X-ray procedures like mammography. It accounts for tissue shape and position changes due to compression, ensuring precise alignment and improved analysis. Although Finite Element Analysis (FEA) is reliable for approximating soft tissue deformation, it struggles with balancing accuracy and computational efficiency. Recent studies have used data-driven models trained on FEA results to speed up tissue deformation predictions. We propose to explore Physics-based Graph Neural Networks (PhysGNN) for breast compression simulation. PhysGNN has been used for data-driven modelling in other domains, and this work presents the first investigation of their potential in predicting breast deformation during mammographic compression. Unlike conventional data-driven models, PhysGNN, which incorporates mesh structural information and enables inductive learning on unstructured grids, is well-suited for capturing complex breast tissue geometries. Trained on deformations from incremental FEA simulations, PhysGNN’s performance is evaluated by comparing predicted nodal displacements with those from finite element (FE) simulations. This deep learning (DL) framework shows promise for accurate, rapid breast deformation approximations, offering enhanced computational efficiency for real-world scenarios.

[CV-43] Region-Aware Text-to-Image Generation via Hard Binding and Soft Refinement

链接: https://arxiv.org/abs/2411.06558
作者: Zhennan Chen,Yajie Li,Haofan Wang,Zhibo Chen,Zhengkai Jiang,Jun Li,Qian Wang,Jian Yang,Ying Tai
关键词-EN: precise layout composition, layout composition, Regional Soft Refinement, descriptions for precise, precise layout
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:In this paper, we present RAG, a Regional-Aware text-to-image Generation method conditioned on regional descriptions for precise layout composition. Regional prompting, or compositional generation, which enables fine-grained spatial control, has gained increasing attention for its practicality in real-world applications. However, previous methods either introduce additional trainable modules, thus only applicable to specific models, or manipulate on score maps within cross-attention layers using attention masks, resulting in limited control strength when the number of regions increases. To handle these limitations, we decouple the multi-region generation into two sub-tasks, the construction of individual region (Regional Hard Binding) that ensures the regional prompt is properly executed, and the overall detail refinement (Regional Soft Refinement) over regions that dismiss the visual boundaries and enhance adjacent interactions. Furthermore, RAG novelly makes repainting feasible, where users can modify specific unsatisfied regions in the last generation while keeping all other regions unchanged, without relying on additional inpainting models. Our approach is tuning-free and applicable to other frameworks as an enhancement to the prompt following property. Quantitative and qualitative experiments demonstrate that RAG achieves superior performance over attribute binding and object relationship than previous tuning-free methods.

[CV-44] Extended multi-stream temporal-attention module for skeleton-based human action recognition (HAR)

链接: https://arxiv.org/abs/2411.06553
作者: Faisal Mehmood,Xin Guo,Enqing Chen,Muhammad Azeem Akbar,Arif Ali Khan,Sami Ullah
关键词-EN: human action recognition, effective skeleton-based human, skeleton-based human action, Graph convolutional networks, convolutional networks
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: This paper accepted in Computers in Human Behavior Journal

点击查看摘要

Abstract:Graph convolutional networks (GCNs) are an effective skeleton-based human action recognition (HAR) technique. GCNs enable the specification of CNNs to a non-Euclidean frame that is more flexible. The previous GCN-based models still have a lot of issues: (I) The graph structure is the same for all model layers and input data.

[CV-45] Image Segmentation from Shadow-Hints using Minimum Spanning Trees

链接: https://arxiv.org/abs/2411.06530
作者: Moritz Heep,Eduard Zell
关键词-EN: notoriously difficult task, RGB space, notoriously difficult, difficult task, trained on thousands
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
*备注:

点击查看摘要

Abstract:Image segmentation in RGB space is a notoriously difficult task where state-of-the-art methods are trained on thousands or even millions of annotated images. While the performance is impressive, it is still not perfect. We propose a novel image segmentation method, achieving similar segmentation quality but without training. Instead, we require an image sequence with a static camera and a single light source at varying positions, as used in for photometric stereo, for example.

[CV-46] Diffusion Sampling Correction via Approximately 10 Parameters

链接: https://arxiv.org/abs/2411.06503
作者: Guangyi Wang,Wei Peng,Lijiang Li,Wenyu Chen,Yuren Cai,Songzhi Su
关键词-EN: Diffusion Probabilistic Models, Diffusion Probabilistic, demonstrated exceptional performance, Probabilistic Models, generative tasks
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Diffusion Probabilistic Models (DPMs) have demonstrated exceptional performance in generative tasks, but this comes at the expense of sampling efficiency. To enhance sampling speed without sacrificing quality, various distillation-based accelerated sampling algorithms have been recently proposed. However, they typically require significant additional training costs and model parameter storage, which limit their practical application. In this work, we propose PCA-based Adaptive Search (PAS), which optimizes existing solvers for DPMs with minimal learnable parameters and training costs. Specifically, we first employ PCA to obtain a few orthogonal unit basis vectors to span the high-dimensional sampling space, which enables us to learn just a set of coordinates to correct the sampling direction; furthermore, based on the observation that the cumulative truncation error exhibits an ``S’'-shape, we design an adaptive search strategy that further enhances the sampling efficiency and reduces the number of stored parameters to approximately 10. Extensive experiments demonstrate that PAS can significantly enhance existing fast solvers in a plug-and-play manner with negligible costs. For instance, on CIFAR10, PAS requires only 12 parameters and less than 1 minute of training on a single NVIDIA A100 GPU to optimize the DDIM from 15.69 FID (NFE=10) to 4.37.

[CV-47] Mitigating covariate shift in non-colocated data with learned parameter priors

链接: https://arxiv.org/abs/2411.06499
作者: Behraj Khan,Behroz Mirza,Nouman Durrani,Tahir Syed
关键词-EN: compromising model selection, training data biases, time or space, training data, data biases cross-validation
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:When training data are distributed across time or space, covariate shift across fragments of training data biases cross-validation, compromising model selection and assessment. We present \textitFragmentation-Induced covariate-shift Remediation ( FIcsR ), which minimizes an f -divergence between a fragment’s covariate distribution and that of the standard cross-validation baseline. We show an equivalence with popular importance-weighting methods. The method’s numerical solution poses a computational challenge owing to the overparametrized nature of a neural network, and we derive a Fisher Information approximation. When accumulated over fragments, this provides a global estimate of the amount of shift remediation thus far needed, and we incorporate that as a prior via the minimization objective. In the paper, we run extensive classification experiments on multiple data classes, over 40 datasets, and with data batched over multiple sequence lengths. We extend the study to the k -fold cross-validation setting through a similar set of experiments. An ablation study exposes the method to varying amounts of shift and demonstrates slower degradation with FIcsR in place. The results are promising under all these conditions; with improved accuracy against batch and fold state-of-the-art by more than 5% and 10% , respectively.

[CV-48] DDIM-Driven Coverless Steganography Scheme with Real Key

链接: https://arxiv.org/abs/2411.06486
作者: Mingyu Yu,Haonan Miao,Zhengping Jin,Sujuan Qing
关键词-EN: Typical steganography embeds, exploiting their redundancy, Typical steganography, embeds secret information, Generative Adversarial Networks
类目: Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Typical steganography embeds secret information into images by exploiting their redundancy. Since the visual imperceptibility of secret information is a key factor in scheme evaluation, conventional methods aim to balance this requirement with embedding capacity. Consequently, integrating emerging image generation models and secret transmission has been extensively explored to achieve a higher embedding capacity. Previous works mostly focus on generating stego-images with Generative Adversarial Networks (GANs) and usually rely on pseudo-keys, namely conditions or parameters involved in the generation process, which are related to secret images. However, studies on diffusion-based coverless steganography remain insufficient. In this work, we leverage the Denoising Diffusion Implicit Model (DDIM) to generate high-quality stego-images without introducing pseudo-keys, instead employing real keys to enhance security. Furthermore, our method offers low-image-correlation real-key protection by incorporating chaotic encryption. Another core innovation is that our method requires only one-time negotiation for multiple communications, unlike prior methods that necessitate negotiation for each interaction.

[CV-49] KMM: Key Frame Mask Mamba for Extended Motion Generation

链接: https://arxiv.org/abs/2411.06481
作者: Zeyu Zhang,Hang Gao,Akide Liu,Qi Chen,Feng Chen,Yiran Wang,Danning Li,Hao Tang
关键词-EN: Human motion generation, generative computer vision, Human motion, game development, computer vision
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Human motion generation is a cut-edge area of research in generative computer vision, with promising applications in video creation, game development, and robotic manipulation. The recent Mamba architecture shows promising results in efficiently modeling long and complex sequences, yet two significant challenges remain: Firstly, directly applying Mamba to extended motion generation is ineffective, as the limited capacity of the implicit memory leads to memory decay. Secondly, Mamba struggles with multimodal fusion compared to Transformers, and lack alignment with textual queries, often confusing directions (left or right) or omitting parts of longer text queries. To address these challenges, our paper presents three key contributions: Firstly, we introduce KMM, a novel architecture featuring Key frame Masking Modeling, designed to enhance Mamba’s focus on key actions in motion segments. This approach addresses the memory decay problem and represents a pioneering method in customizing strategic frame-level masking in SSMs. Additionally, we designed a contrastive learning paradigm for addressing the multimodal fusion problem in Mamba and improving the motion-text alignment. Finally, we conducted extensive experiments on the go-to dataset, BABEL, achieving state-of-the-art performance with a reduction of more than 57% in FID and 70% parameters compared to previous state-of-the-art methods. See project website: this https URL

[CV-50] Superpixel Segmentation: A Long-Lasting Ill-Posed Problem

链接: https://arxiv.org/abs/2411.06478
作者: Rémi Giraud,Michaël Clément
关键词-EN: computer vision pipelines, image over-segmentation, vision pipelines, essential to computer, computer vision
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:For many years, image over-segmentation into superpixels has been essential to computer vision pipelines, by creating homogeneous and identifiable regions of similar sizes. Such constrained segmentation problem would require a clear definition and specific evaluation criteria. However, the validation framework for superpixel methods, typically viewed as standard object segmentation, has rarely been thoroughly studied. In this work, we first take a step back to show that superpixel segmentation is fundamentally an ill-posed problem, due to the implicit regularity constraint on the shape and size of superpixels. We also demonstrate through a novel comprehensive study that the literature suffers from only evaluating certain aspects, sometimes incorrectly and with inappropriate metrics. Concurrently, recent deep learning-based superpixel methods mainly focus on the object segmentation task at the expense of regularity. In this ill-posed context, we show that we can achieve competitive results using a recent architecture like the Segment Anything Model (SAM), without dedicated training for the superpixel segmentation task. This leads to rethinking superpixel segmentation and the necessary properties depending on the targeted downstream task.

[CV-51] Dropout the High-rate Downsampling: A Novel Design Paradigm for UHD Image Restoration WACV2025

链接: https://arxiv.org/abs/2411.06456
作者: Chen Wu,Ling Wang,Long Peng,Dianjie Lu,Zhuoran Zheng
关键词-EN: high-end mobile devices, UHD images, UHD image restoration, UHD, mobile devices
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: WACV2025

点击查看摘要

Abstract:With the popularization of high-end mobile devices, Ultra-high-definition (UHD) images have become ubiquitous in our lives. The restoration of UHD images is a highly challenging problem due to the exaggerated pixel count, which often leads to memory overflow during processing. Existing methods either downsample UHD images at a high rate before processing or split them into multiple patches for separate processing. However, high-rate downsampling leads to significant information loss, while patch-based approaches inevitably introduce boundary artifacts. In this paper, we propose a novel design paradigm to solve the UHD image restoration problem, called D2Net. D2Net enables direct full-resolution inference on UHD images without the need for high-rate downsampling or dividing the images into several patches. Specifically, we ingeniously utilize the characteristics of the frequency domain to establish long-range dependencies of features. Taking into account the richer local patterns in UHD images, we also design a multi-scale convolutional group to capture local features. Additionally, during the decoding stage, we dynamically incorporate features from the encoding stage to reduce the flow of irrelevant information. Extensive experiments on three UHD image restoration tasks, including low-light image enhancement, image dehazing, and image deblurring, show that our model achieves better quantitative and qualitative results than state-of-the-art methods.

[CV-52] Improved Video VAE for Latent Video Diffusion Model

链接: https://arxiv.org/abs/2411.06449
作者: Pingyu Wu,Kai Zhu,Yu Liu,Liming Zhao,Wei Zhai,Yang Cao,Zheng-Jun Zha
关键词-EN: Variational Autoencoder, compress pixel data, OpenAI Sora, diffusion generation models, image VAE
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
*备注:

点击查看摘要

Abstract:Variational Autoencoder (VAE) aims to compress pixel data into low-dimensional latent space, playing an important role in OpenAI’s Sora and other latent video diffusion generation models. While most of existing video VAEs inflate a pretrained image VAE into the 3D causal structure for temporal-spatial compression, this paper presents two astonishing findings: (1) The initialization from a well-trained image VAE with the same latent dimensions suppresses the improvement of subsequent temporal compression capabilities. (2) The adoption of causal reasoning leads to unequal information interactions and unbalanced performance between frames. To alleviate these problems, we propose a keyframe-based temporal compression (KTC) architecture and a group causal convolution (GCConv) module to further improve video VAE (IV-VAE). Specifically, the KTC architecture divides the latent space into two branches, in which one half completely inherits the compression prior of keyframes from a lower-dimension image VAE while the other half involves temporal compression through 3D group causal convolution, reducing temporal-spatial conflicts and accelerating the convergence speed of video VAE. The GCConv in above 3D half uses standard convolution within each frame group to ensure inter-frame equivalence, and employs causal logical padding between groups to maintain flexibility in processing variable frame video. Extensive experiments on five benchmarks demonstrate the SOTA video reconstruction and generation capabilities of the proposed IV-VAE (this https URL).

[CV-53] SamRobNODDI: Q-Space Sampling-Augmented Continuous Representation Learning for Robust and Generalized NODDI

链接: https://arxiv.org/abs/2411.06444
作者: Taohui Xiao,Jian Cheng,Wenxin Fan,Enqing Dong,Hairong Zheng,Shanshan Wang
关键词-EN: Neurite Orientation Dispersion, magnetic resonance imaging, Neurite Orientation, Orientation Dispersion, Dispersion and Density
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
*备注:

点击查看摘要

Abstract:Neurite Orientation Dispersion and Density Imaging (NODDI) microstructure estimation from diffusion magnetic resonance imaging (dMRI) is of great significance for the discovery and treatment of various neurological diseases. Current deep learning-based methods accelerate the speed of NODDI parameter estimation and improve the accuracy. However, most methods require the number and coordinates of gradient directions during testing and training to remain strictly consistent, significantly limiting the generalization and robustness of these models in NODDI parameter estimation. In this paper, we propose a q-space sampling augmentation-based continuous representation learning framework (SamRobNODDI) to achieve robust and generalized NODDI. Specifically, a continuous representation learning method based on q-space sampling augmentation is introduced to fully explore the information between different gradient directions in q-space. Furthermore, we design a sampling consistency loss to constrain the outputs of different sampling schemes, ensuring that the outputs remain as consistent as possible, thereby further enhancing performance and robustness to varying q-space sampling schemes. SamRobNODDI is also a flexible framework that can be applied to different backbone networks. To validate the effectiveness of the proposed method, we compared it with 7 state-of-the-art methods across 18 different q-space sampling schemes, demonstrating that the proposed SamRobNODDI has better performance, robustness, generalization, and flexibility.

[CV-54] Detecting AutoEncoder is Enough to Catch LDM Generated Images

链接: https://arxiv.org/abs/2411.06441
作者: Dmitry Vesnin,Dmitry Levshun,Andrey Chechulin
关键词-EN: Latent Diffusion Models, recent years, diffusion models, detecting images generated, images
类目: Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In recent years, diffusion models have become one of the main methods for generating images. However, detecting images generated by these models remains a challenging task. This paper proposes a novel method for detecting images generated by Latent Diffusion Models (LDM) by identifying artifacts introduced by their autoencoders. By training a detector to distinguish between real images and those reconstructed by the LDM autoencoder, the method enables detection of generated images without directly training on them. The novelty of this research lies in the fact that, unlike similar approaches, this method does not require training on synthesized data, significantly reducing computational costs and enhancing generalization ability. Experimental results show high detection accuracy with minimal false positives, making this approach a promising tool for combating fake images.

[CV-55] SplatFormer: Point Transformer for Robust 3D Gaussian Splatting

链接: https://arxiv.org/abs/2411.06390
作者: Yutong Chen,Marko Mihajlovic,Xiyi Chen,Yiming Wang,Sergey Prokudin,Siyu Tang
关键词-EN: transformed photorealistic reconstruction, recently transformed photorealistic, high visual fidelity, Gaussian Splatting, achieving high visual
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Code and dataset are publicly available. Project page: this https URL

点击查看摘要

Abstract:3D Gaussian Splatting (3DGS) has recently transformed photorealistic reconstruction, achieving high visual fidelity and real-time performance. However, rendering quality significantly deteriorates when test views deviate from the camera angles used during training, posing a major challenge for applications in immersive free-viewpoint rendering and navigation. In this work, we conduct a comprehensive evaluation of 3DGS and related novel view synthesis methods under out-of-distribution (OOD) test camera scenarios. By creating diverse test cases with synthetic and real-world datasets, we demonstrate that most existing methods, including those incorporating various regularization techniques and data-driven priors, struggle to generalize effectively to OOD views. To address this limitation, we introduce SplatFormer, the first point transformer model specifically designed to operate on Gaussian splats. SplatFormer takes as input an initial 3DGS set optimized under limited training views and refines it in a single forward pass, effectively removing potential artifacts in OOD test views. To our knowledge, this is the first successful application of point transformers directly on 3DGS sets, surpassing the limitations of previous multi-scene training methods, which could handle only a restricted number of input views during inference. Our model significantly improves rendering quality under extreme novel views, achieving state-of-the-art performance in these challenging scenarios and outperforming various 3DGS regularization techniques, multi-scene models tailored for sparse view synthesis, and diffusion-based frameworks.

[CV-56] SAN: Structure-Aware Network for Complex and Long-tailed Chinese Text Recognition ICDAR2023

链接: https://arxiv.org/abs/2411.06381
作者: Junyi Zhang,Chang Liu,Chun Yang
关键词-EN: factors affecting model, Chinese text recognition, affecting model performance, factors affecting, complex characters
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Published in ICDAR 2023

点击查看摘要

Abstract:In text recognition, complex glyphs and tail classes have always been factors affecting model performance. Specifically for Chinese text recognition, the lack of shape-awareness can lead to confusion among close complex characters. Since such characters are often tail classes that appear less frequently in the training-set, making it harder for the model to capture its shape information. Hence in this work, we propose a structure-aware network utilizing the hierarchical composition information to improve the recognition performance of complex characters. Implementation-wise, we first propose an auxiliary radical branch and integrate it into the base recognition network as a regularization term, which distills hierarchical composition information into the feature extractor. A Tree-Similarity-based weighting mechanism is then proposed to further utilize the depth information in the hierarchical representation. Experiments demonstrate that the proposed approach can significantly improve the performances of complex characters and tail characters, yielding a better overall performance. Code is available at this https URL.

[CV-57] PKF: Probabilistic Data Association Kalman Filter for Multi-Object Tracking

链接: https://arxiv.org/abs/2411.06378
作者: Hanwen Cao,George J. Pappas,Nikolay Atanasov
关键词-EN: probabilistic data association, Kalman filter, data association, apply Expectation Maximization, data association filter
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:In this paper, we derive a new Kalman filter with probabilistic data association between measurements and states. We formulate a variational inference problem to approximate the posterior density of the state conditioned on the measurement data. We view the unknown data association as a latent variable and apply Expectation Maximization (EM) to obtain a filter with update step in the same form as the Kalman filter but with expanded measurement vector of all potential associations. We show that the association probabilities can be computed as permanents of matrices with measurement likelihood entries. We also propose an ambiguity check that associates only a subset of ambiguous measurements and states probabilistically, thus reducing the association time and preventing low-probability measurements from harming the estimation accuracy. Experiments in simulation show that our filter achieves lower tracking errors than the well-established joint probabilistic data association filter (JPDAF), while running at comparable rate. We also demonstrate the effectiveness of our filter in multi-object tracking (MOT) on multiple real-world datasets, including MOT17, MOT20, and DanceTrack. We achieve better higher order tracking accuracy (HOTA) than previous Kalman-filter methods and remain real-time. Associating only bounding boxes without deep features or velocities, our method ranks top-10 on both MOT17 and MOT20 in terms of HOTA. Given offline detections, our algorithm tracks at 250+ fps on a single laptop CPU. Code is available at this https URL.

[CV-58] hrough the Curved Cover: Synthesizing Cover Aberrated Scenes with Refractive Field WACV2025

链接: https://arxiv.org/abs/2411.06365
作者: Liuyue Xie,Jiancong Guo,Laszlo A. Jeni,Zhiheng Jia,Mingyang Li,Yunwen Zhou,Chao Guo
关键词-EN: Recent extended reality, Recent extended, hazards and falls, extended reality headsets, robots have adopted
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: WACV 2025

点击查看摘要

Abstract:Recent extended reality headsets and field robots have adopted covers to protect the front-facing cameras from environmental hazards and falls. The surface irregularities on the cover can lead to optical aberrations like blurring and non-parametric distortions. Novel view synthesis methods like NeRF and 3D Gaussian Splatting are ill-equipped to synthesize from sequences with optical aberrations. To address this challenge, we introduce SynthCover to enable novel view synthesis through protective covers for downstream extended reality applications. SynthCover employs a Refractive Field that estimates the cover’s geometry, enabling precise analytical calculation of refracted rays. Experiments on synthetic and real-world scenes demonstrate our method’s ability to accurately model scenes viewed through protective covers, achieving a significant improvement in rendering quality compared to prior methods. We also show that the model can adjust well to various cover geometries with synthetic sequences captured with covers of different surface curvatures. To motivate further studies on this problem, we provide the benchmarked dataset containing real and synthetic walkable scenes captured with protective cover optical aberrations.

[CV-59] Classification in Japanese Sign Language Based on Dynamic Facial Expressions

链接: https://arxiv.org/abs/2411.06347
作者: Yui Tatsumi,Shoko Tanaka,Shunsuke Akamatsu,Takahiro Shindo,Hiroshi Watanabe
关键词-EN: visual language expressed, Sign language, expressed through hand, non-manual markers, Japanese Sign Language
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 2024 IEEE 13th Global Conference on Consumer Electronics (GCCE 2024)

点击查看摘要

Abstract:Sign language is a visual language expressed through hand movements and non-manual markers. Non-manual markers include facial expressions and head movements. These expressions vary across different nations. Therefore, specialized analysis methods for each sign language are necessary. However, research on Japanese Sign Language (JSL) recognition is limited due to a lack of datasets. The development of recognition models that consider both manual and non-manual features of JSL is crucial for precise and smooth communication with deaf individuals. In JSL, sentence types such as affirmative statements and questions are distinguished by facial expressions. In this paper, we propose a JSL recognition method that focuses on facial expressions. Our proposed method utilizes a neural network to analyze facial features and classify sentence types. Through the experiments, we confirm our method’s effectiveness by achieving a classification accuracy of 96.05%.

[CV-60] Activation Map Compression through Tensor Decomposition for Deep Learning

链接: https://arxiv.org/abs/2411.06346
作者: Le-Trung Nguyen,Aël Quélennec,Enzo Tartaglione,Samuel Tardieu,Van-Tam Nguyen
关键词-EN: framework called Edge, Internet of Things, Things and Deep, exponentially growing industrial, growing industrial fields
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Internet of Things and Deep Learning are synergetically and exponentially growing industrial fields with a massive call for their unification into a common framework called Edge AI. While on-device inference is a well-explored topic in recent research, backpropagation remains an open challenge due to its prohibitive computational and memory costs compared to the extreme resource constraints of embedded devices. Drawing on tensor decomposition research, we tackle the main bottleneck of backpropagation, namely the memory footprint of activation map storage. We investigate and compare the effects of activation compression using Singular Value Decomposition and its tensor variant, High-Order Singular Value Decomposition. The application of low-order decomposition results in considerable memory savings while preserving the features essential for learning, and also offers theoretical guarantees to convergence. Experimental results obtained on main-stream architectures and tasks demonstrate Pareto-superiority over other state-of-the-art solutions, in terms of the trade-off between generalization and memory footprint.

[CV-61] CityGuessr: City-Level Video Geo-Localization on a Global Scale ECCV

链接: https://arxiv.org/abs/2411.06344
作者: Parth Parag Kulkarni,Gaurav Kumar Nayak,Mubarak Shah
关键词-EN: current times, Video, Video geolocalization, problem, geolocalization
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted to ECVA Eurpoean Conference on Computer Vision(ECCV) 2024

点击查看摘要

Abstract:Video geolocalization is a crucial problem in current times. Given just a video, ascertaining where it was captured from can have a plethora of advantages. The problem of worldwide geolocalization has been tackled before, but only using the image modality. Its video counterpart remains relatively unexplored. Meanwhile, video geolocalization has also garnered some attention in the recent past, but the existing methods are all restricted to specific regions. This motivates us to explore the problem of video geolocalization at a global scale. Hence, we propose a novel problem of worldwide video geolocalization with the objective of hierarchically predicting the correct city, state/province, country, and continent, given a video. However, no large scale video datasets that have extensive worldwide coverage exist, to train models for solving this problem. To this end, we introduce a new dataset, CityGuessr68k comprising of 68,269 videos from 166 cities all over the world. We also propose a novel baseline approach to this problem, by designing a transformer-based architecture comprising of an elegant Self-Cross Attention module for incorporating scenes as well as a TextLabel Alignment strategy for distilling knowledge from textlabels in feature space. To further enhance our location prediction, we also utilize soft-scene labels. Finally we demonstrate the performance of our method on our new dataset as well as Mapillary(MSLS). Our code and datasets are available at: this https URL

[CV-62] SEM-Net: Efficient Pixel Modelling for image inpainting with Spatially Enhanced SSM WACV2025

链接: https://arxiv.org/abs/2411.06318
作者: Shuang Chen,Haozheng Zhang,Amir Atapour-Abarghouei,Hubert P. H. Shum
关键词-EN: partially damaged image, Image inpainting aims, aims to repair, repair a partially, partially damaged
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted by WACV 2025

点击查看摘要

Abstract:Image inpainting aims to repair a partially damaged image based on the information from known regions of the images. \reviseAchieving semantically plausible inpainting results is particularly challenging because it requires the reconstructed regions to exhibit similar patterns to the semanticly consistent regions. This requires a model with a strong capacity to capture long-range dependencies. Existing models struggle in this regard due to the slow growth of receptive field for Convolutional Neural Networks (CNNs) based methods and patch-level interactions in Transformer-based methods, which are ineffective for capturing long-range dependencies. Motivated by this, we propose SEM-Net, a novel visual State Space model (SSM) vision network, modelling corrupted images at the pixel level while capturing long-range dependencies (LRDs) in state space, achieving a linear computational complexity. To address the inherent lack of spatial awareness in SSM, we introduce the Snake Mamba Block (SMB) and Spatially-Enhanced Feedforward Network. These innovations enable SEM-Net to outperform state-of-the-art inpainting methods on two distinct datasets, showing significant improvements in capturing LRDs and enhancement in spatial consistency. Additionally, SEM-Net achieves state-of-the-art performance on motion deblurring, demonstrating its generalizability. Our source code will be released in this https URL. Comments: Accepted by WACV 2025 Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2411.06318 [cs.CV] (or arXiv:2411.06318v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2411.06318 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[CV-63] Adaptive Aspect Ratios with Patch-Mixup-ViT-based Vehicle ReID

链接: https://arxiv.org/abs/2411.06297
作者: Mei Qiu,Lauren Ann Christopher,Stanley Chien,Lingxi Li
关键词-EN: Vision Transformers, shown exceptional performance, shown exceptional, aspect ratios, Vision
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Vision Transformers (ViTs) have shown exceptional performance in vehicle re-identification (ReID) tasks. However, non-square aspect ratios of image or video inputs can negatively impact re-identification accuracy. To address this challenge, we propose a novel, human perception driven, and general ViT-based ReID framework that fuses models trained on various aspect ratios. Our key contributions are threefold: (i) We analyze the impact of aspect ratios on performance using the VeRi-776 and VehicleID datasets, providing guidance for input settings based on the distribution of original image aspect ratios. (ii) We introduce patch-wise mixup strategy during ViT patchification (guided by spatial attention scores) and implement uneven stride for better alignment with object aspect ratios. (iii) We propose a dynamic feature fusion ReID network to enhance model robustness. Our method outperforms state-of-the-art transformer-based approaches on both datasets, with only a minimal increase in inference time per image.

[CV-64] Hidden in Plain Sight: Evaluating Abstract Shape Recognition in Vision-Language Models

链接: https://arxiv.org/abs/2411.06287
作者: Arshia Hemmat,Adam Davies,Tom A. Lamb,Jianhao Yuan,Philip Torr,Ashkan Khakzar,Francesco Pinto
关键词-EN: early neural image, neural image classifiers, image classifiers relied, early neural, neural image
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Despite the importance of shape perception in human vision, early neural image classifiers relied less on shape information for object recognition than other (often spurious) features. While recent research suggests that current large Vision-Language Models (VLMs) exhibit more reliance on shape, we find them to still be seriously limited in this regard. To quantify such limitations, we introduce IllusionBench, a dataset that challenges current cutting-edge VLMs to decipher shape information when the shape is represented by an arrangement of visual elements in a scene. Our extensive evaluations reveal that, while these shapes are easily detectable by human annotators, current VLMs struggle to recognize them, indicating important avenues for future work in developing more robust visual perception systems. The full dataset and codebase are available at: \urlthis https URL

[CV-65] Zero-Shot NAS via the Suppression of Local Entropy Decrease

链接: https://arxiv.org/abs/2411.06236
作者: Ning Wu,Han Huang,Yueting Xu,Zhifeng Hao
关键词-EN: neural architecture search, time-consuming part, part of neural, local entropy, NAS
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Neural and Evolutionary Computing (cs.NE)
*备注: 8 pages, 2 figures

点击查看摘要

Abstract:Architecture performance evaluation is the most time-consuming part of neural architecture search (NAS). Zero-Shot NAS accelerates the evaluation by utilizing zero-cost proxies instead of training. Though effective, existing zero-cost proxies require invoking backpropagations or running networks on input data, making it difficult to further accelerate the computation of proxies. To alleviate this issue, architecture topologies are used to evaluate the performance of networks in this study. We prove that particular architectural topologies decrease the local entropy of feature maps, which degrades specific features to a bias, thereby reducing network performance. Based on this proof, architectural topologies are utilized to quantify the suppression of local entropy decrease (SED) as a data-free and running-free proxy. Experimental results show that SED outperforms most state-of-the-art proxies in terms of architecture selection on five benchmarks, with computation time reduced by three orders of magnitude. We further compare the SED-based NAS with state-of-the-art proxies. SED-based NAS selects the architecture with higher accuracy and fewer parameters in only one second. The theoretical analyses of local entropy and experimental results demonstrate that the suppression of local entropy decrease facilitates selecting optimal architectures in Zero-Shot NAS.

[CV-66] Crowd3D: Robust Monocular Crowd Reconstruction with Upright Space

链接: https://arxiv.org/abs/2411.06232
作者: Jing Huang,Hao Wen,Tianyi Zhou,Haozhe Lin,Yu-Kun Lai,Kun Li
关键词-EN: achieve globally consistent, globally consistent reconstruction, Human-scene Virtual Interaction, Virtual Interaction Point, unknown camera parameters
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 14 pages including reference

点击查看摘要

Abstract:This paper aims to reconstruct hundreds of people’s 3D poses, shapes, and locations from a single image with unknown camera parameters. Due to the small and highly varying 2D human scales, depth ambiguity, and perspective distortion, no existing methods can achieve globally consistent reconstruction and accurate reprojection. To address these challenges, we first propose Crowd3D, which leverages a new concept, Human-scene Virtual Interaction Point (HVIP), to convert the complex 3D human localization into 2D-pixel localization with robust camera and ground estimation to achieve globally consistent reconstruction. To achieve stable generalization on different camera FoVs without test-time optimization, we propose an extended version, Crowd3D++, which eliminates the influence of camera parameters and the cropping operation by the proposed canonical upright space and ground-aware normalization transform. In the defined upright space, Crowd3D++ also designs an HVIPNet to regress 2D HVIP and infer the depths. Besides, we contribute two benchmark datasets, LargeCrowd and SyntheticCrowd, for evaluating crowd reconstruction in large scenes. The source code and data will be made publicly available after acceptance.

[CV-67] xt2CAD: Text to 3D CAD Generation via Technical Drawings

链接: https://arxiv.org/abs/2411.06206
作者: Mohsen Yavartanoo,Sangmin Hong,Reyhaneh Neshatavar,Kyoung Mu Lee
关键词-EN: industrial Computer-Aided Design, CAD models, CAD, Computer-Aided Design, modern manufacturing
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:The generation of industrial Computer-Aided Design (CAD) models from user requests and specifications is crucial to enhancing efficiency in modern manufacturing. Traditional methods of CAD generation rely heavily on manual inputs and struggle with complex or non-standard designs, making them less suited for dynamic industrial needs. To overcome these challenges, we introduce Text2CAD, a novel framework that employs stable diffusion models tailored to automate the generation process and efficiently bridge the gap between user specifications in text and functional CAD models. This approach directly translates the user’s textural descriptions into detailed isometric images, which are then precisely converted into orthographic views, e.g., top, front, and side, providing sufficient information to reconstruct 3D CAD models. This process not only streamlines the creation of CAD models from textual descriptions but also ensures that the resulting models uphold physical and dimensional consistency essential for practical engineering applications. Our experimental results show that Text2CAD effectively generates technical drawings that are accurately translated into high-quality 3D CAD models, showing substantial potential to revolutionize CAD automation in response to user demands.

[CV-68] Multi-object Tracking by Detection and Query: an efficient end-to-end manner

链接: https://arxiv.org/abs/2411.06197
作者: Shukun Jia,Yichao Cao,Feng Yang,Xin Lu,Xiaobo Lu
关键词-EN: newly emerging tracking, Multi-object tracking, Learnable Associator, detection and newly, newly emerging
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Multi-object tracking is advancing through two dominant paradigms: traditional tracking by detection and newly emerging tracking by query. In this work, we fuse them together and propose the tracking-by-detection-and-query paradigm, which is achieved by a Learnable Associator. Specifically, the basic information interaction module and the content-position alignment module are proposed for thorough information Interaction among object queries. Tracking results are directly Decoded from these queries. Hence, we name the method as LAID. Compared to tracking-by-query models, LAID achieves competitive tracking accuracy with notably higher training efficiency. With regard to tracking-by-detection methods, experimental results on DanceTrack show that LAID significantly surpasses the state-of-the-art heuristic method by 3.9% on HOTA metric and 6.1% on IDF1 metric. On SportsMOT, LAID also achieves the best score on HOTA metric. By holding low training cost, strong tracking capabilities, and an elegant end-to-end approach all at once, LAID presents a forward-looking direction for the field.

[CV-69] LSSInst: Improving Geometric Modeling in LSS-Based BEV Perception with Instance Representation

链接: https://arxiv.org/abs/2411.06173
作者: Weijie Ma,Jingwei Jiang,Yang Yang,Zehui Chen,Hao Chen
关键词-EN: view transformation paradigm, forward view transformation, BEV representation formulated, BEV representation, gained by camera-only
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted by 3DV 2025

点击查看摘要

Abstract:With the attention gained by camera-only 3D object detection in autonomous driving, methods based on Bird-Eye-View (BEV) representation especially derived from the forward view transformation paradigm, i.e., lift-splat-shoot (LSS), have recently seen significant progress. The BEV representation formulated by the frustum based on depth distribution prediction is ideal for learning the road structure and scene layout from multi-view images. However, to retain computational efficiency, the compressed BEV representation such as in resolution and axis is inevitably weak in retaining the individual geometric details, undermining the methodological generality and applicability. With this in mind, to compensate for the missing details and utilize multi-view geometry constraints, we propose LSSInst, a two-stage object detector incorporating BEV and instance representations in tandem. The proposed detector exploits fine-grained pixel-level features that can be flexibly integrated into existing LSS-based BEV networks. Having said that, due to the inherent gap between two representation spaces, we design the instance adaptor for the BEV-to-instance semantic coherence rather than pass the proposal naively. Extensive experiments demonstrated that our proposed framework is of excellent generalization ability and performance, which boosts the performances of modern LSS-based BEV perception methods without bells and whistles and outperforms current LSS-based state-of-the-art works on the large-scale nuScenes benchmark.

[CV-70] Scalable Tokenization-Free Diffusion Model Architectures with Efficient Initial Convolution and Fixed-Size Reusable Structures for On-Device Image Generation

链接: https://arxiv.org/abs/2411.06119
作者: Sanchar Palit,Sathya Veera Reddy Dendi,Mallikarjuna Talluri,Raj Narayana Gadde
关键词-EN: Vision Transformers, Vision Transformers require, widely adopted, reusable transformer block, Vision
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 8 pages

点击查看摘要

Abstract:Vision Transformers and U-Net architectures have been widely adopted in the implementation of Diffusion Models. However, each architecture presents specific challenges while realizing them on-device. Vision Transformers require positional embedding to maintain correspondence between the tokens processed by the transformer, although they offer the advantage of using fixed-size, reusable repetitive blocks following tokenization. The U-Net architecture lacks these attributes, as it utilizes variable-sized intermediate blocks for down-convolution and up-convolution in the noise estimation backbone for the diffusion process. To address these issues, we propose an architecture that utilizes a fixed-size, reusable transformer block as a core structure, making it more suitable for hardware implementation. Our architecture is characterized by low complexity, token-free design, absence of positional embeddings, uniformity, and scalability, making it highly suitable for deployment on mobile and resource-constrained devices. The proposed model exhibit competitive and consistent performance across both unconditional and conditional image generation tasks. The model achieved a state-of-the-art FID score of 1.6 on unconditional image generation with the CelebA.

[CV-71] Pattern Integration and Enhancement Vision Transformer for Self-Supervised Learning in Remote Sensing

链接: https://arxiv.org/abs/2411.06091
作者: Kaixuan Lu,Ruiqian Zhang,Xiao Huang,Yuxing Xie,Xiaogang Ning,Hanchao Zhang,Mengke Yuan,Pan Zhang,Tao Wang,Tongkui Liao
关键词-EN: remote sensing images, remote sensing, Recent self-supervised learning, unlabeled remote sensing, sensing images
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Recent self-supervised learning (SSL) methods have demonstrated impressive results in learning visual representations from unlabeled remote sensing images. However, most remote sensing images predominantly consist of scenographic scenes containing multiple ground objects without explicit foreground targets, which limits the performance of existing SSL methods that focus on foreground targets. This raises the question: Is there a method that can automatically aggregate similar objects within scenographic remote sensing images, thereby enabling models to differentiate knowledge embedded in various geospatial patterns for improved feature representation? In this work, we present the Pattern Integration and Enhancement Vision Transformer (PIEViT), a novel self-supervised learning framework designed specifically for remote sensing imagery. PIEViT utilizes a teacher-student architecture to address both image-level and patch-level tasks. It employs the Geospatial Pattern Cohesion (GPC) module to explore the natural clustering of patches, enhancing the differentiation of individual features. The Feature Integration Projection (FIP) module further refines masked token reconstruction using geospatially clustered patches. We validated PIEViT across multiple downstream tasks, including object detection, semantic segmentation, and change detection. Experiments demonstrated that PIEViT enhances the representation of internal patch features, providing significant improvements over existing self-supervised baselines. It achieves excellent results in object detection, land cover classification, and change detection, underscoring its robustness, generalization, and transferability for remote sensing image interpretation tasks.

[CV-72] GlocalCLIP: Object-agnostic Global-Local Prompt Learning for Zero-shot Anomaly Detection

链接: https://arxiv.org/abs/2411.06071
作者: Jiyul Ham,Yonggon Jung,Jun-Geol Baek
关键词-EN: data scarcity arises, Zero-shot anomaly detection, training data, training samples, data scarcity
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 28 pages, 33 figures

点击查看摘要

Abstract:Zero-shot anomaly detection (ZSAD) is crucial for detecting abnormal patterns in target datasets without using training samples, specifically in scenarios where there are distributional differences between the target domain and training data or where data scarcity arises because of restricted access. Although recently pretrained vision-language models demonstrate strong zero-shot performance across various visual tasks, they focus on learning class semantics, which makes their direct application to ZSAD challenging. To address this scenario, we propose GlocalCLIP, which uniquely separates global and local prompts and jointly optimizes them. This approach enables the object-agnostic glocal semantic prompt design to effectively capture general normal and anomalous patterns without dependency on specific objects in the image. We refine the text prompts for more precise adjustments by utilizing deep-text prompt tuning in the text encoder. In the vision encoder, we apply V-V attention layers to capture detailed local image features. Finally, we introduce glocal contrastive learning to improve the complementary learning of global and local prompts, effectively detecting abnormal patterns across various domains. The generalization performance of GlocalCLIP in ZSAD was demonstrated on 15 real-world datasets from both the industrial and medical domains, achieving superior performance compared to existing methods.

[CV-73] AI-Driven Stylization of 3D Environments

链接: https://arxiv.org/abs/2411.06067
作者: Yuanbo Chen,Yixiao Kang,Yukun Song,Cyrus Vachha,Sining Huang
关键词-EN: Gaussian Splatting, higher fidelity, representations like NeRFs, primitive objects, Gaussian
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
*备注:

点击查看摘要

Abstract:In this system, we discuss methods to stylize a scene of 3D primitive objects into a higher fidelity 3D scene using novel 3D representations like NeRFs and 3D Gaussian Splatting. Our approach leverages existing image stylization systems and image-to-3D generative models to create a pipeline that iteratively stylizes and composites 3D objects into scenes. We show our results on adding generated objects into a scene and discuss limitations.

[CV-74] Dynamic Textual Prompt For Rehearsal-free Lifelong Person Re-identification

链接: https://arxiv.org/abs/2411.06023
作者: Hongyu Chen,Bingliang Jiao,Wenxuan Wang,Peng Wang
关键词-EN: Lifelong person re-identification, person re-identification attempts, Lifelong person, continuous data streams, person re-identification
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 12 pages, 6 figures

点击查看摘要

Abstract:Lifelong person re-identification attempts to recognize people across cameras and integrate new knowledge from continuous data streams. Key challenges involve addressing catastrophic forgetting caused by parameter updating and domain shift, and maintaining performance in seen and unseen domains. Many previous works rely on data memories to retain prior samples. However, the amount of retained data increases linearly with the number of training domains, leading to continually increasing memory consumption. Additionally, these methods may suffer significant performance degradation when data preservation is prohibited due to privacy concerns. To address these limitations, we propose using textual descriptions as guidance to encourage the ReID model to learn cross-domain invariant features without retaining samples. The key insight is that natural language can describe pedestrian instances with an invariant style, suggesting a shared textual space for any pedestrian images. By leveraging this shared textual space as an anchor, we can prompt the ReID model to embed images from various domains into a unified semantic space, thereby alleviating catastrophic forgetting caused by domain shifts. To achieve this, we introduce a task-driven dynamic textual prompt framework in this paper. This model features a dynamic prompt fusion module, which adaptively constructs and fuses two different textual prompts as anchors. This effectively guides the ReID model to embed images into a unified semantic space. Additionally, we design a text-visual feature alignment module to learn a more precise mapping between fine-grained visual and textual features. We also developed a learnable knowledge distillation module that allows our model to dynamically balance retaining existing knowledge with acquiring new knowledge. Extensive experiments demonstrate that our method outperforms SOTAs under various settings.

[CV-75] GaussianSpa: An “Optimizing-Sparsifying” Simplification Framework for Compact and High-Quality 3D Gaussian Splatting

链接: https://arxiv.org/abs/2411.06019
作者: Yangming Zhang,Wenqi Jia,Wei Niu,Miao Yin
关键词-EN: leveraging continuous aggregations, model scene geometry, Gaussian Splatting, view synthesis, leveraging continuous
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
*备注: Project page at this https URL

点击查看摘要

Abstract:3D Gaussian Splatting (3DGS) has emerged as a mainstream for novel view synthesis, leveraging continuous aggregations of Gaussian functions to model scene geometry. However, 3DGS suffers from substantial memory requirements to store the multitude of Gaussians, hindering its practicality. To address this challenge, we introduce GaussianSpa, an optimization-based simplification framework for compact and high-quality 3DGS. Specifically, we formulate the simplification as an optimization problem associated with the 3DGS training. Correspondingly, we propose an efficient “optimizing-sparsifying” solution that alternately solves two independent sub-problems, gradually imposing strong sparsity onto the Gaussians in the training process. Our comprehensive evaluations on various datasets show the superiority of GaussianSpa over existing state-of-the-art approaches. Notably, GaussianSpa achieves an average PSNR improvement of 0.9 dB on the real-world Deep Blending dataset with 10 \times fewer Gaussians compared to the vanilla 3DGS. Our project page is available at this https URL.

[CV-76] A Modular Conditional Diffusion Framework for Image Reconstruction

链接: https://arxiv.org/abs/2411.05993
作者: Magauiya Zhussip,Iaroslav Koshelev,Stamatis Lefkimmiatis
关键词-EN: blind image restoration, demonstrated outstanding performance, Diffusion Probabilistic Models, image restoration, recently utilized
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Diffusion Probabilistic Models (DPMs) have been recently utilized to deal with various blind image restoration (IR) tasks, where they have demonstrated outstanding performance in terms of perceptual quality. However, the task-specific nature of existing solutions and the excessive computational costs related to their training, make such models impractical and challenging to use for different IR tasks than those that were initially trained for. This hinders their wider adoption, especially by those who lack access to powerful computational resources and vast amount of training data. In this work we aim to address the above issues and enable the successful adoption of DPMs in practical IR-related applications. Towards this goal, we propose a modular diffusion probabilistic IR framework (DP-IR), which allows us to combine the performance benefits of existing pre-trained state-of-the-art IR networks and generative DPMs, while it requires only the additional training of a relatively small module (0.7M params) related to the particular IR task of interest. Moreover, the architecture of the proposed framework allows for a sampling strategy that leads to at least four times reduction of neural function evaluations without suffering any performance loss, while it can also be combined with existing acceleration techniques such as DDIM. We evaluate our model on four benchmarks for the tasks of burst JDD-SR, dynamic scene deblurring, and super-resolution. Our method outperforms existing approaches in terms of perceptual quality while it retains a competitive performance with respect to fidelity metrics.

[CV-77] Emotional Images: Assessing Emotions in Images and Potential Biases in Generative Models

链接: https://arxiv.org/abs/2411.05985
作者: Maneet Mehta,Cody Buntain
关键词-EN: generative artificial intelligence, paper examines potential, artificial intelligence, Google Vision Transformer, paper examines
类目: Computers and Society (cs.CY); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:This paper examines potential biases and inconsistencies in emotional evocation of images produced by generative artificial intelligence (AI) models and their potential bias toward negative emotions. In particular, we assess this bias by comparing the emotions evoked by an AI-produced image to the emotions evoked by prompts used to create those images. As a first step, the study evaluates three approaches for identifying emotions in images – traditional supervised learning, zero-shot learning with vision-language models, and cross-modal auto-captioning – using EmoSet, a large dataset of image-emotion annotations that categorizes images across eight emotional types. Results show fine-tuned models, particularly Google’s Vision Transformer (ViT), significantly outperform zero-shot and caption-based methods in recognizing emotions in images. For a cross-modality comparison, we then analyze the differences between emotions in text prompts – via existing text-based emotion-recognition models – and the emotions evoked in the resulting images. Findings indicate that AI-generated images frequently lean toward negative emotional content, regardless of the original prompt. This emotional skew in generative models could amplify negative affective content in digital spaces, perpetuating its prevalence and impact. The study advocates for a multidisciplinary approach to better align AI emotion recognition with psychological insights and address potential biases in generative AI outputs across digital media.

[CV-78] Utilisation of Vision Systems and Digital Twin for Maintaining Cleanliness in Public Spaces ICCV

链接: https://arxiv.org/abs/2411.05964
作者: Mateusz Wasala,Krzysztof Blachut,Hubert Szolc,Marcin Kowalczyk,Michal Danilowicz,Tomasz Kryjak
关键词-EN: high cleanliness standards, maintaining high cleanliness, Digital Twin technology, public spaces results, Digital Twin
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
*备注: Accepted for the ICCVG 2024: International Conference on Computer Vision and Graphics, Poland

点击查看摘要

Abstract:Nowadays, the increasing demand for maintaining high cleanliness standards in public spaces results in the search for innovative solutions. The deployment of CCTV systems equipped with modern cameras and software enables not only real-time monitoring of the cleanliness status but also automatic detection of impurities and optimisation of cleaning schedules. The Digital Twin technology allows for the creation of a virtual model of the space, facilitating the simulation, training, and testing of cleanliness management strategies before implementation in the real world. In this paper, we present the utilisation of advanced vision surveillance systems and the Digital Twin technology in cleanliness management, using a railway station as an example. The Digital Twin was created based on an actual 3D model in the Nvidia Omniverse Isaac Sim simulator. A litter detector, bin occupancy level detector, stain segmentation, and a human detector (including the cleaning crew) along with their movement analysis were implemented. A preliminary assessment was conducted, and potential modifications for further enhancement and future development of the system were identified.

[CV-79] Querying Perception Streams with Spatial Regular Expressions

链接: https://arxiv.org/abs/2411.05946
作者: Jacob Anderson,Georgios Fainekos,Bardh Hoxha,Hideki Okamoto,Danil Prokhorov
关键词-EN: analysis generates large, generates large volumes, data analysis generates, fields like robotics, analysis generates
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Formal Languages and Automata Theory (cs.FL)
*备注: This work has been submitted to the International Journal on Software Tools for Technology Transfer

点击查看摘要

Abstract:Perception in fields like robotics, manufacturing, and data analysis generates large volumes of temporal and spatial data to effectively capture their environments. However, sorting through this data for specific scenarios is a meticulous and error-prone process, often dependent on the application, and lacks generality and reproducibility. In this work, we introduce SpREs as a novel querying language for pattern matching over perception streams containing spatial and temporal data derived from multi-modal dynamic environments. To highlight the capabilities of SpREs, we developed the STREM tool as both an offline and online pattern matching framework for perception data. We demonstrate the offline capabilities of STREM through a case study on a publicly available AV dataset (Woven Planet Perception) and its online capabilities through a case study integrating STREM in ROS with the CARLA simulator. We also conduct performance benchmark experiments on various SpRE queries. Using our matching framework, we are able to find over 20,000 matches within 296 ms making STREM applicable in runtime monitoring applications.

[CV-80] ViT Enhanced Privacy-Preserving Secure Medical Data Sharing and Classification

链接: https://arxiv.org/abs/2411.05901
作者: Al Amin,Kamrul Hasan,Sharif Ullah,M. Shamim Hossain
关键词-EN: minimizing computational overhead, medical image analysis, secure data sharing, sharing are critical, image analysis
类目: Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 2 pages, 2 figures

点击查看摘要

Abstract:Privacy-preserving and secure data sharing are critical for medical image analysis while maintaining accuracy and minimizing computational overhead are also crucial. Applying existing deep neural networks (DNNs) to encrypted medical data is not always easy and often compromises performance and security. To address these limitations, this research introduces a secure framework consisting of a learnable encryption method based on the block-pixel operation to encrypt the data and subsequently integrate it with the Vision Transformer (ViT). The proposed framework ensures data privacy and security by creating unique scrambling patterns per key, providing robust performance against leading bit attacks and minimum difference attacks.

[CV-81] Enhancing Cardiovascular Disease Prediction through Multi-Modal Self-Supervised Learning BMVC

链接: https://arxiv.org/abs/2411.05900
作者: Francesco Girlanda,Olga Demler,Bjoern Menze,Neda Davoudi
关键词-EN: diseases remains imperative, Accurate prediction, cardiac magnetic resonance, diagnosis and intervention, necessitating robust
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: Accepted to British Machine Vision Conference (BMVC) 2024

点击查看摘要

Abstract:Accurate prediction of cardiovascular diseases remains imperative for early diagnosis and intervention, necessitating robust and precise predictive models. Recently, there has been a growing interest in multi-modal learning for uncovering novel insights not available through uni-modal datasets alone. By combining cardiac magnetic resonance images, electrocardiogram signals, and available medical information, our approach enables the capture of holistic status about individuals’ cardiovascular health by leveraging shared information across modalities. Integrating information from multiple modalities and benefiting from self-supervised learning techniques, our model provides a comprehensive framework for enhancing cardiovascular disease prediction with limited annotated datasets. We employ a masked autoencoder to pre-train the electrocardiogram ECG encoder, enabling it to extract relevant features from raw electrocardiogram data, and an image encoder to extract relevant features from cardiac magnetic resonance images. Subsequently, we utilize a multi-modal contrastive learning objective to transfer knowledge from expensive and complex modality, cardiac magnetic resonance image, to cheap and simple modalities such as electrocardiograms and medical information. Finally, we fine-tuned the pre-trained encoders on specific predictive tasks, such as myocardial infarction. Our proposed method enhanced the image information by leveraging different available modalities and outperformed the supervised approach by 7.6% in balanced accuracy. Comments: Accepted to British Machine Vision Conference (BMVC) 2024 Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG) Cite as: arXiv:2411.05900 [cs.CV] (or arXiv:2411.05900v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2411.05900 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[CV-82] Predictive Digital Twin for Condition Monitoring Using Thermal Imaging

链接: https://arxiv.org/abs/2411.05887
作者: Daniel Menges,Florian Stadtmann,Henrik Jordheim,Adil Rasheed
关键词-EN: Proper Orthogonal Decomposition, Dynamic Mode Decomposition, Principal Component Analysis, twin specifically designed, Robust Principal Component
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This paper explores the development and practical application of a predictive digital twin specifically designed for condition monitoring, using advanced mathematical models and thermal imaging techniques. Our work presents a comprehensive approach to integrating Proper Orthogonal Decomposition (POD), Robust Principal Component Analysis (RPCA), and Dynamic Mode Decomposition (DMD) to establish a robust predictive digital twin framework. We employ these methods in a real-time experimental setup involving a heated plate monitored through thermal imaging. This system effectively demonstrates the digital twin’s capabilities in real-time predictions, condition monitoring, and anomaly detection. Additionally, we introduce the use of a human-machine interface that includes virtual reality, enhancing user interaction and system understanding. The primary contributions of our research lie in the demonstration of these advanced techniques in a tangible setup, showcasing the potential of digital twins to transform industry practices by enabling more proactive and strategic asset management.

[CV-83] Smile upon the Face but Sadness in the Eyes: Emotion Recognition based on Facial Expressions and Eye Behaviors

链接: https://arxiv.org/abs/2411.05879
作者: Yuanyuan Liu,Lin Wei,Kejun Liu,Yibing Zhan,Zijing Chen,Zhe Chen,Shiguang Shan
关键词-EN: facial expression recognition, Multimodal Emotion Recognition, Emotion Recognition, facial expressions, eye behaviors
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Emotion Recognition (ER) is the process of identifying human emotions from given data. Currently, the field heavily relies on facial expression recognition (FER) because facial expressions contain rich emotional cues. However, it is important to note that facial expressions may not always precisely reflect genuine emotions and FER-based results may yield misleading ER. To understand and bridge this gap between FER and ER, we introduce eye behaviors as an important emotional cues for the creation of a new Eye-behavior-aided Multimodal Emotion Recognition (EMER) dataset. Different from existing multimodal ER datasets, the EMER dataset employs a stimulus material-induced spontaneous emotion generation method to integrate non-invasive eye behavior data, like eye movements and eye fixation maps, with facial videos, aiming to obtain natural and accurate human emotions. Notably, for the first time, we provide annotations for both ER and FER in the EMER, enabling a comprehensive analysis to better illustrate the gap between both tasks. Furthermore, we specifically design a new EMERT architecture to concurrently enhance performance in both ER and FER by efficiently identifying and bridging the emotion gap between the this http URL, our EMERT employs modality-adversarial feature decoupling and multi-task Transformer to augment the modeling of eye behaviors, thus providing an effective complement to facial expressions. In the experiment, we introduce seven multimodal benchmark protocols for a variety of comprehensive evaluations of the EMER dataset. The results show that the EMERT outperforms other state-of-the-art multimodal methods by a great margin, revealing the importance of modeling eye behaviors for robust ER. To sum up, we provide a comprehensive analysis of the importance of eye behaviors in ER, advancing the study on addressing the gap between FER and ER for more robust ER performance.

[CV-84] Joint-Optimized Unsupervised Adversarial Domain Adaptation in Remote Sensing Segmentation with Prompted Foundation Model

链接: https://arxiv.org/abs/2411.05878
作者: Shuchang Lyu,Qi Zhaoa,Guangliang Cheng,Yiwei He,Zheng Zhou,Guangbiao Wang,Zhenwei Shi
关键词-EN: Sensing Semantic Segmentation, Unsupervised Domain Adaptation, Remote Sensing Semantic, remote sensing scenes, Semantic Segmentation
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 12 pages,6 figures, 6 tables

点击查看摘要

Abstract:Unsupervised Domain Adaptation for Remote Sensing Semantic Segmentation (UDA-RSSeg) addresses the challenge of adapting a model trained on source domain data to target domain samples, thereby minimizing the need for annotated data across diverse remote sensing scenes. This task presents two principal challenges: (1) severe inconsistencies in feature representation across different remote sensing domains, and (2) a domain gap that emerges due to the representation bias of source domain patterns when translating features to predictive logits. To tackle these issues, we propose a joint-optimized adversarial network incorporating the “Segment Anything Model (SAM) (SAM-JOANet)” for UDA-RSSeg. Our approach integrates SAM to leverage its robust generalized representation capabilities, thereby alleviating feature inconsistencies. We introduce a finetuning decoder designed to convert SAM-Encoder features into predictive logits. Additionally, a feature-level adversarial-based prompted segmentor is employed to generate class-agnostic maps, which guide the finetuning decoder’s feature representations. The network is optimized end-to-end, combining the prompted segmentor and the finetuning decoder. Extensive evaluations on benchmark datasets, including ISPRS (Potsdam/Vaihingen) and CITY-OSM (Paris/Chicago), demonstrate the effectiveness of our method. The results, supported by visualization and analysis, confirm the method’s interpretability and robustness. The code of this paper is available at this https URL.

[CV-85] Poor Mans Training on MCUs: A Memory-Efficient Quantized Back-Propagation-Free Approach

链接: https://arxiv.org/abs/2411.05873
作者: Yequan Zhao,Hai Li,Ian Young,Zheng Zhang
关键词-EN: Back propagation, neural network training, training, default solution, computation in neural
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Distributed, Parallel, and Cluster Computing (cs.DC); Neural and Evolutionary Computing (cs.NE)
*备注:

点击查看摘要

Abstract:Back propagation (BP) is the default solution for gradient computation in neural network training. However, implementing BP-based training on various edge devices such as FPGA, microcontrollers (MCUs), and analog computing platforms face multiple major challenges, such as the lack of hardware resources, long time-to-market, and dramatic errors in a low-precision setting. This paper presents a simple BP-free training scheme on an MCU, which makes edge training hardware design as easy as inference hardware design. We adopt a quantized zeroth-order method to estimate the gradients of quantized model parameters, which can overcome the error of a straight-through estimator in a low-precision BP scheme. We further employ a few dimension reduction methods (e.g., node perturbation, sparse training) to improve the convergence of zeroth-order training. Experiment results show that our BP-free training achieves comparable performance as BP-based training on adapting a pre-trained image classifier to various corrupted data on resource-constrained edge devices (e.g., an MCU with 1024-KB SRAM for dense full-model training, or an MCU with 256-KB SRAM for sparse training). This method is most suitable for application scenarios where memory cost and time-to-market are the major concerns, but longer latency can be tolerated.

[CV-86] Saliency Assisted Quantization for Neural Networks

链接: https://arxiv.org/abs/2411.05858
作者: Elmira Mousa Rezabeyk,Salar Beigzad,Yasin Hamzavi,Mohsen Bagheritabar,Seyedeh Sogol Mirikhoozani
关键词-EN: Deep learning methods, Deep learning, image classification, significant place, place in image
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Deep learning methods have established a significant place in image classification. While prior research has focused on enhancing final outcomes, the opaque nature of the decision-making process in these models remains a concern for experts. Additionally, the deployment of these methods can be problematic in resource-limited environments. This paper tackles the inherent black-box nature of these models by providing real-time explanations during the training phase, compelling the model to concentrate on the most distinctive and crucial aspects of the input. Furthermore, we employ established quantization techniques to address resource constraints. To assess the effectiveness of our approach, we explore how quantization influences the interpretability and accuracy of Convolutional Neural Networks through a comparative analysis of saliency maps from standard and quantized models. Quantization is implemented during the training phase using the Parameterized Clipping Activation method, with a focus on the MNIST and FashionMNIST benchmark datasets. We evaluated three bit-width configurations (2-bit, 4-bit, and mixed 4/2-bit) to explore the trade-off between efficiency and interpretability, with each configuration designed to highlight varying impacts on saliency map clarity and model accuracy. The results indicate that while quantization is crucial for implementing models on resource-limited devices, it necessitates a trade-off between accuracy and interpretability. Lower bit-widths result in more pronounced reductions in both metrics, highlighting the necessity of meticulous quantization parameter selection in applications where model transparency is paramount. The study underscores the importance of achieving a balance between efficiency and interpretability in the deployment of neural networks.

[CV-87] Learning Morphisms with Gauss-Newton Approximation for Growing Networks

链接: https://arxiv.org/abs/2411.05855
作者: Neal Lawton,Aram Galstyan,Greg Ver Steeg
关键词-EN: Neural Architecture Search, Architecture Search, Neural Architecture, called network morphisms, architecture called network
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Neural and Evolutionary Computing (cs.NE)
*备注: 12 pages, 4 figures

点击查看摘要

Abstract:A popular method for Neural Architecture Search (NAS) is based on growing networks via small local changes to the network’s architecture called network morphisms. These methods start with a small seed network and progressively grow the network by adding new neurons in an automated way. However, it remains a challenge to efficiently determine which parts of the network are best to grow. Here we propose a NAS method for growing a network by using a Gauss-Newton approximation of the loss function to efficiently learn and evaluate candidate network morphisms. We compare our method with state of the art NAS methods for CIFAR-10 and CIFAR-100 classification tasks, and conclude our method learns similar quality or better architectures at a smaller computational cost.

[CV-88] Reducing catastrophic forgetting of incremental learning in the absence of rehearsal memory with task-specific token

链接: https://arxiv.org/abs/2411.05846
作者: Young Jo Choi,Min Kyoon Yoo,Yu Rang Park
关键词-EN: generally display catastrophic, Deep learning models, models generally display, display catastrophic forgetting, Deep learning
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Deep learning models generally display catastrophic forgetting when learning new data continuously. Many incremental learning approaches address this problem by reusing data from previous tasks while learning new tasks. However, the direct access to past data generates privacy and security concerns. To address these issues, we present a novel method that preserves previous knowledge without storing previous data. This method is inspired by the architecture of a vision transformer and employs a unique token capable of encapsulating the compressed knowledge of each task. This approach generates task-specific embeddings by directing attention differently based on the task associated with the data, thereby effectively mimicking the impact of having multiple models through tokens. Our method incorporates a distillation process that ensures efficient interactions even after multiple additional learning steps, thereby optimizing the model against forgetting. We measured the performance of our model in terms of accuracy and backward transfer using a benchmark dataset for different task-incremental learning scenarios. Our results demonstrate the superiority of our approach, which achieved the highest accuracy and lowest backward transfer among the compared methods. In addition to presenting a new model, our approach lays the foundation for various extensions within the spectrum of vision-transformer architectures.

[CV-89] StegaVision: Enhancing Steganography with Attention Mechanism AAAI-25

链接: https://arxiv.org/abs/2411.05838
作者: Abhinav Kumar,Pratham Singla,Aayan Yadav
关键词-EN: spatial attention, attention, spatial, Image steganography, Image
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: AAAI-25 Student Abstract

点击查看摘要

Abstract:Image steganography is the technique of embedding secret information within images. The development of deep learning has led to significant advances in this field. However, existing methods often struggle to balance image quality, embedding capacity, and security. This paper proposes a novel approach to image steganography by enhancing an encoder-decoder architecture with attention mechanisms, specifically focusing on channel and spatial attention modules. We systematically investigate five configurations: (1) channel attention, (2) spatial attention, (3) sequential channel followed by spatial attention, (4) spatial attention followed by channel attention and (5) parallel channel and spatial attention. Our experiments show that adding attention mechanisms improves the ability to embed hidden information while maintaining the visual quality of the images. The increase in the PSNR and SSIM scores shows that using a parallel combination of channel and spatial attention improves image quality and hiding capacity simultaneously. This is in contrast to previous works where there is a tradeoff between them. This study shows that attention mechanisms in image steganography lead to better hiding of secret information. Our code is available at this https URL.

[CV-90] On the Trade-Off between Stability and Fidelity of Gaussian-Smoothed Saliency Maps

链接: https://arxiv.org/abs/2411.05837
作者: Zhuorui Ye,Farzan Farnia
关键词-EN: neural network classifiers, Gradient-based saliency maps, gradient-based maps, saliency maps, Standard gradient-based maps
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Gradient-based saliency maps have been widely used to interpret the decisions of neural network classifiers and discover phenomena from their learned functions. Standard gradient-based maps are frequently observed to be highly sensitive to the randomness of training data and the stochasticity in the training process. In this work, we study the role of randomized smoothing in the well-known Smooth-Grad algorithm in the stability of the gradient-based maps to the randomness of training samples. We extend the algorithmic stability framework to gradient-based saliency maps and prove bounds on the stability error of standard Simple-Grad, Integrated-Gradients, and Smooth-Grad saliency maps. Our theoretical results suggest the role of Gaussian smoothing in boosting the stability of gradient-based maps to the randomness of training settings. On the other hand, we analyze the faithfulness of the Smooth-Grad maps to the original Simple-Grad and show the lower fidelity under a more intense Gaussian smoothing. We support our theoretical results by performing several numerical experiments on standard image datasets. Our empirical results confirm our hypothesis on the fidelity-stability trade-off in the application of Gaussian smoothing to gradient-based interpretation maps.

[CV-91] Prion-ViT: Prions-Inspired Vision Transformers for Temperature prediction with Specklegrams

链接: https://arxiv.org/abs/2411.05836
作者: Abhishek Sebastian,Pragna R
关键词-EN: Fiber Specklegram Sensors, data poses significant, conventional predictive models, poses significant challenges, environmental monitoring due
类目: Computer Vision and Pattern Recognition (cs.CV); Signal Processing (eess.SP); Optics (physics.optics)
*备注:

点击查看摘要

Abstract:Fiber Specklegram Sensors (FSS) are widely used in environmental monitoring due to their high sensitivity to temperature fluctuations, yet the complex, nonlinear nature of specklegram data poses significant challenges for conventional predictive models. This study introduces a novel Prion-Vision Transformer (Prion-ViT) model, inspired by biological prion memory mechanisms, to enhance long-term dependency modeling for accurate temperature prediction using FSS data. By leveraging a persistent memory state, the Prion-ViT effectively retains and propagates essential features across multiple layers, thereby improving prediction accuracy and reducing mean absolute error (MAE) to “0.52 Degree Celsius” outperforming traditional models like ResNet, Inception Net V2, and existing transformer-based architectures. The study addresses the specific challenges of applying Vision Transformers (ViTs) to FSS data and demonstrates that the prion-inspired memory mechanism offers a robust solution for capturing complex optical interference patterns in specklegrams. These findings establish Prion-ViT as a promising advancement for real-time industrial temperature monitoring applications, with potential applicability to other optical sensing domains.

[CV-92] A Theory of Stabilization by Skull Carving

链接: https://arxiv.org/abs/2411.05827
作者: Mathieu Lamarre,Patrick Anderson,Étienne Danvoye
关键词-EN: photoreal avatar construction, training data collection, virtual reality, essential for applications, applications in photoreal
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 4 pages, 3 figures

点击查看摘要

Abstract:Accurate stabilization of facial motion is essential for applications in photoreal avatar construction for 3D games, virtual reality, movies, and training data collection. For the latter, stabilization must work automatically for the general population with people of varying morphology. Distinguishing rigid skull motion from facial expressions is critical since misalignment between skull motion and facial expressions can lead to animation models that are hard to control and can not fit natural motion. Existing methods struggle to work with sparse sets of very different expressions, such as when combining multiple units from the Facial Action Coding System (FACS). Certain approaches are not robust enough, some depend on motion data to find stable points, while others make one-for-all invalid physiological assumptions. In this paper, we leverage recent advances in neural signed distance fields and differentiable isosurface meshing to compute skull stabilization rigid transforms directly on unstructured triangle meshes or point clouds, significantly enhancing accuracy and robustness. We introduce the concept of a stable hull as the surface of the boolean intersection of stabilized scans, analogous to the visual hull in shape-from-silhouette and the photo hull from space carving. This hull resembles a skull overlaid with minimal soft tissue thickness, upper teeth are automatically included. Our skull carving algorithm simultaneously optimizes the stable hull shape and rigid transforms to get accurate stabilization of complex expressions for large diverse sets of people, outperforming existing methods.

[CV-93] SPACE: SPAtial-aware Consistency rEgularization for anomaly detection in Industrial applications WACV2025

链接: https://arxiv.org/abs/2411.05822
作者: Daehwan Kim,Hyungmin Kim,Daun Jeong,Sungho Suh,Hansang Cho
关键词-EN: Feature Encoder, anomaly detection methodology, Spatial Consistency regularization, Consistency regularization Loss, methodology that integrates
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted to WACV 2025

点击查看摘要

Abstract:In this paper, we propose SPACE, a novel anomaly detection methodology that integrates a Feature Encoder (FE) into the structure of the Student-Teacher method. The proposed method has two key elements: Spatial Consistency regularization Loss (SCL) and Feature converter Module (FM). SCL prevents overfitting in student models by avoiding excessive imitation of the teacher model. Simultaneously, it facilitates the expansion of normal data features by steering clear of abnormal areas generated through data augmentation. This dual functionality ensures a robust boundary between normal and abnormal data. The FM prevents the learning of ambiguous information from the FE. This protects the learned features and enables more effective detection of structural and logical anomalies. Through these elements, SPACE is available to minimize the influence of the FE while integrating various data this http URL this study, we evaluated the proposed method on the MVTec LOCO, MVTec AD, and VisA datasets. Experimental results, through qualitative evaluation, demonstrate the superiority of detection and efficiency of each module compared to state-of-the-art methods.

[CV-94] Benchmarking Vision Language Action Models on Robotic Learning Tasks

链接: https://arxiv.org/abs/2411.05821
作者: Pranav Guruprasad,Harshvardhan Sikka,Jaewoo Song,Yangyue Wang,Paul Pu Liang
关键词-EN: combine visual understanding, developing general-purpose robotic, language comprehension, visual understanding, general-purpose robotic systems
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 16 Pages, 9 Figures

点击查看摘要

Abstract:Vision-language-action (VLA) models represent a promising direction for developing general-purpose robotic systems, demonstrating the ability to combine visual understanding, language comprehension, and action generation. However, systematic evaluation of these models across diverse robotic tasks remains limited. In this work, we present a comprehensive evaluation framework and benchmark suite for assessing VLA models. We profile three state-of-the-art VLM and VLAs - GPT-4o, OpenVLA, and JAT - across 20 diverse datasets from the Open-X-Embodiment collection, evaluating their performance on various manipulation tasks. Our analysis reveals several key insights: 1. current VLA models show significant variation in performance across different tasks and robot platforms, with GPT-4o demonstrating the most consistent performance through sophisticated prompt engineering, 2. all models struggle with complex manipulation tasks requiring multi-step planning, and 3. model performance is notably sensitive to action space characteristics and environmental factors. We release our evaluation framework and findings to facilitate systematic assessment of future VLA models and identify critical areas for improvement in the development of general purpose robotic systems.

[CV-95] UEVAVD: A Dataset for Developing UAVs Eye View Active Object Detection

链接: https://arxiv.org/abs/2411.04348
作者: Xinhua Jiang,Tianpeng Liu,Li Liu,Zhen Liu,Yongxiang Liu
关键词-EN: UAV-based object detection, UAV AOD, UAV AOD problem, UAV AOD method, object detection
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
*备注:

点击查看摘要

Abstract:Occlusion is a longstanding difficulty that challenges the UAV-based object detection. Many works address this problem by adapting the detection model. However, few of them exploit that the UAV could fundamentally improve detection performance by changing its viewpoint. Active Object Detection (AOD) offers an effective way to achieve this purpose. Through Deep Reinforcement Learning (DRL), AOD endows the UAV with the ability of autonomous path planning to search for the observation that is more conducive to target identification. Unfortunately, there exists no available dataset for developing the UAV AOD method. To fill this gap, we released a UAV’s eye view active vision dataset named UEVAVD and hope it can facilitate research on the UAV AOD problem. Additionally, we improve the existing DRL-based AOD method by incorporating the inductive bias when learning the state representation. First, due to the partial observability, we use the gated recurrent unit to extract state representations from the observation sequence instead of the single-view observation. Second, we pre-decompose the scene with the Segment Anything Model (SAM) and filter out the irrelevant information with the derived masks. With these practices, the agent could learn an active viewing policy with better generalization capability. The effectiveness of our innovations is validated by the experiments on the UEVAVD dataset. Our dataset will soon be available at this https URL.

[CV-96] A Hyperspectral Imaging Dataset and Methodology for Intraoperative Pixel-Wise Classification of Metastatic Colon Cancer in the Liver

链接: https://arxiv.org/abs/2411.06969
作者: Ivica Kopriva,Dario Sitnik,Laura-Isabelle Dion-Bertrand,Marija Milković Periša,Mirko Hadžija,Marijana Popović Hadžija
关键词-EN: holds significant potential, Hyperspectral imaging, computational pathology, potential for transforming, transforming the field
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注: 12 pages, 5 figures, 5 tables

点击查看摘要

Abstract:Hyperspectral imaging (HSI) holds significant potential for transforming the field of computational pathology. However, there is currently a shortage of pixel-wise annotated HSI data necessary for training deep learning (DL) models. Additionally, the number of HSI-based research studies remains limited, and in many cases, the advantages of HSI over traditional RGB imaging have not been conclusively demonstrated, particularly for specimens collected intraoperatively. To address these challenges we present a database consisted of 27 HSIs of hematoxylin-eosin stained frozen sections, collected from 14 patients with colon adenocarcinoma metastasized to the liver. It is aimed to validate pixel-wise classification for intraoperative tumor resection. The HSIs were acquired in the spectral range of 450 to 800 nm, with a resolution of 1 nm, resulting in images of 1384x1035 pixels. Pixel-wise annotations were performed by three pathologists. To overcome challenges such as experimental variability and the lack of annotated data, we combined label-propagation-based semi-supervised learning (SSL) with spectral-spatial features extracted by: the multiscale principle of relevant information (MPRI) method and tensor singular spectrum analysis method. Using only 1% of labeled pixels per class the SSL-MPRI method achieved a micro balanced accuracy (BACC) of 0.9313 and a micro F1-score of 0.9235 on the HSI dataset. The performance on corresponding RGB images was lower, with a micro BACC of 0.8809 and a micro F1-score of 0.8688. These improvements are statistically significant. The SSL-MPRI approach outperformed six DL architectures trained with 63% of labeled pixels. Data and code are available at: this https URL.

[CV-97] Maximizing domain generalization in fetal brain tissue segmentation: the role of synthetic data generation intensity clustering and real image fine-tuning

链接: https://arxiv.org/abs/2411.06842
作者: Vladyslav Zalevskyi,Thomas Sanchez,Margaux Roulet,Hélène Lajous,Jordina Aviles Verdera,Jana Hutter,Hamza Kebiri,Meritxell Bach Cuadra
关键词-EN: faces challenges due, magnetic resonance imaging, fetal brain MRI, brain tissue segmentation, Fetal brain tissue
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Fetal brain tissue segmentation in magnetic resonance imaging (MRI) is a crucial tool that supports the understanding of neurodevelopment, yet it faces challenges due to the heterogeneity of data coming from different scanners and settings, and due to data scarcity. Recent approaches based on domain randomization, like SynthSeg, have shown a great potential for single source domain generalization, by simulating images with randomized contrast and image resolution from the label maps. In this work, we investigate how to maximize the out-of-domain (OOD) generalization potential of SynthSeg-based methods in fetal brain MRI. Specifically, when studying data generation, we demonstrate that the simple Gaussian mixture models used in SynthSeg enable more robust OOD generalization than physics-informed generation methods. We also investigate how intensity clustering can help create more faithful synthetic images, and observe that it is key to achieving a non-trivial OOD generalization capability when few label classes are available. Finally, by combining for the first time SynthSeg with modern fine-tuning approaches based on weight averaging, we show that fine-tuning a model pre-trained on synthetic data on a few real image-segmentation pairs in a new domain can lead to improvements in the target domain, but also in other domains. We summarize our findings as five key recommendations that we believe can guide practitioners who would like to develop SynthSeg-based approaches in other organs or modalities.

[CV-98] SynStitch: a Self-Supervised Learning Network for Ultrasound Image Stitching Using Synthetic Training Pairs and Indirect Supervision

链接: https://arxiv.org/abs/2411.06750
作者: Xing Yao,Runxuan Yu,Dewei Hu,Hao Yang,Ange Lou,Jiacheng Wang,Daiwei Lu,Gabriel Arenas,Baris Oguz,Alison Pouch,Nadav Schwartz,Brett C Byram,Ipek Oguz
关键词-EN: varied probe positions, probe positions, varied probe, FOV, stitching
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Ultrasound (US) image stitching can expand the field-of-view (FOV) by combining multiple US images from varied probe positions. However, registering US images with only partially overlapping anatomical contents is a challenging task. In this work, we introduce SynStitch, a self-supervised framework designed for 2DUS stitching. SynStitch consists of a synthetic stitching pair generation module (SSPGM) and an image stitching module (ISM). SSPGM utilizes a patch-conditioned ControlNet to generate realistic 2DUS stitching pairs with known affine matrix from a single input image. ISM then utilizes this synthetic paired data to learn 2DUS stitching in a supervised manner. Our framework was evaluated against multiple leading methods on a kidney ultrasound dataset, demonstrating superior 2DUS stitching performance through both qualitative and quantitative analyses. The code will be made public upon acceptance of the paper.

[CV-99] Separation en composantes structures textures et bruit dune image apport de lutilisation des contourlettes

链接: https://arxiv.org/abs/2411.06696
作者: Jerome Gilles
关键词-EN: improve image decomposition, propose to improve, propose, Abstract, image decomposition algorithms
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注: in French language, GRETSI Symposium on Signal and Image Processing, Dijon, France, September 2009

点击查看摘要

Abstract:In this paper, we propose to improve image decomposition algorithms in the case of noisy images. In \citegilles1,aujoluvw, the authors propose to separate structures, textures and noise from an image. Unfortunately, the use of separable wavelets shows some artefacts. In this paper, we propose to replace the wavelet transform by the contourlet transform which better approximate geometry in images. For that, we define contourlet spaces and their associated norms. Then, we get an iterative algorithm which we test on two noisy textured images.

[CV-100] METRIC: a complete methodology for performances evaluation of automatic target Detection Recognition and Tracking algorithms in infrared imagery

链接: https://arxiv.org/abs/2411.06695
作者: Jérôme Gilles,Stéphane Landeau,Tristan Dagobert,Philippe Chevalier,Eric Stiée,Damien Diaz,Jean-Luc Maillart
关键词-EN: algorithms performance assessment, automatic target detection, recognition and tracking, question of automatic, automatic target
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:In this communication, we deal with the question of automatic target detection, recognition and tracking (ATD/R/T) algorithms performance assessment. We propose a complete methodology of evaluation which approaches objective image datasets development and adapted metrics definition for the different tasks (detection, recognition and tracking). We present some performance results which are currently processed in a French-MoD program called 2ACI (Acquisition Automatique de Cibles par Imagerie).

[CV-101] PRISM: Privacy-preserving Inter-Site MRI Harmonization via Disentangled Representation Learning

链接: https://arxiv.org/abs/2411.06513
作者: Sarang Galada,Tanurima Halder,Kunal Deo,Ram P Krish,Kshitij Jadhav
关键词-EN: Multi-site MRI studies, Privacy-preserving Inter-Site MRI, site-specific variations arising, differences in methodology, acquisition protocols
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注: This work has been submitted to ISBI 2025

点击查看摘要

Abstract:Multi-site MRI studies often suffer from site-specific variations arising from differences in methodology, hardware, and acquisition protocols, thereby compromising accuracy and reliability in clinical AI/ML tasks. We present PRISM (Privacy-preserving Inter-Site MRI Harmonization), a novel Deep Learning framework for harmonizing structural brain MRI across multiple sites while preserving data privacy. PRISM employs a dual-branch autoencoder with contrastive learning and variational inference to disentangle anatomical features from style and site-specific variations, enabling unpaired image translation without traveling subjects or multiple MRI modalities. Our modular design allows harmonization to any target site and seamless integration of new sites without the need for retraining or fine-tuning. Using multi-site structural MRI data, we demonstrate PRISM’s effectiveness in downstream tasks such as brain tissue segmentation and validate its harmonization performance through multiple experiments. Our framework addresses key challenges in medical AI/ML, including data privacy, distribution shifts, model generalizability and interpretability. Code is available at this https URL

[CV-102] A Hybrid Approach for COVID-19 Detection: Combining Wasserstein GAN with Transfer Learning

链接: https://arxiv.org/abs/2411.06397
作者: Sumera Rounaq,Shahid Munir Shah,Mahmoud Aljawarneh,Sarah Khan,Ghulam Muhammad
关键词-EN: Viral Pneumonia, Viral Pneumonia cases, early diagnosis, induced Viral Pneumonia, extremely contagious
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:COVID-19 is extremely contagious and its rapid growth has drawn attention towards its early diagnosis. Early diagnosis of COVID-19 enables healthcare professionals and government authorities to break the chain of transition and flatten the epidemic curve. With the number of cases accelerating across the developed world, COVID-19 induced Viral Pneumonia cases is a big challenge. Overlapping of COVID-19 cases with Viral Pneumonia and other lung infections with limited dataset and long training hours is a serious problem to cater. Limited amount of data often results in over-fitting models and due to this reason, model does not predict generalized results. To fill this gap, we proposed GAN-based approach to synthesize images which later fed into the deep learning models to classify images of COVID-19, Normal, and Viral Pneumonia. Specifically, customized Wasserstein GAN is proposed to generate 19% more Chest X-ray images as compare to the real images. This expanded dataset is then used to train four proposed deep learning models: VGG-16, ResNet-50, GoogLeNet and MNAST. The result showed that expanded dataset utilized deep learning models to deliver high classification accuracies. In particular, VGG-16 achieved highest accuracy of 99.17% among all four proposed schemes. Rest of the models like ResNet-50, GoogLeNet and MNAST delivered 93.9%, 94.49% and 97.75% testing accuracies respectively. Later, the efficiency of these models is compared with the state of art models on the basis of accuracy. Further, our proposed models can be applied to address the issue of scant datasets for any problem of image analysis.

[CV-103] A novel algorithm for optimizing bundle adjustment in image sequence alignment

链接: https://arxiv.org/abs/2411.06343
作者: Hailin Xu,Hongxia Wang,Huanshui Zhang
关键词-EN: Bundle Adjustment, Optimal Control Algorithm, squares method, typical choice, commonly optimized
类目: Optimization and Control (math.OC); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:The Bundle Adjustment (BA) model is commonly optimized using a nonlinear least squares method, with the Levenberg-Marquardt (L-M) algorithm being a typical choice. However, despite the L-M algorithm’s effectiveness, its sensitivity to initial conditions often results in slower convergence when applied to poorly conditioned datasets, motivating the exploration of alternative optimization strategies. This paper introduces a novel algorithm for optimizing the BA model in the context of image sequence alignment for cryo-electron tomography, utilizing optimal control theory to directly optimize general nonlinear functions. The proposed Optimal Control Algorithm (OCA) exhibits superior convergence rates and effectively mitigates the oscillatory behavior frequently observed in L-M algorithm. Extensive experiments on both synthetic and real-world datasets were conducted to evaluate the algorithm’s performance. The results demonstrate that the OCA achieves faster convergence compared to the L-M algorithm. Moreover, the incorporation of a bisection-based update procedure significantly enhances the OCA’s performance, particularly in poorly initialized datasets. These findings indicate that the OCA can substantially improve the efficiency of 3D reconstructions in cryo-electron tomography.

[CV-104] Exploring Out-of-distribution Detection for Sparse-view Computed Tomography with Diffusion Models

链接: https://arxiv.org/abs/2411.06308
作者: Ezgi Demircan-Tureyen,Felix Lucka,Tristan van Leeuwen
关键词-EN: inverse imaging problems, Recent works demonstrate, OOD, imaging problems, works demonstrate
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Recent works demonstrate the effectiveness of diffusion models as unsupervised solvers for inverse imaging problems. Sparse-view computed tomography (CT) has greatly benefited from these advancements, achieving improved generalization without reliance on measurement parameters. However, this comes at the cost of potential hallucinations, especially when handling out-of-distribution (OOD) data. To ensure reliability, it is essential to study OOD detection for CT reconstruction across both clinical and industrial applications. This need further extends to enabling the OOD detector to function effectively as an anomaly inspection tool. In this paper, we explore the use of a diffusion model, trained to capture the target distribution for CT reconstruction, as an in-distribution prior. Building on recent research, we employ the model to reconstruct partially diffused input images and assess OOD-ness through multiple reconstruction errors. Adapting this approach for sparse-view CT requires redefining the notions of “input” and “reconstruction error”. Here, we use filtered backprojection (FBP) reconstructions as input and investigate various definitions of reconstruction error. Our proof-of-concept experiments on the MNIST dataset highlight both successes and failures, demonstrating the potential and limitations of integrating such an OOD detector into a CT reconstruction system. Our findings suggest that effective OOD detection can be achieved by comparing measurements with forward-projected reconstructions, provided that reconstructions from noisy FBP inputs are conditioned on the measurements. However, conditioning can sometimes lead the OOD detector to inadvertently reconstruct OOD images well. To counter this, we introduce a weighting approach that improves robustness against highly informative OOD measurements, albeit with a trade-off in performance in certain cases.

[CV-105] Alleviating Hyperparameter-Tuning Burden in SVM Classifiers for Pulmonary Nodules Diagnosis with Multi-Task Bayesian Optimization

链接: https://arxiv.org/abs/2411.06184
作者: Wenhao Chi,Haiping Liu,Hongqiao Dong,Wenhua Liang,Bo Liu
关键词-EN: measure tumor characteristics, non-invasive medical imaging, multi-task Bayesian optimization, tumor characteristics, field of non-invasive
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 12 pages, 4 figures, 37 references

点击查看摘要

Abstract:In the field of non-invasive medical imaging, radiomic features are utilized to measure tumor characteristics. However, these features can be affected by the techniques used to discretize the images, ultimately impacting the accuracy of diagnosis. To investigate the influence of various image discretization methods on diagnosis, it is common practice to evaluate multiple discretization strategies individually. This approach often leads to redundant and time-consuming tasks such as training predictive models and fine-tuning hyperparameters separately. This study examines the feasibility of employing multi-task Bayesian optimization to accelerate the hyperparameters search for classifying benign and malignant pulmonary nodules using RBF SVM. Our findings suggest that multi-task Bayesian optimization significantly accelerates the search for hyperparameters in comparison to a single-task approach. To the best of our knowledge, this is the first investigation to utilize multi-task Bayesian optimization in a critical medical context.

[CV-106] Epi-NAF: Enhancing Neural Attenuation Fields for Limited-Angle CT with Epipolar Consistency Conditions

链接: https://arxiv.org/abs/2411.06181
作者: Daniel Gilo,Tzofi Klinghoffer,Or Litany
关键词-EN: inverse rendering domain, initially successful, rendering domain, marking a paradigm, traditional techniques
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注: This work has been submitted to the IEEE for possible publication

点击查看摘要

Abstract:Neural field methods, initially successful in the inverse rendering domain, have recently been extended to CT reconstruction, marking a paradigm shift from traditional techniques. While these approaches deliver state-of-the-art results in sparse-view CT reconstruction, they struggle in limited-angle settings, where input projections are captured over a restricted angle range. We present a novel loss term based on consistency conditions between corresponding epipolar lines in X-ray projection images, aimed at regularizing neural attenuation field optimization. By enforcing these consistency conditions, our approach, Epi-NAF, propagates supervision from input views within the limited-angle range to predicted projections over the full cone-beam CT range. This loss results in both qualitative and quantitative improvements in reconstruction compared to baseline methods.

[CV-107] Efficient Self-Supervised Barlow Twins from Limited Tissue Slide Cohorts for Colonic Pathology Diagnostics

链接: https://arxiv.org/abs/2411.05959
作者: Cassandre Notton,Vasudev Sharma,Vincent Quoc-Huy Trinh,Lina Chen,Minqi Xu,Sonal Varma,Mahdi S. Hosseini
关键词-EN: established dysplasia-carcinoma sequence, CRC screening, colorectal polyps screening, established dysplasia-carcinoma, dysplasia-carcinoma sequence
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注: Submission Under Review

点击查看摘要

Abstract:Colorectal cancer (CRC) is one of the few cancers that have an established dysplasia-carcinoma sequence that benefits from screening. Everyone over 50 years of age in Canada is eligible for CRC screening. About 20% of those people will undergo a biopsy for a pre-neoplastic polyp and, in many cases, multiple polyps. As such, these polyp biopsies make up the bulk of a pathologist’s workload. Developing an efficient computational model to help screen these polyp biopsies can improve the pathologist’s workflow and help guide their attention to critical areas on the slide. DL models face significant challenges in computational pathology (CPath) because of the gigapixel image size of whole-slide images and the scarcity of detailed annotated datasets. It is, therefore, crucial to leverage self-supervised learning (SSL) methods to alleviate the burden and cost of data annotation. However, current research lacks methods to apply SSL frameworks to analyze pathology data effectively. This paper aims to propose an optimized Barlow Twins framework for colorectal polyps screening. We adapt its hyperparameters, augmentation strategy and encoder to the specificity of the pathology data to enhance performance. Additionally, we investigate the best Field of View (FoV) for colorectal polyps screening and propose a new benchmark dataset for CRC screening, made of four types of colorectal polyps and normal tissue, by performing downstream tasking on MHIST and NCT-CRC-7K datasets. Furthermore, we show that the SSL representations are more meaningful and qualitative than the supervised ones and that Barlow Twins benefits from the Swin Transformer when applied to pathology data. Codes are avaialble from this https URL.

[CV-108] UnDIVE: Generalized Underwater Video Enhancement Using Generative Priors WACV2025

链接: https://arxiv.org/abs/2411.05886
作者: Suhas Srinath,Aditya Chandrasekar,Hemang Jamadagni,Rajiv Soundararajan,Prathosh A P
关键词-EN: gained significant attention, marine exploration, research topic, imaging has gained, gained significant
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted to IEEE/CVF WACV 2025

点击查看摘要

Abstract:With the rise of marine exploration, underwater imaging has gained significant attention as a research topic. Underwater video enhancement has become crucial for real-time computer vision tasks in marine exploration. However, most existing methods focus on enhancing individual frames and neglect video temporal dynamics, leading to visually poor enhancements. Furthermore, the lack of ground-truth references limits the use of abundant available underwater video data in many applications. To address these issues, we propose a two-stage framework for enhancing underwater videos. The first stage uses a denoising diffusion probabilistic model to learn a generative prior from unlabeled data, capturing robust and descriptive feature representations. In the second stage, this prior is incorporated into a physics-based image formulation for spatial enhancement, while also enforcing temporal consistency between video frames. Our method enables real-time and computationally-efficient processing of high-resolution underwater videos at lower resolutions, and offers efficient enhancement in the presence of diverse water-types. Extensive experiments on four datasets show that our approach generalizes well and outperforms existing enhancement methods. Our code is available at this http URL.

[CV-109] Alternative Learning Paradigms for Image Quality Transfer

链接: https://arxiv.org/abs/2411.05885
作者: Ahmed Karam Eldaly,Matteo Figini,Daniel C. Alexander
关键词-EN: Image Quality Transfer, Quality Transfer, higher quality images, rich information learned, higher quality
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: Accepted for publication at the Journal of Machine Learning for Biomedical Imaging (MELBA) this https URL

点击查看摘要

Abstract:Image Quality Transfer (IQT) aims to enhance the contrast and resolution of low-quality medical images, e.g. obtained from low-power devices, with rich information learned from higher quality images. In contrast to existing IQT methods which adopt supervised learning frameworks, in this work, we propose two novel formulations of the IQT problem. The first approach uses an unsupervised learning framework, whereas the second is a combination of both supervised and unsupervised learning. The unsupervised learning approach considers a sparse representation (SRep) and dictionary learning model, which we call IQT-SRep, whereas the combination of supervised and unsupervised learning approach is based on deep dictionary learning (DDL), which we call IQT-DDL. The IQT-SRep approach trains two dictionaries using a SRep model using pairs of low- and high-quality volumes. Subsequently, the SRep of a low-quality block, in terms of the low-quality dictionary, can be directly used to recover the corresponding high-quality block using the high-quality dictionary. On the other hand, the IQT-DDL approach explicitly learns a high-resolution dictionary to upscale the input volume, while the entire network, including high dictionary generator, is simultaneously optimised to take full advantage of deep learning methods. The two models are evaluated using a low-field magnetic resonance imaging (MRI) application aiming to recover high-quality images akin to those obtained from high-field scanners. Experiments comparing the proposed approaches against state-of-the-art supervised deep learning IQT method (IQT-DL) identify that the two novel formulations of the IQT problem can avoid bias associated with supervised methods when tested using out-of-distribution data that differs from the distribution of the data the model was trained on. This highlights the potential benefit of these novel paradigms for IQT.

[CV-110] Untrained Perceptual Loss for image denoising of line-like structures in MR images

链接: https://arxiv.org/abs/2411.05884
作者: Elisabeth Pfaehler,Daniel Pflugfelder,Hanno Scharr
关键词-EN: Magnetic Resonance, acquisition of Magnetic, shorter scan times, scan times lead, Structural Similarity Index
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In the acquisition of Magnetic Resonance (MR) images shorter scan times lead to higher image noise. Therefore, automatic image denoising using deep learning methods is of high interest. MR images containing line-like structures such as roots or vessels yield special characteristics as they display connected structures and yield sparse information. For this kind of data, it is important to consider voxel neighborhoods when training a denoising network. In this paper, we translate the Perceptual Loss to 3D data by comparing feature maps of untrained networks in the loss function as done previously for 2D data. We tested the performance of untrained Perceptual Loss (uPL) on 3D image denoising of MR images displaying brain vessels (MR angiograms - MRA) and images of plant roots in soil. We investigate the impact of various uPL characteristics such as weight initialization, network depth, kernel size, and pooling operations on the results. We tested the performance of the uPL loss on four Rician noise levels using evaluation metrics such as the Structural Similarity Index Metric (SSIM). We observe, that our uPL outperforms conventional loss functions such as the L1 loss or a loss based on the Structural Similarity Index Metric (SSIM). The uPL network’s initialization is not important, while network depth and pooling operations impact denoising performance. E.g. for both datasets a network with five convolutional layers led to the best performance while a network with more layers led to a performance drop. We also find that small uPL networks led to better or comparable results than using large networks such as VGG. We observe superior performance of our loss for both datasets, all noise levels, and three network architectures. In conclusion, for images containing line-like structures, uPL is an alternative to other loss functions for 3D image denoising.

[CV-111] Benchmarking 3D multi-coil NC-PDNet MRI reconstruction

链接: https://arxiv.org/abs/2411.05883
作者: Asma Tanabene(NEUROSPIN, MIND),Chaithya Giliyar Radhakrishna(NEUROSPIN, MIND),Aurélien Massire,Mariappan S. Nadar,Philippe Ciuciu(NEUROSPIN, MIND)
关键词-EN: parallel imaging acquisitions, shown great promise, Deep learning, promise for MRI, parallel imaging
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Deep learning has shown great promise for MRI reconstruction from undersampled data, yet there is a lack of research on validating its performance in 3D parallel imaging acquisitions with non-Cartesian undersampling. In addition, the artifacts and the resulting image quality depend on the under-sampling pattern. To address this uncharted territory, we extend the Non-Cartesian Primal-Dual Network (NC-PDNet), a state-of-the-art unrolled neural network, to a 3D multi-coil setting. We evaluated the impact of channel-specific versus channel-agnostic training configurations and examined the effect of coil compression. Finally, we benchmark four distinct non-Cartesian undersampling patterns, with an acceleration factor of six, using the publicly available Calgary-Campinas dataset. Our results show that NC-PDNet trained on compressed data with varying input channel numbers achieves an average PSNR of 42.98 dB for 1 mm isotropic 32 channel whole-brain 3D reconstruction. With an inference time of 4.95sec and a GPU memory usage of 5.49 GB, our approach demonstrates significant potential for clinical research application.

[CV-112] rends Challenges and Future Directions in Deep Learning for Glaucoma: A Systematic Review

链接: https://arxiv.org/abs/2411.05876
作者: Mahtab Faraji,Homa Rashidisabet,George R. Nahass,RV Paul Chan,Thasarat S Vajaranant,Darvin Yi
关键词-EN: Preferred Reporting Items, Deep Learning, Preferred Reporting, Reporting Items, Items for Systematic
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Here, we examine the latest advances in glaucoma detection through Deep Learning (DL) algorithms using Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA). This study focuses on three aspects of DL-based glaucoma detection frameworks: input data modalities, processing strategies, and model architectures and applications. Moreover, we analyze trends in employing each aspect since the onset of DL in this field. Finally, we address current challenges and suggest future research directions.

[CV-113] Exploring the Feasibility of Affordable Sonar Technology: Object Detection in Underwater Environments Using the Ping 360

链接: https://arxiv.org/abs/2411.05863
作者: Md Junayed Hasan,Somasundar Kannan,Ali Rohan,Mohd Asif Shah
关键词-EN: complex underwater obstacles, detecting complex underwater, Ping, underwater obstacles, sonar
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Emerging Technologies (cs.ET)
*备注: This work is currently under review. This is a pre-print

点击查看摘要

Abstract:This study explores the potential of the Ping 360 sonar device, primarily used for navigation, in detecting complex underwater obstacles. The key motivation behind this research is the device’s affordability and open-source nature, offering a cost-effective alternative to more expensive imaging sonar systems. The investigation focuses on understanding the behaviour of the Ping 360 in controlled environments and assessing its suitability for object detection, particularly in scenarios where human operators are unavailable for inspecting offshore structures in shallow waters. Through a series of carefully designed experiments, we examined the effects of surface reflections and object shadows in shallow underwater environments. Additionally, we developed a manually annotated sonar image dataset to train a U-Net segmentation model. Our findings indicate that while the Ping 360 sonar demonstrates potential in simpler settings, its performance is limited in more cluttered or reflective environments unless extensive data pre-processing and annotation are applied. To our knowledge, this is the first study to evaluate the Ping 360’s capabilities for complex object detection. By investigating the feasibility of low-cost sonar devices, this research provides valuable insights into their limitations and potential for future AI-based interpretation, marking a unique contribution to the field.

[CV-114] Navigating Distribution Shifts in Medical Image Analysis: A Survey

链接: https://arxiv.org/abs/2411.05824
作者: Zixian Su,Jingwei Guo,Xi Yang,Qiufeng Wang,Frans Coenen,Kaizhu Huang
关键词-EN: Medical Image Analysis, Image Analysis, enhancing clinical diagnostics, enhancing clinical, personalized treatment
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Medical Image Analysis (MedIA) has become indispensable in modern healthcare, enhancing clinical diagnostics and personalized treatment. Despite the remarkable advancements supported by deep learning (DL) technologies, their practical deployment faces challenges due to distribution shifts, where models trained on specific datasets underperform across others from varying hospitals, regions, or patient populations. To navigate this issue, researchers have been actively developing strategies to increase the adaptability and robustness of DL models, enabling their effective use in unfamiliar and diverse environments. This paper systematically reviews approaches that apply DL techniques to MedIA systems affected by distribution shifts. Unlike traditional categorizations based on technical specifications, our approach is grounded in the real-world operational constraints faced by healthcare institutions. Specifically, we categorize the existing body of work into Joint Training, Federated Learning, Fine-tuning, and Domain Generalization, with each method tailored to distinct scenarios caused by Data Accessibility, Privacy Concerns, and Collaborative Protocols. This perspective equips researchers with a nuanced understanding of how DL can be strategically deployed to address distribution shifts in MedIA, ensuring diverse and robust medical applications. By delving deeper into these topics, we highlight potential pathways for future research that not only address existing limitations but also push the boundaries of deployable MedIA technologies.

机器学习

[LG-0] DeepONet as a Multi-Operator Extrapolation Model: Distributed Pretraining with Physics-Informed Fine-Tuning

链接: https://arxiv.org/abs/2411.07239
作者: Zecheng Zhang,Christian Moya,Lu Lu,Guang Lin,Hayden Schaeffer
关键词-EN: diverse function data, neural network, diverse function, distributed neural operator, achieve multi-operator learning
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We propose a novel fine-tuning method to achieve multi-operator learning through training a distributed neural operator with diverse function data and then zero-shot fine-tuning the neural network using physics-informed losses for downstream tasks. Operator learning effectively approximates solution operators for PDEs and various PDE-related problems, yet it often struggles to generalize to new tasks. To address this, we investigate fine-tuning a pretrained model, while carefully selecting an initialization that enables rapid adaptation to new tasks with minimal data. Our approach combines distributed learning to integrate data from various operators in pre-training, while physics-informed methods enable zero-shot fine-tuning, minimizing the reliance on downstream data. We investigate standard fine-tuning and Low-Rank Adaptation fine-tuning, applying both to train complex nonlinear target operators that are difficult to learn only using random initialization. Through comprehensive numerical examples, we demonstrate the advantages of our approach, showcasing significant improvements in accuracy. Our findings provide a robust framework for advancing multi-operator learning and highlight the potential of transfer learning techniques in this domain.

[LG-1] Score-based generative diffusion with “active” correlated noise sources

链接: https://arxiv.org/abs/2411.07233
作者: Alexandra Lamtyugina,Agnish Kumar Behera,Aditya Nandy,Carlos Floyd,Suriyanarayanan Vaikuntanathan
关键词-EN: Diffusion models exhibit, models exhibit robust, Diffusion models, robust generative properties, exhibit robust generative
类目: Machine Learning (cs.LG); Disordered Systems and Neural Networks (cond-mat.dis-nn)
*备注: 18 pages, 11 figures

点击查看摘要

Abstract:Diffusion models exhibit robust generative properties by approximating the underlying distribution of a dataset and synthesizing data by sampling from the approximated distribution. In this work, we explore how the generative performance may be be modulated if noise sources with temporal correlations – akin to those used in the field of active matter – are used for the destruction of the data in the forward process. Our numerical and analytical experiments suggest that the corresponding reverse process may exhibit improved generative properties.

[LG-2] Feature Selection Based on Wasserstein Distance

链接: https://arxiv.org/abs/2411.07217
作者: Fuwei Li
关键词-EN: Wasserstein distance, selection method based, feature selection, Wasserstein, feature selection method
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In this paper, we present a novel feature selection method based on the Wasserstein distance. Feature selection plays a critical role in reducing the dimensionality of input data, thereby improving machine learning efficiency and generalization performance. Unlike traditional feature selection approaches that rely on criteria such as correlation or KL divergence, our method leverages the Wasserstein distance to measure the similarity between distributions of selected features and original features. This approach inherently accounts for similarities between classes, making it robust in scenarios involving noisy labels. Experimental results demonstrate that our method outperforms traditional approaches, particularly in challenging settings involving noisy labeled data.

[LG-3] Comparing Bottom-Up and Top-Down Steering Approaches on In-Context Learning Tasks

链接: https://arxiv.org/abs/2411.07213
作者: Madeline Brumley,Joe Kwon,David Krueger,Dmitrii Krasheninnikov,Usman Anwar
关键词-EN: large language models, robustly steering models, desired behaviors, language models, key objective
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:A key objective of interpretability research on large language models (LLMs) is to develop methods for robustly steering models toward desired behaviors. To this end, two distinct approaches to interpretability – bottom-up" and top-down" – have been presented, but there has been little quantitative comparison between them. We present a case study comparing the effectiveness of representative vector steering methods from each branch: function vectors (FV; arXiv:2310.15213), as a bottom-up method, and in-context vectors (ICV; arXiv:2311.06668) as a top-down method. While both aim to capture compact representations of broad in-context learning tasks, we find they are effective only on specific types of tasks: ICVs outperform FVs in behavioral shifting, whereas FVs excel in tasks requiring more precision. We discuss the implications for future evaluations of steering methods and for further research into top-down and bottom-up steering given these findings.

[LG-4] General Geospatial Inference with a Population Dynamics Foundation Model

链接: https://arxiv.org/abs/2411.07207
作者: Mohit Agarwal,Mimi Sun,Chaitanya Kamath,Arbaaz Muslim,Prithul Sarker,Joydeep Paul,Hector Yee,Marcin Sieniek,Kim Jablonski,Yael Mayer,David Fork,Sheila de Guia,Jamie McPike,Adam Boulanger,Tomer Shekel,David Schottlander,Yao Xiao,Manjit Chakravarthy Manukonda,Yun Liu,Neslihan Bulut,Sami Abu-el-haija,Arno Eigenwillig,Parth Kothari,Bryan Perozzi,Monica Bharel,Von Nguyen,Luke Barrington,Niv Efron,Yossi Matias,Greg Corrado,Krish Eswaran,Shruthi Prabhakara,Shravya Shetty,Gautam Prasad
关键词-EN: requires governmental agencies, allocate limited resources, world requires governmental, identify high-risk groups, strategically allocate limited
类目: Machine Learning (cs.LG); Computers and Society (cs.CY)
*备注: 28 pages, 16 figures, preprint

点击查看摘要

Abstract:Supporting the health and well-being of dynamic populations around the world requires governmental agencies, organizations and researchers to understand and reason over complex relationships between human behavior and local contexts in order to identify high-risk groups and strategically allocate limited resources. Traditional approaches to these classes of problems often entail developing manually curated, task-specific features and models to represent human behavior and the natural and built environment, which can be challenging to adapt to new, or even, related tasks. To address this, we introduce a Population Dynamics Foundation Model (PDFM) that aims to capture the relationships between diverse data modalities and is applicable to a broad range of geospatial tasks. We first construct a geo-indexed dataset for postal codes and counties across the United States, capturing rich aggregated information on human behavior from maps, busyness, and aggregated search trends, and environmental factors such as weather and air quality. We then model this data and the complex relationships between locations using a graph neural network, producing embeddings that can be adapted to a wide range of downstream tasks using relatively simple models. We evaluate the effectiveness of our approach by benchmarking it on 27 downstream tasks spanning three distinct domains: health indicators, socioeconomic factors, and environmental measurements. The approach achieves state-of-the-art performance on all 27 geospatial interpolation tasks, and on 25 out of the 27 extrapolation and super-resolution tasks. We combined the PDFM with a state-of-the-art forecasting foundation model, TimesFM, to predict unemployment and poverty, achieving performance that surpasses fully supervised forecasting. The full set of embeddings and sample code are publicly available for researchers.

[LG-5] Data-Driven Predictive Control of Nonholonomic Robots Based on a Bilinear Koopman Realization: Data Does Not Replace Geometry

链接: https://arxiv.org/abs/2411.07192
作者: Mario Rosenfelder,Lea Bold,Hannes Eschmann,Peter Eberhard,Karl Worthmann,Henrik Ebel
关键词-EN: effortless data generation, Dynamic Mode Decomposition, growing trend, trend towards effortless, increasing interest
类目: ystems and Control (eess.SY); Machine Learning (cs.LG); Robotics (cs.RO)
*备注: 23 pages, 12 figures

点击查看摘要

Abstract:Advances in machine learning and the growing trend towards effortless data generation in real-world systems has led to an increasing interest for data-inferred models and data-based control in robotics. It seems appealing to govern robots solely based on data, bypassing the traditional, more elaborate pipeline of system modeling through first-principles and subsequent controller design. One promising data-driven approach is the Extended Dynamic Mode Decomposition (EDMD) for control-affine systems, a system class which contains many vehicles and machines of immense practical importance including, e.g., typical wheeled mobile robots. EDMD can be highly data-efficient, computationally inexpensive, can deal with nonlinear dynamics as prevalent in robotics and mechanics, and has a sound theoretical foundation rooted in Koopman theory. On this background, this present paper examines how EDMD models can be integrated into predictive controllers for nonholonomic mobile robots. In addition to the conventional kinematic mobile robot, we also cover the complete data-driven control pipeline - from data acquisition to control design - when the robot is not treated in terms of first-order kinematics but in a second-order manner, allowing to account for actuator dynamics. Using only real-world measurement data, it is shown in both simulations and hardware experiments that the surrogate models enable high-precision predictive controllers in the studied cases. However, the findings raise significant concerns about purely data-centric approaches that overlook the underlying geometry of nonholonomic systems, showing that, for nonholonomic systems, some geometric insight seems necessary and cannot be easily compensated for with large amounts of data.

[LG-6] Revisiting Ensembling in One-Shot Federated Learning NEURIPS2024

链接: https://arxiv.org/abs/2411.07182
作者: Youssef Allouah,Akash Dhasade,Rachid Guerraoui,Nirupam Gupta,Anne-Marie Kermarrec,Rafael Pinot,Rafael Pires,Rishi Sharma
关键词-EN: training machine learning, sharing raw data, appealing approach, approach to training, training machine
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注: Accepted at NeurIPS 2024

点击查看摘要

Abstract:Federated learning (FL) is an appealing approach to training machine learning models without sharing raw data. However, standard FL algorithms are iterative and thus induce a significant communication cost. One-shot federated learning (OFL) trades the iterative exchange of models between clients and the server with a single round of communication, thereby saving substantially on communication costs. Not surprisingly, OFL exhibits a performance gap in terms of accuracy with respect to FL, especially under high data heterogeneity. We introduce FENS, a novel federated ensembling scheme that approaches the accuracy of FL with the communication efficiency of OFL. Learning in FENS proceeds in two phases: first, clients train models locally and send them to the server, similar to OFL; second, clients collaboratively train a lightweight prediction aggregator model using FL. We showcase the effectiveness of FENS through exhaustive experiments spanning several datasets and heterogeneity levels. In the particular case of heterogeneously distributed CIFAR-10 dataset, FENS achieves up to a 26.9% higher accuracy over state-of-the-art (SOTA) OFL, being only 3.1% lower than FL. At the same time, FENS incurs at most 4.3x more communication than OFL, whereas FL is at least 10.9x more communication-intensive than FENS.

[LG-7] Joint Age-State Belief is All You Need: Minimizing AoII via Pull-Based Remote Estimation

链接: https://arxiv.org/abs/2411.07179
作者: Ismail Cosandal,Sennur Ulukus,Nail Akar
关键词-EN: recently proposed freshness, recently proposed, proposed freshness, freshness and mismatch, mismatch metric
类目: Information Theory (cs.IT); Machine Learning (cs.LG); Networking and Internet Architecture (cs.NI); Signal Processing (eess.SP); Systems and Control (eess.SY)
*备注:

点击查看摘要

Abstract:Age of incorrect information (AoII) is a recently proposed freshness and mismatch metric that penalizes an incorrect estimation along with its duration. Therefore, keeping track of AoII requires the knowledge of both the source and estimation processes. In this paper, we consider a time-slotted pull-based remote estimation system under a sampling rate constraint where the information source is a general discrete-time Markov chain (DTMC) process. Moreover, packet transmission times from the source to the monitor are non-zero which disallows the monitor to have perfect information on the actual AoII process at any time. Hence, for this pull-based system, we propose the monitor to maintain a sufficient statistic called \em belief which stands for the joint distribution of the age and source processes to be obtained from the history of all observations. Using belief, we first propose a maximum a posteriori (MAP) estimator to be used at the monitor as opposed to existing martingale estimators in the literature. Second, we obtain the optimality equations from the belief-MDP (Markov decision process) formulation. Finally, we propose two belief-dependent policies one of which is based on deep reinforcement learning, and the other one is a threshold-based policy based on the instantaneous expected AoII.

[LG-8] Enhancing Predictive Maintenance in Mining Mobile Machinery through a TinyML-enabled Hierarchical Inference Network

链接: https://arxiv.org/abs/2411.07168
作者: Raúl de la Fuente,Luciano Radrigan,Anibal S Morales
关键词-EN: challenging Predictive Maintenance, Predictive Maintenance, Mining machinery operating, faces high wear, Edge Sensor Network
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC); Multiagent Systems (cs.MA); Networking and Internet Architecture (cs.NI); Signal Processing (eess.SP)
*备注: This work has been submitted to the IEEE Access for possible publication

点击查看摘要

Abstract:Mining machinery operating in variable environments faces high wear and unpredictable stress, challenging Predictive Maintenance (PdM). This paper introduces the Edge Sensor Network for Predictive Maintenance (ESN-PdM), a hierarchical inference framework across edge devices, gateways, and cloud services for real-time condition monitoring. The system dynamically adjusts inference locations–on-device, on-gateway, or on-cloud–based on trade-offs among accuracy, latency, and battery life, leveraging Tiny Machine Learning (TinyML) techniques for model optimization on resource-constrained devices. Performance evaluations showed that on-sensor and on-gateway inference modes achieved over 90% classification accuracy, while cloud-based inference reached 99%. On-sensor inference reduced power consumption by approximately 44%, enabling up to 104 hours of operation. Latency was lowest for on-device inference (3.33 ms), increasing when offloading to the gateway (146.67 ms) or cloud (641.71 ms). The ESN-PdM framework provides a scalable, adaptive solution for reliable anomaly detection and PdM, crucial for maintaining machinery uptime in remote environments. By balancing accuracy, latency, and energy consumption, this approach advances PdM frameworks for industrial applications.

[LG-9] Efficient Adaptive Optimization via Subset-Norm and Subspace-Momentum: Fast Memory-Reduced Training with Convergence Guarantees

链接: https://arxiv.org/abs/2411.07120
作者: Thien Hang Nguyen,Huy Le Nguyen
关键词-EN: large-scale neural networks, efficient adaptive optimization, neural networks, reduce memory requirements, introduce two complementary
类目: Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:We introduce two complementary techniques for efficient adaptive optimization that reduce memory requirements while accelerating training of large-scale neural networks. The first technique, Subset-Norm adaptive step size, generalizes AdaGrad-Norm and AdaGrad(-Coordinate) by reducing the second moment term’s memory footprint from O(d) to O(\sqrtd) through step-size sharing, where d is the model size. For non-convex smooth objectives under coordinate-wise sub-gaussian gradient noise, we prove a noise-adapted high-probability convergence guarantee showing improved dimensional dependence over existing methods. Our second technique, Subspace-Momentum, reduces the momentum state’s memory footprint by operating in a low-dimensional subspace while applying standard SGD in the orthogonal complement. We establish high-probability convergence rates under similar relaxed assumptions. Empirical evaluation on LLaMA models from 60M to 1B parameters demonstrates the effectiveness of our methods, where combining subset-norm with subspace-momentum achieves Adam’s validation perplexity in approximately half the training tokens (6.8B vs 13.1B) while using only 20% of the Adam’s optimizer-states memory footprint and requiring minimal additional hyperparameter tuning.

[LG-10] nyML Security: Exploring Vulnerabilities in Resource-Constrained Machine Learning Systems

链接: https://arxiv.org/abs/2411.07114
作者: Jacob Huckelberry,Yuke Zhang,Allison Sansone,James Mickens,Peter A. Beerel,Vijay Janapa Reddi
关键词-EN: Tiny Machine Learning, enable machine learning, machine learning inference, Machine Learning, Tiny Machine
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注: Submitted to Proceedings of the IEEE

点击查看摘要

Abstract:Tiny Machine Learning (TinyML) systems, which enable machine learning inference on highly resource-constrained devices, are transforming edge computing but encounter unique security challenges. These devices, restricted by RAM and CPU capabilities two to three orders of magnitude smaller than conventional systems, make traditional software and hardware security solutions impractical. The physical accessibility of these devices exacerbates their susceptibility to side-channel attacks and information leakage. Additionally, TinyML models pose security risks, with weights potentially encoding sensitive data and query interfaces that can be exploited. This paper offers the first thorough survey of TinyML security threats. We present a device taxonomy that differentiates between IoT, EdgeML, and TinyML, highlighting vulnerabilities unique to TinyML. We list various attack vectors, assess their threat levels using the Common Vulnerability Scoring System, and evaluate both existing and possible defenses. Our analysis identifies where traditional security measures are adequate and where solutions tailored to TinyML are essential. Our results underscore the pressing need for specialized security solutions in TinyML to ensure robust and secure edge computing applications. We aim to inform the research community and inspire innovative approaches to protecting this rapidly evolving and critical field.

[LG-11] Differentially-Private Collaborative Online Personalized Mean Estimation

链接: https://arxiv.org/abs/2411.07094
作者: Yauhen Yakimenka,Chung-Wei Weng,Hsuan-Yin Lin,Eirik Rosnes,Jörg Kliewer
关键词-EN: arbitrary unknown agent-specific, data variance estimation, continuously receiving data, agents continuously receiving, unknown agent-specific distributions
类目: Machine Learning (cs.LG); Information Theory (cs.IT)
*备注: Presented in part at the 2023 IEEE International Symposium on Information Theory (ISIT)

点击查看摘要

Abstract:We consider the problem of collaborative personalized mean estimation under a privacy constraint in an environment of several agents continuously receiving data according to arbitrary unknown agent-specific distributions. In particular, we provide a method based on hypothesis testing coupled with differential privacy and data variance estimation. Two privacy mechanisms and two data variance estimation schemes are proposed, and we provide a theoretical convergence analysis of the proposed algorithm for any bounded unknown distributions on the agents’ data, showing that collaboration provides faster convergence than a fully local approach where agents do not share data. Moreover, we provide analytical performance curves for the case with an oracle class estimator, i.e., the class structure of the agents, where agents receiving data from distributions with the same mean are considered to be in the same class, is known. The theoretical faster-than-local convergence guarantee is backed up by extensive numerical results showing that for a considered scenario the proposed approach indeed converges much faster than a fully local approach, and performs comparably to ideal performance where all data is public. This illustrates the benefit of private collaboration in an online setting.

[LG-12] General framework for online-to-nonconvex conversion: Schedule-free SGD is also effective for nonconvex optimization

链接: https://arxiv.org/abs/2411.07061
作者: Kwangjun Ahn,Gagik Magakyan,Ashok Cutkosky
关键词-EN: work investigates, investigates the effectiveness, Defazio, schedule-free SGD, schedule-free methods
类目: Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML)
*备注: Comments would be appreciated!

点击查看摘要

Abstract:This work investigates the effectiveness of schedule-free methods, developed by A. Defazio et al. (NeurIPS 2024), in nonconvex optimization settings, inspired by their remarkable empirical success in training neural networks. Specifically, we show that schedule-free SGD achieves optimal iteration complexity for nonsmooth, nonconvex optimization problems. Our proof begins with the development of a general framework for online-to-nonconvex conversion, which converts a given online learning algorithm into an optimization algorithm for nonconvex losses. Our general framework not only recovers existing conversions but also leads to two novel conversion schemes. Notably, one of these new conversions corresponds directly to schedule-free SGD, allowing us to establish its optimality. Additionally, our analysis provides valuable insights into the parameter choices for schedule-free SGD, addressing a theoretical gap that the convex theory cannot explain.

[LG-13] HeteroSample: Meta-path Guided Sampling for Heterogeneous Graph Representation Learning

链接: https://arxiv.org/abs/2411.07022
作者: Ao Liu,Jing Chen,Ruiying Du,Cong Wu,Yebo Feng,Teng Li,Jianfeng Ma
关键词-EN: Internet of Things, capture complex interactions, complex IoT systems, resulted in vast, interactions among devices
类目: Machine Learning (cs.LG)
*备注: 11 pages

点击查看摘要

Abstract:The rapid expansion of Internet of Things (IoT) has resulted in vast, heterogeneous graphs that capture complex interactions among devices, sensors, and systems. Efficient analysis of these graphs is critical for deriving insights in IoT scenarios such as smart cities, industrial IoT, and intelligent transportation systems. However, the scale and diversity of IoT-generated data present significant challenges, and existing methods often struggle with preserving the structural integrity and semantic richness of these complex graphs. Many current approaches fail to maintain the balance between computational efficiency and the quality of the insights generated, leading to potential loss of critical information necessary for accurate decision-making in IoT applications. We introduce HeteroSample, a novel sampling method designed to address these challenges by preserving the structural integrity, node and edge type distributions, and semantic patterns of IoT-related graphs. HeteroSample works by incorporating the novel top-leader selection, balanced neighborhood expansion, and meta-path guided sampling strategies. The key idea is to leverage the inherent heterogeneous structure and semantic relationships encoded by meta-paths to guide the sampling process. This approach ensures that the resulting subgraphs are representative of the original data while significantly reducing computational overhead. Extensive experiments demonstrate that HeteroSample outperforms state-of-the-art methods, achieving up to 15% higher F1 scores in tasks such as link prediction and node classification, while reducing runtime by 20%.These advantages make HeteroSample a transformative tool for scalable and accurate IoT applications, enabling more effective and efficient analysis of complex IoT systems, ultimately driving advancements in smart cities, industrial IoT, and beyond.

[LG-14] Hierarchical Conditional Tabular GAN for Multi-Tabular Synthetic Data Generation

链接: https://arxiv.org/abs/2411.07009
作者: Wilhelm Ågren,Victorio Úbeda Sosa
关键词-EN: privacy regulations limit, synthetic data, approach to leverage, data, leverage when access
类目: Machine Learning (cs.LG); Databases (cs.DB)
*备注:

点击查看摘要

Abstract:The generation of synthetic data is a state-of-the-art approach to leverage when access to real data is limited or privacy regulations limit the usability of sensitive data. A fair amount of research has been conducted on synthetic data generation for single-tabular datasets, but only a limited amount of research has been conducted on multi-tabular datasets with complex table relationships. In this paper we propose the algorithm HCTGAN to synthesize multi-tabular data from complex multi-tabular datasets. We compare our results to the probabilistic model HMA1. Our findings show that our proposed algorithm can more efficiently sample large amounts of synthetic data for deep and complex multi-tabular datasets, whilst achieving adequate data quality and always guaranteeing referential integrity. We conclude that the HCTGAN algorithm is suitable for generating large amounts of synthetic data efficiently for deep multi-tabular datasets with complex relationships. We additionally suggest that the HMA1 model should be used on smaller datasets when emphasis is on data quality.

[LG-15] Efficient Unsupervised Domain Adaptation Regression for Spatial-Temporal Air Quality Sensor Fusion

链接: https://arxiv.org/abs/2411.06917
作者: Keivan Faghih Niresi,Ismail Nejjar,Olga Fink
关键词-EN: Internet of Things, air pollution monitoring, affordable Internet, recent years due, scalability and cost-effectiveness
类目: Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注:

点击查看摘要

Abstract:The deployment of affordable Internet of Things (IoT) sensors for air pollution monitoring has increased in recent years due to their scalability and cost-effectiveness. However, accurately calibrating these sensors in uncontrolled environments remains a significant challenge. While expensive reference sensors can provide accurate ground truth data, they are often deployed on a limited scale due to high costs, leading to a scarcity of labeled data. In diverse urban environments, data distributions constantly shift due to varying factors such as traffic patterns, industrial activities, and weather conditions, which impact sensor readings. Consequently, traditional machine learning models – despite their increasing deployment for environmental sensor calibration – often struggle to provide reliable pollutant measurements across different locations due to domain shifts. To address these challenges, we propose a novel unsupervised domain adaptation (UDA) method specifically tailored for regression tasks on graph-structured data. Our approach leverages Graph Neural Networks (GNNs) to model the relationships between sensors. To effectively capture critical spatial-temporal interactions, we incorporate spatial-temporal graph neural networks (STGNNs), which extend GNNs by incorporating temporal dynamics. To handle the resulting larger embeddings, we propose a domain adaptation method using a closed-form solution inspired by the Tikhonov-regularized least-squares problem. This method leverages Cholesky decomposition and power iteration to align the subspaces between source and target domains. By aligning these subspaces, our approach allows low-cost IoT sensors to learn calibration parameters from expensive reference sensors. This facilitates reliable pollutant measurements in new locations without the need for additional costly equipment.

[LG-16] SPARTAN: A Sparse Transformer Learning Local Causation

链接: https://arxiv.org/abs/2411.06890
作者: Anson Lei,Ingmar Posner,Bernhard Schölkopf
关键词-EN: local causal, local causal structures, Causal structures play, local causal graphs, Causal structures
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Causal structures play a central role in world models that flexibly adapt to changes in the environment. While recent works motivate the benefits of discovering local causal graphs for dynamics modelling, in this work we demonstrate that accurately capturing these relationships in complex settings remains challenging for the current state-of-the-art. To remedy this shortcoming, we postulate that sparsity is a critical ingredient for the discovery of such local causal structures. To this end we present the SPARse TrANsformer World model (SPARTAN), a Transformer-based world model that learns local causal structures between entities in a scene. By applying sparsity regularisation on the attention pattern between object-factored tokens, SPARTAN identifies sparse local causal models that accurately predict future object states. Furthermore, we extend our model to capture sparse interventions with unknown targets on the dynamics of the environment. This results in a highly interpretable world model that can efficiently adapt to changes. Empirically, we evaluate SPARTAN against the current state-of-the-art in object-centric world models on observation-based environments and demonstrate that our model can learn accurate local causal graphs and achieve significantly improved few-shot adaptation to changes in the dynamics of the environment as well as robustness against removing irrelevant distractors.

[LG-17] WassFFed: Wasserstein Fair Federated Learning

链接: https://arxiv.org/abs/2411.06881
作者: Zhongxuan Han,Li Zhang,Chaochao Chen,Xiaolin Zheng,Fei Zheng,Yuyuan Li,Jianwei Yin
关键词-EN: Federated Learning, Fair Federated Learning, scenarios where users’, Wasserstein Fair Federated, Federated Learning framework
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: Submitted to TKDE

点击查看摘要

Abstract:Federated Learning (FL) employs a training approach to address scenarios where users’ data cannot be shared across clients. Achieving fairness in FL is imperative since training data in FL is inherently geographically distributed among diverse user groups. Existing research on fairness predominantly assumes access to the entire training data, making direct transfer to FL challenging. However, the limited existing research on fairness in FL does not effectively address two key challenges, i.e., (CH1) Current methods fail to deal with the inconsistency between fair optimization results obtained with surrogate functions and fair classification results. (CH2) Directly aggregating local fair models does not always yield a globally fair model due to non Identical and Independent data Distributions (non-IID) among clients. To address these challenges, we propose a Wasserstein Fair Federated Learning framework, namely WassFFed. To tackle CH1, we ensure that the outputs of local models, rather than the loss calculated with surrogate functions or classification results with a threshold, remain independent of various user groups. To resolve CH2, we employ a Wasserstein barycenter calculation of all local models’ outputs for each user group, bringing local model outputs closer to the global output distribution to ensure consistency between the global model and local models. We conduct extensive experiments on three real-world datasets, demonstrating that WassFFed outperforms existing approaches in striking a balance between accuracy and fairness.

[LG-18] Generative Feature Training of Thin 2-Layer Networks

链接: https://arxiv.org/abs/2411.06848
作者: Johannes Hertrich,Sebastian Neumayer
关键词-EN: small datasets, hidden weights based, small number, neural networks, approximation of functions
类目: Machine Learning (cs.LG); Numerical Analysis (math.NA); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:We consider the approximation of functions by 2-layer neural networks with a small number of hidden weights based on the squared loss and small datasets. Due to the highly non-convex energy landscape, gradient-based training often suffers from local minima. As a remedy, we initialize the hidden weights with samples from a learned proposal distribution, which we parameterize as a deep generative model. To train this model, we exploit the fact that with fixed hidden weights, the optimal output weights solve a linear equation. After learning the generative model, we refine the sampled weights with a gradient-based post-processing in the latent space. Here, we also include a regularization scheme to counteract potential noise. Finally, we demonstrate the effectiveness of our approach by numerical examples.

[LG-19] Spatially Constrained Transformer with Efficient Global Relation Modelling for Spatio-Temporal Prediction

链接: https://arxiv.org/abs/2411.06836
作者: Ashutosh Sao,Simon Gottschalk
关键词-EN: Accurate spatio-temporal prediction, Accurate spatio-temporal, Convolutional Neural Networks, smart cities, prediction is crucial
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Accurate spatio-temporal prediction is crucial for the sustainable development of smart cities. However, current approaches often struggle to capture important spatio-temporal relationships, particularly overlooking global relations among distant city regions. Most existing techniques predominantly rely on Convolutional Neural Networks (CNNs) to capture global relations. However, CNNs exhibit neighbourhood bias, making them insufficient for capturing distant relations. To address this limitation, we propose ST-SampleNet, a novel transformer-based architecture that combines CNNs with self-attention mechanisms to capture both local and global relations effectively. Moreover, as the number of regions increases, the quadratic complexity of self-attention becomes a challenge. To tackle this issue, we introduce a lightweight region sampling strategy that prunes non-essential regions and enhances the efficiency of our approach. Furthermore, we introduce a spatially constrained position embedding that incorporates spatial neighbourhood information into the self-attention mechanism, aiding in semantic interpretation and improving the performance of ST-SampleNet. Our experimental evaluation on three real-world datasets demonstrates the effectiveness of ST-SampleNet. Additionally, our efficient variant achieves a 40% reduction in computational costs with only a marginal compromise in performance, approximately 1%.

[LG-20] Adaptive Conditional Expert Selection Network for Multi-domain Recommendation

链接: https://arxiv.org/abs/2411.06826
作者: Kuiyao Dong,Xingyu Lou,Feng Liu,Ruian Wang,Wenyi Yu,Ping Wang,Jun Wang
关键词-EN: powerful expressive ability, Multi-domain recommendation, standard in Multi-domain, expressive ability, facto standard
类目: Machine Learning (cs.LG); Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Mixture-of-Experts (MOE) has recently become the de facto standard in Multi-domain recommendation (MDR) due to its powerful expressive ability. However, such MOE-based method typically employs all experts for each instance, leading to scalability issue and low-discriminability between domains and experts. Furthermore, the design of commonly used domain-specific networks exacerbates the scalability issues. To tackle the problems, We propose a novel method named CESAA consists of Conditional Expert Selection (CES) Module and Adaptive Expert Aggregation (AEA) Module to tackle these challenges. Specifically, CES first combines a sparse gating strategy with domain-shared experts. Then AEA utilizes mutual information loss to strengthen the correlations between experts and specific domains, and significantly improve the distinction between experts. As a result, only domain-shared experts and selected domain-specific experts are activated for each instance, striking a balance between computational efficiency and model performance. Experimental results on both public ranking and industrial retrieval datasets verify the effectiveness of our method in MDR tasks.

[LG-21] Large Language Model in Medical Informatics: Direct Classification and Enhanced Text Representations for Automatic ICD Coding

链接: https://arxiv.org/abs/2411.06823
作者: Zeyd Boukhers,AmeerAli Khan,Qusai Ramadan,Cong Yang
关键词-EN: accurately classifying International, classifying International Classification, Large Language Models, Convolutional Neural Network, Addressing the complexity
类目: Machine Learning (cs.LG); Information Retrieval (cs.IR)
*备注: accepted at the 2024 IEEE International Conference on Bioinformatics and Biomedicine (BIBM 2024)

点击查看摘要

Abstract:Addressing the complexity of accurately classifying International Classification of Diseases (ICD) codes from medical discharge summaries is challenging due to the intricate nature of medical documentation. This paper explores the use of Large Language Models (LLM), specifically the LLAMA architecture, to enhance ICD code classification through two methodologies: direct application as a classifier and as a generator of enriched text representations within a Multi-Filter Residual Convolutional Neural Network (MultiResCNN) framework. We evaluate these methods by comparing them against state-of-the-art approaches, revealing LLAMA’s potential to significantly improve classification outcomes by providing deep contextual insights into medical texts.

[LG-22] Streetwise Agents : Empowering Offline RL Policies to Outsmart Exogenous Stochastic Disturbances in RTC

链接: https://arxiv.org/abs/2411.06815
作者: Aditya Soni,Mayukh Das,Anjaly Parayil,Supriyo Ghosh,Shivam Shandilya,Ching-An Cheng,Vishak Gopal,Sami Khairy,Gabriel Mittag,Yasaman Hosseinkashi,Chetan Bansal
关键词-EN: real production systems, production systems limits, feedback-driven decision making, training online, difficulty of exploring
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The difficulty of exploring and training online on real production systems limits the scope of real-time online data/feedback-driven decision making. The most feasible approach is to adopt offline reinforcement learning from limited trajectory samples. However, after deployment, such policies fail due to exogenous factors that temporarily or permanently disturb/alter the transition distribution of the assumed decision process structure induced by offline samples. This results in critical policy failures and generalization errors in sensitive domains like Real-Time Communication (RTC). We solve this crucial problem of identifying robust actions in presence of domain shifts due to unseen exogenous stochastic factors in the wild. As it is impossible to learn generalized offline policies within the support of offline data that are robust to these unseen exogenous disturbances, we propose a novel post-deployment shaping of policies (Streetwise), conditioned on real-time characterization of out-of-distribution sub-spaces. This leads to robust actions in bandwidth estimation (BWE) of network bottlenecks in RTC and in standard benchmarks. Our extensive experimental results on BWE and other standard offline RL benchmark environments demonstrate a significant improvement ( \approx 18% on some scenarios) in final returns wrt. end-user metrics over state-of-the-art baselines.

[LG-23] Structuring the Processing Frameworks for Data Stream Evaluation and Application

链接: https://arxiv.org/abs/2411.06799
作者: Joanna Komorniczak,Paweł Ksieniewicz,Paweł Zyblewski
关键词-EN: resembles real-world applications, data stream processing, data stream, stream processing frameworks, data stream classification
类目: Machine Learning (cs.LG); Databases (cs.DB)
*备注:

点击查看摘要

Abstract:The following work addresses the problem of frameworks for data stream processing that can be used to evaluate the solutions in an environment that resembles real-world applications. The definition of structured frameworks stems from a need to reliably evaluate the data stream classification methods, considering the constraints of delayed and limited label access. The current experimental evaluation often boundlessly exploits the assumption of their complete and immediate access to monitor the recognition quality and to adapt the methods to the changing concepts. The problem is leveraged by reviewing currently described methods and techniques for data stream processing and verifying their outcomes in simulated environment. The effect of the work is a proposed taxonomy of data stream processing frameworks, showing the linkage between drift detection and classification methods considering a natural phenomenon of label delay.

[LG-24] White-Box Diffusion Transformer for single-cell RNA-seq generation

链接: https://arxiv.org/abs/2411.06785
作者: Zhuorui Cui,Shengze Dong,Ding Liu
关键词-EN: cell RNA sequencing, single cell RNA, characterizing cellular subpopulations, technology offers advantages, RNA sequencing
类目: Machine Learning (cs.LG); Genomics (q-bio.GN)
*备注: 11pages, 3 figures

点击查看摘要

Abstract:As a powerful tool for characterizing cellular subpopulations and cellular heterogeneity, single cell RNA sequencing (scRNA-seq) technology offers advantages of high throughput and multidimensional analysis. However, the process of data acquisition is often constrained by high cost and limited sample availability. To overcome these limitations, we propose a hybrid model based on Diffusion model and White-Box transformer that aims to generate synthetic and biologically plausible scRNA-seq data. Diffusion model progressively introduce noise into the data and then recover the original data through a denoising process, a forward and reverse process that is particularly suitable for generating complex data distributions. White-Box transformer is a deep learning architecture that emphasizes mathematical interpretability. By minimizing the encoding rate of the data and maximizing the sparsity of the representation, it not only reduces the computational burden, but also provides clear insight into underlying structure. Our White-Box Diffusion Transformer combines the generative capabilities of Diffusion model with the mathematical interpretability of White-Box transformer. Through experiments using six different single-cell RNA-Seq datasets, we visualize both generated and real data using t-SNE dimensionality reduction technique, as well as quantify similarity between generated and real data using various metrics to demonstrate comparable performance of White-Box Diffusion Transformer and Diffusion Transformer in generating scRNA-seq data alongside significant improvements in training efficiency and resource utilization. Our code is available at this https URL

[LG-25] Model Partition and Resource Allocation for Split Learning in Vehicular Edge Networks

链接: https://arxiv.org/abs/2411.06773
作者: Lu Yu,Zheng Chang,Yunjian Jia,Geyong Min
关键词-EN: autonomous driving technologies, networks presents significant, presents significant challenges, vehicular networks presents, integration of autonomous
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注: arXiv admin note: text overlap with arXiv:2306.12194 by other authors

点击查看摘要

Abstract:The integration of autonomous driving technologies with vehicular networks presents significant challenges in privacy preservation, communication efficiency, and resource allocation. This paper proposes a novel U-shaped split federated learning (U-SFL) framework to address these challenges on the way of realizing in vehicular edge networks. U-SFL is able to enhance privacy protection by keeping both raw data and labels on the vehicular user (VU) side while enabling parallel processing across multiple vehicles. To optimize communication efficiency, we introduce a semantic-aware auto-encoder (SAE) that significantly reduces the dimensionality of transmitted data while preserving essential semantic information. Furthermore, we develop a deep reinforcement learning (DRL) based algorithm to solve the NP-hard problem of dynamic resource allocation and split point selection. Our comprehensive evaluation demonstrates that U-SFL achieves comparable classification performance to traditional split learning (SL) while substantially reducing data transmission volume and communication latency. The proposed DRL-based optimization algorithm shows good convergence in balancing latency, energy consumption, and learning performance.

[LG-26] Sketched Adaptive Federated Deep Learning: A Sharp Convergence Analysis

链接: https://arxiv.org/abs/2411.06770
作者: Zhijie Chen,Qiaobo Li,Arindam Banerjee
关键词-EN: Combining gradient compression, fewer communication rounds, Combining gradient, gradient compression methods, adaptive federated learning
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Combining gradient compression methods (e.g., CountSketch, quantization) and adaptive optimizers (e.g., Adam, AMSGrad) is a desirable goal in federated learning (FL), with potential benefits on both fewer communication rounds and less per-round communication. In spite of the preliminary empirical success of sketched adaptive methods, existing convergence analyses show the communication cost to have a linear dependence on the ambient dimension, i.e., number of parameters, which is prohibitively high for modern deep learning models. In this work, we introduce specific sketched adaptive federated learning (SAFL) algorithms and, as our main contribution, provide theoretical convergence analyses in different FL settings with guarantees on communication cost depending only logarithmically (instead of linearly) on the ambient dimension. Unlike existing analyses, we show that the entry-wise sketching noise existent in the preconditioners and the first moments of SAFL can be implicitly addressed by leveraging the recently-popularized anisotropic curvatures in deep learning losses, e.g., fast decaying loss Hessian eigen-values. In the i.i.d. client setting of FL, we show that SAFL achieves asymptotic O(1/\sqrtT) convergence, and converges faster in the initial epochs. In the non-i.i.d. client setting, where non-adaptive methods lack convergence guarantees, we show that SACFL (SAFL with clipping) algorithms can provably converge in spite of the additional heavy-tailed noise. Our theoretical claims are supported by empirical studies on vision and language tasks, and in both fine-tuning and training-from-scratch regimes. Surprisingly, as a by-product of our analysis, the proposed SAFL methods are competitive with the state-of-the-art communication-efficient federated learning algorithms based on error feedback. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2411.06770 [cs.LG] (or arXiv:2411.06770v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2411.06770 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-27] Precision Glass Thermoforming Assisted by Neural Networks

链接: https://arxiv.org/abs/2411.06762
作者: Yuzhou Zhang,Mohan Hua,Haihui Ruan
关键词-EN: require curve pro-files, chemical inertness, good processability, optical transparency, curve pro-files
类目: Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Glass with good processability, chemical inertness, and optical transparency has been widely used in optical and aesthetic products, many of which require curve pro-files with high precision. To meet the increasingly tightened geometrical tolerances and fast product updating rates, the traditional approach of developing a thermoform-ing process through trials and errors can cause a large waste of time and resources and often end up with failure. Hence, there is a need to develop an efficient predictive model, replacing the costly simulations or experiments, to assist the design of preci-sion glass thermoforming. In this work, we report a dimensionless back-propagation neural network (BPNN) that can adequately predict the form errors and thus compen-sate for these errors in mold design to achieve precision glass molding. Based on the precision molds, also discussed is the issue of error magnification considering that cover glass for AR/VR glasses or smartphones, with extremely large scale of produc-tion, may require a lower level of mold machining accuracy. It is expected that this BPNN will also be implementable in the glass-manufacturing industry, i.e., trained using industrial data for precision mold designs.

[LG-28] Neuromodulated Meta-Learning

链接: https://arxiv.org/abs/2411.06746
作者: Jingyao Wang,Huijie Guo,Wenwen Qiang,Jiangmeng Li,Changwen Zheng,Hui Xiong,Gang Hua
关键词-EN: enabling efficient interaction, Humans excel, FNS, diverse environments, enabling efficient
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Humans excel at adapting perceptions and actions to diverse environments, enabling efficient interaction with the external world. This adaptive capability relies on the biological nervous system (BNS), which activates different brain regions for distinct tasks. Meta-learning similarly trains machines to handle multiple tasks but relies on a fixed network structure, not as flexible as BNS. To investigate the role of flexible network structure (FNS) in meta-learning, we conduct extensive empirical and theoretical analyses, finding that model performance is tied to structure, with no universally optimal pattern across tasks. This reveals the crucial role of FNS in meta-learning, ensuring meta-learning to generate the optimal structure for each task, thereby maximizing the performance and learning efficiency of meta-learning. Motivated by this insight, we propose to define, measure, and model FNS in meta-learning. First, we define that an effective FNS should possess frugality, plasticity, and sensitivity. Then, to quantify FNS in practice, we present three measurements for these properties, collectively forming the \emphstructure constraint with theoretical supports. Building on this, we finally propose Neuromodulated Meta-Learning (NeuronML) to model FNS in meta-learning. It utilizes bi-level optimization to update both weights and structure with the structure constraint. Extensive theoretical and empirical evaluations demonstrate the effectiveness of NeuronML on various tasks. Code is publicly available at \hrefthis https URLthis https URL.

[LG-29] Beating Adversarial Low-Rank MDPs with Unknown Transition and Bandit Feedback NEURIPS2024

链接: https://arxiv.org/abs/2411.06739
作者: Haolin Liu,Zakaria Mhammedi,Chen-Yu Wei,Julian Zimmert
关键词-EN: adversarial losses, minimization in low-rank, low-rank MDPs, MDPs with fixed, bandit loss feedback
类目: Machine Learning (cs.LG)
*备注: NeurIPS 2024

点击查看摘要

Abstract:We consider regret minimization in low-rank MDPs with fixed transition and adversarial losses. Previous work has investigated this problem under either full-information loss feedback with unknown transitions (Zhao et al., 2024), or bandit loss feedback with known transition (Foster et al., 2022). First, we improve the poly(d, A, H)T^5/6 regret bound of Zhao et al. (2024) to poly(d, A, H)T^2/3 for the full-information unknown transition setting, where d is the rank of the transitions, A is the number of actions, H is the horizon length, and T is the number of episodes. Next, we initiate the study on the setting with bandit loss feedback and unknown transitions. Assuming that the loss has a linear structure, we propose both model based and model free algorithms achieving poly(d, A, H)T^2/3 regret, though they are computationally inefficient. We also propose oracle-efficient model-free algorithms with poly(d, A, H)T^4/5 regret. We show that the linear structure is necessary for the bandit case without structure on the reward function, the regret has to scale polynomially with the number of states. This is contrary to the full-information case (Zhao et al., 2024), where the regret can be independent of the number of states even for unstructured reward function.

[LG-30] Mr.Steve: Instruction-Following Agents in Minecraft with What-Where-When Memory

链接: https://arxiv.org/abs/2411.06736
作者: Junyeong Park,Junmo Cho,Sungjin Ahn
关键词-EN: developing general-purpose embodied, LLM-augmented hierarchical approaches, Significant advances, environments like Minecraft, made in developing
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Significant advances have been made in developing general-purpose embodied AI in environments like Minecraft through the adoption of LLM-augmented hierarchical approaches. While these approaches, which combine high-level planners with low-level controllers, show promise, low-level controllers frequently become performance bottlenecks due to repeated failures. In this paper, we argue that the primary cause of failure in many low-level controllers is the absence of an episodic memory system. To address this, we introduce this http URL (Memory Recall Steve-1), a novel low-level controller equipped with Place Event Memory (PEM), a form of episodic memory that captures what, where, and when information from episodes. This directly addresses the main limitation of the popular low-level controller, Steve-1. Unlike previous models that rely on short-term memory, PEM organizes spatial and event-based data, enabling efficient recall and navigation in long-horizon tasks. Additionally, we propose an Exploration Strategy and a Memory-Augmented Task Solving Framework, allowing agents to alternate between exploration and task-solving based on recalled events. Our approach significantly improves task-solving and exploration efficiency compared to existing methods. We will release our code and demos on the project page: this https URL.

[LG-31] GSL-PCD: Improving Generalist-Specialist Learning with Point Cloud Feature-based Task Partitioning

链接: https://arxiv.org/abs/2411.06733
作者: Xiu Yuan
关键词-EN: Deep Reinforcement Learning, Generalization in Deep, Deep Reinforcement, Reinforcement Learning, set of scenarios
类目: Machine Learning (cs.LG); Robotics (cs.RO)
*备注:

点击查看摘要

Abstract:Generalization in Deep Reinforcement Learning (DRL) across unseen environment variations often requires training over a diverse set of scenarios. Many existing DRL algorithms struggle with efficiency when handling numerous variations. The Generalist-Specialist Learning (GSL) framework addresses this by first training a generalist model on all variations, then creating specialists from the generalist’s weights, each focusing on a subset of variations. The generalist then refines its learning with assistance from the specialists. However, random task partitioning in GSL can impede performance by assigning vastly different variations to the same specialist, often resulting in each specialist focusing on only one variation, which raises computational costs. To improve this, we propose Generalist-Specialist Learning with Point Cloud Feature-based Task Partitioning (GSL-PCD). Our approach clusters environment variations based on features extracted from object point clouds and uses balanced clustering with a greedy algorithm to assign similar variations to the same specialist. Evaluations on robotic manipulation tasks from the ManiSkill benchmark demonstrate that point cloud feature-based partitioning outperforms vanilla partitioning by 9.4%, with a fixed number of specialists, and reduces computational and sample requirements by 50% to achieve comparable performance.

[LG-32] Real-time Monitoring and Analysis of Track and Field Athletes Based on Edge Computing and Deep Reinforcement Learning Algorithm

链接: https://arxiv.org/abs/2411.06720
作者: Xiaowei Tang,Bin Long,Li Zhou
关键词-EN: addressing the limitations, real-time performance, deep learning algorithms, track and field, deep learning
类目: Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注: 17 pages

点击查看摘要

Abstract:This research focuses on real-time monitoring and analysis of track and field athletes, addressing the limitations of traditional monitoring systems in terms of real-time performance and accuracy. We propose an IoT-optimized system that integrates edge computing and deep learning algorithms. Traditional systems often experience delays and reduced accuracy when handling complex motion data, whereas our method, by incorporating a SAC-optimized deep learning model within the IoT architecture, achieves efficient motion recognition and real-time feedback. Experimental results show that this system significantly outperforms traditional methods in response time, data processing accuracy, and energy efficiency, particularly excelling in complex track and field events. This research not only enhances the precision and efficiency of athlete monitoring but also provides new technical support and application prospects for sports science research.

[LG-33] Learning a Single Neuron Robustly to Distributional Shifts and Adversarial Label Noise

链接: https://arxiv.org/abs/2411.06697
作者: Shuyao Li,Sushrut Karmalkar,Ilias Diakonikolas,Jelena Diakonikolas
关键词-EN: mathbf, adversarial distribution shifts, mathcal, best-fit function, study the problem
类目: Machine Learning (cs.LG); Data Structures and Algorithms (cs.DS); Optimization and Control (math.OC); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:We study the problem of learning a single neuron with respect to the L_2^2 -loss in the presence of adversarial distribution shifts, where the labels can be arbitrary, and the goal is to find a ``best-fit’’ function. More precisely, given training samples from a reference distribution \mathcalp_0 , the goal is to approximate the vector \mathbfw^* which minimizes the squared loss with respect to the worst-case distribution that is close in \chi^2 -divergence to \mathcalp_0 . We design a computationally efficient algorithm that recovers a vector \hat\mathbfw satisfying \mathbbE_\mathcalp^* (\sigma(\hat\mathbfw \cdot \mathbfx) - y)^2 \leq C , \mathbbE_\mathcalp^* (\sigma(\mathbfw^* \cdot \mathbfx) - y)^2 + \epsilon , where C1 is a dimension-independent constant and (\mathbfw^, \mathcalp^) is the witness attaining the min-max risk \min_\mathbfw~:~|\mathbfw| \leq W \max_\mathcalp \mathbbE_(\mathbfx, y) \sim \mathcalp (\sigma(\mathbfw \cdot \mathbfx) - y)^2 - \nu \chi^2(\mathcalp, \mathcalp_0) . Our algorithm follows a primal-dual framework and is designed by directly bounding the risk with respect to the original, nonconvex L_2^2 loss. From an optimization standpoint, our work opens new avenues for the design of primal-dual algorithms under structured nonconvexity.

[LG-34] Shedding Light on Problems with Hyperbolic Graph Learning

链接: https://arxiv.org/abs/2411.06688
作者: Isay Katsman,Anna Gilbert
关键词-EN: machine learning literature, Recent papers, graph representation learning, graph machine learning, hyperbolic graph representation
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: Preprint

点击查看摘要

Abstract:Recent papers in the graph machine learning literature have introduced a number of approaches for hyperbolic representation learning. The asserted benefits are improved performance on a variety of graph tasks, node classification and link prediction included. Claims have also been made about the geometric suitability of particular hierarchical graph datasets to representation in hyperbolic space. Despite these claims, our work makes a surprising discovery: when simple Euclidean models with comparable numbers of parameters are properly trained in the same environment, in most cases, they perform as well, if not better, than all introduced hyperbolic graph representation learning models, even on graph datasets previously claimed to be the most hyperbolic as measured by Gromov \delta -hyperbolicity (i.e., perfect trees). This observation gives rise to a simple question: how can this be? We answer this question by taking a careful look at the field of hyperbolic graph representation learning as it stands today, and find that a number of papers fail to diligently present baselines, make faulty modelling assumptions when constructing algorithms, and use misleading metrics to quantify geometry of graph datasets. We take a closer look at each of these three problems, elucidate the issues, perform an analysis of methods, and introduce a parametric family of benchmark datasets to ascertain the applicability of (hyperbolic) graph neural networks.

[LG-35] A Novel Combined Data-Driven Approach for Electricity Theft Detection

链接: https://arxiv.org/abs/2411.06649
作者: Kedi Zheng,Qixin Chen,Yi Wang,Chongqing Kang,Qing Xia
关键词-EN: Energy Internet, important feature, data mining techniques, energy, two-way flow
类目: ystems and Control (eess.SY); Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注: Paper accepted for IEEE Transactions on Industrial Informatics. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses

点击查看摘要

Abstract:The two-way flow of information and energy is an important feature of the Energy Internet. Data analytics is a powerful tool in the information flow that aims to solve practical problems using data mining techniques. As the problem of electricity thefts via tampering with smart meters continues to increase, the abnormal behaviors of thefts become more diversified and more difficult to detect. Thus, a data analytics method for detecting various types of electricity thefts is required. However, the existing methods either require a labeled dataset or additional system information which is difficult to obtain in reality or have poor detection accuracy. In this paper, we combine two novel data mining techniques to solve the problem. One technique is the Maximum Information Coefficient (MIC), which can find the correlations between the non-technical loss (NTL) and a certain electricity behavior of the consumer. MIC can be used to precisely detect thefts that appear normal in shapes. The other technique is the clustering technique by fast search and find of density peaks (CFSFDP). CFSFDP finds the abnormal users among thousands of load profiles, making it quite suitable for detecting electricity thefts with arbitrary shapes. Next, a framework for combining the advantages of the two techniques is proposed. Numerical experiments on the Irish smart meter dataset are conducted to show the good performance of the combined method.

[LG-36] Mixed Effects Deep Learning Autoencoder for interpretable analysis of single cell RNA Sequencing data

链接: https://arxiv.org/abs/2411.06635
作者: Aixa X. Andrade,Son Nguyen,Albert Montillo
关键词-EN: Single-cell RNA sequencing, Single-cell RNA, RNA sequencing, deep learning, Effects Deep Learning
类目: Machine Learning (cs.LG); Genomics (q-bio.GN)
*备注: Main manuscript: 29 pages, including 10 figures and 8 tables. Supplemental material: 17 pages

点击查看摘要

Abstract:Single-cell RNA sequencing (scRNA-seq) data are often confounded due to technical or biological batch effects. Existing deep learning models aim to mitigate these effects but may inadvertently discard batch-specific information. We propose a Mixed Effects Deep Learning (MEDL) Autoencoder framework that separately models batch-invariant (fixed effects) and batch-specific (random effects) components. By decoupling fixed effects representing biological states from random effects capturing batch-specific variations, MEDL integrates both types of information into predictive models, minimizing information loss. This approach improves interpretability enabling 2D visualizations that show how the same cell would appear across different batches, facilitating exploration of batch-specific variations. We applied MEDL to three datasets: Healthy Heart, Autism Spectrum Disorder (ASDc), and Acute Myeloid Leukemia (AML). In Healthy Heart, MEDL managed 147 batches, assessing its capacity to handle high batch numbers. In ASDc, MEDL captured donor heterogeneity between autistic and healthy individuals, while in AML, it distinguished heterogeneity in a complex setting with variable cell-type presence and malignant cells in diseased donors. These applications demonstrate MEDL’s potential to capture fixed and random effects, improve visualization, and enhance predictive accuracy, offering a robust framework for cellular heterogeneity analysis across diverse datasets.

[LG-37] Inductive Graph Few-shot Class Incremental Learning

链接: https://arxiv.org/abs/2411.06634
作者: Yayong Li,Peyman Moghadam,Can Peng,Nan Ye,Piotr Koniusz
关键词-EN: Graph Neural Networks, Neural Networks, GNN classifier, Graph Neural, Graph Few-Shot Class
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Node classification with Graph Neural Networks (GNN) under a fixed set of labels is well known in contrast to Graph Few-Shot Class Incremental Learning (GFSCIL), which involves learning a GNN classifier as graph nodes and classes growing over time sporadically. We introduce inductive GFSCIL that continually learns novel classes with newly emerging nodes while maintaining performance on old classes without accessing previous data. This addresses the practical concern of transductive GFSCIL, which requires storing the entire graph with historical data. Compared to the transductive GFSCIL, the inductive setting exacerbates catastrophic forgetting due to inaccessible previous data during incremental training, in addition to overfitting issue caused by label sparsity. Thus, we propose a novel method, called Topology-based class Augmentation and Prototype calibration (TAP). To be specific, it first creates a triple-branch multi-topology class augmentation method to enhance model generalization ability. As each incremental session receives a disjoint subgraph with nodes of novel classes, the multi-topology class augmentation method helps replicate such a setting in the base session to boost backbone versatility. In incremental learning, given the limited number of novel class samples, we propose an iterative prototype calibration to improve the separation of class prototypes. Furthermore, as backbone fine-tuning poses the feature distribution drift, prototypes of old classes start failing over time, we propose the prototype shift method for old classes to compensate for the drift. We showcase the proposed method on four datasets.

[LG-38] Using Diffusion Models as Generative Replay in Continual Federated Learning – What will Happen?

链接: https://arxiv.org/abs/2411.06618
作者: Yongsheng Mei,Liangqi Yuan,Dong-Jun Han,Kevin S. Chan,Christopher G. Brinton,Tian Lan
关键词-EN: introducing continuous learning, dynamically over time, introducing continuous, Federated learning, continual federated learning
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注:

点击查看摘要

Abstract:Federated learning (FL) has become a cornerstone in decentralized learning, where, in many scenarios, the incoming data distribution will change dynamically over time, introducing continuous learning (CL) problems. This continual federated learning (CFL) task presents unique challenges, particularly regarding catastrophic forgetting and non-IID input data. Existing solutions include using a replay buffer to store historical data or leveraging generative adversarial networks. Nevertheless, motivated by recent advancements in the diffusion model for generative tasks, this paper introduces DCFL, a novel framework tailored to address the challenges of CFL in dynamic distributed learning environments. Our approach harnesses the power of the conditional diffusion model to generate synthetic historical data at each local device during communication, effectively mitigating latent shifts in dynamic data distribution inputs. We provide the convergence bound for the proposed CFL framework and demonstrate its promising performance across multiple datasets, showcasing its effectiveness in tackling the complexities of CFL tasks.

[LG-39] Are Neuromorphic Architectures Inherently Privacy-preserving? An Exploratory Study

链接: https://arxiv.org/abs/2411.06613
作者: Ayana Moshruba,Ihsen Alouani,Maryam Parsa
关键词-EN: Artificial Neural Networks, Spiking Neural Networks, sensitive application areas, Neural Networks, growing concern
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Neural and Evolutionary Computing (cs.NE)
*备注:

点击查看摘要

Abstract:While machine learning (ML) models are becoming mainstream, especially in sensitive application areas, the risk of data leakage has become a growing concern. Attacks like membership inference (MIA) have shown that trained models can reveal sensitive data, jeopardizing confidentiality. While traditional Artificial Neural Networks (ANNs) dominate ML applications, neuromorphic architectures, specifically Spiking Neural Networks (SNNs), are emerging as promising alternatives due to their low power consumption and event-driven processing, akin to biological neurons. Privacy in ANNs is well-studied; however, little work has explored the privacy-preserving properties of SNNs. This paper examines whether SNNs inherently offer better privacy. Using MIAs, we assess the privacy resilience of SNNs versus ANNs across diverse datasets. We analyze the impact of learning algorithms (surrogate gradient and evolutionary), frameworks (snnTorch, TENNLab, LAVA), and parameters on SNN privacy. Our findings show that SNNs consistently outperform ANNs in privacy preservation, with evolutionary algorithms offering additional resilience. For instance, on CIFAR-10, SNNs achieve an AUC of 0.59, significantly lower than ANNs’ 0.82, and on CIFAR-100, SNNs maintain an AUC of 0.58 compared to ANNs’ 0.88. Additionally, we explore the privacy-utility trade-off with Differentially Private Stochastic Gradient Descent (DPSGD), finding that SNNs sustain less accuracy loss than ANNs under similar privacy constraints.

[LG-40] MolMiner: Transformer architecture for fragment-based autoregressive generation of molecular stories

链接: https://arxiv.org/abs/2411.06608
作者: Raul Ortega Ochoa,Tejs Vegge,Jes Frellsen
关键词-EN: high-throughput screening paradigms, Deep generative models, Deep generative, screening paradigms, popular choice
类目: Machine Learning (cs.LG); Materials Science (cond-mat.mtrl-sci)
*备注:

点击查看摘要

Abstract:Deep generative models for molecular discovery have become a very popular choice in new high-throughput screening paradigms. These models have been developed inheriting from the advances in natural language processing and computer vision, achieving ever greater results. However, generative molecular modelling has unique challenges that are often overlooked. Chemical validity, interpretability of the generation process and flexibility to variable molecular sizes are among some of the remaining challenges for generative models in computational materials design. In this work, we propose an autoregressive approach that decomposes molecular generation into a sequence of discrete and interpretable steps using molecular fragments as units, a ‘molecular story’. Enforcing chemical rules in the stories guarantees the chemical validity of the generated molecules, the discrete sequential steps of a molecular story makes the process transparent improving interpretability, and the autoregressive nature of the approach allows the size of the molecule to be a decision of the model. We demonstrate the validity of the approach in a multi-target inverse design of electroactive organic compounds, focusing on the target properties of solubility, redox potential, and synthetic accessibility. Our results show that the model can effectively bias the generation distribution according to the prompted multi-target objective.

[LG-41] An Energy-Based Self-Adaptive Learning Rate for Stochastic Gradient Descent: Enhancing Unconstrained Optimization with VAV method

链接: https://arxiv.org/abs/2411.06573
作者: Jiahao Zhang,Christian Moya,Guang Lin
关键词-EN: Vector Auxiliary Variable, learning rate remains, achieving model stability, essential for achieving, Auxiliary Variable
类目: Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Optimizing the learning rate remains a critical challenge in machine learning, essential for achieving model stability and efficient convergence. The Vector Auxiliary Variable (VAV) algorithm introduces a novel energy-based self-adjustable learning rate optimization method designed for unconstrained optimization problems. It incorporates an auxiliary variable r to facilitate efficient energy approximation without backtracking while adhering to the unconditional energy dissipation law. Notably, VAV demonstrates superior stability with larger learning rates and achieves faster convergence in the early stage of the training process. Comparative analyses demonstrate that VAV outperforms Stochastic Gradient Descent (SGD) across various tasks. This paper also provides rigorous proof of the energy dissipation law and establishes the convergence of the algorithm under reasonable assumptions. Additionally, r acts as an empirical lower bound of the training loss in practice, offering a novel scheduling approach that further enhances algorithm performance.

[LG-42] Fitting Multiple Machine Learning Models with Performance Based Clustering

链接: https://arxiv.org/abs/2411.06572
作者: Mehmet Efe Lorasdagi,Ahmet Berker Koc,Ali Taha Koc,Suleyman Serdar Kozat
关键词-EN: Traditional machine learning, single generating mechanism, machine learning approaches, learning approaches assume, single mechanism assumption
类目: Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注:

点击查看摘要

Abstract:Traditional machine learning approaches assume that data comes from a single generating mechanism, which may not hold for most real life data. In these cases, the single mechanism assumption can result in suboptimal performance. We introduce a clustering framework that eliminates this assumption by grouping the data according to the relations between the features and the target values and we obtain multiple separate models to learn different parts of the data. We further extend our framework to applications having streaming data where we produce outcomes using an ensemble of models. For this, the ensemble weights are updated based on the incoming data batches. We demonstrate the performance of our approach over the widely-studied real life datasets, showing significant improvements over the traditional single-model approaches.

[LG-43] hermodynamically-Informed Iterative Neural Operators for Heterogeneous Elastic Localization

链接: https://arxiv.org/abs/2411.06529
作者: Conlain Kelly,Surya R. Kalidindi
关键词-EN: Engineering problems frequently, spatially-varying discontinuous coefficients, problems frequently require, frequently require solution, Engineering problems
类目: Machine Learning (cs.LG); Materials Science (cond-mat.mtrl-sci)
*备注: Submitted to Elsevier

点击查看摘要

Abstract:Engineering problems frequently require solution of governing equations with spatially-varying discontinuous coefficients. Even for linear elliptic problems, mapping large ensembles of coefficient fields to solutions can become a major computational bottleneck using traditional numerical solvers. Furthermore, machine learning methods such as neural operators struggle to fit these maps due to sharp transitions and high contrast in the coefficient fields and a scarcity of informative training data. In this work, we focus on a canonical problem in computational mechanics: prediction of local elastic deformation fields over heterogeneous material structures subjected to periodic boundary conditions. We construct a hybrid approximation for the coefficient-to-solution map using a Thermodynamically-informed Iterative Neural Operator (TherINO). Rather than using coefficient fields as direct inputs and iterating over a learned latent space, we employ thermodynamic encodings – drawn from the constitutive equations – and iterate over the solution space itself. Through an extensive series of case studies, we elucidate the advantages of these design choices in terms of efficiency, accuracy, and flexibility. We also analyze the model’s stability and extrapolation properties on out-of-distribution coefficient fields and demonstrate an improved speed-accuracy tradeoff for predicting elastic quantities of interest.

[LG-44] Causal Representation Learning from Multimodal Biological Observations

链接: https://arxiv.org/abs/2411.06518
作者: Yuewen Sun,Lingjing Kong,Guangyi Chen,Loka Li,Gongxu Luo,Zijian Li,Yixuan Zhang,Yujia Zheng,Mengyue Yang,Petar Stojanov,Eran Segal,Eric P. Xing,Kun Zhang
关键词-EN: underlying biological mechanisms, biological applications, Prevalent, biological, machine learning
类目: Machine Learning (cs.LG); Quantitative Methods (q-bio.QM); Methodology (stat.ME)
*备注:

点击查看摘要

Abstract:Prevalent in biological applications (e.g., human phenotype measurements), multimodal datasets can provide valuable insights into the underlying biological mechanisms. However, current machine learning models designed to analyze such datasets still lack interpretability and theoretical guarantees, which are essential to biological applications. Recent advances in causal representation learning have shown promise in uncovering the interpretable latent causal variables with formal theoretical certificates. Unfortunately, existing works for multimodal distributions either rely on restrictive parametric assumptions or provide rather coarse identification results, limiting their applicability to biological research which favors a detailed understanding of the mechanisms. In this work, we aim to develop flexible identification conditions for multimodal data and principled methods to facilitate the understanding of biological datasets. Theoretically, we consider a flexible nonparametric latent distribution (c.f., parametric assumptions in prior work) permitting causal relationships across potentially different modalities. We establish identifiability guarantees for each latent component, extending the subspace identification results from prior work. Our key theoretical ingredient is the structural sparsity of the causal connections among distinct modalities, which, as we will discuss, is natural for a large collection of biological systems. Empirically, we propose a practical framework to instantiate our theoretical insights. We demonstrate the effectiveness of our approach through extensive experiments on both numerical and synthetic datasets. Results on a real-world human phenotype dataset are consistent with established medical research, validating our theoretical and methodological framework. Subjects: Machine Learning (cs.LG); Quantitative Methods (q-bio.QM); Methodology (stat.ME) Cite as: arXiv:2411.06518 [cs.LG] (or arXiv:2411.06518v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2411.06518 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-45] Individual Regret in Cooperative Stochastic Multi-Armed Bandits

链接: https://arxiv.org/abs/2411.06501
作者: Idan Barnea,Tal Lancewicki,Yishay Mansour
关键词-EN: stochastic Multi-Armed Bandits, Multi-Armed Bandits, arbitrary connected communication, individual regret bound, connected communication graph
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 42 pages, 1 figure

点击查看摘要

Abstract:We study the regret in stochastic Multi-Armed Bandits (MAB) with multiple agents that communicate over an arbitrary connected communication graph. We show a near-optimal individual regret bound of \tildeO(\sqrtAT/m+A) , where A is the number of actions, T the time horizon, and m the number of agents. In particular, assuming a sufficient number of agents, we achieve a regret bound of \tildeO(A) , which is independent of the sub-optimality gaps and the diameter of the communication graph. To the best of our knowledge, our study is the first to show an individual regret bound in cooperative stochastic MAB that is independent of the graph’s diameter and applicable to non-fully-connected communication graphs.

[LG-46] owards Graph Neural Network Surrogates Leveraging Mechanistic Expert Knowledge for Pandemic Response

链接: https://arxiv.org/abs/2411.06500
作者: Agatha Schmidt,Henrik Zunker,Alexander Heinlein,Martin J. Kühn
关键词-EN: evidence-based decision making, guide evidence-based decision, proven fundamental, fundamental to guide, guide evidence-based
类目: Machine Learning (cs.LG); Populations and Evolution (q-bio.PE)
*备注: 22 pages, 8 figures

点击查看摘要

Abstract:During the COVID-19 crisis, mechanistic models have been proven fundamental to guide evidence-based decision making. However, time-critical decisions in a dynamically changing environment restrict the time available for modelers to gather supporting evidence. As infectious disease dynamics are often heterogeneous on a spatial or demographic scale, models should be resolved accordingly. In addition, with a large number of potential interventions, all scenarios can barely be computed on time, even when using supercomputing facilities. We suggest to combine complex mechanistic models with data-driven surrogate models to allow for on-the-fly model adaptations by public health experts. We build upon a spatially and demographically resolved infectious disease model and train a graph neural network for data sets representing early phases of the pandemic. The resulting networks reached an execution time of less than a second, a significant speedup compared to the metapopulation approach. The suggested approach yields potential for on-the-fly execution and, thus, integration of disease dynamics models in low-barrier website applications. For the approach to be used with decision-making, datasets with larger variance will have to be considered.

[LG-47] Accelerating Large Language Model Training with 4D Parallelism and Memory Consumption Estimator

链接: https://arxiv.org/abs/2411.06465
作者: Kazuki Fujii,Kohei Watanabe,Rio Yokota
关键词-EN: including Tensor Parallelism, large language model, including Tensor, distribute model parameters, Pipeline Parallelism
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注:

点击查看摘要

Abstract:In large language model (LLM) training, several parallelization strategies, including Tensor Parallelism (TP), Pipeline Parallelism (PP), Data Parallelism (DP), as well as Sequence Parallelism (SP) and Context Parallelism (CP), are employed to distribute model parameters, activations, and optimizer states across devices. Identifying the optimal parallelization configuration for each environment while avoiding GPU memory overflow remains a challenging task. In this study, we provide precise formulas to estimate the memory consumed by parameters, gradients, optimizer states, and activations for 4D parallel training (DP, TP, PP, CP) in the Llama architecture. We conducted 454 experiments on A100 and H100 GPUs, incorporating often neglected factors such as temporary buffers and memory fragmentation into our analysis. Results indicate that when the estimated memory usage is below 80% of the available GPU memory, the training never encounters out-of-memory errors. This simple yet effective formula allows us to identify parallelization configurations that could lead to memory overflow in advance, significantly reducing the configuration search space. Additionally, through a comprehensive exploration of optimal configurations in 4D parallelism, our analysis of the 454 experimental results provides empirical insights into optimal 4D parallelism configurations.

[LG-48] Predictors of disease outbreaks at continentalscale in the African region: Insights and predictions with geospatial artificial intelligence using earth observations and routine disease surveillance data

链接: https://arxiv.org/abs/2411.06436
作者: Scott Pezanowski,Etien Luc Koua,Joseph C Okeibunor,Abdou Salam Gueye
关键词-EN: incorporating relevant high-spatial, relevant high-spatial resolution, research adopts computational, maintaining local-level analysis, adopts computational techniques
类目: Machine Learning (cs.LG)
*备注: 15 pages, 3 figures, 7 tables

点击查看摘要

Abstract:Objectives: Our research adopts computational techniques to analyze disease outbreaks weekly over a large geographic area while maintaining local-level analysis by incorporating relevant high-spatial resolution cultural and environmental datasets. The abundance of data about disease outbreaks gives scientists an excellent opportunity to uncover patterns in disease spread and make future predictions. However, data over a sizeable geographic area quickly outpace human cognition. Our study area covers a significant portion of the African continent (about 17,885,000 km2). The data size makes computational analysis vital to assist human decision-makers. Methods: We first applied global and local spatial autocorrelation for malaria, cholera, meningitis, and yellow fever case counts. We then used machine learning to predict the weekly presence of these diseases in the second-level administrative district. Lastly, we used machine learning feature importance methods on the variables that affect spread. Results: Our spatial autocorrelation results show that geographic nearness is critical but varies in effect and space. Moreover, we identified many interesting hot and cold spots and spatial outliers. The machine learning model infers a binary class of cases or none with the best F1 score of 0.96 for malaria. Machine learning feature importance uncovered critical cultural and environmental factors affecting outbreaks and variations between diseases. Conclusions: Our study shows that data analytics and machine learning are vital to understanding and monitoring disease outbreaks locally across vast areas. The speed at which these methods produce insights can be critical during epidemics and emergencies.

[LG-49] UniGAD: Unifying Multi-level Graph Anomaly Detection NEURIPS2024

链接: https://arxiv.org/abs/2411.06427
作者: Yiqing Lin,Jianheng Tang,Chenyi Zi,H.Vicky Zhao,Yuan Yao,Jia Li
关键词-EN: Graph Anomaly Detection, Anomaly Detection, aims to identify, identify uncommon, graph-structured data
类目: Machine Learning (cs.LG)
*备注: Accepted by NeurIPS 2024. All codes can be found at this https URL

点击查看摘要

Abstract:Graph Anomaly Detection (GAD) aims to identify uncommon, deviated, or suspicious objects within graph-structured data. Existing methods generally focus on a single graph object type (node, edge, graph, etc.) and often overlook the inherent connections among different object types of graph anomalies. For instance, a money laundering transaction might involve an abnormal account and the broader community it interacts with. To address this, we present UniGAD, the first unified framework for detecting anomalies at node, edge, and graph levels jointly. Specifically, we develop the Maximum Rayleigh Quotient Subgraph Sampler (MRQSampler) that unifies multi-level formats by transferring objects at each level into graph-level tasks on subgraphs. We theoretically prove that MRQSampler maximizes the accumulated spectral energy of subgraphs (i.e., the Rayleigh quotient) to preserve the most significant anomaly information. To further unify multi-level training, we introduce a novel GraphStitch Network to integrate information across different levels, adjust the amount of sharing required at each level, and harmonize conflicting training goals. Comprehensive experiments show that UniGAD outperforms both existing GAD methods specialized for a single task and graph prompt-based approaches for multiple tasks, while also providing robust zero-shot task transferability. All codes can be found at this https URL.

[LG-50] Locally Adaptive One-Class Classifier Fusion with Dynamic ellp-Norm Constraints for Robust Anomaly Detection

链接: https://arxiv.org/abs/2411.06406
作者: Sepehr Nourmohammadi,Arda Sarp Yenicesu,Ozgur S. Oguz
关键词-EN: one-class classifier fusion, locally adaptive learning, p-norm constraints, paper presents, approach to one-class
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:This paper presents a novel approach to one-class classifier fusion through locally adaptive learning with dynamic \ell p-norm constraints. We introduce a framework that dynamically adjusts fusion weights based on local data characteristics, addressing fundamental challenges in ensemble-based anomaly detection. Our method incorporates an interior-point optimization technique that significantly improves computational efficiency compared to traditional Frank-Wolfe approaches, achieving up to 19-fold speed improvements in complex scenarios. The framework is extensively evaluated on standard UCI benchmark datasets and specialized temporal sequence datasets, demonstrating superior performance across diverse anomaly types. Statistical validation through Skillings-Mack tests confirms our method’s significant advantages over existing approaches, with consistent top rankings in both pure and non-pure learning scenarios. The framework’s ability to adapt to local data patterns while maintaining computational efficiency makes it particularly valuable for real-time applications where rapid and accurate anomaly detection is crucial.

[LG-51] Local vs. Global Models for Hierarchical Forecasting

链接: https://arxiv.org/abs/2411.06394
作者: Zhao Yingjie,Mahdi Abolghasemi
关键词-EN: presenting significant challenges, Hierarchical time series, time series forecasting, series forecasting plays, Global Forecasting Models
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Hierarchical time series forecasting plays a crucial role in decision-making in various domains while presenting significant challenges for modelling as they involve multiple levels of aggregation, constraints, and availability of information. This study explores the influence of distinct information utilisation on the accuracy of hierarchical forecasts, proposing and evaluating locals and a range of Global Forecasting Models (GFMs). In contrast to local models, which forecast each series independently, we develop GFMs to exploit cross-series and cross-hierarchies information, improving both forecasting performance and computational efficiency. We employ reconciliation methods to ensure coherency in forecasts and use the Mean Absolute Scaled Error (MASE) and Multiple Comparisons with the Best (MCB) tests to assess statistical significance. The findings indicate that GFMs possess significant advantages for hierarchical forecasting, providing more accurate and computationally efficient solutions across different levels in a hierarchy. Two specific GFMs based on LightGBM are introduced, demonstrating superior accuracy and lower model complexity than their counterpart local models and conventional methods such as Exponential Smoothing (ES) and Autoregressive Integrated Moving Average (ARIMA).

[LG-52] Metric Learning for Tag Recommendation: Tackling Data Sparsity and Cold Start Issues

链接: https://arxiv.org/abs/2411.06374
作者: Yuanshuai Luo,Rui Wang,Yaxin Liang,Ankai Liang,Wenyi Liu
关键词-EN: Internet services, part of Internet, personalized recommendation systems, social media, digital information
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:With the rapid growth of digital information, personalized recommendation systems have become an indispensable part of Internet services, especially in the fields of e-commerce, social media, and online entertainment. However, traditional collaborative filtering and content-based recommendation methods have limitations in dealing with data sparsity and cold start problems, especially in the face of largescale heterogeneous data, which makes it difficult to meet user expectations. This paper proposes a new label recommendation algorithm based on metric learning, which aims to overcome the challenges of traditional recommendation systems by learning effective distance or similarity metrics to capture the subtle differences between user preferences and item features. Experimental results show that the algorithm outperforms baseline methods including local response metric learning (LRML), collaborative metric learning (CML), and adaptive tensor factorization (ATF) based on adversarial learning on multiple evaluation metrics. In particular, it performs particularly well in the accuracy of the first few recommended items, while maintaining high robustness and maintaining high recommendation accuracy.

[LG-53] Optimized Inference for 1.58-bit LLM s: A Time and Memory-Efficient Algorithm for Binary and Ternary Matrix Multiplication

链接: https://arxiv.org/abs/2411.06360
作者: Mohsen Dehghankar,Mahdi Erfanian,Abolfazl Asudeh
关键词-EN: Large Language Models, Large Language, advanced computational infrastructure, Language Models, success and versatility
类目: Machine Learning (cs.LG); Data Structures and Algorithms (cs.DS)
*备注:

点击查看摘要

Abstract:Despite their tremendous success and versatility, Large Language Models (LLMs) suffer from inference inefficiency while relying on advanced computational infrastructure. To address these challenges and make LLMs more accessible and cost-effective, in this paper, we propose algorithms to improve the inference time and memory efficiency of 1.58-bit LLMs with ternary weight matrices. Particularly focusing on matrix multiplication as the bottle-neck operation of inference, we observe that, once trained, the weight matrices of a model no longer change. This allows us to preprocess these matrices and create indices that help reduce the storage requirements by a logarithmic factor while enabling our efficient inference algorithms. Specifically, for a n by n weight matrix, our efficient algorithm guarantees a time complexity of O(\fracn^2\log n) , a logarithmic factor improvement over the standard O(n^2) vector-matrix multiplication. Besides theoretical analysis, we conduct extensive experiments to evaluate the practical efficiency of our algorithms. Our results confirm the superiority of the approach both with respect to time and memory, as we observed a reduction in inference time up to 29x and memory usage up to 6x.

[LG-54] Client Contribution Normalization for Enhanced Federated Learning

链接: https://arxiv.org/abs/2411.06352
作者: Mayank Kumar Kundalwal,Anurag Saraswat,Ishan Mishra,Deepak Mishra
关键词-EN: substantial communication costs, Mobile devices, traditional centralized machine, centralized machine learning, machine learning models
类目: Machine Learning (cs.LG)
*备注: Accepted at IEEE INDICON 2024

点击查看摘要

Abstract:Mobile devices, including smartphones and laptops, generate decentralized and heterogeneous data, presenting significant challenges for traditional centralized machine learning models due to substantial communication costs and privacy risks. Federated Learning (FL) offers a promising alternative by enabling collaborative training of a global model across decentralized devices without data sharing. However, FL faces challenges due to statistical heterogeneity among clients, where non-independent and identically distributed (non-IID) data impedes model convergence and performance. This paper focuses on data-dependent heterogeneity in FL and proposes a novel approach leveraging mean latent representations extracted from locally trained models. The proposed method normalizes client contributions based on these representations, allowing the central server to estimate and adjust for heterogeneity during aggregation. This normalization enhances the global model’s generalization and mitigates the limitations of conventional federated averaging methods. The main contributions include introducing a normalization scheme using mean latent representations to handle statistical heterogeneity in FL, demonstrating the seamless integration with existing FL algorithms to improve performance in non-IID settings, and validating the approach through extensive experiments on diverse datasets. Results show significant improvements in model accuracy and consistency across skewed distributions. Our experiments with six FL schemes: FedAvg, FedProx, FedBABU, FedNova, SCAFFOLD, and SGDM highlight the robustness of our approach. This research advances FL by providing a practical and computationally efficient solution for statistical heterogeneity, contributing to the development of more reliable and generalized machine learning models.

[LG-55] CRTRE: Causal Rule Generation with Target Trial Emulation Framework

链接: https://arxiv.org/abs/2411.06338
作者: Junda Wang,Weijian Li,Han Wang,Hanjia Lyu,Caroline P. Thirukumaran,Addisu Mesfin,Hong Yu,Jiebo Luo
关键词-EN: gaining increasing attention, increasing attention, biomedical domain, gaining increasing, Causal inference
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Causal inference and model interpretability are gaining increasing attention, particularly in the biomedical domain. Despite recent advance, decorrelating features in nonlinear environments with human-interpretable representations remains underexplored. In this study, we introduce a novel method called causal rule generation with target trial emulation framework (CRTRE), which applies randomize trial design principles to estimate the causal effect of association rules. We then incorporate such association rules for the downstream applications such as prediction of disease onsets. Extensive experiments on six healthcare datasets, including synthetic data, real-world disease collections, and MIMIC-III/IV, demonstrate the model’s superior performance. Specifically, our method achieved a \beta error of 0.907, outperforming DWR (1.024) and SVM (1.141). On real-world datasets, our model achieved accuracies of 0.789, 0.920, and 0.300 for Esophageal Cancer, Heart Disease, and Cauda Equina Syndrome prediction task, respectively, consistently surpassing baseline models. On the ICD code prediction tasks, it achieved AUC Macro scores of 92.8 on MIMIC-III and 96.7 on MIMIC-IV, outperforming the state-of-the-art models KEPT and MSMN. Expert evaluations further validate the model’s effectiveness, causality, and interpretability.

[LG-56] Regret Minimization and Statistical Inference in Online Decision Making with High-dimensional Covariates

链接: https://arxiv.org/abs/2411.06329
作者: Congyuan Duan,Wanteng Ma,Jiashuo Jiang,Dong Xia
关键词-EN: context bandit model, high-dimensional online decision-making, sparse linear context, linear context bandit, investigates regret minimization
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:This paper investigates regret minimization, statistical inference, and their interplay in high-dimensional online decision-making based on the sparse linear context bandit model. We integrate the \varepsilon -greedy bandit algorithm for decision-making with a hard thresholding algorithm for estimating sparse bandit parameters and introduce an inference framework based on a debiasing method using inverse propensity weighting. Under a margin condition, our method achieves either O(T^1/2) regret or classical O(T^1/2) -consistent inference, indicating an unavoidable trade-off between exploration and exploitation. If a diverse covariate condition holds, we demonstrate that a pure-greedy bandit algorithm, i.e., exploration-free, combined with a debiased estimator based on average weighting can simultaneously achieve optimal O(\log T) regret and O(T^1/2) -consistent inference. We also show that a simple sample mean estimator can provide valid inference for the optimal policy’s value. Numerical simulations and experiments on Warfarin dosing data validate the effectiveness of our methods.

[LG-57] Emotion-Aware Interaction Design in Intelligent User Interface Using Multi-Modal Deep Learning

链接: https://arxiv.org/abs/2411.06326
作者: Shiyu Duan,Ziyi Wang,Shixiao Wang,Mengmeng Chen,Runsheng Zhang
关键词-EN: making technology, emotion recognition, user interface, technology, user interaction
类目: Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In an era where user interaction with technology is ubiquitous, the importance of user interface (UI) design cannot be overstated. A well-designed UI not only enhances usability but also fosters more natural, intuitive, and emotionally engaging experiences, making technology more accessible and impactful in everyday life. This research addresses this growing need by introducing an advanced emotion recognition system to significantly improve the emotional responsiveness of UI. By integrating facial expressions, speech, and textual data through a multi-branch Transformer model, the system interprets complex emotional cues in real-time, enabling UIs to interact more empathetically and effectively with users. Using the public MELD dataset for validation, our model demonstrates substantial improvements in emotion recognition accuracy and F1 scores, outperforming traditional methods. These findings underscore the critical role that sophisticated emotion recognition plays in the evolution of UIs, making technology more attuned to user needs and emotions. This study highlights how enhanced emotional intelligence in UIs is not only about technical innovation but also about fostering deeper, more meaningful connections between users and the digital world, ultimately shaping how people interact with technology in their daily lives.

[LG-58] When are dynamical systems learned from time series data statistically accurate? NEURIPS2024

链接: https://arxiv.org/abs/2411.06311
作者: Jeongjin Park,Nicole Yang,Nisha Chandramoorthy
关键词-EN: Conventional notions, describe the ability, capture meaningful information, fail to describe, Conventional
类目: Machine Learning (cs.LG); Mathematical Physics (math-ph); Dynamical Systems (math.DS); Statistics Theory (math.ST)
*备注: in NeuRIPS 2024

点击查看摘要

Abstract:Conventional notions of generalization often fail to describe the ability of learned models to capture meaningful information from dynamical data. A neural network that learns complex dynamics with a small test error may still fail to reproduce its \emphphysical behavior, including associated statistical moments and Lyapunov exponents. To address this gap, we propose an ergodic theoretic approach to generalization of complex dynamical models learned from time series data. Our main contribution is to define and analyze generalization of a broad suite of neural representations of classes of ergodic systems, including chaotic systems, in a way that captures emulating underlying invariant, physical measures. Our results provide theoretical justification for why regression methods for generators of dynamical systems (Neural ODEs) fail to generalize, and why their statistical accuracy improves upon adding Jacobian information during training. We verify our results on a number of ergodic chaotic systems and neural network parameterizations, including MLPs, ResNets, Fourier Neural layers, and RNNs.

[LG-59] Intelligent Fault Diagnosis of Type and Severity in Low-Frequency Low Bit-Depth Signals

链接: https://arxiv.org/abs/2411.06299
作者: Tito Spadini,Kenji Nose-Filho,Ricardo Suyama
关键词-EN: Intelligent Fault Diagnosis, Fault Diagnosis, Intelligent Fault, rotating machinery utilizing, focuses on Intelligent
类目: Machine Learning (cs.LG); Sound (cs.SD); Audio and Speech Processing (eess.AS); Signal Processing (eess.SP)
*备注:

点击查看摘要

Abstract:This study focuses on Intelligent Fault Diagnosis (IFD) in rotating machinery utilizing a single microphone and a data-driven methodology, effectively diagnosing 42 classes of fault types and severities. The research leverages sound data from the imbalanced MaFaulDa dataset, aiming to strike a balance between high performance and low resource consumption. The testing phase encompassed a variety of configurations, including sampling, quantization, signal normalization, silence removal, Wiener filtering, data scaling, windowing, augmentation, and classifier tuning using XGBoost. Through the analysis of time, frequency, mel-frequency, and statistical features, we achieved an impressive accuracy of 99.54% and an F-Beta score of 99.52% with just 6 boosting trees at an 8 kHz, 8-bit configuration. Moreover, when utilizing only MFCCs along with their first- and second-order deltas, we recorded an accuracy of 97.83% and an F-Beta score of 97.67%. Lastly, by implementing a greedy wrapper approach, we obtained a remarkable accuracy of 96.82% and an F-Beta score of 98.86% using 50 selected features, nearly all of which were first- and second-order deltas of the MFCCs.

[LG-60] nyML NLP Approach for Semantic Wireless Sentiment Classification

链接: https://arxiv.org/abs/2411.06291
作者: Ahmed Y. Radwan,Mohammad Shehab,Mohamed-Slim Alouini
关键词-EN: Natural Language Processing, Natural Language, semantic sentiment analysis, impair users’ privacy, Language Processing
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Information Theory (cs.IT)
*备注: Submitted for WCNC-2025, Under Review

点击查看摘要

Abstract:Natural Language Processing (NLP) operations, such as semantic sentiment analysis and text synthesis, may often impair users’ privacy and demand significant on device computational resources. Centralized learning (CL) on the edge offers an alternative energy-efficient approach, yet requires the collection of raw information, which affects the user’s privacy. While Federated learning (FL) preserves privacy, it requires high computational energy on board tiny user devices. We introduce split learning (SL) as an energy-efficient alternative, privacy-preserving tiny machine learning (TinyML) scheme and compare it to FL and CL in the presence of Rayleigh fading and additive noise. Our results show that SL reduces processing power and CO2 emissions while maintaining high accuracy, whereas FL offers a balanced compromise between efficiency and privacy. Hence, this study provides insights into deploying energy-efficient, privacy-preserving NLP models on edge devices.

[LG-61] SPIKANs: Separable Physics-Informed Kolmogorov-Arnold Networks

链接: https://arxiv.org/abs/2411.06286
作者: Bruno Jacob,Amanda A. Howard,Panos Stinis
关键词-EN: partial differential equations, neural network structures, solving partial differential, alternative neural network, Physics-Informed Neural Networks
类目: Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注:

点击查看摘要

Abstract:Physics-Informed Neural Networks (PINNs) have emerged as a promising method for solving partial differential equations (PDEs) in scientific computing. While PINNs typically use multilayer perceptrons (MLPs) as their underlying architecture, recent advancements have explored alternative neural network structures. One such innovation is the Kolmogorov-Arnold Network (KAN), which has demonstrated benefits over traditional MLPs, including faster neural scaling and better interpretability. The application of KANs to physics-informed learning has led to the development of Physics-Informed KANs (PIKANs), enabling the use of KANs to solve PDEs. However, despite their advantages, KANs often suffer from slower training speeds, particularly in higher-dimensional problems where the number of collocation points grows exponentially with the dimensionality of the system. To address this challenge, we introduce Separable Physics-Informed Kolmogorov-Arnold Networks (SPIKANs). This novel architecture applies the principle of separation of variables to PIKANs, decomposing the problem such that each dimension is handled by an individual KAN. This approach drastically reduces the computational complexity of training without sacrificing accuracy, facilitating their application to higher-dimensional PDEs. Through a series of benchmark problems, we demonstrate the effectiveness of SPIKANs, showcasing their superior scalability and performance compared to PIKANs and highlighting their potential for solving complex, high-dimensional PDEs in scientific computing.

[LG-62] A Natural Primal-Dual Hybrid Gradient Method for Adversarial Neural Network Training on Solving Partial Differential Equations

链接: https://arxiv.org/abs/2411.06278
作者: Shu Liu,Stanley Osher,Wuchen Li
关键词-EN: primal-dual hybrid gradient, scalable preconditioned primal-dual, preconditioned primal-dual hybrid, hybrid gradient algorithm, partial differential equations
类目: Numerical Analysis (math.NA); Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:We propose a scalable preconditioned primal-dual hybrid gradient algorithm for solving partial differential equations (PDEs). We multiply the PDE with a dual test function to obtain an inf-sup problem whose loss functional involves lower-order differential operators. The Primal-Dual Hybrid Gradient (PDHG) algorithm is then leveraged for this saddle point problem. By introducing suitable precondition operators to the proximal steps in the PDHG algorithm, we obtain an alternative natural gradient ascent-descent optimization scheme for updating the neural network parameters. We apply the Krylov subspace method (MINRES) to evaluate the natural gradients efficiently. Such treatment readily handles the inversion of precondition matrices via matrix-vector multiplication. A posterior convergence analysis is established for the time-continuous version of the proposed method. The algorithm is tested on various types of PDEs with dimensions ranging from 1 to 50 , including linear and nonlinear elliptic equations, reaction-diffusion equations, and Monge-Ampère equations stemming from the L^2 optimal transport problems. We compare the performance of the proposed method with several commonly used deep learning algorithms such as physics-informed neural networks (PINNs), the DeepRitz method, weak adversarial networks (WANs), etc, for solving PDEs using the Adam and L-BFGS optimizers. The numerical results suggest that the proposed method performs efficiently and robustly and converges more stably.

[LG-63] Constraints and Variables Reduction for Optimal Power Flow Using Hierarchical Graph Neural Networks with Virtual Node-Splitting

链接: https://arxiv.org/abs/2411.06268
作者: Thuan Phamh,Xingpeng Li
关键词-EN: graph neural network, Power system networks, capture individual generator, individual generator features, system networks
类目: ystems and Control (eess.SY); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Power system networks are often modeled as homogeneous graphs, which limits the ability of graph neural network (GNN) to capture individual generator features at the same nodes. By introducing the proposed virtual node-splitting strategy, generator-level attributes like costs, limits, and ramp rates can be fully captured by GNN models, improving GNN’s learning capacity and prediction accuracy. Optimal power flow (OPF) problem is used for real-time grid operations. Limited timeframe motivates studies to create size-reduced OPF (ROPF) models to relieve the computational complexity. In this paper, with virtual node-splitting, a novel two-stage adaptive hierarchical GNN is developed to (i) predict critical lines that would be congested, and then (ii) predict base generators that would operate at the maximum capacity. This will substantially reduce the constraints and variables needed for OPF, creating the proposed ROPFLG model with reduced monitor lines and reduced generator-specific variables and constraints. Two ROPF models, ROPFL and ROPFG, with just reduced lines or generators respectively, are also implemented as additional benchmark models. Case studies show that the proposed ROPFLG consistently outperforms the benchmark full OPF (FOPF) and the other two ROPF methods, achieving significant computational time savings while reliably finding optimal solutions.

[LG-64] owards Establishing Guaranteed Error for Learned Database Operations ICLR’24

链接: https://arxiv.org/abs/2411.06243
作者: Sepanta Zeighami,Cyrus Shahabi
关键词-EN: estimating aggregate attribute, demonstrated substantial performance, substantial performance enhancements, fundamental data management, estimating aggregate
类目: Databases (cs.DB); Machine Learning (cs.LG)
*备注: Appeared in ICLR’24

点击查看摘要

Abstract:Machine learning models have demonstrated substantial performance enhancements over non-learned alternatives in various fundamental data management operations, including indexing (locating items in an array), cardinality estimation (estimating the number of matching records in a database), and range-sum estimation (estimating aggregate attribute values for query-matched records). However, real-world systems frequently favor less efficient non-learned methods due to their ability to offer (worst-case) error guarantees - an aspect where learned approaches often fall short. The primary objective of these guarantees is to ensure system reliability, ensuring that the chosen approach consistently delivers the desired level of accuracy across all databases. In this paper, we embark on the first theoretical study of such guarantees for learned methods, presenting the necessary conditions for such guarantees to hold when using machine learning to perform indexing, cardinality estimation and range-sum estimation. Specifically, we present the first known lower bounds on the model size required to achieve the desired accuracy for these three key database operations. Our results bound the required model size for given average and worst-case errors in performing database operations, serving as the first theoretical guidelines governing how model size must change based on data size to be able to guarantee an accuracy level. More broadly, our established guarantees pave the way for the broader adoption and integration of learned models into real-world systems.

[LG-65] heoretical Analysis of Learned Database Operations under Distribution Shift through Distribution Learnability ICML’24

链接: https://arxiv.org/abs/2411.06241
作者: Sepanta Zeighami,Cyrus Shahahbi
关键词-EN: substantial performance benefits, cardinality estimation, learned models, provide substantial performance, machine learning
类目: Machine Learning (cs.LG); Databases (cs.DB)
*备注: Appeared in ICML’24 (oral)

点击查看摘要

Abstract:Use of machine learning to perform database operations, such as indexing, cardinality estimation, and sorting, is shown to provide substantial performance benefits. However, when datasets change and data distribution shifts, empirical results also show performance degradation for learned models, possibly to worse than non-learned alternatives. This, together with a lack of theoretical understanding of learned methods undermines their practical applicability, since there are no guarantees on how well the models will perform after deployment. In this paper, we present the first known theoretical characterization of the performance of learned models in dynamic datasets, for the aforementioned operations. Our results show novel theoretical characteristics achievable by learned models and provide bounds on the performance of the models that characterize their advantages over non-learned methods, showing why and when learned models can outperform the alternatives. Our analysis develops the distribution learnability framework and novel theoretical tools which build the foundation for the analysis of learned database operations in the future.

[LG-66] Web Scale Graph Mining for Cyber Threat Intelligence

链接: https://arxiv.org/abs/2411.06239
作者: Scott Freitas,Amir Gharib
关键词-EN: cyberattacks demands accurate, today increasingly sophisticated, large-scale cyberattacks demands, Defending against today, threat intelligence
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG); Social and Information Networks (cs.SI)
*备注:

点击查看摘要

Abstract:Defending against today’s increasingly sophisticated and large-scale cyberattacks demands accurate, real-time threat intelligence. Traditional approaches struggle to scale, integrate diverse telemetry, and adapt to a constantly evolving security landscape. We introduce Threat Intelligence Tracking via Adaptive Networks (TITAN), an industry-scale graph mining framework that generates cyber threat intelligence at unprecedented speed and scale. TITAN introduces a suite of innovations specifically designed to address the complexities of the modern security landscape, including: (1) a dynamic threat intelligence graph that maps the intricate relationships between millions of entities, incidents, and organizations; (2) real-time update mechanisms that automatically decay and prune outdated intel; (3) integration of security domain knowledge to bootstrap initial reputation scores; and (4) reputation propagation algorithms that uncover hidden threat actor infrastructure. Integrated into Microsoft Unified Security Operations Platform (USOP), which is deployed across hundreds of thousands of organizations worldwide, TITAN’s threat intelligence powers key detection and disruption capabilities. With an impressive average macro-F1 score of 0.89 and a precision-recall AUC of 0.94, TITAN identifies millions of high-risk entities each week, enabling a 6x increase in non-file threat intelligence. Since its deployment, TITAN has increased the product’s incident disruption rate by a remarkable 21%, while reducing the time to disrupt by a factor of 1.9x, and maintaining 99% precision, as confirmed by customer feedback and thorough manual evaluation by security experts–ultimately saving customers from costly security breaches.

[LG-67] Leveraging Retrieval-Augmented Generation for University Knowledge Retrieval

链接: https://arxiv.org/abs/2411.06237
作者: Arshia Hemmat,Kianoosh Vadaei,Mohammad Hassan Heydari,Afsaneh Fatemi
关键词-EN: Large Language Models, university-related question answering, Language Models, Large Language, Retrieval-Augmented Generation
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注: 6 pages, 2 figures, 1 table, Submitted to 15th IKT conference

点击查看摘要

Abstract:This paper introduces an innovative approach using Retrieval-Augmented Generation (RAG) pipelines with Large Language Models (LLMs) to enhance information retrieval and query response systems for university-related question answering. By systematically extracting data from the university official webpage and employing advanced prompt engineering techniques, we generate accurate, contextually relevant responses to user queries. We developed a comprehensive university benchmark, UniversityQuestionBench (UQB), to rigorously evaluate our system performance, based on common key metrics in the filed of RAG pipelines, assessing accuracy and reliability through various metrics and real-world scenarios. Our experimental results demonstrate significant improvements in the precision and relevance of generated responses, enhancing user experience and reducing the time required to obtain relevant answers. In summary, this paper presents a novel application of RAG pipelines and LLMs, supported by a meticulously prepared university benchmark, offering valuable insights into advanced AI techniques for academic data retrieval and setting the stage for future research in this domain. Comments: 6 pages, 2 figures, 1 table, Submitted to 15th IKT conference Subjects: Information Retrieval (cs.IR); Machine Learning (cs.LG) Cite as: arXiv:2411.06237 [cs.IR] (or arXiv:2411.06237v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2411.06237 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-68] Early Prediction of Natural Gas Pipeline Leaks Using the MKTCN Model

链接: https://arxiv.org/abs/2411.06214
作者: Xuguang Li,Zhonglin Zuo,Zheng Dong,Yang Yang
关键词-EN: Natural gas pipeline, substantial economic losses, Natural gas, pose severe risks, gas pipeline leaks
类目: Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注: 12 pages, 6 figures

点击查看摘要

Abstract:Natural gas pipeline leaks pose severe risks, leading to substantial economic losses and potential hazards to human safety. In this study, we develop an accurate model for the early prediction of pipeline leaks. To the best of our knowledge, unlike previous anomaly detection, this is the first application to use internal pipeline data for early prediction of leaks. The modeling process addresses two main challenges: long-term dependencies and sample imbalance. First, we introduce a dilated convolution-based prediction model to capture long-term dependencies, as dilated convolution expands the model’s receptive field without added computational cost. Second, to mitigate sample imbalance, we propose the MKTCN model, which incorporates the Kolmogorov-Arnold Network as the fully connected layer in a dilated convolution model, enhancing network generalization. Finally, we validate the MKTCN model through extensive experiments on two real-world datasets. Results demonstrate that MKTCN outperforms in generalization and classification, particularly under severe data imbalance, and effectively predicts leaks up to 5000 seconds in advance. Overall, the MKTCN model represents a significant advancement in early pipeline leak prediction, providing robust generalization and improved modeling of the long-term dependencies inherent in multi-dimensional time-series data.

[LG-69] Advanced Wildfire Prediction in Morocco: Developing a Deep Learning Dataset from Multisource Observations

链接: https://arxiv.org/abs/2411.06202
作者: Ayoub Jadouli,Chaker El Amrani
关键词-EN: pose significant threats, Wildfires pose significant, necessitating advanced predictive, advanced predictive methods, threats to ecosystems
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Wildfires pose significant threats to ecosystems, economies, and communities worldwide, necessitating advanced predictive methods for effective mitigation. This study introduces a novel and comprehensive dataset specifically designed for wildfire prediction in Morocco, addressing its unique geographical and climatic challenges. By integrating satellite observations and ground station data, we compile essential environmental indicators such as vegetation health (NDVI), population density, soil moisture levels, and meteorological data aimed at predicting next-day wildfire occurrences with high accuracy. Our methodology incorporates state-of-the-art machine learning and deep learning algorithms, demonstrating superior performance in capturing wildfire dynamics compared to traditional models. Preliminary results show that models using this dataset achieve an accuracy of up to 90%, significantly improving prediction capabilities. The public availability of this dataset fosters scientific collaboration, aiming to refine predictive models and develop innovative wildfire management strategies. Our work not only advances the technical field of dataset creation but also emphasizes the necessity for localized research in underrepresented regions, providing a scalable model for other areas facing similar environmental challenges.

[LG-70] Weak to Strong Learning from Aggregate Labels

链接: https://arxiv.org/abs/2411.06200
作者: Yukti Makhija,Rishi Saket
关键词-EN: aggregate labels, weak learner, strong learner, training data consists, aggregate
类目: Machine Learning (cs.LG); Data Structures and Algorithms (cs.DS); Machine Learning (stat.ML)
*备注: 19 pages

点击查看摘要

Abstract:In learning from aggregate labels, the training data consists of sets or “bags” of feature-vectors (instances) along with an aggregate label for each bag derived from the (usually 0,1-valued) labels of its instances. In learning from label proportions (LLP), the aggregate label is the average of the bag’s instance labels, whereas in multiple instance learning (MIL) it is the OR. The goal is to train an instance-level predictor, typically achieved by fitting a model on the training data, in particular one that maximizes the accuracy which is the fraction of satisfied bags i.e., those on which the predicted labels are consistent with the aggregate label. A weak learner has at a constant accuracy 1 on the training bags, while a strong learner’s accuracy can be arbitrarily close to 1. We study the problem of using a weak learner on such training bags with aggregate labels to obtain a strong learner, analogous to supervised learning for which boosting algorithms are known. Our first result shows the impossibility of boosting in LLP using weak classifiers of any accuracy 1 by constructing a collection of bags for which such weak learners (for any weight assignment) exist, while not admitting any strong learner. A variant of this construction also rules out boosting in MIL for a non-trivial range of weak learner accuracy. In the LLP setting however, we show that a weak learner (with small accuracy) on large enough bags can in fact be used to obtain a strong learner for small bags, in polynomial time. We also provide more efficient, sampling based variant of our procedure with probabilistic guarantees which are empirically validated on three real and two synthetic datasets. Our work is the first to theoretically study weak to strong learning from aggregate labels, with an algorithm to achieve the same for LLP, while proving the impossibility of boosting for both LLP and MIL.

[LG-71] State Chrono Representation for Enhancing Generalization in Reinforcement Learning

链接: https://arxiv.org/abs/2411.06174
作者: Jianda Chen,Wen Zheng Terence Ng,Zichen Chen,Sinno Jialin Pan,Tianwei Zhang
关键词-EN: image-based inputs, crucial to establish, establish a robust, robust and generalizable, generalizable state representation
类目: Machine Learning (cs.LG); Robotics (cs.RO)
*备注:

点击查看摘要

Abstract:In reinforcement learning with image-based inputs, it is crucial to establish a robust and generalizable state representation. Recent advancements in metric learning, such as deep bisimulation metric approaches, have shown promising results in learning structured low-dimensional representation space from pixel observations, where the distance between states is measured based on task-relevant features. However, these approaches face challenges in demanding generalization tasks and scenarios with non-informative rewards. This is because they fail to capture sufficient long-term information in the learned representations. To address these challenges, we propose a novel State Chrono Representation (SCR) approach. SCR augments state metric-based representations by incorporating extensive temporal information into the update step of bisimulation metric learning. It learns state distances within a temporal framework that considers both future dynamics and cumulative rewards over current and long-term future states. Our learning strategy effectively incorporates future behavioral information into the representation space without introducing a significant number of additional parameters for modeling dynamics. Extensive experiments conducted in DeepMind Control and Meta-World environments demonstrate that SCR achieves better performance comparing to other recent metric-based methods in demanding generalization tasks. The codes of SCR are available in this https URL.

[LG-72] HiHa: Introducing Hierarchical Harmonic Decomposition to Implicit Neural Compression for Atmospheric Data

链接: https://arxiv.org/abs/2411.06155
作者: Zhewen Xu,Baoxiang Pan,Hongliang Li,Xiaohui Wei
关键词-EN: transferring massive atmospheric, large climate models, rapid development, development of large, large climate
类目: Machine Learning (cs.LG); Information Theory (cs.IT); Atmospheric and Oceanic Physics (physics.ao-ph)
*备注:

点击查看摘要

Abstract:The rapid development of large climate models has created the requirement of storing and transferring massive atmospheric data worldwide. Therefore, data compression is essential for meteorological research, but an efficient compression scheme capable of keeping high accuracy with high compressibility is still lacking. As an emerging technique, Implicit Neural Representation (INR) has recently acquired impressive momentum and demonstrates high promise for compressing diverse natural data. However, the INR-based compression encounters a bottleneck due to the sophisticated spatio-temporal properties and variability. To address this issue, we propose Hierarchical Harmonic decomposition implicit neural compression (HiHa) for atmospheric data. HiHa firstly segments the data into multi-frequency signals through decomposition of multiple complex harmonic, and then tackles each harmonic respectively with a frequency-based hierarchical compression module consisting of sparse storage, multi-scale INR and iterative decomposition sub-modules. We additionally design a temporal residual compression module to accelerate compression by utilizing temporal continuity. Experiments depict that HiHa outperforms both mainstream compressors and other INR-based methods in both compression fidelity and capabilities, and also demonstrate that using compressed data in existing data-driven models can achieve the same accuracy as raw data.

[LG-73] Mutual-energy inner product optimization method for constructing feature coordinates and image classification in Machine Learning

链接: https://arxiv.org/abs/2411.06100
作者: Yuanxiu Wang
关键词-EN: suitable coordinate system, mutual-energy inner product, machine learning, classes of samples, feature coordinate system
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 23 pages,5 figures

点击查看摘要

Abstract:As a key task in machine learning, data classification is essentially to find a suitable coordinate system to represent data features of different classes of samples. This paper proposes the mutual-energy inner product optimization method for constructing a feature coordinate system. First, by analyzing the solution space and eigenfunctions of partial differential equations describing a non-uniform membrane, the mutual-energy inner product is defined. Second, by expressing the mutual-energy inner product as a series of eigenfunctions, it shows a significant advantage of enhancing low-frequency features and suppressing high-frequency noise, compared with the Euclidean inner product. And then, a mutual-energy inner product optimization model is built to extract data features, and convexity and concavity properties of its objective function are discussed. Next, by combining the finite element method, a stable and efficient sequential linearization algorithm is constructed to solve the optimization model. This algorithm only solves equations including positive definite symmetric matrix and linear programming with a few constraints, and its vectorized implementation is discussed. Finally, the mutual-energy inner product optimization method is used to construct feature coordinates, and multi-class Gaussian classifiers are trained on the MINST training set. Good prediction results of Gaussian classifiers are achieved on the MINST test set.

[LG-74] Concept Bottleneck Language Models For protein design

链接: https://arxiv.org/abs/2411.06090
作者: Aya Abdelsalam Ismail,Tuomas Oikarinen,Amy Wang,Julius Adebayo,Samuel Stanton,Taylor Joren,Joseph Kleinhenz,Allen Goodman,Héctor Corrada Bravo,Kyunghyun Cho,Nathan C. Frey
关键词-EN: introduce Concept Bottleneck, Protein Language Models, Concept Bottleneck Protein, Concept Bottleneck Models, Bottleneck Protein Language
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We introduce Concept Bottleneck Protein Language Models (CB-pLM), a generative masked language model with a layer where each neuron corresponds to an interpretable concept. Our architecture offers three key benefits: i) Control: We can intervene on concept values to precisely control the properties of generated proteins, achieving a 3 times larger change in desired concept values compared to baselines. ii) Interpretability: A linear mapping between concept values and predicted tokens allows transparent analysis of the model’s decision-making process. iii) Debugging: This transparency facilitates easy debugging of trained models. Our models achieve pre-training perplexity and downstream task performance comparable to traditional masked protein language models, demonstrating that interpretability does not compromise performance. While adaptable to any language model, we focus on masked protein language models due to their importance in drug discovery and the ability to validate our model’s capabilities through real-world experiments and expert knowledge. We scale our CB-pLM from 24 million to 3 billion parameters, making them the largest Concept Bottleneck Models trained and the first capable of generative language modeling.

[LG-75] A Survey on Kolmogorov-Arnold Network

链接: https://arxiv.org/abs/2411.06078
作者: Shriyank Somvanshi,Syed Aaqib Javed,Md Monzurul Islam,Diwas Pandit,Subasish Das
关键词-EN: Kolmogorov-Arnold representation theorem, systematic review explores, network model inspired, theoretical foundations, Kolmogorov-Arnold Networks
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This systematic review explores the theoretical foundations, evolution, applications, and future potential of Kolmogorov-Arnold Networks (KAN), a neural network model inspired by the Kolmogorov-Arnold representation theorem. KANs distinguish themselves from traditional neural networks by using learnable, spline-parameterized functions instead of fixed activation functions, allowing for flexible and interpretable representations of high-dimensional functions. This review details KAN’s architectural strengths, including adaptive edge-based activation functions that improve parameter efficiency and scalability in applications such as time series forecasting, computational biomedicine, and graph learning. Key advancements, including Temporal-KAN, FastKAN, and Partial Differential Equation (PDE) KAN, illustrate KAN’s growing applicability in dynamic environments, enhancing interpretability, computational efficiency, and adaptability for complex function approximation tasks. Additionally, this paper discusses KAN’s integration with other architectures, such as convolutional, recurrent, and transformer-based models, showcasing its versatility in complementing established neural networks for tasks requiring hybrid approaches. Despite its strengths, KAN faces computational challenges in high-dimensional and noisy data settings, motivating ongoing research into optimization strategies, regularization techniques, and hybrid models. This paper highlights KAN’s role in modern neural architectures and outlines future directions to improve its computational efficiency, interpretability, and scalability in data-intensive applications.

[LG-76] Model Selection for Average Reward RL with Application to Utility Maximization in Repeated Games

链接: https://arxiv.org/abs/2411.06069
作者: Alireza Masoumian,James R. Wright
关键词-EN: Markov Decision Process, Markov Decision, Decision Process, Process whose structure, online model selection
类目: Machine Learning (cs.LG); Computer Science and Game Theory (cs.GT); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:In standard RL, a learner attempts to learn an optimal policy for a Markov Decision Process whose structure (e.g. state space) is known. In online model selection, a learner attempts to learn an optimal policy for an MDP knowing only that it belongs to one of M 1 model classes of varying complexity. Recent results have shown that this can be feasibly accomplished in episodic online RL. In this work, we propose \mathsfMRBEAR , an online model selection algorithm for the average reward RL setting. The regret of the algorithm is in \tilde O(M C_m^^2 \mathsfB_m^(T,\delta)) where C_m^* represents the complexity of the simplest well-specified model class and \mathsfB_m^(T,\delta) is its corresponding regret bound. This result shows that in average reward RL, like the episodic online RL, the additional cost of model selection scales only linearly in M , the number of model classes. We apply \mathsfMRBEAR to the interaction between a learner and an opponent in a two-player simultaneous general-sum repeated game, where the opponent follows a fixed unknown limited memory strategy. The learner’s goal is to maximize its utility without knowing the opponent’s utility function. The interaction is over T rounds with no episode or discounting which leads us to measure the learner’s performance by average reward regret. In this application, our algorithm enjoys an opponent-complexity-dependent regret in \tilde O(M(\mathsfsp(h^) B^m^* A^m^+1)^\frac32 \sqrtT) , where m^\le M is the unknown memory limit of the opponent, \mathsfsp(h^) is the unknown span of optimal bias induced by the opponent, and A and B are the number of actions for the learner and opponent respectively. We also show that the exponential dependency on m^ is inevitable by proving a lower bound on the learner’s regret.

[LG-77] Learning Mixtures of Experts with EM

链接: https://arxiv.org/abs/2411.06056
作者: Quentin Fruytier,Aryan Mokhtari,Sujay Sanghavi
关键词-EN: Machine Learning models, Machine Learning, Learning models, input space, model trained
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Mixtures of Experts (MoE) are Machine Learning models that involve partitioning the input space, with a separate “expert” model trained on each partition. Recently, MoE have become popular as components in today’s large language models as a means to reduce training and inference costs. There, the partitioning function and the experts are both learnt jointly via gradient descent on the log-likelihood. In this paper we focus on studying the efficiency of the Expectation Maximization (EM) algorithm for the training of MoE models. We first rigorously analyze EM for the cases of linear or logistic experts, where we show that EM is equivalent to Mirror Descent with unit step size and a Kullback-Leibler Divergence regularizer. This perspective allows us to derive new convergence results and identify conditions for local linear convergence based on the signal-to-noise ratio (SNR). Experiments on synthetic and (small-scale) real-world data show that EM outperforms the gradient descent algorithm both in terms of convergence rate and the achieved accuracy.

[LG-78] Linear Spherical Sliced Optimal Transport: A Fast Metric for Comparing Spherical Data

链接: https://arxiv.org/abs/2411.06055
作者: Xinran Liu,Yikun Bai,Rocío Díaz Martín,Kaiwen Shi,Ashkan Shahbazi,Bennett A. Landman,Catie Chang,Soheil Kolouri
关键词-EN: optimal transport, Sliced optimal transport, spherical sliced Wasserstein, computer vision, important in fields
类目: Machine Learning (cs.LG); Metric Geometry (math.MG)
*备注:

点击查看摘要

Abstract:Efficient comparison of spherical probability distributions becomes important in fields such as computer vision, geosciences, and medicine. Sliced optimal transport distances, such as spherical and stereographic spherical sliced Wasserstein distances, have recently been developed to address this need. These methods reduce the computational burden of optimal transport by slicing hyperspheres into one-dimensional projections, i.e., lines or circles. Concurrently, linear optimal transport has been proposed to embed distributions into ( L^2 ) spaces, where the ( L^2 ) distance approximates the optimal transport distance, thereby simplifying comparisons across multiple distributions. In this work, we introduce the Linear Spherical Sliced Optimal Transport (LSSOT) framework, which utilizes slicing to embed spherical distributions into ( L^2 ) spaces while preserving their intrinsic geometry, offering a computationally efficient metric for spherical probability measures. We establish the metricity of LSSOT and demonstrate its superior computational efficiency in applications such as cortical surface registration, 3D point cloud interpolation via gradient flow, and shape embedding. Our results demonstrate the significant computational benefits and high accuracy of LSSOT in these applications.

[LG-79] Personalized Hierarchical Split Federated Learning in Wireless Networks

链接: https://arxiv.org/abs/2411.06042
作者: Md-Ferdous Pervej,Andreas F. Molisch
关键词-EN: Extreme resource constraints, resource constraints make, constraints make large-scale, make large-scale machine, Extreme resource
类目: Machine Learning (cs.LG); Networking and Internet Architecture (cs.NI); Systems and Control (eess.SY)
*备注:

点击查看摘要

Abstract:Extreme resource constraints make large-scale machine learning (ML) with distributed clients challenging in wireless networks. On the one hand, large-scale ML requires massive information exchange between clients and server(s). On the other hand, these clients have limited battery and computation powers that are often dedicated to operational computations. Split federated learning (SFL) is emerging as a potential solution to mitigate these challenges, by splitting the ML model into client-side and server-side model blocks, where only the client-side block is trained on the client device. However, practical applications require personalized models that are suitable for the client’s personal task. Motivated by this, we propose a personalized hierarchical split federated learning (PHSFL) algorithm that is specially designed to achieve better personalization performance. More specially, owing to the fact that regardless of the severity of the statistical data distributions across the clients, many of the features have similar attributes, we only train the body part of the federated learning (FL) model while keeping the (randomly initialized) classifier frozen during the training phase. We first perform extensive theoretical analysis to understand the impact of model splitting and hierarchical model aggregations on the global model. Once the global model is trained, we fine-tune each client classifier to obtain the personalized models. Our empirical findings suggest that while the globally trained model with the untrained classifier performs quite similarly to other existing solutions, the fine-tuned models show significantly improved personalized performance.

[LG-80] Parallel Multi-path Feed Forward Neural Networks (PMFFNN) for Long Columnar Datasets: A Novel Approach to Complexity Reduction

链接: https://arxiv.org/abs/2411.06020
作者: Ayoub Jadouli,Chaker El Amrani
关键词-EN: Convolutional Neural Networks, one-dimensional Convolutional Neural, Feed-Forward Neural Networks, Neural Networks, Forward Neural Networks
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Traditional Feed-Forward Neural Networks (FFNN) and one-dimensional Convolutional Neural Networks (1D CNN) often encounter difficulties when dealing with long, columnar datasets that contain numerous features. The challenge arises from two primary factors: the large volume of data and the potential absence of meaningful relationships between features. In conventional training, large datasets can overwhelm the model, causing significant portions of the input to remain underutilized. As a result, the model may fail to capture the critical information necessary for effective learning, which leads to diminished performance. To overcome these limitations, we introduce a novel architecture called Parallel Multi-path Feed Forward Neural Networks (PMFFNN). Our approach leverages multiple parallel pathways to process distinct subsets of columns from the input dataset. By doing so, the architecture ensures that each subset of features receives focused attention, which is often neglected in traditional models. This approach maximizes the utilization of feature diversity, ensuring that no critical data sections are overlooked during training. Our architecture offers two key advantages. First, it allows for more effective handling of long, columnar data by distributing the learning task across parallel paths. Second, it reduces the complexity of the model by narrowing the feature scope in each path, which leads to faster training times and improved resource efficiency. The empirical results indicate that PMFFNN outperforms traditional FFNNs and 1D CNNs, providing an optimized solution for managing large-scale data. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2411.06020 [cs.LG] (or arXiv:2411.06020v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2411.06020 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-81] Variance-Aware Linear UCB with Deep Representation for Neural Contextual Bandits

链接: https://arxiv.org/abs/2411.05979
作者: Ha Manh Bui,Enrique Mallada,Anqi Liu
关键词-EN: deep neural networks, neural upper confidence, upper confidence bound, leveraging the representation, representation power
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:By leveraging the representation power of deep neural networks, neural upper confidence bound (UCB) algorithms have shown success in contextual bandits. To further balance the exploration and exploitation, we propose Neural- \sigma^2 -LinearUCB, a variance-aware algorithm that utilizes \sigma^2_t , i.e., an upper bound of the reward noise variance at round t , to enhance the uncertainty quantification quality of the UCB, resulting in a regret performance improvement. We provide an oracle version for our algorithm characterized by an oracle variance upper bound \sigma^2_t and a practical version with a novel estimation for this variance bound. Theoretically, we provide rigorous regret analysis for both versions and prove that our oracle algorithm achieves a better regret guarantee than other neural-UCB algorithms in the neural contextual bandits setting. Empirically, our practical method enjoys a similar computational efficiency, while outperforming state-of-the-art techniques by having a better calibration and lower regret across multiple standard settings, including on the synthetic, UCI, MNIST, and CIFAR-10 datasets.

[LG-82] he effect of different feature selection methods on models created with XGBoost

链接: https://arxiv.org/abs/2411.05937
作者: Jorge Neyra,Vishal B. Siramshetty,Huthaifa I. Ashqar
关键词-EN: superb regularization methods, popular machine learning, machine learning algorithm, feature selection methods, selection methods
类目: Machine Learning (cs.LG); Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:This study examines the effect that different feature selection methods have on models created with XGBoost, a popular machine learning algorithm with superb regularization methods. It shows that three different ways for reducing the dimensionality of features produces no statistically significant change in the prediction accuracy of the model. This suggests that the traditional idea of removing the noisy training data to make sure models do not overfit may not apply to XGBoost. But it may still be viable in order to reduce computational complexity.

[LG-83] DNAMite: Interpretable Calibrated Survival Analysis with Discretized Additive Models

链接: https://arxiv.org/abs/2411.05923
作者: Mike Van Ness,Billy Block,Madeleine Udell
关键词-EN: Survival analysis, machine learning, Survival, machine learning models, classic problem
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Survival analysis is a classic problem in statistics with important applications in healthcare. Most machine learning models for survival analysis are black-box models, limiting their use in healthcare settings where interpretability is paramount. More recently, glass-box machine learning models have been introduced for survival analysis, with both strong predictive performance and interpretability. Still, several gaps remain, as no prior glass-box survival model can produce calibrated shape functions with enough flexibility to capture the complex patterns often found in real data. To fill this gap, we introduce a new glass-box machine learning model for survival analysis called DNAMite. DNAMite uses feature discretization and kernel smoothing in its embedding module, making it possible to learn shape functions with a flexible balance of smoothness and jaggedness. Further, DNAMite produces calibrated shape functions that can be directly interpreted as contributions to the cumulative incidence function. Our experiments show that DNAMite generates shape functions closer to true shape functions on synthetic data, while making predictions with comparable predictive performance and better calibration than previous glass-box and black-box models.

[LG-84] Streaming Bayes GFlowNets

链接: https://arxiv.org/abs/2411.05899
作者: Tiago da Silva,Daniel Augusto de Souza,Diego Mesquita
关键词-EN: Bayes’ rule naturally, Bayes’ rule, rule naturally, Bayes’, data arrives
类目: Machine Learning (cs.LG)
*备注: 25 pages, 8 figures

点击查看摘要

Abstract:Bayes’ rule naturally allows for inference refinement in a streaming fashion, without the need to recompute posteriors from scratch whenever new data arrives. In principle, Bayesian streaming is straightforward: we update our prior with the available data and use the resulting posterior as a prior when processing the next data chunk. In practice, however, this recipe entails i) approximating an intractable posterior at each time step; and ii) encapsulating results appropriately to allow for posterior propagation. For continuous state spaces, variational inference (VI) is particularly convenient due to its scalability and the tractability of variational posteriors. For discrete state spaces, however, state-of-the-art VI results in analytically intractable approximations that are ill-suited for streaming settings. To enable streaming Bayesian inference over discrete parameter spaces, we propose streaming Bayes GFlowNets (abbreviated as SB-GFlowNets) by leveraging the recently proposed GFlowNets – a powerful class of amortized samplers for discrete compositional objects. Notably, SB-GFlowNet approximates the initial posterior using a standard GFlowNet and subsequently updates it using a tailored procedure that requires only the newly observed data. Our case studies in linear preference learning and phylogenetic inference showcase the effectiveness of SB-GFlowNets in sampling from an unnormalized posterior in a streaming setting. As expected, we also observe that SB-GFlowNets is significantly faster than repeatedly training a GFlowNet from scratch to sample from the full posterior.

[LG-85] A Comparative Analysis of Machine Learning Models for DDoS Detection in IoT Networks

链接: https://arxiv.org/abs/2411.05890
作者: Sushil Shakya,Robert Abbas
关键词-EN: Stochastic Gradient Descent, paper presents, machine learning, machine learning models, Stochastic Gradient
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注: 6 pages, 6 figures

点击查看摘要

Abstract:This paper presents the detection of DDoS attacks in IoT networks using machine learning models. Their rapid growth has made them highly susceptible to various forms of cyberattacks, many of whose security procedures are implemented in an irregular manner. It evaluates the efficacy of different machine learning models, such as XGBoost, K-Nearest Neighbours, Stochastic Gradient Descent, and Naïve Bayes, in detecting DDoS attacks from normal network traffic. Each model has been explained on several performance metrics, such as accuracy, precision, recall, and F1-score to understand the suitability of each model in real-time detection and response against DDoS threats. This comparative analysis will, therefore, enumerate the unique strengths and weaknesses of each model with respect to the IoT environments that are dynamic and hence moving in nature. The effectiveness of these models is analyzed, showing how machine learning can greatly enhance IoT security frameworks, offering adaptive, efficient, and reliable DDoS detection capabilities. These findings have shown the potential of machine learning in addressing the pressing need for robust IoT security solutions that can mitigate modern cyber threats and assure network integrity.

[LG-86] Sdn Intrusion Detection Using Machine Learning Method

链接: https://arxiv.org/abs/2411.05888
作者: Muhammad Zawad Mahmud,Shahran Rahman Alve,Samiha Islam,Mohammad Monirujjaman Khan
关键词-EN: Gradient Boosting, Software-defined network, directly programmable, underlying infrastructure, Decision Tree
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG); Networking and Internet Architecture (cs.NI)
*备注: 15 Pages, 14 Figures

点击查看摘要

Abstract:Software-defined network (SDN) is a new approach that allows network control to become directly programmable, and the underlying infrastructure can be abstracted from applications and network services. Control plane). When it comes to security, the centralization that this demands is ripe for a variety of cyber threats that are not typically seen in other network architectures. The authors in this research developed a novel machine-learning method to capture infections in networks. We applied the classifier to the UNSW-NB 15 intrusion detection benchmark and trained a model with this data. Random Forest and Decision Tree are classifiers used to assess with Gradient Boosting and AdaBoost. Out of these best-performing models was Gradient Boosting with an accuracy, recall, and F1 score of 99.87%,100%, and 99.85%, respectively, which makes it reliable in the detection of intrusions for SDN networks. The second best-performing classifier was also a Random Forest with 99.38% of accuracy, followed by Ada Boost and Decision Tree. The research shows that the reason that Gradient Boosting is so effective in this task is that it combines weak learners and creates a strong ensemble model that can predict if traffic belongs to a normal or malicious one with high accuracy. This paper indicates that the GBDT-IDS model is able to improve network security significantly and has better features in terms of both real-time detection accuracy and low false positive rates. In future work, we will integrate this model into live SDN space to observe its application and scalability. This research serves as an initial base on which one can make further strides forward to enhance security in SDN using ML techniques and have more secure, resilient networks.

[LG-87] Rethinking Deep Learning: Non-backpropagation and Non-optimization Machine Learning Approach Using Hebbian Neural Networks

链接: https://arxiv.org/abs/2411.05861
作者: Kei Itoh
关键词-EN: Hebbian learning, scientific challenges, provide a powerful, powerful tool, tool for addressing
类目: Neural and Evolutionary Computing (cs.NE); Machine Learning (cs.LG)
*备注: 13 pages, 4 figures

点击查看摘要

Abstract:Developing strong AI could provide a powerful tool for addressing social and scientific challenges. Neural networks (NNs), inspired by biological systems, have the potential to achieve this. However, weight optimization techniques using error backpropagation are not observed in biological systems, raising doubts about current NNs approaches. In this context, Itoh (2024) solved the MNIST classification problem without using objective functions or backpropagation. However, weight updates were not used, so it does not qualify as machine learning AI. In this study, I develop a machine learning method that mimics biological neural systems by implementing Hebbian learning in NNs without backpropagation and optimization method to solve the MNIST classification problem and analyze its output. Development proceeded in three stages. In the first stage, I applied the Hebbian learning rule to the MNIST character recognition algorithm by Itoh (2024), resulting in lower accuracy than non-Hebbian NNs, highlighting the limitations of conventional training procedures for Hebbian learning. In the second stage, I examined the properties of individually trained NNs using norm-based cognition, showing that NNs trained on a specific label respond powerfully to that label. In the third stage, I created an MNIST character recognition program using vector norm magnitude as the criterion, achieving an accuracy of approximately 75%. This demonstrates that the Hebbian learning NNs can recognize handwritten characters without objective functions, backpropagation, or optimization processes. Based on these results, developing a mechanism based on norm-based cognition as a fundamental unit and then increasing complexity to achieve indirect similarity cognition should help mimic biological neural systems and contribute to realizing strong AI.

[LG-88] Financial Fraud Detection using Jump-Attentive Graph Neural Networks

链接: https://arxiv.org/abs/2411.05857
作者: Prashank Kadam
关键词-EN: services online continues, continues to grow, surged correspondingly, online continues, financial services online
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注: International Conference on Machine Learning and Applications 2024

点击查看摘要

Abstract:As the availability of financial services online continues to grow, the incidence of fraud has surged correspondingly. Fraudsters continually seek new and innovative ways to circumvent the detection algorithms in place. Traditionally, fraud detection relied on rule-based methods, where rules were manually created based on transaction data features. However, these techniques soon became ineffective due to their reliance on manual rule creation and their inability to detect complex data patterns. Today, a significant portion of the financial services sector employs various machine learning algorithms, such as XGBoost, Random Forest, and neural networks, to model transaction data. While these techniques have proven more efficient than rule-based methods, they still fail to capture interactions between different transactions and their interrelationships. Recently, graph-based techniques have been adopted for financial fraud detection, leveraging graph topology to aggregate neighborhood information of transaction data using Graph Neural Networks (GNNs). Despite showing improvements over previous methods, these techniques still struggle to keep pace with the evolving camouflaging tactics of fraudsters and suffer from information loss due to over-smoothing. In this paper, we propose a novel algorithm that employs an efficient neighborhood sampling method, effective for camouflage detection and preserving crucial feature information from non-similar nodes. Additionally, we introduce a novel GNN architecture that utilizes attention mechanisms and preserves holistic neighborhood information to prevent information loss. We test our algorithm on financial data to show that our method outperforms other state-of-the-art graph algorithms.

[LG-89] spadesuit SPADE spadesuit Split Peak Attention DEcomposition

链接: https://arxiv.org/abs/2411.05852
作者: Malcolm Wolff,Kin G. Olivares,Boris Oreshkin,Sunny Ruan,Sitan Yang,Abhinav Katoch,Shankar Ramasubramanian,Youxin Zhang,Michael W. Mahoney,Dmitry Efimov,Vincent Quenneville-Bélair
关键词-EN: faces challenges induced, Peak Events, promotions and holidays, Peak events create, Split Peak Attention
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Demand forecasting faces challenges induced by Peak Events (PEs) corresponding to special periods such as promotions and holidays. Peak events create significant spikes in demand followed by demand ramp down periods. Neural networks like MQCNN and MQT overreact to demand peaks by carrying over the elevated PE demand into subsequent Post-Peak-Event (PPE) periods, resulting in significantly over-biased forecasts. To tackle this challenge, we introduce a neural forecasting model called Split Peak Attention DEcomposition, SPADE. This model reduces the impact of PEs on subsequent forecasts by modeling forecasting as consisting of two separate tasks: one for PEs; and the other for the rest. Its architecture then uses masked convolution filters and a specialized Peak Attention module. We show SPADE’s performance on a worldwide retail dataset with hundreds of millions of products. Our results reveal a reduction in PPE degradation by 4.5% and an improvement in PE accuracy by 3.9%, relative to current production models.

[LG-90] Multivariate Data Augmentation for Predictive Maintenance using Diffusion

链接: https://arxiv.org/abs/2411.05848
作者: Andrew Thompson,Alexander Sommers,Alicia Russell-Gilbert,Logan Cummins,Sudip Mittal,Shahram Rahimi,Maria Seale,Joseph Jaboure,Thomas Arnold,Joshua Church
关键词-EN: optimize system repairs, fault data, data, Predictive maintenance, financial domains
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Predictive maintenance has been used to optimize system repairs in the industrial, medical, and financial domains. This technique relies on the consistent ability to detect and predict anomalies in critical systems. AI models have been trained to detect system faults, improving predictive maintenance efficiency. Typically there is a lack of fault data to train these models, due to organizations working to keep fault occurrences and down time to a minimum. For newly installed systems, no fault data exists since they have yet to fail. By using diffusion models for synthetic data generation, the complex training datasets for these predictive models can be supplemented with high level synthetic fault data to improve their performance in anomaly detection. By learning the relationship between healthy and faulty data in similar systems, a diffusion model can attempt to apply that relationship to healthy data of a newly installed system that has no fault data. The diffusion model would then be able to generate useful fault data for the new system, and enable predictive models to be trained for predictive maintenance. The following paper demonstrates a system for generating useful, multivariate synthetic data for predictive maintenance, and how it can be applied to systems that have yet to fail.

[LG-91] Neural Precision Polarization: Simplifying Neural Network Inference with Dual-Level Precision

链接: https://arxiv.org/abs/2411.05845
作者: Dinithi Jayasuriya,Nastaran Darabi,Maeesha Binte Hashem,Amit Ranjan Trivedi
关键词-EN: assigning low precision, reserving high precision, high precision levels, scheme for DNN, DNN inference
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We introduce a precision polarization scheme for DNN inference that utilizes only very low and very high precision levels, assigning low precision to the majority of network weights and activations while reserving high precision paths for targeted error compensation. This separation allows for distinct optimization of each precision level, thereby reducing memory and computation demands without compromising model accuracy. In the discussed approach, a floating-point model can be trained in the cloud and then downloaded to an edge device, where network weights and activations are directly quantized to meet the edge devices’ desired level, such as NF4 or INT8. To address accuracy loss from quantization, surrogate paths are introduced, leveraging low-rank approximations on a layer-by-layer basis. These paths are trained with a sensitivity-based metric on minimal training data to recover accuracy loss under quantization as well as due to process variability, such as when the main prediction path is implemented using analog acceleration. Our simulation results show that neural precision polarization enables approximately 464 TOPS per Watt MAC efficiency and reliability by integrating rank-8 error recovery paths with highly efficient, though potentially unreliable, bit plane-wise compute-in-memory processing.

[LG-92] Efficient and Robust Freeway Traffic Speed Estimation under Oblique Grid using Vehicle Trajectory Data

链接: https://arxiv.org/abs/2411.05842
作者: Yang He,Chengchuan An,Yuheng Jia,Jiachao Liu,Zhenbo Lu,Jingxin Xia
关键词-EN: Accurately estimating spatiotemporal, significant challenge due, limited sensor deployment, spatiotemporal traffic states, potential data corruption
类目: ystems and Control (eess.SY); Machine Learning (cs.LG)
*备注: accepted by T-ITS

点击查看摘要

Abstract:Accurately estimating spatiotemporal traffic states on freeways is a significant challenge due to limited sensor deployment and potential data corruption. In this study, we propose an efficient and robust low-rank model for precise spatiotemporal traffic speed state estimation (TSE) using lowpenetration vehicle trajectory data. Leveraging traffic wave priors, an oblique grid-based matrix is first designed to transform the inherent dependencies of spatiotemporal traffic states into the algebraic low-rankness of a matrix. Then, with the enhanced traffic state low-rankness in the oblique matrix, a low-rank matrix completion method is tailored to explicitly capture spatiotemporal traffic propagation characteristics and precisely reconstruct traffic states. In addition, an anomaly-tolerant module based on a sparse matrix is developed to accommodate corrupted data input and thereby improve the TSE model robustness. Notably, driven by the understanding of traffic waves, the computational complexity of the proposed efficient method is only correlated with the problem size itself, not with dataset size and hyperparameter selection prevalent in existing studies. Extensive experiments demonstrate the effectiveness, robustness, and efficiency of the proposed model. The performance of the proposed method achieves up to a 12% improvement in Root Mean Squared Error (RMSE) in the TSE scenarios and an 18% improvement in RMSE in the robust TSE scenarios, and it runs more than 20 times faster than the state-of-the-art (SOTA) methods.

[LG-93] FLEXtime: Filterbank learning for explaining time series

链接: https://arxiv.org/abs/2411.05841
作者: Thea Brüsch,Kristoffer K. Wickstrøm,Mikkel N. Schmidt,Robert Jenssen,Tommy S. Alstrøm
关键词-EN: explaining predictions based, instance-wise saliency mask, time series, explaining predictions, predictions based
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:State-of-the-art methods for explaining predictions based on time series are built on learning an instance-wise saliency mask for each time step. However, for many types of time series, the salient information is found in the frequency domain. Adopting existing methods to the frequency domain involves naively zeroing out frequency content in the signals, which goes against established signal processing theory. Therefore, we propose a new method entitled FLEXtime, that uses a filterbank to split the time series into frequency bands and learns the optimal combinations of these bands. FLEXtime avoids the drawbacks of zeroing out frequency bins and is more stable and easier to train compared to the naive method. Our extensive evaluation shows that FLEXtime on average outperforms state-of-the-art explainability methods across a range of datasets. FLEXtime fills an important gap in the time series explainability literature and can provide a valuable tool for a wide range of time series like EEG and audio. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2411.05841 [cs.LG] (or arXiv:2411.05841v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2411.05841 Focus to learn more arXiv-issued DOI via DataCite

[LG-94] Fully Automated Correlated Time Series Forecasting in Minutes VLDB2025

链接: https://arxiv.org/abs/2411.05833
作者: Xinle Wu,Xingjian Wu,Dalin Zhang,Miao Zhang,Chenjuan Guo,Bin Yang,Christian S. Jensen
关键词-EN: Societal and industrial, systems increasingly leverage, increasingly leverage sensors, search, time series
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: accepted by PVLDB 2025

点击查看摘要

Abstract:Societal and industrial infrastructures and systems increasingly leverage sensors that emit correlated time series. Forecasting of future values of such time series based on recorded historical values has important benefits. Automatically designed models achieve higher accuracy than manually designed models. Given a forecasting task, which includes a dataset and a forecasting horizon, automated design methods automatically search for an optimal forecasting model for the task in a manually designed search space, and then train the identified model using the dataset to enable the forecasting. Existing automated methods face three challenges. First, the search space is constructed by human experts, rending the methods only semi-automated and yielding search spaces prone to subjective biases. Second, it is time consuming to search for an optimal model. Third, training the identified model for a new task is also costly. These challenges limit the practicability of automated methods in real-world settings. To contend with the challenges, we propose a fully automated and highly efficient correlated time series forecasting framework where the search and training can be done in minutes. The framework includes a data-driven, iterative strategy to automatically prune a large search space to obtain a high-quality search space for a new forecasting task. It includes a zero-shot search strategy to efficiently identify the optimal model in the customized search space. And it includes a fast parameter adaptation strategy to accelerate the training of the identified model. Experiments on seven benchmark datasets offer evidence that the framework is capable of state-of-the-art accuracy and is much more efficient than existing methods.

[LG-95] GitChameleon: Unmasking the Version-Switching Capabilities of Code Generation Models

链接: https://arxiv.org/abs/2411.05830
作者: Nizar Islah,Justine Gehring,Diganta Misra,Eilif Muller,Irina Rish,Terry Yue Zhuo,Massimo Caccia
关键词-EN: frequent version updates, software libraries presents, rapid evolution, evolution of software, presents a significant
类目: oftware Engineering (cs.SE); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The rapid evolution of software libraries presents a significant challenge for code generation models, which must adapt to frequent version updates while maintaining compatibility with previous versions. Existing code completion benchmarks often overlook this dynamic aspect, and the one that does consider it relies on static code prediction tasks without execution-based evaluation, offering a limited perspective on a model’s practical usability. To address this gap, we introduce \textbf\GitChameleon, a novel, manually curated dataset comprising 116 Python code completion problems, each conditioned on specific library versions and accompanied by executable unit tests. \GitChameleon is designed to rigorously assess the ability of modern large language models (LLMs) to generate version-specific code that is not only syntactically correct but also functionally accurate upon execution. Our comprehensive evaluations reveal that state-of-the-art LLMs struggle with this task; for instance, \textbfGPT-4o achieves a pass@10 of only 39.9% (43.7% when provided with error feedback), highlighting the complexity of the problem and the limitations of current models. By providing an execution-based benchmark that emphasizes the dynamic nature of code libraries, \GitChameleon serves as a critical tool to advance the development of more adaptable and reliable code generation models. For facilitation for further exploration of version-conditioned code generation, we make our code repository publicly accessible at \urlthis https URL.

[LG-96] Open LLM s are Necessary for Current Private Adaptations and Outperform their Closed Alternatives NEURIPS2024

链接: https://arxiv.org/abs/2411.05818
作者: Vincent Hanke,Tom Blanchard,Franziska Boenisch,Iyiola Emmanuel Olatunji,Michael Backes,Adam Dziedzic
关键词-EN: open Large Language, Large Language Models, made significant progress, closed LLMs, Large Language
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注: Accepted at NeurIPS 2024

点击查看摘要

Abstract:While open Large Language Models (LLMs) have made significant progress, they still fall short of matching the performance of their closed, proprietary counterparts, making the latter attractive even for the use on highly private data. Recently, various new methods have been proposed to adapt closed LLMs to private data without leaking private information to third parties and/or the LLM provider. In this work, we analyze the privacy protection and performance of the four most recent methods for private adaptation of closed LLMs. By examining their threat models and thoroughly comparing their performance under different privacy levels according to differential privacy (DP), various LLM architectures, and multiple datasets for classification and generation tasks, we find that: (1) all the methods leak query data, i.e., the (potentially sensitive) user data that is queried at inference time, to the LLM provider, (2) three out of four methods also leak large fractions of private training data to the LLM provider while the method that protects private data requires a local open LLM, (3) all the methods exhibit lower performance compared to three private gradient-based adaptation methods for local open LLMs, and (4) the private adaptation methods for closed LLMs incur higher monetary training and query costs than running the alternative methods on local open LLMs. This yields the conclusion that, to achieve truly privacy-preserving LLM adaptations that yield high performance and more privacy at lower costs, taking into account current methods and models, one should use open LLMs.

[LG-97] AI for ERW Detection in Clearance Operations – The State of Research

链接: https://arxiv.org/abs/2411.05813
作者: Björn Kischelewski,Gregory Cathcart,David Wahl,Benjamin Guedj
关键词-EN: ERW risk prediction, ERW, ERW risk, ERW clearance, remnants of war
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The clearance of explosive remnants of war (ERW) continues to be a predominantly manual and high-risk process that can benefit from advances in technology to improve its efficiency and effectiveness. In particular, research on artificial intelligence for ERW clearance has grown significantly in recent years. However, this research spans a wide range of fields, making it difficult to gain a comprehensive understanding of current trends and developments. Therefore, this article provides a literature review of academic research on AI for ERW detection for clearance operations. It finds that research can be grouped into two main streams, AI for ERW object detection and AI for ERW risk prediction, with the latter being much less studied than the former. From the analysis of the eligible literature, we develop three opportunities for future research, including a call for renewed efforts in the use of AI for ERW risk prediction, the combination of different AI systems and data sources, and novel approaches to improve ERW risk prediction performance, such as pattern-based prediction. Finally, we provide a perspective on the future of AI for ERW clearance. We emphasize the role of traditional machine learning for this task, the need to dynamically incorporate expert knowledge into the models, and the importance of effectively integrating AI systems with real-world operations.

[LG-98] Variational Bayes Decomposition for Inverse Estimation with Superimposed Multispectral Intensity

链接: https://arxiv.org/abs/2411.05805
作者: Akinori Asahara,Yoshihiro Osakabe,Yamamoto Mitsuya,Hidekazu Morita
关键词-EN: variational Bayesian inference, measured wave intensity, X-ray intensity, variational Bayesian, Bayesian inference
类目: Machine Learning (cs.LG); Computational Engineering, Finance, and Science (cs.CE); Signal Processing (eess.SP)
*备注:

点击查看摘要

Abstract:A variational Bayesian inference for measured wave intensity, such as X-ray intensity, is proposed in this paper. The data is popular to obtain information about unobservable features of an object, such as a material sample and the components of it. The proposed method assumes particles represent the wave, and their behaviors are stochastically modeled. The inference is accurate even if the data is noisy because of a smooth prior setting. Moreover, in this paper, two experimental results show feasibility of the proposed method.

[LG-99] Semantic Information G Theory for Range Control with Tradeoff between Purposiveness and Efficiency

链接: https://arxiv.org/abs/2411.05789
作者: Chenguang Lu
关键词-EN: Recent advances, deep learning suggest, semantic mutual information, information, deep learning
类目: Information Theory (cs.IT); Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注: 9 pages and 6 Figures

点击查看摘要

Abstract:Recent advances in deep learning suggest that we need to maximize and minimize two different kinds of information simultaneously. The Information Max-Min (IMM) method has been used in deep learning, reinforcement learning, and maximum entropy control. Shannon’s information rate-distortion function is the theoretical basis of Minimizing Mutual Information (MMI) and data compression, but it is not enough to solve the IMM problem. The author has proposed the semantic information G theory (i.e., Shannon-Lu theory), including the semantic information G measure and the information rate fidelity function R(G) (R is the MMI for the given G of semantic mutual information). The parameter solution of the R(G) function provides a general method to improve the information efficiency, G/R. This paper briefly introduces the semantic information G measure and the parametric solution of the R(G) function. Two examples reveal that the parametric solution can help us optimize range control with the tradeoff between purposiveness (i.e., semantic mutual information) and information efficiency. It seems that the R(G) function can serve as the theoretical basis of IMM methods, but we still need further research in combination with deep learning, reinforcement learning, and constraint control.

[LG-100] Conditional simulation via entropic optimal transport: Toward non-parametric estimation of conditional Brenier maps

链接: https://arxiv.org/abs/2411.07154
作者: Ricardo Baptista,Aram-Alexandre Pooladian,Michael Brennan,Youssef Marzouk,Jonathan Niles-Weed
关键词-EN: conditional Brenier maps, conditional Brenier, Brenier maps, construct conditional Brenier, Generate samples
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注: 26 pages, 4 figures

点击查看摘要

Abstract:Conditional simulation is a fundamental task in statistical modeling: Generate samples from the conditionals given finitely many data points from a joint distribution. One promising approach is to construct conditional Brenier maps, where the components of the map pushforward a reference distribution to conditionals of the target. While many estimators exist, few, if any, come with statistical or algorithmic guarantees. To this end, we propose a non-parametric estimator for conditional Brenier maps based on the computational scalability of \emphentropic optimal transport. Our estimator leverages a result of Carlier et al. (2010), which shows that optimal transport maps under a rescaled quadratic cost asymptotically converge to conditional Brenier maps; our estimator is precisely the entropic analogues of these converging maps. We provide heuristic justifications for choosing the scaling parameter in the cost as a function of the number of samples by fully characterizing the Gaussian setting. We conclude by comparing the performance of the estimator to other machine learning and non-parametric approaches on benchmark datasets and Bayesian inference problems.

[LG-101] Effectively Leveraging Momentum Terms in Stochastic Line Search Frameworks for Fast Optimization of Finite-Sum Problems

链接: https://arxiv.org/abs/2411.07102
作者: Matteo Lapucci,Davide Pucci
关键词-EN: deep learning scenarios, address unconstrained finite-sum, unconstrained finite-sum optimization, scale deep learning, finite-sum optimization problems
类目: Optimization and Control (math.OC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In this work, we address unconstrained finite-sum optimization problems, with particular focus on instances originating in large scale deep learning scenarios. Our main interest lies in the exploration of the relationship between recent line search approaches for stochastic optimization in the overparametrized regime and momentum directions. First, we point out that combining these two elements with computational benefits is not straightforward. To this aim, we propose a solution based on mini-batch persistency. We then introduce an algorithmic framework that exploits a mix of data persistency, conjugate-gradient type rules for the definition of the momentum parameter and stochastic line searches. The resulting algorithm is empirically shown to outperform other popular methods from the literature, obtaining state-of-the-art results in both convex and nonconvex large scale training problems.

[LG-102] Reconstruction of neuromorphic dynamics from a single scalar time series using variational autoencoder and neural network map

链接: https://arxiv.org/abs/2411.07055
作者: Pavel V. Kuptsov,Nataliya V. Stankevich
关键词-EN: scalar time series, single scalar time, paper examines, examines the reconstruction, neuromorphic behavior
类目: Pattern Formation and Solitons (nlin.PS); Machine Learning (cs.LG); Biological Physics (physics.bio-ph)
*备注: 15 pages, 15 figures, 3 tables

点击查看摘要

Abstract:This paper examines the reconstruction of a family of dynamical systems with neuromorphic behavior using a single scalar time series. A model of a physiological neuron based on the Hodgkin-Huxley formalism is considered. Single time series of one of its variables is shown to be enough to train a neural network that can operate as a discrete time dynamical system with one control parameter. The neural network system is created in two steps. First, the delay-coordinate embedding vectors are constructed form the original time series and their dimension is reduced with by means of a variational autoencoder to obtain the recovered state-space vectors. It is shown that an appropriate reduced dimension can be determined by analyzing the autoencoder training process. Second, pairs of the recovered state-space vectors at consecutive time steps supplied with a constant value playing the role of a control parameter are used to train another neural network to make it operate as a recurrent map. The regimes of thus created neural network system observed when its control parameter is varied are in very good accordance with those of the original system, though they were not explicitly presented during training.

[LG-103] Unified Bayesian representation for high-dimensional multi-modal biomedical data for small-sample classification

链接: https://arxiv.org/abs/2411.07043
作者: Albert Belenguer-Llorens,Carlos Sevilla-Salcedo,Jussi Tohka,Vanessa Gómez-Verdejo
关键词-EN: Bayesian algorithm designed, providing explainable solutions, Bayesian algorithm, small sample sizes, algorithm designed
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 36 pages, 3 figures and 3 tables

点击查看摘要

Abstract:We present BALDUR, a novel Bayesian algorithm designed to deal with multi-modal datasets and small sample sizes in high-dimensional settings while providing explainable solutions. To do so, the proposed model combines within a common latent space the different data views to extract the relevant information to solve the classification task and prune out the irrelevant/redundant features/data views. Furthermore, to provide generalizable solutions in small sample size scenarios, BALDUR efficiently integrates dual kernels over the views with a small sample-to-feature ratio. Finally, its linear nature ensures the explainability of the model outcomes, allowing its use for biomarker identification. This model was tested over two different neurodegeneration datasets, outperforming the state-of-the-art models and detecting features aligned with markers already described in the scientific literature.

[LG-104] Data-Driven Gradient Optimization for Field Emission Management in a Superconducting Radio-Frequency Linac

链接: https://arxiv.org/abs/2411.07018
作者: Steven Goldenberg,Kawser Ahammed,Adam Carpenter,Jiang Li,Riad Suleiman,Chris Tennant
关键词-EN: radio-frequency linear accelerators, superconducting radio-frequency linear, linear accelerators, significant problems, problems in superconducting
类目: Accelerator Physics (physics.acc-ph); Machine Learning (cs.LG)
*备注: 14 pages, 6 figures, 10 tables

点击查看摘要

Abstract:Field emission can cause significant problems in superconducting radio-frequency linear accelerators (linacs). When cavity gradients are pushed higher, radiation levels within the linacs may rise exponentially, causing degradation of many nearby systems. This research aims to utilize machine learning with uncertainty quantification to predict radiation levels at multiple locations throughout the linacs and ultimately optimize cavity gradients to reduce field emission induced radiation while maintaining the total linac energy gain necessary for the experimental physics program. The optimized solutions show over 40% reductions for both neutron and gamma radiation from the standard operational settings.

[LG-105] Causal-discovery-based root-cause analysis and its application in time-series prediction error diagnosis

链接: https://arxiv.org/abs/2411.06990
作者: Hiroshi Yokoyama,Ryusei Shingaki,Kaneharu Nishino,Shohei Shimizu,Thong Pham
关键词-EN: Recent rapid advancements, Recent rapid, error diagnosis challenging, black boxes, diagnosis challenging
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 10 pages with 5 figures

点击查看摘要

Abstract:Recent rapid advancements of machine learning have greatly enhanced the accuracy of prediction models, but most models remain “black boxes”, making prediction error diagnosis challenging, especially with outliers. This lack of transparency hinders trust and reliability in industrial applications. Heuristic attribution methods, while helpful, often fail to capture true causal relationships, leading to inaccurate error attributions. Various root-cause analysis methods have been developed using Shapley values, yet they typically require predefined causal graphs, limiting their applicability for prediction errors in machine learning models. To address these limitations, we introduce the Causal-Discovery-based Root-Cause Analysis (CD-RCA) method that estimates causal relationships between the prediction error and the explanatory variables, without needing a pre-defined causal graph. By simulating synthetic error data, CD-RCA can identify variable contributions to outliers in prediction errors by Shapley values. Extensive simulations show CD-RCA outperforms current heuristic attribution methods, and a sensitivity analysis reveals new patterns where Shapley values may misattribute errors, paving the way for more accurate error attribution methods.

[LG-106] Data-driven discovery of mechanical models directly from MRI spectral data

链接: https://arxiv.org/abs/2411.06958
作者: D.G.J. Heesterbeek,M.H.C. van Riel,T. van Leeuwen,C.A.T. van den Berg,A. Sbrizzi
关键词-EN: Finding interpretable biomechanical, Finding interpretable, interpretable biomechanical models, physiology and disease, provide insight
类目: Medical Physics (physics.med-ph); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
*备注: 11 pages regular paper with 8 figures, 9 pages supplementary material with 6 figures, 1 supplementary video

点击查看摘要

Abstract:Finding interpretable biomechanical models can provide insight into the functionality of organs with regard to physiology and disease. However, identifying broadly applicable dynamical models for in vivo tissue remains challenging. In this proof of concept study we propose a reconstruction framework for data-driven discovery of dynamical models from experimentally obtained undersampled MRI spectral data. The method makes use of the previously developed spectro-dynamic framework which allows for reconstruction of displacement fields at high spatial and temporal resolution required for model identification. The proposed framework combines this method with data-driven discovery of interpretable models using Sparse Identification of Non-linear Dynamics (SINDy). The design of the reconstruction algorithm is such that a symbiotic relation between the reconstruction of the displacement fields and the model identification is created. Our method does not rely on periodicity of the motion. It is successfully validated using spectral data of a dynamic phantom gathered on a clinical MRI scanner. The dynamic phantom is programmed to perform motion adhering to 5 different (non-linear) ordinary differential equations. The proposed framework performed better than a 2-step approach where the displacement fields were first reconstructed from the undersampled data without any information on the model, followed by data-driven discovery of the model using the reconstructed displacement fields. This study serves as a first step in the direction of data-driven discovery of in vivo models.

[LG-107] Understanding Generalization in Quantum Machine Learning with Margins

链接: https://arxiv.org/abs/2411.06919
作者: Tak Hur,Daniel K. Park
关键词-EN: Understanding and improving, quantum machine learning, improving generalization capabilities, machine learning, capabilities is crucial
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG)
*备注: 18 pages, 6 figures

点击查看摘要

Abstract:Understanding and improving generalization capabilities is crucial for both classical and quantum machine learning (QML). Recent studies have revealed shortcomings in current generalization theories, particularly those relying on uniform bounds, across both classical and quantum settings. In this work, we present a margin-based generalization bound for QML models, providing a more reliable framework for evaluating generalization. Our experimental studies on the quantum phase recognition (QPR) dataset demonstrate that margin-based metrics are strong predictors of generalization performance, outperforming traditional metrics like parameter count. By connecting this margin-based metric to quantum information theory, we demonstrate how to enhance the generalization performance of QML through a classical-quantum hybrid approach when applied to classical data.

[LG-108] Effect sizes as a statistical feature-selector-based learning to detect breast cancer

链接: https://arxiv.org/abs/2411.06868
作者: Nicolas Masino,Antonio Quintero-Rincon
关键词-EN: Breast cancer detection, open research field, tremendous effort devoted, Breast cancer, research field
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
*备注: 16 pages, 10 figures, 5 tables,2024 IEEE Biennial Congress of Argentina (ARGENCON)

点击查看摘要

Abstract:Breast cancer detection is still an open research field, despite a tremendous effort devoted to work in this area. Effect size is a statistical concept that measures the strength of the relationship between two variables on a numeric scale. Feature selection is widely used to reduce the dimensionality of data by selecting only a subset of predictor variables to improve a learning model. In this work, an algorithm and experimental results demonstrate the feasibility of developing a statistical feature-selector-based learning tool capable of reducing the data dimensionality using parametric effect size measures from features extracted from cell nuclei images. The SVM classifier with a linear kernel as a learning tool achieved an accuracy of over 90%. These excellent results suggest that the effect size is within the standards of the feature-selector methods

[LG-109] Optimized Quality of Service prediction in FSO Links over South Africa using Ensemble Learning

链接: https://arxiv.org/abs/2411.06832
作者: S.O. Adebusola,P.A. Owolawi,J.S. Ojo,P.S. Maswikaneng
关键词-EN: Fibre optic communication, Fibre optic, optic communication system, Gradient Boost Regression, expected to increase
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Signal Processing (eess.SP); Optics (physics.optics)
*备注:

点击查看摘要

Abstract:Fibre optic communication system is expected to increase exponentially in terms of application due to the numerous advantages over copper wires. The optical network evolution presents several advantages such as over long-distance, low-power requirement, higher carrying capacity and high bandwidth among others Such network bandwidth surpasses methods of transmission that include copper cables and microwaves. Despite these benefits, free-space optical communications are severely impacted by harsh weather situations like mist, precipitation, blizzard, fume, soil, and drizzle debris in the atmosphere, all of which have an impact on the Quality of Service (QoS) rendered by the systems. The primary goal of this article is to optimize the QoS using the ensemble learning models Random Forest, ADaBoost Regression, Stacking Regression, Gradient Boost Regression, and Multilayer Neural Network. To accomplish the stated goal, meteorological data, visibility, wind speed, and altitude were obtained from the South Africa Weather Services archive during a ten-year period (2010 to 2019) at four different locations: Polokwane, Kimberley, Bloemfontein, and George. We estimated the data rate, power received, fog-induced attenuation, bit error rate and power penalty using the collected and processed data. The RMSE and R-squared values of the model across all the study locations, Polokwane, Kimberley, Bloemfontein, and George, are 0.0073 and 0.9951, 0.0065 and 0.9998, 0.0060 and 0.9941, and 0.0032 and 0.9906, respectively. The result showed that using ensemble learning techniques in transmission modeling can significantly enhance service quality and meet customer service level agreements and ensemble method was successful in efficiently optimizing the signal to noise ratio, which in turn enhanced the QoS at the point of reception.

[LG-110] Predicting ionic conductivity in solids from the machine-learned potential energy landscape

链接: https://arxiv.org/abs/2411.06804
作者: Artem Maevskiy,Alexandra Carvalho,Emil Sataev,Volha Turchyna,Keian Noori,Aleksandr Rodin,A. H. Castro Neto,Andrey Ustyuzhanin
关键词-EN: advancing solid-state batteries, traditional lithium-ion batteries, offer improved energy, improved energy density, solid-state batteries
类目: Materials Science (cond-mat.mtrl-sci); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Discovering new superionic materials is essential for advancing solid-state batteries, which offer improved energy density and safety compared to the traditional lithium-ion batteries with liquid electrolytes. Conventional computational methods for identifying such materials are resource-intensive and not easily scalable. Recently, universal interatomic potential models have been developed using equivariant graph neural networks. These models are trained on extensive datasets of first-principles force and energy calculations. One can achieve significant computational advantages by leveraging them as the foundation for traditional methods of assessing the ionic conductivity, such as molecular dynamics or nudged elastic band techniques. However, the generalization error from model inference on diverse atomic structures arising in such calculations can compromise the reliability of the results. In this work, we propose an approach for the quick and reliable evaluation of ionic conductivity through the analysis of a universal interatomic potential. Our method incorporates a set of heuristic structure descriptors that effectively employ the rich knowledge of the underlying model while requiring minimal generalization capabilities. Using our descriptors, we rank lithium-containing materials in the Materials Project database according to their expected ionic conductivity. Eight out of the ten highest-ranked materials are confirmed to be superionic at room temperature in first-principles calculations. Notably, our method achieves a speed-up factor of approximately 50 compared to molecular dynamics driven by a machine-learning potential, and is at least 3,000 times faster compared to first-principles molecular dynamics.

[LG-111] Methane projections from Canadas oil sands tailings using scientific deep learning reveal significant underestimation

链接: https://arxiv.org/abs/2411.06741
作者: Esha Saha,Oscar Wang,Amit K. Chakraborty,Pablo Venegas Garcia,Russell Milne,Hao Wang
关键词-EN: Canada Athabasca Oil, Canada Athabasca, Athabasca Oil Sands, Bitumen extraction, synthetic crude oil
类目: Applications (stat.AP); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 19 pages, 8 figures, 2 tables

点击查看摘要

Abstract:Bitumen extraction for the production of synthetic crude oil in Canada’s Athabasca Oil Sands industry has recently come under spotlight for being a significant source of greenhouse gas emission. A major cause of concern is methane, a greenhouse gas produced by the anaerobic biodegradation of hydrocarbons in oil sands residues, or tailings, stored in settle basins commonly known as oil sands tailing ponds. In order to determine the methane emitting potential of these tailing ponds and have future methane projections, we use real-time weather data, mechanistic models developed from laboratory controlled experiments, and industrial reports to train a physics constrained machine learning model. Our trained model can successfully identify the directions of active ponds and estimate their emission levels, which are generally hard to obtain due to data sampling restrictions. We found that each active oil sands tailing pond could emit between 950 to 1500 tonnes of methane per year, whose environmental impact is equivalent to carbon dioxide emissions from at least 6000 gasoline powered vehicles. Although abandoned ponds are often presumed to have insignificant emissions, our findings indicate that these ponds could become active over time and potentially emit up to 1000 tonnes of methane each year. Taking an average over all datasets that was used in model training, we estimate that emissions around major oil sands regions would need to be reduced by approximately 12% over a year, to reduce the average methane concentrations to 2005 levels.

[LG-112] ruth beauty and goodness in grand unification: a machine learning approach

链接: https://arxiv.org/abs/2411.06718
作者: Shinsuke Kawai,Nobuchika Okada
关键词-EN: Grand Unified Theory, machine learning techniques, Grand Unified, Unified Theory, GUT Higgs field
类目: High Energy Physics - Phenomenology (hep-ph); Machine Learning (cs.LG); High Energy Physics - Theory (hep-th)
*备注: 8 pages, 4 figures

点击查看摘要

Abstract:We investigate the flavour sector of the supersymmetric SU(5) Grand Unified Theory (GUT) model using machine learning techniques. The minimal SU(5) model is known to predict fermion masses that disagree with observed values in nature. There are two well-known approaches to address this issue: one involves introducing a 45-representation Higgs field, while the other employs a higher-dimensional operator involving the 24-representation GUT Higgs field. We compare these two approaches by numerically optimising a loss function, defined as the ratio of determinants of mass matrices. Our findings indicate that the 24-Higgs approach achieves the observed fermion masses with smaller modifications to the original minimal SU(5) model.

[LG-113] Quantum Policy Gradient in Reproducing Kernel Hilbert Space

链接: https://arxiv.org/abs/2411.06650
作者: David M. Bossens,Kishor Bharti,Jayne Thompson
关键词-EN: Parametrised quantum circuits, circuits offer expressive, Parametrised quantum, quantum circuits offer, quantum
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Parametrised quantum circuits offer expressive and data-efficient representations for machine learning. Due to quantum states residing in a high-dimensional complex Hilbert space, parametrised quantum circuits have a natural interpretation in terms of kernel methods. The representation of quantum circuits in terms of quantum kernels has been studied widely in quantum supervised learning, but has been overlooked in the context of quantum reinforcement learning. This paper proposes parametric and non-parametric policy gradient and actor-critic algorithms with quantum kernel policies in quantum environments. This approach, implemented with both numerical and analytical quantum policy gradient techniques, allows exploiting the many advantages of kernel methods, including available analytic forms for the gradient of the policy and tunable expressiveness. The proposed approach is suitable for vector-valued action spaces and each of the formulations demonstrates a quadratic reduction in query complexity compared to their classical counterparts. Two actor-critic algorithms, one based on stochastic policy gradient and one based on deterministic policy gradient (comparable to the popular DDPG algorithm), demonstrate additional query complexity reductions compared to quantum policy gradient algorithms under favourable conditions.

[LG-114] Few measurement shots challenge generalization in learning to classify entanglement

链接: https://arxiv.org/abs/2411.06600
作者: Leonardo Banchi,Jason Pereira,Marco Zamboni
关键词-EN: extract general laws, ability to extract, quantum setting, quantum, classical machine-learning methods
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG); Mathematical Physics (math-ph); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:The ability to extract general laws from a few known examples depends on the complexity of the problem and on the amount of training data. In the quantum setting, the learner’s generalization performance is further challenged by the destructive nature of quantum measurements that, together with the no-cloning theorem, limits the amount of information that can be extracted from each training sample. In this paper we focus on hybrid quantum learning techniques where classical machine-learning methods are paired with quantum algorithms and show that, in some settings, the uncertainty coming from a few measurement shots can be the dominant source of errors. We identify an instance of this possibly general issue by focusing on the classification of maximally entangled vs. separable states, showing that this toy problem becomes challenging for learners unaware of entanglement theory. Finally, we introduce an estimator based on classical shadows that performs better in the big data, few copy regime. Our results show that the naive application of classical machine-learning methods to the quantum setting is problematic, and that a better theoretical foundation of quantum learning is required.

[LG-115] Multi-Parameter Molecular MRI Quantification using Physics-Informed Self-Supervised Learning

链接: https://arxiv.org/abs/2411.06447
作者: Alex Finkelstein,Nikita Vladimirov,Moritz Zaiss,Or Perlman
关键词-EN: Biophysical model fitting, obtaining quantitative parameters, Biophysical model, model fitting plays, signals and images
类目: Medical Physics (physics.med-ph); Machine Learning (cs.LG); Computational Physics (physics.comp-ph)
*备注: This project was funded by the European Union (ERC, BabyMagnet, project no. 101115639), the Ministry of Innovation, Science and Technology, Israel, and a grant from the Tel Aviv University Center for AI and Data Science (TAD, The Blavatnik AI and Data Science Fund). None of above can be held responsible for views and opinions expressed, which are those of the authors alone

点击查看摘要

Abstract:Biophysical model fitting plays a key role in obtaining quantitative parameters from physiological signals and images. However, the model complexity for molecular magnetic resonance imaging (MRI) often translates into excessive computation time, which makes clinical use impractical. Here, we present a generic computational approach for solving the parameter extraction inverse problem posed by ordinary differential equation (ODE) modeling coupled with experimental measurement of the system dynamics. This is achieved by formulating a numerical ODE solver to function as a step-wise analytical one, thereby making it compatible with automatic differentiation-based optimization. This enables efficient gradient-based model fitting, and provides a new approach to parameter quantification based on self-supervised learning from a single data observation. The neural-network-based train-by-fit pipeline was used to quantify semisolid magnetization transfer (MT) and chemical exchange saturation transfer (CEST) amide proton exchange parameters in the human brain, in an in-vivo molecular MRI study (n=4). The entire pipeline of the first whole brain quantification was completed in 18.3 \pm 8.3 minutes, which is an order-of-magnitude faster than comparable alternatives. Reusing the single-subject-trained network for inference in new subjects took 1.0 \pm 0.2 s, to provide results in agreement with literature values and scan-specific fit results (Pearson’s r0.98, p0.0001).

[LG-116] Optimal Execution with Reinforcement Learning

链接: https://arxiv.org/abs/2411.06389
作者: Yadh Hafsi,Edoardo Vittori
关键词-EN: limited time frame, optimal execution strategy, limit order book, aiming to determine, time frame
类目: Trading and Market Microstructure (q-fin.TR); Machine Learning (cs.LG)
*备注: 9 pages

点击查看摘要

Abstract:This study investigates the development of an optimal execution strategy through reinforcement learning, aiming to determine the most effective approach for traders to buy and sell inventory within a limited time frame. Our proposed model leverages input features derived from the current state of the limit order book. To simulate this environment and overcome the limitations associated with relying on historical data, we utilize the multi-agent market simulator ABIDES, which provides a diverse range of depth levels within the limit order book. We present a custom MDP formulation followed by the results of our methodology and benchmark the performance against standard execution strategies. Our findings suggest that the reinforcement learning-based approach demonstrates significant potential. Comments: 9 pages Subjects: Trading and Market Microstructure (q-fin.TR); Machine Learning (cs.LG) Cite as: arXiv:2411.06389 [q-fin.TR] (or arXiv:2411.06389v1 [q-fin.TR] for this version) https://doi.org/10.48550/arXiv.2411.06389 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-117] A Learned Proximal Alternating Minimization Algorithm and Its Induced Network for a Class of Two-block Nonconvex and Nonsmooth Optimization

链接: https://arxiv.org/abs/2411.06333
作者: Yunmei Chen,Lezhi Liu,Lei Zhang
关键词-EN: general learned proximal, learned proximal alternating, nonconvex optimization problems, work proposes, proposes a general
类目: Optimization and Control (math.OC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This work proposes a general learned proximal alternating minimization algorithm, LPAM, for solving learnable two-block nonsmooth and nonconvex optimization problems. We tackle the nonsmoothness by an appropriate smoothing technique with automatic diminishing smoothing effect. For smoothed nonconvex problems we modify the proximal alternating linearized minimization (PALM) scheme by incorporating the residual learning architecture, which has proven to be highly effective in deep network training, and employing the block coordinate decent (BCD) iterates as a safeguard for the convergence of the algorithm. We prove that there is a subsequence of the iterates generated by LPAM, which has at least one accumulation point and each accumulation point is a Clarke stationary point. Our method is widely applicable as one can employ various learning problems formulated as two-block optimizations, and is also easy to be extended for solving multi-block nonsmooth and nonconvex optimization problems. The network, whose architecture follows the LPAM exactly, namely LPAM-net, inherits the convergence properties of the algorithm to make the network interpretable. As an example application of LPAM-net, we present the numerical and theoretical results on the application of LPAM-net for joint multi-modal MRI reconstruction with significantly under-sampled k-space data. The experimental results indicate the proposed LPAM-net is parameter-efficient and has favourable performance in comparison with some state-of-the-art methods.

[LG-118] Amortized Bayesian Local Interpolation NetworK: Fast covariance parameter estimation for Gaussian Processes

链接: https://arxiv.org/abs/2411.06324
作者: Brandon R. Feng,Reetam Majumder,Brian J. Reich,Mohamed A. Abba
关键词-EN: process called Kriging, Gaussian processes, unseen spatial locations, flexibility and interpretability, Amortized Bayesian Local
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Methodology (stat.ME)
*备注:

点击查看摘要

Abstract:Gaussian processes (GPs) are a ubiquitous tool for geostatistical modeling with high levels of flexibility and interpretability, and the ability to make predictions at unseen spatial locations through a process called Kriging. Estimation of Kriging weights relies on the inversion of the process’ covariance matrix, creating a computational bottleneck for large spatial datasets. In this paper, we propose an Amortized Bayesian Local Interpolation NetworK (A-BLINK) for fast covariance parameter estimation, which uses two pre-trained deep neural networks to learn a mapping from spatial location coordinates and covariance function parameters to Kriging weights and the spatial variance, respectively. The fast prediction time of these networks allows us to bypass the matrix inversion step, creating large computational speedups over competing methods in both frequentist and Bayesian settings, and also provides full posterior inference and predictions using Markov chain Monte Carlo sampling methods. We show significant increases in computational efficiency over comparable scalable GP methodology in an extensive simulation study with lower parameter estimation error. The efficacy of our approach is also demonstrated using a temperature dataset of US climate normals for 1991–2020 based on over 7,000 weather stations.

[LG-119] Deep Nonparametric Conditional Independence Tests for Images

链接: https://arxiv.org/abs/2411.06140
作者: Marco Simnacher,Xiangnan Xu,Hani Park,Christoph Lippert,Sonja Greven
关键词-EN: nonparametric CITs, dependence between random, embedding maps, Conditional independence tests, DNCITs
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Image and Video Processing (eess.IV); Statistics Theory (math.ST); Methodology (stat.ME)
*备注: 50 pages, 13 figures

点击查看摘要

Abstract:Conditional independence tests (CITs) test for conditional dependence between random variables. As existing CITs are limited in their applicability to complex, high-dimensional variables such as images, we introduce deep nonparametric CITs (DNCITs). The DNCITs combine embedding maps, which extract feature representations of high-dimensional variables, with nonparametric CITs applicable to these feature representations. For the embedding maps, we derive general properties on their parameter estimators to obtain valid DNCITs and show that these properties include embedding maps learned through (conditional) unsupervised or transfer learning. For the nonparametric CITs, appropriate tests are selected and adapted to be applicable to feature representations. Through simulations, we investigate the performance of the DNCITs for different embedding maps and nonparametric CITs under varying confounder dimensions and confounder relationships. We apply the DNCITs to brain MRI scans and behavioral traits, given confounders, of healthy individuals from the UK Biobank (UKB), confirming null results from a number of ambiguous personality neuroscience studies with a larger data set and with our more powerful tests. In addition, in a confounder control study, we apply the DNCITs to brain MRI scans and a confounder set to test for sufficient confounder control, leading to a potential reduction in the confounder dimension under improved confounder control compared to existing state-of-the-art confounder control studies for the UKB. Finally, we provide an R package implementing the DNCITs.

[LG-120] Exploring Structural Nonlinearity in Binary Polariton-Based Neuromorphic Architectures

链接: https://arxiv.org/abs/2411.06124
作者: Evgeny Sedov,Alexey Kavokin
关键词-EN: leveraging polariton dyads, interfering polariton condensates, optically excited pairs, binary logic gate, network leveraging polariton
类目: Disordered Systems and Neural Networks (cond-mat.dis-nn); Emerging Technologies (cs.ET); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE); Applied Physics (physics.app-ph)
*备注:

点击查看摘要

Abstract:This study investigates the performance of a binarized neuromorphic network leveraging polariton dyads, optically excited pairs of interfering polariton condensates within a microcavity to function as binary logic gate neurons. Employing numerical simulations, we explore various neuron configurations, both linear (NAND, NOR) and nonlinear (XNOR), to assess their effectiveness in image classification tasks. We demonstrate that structural nonlinearity, derived from the network’s layout, plays a crucial role in facilitating complex computational tasks, effectively reducing the reliance on the inherent nonlinearity of individual neurons. Our findings suggest that the network’s configuration and the interaction among its elements can emulate the benefits of nonlinearity, thus potentially simplifying the design and manufacturing of neuromorphic systems and enhancing their scalability. This shift in focus from individual neuron properties to network architecture could lead to significant advancements in the efficiency and applicability of neuromorphic computing.

[LG-121] BreakGPT: Leveraging Large Language Models for Predicting Asset Price Surges

链接: https://arxiv.org/abs/2411.06076
作者: Aleksandr Simonyan
关键词-EN: architecture adapted specifically, sharp upward movements, large language model, paper introduces BreakGPT, architecture adapted
类目: atistical Finance (q-fin.ST); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This paper introduces BreakGPT, a novel large language model (LLM) architecture adapted specifically for time series forecasting and the prediction of sharp upward movements in asset prices. By leveraging both the capabilities of LLMs and Transformer-based models, this study evaluates BreakGPT and other Transformer-based models for their ability to address the unique challenges posed by highly volatile financial markets. The primary contribution of this work lies in demonstrating the effectiveness of combining time series representation learning with LLM prediction frameworks. We showcase BreakGPT as a promising solution for financial forecasting with minimal training and as a strong competitor for capturing both local and global temporal dependencies.

[LG-122] Filling in Missing FX Implied Volatilities with Uncertainties: Improving VAE-Based Volatility Imputation

链接: https://arxiv.org/abs/2411.05998
作者: Achintya Gopal
关键词-EN: requires methods, methods to fill, missing implied volatilities, common problem, missing implied
类目: atistical Finance (q-fin.ST); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 35 pages, 22 figures, 10 tables

点击查看摘要

Abstract:Missing data is a common problem in finance and often requires methods to fill in the gaps, or in other words, imputation. In this work, we focused on the imputation of missing implied volatilities for FX options. Prior work has used variational autoencoders (VAEs), a neural network-based approach, to solve this problem; however, using stronger classical baselines such as Heston with jumps can significantly outperform their results. We show that simple modifications to the architecture of the VAE lead to significant imputation performance improvements (e.g., in low missingness regimes, nearly cutting the error by half), removing the necessity of using \beta -VAEs. Further, we modify the VAE imputation algorithm in order to better handle the uncertainty in data, as well as to obtain accurate uncertainty estimates around imputed values.

[LG-123] Energy Efficient Protein Language Models: Leveraging Small Language Models with LoRA for Controllable Protein Generation

链接: https://arxiv.org/abs/2411.05966
作者: Aayush Shah,Shankar Jayaratnam
关键词-EN: Large language models, demonstrated significant success, shown promising results, natural language processing, Large language
类目: Biomolecules (q-bio.BM); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Large language models (LLMs) have demonstrated significant success in natural language processing (NLP) tasks and have shown promising results in other domains such as protein sequence generation. However, there remain salient differences between LLMs used for NLP, which effectively handle multiple tasks and are available in small sizes, and protein language models that are often specialized for specific tasks and only exist in larger sizes. In this work, we introduce two small protein language models, based on Llama-3-8B and Phi-3-mini, that are capable of both uncontrollable and controllable protein generation. For the uncontrollable generation task, our best model achieves an average pLDDT score of 69.75, demonstrating robust performance in generating viable protein structures. For the controllable generation task, in which the model generates proteins according to properties specified in the prompt, we achieve a remarkable average TM-Score of 0.84, indicating high structural similarity to target proteins. We chose 10 properties, including six classes of enzymes, to extend the capabilities of prior protein language models. Our approach utilizes the Low-Rank Adaptor (LoRA) technique, reducing trainable parameters to just 4% of the original model size, lowering computational requirements. By using a subset of the UniRef50 dataset and small models, we reduced the overall training time by 70% without compromising performance. Notably, Phi-3-mini reduced trainable parameters by 60%, decreasing training cost by 30% compared to Llama 3. Consequently, Phi-3 achieved a comparable TM-Score of 0.81, demonstrating that smaller models can match the performance of larger ones, like Llama 3. We also demonstrate the deployment of our models on the energy efficient ET-SoC-1 chip, significantly improving the TPS/W by a factor of 3.

[LG-124] A method based on Generative Adversarial Networks for disentangling physical and chemical properties of stars in astronomical spectra

链接: https://arxiv.org/abs/2411.05960
作者: Raúl Santoveña,Carlos Dafonte,Minia Manteiga
关键词-EN: compression techniques focused, Data compression techniques, compression techniques, techniques focused, modern era
类目: Instrumentation and Methods for Astrophysics (astro-ph.IM); Solar and Stellar Astrophysics (astro-ph.SR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Data compression techniques focused on information preservation have become essential in the modern era of big data. In this work, an encoder-decoder architecture has been designed, where adversarial training, a modification of the traditional autoencoder, is used in the context of astrophysical spectral analysis. The goal of this proposal is to obtain an intermediate representation of the astronomical stellar spectra, in which the contribution to the flux of a star due to the most influential physical properties (its surface temperature and gravity) disappears and the variance reflects only the effect of the chemical composition over the spectrum. A scheme of deep learning is used with the aim of unraveling in the latent space the desired parameters of the rest of the information contained in the data. This work proposes a version of adversarial training that makes use of a discriminator per parameter to be disentangled, thus avoiding the exponential combination that occurs in the use of a single discriminator, as a result of the discretization of the values to be untangled. To test the effectiveness of the method, synthetic astronomical data are used from the APOGEE and Gaia surveys. In conjunction with the work presented, we also provide a disentangling framework (GANDALF) available to the community, which allows the replication, visualization, and extension of the method to domains of any nature.

[LG-125] ackling extreme urban heat: a machine learning approach to assess the impacts of climate change and the efficacy of climate adaptation strategies in urban microclimates

链接: https://arxiv.org/abs/2411.05952
作者: Grant Buster,Jordan Cox,Brandon N. Benton,Ryan N. King
关键词-EN: climate adaptation efforts, climate change progress, adaptation efforts, Los Angeles urban, urban heat
类目: Atmospheric and Oceanic Physics (physics.ao-ph); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:As urbanization and climate change progress, urban heat becomes a priority for climate adaptation efforts. High temperatures concentrated in urban heat can drive increased risk of heat-related death and illness as well as increased energy demand for cooling. However, estimating the effects of urban heat is an ongoing field of research typically burdened by an imprecise description of the built environment, significant computational cost, and a lack of high-resolution estimates of the impacts of climate change. Here, we present open-source, computationally efficient machine learning methods that can improve the accuracy of urban temperature estimates when compared to historical reanalysis data. These models are applied to residential buildings in Los Angeles, and we compare the energy benefits of heat mitigation strategies to the impacts of climate change. We find that cooling demand is likely to increase substantially through midcentury, but engineered high-albedo surfaces could lessen this increase by more than 50%. The corresponding increase in heating demand complicates this narrative, but total annual energy use from combined heating and cooling with electric heat pumps in the Los Angeles urban climate is shown to benefit from the engineered cooling strategies under both current and future climates.

[LG-126] Compactly-supported nonstationary kernels for computing exact Gaussian processes on big data

链接: https://arxiv.org/abs/2411.05869
作者: Mark D. Risser,Marcus M. Noack,Hengrui Luo,Ronald Pandolfi
关键词-EN: stochastic function approximation, probabilistic machine learning, machine learning methods, machine learning, analyzing real-world measurements
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Applications (stat.AP); Computation (stat.CO); Methodology (stat.ME)
*备注:

点击查看摘要

Abstract:The Gaussian process (GP) is a widely used probabilistic machine learning method for stochastic function approximation, stochastic modeling, and analyzing real-world measurements of nonlinear processes. Unlike many other machine learning methods, GPs include an implicit characterization of uncertainty, making them extremely useful across many areas of science, technology, and engineering. Traditional implementations of GPs involve stationary kernels (also termed covariance functions) that limit their flexibility and exact methods for inference that prevent application to data sets with more than about ten thousand points. Modern approaches to address stationarity assumptions generally fail to accommodate large data sets, while all attempts to address scalability focus on approximating the Gaussian likelihood, which can involve subjectivity and lead to inaccuracies. In this work, we explicitly derive an alternative kernel that can discover and encode both sparsity and nonstationarity. We embed the kernel within a fully Bayesian GP model and leverage high-performance computing resources to enable the analysis of massive data sets. We demonstrate the favorable performance of our novel kernel relative to existing exact and approximate GP methods across a variety of synthetic data examples. Furthermore, we conduct space-time prediction based on more than one million measurements of daily maximum temperature and verify that our results outperform state-of-the-art methods in the Earth sciences. More broadly, having access to exact GPs that use ultra-scalable, sparsity-discovering, nonstationary kernels allows GP methods to truly compete with a wide variety of machine learning methods.

[LG-127] Provably Faster Algorithms for Bilevel Optimization via Without-Replacement Sampling

链接: https://arxiv.org/abs/2411.05868
作者: Junyi Li,Heng Huang
关键词-EN: experienced significant advancements, significant advancements recently, Bilevel Optimization, experienced significant, significant advancements
类目: Optimization and Control (math.OC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Bilevel Optimization has experienced significant advancements recently with the introduction of new efficient algorithms. Mirroring the success in single-level optimization, stochastic gradient-based algorithms are widely used in bilevel optimization. However, a common limitation in these algorithms is the presumption of independent sampling, which can lead to increased computational costs due to the complicated hyper-gradient formulation of bilevel problems. To address this challenge, we study the example-selection strategy for bilevel optimization in this work. More specifically, we introduce a without-replacement sampling based algorithm which achieves a faster convergence rate compared to its counterparts that rely on independent sampling. Beyond the standard bilevel optimization formulation, we extend our discussion to conditional bilevel optimization and also two special cases: minimax and compositional optimization. Finally, we validate our algorithms over both synthetic and real-world applications. Numerical results clearly showcase the superiority of our algorithms.

[LG-128] A Fundamental Accuracy–Robustness Trade-off in Regression and Classification

链接: https://arxiv.org/abs/2411.05853
作者: Sohail Bahmani
关键词-EN: simple intuition, optimal predictor, predictor is smooth, cost of accuracy, derive a fundamental
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We derive a fundamental trade-off between standard and adversarial risk in a rather general situation that formalizes the following simple intuition: “If no (nearly) optimal predictor is smooth, adversarial robustness comes at the cost of accuracy.” As a concrete example, we evaluate the derived trade-off in regression with polynomial ridge functions under mild regularity conditions.

[LG-129] Are Deep Learning Methods Suitable for Downscaling Global Climate Projections? Review and Intercomparison of Existing Models

链接: https://arxiv.org/abs/2411.05850
作者: Jose González-Abad,José Manuel Gutiérrez
关键词-EN: including Perfect Prognosis, global climate change, climate change projections, Deep Learning, downscaling global climate
类目: Atmospheric and Oceanic Physics (physics.ao-ph); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: Under review for Earth’s Future

点击查看摘要

Abstract:Deep Learning (DL) has shown promise for downscaling global climate change projections under different approaches, including Perfect Prognosis (PP) and Regional Climate Model (RCM) emulation. Unlike emulators, PP downscaling models are trained on observational data, so it remains an open question whether they can plausibly extrapolate unseen conditions and changes in future emissions scenarios. Here we focus on this problem as the main drawback for the operationalization of these methods and present the results of 1) a literature review to identify state-of-the-art DL models for PP downscaling and 2) an intercomparison experiment to evaluate the performance of these models and to assess their extrapolation capability using a common experimental framework, taking into account the sensitivity of results to different training replicas. We focus on minimum and maximum temperatures and precipitation over Spain, a region with a range of climatic conditions with different influential regional processes. We conclude with a discussion of the findings, limitations of existing methods, and prospects for future development.

[LG-130] Assessing and Enhancing Graph Neural Networks for Combinatorial Optimization: Novel Approaches and Application in Maximum Independent Set Problems

链接: https://arxiv.org/abs/2411.05834
作者: Chenchuhui Hu
关键词-EN: computation time grows, time grows exponentially, QUBO unsupervised approach, Combinatorial optimization, Graph Neural Networks
类目: Optimization and Control (math.OC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Combinatorial optimization (CO) problems are challenging as the computation time grows exponentially with the input. Graph Neural Networks (GNNs) show promise for researchers in solving CO problems. This study investigates the effectiveness of GNNs in solving the maximum independent set (MIS) problem, inspired by the intriguing findings of Schuetz et al., and aimed to enhance this solver. Despite the promise shown by GNNs, some researchers observed discrepancies when reproducing the findings, particularly compared to the greedy algorithm, for instance. We reproduced Schuetz’ Quadratic Unconstrained Binary Optimization (QUBO) unsupervised approach and explored the possibility of combining it with a supervised learning approach for solving MIS problems. While the QUBO unsupervised approach did not guarantee maximal or optimal solutions, it provided a solid first guess for post-processing techniques like greedy decoding or tree-based methods. Moreover, our findings indicated that the supervised approach could further refine the QUBO unsupervised solver, as the learned model assigned meaningful probabilities for each node as initial node features, which could then be improved with the QUBO unsupervised approach. Thus, GNNs offer a valuable method for solving CO problems by integrating learned graph structures rather than relying solely on traditional heuristic functions. This research highlights the potential of GNNs to boost solver performance by leveraging ground truth during training and using optimization functions to learn structural graph information, marking a pioneering step towards improving prediction accuracy in a non-autoregressive manner.

[LG-131] Demo: Multi-Modal Seizure Prediction System

链接: https://arxiv.org/abs/2411.05817
作者: Ali Saeizadeh,Pietro Brach del Prever,Douglas Schonholtz,Raffaele Guida,Emrecan Demirors,Jorge M. Jimenez,Pedram Johari,Tommaso Melodia
关键词-EN: utilizing Deep Learning, Deep Learning, utilizing Deep, predicting epileptic seizures, epileptic seizures benefiting
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注: 1 page, 1 figure, Proceedings of the IEEE 20th International Conference on Body Sensor Networks (BSN), October 2024

点击查看摘要

Abstract:This demo presents SeizNet, an innovative system for predicting epileptic seizures benefiting from a multi-modal sensor network and utilizing Deep Learning (DL) techniques. Epilepsy affects approximately 65 million people worldwide, many of whom experience drug-resistant seizures. SeizNet aims at providing highly accurate alerts, allowing individuals to take preventive measures without being disturbed by false alarms. SeizNet uses a combination of data collected through either invasive (intracranial electroencephalogram (iEEG)) or non-invasive (electroencephalogram (EEG) and electrocardiogram (ECG)) sensors, and processed by advanced DL algorithms that are optimized for real-time inference at the edge, ensuring privacy and minimizing data transmission. SeizNet achieves 97% accuracy in seizure prediction while keeping the size and energy restrictions of an implantable device.

[LG-132] Graph Neural Networks for Financial Fraud Detection: A Review

链接: https://arxiv.org/abs/2411.05815
作者: Dawei Cheng,Yao Zou,Sheng Xiang,Changjun Jiang
关键词-EN: financial fraud detection, fraud detection, financial fraud, increasingly complex due, grown increasingly complex
类目: atistical Finance (q-fin.ST); Machine Learning (cs.LG)
*备注: 17 Pages, 2 Figures

点击查看摘要

Abstract:The landscape of financial transactions has grown increasingly complex due to the expansion of global economic integration and advancements in information technology. This complexity poses greater challenges in detecting and managing financial fraud. This review explores the role of Graph Neural Networks (GNNs) in addressing these challenges by proposing a unified framework that categorizes existing GNN methodologies applied to financial fraud detection. Specifically, by examining a series of detailed research questions, this review delves into the suitability of GNNs for financial fraud detection, their deployment in real-world scenarios, and the design considerations that enhance their effectiveness. This review reveals that GNNs are exceptionally adept at capturing complex relational patterns and dynamics within financial networks, significantly outperforming traditional fraud detection methods. Unlike previous surveys that often overlook the specific potentials of GNNs or address them only superficially, our review provides a comprehensive, structured analysis, distinctly focusing on the multifaceted applications and deployments of GNNs in financial fraud detection. This review not only highlights the potential of GNNs to improve fraud detection mechanisms but also identifies current gaps and outlines future research directions to enhance their deployment in financial systems. Through a structured review of over 100 studies, this review paper contributes to the understanding of GNN applications in financial fraud detection, offering insights into their adaptability and potential integration strategies.

[LG-133] Forecasting Company Fundamentals

链接: https://arxiv.org/abs/2411.05791
作者: Felix Divo,Eric Endress,Kevin Endler,Kristian Kersting,Devendra Singh Dhami
关键词-EN: assessing companies’ financial, success and stability, key to assessing, assessing companies’, companies’ financial
类目: atistical Finance (q-fin.ST); Machine Learning (cs.LG); General Economics (econ.GN); Applications (stat.AP)
*备注: 24 pages, 9 figures, under review

点击查看摘要

Abstract:Company fundamentals are key to assessing companies’ financial and overall success and stability. Forecasting them is important in multiple fields, including investing and econometrics. While statistical and contemporary machine learning methods have been applied to many time series tasks, there is a lack of comparison of these approaches on this particularly challenging data regime. To this end, we try to bridge this gap and thoroughly evaluate the theoretical properties and practical performance of 22 deterministic and probabilistic company fundamentals forecasting models on real company data. We observe that deep learning models provide superior forcasting performance to classical models, in particular when considering uncertainty estimation. To validate the findings, we compare them to human analyst expectations and find that their accuracy is comparable to the automatic forecasts. We further show how these high-quality forecasts can benefit automated stock allocation. We close by presenting possible ways of integrating domain experts to further improve performance and increase reliability.

[LG-134] Comparative Analysis of LSTM GRU and Transformer Models for Stock Price Prediction

链接: https://arxiv.org/abs/2411.05790
作者: Jue Xiao,Tingting Deng,Shuochen Bi
关键词-EN: recent fast-paced financial, investors constantly seek, fast-paced financial markets, recent fast-paced, fast-paced financial
类目: atistical Finance (q-fin.ST); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In recent fast-paced financial markets, investors constantly seek ways to gain an edge and make informed decisions. Although achieving perfect accuracy in stock price predictions remains elusive, artificial intelligence (AI) advancements have significantly enhanced our ability to analyze historical data and identify potential trends. This paper takes AI driven stock price trend prediction as the core research, makes a model training data set of famous Tesla cars from 2015 to 2024, and compares LSTM, GRU, and Transformer Models. The analysis is more consistent with the model of stock trend prediction, and the experimental results show that the accuracy of the LSTM model is 94%. These methods ultimately allow investors to make more informed decisions and gain a clearer insight into market behaviors.

[LG-135] News-Driven Stock Price Forecasting in Indian Markets: A Comparative Study of Advanced Deep Learning Models

链接: https://arxiv.org/abs/2411.05788
作者: Kaushal Attaluri,Mukesh Tripathi,Srinithi Reddy,Shivendra
关键词-EN: challenge for traders, Forecasting stock market, complex challenge, engineers due, multitude of factors
类目: atistical Finance (q-fin.ST); Machine Learning (cs.LG)
*备注: 7 pages, 9 figures, 1 table

点击查看摘要

Abstract:Forecasting stock market prices remains a complex challenge for traders, analysts, and engineers due to the multitude of factors that influence price movements. Recent advancements in artificial intelligence (AI) and natural language processing (NLP) have significantly enhanced stock price prediction capabilities. AI’s ability to process vast and intricate data sets has led to more sophisticated forecasts. However, achieving consistently high accuracy in stock price forecasting remains elusive. In this paper, we leverage 30 years of historical data from national banks in India, sourced from the National Stock Exchange, to forecast stock prices. Our approach utilizes state-of-the-art deep learning models, including multivariate multi-step Long Short-Term Memory (LSTM), Facebook Prophet with LightGBM optimized through Optuna, and Seasonal Auto-Regressive Integrated Moving Average (SARIMA). We further integrate sentiment analysis from tweets and reliable financial sources such as Business Standard and Reuters, acknowledging their crucial influence on stock price fluctuations.

信息检索

[IR-0] Invar-RAG: Invariant LLM -aligned Retrieval for Better Generation

链接: https://arxiv.org/abs/2411.07021
作者: Ziwei Liu,Liang Zhang,Qian Li,Jianghua Wu,Guangxu Zhu
关键词-EN: providing reliable answer, reliable answer predictions, shown impressive capability, addressing hallucination problems, Retrieval-augmented generation
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Retrieval-augmented generation (RAG) has shown impressive capability in providing reliable answer predictions and addressing hallucination problems. A typical RAG implementation uses powerful retrieval models to extract external information and large language models (LLMs) to generate answers. In contrast, recent LLM-based retrieval has gained attention for its substantial improvements in information retrieval (IR) due to the LLMs’ semantic understanding capability. However, directly applying LLM to RAG systems presents challenges. This may cause feature locality problems as massive parametric knowledge can hinder effective usage of global information across the corpus; for example, an LLM-based retriever often inputs document summaries instead of full documents. Moreover, various pre-trained tasks in LLMs introduce variance, further weakening performance as a retriever. To address these issues, we propose a novel two-stage fine-tuning architecture called Invar-RAG. In the retrieval stage, an LLM-based retriever is constructed by integrating LoRA-based representation learning to tackle feature locality issues. To enhance retrieval performance, we develop two patterns (invariant and variant patterns) and an invariance loss to reduce LLM variance. In the generation stage, a refined fine-tuning method is employed to improve LLM accuracy in generating answers based on retrieved information. Experimental results show that Invar-RAG significantly outperforms existing baselines across three open-domain question answering (ODQA) datasets. Code is available in the Supplementary Material for reproducibility. Subjects: Information Retrieval (cs.IR) Cite as: arXiv:2411.07021 [cs.IR] (or arXiv:2411.07021v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2411.07021 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[IR-1] LLM -Assisted Relevance Assessments: When Should We Ask LLM s for Help?

链接: https://arxiv.org/abs/2411.06877
作者: Rikiya Takehi,Ellen M. Voorhees,Tetsuya Sakai
关键词-EN: evaluate ranking algorithms, easily evaluate ranking, Test collections, ranking algorithms, researchers to quickly
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Test collections are information retrieval tools that allow researchers to quickly and easily evaluate ranking algorithms. While test collections have become an integral part of IR research, the process of data creation involves significant efforts in manual annotations, which often makes it very expensive and time-consuming. Thus, the test collections could become small when the budget is limited, which may lead to unstable evaluations. As an alternative, recent studies have proposed the use of large language models (LLMs) to completely replace human assessors. However, while LLMs seem to somewhat correlate with human judgments, they are not perfect and often show bias. Moreover, even if a well-performing LLM or prompt is found on one dataset, there is no guarantee that it will perform similarly in practice, due to difference in tasks and data. Thus a complete replacement with LLMs is argued to be too risky and not fully trustable. Thus, in this paper, we propose \textbfLLM-\textbfAssisted \textbfRelevance \textbfAssessments (\textbfLARA), an effective method to balance manual annotations with LLM annotations, which helps to make a rich and reliable test collection. We use the LLM’s predicted relevance probabilities in order to select the most profitable documents to manually annotate under a budget constraint. While solely relying on LLM’s predicted probabilities to manually annotate performs fairly well, with theoretical reasoning, LARA guides the human annotation process even more effectively via online calibration learning. Then, using the calibration model learned from the limited manual annotations, LARA debiases the LLM predictions to annotate the remaining non-assessed data. Empirical evaluations on TREC-COVID and TREC-8 Ad Hoc datasets show that LARA outperforms the alternative solutions under almost any budget constraint. Subjects: Information Retrieval (cs.IR) Cite as: arXiv:2411.06877 [cs.IR] (or arXiv:2411.06877v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2411.06877 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[IR-2] Boosting the Targeted Transferability of Adversarial Examples via Salient Region Weighted Feature Drop

链接: https://arxiv.org/abs/2411.06784
作者: Shanjun Xu,Linghui Li,Kaiguo Yuan,Bingyu Li
关键词-EN: Deep neural networks, presenting significant risks, Deep neural, Weighted Feature Drop, presenting significant
类目: Information Retrieval (cs.IR)
*备注: 9 pages

点击查看摘要

Abstract:Deep neural networks can be vulnerable to adversarially crafted examples, presenting significant risks to practical applications. A prevalent approach for adversarial attacks relies on the transferability of adversarial examples, which are generated from a substitute model and leveraged to attack unknown black-box models. Despite various proposals aimed at improving transferability, the success of these attacks in targeted black-box scenarios is often hindered by the tendency for adversarial examples to overfit to the surrogate models. In this paper, we introduce a novel framework based on Salient region Weighted Feature Drop (SWFD) designed to enhance the targeted transferability of adversarial examples. Drawing from the observation that examples with higher transferability exhibit smoother distributions in the deep-layer outputs, we propose the weighted feature drop mechanism to modulate activation values according to weights scaled by norm distribution, effectively addressing the overfitting issue when generating adversarial examples. Additionally, by leveraging salient region within the image to construct auxiliary images, our method enables the adversarial example’s features to be transferred to the target category in a model-agnostic manner, thereby enhancing the transferability. Comprehensive experiments confirm that our approach outperforms state-of-the-art methods across diverse configurations. On average, the proposed SWFD raises the attack success rate for normally trained models and robust models by 16.31% and 7.06% respectively.

[IR-3] Annotative Indexing

链接: https://arxiv.org/abs/2411.06256
作者: Charles L. A. Clarke
关键词-EN: traditional inverted indexes, generalizes traditional inverted, paper introduces annotative, introduces annotative indexing, inverted indexes
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:This paper introduces annotative indexing, a novel framework that unifies and generalizes traditional inverted indexes, column stores, object stores, and graph databases. As a result, annotative indexing can provide the underlying indexing framework for databases that support knowledge graphs, entity retrieval, semi-structured data, and ranked retrieval. While we primarily focus on human language data in the form of text, annotative indexing is sufficiently general to support a range of other datatypes, and we provide examples of SQL-like queries over a JSON store that includes numbers and dates. Taking advantage of the flexibility of annotative indexing, we also demonstrate a fully dynamic annotative index incorporating support for ACID properties of transactions with hundreds of multiple concurrent readers and writers.

[IR-4] KeyB2: Selecting Key Blocks is Also Important for Long Document Ranking with Large Language Models

链接: https://arxiv.org/abs/2411.06254
作者: Minghan Li,Eric Gaussier,Juntao Li,Guodong Zhou
关键词-EN: large language models, Llama has significantly, significantly advanced information, language models, rapid development
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:The rapid development of large language models (LLMs) like Llama has significantly advanced information retrieval (IR) systems. However, using LLMs for long documents, as in RankLLaMA, remains challenging due to computational complexity, especially concerning input token length. Furthermore, the internal mechanisms of LLMs during ranking are still not fully understood. In this paper, we first explore the internal workings of LLMs during relevance judgement and identify that specific attention heads play a crucial role in aligning relevant tokens. This observation inspires us to revisit the block pre-ranking strategy used in KeyB, which remains state-of-the-art (SOTA) on the TREC 2019 DL document ranking dataset. Building on these insights, we develop KeyB2, an advanced long document IR approach that integrates block pre-ranking with the performance of LLMs. KeyB2 efficiently identifies and processes the most relevant blocks, reducing computational costs and improving ranking effectiveness. Additionally, we introduce a new bi-encoder block matching strategy for KeyB2. Comprehensive experiments on long-document datasets, including TREC 2019 DL, Robust04, and MLDR-zh, show that KeyB2 outperforms baselines like RankLLaMA and KeyB by reducing reranking time and GPU memory usage while enhancing retrieval performance, achieving new SOTA results on TREC 2019 DL with higher NDCG@10 and MAP scores.

[IR-5] Interpret the Internal States of Recommendation Model with Sparse Autoencoder

链接: https://arxiv.org/abs/2411.06112
作者: Jiayin Wang,Xiaoyu Zhang,Weizhi Ma,Min Zhang
关键词-EN: Explainable recommendation systems, Explainable recommendation, enhance transparency, important to enhance, recommendation models
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Explainable recommendation systems are important to enhance transparency, accuracy, and fairness. Beyond result-level explanations, model-level interpretations can provide valuable insights that allow developers to optimize system designs and implement targeted improvements. However, most current approaches depend on specialized model designs, which often lack generalization capabilities. Given the various kinds of recommendation models, existing methods have limited ability to effectively interpret them. To address this issue, we propose RecSAE, an automatic, generalizable probing method for interpreting the internal states of Recommendation models with Sparse AutoEncoder. RecSAE serves as a plug-in module that does not affect original models during interpretations, while also enabling predictable modifications to their behaviors based on interpretation results. Firstly, we train an autoencoder with sparsity constraints to reconstruct internal activations of recommendation models, making the RecSAE latents more interpretable and monosemantic than the original neuron activations. Secondly, we automated the construction of concept dictionaries based on the relationship between latent activations and input item sequences. Thirdly, RecSAE validates these interpretations by predicting latent activations on new item sequences using the concept dictionary and deriving interpretation confidence scores from precision and recall. We demonstrate RecSAE’s effectiveness on two datasets, identifying hundreds of highly interpretable concepts from pure ID-based models. Latent ablation studies further confirm that manipulating latent concepts produces corresponding changes in model output behavior, underscoring RecSAE’s utility for both understanding and targeted tuning recommendation models. Code and data are publicly available at this https URL.

[IR-6] Snippet-based Conversational Recommender System

链接: https://arxiv.org/abs/2411.06064
作者: Haibo Sun,Naoki Otani,Hannah Kim,Dan Zhang,Nikita Bhutani
关键词-EN: Conversational Recommender Systems, Recommender Systems, provide personalized recommendations, Conversational Recommender, provide personalized
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Conversational Recommender Systems (CRS) engage users in interactive dialogues to gather preferences and provide personalized recommendations. Traditionally, CRS rely on pre-defined attributes or expensive, domain-specific annotated datasets to guide conversations, which limits flexibility and adaptability across domains. In this work, we introduce SnipRec, a novel CRS that enhances dialogues and recommendations by extracting diverse expressions and preferences from user-generated content (UGC) like customer reviews. Using large language models, SnipRec maps user responses and UGC to concise snippets, which are used to generate clarification questions and retrieve relevant items. Our approach eliminates the need for domain-specific training, making it adaptable to new domains and effective without prior knowledge of user preferences. Extensive experiments on the Yelp dataset demonstrate the effectiveness of snippet-based representations against document and sentence-based representations. Additionally, SnipRec is able to improve Hits@10 by 0.25 over the course of five conversational turns, underscoring the efficiency of SnipRec in capturing user preferences through multi-turn conversations.

[IR-7] he Shapley index for music streaming platforms

链接: https://arxiv.org/abs/2411.07166
作者: Gustavo Bergantiños,Juan D. Moreno-Ternero
关键词-EN: music streaming platforms, streaming platforms, measure the popularity, music streaming, Shapley index combining
类目: Theoretical Economics (econ.TH); Computer Science and Game Theory (cs.GT); Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:We study an index to measure the popularity of artists in music streaming platforms. This index, which can be used to allocate the amount raised via paid subscriptions among participating artists, is based on the Shapley value, a centerpiece in cooperative game theory. We characterize this Shapley index combining several axioms formalizing principles with normative appeal. This permits to place the index in the literature, as an alternative to the well-known (and widely used in the industry) pro-rata and user-centric indices.

附件下载

点击下载今日全部论文列表

Arxiv今日论文 | 2024-11-12

目录

概览 (2024-11-12)

自然语言处理

人工智能

计算机视觉

机器学习

信息检索

附件下载