Arxiv今日论文 | 2024-11-18

本篇博文主要展示 2024-11-18 从Arxiv.org论文网站获取的最新论文列表，自动更新，按照NLP、CV、ML、AI、IR五个大方向区分，若需要邮件定时接收，请在评论区留下你的邮箱号。

说明：每日论文数据从Arxiv.org获取，每天早上12:00左右定时自动更新。

友情提示: 如何您需要邮箱接收每日论文数据，请在评论处留下你的邮箱。

【速读】：该论文试图解决现有开源多模态大语言模型（MLLMs）在多模态推理（Multimodal Reasoning），特别是链式思维（Chain-of-Thought, CoT）性能上的分布偏移问题。解决方案的关键在于引入偏好优化（Preference Optimization, PO）过程，具体包括：(1) 数据方面，设计自动化偏好数据构建流程，创建高质量、大规模的多模态推理偏好数据集（MMPR）；(2) 模型方面，探索将PO与MLLMs结合，开发了一种简单而有效的方法，称为混合偏好优化（Mixed Preference Optimization, MPO），显著提升了多模态CoT性能。实验结果表明，该方法在多个基准测试中表现优异，尤其在多模态推理任务上，InternVL2-8B-MPO模型在MathVista上的准确率达到67.0，比InternVL2-8B高出8.7个百分点，性能接近10倍大的InternVL2-76B模型。

链接: https://arxiv.org/abs/2411.10442
作者: Weiyun Wang,Zhe Chen,Wenhai Wang,Yue Cao,Yangzhou Liu,Zhangwei Gao,Jinguo Zhu,Xizhou Zhu,Lewei Lu,Yu Qiao,Jifeng Dai
关键词-EN: Existing open-source multimodal, Existing open-source, training process involving, process involving pre-training, open-source multimodal large
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Existing open-source multimodal large language models (MLLMs) generally follow a training process involving pre-training and supervised fine-tuning. However, these models suffer from distribution shifts, which limit their multimodal reasoning, particularly in the Chain-of-Thought (CoT) performance. To address this, we introduce a preference optimization (PO) process to enhance the multimodal reasoning capabilities of MLLMs. Specifically, (1) on the data side, we design an automated preference data construction pipeline to create MMPR, a high-quality, large-scale multimodal reasoning preference dataset. and (2) on the model side, we explore integrating PO with MLLMs, developing a simple yet effective method, termed Mixed Preference Optimization (MPO), which boosts multimodal CoT performance. Our approach demonstrates improved performance across multiple benchmarks, particularly in multimodal reasoning tasks. Notably, our model, InternVL2-8B-MPO, achieves an accuracy of 67.0 on MathVista, outperforming InternVL2-8B by 8.7 points and achieving performance comparable to the 10x larger InternVL2-76B. We hope this study could inspire further advancements in MLLMs. Code, data, and model shall be publicly released.
摘要：现有的开源多模态大语言模型 (MLLMs) 通常遵循预训练和监督微调的训练过程。然而，这些模型在分布偏移问题上表现不佳，限制了其在多模态推理，特别是在思维链 (Chain-of-Thought, CoT) 性能上的表现。为解决这一问题，我们引入了一种偏好优化 (Preference Optimization, PO) 过程，以增强 MLLMs 的多模态推理能力。具体来说，(1) 在数据方面，我们设计了一个自动化的偏好数据构建管道，创建了 MMPR，这是一个高质量、大规模的多模态推理偏好数据集；(2) 在模型方面，我们探索将 PO 与 MLLMs 结合，开发了一种简单而有效的方法，称为混合偏好优化 (Mixed Preference Optimization, MPO)，显著提升了多模态 CoT 性能。我们的方法在多个基准测试中展示了改进的性能，特别是在多模态推理任务中。值得注意的是，我们的模型 InternVL2-8B-MPO 在 MathVista 上达到了 67.0 的准确率，比 InternVL2-8B 高出 8.7 分，并且性能可与大 10 倍的 InternVL2-76B 相媲美。我们希望这项研究能够激发 MLLMs 领域的进一步发展。代码、数据和模型将公开发布。

[NLP-1] Mitigating Hallucination in Multimodal Large Language Model via Hallucination-targeted Direct Preference Optimization

【速读】：该论文试图解决多模态大语言模型 (Multimodal Large Language Models, MLLMs) 中存在的幻觉问题 (hallucination)，这一问题限制了其实际应用。解决方案的关键在于引入了一种名为幻觉针对性直接偏好优化 (Hallucination-targeted Direct Preference Optimization, HDPO) 的新方法，该方法不同于以往的策略，从幻觉的多种形式和成因入手进行针对性优化。具体而言，研究团队开发了三种偏好对数据，分别针对以下幻觉成因：(1) 视觉能力不足，(2) 长上下文生成，(3) 多模态冲突。实验结果表明，HDPO 在多个幻觉评估数据集上表现优异，超越了大多数最先进 (SOTA) 方法，显示出其潜在优势。此外，消融研究和深入分析进一步验证了该方法的有效性，并提示通过扩展规模可能实现进一步的改进。

链接: https://arxiv.org/abs/2411.10436
作者: Yuhan Fu,Ruobing Xie,Xingwu Sun,Zhanhui Kang,Xirong Li
关键词-EN: Large Language Models, Multimodal Large Language, Language Models, Large Language, Direct Preference Optimization
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注:

点击查看摘要

Abstract:Multimodal Large Language Models (MLLMs) are known to hallucinate, which limits their practical applications. Recent works have attempted to apply Direct Preference Optimization (DPO) to enhance the performance of MLLMs, but have shown inconsistent improvements in mitigating hallucinations. To address this issue more effectively, we introduce Hallucination-targeted Direct Preference Optimization (HDPO) to reduce hallucinations in MLLMs. Unlike previous approaches, our method tackles hallucinations from their diverse forms and causes. Specifically, we develop three types of preference pair data targeting the following causes of MLLM hallucinations: (1) insufficient visual capabilities, (2) long context generation, and (3) multimodal conflicts. Experimental results demonstrate that our method achieves superior performance across multiple hallucination evaluation datasets, surpassing most state-of-the-art (SOTA) methods and highlighting the potential of our approach. Ablation studies and in-depth analyses further confirm the effectiveness of our method and suggest the potential for further improvements through scaling up.
摘要：多模态大语言模型 (Multimodal Large Language Models, MLLMs) 以其产生幻觉的特性而闻名，这限制了它们的实际应用。近期研究尝试将直接偏好优化 (Direct Preference Optimization, DPO) 应用于提升 MLLMs 的性能，但在减少幻觉方面显示出不一致的改进效果。为了更有效地解决这一问题，我们引入了针对幻觉的直接偏好优化 (Hallucination-targeted Direct Preference Optimization, HDPO)，以减少 MLLMs 中的幻觉现象。与以往方法不同，我们的方法从幻觉的多种形式和成因入手。具体而言，我们针对以下 MLLM 幻觉的成因开发了三种类型的偏好对数据：(1) 视觉能力不足，(2) 长上下文生成，以及 (3) 多模态冲突。实验结果表明，我们的方法在多个幻觉评估数据集上表现优异，超越了大多数最先进 (State-of-the-Art, SOTA) 方法，突显了我们方法的潜力。消融研究和深入分析进一步证实了我们方法的有效性，并指出了通过扩展规模进一步改进的潜力。

[NLP-2] owards Automatic Evaluation of Task-Oriented Dialogue Flows

【速读】：该论文试图解决面向任务的对话系统中对话流程（dialogue flows）质量评估缺乏标准方法的问题。解决方案的关键是引入了一种名为FuDGE（Fuzzy Dialogue-Graph Edit Distance）的新型度量标准，该标准通过评估对话流程的结构复杂性和对对话数据的表示覆盖度来衡量其质量。FuDGE能够量化单个对话与流程的匹配程度，从而评估整个对话集被流程表示的总体效果。通过在手动配置和自动生成的对话流程上的广泛实验，证明了FuDGE及其评估框架的有效性，从而为对话设计者和自动化技术提供了标准化和优化的工具，以实现更高的效率和自动化水平。

链接: https://arxiv.org/abs/2411.10416
作者: Mehrnoosh Mirtaheri,Nikhil Varghese,Chandra Khatri,Amol Kelkar
关键词-EN: Task-oriented dialogue systems, predefined conversation schemes, directed acyclic graphs, dialogue systems rely, Task-oriented dialogue
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Task-oriented dialogue systems rely on predefined conversation schemes (dialogue flows) often represented as directed acyclic graphs. These flows can be manually designed or automatically generated from previously recorded conversations. Due to variations in domain expertise or reliance on different sets of prior conversations, these dialogue flows can manifest in significantly different graph structures. Despite their importance, there is no standard method for evaluating the quality of dialogue flows. We introduce FuDGE (Fuzzy Dialogue-Graph Edit Distance), a novel metric that evaluates dialogue flows by assessing their structural complexity and representational coverage of the conversation data. FuDGE measures how well individual conversations align with a flow and, consequently, how well a set of conversations is represented by the flow overall. Through extensive experiments on manually configured flows and flows generated by automated techniques, we demonstrate the effectiveness of FuDGE and its evaluation framework. By standardizing and optimizing dialogue flows, FuDGE enables conversational designers and automated techniques to achieve higher levels of efficiency and automation.
摘要：面向任务的对话系统依赖于预定义的对话方案（对话流程），这些方案通常表示为有向无环图。这些流程可以手动设计，也可以从先前记录的对话中自动生成。由于领域专业知识的差异或依赖于不同的先前对话集，这些对话流程可能呈现出显著不同的图结构。尽管其重要性不言而喻，但目前尚无标准方法来评估对话流程的质量。我们引入了 FuDGE（模糊对话图编辑距离），这是一种新颖的度量方法，通过评估对话流程的结构复杂性和对对话数据的表示覆盖度来评估其质量。FuDGE 衡量单个对话与流程的匹配程度，进而评估一组对话在整体上被流程表示的程度。通过在手动配置的流程和自动生成技术生成的流程上进行广泛实验，我们展示了 FuDGE 及其评估框架的有效性。通过标准化和优化对话流程，FuDGE 使对话设计师和自动技术能够实现更高水平的效率和自动化。

[NLP-3] Llama Guard 3 Vision: Safeguarding Human-AI Image Understanding Conversations

【速读】：该论文试图解决在涉及图像理解的人机对话中，如何有效保障多模态大语言模型（LLM）输入和输出的内容安全问题。解决方案的关键在于引入Llama Guard 3 Vision，这是一个基于多模态LLM的安全防护系统，专门设计用于支持图像推理用例，并优化了检测有害多模态（文本和图像）提示及相应文本响应的能力。该系统通过在Llama 3.2-Vision上进行微调，并在内部基准测试中使用MLCommons分类法展示了强大的性能，同时测试了其对抗对抗性攻击的鲁棒性。

链接: https://arxiv.org/abs/2411.10414
作者: Jianfeng Chi,Ujjwal Karn,Hongyuan Zhan,Eric Smith,Javier Rando,Yiming Zhang,Kate Plawiak,Zacharie Delpierre Coudert,Kartikeya Upasani,Mahesh Pasupuleti
关键词-EN: multimodal LLM inputs, introduce Llama Guard, Llama Guard, LLM inputs, involves image understanding
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We introduce Llama Guard 3 Vision, a multimodal LLM-based safeguard for human-AI conversations that involves image understanding: it can be used to safeguard content for both multimodal LLM inputs (prompt classification) and outputs (response classification). Unlike the previous text-only Llama Guard versions (Inan et al., 2023; Llama Team, 2024b,a), it is specifically designed to support image reasoning use cases and is optimized to detect harmful multimodal (text and image) prompts and text responses to these prompts. Llama Guard 3 Vision is fine-tuned on Llama 3.2-Vision and demonstrates strong performance on the internal benchmarks using the MLCommons taxonomy. We also test its robustness against adversarial attacks. We believe that Llama Guard 3 Vision serves as a good starting point to build more capable and robust content moderation tools for human-AI conversation with multimodal capabilities.
摘要：我们引入了 Llama Guard 3 Vision，这是一种基于多模态大语言模型的安全防护系统，用于涉及图像理解的人机对话：它可以用于保护多模态大语言模型的输入（提示分类）和输出（响应分类）内容。与之前的仅文本版本的 Llama Guard（Inan et al., 2023; Llama Team, 2024b,a）不同，它专门设计用于支持图像推理用例，并优化以检测有害的多模态（文本和图像）提示以及针对这些提示的文本响应。Llama Guard 3 Vision 在 Llama 3.2-Vision 上进行了微调，并在使用 MLCommons 分类法的内部基准测试中展示了强大的性能。我们还测试了其对抗对抗性攻击的鲁棒性。我们相信，Llama Guard 3 Vision 为构建更具能力和鲁棒性的多模态人机对话内容审核工具提供了一个良好的起点。

[NLP-4] Features that Make a Difference: Leveraging Gradients for Improved Dictionary Learning NAACL2025

【速读】：该论文试图解决传统稀疏自编码器 (Sparse Autoencoders, SAEs) 在训练过程中仅考虑激活值而忽视这些激活值对下游计算的影响，从而导致忽略那些激活值较小但强烈影响模型输出的特征的问题。解决方案的关键在于引入梯度稀疏自编码器 (Gradient SAEs, g-SAEs)，通过增强TopK激活函数，使其在选择k个元素时依赖于输入激活的梯度。这种方法使得在给定稀疏水平下，g-SAEs生成的重构更忠实于原始网络性能，并且学习到的潜在变量在任意上下文中更能有效引导模型。通过考虑激活的下游效应，g-SAEs不仅关注特征的表示（representations），还兼顾了特征作为行动（actions）的前瞻性作用。

链接: https://arxiv.org/abs/2411.10397
作者: Jeffrey Olmo,Jared Wilson,Max Forsey,Bryce Hepner,Thomas Vin Howe,David Wingate
关键词-EN: network internal activations, overcomplete decomposition, extracting neural network, sparse autoencoder architecture, internal activations
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 9 pages, 8 figures. Submitted to NAACL 2025

点击查看摘要

Abstract:Sparse Autoencoders (SAEs) are a promising approach for extracting neural network representations by learning a sparse and overcomplete decomposition of the network’s internal activations. However, SAEs are traditionally trained considering only activation values and not the effect those activations have on downstream computations. This limits the information available to learn features, and biases the autoencoder towards neglecting features which are represented with small activation values but strongly influence model outputs. To address this, we introduce Gradient SAEs (g-SAEs), which modify the k -sparse autoencoder architecture by augmenting the TopK activation function to rely on the gradients of the input activation when selecting the k elements. For a given sparsity level, g-SAEs produce reconstructions that are more faithful to original network performance when propagated through the network. Additionally, we find evidence that g-SAEs learn latents that are on average more effective at steering models in arbitrary contexts. By considering the downstream effects of activations, our approach leverages the dual nature of neural network features as both \textitrepresentations , retrospectively, and \textitactions , prospectively. While previous methods have approached the problem of feature discovery primarily focused on the former aspect, g-SAEs represent a step towards accounting for the latter as well.
摘要：稀疏自编码器 (Sparse Autoencoders, SAEs) 是一种有前景的方法，通过学习网络内部激活的稀疏且过完备的分解，来提取神经网络的表示。然而，传统的 SAEs 在训练时仅考虑激活值，而未考虑这些激活对下游计算的影响。这限制了可用于学习特征的信息量，并使自编码器倾向于忽略那些激活值较小但强烈影响模型输出的特征。为解决这一问题，我们引入了梯度稀疏自编码器 (Gradient SAEs, g-SAEs)，通过增强 TopK 激活函数，使其在选择 k 个元素时依赖于输入激活的梯度，从而修改了 k-稀疏自编码器架构。对于给定的稀疏水平，g-SAEs 生成的重构在通过网络传播时更忠实于原始网络性能。此外，我们发现 g-SAEs 学习的潜在变量在平均水平上更能有效地在任意上下文中引导模型。通过考虑激活的下游效应，我们的方法利用了神经网络特征的双重性质：作为回顾性的表示 (representations) 和前瞻性的行动 (actions)。尽管先前的方法主要关注前一方面，g-SAEs 代表了一个向同时考虑后一方面的迈进。

[NLP-5] A Survey of Event Causality Identification: Principles Taxonomy Challenges and Assessment

【速读】：该论文试图解决事件因果关系识别 (Event Causality Identification, ECI) 这一自然语言处理 (NLP) 中的关键任务，旨在从文本数据中自动提取因果关系。解决方案的关键在于系统地构建了ECI的基础原理、技术框架和挑战，并提出了一个全面的分类体系 (taxonomy) 来分类和澄清当前的研究方法。具体而言，论文将ECI方法分为句子级 (SECI) 和文档级 (DECI) 两大类，分别探讨了基于特征模式匹配、深度语义编码、因果知识预训练与提示微调、外部知识增强等SECI方法，以及基于事件图推理和提示技术的DECI方法。此外，论文还对现有方法进行了定量评估，并指出了未来研究的方向，以克服当前的局限性和拓展ECI的应用范围。

链接: https://arxiv.org/abs/2411.10371
作者: Zefan Zeng,Qing Cheng,Xingchen Hu,Yuehang Si,Zhong Liu
关键词-EN: Natural Language Processing, Language Processing, Natural Language, automatically extracting causalities, Event Causality Identification
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Event Causality Identification (ECI) has become a crucial task in Natural Language Processing (NLP), aimed at automatically extracting causalities from textual data. In this survey, we systematically address the foundational principles, technical frameworks, and challenges of ECI, offering a comprehensive taxonomy to categorize and clarify current research methodologies, as well as a quantitative assessment of existing models. We first establish a conceptual framework for ECI, outlining key definitions, problem formulations, and evaluation standards. Our taxonomy classifies ECI methods according to the two primary tasks of sentence-level (SECI) and document-level (DECI) event causality identification. For SECI, we examine feature pattern-based matching, deep semantic encoding, causal knowledge pre-training and prompt-based fine-tuning, and external knowledge enhancement methods. For DECI, we highlight approaches focused on event graph reasoning and prompt-based techniques to address the complexity of cross-sentence causal inference. Additionally, we analyze the strengths, limitations, and open challenges of each approach. We further conduct an extensive quantitative evaluation of various ECI methods on two benchmark datasets. Finally, we explore future research directions, highlighting promising pathways to overcome current limitations and broaden ECI applications.
摘要：事件因果关系识别 (Event Causality Identification, ECI) 已成为自然语言处理 (Natural Language Processing, NLP) 中的一个关键任务，旨在自动从文本数据中提取因果关系。在本综述中，我们系统地探讨了 ECI 的基础原理、技术框架及其面临的挑战，提供了一个全面的分类体系来归类和阐明当前的研究方法，并对现有模型进行了定量评估。我们首先为 ECI 建立了一个概念框架，概述了关键定义、问题表述和评估标准。我们的分类体系根据句子级 (Sentence-level Event Causality Identification, SECI) 和文档级 (Document-level Event Causality Identification, DECI) 事件因果关系识别这两个主要任务对 ECI 方法进行了分类。对于 SECI，我们考察了基于特征模式的匹配、深度语义编码、因果知识预训练和基于提示的微调，以及外部知识增强方法。对于 DECI，我们强调了专注于事件图推理和基于提示的技术，以应对跨句子因果推理的复杂性。此外，我们分析了每种方法的优势、局限性和开放挑战。我们进一步对两个基准数据集上的多种 ECI 方法进行了广泛的定量评估。最后，我们探讨了未来的研究方向，突出了克服当前局限性和扩展 ECI 应用的有前景的路径。

[NLP-6] Safe Text-to-Image Generation: Simply Sanitize the Prompt Embedding

【速读】：该论文试图解决文本到图像生成模型（T2I）在生成过程中可能产生不安全内容的问题。解决方案的关键在于提出了一种视觉无关的安全生成框架，称为嵌入净化器（Embedding Sanitizer, ES），该框架专注于从提示嵌入（prompt embeddings）中移除不适当概念，并使用净化后的嵌入来指导模型进行安全生成。ES通过在文本编码器的输出上应用一个即插即用模块，实现了与不同T2I模型的无缝集成，并通过独特的评分机制动态调整净化强度，以平衡防御性能和生成质量。通过在五个提示基准上的广泛评估，ES在净化不安全生成源（提示嵌入）方面达到了最先进的鲁棒性，显著优于现有的安全措施。

链接: https://arxiv.org/abs/2411.10329
作者: Huming Qiu,Guanxu Chen,Mi Zhang,Min Yang
关键词-EN: made significant progress, generating high-quality images, recent years, made significant, significant progress
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:In recent years, text-to-image (T2I) generation models have made significant progress in generating high-quality images that align with text descriptions. However, these models also face the risk of unsafe generation, potentially producing harmful content that violates usage policies, such as explicit material. Existing safe generation methods typically focus on suppressing inappropriate content by erasing undesired concepts from visual representations, while neglecting to sanitize the textual representation. Although these methods help mitigate the risk of misuse to certain extent, their robustness remains insufficient when dealing with adversarial attacks. Given that semantic consistency between input text and output image is a fundamental requirement for T2I models, we identify that textual representations (i.e., prompt embeddings) are likely the primary source of unsafe generation. To this end, we propose a vision-agnostic safe generation framework, Embedding Sanitizer (ES), which focuses on erasing inappropriate concepts from prompt embeddings and uses the sanitized embeddings to guide the model for safe generation. ES is applied to the output of the text encoder as a plug-and-play module, enabling seamless integration with different T2I models as well as other safeguards. In addition, ES’s unique scoring mechanism assigns a score to each token in the prompt to indicate its potential harmfulness, and dynamically adjusts the sanitization intensity to balance defensive performance and generation quality. Through extensive evaluation on five prompt benchmarks, our approach achieves state-of-the-art robustness by sanitizing the source (prompt embedding) of unsafe generation compared to nine baseline methods. It significantly outperforms existing safeguards in terms of interpretability and controllability while maintaining generation quality. Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL) Cite as: arXiv:2411.10329 [cs.CR] (or arXiv:2411.10329v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2411.10329 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
摘要：近年来，文本到图像 (Text-to-Image, T2I) 生成模型在生成与文本描述高度一致的高质量图像方面取得了显著进展。然而，这些模型也面临着不安全生成的风险，可能会产生违反使用政策的危害性内容，例如露骨材料。现有的安全生成方法通常侧重于通过从视觉表示中擦除不希望的概念来抑制不当内容，而忽略了文本表示的净化。尽管这些方法在一定程度上帮助缓解了误用风险，但在应对对抗性攻击时，其鲁棒性仍然不足。鉴于输入文本与输出图像之间的语义一致性是 T2I 模型的基本要求，我们识别出文本表示（即提示嵌入）很可能是导致不安全生成的主要源头。为此，我们提出了一种与视觉无关的安全生成框架——嵌入净化器 (Embedding Sanitizer, ES)，该框架专注于从提示嵌入中擦除不当概念，并使用净化后的嵌入来指导模型进行安全生成。ES 作为即插即用模块应用于文本编码器的输出，能够无缝集成到不同的 T2I 模型以及其他安全措施中。此外，ES 独特的评分机制为提示中的每个 Token 分配一个分数，以指示其潜在的危害性，并动态调整净化强度，以平衡防御性能和生成质量。通过对五个提示基准进行广泛评估，我们的方法在净化不安全生成源头（提示嵌入）方面达到了最先进的鲁棒性，相比九种基线方法表现出色。它在可解释性和可控性方面显著优于现有的安全措施，同时保持了生成质量。

主题：密码学与安全 (cs.CR)；人工智能 (cs.AI)；计算与语言 (cs.CL)
引用方式：arXiv:2411.10329 [cs.CR]（或 arXiv:2411.10329v1 [cs.CR] 用于此版本）
https://doi.org/10.48550/arXiv.2411.10329
通过 DataCite 发布的 arXiv DOI（待注册）

[NLP-7] Emotion Detection in Reddit: Comparative Study of Machine Learning and Deep Learning Techniques

【速读】：该论文试图解决文本情感检测问题，关键在于通过利用GoEmotions数据集和多种模型（包括六种机器学习模型、三种集成模型和一种长短期记忆网络模型）来确定最优的情感检测模型。研究结果表明，Stacking分类器在准确性和性能上优于其他模型，包括预训练的EmoBERTa模型，并且通过Streamlit网络应用展示了其在实际文本情感分析中的应用潜力。

链接: https://arxiv.org/abs/2411.10328
作者: Maliheh Alaeddini
关键词-EN: significantly influences behavior, Emotion detection, human communication, influences behavior, decision-making processes
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Emotion detection is pivotal in human communication, as it significantly influences behavior, relationships, and decision-making processes. This study concentrates on text-based emotion detection by leveraging the GoEmotions dataset, which annotates Reddit comments with 27 distinct emotions. These emotions are subsequently mapped to Ekman’s six basic categories: joy, anger, fear, sadness, disgust, and surprise. We employed a range of models for this task, including six machine learning models, three ensemble models, and a Long Short-Term Memory (LSTM) model to determine the optimal model for emotion detection. Results indicate that the Stacking classifier outperforms other models in accuracy and performance. We also benchmark our models against EmoBERTa, a pre-trained emotion detection model, with our Stacking classifier proving more effective. Finally, the Stacking classifier is deployed via a Streamlit web application, underscoring its potential for real-world applications in text-based emotion analysis.
摘要：情感检测在人类交流中至关重要，因为它显著影响行为、关系和决策过程。本研究专注于利用 GoEmotions 数据集进行基于文本的情感检测，该数据集将 Reddit 评论标注为 27 种不同的情感。这些情感随后被映射到 Ekman 的六个基本类别：喜悦、愤怒、恐惧、悲伤、厌恶和惊讶。我们采用了多种模型进行此任务，包括六种机器学习模型、三种集成模型和一个长短期记忆 (LSTM) 模型，以确定最适合情感检测的模型。结果表明，Stacking 分类器在准确性和性能方面优于其他模型。我们还对我们的模型与预训练的情感检测模型 EmoBERTa 进行了基准测试，结果显示我们的 Stacking 分类器更为有效。最后，通过 Streamlit 网络应用程序部署了 Stacking 分类器，突显了其在基于文本的情感分析中的实际应用潜力。

[NLP-8] he Dawn of GUI Agent : A Preliminary Case Study with Claude 3.5 Computer Use

【速读】：该论文试图解决的问题是如何评估和展示Claude 3.5 Computer Use这一前沿AI模型在复杂现实环境中的能力，特别是其在图形用户界面（GUI）自动化方面的表现。解决方案的关键在于设计和执行一系列跨领域和软件的精心设计的任务，以全面测试模型的端到端语言到桌面操作的能力。通过这些案例研究，论文不仅展示了Claude 3.5 Computer Use的初步能力和局限性，还提供了一个开箱即用的代理框架，用于部署基于API的GUI自动化模型，从而简化了实现过程。此外，论文还提出了关于规划、行动和批判性思考的问题，这些问题对于未来模型的改进至关重要。

链接: https://arxiv.org/abs/2411.10323
作者: Siyuan Hu,Mingyu Ouyang,Difei Gao,Mike Zheng Shou
关键词-EN: graphical user interface, recently released model, user interface, recently released, graphical user
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注: 40 pages, 21 figures, preprint

点击查看摘要

Abstract:The recently released model, Claude 3.5 Computer Use, stands out as the first frontier AI model to offer computer use in public beta as a graphical user interface (GUI) agent. As an early beta, its capability in the real-world complex environment remains unknown. In this case study to explore Claude 3.5 Computer Use, we curate and organize a collection of carefully designed tasks spanning a variety of domains and software. Observations from these cases demonstrate Claude 3.5 Computer Use’s unprecedented ability in end-to-end language to desktop actions. Along with this study, we provide an out-of-the-box agent framework for deploying API-based GUI automation models with easy implementation. Our case studies aim to showcase a groundwork of capabilities and limitations of Claude 3.5 Computer Use with detailed analyses and bring to the fore questions about planning, action, and critic, which must be considered for future improvement. We hope this preliminary exploration will inspire future research into the GUI agent community. All the test cases in the paper can be tried through the project: this https URL.
摘要：最近发布的模型 Claude 3.5 Computer Use 作为首个提供计算机使用功能的 AI 模型，在公开测试阶段以图形用户界面 (GUI) 智能体的形式脱颖而出。作为早期测试版，其在真实复杂环境中的能力尚不明确。在本案例研究中，我们精心策划并组织了一系列跨领域和软件的复杂任务，以探索 Claude 3.5 Computer Use 的功能。这些案例的观察结果展示了 Claude 3.5 Computer Use 在从语言到桌面操作的端到端过程中的前所未有的能力。同时，我们提供了一个开箱即用的智能体框架，用于部署基于 API 的 GUI 自动化模型，实现简便。我们的案例研究旨在展示 Claude 3.5 Computer Use 的能力和局限性，并进行详细分析，同时提出关于规划、行动和批判的疑问，这些疑问对于未来的改进至关重要。我们希望这一初步探索能够激发 GUI 智能体领域未来的研究。本文中的所有测试案例均可通过以下项目进行尝试：this https URL。

[NLP-9] Unveiling Topological Structures in Text: A Comprehensive Survey of Topological Data Analysis Applications in NLP

【速读】：该论文试图解决在自然语言处理（NLP）领域中，机器学习（ML）技术在处理现实世界数据时面临的挑战，如数据不平衡、噪声、标签不足和高维度问题。解决方案的关键在于引入拓扑数据分析（Topological Data Analysis, TDA），这是一种能够捕捉数据内在形状的统计方法，即使在存在噪声的情况下也能有效工作。论文通过综述85篇相关研究，将这些努力分为理论和非理论两种方法。理论方法试图从拓扑视角解释语言现象，而非理论方法则将TDA与ML特征结合，利用多种数值表示技术。论文最后探讨了该领域存在的挑战和未解决的问题。

链接: https://arxiv.org/abs/2411.10298
作者: Adaku Uchendu,Thai Le
关键词-EN: extract valuable insights, wealth of information, internet has led, computational methods, methods to analyze
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The surge of data available on the internet has led to the adoption of various computational methods to analyze and extract valuable insights from this wealth of information. Among these, the field of Machine Learning (ML) has thrived by leveraging data to extract meaningful insights. However, ML techniques face notable challenges when dealing with real-world data, often due to issues of imbalance, noise, insufficient labeling, and high dimensionality. To address these limitations, some researchers advocate for the adoption of Topological Data Analysis (TDA), a statistical approach that discerningly captures the intrinsic shape of data despite noise. Despite its potential, TDA has not gained as much traction within the Natural Language Processing (NLP) domain compared to structurally distinct areas like computer vision. Nevertheless, a dedicated community of researchers has been exploring the application of TDA in NLP, yielding 85 papers we comprehensively survey in this paper. Our findings categorize these efforts into theoretical and nontheoretical approaches. Theoretical approaches aim to explain linguistic phenomena from a topological viewpoint, while non-theoretical approaches merge TDA with ML features, utilizing diverse numerical representation techniques. We conclude by exploring the challenges and unresolved questions that persist in this niche field. Resources and a list of papers on this topic can be found at: this https URL.
摘要：互联网上可用数据的激增促使人们采用各种计算方法来分析和从中提取有价值的见解。在这些方法中，机器学习（Machine Learning, ML）领域通过利用数据提取有意义的洞察而蓬勃发展。然而，ML技术在处理现实世界数据时面临显著挑战，这通常是由于数据的不平衡、噪声、标签不足和高维度问题。为了解决这些局限性，一些研究人员提倡采用拓扑数据分析（Topological Data Analysis, TDA），这是一种统计方法，能够在噪声存在的情况下敏锐地捕捉数据的内在形状。尽管TDA具有潜力，但与计算机视觉等结构上不同的领域相比，它在自然语言处理（Natural Language Processing, NLP）领域并未获得同样的关注。尽管如此，一个专注的研究群体一直在探索TDA在NLP中的应用，并产生了85篇论文，我们在本文中进行了全面的综述。我们的研究发现将这些努力分为理论和非理论方法。理论方法旨在从拓扑视角解释语言现象，而非理论方法则将TDA与ML特征结合，利用多种数值表示技术。最后，我们探讨了这一小众领域中存在的挑战和未解决的问题。相关资源和论文列表可以在以下链接中找到：this https URL。

[NLP-10] Scaling Law for Post-training after Model Pruning

【速读】：该论文试图解决大语言模型（LLMs）在经过模型剪枝（model pruning）后，如何有效恢复性能的问题。解决方案的关键在于引入了一种缩放定律（scaling law），用于确定剪枝后模型进行后训练（post-training）所需的最佳数据量。具体来说，论文通过实验发现，剪枝比例越高，后训练所需的数据量越大，而较大的LLMs则需要较少的数据。该缩放定律基于剪枝前后的参数数量以及后训练的token数量来预测模型的损失，并且发现从小型LLMs中建立的缩放定律可以可靠地外推到更大的LLMs。这一研究为剪枝后LLMs的后训练提供了宝贵的见解，并为优化后训练数据的使用提供了实用的缩放定律。

链接: https://arxiv.org/abs/2411.10272
作者: Xiaodong Chen,Yuxuan Hu,Jing Zhang,Xiaokang Zhang,Cuiping Li,Hong Chen
关键词-EN: Large language models, Large language, Transformer architecture, domains and tasks, architecture are widely
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Large language models (LLMs) based on the Transformer architecture are widely employed across various domains and tasks. However, their increasing size imposes significant hardware demands, limiting practical deployment. To mitigate this, model pruning techniques have been developed to create more efficient models while maintaining high performance. Despite this, post-training after pruning is crucial for performance recovery and can be resource-intensive. This paper investigates the post-training requirements of pruned LLMs and introduces a scaling law to determine the optimal amount of post-training data. Post-training experiments with the Llama-3 and Qwen-2.5 series models, pruned using depth pruning, width pruning, and 2:4 semi-structured pruning, show that higher pruning ratios necessitate more post-training data for performance recovery, whereas larger LLMs require less. The proposed scaling law predicts a model’s loss based on its parameter counts before and after pruning, as well as the post-training token counts. Furthermore, we find that the scaling law established from smaller LLMs can be reliably extrapolated to larger LLMs. This work provides valuable insights into the post-training of pruned LLMs and offers a practical scaling law for optimizing post-training data usage.
摘要：基于 Transformer 架构的大语言模型 (LLM) 在各个领域和任务中得到了广泛应用。然而，其规模的不断扩大对硬件提出了显著的需求，限制了实际部署的可行性。为了缓解这一问题，模型剪枝技术应运而生，旨在创建更高效的模型同时保持高性能。尽管如此，剪枝后的训练后处理对于性能恢复至关重要，且可能需要大量资源。本文探讨了剪枝后 LLM 的训练后处理需求，并引入了一种缩放定律来确定最佳的训练后数据量。通过对 Llama-3 和 Qwen-2.5 系列模型进行深度剪枝、宽度剪枝和 2:4 半结构化剪枝后的训练后实验，结果表明，较高的剪枝比例需要更多的训练后数据以恢复性能，而较大的 LLM 则需要较少的数据。所提出的缩放定律根据模型剪枝前后的参数数量以及训练后 Token 数量来预测模型的损失。此外，我们发现从小型 LLM 中建立的缩放定律可以可靠地外推到更大的 LLM 上。这项工作为剪枝后 LLM 的训练后处理提供了宝贵的见解，并为优化训练后数据使用提供了一个实用的缩放定律。

[NLP-11] Scaling up the Evaluation of Collaborative Problem Solving: Promises and Challenges of Coding Chat Data with ChatGPT

【速读】：该论文试图解决在协作问题解决 (Collaborative Problem Solving, CPS) 研究中，如何高效编码通信数据以实现大规模评估的挑战。解决方案的关键在于利用 ChatGPT 直接编码 CPS 聊天数据，并通过在多个数据集和编码框架上进行基准测试，发现 ChatGPT 在处理口语化讨论时表现优于人工编码，但在涉及专业科学术语和复杂上下文的任务中表现不足。这一发现为研究人员提供了实用指南，以制定高效且可扩展的通信数据分析策略。

链接: https://arxiv.org/abs/2411.10246
作者: Jiangang Hao,Wenju Cui,Patrick Kyllonen,Emily Kerzabi,Lei Liu,Michael Flor
关键词-EN: Collaborative problem solving, Collaborative problem, century skill, problem solving, widely recognized
类目: Human-Computer Interaction (cs.HC); Computation and Language (cs.CL)
备注: 21 pages, 3 figures, 5 tables. Initially report in the edArXiv:xw6kz

点击查看摘要

Abstract:Collaborative problem solving (CPS) is widely recognized as a critical 21st century skill. Efficiently coding communication data is a big challenge in scaling up research on assessing CPS. This paper reports the findings on using ChatGPT to directly code CPS chat data by benchmarking performance across multiple datasets and coding frameworks. We found that ChatGPT-based coding outperformed human coding in tasks where the discussions were characterized by colloquial languages but fell short in tasks where the discussions dealt with specialized scientific terminology and contexts. The findings offer practical guidelines for researchers to develop strategies for efficient and scalable analysis of communication data from CPS tasks.
摘要：协作问题解决（Collaborative Problem Solving, CPS）被广泛认为是21世纪的关键技能。在评估CPS的研究中，高效地编码通信数据是一个巨大的挑战。本文报告了使用ChatGPT直接编码CPS聊天数据的发现，通过在多个数据集和编码框架上进行基准测试来评估其性能。我们发现，基于ChatGPT的编码在讨论以口语化语言为特征的任务中优于人工编码，但在讨论涉及专业科学术语和上下文的任务中表现不足。这些发现为研究人员提供了实用的指导，以制定策略来高效且可扩展地分析来自CPS任务的通信数据。

[NLP-12] Measuring Non-Adversarial Reproduction of Training Data in Large Language Models

【速读】：该论文试图解决生成式语言模型在自然和良性提示下对训练数据的非对抗性再现问题。解决方案的关键在于量化模型输出与预训练数据之间的重叠，并研究如何通过提示策略减少这种重叠，特别是在最坏情况下的再现。研究发现，尽管适当的提示策略可以平均减少非对抗性再现，但在良性交互中完全消除训练数据的再现仍需更强的防御措施。

链接: https://arxiv.org/abs/2411.10242
作者: Michael Aerni,Javier Rando,Edoardo Debenedetti,Nicholas Carlini,Daphne Ippolito,Florian Tramèr
关键词-EN: Large language models, Large language, models memorize parts, language models memorize, memorize parts
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Large language models memorize parts of their training data. Memorizing short snippets and facts is required to answer questions about the world and to be fluent in any language. But models have also been shown to reproduce long verbatim sequences of memorized text when prompted by a motivated adversary. In this work, we investigate an intermediate regime of memorization that we call non-adversarial reproduction, where we quantify the overlap between model responses and pretraining data when responding to natural and benign prompts. For a variety of innocuous prompt categories (e.g., writing a letter or a tutorial), we show that up to 15% of the text output by popular conversational language models overlaps with snippets from the Internet. In worst cases, we find generations where 100% of the content can be found exactly online. For the same tasks, we find that human-written text has far less overlap with Internet data. We further study whether prompting strategies can close this reproduction gap between models and humans. While appropriate prompting can reduce non-adversarial reproduction on average, we find that mitigating worst-case reproduction of training data requires stronger defenses – even for benign interactions.
摘要：大语言模型会记忆部分训练数据。记忆短片段和事实是回答关于世界的问题和流利使用任何语言的必要条件。但已有研究表明，当受到有动机的敌对者提示时，模型也会复制记忆文本中的长篇逐字序列。在本研究中，我们探讨了一种称为非敌对复制的中间记忆状态，即在自然和良性提示下，量化模型响应与预训练数据之间的重叠。对于多种无害的提示类别（例如，撰写信件或教程），我们发现流行对话语言模型的文本输出中，高达15%的内容与互联网片段重叠。在最坏情况下，我们发现生成的内容中100%的内容可以完全在线上找到。对于相同的任务，我们发现人类撰写的文本与互联网数据的重叠度要低得多。我们进一步研究了提示策略是否能缩小模型与人类之间的复制差距。虽然适当的提示可以平均减少非敌对复制，但我们发现，减轻最坏情况下的训练数据复制需要更强的防御措施——即使在良性互动中也是如此。

[NLP-13] Entropy and type-token ratio in gigaword corpora

【速读】：该论文试图解决词汇多样性（lexical diversity）在不同语言和文本类型中的量化问题，特别是如何将这一概念操作化。解决方案的关键在于研究熵（entropy）和文本-词元比率（text-token ratio）这两个广泛使用的词汇多样性指标，并在包含英语、西班牙语和土耳其语的大规模语言数据集中进行验证。研究发现，熵和文本-词元比率之间存在一种跨数据集的功能关系，并且在词汇量较大的情况下，这种关系可以通过一个解析表达式来解释，这与Zipf定律和Heaps定律有关。这一发现不仅深化了对文本结构理论的理解，还为自然语言处理等领域的实际应用提供了新的视角。

链接: https://arxiv.org/abs/2411.10227
作者: Pablo Rosillo-Rodes,Maxi San Miguel,David Sanchez
关键词-EN: Lexical diversity measures, measures the vocabulary, vocabulary variation, Lexical diversity, diversity measures
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR); Physics and Society (physics.soc-ph)
备注: 12 pages, 10 figures, 7 tables

点击查看摘要

Abstract:Lexical diversity measures the vocabulary variation in texts. While its utility is evident for analyses in language change and applied linguistics, it is not yet clear how to operationalize this concept in a unique way. We here investigate entropy and text-token ratio, two widely employed metrics for lexical diversities, in six massive linguistic datasets in English, Spanish, and Turkish, consisting of books, news articles, and tweets. These gigaword corpora correspond to languages with distinct morphological features and differ in registers and genres, thus constituting a diverse testbed for a quantitative approach to lexical diversity. Strikingly, we find a functional relation between entropy and text-token ratio that holds across the corpora under consideration. Further, in the limit of large vocabularies we find an analytical expression that sheds light on the origin of this relation and its connection with both Zipf and Heaps laws. Our results then contribute to the theoretical understanding of text structure and offer practical implications for fields like natural language processing.
摘要：词汇多样性衡量文本中词汇的变化情况。尽管其在语言变化和应用语言学分析中的实用性显而易见，但如何以独特的方式将这一概念操作化尚不明确。本文研究了熵和文本-Token比率这两种广泛用于衡量词汇多样性的指标，在六个大规模的英语、西班牙语和土耳其语语料库中，这些语料库包括书籍、新闻文章和推文。这些语料库对应于具有不同形态特征的语言，并且在语域和体裁上有所不同，因此构成了一个多样化的测试平台，用于定量研究词汇多样性。值得注意的是，我们发现熵和文本-Token比率之间存在一种功能关系，这种关系在所考虑的语料库中普遍成立。此外，在大词汇量的极限情况下，我们找到了一个解析表达式，揭示了这种关系的起源及其与Zipf定律和Heaps定律的联系。我们的研究结果有助于深化对文本结构理论的理解，并为自然语言处理等领域提供了实际应用的启示。

[NLP-14] Increasing the Accessibility of Causal Domain Knowledge via Causal Information Extraction Methods: A Case Study in the Semiconductor Manufacturing Industry

【速读】：该论文试图解决从半导体制造行业的实际文档中自动化提取因果信息的问题，以帮助识别和缓解潜在故障、提高过程效率、促进质量改进并应对各种运营挑战。解决方案的关键在于开发了两种因果信息提取方法：单阶段序列标注 (Single-Stage Sequence Tagging, SST) 和多阶段序列标注 (Multi-Stage Sequence Tagging, MST)。研究通过评估这些方法在半导体制造公司现有文档（包括演示文稿和FMEA文档）上的表现，发现MST方法在提取因果信息方面特别适用于半结构化文档（如FMEA），并能达到93%的F1分数。此外，MST在从演示文稿中提取文本时也能达到73%的F1分数。研究还强调了选择与领域更匹配的语言模型以及进行领域内微调的重要性。

链接: https://arxiv.org/abs/2411.10172
作者: Houssam Razouk,Leonie Benischke,Daniel Garber,Roman Kern
关键词-EN: enhancing process efficiency, prompting quality improvements, causal information extraction, mitigating potential failures, causal information
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 17 pages, 2 figures

点击查看摘要

Abstract:The extraction of causal information from textual data is crucial in the industry for identifying and mitigating potential failures, enhancing process efficiency, prompting quality improvements, and addressing various operational challenges. This paper presents a study on the development of automated methods for causal information extraction from actual industrial documents in the semiconductor manufacturing industry. The study proposes two types of causal information extraction methods, single-stage sequence tagging (SST) and multi-stage sequence tagging (MST), and evaluates their performance using existing documents from a semiconductor manufacturing company, including presentation slides and FMEA (Failure Mode and Effects Analysis) documents. The study also investigates the effect of representation learning on downstream tasks. The presented case study showcases that the proposed MST methods for extracting causal information from industrial documents are suitable for practical applications, especially for semi structured documents such as FMEAs, with a 93% F1 score. Additionally, MST achieves a 73% F1 score on texts extracted from presentation slides. Finally, the study highlights the importance of choosing a language model that is more aligned with the domain and in-domain fine-tuning.
摘要：从文本数据中提取因果信息在工业领域至关重要，它有助于识别和缓解潜在故障、提升流程效率、促进质量改进以及应对各种运营挑战。本文研究了从半导体制造业实际文档中开发自动化因果信息提取方法的过程。研究提出了两种因果信息提取方法：单阶段序列标注 (Single-Stage Sequence Tagging, SST) 和多阶段序列标注 (Multi-Stage Sequence Tagging, MST)，并使用半导体制造公司的现有文档（包括演示文稿和失效模式与影响分析 (Failure Mode and Effects Analysis, FMEA) 文档）评估了它们的性能。研究还探讨了表示学习对下游任务的影响。案例研究展示，所提出的 MST 方法在从工业文档中提取因果信息方面适用于实际应用，特别是在 FMEA 等半结构化文档中，其 F1 得分为 93%。此外，MST 在从演示文稿中提取的文本上达到了 73% 的 F1 得分。最后，研究强调了选择更符合领域特性的语言模型以及进行领域内微调的重要性。

[NLP-15] Evaluating the role of `Constitutions for learning from AI feedback NEURIPS2024

【速读】：该论文试图解决的问题是如何通过使用不同的“宪法”（constitutions）来提高大型语言模型（LLMs）在医疗访谈中以患者为中心的沟通质量。解决方案的关键在于比较四种不同的宪法对反馈质量的影响，特别是它们在情感质量和信息收集与提供方面的表现。研究发现，详细的宪法在情感质量方面表现更好，但在信息收集和提供等实用技能方面，没有任何宪法优于基线模型。这表明，尽管应优先考虑详细的宪法，但在某些领域，AI反馈作为奖励信号的有效性可能存在局限性。

链接: https://arxiv.org/abs/2411.10168
作者: Saskia Redgate,Andrew M. Bean,Adam Mahdi
关键词-EN: large language models, growing capabilities, capabilities of large, large language, training and assessing
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 4 pages, 2 figures. In NeurIPS 2024 Workshop on Language Gamification

点击查看摘要

Abstract:The growing capabilities of large language models (LLMs) have led to their use as substitutes for human feedback for training and assessing other LLMs. These methods often rely on `constitutions’, written guidelines which a critic model uses to provide feedback and improve generations. We investigate how the choice of constitution affects feedback quality by using four different constitutions to improve patient-centered communication in medical interviews. In pairwise comparisons conducted by 215 human raters, we found that detailed constitutions led to better results regarding emotive qualities. However, none of the constitutions outperformed the baseline in learning more practically-oriented skills related to information gathering and provision. Our findings indicate that while detailed constitutions should be prioritised, there are possible limitations to the effectiveness of AI feedback as a reward signal in certain areas.
摘要：大语言模型（LLMs）能力的不断提升，使得它们被用作替代人类反馈，用于训练和评估其他大语言模型。这些方法通常依赖于“宪法”，即批评模型用于提供反馈和改进生成的书面指导方针。我们通过使用四种不同的宪法来改进医疗访谈中的以患者为中心的沟通，研究了宪法选择对反馈质量的影响。在由215名人类评分者进行的成对比较中，我们发现详细的宪法在情感质量方面带来了更好的结果。然而，在涉及信息收集和提供的更具实践导向的技能学习方面，没有任何宪法优于基线。我们的研究结果表明，尽管应优先考虑详细的宪法，但在某些领域，AI反馈作为奖励信号的有效性可能存在局限性。

[NLP-16] Compound-QA: A Benchmark for Evaluating LLM s on Compound Questions

【速读】：该论文试图解决现有大型语言模型（LLMs）评估基准在处理复杂交互问题时的不足，特别是忽视了真实应用中复合问题的处理能力。解决方案的关键在于引入了复合问题合成（Compound Question Synthesis, CQ-Syn）方法，创建了Compound-QA基准，该基准专注于包含多个子问题的复合问题。通过从现有问答数据集中提取并人工验证，Compound-QA涵盖了事实陈述、因果关系、假设分析、比较选择和评估建议五个类别，从理解、推理和知识三个维度评估LLMs的能力。研究结果表明，复合问题处理能力显著低于单一问题，但通过多种方法可以显著提升模型在复合问题上的理解和推理能力。

链接: https://arxiv.org/abs/2411.10163
作者: Yutao Hou,Yajing Luo,Zhiwen Ruan,Hongru Wang,Weifeng Ge,Yun Chen,Guanhua Chen
关键词-EN: Large language models, develop diverse evaluation, Large language, diverse evaluation benchmarks, demonstrate remarkable performance
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) demonstrate remarkable performance across various tasks, prompting researchers to develop diverse evaluation benchmarks. However, existing benchmarks typically measure the ability of LLMs to respond to individual questions, neglecting the complex interactions in real-world applications. In this paper, we introduce Compound Question Synthesis (CQ-Syn) to create the Compound-QA benchmark, focusing on compound questions with multiple sub-questions. This benchmark is derived from existing QA datasets, annotated with proprietary LLMs and verified by humans for accuracy. It encompasses five categories: Factual-Statement, Cause-and-Effect, Hypothetical-Analysis, Comparison-and-Selection, and Evaluation-and-Suggestion. It evaluates the LLM capability in terms of three dimensions including understanding, reasoning, and knowledge. Our assessment of eight open-source LLMs using Compound-QA reveals distinct patterns in their responses to compound questions, which are significantly poorer than those to non-compound questions. Additionally, we investigate various methods to enhance LLMs performance on compound questions. The results indicate that these approaches significantly improve the models’ comprehension and reasoning abilities on compound questions.
摘要：大语言模型（Large Language Models, LLMs）在各种任务中展现出卓越的性能，促使研究人员开发多样化的评估基准。然而，现有的基准通常仅测量LLMs对单个问题的响应能力，忽略了现实应用中复杂的交互情况。本文中，我们引入了复合问题合成（Compound Question Synthesis, CQ-Syn）以创建复合问答基准（Compound-QA），专注于包含多个子问题的复合问题。该基准源自现有的问答数据集，通过专有的LLMs进行标注，并由人工验证其准确性。它涵盖五个类别：事实陈述（Factual-Statement）、因果关系（Cause-and-Effect）、假设分析（Hypothetical-Analysis）、比较与选择（Comparison-and-Selection）以及评估与建议（Evaluation-and-Suggestion）。该基准从理解、推理和知识三个维度评估LLM的能力。我们使用Compound-QA对八个开源LLM进行评估，发现它们在复合问题上的响应明显不如非复合问题。此外，我们还研究了多种提升LLM在复合问题上性能的方法。结果表明，这些方法显著提高了模型在复合问题上的理解和推理能力。

[NLP-17] An Effective Framework to Help Large Language Models Handle Numeric-involved Long-context Tasks

【速读】：该论文试图解决大型语言模型（LLMs）在处理涉及数值计算的长上下文任务时性能显著下降的问题。解决方案的关键在于提出了一种工作流程，将数值相关的长上下文任务分解为四个低级子任务：判断、提取、代码处理和结论生成。前两个子任务相对简单，可以使用较小的模型高效处理长上下文；而在需要数值计算时，利用LLMs生成的代码来执行计算，从而避免了LLMs在计算能力上的不足。该方法不仅提高了任务的准确性，还显著降低了API调用的成本。

链接: https://arxiv.org/abs/2411.10145
作者: Yijiong Yu
关键词-EN: Large Language Models, Large Language, demonstrated remarkable capabilities, traditional retrieval tasks, Language Models
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have demonstrated remarkable capabilities in handling long texts and have almost perfect performance in traditional retrieval tasks. However, their performance significantly degrades when it comes to numerical calculations in the long-context. Numeric-involved long-context tasks typically cannot be addressed by current LLMs in normal settings due to their inherent limitations in simultaneously handling complex and massive information. Some CoT like prompting methods can improve accuracy but demands massive output tokens, which is costly and slow. To address this issue, we propose a workflow, which decompose a numeric-involved long-context task into 4 low-level subtasks: judging, extracting and processing with code and conclusion. The former 2 subtasks is relatively simple, which allows us to use smaller models for efficiently processing long context. When numerical calculations are required, we use code generated by LLMs to avoid the disadvantage of LLM not being good at calculations. The results in 2 numeric-involved long-context benchmarks demonstrate our workflow can not only improve accuracy, but also significantly reduce the cost of API calls.
摘要：大语言模型（LLMs）在处理长文本方面展现了卓越的能力，并且在传统的检索任务中几乎达到了完美的表现。然而，当涉及到长上下文中的数值计算时，其性能显著下降。由于其固有的局限性，即难以同时处理复杂和大量的信息，当前的 LLMs 在常规设置下通常无法解决涉及数值的长上下文任务。一些类似于思维链（CoT）的提示方法虽然可以提高准确性，但需要大量的输出 Token，这既昂贵又缓慢。为了解决这一问题，我们提出了一种工作流程，将涉及数值的长上下文任务分解为四个低级子任务：判断、提取、代码处理和结论。前两个子任务相对简单，允许我们使用较小的模型来高效处理长上下文。当需要数值计算时，我们使用由 LLMs 生成的代码来避免 LLM 在计算方面的不足。在两个涉及数值的长上下文基准测试中的结果表明，我们的工作流程不仅能提高准确性，还能显著降低 API 调用的成本。

[NLP-18] Legal Evalutions and Challenges of Large Language Models

【速读】：该论文试图解决大型语言模型（LLMs）在法律领域的应用问题，特别是评估这些模型在理解和应用法律文本、推理法律问题以及预测判决方面的表现。解决方案的关键在于系统性地测试和比较当前最先进的LLMs，包括开源、闭源以及专门为法律领域训练的模型，通过分析其在英汉法律案例中的表现，揭示LLMs在法律应用中的潜力与局限性，尤其是法律语言解释和法律推理准确性方面的挑战。

链接: https://arxiv.org/abs/2411.10137
作者: Jiaqi Wang,Huan Zhao,Zhenyuan Yang,Peng Shu,Junhao Chen,Haobo Sun,Ruixi Liang,Shixin Li,Pengcheng Shi,Longjun Ma,Zongjia Liu,Zhengliang Liu,Tianyang Zhong,Yutong Zhang,Chong Ma,Xin Zhang,Tuo Zhang,Tianli Ding,Yudan Ren,Tianming Liu,Xi Jiang,Shu Zhang
关键词-EN: Large Language Models, testing methods based, large models, Large Language, applying legal provisions
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In this paper, we review legal testing methods based on Large Language Models (LLMs), using the OPENAI o1 model as a case study to evaluate the performance of large models in applying legal provisions. We compare current state-of-the-art LLMs, including open-source, closed-source, and legal-specific models trained specifically for the legal domain. Systematic tests are conducted on English and Chinese legal cases, and the results are analyzed in depth. Through systematic testing of legal cases from common law systems and China, this paper explores the strengths and weaknesses of LLMs in understanding and applying legal texts, reasoning through legal issues, and predicting judgments. The experimental results highlight both the potential and limitations of LLMs in legal applications, particularly in terms of challenges related to the interpretation of legal language and the accuracy of legal reasoning. Finally, the paper provides a comprehensive analysis of the advantages and disadvantages of various types of models, offering valuable insights and references for the future application of AI in the legal field.
摘要：本文综述了基于大语言模型 (LLM) 的法律测试方法，以 OPENAI o1 模型为例，评估大型模型在应用法律条款方面的性能。我们对比了当前最先进的 LLM，包括开源、闭源以及专门为法律领域训练的法律专用模型。系统测试了英文和中文的法律案例，并对结果进行了深入分析。通过对普通法系和中国法律案例的系统测试，本文探讨了 LLM 在理解和应用法律文本、通过法律问题进行推理以及预测判决方面的优势和劣势。实验结果突显了 LLM 在法律应用中的潜力和局限性，特别是在法律语言解释和法律推理准确性方面的挑战。最后，本文对各类模型的优缺点进行了全面分析，为未来 AI 在法律领域的应用提供了宝贵的见解和参考。

[NLP-19] Prompting and Fine-tuning Large Language Models for Automated Code Review Comment Generation

【速读】：该论文试图解决生成准确代码审查评论的挑战，这一挑战源于任务输出的多样性和非唯一性。解决方案的关键在于采用参数高效、量化的低秩微调（QLoRA）方法对开源大型语言模型（LLM）进行微调，并在消费级硬件上实现这一过程。此外，论文还探索了通过增强语义元数据信息到提示中，以提升代码审查评论生成的性能。具体来说，通过在输入代码补丁中加入函数调用图和代码摘要，利用GPT-3.5模型进行少样本提示，显著提升了生成评论的质量，BLEU-4评分比预训练基线提高了约90%。同时，通过QLoRA微调的Code Llama和Llama 3.1模型，以及少样本提示的Gemini-1.0 Pro模型，也在该任务上取得了竞争性的结果，性能提升范围从25%到83%不等。

链接: https://arxiv.org/abs/2411.10129
作者: Md. Asif Haider,Ayesha Binte Mostofa,Sk. Sabit Bin Mosaddek,Anindya Iqbal,Toufique Ahmed
关键词-EN: Generating accurate code, Generating accurate, significant challenge due, remains a significant, significant challenge
类目: oftware Engineering (cs.SE); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Generating accurate code review comments remains a significant challenge due to the inherently diverse and non-unique nature of the task output. Large language models pretrained on both programming and natural language data tend to perform well in code-oriented tasks. However, large-scale pretraining is not always feasible due to its environmental impact and project-specific generalizability issues. In this work, first we fine-tune open-source Large language models (LLM) in parameter-efficient, quantized low-rank (QLoRA) fashion on consumer-grade hardware to improve review comment generation. Recent studies demonstrate the efficacy of augmenting semantic metadata information into prompts to boost performance in other code-related tasks. To explore this in code review activities, we also prompt proprietary, closed-source LLMs augmenting the input code patch with function call graphs and code summaries. Both of our strategies improve the review comment generation performance, with function call graph augmented few-shot prompting on the GPT-3.5 model surpassing the pretrained baseline by around 90% BLEU-4 score on the CodeReviewer dataset. Moreover, few-shot prompted Gemini-1.0 Pro, QLoRA fine-tuned Code Llama and Llama 3.1 models achieve competitive results (ranging from 25% to 83% performance improvement) on this task. An additional human evaluation study further validates our experimental findings, reflecting real-world developers’ perceptions of LLM-generated code review comments based on relevant qualitative metrics.
摘要：生成准确的代码审查评论仍然是一个重大挑战，这主要归因于任务输出的固有多样性和非唯一性。在编程和自然语言数据上预训练的大语言模型在面向代码的任务中表现出色。然而，由于其环境影响和项目特定泛化性问题，大规模预训练并不总是可行。在本研究中，首先我们在消费级硬件上以参数高效、量化的低秩（QLoRA）方式微调开源大语言模型（LLM），以改进审查评论生成。最近的研究表明，将语义元数据信息融入提示中可以提升其他代码相关任务的性能。为了在代码审查活动中探索这一点，我们还通过增加函数调用图和代码摘要来提示专有的闭源大语言模型。这两种策略都提高了审查评论生成的性能，其中在GPT-3.5模型上使用函数调用图增强的少样本提示在CodeReviewer数据集上的BLEU-4评分比预训练基线高出约90%。此外，少样本提示的Gemini-1.0 Pro、QLoRA微调的Code Llama和Llama 3.1模型在该任务上也取得了有竞争力的结果（性能提升范围从25%到83%）。额外的人类评估研究进一步验证了我们的实验结果，反映了基于相关定性指标的LLM生成的代码审查评论在实际开发者中的感知。

[NLP-20] Memorization in Attention-only Transformers AISTATS2025

【速读】：该论文试图解决现有研究在多注意力头机制（multi-head attention）记忆能力分析中对上下文大小限制不切实际的问题。解决方案的关键在于提出了一种新的证明方法，适用于基于语言的Transformer模型，并将其扩展到任意上下文大小。该方法通过引入注意力层实现更有效的精确记忆（exact memorization），并首次提出了分布的近似记忆（approximate memorization of distributions）概念。实验验证表明，所提出的界限更准确地反映了语言模型的真实记忆能力，并与先前的工作进行了精确比较。

链接: https://arxiv.org/abs/2411.10115
作者: Léo Dana,Muni Sreenivas Pydi,Yann Chevaleyre
关键词-EN: Recent research, context size, research has explored, findings are constrained, constrained by unrealistic
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 16 pages, 6 figures, submitted to AISTATS 2025,

点击查看摘要

Abstract:Recent research has explored the memorization capacity of multi-head attention, but these findings are constrained by unrealistic limitations on the context size. We present a novel proof for language-based Transformers that extends the current hypothesis to any context size. Our approach improves upon the state-of-the-art by achieving more effective exact memorization with an attention layer, while also introducing the concept of approximate memorization of distributions. Through experimental validation, we demonstrate that our proposed bounds more accurately reflect the true memorization capacity of language models, and provide a precise comparison with prior work.
摘要：近期研究探讨了多头注意力机制的记忆能力，但这些发现受限于对上下文大小的不切实际的限制。我们提出了一种基于语言的 Transformer 的新颖证明，将当前假设扩展到任意上下文大小。我们的方法通过注意力层实现了更有效的精确记忆，同时引入了分布的近似记忆概念，从而超越了现有技术水平。通过实验验证，我们展示了所提出的界限更准确地反映了语言模型的真实记忆能力，并提供了与先前工作的精确比较。

[NLP-21] Xmodel-1.5: An 1B-scale Multilingual LLM

【速读】：该论文试图解决多语言自然语言处理任务中的性能问题，特别是提升模型在非英语语言（如泰语、阿拉伯语和法语）中的表现。解决方案的关键在于引入了一个名为Xmodel-1.5的10亿参数多语言大模型，该模型在约2万亿个token上进行了预训练，并在多个语言上展示了强大的性能。此外，论文还贡献了一个由朱拉隆功大学集成创新学院学生标注的泰语评估数据集，以促进多语言AI研究的发展。

链接: https://arxiv.org/abs/2411.10083
作者: Wang Qun,Liu Yang,Lin Qingquan,Jiang Ling
关键词-EN: large model pretrained, trillion tokens, pretrained on approximately, Chulalongkorn University School, Chinese and English
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We introduce Xmodel-1.5, a novel 1-billion-parameter multilingual large model pretrained on approximately 2 trillion tokens. The model demonstrates strong performance across several languages, with particularly notable results in Thai, Arabic, and French, alongside its effectiveness in Chinese and English. In addition, we contribute to the research community by releasing a Thai evaluation dataset, which includes hundreds of questions annotated by students from Chulalongkorn University’s School of Integrated Innovation. While the results are promising, we acknowledge that there is still room for improvement. We hope this work advances ongoing efforts in multilingual AI research and promotes better cross-linguistic understanding in various natural language processing tasks. Our models and code are publicly available on GitHub at this https URL.
摘要：我们介绍了 Xmodel-1.5，这是一种新颖的 10 亿参数的多语言大语言模型，预训练于约 2 万亿个 Token 上。该模型在多种语言中表现出强劲的性能，特别是在泰语、阿拉伯语和法语中取得了显著成果，同时在中文和英语中也展现出有效性。此外，我们通过发布一个泰语评估数据集，为研究社区做出了贡献，该数据集包含数百个由朱拉隆功大学综合创新学院的学生标注的问题。尽管结果令人鼓舞，但我们承认仍有改进空间。我们希望这项工作能够推动多语言 AI 研究的不断进展，并促进在各种自然语言处理任务中更好地实现跨语言理解。我们的模型和代码已在 GitHub 上公开，链接为 https URL。

[NLP-22] Understanding The Effect Of Temperature On Alignment With Human Opinions

【速读】：该论文试图解决的问题是如何从大型语言模型（LLMs）中提取与人类观点一致的意见分布，并探讨这些模型在多大程度上反映了人类的主观性。解决方案的关键在于通过实验分析三种直接的方法（采样、对数概率和直接提示）来获取意见分布，并评估这些方法在主观任务中的表现。研究发现，通过简单的参数调整，采样和对数概率方法能够返回更一致的输出，但同时也指出，假设模型完全反映人类观点可能存在局限性，这强调了进一步研究人类主观性如何影响模型不确定性的必要性。

链接: https://arxiv.org/abs/2411.10080
作者: Maja Pavlovic,Massimo Poesio
关键词-EN: recent studies focus, effectively extract aligned, capabilities of LLMs, recent studies, increasing capabilities
类目: Computation and Language (cs.CL); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:With the increasing capabilities of LLMs, recent studies focus on understanding whose opinions are represented by them and how to effectively extract aligned opinion distributions. We conducted an empirical analysis of three straightforward methods for obtaining distributions and evaluated the results across a variety of metrics. Our findings suggest that sampling and log-probability approaches with simple parameter adjustments can return better aligned outputs in subjective tasks compared to direct prompting. Yet, assuming models reflect human opinions may be limiting, highlighting the need for further research on how human subjectivity affects model uncertainty.
摘要：随着大语言模型（LLM）能力的不断提升，近期的研究重点在于理解这些模型所代表的观点来源，以及如何有效地提取与其一致的观点分布。我们针对三种直接的方法进行了实证分析，以获取这些分布，并在多种评价指标下对结果进行了评估。研究结果表明，在主观任务中，通过简单的参数调整，采样和基于对数概率的方法相较于直接提示能够返回更为一致的输出。然而，假设模型完全反映人类观点可能存在局限性，这突显了进一步研究人类主观性如何影响模型不确定性的必要性。

[NLP-23] Layer Importance and Hallucination Analysis in Large Language Models via Enhanced Activation Variance-Sparsity

【速读】：该论文试图解决大型语言模型（LLMs）中不同层的重要性评估问题，并提出了一种解决方案来优化模型性能和解释性。解决方案的关键在于引入激活方差-稀疏度评分（Activation Variance-Sparsity Score, AVSS），通过结合归一化激活方差和稀疏度来量化每一层对整体模型性能的贡献。论文通过实验证明，基于AVSS的层重要性排序和剪枝策略可以保留超过90%的原始性能，揭示了LLM架构中的潜在冗余。此外，论文还提出了增强版的AVSS（EAVSS），专门用于评估各层的幻觉生成倾向，通过引入幻觉特定激活方差（Hallucination-Specific Activation Variance, HSAV）和幻觉特定稀疏度（Hallucination-Specific Sparsity, HSS）指标，精确识别易产生幻觉的层，并通过对比学习在这些层上进行优化，有效减少了幻觉生成，最大性能提升达12%。该方法在多个数据集上验证了其有效性，为LLMs的层重要性评估和幻觉缓解提供了全面的框架。

链接: https://arxiv.org/abs/2411.10069
作者: Zichen Song,Sitan Huang,Yuxin Wu,Zhongfeng Kang
关键词-EN: optimizing model performance, large language models, Activation Variance-Sparsity Score, crucial for optimizing, model performance
类目: Computation and Language (cs.CL); Performance (cs.PF)
备注: 20 pages, 5 figures

点击查看摘要

Abstract:Evaluating the importance of different layers in large language models (LLMs) is crucial for optimizing model performance and interpretability. This paper first explores layer importance using the Activation Variance-Sparsity Score (AVSS), which combines normalized activation variance and sparsity to quantify each layer’s contribution to overall model performance. By ranking layers based on AVSS and pruning the least impactful 25%, our experiments on tasks such as question answering, language modeling, and sentiment classification show that over 90% of the original performance is retained, highlighting potential redundancies in LLM architectures. Building on AVSS, we propose an enhanced version tailored to assess hallucination propensity across layers (EAVSS). This improved approach introduces Hallucination-Specific Activation Variance (HSAV) and Hallucination-Specific Sparsity (HSS) metrics, allowing precise identification of hallucination-prone layers. By incorporating contrastive learning on these layers, we effectively mitigate hallucination generation, contributing to more robust and efficient LLMs(The maximum performance improvement is 12%). Our results on the NQ, SciQ, TriviaQA, TruthfulQA, and WikiQA datasets demonstrate the efficacy of this method, offering a comprehensive framework for both layer importance evaluation and hallucination mitigation in LLMs.
摘要：评估大语言模型 (LLM) 中不同层的重要性对于优化模型性能和解释性至关重要。本文首先使用激活方差-稀疏度评分 (Activation Variance-Sparsity Score, AVSS) 探索层的重要性，该评分结合了归一化的激活方差和稀疏度，以量化每一层对整体模型性能的贡献。通过基于 AVSS 对层进行排序并修剪影响最小的 25%，我们在问答、语言建模和情感分类等任务中的实验表明，超过 90% 的原始性能得以保留，突显了 LLM 架构中的潜在冗余。在 AVSS 的基础上，我们提出了一种增强版本，专门用于评估各层产生幻觉的倾向 (Enhanced AVSS, EAVSS)。这一改进方法引入了幻觉特定激活方差 (Hallucination-Specific Activation Variance, HSAV) 和幻觉特定稀疏度 (Hallucination-Specific Sparsity, HSS) 指标，从而能够精确识别易产生幻觉的层。通过在这些层上结合对比学习，我们有效地减少了幻觉的生成，有助于构建更稳健和高效的 LLM（最大性能提升为 12%）。我们在 NQ、SciQ、TriviaQA、TruthfulQA 和 WikiQA 数据集上的结果证明了该方法的有效性，提供了一个全面的框架，用于在 LLM 中评估层重要性和减少幻觉。

[NLP-24] CMATH: Cross-Modality Augmented Transformer with Hierarchical Variational Distillation for Multimodal Emotion Recognition in Conversation

【速读】：该论文试图解决多模态对话情感识别（Multimodal Emotion Recognition in Conversation, MER）中由于模态信息质量不均和单一粒度融合导致的情感识别不准确问题。解决方案的关键在于提出了一个名为CMATH（Cross-Modality Augmented Transformer with Hierarchical Variational Distillation）的新模型，该模型包含两个主要组件：多模态交互融合（Multimodal Interaction Fusion）和层次化变分蒸馏（Hierarchical Variational Distillation）。多模态交互融合通过模态重构（Modality Reconstruction）获取高质量的模态压缩表示，并采用非对称融合策略的跨模态增强Transformer（Cross-Modality Augmented Transformer, CMA-Transformer）来处理模态信息的不均衡性。层次化变分蒸馏则通过设计变分融合网络将细粒度表示融合为粗粒度表示，并引入层次蒸馏框架来保持不同粒度模态表示之间的一致性。实验结果表明，CMATH模型在IEMOCAP和MELD数据集上优于现有的最先进基线模型。

链接: https://arxiv.org/abs/2411.10060
作者: Xiaofei Zhu,Jiawei Cheng,Zhou Yang,Zhuo Chen,Qingyang Wang,Jianfeng Yao
关键词-EN: integrating multimodal information, Cross-Modality Augmented Transformer, Hierarchical Variational Distillation, utterances by integrating, Multimodal
类目: Multimedia (cs.MM); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Multimodal emotion recognition in conversation (MER) aims to accurately identify emotions in conversational utterances by integrating multimodal information. Previous methods usually treat multimodal information as equal quality and employ symmetric architectures to conduct multimodal fusion. However, in reality, the quality of different modalities usually varies considerably, and utilizing a symmetric architecture is difficult to accurately recognize conversational emotions when dealing with uneven modal information. Furthermore, fusing multi-modality information in a single granularity may fail to adequately integrate modal information, exacerbating the inaccuracy in emotion recognition. In this paper, we propose a novel Cross-Modality Augmented Transformer with Hierarchical Variational Distillation, called CMATH, which consists of two major components, i.e., Multimodal Interaction Fusion and Hierarchical Variational Distillation. The former is comprised of two submodules, including Modality Reconstruction and Cross-Modality Augmented Transformer (CMA-Transformer), where Modality Reconstruction focuses on obtaining high-quality compressed representation of each modality, and CMA-Transformer adopts an asymmetric fusion strategy which treats one modality as the central modality and takes others as auxiliary modalities. The latter first designs a variational fusion network to fuse the fine-grained representations learned by CMA- Transformer into a coarse-grained representations. Then, it introduces a hierarchical distillation framework to maintain the consistency between modality representations with different granularities. Experiments on the IEMOCAP and MELD datasets demonstrate that our proposed model outperforms previous state-of-the-art baselines. Implementation codes can be available at this https URL cjw-MER/CMATH.
摘要：多模态对话情感识别（Multimodal Emotion Recognition in Conversation, MER）旨在通过整合多模态信息，准确识别对话中的情感。以往的方法通常将多模态信息视为同等质量，并采用对称架构进行多模态融合。然而，在实际应用中，不同模态的质量通常存在显著差异，使用对称架构在处理不均衡模态信息时难以准确识别对话情感。此外，在单一粒度上融合多模态信息可能无法充分整合模态信息，从而加剧情感识别的不准确性。本文提出了一种新的跨模态增强Transformer与分层变分蒸馏模型，称为CMATH，该模型包含两个主要组件，即多模态交互融合和分层变分蒸馏。前者由两个子模块组成，包括模态重构和跨模态增强Transformer（CMA-Transformer），其中模态重构专注于获取每种模态的高质量压缩表示，而CMA-Transformer采用非对称融合策略，将一种模态视为中心模态，其他模态作为辅助模态。后者首先设计了一个变分融合网络，将CMA-Transformer学习到的细粒度表示融合为粗粒度表示。然后，引入一个分层蒸馏框架，以保持不同粒度模态表示之间的一致性。在IEMOCAP和MELD数据集上的实验表明，我们提出的模型优于以往的最先进基线模型。实现代码可在以下链接获取：https URL cjw-MER/CMATH。

[NLP-25] owards unearthing neglected climate innovations from scientific literature using Large Language Models NEURIPS2024

【速读】：该论文试图解决气候变化应对策略中创新解决方案的识别和部署问题，特别是那些已存在于科学文献中但未被充分利用的解决方案。解决方案的关键在于利用大型语言模型（Large Language Models, LLMs）如GPT4-o，对从OpenAlex获取的科学论文标题和摘要进行多维度评估，包括气候变化缓解潜力、技术发展阶段和部署准备度。通过将语言模型的输出与人类评估进行比较，研究展示了LLM在快速、高效且一致地识别潜在有影响力的气候创新方面的有效性，从而增强气候行动策略。

链接: https://arxiv.org/abs/2411.10055
作者: César Quilodrán-Casas,Christopher Waite,Nicole Alhadeff,Diyona Dsouza,Cathal Hughes,Larissa Kunstel-Tabet,Alyssa Gilbert
关键词-EN: urgent global threat, global threat, needing the rapid, poses an urgent, urgent global
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 10 pages. Accepted in the LatinX in AI workshop at NeurIPS 2024

点击查看摘要

Abstract:Climate change poses an urgent global threat, needing the rapid identification and deployment of innovative solutions. We hypothesise that many of these solutions already exist within scientific literature but remain underutilised. To address this gap, this study employs a curated dataset sourced from OpenAlex, a comprehensive repository of scientific papers. Utilising Large Language Models (LLMs), such as GPT4-o from OpenAI, we evaluate title-abstract pairs from scientific papers on seven dimensions, covering climate change mitigation potential, stage of technological development, and readiness for deployment. The outputs of the language models are then compared with human evaluations to assess their effectiveness in identifying promising yet overlooked climate innovations. Our findings suggest that these LLM-based models can effectively augment human expertise, uncovering climate solutions that are potentially impactful but with far greater speed, throughput and consistency. Here, we focused on UK-based solutions, but the workflow is region-agnostic. This work contributes to the discovery of neglected innovations in scientific literature and demonstrates the potential of AI in enhancing climate action strategies.
摘要：气候变化构成了紧迫的全球威胁，迫切需要快速识别和部署创新解决方案。我们假设许多这些解决方案已经存在于科学文献中，但未得到充分利用。为了填补这一空白，本研究采用从 OpenAlex 这一全面的科学论文数据库中筛选的数据集。利用大语言模型 (LLMs)，如 OpenAI 的 GPT4-o，我们对科学论文的标题-摘要对进行七维评估，涵盖气候变化缓解潜力、技术发展阶段以及部署准备情况。随后，将语言模型的输出与人工评估进行比较，以评估其在识别有前景但被忽视的气候创新方面的有效性。我们的研究结果表明，基于 LLM 的模型能够有效增强人类专业知识，以更快的速度、吞吐量和一致性揭示潜在影响重大的气候解决方案。在此，我们专注于英国的解决方案，但工作流程与地区无关。这项工作有助于发现科学文献中被忽视的创新，并展示了 AI 在增强气候行动策略方面的潜力。

[NLP-26] Information Extraction from Clinical Notes: Are We Ready to Switch to Large Language Models ?

【速读】：该论文旨在解决临床自然语言处理（NLP）中的信息提取（IE）问题，特别是命名实体识别（NER）和关系提取（RE）任务。解决方案的关键在于比较和评估大型语言模型（LLMs）如LLaMA-2和LLaMA-3与传统深度学习模型如BiomedBERT在临床文本上的性能、泛化能力、计算资源需求和吞吐量。研究结果表明，尽管LLaMA模型在NER和RE任务上表现优于BiomedBERT，但其计算资源需求更高且吞吐量较低，因此选择合适的模型应基于具体任务需求、可用计算资源和使用场景。

链接: https://arxiv.org/abs/2411.10020
作者: Yan Hu,Xu Zuo,Yujia Zhou,Xueqing Peng,Jimin Huang,Vipina K. Keloth,Vincent J. Zhang,Ruey-Ling Weng,Qingyu Chen,Xiaoqian Jiang,Kirk E. Roberts,Hua Xu
关键词-EN: natural language processing, Information extraction, Named Entity Recognition, clinical natural language, language processing
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Backgrounds: Information extraction (IE) is critical in clinical natural language processing (NLP). While large language models (LLMs) excel on generative tasks, their performance on extractive tasks remains debated. Methods: We investigated Named Entity Recognition (NER) and Relation Extraction (RE) using 1,588 clinical notes from four sources (UT Physicians, MTSamples, MIMIC-III, and i2b2). We developed an annotated corpus covering 4 clinical entities and 16 modifiers, and compared instruction-tuned LLaMA-2 and LLaMA-3 against BiomedBERT in terms of performance, generalizability, computational resources, and throughput to BiomedBERT. Results: LLaMA models outperformed BiomedBERT across datasets. With sufficient training data, LLaMA showed modest improvements (1% on NER, 1.5-3.7% on RE); improvements were larger with limited training data. On unseen i2b2 data, LLaMA-3-70B outperformed BiomedBERT by 7% (F1) on NER and 4% on RE. However, LLaMA models required more computing resources and ran up to 28 times slower. We implemented “Kiwi,” a clinical IE package featuring both models, available at this https URL. Conclusion: This study is among the first to develop and evaluate a comprehensive clinical IE system using open-source LLMs. Results indicate that LLaMA models outperform BiomedBERT for clinical NER and RE but with higher computational costs and lower throughputs. These findings highlight that choosing between LLMs and traditional deep learning methods for clinical IE applications should remain task-specific, taking into account both performance metrics and practical considerations such as available computing resources and the intended use case scenarios.
摘要：背景：信息提取（Information Extraction, IE）在临床自然语言处理（Natural Language Processing, NLP）中至关重要。尽管大语言模型（Large Language Models, LLMs）在生成任务上表现出色，但其在提取任务上的表现仍存在争议。方法：我们使用来自四个来源（UT Physicians、MTSamples、MIMIC-III 和 i2b2）的 1,588 份临床笔记，研究了命名实体识别（Named Entity Recognition, NER）和关系提取（Relation Extraction, RE）。我们开发了一个包含 4 种临床实体和 16 种修饰词的标注语料库，并比较了指令微调的 LLaMA-2 和 LLaMA-3 与 BiomedBERT 在性能、泛化性、计算资源和吞吐量方面的表现。结果：LLaMA 模型在所有数据集上均优于 BiomedBERT。在充足的训练数据下，LLaMA 显示出适度的改进（NER 提升 1%，RE 提升 1.5-3.7%）；在有限的训练数据下，改进更为显著。在未见过的 i2b2 数据上，LLaMA-3-70B 在 NER 上的 F1 分数比 BiomedBERT 高出 7%，在 RE 上高出 4%。然而，LLaMA 模型需要更多的计算资源，运行速度最多慢 28 倍。我们实现了“Kiwi”，一个集成了这两种模型的临床 IE 包，可在以下网址获取：https URL。结论：本研究是首批使用开源 LLMs 开发和评估全面临床 IE 系统的研究之一。结果表明，LLaMA 模型在临床 NER 和 RE 上优于 BiomedBERT，但计算成本更高，吞吐量更低。这些发现强调，在选择 LLMs 和传统深度学习方法进行临床 IE 应用时，应根据任务的具体需求，综合考虑性能指标和实际因素，如可用的计算资源和预期的使用场景。

[NLP-27] Once More With Feeling: Measuring Emotion of Acting Performances in Contemporary American Film

【速读】：该论文试图解决的问题是如何通过计算方法探索电影中的表演表现。解决方案的关键在于应用语音情感识别模型（speech emotion recognition models）和变异社会语言学分析框架（variationist sociolinguistic analytical framework），对当代美国流行电影的语料库进行分析，以揭示叙事结构、历时变化以及基于类型和对话的约束在口语表演中的体现。

链接: https://arxiv.org/abs/2411.10018
作者: Naitian Zhou,David Bamman
关键词-EN: contemporary American film, Abstract, cinematography, editing, find narrative structure
类目: Computation and Language (cs.CL)
备注: Accepted CHR 2024

点击查看摘要

Abstract:Narrative film is a composition of writing, cinematography, editing, and performance. While much computational work has focused on the writing or visual style in film, we conduct in this paper a computational exploration of acting performance. Applying speech emotion recognition models and a variationist sociolinguistic analytical framework to a corpus of popular, contemporary American film, we find narrative structure, diachronic shifts, and genre- and dialogue-based constraints located in spoken performances.
摘要：叙事电影是写作、摄影、剪辑和表演的综合体。尽管许多计算工作集中在电影的写作或视觉风格上，但本文进行了一项关于表演表现的计算探索。我们将语音情感识别模型和变异社会语言学分析框架应用于一个当代美国流行电影的语料库，发现叙事结构、历时变化以及基于类型和对话的约束条件存在于口语表演中。

[NLP-28] Orca: Enhancing Role-Playing Abilities of Large Language Models by Integrating Personality Traits

【速读】：该论文试图解决现有大型语言模型（LLMs）在个性化对话系统中忽视心理因素的问题，特别是角色扮演对话代理在理解用户个性特质方面的不足。解决方案的关键在于提出了一种名为Orca的框架，该框架通过四个阶段实现：（1）个性特质推断（Personality traits inferring），利用LLMs推断用户的五大人格特质报告和评分；（2）数据增强（Data Augment），模拟用户的个人资料、背景故事和心理活动；（3）数据集构建（Dataset construction），采用个性条件指令提示（PCIP）来刺激LLMs；（4）模型训练（Modeling and Training），通过个性条件指令调优（PTIT和PSIT），使用生成的数据来增强现有的开源LLMs。该框架通过引入OrcaBench基准测试，验证了其在感知个性特质和提升角色扮演能力方面的优越性和有效性。

链接: https://arxiv.org/abs/2411.10006
作者: Yuxuan Huang
关键词-EN: Large language models, personalized dialogue systems, Large language, numerous role-playing conversational, role-playing conversational agents
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models has catalyzed the development of personalized dialogue systems, numerous role-playing conversational agents have emerged. While previous research predominantly focused on enhancing the model’s capability to follow instructions by designing character profiles, neglecting the psychological factors that drive human conversations. In this paper, we propose Orca, a framework for data processing and training LLMs of custom characters by integrating personality traits. Orca comprises four stages: (1) Personality traits inferring, leverage LLMs to infer user’s BigFive personality trait reports and scores. (2) Data Augment, simulate user’s profile, background story, and psychological activities. (3) Dataset construction, personality-conditioned instruction prompting (PCIP) to stimulate LLMs. (4) Modeling and Training, personality-conditioned instruction tuning (PTIT and PSIT), using the generated data to enhance existing open-source LLMs. We introduce OrcaBench, the first benchmark for evaluating the quality of content generated by LLMs on social platforms across multiple scales. Our experiments demonstrate that our proposed model achieves superior performance on this benchmark, demonstrating its excellence and effectiveness in perceiving personality traits that significantly improve role-playing abilities. Our Code is available at this https URL.
摘要：大语言模型（LLM）推动了个性化对话系统的发展，众多角色扮演的对话智能体（AI Agent）应运而生。然而，以往的研究主要集中在通过设计角色档案来增强模型遵循指令的能力，却忽视了驱动人类对话的心理因素。本文提出了Orca框架，通过整合人格特质来进行数据处理和训练自定义角色的LLM。Orca框架包括四个阶段：（1）人格特质推断，利用LLM推断用户的大五人格特质报告和评分；（2）数据增强，模拟用户的个人档案、背景故事和心理活动；（3）数据集构建，采用人格条件指令提示（PCIP）来刺激LLM；（4）建模与训练，使用人格条件指令调优（PTIT和PSIT），利用生成的数据来增强现有的开源LLM。我们引入了OrcaBench，这是首个用于评估LLM在社交平台上生成内容质量的多尺度基准。我们的实验表明，所提出的模型在该基准上表现优异，展示了其在感知人格特质方面的卓越性和有效性，显著提升了角色扮演能力。我们的代码可在以下链接获取：https URL。

[NLP-29] HistoLens: An LLM -Powered Framework for Multi-Layered Analysis of Historical Texts – A Case Application of Yantie Lun

【速读】：该论文试图解决历史文本的多层次分析问题，特别是如何利用大型语言模型 (LLMs) 和自然语言处理 (NLP) 技术来深入挖掘和可视化历史文本中的文化、思想和地理信息。解决方案的关键在于提出了一个名为 HistoLens 的多层次分析框架，该框架整合了命名实体识别、知识图谱构建和地理信息可视化等技术，并通过案例研究《盐铁论》展示了其在历史研究和教育中的应用潜力。HistoLens 不仅能够多维度、可视化和定量地分析历史文本，还能利用 LLMs 构建机器教学场景，提供解释性分析，从而为历史文本研究提供新的视角和辅助工具。

链接: https://arxiv.org/abs/2411.09978
作者: Yifan Zeng
关键词-EN: Large Language Models, Language Models, Large Language, Yantie Lun, Western Han
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:This paper proposes HistoLens, a multi-layered analysis framework for historical texts based on Large Language Models (LLMs). Using the important Western Han dynasty text “Yantie Lun” as a case study, we demonstrate the framework’s potential applications in historical research and education. HistoLens integrates NLP technology (especially LLMs), including named entity recognition, knowledge graph construction, and geographic information visualization. The paper showcases how HistoLens explores Western Han culture in “Yantie Lun” through multi-dimensional, visual, and quantitative methods, focusing particularly on the influence of Confucian and Legalist thoughts on political, economic, military, and ethnic. We also demonstrate how HistoLens constructs a machine teaching scenario using LLMs for explainable analysis, based on a dataset of Confucian and Legalist ideas extracted with LLM assistance. This approach offers novel and diverse perspectives for studying historical texts like “Yantie Lun” and provides new auxiliary tools for history education. The framework aims to equip historians and learners with LLM-assisted tools to facilitate in-depth, multi-layered analysis of historical texts and foster innovation in historical education.
摘要：本文提出了 HistoLens，这是一个基于大语言模型 (LLM) 的多层次历史文本分析框架。以重要的西汉文献《盐铁论》为例，我们展示了该框架在历史研究和教育中的潜在应用。HistoLens 集成了自然语言处理技术（特别是 LLM），包括命名实体识别、知识图谱构建和地理信息可视化。本文展示了 HistoLens 如何通过多维度、可视化和定量的方法探索《盐铁论》中的西汉文化，特别关注儒家和法家思想对政治、经济、军事和民族的影响。我们还展示了 HistoLens 如何基于从 LLM 辅助提取的儒家和法家思想数据集，构建机器教学场景进行可解释性分析。这种方法为研究《盐铁论》等历史文本提供了新颖且多样的视角，并为历史教育提供了新的辅助工具。该框架旨在为历史学家和学习者提供 LLM 辅助工具，以促进对历史文本的深入、多层次分析，并推动历史教育的创新。

[NLP-30] Large Language Models as User-Agents for Evaluating Task-Oriented-Dialogue Systems

【速读】：该论文试图解决传统离线数据集在评估面向任务的对话系统（Task-Oriented Dialogue, TOD）时缺乏上下文感知能力的问题。解决方案的关键在于利用大型语言模型（Large Language Models, LLMs）创建上下文感知的用户代理（user-agents），通过在上下文中提供示例引导LLM生成对话，并跟踪用户目标状态。这种方法不仅提高了用户代理在多样性和任务完成度指标上的表现，还提出了在此动态框架下自动评估TOD模型的方法论。

链接: https://arxiv.org/abs/2411.09972
作者: Taaha Kazi,Ruiliang Lyu,Sizhe Zhou,Dilek Hakkani-Tur,Gokhan Tur
关键词-EN: evaluate task-oriented dialogue, task-oriented dialogue, evaluate task-oriented, Traditionally, offline datasets
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Traditionally, offline datasets have been used to evaluate task-oriented dialogue (TOD) models. These datasets lack context awareness, making them suboptimal benchmarks for conversational systems. In contrast, user-agents, which are context-aware, can simulate the variability and unpredictability of human conversations, making them better alternatives as evaluators. Prior research has utilized large language models (LLMs) to develop user-agents. Our work builds upon this by using LLMs to create user-agents for the evaluation of TOD systems. This involves prompting an LLM, using in-context examples as guidance, and tracking the user-goal state. Our evaluation of diversity and task completion metrics for the user-agents shows improved performance with the use of better prompts. Additionally, we propose methodologies for the automatic evaluation of TOD models within this dynamic framework.
摘要：传统上，任务导向对话 (Task-Oriented Dialogue, TOD) 模型的评估依赖于离线数据集。这些数据集缺乏上下文感知能力，使得它们作为对话系统的基准并不理想。相比之下，具备上下文感知能力的用户代理 (User-Agents) 能够模拟人类对话的多样性和不可预测性，因此作为评估者更具优势。先前的研究已经利用大语言模型 (Large Language Models, LLMs) 来开发用户代理。我们的工作在此基础上，利用 LLMs 创建用于 TOD 系统评估的用户代理。这包括通过上下文示例引导 LLM 生成提示，并跟踪用户目标状态。我们对用户代理的多样性和任务完成指标的评估显示，使用更好的提示可以提升性能。此外，我们提出了在此动态框架内自动评估 TOD 模型的方法。

[NLP-31] LoRA-LiteE: A Computationally Efficient Framework for Chatbot Preference-Tuning

【速读】：该论文试图解决在资源受限环境下，如何高效地进行聊天机器人的人类偏好调优问题。解决方案的关键在于引入了一种名为LoRA-Lite Ensemble (LoRA-LiteE)的创新框架，该框架结合了监督微调 (Supervised Fine-tuning, SFT)、低秩适应 (Low-Rank Adaptation, LoRA) 和集成学习技术，通过聚合轻量级模型的预测结果，实现了性能与计算成本之间的平衡。实验结果表明，LoRA-LiteE模型在Chatbot Arena基准数据集上的表现与未经微调的GPT-4相当，并且在资源受限条件下优于单一的大规模模型，从而为资源受限环境下的聊天机器人偏好调优提供了一种可行且高效的解决方案。

链接: https://arxiv.org/abs/2411.09947
作者: Yahe Yang,Chunliang Tao,Xiaojing Fan
关键词-EN: Effective preference tuning, enhancing user satisfaction, aligning chatbot responses, notably Reinforcement Learning, satisfaction and engagement
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Effective preference tuning is pivotal in aligning chatbot responses with human expectations, enhancing user satisfaction and engagement. Traditional approaches, notably Reinforcement Learning from Human Feedback (RLHF) as employed in advanced models like GPT-4, have demonstrated considerable success in this domain. However, RLHF methods are often computationally intensive and resource-demanding, limiting their scalability and accessibility for broader applications. To address these challenges, this study introduces LoRA-Lite Ensemble (LoRA-LiteE), an innovative framework that combines Supervised Fine-tuning (SFT) with Low-Rank Adaptation (LoRA) and Ensemble Learning techniques to effectively aggregate predictions of lightweight models, which aim to achieve a balance between the performance and computational cost. Utilizing the Chatbot Arena benchmark dataset, we conduct a comprehensive comparative analysis among our LoRA-LiteE model, corresponding base models at different scales, and GPT-4 trained with RLHF. Our empirical results demonstrate that the proposed LoRA-LiteE model achieves comparable performance to un-finetuned GPT-4 and outperforms the single larger-scale models under limited resource constraints. These findings highlight that our LoRA-LiteE provides a feasible and efficient methodology for human preference prediction in chatbot systems, enhancing scalability and accessibility, and thereby broadening the applicability of preference-tuned chatbots in resource-constrained environments.
摘要：在使聊天机器人的回复与人类期望相一致，从而提高用户满意度和参与度方面，有效的偏好调整起着关键作用。传统方法，特别是像 GPT-4 这样的高级模型中采用的人类反馈强化学习 (Reinforcement Learning from Human Feedback, RLHF)，在此领域已显示出显著的成功。然而，RLHF 方法通常计算密集且资源需求高，限制了其在更广泛应用中的可扩展性和可访问性。为应对这些挑战，本研究引入了 LoRA-Lite 集成 (LoRA-Lite Ensemble, LoRA-LiteE)，这是一种创新框架，结合了监督微调 (Supervised Fine-tuning, SFT)、低秩适应 (Low-Rank Adaptation, LoRA) 和集成学习技术，以有效聚合轻量级模型的预测结果，旨在性能与计算成本之间取得平衡。利用 Chatbot Arena 基准数据集，我们对 LoRA-LiteE 模型、不同规模的相应基础模型以及使用 RLHF 训练的 GPT-4 进行了全面的比较分析。我们的实证结果表明，所提出的 LoRA-LiteE 模型在未微调的 GPT-4 上实现了可比拟的性能，并且在资源受限的情况下优于单一的大规模模型。这些发现强调了我们的 LoRA-LiteE 为聊天机器人系统中的人类偏好预测提供了一种可行且高效的方法，增强了可扩展性和可访问性，从而在资源受限的环境中扩大了偏好调整聊天机器人的适用性。

[NLP-32] SlimLM: An Efficient Small Language Model for On-Device Document Assistance

【速读】：该论文试图解决在智能手机上部署小型语言模型（SLMs）的实际性能和应用问题。解决方案的关键在于开发了一系列针对移动设备文档辅助任务优化的SLMs，命名为SlimLM。通过在三星Galaxy S24上进行广泛的实验，研究确定了模型大小（从125M到7B参数）、上下文长度和推理时间之间的最佳权衡，以实现高效的设备内处理。SlimLM在SlimPajama-627B数据集上预训练，并在自建的DocAssist数据集上进行微调，用于摘要、问答和建议任务。研究结果表明，SlimLM在现有SLMs中表现出可比或更优的性能，并为未来的设备内语言模型研究提供了基准。此外，论文还提供了一个Android应用程序，为SLM的实际部署提供了实用见解。

链接: https://arxiv.org/abs/2411.09944
作者: Thang M. Pham,Phat T. Nguyen,Seunghyun Yoon,Viet Dac Lai,Franck Dernoncourt,Trung Bui
关键词-EN: smartphones remains underexplored, show promises, remains underexplored, small language models, Samsung Galaxy
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:While small language models (SLMs) show promises for mobile deployment, their real-world performance and applications on smartphones remains underexplored. We present SlimLM, a series of SLMs optimized for document assistance tasks on mobile devices. Through extensive experiments on a Samsung Galaxy S24, we identify the optimal trade-offs between model size (ranging from 125M to 7B parameters), context length, and inference time for efficient on-device processing. SlimLM is pre-trained on SlimPajama-627B and fine-tuned on DocAssist, our constructed dataset for summarization, question answering and suggestion tasks. Our smallest model demonstrates efficient performance on S24, while larger variants offer enhanced capabilities within mobile constraints. We evaluate SlimLM against existing SLMs, showing comparable or superior performance and offering a benchmark for future research in on-device language models. We also provide an Android application, offering practical insights into SLM deployment. Our findings provide valuable insights and illuminate the capabilities of running advanced language models on high-end smartphones, potentially reducing server costs and enhancing privacy through on-device processing.
摘要：尽管小型语言模型 (SLM) 在移动设备部署方面显示出潜力，但其在智能手机上的实际性能和应用仍未得到充分探索。我们提出了 SlimLM，这是一系列针对移动设备上的文档辅助任务进行优化的小型语言模型。通过在三星 Galaxy S24 上进行广泛的实验，我们确定了模型大小（范围从 125M 到 7B 参数）、上下文长度和推理时间之间的最佳权衡，以实现高效的在设备处理。SlimLM 在 SlimPajama-627B 上进行了预训练，并在我们构建的 DocAssist 数据集上进行了微调，该数据集用于摘要、问答和建议任务。我们最小的模型在 S24 上展示了高效性能，而较大的变体在移动设备的限制下提供了增强的能力。我们评估了 SlimLM 与现有 SLM 的性能，显示了可比或更优的表现，并为未来在设备语言模型的研究提供了基准。我们还提供了一个 Android 应用程序，为 SLM 部署提供了实际见解。我们的研究结果提供了宝贵的见解，并展示了在高端智能手机上运行先进语言模型的能力，这可能通过在设备处理来减少服务器成本并增强隐私。

[NLP-33] Refined and Segmented Price Sentiment Indices from Survey Comments

【速读】：该论文旨在通过增强价格情绪指数来更精确地理解价格趋势，不仅从消费者的角度，也从企业的角度。解决方案的关键在于利用大型语言模型 (LLM) 对日本内阁府经济观察者调查中的价格相关评论进行分类。通过结合评论领域和受访者行业的信息，论文能够区分评论是来自消费者还是企业，以及评论涉及的是商品还是服务。这种方法不仅构建了通用价格情绪指数，还针对消费者和价格、商品和服务等更具体的对象构建了指数。通过使用LLM进行更准确的分类，并结合多个LLM的输出，论文展示了分类性能的潜在提升，从而构建了一个与现有指数相关性更高的价格情绪指数。此外，基于受访者行业的评论选择进一步增强了消费者价格指数的相关性。

链接: https://arxiv.org/abs/2411.09937
作者: Masahiro Suzuki,Hiroki Sakaji
关键词-EN: Economy Watchers Survey, precisely understand price, Economy Watchers, understand price trends, Watchers Survey
类目: Computation and Language (cs.CL); Computational Finance (q-fin.CP)
备注: Accepted to IEEE BigData 2024. 9 pages, 11 tables, 1 figure

点击查看摘要

Abstract:We aim to enhance a price sentiment index and to more precisely understand price trends from the perspective of not only consumers but also businesses. We extract comments related to prices from the Economy Watchers Survey conducted by the Cabinet Office of Japan and classify price trends using a large language model (LLM). We classify whether the survey sample reflects the perspective of consumers or businesses, and whether the comments pertain to goods or services by utilizing information on the fields of comments and the industries of respondents included in the Economy Watchers Survey. From these classified price-related comments, we construct price sentiment indices not only for a general purpose but also for more specific objectives by combining perspectives on consumers and prices, as well as goods and services. It becomes possible to achieve a more accurate classification of price directions by employing a LLM for classification. Furthermore, integrating the outputs of multiple LLMs suggests the potential for the better performance of the classification. The use of more accurately classified comments allows for the construction of an index with a higher correlation to existing indices than previous studies. We demonstrate that the correlation of the price index for consumers, which has a larger sample size, is further enhanced by selecting comments for aggregation based on the industry of the survey respondents.
摘要：我们的目标是增强价格情绪指数，并从消费者和企业的角度更精确地理解价格趋势。我们从日本内阁府进行的“经济观察者调查”中提取与价格相关的评论，并使用大语言模型 (LLM) 对价格趋势进行分类。我们根据评论领域和受访者行业信息，分类调查样本反映的是消费者还是企业的视角，以及评论涉及的是商品还是服务。通过对这些分类后的价格相关评论，我们不仅构建了通用目的的价格情绪指数，还通过结合消费者和价格、商品和服务等更具体的视角，构建了更细化的价格情绪指数。通过使用 LLM 进行分类，可以实现对价格方向的更准确分类。此外，整合多个 LLM 的输出表明，分类性能有可能得到进一步提升。使用更准确分类的评论构建的指数，其与现有指数的相关性高于以往研究。我们证明，通过根据调查受访者的行业选择评论进行汇总，消费者价格指数的相关性在样本量较大的情况下得到了进一步增强。

[NLP-34] JRadiEvo: A Japanese Radiology Report Generation Model Enhanced by Evolutionary Optimization of Model Merging NEURIPS’24

【速读】：该论文试图解决在非英语医疗环境中，如何利用有限的数据资源高效地生成准确的放射报告的问题。解决方案的关键在于通过进化优化模型合并 (Evolutionary optimization of model merging, JRadiEvo) 技术，将非医学领域的视觉语言基础模型扩展到医学领域，从而在仅使用50个翻译样本的情况下，成功开发出一个能够从X光图像生成准确日语报告的模型。这一方法不仅显著提高了数据利用效率，还使得模型在参数规模较小的情况下（仅80亿参数），能够在医院内部本地部署，满足严格的隐私和安全要求。

链接: https://arxiv.org/abs/2411.09933
作者: Kaito Baba,Ryota Yagi,Junichiro Takahashi,Risa Kishikawa,Satoshi Kodera
关键词-EN: rapid advancement, large language models, significant advancements, model, large language
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Neural and Evolutionary Computing (cs.NE)
备注: Accepted by NeurIPS’24 Workshop on AIM-FM: Advancements In Medical Foundation Models: Explainability, Robustness, Security, and Beyond

点击查看摘要

Abstract:With the rapid advancement of large language models (LLMs), foundational models (FMs) have seen significant advancements. Healthcare is one of the most crucial application areas for these FMs, given the significant time and effort required for physicians to analyze large volumes of patient data. Recent efforts have focused on adapting multimodal FMs to the medical domain through techniques like instruction-tuning, leading to the development of medical foundation models (MFMs). However, these approaches typically require large amounts of training data to effectively adapt models to the medical field. Moreover, most existing models are trained on English datasets, limiting their practicality in non-English-speaking regions where healthcare professionals and patients are not always fluent in English. The need for translation introduces additional costs and inefficiencies. To address these challenges, we propose a \textbfJapanese \textbfRadiology report generation model enhanced by \textbfEvolutionary optimization of model merging (JRadiEvo). This is the first attempt to extend a non-medical vision-language foundation model to the medical domain through evolutionary optimization of model merging. We successfully created a model that generates accurate Japanese reports from X-ray images using only 50 translated samples from publicly available data. This model, developed with highly efficient use of limited data, outperformed leading models from recent research trained on much larger datasets. Additionally, with only 8 billion parameters, this relatively compact foundation model can be deployed locally within hospitals, making it a practical solution for environments where APIs and other external services cannot be used due to strict privacy and security requirements.
摘要：随着大语言模型（LLMs）的快速发展，基础模型（FMs）也取得了显著进步。医疗领域是这些基础模型最重要的应用领域之一，因为医生需要花费大量时间和精力来分析大量的患者数据。最近的研究致力于通过指令微调等技术将多模态基础模型适应于医疗领域，从而开发出医疗基础模型（MFMs）。然而，这些方法通常需要大量的训练数据才能有效地将模型适应于医疗领域。此外，大多数现有模型是基于英语数据集进行训练的，这限制了它们在非英语地区（如医疗专业人员和患者不总是精通英语的地区）的实用性。翻译需求引入了额外的成本和低效率。为了应对这些挑战，我们提出了一种通过模型合并的进化优化增强的日语放射报告生成模型（JRadiEvo）。这是首次尝试通过模型合并的进化优化将非医疗领域的视觉语言基础模型扩展到医疗领域。我们成功地创建了一个模型，仅使用从公开数据中翻译的50个样本，就能从X光图像生成准确的日语报告。该模型在高效利用有限数据的情况下，表现优于近期研究中基于更大数据集训练的领先模型。此外，该模型仅有80亿参数，相对紧凑，可以在医院内部本地部署，使其成为在因严格隐私和安全要求而无法使用API和其他外部服务的场景中的实用解决方案。

[NLP-35] Research on Domain-Specific Chinese Spelling Correction Method Based on Plugin Extension Modules

【速读】：该论文试图解决现有中文拼写校正模型在处理特定领域文本时表现不佳的问题。解决方案的关键在于设计了一个插件扩展模块，该模块能够学习特定领域术语的特征，从而在不损害模型通用拼写校正性能的前提下，提升模型在特定领域的校正能力。通过在医疗、法律和官方文档等领域的扩展模块集成，实验结果表明模型的校正性能显著优于未使用扩展模块的基线模型。

链接: https://arxiv.org/abs/2411.09884
作者: Xiaowu Zhang,Hongfei Zhao,Xuan Chang
关键词-EN: Chinese spelling correction, handling domain-specific texts, Chinese spelling, Traditional Chinese spelling, correction method based
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:This paper proposes a Chinese spelling correction method based on plugin extension modules, aimed at addressing the limitations of existing models in handling domain-specific texts. Traditional Chinese spelling correction models are typically trained on general-domain datasets, resulting in poor performance when encountering specialized terminology in domain-specific texts. To address this issue, we design an extension module that learns the features of domain-specific terminology, thereby enhancing the model’s correction capabilities within specific domains. This extension module can provide domain knowledge to the model without compromising its general spelling correction performance, thus improving its accuracy in specialized fields. Experimental results demonstrate that after integrating extension modules for medical, legal, and official document domains, the model’s correction performance is significantly improved compared to the baseline model without any extension modules.
摘要：本文提出了一种基于插件扩展模块的中文拼写校正方法，旨在解决现有模型在处理特定领域文本时的局限性。传统的中文拼写校正模型通常在通用领域数据集上进行训练，导致在遇到特定领域文本中的专业术语时表现不佳。为解决这一问题，我们设计了一个扩展模块，该模块能够学习特定领域术语的特征，从而增强模型在特定领域内的校正能力。该扩展模块能够在不影响模型通用拼写校正性能的前提下，为模型提供领域知识，从而提高其在专业领域的准确性。实验结果表明，在为医疗、法律和官方文档领域集成扩展模块后，模型的校正性能相比没有扩展模块的基线模型显著提升。

[NLP-36] KULCQ: An Unsupervised Keyword-based Utterance Level Clustering Quality Metric

【速读】：该论文试图解决在无监督环境下评估对话数据聚类质量的问题。传统方法依赖于带有意图标签的数据，限制了其扩展性。论文提出的解决方案之关键是引入了一种基于关键词的聚类质量评估指标——关键词驱动的语句级别聚类质量 (Keyword-based Utterance Level Clustering Quality, KULCQ)。该指标通过分析关键词来评估聚类质量，不仅考虑了聚类的几何特性，还捕捉了对话数据中的语义关系，从而在无标签数据的情况下更有效地评估聚类效果。

链接: https://arxiv.org/abs/2411.09853
作者: Pranav Guruprasad,Negar Mokhberian,Nikhil Varghese,Chandra Khatri,Amol Kelkar
关键词-EN: Intent discovery, agents and improving, clustering, clustering quality, Intent
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Intent discovery is crucial for both building new conversational agents and improving existing ones. While several approaches have been proposed for intent discovery, most rely on clustering to group similar utterances together. Traditional evaluation of these utterance clusters requires intent labels for each utterance, limiting scalability. Although some clustering quality metrics exist that do not require labeled data, they focus solely on cluster geometry while ignoring the linguistic nuances present in conversational transcripts. In this paper, we introduce Keyword-based Utterance Level Clustering Quality (KULCQ), an unsupervised metric that leverages keyword analysis to evaluate clustering quality. We demonstrate KULCQ’s effectiveness by comparing it with existing unsupervised clustering metrics and validate its performance through comprehensive ablation studies. Our results show that KULCQ better captures semantic relationships in conversational data while maintaining consistency with geometric clustering principles.
摘要：意图发现对于构建新的对话智能体和改进现有智能体至关重要。尽管已经提出了多种意图发现方法，但大多数方法依赖于聚类技术将相似的语句分组。传统的聚类评估方法需要每个语句的意图标签，这限制了其可扩展性。虽然存在一些不需要标签数据的聚类质量指标，但它们仅关注聚类几何结构，而忽略了对话记录中的语言细微差别。本文中，我们引入了基于关键词的语句级别聚类质量评估指标 (Keyword-based Utterance Level Clustering Quality, KULCQ)，这是一种利用关键词分析来评估聚类质量的无监督指标。我们通过与现有无监督聚类指标的比较，展示了 KULCQ 的有效性，并通过全面的消融研究验证了其性能。结果表明，KULCQ 在捕捉对话数据中的语义关系方面表现更优，同时保持了与几何聚类原则的一致性。

[NLP-37] A Benchmark for Long-Form Medical Question Answering NEURIPS2024

【速读】：该论文试图解决现有大型语言模型（LLMs）在长篇医疗问答（QA）评估中缺乏有效基准的问题。现有医疗QA评估基准主要集中在自动指标和选择题上，未能充分捕捉或评估LLMs在实际临床应用中的复杂性。此外，现有关于长篇答案生成的研究多为闭源，缺乏医学专家的人工标注，导致结果难以复现和改进。论文的关键解决方案是引入一个新的公开基准，该基准包含真实世界的消费者医疗问题，并由医学专家进行长篇答案的标注评估。通过成对比较不同开源和闭源医疗及通用LLMs的回答，基于正确性、有用性、有害性和偏见等标准进行评估，并进行全面的LLM-as-a-judge分析，研究人类判断与LLMs判断的一致性。初步结果显示，开源LLMs在医疗QA中的表现具有与领先闭源模型相媲美的潜力。

链接: https://arxiv.org/abs/2411.09834
作者: Pedram Hosseini,Jessica M. Sin,Bing Ren,Bryceton G. Thomas,Elnaz Nouri,Ali Farahanchi,Saeed Hassanpour
关键词-EN: evaluating large language, large language models, medical question answering, large language, medical
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: AIM-FM: Advancements in Medical Foundation Models Workshop, 38th Conference on Neural Information Processing Systems (NeurIPS 2024)

点击查看摘要

Abstract:There is a lack of benchmarks for evaluating large language models (LLMs) in long-form medical question answering (QA). Most existing medical QA evaluation benchmarks focus on automatic metrics and multiple-choice questions. While valuable, these benchmarks fail to fully capture or assess the complexities of real-world clinical applications where LLMs are being deployed. Furthermore, existing studies on evaluating long-form answer generation in medical QA are primarily closed-source, lacking access to human medical expert annotations, which makes it difficult to reproduce results and enhance existing baselines. In this work, we introduce a new publicly available benchmark featuring real-world consumer medical questions with long-form answer evaluations annotated by medical doctors. We performed pairwise comparisons of responses from various open and closed-source medical and general-purpose LLMs based on criteria such as correctness, helpfulness, harmfulness, and bias. Additionally, we performed a comprehensive LLM-as-a-judge analysis to study the alignment between human judgments and LLMs. Our preliminary results highlight the strong potential of open LLMs in medical QA compared to leading closed models. Code Data: this https URL
摘要：目前缺乏用于评估大语言模型 (LLM) 在长篇医疗问答 (QA) 中的基准测试。大多数现有的医疗 QA 评估基准侧重于自动指标和多项选择题。尽管这些基准具有价值，但它们未能完全捕捉或评估 LLM 在实际临床应用中部署时的复杂性。此外，现有关于评估医疗 QA 中长篇回答生成的研究主要为闭源，缺乏人类医疗专家的注释，这使得难以复现结果并提升现有基线。在本研究中，我们引入了一个新的公开可用基准，该基准包含真实世界的消费者医疗问题，并由医生进行长篇回答评估注释。我们根据正确性、有用性、有害性和偏见等标准，对来自各种开源和闭源医疗及通用 LLM 的回答进行了成对比较。此外，我们还进行了全面的 LLM-as-a-judge 分析，以研究人类判断与 LLM 之间的对齐情况。我们的初步结果显示，与领先的闭源模型相比，开源 LLM 在医疗 QA 中展现出强大的潜力。代码和数据：此 https URL

[NLP-38] Evaluating Gender Bias in Large Language Models

【速读】：该论文试图解决的问题是大型语言模型（LLMs）在职业语境中代词选择上的性别偏见。解决方案的关键在于通过三种不同的句子处理方法（masked tokens, unmasked sentences, 和 sentence completion）以及职业名称生成来评估模型在性别分布上的表现。研究发现，模型的代词选择与美国劳动力数据中的性别分布呈正相关，女性代词更常与女性主导的职业相关联，而男性代词则更常与男性主导的职业相关联。特别是句子完成方法显示出与实际性别分布最强的相关性。此外，名称生成虽然呈现出更平衡的性别分布，但在男性或女性主导的职业中仍有显著差异。总体而言，提示方法（prompting）对性别分布的影响大于模型选择本身，这突显了在LLMs中解决性别偏见问题的复杂性，并强调了提示在性别映射中的重要性。

链接: https://arxiv.org/abs/2411.09826
作者: Michael Döll,Markus Döhring,Andreas Müller
关键词-EN: Large Language Models, language models, gender distribution, Gender bias, Gender
类目: Computation and Language (cs.CL)
备注: 13 pages, 12 figures, 1 table

点击查看摘要

Abstract:Gender bias in artificial intelligence has become an important issue, particularly in the context of language models used in communication-oriented applications. This study examines the extent to which Large Language Models (LLMs) exhibit gender bias in pronoun selection in occupational contexts. The analysis evaluates the models GPT-4, GPT-4o, PaLM 2 Text Bison and Gemini 1.0 Pro using a self-generated dataset. The jobs considered include a range of occupations, from those with a significant male presence to those with a notable female concentration, as well as jobs with a relatively equal gender distribution. Three different sentence processing methods were used to assess potential gender bias: masked tokens, unmasked sentences, and sentence completion. In addition, the LLMs suggested names of individuals in specific occupations, which were then examined for gender distribution. The results show a positive correlation between the models’ pronoun choices and the gender distribution present in U.S. labor force data. Female pronouns were more often associated with female-dominated occupations, while male pronouns were more often associated with male-dominated occupations. Sentence completion showed the strongest correlation with actual gender distribution, while name generation resulted in a more balanced ‘politically correct’ gender distribution, albeit with notable variations in predominantly male or female occupations. Overall, the prompting method had a greater impact on gender distribution than the model selection itself, highlighting the complexity of addressing gender bias in LLMs. The findings highlight the importance of prompting in gender mapping.
摘要：人工智能中的性别偏见已成为一个重要问题，特别是在面向通信应用的语言模型中。本研究探讨了大语言模型 (LLMs) 在职业语境中代词选择时表现出的性别偏见程度。分析评估了 GPT-4、GPT-4o、PaLM 2 Text Bison 和 Gemini 1.0 Pro 模型，使用自生成数据集。所考虑的职业包括从男性占显著比例到女性占显著比例，以及性别分布相对均衡的各种职业。采用了三种不同的句子处理方法来评估潜在的性别偏见：掩码 Token (masked tokens)、未掩码句子 (unmasked sentences) 和句子补全 (sentence completion)。此外，LLMs 为特定职业推荐了个人姓名，随后对其性别分布进行了检查。结果显示，模型代词选择与美国劳动力数据中的性别分布呈正相关。女性代词更多地与女性主导的职业相关联，而男性代词更多地与男性主导的职业相关联。句子补全显示出与实际性别分布最强的相关性，而姓名生成则产生了更为平衡的“政治正确”性别分布，尽管在男性或女性占主导的职业中存在显著差异。总体而言，提示方法对性别分布的影响大于模型选择本身，凸显了在 LLMs 中解决性别偏见的复杂性。研究结果强调了提示在性别映射中的重要性。

[NLP-39] Evaluating the Predictive Capacity of ChatGPT for Academic Peer Review Outcomes Across Multiple Platforms

【速读】：该论文试图解决的问题是如何利用大型语言模型（LLMs）如ChatGPT来预测学术论文的同行评审结果。解决方案的关键在于引入两种新的评估情境，并采用更为稳健的方法——平均多个ChatGPT评分。具体来说，研究通过平均30个ChatGPT的预测结果，基于评审指南，仅使用提交的标题和摘要，来评估不同平台（如F1000Research、SciPost Physics和ICLR）的论文质量。研究发现，尽管在某些情境下ChatGPT能够产生弱正相关的预发表质量评估，但其有效性和最佳应用策略在不同平台间存在显著差异。此外，最适用于ChatGPT的输入内容也因平台而异，包括是否使用全文以及是否采用链式思维系统提示等策略。

链接: https://arxiv.org/abs/2411.09763
作者: Mike Thelwall,Abdullah Yaghi
关键词-EN: Large Language Models, Language Models, Large Language, predict peer review, peer review outcomes
类目: Digital Libraries (cs.DL); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:While previous studies have demonstrated that Large Language Models (LLMs) can predict peer review outcomes to some extent, this paper builds on that by introducing two new contexts and employing a more robust method - averaging multiple ChatGPT scores. The findings that averaging 30 ChatGPT predictions, based on reviewer guidelines and using only the submitted titles and abstracts, failed to predict peer review outcomes for F1000Research (Spearman’s rho=0.00). However, it produced mostly weak positive correlations with the quality dimensions of SciPost Physics (rho=0.25 for validity, rho=0.25 for originality, rho=0.20 for significance, and rho = 0.08 for clarity) and a moderate positive correlation for papers from the International Conference on Learning Representations (ICLR) (rho=0.38). Including the full text of articles significantly increased the correlation for ICLR (rho=0.46) and slightly improved it for F1000Research (rho=0.09), while it had variable effects on the four quality dimension correlations for SciPost LaTeX files. The use of chain-of-thought system prompts slightly increased the correlation for F1000Research (rho=0.10), marginally reduced it for ICLR (rho=0.37), and further decreased it for SciPost Physics (rho=0.16 for validity, rho=0.18 for originality, rho=0.18 for significance, and rho=0.05 for clarity). Overall, the results suggest that in some contexts, ChatGPT can produce weak pre-publication quality assessments. However, the effectiveness of these assessments and the optimal strategies for employing them vary considerably across different platforms, journals, and conferences. Additionally, the most suitable inputs for ChatGPT appear to differ depending on the platform.
摘要：尽管先前的研究表明大语言模型 (LLMs) 在一定程度上可以预测同行评审的结果，但本文在此基础上引入了两个新的情境，并采用了一种更为稳健的方法——平均多个 ChatGPT 评分。研究发现，基于评审指南并仅使用提交的标题和摘要，平均 30 个 ChatGPT 预测未能预测 F1000Research 的同行评审结果（Spearman’s rho=0.00）。然而，它对 SciPost Physics 的质量维度产生了大部分弱正相关（有效性 rho=0.25，原创性 rho=0.25，重要性 rho=0.20，清晰度 rho=0.08），并对国际学习表示会议 (ICLR) 的论文产生了中等正相关（rho=0.38）。包含文章全文显著增加了 ICLR 的相关性（rho=0.46），并略微改善了 F1000Research 的相关性（rho=0.09），而对 SciPost LaTeX 文件的四个质量维度相关性产生了不同影响。使用思维链系统提示略微增加了 F1000Research 的相关性（rho=0.10），略微降低了 ICLR 的相关性（rho=0.37），并进一步降低了 SciPost Physics 的相关性（有效性 rho=0.16，原创性 rho=0.18，重要性 rho=0.18，清晰度 rho=0.05）。总体而言，结果表明在某些情境下，ChatGPT 可以产生弱出版前质量评估。然而，这些评估的有效性以及最佳应用策略在不同平台、期刊和会议之间存在显著差异。此外，ChatGPT 最适合的输入似乎因平台而异。

人工智能

[AI-0] VeriGraph: Scene Graphs for Execution Verifiable Robot Planning

链接: https://arxiv.org/abs/2411.10446
作者: Daniel Ekpo,Mara Levy,Saksham Suri,Chuong Huynh,Abhinav Shrivastava
关键词-EN: challenges remain due, Recent advancements, vision-language models, offer potential, incorrect action sequences
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Recent advancements in vision-language models (VLMs) offer potential for robot task planning, but challenges remain due to VLMs’ tendency to generate incorrect action sequences. To address these limitations, we propose VeriGraph, a novel framework that integrates VLMs for robotic planning while verifying action feasibility. VeriGraph employs scene graphs as an intermediate representation, capturing key objects and spatial relationships to improve plan verification and refinement. The system generates a scene graph from input images and uses it to iteratively check and correct action sequences generated by an LLM-based task planner, ensuring constraints are respected and actions are executable. Our approach significantly enhances task completion rates across diverse manipulation scenarios, outperforming baseline methods by 58% for language-based tasks and 30% for image-based tasks.

[AI-1] Mitigating Parameter Degeneracy using Joint Conditional Diffusion Model for WECC Composite Load Model in Power Systems

链接: https://arxiv.org/abs/2411.10431
作者: Feiqin Zhu,Dmitrii Torbunov,Yihui Ren,Zhongjing Jiang,Tianqiao Zhao,Amirthagunaraj Yogarathnam,Meng Yue
关键词-EN: gained widespread attention, Data-driven modeling, recent years, gained widespread, widespread attention
类目: Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
*备注:

点击查看摘要

Abstract:Data-driven modeling for dynamic systems has gained widespread attention in recent years. Its inverse formulation, parameter estimation, aims to infer the inherent model parameters from observations. However, parameter degeneracy, where different combinations of parameters yield the same observable output, poses a critical barrier to accurately and uniquely identifying model parameters. In the context of WECC composite load model (CLM) in power systems, utility practitioners have observed that CLM parameters carefully selected for one fault event may not perform satisfactorily in another fault. Here, we innovate a joint conditional diffusion model-based inverse problem solver (JCDI), that incorporates a joint conditioning architecture with simultaneous inputs of multi-event observations to improve parameter generalizability. Simulation studies on the WECC CLM show that the proposed JCDI effectively reduces uncertainties of degenerate parameters, thus the parameter estimation error is decreased by 42.1% compared to a single-event learning scheme. This enables the model to achieve high accuracy in predicting power trajectories under different fault events, including electronic load tripping and motor stalling, outperforming standard deep reinforcement learning and supervised learning approaches. We anticipate this work will contribute to mitigating parameter degeneracy in system dynamics, providing a general parameter estimation framework across various scientific domains.

[AI-2] Evaluating Creativity and Deception in Large Language Models : A Simulation Framework for Multi-Agent Balderdash ACL2024

链接: https://arxiv.org/abs/2411.10422
作者: Parsa Hejabi,Elnaz Rahmati,Alireza S. Ziabari,Preni Golazizian,Jesse Thomason,Morteza Dehghani
关键词-EN: Large Language Models, Large Language, Language Models, creativity remains underexplored, shown impressive capabilities
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI)
*备注: Accepted at Wordplay: When Language Meets Games @ ACL 2024

点击查看摘要

Abstract:Large Language Models (LLMs) have shown impressive capabilities in complex tasks and interactive environments, yet their creativity remains underexplored. This paper introduces a simulation framework utilizing the game Balderdash to evaluate both the creativity and logical reasoning of LLMs. In Balderdash, players generate fictitious definitions for obscure terms to deceive others while identifying correct definitions. Our framework enables multiple LLM agents to participate in this game, assessing their ability to produce plausible definitions and strategize based on game rules and history. We implemented a centralized game engine featuring various LLMs as participants and a judge LLM to evaluate semantic equivalence. Through a series of experiments, we analyzed the performance of different LLMs, examining metrics such as True Definition Ratio, Deception Ratio, and Correct Guess Ratio. The results provide insights into the creative and deceptive capabilities of LLMs, highlighting their strengths and areas for improvement. Specifically, the study reveals that infrequent vocabulary in LLMs’ input leads to poor reasoning on game rules and historical context (this https URL).

[AI-3] Repurposing Stable Diffusion Attention for Training-Free Unsupervised Interactive Segmentation

链接: https://arxiv.org/abs/2411.10411
作者: Markus Karmann,Onay Urfalioglu
关键词-EN: obtain high quality, based Image Segmentation, Recent progress, quality semantic labels, prompt based Image
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Recent progress in interactive point prompt based Image Segmentation allows to significantly reduce the manual effort to obtain high quality semantic labels. State-of-the-art unsupervised methods use self-supervised pre-trained models to obtain pseudo-labels which are used in training a prompt-based segmentation model. In this paper, we propose a novel unsupervised and training-free approach based solely on the self-attention of Stable Diffusion. We interpret the self-attention tensor as a Markov transition operator, which enables us to iteratively construct a Markov chain. Pixel-wise counting of the required number of iterations along the Markov-chain to reach a relative probability threshold yields a Markov-iteration-map, which we simply call a Markov-map. Compared to the raw attention maps, we show that our proposed Markov-map has less noise, sharper semantic boundaries and more uniform values within semantically similar regions. We integrate the Markov-map in a simple yet effective truncated nearest neighbor framework to obtain interactive point prompt based segmentation. Despite being training-free, we experimentally show that our approach yields excellent results in terms of Number of Clicks (NoC), even outperforming state-of-the-art training based unsupervised methods in most of the datasets.

[AI-4] Deep Learning for Micro-Scale Crack Detection on Imbalanced Datasets Using Key Point Localization

链接: https://arxiv.org/abs/2411.10389
作者: Fatahlla Moreh(Christian Albrechts University, Kiel, Germany),Yusuf Hasan(Aligarh Muslim University, Aligarh, India),Bilal Zahid Hussain(Texas Aamp;M University, College Station, USA),Mohammad Ammar(Aligarh Muslim University, Aligarh, India),Sven Tomforde(Christian Albrechts University, Kiel, Germany)
关键词-EN: Internal crack detection, structural health monitoring, Internal crack, health monitoring, crack detection
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Internal crack detection has been a subject of focus in structural health monitoring. By focusing on crack detection in structural datasets, it is demonstrated that deep learning (DL) methods can effectively analyze seismic wave fields interacting with micro-scale cracks, which are beyond the resolution of conventional visual inspection. This work explores a novel application of DL-based key point detection technique, where cracks are localized by predicting the coordinates of four key points that define a bounding region of the crack. The study not only opens new research directions for non-visual applications but also effectively mitigates the impact of imbalanced data which poses a challenge for previous DL models, as it can be biased toward predicting the majority class (non-crack regions). Popular DL techniques, such as the Inception blocks, are used and investigated. The model shows an overall reduction in loss when applied to micro-scale crack detection and is reflected in the lower average deviation between the location of actual and predicted cracks, with an average Intersection over Union (IoU) being 0.511 for all micro cracks (greater than 0.00 micrometers) and 0.631 for larger micro cracks (greater than 4 micrometers).

[AI-5] Low-Latency Task-Oriented Communications with Multi-Round Multi-Task Deep Learning

链接: https://arxiv.org/abs/2411.10385
作者: Yalin E. Sagduyu,Tugba Erpek,Aylin Yener,Sennur Ulukus
关键词-EN: learns compressed latent, compressed latent representations, transmitter learns compressed, channel, learns compressed
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC); Information Theory (cs.IT); Networking and Internet Architecture (cs.NI); Signal Processing (eess.SP)
*备注:

点击查看摘要

Abstract:In this paper, we address task-oriented (or goal-oriented) communications where an encoder at the transmitter learns compressed latent representations of data, which are then transmitted over a wireless channel. At the receiver, a decoder performs a machine learning task, specifically for classifying the received signals. The deep neural networks corresponding to the encoder-decoder pair are jointly trained, taking both channel and data characteristics into account. Our objective is to achieve high accuracy in completing the underlying task while minimizing the number of channel uses determined by the encoder’s output size. To this end, we propose a multi-round, multi-task learning (MRMTL) approach for the dynamic update of channel uses in multi-round transmissions. The transmitter incrementally sends an increasing number of encoded samples over the channel based on the feedback from the receiver, and the receiver utilizes the signals from a previous round to enhance the task performance, rather than only considering the latest transmission. This approach employs multi-task learning to jointly optimize accuracy across varying number of channel uses, treating each configuration as a distinct task. By evaluating the confidence of the receiver in task decisions, MRMTL decides on whether to allocate additional channel uses in multiple rounds. We characterize both the accuracy and the delay (total number of channel uses) of MRMTL, demonstrating that it achieves the accuracy close to that of conventional methods requiring large numbers of channel uses, but with reduced delay by incorporating signals from a prior round. We consider the CIFAR-10 dataset, convolutional neural network architectures, and AWGN and Rayleigh channel models for performance evaluation. We show that MRMTL significantly improves the efficiency of task-oriented communications, balancing accuracy and latency effectively.

[AI-6] owards High-Fidelity 3D Portrait Generation with Rich Details by Cross-View Prior-Aware Diffusion

链接: https://arxiv.org/abs/2411.10369
作者: Haoran Wei,Wencheng Han,Xingping Dong,Jianbing Shen
关键词-EN: Recent diffusion-based Single-image, methods typically employ, Recent diffusion-based, diffusion-based Single-image, provide multi-view knowledge
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Recent diffusion-based Single-image 3D portrait generation methods typically employ 2D diffusion models to provide multi-view knowledge, which is then distilled into 3D representations. However, these methods usually struggle to produce high-fidelity 3D models, frequently yielding excessively blurred textures. We attribute this issue to the insufficient consideration of cross-view consistency during the diffusion process, resulting in significant disparities between different views and ultimately leading to blurred 3D representations. In this paper, we address this issue by comprehensively exploiting multi-view priors in both the conditioning and diffusion procedures to produce consistent, detail-rich portraits. From the conditioning standpoint, we propose a Hybrid Priors Diffsion model, which explicitly and implicitly incorporates multi-view priors as conditions to enhance the status consistency of the generated multi-view portraits. From the diffusion perspective, considering the significant impact of the diffusion noise distribution on detailed texture generation, we propose a Multi-View Noise Resamplig Strategy integrated within the optimization process leveraging cross-view priors to enhance representation consistency. Extensive experiments demonstrate that our method can produce 3D portraits with accurate geometry and rich details from a single image. The project page is at \urlthis https URL.

[AI-7] Mechanisms of Generative Image-to-Image Translation Networks

链接: https://arxiv.org/abs/2411.10368
作者: Guangzong Chen,Mingui Sun,Zhi-Hong Mao,Kangni Liu,Wenyan Jia
关键词-EN: Generative Adversarial Networks, Generative Adversarial, class of neural, neural networks, Generative
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Image and Video Processing (eess.IV)
*备注:

点击查看摘要

Abstract:Generative Adversarial Networks (GANs) are a class of neural networks that have been widely used in the field of image-to-image translation. In this paper, we propose a streamlined image-to-image translation network with a simpler architecture compared to existing models. We investigate the relationship between GANs and autoencoders and provide an explanation for the efficacy of employing only the GAN component for tasks involving image translation. We show that adversarial for GAN models yields results comparable to those of existing methods without additional complex loss penalties. Subsequently, we elucidate the rationale behind this phenomenon. We also incorporate experimental results to demonstrate the validity of our findings.

[AI-8] Continual Adversarial Reinforcement Learning (CARL) of False Data Injection detection: forgetting and explainability

链接: https://arxiv.org/abs/2411.10367
作者: Pooja Aslami,Kejun Chen,Timothy M. Hansen,Malik Hassanaly
关键词-EN: False data injection, data injection attacks, renewable energy production, growing concern linked, increased renewable energy
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:False data injection attacks (FDIAs) on smart inverters are a growing concern linked to increased renewable energy production. While data-based FDIA detection methods are also actively developed, we show that they remain vulnerable to impactful and stealthy adversarial examples that can be crafted using Reinforcement Learning (RL). We propose to include such adversarial examples in data-based detection training procedure via a continual adversarial RL (CARL) approach. This way, one can pinpoint the deficiencies of data-based detection, thereby offering explainability during their incremental improvement. We show that a continual learning implementation is subject to catastrophic forgetting, and additionally show that forgetting can be addressed by employing a joint training strategy on all generated FDIA scenarios.

[AI-9] Forming Auxiliary High-confident Instance-level Loss to Promote Learning from Label Proportions

链接: https://arxiv.org/abs/2411.10364
作者: Tianhao Ma,Han Chen,Juncheng Hu,Yungang Zhu,Ximing Li
关键词-EN: weakly-supervised learning task, challenging weakly-supervised learning, High-confident Instance-level Loss, instance-level loss, auxiliary instance-level loss
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Learning from label proportions (LLP), i.e., a challenging weakly-supervised learning task, aims to train a classifier by using bags of instances and the proportions of classes within bags, rather than annotated labels for each instance. Beyond the traditional bag-level loss, the mainstream methodology of LLP is to incorporate an auxiliary instance-level loss with pseudo-labels formed by predictions. Unfortunately, we empirically observed that the pseudo-labels are are often inaccurate due to over-smoothing, especially for the scenarios with large bag sizes, hurting the classifier induction. To alleviate this problem, we suggest a novel LLP method, namely Learning from Label Proportions with Auxiliary High-confident Instance-level Loss (L^2P-AHIL). Specifically, we propose a dual entropy-based weight (DEW) method to adaptively measure the confidences of pseudo-labels. It simultaneously emphasizes accurate predictions at the bag level and avoids overly smoothed predictions. We then form high-confident instance-level loss with DEW, and jointly optimize it with the bag-level loss in a self-training manner. The experimental results on benchmark datasets show that L^2P-AHIL can surpass the existing baseline methods, and the performance gain can be more significant as the bag size increases.

[AI-10] Domain Adaptation-based Edge Computing for Cross-Conditions Fault Diagnosis

链接: https://arxiv.org/abs/2411.10340
作者: Yanzhi Wang,Chu Wang,Jinhong Wu,Ziyang Yu,Qi Zhou
关键词-EN: diagnosis technology supports, Fault diagnosis, Fault diagnosis technology, operation of mechanical, mechanical equipment
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
*备注: 28 pages, 11 figures

点击查看摘要

Abstract:Fault diagnosis technology supports the healthy operation of mechanical equipment. However, the variations conditions during the operation of mechanical equipment lead to significant disparities in data distribution, posing challenges to fault diagnosis. Furthermore, when deploying applications, traditional methods often encounter issues such as latency and data security. Therefore, conducting fault diagnosis and deploying application methods under cross-operating conditions holds significant value. This paper proposes a domain adaptation-based lightweight fault diagnosis framework for edge computing scenarios. Incorporating the local maximum mean discrepancy into knowledge transfer aligns the feature distributions of different domains in a high-dimensional feature space, to discover a common feature space across domains. The acquired fault diagnosis expertise from the cloud-model is transferred to the lightweight edge-model using adaptation knowledge transfer methods. While ensuring real-time diagnostic capabilities, accurate fault diagnosis is achieved across working conditions. We conducted validation experiments on the NVIDIA Jetson Xavier NX kit. In terms of diagnostic performance, the proposed method significantly improved diagnostic accuracy, with average increases of 34.44% and 17.33% compared to the comparison method, respectively. Regarding lightweight effectiveness, proposed method achieved an average inference speed increase of 80.47%. Additionally, compared to the cloud-model, the parameter count of the edge-model decreased by 96.37%, while the Flops decreased by 83.08%.

[AI-11] A Realistic Collimated X-Ray Image Simulation Pipeline

链接: https://arxiv.org/abs/2411.10308
作者: Benjamin El-Zein,Dominik Eckert,Thomas Weber,Maximilian Rohleder,Ludwig Ritschl,Steffen Kappler,Andreas Maier
关键词-EN: detectors position relative, Collimator detection remains, task in X-ray, X-ray systems, detection remains
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Medical Physics (physics.med-ph)
*备注:

点击查看摘要

Abstract:Collimator detection remains a challenging task in X-ray systems with unreliable or non-available information about the detectors position relative to the source. This paper presents a physically motivated image processing pipeline for simulating the characteristics of collimator shadows in X-ray images. By generating randomized labels for collimator shapes and locations, incorporating scattered radiation simulation, and including Poisson noise, the pipeline enables the expansion of limited datasets for training deep neural networks. We validate the proposed pipeline by a qualitative and quantitative comparison against real collimator shadows. Furthermore, it is demonstrated that utilizing simulated data within our deep learning framework not only serves as a suitable substitute for actual collimators but also enhances the generalization performance when applied to real-world data.

[AI-12] RETR: Multi-View Radar Detection Transformer for Indoor Perception NEURIPS2024

链接: https://arxiv.org/abs/2411.10293
作者: Ryoma Yataka,Adriano Cardace,Pu Perry Wang,Petros Boufounos,Ryuhei Takahashi
关键词-EN: rising interest due, affordable costs driven, emerging automotive imaging, reduced privacy concerns, automotive imaging radar
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Differential Geometry (math.DG)
*备注: 24 pages, Accepted to NeurIPS 2024

点击查看摘要

Abstract:Indoor radar perception has seen rising interest due to affordable costs driven by emerging automotive imaging radar developments and the benefits of reduced privacy concerns and reliability under hazardous conditions (e.g., fire and smoke). However, existing radar perception pipelines fail to account for distinctive characteristics of the multi-view radar setting. In this paper, we propose Radar dEtection TRansformer (RETR), an extension of the popular DETR architecture, tailored for multi-view radar perception. RETR inherits the advantages of DETR, eliminating the need for hand-crafted components for object detection and segmentation in the image plane. More importantly, RETR incorporates carefully designed modifications such as 1) depth-prioritized feature similarity via a tunable positional encoding (TPE); 2) a tri-plane loss from both radar and camera coordinates; and 3) a learnable radar-to-camera transformation via reparameterization, to account for the unique multi-view radar setting. Evaluated on two indoor radar perception datasets, our approach outperforms existing state-of-the-art methods by a margin of 15.38+ AP for object detection and 11.77+ IoU for instance segmentation, respectively.

[AI-13] he ParClusterers Benchmark Suite (PCBS): A Fine-Grained Analysis of Scalable Graph Clustering VLDB’25

链接: https://arxiv.org/abs/2411.10290
作者: Shangdi Yu,Jessica Shi,Jamison Meindl,David Eisenstat,Xiaoen Ju,Sasan Tavakkol,Laxman Dhulipala,Jakub Łącki,Vahab Mirrokni,Julian Shun
关键词-EN: ParClusterers Benchmark Suite, graph clustering algorithms, clustering algorithms, Benchmark Suite, graph clustering
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Social and Information Networks (cs.SI)
*备注: This is a preliminary version of a paper that will appear at VLDB’25

点击查看摘要

Abstract:We introduce the ParClusterers Benchmark Suite (PCBS) – a collection of highly scalable parallel graph clustering algorithms and benchmarking tools that streamline comparing different graph clustering algorithms and implementations. The benchmark includes clustering algorithms that target a wide range of modern clustering use cases, including community detection, classification, and dense subgraph mining. The benchmark toolkit makes it easy to run and evaluate multiple instances of different clustering algorithms, which can be useful for fine-tuning the performance of clustering on a given task, and for comparing different clustering algorithms based on different metrics of interest, including clustering quality and running time. Using PCBS, we evaluate a broad collection of real-world graph clustering datasets. Somewhat surprisingly, we find that the best quality results are obtained by algorithms that not included in many popular graph clustering toolkits. The PCBS provides a standardized way to evaluate and judge the quality-performance tradeoffs of the active research area of scalable graph clustering algorithms. We believe it will help enable fair, accurate, and nuanced evaluation of graph clustering algorithms in the future. Comments: This is a preliminary version of a paper that will appear at VLDB’25 Subjects: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Social and Information Networks (cs.SI) Cite as: arXiv:2411.10290 [cs.DC] (or arXiv:2411.10290v1 [cs.DC] for this version) https://doi.org/10.48550/arXiv.2411.10290 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Shangdi Yu [view email] [v1] Fri, 15 Nov 2024 15:47:32 UTC (3,595 KB)

[AI-14] Systolic Arrays and Structured Pruning Co-design for Efficient Transformers in Edge Systems

链接: https://arxiv.org/abs/2411.10285
作者: Pedro Palacios,Rafael Medina,Jean-Luc Rouas,Giovanni Ansaloni,David Atienza
关键词-EN: edge devices necessitates, devices necessitates cross-stack, Efficient deployment, necessitates cross-stack optimization, deployment of resource-intensive
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI)
*备注: 7 pages, 10 figures

点击查看摘要

Abstract:Efficient deployment of resource-intensive transformers on edge devices necessitates cross-stack optimization. We thus study the interrelation between structured pruning and systolic acceleration, matching the size of pruned blocks with the systolic array dimensions. In this setting, computations of pruned weight blocks can be skipped, reducing run-time and energy consumption, but potentially impacting quality of service (QoS). To evaluate the trade-offs between systolic array size and sparsity opportunities, we present a novel co-design framework that integrates algorithmic optimization, system simulation, and hardware design. Targeting speech recognition using transformers as a case study, we analyze how configuration choices across the stack affect performance metrics. Results demonstrate that structured pruning on systems featuring systolic array acceleration can effectively increase performance, while maintaining high QoS levels. Up to 26% system-wide speedups due to structured pruning were measured, with only 1.4% word error rate degradation on the standard Librispeech dataset.

[AI-15] Lateral Movement Detection via Time-aware Subgraph Classification on Authentication Logs

链接: https://arxiv.org/abs/2411.10279
作者: Jiajun Zhou,Jiacheng Yao,Xuanze Chen,Shanqing Yu,Qi Xuan,Xiaoniu Yang
关键词-EN: advanced persistent threat, lateral movement detection, Lateral movement, crucial component, component of advanced
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Lateral movement is a crucial component of advanced persistent threat (APT) attacks in networks. Attackers exploit security vulnerabilities in internal networks or IoT devices, expanding their control after initial infiltration to steal sensitive data or carry out other malicious activities, posing a serious threat to system security. Existing research suggests that attackers generally employ seemingly unrelated operations to mask their malicious intentions, thereby evading existing lateral movement detection methods and hiding their intrusion traces. In this regard, we analyze host authentication log data from a graph perspective and propose a multi-scale lateral movement detection framework called LMDetect. The main workflow of this framework proceeds as follows: 1) Construct a heterogeneous multigraph from host authentication log data to strengthen the correlations among internal system entities; 2) Design a time-aware subgraph generator to extract subgraphs centered on authentication events from the heterogeneous authentication multigraph; 3) Design a multi-scale attention encoder that leverages both local and global attention to capture hidden anomalous behavior patterns in the authentication subgraphs, thereby achieving lateral movement detection. Extensive experiments on two real-world authentication log datasets demonstrate the effectiveness and superiority of our framework in detecting lateral movement behaviors.

[AI-16] he Unreasonable Effectiveness of Guidance for Diffusion Models

链接: https://arxiv.org/abs/2411.10257
作者: Tim Kaiser,Nikolas Adaloglou,Markus Kollmann
关键词-EN: error-correcting technique, perceptual quality, quality of images, images generated, auxiliary diffusion model
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Preprint. 19 pages, 14 figures in total, including references and appendix

点击查看摘要

Abstract:Guidance is an error-correcting technique used to improve the perceptual quality of images generated by diffusion models. Typically, the correction is achieved by linear extrapolation, using an auxiliary diffusion model that has lower performance than the primary model. Using a 2D toy example, we show that it is highly beneficial when the auxiliary model exhibits similar errors as the primary one but stronger. We verify this finding in higher dimensions, where we show that competitive generative performance to state-of-the-art guidance methods can be achieved when the auxiliary model differs from the primary one only by having stronger weight regularization. As an independent contribution, we investigate whether upweighting long-range spatial dependencies improves visual fidelity. The result is a novel guidance method, which we call sliding window guidance (SWG), that guides the primary model with itself by constraining its receptive field. Intriguingly, SWG aligns better with human preferences than state-of-the-art guidance methods while requiring neither training, architectural modifications, nor class conditioning. The code will be released.

[AI-17] Artificial Intelligence in Pediatric Echocardiography: Exploring Challenges Opportunities and Clinical Applications with Explainable AI and Federated Learning

链接: https://arxiv.org/abs/2411.10255
作者: Mohammed Yaseen Jabarulla,Theodor Uden,Thomas Jack,Philipp Beerbaum,Steffen Oeltze-Jafra
关键词-EN: heart diseases present, Pediatric heart diseases, present a broad, broad spectrum, pediatric echocardiography
类目: Artificial Intelligence (cs.AI)
*备注: This article is planned for submission to Frontiers Journal

点击查看摘要

Abstract:Pediatric heart diseases present a broad spectrum of congenital and acquired diseases. More complex congenital malformations require a differentiated and multimodal decision-making process, usually including echocardiography as a central imaging method. Artificial intelligence (AI) offers considerable promise for clinicians by facilitating automated interpretation of pediatric echocardiography data. However, adapting AI technologies for pediatric echocardiography analysis has challenges such as limited public data availability, data privacy, and AI model transparency. Recently, researchers have focused on disruptive technologies, such as federated learning (FL) and explainable AI (XAI), to improve automatic diagnostic and decision support workflows. This study offers a comprehensive overview of the limitations and opportunities of AI in pediatric echocardiography, emphasizing the synergistic workflow and role of XAI and FL, identifying research gaps, and exploring potential future developments. Additionally, three relevant clinical use cases demonstrate the functionality of XAI and FL with a focus on (i) view recognition, (ii) disease classification, (iii) segmentation of cardiac structures, and (iv) quantitative assessment of cardiac function.

[AI-18] Generative AI in Multimodal User Interfaces: Trends Challenges and Cross-Platform Adaptability

链接: https://arxiv.org/abs/2411.10234
作者: J. Bieniek,M. Rahouti,D. C. Verma
关键词-EN: computer interaction expand, human computer interaction, reshaping user interfaces, user interfaces, introducing new possibilities
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
*备注: 13 pages, 4 figures

点击查看摘要

Abstract:As the boundaries of human computer interaction expand, Generative AI emerges as a key driver in reshaping user interfaces, introducing new possibilities for personalized, multimodal and cross-platform interactions. This integration reflects a growing demand for more adaptive and intuitive user interfaces that can accommodate diverse input types such as text, voice and video, and deliver seamless experiences across devices. This paper explores the integration of generative AI in modern user interfaces, examining historical developments and focusing on multimodal interaction, cross-platform adaptability and dynamic personalization. A central theme is the interface dilemma, which addresses the challenge of designing effective interactions for multimodal large language models, assessing the trade-offs between graphical, voice-based and immersive interfaces. The paper further evaluates lightweight frameworks tailored for mobile platforms, spotlighting the role of mobile hardware in enabling scalable multimodal AI. Technical and ethical challenges, including context retention, privacy concerns and balancing cloud and on-device processing are thoroughly examined. Finally, the paper outlines future directions such as emotionally adaptive interfaces, predictive AI driven user interfaces and real-time collaborative systems, underscoring generative AI’s potential to redefine adaptive user-centric interfaces across platforms.

[AI-19] ColorEdit: Training-free Image-Guided Color editing with diffusion model

链接: https://arxiv.org/abs/2411.10232
作者: Xingxi Yin,Zhi Li,Jingfeng Zhang,Chenglin Li,Yin Zhang
关键词-EN: demonstrating remarkable efficacy, impressive generative capabilities, text-guided image editing, image editing tasks, image editing
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Text-to-image (T2I) diffusion models, with their impressive generative capabilities, have been adopted for image editing tasks, demonstrating remarkable efficacy. However, due to attention leakage and collision between the cross-attention map of the object and the new color attribute from the text prompt, text-guided image editing methods may fail to change the color of an object, resulting in a misalignment between the resulting image and the text prompt. In this paper, we conduct an in-depth analysis on the process of text-guided image synthesizing and what semantic information different cross-attention blocks have learned. We observe that the visual representation of an object is determined in the up-block of the diffusion model in the early stage of the denoising process, and color adjustment can be achieved through value matrices alignment in the cross-attention layer. Based on our findings, we propose a straightforward, yet stable, and effective image-guided method to modify the color of an object without requiring any additional fine-tuning or training. Lastly, we present a benchmark dataset called COLORBENCH, the first benchmark to evaluate the performance of color change methods. Extensive experiments validate the effectiveness of our method in object-level color editing and surpass the performance of popular text-guided image editing approaches in both synthesized and real images.

[AI-20] A Low-Resolution Image is Worth 1x1 Words: Enabling Fine Image Super-Resolution with Transformers and TaylorShift

链接: https://arxiv.org/abs/2411.10231
作者: Sanath Budakegowdanadoddi Nagaraju,Brian Bernhard Moser,Tobias Christian Nauen,Stanislav Frolov,Federico Raue,Andreas Dengel
关键词-EN: image reconstruction quality, fine-grained detail enhancement, recently advanced image, advanced image reconstruction, challenges remain due
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multimedia (cs.MM)
*备注:

点击查看摘要

Abstract:Transformer-based Super-Resolution (SR) models have recently advanced image reconstruction quality, yet challenges remain due to computational complexity and an over-reliance on large patch sizes, which constrain fine-grained detail enhancement. In this work, we propose TaylorIR to address these limitations by utilizing a patch size of 1x1, enabling pixel-level processing in any transformer-based SR model. To address the significant computational demands under the traditional self-attention mechanism, we employ the TaylorShift attention mechanism, a memory-efficient alternative based on Taylor series expansion, achieving full token-to-token interactions with linear complexity. Experimental results demonstrate that our approach achieves new state-of-the-art SR performance while reducing memory consumption by up to 60% compared to traditional self-attention-based transformers.

[AI-21] MCL: Multi-view Enhanced Contrastive Learning for Chest X-ray Report Generation

链接: https://arxiv.org/abs/2411.10224
作者: Kang Liu,Zhuoqi Ma,Kun Xie,Zhicheng Jiao,Qiguang Miao
关键词-EN: enhancing doctor-patient communication, planning treatment strategies, Radiology reports, enhanced Contrastive Learning, Multi-view enhanced Contrastive
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: this https URL

点击查看摘要

Abstract:Radiology reports are crucial for planning treatment strategies and enhancing doctor-patient communication, yet manually writing these reports is burdensome for radiologists. While automatic report generation offers a solution, existing methods often rely on single-view radiographs, limiting diagnostic accuracy. To address this problem, we propose MCL, a Multi-view enhanced Contrastive Learning method for chest X-ray report generation. Specifically, we first introduce multi-view enhanced contrastive learning for visual representation by maximizing agreements between multi-view radiographs and their corresponding report. Subsequently, to fully exploit patient-specific indications (e.g., patient’s symptoms) for report generation, we add a transitional ``bridge" for missing indications to reduce embedding space discrepancies caused by their presence or absence. Additionally, we construct Multi-view CXR and Two-view CXR datasets from public sources to support research on multi-view report generation. Our proposed MCL surpasses recent state-of-the-art methods across multiple datasets, achieving a 5.0% F1 RadGraph improvement on MIMIC-CXR, a 7.3% BLEU-1 improvement on MIMIC-ABN, a 3.1% BLEU-4 improvement on Multi-view CXR, and an 8.2% F1 CheXbert improvement on Two-view CXR.

[AI-22] An Empirical Study on LLM -based Agents for Automated Bug Fixing

链接: https://arxiv.org/abs/2411.10213
作者: Xiangxin Meng,Zexiong Ma,Pengfei Gao,Chao Peng
关键词-EN: Large language models, development environment interaction, addressing software defects, Large language, fix bugs automatically
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Large language models (LLMs) and LLM-based Agents have been applied to fix bugs automatically, demonstrating the capability in addressing software defects by engaging in development environment interaction, iterative validation and code modification. However, systematic analysis of these agent and non-agent systems remain limited, particularly regarding performance variations among top-performing ones. In this paper, we examine seven proprietary and open-source systems on the SWE-bench Lite benchmark for automated bug fixing. We first assess each system’s overall performance, noting instances solvable by all or none of these sytems, and explore why some instances are uniquely solved by specific system types. We also compare fault localization accuracy at file and line levels and evaluate bug reproduction capabilities, identifying instances solvable only through dynamic reproduction. Through analysis, we concluded that further optimization is needed in both the LLM itself and the design of Agentic flow to improve the effectiveness of the Agent in bug fixing.

[AI-23] A logic for reasoning with inconsistent knowledge – A reformulation using nowadays terminology (2024)

链接: https://arxiv.org/abs/2411.10197
作者: Nico Roos
关键词-EN: logic, inconsistent knowledge, situations humans, set of premisses, inconsistent
类目: Artificial Intelligence (cs.AI)
*备注: The original version was published in the Artificial Intelligence journal. This original version uses ‘justifications’ in the proof system, which we would call nowadays ‘arguments’. The current version presents the same results but now using the terminology of an assumption-based argumentation system

点击查看摘要

Abstract:In many situations humans have to reason with inconsistent knowledge. These inconsistencies may occur due to not fully reliable sources of information. In order to reason with inconsistent knowledge, it is not possible to view a set of premisses as absolute truths as is done in predicate logic. Viewing the set of premisses as a set of assumptions, however, it is possible to deduce useful conclusions from an inconsistent set of premisses. In this paper a logic for reasoning with inconsistent knowledge is described. This logic is a generalization of the work of N. Rescher [15]. In the logic a reliability relation is used to choose between incompatible assumptions. These choices are only made when a contradiction is derived. As long as no contradiction is derived, the knowledge is assumed to be consistent. This makes it possible to define an argumentation-based deduction process for the logic. For the logic a semantics based on the ideas of Y. Shoham [22, 23], is defined. It turns out that the semantics for the logic is a preferential semantics according to the definition S. Kraus, D. Lehmann and M. Magidor [12]. Therefore the logic is a logic of system P and possesses all the properties of an ideal non-monotonic logic.

[AI-24] FengWu-W2S: A deep learning model for seamless weather-to-subseasonal forecast of global atmosphere

链接: https://arxiv.org/abs/2411.10191
作者: Fenghua Ling,Kang Chen,Jiye Wu,Tao Han,Jing-Jia Luo,Wanli Ouyang,Lei Bai
关键词-EN: produces warning information, produces warning, warning information, information at continuum, long-standing pursuit
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Atmospheric and Oceanic Physics (physics.ao-ph)
*备注: 23 pages,8 figures

点击查看摘要

Abstract:Seamless forecasting that produces warning information at continuum timescales based on only one system is a long-standing pursuit for weather-climate service. While the rapid advancement of deep learning has induced revolutionary changes in classical forecasting field, current efforts are still focused on building separate AI models for weather and climate forecasts. To explore the seamless forecasting ability based on one AI model, we propose FengWu-Weather to Subseasonal (FengWu-W2S), which builds on the FengWu global weather forecast model and incorporates an ocean-atmosphere-land coupling structure along with a diverse perturbation strategy. FengWu-W2S can generate 6-hourly atmosphere forecasts extending up to 42 days through an autoregressive and seamless manner. Our hindcast results demonstrate that FengWu-W2S reliably predicts atmospheric conditions out to 3-6 weeks ahead, enhancing predictive capabilities for global surface air temperature, precipitation, geopotential height and intraseasonal signals such as the Madden-Julian Oscillation (MJO) and North Atlantic Oscillation (NAO). Moreover, our ablation experiments on forecast error growth from daily to seasonal timescales reveal potential pathways for developing AI-based integrated system for seamless weather-climate forecasting in the future.

[AI-25] Agent ic LLM s in the Supply Chain: Towards Autonomous Multi-Agent Consensus-Seeking

链接: https://arxiv.org/abs/2411.10184
作者: Valeria Jannelli,Stefan Schoepf,Matthias Bickel,Torbjørn Netland,Alexandra Brintrup
关键词-EN: Large Language Models, Language Models, Large Language, delivery times require, explores how Large
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:This paper explores how Large Language Models (LLMs) can automate consensus-seeking in supply chain management (SCM), where frequent decisions on problems such as inventory levels and delivery times require coordination among companies. Traditional SCM relies on human consensus in decision-making to avoid emergent problems like the bullwhip effect. Some routine consensus processes, especially those that are time-intensive and costly, can be automated. Existing solutions for automated coordination have faced challenges due to high entry barriers locking out SMEs, limited capabilities, and limited adaptability in complex scenarios. However, recent advances in Generative AI, particularly LLMs, show promise in overcoming these barriers. LLMs, trained on vast datasets can negotiate, reason, and plan, facilitating near-human-level consensus at scale with minimal entry barriers. In this work, we identify key limitations in existing approaches and propose autonomous LLM agents to address these gaps. We introduce a series of novel, supply chain-specific consensus-seeking frameworks tailored for LLM agents and validate the effectiveness of our approach through a case study in inventory management. To accelerate progress within the SCM community, we open-source our code, providing a foundation for further advancements in LLM-powered autonomous supply chain solutions.

[AI-26] Let people fail! Exploring the influence of explainable virtual and robotic agents in learning-by-doing tasks

链接: https://arxiv.org/abs/2411.10176
作者: Marco Matarese,Francesco Rea,Katharina J. Rohlfing,Alessandra Sciutti
关键词-EN: Collaborative decision-making, agents presents opportunities, opportunities and challenges, presents opportunities, Collaborative
类目: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Robotics (cs.RO)
*备注:

点击查看摘要

Abstract:Collaborative decision-making with artificial intelligence (AI) agents presents opportunities and challenges. While human-AI performance often surpasses that of individuals, the impact of such technology on human behavior remains insufficiently understood, primarily when AI agents can provide justifiable explanations for their suggestions. This study compares the effects of classic vs. partner-aware explanations on human behavior and performance during a learning-by-doing task. Three participant groups were involved: one interacting with a computer, another with a humanoid robot, and a third one without assistance. Results indicated that partner-aware explanations influenced participants differently based on the type of artificial agents involved. With the computer, participants enhanced their task completion times. At the same time, those interacting with the humanoid robot were more inclined to follow its suggestions, although they did not reduce their timing. Interestingly, participants autonomously performing the learning-by-doing task demonstrated superior knowledge acquisition than those assisted by explainable AI (XAI). These findings raise profound questions and have significant implications for automated tutoring and human-AI collaboration.

[AI-27] he Surprising Ineffectiveness of Pre-Trained Visual Representations for Model-Based Reinforcement Learning NEURIPS2024

链接: https://arxiv.org/abs/2411.10175
作者: Moritz Schneider,Robert Krug,Narunas Vaskevicius,Luigi Palmieri,Joschka Boedecker
关键词-EN: require extensive amounts, Visual Reinforcement Learning, Visual Reinforcement, Reinforcement Learning, methods often require
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: Published at the 38th Conference on Neural Information Processing Systems (NeurIPS 2024). Project page: this https URL

点击查看摘要

Abstract:Visual Reinforcement Learning (RL) methods often require extensive amounts of data. As opposed to model-free RL, model-based RL (MBRL) offers a potential solution with efficient data utilization through planning. Additionally, RL lacks generalization capabilities for real-world tasks. Prior work has shown that incorporating pre-trained visual representations (PVRs) enhances sample efficiency and generalization. While PVRs have been extensively studied in the context of model-free RL, their potential in MBRL remains largely unexplored. In this paper, we benchmark a set of PVRs on challenging control tasks in a model-based RL setting. We investigate the data efficiency, generalization capabilities, and the impact of different properties of PVRs on the performance of model-based agents. Our results, perhaps surprisingly, reveal that for MBRL current PVRs are not more sample efficient than learning representations from scratch, and that they do not generalize better to out-of-distribution (OOD) settings. To explain this, we analyze the quality of the trained dynamics model. Furthermore, we show that data diversity and network architecture are the most important contributors to OOD generalization performance.

[AI-28] A Hard-Label Cryptanalytic Extraction of Non-Fully Connected Deep Neural Networks using Side-Channel Attacks

链接: https://arxiv.org/abs/2411.10174
作者: Benoit Coqueret,Mathieu Carbone,Olivier Sentieys,Gabriel Zaid
关键词-EN: Deep Neural Networks, Neural Networks, Deep Neural, high fidelity, past decade
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:During the past decade, Deep Neural Networks (DNNs) proved their value on a large variety of subjects. However despite their high value and public accessibility, the protection of the intellectual property of DNNs is still an issue and an emerging research field. Recent works have successfully extracted fully-connected DNNs using cryptanalytic methods in hard-label settings, proving that it was possible to copy a DNN with high fidelity, i.e., high similitude in the output predictions. However, the current cryptanalytic attacks cannot target complex, i.e., not fully connected, DNNs and are limited to special cases of neurons present in deep networks. In this work, we introduce a new end-to-end attack framework designed for model extraction of embedded DNNs with high fidelity. We describe a new black-box side-channel attack which splits the DNN in several linear parts for which we can perform cryptanalytic extraction and retrieve the weights in hard-label settings. With this method, we are able to adapt cryptanalytic extraction, for the first time, to non-fully connected DNNs, while maintaining a high fidelity. We validate our contributions by targeting several architectures implemented on a microcontroller unit, including a Multi-Layer Perceptron (MLP) of 1.7 million parameters and a shortened MobileNetv1. Our framework successfully extracts all of these DNNs with high fidelity (88.4% for the MobileNetv1 and 93.2% for the MLP). Furthermore, we use the stolen model to generate adversarial examples and achieve close to white-box performance on the victim’s model (95.8% and 96.7% transfer rate). Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI) Cite as: arXiv:2411.10174 [cs.CR] (or arXiv:2411.10174v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2411.10174 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-29] Semantics and Spatiality of Emergent Communication NEURIPS2024

链接: https://arxiv.org/abs/2411.10173
作者: Rotem Ben Zion,Boaz Carmeli,Orr Paradise,Yonatan Belinkov
关键词-EN: develop opaque goal-oriented, perform collaborative tasks, opaque goal-oriented communication, artificial agents, agents are jointly
类目: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
*备注: 34 pages, to be published in NeurIPS 2024

点击查看摘要

Abstract:When artificial agents are jointly trained to perform collaborative tasks using a communication channel, they develop opaque goal-oriented communication protocols. Good task performance is often considered sufficient evidence that meaningful communication is taking place, but existing empirical results show that communication strategies induced by common objectives can be counterintuitive whilst solving the task nearly perfectly. In this work, we identify a goal-agnostic prerequisite to meaningful communication, which we term semantic consistency, based on the idea that messages should have similar meanings across instances. We provide a formal definition for this idea, and use it to compare the two most common objectives in the field of emergent communication: discrimination and reconstruction. We prove, under mild assumptions, that semantically inconsistent communication protocols can be optimal solutions to the discrimination task, but not to reconstruction. We further show that the reconstruction objective encourages a stricter property, spatial meaningfulness, which also accounts for the distance between messages. Experiments with emergent communication games validate our theoretical results. These findings demonstrate an inherent advantage of distance-based communication goals, and contextualize previous empirical discoveries.

[AI-30] Imagine-2-Drive: High-Fidelity World Modeling in CARLA for Autonomous Vehicles ICRA2025

链接: https://arxiv.org/abs/2411.10171
作者: Anant Garg,K Madhava Krishna
关键词-EN: modeling diverse behavioral, diverse behavioral modes, model-based Reinforcement Learning, based state space, effective decision-making
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注: Submitted to ICRA 2025

点击查看摘要

Abstract:In autonomous driving with image based state space, accurate prediction of future events and modeling diverse behavioral modes are essential for safety and effective decision-making. World model-based Reinforcement Learning (WMRL) approaches offers a promising solution by simulating future states from current state and actions. However, utility of world models is often limited by typical RL policies being limited to deterministic or single gaussian distribution. By failing to capture the full spectrum of possible actions, reduces their adaptability in complex, dynamic environments. In this work, we introduce Imagine-2-Drive, a framework that consists of two components, VISTAPlan, a high-fidelity world model for accurate future prediction and Diffusion Policy Actor (DPA), a diffusion based policy to model multi-modal behaviors for trajectory prediction. We use VISTAPlan to simulate and evaluate trajectories from DPA and use Denoising Diffusion Policy Optimization (DDPO) to train DPA to maximize the cumulative sum of rewards over the trajectories. We analyze the benefits of each component and the framework as a whole in CARLA with standard driving metrics. As a consequence of our twin novelties- VISTAPlan and DPA, we significantly outperform the state of the art (SOTA) world models on standard driving metrics by 15% and 20% on Route Completion and Success Rate respectively.

[AI-31] Mitigating Sycophancy in Decoder-Only Transformer Architectures: Synthetic Data Intervention

链接: https://arxiv.org/abs/2411.10156
作者: Libo Wang
关键词-EN: decoder-only transformer architecture, large language models, sycophancy problem caused, synthetic data intervention, research applies synthetic
类目: Artificial Intelligence (cs.AI)
*备注: This research is also submitted to OpenReview. The main text is 9 pages (excluding citations), 7 figures, and 1 table

点击查看摘要

Abstract:To address the sycophancy problem caused by reinforcement learning from human feedback in large language models, this research applies synthetic data intervention technology to the decoder-only transformer architecture. Based on the research gaps in the existing literature, the researcher designed an experimental process to reduce the tendency of models to cater by generating diversified data, and used GPT4o as an experimental tool for verification. The experiment used 100 true and false questions, and compared the performance of the model trained with synthetic data intervention and the original untrained model on multiple indicators. The results show that the SDI training model supports the technology in terms of accuracy rate and sycophancy rate and has significant effectiveness in reducing sycophancy phenomena. Notably, the data set, experimental process, code and data results have been uploaded to Github, the link is this https URL.

[AI-32] Causal Time-Series Synchronization for Multi-Dimensional Forecasting

链接: https://arxiv.org/abs/2411.10152
作者: Michael Mayr,Georgios C. Chasparis,Josef Küng
关键词-EN: Digital Twins require, Twins require modeling, require modeling approaches, Digital Twins, industry high expectations
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 14 pages

点击查看摘要

Abstract:The process industry’s high expectations for Digital Twins require modeling approaches that can generalize across tasks and diverse domains with potentially different data dimensions and distributional shifts i.e., Foundational Models. Despite success in natural language processing and computer vision, transfer learning with (self-) supervised signals for pre-training general-purpose models is largely unexplored in the context of Digital Twins in the process industry due to challenges posed by multi-dimensional time-series data, lagged cause-effect dependencies, complex causal structures, and varying number of (exogenous) variables. We propose a novel channel-dependent pre-training strategy that leverages synchronized cause-effect pairs to overcome these challenges by breaking down the multi-dimensional time-series data into pairs of cause-effect variables. Our approach focuses on: (i) identifying highly lagged causal relationships using data-driven methods, (ii) synchronizing cause-effect pairs to generate training samples for channel-dependent pre-training, and (iii) evaluating the effectiveness of this approach in channel-dependent forecasting. Our experimental results demonstrate significant improvements in forecasting accuracy and generalization capability compared to traditional training methods.

[AI-33] Generative Agent Simulations of 1000 People

链接: https://arxiv.org/abs/2411.10109
作者: Joon Sung Park,Carolyn Q. Zou,Aaron Shaw,Benjamin Mako Hill,Carrie Cai,Meredith Ringel Morris,Robb Willer,Percy Liang,Michael S. Bernstein
关键词-EN: human behavioral simulation, enable broad applications, general-purpose computational agents, replicate human behavior, General Social Survey
类目: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The promise of human behavioral simulation–general-purpose computational agents that replicate human behavior across domains–could enable broad applications in policymaking and social science. We present a novel agent architecture that simulates the attitudes and behaviors of 1,052 real individuals–applying large language models to qualitative interviews about their lives, then measuring how well these agents replicate the attitudes and behaviors of the individuals that they represent. The generative agents replicate participants’ responses on the General Social Survey 85% as accurately as participants replicate their own answers two weeks later, and perform comparably in predicting personality traits and outcomes in experimental replications. Our architecture reduces accuracy biases across racial and ideological groups compared to agents given demographic descriptions. This work provides a foundation for new tools that can help investigate individual and collective behavior.

[AI-34] Multi-Task Adversarial Variational Autoencoder for Estimating Biological Brain Age with Multimodal Neuroimaging

链接: https://arxiv.org/abs/2411.10100
作者: Muhammad Usman,Azka Rehman,Abdullah Shahid,Abd Ur Rehman,Sung-Min Gho,Aleum Lee,Tariq M.Khan,Imran Razzak
关键词-EN: incorporating functional MRI, functional connectivity measurements, Adversarial Variational Autoencoder, functional MRI data, structural MRI data
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Despite advances in deep learning for estimating brain age from structural MRI data, incorporating functional MRI data is challenging due to its complex structure and the noisy nature of functional connectivity measurements. To address this, we present the Multitask Adversarial Variational Autoencoder, a custom deep learning framework designed to improve brain age predictions through multimodal MRI data integration. This model separates latent variables into generic and unique codes, isolating shared and modality-specific features. By integrating multitask learning with sex classification as an additional task, the model captures sex-specific aging patterns. Evaluated on the OpenBHB dataset, a large multisite brain MRI collection, the model achieves a mean absolute error of 2.77 years, outperforming traditional methods. This success positions M-AVAE as a powerful tool for metaverse-based healthcare applications in brain age estimation.

[AI-35] AI and the Future of Work in Africa White Paper

链接: https://arxiv.org/abs/2411.10091
作者: Jacki O’Neill,Vukosi Marivate,Barbara Glover,Winnie Karanu,Girmaw Abebe Tadesse,Akua Gyekye,Anne Makena,Wesley Rosslyn-Smith,Matthew Grollnek,Charity Wayua,Rehema Baguma,Angel Maduke,Sarah Spencer,Daniel Kandie,Dennis Ndege Maari,Natasha Mutangana,Maxamed Axmed,Nyambura Kamau,Muhammad Adamu,Frank Swaniker,Brian Gatuguti,Jonathan Donner,Mark Graham,Janet Mumo,Caroline Mbindyo,Charlette N’Guessan,Irene Githinji,Lesego Makhafola,Sean Kruger,Olivia Etyang,Mulang Onando,Joe Sevilla,Nanjira Sambuli,Martin Mbaya,Paul Breloff,Gideon M. Anapey,Tebogo L. Mogaleemang,Tiyani Nghonyama,Muthoni Wanyoike,Bhekani Mbuli,Lawrence Nderu,Wambui Nyabero,Uzma Alam,Kayode Olaleye,Caroline Njenga,Abigail Sellen,David Kairo,Rutendo Chabikwa,Najeeb G. Abdulhamid,Ketry Kubasu,Chinasa T. Okolo,Eugenia Akpo,Joel Budu,Issa Karambal,Joseph Berkoh,William Wasswa,Muchai Njagwi,Rob Burnet,Loise Ochanda,Hanlie de Bod,Elizabeth Ankrah,Selemani Kinyunyu,Mutembei Kariuki,Angel Maduke,Kizito Kiyimba,Farida Eleshin,Lillian Secelela Madeje,Catherine Muraga,Ida Nganga,Judy Gichoya,Tabbz Maina,Samuel Maina,Muchai Mercy,Millicent Ochieng,Stephanie Nyairo
关键词-EN: including Microsoft Research, team including Microsoft, Microsoft Research, University of Oxford, workshop in Nairobi
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:This white paper is the output of a multidisciplinary workshop in Nairobi (Nov 2023). Led by a cross-organisational team including Microsoft Research, NEPAD, Lelapa AI, and University of Oxford. The workshop brought together diverse thought-leaders from various sectors and backgrounds to discuss the implications of Generative AI for the future of work in Africa. Discussions centred around four key themes: Macroeconomic Impacts; Jobs, Skills and Labour Markets; Workers’ Perspectives and Africa-Centris AI Platforms. The white paper provides an overview of the current state and trends of generative AI and its applications in different domains, as well as the challenges and risks associated with its adoption and regulation. It represents a diverse set of perspectives to create a set of insights and recommendations which aim to encourage debate and collaborative action towards creating a dignified future of work for everyone across Africa.

[AI-36] PFML: Self-Supervised Learning of Time-Series Data Without Representation Collapse

链接: https://arxiv.org/abs/2411.10087
作者: Einari Vaaras,Manu Airaksinen,Okko Räsänen
关键词-EN: data-driven learning approach, Self-supervised learning, SSL, data, SSL methods
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Self-supervised learning (SSL) is a data-driven learning approach that utilizes the innate structure of the data to guide the learning process. In contrast to supervised learning, which depends on external labels, SSL utilizes the inherent characteristics of the data to produce its own supervisory signal. However, one frequent issue with SSL methods is representation collapse, where the model outputs a constant input-invariant feature representation. This issue hinders the potential application of SSL methods to new data modalities, as trying to avoid representation collapse wastes researchers’ time and effort. This paper introduces a novel SSL algorithm for time-series data called Prediction of Functionals from Masked Latents (PFML). Instead of predicting masked input signals or their latent representations directly, PFML operates by predicting statistical functionals of the input signal corresponding to masked embeddings, given a sequence of unmasked embeddings. The algorithm is designed to avoid representation collapse, rendering it straightforwardly applicable to different time-series data domains, such as novel sensor modalities in clinical data. We demonstrate the effectiveness of PFML through complex, real-life classification tasks across three different data modalities: infant posture and movement classification from multi-sensor inertial measurement unit data, emotion recognition from speech data, and sleep stage classification from EEG data. The results show that PFML is superior to a conceptually similar pre-existing SSL method and competitive against the current state-of-the-art SSL method, while also being conceptually simpler and without suffering from representation collapse.

[AI-37] Adapting the Biological SSVEP Response to Artificial Neural Networks

链接: https://arxiv.org/abs/2411.10084
作者: Emirhan Böge,Yasemin Gunindi,Erchan Aptoula,Nihan Alp,Huseyin Ozkan
关键词-EN: crucial for understanding, frequency tagging, Neuron, tagging, Neuron importance assessment
类目: Artificial Intelligence (cs.AI)
*备注: 10 pages, 5 figures

点击查看摘要

Abstract:Neuron importance assessment is crucial for understanding the inner workings of artificial neural networks (ANNs) and improving their interpretability and efficiency. This paper introduces a novel approach to neuron significance assessment inspired by frequency tagging, a technique from neuroscience. By applying sinusoidal contrast modulation to image inputs and analyzing resulting neuron activations, this method enables fine-grained analysis of a network’s decision-making processes. Experiments conducted with a convolutional neural network for image classification reveal notable harmonics and intermodulations in neuron-specific responses under part-based frequency tagging. These findings suggest that ANNs exhibit behavior akin to biological brains in tuning to flickering frequencies, thereby opening avenues for neuron/filter importance assessment through frequency tagging. The proposed method holds promise for applications in network pruning, and model interpretability, contributing to the advancement of explainable artificial intelligence and addressing the lack of transparency in neural networks. Future research directions include developing novel loss functions to encourage biologically plausible behavior in ANNs.

[AI-38] Real-Time AI-Driven People Tracking and Counting Using Overhead Cameras

链接: https://arxiv.org/abs/2411.10072
作者: Ishrath Ahamed,Chamith Dilshan Ranathunga,Dinuka Sandun Udayantha,Benny Kai Kiat Ng,Chau Yuen
关键词-EN: intelligent transportation systems, Accurate people counting, safety protocols, Accurate people, energy management
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: This paper is accepted to IEEE Region 10 conference (TENCON) 2024

点击查看摘要

Abstract:Accurate people counting in smart buildings and intelligent transportation systems is crucial for energy management, safety protocols, and resource allocation. This is especially critical during emergencies, where precise occupant counts are vital for safe evacuation. Existing methods struggle with large crowds, often losing accuracy with even a few additional people. To address this limitation, this study proposes a novel approach combining a new object tracking algorithm, a novel counting algorithm, and a fine-tuned object detection model. This method achieves 97% accuracy in real-time people counting with a frame rate of 20-27 FPS on a low-power edge computer.

[AI-39] Evidential Federated Learning for Skin Lesion Image Classification ICPR2024

链接: https://arxiv.org/abs/2411.10071
作者: Rutger Hendrix,Federica Proietto Salanitri,Concetto Spampinato,Simone Palazzo,Ulas Bagci
关键词-EN: skin lesion classification, pre-trained Vision Transformer, distributed skin lesion, lesion classification, Vision Transformer
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Published as a conference paper at ICPR 2024

点击查看摘要

Abstract:We introduce FedEvPrompt, a federated learning approach that integrates principles of evidential deep learning, prompt tuning, and knowledge distillation for distributed skin lesion classification. FedEvPrompt leverages two sets of prompts: b-prompts (for low-level basic visual knowledge) and t-prompts (for task-specific knowledge) prepended to frozen pre-trained Vision Transformer (ViT) models trained in an evidential learning framework to maximize class evidences. Crucially, knowledge sharing across federation clients is achieved only through knowledge distillation on attention maps generated by the local ViT models, ensuring enhanced privacy preservation compared to traditional parameter or synthetic image sharing methodologies. FedEvPrompt is optimized within a round-based learning paradigm, where each round involves training local models followed by attention maps sharing with all federation clients. Experimental validation conducted in a real distributed setting, on the ISIC2019 dataset, demonstrates the superior performance of FedEvPrompt against baseline federated learning algorithms and knowledge distillation methods, without sharing model parameters. In conclusion, FedEvPrompt offers a promising approach for federated learning, effectively addressing challenges such as data heterogeneity, imbalance, privacy preservation, and knowledge sharing.

[AI-40] Federated Domain Generalization via Prompt Learning and Aggregation

链接: https://arxiv.org/abs/2411.10063
作者: Shuai Gong,Chaoran Cui,Chunyun Zhang,Wenna Wang,Xiushan Nie,Lei Zhu
关键词-EN: addressing data heterogeneity, global prompts, prompts, aims to improve, privacy-preserving constraints
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: This work has been submitted to the IEEE for possible publication

点击查看摘要

Abstract:Federated domain generalization (FedDG) aims to improve the global model generalization in unseen domains by addressing data heterogeneity under privacy-preserving constraints. A common strategy in existing FedDG studies involves sharing domain-specific knowledge among clients, such as spectrum information, class prototypes, and data styles. However, this knowledge is extracted directly from local client samples, and sharing such sensitive information poses a potential risk of data leakage, which might not fully meet the requirements of FedDG. In this paper, we introduce prompt learning to adapt pre-trained vision-language models (VLMs) in the FedDG scenario, and leverage locally learned prompts as a more secure bridge to facilitate knowledge transfer among clients. Specifically, we propose a novel FedDG framework through Prompt Learning and AggregatioN (PLAN), which comprises two training stages to collaboratively generate local prompts and global prompts at each federated round. First, each client performs both text and visual prompt learning using their own data, with local prompts indirectly synchronized by regarding the global prompts as a common reference. Second, all domain-specific local prompts are exchanged among clients and selectively aggregated into the global prompts using lightweight attention-based aggregators. The global prompts are finally applied to adapt VLMs to unseen target domains. As our PLAN framework requires training only a limited number of prompts and lightweight aggregators, it offers notable advantages in computational and communication efficiency for FedDG. Extensive experiments demonstrate the superior generalization ability of PLAN across four benchmark datasets.

[AI-41] KuaiFormer: Transformer-Based Retrieval at Kuaishou

链接: https://arxiv.org/abs/2411.10057
作者: Chi Liu,Jiangxia Cao,Rui Huang,Kai Zheng,Qiang Luo,Kun Gai,Guorui Zhou
关键词-EN: Deep Neural Network, large-scale content recommendation, responsible for selecting, ranking modules, initial stage
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In large-scale content recommendation systems, retrieval serves as the initial stage in the pipeline, responsible for selecting thousands of candidate items from billions of options to pass on to ranking modules. Traditionally, the dominant retrieval method has been Embedding-Based Retrieval (EBR) using a Deep Neural Network (DNN) dual-tower structure. However, applying transformer in retrieval tasks has been the focus of recent research, though real-world industrial deployment still presents significant challenges. In this paper, we introduce KuaiFormer, a novel transformer-based retrieval framework deployed in a large-scale content recommendation system. KuaiFormer fundamentally redefines the retrieval process by shifting from conventional score estimation tasks (such as click-through rate estimate) to a transformer-driven Next Action Prediction paradigm. This shift enables more effective real-time interest acquisition and multi-interest extraction, significantly enhancing retrieval performance. KuaiFormer has been successfully integrated into Kuaishou App’s short-video recommendation system since May 2024, serving over 400 million daily active users and resulting in a marked increase in average daily usage time of Kuaishou users. We provide insights into both the technical and business aspects of deploying transformer in large-scale recommendation systems, addressing practical challenges encountered during industrial implementation. Our findings offer valuable guidance for engineers and researchers aiming to leverage transformer models to optimize large-scale content recommendation systems.

[AI-42] hat Chip Has Sailed: A Critique of Unfounded Skepticism Around AI for Chip Design

链接: https://arxiv.org/abs/2411.10053
作者: Anna Goldie,Azalia Mirhoseini,Jeff Dean
关键词-EN: superhuman chip layouts, generating superhuman chip, deep reinforcement learning, open-sourced on GitHub, introduced a deep
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In 2020, we introduced a deep reinforcement learning method capable of generating superhuman chip layouts, which we then published in Nature and open-sourced on GitHub. AlphaChip has inspired an explosion of work on AI for chip design, and has been deployed in state-of-the-art chips across Alphabet and extended by external chipmakers. Even so, a non-peer-reviewed invited paper at ISPD 2023 questioned its performance claims, despite failing to run our method as described in Nature. For example, it did not pre-train the RL method (removing its ability to learn from prior experience), used substantially fewer compute resources (20x fewer RL experience collectors and half as many GPUs), did not train to convergence (standard practice in machine learning), and evaluated on test cases that are not representative of modern chips. Recently, Igor Markov published a meta-analysis of three papers: our peer-reviewed Nature paper, the non-peer-reviewed ISPD paper, and Markov’s own unpublished paper (though he does not disclose that he co-authored it). Although AlphaChip has already achieved widespread adoption and impact, we publish this response to ensure that no one is wrongly discouraged from innovating in this impactful area.

[AI-43] Jal Anveshak: Prediction of fishing zones using fine-tuned LlaMa 2

链接: https://arxiv.org/abs/2411.10050
作者: Arnav Mejari,Maitreya Vaghulade,Paarshva Chitaliya,Arya Telang,Lynette D’mello
关键词-EN: witnessed significant advancements, Indian government efforts, collecting data related, recent years, significant advancements
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In recent years, the global and Indian government efforts in monitoring and collecting data related to the fisheries industry have witnessed significant advancements. Despite this wealth of data, there exists an untapped potential for leveraging artificial intelligence based technological systems to benefit Indian fishermen in coastal areas. To fill this void in the Indian technology ecosystem, the authors introduce Jal Anveshak. This is an application framework written in Dart and Flutter that uses a Llama 2 based Large Language Model fine-tuned on pre-processed and augmented government data related to fishing yield and availability. Its main purpose is to help Indian fishermen safely get the maximum yield of fish from coastal areas and to resolve their fishing related queries in multilingual and multimodal ways.

[AI-44] Physics-informed neural networks need a physicist to be accurate: the case of mass and heat transport in Fischer-Tropsch catalyst particles

链接: https://arxiv.org/abs/2411.10048
作者: Tymofii Nikolaienko,Harshil Patel,Aniruddha Panda,Subodh Madhav Joshi,Stanislav Jaso,Kaushic Kalyanaraman
关键词-EN: Physics-Informed Neural Networks, influential technology, merging the swift, theoretical physics, swift and automated
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Chemical Physics (physics.chem-ph); Computational Physics (physics.comp-ph)
*备注:

点击查看摘要

Abstract:Physics-Informed Neural Networks (PINNs) have emerged as an influential technology, merging the swift and automated capabilities of machine learning with the precision and dependability of simulations grounded in theoretical physics. PINNs are often employed to solve algebraic or differential equations to replace some or even all steps of multi-stage computational workflows, leading to their significant speed-up. However, wide adoption of PINNs is still hindered by reliability issues, particularly at extreme ends of the input parameter ranges. In this study, we demonstrate this in the context of a system of coupled non-linear differential reaction-diffusion and heat transfer equations related to Fischer-Tropsch synthesis, which are solved by a finite-difference method with a PINN used in evaluating their source terms. It is shown that the testing strategies traditionally used to assess the accuracy of neural networks as function approximators can overlook the peculiarities which ultimately cause instabilities of the finite-difference solver. We propose a domain knowledge-based modifications to the PINN architecture ensuring its correct asymptotic behavior. When combined with an improved numerical scheme employed as an initial guess generator, the proposed modifications are shown to recover the overall stability of the simulations, while preserving the speed-up brought by PINN as the workflow component. We discuss the possible applications of the proposed hybrid transport equation solver in context of chemical reactors simulations.

[AI-45] Rethinking Normalization Strategies and Convolutional Kernels for Multimodal Image Fusion

链接: https://arxiv.org/abs/2411.10036
作者: Dan He,Guofen Wang,Weisheng Li,Yucheng Shu,Wenbo Li,Lijian Yang,Yuping Huang,Feiyan Li
关键词-EN: Multimodal image fusion, Multimodal image, aims to integrate, image fusion, modalities to obtain
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Multimodal image fusion (MMIF) aims to integrate information from different modalities to obtain a comprehensive image, aiding downstream tasks. However, existing methods tend to prioritize natural image fusion and focus on information complementary and network training strategies. They ignore the essential distinction between natural and medical image fusion and the influence of underlying components. This paper dissects the significant differences between the two tasks regarding fusion goals, statistical properties, and data distribution. Based on this, we rethink the suitability of the normalization strategy and convolutional kernels for end-to-end this http URL, this paper proposes a mixture of instance normalization and group normalization to preserve sample independence and reinforce intrinsic feature this http URL strategy promotes the potential of enriching feature maps, thus boosting fusion performance. To this end, we further introduce the large kernel convolution, effectively expanding receptive fields and enhancing the preservation of image detail. Moreover, the proposed multipath adaptive fusion module recalibrates the decoder input with features of various scales and receptive fields, ensuring the transmission of crucial information. Extensive experiments demonstrate that our method exhibits state-of-the-art performance in multiple fusion tasks and significantly improves downstream applications. The code is available at this https URL.

[AI-46] VMID: A Multimodal Fusion LLM Framework for Detecting and Identifying Misinformation of Short Videos

链接: https://arxiv.org/abs/2411.10032
作者: Weihao Zhong,Yinhao Xiao,Minghui Xu,Xiuzhen Cheng
关键词-EN: access current events, Short video platforms, offering a highly, important channels, highly engaging
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: arXiv admin note: text overlap with arXiv:2211.10973 by other authors

点击查看摘要

Abstract:Short video platforms have become important channels for news dissemination, offering a highly engaging and immediate way for users to access current events and share information. However, these platforms have also emerged as significant conduits for the rapid spread of misinformation, as fake news and rumors can leverage the visual appeal and wide reach of short videos to circulate extensively among audiences. Existing fake news detection methods mainly rely on single-modal information, such as text or images, or apply only basic fusion techniques, limiting their ability to handle the complex, multi-layered information inherent in short videos. To address these limitations, this paper presents a novel fake news detection method based on multimodal information, designed to identify misinformation through a multi-level analysis of video content. This approach effectively utilizes different modal representations to generate a unified textual description, which is then fed into a large language model for comprehensive evaluation. The proposed framework successfully integrates multimodal features within videos, significantly enhancing the accuracy and reliability of fake news detection. Experimental results demonstrate that the proposed approach outperforms existing models in terms of accuracy, robustness, and utilization of multimodal information, achieving an accuracy of 90.93%, which is significantly higher than the best baseline model (SV-FEND) at 81.05%. Furthermore, case studies provide additional evidence of the effectiveness of the approach in accurately distinguishing between fake news, debunking content, and real incidents, highlighting its reliability and robustness in real-world applications.

[AI-47] MOT_FCG: Enhanced Representation of Motion and Appearance Features

链接: https://arxiv.org/abs/2411.10028
作者: Yanzhao Fang
关键词-EN: spatial motion features, multi-object tracking, goal of multi-object, detect and track, maintaining a unique
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 12 pages, 7 figures

点击查看摘要

Abstract:The goal of multi-object tracking (MOT) is to detect and track all objects in a scene across frames, while maintaining a unique identity for each object. Most existing methods rely on the spatial motion features and appearance embedding features of the detected objects in consecutive frames. Effectively and robustly representing the spatial and appearance features of long trajectories has become a critical factor affecting the performance of MOT. We propose a novel approach for appearance and spatial feature representation, improving upon the clustering association method MOT_FCG. For spatial motion features, we propose Diagonal Modulated GIoU, which more accurately represents the relationship between the position and shape of the objects. For appearance features, we utilize a dynamic appearance representation that incorporates confidence information, enabling the trajectory appearance features to be more robust and global. Based on the baseline model MOT_FCG, we achieved 76.1 HOTA, 80.4 MOTA and 81.3 IDF1 on the MOT17 validation set, and also achieved competitive performance on the MOT20 and DanceTrack validation sets.

[AI-48] MicroCrackAttentionNeXt: Advancing Microcrack Detection in Wave Field Analysis Using Deep Neural Networks through Feature Visualization

链接: https://arxiv.org/abs/2411.10015
作者: Fatahlla Moreh(Christian Albrechts University, Kiel, Germany),Yusuf Hasan(Aligarh Muslim University, Aligarh, India),Bilal Zahid Hussain(Texas Aamp;M University, College Station, USA),Mohammad Ammar(Aligarh Muslim University, Aligarh, India),Sven Tomforde(Christian Albrechts University, Kiel, Germany)
关键词-EN: wave fields interacting, Micro Crack detection, Micro Crack, automated pipeline, pipeline using wave
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Micro Crack detection using deep neural networks (DNNs) through an automated pipeline using wave fields interacting with the damaged areas is highly sought after. These high-dimensional spatio-temporal crack data are limited, and these datasets have large dimensions in the temporal domain. The dataset presents a substantial class imbalance, with crack pixels constituting an average of only 5% of the total pixels per sample. This extreme class imbalance poses a challenge for deep learning models with the different micro-scale cracks, as the network can be biased toward predicting the majority class, generally leading to poor detection accuracy. This study builds upon the previous benchmark SpAsE-Net, an asymmetric encoder-decoder network for micro-crack detection. The impact of various activation and loss functions were examined through feature space visualization using the manifold discovery and analysis (MDA) algorithm. The optimized architecture and training methodology achieved an accuracy of 86.85%.

[AI-49] DeepMedcast: A Deep Learning Method for Generating Intermediate Weather Forecasts among Multiple NWP Models

链接: https://arxiv.org/abs/2411.10010
作者: Atsushi Kudo
关键词-EN: Numerical weather prediction, AI-driven NWP models, Numerical weather, NWP models, diverse NWP outputs
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 12 pages, 8 figures

点击查看摘要

Abstract:Numerical weather prediction (NWP) centers around the world operate a variety of NWP models, and recent advances in AI-driven NWP models have increased the availability of diverse NWP outputs. While this expansion holds the potential to improve forecast accuracy, it also raises a critical challenge of identifying the most reliable predictions for specific forecast scenarios. Traditional approaches, such as ensemble or weighted averaging, combine multiple NWP outputs but often generate unrealistic atmospheric fields, complicating the production of reliable and consistent forecasts in operational settings. In this study, we introduce DeepMedcast, a deep learning method that generates intermediate forecast, or “medcast”, between two or more NWP outputs. Unlike ensemble averaging, DeepMedcast can provide consistent and explainable medcast without distorting meteorological fields. This paper details the methodology and case studies of DeepMedcast, discussing its advantages and potential contributions to operational forecasting.

[AI-50] Graph-based Complexity for Causal Effect by Empirical Plug-in

链接: https://arxiv.org/abs/2411.10008
作者: Rina Dechter,Annie Raichev,Alexander Ihler,Jin Tian
关键词-EN: causal effect queries, effect queries, paper focuses, causal effect, computing empirical plug-in
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:This paper focuses on the computational complexity of computing empirical plug-in estimates for causal effect queries. Given a causal graph and observational data, any identifiable causal query can be estimated from an expression over the observed variables, called the estimand. The estimand can then be evaluated by plugging in probabilities computed empirically from data. In contrast to conventional wisdom, which assumes that high dimensional probabilistic functions will lead to exponential evaluation time of the estimand. We show that computation can be done efficiently, potentially in time linear in the data size, depending on the estimand’s hypergraph. In particular, we show that both the treewidth and hypertree width of the estimand’s structure bound the evaluation complexity of the plug-in estimands, analogous to their role in the complexity of probabilistic inference in graphical models. Often, the hypertree width provides a more effective bound, since the empirical distributions are sparse. Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2411.10008 [cs.AI] (or arXiv:2411.10008v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2411.10008 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-51] DuSEGO: Dual Second-order Equivariant Graph Ordinary Differential Equation

链接: https://arxiv.org/abs/2411.10000
作者: Yingxu Wang,Nan Yin,Mingyan Xiao,Xinhao Yi,Siwei Liu,Shangsong Liang
关键词-EN: Graph Neural Networks, Neural Networks, achieved significant success, modeling complex dynamic, complex dynamic systems
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Graph Neural Networks (GNNs) with equivariant properties have achieved significant success in modeling complex dynamic systems and molecular properties. However, their expressiveness ability is limited by: (1) Existing methods often overlook the over-smoothing issue caused by traditional GNN models, as well as the gradient explosion or vanishing problems in deep GNNs. (2) Most models operate on first-order information, neglecting that the real world often consists of second-order systems, which further limits the model’s representation capabilities. To address these issues, we propose the \textbfDual \textbfSecond-order \textbfEquivariant \textbfGraph \textbfOrdinary Differential Equation (\method) for equivariant representation. Specifically, \method apply the dual second-order equivariant graph ordinary differential equations (Graph ODEs) on graph embeddings and node coordinates, simultaneously. Theoretically, we first prove that \method maintains the equivariant property. Furthermore, we provide theoretical insights showing that \method effectively alleviates the over-smoothing problem in both feature representation and coordinate update. Additionally, we demonstrate that the proposed \method mitigates the exploding and vanishing gradients problem, facilitating the training of deep multi-layer GNNs. Extensive experiments on benchmark datasets validate the superiority of the proposed \method compared to baselines.

[AI-52] Unlocking Transfer Learning for Open-World Few-Shot Recognition

链接: https://arxiv.org/abs/2411.09986
作者: Byeonggeun Kim,Juntae Lee,Kyuhong Shim,Simyung Chang
关键词-EN: termed closed-set classes, Few-Shot Open-Set Recognition, critical real-world challenge, identifying open-set inputs, closed-set classes
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Few-Shot Open-Set Recognition (FSOSR) targets a critical real-world challenge, aiming to categorize inputs into known categories, termed closed-set classes, while identifying open-set inputs that fall outside these classes. Although transfer learning where a model is tuned to a given few-shot task has become a prominent paradigm in closed-world, we observe that it fails to expand to open-world. To unlock this challenge, we propose a two-stage method which consists of open-set aware meta-learning with open-set free transfer learning. In the open-set aware meta-learning stage, a model is trained to establish a metric space that serves as a beneficial starting point for the subsequent stage. During the open-set free transfer learning stage, the model is further adapted to a specific target task through transfer learning. Additionally, we introduce a strategy to simulate open-set examples by modifying the training dataset or generating pseudo open-set examples. The proposed method achieves state-of-the-art performance on two widely recognized benchmarks, miniImageNet and tieredImageNet, with only a 1.5% increase in training effort. Our work demonstrates the effectiveness of transfer learning in FSOSR.

[AI-53] Steering AI-Driven Personalization of Scientific Text for General Audiences

链接: https://arxiv.org/abs/2411.09969
作者: Taewook Kim,Dhruv Agarwal,Jordan Ackerman,Manaswi Saha
关键词-EN: Digital media platforms, Digital media, media platforms, social media, offer opportunities
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
*备注: 23 pages, 5 figures, 1 table

点击查看摘要

Abstract:Digital media platforms (e.g., social media, science blogs) offer opportunities to communicate scientific content to general audiences at scale. However, these audiences vary in their scientific expertise, literacy levels, and personal backgrounds, making effective science communication challenging. To address this challenge, we designed TranSlider, an AI-powered tool that generates personalized translations of scientific text based on individual user profiles (e.g., hobbies, location, and education). Our tool features an interactive slider that allows users to steer the degree of personalization from 0 (weakly relatable) to 100 (strongly relatable), leveraging LLMs to generate the translations with given degrees. Through an exploratory study with 15 participants, we investigated both the utility of these AI-personalized translations and how interactive reading features influenced users’ understanding and reading experiences. We found that participants who preferred higher degrees of personalization appreciated the relatable and contextual translations, while those who preferred lower degrees valued concise translations with subtle contextualization. Furthermore, participants reported the compounding effect of multiple translations on their understanding of scientific content. Given these findings, we discuss several implications of AI-personalized translation tools in facilitating communication in collaborative contexts.

[AI-54] Seeing Clearly by Layer Two: Enhancing Attention Heads to Alleviate Hallucination in LVLMs

链接: https://arxiv.org/abs/2411.09968
作者: Xiaofeng Zhang,Yihao Quan,Chaochen Gu,Chen Shen,Xiaosong Yuan,Shaotian Yan,Hao Cheng,Kaijie Wu,Jieping Ye
关键词-EN: multimodal large language, large language models, image tokens, attention sinks, attention
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The hallucination problem in multimodal large language models (MLLMs) remains a common issue. Although image tokens occupy a majority of the input sequence of MLLMs, there is limited research to explore the relationship between image tokens and hallucinations. In this paper, we analyze the distribution of attention scores for image tokens across each layer and head of the model, revealing an intriguing and common phenomenon: most hallucinations are closely linked to the pattern of attention sinks in the self-attention matrix of image tokens, where shallow layers exhibit dense attention sinks and deeper layers show sparse attention sinks. We further analyze the attention heads of different layers and find that heads with high-density attention sink in the image part play a positive role in alleviating hallucinations. In this paper, we propose a training-free method named \textcolorred\textbfEnhancing \textcolorred\textbfAttention \textcolorred\textbfHeads (EAH), an approach designed to enhance the convergence of image tokens attention sinks in the shallow layers. EAH identifies the attention head that shows the vision sink in a shallow layer and extracts its attention matrix. This attention map is then broadcast to other heads in the layer, thereby strengthening the layer to pay more attention to the image itself. With extensive experiments, EAH shows significant hallucination-mitigating performance on different MLLMs and metrics, proving its effectiveness and generality.

[AI-55] Instruction-Guided Editing Controls for Images and Multimedia: A Survey in LLM era

链接: https://arxiv.org/abs/2411.09955
作者: Thanh Tam Nguyen,Zhao Ren,Trinh Pham,Phi Le Nguyen,Hongzhi Yin,Quoc Viet Hung Nguyen
关键词-EN: digital content creation, transformed digital content, rapid advancement, advancement of large, learning has transformed
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG); Multimedia (cs.MM)
*备注:

点击查看摘要

Abstract:The rapid advancement of large language models (LLMs) and multimodal learning has transformed digital content creation and manipulation. Traditional visual editing tools require significant expertise, limiting accessibility. Recent strides in instruction-based editing have enabled intuitive interaction with visual content, using natural language as a bridge between user intent and complex editing operations. This survey provides an overview of these techniques, focusing on how LLMs and multimodal models empower users to achieve precise visual modifications without deep technical knowledge. By synthesizing over 100 publications, we explore methods from generative adversarial networks to diffusion models, examining multimodal integration for fine-grained content control. We discuss practical applications across domains such as fashion, 3D scene manipulation, and video synthesis, highlighting increased accessibility and alignment with human intuition. Our survey compares existing literature, emphasizing LLM-empowered editing, and identifies key challenges to stimulate further research. We aim to democratize powerful visual editing across various industries, from entertainment to education. Interested readers are encouraged to access our repository at this https URL.

[AI-56] GGAvatar: Reconstructing Garment-Separated 3D Gaussian Splatting Avatars from Monocular Video

链接: https://arxiv.org/abs/2411.09952
作者: Jingxuan Chen
关键词-EN: Gaussian Splatting Avatar, virtual try-ons, animation and virtual, Splatting Avatar, Gaussian Splatting
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Multimedia (cs.MM)
*备注: MMAsia’24 Accepted

点击查看摘要

Abstract:Avatar modelling has broad applications in human animation and virtual try-ons. Recent advancements in this field have focused on high-quality and comprehensive human reconstruction but often overlook the separation of clothing from the body. To bridge this gap, this paper introduces GGAvatar (Garment-separated 3D Gaussian Splatting Avatar), which relies on monocular videos. Through advanced parameterized templates and unique phased training, this model effectively achieves decoupled, editable, and realistic reconstruction of clothed humans. Comparative evaluations with other costly models confirm GGAvatar’s superior quality and efficiency in modelling both clothed humans and separable garments. The paper also showcases applications in clothing editing, as illustrated in Figure 1, highlighting the model’s benefits and the advantages of effective disentanglement. The code is available at this https URL.

[AI-57] EESlice: Protecting Sensitive Neural Network Models in Trusted Execution Environments When Attackers have Pre-Trained Models

链接: https://arxiv.org/abs/2411.09945
作者: Ding Li,Ziqi Zhang,Mengyu Yao,Yifeng Cai,Yao Guo,Xiangqun Chen
关键词-EN: Trusted Execution Environments, Trusted Execution, Execution Environments, safeguard on-device models, safeguard on-device
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Accepted by TOSEM. Extended version of the SP24 paper ( arXiv:2310.07152 )

点击查看摘要

Abstract:Trusted Execution Environments (TEE) are used to safeguard on-device models. However, directly employing TEEs to secure the entire DNN model is challenging due to the limited computational speed. Utilizing GPU can accelerate DNN’s computation speed but commercial widely-available GPUs usually lack security protection. To this end, scholars introduce TSDP, a method that protects privacy-sensitive weights within TEEs and offloads insensitive weights to GPUs. Nevertheless, current methods do not consider the presence of a knowledgeable adversary who can access abundant publicly available pre-trained models and datasets. This paper investigates the security of existing methods against such a knowledgeable adversary and reveals their inability to fulfill their security promises. Consequently, we introduce a novel partition before training strategy, which effectively separates privacy-sensitive weights from other components of the model. Our evaluation demonstrates that our approach can offer full model protection with a computational cost reduced by a factor of 10. In addition to traditional CNN models, we also demonstrate the scalability to large language models. Our approach can compress the private functionalities of the large language model to lightweight slices and achieve the same level of protection as the shielding-whole-model baseline.

[AI-58] Motion-Grounded Video Reasoning: Understanding and Perceiving Motion at Pixel Level

链接: https://arxiv.org/abs/2411.09921
作者: Andong Deng,Tongjia Chen,Shoubin Yu,Taojiannan Yang,Lincoln Spencer,Yapeng Tian,Ajmal Saeed Mian,Mohit Bansal,Chen Chen
关键词-EN: Video Reasoning, Reasoning, generating visual answers, Video, Motion-Grounded Video Reasoning
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In this paper, we introduce Motion-Grounded Video Reasoning, a new motion understanding task that requires generating visual answers (video segmentation masks) according to the input question, and hence needs implicit spatiotemporal reasoning and grounding. This task extends existing spatiotemporal grounding work focusing on explicit action/motion grounding, to a more general format by enabling implicit reasoning via questions. To facilitate the development of the new task, we collect a large-scale dataset called GROUNDMORE, which comprises 1,715 video clips, 249K object masks that are deliberately designed with 4 question types (Causal, Sequential, Counterfactual, and Descriptive) for benchmarking deep and comprehensive motion reasoning abilities. GROUNDMORE uniquely requires models to generate visual answers, providing a more concrete and visually interpretable response than plain texts. It evaluates models on both spatiotemporal grounding and reasoning, fostering to address complex challenges in motion-related video reasoning, temporal perception, and pixel-level understanding. Furthermore, we introduce a novel baseline model named Motion-Grounded Video Reasoning Assistant (MORA). MORA incorporates the multimodal reasoning ability from the Multimodal LLM, the pixel-level perception capability from the grounding model (SAM), and the temporal perception ability from a lightweight localization head. MORA achieves respectable performance on GROUNDMORE outperforming the best existing visual grounding baseline model by an average of 21.5% relatively. We hope this novel and challenging task will pave the way for future advancements in robust and general motion understanding via video reasoning segmentation

[AI-59] AMXFP4: Taming Activation Outliers with Asymmetric Microscaling Floating-Point for 4-bit LLM Inference

链接: https://arxiv.org/abs/2411.09909
作者: Janghwan Lee,Jiwoong Park,Jinseok Kim,Yongjik Kim,Jungju Oh,Jinwook Oh,Jungwook Choi
关键词-EN: Scaling Large Language, Large Language Models, Scaling Large, Language Models, Large Language
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Scaling Large Language Models (LLMs) with extended context lengths has increased the need for efficient low-bit quantization to manage their substantial computational demands. However, reducing precision to 4 bits frequently degrades performance due to activation outliers. To address this, we propose Asymmetric Microscaling 4-bit Floating-Point (AMXFP4) for efficient LLM inference. This novel data format leverages asymmetric shared scales to mitigate outliers while naturally capturing the asymmetry introduced by group-wise quantization. Unlike conventional 4-bit quantization methods that rely on data rotation and costly calibration, AMXFP4 uses asymmetric shared scales for direct 4-bit casting, achieving near-ideal quantization accuracy across various LLM tasks, including multi-turn conversations, long-context reasoning, and visual question answering. Our AMXFP4 format significantly outperforms MXFP4 and other leading quantization techniques, enabling robust, calibration-free 4-bit inference.

[AI-60] Statistical Analysis of Policy Space Compression Problem

链接: https://arxiv.org/abs/2411.09900
作者: Majid Molaei,Marcello Restelli,Alberto Maria Metelli,Matteo Papini
关键词-EN: partially observable problems, address continuous state-action, Policy search methods, policy space, offering a framework
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Policy search methods are crucial in reinforcement learning, offering a framework to address continuous state-action and partially observable problems. However, the complexity of exploring vast policy spaces can lead to significant inefficiencies. Reducing the policy space through policy compression emerges as a powerful, reward-free approach to accelerate the learning process. This technique condenses the policy space into a smaller, representative set while maintaining most of the original effectiveness. Our research focuses on determining the necessary sample size to learn this compressed set accurately. We employ Rényi divergence to measure the similarity between true and estimated policy distributions, establishing error bounds for good approximations. To simplify the analysis, we employ the l_1 norm, determining sample size requirements for both model-based and model-free settings. Finally, we correlate the error bounds from the l_1 norm with those from Rényi divergence, distinguishing between policies near the vertices and those in the middle of the policy space, to determine the lower and upper bounds for the required sample sizes.

[AI-61] Off-Dynamics Reinforcement Learning via Domain Adaptation and Reward Augmented Imitation NEURIPS2024

链接: https://arxiv.org/abs/2411.09891
作者: Yihong Guo,Yixuan Wang,Yuanyuan Shi,Pan Xu,Anqi Liu
关键词-EN: target domain, source domain, domain, imitation learning, target optimal trajectories
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Robotics (cs.RO)
*备注: Published at Neurips 2024

点击查看摘要

Abstract:Training a policy in a source domain for deployment in the target domain under a dynamics shift can be challenging, often resulting in performance degradation. Previous work tackles this challenge by training on the source domain with modified rewards derived by matching distributions between the source and the target optimal trajectories. However, pure modified rewards only ensure the behavior of the learned policy in the source domain resembles trajectories produced by the target optimal policies, which does not guarantee optimal performance when the learned policy is actually deployed to the target domain. In this work, we propose to utilize imitation learning to transfer the policy learned from the reward modification to the target domain so that the new policy can generate the same trajectories in the target domain. Our approach, Domain Adaptation and Reward Augmented Imitation Learning (DARAIL), utilizes the reward modification for domain adaptation and follows the general framework of generative adversarial imitation learning from observation (GAIfO) by applying a reward augmented estimator for the policy optimization step. Theoretically, we present an error bound for our method under a mild assumption regarding the dynamics shift to justify the motivation of our method. Empirically, our method outperforms the pure modified reward method without imitation learning and also outperforms other baselines in benchmark off-dynamics environments.

[AI-62] A Hybrid Artificial Intelligence System for Automated EEG Background Analysis and Report Generation

链接: https://arxiv.org/abs/2411.09874
作者: Chin-Sung Tung,Sheng-Fu Liang,Shu-Feng Chang,Chung-Ping Young
关键词-EN: plays a crucial, neurological disorders, crucial role, EEG, University Abnormal EEG
类目: Artificial Intelligence (cs.AI); Signal Processing (eess.SP)
*备注: Example code available at this https URL

点击查看摘要

Abstract:Electroencephalography (EEG) plays a crucial role in the diagnosis of various neurological disorders. However, small hospitals and clinics often lack advanced EEG signal analysis systems and are prone to misinterpretation in manual EEG reading. This study proposes an innovative hybrid artificial intelligence (AI) system for automatic interpretation of EEG background activity and report generation. The system combines deep learning models for posterior dominant rhythm (PDR) prediction, unsupervised artifact removal, and expert-designed algorithms for abnormality detection. For PDR prediction, 1530 labeled EEGs were used, and the best ensemble model achieved a mean absolute error (MAE) of 0.237, a root mean square error (RMSE) of 0.359, an accuracy of 91.8% within a 0.6Hz error, and an accuracy of 99% within a 1.2Hz error. The AI system significantly outperformed neurologists in detecting generalized background slowing (p = 0.02; F1: AI 0.93, neurologists 0.82) and demonstrated improved focal abnormality detection, although not statistically significant (p = 0.79; F1: AI 0.71, neurologists 0.55). Validation on both an internal dataset and the Temple University Abnormal EEG Corpus showed consistent performance (F1: 0.884 and 0.835, respectively; p = 0.66), demonstrating generalizability. The use of large language models (LLMs) for report generation demonstrated 100% accuracy, verified by three other independent LLMs. This hybrid AI system provides an easily scalable and accurate solution for EEG interpretation in resource-limited settings, assisting neurologists in improving diagnostic accuracy and reducing misdiagnosis rates.

[AI-63] InterFormer: Towards Effective Heterogeneous Interaction Learning for Click-Through Rate Prediction

链接: https://arxiv.org/abs/2411.09852
作者: Zhichen Zeng,Xiaolong Liu,Mengyue Hang,Xiaoyi Liu,Qinghai Zhou,Chaofei Yang,Yiqun Liu,Yichen Ruan,Laming Chen,Yuxin Chen,Yujia Hao,Jiaqi Xu,Jade Nie,Xi Liu,Buyun Zhang,Wei Wen,Siyang Yuan,Kai Wang,Wen-Yen Chen,Yiping Han,Huayu Li,Chunzhi Yang,Bo Long,Philip S. Yu,Hanghang Tong,Jiyan Yang
关键词-EN: Click-through rate, recommender systems, information, predicts the probability, task in recommender
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 10 pages, 6 figures

点击查看摘要

Abstract:Click-through rate (CTR) prediction, which predicts the probability of a user clicking an ad, is a fundamental task in recommender systems. The emergence of heterogeneous information, such as user profile and behavior sequences, depicts user interests from different aspects. A mutually beneficial integration of heterogeneous information is the cornerstone towards the success of CTR prediction. However, most of the existing methods suffer from two fundamental limitations, including (1) insufficient inter-mode interaction due to the unidirectional information flow between modes, and (2) aggressive information aggregation caused by early summarization, resulting in excessive information loss. To address the above limitations, we propose a novel module named InterFormer to learn heterogeneous information interaction in an interleaving style. To achieve better interaction learning, InterFormer enables bidirectional information flow for mutually beneficial learning across different modes. To avoid aggressive information aggregation, we retain complete information in each data mode and use a separate bridging arch for effective information selection and summarization. Our proposed InterFormer achieves state-of-the-art performance on three public datasets and a large-scale industrial dataset.

[AI-64] Enhancing Diffusion Posterior Sampling for Inverse Problems by Integrating Crafted Measurements

链接: https://arxiv.org/abs/2411.09850
作者: Shijie Zhou,Huaisheng Zhu,Rohan Sharma,Ruiyi Zhang,Kaiyi Ji,Changyou Chen
关键词-EN: powerful foundation model, foundation model, visual generation, powerful foundation, posterior sampling
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Diffusion models have emerged as a powerful foundation model for visual generation. With an appropriate sampling process, it can effectively serve as a generative prior to solve general inverse problems. Current posterior sampling based methods take the measurement (i.e., degraded image sample) into the posterior sampling to infer the distribution of the target data (i.e., clean image sample). However, in this manner, we show that high-frequency information can be prematurely introduced during the early stages, which could induce larger posterior estimate errors during the restoration sampling. To address this issue, we first reveal that forming the log posterior gradient with the noisy measurement ( i.e., samples from a diffusion forward process) instead of the clean one can benefit the reverse process. Consequently, we propose a novel diffusion posterior sampling method DPS-CM, which incorporates a Crafted Measurement (i.e., samples generated by a reverse denoising process, compared to random sampling with noise in standard methods) to form the posterior estimate. This integration aims to mitigate the misalignment with the diffusion prior caused by cumulative posterior estimate errors. Experimental results demonstrate that our approach significantly improves the overall capacity to solve general and noisy inverse problems, such as Gaussian deblurring, super-resolution, inpainting, nonlinear deblurring, and tasks with Poisson noise, relative to existing approaches.

[AI-65] Deep Autoencoders for Unsupervised Anomaly Detection in Wildfire Prediction

链接: https://arxiv.org/abs/2411.09844
作者: İrem Üstek,Miguel Arana-Catania,Alexander Farr,Ivan Petrunin
关键词-EN: significantly increasing hazard, global ecosystems due, climate crisis, pose a significantly, significantly increasing
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 33 pages, 18 figure, 16 tables. To appear in Earth and Space Science

点击查看摘要

Abstract:Wildfires pose a significantly increasing hazard to global ecosystems due to the climate crisis. Due to its complex nature, there is an urgent need for innovative approaches to wildfire prediction, such as machine learning. This research took a unique approach, differentiating from classical supervised learning, and addressed the gap in unsupervised wildfire prediction using autoencoders and clustering techniques for anomaly detection. Historical weather and normalised difference vegetation index datasets of Australia for 2005 - 2021 were utilised. Two main unsupervised approaches were analysed. The first used a deep autoencoder to obtain latent features, which were then fed into clustering models, isolation forest, local outlier factor and one-class SVM for anomaly detection. The second approach used a deep autoencoder to reconstruct the input data and use reconstruction errors to identify anomalies. Long Short-Term Memory (LSTM) autoencoders and fully connected (FC) autoencoders were employed in this part, both in an unsupervised way learning only from nominal data. The FC autoencoder outperformed its counterparts, achieving an accuracy of 0.71, an F1-score of 0.74, and an MCC of 0.42. These findings highlight the practicality of this method, as it effectively predicts wildfires in the absence of ground truth, utilising an unsupervised learning technique.

[AI-66] Real-time Adapting Routing (RAR): Improving Efficiency Through Continuous Learning in Software Powered by Layered Foundation Models

链接: https://arxiv.org/abs/2411.09837
作者: Kirill Vasilevski,Dayi Lin,Ahmed Hassan
关键词-EN: large language models, Real-time Adaptive Routing, powered software, Foundation Model, people often opt
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
*备注:

点击查看摘要

Abstract:To balance the quality and inference cost of a Foundation Model (FM, such as large language models (LLMs)) powered software, people often opt to train a routing model that routes requests to FMs with different sizes and capabilities. Existing routing models rely on learning the optimal routing decision from carefully curated data, require complex computations to be updated, and do not consider the potential evolution of weaker FMs. In this paper, we propose Real-time Adaptive Routing (RAR), an approach to continuously adapt FM routing decisions while using guided in-context learning to enhance the capabilities of weaker FM. The goal is to reduce reliance on stronger, more expensive FMs. We evaluate our approach on different subsets of the popular MMLU benchmark. Over time, our approach routes 50.2% fewer requests to computationally expensive models while maintaining around 90.5% of the general response quality. In addition, the guides generated from stronger models have shown intra-domain generalization and led to a better quality of responses compared to an equivalent approach with a standalone weaker FM.

[AI-67] A Self-Supervised Model for Multi-modal Stroke Risk Prediction NEURIPS2024

链接: https://arxiv.org/abs/2411.09822
作者: Camille Delgrange,Olga Demler,Samia Mora,Bjoern Menze,Ezequiel de la Rosa,Neda Davoudi
关键词-EN: Predicting stroke risk, Predicting stroke, complex challenge, data modalities, data
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: Accepted as oral paper at AIM-FM workshop, Neurips 2024

点击查看摘要

Abstract:Predicting stroke risk is a complex challenge that can be enhanced by integrating diverse clinically available data modalities. This study introduces a self-supervised multimodal framework that combines 3D brain imaging, clinical data, and image-derived features to improve stroke risk prediction prior to onset. By leveraging large unannotated clinical datasets, the framework captures complementary and synergistic information across image and tabular data modalities. Our approach is based on a contrastive learning framework that couples contrastive language-image pretraining with an image-tabular matching module, to better align multimodal data representations in a shared latent space. The model is trained on the UK Biobank, which includes structural brain MRI and clinical data. We benchmark its performance against state-of-the-art unimodal and multimodal methods using tabular, image, and image-tabular combinations under diverse frozen and trainable model settings. The proposed model outperformed self-supervised tabular (image) methods by 2.6% (2.6%) in ROC-AUC and by 3.3% (5.6%) in balanced accuracy. Additionally, it showed a 7.6% increase in balanced accuracy compared to the best multimodal supervised model. Through interpretable tools, our approach demonstrated better integration of tabular and image data, providing richer and more aligned embeddings. Gradient-weighted Class Activation Mapping heatmaps further revealed activated brain regions commonly associated in the literature with brain aging, stroke risk, and clinical outcomes. This robust self-supervised multimodal framework surpasses state-of-the-art methods for stroke risk prediction and offers a strong foundation for future studies integrating diverse data modalities to advance clinical predictive modelling.

[AI-68] WelQrate: Defining the Gold Standard in Small Molecule Drug Discovery Benchmarking

链接: https://arxiv.org/abs/2411.09820
作者: Yunchao(Lance)Liu,Ha Dong,Xin Wang,Rocco Moretti,Yu Wang,Zhaoqian Su,Jiawei Gu,Bobby Bodenheimer,Charles David Weaver,Jens Meiler,Tyler Derr
关键词-EN: drug discovery, revolutionized computer-aided drug, drug discovery benchmarking, computer-aided drug discovery, molecule drug discovery
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Biomolecules (q-bio.BM)
*备注: * denotes equal contribution

点击查看摘要

Abstract:While deep learning has revolutionized computer-aided drug discovery, the AI community has predominantly focused on model innovation and placed less emphasis on establishing best benchmarking practices. We posit that without a sound model evaluation framework, the AI community’s efforts cannot reach their full potential, thereby slowing the progress and transfer of innovation into real-world drug discovery. Thus, in this paper, we seek to establish a new gold standard for small molecule drug discovery benchmarking, WelQrate. Specifically, our contributions are threefold: WelQrate Dataset Collection - we introduce a meticulously curated collection of 9 datasets spanning 5 therapeutic target classes. Our hierarchical curation pipelines, designed by drug discovery experts, go beyond the primary high-throughput screen by leveraging additional confirmatory and counter screens along with rigorous domain-driven preprocessing, such as Pan-Assay Interference Compounds (PAINS) filtering, to ensure the high-quality data in the datasets; WelQrate Evaluation Framework - we propose a standardized model evaluation framework considering high-quality datasets, featurization, 3D conformation generation, evaluation metrics, and data splits, which provides a reliable benchmarking for drug discovery experts conducting real-world virtual screening; Benchmarking - we evaluate model performance through various research questions using the WelQrate dataset collection, exploring the effects of different models, dataset quality, featurization methods, and data splitting strategies on the results. In summary, we recommend adopting our proposed WelQrate as the gold standard in small molecule drug discovery benchmarking. The WelQrate dataset collection, along with the curation codes, and experimental scripts are all publicly available at this http URL.

[AI-69] Evaluating Loss Landscapes from a Topology Perspective

链接: https://arxiv.org/abs/2411.09807
作者: Tiankai Xie,Caleb Geniesse,Jiaqing Chen,Yaoqing Yang,Dmitriy Morozov,Michael W. Mahoney,Ross Maciejewski,Gunther H. Weber
关键词-EN: loss landscapes, provide valuable insights, loss, Characterizing the loss, visualizing loss landscapes
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Characterizing the loss of a neural network with respect to model parameters, i.e., the loss landscape, can provide valuable insights into properties of that model. Various methods for visualizing loss landscapes have been proposed, but less emphasis has been placed on quantifying and extracting actionable and reproducible insights from these complex representations. Inspired by powerful tools from topological data analysis (TDA) for summarizing the structure of high-dimensional data, here we characterize the underlying shape (or topology) of loss landscapes, quantifying the topology to reveal new insights about neural networks. To relate our findings to the machine learning (ML) literature, we compute simple performance metrics (e.g., accuracy, error), and we characterize the local structure of loss landscapes using Hessian-based metrics (e.g., largest eigenvalue, trace, eigenvalue spectral density). Following this approach, we study established models from image pattern recognition (e.g., ResNets) and scientific ML (e.g., physics-informed neural networks), and we show how quantifying the shape of loss landscapes can provide new insights into model performance and learning dynamics.

[AI-70] AI-Driven Human-Autonomy Teaming in Tactical Operations: Proposed Framework Challenges and Future Directions

链接: https://arxiv.org/abs/2411.09788
作者: Desta Haileselassie Hagos,Hassan El Alami,Danda B. Rawat
关键词-EN: Artificial Intelligence, AI-driven HAT, rapidly transforming tactical, machine learning techniques, human decision-making capabilities
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
*备注: Submitted for review to the Proceedings of the IEEE

点击查看摘要

Abstract:Artificial Intelligence (AI) techniques, particularly machine learning techniques, are rapidly transforming tactical operations by augmenting human decision-making capabilities. This paper explores AI-driven Human-Autonomy Teaming (HAT) as a transformative approach, focusing on how it empowers human decision-making in complex environments. While trust and explainability continue to pose significant challenges, our exploration focuses on the potential of AI-driven HAT to transform tactical operations. By improving situational awareness and supporting more informed decision-making, AI-driven HAT can enhance the effectiveness and safety of such operations. To this end, we propose a comprehensive framework that addresses the key components of AI-driven HAT, including trust and transparency, optimal function allocation between humans and AI, situational awareness, and ethical considerations. The proposed framework can serve as a foundation for future research and development in the field. By identifying and discussing critical research challenges and knowledge gaps in this framework, our work aims to guide the advancement of AI-driven HAT for optimizing tactical operations. We emphasize the importance of developing scalable and ethical AI-driven HAT systems that ensure seamless human-machine collaboration, prioritize ethical considerations, enhance model transparency through Explainable AI (XAI) techniques, and effectively manage the cognitive load of human operators.

[AI-71] SureMap: Simultaneous Mean Estimation for Single-Task and Multi-Task Disaggregated Evaluation NEURIPS2024

链接: https://arxiv.org/abs/2411.09730
作者: Mikhail Khodak,Lester Mackey,Alexandra Chouldechova,Miroslav Dudík
关键词-EN: Disaggregated evaluation, machine learning model, assessing performance, Disaggregated, evaluation
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Applications (stat.AP); Machine Learning (stat.ML)
*备注: NeurIPS 2024

点击查看摘要

Abstract:Disaggregated evaluation – estimation of performance of a machine learning model on different subpopulations – is a core task when assessing performance and group-fairness of AI systems. A key challenge is that evaluation data is scarce, and subpopulations arising from intersections of attributes (e.g., race, sex, age) are often tiny. Today, it is common for multiple clients to procure the same AI model from a model developer, and the task of disaggregated evaluation is faced by each customer individually. This gives rise to what we call the multi-task disaggregated evaluation problem, wherein multiple clients seek to conduct a disaggregated evaluation of a given model in their own data setting (task). In this work we develop a disaggregated evaluation method called SureMap that has high estimation accuracy for both multi-task and single-task disaggregated evaluations of blackbox models. SureMap’s efficiency gains come from (1) transforming the problem into structured simultaneous Gaussian mean estimation and (2) incorporating external data, e.g., from the AI system creator or from their other clients. Our method combines maximum a posteriori (MAP) estimation using a well-chosen prior together with cross-validation-free tuning via Stein’s unbiased risk estimate (SURE). We evaluate SureMap on disaggregated evaluation tasks in multiple domains, observing significant accuracy improvements over several strong competitors.

[AI-72] owards Neural Foundation Models for Vision: Aligning EEG MEG and fMRI Representations for Decoding Encoding and Modality Conversion

链接: https://arxiv.org/abs/2411.09723
作者: Matteo Ferrante,Tommaso Boccato,Grigorii Rashkov,Nicola Toschi
关键词-EN: leveraging contrastive learning, representationsof brain activity, multimodal representationsof brain, contrastive learning, paper presents
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:This paper presents a novel approach towards creating a foundational model for aligning neural data and visual stimuli across multimodal representationsof brain activity by leveraging contrastive learning. We used electroencephalography (EEG), magnetoencephalography (MEG), and functional magnetic resonance imaging (fMRI) data. Our framework’s capabilities are demonstrated through three key experiments: decoding visual information from neural data, encoding images into neural representations, and converting between neural modalities. The results highlight the model’s ability to accurately capture semantic information across different brain imaging techniques, illustrating its potential in decoding, encoding, and modality conversion tasks.

[AI-73] Iterative Batch Reinforcement Learning via Safe Diversified Model-based Policy Search

链接: https://arxiv.org/abs/2411.09722
作者: Amna Najib,Stefan Depeweg,Phillip Swazinna
关键词-EN: previously collected sets, environment during training, relying exclusively, direct interaction, exclusively on previously
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
*备注: Workshop on Safe and Robust Robot Learning for Operation in the Real World (SAFE-ROL) at CoRL 2024

点击查看摘要

Abstract:Batch reinforcement learning enables policy learning without direct interaction with the environment during training, relying exclusively on previously collected sets of interactions. This approach is, therefore, well-suited for high-risk and cost-intensive applications, such as industrial control. Learned policies are commonly restricted to act in a similar fashion as observed in the batch. In a real-world scenario, learned policies are deployed in the industrial system, inevitably leading to the collection of new data that can subsequently be added to the existing recording. The process of learning and deployment can thus take place multiple times throughout the lifespan of a system. In this work, we propose to exploit this iterative nature of applying offline reinforcement learning to guide learned policies towards efficient and informative data collection during deployment, leading to continuous improvement of learned policies while remaining within the support of collected data. We present an algorithmic methodology for iterative batch reinforcement learning based on ensemble-based model-based policy search, augmented with safety and, importantly, a diversity criterion.

[AI-74] NFRs in Medical Imaging

链接: https://arxiv.org/abs/2411.09718
作者: Amanda Vallentin
关键词-EN: great pressure due, pressure due, diagnostic imaging, growing workload, imaging
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The diagnostic imaging departments are under great pressure due to a growing workload. The number of required scans is growing and there is a shortage of qualified labor. AI solutions for medical imaging applications have shown great potential. However, very few diagnostic imaging models have been approved for hospital use and even fewer are being implemented at the hospitals. The most common reason why software projects fail is poor requirement engineering, especially non-functional requirements (NFRs) can be detrimental to a project. Research shows that machine learning professionals struggle to work with NFRs and that there is a need to adapt NFR frameworks to machine learning, AI-based, software. This study uses qualitative methods to interact with key stakeholders to identify which types of NFRs are important for medical imaging applications. The study was done on a single Danish hospital and found that NFRs of type Efficiency, Accuracy, Interoperability, Reliability, Usability, Adaptability, and Fairness were important to the stakeholders. Especially Efficiency since the diagnostic imaging department is trying to spend as little time as possible on each scan.

[AI-75] AI-Driven Feedback Loops in Digital Technologies: Psychological Impacts on User Behaviour and Well-Being

链接: https://arxiv.org/abs/2411.09706
作者: Anthonette Adanyin
关键词-EN: shape user behavior, social media networks, wearable devices, media networks, rapid spread
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
*备注:

点击查看摘要

Abstract:The rapid spread of digital technologies has produced data-driven feedback loops, wearable devices, social media networks, and mobile applications that shape user behavior, motivation, and mental well-being. While these systems encourage self-improvement and the development of healthier habits through real-time feedback, they also create psychological risks such as technostress, addiction, and loss of autonomy. The present study also aims to investigate the positive and negative psychological consequences of feedback mechanisms on users’ behaviour and well-being. Employing a descriptive survey method, the study collected data from 200 purposely selected users to assess changes in behaviour, motivation, and mental well-being related to health, social, and lifestyle applications. Results indicate that while feedback mechanisms facilitate goal attainment and social interconnection through streaks and badges, among other components, they also enhance anxiety, mental weariness, and loss of productivity due to actions that are considered feedback-seeking. Furthermore, test subjects reported that their actions are unconsciously shaped by app feedback, often at the expense of personal autonomy, while real-time feedback minimally influences professional or social interactions. The study shows that data-driven feedback loops deliver not only motivational benefits but also psychological challenges. To mitigate these risks, users should establish boundaries regarding their use of technology to prevent burnout and addiction, while developers need to refine feedback mechanisms to reduce cognitive load and foster more inclusive participation. Future research should focus on designing feedback mechanisms that promote well-being without compromising individual freedom or increasing social comparison.

[AI-76] Prices Bids Values: Everything Everywhere All at Once

链接: https://arxiv.org/abs/2411.09355
作者: Ermis Soumalias,Jakob Heiss,Jakob Weissteiner,Sven Seuken
关键词-EN: previous SOTA, MLHCA, SOTA, queries, iterative combinatorial
类目: Computer Science and Game Theory (cs.GT); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We study the design of iterative combinatorial auctions (ICAs). The main challenge in this domain is that the bundle space grows exponentially in the number of items. To address this, several papers have recently proposed machine learning (ML)-based preference elicitation algorithms that aim to elicit only the most important information from bidders to maximize efficiency. The SOTA ML-based algorithms elicit bidders’ preferences via value queries (i.e., “What is your value for the bundle \A,B\ ?”). However, the most popular iterative combinatorial auction in practice elicits information via more practical \emphdemand queries (i.e., “At prices p , what is your most preferred bundle of items?”). In this paper, we examine the advantages of value and demand queries from both an auction design and an ML perspective. We propose a novel ML algorithm that provably integrates the full information from both query types. As suggested by our theoretical analysis, our experimental results verify that combining demand and value queries results in significantly better learning performance. Building on these insights, we present MLHCA, the most efficient ICA ever designed. MLHCA substantially outperforms the previous SOTA in realistic auction settings, delivering large efficiency gains. Compared to the previous SOTA, MLHCA reduces efficiency loss by up to a factor of 10, and in the most challenging and realistic domain, MLHCA outperforms the previous SOTA using 30% fewer queries. Thus, MLHCA achieves efficiency improvements that translate to welfare gains of hundreds of millions of USD, while also reducing the cognitive load on the bidders, establishing a new benchmark both for practicability and for economic impact.

[AI-77] Cyber-Forensic Review of Human Footprint and Gait for Personal Identification

链接: https://arxiv.org/abs/2204.09344
作者: Kapil Kumar Nagwanshi
关键词-EN: system AADHAR card, identification system AADHAR, Indian biometric identification, PAN card, AADHAR card
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
*备注:

点击查看摘要

Abstract:The human footprint is having a unique set of ridges unmatched by any other human being, and therefore it can be used in different identity documents for example birth certificate, Indian biometric identification system AADHAR card, driving license, PAN card, and passport. There are many instances of the crime scene where an accused must walk around and left the footwear impressions as well as barefoot prints and therefore, it is very crucial to recovering the footprints from identifying the criminals. Footprint-based biometric is a considerably newer technique for personal identification. Fingerprints, retina, iris and face recognition are the methods most useful for attendance record of the person. This time the world is facing the problem of global terrorism. It is challenging to identify the terrorist because they are living as regular as the citizens do. Their soft target includes the industries of special interests such as defence, silicon and nanotechnology chip manufacturing units, pharmacy sectors. They pretend themselves as religious persons, so temples and other holy places, even in markets is in their targets. These are the places where one can obtain their footprints quickly. The gait itself is sufficient to predict the behaviour of the suspects. The present research is driven to identify the usefulness of footprint and gait as an alternative to personal identification.

[AI-78] Identifying Key Drivers of Heatwaves: A Novel Spatio-Temporal Framework for Extreme Event Detection

链接: https://arxiv.org/abs/2411.10108
作者: J. Pérez-Aracil,C. Peláez-Rodríguez,Ronan McAdam,Antonello Squintu,Cosmin M. Marina,Eugenio Lorente-Ramos,Niklas Luther,Veronica Torralba,Enrico Scoccimarro,Leone Cavicchia,Matteo Giuliani,Eduardo Zorita,Felicitas Hansen,David Barriopedro,Ricardo Garcia-Herrera,Pedro A. Gutiérrez,Jürg Luterbacher,Elena Xoplaki,Andrea Castelletti,S. Salcedo-Sanz
关键词-EN: produce significant societal, environmental impacts, societal and environmental, Heatwaves, extreme atmospheric events
类目: Atmospheric and Oceanic Physics (physics.ao-ph); Artificial Intelligence (cs.AI)
*备注: 28 pages, 10 figures, 4 tables

点击查看摘要

Abstract:Heatwaves (HWs) are extreme atmospheric events that produce significant societal and environmental impacts. Predicting these extreme events remains challenging, as their complex interactions with large-scale atmospheric and climatic variables are difficult to capture with traditional statistical and dynamical models. This work presents a general method for driver identification in extreme climate events. A novel framework (STCO-FS) is proposed to identify key immediate (short-term) HW drivers by combining clustering algorithms with an ensemble evolutionary algorithm. The framework analyzes spatio-temporal data, reduces dimensionality by grouping similar geographical nodes for each variable, and develops driver selection in spatial and temporal domains, identifying the best time lags between predictive variables and HW occurrences. The proposed method has been applied to analyze HWs in the Adda river basin in Italy. The approach effectively identifies significant variables influencing HWs in this region. This research can potentially enhance our understanding of HW drivers and predictability.

[AI-79] EyeDiff: text-to-image diffusion model improves rare eye disease diagnosis

链接: https://arxiv.org/abs/2411.10004
作者: Ruoyu Chen,Weiyi Zhang,Bowen Liu,Xiaolan Chen,Pusheng Xu,Shunming Liu,Mingguang He,Danli Shi
关键词-EN: global healthcare systems, vision-threatening retinal diseases, retinal diseases poses, healthcare systems, rising prevalence
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: 28 pages, 2 figures

点击查看摘要

Abstract:The rising prevalence of vision-threatening retinal diseases poses a significant burden on the global healthcare systems. Deep learning (DL) offers a promising solution for automatic disease screening but demands substantial data. Collecting and labeling large volumes of ophthalmic images across various modalities encounters several real-world challenges, especially for rare diseases. Here, we introduce EyeDiff, a text-to-image model designed to generate multimodal ophthalmic images from natural language prompts and evaluate its applicability in diagnosing common and rare diseases. EyeDiff is trained on eight large-scale datasets using the advanced latent diffusion model, covering 14 ophthalmic image modalities and over 80 ocular diseases, and is adapted to ten multi-country external datasets. The generated images accurately capture essential lesional characteristics, achieving high alignment with text prompts as evaluated by objective metrics and human experts. Furthermore, integrating generated images significantly enhances the accuracy of detecting minority classes and rare eye diseases, surpassing traditional oversampling methods in addressing data imbalance. EyeDiff effectively tackles the issue of data imbalance and insufficiency typically encountered in rare diseases and addresses the challenges of collecting large-scale annotated images, offering a transformative solution to enhance the development of expert-level diseases diagnosis models in ophthalmic field.

[AI-80] Building 6G Radio Foundation Models with Transformer Architectures

链接: https://arxiv.org/abs/2411.09996
作者: Ahmed Aboulfotouh,Ashkan Eshaghbeigi,Hatem Abou-Zeid
关键词-EN: Foundation deep learning, learn general, designed to learn, robust and adaptable, target modality
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Networking and Internet Architecture (cs.NI)
*备注:

点击查看摘要

Abstract:Foundation deep learning (DL) models are general models, designed to learn general, robust and adaptable representations of their target modality, enabling finetuning across a range of downstream tasks. These models are pretrained on large, unlabeled datasets using self-supervised learning (SSL). Foundation models have demonstrated better generalization than traditional supervised approaches, a critical requirement for wireless communications where the dynamic environment demands model adaptability. In this work, we propose and demonstrate the effectiveness of a Vision Transformer (ViT) as a radio foundation model for spectrogram learning. We introduce a Masked Spectrogram Modeling (MSM) approach to pretrain the ViT in a self-supervised fashion. We evaluate the ViT-based foundation model on two downstream tasks: Channel State Information (CSI)-based Human Activity sensing and Spectrogram Segmentation. Experimental results demonstrate competitive performance to supervised training while generalizing across diverse domains. Notably, the pretrained ViT model outperforms a four-times larger model that is trained from scratch on the spectrogram segmentation task, while requiring significantly less training time, and achieves competitive performance on the CSI-based human activity sensing task. This work demonstrates the effectiveness of ViT with MSM for pretraining as a promising technique for scalable foundation model development in future 6G networks.

[AI-81] Self-Supervised Radio Pre-training: Toward Foundational Models for Spectrogram Learning

链接: https://arxiv.org/abs/2411.09849
作者: Ahmed Aboulfotouh,Ashkan Eshaghbeigi,Dimitrios Karslidis,Hatem Abou-Zeid
关键词-EN: natural language processing, Masked Spectrogram Modeling, trained on large, self-supervised learning techniques, Foundational deep learning
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Networking and Internet Architecture (cs.NI)
*备注:

点击查看摘要

Abstract:Foundational deep learning (DL) models are general models, trained on large, diverse, and unlabelled datasets, typically using self-supervised learning techniques have led to significant advancements especially in natural language processing. These pretrained models can be fine-tuned for related downstream tasks, offering faster development and reduced training costs, while often achieving improved performance. In this work, we introduce Masked Spectrogram Modeling, a novel self-supervised learning approach for pretraining foundational DL models on radio signals. Adopting a Convolutional LSTM architecture for efficient spatio-temporal processing, we pretrain the model with an unlabelled radio dataset collected from over-the-air measurements. Subsequently, the pretrained model is fine-tuned for two downstream tasks: spectrum forecasting and segmentation. Experimental results demonstrate that our methodology achieves competitive performance in both forecasting accuracy and segmentation, validating its effectiveness for developing foundational radio models.

[AI-82] Deep Learning for Fetal Inflammatory Response Diagnosis in the Umbilical Cord

链接: https://arxiv.org/abs/2411.09767
作者: Marina A. Ayad,Ramin Nateghi,Abhishek Sharma,Lawrence Chillrud,Tilly Seesillapachai,Lee A.D. Cooper,Jeffery A. Goldstein
关键词-EN: ascending intrauterine infection, fetal inflammatory response, umbilical cord, fetal inflammatory, result of ascending
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Inflammation of the umbilical cord can be seen as a result of ascending intrauterine infection or other inflammatory stimuli. Acute fetal inflammatory response (FIR) is characterized by infiltration of the umbilical cord by fetal neutrophils, and can be associated with neonatal sepsis or fetal inflammatory response syndrome. Recent advances in deep learning in digital pathology have demonstrated favorable performance across a wide range of clinical tasks, such as diagnosis and prognosis. In this study we classified FIR from whole slide images (WSI). We digitized 4100 histological slides of umbilical cord stained with hematoxylin and eosin(HE) and extracted placental diagnoses from the electronic health record. We build models using attention-based whole slide learning models. We compared strategies between features extracted by a model (ConvNeXtXLarge) pretrained on non-medical images (ImageNet), and one pretrained using histopathology images (UNI). We trained multiple iterations of each model and combined them into an ensemble. The predictions from the ensemble of models trained using UNI achieved an overall balanced accuracy of 0.836 on the test dataset. In comparison, the ensembled predictions using ConvNeXtXLarge had a lower balanced accuracy of 0.7209. Heatmaps generated from top accuracy model appropriately highlighted arteritis in cases of FIR 2. In FIR 1, the highest performing model assigned high attention to areas of activated-appearing stroma in Wharton’s Jelly. However, other high-performing models assigned attention to umbilical vessels. We developed models for diagnosis of FIR from placental histology images, helping reduce interobserver variability among pathologists. Future work may examine the utility of these models for identifying infants at risk of systemic inflammatory response or early onset neonatal sepsis.

[AI-83] Feature Selection via Dynamic Graph-based Attention Block in MI-based EEG Signals

链接: https://arxiv.org/abs/2411.09709
作者: Hyeon-Taek Han,Dae-Hyeok Lee,Heon-Gyu Kwak
关键词-EN: technology enables direct, enables direct interaction, Brain-computer interface, analyzing brain signals, EEG signals
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 4 pages, 2 figures, 1 table, Name of Conference: International Conference on Brain-Computer Interface

点击查看摘要

Abstract:Brain-computer interface (BCI) technology enables direct interaction between humans and computers by analyzing brain signals. Electroencephalogram (EEG) is one of the non-invasive tools used in BCI systems, providing high temporal resolution for real-time applications. However, EEG signals are often affected by a low signal-to-noise ratio, physiological artifacts, and individual variability, representing challenges in extracting distinct features. Also, motor imagery (MI)-based EEG signals could contain features with low correlation to MI characteristics, which might cause the weights of the deep model to become biased towards those features. To address these problems, we proposed the end-to-end deep preprocessing method that effectively enhances MI characteristics while attenuating features with low correlation to MI characteristics. The proposed method consisted of the temporal, spatial, graph, and similarity blocks to preprocess MI-based EEG signals, aiming to extract more discriminative features and improve the robustness. We evaluated the proposed method using the public dataset 2a of BCI Competition IV to compare the performances when integrating the proposed method into the conventional models, including the DeepConvNet, the M-ShallowConvNet, and the EEGNet. The experimental results showed that the proposed method could achieve the improved performances and lead to more clustered feature distributions of MI tasks. Hence, we demonstrated that our proposed method could enhance discriminative features related to MI characteristics.

计算机视觉

[CV-0] LLaVA-o1: Let Vision Language Models Reason Step-by-Step

链接: https://arxiv.org/abs/2411.10440
作者: Guowei Xu,Peng Jin,Li Hao,Yibing Song,Lichao Sun,Li Yuan
关键词-EN: Large language models, demonstrated substantial advancements, Large language, demonstrated substantial, substantial advancements
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 11 pages, 5 figures

点击查看摘要

Abstract:Large language models have demonstrated substantial advancements in reasoning capabilities, particularly through inference-time scaling, as illustrated by models such as OpenAI’s o1. However, current Vision-Language Models (VLMs) often struggle to perform systematic and structured reasoning, especially when handling complex visual question-answering tasks. In this work, we introduce LLaVA-o1, a novel VLM designed to conduct autonomous multistage reasoning. Unlike chain-of-thought prompting, LLaVA-o1 independently engages in sequential stages of summarization, visual interpretation, logical reasoning, and conclusion generation. This structured approach enables LLaVA-o1 to achieve marked improvements in precision on reasoning-intensive tasks. To accomplish this, we compile the LLaVA-o1-100k dataset, integrating samples from various visual question answering sources and providing structured reasoning annotations. Besides, we propose an inference-time stage-level beam search method, which enables effective inference-time scaling. Remarkably, with only 100k training samples and a simple yet effective inference time scaling method, LLaVA-o1 not only outperforms its base model by 8.9% on a wide range of multimodal reasoning benchmarks, but also surpasses the performance of larger and even closed-source models, such as Gemini-1.5-pro, GPT-4o-mini, and Llama-3.2-90B-Vision-Instruct.

[CV-1] M-VAR: Decoupled Scale-wise Autoregressive Modeling for High-Quality Image Generation

链接: https://arxiv.org/abs/2411.10433
作者: Sucheng Ren,Yaodong Yu,Nataniel Ruiz,Feng Wang,Alan Yuille,Cihang Xie
关键词-EN: exists recent work, computer vision, exists recent, recent work, work in computer
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:There exists recent work in computer vision, named VAR, that proposes a new autoregressive paradigm for image generation. Diverging from the vanilla next-token prediction, VAR structurally reformulates the image generation into a coarse to fine next-scale prediction. In this paper, we show that this scale-wise autoregressive framework can be effectively decoupled into \textitintra-scale modeling, which captures local spatial dependencies within each scale, and \textitinter-scale modeling, which models cross-scale relationships progressively from coarse-to-fine scales. This decoupling structure allows to rebuild VAR in a more computationally efficient manner. Specifically, for intra-scale modeling – crucial for generating high-fidelity images – we retain the original bidirectional self-attention design to ensure comprehensive modeling; for inter-scale modeling, which semantically connects different scales but is computationally intensive, we apply linear-complexity mechanisms like Mamba to substantially reduce computational overhead. We term this new framework M-VAR. Extensive experiments demonstrate that our method outperforms existing models in both image quality and generation speed. For example, our 1.5B model, with fewer parameters and faster inference speed, outperforms the largest VAR-d30-2B. Moreover, our largest model M-VAR-d32 impressively registers 1.78 FID on ImageNet 256 \times 256 and outperforms the prior-art autoregressive models LlamaGen/VAR by 0.4/0.19 and popular diffusion models LDM/DiT by 1.82/0.49, respectively. Code is avaiable at \urlthis https URL.

[CV-2] Generation of synthetic gait data: application to multiple sclerosis patients gait patterns

链接: https://arxiv.org/abs/2411.10377
作者: Klervi Le Gall,Lise Bellanger,David Laplaud
关键词-EN: severe non-traumatic disability, Multiple sclerosis, increasing worldwide, severe non-traumatic, non-traumatic disability
类目: Computer Vision and Pattern Recognition (cs.CV); Applications (stat.AP)
*备注:

点击查看摘要

Abstract:Multiple sclerosis (MS) is the leading cause of severe non-traumatic disability in young adults and its incidence is increasing worldwide. The variability of gait impairment in MS necessitates the development of a non-invasive, sensitive, and cost-effective tool for quantitative gait evaluation. The eGait movement sensor, designed to characterize human gait through unit quaternion time series (QTS) representing hip rotations, is a promising approach. However, the small sample sizes typical of clinical studies pose challenges for the stability of gait data analysis tools. To address these challenges, this article presents two key scientific contributions. First, a comprehensive framework is proposed for transforming QTS data into a form that preserves the essential geometric properties of gait while enabling the use of any tabular synthetic data generation method. Second, a synthetic data generation method is introduced, based on nearest neighbors weighting, which produces high-fidelity synthetic QTS data suitable for small datasets and private data environments. The effectiveness of the proposed method, is demonstrated through its application to MS gait data, showing very good fidelity and respect of the initial geometry of the data. Thanks to this work, we are able to produce synthetic data sets and work on the stability of clustering methods.

[CV-3] Interactive Image-Based Aphid Counting in Yellow Water Traps under Stirring Actions

链接: https://arxiv.org/abs/2411.10357
作者: Xumin Gao,Mark Stevens,Grzegorz Cielniak
关键词-EN: low visibility arising, current vision-based aphid, counting, water traps suffer, vision-based aphid counting
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:The current vision-based aphid counting methods in water traps suffer from undercounts caused by occlusions and low visibility arising from dense aggregation of insects and other objects. To address this problem, we propose a novel aphid counting method through interactive stirring actions. We use interactive stirring to alter the distribution of aphids in the yellow water trap and capture a sequence of images which are then used for aphid detection and counting through an optimized small object detection network based on Yolov5. We also propose a counting confidence evaluation system to evaluate the confidence of count-ing results. The final counting result is a weighted sum of the counting results from all sequence images based on the counting confidence. Experimental results show that our proposed aphid detection network significantly outperforms the original Yolov5, with improvements of 33.9% in AP@0.5 and 26.9% in AP@[0.5:0.95] on the aphid test set. In addition, the aphid counting test results using our proposed counting confidence evaluation system show significant improvements over the static counting method, closely aligning with manual counting results.

[CV-4] BiDense: Binarization for Dense Prediction

链接: https://arxiv.org/abs/2411.10346
作者: Rui Yin,Haotong Qin,Yulun Zhang,Wenbo Li,Yong Guo,Jianjun Zhu,Cheng Wang,Biao Jia
关键词-EN: computer vision, Dense prediction, Channel-adaptive Full-precision Bypass, dense prediction tasks, critical task
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Dense prediction is a critical task in computer vision. However, previous methods often require extensive computational resources, which hinders their real-world application. In this paper, we propose BiDense, a generalized binary neural network (BNN) designed for efficient and accurate dense prediction tasks. BiDense incorporates two key techniques: the Distribution-adaptive Binarizer (DAB) and the Channel-adaptive Full-precision Bypass (CFB). The DAB adaptively calculates thresholds and scaling factors for binarization, effectively retaining more information within BNNs. Meanwhile, the CFB facilitates full-precision bypassing for binary convolutional layers undergoing various channel size transformations, which enhances the propagation of real-valued signals and minimizes information loss. By leveraging these techniques, BiDense preserves more real-valued information, enabling more accurate and detailed dense predictions in BNNs. Extensive experiments demonstrate that our framework achieves performance levels comparable to full-precision models while significantly reducing memory usage and computational costs.

[CV-5] Comparative Analysis of Machine Learning Approaches for Bone Age Assessment: A Comprehensive Study on Three Distinct Models

链接: https://arxiv.org/abs/2411.10345
作者: Nandavardhan R.,Somanathan R.,Vikram Suresh,Savaridassan P
关键词-EN: Radiologists and doctors, X-ray images, doctors make, non-dominant hands, hands of children
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Radiologists and doctors make use of X-ray images of the non-dominant hands of children and infants to assess the possibility of genetic conditions and growth abnormalities. This is done by assessing the difference between the actual extent of growth found using the X-rays and the chronological age of the subject. The assessment was done conventionally using The Greulich Pyle (GP) or Tanner Whitehouse (TW) approach. These approaches require a high level of expertise and may often lead to observer bias. Hence, to automate the process of assessing the X-rays, and to increase its accuracy and efficiency, several machine learning models have been developed. These machine-learning models have several differences in their accuracy and efficiencies, leading to an unclear choice for the suitable model depending on their needs and available resources. Methods: In this study, we have analyzed the 3 most widely used models for the automation of bone age prediction, which are the Xception model, VGG model and CNN model. These models were trained on the preprocessed dataset and the accuracy was measured using the MAE in terms of months for each model. Using this, the comparison between the models was done. Results: The 3 models, Xception, VGG, and CNN models have been tested for accuracy and other relevant factors.

[CV-6] Y-MAP-Net: Real-time depth normals segmentation multi-label captioning and 2D human pose in RGB images

链接: https://arxiv.org/abs/2411.10334
作者: Ammar Qammaz,Nikolaos Vasilikopoulos,Iason Oikonomidis,Antonis A. Argyros
关键词-EN: Y-shaped neural network, RGB images, Y-shaped neural, real-time multi-task learning, neural network architecture
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 8 page paper, 6 Figures, 3 Tables

点击查看摘要

Abstract:We present Y-MAP-Net, a Y-shaped neural network architecture designed for real-time multi-task learning on RGB images. Y-MAP-Net, simultaneously predicts depth, surface normals, human pose, semantic segmentation and generates multi-label captions, all from a single network evaluation. To achieve this, we adopt a multi-teacher, single-student training paradigm, where task-specific foundation models supervise the network’s learning, enabling it to distill their capabilities into a lightweight architecture suitable for real-time applications. Y-MAP-Net, exhibits strong generalization, simplicity and computational efficiency, making it ideal for robotics and other practical scenarios. To support future research, we will release our code publicly.

[CV-7] Number it: Temporal Grounding Videos like Flipping Manga

链接: https://arxiv.org/abs/2411.10332
作者: Yongliang Wu,Xinting Hu,Yuyang Sun,Yizhou Zhou,Wenbo Zhu,Fengyun Rao,Bernt Schiele,Xu Yang
关键词-EN: Large Language Models, Video Large Language, Language Models, Large Language, made remarkable advancements
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 11 pages, 7 figures

点击查看摘要

Abstract:Video Large Language Models (Vid-LLMs) have made remarkable advancements in comprehending video content for QA dialogue. However, they struggle to extend this visual understanding to tasks requiring precise temporal localization, known as Video Temporal Grounding (VTG). To address this gap, we introduce Number-Prompt (NumPro), a novel method that empowers Vid-LLMs to bridge visual comprehension with temporal grounding by adding unique numerical identifiers to each video frame. Treating a video as a sequence of numbered frame images, NumPro transforms VTG into an intuitive process: flipping through manga panels in sequence. This allows Vid-LLMs to “read” event timelines, accurately linking visual content with corresponding temporal information. Our experiments demonstrate that NumPro significantly boosts VTG performance of top-tier Vid-LLMs without additional computational cost. Furthermore, fine-tuning on a NumPro-enhanced dataset defines a new state-of-the-art for VTG, surpassing previous top-performing methods by up to 6.9% in mIoU for moment retrieval and 8.5% in mAP for highlight detection. The code will be available at this https URL.

[CV-8] CNN-Based Classification of Persian Miniature Paintings from Five Renowned Schools

链接: https://arxiv.org/abs/2411.10330
作者: Mojtaba Shahi,Roozbeh Rajabi,Farnaz Masoumzadeh
关键词-EN: Persian miniature painting, Convolutional Neural Networks, computational painting analysis, painting analysis focused, classify Persian miniatures
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 20 pages, submitted to journal

点击查看摘要

Abstract:This article addresses the gap in computational painting analysis focused on Persian miniature painting, a rich cultural and artistic heritage. It introduces a novel approach using Convolutional Neural Networks (CNN) to classify Persian miniatures from five schools: Herat, Tabriz-e Avval, Shiraz-e Avval, Tabriz-e Dovvom, and Qajar. The method achieves an average accuracy of over 91%. A meticulously curated dataset captures the distinct features of each school, with a patch-based CNN approach classifying image segments independently before merging results for enhanced accuracy. This research contributes significantly to digital art analysis, providing detailed insights into the dataset, CNN architecture, training, and validation processes. It highlights the potential for future advancements in automated art analysis, bridging machine learning, art history, and digital humanities, thereby aiding the preservation and understanding of Persian cultural heritage.

[CV-9] Melanoma Detection with Uncertainty Quantification

链接: https://arxiv.org/abs/2411.10322
作者: SangHyuk Kim,Edward Gaibor,Brian Matejek,Daniel Haehn
关键词-EN: improving survival rates, Early detection, survival rates, crucial for improving, improving survival
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 5 pages, 5 figures, 3 tables, submitted to ISBI2025

点击查看摘要

Abstract:Early detection of melanoma is crucial for improving survival rates. Current detection tools often utilize data-driven machine learning methods but often overlook the full integration of multiple datasets. We combine publicly available datasets to enhance data diversity, allowing numerous experiments to train and evaluate various classifiers. We then calibrate them to minimize misdiagnoses by incorporating uncertainty quantification. Our experiments on benchmark datasets show accuracies of up to 93.2% before and 97.8% after applying uncertainty-based rejection, leading to a reduction in misdiagnoses by over 40.5%. Our code and data are publicly available, and a web-based interface for quick melanoma detection of user-supplied images is also provided.

[CV-10] Probabilistic Prior Driven Attention Mechanism Based on Diffusion Model for Imaging Through Atmospheric Turbulence

链接: https://arxiv.org/abs/2411.10321
作者: Guodong Sun,Qixiang Ma,Liqiang Zhang,Hongwei Wang,Zixuan Gao,Haotian Zhang
关键词-EN: Atmospheric turbulence introduces, Turbulence Removal Network, turbulence introduces severe, Atmospheric turbulence, challenging traditional image
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Atmospheric turbulence introduces severe spatial and geometric distortions, challenging traditional image restoration methods. We propose the Probabilistic Prior Turbulence Removal Network (PPTRN), which combines probabilistic diffusion-based prior modeling with Transformer-driven feature extraction to address this issue. PPTRN employs a two-stage approach: first, a latent encoder and Transformer are jointly trained on clear images to establish robust feature representations. Then, a Denoising Diffusion Probabilistic Model (DDPM) models prior distributions over latent vectors, guiding the Transformer in capturing diverse feature variations essential for restoration. A key innovation in PPTRN is the Probabilistic Prior Driven Cross Attention mechanism, which integrates the DDPM-generated prior with feature embeddings to reduce artifacts and enhance spatial coherence. Extensive experiments validate that PPTRN significantly improves restoration quality on turbulence-degraded images, setting a new benchmark in clarity and structural fidelity.

[CV-11] M3TR: Generalist HD Map Construction with Variable Map Priors

链接: https://arxiv.org/abs/2411.10316
作者: Fabian Immel,Richard Fehler,Frank Bieder,Jan-Hendrik Pauls,Christoph Stiller
关键词-EN: Autonomous vehicles require, vehicles require road, Autonomous vehicles, require road information, map
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
*备注:

点击查看摘要

Abstract:Autonomous vehicles require road information for their operation, usually in form of HD maps. Since offline maps eventually become outdated or may only be partially available, online HD map construction methods have been proposed to infer map information from live sensor data. A key issue remains how to exploit such partial or outdated map information as a prior. We introduce M3TR (Multi-Masking Map Transformer), a generalist approach for HD map construction both with and without map priors. We address shortcomings in ground truth generation for Argoverse 2 and nuScenes and propose the first realistic scenarios with semantically diverse map priors. Examining various query designs, we use an improved method for integrating prior map elements into a HD map construction model, increasing performance by +4.3 mAP. Finally, we show that training across all prior scenarios yields a single Generalist model, whose performance is on par with previous Expert models that can handle only one specific type of map prior. M3TR thus is the first model capable of leveraging variable map priors, making it suitable for real-world deployment. Code is available at this https URL

[CV-12] Modification Takes Courage: Seamless Image Stitching via Reference-Driven Inpainting

链接: https://arxiv.org/abs/2411.10309
作者: Ziqi Xie,Xiao Lai,Weidong Zhao,Xianhui Liu,Wenlong Hou
关键词-EN: Current image stitching, produce noticeable seams, Current image, Reference-Driven Inpainting Stitcher, produce noticeable
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 17 pages, 10 figures

点击查看摘要

Abstract:Current image stitching methods often produce noticeable seams in challenging scenarios such as uneven hue and large parallax. To tackle this problem, we propose the Reference-Driven Inpainting Stitcher (RDIStitcher), which reformulates the image fusion and rectangling as a reference-based inpainting model, incorporating a larger modification fusion area and stronger modification intensity than previous methods. Furthermore, we introduce a self-supervised model training method, which enables the implementation of RDIStitcher without requiring labeled data by fine-tuning a Text-to-Image (T2I) diffusion model. Recognizing difficulties in assessing the quality of stitched images, we present the Multimodal Large Language Models (MLLMs)-based metrics, offering a new perspective on evaluating stitched image quality. Compared to the state-of-the-art (SOTA) method, extensive experiments demonstrate that our method significantly enhances content coherence and seamless transitions in the stitched images. Especially in the zero-shot experiments, our method exhibits strong generalization capabilities. Code: this https URL

[CV-13] Multidimensional Byte Pair Encoding: Shortened Sequences for Improved Visual Data Generation

链接: https://arxiv.org/abs/2411.10281
作者: Tim Elsner,Paula Usinger,Julius Nehring-Wirxel,Gregor Kobsik,Victor Czech,Yanjiang He,Isaak Lim,Leif Kobbelt
关键词-EN: Byte Pair Encoding, transformers benefit greatly, language processing, benefit greatly, greatly from text
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In language processing, transformers benefit greatly from text being condensed. This is achieved through a larger vocabulary that captures word fragments instead of plain characters. This is often done with Byte Pair Encoding. In the context of images, tokenisation of visual data is usually limited to regular grids obtained from quantisation methods, without global content awareness. Our work improves tokenisation of visual data by bringing Byte Pair Encoding from 1D to multiple dimensions, as a complementary add-on to existing compression. We achieve this through counting constellations of token pairs and replacing the most frequent token pair with a newly introduced token. The multidimensionality only increases the computation time by a factor of 2 for images, making it applicable even to large datasets like ImageNet within minutes on consumer hardware. This is a lossless preprocessing step. Our evaluation shows improved training and inference performance of transformers on visual data achieved by compressing frequent constellations of tokens: The resulting sequences are shorter, with more uniformly distributed information content, e.g. condensing empty regions in an image into single tokens. As our experiments show, these condensed sequences are easier to process. We additionally introduce a strategy to amplify this compression further by clustering the vocabulary.

[CV-14] 4DPV: 4D Pet from Videos by Coarse-to-Fine Non-Rigid Radiance Fields ACCV2024

链接: https://arxiv.org/abs/2411.10275
作者: Sergio M. de Paco,Antonio Agudo
关键词-EN: multiple RGB sequences, multiple RGB, simultaneously recover, recover the camera, camera pose
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 17th Asian Conference on Computer Vision (ACCV 2024)

点击查看摘要

Abstract:We present a coarse-to-fine neural deformation model to simultaneously recover the camera pose and the 4D reconstruction of an unknown object from multiple RGB sequences in the wild. To that end, our approach does not consider any pre-built 3D template nor 3D training data as well as controlled illumination conditions, and can sort out the problem in a self-supervised manner. Our model exploits canonical and image-variant spaces where both coarse and fine components are considered. We introduce a neural local quadratic model with spatio-temporal consistency to encode fine details that is combined with canonical embeddings in order to establish correspondences across sequences. We thoroughly validate the method on challenging scenarios with complex and real-world deformations, providing both quantitative and qualitative evaluations, an ablation study and a comparison with respect to competing approaches. Our project is available at this https URL.

[CV-15] Fill in the blanks: Rethinking Interpretability in vision

链接: https://arxiv.org/abs/2411.10273
作者: Pathirage N. Deelaka,Tharindu Wickremasinghe,Devin Y. De Silva,Lisara N. Gajaweera
关键词-EN: deep learning, deep learning models, deep learning aided, observed in contemporary, key challenge
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Model interpretability is a key challenge that has yet to align with the advancements observed in contemporary state-of-the-art deep learning models. In particular, deep learning aided vision tasks require interpretability, in order for their adoption in more specialized domains such as medical imaging. Although the field of explainable AI (XAI) developed methods for interpreting vision models along with early convolutional neural networks, recent XAI research has mainly focused on assigning attributes via saliency maps. As such, these methods are restricted to providing explanations at a sample level, and many explainability methods suffer from low adaptability across a wide range of vision models. In our work, we re-think vision-model explainability from a novel perspective, to probe the general input structure that a model has learnt during its training. To this end, we ask the question: “How would a vision model fill-in a masked-image”. Experiments on standard vision datasets and pre-trained models reveal consistent patterns, and could be intergrated as an additional model-agnostic explainability tool in modern machine-learning platforms. The code will be available at \urlthis https URL

[CV-16] Partial Scene Text Retrieval

链接: https://arxiv.org/abs/2411.10261
作者: Hao Wang,Minghui Liao,Zhouyi Xie,Wenyu Liu,Xiang Bai
关键词-EN: retrieval involves localizing, text retrieval involves, image gallery, partial patches, text-line instances
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted on TPAMI

点击查看摘要

Abstract:The task of partial scene text retrieval involves localizing and searching for text instances that are the same or similar to a given query text from an image gallery. However, existing methods can only handle text-line instances, leaving the problem of searching for partial patches within these text-line instances unsolved due to a lack of patch annotations in the training data. To address this issue, we propose a network that can simultaneously retrieve both text-line instances and their partial patches. Our method embeds the two types of data (query text and scene text instances) into a shared feature space and measures their cross-modal similarities. To handle partial patches, our proposed approach adopts a Multiple Instance Learning (MIL) approach to learn their similarities with query text, without requiring extra annotations. However, constructing bags, which is a standard step of conventional MIL approaches, can introduce numerous noisy samples for training, and lower inference speed. To address this issue, we propose a Ranking MIL (RankMIL) approach to adaptively filter those noisy samples. Additionally, we present a Dynamic Partial Match Algorithm (DPMA) that can directly search for the target partial patch from a text-line instance during the inference stage, without requiring bags. This greatly improves the search efficiency and the performance of retrieving partial patches. The source code and dataset are available at this https URL.

[CV-17] Visual-Linguistic Agent : Towards Collaborative Contextual Object Reasoning

链接: https://arxiv.org/abs/2411.10252
作者: Jingru Yang,Huan Yu,Yang Jingxin,Chentianye Xu,Yin Biao,Yu Sun,Shengfeng He
关键词-EN: Large Language Models, Multimodal Large Language, Large Language, reliable visual interpretation, Language Models
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Multimodal Large Language Models (MLLMs) excel at descriptive tasks within images but often struggle with precise object localization, a critical element for reliable visual interpretation. In contrast, traditional object detection models provide high localization accuracy but frequently generate detections lacking contextual coherence due to limited modeling of inter-object relationships. To address this fundamental limitation, we introduce the \textbfVisual-Linguistic Agent (VLA), a collaborative framework that combines the relational reasoning strengths of MLLMs with the precise localization capabilities of traditional object detectors. In the VLA paradigm, the MLLM serves as a central Linguistic Agent, working collaboratively with specialized Vision Agents for object detection and classification. The Linguistic Agent evaluates and refines detections by reasoning over spatial and contextual relationships among objects, while the classification Vision Agent offers corrective feedback to improve classification accuracy. This collaborative approach enables VLA to significantly enhance both spatial reasoning and object localization, addressing key challenges in multimodal understanding. Extensive evaluations on the COCO dataset demonstrate substantial performance improvements across multiple detection models, highlighting VLA’s potential to set a new benchmark in accurate and contextually coherent object detection.

[CV-18] Morpho-Aware Global Attention for Image Matting

链接: https://arxiv.org/abs/2411.10251
作者: Jingru Yang,Chengzhi Cao,Chentianye Xu,Zhongwei Xie,Kaixiang Huang,Yang Zhou,Shengfeng He
关键词-EN: Convolutional Neural Networks, Neural Networks, face inherent challenges, Vision Transformers, Convolutional Neural
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Vision Transformers (ViTs) and Convolutional Neural Networks (CNNs) face inherent challenges in image matting, particularly in preserving fine structural details. ViTs, with their global receptive field enabled by the self-attention mechanism, often lose local details such as hair strands. Conversely, CNNs, constrained by their local receptive field, rely on deeper layers to approximate global context but struggle to retain fine structures at greater depths. To overcome these limitations, we propose a novel Morpho-Aware Global Attention (MAGA) mechanism, designed to effectively capture the morphology of fine structures. MAGA employs Tetris-like convolutional patterns to align the local shapes of fine structures, ensuring optimal local correspondence while maintaining sensitivity to morphological details. The extracted local morphology information is used as query embeddings, which are projected onto global key embeddings to emphasize local details in a broader context. Subsequently, by projecting onto value embeddings, MAGA seamlessly integrates these emphasized morphological details into a unified global structure. This approach enables MAGA to simultaneously focus on local morphology and unify these details into a coherent whole, effectively preserving fine structures. Extensive experiments show that our MAGA-based ViT achieves significant performance gains, outperforming state-of-the-art methods across two benchmarks with average improvements of 4.3% in SAD and 39.5% in MSE. Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2411.10251 [cs.CV] (or arXiv:2411.10251v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2411.10251 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[CV-19] ScribbleVS: Scribble-Supervised Medical Image Segmentation via Dynamic Competitive Pseudo Label Selection

链接: https://arxiv.org/abs/2411.10237
作者: Tao Wang,Xinlin Zhang,Yuanbin Chen,Yuanbo Zhou,Longxuan Zhao,Tao Tan,Tong Tong
关键词-EN: provide substantial support, clinical medicine, support to clinicians, precise image segmentation, provide substantial
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:In clinical medicine, precise image segmentation can provide substantial support to clinicians. However, achieving such precision often requires a large amount of finely annotated data, which can be costly. Scribble annotation presents a more efficient alternative, boosting labeling efficiency. However, utilizing such minimal supervision for medical image segmentation training, especially with scribble annotations, poses significant challenges. To address these challenges, we introduce ScribbleVS, a novel framework that leverages scribble annotations. We introduce a Regional Pseudo Labels Diffusion Module to expand the scope of supervision and reduce the impact of noise present in pseudo labels. Additionally, we propose a Dynamic Competitive Selection module for enhanced refinement in selecting pseudo labels. Experiments conducted on the ACDC and MSCMRseg datasets have demonstrated promising results, achieving performance levels that even exceed those of fully supervised methodologies. The codes of this study are available at this https URL.

[CV-20] Learning Generalizable 3D Manipulation With 10 Demonstrations

链接: https://arxiv.org/abs/2411.10203
作者: Yu Ren,Yang Cong,Ronghan Chen,Jiahao Long
关键词-EN: Semantic Guided Perception, industrial automation, automation and service, manipulation skill learning, Spatial Generalized Decision
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
*备注:

点击查看摘要

Abstract:Learning robust and generalizable manipulation skills from demonstrations remains a key challenge in robotics, with broad applications in industrial automation and service robotics. While recent imitation learning methods have achieved impressive results, they often require large amounts of demonstration data and struggle to generalize across different spatial variants. In this work, we present a novel framework that learns manipulation skills from as few as 10 demonstrations, yet still generalizes to spatial variants such as different initial object positions and camera viewpoints. Our framework consists of two key modules: Semantic Guided Perception (SGP), which constructs task-focused, spatially aware 3D point cloud representations from RGB-D inputs; and Spatial Generalized Decision (SGD), an efficient diffusion-based decision-making module that generates actions via denoising. To effectively learn generalization ability from limited data, we introduce a critical spatially equivariant training strategy that captures the spatial knowledge embedded in expert demonstrations. We validate our framework through extensive experiments on both simulation benchmarks and real-world robotic systems. Our method demonstrates a 60 percent improvement in success rates over state-of-the-art approaches on a series of challenging tasks, even with substantial variations in object poses and camera viewpoints. This work shows significant potential for advancing efficient, generalizable manipulation skill learning in real-world applications.

[CV-21] Block based Adaptive Compressive Sensing with Sampling Rate Control

链接: https://arxiv.org/abs/2411.10200
作者: Kosuke Iwama,Ryugo Morita,Jinjia Zhou
关键词-EN: exploit data redundancy, Nyquist rate, Compressive sensing, compressive sensing framework, adaptive compressive sensing
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted to MMAsia2024

点击查看摘要

Abstract:Compressive sensing (CS), acquiring and reconstructing signals below the Nyquist rate, has great potential in image and video acquisition to exploit data redundancy and greatly reduce the amount of sampled data. To further reduce the sampled data while keeping the video quality, this paper explores the temporal redundancy in video CS and proposes a block based adaptive compressive sensing framework with a sampling rate (SR) control strategy. To avoid redundant compression of non-moving regions, we first incorporate moving block detection between consecutive frames, and only transmit the measurements of moving blocks. The non-moving regions are reconstructed from the previous frame. In addition, we propose a block storage system and a dynamic threshold to achieve adaptive SR allocation to each frame based on the area of moving regions and target SR for controlling the average SR within the target SR. Finally, to reduce blocking artifacts and improve reconstruction quality, we adopt a cooperative reconstruction of the moving and non-moving blocks by referring to the measurements of the non-moving blocks from the previous frame. Extensive experiments have demonstrated that this work is able to control SR and obtain better performance than existing works.

[CV-22] STLight: a Fully Convolutional Approach for Efficient Predictive Learning by Spatio-Temporal joint Processing WACV2025

链接: https://arxiv.org/abs/2411.10198
作者: Andrea Alfarano,Alberto Alfarano,Linda Friso,Andrea Bacciu,Irene Amerini,Fabrizio Silvestri
关键词-EN: self-supervised learning paradigm, predicting future frames, future frames based, Spatio-Temporal predictive Learning, paradigm that enables
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted at WACV 2025 conference

点击查看摘要

Abstract:Spatio-Temporal predictive Learning is a self-supervised learning paradigm that enables models to identify spatial and temporal patterns by predicting future frames based on past frames. Traditional methods, which use recurrent neural networks to capture temporal patterns, have proven their effectiveness but come with high system complexity and computational demand. Convolutions could offer a more efficient alternative but are limited by their characteristic of treating all previous frames equally, resulting in poor temporal characterization, and by their local receptive field, limiting the capacity to capture distant correlations among frames. In this paper, we propose STLight, a novel method for spatio-temporal learning that relies solely on channel-wise and depth-wise convolutions as learnable layers. STLight overcomes the limitations of traditional convolutional approaches by rearranging spatial and temporal dimensions together, using a single convolution to mix both types of features into a comprehensive spatio-temporal patch representation. This representation is then processed in a purely convolutional framework, capable of focusing simultaneously on the interaction among near and distant patches, and subsequently allowing for efficient reconstruction of the predicted frames. Our architecture achieves state-of-the-art performance on STL benchmarks across different datasets and settings, while significantly improving computational efficiency in terms of parameters and computational FLOPs. The code is publicly available

[CV-23] DiMoDif: Discourse Modality-information Differentiation for Audio-visual Deepfake Detection and Localization

链接: https://arxiv.org/abs/2411.10193
作者: Christos Koutlis,Symeon Papadopoulos
关键词-EN: posing significant threats, rapidly advanced, societal trust, technology has rapidly, integrity and societal
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Deepfake technology has rapidly advanced, posing significant threats to information integrity and societal trust. While significant progress has been made in detecting deepfakes, the simultaneous manipulation of audio and visual modalities, sometimes at small parts but still altering the meaning, presents a more challenging detection scenario. We present a novel audio-visual deepfake detection framework that leverages the inter-modality differences in machine perception of speech, based on the assumption that in real samples - in contrast to deepfakes - visual and audio signals coincide in terms of information. Our framework leverages features from deep networks that specialize in video and audio speech recognition to spot frame-level cross-modal incongruities, and in that way to temporally localize the deepfake forgery. To this end, DiMoDif employs a Transformer encoder-based architecture with a feature pyramid scheme and local attention, and optimizes the detection model through a composite loss function accounting for frame-level detections and fake intervals localization. DiMoDif outperforms the state-of-the-art on the Temporal Forgery Localization task by +47.88% AP@0.75 on AV-Deepfake1M, and performs on-par on LAV-DF. On the Deepfake Detection task, it outperforms the state-of-the-art by +30.5% AUC on AV-Deepfake1M, +2.8% AUC on FakeAVCeleb, and performs on-par on LAV-DF. Code available at this https URL.

[CV-24] NeISF: Neural Incident Stokes Field for Polarized Inverse Rendering of Conductors and Dielectrics

链接: https://arxiv.org/abs/2411.10189
作者: Chenhao Li,Taishi Ono,Takeshi Uemori,Sho Nitta,Hajime Mihara,Alexander Gatto,Hajime Nagahara,Yusuke Moriuchi
关键词-EN: greatly improved shape, Recent inverse rendering, Recent inverse, utilizing polarization cues, improved shape
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Recent inverse rendering methods have greatly improved shape, material, and illumination reconstruction by utilizing polarization cues. However, existing methods only support dielectrics, ignoring conductors that are found everywhere in life. Since conductors and dielectrics have different reflection properties, using previous conductor methods will lead to obvious errors. In addition, conductors are glossy, which may cause strong specular reflection and is hard to reconstruct. To solve the above issues, we propose NeISF++, an inverse rendering pipeline that supports conductors and dielectrics. The key ingredient for our proposal is a general pBRDF that describes both conductors and dielectrics. As for the strong specular reflection problem, we propose a novel geometry initialization method using DoLP images. This physical cue is invariant to intensities and thus robust to strong specular reflections. Experimental results on our synthetic and real datasets show that our method surpasses the existing polarized inverse rendering methods for geometry and material decomposition as well as downstream tasks like relighting.

[CV-25] ry-On-Adapter: A Simple and Flexible Try-On Paradigm

链接: https://arxiv.org/abs/2411.10187
作者: Hanzhong Guo,Jianfeng Zhang,Cheng Zou,Jun Li,Meng Wang,Ruxue Wen,Pingzhong Tang,Jingdong Chen,Ming Yang
关键词-EN: providing significant research, dressed person conditioned, Image-based virtual try-on, generate realistic images, naturally dressed person
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Image virtual try-on, 7 pages, 3 figures

点击查看摘要

Abstract:Image-based virtual try-on, widely used in online shopping, aims to generate images of a naturally dressed person conditioned on certain garments, providing significant research and commercial potential. A key challenge of try-on is to generate realistic images of the model wearing the garments while preserving the details of the garments. Previous methods focus on masking certain parts of the original model’s standing image, and then inpainting on masked areas to generate realistic images of the model wearing corresponding reference garments, which treat the try-on task as an inpainting task. However, such implements require the user to provide a complete, high-quality standing image, which is user-unfriendly in practical applications. In this paper, we propose Try-On-Adapter (TOA), an outpainting paradigm that differs from the existing inpainting paradigm. Our TOA can preserve the given face and garment, naturally imagine the rest parts of the image, and provide flexible control ability with various conditions, e.g., garment properties and human pose. In the experiments, TOA shows excellent performance on the virtual try-on task even given relatively low-quality face and garment images in qualitative comparisons. Additionally, TOA achieves the state-of-the-art performance of FID scores 5.56 and 7.23 for paired and unpaired on the VITON-HD dataset in quantitative comparisons.

[CV-26] Efficient Progressive Image Compression with Variance-aware Masking WACV2025

链接: https://arxiv.org/abs/2411.10185
作者: Alberto Presta,Enzo Tartaglione,Attilio Fiandrotti,Marco Grangetto,Pamela Cosman
关键词-EN: Learned progressive image, progressive image compression, Learned progressive, progressive image, image compression
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 10 pages. Accepted at WACV 2025

点击查看摘要

Abstract:Learned progressive image compression is gaining momentum as it allows improved image reconstruction as more bits are decoded at the receiver. We propose a progressive image compression method in which an image is first represented as a pair of base-quality and top-quality latent representations. Next, a residual latent representation is encoded as the element-wise difference between the top and base representations. Our scheme enables progressive image compression with element-wise granularity by introducing a masking system that ranks each element of the residual latent representation from most to least important, dividing it into complementary components, which can be transmitted separately to the decoder in order to obtain different reconstruction quality. The masking system does not add further parameters nor complexity. At the receiver, any elements of the top latent representation excluded from the transmitted components can be independently replaced with the mean predicted by the hyperprior architecture, ensuring reliable reconstructions at any intermediate quality level. We also introduced Rate Enhancement Modules (REMs), which refine the estimation of entropy parameters using already decoded components. We obtain results competitive with state-of-the-art competitors, while significantly reducing computational complexity, decoding time, and number of parameters.

[CV-27] Visual question answering based evaluation metrics for text-to-image generation ISCAS2024

链接: https://arxiv.org/abs/2411.10183
作者: Mizuki Miyamoto,Ryugo Morita,Jinjia Zhou
关键词-EN: received considerable attention, generated images, image generation tasks, text-guided image manipulation, input text
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted to ISCAS2024

点击查看摘要

Abstract:Text-to-image generation and text-guided image manipulation have received considerable attention in the field of image generation tasks. However, the mainstream evaluation methods for these tasks have difficulty in evaluating whether all the information from the input text is accurately reflected in the generated images, and they mainly focus on evaluating the overall alignment between the input text and the generated images. This paper proposes new evaluation metrics that assess the alignment between input text and generated images for every individual object. Firstly, according to the input text, chatGPT is utilized to produce questions for the generated images. After that, we use Visual Question Answering(VQA) to measure the relevance of the generated images to the input text, which allows for a more detailed evaluation of the alignment compared to existing methods. In addition, we use Non-Reference Image Quality Assessment(NR-IQA) to evaluate not only the text-image alignment but also the quality of the generated images. Experimental results show that our proposed evaluation approach is the superior metric that can simultaneously assess finer text-image alignment and image quality while allowing for the adjustment of these ratios.

[CV-28] CART: Compositional Auto-Regressive Transformer for Image Generation CVPR2025

链接: https://arxiv.org/abs/2411.10180
作者: Siddharth Roheda
关键词-EN: enabling diverse applications, achieved remarkable advancements, virtual reality, recent years, remarkable advancements
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: under review at CVPR 2025

点击查看摘要

Abstract:In recent years, image synthesis has achieved remarkable advancements, enabling diverse applications in content creation, virtual reality, and beyond. We introduce a novel approach to image generation using Auto-Regressive (AR) modeling, which leverages a next-detail prediction strategy for enhanced fidelity and scalability. While AR models have achieved transformative success in language modeling, replicating this success in vision tasks has presented unique challenges due to the inherent spatial dependencies in images. Our proposed method addresses these challenges by iteratively adding finer details to an image compositionally, constructing it as a hierarchical combination of base and detail image factors. This strategy is shown to be more effective than the conventional next-token prediction and even surpasses the state-of-the-art next-scale prediction approaches. A key advantage of this method is its scalability to higher resolutions without requiring full model retraining, making it a versatile solution for high-resolution image generation.

[CV-29] SEAGULL: No-reference Image Quality Assessment for Regions of Interest via Vision-Language Instruction Tuning

链接: https://arxiv.org/abs/2411.10161
作者: Zewen Chen,Juan Wang,Wen Wang,Sunhan Xu,Hang Xiong,Yun Zeng,Jian Guo,Shuxun Wang,Chunfeng Yuan,Bing Li,Weiming Hu
关键词-EN: Regions of Interest, Existing Image Quality, Existing Image, works explore quality, methods achieve remarkable
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Existing Image Quality Assessment (IQA) methods achieve remarkable success in analyzing quality for overall image, but few works explore quality analysis for Regions of Interest (ROIs). The quality analysis of ROIs can provide fine-grained guidance for image quality improvement and is crucial for scenarios focusing on region-level quality. This paper proposes a novel network, SEAGULL, which can SEe and Assess ROIs quality with GUidance from a Large vision-Language model. SEAGULL incorporates a vision-language model (VLM), masks generated by Segment Anything Model (SAM) to specify ROIs, and a meticulously designed Mask-based Feature Extractor (MFE) to extract global and local tokens for specified ROIs, enabling accurate fine-grained IQA for ROIs. Moreover, this paper constructs two ROI-based IQA datasets, SEAGULL-100w and SEAGULL-3k, for training and evaluating ROI-based IQA. SEAGULL-100w comprises about 100w synthetic distortion images with 33 million ROIs for pre-training to improve the model’s ability of regional quality perception, and SEAGULL-3k contains about 3k authentic distortion ROIs to enhance the model’s ability to perceive real world distortions. After pre-training on SEAGULL-100w and fine-tuning on SEAGULL-3k, SEAGULL shows remarkable performance on fine-grained ROI quality assessment. Code and datasets are publicly available at the this https URL.

[CV-30] Outliers resistant image classification by anomaly detection

链接: https://arxiv.org/abs/2411.10150
作者: Anton Sergeev,Victor Minchenkov,Aleksei Soldatov,Vasiliy Kakurin,Yaroslav Mazikov
关键词-EN: manual assembly processes, processes in production, automatic monitoring, monitoring of manual, including computer vision
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 19 pages, in Russian

点击查看摘要

Abstract:Various technologies, including computer vision models, are employed for the automatic monitoring of manual assembly processes in production. These models detect and classify events such as the presence of components in an assembly area or the connection of components. A major challenge with detection and classification algorithms is their susceptibility to variations in environmental conditions and unpredictable behavior when processing objects that are not included in the training dataset. As it is impractical to add all possible subjects in the training sample, an alternative solution is necessary. This study proposes a model that simultaneously performs classification and anomaly detection, employing metric learning to generate vector representations of images in a multidimensional space, followed by classification using cross-entropy. For experimentation, a dataset of over 327,000 images was prepared. Experiments were conducted with various computer vision model architectures, and the outcomes of each approach were compared.

[CV-31] Matrix-Valued LogSumExp Approximation for Colour Morphology

链接: https://arxiv.org/abs/2411.10141
作者: Marvin Kahra,Michael Breuß,Andreas Kleefeld,Martin Welk
关键词-EN: Mathematical morphology, image processing, window that moves, change certain pixels, part of image
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 42 pages, 10 figures, to be submitted in JMIV

点击查看摘要

Abstract:Mathematical morphology is a part of image processing that uses a window that moves across the image to change certain pixels according to certain operations. The concepts of supremum and infimum play a crucial role here, but it proves challenging to define them generally for higher-dimensional data, such as colour representations. Numerous approaches have therefore been taken to solve this problem with certain compromises. In this paper we will analyse the construction of a new approach, which we have already presented experimentally in paper [Kahra, M., Breuß, M., Kleefeld, A., Welk, M., DGMM 2024, pp. 325-337]. This is based on a method by Burgeth and Kleefeld [Burgeth, B., Kleefeld, A., ISMM 2013, pp. 243-254], who regard the colours as symmetric 2\times2 matrices and compare them by means of the Loewner order in a bi-cone through different suprema. However, we will replace the supremum with the LogExp approximation for the maximum instead. This allows us to transfer the associativity of the dilation from the one-dimensional case to the higher-dimensional case. In addition, we will investigate the minimality property and specify a relaxation to ensure that our approach is continuously dependent on the input data.

[CV-32] CoSAM: Self-Correcting SAM for Domain Generalization in 2D Medical Image Segmentation

链接: https://arxiv.org/abs/2411.10136
作者: Yihang Fu,Ziyang Chen,Yiwen Ye,Xingliang Lei,Zhisong Wang,Yong Xia
关键词-EN: variations in imaging, imaging protocols, protocols and scanners, exhibit distribution shifts, medical image segmentation
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Medical images often exhibit distribution shifts due to variations in imaging protocols and scanners across different medical centers. Domain Generalization (DG) methods aim to train models on source domains that can generalize to unseen target domains. Recently, the segment anything model (SAM) has demonstrated strong generalization capabilities due to its prompt-based design, and has gained significant attention in image segmentation tasks. Existing SAM-based approaches attempt to address the need for manual prompts by introducing prompt generators that automatically generate these prompts. However, we argue that auto-generated prompts may not be sufficiently accurate under distribution shifts, potentially leading to incorrect predictions that still require manual verification and correction by clinicians. To address this challenge, we propose a method for 2D medical image segmentation called Self-Correcting SAM (CoSAM). Our approach begins by generating coarse masks using SAM in a prompt-free manner, providing prior prompts for the subsequent stages, and eliminating the need for prompt generators. To automatically refine these coarse masks, we introduce a generalized error decoder that simulates the correction process typically performed by clinicians. Furthermore, we generate diverse prompts as feedback based on the corrected masks, which are used to iteratively refine the predictions within a self-correcting loop, enhancing the generalization performance of our model. Extensive experiments on two medical image segmentation benchmarks across multiple scenarios demonstrate the superiority of CoSAM over state-of-the-art SAM-based methods.

[CV-33] Efficient Density Control for 3D Gaussian Splatting

链接: https://arxiv.org/abs/2411.10133
作者: Xiaobin Deng,Changyu Diao,Min Li,Ruohan Yu,Duanqing Xu
关键词-EN: balancing advanced rendering, Gaussian Splatting, view synthesis, balancing advanced, real-time performance
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:3D Gaussian Splatting (3DGS) excels in novel view synthesis, balancing advanced rendering quality with real-time performance. However, in trained scenes, a large number of Gaussians with low opacity significantly increase rendering costs. This issue arises due to flaws in the split and clone operations during the densification process, which lead to extensive Gaussian overlap and subsequent opacity reduction. To enhance the efficiency of Gaussian utilization, we improve the adaptive density control of 3DGS. First, we introduce a more efficient long-axis split operation to replace the original clone and split, which mitigates Gaussian overlap and improves densification this http URL, we propose a simple adaptive pruning technique to reduce the number of low-opacity Gaussians. Finally, by dynamically lowering the splitting threshold and applying importance weighting, the efficiency of Gaussian utilization is further this http URL evaluate our proposed method on various challenging real-world datasets. Experimental results show that our Efficient Density Control (EDC) can enhance both the rendering speed and quality.

[CV-34] owards Multi-View Consistent Style Transfer with One-Step Diffusion via Vision Conditioning ECCV2024

链接: https://arxiv.org/abs/2411.10130
作者: Yushen Zuo,Jun Xiao,Kin-Chung Chan,Rongkang Dong,Cuixin Yang,Zongqi He,Hao Xie,Kin-Man Lam
关键词-EN: increasingly attractive topic, style transfer, increasingly attractive, attractive topic, transfer
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted by ECCV 2024 AI for Visual Arts Workshop and Challenges, 18 pages, 7 figures

点击查看摘要

Abstract:The stylization of 3D scenes is an increasingly attractive topic in 3D vision. Although image style transfer has been extensively researched with promising results, directly applying 2D style transfer methods to 3D scenes often fails to preserve the structural and multi-view properties of 3D environments, resulting in unpleasant distortions in images from different viewpoints. To address these issues, we leverage the remarkable generative prior of diffusion-based models and propose a novel style transfer method, OSDiffST, based on a pre-trained one-step diffusion model (i.e., SD-Turbo) for rendering diverse styles in multi-view images of 3D scenes. To efficiently adapt the pre-trained model for multi-view style transfer on small datasets, we introduce a vision condition module to extract style information from the reference style image to serve as conditional input for the diffusion model and employ LoRA in diffusion model for adaptation. Additionally, we consider color distribution alignment and structural similarity between the stylized and content images using two specific loss functions. As a result, our method effectively preserves the structural information and multi-view consistency in stylized images without any 3D information. Experiments show that our method surpasses other promising style transfer methods in synthesizing various styles for multi-view images of 3D scenes. Stylized images from different viewpoints generated by our method achieve superior visual quality, with better structural integrity and less distortion. The source code is available at this https URL.

[CV-35] CorrCLIP: Reconstructing Correlations in CLIP with Off-the-Shelf Foundation Models for Open-Vocabulary Semantic Segmentation

链接: https://arxiv.org/abs/2411.10086
作者: Dengke Zhang,Fagui Liu,Quan Tang
关键词-EN: assign semantic labels, Open-vocabulary semantic segmentation, set of categories, semantic segmentation aims, aims to assign
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Open-vocabulary semantic segmentation aims to assign semantic labels to each pixel without relying on a predefined set of categories. Contrastive Language-Image Pre-training (CLIP) demonstrates outstanding zero-shot classification capabilities but struggles with the pixel-wise segmentation task as the captured inter-patch correlations correspond to no specific visual concepts. Despite previous CLIP-based works improving inter-patch correlations by self-self attention, they still face the inherent limitation that image patches tend to have high similarity to outlier ones. In this work, we introduce CorrCLIP, a training-free approach for open-vocabulary semantic segmentation, which reconstructs significantly coherent inter-patch correlations utilizing foundation models. Specifically, it employs the Segment Anything Model (SAM) to define the scope of patch interactions, ensuring that patches interact only with semantically similar ones. Furthermore, CorrCLIP obtains an understanding of an image’s semantic layout via self-supervised models to determine concrete similarity values between image patches, which addresses the similarity irregularity problem caused by the aforementioned restricted patch interaction regime. Finally, CorrCLIP reuses the region masks produced by SAM to update the segmentation map. As a training-free method, CorrCLIP achieves a notable improvement across eight challenging benchmarks regarding the averaged mean Intersection over Union, boosting it from 44.4% to 51.0%.

[CV-36] Influence of Depth Camera Noise Models on Respiration Estimation

链接: https://arxiv.org/abs/2411.10081
作者: Maurice Rohr,Sebastian Dill
关键词-EN: Depth cameras, capturing vital signs, interesting modality, modality for capturing, vital signs
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Poster Prague 2023 Conference, 4 pages

点击查看摘要

Abstract:Depth cameras are an interesting modality for capturing vital signs such as respiratory rate. Plenty approaches exist to extract vital signs in a controlled setting, but in order to apply them more flexibly for example in multi-camera settings, a simulated environment is needed to generate enough data for training and testing of new algorithms. We show first results of a 3D-rendering simulation pipeline that focuses on different noise models in order to generate realistic, depth-camera based respiratory signals using both synthetic and real respiratory signals as a baseline. While most noise can be accurately modelled as Gaussian in this context, we can show that as soon as the available image resolution is too low, the differences between different noise models surface.

[CV-37] Uncertainty-Weighted Mutual Distillation for Multi-View Fusion

链接: https://arxiv.org/abs/2411.10077
作者: Jiwoong Yang,Haejun Chung,Ikbeom Jang
关键词-EN: effectively leveraging images, leveraging images captured, angles and locations, effectively leveraging, leveraging images
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Multi-view learning often faces challenges in effectively leveraging images captured from different angles and locations. This challenge is particularly pronounced when addressing inconsistencies and uncertainties between views. In this paper, we propose a novel Multi-View Uncertainty-Weighted Mutual Distillation (MV-UWMD) method. Our method enhances prediction consistency by performing hierarchical mutual distillation across all possible view combinations, including single-view, partial multi-view, and full multi-view predictions. This introduces an uncertainty-based weighting mechanism through mutual distillation, allowing effective exploitation of unique information from each view while mitigating the impact of uncertain predictions. We extend a CNN-Transformer hybrid architecture to facilitate robust feature learning and integration across multiple view combinations. We conducted extensive experiments using a large, unstructured dataset captured from diverse, non-fixed viewpoints. The results demonstrate that MV-UWMD improves prediction accuracy and consistency compared to existing multi-view learning approaches.

[CV-38] Improving the accuracy of automated labeling of specimen images datasets via a confidence-based process

链接: https://arxiv.org/abs/2411.10074
作者: Quentin Bateux,Jonathan Koss,Patrick W. Sweeney,Erika Edwards,Nelson Rios,Aaron M. Dollar
关键词-EN: natural history collections, imagery and metadata, digitization of natural, natural history, history collections
类目: Computer Vision and Pattern Recognition (cs.CV); Populations and Evolution (q-bio.PE)
*备注:

点击查看摘要

Abstract:The digitization of natural history collections over the past three decades has unlocked a treasure trove of specimen imagery and metadata. There is great interest in making this data more useful by further labeling it with additional trait data, and modern deep learning machine learning techniques utilizing convolutional neural nets (CNNs) and similar networks show particular promise to reduce the amount of required manual labeling by human experts, making the process much faster and less expensive. However, in most cases, the accuracy of these approaches is too low for reliable utilization of the automatic labeling, typically in the range of 80-85% accuracy. In this paper, we present and validate an approach that can greatly improve this accuracy, essentially by examining the confidence that the network has in the generated label as well as utilizing a user-defined threshold to reject labels that fall below a chosen level. We demonstrate that a naive model that produced 86% initial accuracy can achieve improved performance - over 95% accuracy (rejecting about 40% of the labels) or over 99% accuracy (rejecting about 65%) by selecting higher confidence thresholds. This gives flexibility to adapt existing models to the statistical requirements of various types of research and has the potential to move these automatic labeling approaches from being unusably inaccurate to being an invaluable new tool. After validating the approach in a number of ways, we annotate the reproductive state of a large dataset of over 600,000 herbarium specimens. The analysis of the results points at under-investigated correlations as well as general alignment with known trends. By sharing this new dataset alongside this work, we want to allow ecologists to gather insights for their own research questions, at their chosen point of accuracy/coverage trade-off.

[CV-39] Step-wise Distribution Alignment Guided Style Prompt Tuning for Source-free Cross-domain Few-shot Learning

链接: https://arxiv.org/abs/2411.10070
作者: Huali Xu,Yongxiang Liu,Li Liu,Shuaifeng Zhi,Shuzhou Sun,Tianpeng Liu,MingMing Cheng
关键词-EN: develop source-domain training, source-domain training strategies, enhance model transferability, cross-domain few-shot learning, training strategies
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 15 pages, 12 figures, 7 tables

点击查看摘要

Abstract:Existing cross-domain few-shot learning (CDFSL) methods, which develop source-domain training strategies to enhance model transferability, face challenges with large-scale pre-trained models (LMs) due to inaccessible source data and training strategies. Moreover, fine-tuning LMs for CDFSL demands substantial computational resources, limiting practicality. This paper addresses the source-free CDFSL (SF-CDFSL) problem, tackling few-shot learning (FSL) in the target domain using only pre-trained models and a few target samples without source data or strategies. To overcome the challenge of inaccessible source data, this paper introduces Step-wise Distribution Alignment Guided Style Prompt Tuning (StepSPT), which implicitly narrows domain gaps through prediction distribution optimization. StepSPT proposes a style prompt to align target samples with the desired distribution and adopts a dual-phase optimization process. In the external process, a step-wise distribution alignment strategy factorizes prediction distribution optimization into a multi-step alignment problem to tune the style prompt. In the internal process, the classifier is updated using standard cross-entropy loss. Evaluations on five datasets demonstrate that StepSPT outperforms existing prompt tuning-based methods and SOTAs. Ablation studies further verify its effectiveness. Code will be made publicly available at \urlthis https URL.

[CV-40] Diachronic Document Dataset for Semantic Layout Analysis

链接: https://arxiv.org/abs/2411.10068
作者: Thibault Clérice(ALMAnaCH),Juliette Janes(ALMAnaCH),Hugo Scheithauer,Sarah Bénière(ALMAnaCH),Florian Cafiero(PSL),Laurent Romary(ALMAnaCH, DCIS),Simon Gabay,Benoît Sagot
关键词-EN: Text Encoding Initiative, Encoding Initiative, Text Encoding, open-access dataset designed, semantic layout analysis
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:We present a novel, open-access dataset designed for semantic layout analysis, built to support document recreation workflows through mapping with the Text Encoding Initiative (TEI) standard. This dataset includes 7,254 annotated pages spanning a large temporal range (1600-2024) of digitised and born-digital materials across diverse document types (magazines, papers from sciences and humanities, PhD theses, monographs, plays, administrative reports, etc.) sorted into modular subsets. By incorporating content from different periods and genres, it addresses varying layout complexities and historical changes in document structure. The modular design allows domain-specific configurations. We evaluate object detection models on this dataset, examining the impact of input size and subset-based training. Results show that a 1280-pixel input size for YOLO is optimal and that training on subsets generally benefits from incorporating them into a generic model rather than fine-tuning pre-trained weights.

[CV-41] EchoMimicV2: Towards Striking Simplified and Semi-Body Human Animation

链接: https://arxiv.org/abs/2411.10061
作者: Rang Meng,Xingyu Zhang,Yuming Li,Chenguang Ma
关键词-EN: Recent work, movement maps conditions, half-body human animation, human animation, movement maps
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Recent work on human animation usually involves audio, pose, or movement maps conditions, thereby achieves vivid animation quality. However, these methods often face practical challenges due to extra control conditions, cumbersome condition injection modules, or limitation to head region driving. Hence, we ask if it is possible to achieve striking half-body human animation while simplifying unnecessary conditions. To this end, we propose a half-body human animation method, dubbed EchoMimicV2, that leverages a novel Audio-Pose Dynamic Harmonization strategy, including Pose Sampling and Audio Diffusion, to enhance half-body details, facial and gestural expressiveness, and meanwhile reduce conditions redundancy. To compensate for the scarcity of half-body data, we utilize Head Partial Attention to seamlessly accommodate headshot data into our training framework, which can be omitted during inference, providing a free lunch for animation. Furthermore, we design the Phase-specific Denoising Loss to guide motion, detail, and low-level quality for animation in specific phases, respectively. Besides, we also present a novel benchmark for evaluating the effectiveness of half-body human animation. Extensive experiments and analyses demonstrate that EchoMimicV2 surpasses existing methods in both quantitative and qualitative evaluations.

[CV-42] GSEditPro: 3D Gaussian Splatting Editing with Attention-based Progressive Localization

链接: https://arxiv.org/abs/2411.10033
作者: Yanhao Sun,RunZe Tian,Xiao Han,XinYao Liu,Yan Zhang,Kai Xu
关键词-EN: Neural Radiance Fields, Radiance Fields, Neural Radiance, text-driven generative editing, emergence of large-scale
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Pacific Graphics 2024

点击查看摘要

Abstract:With the emergence of large-scale Text-to-Image(T2I) models and implicit 3D representations like Neural Radiance Fields (NeRF), many text-driven generative editing methods based on NeRF have appeared. However, the implicit encoding of geometric and textural information poses challenges in accurately locating and controlling objects during editing. Recently, significant advancements have been made in the editing methods of 3D Gaussian Splatting, a real-time rendering technology that relies on explicit representation. However, these methods still suffer from issues including inaccurate localization and limited manipulation over editing. To tackle these challenges, we propose GSEditPro, a novel 3D scene editing framework which allows users to perform various creative and precise editing using text prompts only. Leveraging the explicit nature of the 3D Gaussian distribution, we introduce an attention-based progressive localization module to add semantic labels to each Gaussian during rendering. This enables precise localization on editing areas by classifying Gaussians based on their relevance to the editing prompts derived from cross-attention layers of the T2I model. Furthermore, we present an innovative editing optimization method based on 3D Gaussian Splatting, obtaining stable and refined editing results through the guidance of Score Distillation Sampling and pseudo ground truth. We prove the efficacy of our method through extensive experiments.

[CV-43] oward Robust and Accurate Adversarial Camouflage Generation against Vehicle Detectors

链接: https://arxiv.org/abs/2411.10029
作者: Jiawei Zhou,Linye Lyu,Daojing He,Yu Li
关键词-EN: multi-view attack performance, widely used physical, superiority in multi-view, Adversarial camouflage, Adversarial
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 14 pages. arXiv admin note: substantial text overlap with arXiv:2402.15853

点击查看摘要

Abstract:Adversarial camouflage is a widely used physical attack against vehicle detectors for its superiority in multi-view attack performance. One promising approach involves using differentiable neural renderers to facilitate adversarial camouflage optimization through gradient back-propagation. However, existing methods often struggle to capture environmental characteristics during the rendering process or produce adversarial textures that can precisely map to the target vehicle. Moreover, these approaches neglect diverse weather conditions, reducing the efficacy of generated camouflage across varying weather scenarios. To tackle these challenges, we propose a robust and accurate camouflage generation method, namely RAUCA. The core of RAUCA is a novel neural rendering component, End-to-End Neural Renderer Plus (E2E-NRP), which can accurately optimize and project vehicle textures and render images with environmental characteristics such as lighting and weather. In addition, we integrate a multi-weather dataset for camouflage generation, leveraging the E2E-NRP to enhance the attack robustness. Experimental results on six popular object detectors show that RAUCA-final outperforms existing methods in both simulation and real-world settings.

[CV-44] owards Utilising a Range of Neural Activations for Comprehending Representational Associations

链接: https://arxiv.org/abs/2411.10019
作者: Laura O’Mahony,Nikola S. Nikolov,David JP O’Sullivan
关键词-EN: highest direction projections, understand intermediate representations, Recent efforts, label individual neurons, examining extremal neuron
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 18 pages, 11 figures

点击查看摘要

Abstract:Recent efforts to understand intermediate representations in deep neural networks have commonly attempted to label individual neurons and combinations of neurons that make up linear directions in the latent space by examining extremal neuron activations and the highest direction projections. In this paper, we show that this approach, although yielding a good approximation for many purposes, fails to capture valuable information about the behaviour of a representation. Neural network activations are generally dense, and so a more complex, but realistic scenario is that linear directions encode information at various levels of stimulation. We hypothesise that non-extremal level activations contain complex information worth investigating, such as statistical associations, and thus may be used to locate confounding human interpretable concepts. We explore the value of studying a range of neuron activations by taking the case of mid-level output neuron activations and demonstrate on a synthetic dataset how they can inform us about aspects of representations in the penultimate layer not evident through analysing maximal activations alone. We use our findings to develop a method to curate data from mid-range logit samples for retraining to mitigate spurious correlations, or confounding concepts in the penultimate layer, on real benchmark datasets. The success of our method exemplifies the utility of inspecting non-maximal activations to extract complex relationships learned by models.

[CV-45] Efficient Depth Estimation for Unstable Stereo Camera Systems on AR Glasses

链接: https://arxiv.org/abs/2411.10013
作者: Yongfan Liu,Hyoukjun Kwon
关键词-EN: augmented reality, fundamental component, component in augmented, depth estimation, cost volume
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Stereo depth estimation is a fundamental component in augmented reality (AR) applications. Although AR applications require very low latency for their real-time applications, traditional depth estimation models often rely on time-consuming preprocessing steps such as rectification to achieve high accuracy. Also, non standard ML operator based algorithms such as cost volume also require significant latency, which is aggravated on compute resource-constrained mobile platforms. Therefore, we develop hardware-friendly alternatives to the costly cost volume and preprocessing and design two new models based on them, MultiHeadDepth and HomoDepth. Our approaches for cost volume is replacing it with a new group-pointwise convolution-based operator and approximation of consine similarity based on layernorm and dot product. For online stereo rectification (preprocessing), we introduce homograhy matrix prediction network with a rectification positional encoding (RPE), which delivers both low latency and robustness to unrectified images, which eliminates the needs for preprocessing. Our MultiHeadDepth, which includes optimized cost volume, provides 11.8-30.3% improvements in accuracy and 22.9-25.2% reduction in latency compared to a state-of-the-art depth estimation model for AR glasses from industry. Our HomoDepth, which includes optimized preprocessing (Homograhpy + RPE) upon MultiHeadDepth, can process unrectified images and reduce the end-to-end latency by 44.5%. We adopt a multi-task learning framework to handle misaligned stereo inputs on HomoDepth, which reduces theAbsRel error by 10.0-24.3%. The results demonstrate the efficacy of our approaches in achieving both high model performance with low latency, which makes a step forward toward practical depth estimation on future AR devices.

[CV-46] Adaptive Non-Uniform Timestep Sampling for Diffusion Model Training

链接: https://arxiv.org/abs/2411.09998
作者: Myunsoo Kim,Donghyeon Ki,Seong-Woong Shim,Byung-Jun Lee
关键词-EN: including image generation, natural language processing, highly expressive generative, demonstrated exceptional success, expressive generative model
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:As a highly expressive generative model, diffusion models have demonstrated exceptional success across various domains, including image generation, natural language processing, and combinatorial optimization. However, as data distributions grow more complex, training these models to convergence becomes increasingly computationally intensive. While diffusion models are typically trained using uniform timestep sampling, our research shows that the variance in stochastic gradients varies significantly across timesteps, with high-variance timesteps becoming bottlenecks that hinder faster convergence. To address this issue, we introduce a non-uniform timestep sampling method that prioritizes these more critical timesteps. Our method tracks the impact of gradient updates on the objective for each timestep, adaptively selecting those most likely to minimize the objective effectively. Experimental results demonstrate that this approach not only accelerates the training process, but also leads to improved performance at convergence. Furthermore, our method shows robust performance across various datasets, scheduling strategies, and diffusion architectures, outperforming previously proposed timestep sampling and weighting heuristics that lack this degree of robustness.

[CV-47] Explanation for Trajectory Planning using Multi-modal Large Language Model for Autonomous Driving ECCV2024

链接: https://arxiv.org/abs/2411.09971
作者: Shota Yamazaki,Chenyu Zhang,Takuya Nanri,Akio Shigekane,Siyuan Wang,Jo Nishiyama,Tao Chu,Kohei Yokosawa
关键词-EN: style autonomous driving, autonomous driving models, ego vehicle, style autonomous, developed recently
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
*备注: Accepted and presented at ECCV 2024 2nd Workshop on Vision-Centric Autonomous Driving (VCAD) on September 30, 2024. 13 pages, 5 figures

点击查看摘要

Abstract:End-to-end style autonomous driving models have been developed recently. These models lack interpretability of decision-making process from perception to control of the ego vehicle, resulting in anxiety for passengers. To alleviate it, it is effective to build a model which outputs captions describing future behaviors of the ego vehicle and their reason. However, the existing approaches generate reasoning text that inadequately reflects the future plans of the ego vehicle, because they train models to output captions using momentary control signals as inputs. In this study, we propose a reasoning model that takes future planning trajectories of the ego vehicle as inputs to solve this limitation with the dataset newly collected.

[CV-48] A Polarization Image Dehazing Method Based on the Principle of Physical Diffusion

链接: https://arxiv.org/abs/2411.09924
作者: Zhenjun Zhang,Lijun Tang,Hongjin Wang,Lilian Zhang,Yunze He,Yaonan Wang
关键词-EN: Computer vision, unmanned vehicles, surveillance systems, remote sensing, systems and remote
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
*备注:

点击查看摘要

Abstract:Computer vision is increasingly used in areas such as unmanned vehicles, surveillance systems and remote sensing. However, in foggy scenarios, image degradation leads to loss of target details, which seriously affects the accuracy and effectiveness of these vision tasks. Polarized light, due to the fact that its electromagnetic waves vibrate in a specific direction, is able to resist scattering and refraction effects in complex media more effectively compared to unpolarized light. As a result, polarized light has a greater ability to maintain its polarization characteristics in complex transmission media and under long-distance imaging conditions. This property makes polarized imaging especially suitable for complex scenes such as outdoor and underwater, especially in foggy environments, where higher quality images can be obtained. Based on this advantage, we propose an innovative semi-physical polarization dehazing method that does not rely on an external light source. The method simulates the diffusion process of fog and designs a diffusion kernel that corresponds to the image blurriness caused by this diffusion. By employing spatiotemporal Fourier transforms and deconvolution operations, the method recovers the state of fog droplets prior to diffusion and the light inversion distribution of objects. This approach effectively achieves dehazing and detail enhancement of the scene.

[CV-49] mmSpyVR: Exploiting mmWave Radar for Penetrating Obstacles to Uncover Privacy Vulnerability of Virtual Reality

链接: https://arxiv.org/abs/2411.09914
作者: Luoyu Mei,Ruofeng Liu,Zhimeng Yin,Qingchuan Zhao,Wenchao Jiang,Shuai Wang,Kangjie Lu,Tian He
关键词-EN: introduces significant privacy, significant privacy risks, enhancing user experiences, Virtual reality, introduces significant
类目: Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Virtual reality (VR), while enhancing user experiences, introduces significant privacy risks. This paper reveals a novel vulnerability in VR systems that allows attackers to capture VR privacy through obstacles utilizing millimeter-wave (mmWave) signals without physical intrusion and virtual connection with the VR devices. We propose mmSpyVR, a novel attack on VR user’s privacy via mmWave radar. The mmSpyVR framework encompasses two main parts: (i) A transfer learning-based feature extraction model to achieve VR feature extraction from mmWave signal. (ii) An attention-based VR privacy spying module to spy VR privacy information from the extracted feature. The mmSpyVR demonstrates the capability to extract critical VR privacy from the mmWave signals that have penetrated through obstacles. We evaluate mmSpyVR through IRB-approved user studies. Across 22 participants engaged in four experimental scenes utilizing VR devices from three different manufacturers, our system achieves an application recognition accuracy of 98.5% and keystroke recognition accuracy of 92.6%. This newly discovered vulnerability has implications across various domains, such as cybersecurity, privacy protection, and VR technology development. We also engage with VR manufacturer Meta to discuss and explore potential mitigation strategies. Data and code are publicly available for scrutiny and research at this https URL

[CV-50] DiffFNO: Diffusion Fourier Neural Operator

链接: https://arxiv.org/abs/2411.09911
作者: Xiaoyi Liu,Hao Tang
关键词-EN: Weighted Fourier Neural, Fourier Neural Operator, Weighted Fourier, Attention-based Neural Operator, Neural Operator
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:We introduce DiffFNO, a novel diffusion framework for arbitrary-scale super-resolution strengthened by a Weighted Fourier Neural Operator (WFNO). Mode Re-balancing in WFNO effectively captures critical frequency components, significantly improving the reconstruction of high-frequency image details that are crucial for super-resolution tasks. Gated Fusion Mechanism (GFM) adaptively complements WFNO’s spectral features with spatial features from an Attention-based Neural Operator (AttnNO). This enhances the network’s capability to capture both global structures and local details. Adaptive Time-Step (ATS) ODE solver, a deterministic sampling strategy, accelerates inference without sacrificing output quality by dynamically adjusting integration step sizes ATS. Extensive experiments demonstrate that DiffFNO achieves state-of-the-art (SOTA) results, outperforming existing methods across various scaling factors by a margin of 2 to 4 dB in PSNR, including those beyond the training distribution. It also achieves this at lower inference time. Our approach sets a new standard in super-resolution, delivering both superior accuracy and computational efficiency.

[CV-51] Free Lunch in Pathology Foundation Model: Task-specific Model Adaptation with Concept-Guided Feature Enhancement

链接: https://arxiv.org/abs/2411.09894
作者: Yanyan Huang,Weiqin Zhao,Yihang Chen,Yu Fu,Lequan Yu
关键词-EN: medical imaging field, specific downstream tasks, foundation models, pathology foundation models, imaging field
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Whole slide image (WSI) analysis is gaining prominence within the medical imaging field. Recent advances in pathology foundation models have shown the potential to extract powerful feature representations from WSIs for downstream tasks. However, these foundation models are usually designed for general-purpose pathology image analysis and may not be optimal for specific downstream tasks or cancer types. In this work, we present Concept Anchor-guided Task-specific Feature Enhancement (CATE), an adaptable paradigm that can boost the expressivity and discriminativeness of pathology foundation models for specific downstream tasks. Based on a set of task-specific concepts derived from the pathology vision-language model with expert-designed prompts, we introduce two interconnected modules to dynamically calibrate the generic image features extracted by foundation models for certain tasks or cancer types. Specifically, we design a Concept-guided Information Bottleneck module to enhance task-relevant characteristics by maximizing the mutual information between image features and concept anchors while suppressing superfluous information. Moreover, a Concept-Feature Interference module is proposed to utilize the similarity between calibrated features and concept anchors to further generate discriminative task-specific features. The extensive experiments on public WSI datasets demonstrate that CATE significantly enhances the performance and generalizability of MIL models. Additionally, heatmap and umap visualization results also reveal the effectiveness and interpretability of CATE. The source code is available at this https URL.

[CV-52] Memory Proxy Maps for Visual Navigation

链接: https://arxiv.org/abs/2411.09893
作者: Faith Johnson,Bryan Bo Cao,Ashwin Ashok,Shubham Jain,Kristin Dana
关键词-EN: previously unseen environments, detailed environment maps, Visual navigation, navigate in previously, previously unseen
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: arXiv admin note: substantial text overlap with arXiv:2402.12498

点击查看摘要

Abstract:Visual navigation takes inspiration from humans, who navigate in previously unseen environments using vision without detailed environment maps. Inspired by this, we introduce a novel no-RL, no-graph, no-odometry approach to visual navigation using feudal learning to build a three tiered agent. Key to our approach is a memory proxy map (MPM), an intermediate representation of the environment learned in a self-supervised manner by the high-level manager agent that serves as a simplified memory, approximating what the agent has seen. We demonstrate that recording observations in this learned latent space is an effective and efficient memory proxy that can remove the need for graphs and odometry in visual navigation tasks. For the mid-level manager agent, we develop a waypoint network (WayNet) that outputs intermediate subgoals, or waypoints, imitating human waypoint selection during local navigation. For the low-level worker agent, we learn a classifier over a discrete action space that avoids local obstacles and moves the agent towards the WayNet waypoint. The resulting feudal navigation network offers a novel approach with no RL, no graph, no odometry, and no metric map; all while achieving SOTA results on the image goal navigation task.

[CV-53] Content-Aware Preserving Image Generation

链接: https://arxiv.org/abs/2411.09871
作者: Giang H. Le,Anh Q. Nguyen,Byeongkeun Kang,Yeejin Lee
关键词-EN: Remarkable progress, generative models, introduction of generative, content, Remarkable
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 35 pages, 12 figures, 1 table, journal

点击查看摘要

Abstract:Remarkable progress has been achieved in image generation with the introduction of generative models. However, precisely controlling the content in generated images remains a challenging task due to their fundamental training objective. This paper addresses this challenge by proposing a novel image generation framework explicitly designed to incorporate desired content in output images. The framework utilizes advanced encoding techniques, integrating subnetworks called content fusion and frequency encoding modules. The frequency encoding module first captures features and structures of reference images by exclusively focusing on selected frequency components. Subsequently, the content fusion module generates a content-guiding vector that encapsulates desired content features. During the image generation process, content-guiding vectors from real images are fused with projected noise vectors. This ensures the production of generated images that not only maintain consistent content from guiding images but also exhibit diverse stylistic variations. To validate the effectiveness of the proposed framework in preserving content attributes, extensive experiments are conducted on widely used benchmark datasets, including Flickr-Faces-High Quality, Animal Faces High Quality, and Large-scale Scene Understanding datasets.

[CV-54] Face De-identification: State-of-the-art Methods and Comparative Studies

链接: https://arxiv.org/abs/2411.09863
作者: Jingyi Cao,Xiangyi Chen,Bo Liu,Ming Ding,Rong Xie,Li Song,Zhu Li,Wenjun Zhang
关键词-EN: image acquisition technologies, acquisition technologies, Face de-identification, facial recognition, image acquisition
类目: Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR)
*备注:

点击查看摘要

Abstract:The widespread use of image acquisition technologies, along with advances in facial recognition, has raised serious privacy concerns. Face de-identification usually refers to the process of concealing or replacing personal identifiers, which is regarded as an effective means to protect the privacy of facial images. A significant number of methods for face de-identification have been proposed in recent years. In this survey, we provide a comprehensive review of state-of-the-art face de-identification methods, categorized into three levels: pixel-level, representation-level, and semantic-level techniques. We systematically evaluate these methods based on two key criteria, the effectiveness of privacy protection and preservation of image utility, highlighting their advantages and limitations. Our analysis includes qualitative and quantitative comparisons of the main algorithms, demonstrating that deep learning-based approaches, particularly those using Generative Adversarial Networks (GANs) and diffusion models, have achieved significant advancements in balancing privacy and utility. Experimental results reveal that while recent methods demonstrate strong privacy protection, trade-offs remain in visual fidelity and computational complexity. This survey not only summarizes the current landscape but also identifies key challenges and future research directions in face de-identification.

[CV-55] Masked Image Contrastive Learning for Efficient Visual Conceptual Pre-training

链接: https://arxiv.org/abs/2411.09858
作者: Xiaoyu Yang,Lijian Xu
关键词-EN: efficient visual conceptual, straightforward pre-training paradigm, paper proposes, proposes a scalable, scalable and straightforward
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 10 pages

点击查看摘要

Abstract:This paper proposes a scalable and straightforward pre-training paradigm for efficient visual conceptual representation called masked image contrastive learning (MiCL). Our MiCL approach is simple: we randomly mask patches to generate different views within an image and contrast them among a mini-batch of images. The core idea behind MiCL consists of two designs. First, masked tokens have the potential to significantly diminish the conceptual redundancy inherent in images, and create distinct views with substantial fine-grained differences on the semantic concept level instead of the instance level. Second, contrastive learning is adept at extracting high-level semantic conceptual features during the pre-training, circumventing the high-frequency interference and additional costs associated with image reconstruction. Importantly, MiCL learns highly semantic conceptual representations efficiently without relying on hand-crafted data augmentations or additional auxiliary modules. Empirically, MiCL demonstrates high scalability with Vision Transformers, as the ViT-L/16 can complete pre-training in 133 hours using only 4 A100 GPUs, achieving 85.8% accuracy in downstream fine-tuning tasks.

[CV-56] Architect: Generating Vivid and Interactive 3D Scenes with Hierarchical 2D Inpainting

链接: https://arxiv.org/abs/2411.09823
作者: Yian Wang,Xiaowen Qiu,Jiageng Liu,Zhehuan Chen,Jiting Cai,Yufei Wang,Tsun-Hsuan Wang,Zhou Xian,Chuang Gan
关键词-EN: Creating large-scale interactive, Creating large-scale, development of Robotics, large-scale interactive, Robotics and Embodied
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Creating large-scale interactive 3D environments is essential for the development of Robotics and Embodied AI research. Current methods, including manual design, procedural generation, diffusion-based scene generation, and large language model (LLM) guided scene design, are hindered by limitations such as excessive human effort, reliance on predefined rules or training datasets, and limited 3D spatial reasoning ability. Since pre-trained 2D image generative models better capture scene and object configuration than LLMs, we address these challenges by introducing Architect, a generative framework that creates complex and realistic 3D embodied environments leveraging diffusion-based 2D image inpainting. In detail, we utilize foundation visual perception models to obtain each generated object from the image and leverage pre-trained depth estimation models to lift the generated 2D image to 3D space. Our pipeline is further extended to a hierarchical and iterative inpainting process to continuously generate placement of large furniture and small objects to enrich the scene. This iterative structure brings the flexibility for our method to generate or refine scenes from various starting points, such as text, floor plans, or pre-arranged environments.

[CV-57] Automatic Classification of General Movements in Newborns ML4H ALT

链接: https://arxiv.org/abs/2411.09821
作者: Daphné Chopard,Sonia Laguna,Kieran Chin-Cheong,Annika Dietz,Anna Badura,Sven Wellmann,Julia E Vogt
关键词-EN: developing nervous system, offer valuable insights, coordinated body movements, coordinated body, nervous system
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注: Findings paper presented at Machine Learning for Health (ML4H) symposium 2024, December 15-16, 2024, Vancouver, Canada, 6 pages

点击查看摘要

Abstract:General movements (GMs) are spontaneous, coordinated body movements in infants that offer valuable insights into the developing nervous system. Assessed through the Prechtl GM Assessment (GMA), GMs are reliable predictors for neurodevelopmental disorders. However, GMA requires specifically trained clinicians, who are limited in number. To scale up newborn screening, there is a need for an algorithm that can automatically classify GMs from infant video recordings. This data poses challenges, including variability in recording length, device type, and setting, with each video coarsely annotated for overall movement quality. In this work, we introduce a tool for extracting features from these recordings and explore various machine learning techniques for automated GM classification.

[CV-58] Video Denoising in Fluorescence Guided Surgery

链接: https://arxiv.org/abs/2411.09798
作者: Trevor Seets,Andreas Velten
关键词-EN: Fluorescence guided surgery, delineating tissue types, promising surgical technique, Fluorescence guided, guided surgery
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
*备注:

点击查看摘要

Abstract:Fluorescence guided surgery (FGS) is a promising surgical technique that gives surgeons a unique view of tissue that is used to guide their practice by delineating tissue types and diseased areas. As new fluorescent contrast agents are developed that have low fluorescent photon yields, it becomes increasingly important to develop computational models to allow FGS systems to maintain good video quality in real time environments. To further complicate this task, FGS has a difficult bias noise term from laser leakage light (LLL) that represents unfiltered excitation light that can be on the order of the fluorescent signal. Most conventional video denoising methods focus on zero mean noise, and non-causal processing, both of which are violated in FGS. Luckily in FGS, often a co-located reference video is also captured which we use to simulate the LLL and assist in the denoising processes. In this work, we propose an accurate noise simulation pipeline that includes LLL and propose three baseline deep learning based algorithms for FGS video denoising.

[CV-59] NACNet: A Histology Context-aware Transformer Graph Convolution Network for Predicting Treatment Response to Neoadjuvant Chemotherapy in Triple Negative Breast Cancer

链接: https://arxiv.org/abs/2411.09766
作者: Qiang Li,George Teodoro,Yi Jiang,Jun Kong
关键词-EN: challenging task clinically, requires understanding complex, Neoadjuvant chemotherapy, understanding complex histology, triple negative breast
类目: Computer Vision and Pattern Recognition (cs.CV); Quantitative Methods (q-bio.QM)
*备注: This paper is accepted by Computerized Medical Imaging and Graphics (Nov 07 2024)

点击查看摘要

Abstract:Neoadjuvant chemotherapy (NAC) response prediction for triple negative breast cancer (TNBC) patients is a challenging task clinically as it requires understanding complex histology interactions within the tumor microenvironment (TME). Digital whole slide images (WSIs) capture detailed tissue information, but their giga-pixel size necessitates computational methods based on multiple instance learning, which typically analyze small, isolated image tiles without the spatial context of the TME. To address this limitation and incorporate TME spatial histology interactions in predicting NAC response for TNBC patients, we developed a histology context-aware transformer graph convolution network (NACNet). Our deep learning method identifies the histopathological labels on individual image tiles from WSIs, constructs a spatial TME graph, and represents each node with features derived from tissue texture and social network analysis. It predicts NAC response using a transformer graph convolution network model enhanced with graph isomorphism network layers. We evaluate our method with WSIs of a cohort of TNBC patient (N=105) and compared its performance with multiple state-of-the-art machine learning and deep learning models, including both graph and non-graph approaches. Our NACNet achieves 90.0% accuracy, 96.0% sensitivity, 88.0% specificity, and an AUC of 0.82, through eight-fold cross-validation, outperforming baseline models. These comprehensive experimental results suggest that NACNet holds strong potential for stratifying TNBC patients by NAC response, thereby helping to prevent overtreatment, improve patient quality of life, reduce treatment cost, and enhance clinical outcomes, marking an important advancement toward personalized breast cancer treatment.

[CV-60] Partial Multi-View Clustering via Meta-Learning and Contrastive Feature Alignment

链接: https://arxiv.org/abs/2411.09758
作者: BoHao Chen
关键词-EN: presents significant challenges, significant challenges practical, challenges practical research, practical research problem, Partial multi-view clustering
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Partial multi-view clustering (PVC) presents significant challenges practical research problem for data analysis in real-world applications, especially when some views of the data are partially missing. Existing clustering methods struggle to handle incomplete views effectively, leading to suboptimal clustering performance. In this paper, we propose a novel dual optimization framework based on contrastive learning, which aims to maximize the consistency of latent features in incomplete multi-view data and improve clustering performance through deep learning models. By combining a fine-tuned Vision Transformer and k-nearest neighbors (KNN), we fill in missing views and dynamically adjust view weights using self-supervised learning and meta-learning. Experimental results demonstrate that our framework outperforms state-of-the-art clustering models on the BDGP and HW datasets, particularly in handling complex and incomplete multi-view data.

[CV-61] Analyzing the AI Nudification Application Ecosystem

链接: https://arxiv.org/abs/2411.09751
作者: Cassidy Gibson,Daniel Olszewski,Natalie Grace Brigham,Anna Crowder,Kevin R. B. Butler,Patrick Traynor,Elissa M. Redmiles,Tadayoshi Kohno
关键词-EN: clothed person, AI-based nudification applications, image subject, produce nude, source image
类目: Human-Computer Interaction (cs.HC); Computer Vision and Pattern Recognition (cs.CV)
*备注: 22 pages, 5 figures, 2 tables

点击查看摘要

Abstract:Given a source image of a clothed person (an image subject), AI-based nudification applications can produce nude (undressed) images of that person. Moreover, not only do such applications exist, but there is ample evidence of the use of such applications in the real world and without the consent of an image subject. Still, despite the growing awareness of the existence of such applications and their potential to violate the rights of image subjects and cause downstream harms, there has been no systematic study of the nudification application ecosystem across multiple applications. We conduct such a study here, focusing on 20 popular and easy-to-find nudification websites. We study the positioning of these web applications (e.g., finding that most sites explicitly target the nudification of women, not all people), the features that they advertise (e.g., ranging from undressing-in-place to the rendering of image subjects in sexual positions, as well as differing user-privacy options), and their underlying monetization infrastructure (e.g., credit cards and cryptocurrencies). We believe this work will empower future, data-informed conversations – within the scientific, technical, and policy communities – on how to better protect individuals’ rights and minimize harm in the face of modern (and future) AI-based nudification applications. Content warning: This paper includes descriptions of web applications that can be used to create synthetic non-consensual explicit AI-created imagery (SNEACI). This paper also includes an artistic rendering of a user interface for such an application.

[CV-62] Adversarial Attacks Using Differentiable Rendering: A Survey

链接: https://arxiv.org/abs/2411.09749
作者: Matthew Hull,Chao Zhang,Zsolt Kira,Duen Horng Chau
关键词-EN: deep neural networks, deceive deep neural, Differentiable rendering methods, differentiable rendering capabilities, physically plausible adversarial
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Differentiable rendering methods have emerged as a promising means for generating photo-realistic and physically plausible adversarial attacks by manipulating 3D objects and scenes that can deceive deep neural networks (DNNs). Recently, differentiable rendering capabilities have evolved significantly into a diverse landscape of libraries, such as Mitsuba, PyTorch3D, and methods like Neural Radiance Fields and 3D Gaussian Splatting for solving inverse rendering problems that share conceptually similar properties commonly used to attack DNNs, such as back-propagation and optimization. However, the adversarial machine learning research community has not yet fully explored or understood such capabilities for generating attacks. Some key reasons are that researchers often have different attack goals, such as misclassification or misdetection, and use different tasks to accomplish these goals by manipulating different representation in a scene, such as the mesh or texture of an object. This survey adopts a task-oriented unifying framework that systematically summarizes common tasks, such as manipulating textures, altering illumination, and modifying 3D meshes to exploit vulnerabilities in DNNs. Our framework enables easy comparison of existing works, reveals research gaps and spotlights exciting future research directions in this rapidly evolving field. Through focusing on how these tasks enable attacks on various DNNs such as image classification, facial recognition, object detection, optical flow and depth estimation, our survey helps researchers and practitioners better understand the vulnerabilities of computer vision systems against photorealistic adversarial attacks that could threaten real-world applications.

[CV-63] On the Foundation Model for Cardiac MRI Reconstruction MICCAI

链接: https://arxiv.org/abs/2411.10403
作者: Chi Zhang,Michael Loecher,Cagan Alkan,Mahmut Yurt,Shreyas S. Vasanawala,Daniel B. Ennis
关键词-EN: cardiac magnetic resonance, machine learning, recent years, magnetic resonance, widely investigated
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注: For MICCAI CMRxRecon Challenge 2024 team CardiAxs

点击查看摘要

Abstract:In recent years, machine learning (ML) based reconstruction has been widely investigated and employed in cardiac magnetic resonance (CMR) imaging. ML-based reconstructions can deliver clinically acceptable image quality under substantially accelerated scans. ML-based reconstruction, however, also requires substantial data and computational time to train the neural network, which is often optimized for a fixed acceleration rate or image contrast. In practice, imaging parameters are often tuned to best suit the diagnosis, which may differ from the training data. This can result in degraded image quality, and multiple trained networks are needed to fulfill the clinical demands. In this study, we propose a foundation model that uses adaptive unrolling, channel-shifting, and Pattern and Contrast-Prompt-UNet (PCP-UNet) to tackle the problem. In particular, the undersampled data goes through a different number of unrolled iterations according to its acceleration rate. Channel-shifting improves reconstructed data quality. The PCP-UNet is equipped with an image contrast and sampling pattern prompt. In vivo CMR experiments were performed using mixed combinations of image contrasts, acceleration rates, and (under)sampling patterns. The proposed foundation model has significantly improved image quality for a wide range of CMR protocols and outperforms the conventional ML-based method.

[CV-64] OneNet: A Channel-Wise 1D Convolutional U-Net

链接: https://arxiv.org/abs/2411.09838
作者: Sanghyun Byun,Kayvan Shah,Ayushi Gang,Christopher Apton,Jacob Song,Woo Seong Chung
关键词-EN: efficient feature extraction, computer vision architectures, computer vision, feature extraction, vision architectures leverage
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Many state-of-the-art computer vision architectures leverage U-Net for its adaptability and efficient feature extraction. However, the multi-resolution convolutional design often leads to significant computational demands, limiting deployment on edge devices. We present a streamlined alternative: a 1D convolutional encoder that retains accuracy while enhancing its suitability for edge applications. Our novel encoder architecture achieves semantic segmentation through channel-wise 1D convolutions combined with pixel-unshuffle operations. By incorporating PixelShuffle, known for improving accuracy in super-resolution tasks while reducing computational load, OneNet captures spatial relationships without requiring 2D convolutions, reducing parameters by up to 47%. Additionally, we explore a fully 1D encoder-decoder that achieves a 71% reduction in size, albeit with some accuracy loss. We benchmark our approach against U-Net variants across diverse mask-generation tasks, demonstrating that it preserves accuracy effectively. Although focused on image segmentation, this architecture is adaptable to other convolutional applications. Code for the project is available at this https URL .

机器学习

[LG-0] MARS: Unleashing the Power of Variance Reduction for Training Large Models

链接: https://arxiv.org/abs/2411.10438
作者: Huizhuo Yuan,Yifeng Liu,Shuang Wu,Xun Zhou,Quanquan Gu
关键词-EN: deep neural networks, Training deep neural, variance reduction, deep neural, neural networks
类目: Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML)
*备注: 23 pages, 7 figures, 6 tables

点击查看摘要

Abstract:Training deep neural networks–and more recently, large models–demands efficient and scalable optimizers. Adaptive gradient algorithms like Adam, AdamW, and their variants have been central to this task. Despite the development of numerous variance reduction algorithms in the past decade aimed at accelerating stochastic optimization in both convex and nonconvex settings, variance reduction has not found widespread success in training deep neural networks or large language models. Consequently, it has remained a less favored approach in modern AI. In this paper, to unleash the power of variance reduction for efficient training of large models, we propose a unified optimization framework, MARS (Make vAriance Reduction Shine), which reconciles preconditioned gradient methods with variance reduction via a scaled stochastic recursive momentum technique. Within our framework, we introduce three instances of MARS that leverage preconditioned gradient updates based on AdamW, Lion, and Shampoo, respectively. We also draw a connection between our algorithms and existing optimizers. Experimental results on training GPT-2 models indicate that MARS consistently outperforms AdamW by a large margin.

[LG-1] Private Counterfactual Retrieval With Immutable Features

链接: https://arxiv.org/abs/2411.10429
作者: Shreya Meel,Pasan Dissanayake,Mohamed Nomeir,Sanghamitra Dutta,Sennur Ulukus
关键词-EN: minimum change needed, counterfactual explanations provide, classification task, favorable class, explanations provide
类目: Information Theory (cs.IT); Cryptography and Security (cs.CR); Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注:

点击查看摘要

Abstract:In a classification task, counterfactual explanations provide the minimum change needed for an input to be classified into a favorable class. We consider the problem of privately retrieving the exact closest counterfactual from a database of accepted samples while enforcing that certain features of the input sample cannot be changed, i.e., they are \emphimmutable. An applicant (user) whose feature vector is rejected by a machine learning model wants to retrieve the sample closest to them in the database without altering a private subset of their features, which constitutes the immutable set. While doing this, the user should keep their feature vector, immutable set and the resulting counterfactual index information-theoretically private from the institution. We refer to this as immutable private counterfactual retrieval (I-PCR) problem which generalizes PCR to a more practical setting. In this paper, we propose two I-PCR schemes by leveraging techniques from private information retrieval (PIR) and characterize their communication costs. Further, we quantify the information that the user learns about the database and compare it for the proposed schemes.

[LG-2] Back to Supervision: Boosting Word Boundary Detection through Frame Classification

链接: https://arxiv.org/abs/2411.10423
作者: Simone Carnemolla,Salvatore Calcagno,Simone Palazzo,Daniela Giordano
关键词-EN: speech processing tasks, Speech segmentation, speech processing, phoneme levels, levels is crucial
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Speech segmentation at both word and phoneme levels is crucial for various speech processing tasks. It significantly aids in extracting meaningful units from an utterance, thus enabling the generation of discrete elements. In this work we propose a model-agnostic framework to perform word boundary detection in a supervised manner also employing a labels augmentation technique and an output-frame selection strategy. We trained and tested on the Buckeye dataset and only tested on TIMIT one, using state-of-the-art encoder models, including pre-trained solutions (Wav2Vec 2.0 and HuBERT), as well as convolutional and convolutional recurrent networks. Our method, with the HuBERT encoder, surpasses the performance of other state-of-the-art architectures, whether trained in supervised or self-supervised settings on the same datasets. Specifically, we achieved F-values of 0.8427 on the Buckeye dataset and 0.7436 on the TIMIT dataset, along with R-values of 0.8489 and 0.7807, respectively. These results establish a new state-of-the-art for both datasets. Beyond the immediate task, our approach offers a robust and efficient preprocessing method for future research in audio tokenization.

[LG-3] Multiscale Dubuc: A New Similarity Measure for Time Series

链接: https://arxiv.org/abs/2411.10418
作者: Mahsa Khazaei,Azim Ahmadzadeh,Krishna Rukmini Puthucode
关键词-EN: Longest Common Subsequence, Quantifying similarities, Dynamic Time Warping, Multiscale Dubuc Distance, time series
类目: Machine Learning (cs.LG)
*备注: 6 pages, 3 figures, IEEE Big Data 2024

点击查看摘要

Abstract:Quantifying similarities between time series in a meaningful way remains a challenge in time series analysis, despite many advances in the field. Most real-world solutions still rely on a few popular measures, such as Euclidean Distance (EuD), Longest Common Subsequence (LCSS), and Dynamic Time Warping (DTW). The strengths and weaknesses of these measures have been studied extensively, and incremental improvements have been proposed. In this study, however, we present a different similarity measure that fuses the notion of Dubuc’s variation from fractal analysis with the Intersection-over-Union (IoU) measure which is widely used in object recognition (also known as the Jaccard Index). In this proof-of-concept paper, we introduce the Multiscale Dubuc Distance (MDD) measure and prove that it is a metric, possessing desirable properties such as the triangle inequality. We use 95 datasets from the UCR Time Series Classification Archive to compare MDD’s performance with EuD, LCSS, and DTW. Our experiments show that MDD’s overall success, without any case-specific customization, is comparable to DTW with optimized window sizes per dataset. We also highlight several datasets where MDD’s performance improves significantly when its single parameter is customized. This customization serves as a powerful tool for gauging MDD’s sensitivity to noise. Lastly, we show that MDD’s running time is linear in the length of the time series, which is crucial for real-world applications involving very large datasets.

[LG-4] Framework for Co-distillation Driven Federated Learning to Address Class Imbalance in Healthcare

链接: https://arxiv.org/abs/2411.10383
作者: Suraj Racha,Shubh Gupta,Humaira Firdowse,Aastik Solanki,Ganesh Ramakrishnan,Kshitij S. Jadhav
关键词-EN: enabling collaborative model, retaining data privacy, collaborative model training, distributed machine learning, enabling collaborative
类目: Machine Learning (cs.LG)
*备注: Accepted at CODS COMAD’24 and to be published in the Discover Data Journal( this https URL )

点击查看摘要

Abstract:Federated Learning (FL) is a pioneering approach in distributed machine learning, enabling collaborative model training across multiple clients while retaining data privacy. However, the inherent heterogeneity due to imbalanced resource representations across multiple clients poses significant challenges, often introducing bias towards the majority class. This issue is particularly prevalent in healthcare settings, where hospitals acting as clients share medical images. To address class imbalance and reduce bias, we propose a co-distillation driven framework in a federated healthcare setting. Unlike traditional federated setups with a designated server client, our framework promotes knowledge sharing among clients to collectively improve learning outcomes. Our experiments demonstrate that in a federated healthcare setting, co-distillation outperforms other federated methods in handling class imbalance. Additionally, we demonstrate that our framework has the least standard deviation with increasing imbalance while outperforming other baselines, signifying the robustness of our framework for FL in healthcare.

[LG-5] Weakly-Supervised Multimodal Learning on MIMIC-CXR ML4H ALT

链接: https://arxiv.org/abs/2411.10356
作者: Andrea Agostini,Daphné Chopard,Yang Meng,Norbert Fortin,Babak Shahbaba,Stephan Mandt,Thomas M. Sutter,Julia E. Vogt
关键词-EN: label scarcity pose, scarcity pose significant, pose significant challenges, Multimodal data integration, proposed Multimodal Variational
类目: Machine Learning (cs.LG)
*备注: Findings paper presented at Machine Learning for Health (ML4H) symposium 2024, December 15-16, 2024, Vancouver, Canada, 13 pages. arXiv admin note: text overlap with arXiv:2403.05300

点击查看摘要

Abstract:Multimodal data integration and label scarcity pose significant challenges for machine learning in medical settings. To address these issues, we conduct an in-depth evaluation of the newly proposed Multimodal Variational Mixture-of-Experts (MMVM) VAE on the challenging MIMIC-CXR dataset. Our analysis demonstrates that the MMVM VAE consistently outperforms other multimodal VAEs and fully supervised approaches, highlighting its strong potential for real-world medical applications.

[LG-6] On the Cost of Model-Serving Frameworks: An Experimental Evaluation

链接: https://arxiv.org/abs/2411.10337
作者: Pasquale De Rosa,Yérom-David Bromberg,Pascal Felber,Djob Mvondo,Valerio Schiavoni
关键词-EN: applying pre-trained models, inference phase, making predictions, process of applying, applying pre-trained
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In machine learning (ML), the inference phase is the process of applying pre-trained models to new, unseen data with the objective of making predictions. During the inference phase, end-users interact with ML services to gain insights, recommendations, or actions based on the input data. For this reason, serving strategies are nowadays crucial for deploying and managing models in production environments effectively. These strategies ensure that models are available, scalable, reliable, and performant for real-world applications, such as time series forecasting, image classification, natural language processing, and so on. In this paper, we evaluate the performances of five widely-used model serving frameworks (TensorFlow Serving, TorchServe, MLServer, MLflow, and BentoML) under four different scenarios (malware detection, cryptocoin prices forecasting, image classification, and sentiment analysis). We demonstrate that TensorFlow Serving is able to outperform all the other frameworks in serving deep learning (DL) models. Moreover, we show that DL-specific frameworks (TensorFlow Serving and TorchServe) display significantly lower latencies than the three general-purpose ML frameworks (BentoML, MLFlow, and MLServer).

[LG-7] Bitcoin Research with a Transaction Graph Dataset

链接: https://arxiv.org/abs/2411.10325
作者: Hugo Schnoering,Michalis Vazirgiannis
关键词-EN: Satoshi Nakamoto, fully decentralized manner, decentralized manner, central authority, digital economy
类目: Machine Learning (cs.LG); General Finance (q-fin.GN)
*备注:

点击查看摘要

Abstract:Bitcoin, launched in 2008 by Satoshi Nakamoto, established a new digital economy where value can be stored and transferred in a fully decentralized manner - alleviating the need for a central authority. This paper introduces a large scale dataset in the form of a transactions graph representing transactions between Bitcoin users along with a set of tasks and baselines. The graph includes 252 million nodes and 785 million edges, covering a time span of nearly 13 years of and 670 million transactions. Each node and edge is timestamped. As for supervised tasks we provide two labeled sets i. a 33,000 nodes based on entity type and ii. nearly 100,000 Bitcoin addresses labeled with an entity name and an entity type. This is the largest publicly available data set of bitcoin transactions designed to facilitate advanced research and exploration in this domain, overcoming the limitations of existing datasets. Various graph neural network models are trained to predict node labels, establishing a baseline for future research. In addition, several use cases are presented to demonstrate the dataset’s applicability beyond Bitcoin analysis. Finally, all data and source code is made publicly available to enable reproducibility of the results.

[LG-8] owards Sample-Efficiency and Generalization of Transfer and Inverse Reinforcement Learning: A Comprehensive Literature Review

链接: https://arxiv.org/abs/2411.10268
作者: Hossein Hassani,Roozbeh Razavi-Far,Mehrdad Saif,Liang Lin
关键词-EN: solving sequential decision-making, sequential decision-making problems, sub-domain of machine, concerned with solving, solving sequential
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Reinforcement learning (RL) is a sub-domain of machine learning, mainly concerned with solving sequential decision-making problems by a learning agent that interacts with the decision environment to improve its behavior through the reward it receives from the environment. This learning paradigm is, however, well-known for being time-consuming due to the necessity of collecting a large amount of data, making RL suffer from sample inefficiency and difficult generalization. Furthermore, the construction of an explicit reward function that accounts for the trade-off between multiple desiderata of a decision problem is often a laborious task. These challenges have been recently addressed utilizing transfer and inverse reinforcement learning (T-IRL). In this regard, this paper is devoted to a comprehensive review of realizing the sample efficiency and generalization of RL algorithms through T-IRL. Following a brief introduction to RL, the fundamental T-IRL methods are presented and the most recent advancements in each research field have been extensively reviewed. Our findings denote that a majority of recent research works have dealt with the aforementioned challenges by utilizing human-in-the-loop and sim-to-real strategies for the efficient transfer of knowledge from source domains to the target domain under the transfer learning scheme. Under the IRL structure, training schemes that require a low number of experience transitions and extension of such frameworks to multi-agent and multi-intention problems have been the priority of researchers in recent years.

[LG-9] MDHP-Net: Detecting Injection Attacks on In-vehicle Network using Multi-Dimensional Hawkes Process and Temporal Model

链接: https://arxiv.org/abs/2411.10258
作者: Qi Liu,Yanchen Liu,Ruifeng Li,Chenhong Cao,Yufeng Li,Xingyu Li,Peng Wang,Runhan Feng
关键词-EN: Electronic Control Unit, Electronic Control, Control Unit, offering enhanced functionalities, functionalities through Electronic
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG); Networking and Internet Architecture (cs.NI)
*备注:

点击查看摘要

Abstract:The integration of intelligent and connected technologies in modern vehicles, while offering enhanced functionalities through Electronic Control Unit and interfaces like OBD-II and telematics, also exposes the vehicle’s in-vehicle network (IVN) to potential cyberattacks. In this paper, we consider a specific type of cyberattack known as the injection attack. As demonstrated by empirical data from real-world cybersecurity adversarial competitions(available at this https URL ), these injection attacks have excitation effect over time, gradually manipulating network traffic and disrupting the vehicle’s normal functioning, ultimately compromising both its stability and safety. To profile the abnormal behavior of attackers, we propose a novel injection attack detector to extract long-term features of attack behavior. Specifically, we first provide a theoretical analysis of modeling the time-excitation effects of the attack using Multi-Dimensional Hawkes Process (MDHP). A gradient descent solver specifically tailored for MDHP, MDHP-GDS, is developed to accurately estimate optimal MDHP parameters. We then propose an injection attack detector, MDHP-Net, which integrates optimal MDHP parameters with MDHP-LSTM blocks to enhance temporal feature extraction. By introducing MDHP parameters, MDHP-Net captures complex temporal features that standard Long Short-Term Memory (LSTM) cannot, enriching temporal dependencies within our customized structure. Extensive evaluations demonstrate the effectiveness of our proposed detection approach.

[LG-10] Uncertainty in Supply Chain Digital Twins: A Quantum-Classical Hybrid Approach

链接: https://arxiv.org/abs/2411.10254
作者: Abdullah Abdullah,Fannya Ratana Sandjaja,Ayesha Abdul Majeed,Gyan Wickremasinghe,Karen Rafferty,Vishal Sharma
关键词-EN: financial risk assessment, supply chain digital, chain digital twins, hybrid machine learning, investigates uncertainty quantification
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This study investigates uncertainty quantification (UQ) using quantum-classical hybrid machine learning (ML) models for applications in complex and dynamic fields, such as attaining resiliency in supply chain digital twins and financial risk assessment. Although quantum feature transformations have been integrated into ML models for complex data tasks, a gap exists in determining their impact on UQ within their hybrid architectures (quantum-classical approach). This work applies existing UQ techniques for different models within a hybrid framework, examining how quantum feature transformation affects uncertainty propagation. Increasing qubits from 4 to 16 shows varied model responsiveness to outlier detection (OD) samples, which is a critical factor for resilient decision-making in dynamic environments. This work shows how quantum computing techniques can transform data features for UQ, particularly when combined with traditional methods.

[LG-11] Efficient Neural Hybrid System Learning and Transition System Abstraction for Dynamical Systems

链接: https://arxiv.org/abs/2411.10240
作者: Yejiang Yang,Zihao Mo,Weiming Xiang
关键词-EN: hybrid modeling framework, network hybrid modeling, computationally efficient, dynamics learning, Computational Tree Logic
类目: ystems and Control (eess.SY); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This paper proposes a neural network hybrid modeling framework for dynamics learning to promote an interpretable, computationally efficient way of dynamics learning and system identification. First, a low-level model will be trained to learn the system dynamics, which utilizes multiple simple neural networks to approximate the local dynamics generated from data-driven partitions. Then, based on the low-level model, a high-level model will be trained to abstract the low-level neural hybrid system model into a transition system that allows Computational Tree Logic Verification to promote the model’s ability with human interaction and verification efficiency.

[LG-12] Machine Learning Algorithms to Assess Site Closure Time Frames for Soil and Groundwater Contamination

链接: https://arxiv.org/abs/2411.10214
作者: Vu-Anh Le,Haruko Murakami Wainwright,Hansell Gonzalez-Raymat,Carol Eddy-Dilek
关键词-EN: Monitored Natural Attenuation, Monitored Natural, Natural Attenuation, minimal environmental disruption, groundwater contamination due
类目: Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注: 14 pages, 6 figures

点击查看摘要

Abstract:Monitored Natural Attenuation (MNA) is gaining prominence as an effective method for managing soil and groundwater contamination due to its cost-efficiency and minimal environmental disruption. Despite its benefits, MNA necessitates extensive groundwater monitoring to ensure that contaminant levels decrease to meet safety standards. This study expands the capabilities of PyLEnM, a Python package designed for long-term environmental monitoring, by incorporating new algorithms to enhance its predictive and analytical functionalities. We introduce methods to estimate the timeframe required for contaminants like Sr-90 and I-129 to reach regulatory safety standards using linear regression and to forecast future contaminant levels with the Bidirectional Long Short-Term Memory (Bi-LSTM) networks. Additionally, Random Forest regression is employed to identify factors influencing the time to reach safety standards. Our methods are illustrated using data from the Savannah River Site (SRS) F-Area, where preliminary findings reveal a notable downward trend in contaminant levels, with variability linked to initial concentrations and groundwater flow dynamics. The Bi-LSTM model effectively predicts contaminant concentrations for the next four years, demonstrating the potential of advanced time series analysis to improve MNA strategies and reduce reliance on manual groundwater sampling. The code, along with its usage instructions, validation, and requirements, is available at: this https URL.

[LG-13] Embedding Byzantine Fault Tolerance into Federated Learning via Virtual Data-Driven Consistency Scoring Plugin

链接: https://arxiv.org/abs/2411.10212
作者: Youngjoon Lee,Jinu Gong,Joonhyuk Kang
关键词-EN: multiple edge devices, compromised edge devices, edge devices, transmitting private data, federated learning
类目: Machine Learning (cs.LG)
*备注: 7 pages

点击查看摘要

Abstract:Given sufficient data from multiple edge devices, federated learning (FL) enables training a shared model without transmitting private data to a central server. However, FL is generally vulnerable to Byzantine attacks from compromised edge devices, which can significantly degrade the model performance. In this paper, we propose a intuitive plugin that can be integrated into existing FL techniques to achieve Byzantine-Resilience. Key idea is to generate virtual data samples and evaluate model consistency scores across local updates to effectively filter out compromised edge devices. By utilizing this scoring mechanism before the aggregation phase, the proposed plugin enables existing FL techniques to become robust against Byzantine attacks while maintaining their original benefits. Numerical results on medical image classification task validate that plugging the proposed approach into representative FL algorithms, effectively achieves Byzantine resilience. Furthermore, the proposed plugin maintains the original convergence properties of the base FL algorithms when no Byzantine attacks are present.

[LG-14] Neural Port-Hamiltonian Models for Nonlinear Distributed Control: An Unconstrained Parametrization Approach

链接: https://arxiv.org/abs/2411.10096
作者: Muhammad Zakwan,Giancarlo Ferrari-Trecate
关键词-EN: policies relying solely, large-scale cyber-physical systems, cyber-physical systems requires, systems requires optimal, requires optimal distributed
类目: ystems and Control (eess.SY); Machine Learning (cs.LG)
*备注: The paper has 15 pages, and has been submitted for a possible publication. arXiv admin note: text overlap with arXiv:2403.17785

点击查看摘要

Abstract:The control of large-scale cyber-physical systems requires optimal distributed policies relying solely on limited communication with neighboring agents. However, computing stabilizing controllers for nonlinear systems while optimizing complex costs remains a significant challenge. Neural Networks (NNs), known for their expressivity, can be leveraged to parametrize control policies that yield good performance. However, NNs’ sensitivity to small input changes poses a risk of destabilizing the closed-loop system. Many existing approaches enforce constraints on the controllers’ parameter space to guarantee closed-loop stability, leading to computationally expensive optimization procedures. To address these problems, we leverage the framework of port-Hamiltonian systems to design continuous-time distributed control policies for nonlinear systems that guarantee closed-loop stability and finite \mathcalL_2 or incremental \mathcalL_2 gains, independent of the optimzation parameters of the controllers. This eliminates the need to constrain parameters during optimization, allowing the use of standard techniques such as gradient-based methods. Additionally, we discuss discretization schemes that preserve the dissipation properties of these controllers for implementation on embedded systems. The effectiveness of the proposed distributed controllers is demonstrated through consensus control of non-holonomic mobile robots subject to collision avoidance and averaged voltage regulation with weighted power sharing in DC microgrids.

[LG-15] Unsupervised Congestion Status Identification Using LMP Data

链接: https://arxiv.org/abs/2411.10058
作者: Kedi Zheng,Qixin Chen,Yi Wang,Chongqing Kang,Le Xie
关键词-EN: locational marginal prices, market strategy making, congestion LMP data, marginal prices, price forecasting
类目: ystems and Control (eess.SY); Machine Learning (cs.LG)
*备注: Paper accepted for IEEE Transactions on Smart Grid. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses

点击查看摘要

Abstract:Having a better understanding of how locational marginal prices (LMPs) change helps in price forecasting and market strategy making. This paper investigates the fundamental distribution of the congestion part of LMPs in high-dimensional Euclidean space using an unsupervised approach. LMP models based on the lossless and lossy DC optimal power flow (DC-OPF) are analyzed to show the overlapping subspace property of the LMP data. The congestion part of LMPs is spanned by certain row vectors of the power transfer distribution factor (PTDF) matrix, and the subspace attributes of an LMP vector uniquely are found to reflect the instantaneous congestion status of all the transmission lines. The proposed method searches for the basis vectors that span the subspaces of congestion LMP data in hierarchical ways. In the bottom-up search, the data belonging to 1-dimensional subspaces are detected, and other data are projected on the orthogonal subspaces. This procedure is repeated until all the basis vectors are found or the basis gap appears. Top-down searching is used to address the basis gap by hyperplane detection with outliers. Once all the basis vectors are detected, the congestion status can be identified. Numerical experiments based on the IEEE 30-bus system, IEEE 118-bus system, Illinois 200-bus system, and Southwest Power Pool are conducted to show the performance of the proposed method.

[LG-16] Model Inversion Attacks: A Survey of Approaches and Countermeasures

链接: https://arxiv.org/abs/2411.10023
作者: Zhanke Zhou,Jianing Zhu,Fengfei Yu,Xuan Li,Xiong Peng,Tongliang Liu,Bo Han
关键词-EN: applications from Euclidean, Euclidean to non-Euclidean, driven numerous research, deep neural networks, success of deep
类目: Machine Learning (cs.LG)
*备注: 40 pages, 17 figures

点击查看摘要

Abstract:The success of deep neural networks has driven numerous research studies and applications from Euclidean to non-Euclidean data. However, there are increasing concerns about privacy leakage, as these networks rely on processing private data. Recently, a new type of privacy attack, the model inversion attacks (MIAs), aims to extract sensitive features of private data for training by abusing access to a well-trained model. The effectiveness of MIAs has been demonstrated in various domains, including images, texts, and graphs. These attacks highlight the vulnerability of neural networks and raise awareness about the risk of privacy leakage within the research community. Despite the significance, there is a lack of systematic studies that provide a comprehensive overview and deeper insights into MIAs across different domains. This survey aims to summarize up-to-date MIA methods in both attacks and defenses, highlighting their contributions and limitations, underlying modeling principles, optimization challenges, and future directions. We hope this survey bridges the gap in the literature and facilitates future research in this critical area. Besides, we are maintaining a repository to keep track of relevant research at this https URL.

[LG-17] Fully Dynamic Adversarially Robust Correlation Clustering in Polylogarithmic Update Time

链接: https://arxiv.org/abs/2411.09979
作者: Vladimir Braverman,Prathamesh Dharangutte,Shreyas Pai,Vihan Shah,Chen Wang
关键词-EN:
类目: Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG)
*备注:

点击查看摘要

[LG-18] Establishing and Evaluating Trustworthy AI: Overview and Research Challenges

链接: https://arxiv.org/abs/2411.09973
作者: Dominik Kowald,Sebastian Scher,Viktoria Pammer-Schindler,Peter Müllner,Kerstin Waxnegger,Lea Demelius,Angela Fessl,Maximilian Toller,Inti Gabriel Mendoza Estrada,Ilija Simic,Vedran Sabol,Andreas Truegler,Eduardo Veas,Roman Kern,Tomislav Nad,Simone Kopeinik
关键词-EN: shape modern life, Artificial intelligence, shape modern, modern life, driving innovation
类目: Machine Learning (cs.LG); Information Retrieval (cs.IR)
*备注: Accepted in Frontiers in Big Data and AI, Research Topic: Towards Fair AI for Trustworthy Artificial Intelligence

点击查看摘要

Abstract:Artificial intelligence (AI) technologies (re-)shape modern life, driving innovation in a wide range of sectors. However, some AI systems have yielded unexpected or undesirable outcomes or have been used in questionable manners. As a result, there has been a surge in public and academic discussions about aspects that AI systems must fulfill to be considered trustworthy. In this paper, we synthesize existing conceptualizations of trustworthy AI along six requirements: 1) human agency and oversight, 2) fairness and non-discrimination, 3) transparency and explainability, 4) robustness and accuracy, 5) privacy and security, and 6) accountability. For each one, we provide a definition, describe how it can be established and evaluated, and discuss requirement-specific research challenges. Finally, we conclude this analysis by identifying overarching research challenges across the requirements with respect to 1) interdisciplinary research, 2) conceptual clarity, 3) context-dependency, 4) dynamics in evolving systems, and 5) investigations in real-world contexts. Thus, this paper synthesizes and consolidates a wide-ranging and active discussion currently taking place in various academic sub-communities and public forums. It aims to serve as a reference for a broad audience and as a basis for future research directions.

[LG-19] Zero-shot Voice Conversion with Diffusion Transformers

链接: https://arxiv.org/abs/2411.09943
作者: Songting Liu
关键词-EN: aims to transform, utterance to match, source speech utterance, voice conversion, voice conversion aims
类目: ound (cs.SD); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注:

点击查看摘要

Abstract:Zero-shot voice conversion aims to transform a source speech utterance to match the timbre of a reference speech from an unseen speaker. Traditional approaches struggle with timbre leakage, insufficient timbre representation, and mismatches between training and inference tasks. We propose Seed-VC, a novel framework that addresses these issues by introducing an external timbre shifter during training to perturb the source speech timbre, mitigating leakage and aligning training with inference. Additionally, we employ a diffusion transformer that leverages the entire reference speech context, capturing fine-grained timbre features through in-context learning. Experiments demonstrate that Seed-VC outperforms strong baselines like OpenVoice and CosyVoice, achieving higher speaker similarity and lower word error rates in zero-shot voice conversion tasks. We further extend our approach to zero-shot singing voice conversion by incorporating fundamental frequency (F0) conditioning, resulting in comparative performance to current state-of-the-art methods. Our findings highlight the effectiveness of Seed-VC in overcoming core challenges, paving the way for more accurate and versatile voice conversion systems.

[LG-20] Is Precise Recovery Necessary? A Task-Oriented Imputation Approach for Time Series Forecasting on Variable Subset

链接: https://arxiv.org/abs/2411.09928
作者: Qi Hao,Runchang Liang,Yue Gao,Hao Dong,Wei Fan,Lu Jiang,Pengyang Wang
关键词-EN: multivariate time series, Variable Subset Forecasting, time series, inference phase, training phase
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Variable Subset Forecasting (VSF) refers to a unique scenario in multivariate time series forecasting, where available variables in the inference phase are only a subset of the variables in the training phase. VSF presents significant challenges as the entire time series may be missing, and neither inter- nor intra-variable correlations persist. Such conditions impede the effectiveness of traditional imputation methods, primarily focusing on filling in individual missing data points. Inspired by the principle of feature engineering that not all variables contribute positively to forecasting, we propose Task-Oriented Imputation for VSF (TOI-VSF), a novel framework shifts the focus from accurate data recovery to directly support the downstream forecasting task. TOI-VSF incorporates a self-supervised imputation module, agnostic to the forecasting model, designed to fill in missing variables while preserving the vital characteristics and temporal patterns of time series data. Additionally, we implement a joint learning strategy for imputation and forecasting, ensuring that the imputation process is directly aligned with and beneficial to the forecasting objective. Extensive experiments across four datasets demonstrate the superiority of TOI-VSF, outperforming baseline methods by 15% on average.

[LG-21] Physics-informed Machine Learning for Battery Pack Thermal Management

链接: https://arxiv.org/abs/2411.09915
作者: Zheng Liu,Yuan Jiang,Yumeng Li,Pingfeng Wang
关键词-EN: Battery thermal management, electric vehicles, popularity of electric, demand for lithium-ion, thermal management systems
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:With the popularity of electric vehicles, the demand for lithium-ion batteries is increasing. Temperature significantly influences the performance and safety of batteries. Battery thermal management systems can effectively control the temperature of batteries; therefore, the performance and safety can be ensured. However, the development process of battery thermal management systems is time-consuming and costly due to the extensive training dataset needed by data-driven models requiring enormous computational costs for finite element analysis. Therefore, a new approach to constructing surrogate models is needed in the era of AI. Physics-informed machine learning enforces the physical laws in surrogate models, making it the perfect candidate for estimating battery pack temperature distribution. In this study, we first developed a 21700 battery pack indirect liquid cooling system with cold plates on the top and bottom with thermal paste surrounding the battery cells. Then, the simplified finite element model was built based on experiment results. Due to the high coolant flow rate, the cold plates can be considered as constant temperature boundaries, while battery cells are the heat sources. The physics-informed convolutional neural network served as a surrogate model to estimate the temperature distribution of the battery pack. The loss function was constructed considering the heat conduction equation based on the finite difference method. The physics-informed loss function helped the convergence of the training process with less data. As a result, the physics-informed convolutional neural network showed more than 15 percents improvement in accuracy compared to the data-driven method with the same training data.

[LG-22] Self-Supervised Learning of Grasping Arbitrary Objects On-the-Move

链接: https://arxiv.org/abs/2411.09904
作者: Takuya Kiyokawa,Eiki Nagata,Yoshihisa Tsurumine,Yuhwan Kwon,Takamitsu Matsubara
关键词-EN: utilizing robots’ mobility, grasping enhances manipulation, Mobile grasping, enhances manipulation efficiency, Mobile grasping enhances
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注: 8 pages, 9 figures

点击查看摘要

Abstract:Mobile grasping enhances manipulation efficiency by utilizing robots’ mobility. This study aims to enable a commercial off-the-shelf robot for mobile grasping, requiring precise timing and pose adjustments. Self-supervised learning can develop a generalizable policy to adjust the robot’s velocity and determine grasp position and orientation based on the target object’s shape and pose. Due to mobile grasping’s complexity, action primitivization and step-by-step learning are crucial to avoid data sparsity in learning from trial and error. This study simplifies mobile grasping into two grasp action primitives and a moving action primitive, which can be operated with limited degrees of freedom for the manipulator. This study introduces three fully convolutional neural network (FCN) models to predict static grasp primitive, dynamic grasp primitive, and residual moving velocity error from visual inputs. A two-stage grasp learning approach facilitates seamless FCN model learning. The ablation study demonstrated that the proposed method achieved the highest grasping accuracy and pick-and-place efficiency. Furthermore, randomizing object shapes and environments in the simulation effectively achieved generalizable mobile grasping.

[LG-23] Deep learning robotics using self-supervised spatial differentiation drive autonomous contact-based semiconductor characterization

链接: https://arxiv.org/abs/2411.09892
作者: Alexander E. Siemenn,Basita Das,Kangyu Ji,Fang Sheng,Tonio Buonassisi
关键词-EN: Integrating autonomous contact-based, Integrating autonomous, enhance measurement quality, Integrating, measurement quality
类目: Robotics (cs.RO); Materials Science (cond-mat.mtrl-sci); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Integrating autonomous contact-based robotic characterization into self-driving laboratories can enhance measurement quality, reliability, and throughput. While deep learning models support robust autonomy, current methods lack pixel-precision positioning and require extensive labeled data. To overcome these challenges, we propose a self-supervised convolutional neural network with a spatially differentiable loss function, incorporating shape priors to refine predictions of optimal robot contact poses for semiconductor characterization. This network improves valid pose generation by 20.0%, relative to existing models. We demonstrate our network’s performance by driving a 4-degree-of-freedom robot to characterize photoconductivity at 3,025 predicted poses across a gradient of perovskite compositions, achieving throughputs over 125 measurements per hour. Spatially mapping photoconductivity onto each drop-casted film reveals regions of inhomogeneity. With this self-supervised deep learning-driven robotic system, we enable high-precision and reliable automation of contact-based characterization techniques at high throughputs, thereby allowing the measurement of previously inaccessible yet important semiconductor properties for self-driving laboratories.

[LG-24] InvestESG: A multi-agent reinforcement learning benchmark for studying climate investment as a social dilemma

链接: https://arxiv.org/abs/2411.09856
作者: Xiaoxuan Hou,Jiayi Yuan,Joel Z. Leibo,Natasha Jaques
关键词-EN: multi-agent reinforcement learning, impact of Environmental, reinforcement learning, multi-agent reinforcement, designed to study
类目: Machine Learning (cs.LG); Computers and Society (cs.CY); Multiagent Systems (cs.MA); General Economics (econ.GN)
*备注:

点击查看摘要

Abstract:InvestESG is a novel multi-agent reinforcement learning (MARL) benchmark designed to study the impact of Environmental, Social, and Governance (ESG) disclosure mandates on corporate climate investments. Supported by both PyTorch and GPU-accelerated JAX framework, the benchmark models an intertemporal social dilemma where companies balance short-term profit losses from climate mitigation efforts and long-term benefits from reducing climate risk, while ESG-conscious investors attempt to influence corporate behavior through their investment decisions. Companies allocate capital across mitigation, greenwashing, and resilience, with varying strategies influencing climate outcomes and investor preferences. Our experiments show that without ESG-conscious investors with sufficient capital, corporate mitigation efforts remain limited under the disclosure mandate. However, when a critical mass of investors prioritizes ESG, corporate cooperation increases, which in turn reduces climate risks and enhances long-term financial stability. Additionally, providing more information about global climate risks encourages companies to invest more in mitigation, even without investor involvement. Our findings align with empirical research using real-world data, highlighting MARL’s potential to inform policy by providing insights into large-scale socio-economic challenges through efficient testing of alternative policy and market designs.

[LG-25] Fair Secretaries with Unfair Predictions NEURIPS2024

链接: https://arxiv.org/abs/2411.09854
作者: Eric Balkanski,Will Ma,Andreas Maggiori
关键词-EN: decision-making under uncertainty, uncertainty that leverages, leverages the power, power of machine-learned, making any assumption
类目: Machine Learning (cs.LG); Data Structures and Algorithms (cs.DS)
*备注: to appear at NeurIPS 2024

点击查看摘要

Abstract:Algorithms with predictions is a recent framework for decision-making under uncertainty that leverages the power of machine-learned predictions without making any assumption about their quality. The goal in this framework is for algorithms to achieve an improved performance when the predictions are accurate while maintaining acceptable guarantees when the predictions are erroneous. A serious concern with algorithms that use predictions is that these predictions can be biased and, as a result, cause the algorithm to make decisions that are deemed unfair. We show that this concern manifests itself in the classical secretary problem in the learning-augmented setting – the state-of-the-art algorithm can have zero probability of accepting the best candidate, which we deem unfair, despite promising to accept a candidate whose expected value is at least \max\Omega (1) , 1 - O(\epsilon)\ times the optimal value, where \epsilon is the prediction error. We show how to preserve this promise while also guaranteeing to accept the best candidate with probability \Omega(1) . Our algorithm and analysis are based on a new “pegging” idea that diverges from existing works and simplifies/unifies some of their results. Finally, we extend to the k -secretary problem and complement our theoretical analysis with experiments.

[LG-26] owards a Fairer Non-negative Matrix Factorization

链接: https://arxiv.org/abs/2411.09847
作者: Lara Kassab,Erin George,Deanna Needell,Haowen Geng,Nika Jafar Nia,Aoxi Li
关键词-EN: techniques provide powerful, provide powerful tools, Topic modeling, Non-negative Matrix Factorization, dimensionality reduction
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Topic modeling, or more broadly, dimensionality reduction, techniques provide powerful tools for uncovering patterns in large datasets and are widely applied across various domains. We investigate how Non-negative Matrix Factorization (NMF) can introduce bias in the representation of data groups, such as those defined by demographics or protected attributes. We present an approach, called Fairer-NMF, that seeks to minimize the maximum reconstruction loss for different groups relative to their size and intrinsic complexity. Further, we present two algorithms for solving this problem. The first is an alternating minimization (AM) scheme and the second is a multiplicative updates (MU) scheme which demonstrates a reduced computational time compared to AM while still achieving similar performance. Lastly, we present numerical experiments on synthetic and real datasets to evaluate the overall performance and trade-offs of Fairer-NMF

[LG-27] FedRewind: Rewinding Continual Model Exchange for Decentralized Federated Learning

链接: https://arxiv.org/abs/2411.09842
作者: Luca Palazzo,Matteo Pennisi,Federica Proietto Salanitri,Giovanni Bellitto,Simone Palazzo,Concetto Spampinato
关键词-EN: leverages model exchange, learning, address spatial distribution, present FedRewind, nodes
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In this paper, we present FedRewind, a novel approach to decentralized federated learning that leverages model exchange among nodes to address the issue of data distribution shift. Drawing inspiration from continual learning (CL) principles and cognitive neuroscience theories for memory retention, FedRewind implements a decentralized routing mechanism where nodes send/receive models to/from other nodes in the federation to address spatial distribution challenges inherent in distributed learning (FL). During local training, federation nodes periodically send their models back (i.e., rewind) to the nodes they received them from for a limited number of iterations. This strategy reduces the distribution shift between nodes’ data, leading to enhanced learning and generalization performance. We evaluate our method on multiple benchmarks, demonstrating its superiority over standard decentralized federated learning methods and those enforcing specific routing schemes within the federation. Furthermore, the combination of federated and continual learning concepts enables our method to tackle the more challenging federated continual learning task, with data shifts over both space and time, surpassing existing baselines.

[LG-28] he Good The Efficient and the Inductive Biases: Exploring Efficiency in Deep Learning Through the Use of Inductive Biases

链接: https://arxiv.org/abs/2411.09827
作者: David W. Romero
关键词-EN: Deep Learning, numerous breakthroughs achieved, Deep Learning efficiency, Deep Learning algorithms, Learning
类目: Machine Learning (cs.LG)
*备注: PhD Dissertation

点击查看摘要

Abstract:The emergence of Deep Learning has marked a profound shift in machine learning, driven by numerous breakthroughs achieved in recent years. However, as Deep Learning becomes increasingly present in everyday tools and applications, there is a growing need to address unresolved challenges related to its efficiency and sustainability. This dissertation delves into the role of inductive biases – particularly, continuous modeling and symmetry preservation – as strategies to enhance the efficiency of Deep Learning. It is structured in two main parts. The first part investigates continuous modeling as a tool to improve the efficiency of Deep Learning algorithms. Continuous modeling involves the idea of parameterizing neural operations in a continuous space. The research presented here demonstrates substantial benefits for the (i) computational efficiency – in time and memory, (ii) the parameter efficiency, and (iii) design efficiency – the complexity of designing neural architectures for new datasets and tasks. The second focuses on the role of symmetry preservation on Deep Learning efficiency. Symmetry preservation involves designing neural operations that align with the inherent symmetries of data. The research presented in this part highlights significant gains both in data and parameter efficiency through the use of symmetry preservation. However, it also acknowledges a resulting trade-off of increased computational costs. The dissertation concludes with a critical evaluation of these findings, openly discussing their limitations and proposing strategies to address them, informed by literature and the author insights. It ends by identifying promising future research avenues in the exploration of inductive biases for efficiency, and their wider implications for Deep Learning. Comments: PhD Dissertation Subjects: Machine Learning (cs.LG) Cite as: arXiv:2411.09827 [cs.LG] (or arXiv:2411.09827v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2411.09827 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Related DOI: https://doi.org/10.5463/thesis.738 Focus to learn more DOI(s) linking to related resources Submission history From: David W. Romero [view email] [v1] Thu, 14 Nov 2024 22:24:59 UTC (101,343 KB)

[LG-29] Learning Parameter Sharing with Tensor Decompositions and Sparsity

链接: https://arxiv.org/abs/2411.09816
作者: Cem Üyük,Mike Lasby,Mohamed Yassin,Utku Evci,Yani Ioannou
关键词-EN: achieve remarkable performance, neural networks achieve, networks achieve remarkable, size hinders deployment, Large neural networks
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Large neural networks achieve remarkable performance, but their size hinders deployment on resource-constrained devices. While various compression techniques exist, parameter sharing remains relatively unexplored. This paper introduces Fine-grained Parameter Sharing (FiPS), a novel algorithm that leverages the relationship between parameter sharing, tensor decomposition, and sparsity to efficiently compress large vision transformer models. FiPS employs a shared base and sparse factors to represent shared neurons across multi-layer perception (MLP) modules. Shared parameterization is initialized via Singular Value Decomposition (SVD) and optimized by minimizing block-wise reconstruction error. Experiments demonstrate that FiPS compresses DeiT-B and Swin-L MLPs to 25-40% of their original parameter count while maintaining accuracy within 1 percentage point of the original models.

[LG-30] Can Features for Phishing URL Detection Be Trusted Across Diverse Datasets? A Case Study with Explainable AI

链接: https://arxiv.org/abs/2411.09813
作者: Maraz Mia,Darius Derakhshan,Mir Mehedi A. Pritom
关键词-EN: prevalent cyber threat, revealing sensitive private, sensitive private information, phishing URL, Phishing
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注: 8 pages, 10 figures, The 11th International Conference on Networking, Systems and Security, December 19-21, 2024

点击查看摘要

Abstract:Phishing has been a prevalent cyber threat that manipulates users into revealing sensitive private information through deceptive tactics, designed to masquerade as trustworthy entities. Over the years, proactively detection of phishing URLs (or websites) has been established as an widely-accepted defense approach. In literature, we often find supervised Machine Learning (ML) models with highly competitive performance for detecting phishing websites based on the extracted features from both phishing and benign (i.e., legitimate) websites. However, it is still unclear if these features or indicators are dependent on a particular dataset or they are generalized for overall phishing detection. In this paper, we delve deeper into this issue by analyzing two publicly available phishing URL datasets, where each dataset has its own set of unique and overlapping features related to URL string and website contents. We want to investigate if overlapping features are similar in nature across datasets and how does the model perform when trained on one dataset and tested on the other. We conduct practical experiments and leverage explainable AI (XAI) methods such as SHAP plots to provide insights into different features’ contributions in case of phishing detection to answer our primary question, ``Can features for phishing URL detection be trusted across diverse dataset?‘’. Our case study experiment results show that features for phishing URL detection can often be dataset-dependent and thus may not be trusted across different datasets even though they share same set of feature behaviors.

[LG-31] Edge Caching Optimization with PPO and Transfer Learning for Dynamic Environments

链接: https://arxiv.org/abs/2411.09812
作者: Farnaz Niknia,Ping Wang
关键词-EN: rising traffic loads, traffic loads strain, loads strain backhaul, strain backhaul links, Proximal Policy Optimization
类目: Networking and Internet Architecture (cs.NI); Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注:

点击查看摘要

Abstract:This paper addresses the challenge of edge caching in dynamic environments, where rising traffic loads strain backhaul links and core networks. We propose a Proximal Policy Optimization (PPO)-based caching strategy that fully incorporates key file attributes such as size, lifetime, importance, and popularity, while also considering random file request arrivals, reflecting more realistic edge caching scenarios. In dynamic environments, changes such as shifts in content popularity and variations in request rates frequently occur, making previously learned policies less effective as they were optimized for earlier conditions. Without adaptation, caching efficiency and response times can degrade. While learning a new policy from scratch in a new environment is an option, it is highly inefficient and computationally expensive. Thus, adapting an existing policy to these changes is critical. To address this, we develop a mechanism that detects changes in content popularity and request rates, ensuring timely adjustments to the caching strategy. We also propose a transfer learning-based PPO algorithm that accelerates convergence in new environments by leveraging prior knowledge. Simulation results demonstrate the significant effectiveness of our approach, outperforming a recent Deep Reinforcement Learning (DRL)-based method.

[LG-32] Fair Resource Allocation in Weakly Coupled Markov Decision Processes

链接: https://arxiv.org/abs/2411.09804
作者: Xiaohui Tu,Yossiri Adulyasak,Nima Akbarzadeh,Erick Delage
关键词-EN: Markov decision processes, sub-Markov decision processes, coupled Markov decision, weakly coupled Markov, fair resource allocation
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We consider fair resource allocation in sequential decision-making environments modeled as weakly coupled Markov decision processes, where resource constraints couple the action spaces of N sub-Markov decision processes (sub-MDPs) that would otherwise operate independently. We adopt a fairness definition using the generalized Gini function instead of the traditional utilitarian (total-sum) objective. After introducing a general but computationally prohibitive solution scheme based on linear programming, we focus on the homogeneous case where all sub-MDPs are identical. For this case, we show for the first time that the problem reduces to optimizing the utilitarian objective over the class of “permutation invariant” policies. This result is particularly useful as we can exploit Whittle index policies in the restless bandits setting while, for the more general setting, we introduce a count-proportion-based deep reinforcement learning approach. Finally, we validate our theoretical findings with comprehensive experiments, confirming the effectiveness of our proposed method in achieving fairness.

[LG-33] Modeling human decomposition: a Bayesian approach

链接: https://arxiv.org/abs/2411.09802
作者: D. Hudson Smith,Noah Nisbet,Carl Ehrett,Cristina I. Tica,Madeline M. Atwell,Katherine E. Weisensee
关键词-EN: individualistic variables affect, Environmental and individualistic, individualistic variables, model, PMI
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Environmental and individualistic variables affect the rate of human decomposition in complex ways. These effects complicate the estimation of the postmortem interval (PMI) based on observed decomposition characteristics. In this work, we develop a generative probabilistic model for decomposing human remains based on PMI and a wide range of environmental and individualistic variables. This model explicitly represents the effect of each variable, including PMI, on the appearance of each decomposition characteristic, allowing for direct interpretation of model effects and enabling the use of the model for PMI inference and optimal experimental design. In addition, the probabilistic nature of the model allows for the integration of expert knowledge in the form of prior distributions. We fit this model to a diverse set of 2,529 cases from the GeoFOR dataset. We demonstrate that the model accurately predicts 24 decomposition characteristics with an ROC AUC score of 0.85. Using Bayesian inference techniques, we invert the decomposition model to predict PMI as a function of the observed decomposition characteristics and environmental and individualistic variables, producing an R-squared measure of 71%. Finally, we demonstrate how to use the fitted model to design future experiments that maximize the expected amount of new information about the mechanisms of decomposition using the Expected Information Gain formalism.

[LG-34] Combining Machine Learning Defenses without Conflicts

链接: https://arxiv.org/abs/2411.09776
作者: Vasisht Duddu,Rui Zhang,N. Asokan
关键词-EN: Machine learning, defenses, combining multiple defenses, multiple defenses, effective
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Machine learning (ML) defenses protect against various risks to security, privacy, and fairness. Real-life models need simultaneous protection against multiple different risks which necessitates combining multiple defenses. But combining defenses with conflicting interactions in an ML model can be ineffective, incurring a significant drop in the effectiveness of one or more defenses being combined. Practitioners need a way to determine if a given combination can be effective. Experimentally identifying effective combinations can be time-consuming and expensive, particularly when multiple defenses need to be combined. We need an inexpensive, easy-to-use combination technique to identify effective combinations. Ideally, a combination technique should be (a) accurate (correctly identifies whether a combination is effective or not), (b) scalable (allows combining multiple defenses), © non-invasive (requires no change to the defenses being combined), and (d) general (is applicable to different types of defenses). Prior works have identified several ad-hoc techniques but none satisfy all the requirements above. We propose a principled combination technique, Def\Con, to identify effective defense combinations. Def\Con meets all requirements, achieving 90% accuracy on eight combinations explored in prior work and 81% in 30 previously unexplored combinations that we empirically evaluate in this paper. Subjects: Cryptography and Security (cs.CR); Machine Learning (cs.LG) Cite as: arXiv:2411.09776 [cs.CR] (or arXiv:2411.09776v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2411.09776 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-35] Beyond Static Tools: Evaluating Large Language Models for Cryptographic Misuse Detection

链接: https://arxiv.org/abs/2411.09772
作者: Zohaib Masood(1),Miguel Vargas Martin(1) ((1) Ontario Tech University)
关键词-EN: Large Language Models, including security-critical tasks, Large Language, developers increasingly relying, Language Models
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The use of Large Language Models (LLMs) in software development is rapidly growing, with developers increasingly relying on these models for coding assistance, including security-critical tasks. Our work presents a comprehensive comparison between traditional static analysis tools for cryptographic API misuse detection-CryptoGuard, CogniCrypt, and Snyk Code-and the LLMs-GPT and Gemini. Using benchmark datasets (OWASP, CryptoAPI, and MASC), we evaluate the effectiveness of each tool in identifying cryptographic misuses. Our findings show that GPT 4-o-mini surpasses current state-of-the-art static analysis tools on the CryptoAPI and MASC datasets, though it lags on the OWASP dataset. Additionally, we assess the quality of LLM responses to determine which models provide actionable and accurate advice, giving developers insights into their practical utility for secure coding. This study highlights the comparative strengths and limitations of static analysis versus LLM-driven approaches, offering valuable insights into the evolving role of AI in advancing software security practices.

[LG-36] Modeling AdaGrad RMSProp and Adam with Integro-Differential Equations

链接: https://arxiv.org/abs/2411.09734
作者: Carlos Heredia
关键词-EN: first-order integro-differential equations, Adam optimization algorithms, first-order integro-differential, Adam optimization, integro-differential equations
类目: Machine Learning (cs.LG); Numerical Analysis (math.NA); Optimization and Control (math.OC)
*备注: 22 pages

点击查看摘要

Abstract:In this paper, we propose a continuous-time formulation for the AdaGrad, RMSProp, and Adam optimization algorithms by modeling them as first-order integro-differential equations. We perform numerical simulations of these equations to demonstrate their validity as accurate approximations of the original algorithms. Our results indicate a strong agreement between the behavior of the continuous-time models and the discrete implementations, thus providing a new perspective on the theoretical understanding of adaptive optimization methods.

[LG-37] o bootstrap or to rollout? An optimal and adaptive interpolation

链接: https://arxiv.org/abs/2411.09731
作者: Wenlong Mou,Jian Qian
关键词-EN: subgraph Bellman operators, subgraph Bellman, Bootstrapping and rollout, Bellman operators, subgraph Bellman estimators
类目: Machine Learning (cs.LG); Statistics Theory (math.ST); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Bootstrapping and rollout are two fundamental principles for value function estimation in reinforcement learning (RL). We introduce a novel class of Bellman operators, called subgraph Bellman operators, that interpolate between bootstrapping and rollout methods. Our estimator, derived by solving the fixed point of the empirical subgraph Bellman operator, combines the strengths of the bootstrapping-based temporal difference (TD) estimator and the rollout-based Monte Carlo (MC) methods. Specifically, the error upper bound of our estimator approaches the optimal variance achieved by TD, with an additional term depending on the exit probability of a selected subset of the state space. At the same time, the estimator exhibits the finite-sample adaptivity of MC, with sample complexity depending only on the occupancy measure of this subset. We complement the upper bound with an information-theoretic lower bound, showing that the additional term is unavoidable given a reasonable sample size. Together, these results establish subgraph Bellman estimators as an optimal and adaptive framework for reconciling TD and MC methods in policy evaluation.

[LG-38] Physics-informed neural networks (PINNs) for numerical model error approximation and superresolution

链接: https://arxiv.org/abs/2411.09728
作者: Bozhou Zhuang,Sashank Rana,Brandon Jones,Danny Smyl
关键词-EN: finite element analysis, Numerical modeling errors, model, model errors, finite element
类目: Machine Learning (cs.LG); Numerical Analysis (math.NA); Computation (stat.CO)
*备注:

点击查看摘要

Abstract:Numerical modeling errors are unavoidable in finite element analysis. The presence of model errors inherently reflects both model accuracy and uncertainty. To date there have been few methods for explicitly quantifying errors at points of interest (e.g. at finite element nodes). The lack of explicit model error approximators has been addressed recently with the emergence of machine learning (ML), which closes the loop between numerical model features/solutions and explicit model error approximations. In this paper, we propose physics-informed neural networks (PINNs) for simultaneous numerical model error approximation and superresolution. To test our approach, numerical data was generated using finite element simulations on a two-dimensional elastic plate with a central opening. Four- and eight-node quadrilateral elements were used in the discretization to represent the reduced-order and higher-order models, respectively. It was found that the developed PINNs effectively predict model errors in both x and y displacement fields with small differences between predictions and ground truth. Our findings demonstrate that the integration of physics-informed loss functions enables neural networks (NNs) to surpass a purely data-driven approach for approximating model errors.

[LG-39] Early-Scheduled Handover Preparation in 5G NR Millimeter-Wave Systems

链接: https://arxiv.org/abs/2411.09720
作者: Dino Pjanić,Alexandros Sopasakis,Andres Reial,Fredrik Tufvesson
关键词-EN: cellular network driven, Early-Scheduled Handover Preparation, critical functions, cellular network, network driven
类目: Machine Learning (cs.LG); Information Theory (cs.IT)
*备注:

点击查看摘要

Abstract:The handover (HO) procedure is one of the most critical functions in a cellular network driven by measurements of the user channel of the serving and neighboring cells. The success rate of the entire HO procedure is significantly affected by the preparation stage. As massive Multiple-Input Multiple-Output (MIMO) systems with large antenna arrays allow resolving finer details of channel behavior, we investigate how machine learning can be applied to time series data of beam measurements in the Fifth Generation (5G) New Radio (NR) system to improve the HO procedure. This paper introduces the Early-Scheduled Handover Preparation scheme designed to enhance the robustness and efficiency of the HO procedure, particularly in scenarios involving high mobility and dense small cell deployments. Early-Scheduled Handover Preparation focuses on optimizing the timing of the HO preparation phase by leveraging machine learning techniques to predict the earliest possible trigger points for HO events. We identify a new early trigger for HO preparation and demonstrate how it can beneficially reduce the required time for HO execution reducing channel quality degradation. These insights enable a new HO preparation scheme that offers a novel, user-aware, and proactive HO decision making in MIMO scenarios incorporating mobility.

[LG-40] Residual Multi-Task Learner for Applied Ranking

链接: https://arxiv.org/abs/2411.09705
作者: Cong Fu,Kun Wang,Jiahua Wu,Yizhou Chen,Guangda Huzhang,Yabo Ni,Anxiang Zeng,Zhiming Zhou
关键词-EN: Modern e-commerce platforms, provide personalized services, e-commerce platforms rely, platforms rely heavily, diverse user feedback
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Modern e-commerce platforms rely heavily on modeling diverse user feedback to provide personalized services. Consequently, multi-task learning has become an integral part of their ranking systems. However, existing multi-task learning methods encounter two main challenges: some lack explicit modeling of task relationships, resulting in inferior performance, while others have limited applicability due to being computationally intensive, having scalability issues, or relying on strong assumptions. To address these limitations and better fit our real-world scenario, pre-rank in Shopee Search, we introduce in this paper ResFlow, a lightweight multi-task learning framework that enables efficient cross-task information sharing via residual connections between corresponding layers of task networks. Extensive experiments on datasets from various scenarios and modalities demonstrate its superior performance and adaptability over state-of-the-art methods. The online A/B tests in Shopee Search showcase its practical value in large-scale industrial applications, evidenced by a 1.29% increase in OPU (order-per-user) without additional system latency. ResFlow is now fully deployed in the pre-rank module of Shopee Search. To facilitate efficient online deployment, we propose a novel offline metric Weighted Recall@K, which aligns well with our online metric OPU, addressing the longstanding online-offline metric misalignment issue. Besides, we propose to fuse scores from the multiple tasks additively when ranking items, which outperforms traditional multiplicative fusion. The code is released at this https URL

[LG-41] he Spatial Complexity of Optical Computing and How to Reduce It

链接: https://arxiv.org/abs/2411.10435
作者: Yandong Li,Francesco Monticone
关键词-EN: hardware requires resources, Similar to algorithms, memory to run, hardware requires, consume time
类目: Optics (physics.optics); Emerging Technologies (cs.ET); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Similar to algorithms, which consume time and memory to run, hardware requires resources to function. For devices processing physical waves, implementing operations needs sufficient “space,” as dictated by wave physics. How much space is needed to perform a certain function is a fundamental question in optics, with recent research addressing it for given mathematical operations, but not for more general computing tasks, e.g., classification. Inspired by computational complexity theory, we study the “spatial complexity” of optical computing systems in terms of scaling laws - specifically, how their physical dimensions must scale as the dimension of the mathematical operation increases - and propose a new paradigm for designing optical computing systems: space-efficient neuromorphic optics, based on structural sparsity constraints and neural pruning methods motivated by wave physics (notably, the concept of “overlapping nonlocality”). On two mainstream platforms, free-space optics and on-chip integrated photonics, our methods demonstrate substantial size reductions (to 1%-10% the size of conventional designs) with minimal compromise on performance. Our theoretical and computational results reveal a trend of diminishing returns on accuracy as structure dimensions increase, providing a new perspective for interpreting and approaching the ultimate limits of optical computing - a balanced trade-off between device size and accuracy.

[LG-42] Fused Gromov-Wasserstein Variance Decomposition with Linear Optimal Transport

链接: https://arxiv.org/abs/2411.10204
作者: Michael Wilson,Tom Needham,Anuj Srivastava
关键词-EN: Wasserstein distances form, Linear Optimal Transport, Wasserstein spaces, LOT embeddings, Wasserstein distances
类目: Methodology (stat.ME); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Wasserstein distances form a family of metrics on spaces of probability measures that have recently seen many applications. However, statistical analysis in these spaces is complex due to the nonlinearity of Wasserstein spaces. One potential solution to this problem is Linear Optimal Transport (LOT). This method allows one to find a Euclidean embedding, called LOT embedding, of measures in some Wasserstein spaces, but some information is lost in this embedding. So, to understand whether statistical analysis relying on LOT embeddings can make valid inferences about original data, it is helpful to quantify how well these embeddings describe that data. To answer this question, we present a decomposition of the Fréchet variance of a set of measures in the 2-Wasserstein space, which allows one to compute the percentage of variance explained by LOT embeddings of those measures. We then extend this decomposition to the Fused Gromov-Wasserstein setting. We also present several experiments that explore the relationship between the dimension of the LOT embedding, the percentage of variance explained by the embedding, and the classification accuracy of machine learning classifiers built on the embedded data. We use the MNIST handwritten digits dataset, IMDB-50000 dataset, and Diffusion Tensor MRI images for these experiments. Our results illustrate the effectiveness of low dimensional LOT embeddings in terms of the percentage of variance explained and the classification accuracy of models built on the embedded data.

[LG-43] Continuous Bayesian Model Selection for Multivariate Causal Discovery

链接: https://arxiv.org/abs/2411.10154
作者: Anish Dhir,Ruby Sedgwick,Avinash Kori,Ben Glocker,Mark van der Wilk
关键词-EN: ensure structure identifiability, approaches require restrictive, Current causal discovery, Bayesian model selection, discovery approaches require
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Current causal discovery approaches require restrictive model assumptions or assume access to interventional data to ensure structure identifiability. These assumptions often do not hold in real-world applications leading to a loss of guarantees and poor accuracy in practice. Recent work has shown that, in the bivariate case, Bayesian model selection can greatly improve accuracy by exchanging restrictive modelling for more flexible assumptions, at the cost of a small probability of error. We extend the Bayesian model selection approach to the important multivariate setting by making the large discrete selection problem scalable through a continuous relaxation. We demonstrate how for our choice of Bayesian non-parametric model, the Causal Gaussian Process Conditional Density Estimator (CGP-CDE), an adjacency matrix can be constructed from the model hyperparameters. This adjacency matrix is then optimised using the marginal likelihood and an acyclicity regulariser, outputting the maximum a posteriori causal graph. We demonstrate the competitiveness of our approach on both synthetic and real-world datasets, showing it is possible to perform multivariate causal discovery without infeasible assumptions using Bayesian model selection.

[LG-44] BONE: a unifying framework for Bayesian online learning in non-stationary environments

链接: https://arxiv.org/abs/2411.10153
作者: Gerardo Duran-Martin,Leandro Sánchez-Betancourt,Alexander Y. Shestopaloff,Kevin Murphy
关键词-EN: perform Bayesian online, Bayesian online learning, perform Bayesian, Bayesian online, non-stationary environments
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We propose a unifying framework for methods that perform Bayesian online learning in non-stationary environments. We call the framework BONE, which stands for (B)ayesian (O)nline learning in (N)on-stationary (E)nvironments. BONE provides a common structure to tackle a variety of problems, including online continual learning, prequential forecasting, and contextual bandits. The framework requires specifying three modelling choices: (i) a model for measurements (e.g., a neural network), (ii) an auxiliary process to model non-stationarity (e.g., the time since the last changepoint), and (iii) a conditional prior over model parameters (e.g., a multivariate Gaussian). The framework also requires two algorithmic choices, which we use to carry out approximate inference under this framework: (i) an algorithm to estimate beliefs (posterior distribution) about the model parameters given the auxiliary variable, and (ii) an algorithm to estimate beliefs about the auxiliary variable. We show how this modularity allows us to write many different existing methods as instances of BONE; we also use this framework to propose a new method. We then experimentally compare existing methods with our proposed new method on several datasets; we provide insights into the situations that make one method more suitable than another for a given task.

[LG-45] DaYu: Data-Driven Model for Geostationary Satellite Observed Cloud Images Forecasting

链接: https://arxiv.org/abs/2411.10144
作者: Xujun Wei,Feng Zhang,Renhe Zhang,Wenwen Li,Cuiping Liu,Bin Guo,Jingwei Li,Haoyang Fu,Xu Tang
关键词-EN: Artificial Intelligence, widely demonstrated strong, demonstrated strong competitiveness, past few years, weather forecasting systems
类目: Atmospheric and Oceanic Physics (physics.ao-ph); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In the past few years, Artificial Intelligence (AI)-based weather forecasting methods have widely demonstrated strong competitiveness among the weather forecasting systems. However, these methods are insufficient for high-spatial-resolution short-term nowcasting within 6 hours, which is crucial for warning short-duration, mesoscale and small-scale weather events. Geostationary satellite remote sensing provides detailed, high spatio-temporal and all-day observations, which can address the above limitations of existing methods. Therefore, this paper proposed an advanced data-driven thermal infrared cloud images forecasting model, “DaYu.” Unlike existing data-driven weather forecasting models, DaYu is specifically designed for geostationary satellite observations, with a temporal resolution of 0.5 hours and a spatial resolution of 0.05^\circ \times 0.05^\circ . DaYu is based on a large-scale transformer architecture, which enables it to capture fine-grained cloud structures and learn fast-changing spatio-temporal evolution features effectively. Moreover, its attention mechanism design achieves a balance in computational complexity, making it practical for applications. DaYu not only achieves accurate forecasts up to 3 hours with a correlation coefficient higher than 0.9, 6 hours higher than 0.8, and 12 hours higher than 0.7, but also detects short-duration, mesoscale, and small-scale weather events with enhanced detail, effectively addressing the shortcomings of existing methods in providing detailed short-term nowcasting within 6 hours. Furthermore, DaYu has significant potential in short-term climate disaster prevention and mitigation.

[LG-46] On the Universal Statistical Consistency of Expansive Hyperbolic Deep Convolutional Neural Networks

链接: https://arxiv.org/abs/2411.10128
作者: Sagar Ghosh,Kushal Bose,Swagatam Das
关键词-EN: accomplishing widespread applications, deep neural networks, Convolutional Neural Networks, Neural Networks, Deep Convolutional Neural
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The emergence of Deep Convolutional Neural Networks (DCNNs) has been a pervasive tool for accomplishing widespread applications in computer vision. Despite its potential capability to capture intricate patterns inside the data, the underlying embedding space remains Euclidean and primarily pursues contractive convolution. Several instances can serve as a precedent for the exacerbating performance of DCNNs. The recent advancement of neural networks in the hyperbolic spaces gained traction, incentivizing the development of convolutional deep neural networks in the hyperbolic space. In this work, we propose Hyperbolic DCNN based on the Poincaré Disc. The work predominantly revolves around analyzing the nature of expansive convolution in the context of the non-Euclidean domain. We further offer extensive theoretical insights pertaining to the universal consistency of the expansive convolution in the hyperbolic space. Several simulations were performed not only on the synthetic datasets but also on some real-world datasets. The experimental results reveal that the hyperbolic convolutional architecture outperforms the Euclidean ones by a commendable margin.

[LG-47] Energy-GNoME: A Living Database of Selected Materials for Energy Applications

链接: https://arxiv.org/abs/2411.10125
作者: Paolo De Angelis,Giovanni Trezza,Giulio Barletta,Pietro Asinari,Eliodoro Chiavazzo
关键词-EN: Artificial Intelligence, driving significant advancements, science is driving, driving significant, significant advancements
类目: Materials Science (cond-mat.mtrl-sci); Other Condensed Matter (cond-mat.other); Machine Learning (cs.LG)
*备注: 60 pages, 16 figures

点击查看摘要

Abstract:Artificial Intelligence (AI) in materials science is driving significant advancements in the discovery of advanced materials for energy applications. The recent GNoME protocol identifies over 380,000 novel stable crystals. From this, we identify over 33,000 materials with potential as energy materials forming the Energy-GNoME database. Leveraging Machine Learning (ML) and Deep Learning (DL) tools, our protocol mitigates cross-domain data bias using feature spaces to identify potential candidates for thermoelectric materials, novel battery cathodes, and novel perovskites. Classifiers with both structural and compositional features identify domains of applicability, where we expect enhanced accuracy of the regressors. Such regressors are trained to predict key materials properties like, thermoelectric figure of merit (zT), band gap (Eg), and cathode voltage ( \Delta V_c ). This method significantly narrows the pool of potential candidates, serving as an efficient guide for experimental and computational chemistry investigations and accelerating the discovery of materials suited for electricity generation, energy storage and conversion.

[LG-48] Recent Advances on Machine Learning-aided DSP for Short-reach and Long-haul Optical Communications

链接: https://arxiv.org/abs/2411.10101
作者: Laurent Schmalen,Vincent Lauinger,Jonas Ney,Norbert Wehn,Patrick Matalla,Sebastian Randel,Alexander von Bank,Eike-Manuel Edelmann
关键词-EN: optical communications, highlight recent advances, machine learning, learning for implementing, implementing equalizers
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注: paper accompanying an invited presentation at OFC 2025

点击查看摘要

Abstract:In this paper, we highlight recent advances in the use of machine learning for implementing equalizers for optical communications. We highlight both algorithmic advances as well as implementation aspects using conventional and neuromorphic hardware.

[LG-49] Adaptive Physics-Guided Neural Network

链接: https://arxiv.org/abs/2411.10064
作者: David Shulman,Itai Dattner
关键词-EN: predicting quality attributes, physics-guided neural network, integrating physical laws, neural network, framework for predicting
类目: Methodology (stat.ME); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This paper introduces an adaptive physics-guided neural network (APGNN) framework for predicting quality attributes from image data by integrating physical laws into deep learning models. The APGNN adaptively balances data-driven and physics-informed predictions, enhancing model accuracy and robustness across different environments. Our approach is evaluated on both synthetic and real-world datasets, with comparisons to conventional data-driven models such as ResNet. For the synthetic data, 2D domains were generated using three distinct governing equations: the diffusion equation, the advection-diffusion equation, and the Poisson equation. Non-linear transformations were applied to these domains to emulate complex physical processes in image form. In real-world experiments, the APGNN consistently demonstrated superior performance in the diverse thermal image dataset. On the cucumber dataset, characterized by low material diversity and controlled conditions, APGNN and PGNN showed similar performance, both outperforming the data-driven ResNet. However, in the more complex thermal dataset, particularly for outdoor materials with higher environmental variability, APGNN outperformed both PGNN and ResNet by dynamically adjusting its reliance on physics-based versus data-driven insights. This adaptability allowed APGNN to maintain robust performance across structured, low-variability settings and more heterogeneous scenarios. These findings underscore the potential of adaptive physics-guided learning to integrate physical constraints effectively, even in challenging real-world contexts with diverse environmental conditions. Subjects: Methodology (stat.ME); Machine Learning (cs.LG) Cite as: arXiv:2411.10064 [stat.ME] (or arXiv:2411.10064v1 [stat.ME] for this version) https://doi.org/10.48550/arXiv.2411.10064 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-50] Dense ReLU Neural Networks for Temporal-spatial Model

链接: https://arxiv.org/abs/2411.09961
作者: Zhi Zhang,Carlos Misael Madrid Padilla,Xiaokai Luo,Oscar Hernan Madrid Padilla,Daren Wang
关键词-EN: Rectified Linear Unit, Linear Unit, Rectified Linear, fully connected deep, utilizing the Rectified
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST)
*备注:

点击查看摘要

Abstract:In this paper, we focus on fully connected deep neural networks utilizing the Rectified Linear Unit (ReLU) activation function for nonparametric estimation. We derive non-asymptotic bounds that lead to convergence rates, addressing both temporal and spatial dependence in the observed measurements. By accounting for dependencies across time and space, our models better reflect the complexities of real-world data, enhancing both predictive performance and theoretical robustness. We also tackle the curse of dimensionality by modeling the data on a manifold, exploring the intrinsic dimensionality of high-dimensional data. We broaden existing theoretical findings of temporal-spatial analysis by applying them to neural networks in more general contexts and demonstrate that our proof techniques are effective for models with short-range dependence. Our empirical simulations across various synthetic response functions underscore the superior performance of our method, outperforming established approaches in the existing literature. These findings provide valuable insights into the strong capabilities of dense neural networks for temporal-spatial modeling across a broad range of function classes.

[LG-51] Revealing the Evolution of Order in Materials Microstructures Using Multi-Modal Computer Vision

链接: https://arxiv.org/abs/2411.09896
作者: Arman Ter-Petrosyan,Michael Holden,Jenna A. Bilbrey,Sarah Akers,Christina Doty,Kayla H. Yano,Le Wang,Rajendra Paudel,Eric Lang,Khalid Hattar,Ryan B. Comes,Yingge Du,Bethany E. Matthews,Steven R. Spurgeon
关键词-EN: extreme environments depends, direct property-defining microstructural, property-defining microstructural order, energy storage, materials for microelectronics
类目: Materials Science (cond-mat.mtrl-sci); Machine Learning (cs.LG)
*备注: 30 pages, 5 figures, 2 tables

点击查看摘要

Abstract:The development of high-performance materials for microelectronics, energy storage, and extreme environments depends on our ability to describe and direct property-defining microstructural order. Our present understanding is typically derived from laborious manual analysis of imaging and spectroscopy data, which is difficult to scale, challenging to reproduce, and lacks the ability to reveal latent associations needed for mechanistic models. Here, we demonstrate a multi-modal machine learning (ML) approach to describe order from electron microscopy analysis of the complex oxide La _1-x Sr _x FeO _3 . We construct a hybrid pipeline based on fully and semi-supervised classification, allowing us to evaluate both the characteristics of each data modality and the value each modality adds to the ensemble. We observe distinct differences in the performance of uni- and multi-modal models, from which we draw general lessons in describing crystal order using computer vision.

[LG-52] SymbolFit: Automatic Parametric Modeling with Symbolic Regression

链接: https://arxiv.org/abs/2411.09851
作者: Ho Fung Tsoi,Dylan Rankin,Cecile Caillol,Miles Cranmer,Sridhara Dasu,Javier Duarte,Philip Harris,Elliot Lipeles,Vladimir Loncar
关键词-EN: simultaneously providing uncertainty, providing uncertainty estimates, automates parametric modeling, introduce SymbolFit, single run
类目: High Energy Physics - Experiment (hep-ex); Machine Learning (cs.LG); Data Analysis, Statistics and Probability (physics.data-an)
*备注: 53 pages, 35 figures. Under review

点击查看摘要

Abstract:We introduce SymbolFit, a framework that automates parametric modeling by using symbolic regression to perform a machine-search for functions that fit the data, while simultaneously providing uncertainty estimates in a single run. Traditionally, constructing a parametric model to accurately describe binned data has been a manual and iterative process, requiring an adequate functional form to be determined before the fit can be performed. The main challenge arises when the appropriate functional forms cannot be derived from first principles, especially when there is no underlying true closed-form function for the distribution. In this work, we address this problem by utilizing symbolic regression, a machine learning technique that explores a vast space of candidate functions without needing a predefined functional form, treating the functional form itself as a trainable parameter. Our approach is demonstrated in data analysis applications in high-energy physics experiments at the CERN Large Hadron Collider (LHC). We demonstrate its effectiveness and efficiency using five real proton-proton collision datasets from new physics searches at the LHC, namely the background modeling in resonance searches for high-mass dijet, trijet, paired-dijet, diphoton, and dimuon events. We also validate the framework using several toy datasets with one and more variables.

[LG-53] Can EEG resting state data benefit data-driven approaches for motor-imagery decoding?

链接: https://arxiv.org/abs/2411.09789
作者: Rishan Mehta,Param Rajpura,Yogesh Kumar Meena
关键词-EN: reveal individual-specific traits, Resting-state EEG data, Resting-state EEG, neuroscience research serve, EEG
类目: ignal Processing (eess.SP); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG); Neurons and Cognition (q-bio.NC)
*备注:

点击查看摘要

Abstract:Resting-state EEG data in neuroscience research serve as reliable markers for user identification and reveal individual-specific traits. Despite this, the use of resting-state data in EEG classification models is limited. In this work, we propose a feature concatenation approach to enhance decoding models’ generalization by integrating resting-state EEG, aiming to improve motor imagery BCI performance and develop a user-generalized model. Using feature concatenation, we combine the EEGNet model, a standard convolutional neural network for EEG signal classification, with functional connectivity measures derived from resting-state EEG data. The findings suggest that although grounded in neuroscience with data-driven learning, the concatenation approach has limited benefits for generalizing models in within-user and across-user scenarios. While an improvement in mean accuracy for within-user scenarios is observed on two datasets, concatenation doesn’t benefit across-user scenarios when compared with random data concatenation. The findings indicate the necessity of further investigation on the model interpretability and the effect of random data concatenation on model robustness.

[LG-54] Reinforced Disentanglers on Random Unitary Circuits

链接: https://arxiv.org/abs/2411.09784
作者: Ning Bao,Keiichiro Furuya,Gun Suer
关键词-EN: two-qubit gates arranged, random Clifford circuits, proximal policy optimization, von Neumann entropy, averaged von Neumann
类目: Quantum Physics (quant-ph); Disordered Systems and Neural Networks (cond-mat.dis-nn); Statistical Mechanics (cond-mat.stat-mech); Machine Learning (cs.LG)
*备注: 9 pages, 7 figures, 1 table. Submitted to QIP 2025

点击查看摘要

Abstract:We search for efficient disentanglers on random Clifford circuits of two-qubit gates arranged in a brick-wall pattern, using the proximal policy optimization (PPO) algorithm \citeschulman2017proximalpolicyoptimizationalgorithms. Disentanglers are defined as a set of projective measurements inserted between consecutive entangling layers. An efficient disentangler is a set of projective measurements that minimize the averaged von Neumann entropy of the final state with the least number of total projections possible. The problem is naturally amenable to reinforcement learning techniques by taking the binary matrix representing the projective measurements along the circuit as our state, and actions as bit flipping operations on this binary matrix that add or delete measurements at specified locations. We give rewards to our agent dependent on the averaged von Neumann entropy of the final state and the configuration of measurements, such that the agent learns the optimal policy that will take him from the initial state of no measurements to the optimal measurement state that minimizes the entanglement entropy. Our results indicate that the number of measurements required to disentangle a random quantum circuit is drastically less than the numerical results of measurement-induced phase transition papers. Additionally, the reinforcement learning procedure enables us to characterize the pattern of optimal disentanglers, which is not possible in the works of measurement-induced phase transitions.

[LG-55] Spatio-Temporal Jump Model for Urban Thermal Comfort Monitoring

链接: https://arxiv.org/abs/2411.09726
作者: Federico P. Cortese,Antonio Pievatolo
关键词-EN: cities face increasing, face increasing heat, essential for well-being, cities face, face increasing
类目: Applications (stat.AP); Machine Learning (cs.LG); Methodology (stat.ME)
*备注:

点击查看摘要

Abstract:Thermal comfort is essential for well-being in urban spaces, especially as cities face increasing heat from urbanization and climate change. Existing thermal comfort models usually overlook temporal dynamics alongside spatial dependencies. We address this problem by introducing a spatio-temporal jump model that clusters data with persistence across both spatial and temporal dimensions. This framework enhances interpretability, minimizes abrupt state changes, and easily handles missing data. We validate our approach through extensive simulations, demonstrating its accuracy in recovering the true underlying partition. When applied to hourly environmental data gathered from a set of weather stations located across the city of Singapore, our proposal identifies meaningful thermal comfort regimes, demonstrating its effectiveness in dynamic urban settings and suitability for real-world monitoring. The comparison of these regimes with feedback on thermal preference indicates the potential of an unsupervised approach to avoid extensive surveys.

[LG-56] Machine learning approaches to explore important features behind bird flight modes

链接: https://arxiv.org/abs/2411.09714
作者: Yukino Kawai,Tatsuya Hisada,Kozue Shiomi,Momoko Hayamizu
关键词-EN: flight styles, specific flight styles, primarily classified, classified as flapping, characterized by rapid
类目: Quantitative Methods (q-bio.QM); Machine Learning (cs.LG)
*备注: 6 pages, 5 figures

点击查看摘要

Abstract:Birds exhibit a variety of flight styles, primarily classified as flapping, which is characterized by rapid up-and-down wing movements, and soaring, which involves gliding with wings outstretched. Each species usually performs specific flight styles, and this has been argued in terms of morphological and physiological adaptation. However, it remains a challenge to evaluate the contribution of each factor to the difference in flight styles. In this study, using phenotypic data from 635 migratory bird species, such as body mass, wing length, and breeding periods, we quantified the relative importance of each feature using Feature Importance and SHAP values, and used them to construct weighted L1 distance matrices and construct NJ trees. Comparison with traditional phylogenetic logistic regression revealed similarity in top-ranked features, but also differences in overall weight distributions and clustering patterns in NJ trees. Our results highlight the complexity of constructing a biologically useful distance matrix from correlated phenotypic features, while the complementary nature of these weighting methods suggests the potential utility of multi-faceted approaches to assessing feature contributions.

[LG-57] Decoding Fatigue Levels of Pilots Using EEG Signals with Hybrid Deep Neural Networks

链接: https://arxiv.org/abs/2411.09707
作者: Dae-Hyeok Lee,Sung-Jin Kim,Si-Hyun Kim
关键词-EN: pilots’ mental states, abnormal mental states, mental states, pilots’ mental, abnormal mental
类目: ignal Processing (eess.SP); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
*备注: 4 pages, 3 figures, 1 table, Name of Conference: International Winter Conference on Brain-Computer Interface

点击查看摘要

Abstract:The detection of pilots’ mental states is critical, as abnormal mental states have the potential to cause catastrophic accidents. This study demonstrates the feasibility of using deep learning techniques to classify different fatigue levels, specifically a normal state, low fatigue, and high fatigue. To the best of our knowledge, this is the first study to classify fatigue levels in pilots. Our approach employs the hybrid deep neural network comprising five convolutional blocks and one long short-term memory block to extract the significant features from electroencephalography signals. Ten pilots participated in the experiment, which was conducted in a simulated flight environment. Compared to four conventional models, our proposed model achieved a superior grand-average accuracy of 0.8801, outperforming other models by at least 0.0599 in classifying fatigue levels. In addition to successfully classifying fatigue levels, our model provided valuable feedback to subjects. Therefore, we anticipate that our study will make the significant contributions to the advancement of autonomous flight and driving technologies, leveraging artificial intelligence in the future.

信息检索

附件下载

点击下载今日全部论文列表

Arxiv今日论文 | 2024-11-18

目录

概览 (2024-11-18)

自然语言处理

人工智能

计算机视觉

机器学习

信息检索

附件下载