今日(2024-08-07)Arxiv最新论文

本篇博文主要展示 2024-08-07 从Arxiv.org论文网站获取的最新论文列表，自动更新，按照NLP、CV、ML、AI、IR五个大方向区分，若需要邮件定时接收，请在评论区留下你的邮箱号。

说明：每日论文数据从Arxiv.org获取，每天早上10:30左右定时自动更新。

友情提示: 如何您需要邮箱接收每日论文数据，请在评论处留下你的邮箱，同样每天10:30左右邮件定时自动发送。

链接: https://arxiv.org/abs/2408.03326
作者: Bo Li,Yuanhan Zhang,Dong Guo,Renrui Zhang,Feng Li,Hao Zhang,Kaichen Zhang,Yanwei Li,Ziwei Liu,Chunyuan Li
关键词-EN: LLaVA-NeXT blog series, open large multimodal, large multimodal models, developed by consolidating, insights into data
关键词-ZN: LLaVa-NeXT博客系列，开放大型多峰、大型多峰模型，通过整合数据洞察开发
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Project Homepage: this https URL

点击查看摘要

Abstract:We present LLaVA-OneVision, a family of open large multimodal models (LMMs) developed by consolidating our insights into data, models, and visual representations in the LLaVA-NeXT blog series. Our experimental results demonstrate that LLaVA-OneVision is the first single model that can simultaneously push the performance boundaries of open LMMs in three important computer vision scenarios: single-image, multi-image, and video scenarios. Importantly, the design of LLaVA-OneVision allows strong transfer learning across different modalities/scenarios, yielding new emerging capabilities. In particular, strong video understanding and cross-scenario capabilities are demonstrated through task transfer from images to videos.
摘要：我们介绍了LLaVA-OneVision，这是一个开放式大型多模式模型（LSYS）家族，通过整合我们对LLaVA-NeXT博客系列中的数据、模型和视觉表示的见解而开发。我们的实验结果表明，LLaVA-OneVision是第一个可以在三个重要的计算机视觉场景（单图像、多图像和视频场景）中同时突破开放LSYS的性能边界的单一模型。重要的是，LLaVA-OneVision的设计允许跨不同模式/场景进行强迁移学习，从而产生新的新兴功能。特别是，通过从图像到视频的任务转移，展示了强大的视频理解能力和跨场景能力。

[NLP-1] CoverBench: A Challenging Benchmark for Complex Claim Verification
[NLP-1] CoverBench：复杂索赔验证的关键基准

链接: https://arxiv.org/abs/2408.03325
作者: Alon Jacovi,Moran Ambar,Eyal Ben-David,Uri Shaham,Amir Feder,Mor Geva,Dror Marcus,Avi Caciularu
关键词-EN: language models’ outputs, growing line, line of research, correctness of language, language models’
关键词-ZN: 语言模型的输出、增长线、研究线、语言的正确性、语言模型”
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:There is a growing line of research on verifying the correctness of language models’ outputs. At the same time, LMs are being used to tackle complex queries that require reasoning. We introduce CoverBench, a challenging benchmark focused on verifying LM outputs in complex reasoning settings. Datasets that can be used for this purpose are often designed for other complex reasoning tasks (e.g., QA) targeting specific use-cases (e.g., financial tables), requiring transformations, negative sampling and selection of hard examples to collect such a benchmark. CoverBench provides a diversified evaluation for complex claim verification in a variety of domains, types of reasoning, relatively long inputs, and a variety of standardizations, such as multiple representations for tables where available, and a consistent schema. We manually vet the data for quality to ensure low levels of label noise. Finally, we report a variety of competitive baseline results to show CoverBench is challenging and has very significant headroom. The data is available at this https URL .
摘要：验证语言模型输出的正确性的研究越来越多。与此同时，LMS正被用来处理需要推理的复杂查询。我们介绍了CoverBtch，这是一个具有挑战性的基准测试，专注于在复杂的推理环境中验证LM输出。可用于此目的的数据集通常是为针对特定用例(例如，财务表格)的其他复杂推理任务(例如，QA)设计的，需要转换、负采样和选择硬示例来收集这样的基准。CoverBtch在各种领域、推理类型、相对较长的输入和各种标准化(例如表的多种表示法(如果可用)和一致的模式)中为复杂的索赔验证提供多样化的评估。我们手动检查数据的质量，以确保标签噪音水平较低。最后，我们报告了各种竞争基准结果，以表明CoverBdge具有挑战性，并具有非常显著的净空空间。这些数据可以在这个HTTPS URL上找到。

[NLP-2] raining LLMs to Recognize Hedges in Spontaneous Narratives
[NLP-2] 培养法学硕士认识自发叙事中的模糊限制

链接: https://arxiv.org/abs/2408.03319
作者: Amie J. Paige,Adil Soubki,John Murzaku,Owen Rambow,Susan E. Brennan
关键词-EN: soften critical feedback, mark utterances, lack of commitment, attribute responsibility, invite input
关键词-ZN: 软化批评性反馈、标记言论、缺乏承诺、赋予责任、邀请意见
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Amie Paige, Adil Soubki, and John Murzaku contributed equally to this study

点击查看摘要

Abstract:Hedges allow speakers to mark utterances as provisional, whether to signal non-prototypicality or “fuzziness”, to indicate a lack of commitment to an utterance, to attribute responsibility for a statement to someone else, to invite input from a partner, or to soften critical feedback in the service of face-management needs. Here we focus on hedges in an experimentally parameterized corpus of 63 Roadrunner cartoon narratives spontaneously produced from memory by 21 speakers for co-present addressees, transcribed to text (Galati and Brennan, 2010). We created a gold standard of hedges annotated by human coders (the Roadrunner-Hedge corpus) and compared three LLM-based approaches for hedge detection: fine-tuning BERT, and zero and few-shot prompting with GPT-4o and LLaMA-3. The best-performing approach was a fine-tuned BERT model, followed by few-shot GPT-4o. After an error analysis on the top performing approaches, we used an LLM-in-the-Loop approach to improve the gold standard coding, as well as to highlight cases in which hedges are ambiguous in linguistically interesting ways that will guide future research. This is the first step in our research program to train LLMs to interpret and generate collateral signals appropriately and meaningfully in conversation.
摘要：模糊限制语允许说话者将话语标记为临时性的，无论是表示非典型性或模糊性，表示对话语缺乏承诺，将陈述的责任归于他人，邀请合作伙伴提供意见，或软化批评反馈以满足面子管理需求。在这里，我们关注的是一个实验参数语料库中的模糊限制语，该语料库包含63个Roadrunner卡通叙事，由21名发言者自发地从记忆中产生，用于同时到场的听众，并转录成文本(Galati和Brennan，2010)。我们创建了由人类编码员注释的对冲的黄金标准(Roadrunner-Hedge语料库)，并比较了三种基于LLM的对冲检测方法：微调BERT，以及使用GPT-40和Llama-3进行零提示和少提示。性能最好的方法是微调的BERT模型，其次是少镜头的GPT-40。在对表现最好的方法进行错误分析后，我们使用了LLM-in-the-Loop方法来改进黄金标准编码，并强调了模糊限制语以语言上有趣的方式模糊的情况，这将指导未来的研究。这是我们研究项目的第一步，目的是训练LLM在对话中恰当而有意义地解释和生成附带信号。

[NLP-3] Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters
[NLP-3] 优化调整LLM测试时间计算比调整模型参数更有效

链接: https://arxiv.org/abs/2408.03314
作者: Charlie Snell,Jaehoon Lee,Kelvin Xu,Aviral Kumar
关键词-EN: open-ended natural language, building generally self-improving, generally self-improving agents, Enabling LLMs, natural language
关键词-ZN: 开放式自然语言，构建总体自我改进、总体自我改进的代理，启用LLM，自然语言
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Enabling LLMs to improve their outputs by using more test-time computation is a critical step towards building generally self-improving agents that can operate on open-ended natural language. In this paper, we study the scaling of inference-time computation in LLMs, with a focus on answering the question: if an LLM is allowed to use a fixed but non-trivial amount of inference-time compute, how much can it improve its performance on a challenging prompt? Answering this question has implications not only on the achievable performance of LLMs, but also on the future of LLM pretraining and how one should tradeoff inference-time and pre-training compute. Despite its importance, little research attempted to understand the scaling behaviors of various test-time inference methods. Moreover, current work largely provides negative results for a number of these strategies. In this work, we analyze two primary mechanisms to scale test-time computation: (1) searching against dense, process-based verifier reward models; and (2) updating the model’s distribution over a response adaptively, given the prompt at test time. We find that in both cases, the effectiveness of different approaches to scaling test-time compute critically varies depending on the difficulty of the prompt. This observation motivates applying a “compute-optimal” scaling strategy, which acts to most effectively allocate test-time compute adaptively per prompt. Using this compute-optimal strategy, we can improve the efficiency of test-time compute scaling by more than 4x compared to a best-of-N baseline. Additionally, in a FLOPs-matched evaluation, we find that on problems where a smaller base model attains somewhat non-trivial success rates, test-time compute can be used to outperform a 14x larger model.
摘要：使LLM能够通过使用更多的测试时间计算来提高它们的输出，这是构建能够在开放式自然语言上操作的普遍自我改进的代理的关键一步。本文研究了LLMS中推理时间计算的可伸缩性，重点回答了这样一个问题：如果允许LLM使用固定但非平凡的推理时间计算，那么它在具有挑战性的提示下的性能能有多大提高？回答这个问题不仅关系到LLMS的可实现性能，而且关系到LLM预训练的未来，以及人们应该如何权衡推理时间和预训练计算。尽管它很重要，但很少有研究试图了解各种测试时间推断方法的缩放行为。此外，目前的工作在很大程度上为其中一些战略提供了负面结果。在这项工作中，我们分析了两种扩展测试时间计算的主要机制：(1)针对密集的、基于过程的验证者奖励模型进行搜索；(2)在测试时给定提示的情况下，自适应地更新模型在响应上的分布。我们发现，在这两种情况下，扩展测试时间计算的不同方法的有效性取决于提示的难度。这一观察结果促使应用“计算最优”的伸缩策略，该策略最有效地根据提示自适应地分配测试时间计算。使用这种计算优化策略，我们可以将测试时间计算扩展的效率提高到N中最佳基准的4倍以上。此外，在FLOPS匹配的评估中，我们发现在较小的基本模型获得一定程度上的成功率的问题上，测试时间计算可以用于优于14倍大的模型。

[NLP-4] KaPO: Knowledge-aware Preference Optimization for Controllable Knowledge Selection in Retrieval-Augmented Language Models
[NLP-4] KaPO：检索增强语言模型中可控知识选择的知识感知偏好优化

链接: https://arxiv.org/abs/2408.03297
作者: Ruizhe Zhang,Yongxin Xu,Yuzhen Xiao,Runchuan Zhu,Xinke Jiang,Xu Chu,Junfeng Zhao,Yasha Wang
关键词-EN: Retrieval-Augmented Generation, large language models, integrating external knowledge, encounter when dealing, knowledge-intensive tasks
关键词-ZN: 检索增强生成、大型语言模型、集成外部知识、处理时遇到、知识密集型任务
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:By integrating external knowledge, Retrieval-Augmented Generation (RAG) has become an effective strategy for mitigating the hallucination problems that large language models (LLMs) encounter when dealing with knowledge-intensive tasks. However, in the process of integrating external non-parametric supporting evidence with internal parametric knowledge, inevitable knowledge conflicts may arise, leading to confusion in the model’s responses. To enhance the knowledge selection of LLMs in various contexts, some research has focused on refining their behavior patterns through instruction-tuning. Nonetheless, due to the absence of explicit negative signals and comparative objectives, models fine-tuned in this manner may still exhibit undesirable behaviors in the intricate and realistic retrieval scenarios. To this end, we propose a Knowledge-aware Preference Optimization, dubbed KaPO, aimed at achieving controllable knowledge selection in real retrieval scenarios. Concretely, we explore and simulate error types across diverse context combinations and learn how to avoid these negative signals through preference optimization methods. Simultaneously, by adjusting the balance between response length and the proportion of preference data representing different behavior patterns, we enhance the adherence capabilities and noise robustness of LLMs in a balanced manner. Experimental results show that KaPO outperforms previous methods for handling knowledge conflicts by over 37%, while also exhibiting robust generalization across various out-of-distribution datasets.
摘要：通过整合外部知识，检索增强生成(RAG)已成为缓解大型语言模型(LLM)在处理知识密集型任务时遇到的幻觉问题的有效策略。然而，在将外部非参数支持证据与内部参数知识相结合的过程中，可能会出现不可避免的知识冲突，导致模型响应的混乱。为了加强学习者在不同语境下的知识选择，一些研究集中在通过教学调整来提炼他们的行为模式。尽管如此，由于没有明确的负面信号和相对目标，以这种方式微调的模型在复杂和现实的检索场景中仍然可能表现出不希望看到的行为。为此，我们提出了一种知识感知偏好优化算法，称为KAPO，旨在实现真实检索场景中可控的知识选择。具体地说，我们探索和模拟了不同语境组合中的错误类型，并学习了如何通过偏好优化方法来避免这些负面信号。同时，通过调整响应长度和表示不同行为模式的偏好数据的比例之间的平衡，平衡地增强了LLMS的坚持能力和噪声鲁棒性。实验结果表明，KAPO在处理知识冲突方面比以往的方法提高了37%以上，同时在各种分布外的数据集上也表现出了良好的泛化能力。

[NLP-5] SARA: Singular-Value Based Adaptive Low-Rank Adaption
[NLP-5] SARA：基于奇异值的自适应低等级适应

链接: https://arxiv.org/abs/2408.03290
作者: Jihao Gu,Shuai Chen,Zelin Wang,Yibo Zhang,Ping Gong
关键词-EN: adding inference overhead, large pre-trained models, inference overhead, adding inference, LoRA method assumes
关键词-ZN: 添加推理负担，大型预训练模型，推理负担，添加推理，LoRA方法假设
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:With the increasing number of parameters in large pre-trained models, LoRA as a parameter-efficient fine-tuning(PEFT) method is widely used for not adding inference overhead. The LoRA method assumes that weight changes during fine-tuning can be approximated by low-rank matrices. However, the rank values need to be manually verified to match different downstream tasks, and they cannot accommodate the varying importance of different layers in the model. In this work, we first analyze the relationship between the performance of different layers and their ranks using SVD. Based on this, we design the Singular-Value Based Adaptive Low-Rank Adaption(SARA), which adaptively finds the rank during initialization by performing SVD on the pre-trained weights. Additionally, we explore the Mixture-of-SARA(Mo-SARA), which significantly reduces the number of parameters by fine-tuning only multiple parallel sets of singular values controlled by a router. Extensive experiments on various complex tasks demonstrate the simplicity and parameter efficiency of our methods. They can effectively and adaptively find the most suitable rank for each layer of each model.
摘要：随着大型预训练模型中参数数量的增加，LORA作为一种参数高效的微调(PEFT)方法，由于不增加推理开销而被广泛使用。LORA方法假定微调过程中的权重变化可以用低阶矩阵来近似。但是，需要手动验证等级值以匹配不同的下游任务，并且它们无法适应模型中不同层的不同重要性。在这项工作中，我们首先利用奇异值分解分析了不同层次的性能与它们的等级之间的关系。在此基础上，设计了基于奇异值的自适应低阶自适应算法(SARA)，该算法通过对预先训练好的权值进行奇异值分解，在初始化过程中自适应地寻找排序。此外，我们探索了混合SARA(MO-SARA)，它通过微调路由器控制的多个并行奇异值集来显著减少参数数量。在各种复杂任务上的大量实验证明了该方法的简单性和参数有效性。它们可以有效地、自适应地为每个模型的每一层找到最合适的排序。

[NLP-6] StructEval: Deepen and Broaden Large Language Model Assessment via Structured Evaluation ACL2024
[NLP-6] StructEval：通过结构化评估加深和扩展大型语言模型评估

链接: https://arxiv.org/abs/2408.03281
作者: Boxi Cao,Mengjie Ren,Hongyu Lin,Xianpei Han,Feng Zhang,Junfeng Zhan,Le Sun
关键词-EN: large language models, atomic test objective, development of large, large language, test objective
关键词-ZN: 大型语言模型、原子测试目标、大型语言的开发、测试目标
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: ACL 2024;Benchmark at this https URL at this https URL

点击查看摘要

Abstract:Evaluation is the baton for the development of large language models. Current evaluations typically employ a single-item assessment paradigm for each atomic test objective, which struggles to discern whether a model genuinely possesses the required capabilities or merely memorizes/guesses the answers to specific questions. To this end, we propose a novel evaluation framework referred to as StructEval. Starting from an atomic test objective, StructEval deepens and broadens the evaluation by conducting a structured assessment across multiple cognitive levels and critical concepts, and therefore offers a comprehensive, robust and consistent evaluation for LLMs. Experiments on three widely-used benchmarks demonstrate that StructEval serves as a reliable tool for resisting the risk of data contamination and reducing the interference of potential biases, thereby providing more reliable and consistent conclusions regarding model capabilities. Our framework also sheds light on the design of future principled and trustworthy LLM evaluation protocols.
摘要：评价是大型语言模型开发的指挥棒。目前的评估通常对每个原子测试目标采用单一项目评估范式，难以辨别模型是否真的具有所需的能力，或者只是记住/猜测特定问题的答案。为此，我们提出了一种新的评估框架，称为StructEval。StructEval从原子测试目标出发，通过跨多个认知水平和关键概念进行结构化评估来深化和扩大评估，从而为LLMS提供全面、稳健和一致的评估。在三个广泛使用的基准上的实验表明，StructEval是一种可靠的工具，可以抵抗数据污染的风险，减少潜在偏差的干扰，从而提供关于模型能力的更可靠和一致的结论。我们的框架也为未来原则性和可信的LLM评估协议的设计提供了启示。

[NLP-7] Synthesizing Text-to-SQL Data from Weak and Strong LLMs ACL2024
[NLP-7] 从弱和强LLM合成文本到SQL数据

链接: https://arxiv.org/abs/2408.03256
作者: Jiaxi Yang,Binyuan Hui,Min Yang,Jian Yang,Junyang Lin,Chang Zhou
关键词-EN: closed-source large language, large language models, remains a challenge, large language, synthetic data approach
关键词-ZN: 闭源大型语言、大型语言模型仍然是一个挑战、大型语言、合成数据方法
类目: Computation and Language (cs.CL)
备注: 12 pages, 7 figures, ACL 2024

点击查看摘要

Abstract:The capability gap between open-source and closed-source large language models (LLMs) remains a challenge in text-to-SQL tasks. In this paper, we introduce a synthetic data approach that combines data produced by larger, more powerful models (strong models) with error information data generated by smaller, not well-aligned models (weak models). The method not only enhances the domain generalization of text-to-SQL models but also explores the potential of error data supervision through preference learning. Furthermore, we employ the synthetic data approach for instruction tuning on open-source LLMs, resulting SENSE, a specialized text-to-SQL model. The effectiveness of SENSE is demonstrated through state-of-the-art results on the SPIDER and BIRD benchmarks, bridging the performance gap between open-source models and methods prompted by closed-source models.
摘要：开源和闭源大型语言模型（LLM）之间的能力差距仍然是文本转SQL任务中的一个挑战。在本文中，我们引入了一种合成数据方法，该方法将更大、更强大的模型（强模型）产生的数据与更小、对齐不佳的模型（弱模型）产生的错误信息数据相结合。该方法不仅增强了文本到SQL模型的领域概括性，而且还探索了通过偏好学习进行错误数据监督的潜力。此外，我们采用合成数据方法对开源LLM进行指令调优，从而产生SEN，这是一种专门的文本到SQL模型。SEN的有效性通过SPIDER和BIRD基准测试的最新结果得到了证明，弥合了开源模型和封闭源模型所激发的方法之间的性能差距。

[NLP-8] Unveiling Factual Recall Behaviors of Large Language Models through Knowledge Neurons
[NLP-8] 通过知识神经元揭示大型语言模型的事实回忆行为

链接: https://arxiv.org/abs/2408.03247
作者: Yifei Wang,Yuheng Chen,Wanting Wen,Yu Sheng,Linjing Li,Daniel Dajun Zeng
关键词-EN: Large Language Models, Language Models, Large Language, investigate whether Large, Models
关键词-ZN: 大型语言模型，语言模型，大型语言，调查是否大型，模型
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In this paper, we investigate whether Large Language Models (LLMs) actively recall or retrieve their internal repositories of factual knowledge when faced with reasoning tasks. Through an analysis of LLMs’ internal factual recall at each reasoning step via Knowledge Neurons, we reveal that LLMs fail to harness the critical factual associations under certain circumstances. Instead, they tend to opt for alternative, shortcut-like pathways to answer reasoning questions. By manually manipulating the recall process of parametric knowledge in LLMs, we demonstrate that enhancing this recall process directly improves reasoning performance whereas suppressing it leads to notable degradation. Furthermore, we assess the effect of Chain-of-Thought (CoT) prompting, a powerful technique for addressing complex reasoning tasks. Our findings indicate that CoT can intensify the recall of factual knowledge by encouraging LLMs to engage in orderly and reliable reasoning. Furthermore, we explored how contextual conflicts affect the retrieval of facts during the reasoning process to gain a comprehensive understanding of the factual recall behaviors of LLMs. Code and data will be available soon.
摘要：在本文中，我们研究了大型语言模型在面对推理任务时是否主动回忆或提取其内部事实知识库。通过对LLMS在每个推理步骤中的内部事实回忆的分析，我们发现在某些情况下，LLMS未能利用关键的事实关联。相反，他们倾向于选择替代的、类似捷径的途径来回答推理问题。通过人工操作LLMS中参数知识的回忆过程，我们证明了加强这一回忆过程直接提高了推理性能，而抑制它会导致显著的降级。此外，我们评估了思想链(COT)提示的效果，这是一种处理复杂推理任务的强大技术。我们的研究结果表明，COT可以通过鼓励LLMS进行有序可靠的推理来强化对事实知识的回忆。此外，我们还探讨了语境冲突在推理过程中对事实提取的影响，以全面了解LLMS的事实回忆行为。代码和数据很快就会公布。

[NLP-9] Making Long-Context Language Models Better Multi-Hop Reasoners ACL2024
[NLP-9] 使长上下文语言模型更好的多跳推理

链接: https://arxiv.org/abs/2408.03246
作者: Yanyang Li,Shuo Liang,Michael R. Lyu,Liwei Wang
关键词-EN: multiple NLP applications, NLP applications, Recent advancements, multiple NLP, enhanced language models
关键词-ZN: 多个NLP应用程序、NLP应用程序、最新进展、多个NLP、增强的语言模型
类目: Computation and Language (cs.CL)
备注: ACL 2024 Main Conference Camera Ready; Dataset, model, and code are available at this https URL

点击查看摘要

Abstract:Recent advancements in long-context modeling have enhanced language models (LMs) for complex tasks across multiple NLP applications. Despite this progress, we find that these models struggle with multi-hop reasoning and exhibit decreased performance in the presence of noisy contexts. In this paper, we introduce Reasoning with Attributions, a novel approach that prompts LMs to supply attributions for each assertion during their reasoning. We validate our approach through experiments on three multi-hop datasets, employing both proprietary and open-source models, and demonstrate its efficacy and resilience. Furthermore, we explore methods to augment reasoning capabilities via fine-tuning and offer an attribution-annotated dataset and a specialized training strategy. Our fine-tuned model achieves competitive performance on multi-hop reasoning benchmarks, closely paralleling proprietary LMs such as ChatGPT and Claude-instant.
摘要：长上下文建模的最新进展增强了针对多个NLP应用程序中复杂任务的语言模型（LM）。尽管取得了这一进步，但我们发现这些模型在多跳推理方面遇到了困难，并且在存在噪音的环境下表现出性能下降。在本文中，我们引入了带归因的推理，这是一种新颖的方法，它促使LM在推理过程中为每个断言提供归因。我们通过在三个多跳数据集上进行实验，采用专有和开源模型来验证我们的方法，并展示其有效性和弹性。此外，我们探索通过微调增强推理能力的方法，并提供属性注释的数据集和专门的训练策略。我们的微调模型在多跳推理基准上实现了有竞争力的性能，与ChatGPT和Claude-instant等专有LM密切相似。

[NLP-10] A Debiased Nearest Neighbors Framework for Multi-Label Text Classification
[NLP-10] 用于多标签文本分类的去偏近邻框架

链接: https://arxiv.org/abs/2408.03202
作者: Zifeng Cheng,Zhiwei Jiang,Yafeng Yin,Zhaoling Chen,Cong Wang,Shiping Ge,Qiguo Huang,Qing Gu
关键词-EN: Multi-Label Text Classification, involves assigning multiple, assigning multiple non-exclusive, Multi-Label Text, Text Classification
关键词-ZN: 多标签文本分类，涉及分配多个、分配多个非排他性、多标签文本、文本分类
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Multi-Label Text Classification (MLTC) is a practical yet challenging task that involves assigning multiple non-exclusive labels to each document. Previous studies primarily focus on capturing label correlations to assist label prediction by introducing special labeling schemes, designing specific model structures, or adding auxiliary tasks. Recently, the k Nearest Neighbor ( k NN) framework has shown promise by retrieving labeled samples as references to mine label co-occurrence information in the embedding space. However, two critical biases, namely embedding alignment bias and confidence estimation bias, are often overlooked, adversely affecting prediction performance. In this paper, we introduce a DEbiased Nearest Neighbors (DENN) framework for MLTC, specifically designed to mitigate these biases. To address embedding alignment bias, we propose a debiased contrastive learning strategy, enhancing neighbor consistency on label co-occurrence. For confidence estimation bias, we present a debiased confidence estimation strategy, improving the adaptive combination of predictions from k NN and inductive binary classifications. Extensive experiments conducted on four public benchmark datasets (i.e., AAPD, RCV1-V2, Amazon-531, and EUR-LEX57K) showcase the effectiveness of our proposed method. Besides, our method does not introduce any extra parameters.
摘要：多标签文本分类是一项既实用又具有挑战性的任务，它涉及到为每个文档分配多个非排他性标签。以往的研究主要集中于通过引入特殊的标签方案、设计特定的模型结构或添加辅助任务来捕获标签相关性以辅助标签预测。最近，k近邻(K NN)框架通过检索标记样本作为嵌入空间中挖掘标记共现信息的参考，显示了良好的前景。然而，两个关键的偏差，即嵌入对齐偏差和置信度估计偏差，往往被忽视，从而对预测性能产生不利影响。在本文中，我们介绍了一种用于MLTC的去偏最近邻(DEN)框架，该框架专门设计用于缓解这些偏差。为了解决嵌入对齐偏差问题，我们提出了一种去偏的对比学习策略，增强了标签共现时的邻居一致性。对于置信度估计偏差，我们提出了一种去偏置信度估计策略，改进了k近邻预测和归纳二分类预测的自适应组合。在四个公共基准数据集(即AAPD、RCV1-V2、Amazon-531和EUR-LEX57K)上进行的大量实验表明了该方法的有效性。此外，我们的方法没有引入任何额外的参数。

[NLP-11] Leveraging Parameter Efficient Training Methods for Low Resource Text Classification: A Case Study in Marathi
[NLP-11] 利用参数高效训练方法进行低资源文本分类：Marathi的案例研究

链接: https://arxiv.org/abs/2408.03172
作者: Pranita Deshmukh,Nikita Kulkarni,Sanhita Kulkarni,Kareena Manghani,Raviraj Joshi
关键词-EN: Natural Language Processing, advanced Natural Language, Bidirectional Encoder Representations, advanced Natural, Language Processing
关键词-ZN: 自然语言处理、高级自然语言、双向编码器表示、高级自然、语言处理
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Accepted at I2CT 2024

点击查看摘要

Abstract:With the surge in digital content in low-resource languages, there is an escalating demand for advanced Natural Language Processing (NLP) techniques tailored to these languages. BERT (Bidirectional Encoder Representations from Transformers), serving as the foundational framework for numerous NLP architectures and language models, is increasingly employed for the development of low-resource NLP models. Parameter Efficient Fine-Tuning (PEFT) is a method for fine-tuning Large Language Models (LLMs) and reducing the training parameters to some extent to decrease the computational costs needed for training the model and achieve results comparable to a fully fine-tuned model. In this work, we present a study of PEFT methods for the Indic low-resource language Marathi. We conduct a comprehensive analysis of PEFT methods applied to various monolingual and multilingual Marathi BERT models. These approaches are evaluated on prominent text classification datasets like MahaSent, MahaHate, and MahaNews. The incorporation of PEFT techniques is demonstrated to significantly expedite the training speed of the models, addressing a critical aspect of model development and deployment. In this study, we explore Low-Rank Adaptation of Large Language Models (LoRA) and adapter methods for low-resource text classification. We show that these methods are competitive with full fine-tuning and can be used without loss in accuracy. This study contributes valuable insights into the effectiveness of Marathi BERT models, offering a foundation for the continued advancement of NLP capabilities in Marathi and similar Indic languages.
摘要：随着低资源语言数字内容的激增，对为这些语言量身定做的高级自然语言处理(NLP)技术的需求不断上升。BERT(Transformers的双向编码器表示)作为众多NLP体系结构和语言模型的基础框架，越来越多地被用于开发低资源的NLP模型。参数高效精调(PEFT)是一种对大型语言模型(LLM)进行微调，并在一定程度上减少训练参数以减少训练模型所需的计算代价的方法，其结果与完全微调的模型相当。在这项工作中，我们提出了一种针对印度低资源语言马拉蒂的PEFT方法的研究。我们对应用于各种单语言和多语言Marathi Bert模型的PEFT方法进行了全面的分析。这些方法在MahaSent、MahaHate和MahaNews等重要的文本分类数据集上进行了评估。结合PEFT技术被证明显著加快了模型的训练速度，解决了模型开发和部署的一个关键方面。在这项研究中，我们探索了大语言模型的低级适应(LORA)和低资源文本分类的适应方法。我们表明，这些方法在完全微调的情况下是有竞争力的，并且可以在不损失精度的情况下使用。这项研究对马拉提伯特模型的有效性提供了有价值的见解，为马拉提语和类似印度语的NLP能力的继续发展提供了基础。

[NLP-12] Conditioning LLMs with Emotion in Neural Machine Translation
[NLP-12] 神经机器翻译中的情感条件化LLM

链接: https://arxiv.org/abs/2408.03150
作者: Charles Brazier,Jean-Luc Rouas
关键词-EN: Natural Language Processing, Language Processing tasks, Large Language Models, including Machine Translation, Large Language
关键词-ZN: 自然语言处理、语言处理任务、大型语言模型，包括机器翻译、大型语言
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 6 pages, In Proceedings of the 21st International Conference on Spoken Language Translation (IWSLT), Bangkok, Thailand, 2024

点击查看摘要

Abstract:Large Language Models (LLMs) have shown remarkable performance in Natural Language Processing tasks, including Machine Translation (MT). In this work, we propose a novel MT pipeline that integrates emotion information extracted from a Speech Emotion Recognition (SER) model into LLMs to enhance translation quality. We first fine-tune five existing LLMs on the Libri-trans dataset and select the most performant model. Subsequently, we augment LLM prompts with different dimensional emotions and train the selected LLM under these different configurations. Our experiments reveal that integrating emotion information, especially arousal, into LLM prompts leads to notable improvements in translation quality.
摘要：大型语言模型（LLM）在自然语言处理任务（包括机器翻译（MT））中表现出了出色的性能。在这项工作中，我们提出了一种新型的MT管道，该管道将从语音情感识别（BER）模型中提取的情感信息集成到LLM中，以提高翻译质量。我们首先在Libri-trans数据集上微调五个现有的LLM，并选择最性能的模型。随后，我们用不同维度的情感增强LLM提示，并在这些不同的配置下训练所选的LLM。我们的实验表明，将情感信息（尤其是唤起）整合到LLM提示中可以显着提高翻译质量。

[NLP-13] Leveraging Entity Information for Cross-Modality Correlation Learning: The Entity-Guided Multimodal Summarization ACL
[NLP-13] 利用实体信息进行跨模式相关学习：实体引导的多模式总结

链接: https://arxiv.org/abs/2408.03149
作者: Yanghai Zhang,Ye Liu,Shiwei Wu,Kai Zhang,Xukai Liu,Qi Liu,Enhong Chen
关键词-EN: Multimodal Summarization model, Multimodal Summarization, rapid increase, increase in multimedia, spurred advancements
关键词-ZN: 多模式摘要模型，多模式摘要，快速增长，多媒体的增加，刺激了进步
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注: In ACL-Findings 2024

点击查看摘要

Abstract:The rapid increase in multimedia data has spurred advancements in Multimodal Summarization with Multimodal Output (MSMO), which aims to produce a multimodal summary that integrates both text and relevant images. The inherent heterogeneity of content within multimodal inputs and outputs presents a significant challenge to the execution of MSMO. Traditional approaches typically adopt a holistic perspective on coarse image-text data or individual visual objects, overlooking the essential connections between objects and the entities they represent. To integrate the fine-grained entity knowledge, we propose an Entity-Guided Multimodal Summarization model (EGMS). Our model, building on BART, utilizes dual multimodal encoders with shared weights to process text-image and entity-image information concurrently. A gating mechanism then combines visual data for enhanced textual summary generation, while image selection is refined through knowledge distillation from a pre-trained vision-language model. Extensive experiments on public MSMO dataset validate the superiority of the EGMS method, which also prove the necessity to incorporate entity information into MSMO problem.
摘要：多媒体数据的快速增长促进了基于多模式输出的多模式摘要(MSMO)的发展，其目的是生成一个整合了文本和相关图像的多模式摘要。多模式输入和输出中固有的内容异质性对MSMO的执行提出了重大挑战。传统的方法通常对粗略的图文数据或单个视觉对象采用整体的视角，而忽略了对象与其所代表的实体之间的本质联系。为了整合细粒度的实体知识，我们提出了一种实体制导的多通道摘要模型。我们的模型建立在BART的基础上，利用具有共享权重的双多模式编码器来同时处理文本-图像和实体-图像信息。然后，选通机制组合用于增强的文本摘要生成的可视数据，同时通过从预先训练的视觉语言模型中提炼知识来精炼图像选择。在公开的MSMO数据集上的大量实验验证了EGMS方法的优越性，也证明了在MSMO问题中加入实体信息的必要性。

[NLP-14] Inference Optimizations for Large Language Models : Effects Challenges and Practical Considerations
[NLP-14] 大型语言模型的推理优化：效果挑战和实际考虑

链接: https://arxiv.org/abs/2408.03130
作者: Leo Donisch,Sigurd Schacht,Carsten Lanquillon
关键词-EN: natural language processing, Large language models, tasks without retraining, ubiquitous in natural, Large language
关键词-ZN: 自然语言处理、大型语言模型、无需再培训的任务、在自然中无处不在、大型语言
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models are ubiquitous in natural language processing because they can adapt to new tasks without retraining. However, their sheer scale and complexity present unique challenges and opportunities, prompting researchers and practitioners to explore novel model training, optimization, and deployment methods. This literature review focuses on various techniques for reducing resource requirements and compressing large language models, including quantization, pruning, knowledge distillation, and architectural optimizations. The primary objective is to explore each method in-depth and highlight its unique challenges and practical applications. The discussed methods are categorized into a taxonomy that presents an overview of the optimization landscape and helps navigate it to understand the research trajectory better.
摘要：大型语言模型在自然语言处理中无处不在，因为它们可以在无需再培训的情况下适应新任务。然而，其庞大的规模和复杂性带来了独特的挑战和机遇，促使研究人员和从业者探索新颖的模型训练、优化和部署方法。本文献回顾重点关注减少资源需求和压缩大型语言模型的各种技术，包括量化、修剪、知识提炼和架构优化。主要目标是深入探索每种方法并强调其独特的挑战和实际应用。所讨论的方法被分类为一个分类法，该分类法概述了优化环境，并帮助导航它以更好地理解研究轨迹。

[NLP-15] Lisbon Computational Linguists at SemEval-2024 Task 2: Using A Mistral 7B Model and Data Augmentation SEMEVAL-2024
[NLP-15] 里斯本计算语言学家参加SemEval-2024任务2：使用Mistral 7 B模型和数据增强

链接: https://arxiv.org/abs/2408.03127
作者: Artur Guimarães,Bruno Martins,João Magalhães
关键词-EN: Clinical Trial Reports, Natural Language Inference, safe biomedical Natural, biomedical Natural Language, Clinical Trials
关键词-ZN: 临床试验报告、自然语言推理、安全生物医学自然、生物医学自然语言、临床试验
类目: Computation and Language (cs.CL)
备注: 8 pages, 1 figure, submitted and accepted into the “18th International Workshop on Semantic Evaluation (SemEval-2024)”

点击查看摘要

Abstract:This paper describes our approach to the SemEval-2024 safe biomedical Natural Language Inference for Clinical Trials (NLI4CT) task, which concerns classifying statements about Clinical Trial Reports (CTRs). We explored the capabilities of Mistral-7B, a generalist open-source Large Language Model (LLM). We developed a prompt for the NLI4CT task, and fine-tuned a quantized version of the model using an augmented version of the training dataset. The experimental results show that this approach can produce notable results in terms of the macro F1-score, while having limitations in terms of faithfulness and consistency. All the developed code is publicly available on a GitHub repository
摘要：本文描述了我们执行SemEval-2024安全生物医学临床试验自然语言推理（NLI 4CT）任务的方法，该任务涉及对临床试验报告（TLR）的声明进行分类。我们探索了Mistral-7 B的功能，这是一个通才开源大型语言模型（LLM）。我们为NLI 4CT任务开发了一个提示，并使用训练数据集的增强版本微调了模型的量化版本。实验结果表明，这种方法在宏观F1评分方面可以产生显着的结果，但在忠实性和一致性方面存在局限性。所有开发的代码都在GitHub存储库上公开提供

[NLP-16] COMMENTATOR: A Code-mixed Multilingual Text Annotation Framework
[NLP-16] 注释者：一个代码混合的多语言文本注释框架

链接: https://arxiv.org/abs/2408.03125
作者: Rajvee Sheth,Shubh Nisar,Heenaben Prajapati,Himanshu Beniwal,Mayank Singh
关键词-EN: NLP community increasingly, community increasingly addresses, increasingly addresses challenges, multilingual datasets efficiently, NLP community
关键词-ZN: NLP社区越来越多，社区越来越多地解决，越来越多地解决挑战，多语言数据集高效，NLP社区
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:As the NLP community increasingly addresses challenges associated with multilingualism, robust annotation tools are essential to handle multilingual datasets efficiently. In this paper, we introduce a code-mixed multilingual text annotation framework, COMMENTATOR, specifically designed for annotating code-mixed text. The tool demonstrates its effectiveness in token-level and sentence-level language annotation tasks for Hinglish text. We perform robust qualitative human-based evaluations to showcase COMMENTATOR led to 5x faster annotations than the best baseline. Our code is publicly available at \urlthis https URL. The demonstration video is available at \urlthis https URL.
摘要：随着NLP社区越来越多地解决与多种语言相关的挑战，强大的注释工具对于有效处理多语言数据集至关重要。本文中，我们介绍了一个代码混合多语言文本注释框架COMMENTATOR，专门为注释代码混合文本而设计。该工具展示了其在印度式英语文本符号级和业务级语言注释任务中的有效性。我们执行稳健的基于人类的定性评估，以展示COMMENTATOR导致的注释速度比最佳基线快5倍。我们的代码可在\urlThis https URL上公开获取。演示视频可在\urlThis https URL上获取。

[NLP-17] Evaluating the Translation Performance of Large Language Models Based on Euas-20
[NLP-17] 基于Euas-20评估大型语言模型的翻译性能

链接: https://arxiv.org/abs/2408.03119
作者: Yan Huang,Wei Liu
关键词-EN: BERT and GPT, large language models, deep learning technology, natural language processing, achieved breakthrough results
关键词-ZN: BERT和GPT、大语言模型、深度学习技术、自然语言处理取得突破性成果
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 15 pages, 8 figures

点击查看摘要

Abstract:In recent years, with the rapid development of deep learning technology, large language models (LLMs) such as BERT and GPT have achieved breakthrough results in natural language processing tasks. Machine translation (MT), as one of the core tasks of natural language processing, has also benefited from the development of large language models and achieved a qualitative leap. Despite the significant progress in translation performance achieved by large language models, machine translation still faces many challenges. Therefore, in this paper, we construct the dataset Euas-20 to evaluate the performance of large language models on translation tasks, the translation ability on different languages, and the effect of pre-training data on the translation ability of LLMs for researchers and developers.
摘要：近年来，随着深度学习技术的快速发展，BERT、GPT等大型语言模型（LLM）在自然语言处理任务方面取得了突破性成果。机器翻译（MT）作为自然语言处理的核心任务之一，也受益于大型语言模型的发展，实现了质的飞跃。尽管大型语言模型在翻译性能方面取得了显着进步，但机器翻译仍然面临许多挑战。因此，在本文中，我们构建了数据集Euas-20，为研究人员和开发人员评估大型语言模型在翻译任务上的性能、对不同语言的翻译能力以及预训练数据对LLM翻译能力的影响。

[NLP-18] opic Modeling with Fine-tuning LLMs and Bag of Sentences
[NLP-18] 通过微调LLM和句子袋进行电影建模

链接: https://arxiv.org/abs/2408.03099
作者: Johannes Schneider
关键词-EN: Large language models, Large language, classical topic models, outperforming classical topic, modeling outperforming classical
关键词-ZN: 大型语言模型，大型语言，经典主题模型，优于经典主题，建模优于经典
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: This is the submitted journal version of enhanced with the novel fine-tuning part of "Efficient and Flexible Topic Modeling using Pretrained Embeddings and Bag of Sentences’’ which appeared at the International Conference on Agents and Artificial Intelligence(ICAART) in 2024

点击查看摘要

Abstract:Large language models (LLM)'s are increasingly used for topic modeling outperforming classical topic models such as LDA. Commonly, pre-trained LLM encoders such as BERT are used out-of-the-box despite the fact that fine-tuning is known to improve LLMs considerably. The challenge lies in obtaining a suitable (labeled) dataset for fine-tuning. In this paper, we use the recent idea to use bag of sentences as the elementary unit in computing topics. In turn, we derive an approach FT-Topic to perform unsupervised fine-tuning relying primarily on two steps for constructing a training dataset in an automatic fashion. First, a heuristic method to identifies pairs of sentence groups that are either assumed to be of the same or different topics. Second, we remove sentence pairs that are likely labeled incorrectly. The dataset is then used to fine-tune an encoder LLM, which can be leveraged by any topic modeling approach using embeddings. However, in this work, we demonstrate its effectiveness by deriving a novel state-of-the-art topic modeling method called SenClu, which achieves fast inference through an expectation-maximization algorithm and hard assignments of sentence groups to a single topic, while giving users the possibility to encode prior knowledge on the topic-document distribution. Code is at \urlthis https URL
摘要：S的大型语言模型被越来越多地用于主题建模，其性能优于LDA等经典主题模型。通常，预训练的LLM编码器(如BERT)是开箱即用的，尽管众所周知微调可以显著改善LLM。挑战在于获得合适的(带标签的)数据集以进行微调。在本文中，我们利用最新的思想，将句子袋作为计算主题的基本单位。反过来，我们推导出一种方法FT-TOPIC来执行无监督微调，主要依赖于两个步骤来自动构建训练数据集。首先，一种启发式方法来识别被假设为相同或不同主题的句子组对。其次，我们删除可能标记错误的句子对。然后使用数据集来微调编码器LLM，这可以被任何使用嵌入的主题建模方法所利用。然而，在这项工作中，我们通过推导一种新的最先进的主题建模方法SenClu来证明其有效性，该方法通过期望最大化算法和对单个主题的句组硬分配来实现快速推理，同时允许用户编码关于主题-文档分布的先验知识。代码位于此HTTPS URL

[NLP-19] 500xCompressor: Generalized Prompt Compression for Large Language Models
[NLP-19] 500 xCompressor：大型语言模型的广义提示压缩

链接: https://arxiv.org/abs/2408.03094
作者: Zongqian Li,Yixuan Su,Nigel Collier
关键词-EN: enhancing inference speed, improving user experience, reducing costs, inference speed, user experience
关键词-ZN: 提高推理速度、改善用户体验、降低成本、推理速度、用户体验
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Prompt compression is crucial for enhancing inference speed, reducing costs, and improving user experience. However, current methods face challenges such as low compression ratios and potential data leakage during evaluation. To address these issues, we propose 500xCompressor, a method that compresses extensive natural language contexts into a minimum of one single special token. The 500xCompressor introduces approximately 0.3% additional parameters and achieves compression ratios ranging from 6x to 480x. It is designed to compress any text, answer various types of questions, and could be utilized by the original large language model (LLM) without requiring fine-tuning. Initially, 500xCompressor was pretrained on the Arxiv Corpus, followed by fine-tuning on the ArxivQA dataset, and subsequently evaluated on strictly unseen and classical question answering (QA) datasets. The results demonstrate that the LLM retained 62.26-72.89% of its capabilities compared to using non-compressed prompts. This study also shows that not all the compressed tokens are equally utilized and that K V values have significant advantages over embeddings in preserving information at high compression ratios. The highly compressive nature of natural language prompts, even for fine-grained complex information, suggests promising potential for future applications and further research into developing a new LLM language.
摘要：即时压缩对于提高推理速度、降低成本、改善用户体验至关重要。然而，当前的方法面临着诸如低压缩比和在评估过程中潜在的数据泄漏等挑战。为了解决这些问题，我们提出了500x Compressor，这是一种将大量的自然语言上下文压缩为最少一个特殊标记的方法。500x Compressor引入了大约0.3%的额外参数，实现了从6倍到480倍的压缩比。它被设计成压缩任何文本，回答各种类型的问题，并且可以被原始的大型语言模型(LLM)使用，而不需要微调。最初，500x Compressor在Arxiv语料库上进行了预培训，随后在ArxivQA数据集上进行了微调，随后在严格不可见的和经典的问答(QA)数据集上进行了评估。结果表明，与使用非压缩提示相比，LLM保留了62.26-72.89%的能力。这项研究还表明，并不是所有的压缩令牌都得到了平等的利用，并且K-V值在高压缩比下保存信息方面比嵌入具有显著的优势。自然语言提示的高度压缩性质，即使是对于细粒度的复杂信息，也表明了未来应用和开发新的LLM语言的进一步研究的潜力。

[NLP-20] Extend Model Merging from Fine-Tuned to Pre-Trained Large Language Models via Weight Disentanglement
[NLP-20] 通过权重解纠缠将模型从微调扩展到预训练的大型语言模型

链接: https://arxiv.org/abs/2408.03092
作者: Le Yu,Bowen Yu,Haiyang Yu,Fei Huang,Yongbin Li
关键词-EN: Large Language Models, Merging Large Language, amalgamate multiple homologous, Large Language, LLMs
关键词-ZN: 大型语言模型、合并大型语言、合并多个同类、大型语言、法学硕士
类目: Computation and Language (cs.CL)
备注: 17 pages

点击查看摘要

Abstract:Merging Large Language Models (LLMs) aims to amalgamate multiple homologous LLMs into one with all the capabilities. Ideally, any LLMs sharing the same backbone should be mergeable, irrespective of whether they are Fine-Tuned (FT) with minor parameter changes or Pre-Trained (PT) with substantial parameter shifts. However, existing methods often manually assign the model importance, rendering them feasible only for LLMs with similar parameter alterations, such as multiple FT LLMs. The diverse parameter changed ranges between FT and PT LLMs pose challenges for current solutions in empirically determining the optimal combination. In this paper, we make a pioneering effort to broaden the applicability of merging techniques from FT to PT LLMs. We initially examine the efficacy of current methods in merging FT and PT LLMs, discovering that they struggle to deal with PT LLMs. Subsequently, we introduce an approach based on WeIght DisENtanglement (WIDEN) to effectively extend the merging scope, which first disentangles model weights into magnitude and direction components, and then performs adaptive fusion by considering their respective contributions. In the experiments, we merge Qwen1.5-Chat (an FT LLM with instruction-following skills) with Sailor (a PT LLM with multilingual abilities) across 7B and 14B model scales. Results reveal that: (1) existing solutions usually fail when merging Sailor, either losing both abilities or only retaining instruction-following skills; (2) WIDEN successfully injects the multilingual abilities of Sailor into Qwen1.5-Chat and make it proficient in Southeast Asian languages, achieving enhancements in the fundamental capabilities. In light of previous research, we also merge multiple 13B FT LLMs and observe that WIDEN achieves a balanced amalgamation of instruction following, mathematical reasoning, and code generation skills.
摘要：合并大型语言模型的目的是将多个同源的大型语言模型合并为一个具有所有功能的大型语言模型。理想情况下，共享相同主干的任何LLM都应该是可合并的，无论它们是具有微小参数变化的微调(FT)还是具有大量参数移位的预训练(PT)。然而，现有的方法通常手动分配模型重要性，使得它们仅适用于具有类似参数变化的LLM，例如多个FT LLM。FT和PT LLM之间不同的参数变化范围对当前解决方案在经验上确定最佳组合提出了挑战。在本文中，我们做出了开创性的努力，将合并技术的适用范围从FT扩展到PT LLM。我们最初检查了当前合并FT和PT LLM的方法的有效性，发现它们难以处理PT LLM。随后，我们提出了一种基于权重解缠的融合方法(WIDDEN)来有效地扩展融合范围，该方法首先将模型权重分解为幅度分量和方向分量，然后考虑它们各自的贡献进行自适应融合。在实验中，我们在7B和14B模型尺度上将Qwen1.5-Chat(具有指令跟随技能的FT LLM)和Sailor(具有多语言能力的PT LLM)合并。结果表明：(1)现有的解决方案在合并Sailor时往往会失败，要么失去两种能力，要么只保留跟随指令的技能；(2)Wide成功地将Sailor的多语言能力注入Qwen1.5-Chat，使其精通东南亚语言，实现了基础能力的增强。根据以前的研究，我们还合并了多个13B FT LLM，并观察到Widden实现了指令遵循、数学推理和代码生成技能的平衡融合。

[NLP-21] Enhancing Complex Causality Extraction via Improved Subtask Interaction and Knowledge Fusion NLPCC2024
[NLP-21] 通过改进的子任务交互和知识融合增强复杂因果关系提取

链接: https://arxiv.org/abs/2408.03079
作者: Jinglong Gao,Chen Lu,Xiao Ding,Zhongyang Li,Ting Liu,Bing Qin
关键词-EN: Complex Causality Extraction, Event Causality Extraction, Causality Extraction, ECE, ECE simultaneously
关键词-ZN: 复杂因果关系提取、事件因果关系提取、因果关系提取、ECT、ECT同时
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: NLPCC 2024 Oral

点击查看摘要

Abstract:Event Causality Extraction (ECE) aims at extracting causal event pairs from texts. Despite ChatGPT’s recent success, fine-tuning small models remains the best approach for the ECE task. However, existing fine-tuning based ECE methods cannot address all three key challenges in ECE simultaneously: 1) Complex Causality Extraction, where multiple causal-effect pairs occur within a single sentence; 2) Subtask~ Interaction, which involves modeling the mutual dependence between the two subtasks of ECE, i.e., extracting events and identifying the causal relationship between extracted events; and 3) Knowledge Fusion, which requires effectively fusing the knowledge in two modalities, i.e., the expressive pretrained language models and the structured knowledge graphs. In this paper, we propose a unified ECE framework (UniCE to address all three issues in ECE simultaneously. Specifically, we design a subtask interaction mechanism to enable mutual interaction between the two ECE subtasks. Besides, we design a knowledge fusion mechanism to fuse knowledge in the two modalities. Furthermore, we employ separate decoders for each subtask to facilitate complex causality extraction. Experiments on three benchmark datasets demonstrate that our method achieves state-of-the-art performance and outperforms ChatGPT with a margin of at least 30% F1-score. More importantly, our model can also be used to effectively improve the ECE performance of ChatGPT via in-context learning.
摘要：事件因果关系提取旨在从文本中提取因果事件对。尽管ChatGPT最近取得了成功，但微调小型型号仍然是欧洲经委会任务的最佳方法。然而，现有的基于微调的欧洲经委会方法不能同时解决欧洲经委会中的所有三个关键挑战：1)复杂因果关系提取，其中多个因果对出现在一句话中；2)子任务~交互，涉及对欧洲经委会两个子任务之间的相互依赖进行建模，即提取事件和确定所提取事件之间的因果关系；以及3)知识融合，需要有效地融合两种模式的知识，即可表达的预先训练的语言模型和结构化的知识图。在本文件中，我们提出了一个统一的欧洲经委会框架(UNICE)，以同时处理欧洲经委会的这三个问题。具体地说，我们设计了一个子任务交互机制来实现两个欧洲经委会子任务之间的交互。此外，我们还设计了一种知识融合机制来融合两种模式下的知识。此外，我们为每个子任务使用单独的解码器，以促进复杂因果关系的提取。在三个基准数据集上的实验表明，我们的方法达到了最先进的性能，并以至少30%的F1-Score的差值超过了ChatGPT。更重要的是，我们的模型还可以通过情境学习来有效地提高ChatGPT的ECA性能。

[NLP-22] owards an Analysis of Discourse and Interactional Pragmatic Reasoning Capabilities of Large Language Models
[NLP-22] 对大型语言模型的话语和交互式务实推理能力进行分析

链接: https://arxiv.org/abs/2408.03074
作者: Amelie Robrecht,Judith Sieker,Clara Lachenmaier,Sina Zarieß,Stefan Kopp
关键词-EN: pragmatic abilities, pragmatics, interactional pragmatics, discourse pragmatics, Abstract
关键词-ZN: 实用能力，实用学，互行实用学，话语实用学，摘要
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:In this work, we want to give an overview on which pragmatic abilities have been tested in LLMs so far and how these tests have been carried out. To do this, we first discuss the scope of the field of pragmatics and suggest a subdivision into discourse pragmatics and interactional pragmatics. We give a non-exhaustive overview of the phenomena of those two subdomains and the methods traditionally used to analyze them. We subsequently consider the resulting heterogeneous set of phenomena and methods as a starting point for our survey of work on discourse pragmatics and interactional pragmatics in the context of LLMs.
摘要：在这项工作中，我们想概述迄今为止在LLM中测试了哪些务实能力以及这些测试是如何进行的。为此，我们首先讨论了语法学领域的范围，并建议将其细分为话语语法学和语际语法学。我们对这两个子领域的现象以及传统上用于分析它们的方法进行了非详尽的概述。随后，我们将由此产生的一系列不同的现象和方法作为我们在法学硕士背景下调查话语修辞学和跨性别修辞学工作的起点。

[NLP-23] Probing structural constraints of negation in Pretrained Language Models
[NLP-23] 探索预训练语言模型中否定的结构约束

链接: https://arxiv.org/abs/2408.03070
作者: David Kletz(Lattice, LLF - UMR7110, UPCité),Marie Candito(UPCité, LLF - UMR7110),Pascal Amsili(Lattice)
关键词-EN: Contradictory results, pretrained language models, masked polarity item, Negative Polarity Item, Polarity Item
关键词-ZN: 矛盾的结果、预先训练的语言模型、掩蔽的极端项、负极端项、极端项
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Contradictory results about the encoding of the semantic impact of negation in pretrained language models (PLMs). have been drawn recently (e.g. Kassner and Schütze (2020); Gubelmann and Handschuh (2022)). In this paper we focus rather on the way PLMs encode negation and its formal impact, through the phenomenon of the Negative Polarity Item (NPI) licensing in English. More precisely, we use probes to identify which contextual representations best encode 1) the presence of negation in a sentence, and 2) the polarity of a neighboring masked polarity item. We find that contextual representations of tokens inside the negation scope do allow for (i) a better prediction of the presence of not compared to those outside the scope and (ii) a better prediction of the right polarity of a masked polarity item licensed by not, although the magnitude of the difference varies from PLM to PLM. Importantly, in both cases the trend holds even when controlling for distance to not. This tends to indicate that the embeddings of these models do reflect the notion of negation scope, and do encode the impact of negation on NPI licensing. Yet, further control experiments reveal that the presence of other lexical items is also better captured when using the contextual representation of a token within the same syntactic clause than outside from it, suggesting that PLMs simply capture the more general notion of syntactic clause.
摘要：在预先训练的语言模型(PLM)中，关于否定的语义影响编码的结果是相互矛盾的。(例如，Kassner和Schütze(2020)；Gubelmann和Handscheh(2022))。本文通过英语中的负极性项(NPI)许可现象，重点研究PLM对否定的编码方式及其形式上的影响。更准确地说，我们使用探测器来识别哪些语境表示最好地编码1)句子中否定的存在，以及2)相邻掩蔽的极性项的极性。我们发现，否定范围内的标记的语境表征确实允许(I)与范围外的标记相比，更好地预测NOT的存在，以及(Ii)更好地预测NOT许可的掩蔽极性项的右极性，尽管差异的大小因PLM而异。重要的是，在这两种情况下，即使在控制距离和不控制距离的情况下，这种趋势也是成立的。这倾向于表明这些模型的嵌入确实反映了否定范围的概念，并且确实编码了否定对NPI许可的影响。然而，进一步的对照实验表明，当在同一句法子句内使用标记的语境表征时，其他词项的存在也比在它之外更好地捕捉到，这表明PLM仅仅捕捉到了更一般的句法从句概念。

[NLP-24] Analysis of Argument Structure Constructions in a Deep Recurrent Language Model
[NLP-24] 深度回归语言模型中论元结构结构分析

链接: https://arxiv.org/abs/2408.03062
作者: Pegah Ramezani,Achim Schilling,Patrick Krauss
关键词-EN: Argument Structure Constructions, cognitive computational neuroscience, Stochastic Neighbor Embedding, fundamental question, question in cognitive
关键词-ZN: 论点结构构造、认知计算神经科学、随机邻居嵌入、基本问题、认知问题
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Understanding how language and linguistic constructions are processed in the brain is a fundamental question in cognitive computational neuroscience. In this study, we explore the representation and processing of Argument Structure Constructions (ASCs) in a recurrent neural language model. We trained a Long Short-Term Memory (LSTM) network on a custom-made dataset consisting of 2000 sentences, generated using GPT-4, representing four distinct ASCs: transitive, ditransitive, caused-motion, and resultative constructions. We analyzed the internal activations of the LSTM model’s hidden layers using Multidimensional Scaling (MDS) and t-Distributed Stochastic Neighbor Embedding (t-SNE) to visualize the sentence representations. The Generalized Discrimination Value (GDV) was calculated to quantify the degree of clustering within these representations. Our results show that sentence representations form distinct clusters corresponding to the four ASCs across all hidden layers, with the most pronounced clustering observed in the last hidden layer before the output layer. This indicates that even a relatively simple, brain-constrained recurrent neural network can effectively differentiate between various construction types. These findings are consistent with previous studies demonstrating the emergence of word class and syntax rule representations in recurrent language models trained on next word prediction tasks. In future work, we aim to validate these results using larger language models and compare them with neuroimaging data obtained during continuous speech perception. This study highlights the potential of recurrent neural language models to mirror linguistic processing in the human brain, providing valuable insights into the computational and neural mechanisms underlying language understanding. Subjects: Computation and Language (cs.CL) Cite as: arXiv:2408.03062 [cs.CL] (or arXiv:2408.03062v1 [cs.CL] for this version)
摘要：了解语言和语言结构是如何在大脑中处理的是认知计算神经科学的一个基本问题。在本研究中，我们探讨了论元结构结构在递归神经语言模型中的表征和加工。我们在一个定制的数据集上训练了一个长短期记忆(LSTM)网络，该数据集由2000个句子组成，使用GPT-4生成，代表了四个不同的ASC：及物句、双及物句、致使运动句和结果句。我们使用多维尺度(MDS)和t分布随机邻居嵌入(t-SNE)来分析LSTM模型隐含层的内部激活，以可视化句子表示。计算广义辨别值(GDV)以量化这些表示中的聚集度。我们的结果表明，句子表征在所有隐含层上形成了与四个ASC相对应的不同簇，其中最明显的簇出现在输出层之前的最后一个隐含层。这表明，即使是一个相对简单的、大脑受限的递归神经网络也可以有效地区分各种结构类型。这些发现与以前的研究一致，这些研究表明，在对下一个词预测任务进行训练的递归语言模型中，出现了词类和句法规则表征。在未来的工作中，我们的目标是使用更大的语言模型来验证这些结果，并将它们与连续语音感知过程中获得的神经成像数据进行比较。这项研究强调了递归神经语言模型在人脑中反映语言处理的潜力，为语言理解的计算和神经机制提供了有价值的见解。科目：计算和语言(cs.CL)引用为：arxiv：2408.03062cs.CL

[NLP-25] OpenOmni: A Collaborative Open Source Tool for Building Future-Ready Multimodal Conversational Agents
[NLP-25] OpenOmni：一款协作开源工具，用于构建面向未来的多模式对话代理

链接: https://arxiv.org/abs/2408.03047
作者: Qiang Sun,Yuanyi Luo,Sirui Li,Wenxiao Zhang,Wei Liu
关键词-EN: Multimodal conversational agents, Multimodal conversational, conversational agents, agents are highly, highly desirable
关键词-ZN: 多模式对话代理，多模式对话，对话代理，代理是非常非常需要的
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Multimodal conversational agents are highly desirable because they offer natural and human-like interaction. However, there is a lack of comprehensive end-to-end solutions to support collaborative development and benchmarking. While proprietary systems like GPT-4o and Gemini demonstrating impressive integration of audio, video, and text with response times of 200-250ms, challenges remain in balancing latency, accuracy, cost, and data privacy. To better understand and quantify these issues, we developed OpenOmni, an open-source, end-to-end pipeline benchmarking tool that integrates advanced technologies such as Speech-to-Text, Emotion Detection, Retrieval Augmented Generation, Large Language Models, along with the ability to integrate customized models. OpenOmni supports local and cloud deployment, ensuring data privacy and supporting latency and accuracy benchmarking. This flexible framework allows researchers to customize the pipeline, focusing on real bottlenecks and facilitating rapid proof-of-concept development. OpenOmni can significantly enhance applications like indoor assistance for visually impaired individuals, advancing human-computer interaction. Our demonstration video is available this https URL, demo is available via this https URL, code is available via this https URL.
摘要：多模式会话代理是非常受欢迎的，因为它们提供了自然的和人类一样的交互。然而，缺乏全面的端到端解决方案来支持协作开发和标杆。虽然像GPT-40和Gemini这样的专有系统展示了令人印象深刻的音频、视频和文本集成，响应时间为200-250ms，但在平衡延迟、准确性、成本和数据隐私方面仍然存在挑战。为了更好地理解和量化这些问题，我们开发了OpenOmni，这是一个开源的端到端管道基准测试工具，集成了语音到文本、情感检测、检索增强生成、大型语言模型等先进技术，以及集成定制模型的能力。OpenOmni支持本地和云部署，确保数据隐私，并支持延迟和准确性基准。这一灵活的框架允许研究人员定制管道，专注于真正的瓶颈，并促进快速概念验证开发。OpenOmni可以显著增强针对视障人士的室内辅助等应用，促进人机交互。我们的演示视频可通过此HTTPS URL获得，演示可通过此HTTPS URL获得，代码可通过此HTTPS URL获得。

[NLP-26] L3iTC at the FinLLM Challenge Task: Quantization for Financial Text Classification Summarization
[NLP-26] FinLLM挑战任务中的L3 iTC：金融文本分类总结的量化

链接: https://arxiv.org/abs/2408.03033
作者: Elvys Linhares Pontes,Carlos-Emiliano González-Gallardo,Mohamed Benjannet,Caryn Qu,Antoine Doucet
关键词-EN: FinLLM Challenge Task, financial text, financial text classification, financial text summarization, details our participation
关键词-ZN: FinLLM挑战任务、财务文本、财务文本分类、财务文本摘要、详细介绍我们的参与情况
类目: Computation and Language (cs.CL); Computational Engineering, Finance, and Science (cs.CE)
备注: Joint Workshop of the 8th Financial Technology and Natural Language Processing (FinNLP) and the 1st Agent AI for Scenario Planning (AgentScen), 2024

点击查看摘要

Abstract:This article details our participation (L3iTC) in the FinLLM Challenge Task 2024, focusing on two key areas: Task 1, financial text classification, and Task 2, financial text summarization. To address these challenges, we fine-tuned several large language models (LLMs) to optimize performance for each task. Specifically, we used 4-bit quantization and LoRA to determine which layers of the LLMs should be trained at a lower precision. This approach not only accelerated the fine-tuning process on the training data provided by the organizers but also enabled us to run the models on low GPU memory. Our fine-tuned models achieved third place for the financial classification task with an F1-score of 0.7543 and secured sixth place in the financial summarization task on the official test datasets.
摘要：本文详细介绍了我们参与FinLLM挑战任务2024的情况（L3 iTC），重点关注两个关键领域：任务1，金融文本分类和任务2，金融文本摘要。为了应对这些挑战，我们对几个大型语言模型（LLM）进行了微调，以优化每个任务的性能。具体来说，我们使用4位量化和LoRA来确定LLM的哪些层应该以较低的精度训练。这种方法不仅加速了组织者提供的训练数据的微调过程，而且使我们能够在低图形处理器内存上运行模型。我们微调的模型在财务分类任务中获得第三名，F1评分为0.7543，并在官方测试数据集的财务摘要任务中获得第六名。

[NLP-27] Fact Finder – Enhancing Domain Expertise of Large Language Models by Incorporating Knowledge Graphs
[NLP-27] 事实核查–通过简化知识图增强大型语言模型的领域专业知识

链接: https://arxiv.org/abs/2408.03010
作者: Daniel Steinigen,Roman Teucher,Timm Heine Ruland,Max Rudat,Nicolas Flores-Herr,Peter Fischer,Nikola Milosevic,Christopher Schymura,Angelo Ziletti
关键词-EN: Large Language Models, natural language queries, answering natural language, Language Models, Large Language
关键词-ZN: 大型语言模型、自然语言查询、回答自然语言、语言模型、大型语言
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注: 10 pages, 7 figures

点击查看摘要

Abstract:Recent advancements in Large Language Models (LLMs) have showcased their proficiency in answering natural language queries. However, their effectiveness is hindered by limited domain-specific knowledge, raising concerns about the reliability of their responses. We introduce a hybrid system that augments LLMs with domain-specific knowledge graphs (KGs), thereby aiming to enhance factual correctness using a KG-based retrieval approach. We focus on a medical KG to demonstrate our methodology, which includes (1) pre-processing, (2) Cypher query generation, (3) Cypher query processing, (4) KG retrieval, and (5) LLM-enhanced response generation. We evaluate our system on a curated dataset of 69 samples, achieving a precision of 78% in retrieving correct KG nodes. Our findings indicate that the hybrid system surpasses a standalone LLM in accuracy and completeness, as verified by an LLM-as-a-Judge evaluation method. This positions the system as a promising tool for applications that demand factual correctness and completeness, such as target identification – a critical process in pinpointing biological entities for disease treatment or crop enhancement. Moreover, its intuitive search interface and ability to provide accurate responses within seconds make it well-suited for time-sensitive, precision-focused research contexts. We publish the source code together with the dataset and the prompt templates used.
摘要：大型语言模型(LLM)的最新进展显示了它们在回答自然语言查询方面的熟练程度。然而，它们的有效性受到特定领域知识有限的阻碍，这引发了人们对其答复可靠性的担忧。我们引入了一种混合系统，它用特定于领域的知识图(KG)来扩充LLMS，从而旨在使用基于KG的检索方法来提高事实的正确性。我们以一个医疗KG为例来演示我们的方法，它包括(1)预处理，(2)Cypher查询生成，(3)Cypher查询处理，(4)KG检索，和(5)LLM增强的响应生成。我们在69个样本的精选数据集上对我们的系统进行了评估，在检索正确的KG节点方面取得了78%的精度。我们的研究结果表明，混合系统在准确性和完备性方面都超过了独立的LLM，并通过LLM-as-a-Court评估方法进行了验证。这将该系统定位为一种很有前途的工具，适用于要求事实正确性和完整性的应用，例如目标识别–这是为疾病治疗或作物改良精确定位生物实体的关键过程。此外，其直观的搜索界面和在几秒钟内提供准确响应的能力使其非常适合时间敏感、专注于精确的研究环境。我们发布源代码以及使用的数据集和提示符模板。

[NLP-28] Empathy Level Alignment via Reinforcement Learning for Empathetic Response Generation
[NLP-28] 通过强化学习进行同理心水平对齐以产生同理心反应

链接: https://arxiv.org/abs/2408.02976
作者: Hui Ma,Bo Zhang,Bo Xu,Jian Wang,Hongfei Lin,Xiao Sun
关键词-EN: human-like dialogue systems, building human-like dialogue, Empathetic response generation, empathetic responses, produce empathetic responses
关键词-ZN: 类人对话系统，建立类人对话，同理心反应生成，同理心反应，产生同理心反应
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Empathetic response generation, aiming at understanding the user’s situation and feelings and respond empathically, is crucial in building human-like dialogue systems. Previous methods mainly focus on using maximum likelihood estimation as the optimization objective for training response generation models, without taking into account the empathy level alignment between generated responses and target responses. To this end, we propose an empathetic response generation using reinforcement learning (EmpRL) framework. The framework designs an effective empathy reward function and generates empathetic responses by maximizing the expected reward through reinforcement learning. Given the powerful text generation capability of pre-trained language models, EmpRL utilizes the pre-trained T5 model as the generator and conducts further training to initialize the policy. To align the empathy level between generated responses and target responses in the context, an empathy reward function containing three empathy communication mechanisms, i.e., emotional reaction, interpretation, and exploration, is constructed using pre-designed and pre-trained empathy identifiers. Finally, the proximal policy optimization algorithm is used to further train the policy to produce empathetic responses. Both automatic and manual evaluations demonstrate that the proposed EmpRL framework can improve the quality of generated responses, enhance the empathy level similarity between generated and target responses, and produce empathetic responses covering both affective and cognitive aspects.
摘要：共情反应生成是构建人性化对话系统的关键，其目的是理解用户的处境和感受，并做出共情反应。以往的方法主要是以最大似然估计作为训练反应生成模型的优化目标，而没有考虑生成的反应与目标反应之间的共情水平对齐。为此，我们提出了一种基于强化学习的共情反应生成(EmpRL)框架。该框架设计了一种有效的共情奖励函数，并通过强化学习最大化期望回报来产生共情反应。鉴于预先训练的语言模型具有强大的文本生成能力，EmpRL使用预先训练的T5模型作为生成器，并进行进一步的训练来初始化策略。为了使产生的反应和目标反应之间的共情水平在语境中保持一致，使用预先设计和预先训练的共情识别器，构建了包含情感反应、解释和探索三种共情交流机制的共情奖励函数。最后，使用最近策略优化算法对策略进行进一步训练，以产生移情响应。自动和手动测试都表明，EmpRL框架可以提高生成反应的质量，提高生成反应和目标反应之间的共情水平相似性，并产生涵盖情感和认知两个方面的共情反应。

[NLP-29] EC-Guide: A Comprehensive E-Commerce Guide for Instruction Tuning and Quantization
[NLP-29] EC指南：指令调整和量化的全面电子商务指南

链接: https://arxiv.org/abs/2408.02970
作者: Zhaopeng Feng,Zijie Meng,Zuozhu Liu
关键词-EN: Large language models, attracted considerable attention, Large language, language models, attracted considerable
关键词-ZN: 大型语言模型，引起了相当大的关注，大型语言，语言模型，引起了相当大的关注
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have attracted considerable attention in various fields for their cost-effective solutions to diverse challenges, especially with advancements in instruction tuning and quantization. E-commerce, with its complex tasks and extensive product-user interactions, presents a promising application area for LLMs. However, the domain-specific concepts and knowledge inherent in e-commerce pose significant challenges for adapting general LLMs. To address this issue, we developed EC-Guide \hrefthis https URL, a comprehensive e-commerce guide for instruction tuning and quantization of LLMs. We also heuristically integrated Chain-of-Thought (CoT) during inference to enhance arithmetic performance. Our approach achieved the 2nd place in Track 2 and 5th place in Track 5 at the Amazon KDD Cup’24 \hrefthis https URL. Additionally, our solution is model-agnostic, enabling effective scalability across larger systems.
摘要：大型语言模型（LLM）因其针对各种挑战的经济高效解决方案而在各个领域引起了相当大的关注，特别是随着指令调优和量化方面的进步。电子商务任务复杂，产品与用户之间的交互广泛，为LLM提供了一个有前途的应用领域。然而，电子商务固有的特定领域概念和知识对适应通用LLM提出了重大挑战。为了解决这个问题，我们开发了EC-Guide \hrefThis https URL，这是一款用于LLM指令调整和量化的全面电子商务指南。我们还在推理过程中逻辑集成了思想链（CoT），以提高算术性能。我们的方法在Amazon KDD Cup ’ 24年的第2轨和第5轨中获得了第5名\hrefThis https URL。此外，我们的解决方案是模型不可知的，可以在更大的系统中实现有效的可扩展性。

[NLP-30] Accuracy and Consistency of LLMs in the Registered Dietitian Exam: The Impact of Prompt Engineering and Knowledge Retrieval
[NLP-30] 注册营养师考试中法学硕士的准确性和一致性：即时工程和知识检索的影响

链接: https://arxiv.org/abs/2408.02964
作者: Iman Azimi,Mohan Qi,Li Wang,Amir M. Rahmani,Youlin Li
关键词-EN: Large language models, boosting patient engagement, accelerating clinical decision-making, facilitating medical education, fundamentally transforming human-facing
关键词-ZN: 大型语言模型，提高患者参与度，加快临床决策，促进医学教育，从根本上改变人性
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) are fundamentally transforming human-facing applications in the health and well-being domains: boosting patient engagement, accelerating clinical decision-making, and facilitating medical education. Although state-of-the-art LLMs have shown superior performance in several conversational applications, evaluations within nutrition and diet applications are still insufficient. In this paper, we propose to employ the Registered Dietitian (RD) exam to conduct a standard and comprehensive evaluation of state-of-the-art LLMs, GPT-4o, Claude 3.5 Sonnet, and Gemini 1.5 Pro, assessing both accuracy and consistency in nutrition queries. Our evaluation includes 1050 RD exam questions encompassing several nutrition topics and proficiency levels. In addition, for the first time, we examine the impact of Zero-Shot (ZS), Chain of Thought (CoT), Chain of Thought with Self Consistency (CoT-SC), and Retrieval Augmented Prompting (RAP) on both accuracy and consistency of the responses. Our findings revealed that while these LLMs obtained acceptable overall performance, their results varied considerably with different prompts and question domains. GPT-4o with CoT-SC prompting outperformed the other approaches, whereas Gemini 1.5 Pro with ZS recorded the highest consistency. For GPT-4o and Claude 3.5, CoT improved the accuracy, and CoT-SC improved both accuracy and consistency. RAP was particularly effective for GPT-4o to answer Expert level questions. Consequently, choosing the appropriate LLM and prompting technique, tailored to the proficiency level and specific domain, can mitigate errors and potential risks in diet and nutrition chatbots.
摘要：大型语言模型正在从根本上改变健康和福祉领域中面向人的应用程序：提高患者参与度，加速临床决策，并促进医学教育。尽管最先进的LLM在几个对话应用程序中表现出了卓越的性能，但在营养和饮食应用程序中的评估仍然不够。在本文中，我们建议使用注册营养师(RD)考试来对最先进的LLMS、GPT-40、Claude 3.5十四行诗和Gemini 1.5 Pro进行标准和全面的评估，以评估营养查询的准确性和一致性。我们的评估包括1050道RD考题，涵盖了几个营养主题和熟练程度.此外，我们还首次考察了Zero-shot(ZS)、思维链(COT)、有自我一致性的思维链(COT-SC)和提取增强提示(RAP)对回答的准确性和一致性的影响。我们的研究结果表明，虽然这些LLMS的整体表现可以接受，但他们的结果随着提示和问题领域的不同而有很大差异。带有COT-SC提示的GPT-40表现优于其他方法，而带有ZS的Gemini 1.5 Pro记录的一致性最高。对于GPT-40和Claude 3.5，COT提高了准确度，COT-SC提高了准确度和一致性。RAP对GPT-40回答专家级问题特别有效。因此，选择适合熟练程度和特定领域的适当LLM和提示技术，可以减少饮食和营养聊天机器人的错误和潜在风险。

[NLP-31] Are Female Carpenters like Blue Bananas? A Corpus Investigation of Occupation Gender Typicality
[NLP-31] 女木匠喜欢蓝香蕉吗？职业性别典型性的数据库调查

链接: https://arxiv.org/abs/2408.02948
作者: Da Ju,Karen Ulrich,Adina Williams
关键词-EN: mention surprising properties, People tend, language to mention, mention surprising, properties of events
关键词-ZN: 提及令人惊讶的属性，人们倾向，提及的语言，提及令人惊讶的，事件的属性
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:People tend to use language to mention surprising properties of events: for example, when a banana is blue, we are more likely to mention color than when it is yellow. This fact is taken to suggest that yellowness is somehow a typical feature of bananas, and blueness is exceptional. Similar to how a yellow color is typical of bananas, there may also be genders that are typical of occupations. In this work, we explore this question using information theoretic techniques coupled with corpus statistic analysis. In two distinct large corpora, we do not find strong evidence that occupations and gender display the same patterns of mentioning as do bananas and color. Instead, we find that gender mentioning is correlated with femaleness of occupation in particular, suggesting perhaps that woman-dominated occupations are seen as somehow ``more gendered’’ than male-dominated ones, and thereby they encourage more gender mentioning overall.
摘要：人们倾向于使用语言来提及事件的令人惊讶的属性：例如，当香蕉是蓝色的时，我们比当香蕉是黄色的时更有可能提及颜色。这一事实被认为表明黄色是香蕉的典型特征，而蓝色则是例外。与香蕉的典型黄色类似，也可能存在职业的典型性别。在这项工作中，我们使用信息论技术结合素材统计分析来探讨这个问题。在两个不同的大型数据库中，我们没有发现强有力的证据表明职业和性别表现出与香蕉和颜色相同的提及模式。相反，我们发现性别提及尤其与职业的女性性相关，这表明女性主导的职业在某种程度上被视为比男性主导的职业“更性别化”，从而鼓励整体上更多地提及性别。

[NLP-32] Self-Supervised Learning for Multi-Channel Neural Transducer
[NLP-32] 多通道神经传感器的自我监督学习

链接: https://arxiv.org/abs/2408.02945
作者: Atsushi Kojima
关键词-EN: automatic speech recognition, framework significantly improves, ASR model, ASR, ASR model based
关键词-ZN: 自动语音识别，框架显着改进，ASB模型，ASB，基于ASB模型
类目: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:Self-supervised learning, such as with the wav2vec 2.0 framework significantly improves the accuracy of end-to-end automatic speech recognition (ASR). Wav2vec 2.0 has been applied to single-channel end-to-end ASR models. In this work, we explored a self-supervised learning method for a multi-channel end-to-end ASR model based on the wav2vec 2.0 framework. As the multi-channel end-to-end ASR model, we focused on a multi-channel neural transducer. In pre-training, we compared three different methods for feature quantization to train a multi-channel conformer audio encoder: joint quantization, feature-wise quantization and channel-wise quantization. In fine-tuning, we trained the multi-channel conformer-transducer. All experiments were conducted using the far-field in-house and CHiME-4 datasets. The results of the experiments showed that feature-wise quantization was the most effective among the methods. We observed a 66% relative reduction in character error rate compared with the model without any pre-training for the far-field in-house dataset.
摘要：自监督学习显著提高了端到端自动语音识别(ASR)的准确率。Wav2vec 2.0已应用于单通道端到端ASR模型。在这项工作中，我们探索了一种基于Wave2vec 2.0框架的多通道端到端ASR模型的自监督学习方法。作为多通道端到端ASR模型，我们重点研究了一种多通道神经传感器。在训练前，我们比较了三种不同的特征量化方法来训练多声道构形音频编码器：联合量化、特征量化和通道量化。在微调方面，我们训练了多通道整形换能器。所有实验都是使用远场内部数据集和CHME-4数据集进行的。实验结果表明，基于特征的量化方法是最有效的方法。我们观察到，与没有对远场内部数据集进行任何预训练的模型相比，字符错误率相对降低了66%。

[NLP-33] HARMONIC: Harnessing LLMs for Tabular Data Synthesis and Privacy Protection
[NLP-33] HARMONIC：利用LLM进行表格数据合成和隐私保护

链接: https://arxiv.org/abs/2408.02927
作者: Yuxin Wang,Duanyu Feng,Yongfu Dai,Zhengyu Chen,Jimin Huang,Sophia Ananiadou,Qianqian Xie,Hao Wang
关键词-EN: tabular data, advancing deep learning, tabular data generation, Data, tabular data presented
关键词-ZN: 表格数据，推进深度学习，表格数据生成，数据，呈现的表格数据
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Cryptography and Security (cs.CR)
备注:

点击查看摘要

Abstract:Data serves as the fundamental foundation for advancing deep learning, particularly tabular data presented in a structured format, which is highly conducive to modeling. However, even in the era of LLM, obtaining tabular data from sensitive domains remains a challenge due to privacy or copyright concerns. Hence, exploring how to effectively use models like LLMs to generate realistic and privacy-preserving synthetic tabular data is urgent. In this paper, we take a step forward to explore LLMs for tabular data synthesis and privacy protection, by introducing a new framework HARMONIC for tabular data generation and evaluation. In the tabular data generation of our framework, unlike previous small-scale LLM-based methods that rely on continued pre-training, we explore the larger-scale LLMs with fine-tuning to generate tabular data and enhance privacy. Based on idea of the k-nearest neighbors algorithm, an instruction fine-tuning dataset is constructed to inspire LLMs to discover inter-row relationships. Then, with fine-tuning, LLMs are trained to remember the format and connections of the data rather than the data itself, which reduces the risk of privacy leakage. In the evaluation part of our framework, we develop specific privacy risk metrics DLT for LLM synthetic data generation, as well as performance evaluation metrics LLE for downstream LLM tasks. Our experiments find that this tabular data generation framework achieves equivalent performance to existing methods with better privacy, which also demonstrates our evaluation framework for the effectiveness of synthetic data and privacy risks in LLM scenarios.
摘要：数据是推进深度学习的基础，尤其是以结构化格式表示的表格数据，非常有利于建模。然而，即使在LLM时代，由于隐私或版权问题，从敏感域获取表格数据仍然是一个挑战。因此，探索如何有效地使用像LLMS这样的模型来生成真实且保护隐私的合成表格数据是当务之急。在本文中，我们通过引入一种新的表格数据生成和评估框架Harmonic，进一步探索用于表格数据合成和隐私保护的LLMS。在我们框架的表格数据生成中，不同于以往基于LLM的小规模方法依赖于持续的预训练，我们探索了更大规模的LLM，并进行了微调以生成表格数据和增强隐私。基于k-近邻算法的思想，构造了一个指令微调数据集，以启发LLMS发现行间关系。然后，通过微调，LLM被训练成记住数据的格式和连接，而不是数据本身，这降低了隐私泄露的风险。在框架的评估部分，我们为LLM合成数据生成开发了特定的隐私风险度量DLT，并为下游LLM任务开发了性能评估度量LLE。实验发现，该表格数据生成框架具有与现有方法相当的性能，具有更好的保密性，这也验证了我们对LLM场景中合成数据的有效性和隐私风险的评估框架。

[NLP-34] Intermediate direct preference optimization
[NLP-34] 中间直接偏好优化

链接: https://arxiv.org/abs/2408.02923
作者: Atsushi Kojima
关键词-EN: intermediate DPO model, intermediate DPO, intermediate DPO loss, DPO, DPO model
关键词-ZN: 中间DPO模型，中间DPO，中间DPO损失，DPO，DPO模型
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We propose the intermediate direct preference optimization (DPO) method to calculate the DPO loss at selected intermediate layers as an auxiliary loss for finetuning large language models (LLMs). The conventional DPO method fine-tunes a supervised fine-tuning (SFT) model by calculating the DPO loss using logits from the final layer. In our intermediate DPO approach, DPO losses are calculated using the logits from K-selected intermediate layers and averaged to obtain the intermediate DPO loss. For training the intermediate DPO model, the final loss is obtained by calculating the weighted sum of the DPO and intermediate DPO losses. During inference, the intermediate DPO model decodes using the final layer logits similarly to the conventional DPO model. In experiments using the ultrafeedback dataset, the performance of the intermediate DPO model was evaluated using GPT-4. As a result, the intermediate DPO model trained using the intermediate DPO loss calculated at the 22nd layer of a 32-layer SFT model achieved win rates of 52.5% and 67.5% against the conventional DPO and SFT models, respectively, demonstrating the effectiveness of the proposed method. Furthermore, we report the relationships among the position of the selected intermediate layers, the number of layers, and performance.
摘要：我们提出了中间直接偏好优化(DPO)方法来计算选定中间层的DPO损失，作为精调大型语言模型(LLMS)的辅助损失。传统的DPO方法通过使用最后一层的LOGIT计算DPO损耗来微调有监督的微调(SFT)模型。在我们的中间DPO方法中，使用K个选择的中间层的Logits来计算DPO损耗，并对其进行平均以获得中间DPO损耗。对于中间DPO模型的训练，通过计算DPO和中间DPO损失的加权和来获得最终损失。在推理过程中，中间DPO模型使用与传统DPO模型类似的最后一层逻辑进行解码。在使用UltraFeedback数据集的实验中，使用GPT-4评估了中间DPO模型的性能。结果表明，使用32层SFT模型第22层计算的中间DPO损耗训练的中间DPO模型相对于传统的DPO和SFT模型分别获得了52.5%和67.5%的成功率，证明了该方法的有效性。此外，我们还报告了所选中间层的位置、层数和性能之间的关系。

[NLP-35] Data Checklist: On Unit-Testing Datasets with Usable Information
[NLP-35] 数据清单：关于具有可用信息的单元测试数据集

链接: https://arxiv.org/abs/2408.02919
作者: Heidi C. Zhang,Shabnam Behzad,Kawin Ethayarajh,Dan Jurafsky
关键词-EN: model behavior, software engineering, tool for understanding, analogous to unit-testing, Model
关键词-ZN: 模型行为，软件工程，理解工具，类似于单元测试，模型
类目: Computation and Language (cs.CL)
备注: 17 pages, 4 figures. COLM 2024

点击查看摘要

Abstract:Model checklists (Ribeiro et al., 2020) have emerged as a useful tool for understanding the behavior of LLMs, analogous to unit-testing in software engineering. However, despite datasets being a key determinant of model behavior, evaluating datasets, e.g., for the existence of annotation artifacts, is largely done ad hoc, once a problem in model behavior has already been found downstream. In this work, we take a more principled approach to unit-testing datasets by proposing a taxonomy based on the V-information literature. We call a collection of such unit tests a data checklist. Using a checklist, not only are we able to recover known artifacts in well-known datasets such as SNLI, but we also discover previously unknown artifacts in preference datasets for LLM alignment. Data checklists further enable a new kind of data filtering, which we use to improve the efficacy and data efficiency of preference alignment.
摘要：模型检查表（Ribeiro等人，2020年）已成为了解LLM行为的有用工具，类似于软件工程中的单元测试。然而，尽管数据集是模型行为的关键决定因素，但评估数据集，例如，一旦下游已经发现模型行为中的问题，那么注释工件的存在就很大程度上是临时完成的。在这项工作中，我们通过基于V信息文献提出分类法，对单元测试数据集采取了更有原则的方法。我们将此类单元测试的集合称为数据清单。使用检查表，我们不仅能够恢复SNLI等知名数据集中的已知伪影，而且还在LLM对齐的偏好数据集中发现以前未知的伪影。数据检查表进一步实现了一种新型的数据过滤，我们使用它来提高偏好对齐的有效性和数据效率。

[NLP-36] Leveraging Inter-Chunk Interactions for Enhanced Retrieval in Large Language Model-Based Question Answering
[NLP-36] 利用块间交互在基于大型语言模型的问题解答中增强检索

链接: https://arxiv.org/abs/2408.02907
作者: Tiezheng Guo,Chen Wang,Yanyi Liu,Jiawei Tang,Pan Li,Sai Xu,Qingwen Yang,Xianlin Gao,Zhi Li,Yingyou Wen
关键词-EN: Retrieving external knowledge, Retrieving external, knowledge and prompting, effective paradigm, performance of question-answering
关键词-ZN: 检索外部知识，检索外部，知识与提示，有效范式，问答绩效
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Retrieving external knowledge and prompting large language models with relevant information is an effective paradigm to enhance the performance of question-answering tasks. Previous research typically handles paragraphs from external documents in isolation, resulting in a lack of context and ambiguous references, particularly in multi-document and complex tasks. To overcome these challenges, we propose a new retrieval framework IIER, that leverages Inter-chunk Interactions to Enhance Retrieval. This framework captures the internal connections between document chunks by considering three types of interactions: structural, keyword, and semantic. We then construct a unified Chunk-Interaction Graph to represent all external documents comprehensively. Additionally, we design a graph-based evidence chain retriever that utilizes previous paths and chunk interactions to guide the retrieval process. It identifies multiple seed nodes based on the target question and iteratively searches for relevant chunks to gather supporting evidence. This retrieval process refines the context and reasoning chain, aiding the large language model in reasoning and answer generation. Extensive experiments demonstrate that IIER outperforms strong baselines across four datasets, highlighting its effectiveness in improving retrieval and reasoning capabilities.
摘要：提取外部知识并用相关信息提示大型语言模型是提高问答任务绩效的有效范式。以前的研究通常孤立地处理外部文档中的段落，导致缺乏上下文和模糊引用，特别是在多文档和复杂任务中。为了克服这些挑战，我们提出了一个新的检索框架IIER，该框架利用组块间的交互来增强检索。该框架通过考虑三种类型的交互来捕获文档块之间的内部联系：结构交互、关键字交互和语义交互。然后，我们构造了一个统一的块交互图来全面地表示所有外部文档。此外，我们设计了一个基于图的证据链检索器，该检索器利用先前的路径和块交互来指导检索过程。它根据目标问题识别多个种子节点，并迭代搜索相关的组块来收集支持证据。这个检索过程细化了上下文和推理链，帮助大型语言模型进行推理和答案生成。大量的实验表明，IIER在四个数据集上的表现优于强基线，突显了它在提高检索和推理能力方面的有效性。

[NLP-37] Lighthouse: A User-Friendly Library for Reproducible Video Moment Retrieval and Highlight Detection
[NLP-37] Lighthouse：一个用户友好的可再现视频时刻检索和亮点检测库

链接: https://arxiv.org/abs/2408.02901
作者: Taichi Nishimura,Shota Nakada,Hokuto Munakata,Tatsuya Komatsu
关键词-EN: video moment retrieval, reproducible video moment, highlight detection, user-friendly library, video moment
关键词-ZN: 视频时刻检索、可再现视频时刻、亮点检测、用户友好库、视频时刻
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Multimedia (cs.MM)
备注: 6 pages; library tech report

点击查看摘要

Abstract:We propose Lighthouse, a user-friendly library for reproducible video moment retrieval and highlight detection (MR-HD). Although researchers proposed various MR-HD approaches, the research community holds two main issues. The first is a lack of comprehensive and reproducible experiments across various methods, datasets, and video-text features. This is because no unified training and evaluation codebase covers multiple settings. The second is user-unfriendly design. Because previous works use different libraries, researchers set up individual environments. In addition, most works release only the training codes, requiring users to implement the whole inference process of MR-HD. Lighthouse addresses these issues by implementing a unified reproducible codebase that includes six models, three features, and five datasets. In addition, it provides an inference API and web demo to make these methods easily accessible for researchers and developers. Our experiments demonstrate that Lighthouse generally reproduces the reported scores in the reference papers. The code is available at this https URL.
摘要：我们提出了LighTower，一个用户友好的库，用于可重现的视频时刻检索和亮点检测(MR-HD)。尽管研究人员提出了各种MR-HD方法，但研究界持有两个主要问题。首先是缺乏跨越各种方法、数据集和视频文本特征的全面和可重复性的实验。这是因为没有统一的培训和评估代码库覆盖多个设置。第二个问题是用户界面不友好。由于以前的工作使用不同的库，研究人员设置了单独的环境。此外，大多数作品只发布训练代码，要求用户实现MR-HD的整个推理过程。LighTower通过实现统一的可复制代码库来解决这些问题，该代码库包括六个模型、三个功能和五个数据集。此外，它还提供了推理API和Web演示，使研究人员和开发人员可以轻松地使用这些方法。我们的实验表明，灯塔一般复制了参考论文中报告的分数。代码可在此HTTPS URL上找到。

[NLP-38] SETN: Stock Embedding Enhanced with Textual and Network Information
[NLP-38] SETN：利用文本和网络信息增强股票嵌入

链接: https://arxiv.org/abs/2408.02899
作者: Takehiro Takayanagi,Hiroki Sakaji,Kiyoshi Izumi
关键词-EN: Stock embedding, Stock, vector representation, information, stock embeddings obtained
关键词-ZN: 股票嵌入、股票、载体表示、信息、获得的股票嵌入
类目: Computation and Language (cs.CL); Computational Engineering, Finance, and Science (cs.CE)
备注:

点击查看摘要

Abstract:Stock embedding is a method for vector representation of stocks. There is a growing demand for vector representations of stock, i.e., stock embedding, in wealth management sectors, and the method has been applied to various tasks such as stock price prediction, portfolio optimization, and similar fund identifications. Stock embeddings have the advantage of enabling the quantification of relative relationships between stocks, and they can extract useful information from unstructured data such as text and network data. In this study, we propose stock embedding enhanced with textual and network information (SETN) using a domain-adaptive pre-trained transformer-based model to embed textual information and a graph neural network model to grasp network information. We evaluate the performance of our proposed model on related company information extraction tasks. We also demonstrate that stock embeddings obtained from the proposed model perform better in creating thematic funds than those obtained from baseline methods, providing a promising pathway for various applications in the wealth management industry.
摘要：股票嵌入是一种股票的向量表示方法。在财富管理领域，对股票的向量表示，即股票嵌入的需求越来越大，该方法已被应用于各种任务，如股价预测、投资组合优化和类似的基金识别。股票嵌入的优点是能够量化股票之间的相对关系，并且可以从文本和网络数据等非结构化数据中提取有用的信息。在这项研究中，我们提出了用文本和网络信息增强的股票嵌入(SETN)，使用基于域自适应的预训练变压器模型来嵌入文本信息，并使用图神经网络模型来获取网络信息。我们对我们提出的模型在相关公司信息提取任务上的性能进行了评估。我们还证明了基于该模型的股票嵌入在创建主题基金方面的表现优于通过基线方法获得的股票嵌入，为财富管理行业的各种应用提供了一条有前途的路径。

[NLP-39] A Framework for Fine-Tuning LLMs using Heterogeneous Feedback
[NLP-39] 使用异类反馈微调LLM的框架

链接: https://arxiv.org/abs/2408.02861
作者: Ryan Aponte(1),Ryan A. Rossi(2),Shunan Guo(2),Franck Dernoncourt(2),Tong Yu(2),Xiang Chen(2),Subrata Mitra(2),Nedim Lipka(2) ((1) Carnegie Mellon University, (2) Adobe Research)
关键词-EN: including text summarization, Large language models, Large language, web navigation, range of tasks
关键词-ZN: 包括文本摘要、大型语言模型、大型语言、网络导航、任务范围
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 7 pages, 1 figure

点击查看摘要

Abstract:Large language models (LLMs) have been applied to a wide range of tasks, including text summarization, web navigation, and chatbots. They have benefitted from supervised fine-tuning (SFT) and reinforcement learning from human feedback (RLHF) following an unsupervised pretraining. These datasets can be difficult to collect, limited in scope, and vary in sample quality. Additionally, datasets can vary extensively in supervision format, from numerical to binary as well as multi-dimensional with many different values. We present a framework for fine-tuning LLMs using heterogeneous feedback, which has two main components. First, we combine the heterogeneous feedback data into a single supervision format, compatible with methods like SFT and RLHF. Next, given this unified feedback dataset, we extract a high-quality and diverse subset to obtain performance increases potentially exceeding the full dataset. We conduct extensive experiments to understand the effectiveness of these techniques for incorporating heterogeneous feedback, and demonstrate improvements from using a high-quality and diverse subset of the data. We find that our framework is able to improve models in multiple areas simultaneously, such as in instruction following and bias reduction.
摘要：大语言模型已被广泛应用于文本摘要、网络导航和聊天机器人等任务中。他们受益于无监督预训练后的监督微调(SFT)和人类反馈强化学习(RLHF)。这些数据集可能很难收集，范围有限，样本质量参差不齐。此外，数据集可以在监管格式上有很大的不同，从数字到二进制以及具有许多不同值的多维数据。我们提出了一种使用异质反馈微调LLMS的框架，该框架包括两个主要组成部分。首先，我们将不同种类的反馈数据合并成单一的监督格式，与SFT和RLHF等方法兼容。接下来，给出这个统一的反馈数据集，我们提取一个高质量和多样化的子集，以获得潜在地超过完整数据集的性能提升。我们进行了广泛的实验，以了解这些技术在整合异质反馈方面的有效性，并展示了使用高质量和多样化的数据子集的改进。我们发现，我们的框架能够同时在多个领域改进模型，如在指令遵循和减少偏差方面。

[NLP-40] Interpretation of the Intent Detection Problem as Dynamics in a Low-dimensional Space ECAI-2024
[NLP-40] 意图检测问题解释为低维空间中的动力学

链接: https://arxiv.org/abs/2408.02838
作者: Eduardo Sanchez-Karhunen,Jose F. Quesada-Moreno,Miguel A. Gutiérrez-Naranjo
关键词-EN: text classification task, Intent detection, users query, text classification, recognize and label
关键词-ZN: 文本分类任务、意图检测、用户查询、文本分类、识别和标签
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Machine Learning (stat.ML)
备注: Camera-Ready version. Accepted paper at 27th European Conference on Artificial Intelligence (ECAI-2024)

点击查看摘要

Abstract:Intent detection is a text classification task whose aim is to recognize and label the semantics behind a users query. It plays a critical role in various business applications. The output of the intent detection module strongly conditions the behavior of the whole system. This sequence analysis task is mainly tackled using deep learning techniques. Despite the widespread use of these techniques, the internal mechanisms used by networks to solve the problem are poorly understood. Recent lines of work have analyzed the computational mechanisms learned by RNNs from a dynamical systems perspective. In this work, we investigate how different RNN architectures solve the SNIPS intent detection problem. Sentences injected into trained networks can be interpreted as trajectories traversing a hidden state space. This space is constrained to a low-dimensional manifold whose dimensionality is related to the embedding and hidden layer sizes. To generate predictions, RNN steers the trajectories towards concrete regions, spatially aligned with the output layer matrix rows directions. Underlying the system dynamics, an unexpected fixed point topology has been identified with a limited number of attractors. Our results provide new insights into the inner workings of networks that solve the intent detection task.
摘要：意图检测是一项文本分类任务，其目的是识别和标注用户查询背后的语义。它在各种业务应用中发挥着关键作用。意图检测模块的输出强烈地制约着整个系统的行为。这个序列分析任务主要是使用深度学习技术来处理的。尽管这些技术被广泛使用，但网络用来解决问题的内部机制却鲜为人知。最近的工作从动态系统的角度分析了RNN学习的计算机制。在这项工作中，我们研究了不同的RNN体系结构如何解决Snips意图检测问题。注入训练网络的句子可以被解释为穿越隐藏状态空间的轨迹。该空间被限制为低维流形，其维度与嵌入层和隐藏层的大小有关。为了生成预测，RNN将轨迹导向具体区域，空间上与输出层矩阵行方向对齐。在系统动力学的基础上，用有限数量的吸引子确定了一个意想不到的不动点拓扑。我们的结果为解决意图检测任务的网络的内部工作提供了新的见解。

[NLP-41] Examining Gender and Power on Wikipedia Through Face and Politeness
[NLP-41] 通过面子和礼貌审视维基百科上的性别和权力

链接: https://arxiv.org/abs/2408.02798
作者: Adil Soubki,Shyne Choi,Owen Rambow
关键词-EN: sociolinguistic theory, face acts, analyzing discourse, discourse by combining, combining two interdependent
关键词-ZN: 社会语言学理论，面部行为，分析话语，结合话语，结合两个相互依存的话语
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:We propose a framework for analyzing discourse by combining two interdependent concepts from sociolinguistic theory: face acts and politeness. While politeness has robust existing tools and data, face acts are less resourced. We introduce a new corpus created by annotating Wikipedia talk pages with face acts and we use this to train a face act tagger. We then employ our framework to study how face and politeness interact with gender and power in discussions between Wikipedia editors. Among other findings, we observe that female Wikipedians are not only more polite, which is consistent with prior studies, but that this difference corresponds with significantly more language directed at humbling aspects of their own face. Interestingly, the distinction nearly vanishes once limiting to editors with administrative power.
摘要：我们通过结合社会语言学理论中的两个相互依赖的概念：面部行为和礼貌，提出了一个分析话语的框架。虽然礼貌有强大的现有工具和数据，但面部行为的资源较少。我们引入了一个通过用面部行为注释维基百科讨论页面而创建的新文集，并使用它来训练面部行为标记器。然后，我们利用我们的框架来研究维基百科编辑之间的讨论中，面子和礼貌如何与性别和权力相互作用。除其他研究结果外，我们观察到女性维基人不仅更有礼貌，这与之前的研究一致，而且这种差异与针对自己面部谦卑方面的语言明显更多相对应。有趣的是，一旦局限于拥有行政权力的编辑，这种区别几乎就消失了。

[NLP-42] Entity Retrieval for Answering Entity-Centric Questions
[NLP-42] 回答以贫困为中心的问题的实体检索

链接: https://arxiv.org/abs/2408.02795
作者: Hassan S. Shavarani,Anoop Sarkar
关键词-EN: retrieval-augmented question answering, crucial factor, Entity Retrieval, retrieval, question answering
关键词-ZN: 检索增强问答，关键因素，实体检索，检索，问答
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL)
备注: 17 pages total, 10 Tables, 4 Figures

点击查看摘要

Abstract:The similarity between the question and indexed documents is a crucial factor in document retrieval for retrieval-augmented question answering. Although this is typically the only method for obtaining the relevant documents, it is not the sole approach when dealing with entity-centric questions. In this study, we propose Entity Retrieval, a novel retrieval method which rather than relying on question-document similarity, depends on the salient entities within the question to identify the retrieval documents. We conduct an in-depth analysis of the performance of both dense and sparse retrieval methods in comparison to Entity Retrieval. Our findings reveal that our method not only leads to more accurate answers to entity-centric questions but also operates more efficiently.
摘要：问题和索引文档之间的相似性是用于检索增强问答的文档检索中的一个关键因素。尽管这通常是获取相关文档的唯一方法，但它并不是处理以实体为中心的问题的唯一方法。在这项研究中，我们提出了实体检索，这是一种新颖的检索方法，它不依赖于问题与文档的相似性，而是依赖于问题中的突出实体来识别检索文档。与实体检索相比，我们对密集和稀疏检索方法的性能进行了深入分析。我们的研究结果表明，我们的方法不仅可以更准确地回答以实体为中心的问题，而且还可以更有效地运行。

[NLP-43] LLM economicus? Mapping the Behavioral Biases of LLMs via Utility Theory
[NLP-43] 经济学法学硕士？通过效用理论绘制LLM的行为偏差

链接: https://arxiv.org/abs/2408.02784
作者: Jillian Ross,Yoon Kim,Andrew W. Lo
关键词-EN: homo economicus, economic, LLMs, biases, economic behavior
关键词-ZN: 经济人、经济、法学硕士、偏见、经济行为
类目: Computation and Language (cs.CL)
备注: Accepted to COLM 2024

点击查看摘要

Abstract:Humans are not homo economicus (i.e., rational economic beings). As humans, we exhibit systematic behavioral biases such as loss aversion, anchoring, framing, etc., which lead us to make suboptimal economic decisions. Insofar as such biases may be embedded in text data on which large language models (LLMs) are trained, to what extent are LLMs prone to the same behavioral biases? Understanding these biases in LLMs is crucial for deploying LLMs to support human decision-making. We propose utility theory-a paradigm at the core of modern economic theory-as an approach to evaluate the economic biases of LLMs. Utility theory enables the quantification and comparison of economic behavior against benchmarks such as perfect rationality or human behavior. To demonstrate our approach, we quantify and compare the economic behavior of a variety of open- and closed-source LLMs. We find that the economic behavior of current LLMs is neither entirely human-like nor entirely economicus-like. We also find that most current LLMs struggle to maintain consistent economic behavior across settings. Finally, we illustrate how our approach can measure the effect of interventions such as prompting on economic biases.
摘要：人不是经济人(即理性的经济存在)。作为人类，我们表现出系统性的行为偏差，如损失厌恶、锚定、框定等，这导致我们做出次优的经济决策。在训练大型语言模型(LLM)的文本数据中可能嵌入这种偏见的情况下，LLM在多大程度上倾向于相同的行为偏差？了解LLMS中的这些偏见对于部署LLMS以支持人类决策至关重要。我们提出了效用理论–现代经济理论的核心范式–作为评估低收入国家经济偏差的一种方法。效用理论能够量化经济行为，并将其与完全理性或人类行为等基准进行比较。为了演示我们的方法，我们量化并比较了各种开放源码和封闭源码LLM的经济行为。我们发现，当前LLM的经济行为既不完全像人类，也不完全像经济人。我们还发现，大多数当前的LLM难以在不同环境下保持一致的经济行为。最后，我们说明了我们的方法如何衡量干预措施的效果，例如对经济偏见的激励。

[NLP-44] Dynamic Language Group-Based MoE: Enhancing Efficiency and Flexibility for Code-Switching Speech Recognition
[NLP-44] 基于动态语言组的MoE：提高代码转换语音识别的效率和灵活性

链接: https://arxiv.org/abs/2407.18581
作者: Hukai Huang,Shenghui Lu,Yahui Shan,He Qu,Wenhao Guan,Qingyang Hong,Lin Li
关键词-EN: approach is ideally, multilingual and code-switching, challenges due, multi-expert architecture, ideally suited
关键词-ZN: 理想的方法是多语言和代码切换，面临挑战，多专家架构，非常适合
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The Mixture of Experts (MoE) approach is ideally suited for tackling multilingual and code-switching (CS) challenges due to its multi-expert architecture. This work introduces the DLG-MoE, which is optimized for bilingual and CS scenarios. Our novel Dynamic Language Group-based MoE layer features a language router with shared weights for explicit language modeling, while independent unsupervised routers within the language group handle attributes beyond language. This structure not only enhances expert extension capabilities but also supports dynamic top-k training, allowing for flexible inference across various top-k values and improving overall performance. The model requires no pre-training and supports streaming recognition, achieving state-of-the-art (SOTA) results with unmatched flexibility compared to other methods. The Code will be released.
摘要：混合专家（MoE）方法因其多专家架构而非常适合应对多语言和代码交换（CS）挑战。这项工作介绍了DLG-MoE，它针对双语和CS场景进行了优化。我们新颖的基于动态语言组的MoE层具有一个语言路由器，该语言路由器具有用于显式语言建模的共享权重，而语言组内的独立无监督路由器处理语言以外的属性。这种结构不仅增强了专家扩展能力，还支持动态top-k训练，允许跨各种top-k值进行灵活推理并提高整体性能。该模型不需要预训练，并支持流识别，与其他方法相比，以无与伦比的灵活性实现最先进的（SOTA）结果。该代码将被发布。

[NLP-45] DrawTalking: Building Interactive Worlds by Sketching and Speaking ATC
[NLP-45] DrawTalking：通过素描和口语构建互动世界

链接: https://arxiv.org/abs/2401.05631
作者: Karl Toby Rosenberg,Rubaiat Habib Kazi,Li-Yi Wei,Haijun Xia,Ken Perlin
关键词-EN: controlling interactive worlds, introduce DrawTalking, telling stories, approach to building, building and controlling
关键词-ZN: 控制互动世界，引入DrawTalking、讲故事、构建方法、构建和控制
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Emerging Technologies (cs.ET); Graphics (cs.GR)
备注: 25 pages, 27 figures; Matching version accepted at UIST 2024

点击查看摘要

Abstract:We introduce DrawTalking, an approach to building and controlling interactive worlds by sketching and speaking while telling stories. It emphasizes user control and flexibility, and gives programming-like capability without requiring code. An early open-ended study with our prototype shows that the mechanics resonate and are applicable to many creative-exploratory use cases, with the potential to inspire and inform research in future natural interfaces for creative exploration and authoring.
摘要：我们介绍了DrawTalking，这是一种通过在讲故事时绘制草图和说话来构建和控制交互世界的方法。它强调用户控制和灵活性，并在不需要代码的情况下提供类似编程的能力。对我们原型进行的早期开放式研究表明，这些机制具有共鸣，适用于许多创造性探索用例，有可能激励和指导未来自然界面的研究，以进行创造性探索和创作。

[NLP-46] VisionUnite: A Vision-Language Foundation Model for Ophthalmology Enhanced with Clinical Knowledge
[NLP-46] VisionUnite：通过临床知识增强的眼科视觉语言基础模型

链接: https://arxiv.org/abs/2408.02865
作者: Zihan Li,Diping Song,Zefeng Yang,Deming Wang,Fei Li,Xiulan Zhang,Paul E. Kinahan,Yu Qiao
关键词-EN: improved diagnostic methods, advanced equipment, developed regions, regions with limited, limited access
关键词-ZN: 改进的诊断方法、先进的设备、发达的地区、准入有限的地区
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The need for improved diagnostic methods in ophthalmology is acute, especially in the less developed regions with limited access to specialists and advanced equipment. Therefore, we introduce VisionUnite, a novel vision-language foundation model for ophthalmology enhanced with clinical knowledge. VisionUnite has been pretrained on an extensive dataset comprising 1.24 million image-text pairs, and further refined using our proposed MMFundus dataset, which includes 296,379 high-quality fundus image-text pairs and 889,137 simulated doctor-patient dialogue instances. Our experiments indicate that VisionUnite outperforms existing generative foundation models such as GPT-4V and Gemini Pro. It also demonstrates diagnostic capabilities comparable to junior ophthalmologists. VisionUnite performs well in various clinical scenarios including open-ended multi-disease diagnosis, clinical explanation, and patient interaction, making it a highly versatile tool for initial ophthalmic disease screening. VisionUnite can also serve as an educational aid for junior ophthalmologists, accelerating their acquisition of knowledge regarding both common and rare ophthalmic conditions. VisionUnite represents a significant advancement in ophthalmology, with broad implications for diagnostics, medical education, and understanding of disease mechanisms.
摘要：眼科对改进诊断方法的需求非常迫切，特别是在缺乏专家和先进设备的欠发达地区。因此，我们引入了一种新的视觉-语言基础模型VisionUnite，用于增强临床知识的眼科。VisionUnite已经在一个包含124万个图像-文本对的广泛数据集上进行了预培训，并使用我们建议的MMFundus数据集进行了进一步优化，其中包括296,379个高质量眼底图像-文本对和889,137个模拟医生-患者对话实例。我们的实验表明，VisionUnite的性能优于现有的GPT-4V和Gemini Pro等生成基础模型。它还展示了可与初级眼科医生媲美的诊断能力。VisionUnite在各种临床场景中表现良好，包括开放式多疾病诊断、临床解释和患者互动，使其成为一种高度通用的初始眼科疾病筛查工具。VisionUnite还可以作为初级眼科医生的教育辅助工具，加速他们获得有关常见和罕见眼科疾病的知识。VisionUnite代表了眼科学的重大进步，对诊断学、医学教育和对疾病机制的理解具有广泛的意义。

人工智能

[AI-0] LLaVA-OneVision: Easy Visual Task Transfer

链接: https://arxiv.org/abs/2408.03326
作者: Bo Li,Yuanhan Zhang,Dong Guo,Renrui Zhang,Feng Li,Hao Zhang,Kaichen Zhang,Yanwei Li,Ziwei Liu,Chunyuan Li
关键词-EN: LLaVA-NeXT blog series, open large multimodal, large multimodal models, developed by consolidating, insights into data
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注: Project Homepage: this https URL

点击查看摘要

[AI-1] raining LLMs to Recognize Hedges in Spontaneous Narratives

链接: https://arxiv.org/abs/2408.03319
作者: Amie J. Paige,Adil Soubki,John Murzaku,Owen Rambow,Susan E. Brennan
关键词-EN: soften critical feedback, mark utterances, lack of commitment, attribute responsibility, invite input
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: Amie Paige, Adil Soubki, and John Murzaku contributed equally to this study

点击查看摘要

[AI-2] Fusing Forces: Deep-Human-Guided Refinement of Segmentation Masks ICPR2024

链接: https://arxiv.org/abs/2408.03304
作者: Rafael Sterzinger,Christian Stippel,Robert Sablatnig
关键词-EN: elaborate figurative illustrations, figurative illustrations featured, characterized by elaborate, Etruscan mirrors constitute, constitute a significant
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
*备注: 16 pages, accepted at ICPR2024

点击查看摘要

Abstract:Etruscan mirrors constitute a significant category in Etruscan art, characterized by elaborate figurative illustrations featured on their backside. A laborious and costly aspect of their analysis and documentation is the task of manually tracing these illustrations. In previous work, a methodology has been proposed to automate this process, involving photometric-stereo scanning in combination with deep neural networks. While achieving quantitative performance akin to an expert annotator, some results still lack qualitative precision and, thus, require annotators for inspection and potential correction, maintaining resource intensity. In response, we propose a deep neural network trained to interactively refine existing annotations based on human guidance. Our human-in-the-loop approach streamlines annotation, achieving equal quality with up to 75% less manual input required. Moreover, during the refinement process, the relative improvement of our methodology over pure manual labeling reaches peak values of up to 26%, attaining drastically better quality quicker. By being tailored to the complex task of segmenting intricate lines, specifically distinguishing it from previous methods, our approach offers drastic improvements in efficacy, transferable to a broad spectrum of applications beyond Etruscan mirrors.

[AI-3] Understanding How Blind Users Handle Object Recognition Errors: Strategies and Challenges

链接: https://arxiv.org/abs/2408.03303
作者: Jonggi Hong,Hernisa Kacorri
关键词-EN: Object recognition, object recognition systems, recognition technologies hold, hold the potential, potential to support
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Object recognition technologies hold the potential to support blind and low-vision people in navigating the world around them. However, the gap between benchmark performances and practical usability remains a significant challenge. This paper presents a study aimed at understanding blind users’ interaction with object recognition systems for identifying and avoiding errors. Leveraging a pre-existing object recognition system, URCam, fine-tuned for our experiment, we conducted a user study involving 12 blind and low-vision participants. Through in-depth interviews and hands-on error identification tasks, we gained insights into users’ experiences, challenges, and strategies for identifying errors in camera-based assistive technologies and object recognition systems. During interviews, many participants preferred independent error review, while expressing apprehension toward misrecognitions. In the error identification task, participants varied viewpoints, backgrounds, and object sizes in their images to avoid and overcome errors. Even after repeating the task, participants identified only half of the errors, and the proportion of errors identified did not significantly differ from their first attempts. Based on these insights, we offer implications for designing accessible interfaces tailored to the needs of blind and low-vision users in identifying object recognition errors.

[AI-4] KaPO: Knowledge-aware Preference Optimization for Controllable Knowledge Selection in Retrieval-Augmented Language Models

链接: https://arxiv.org/abs/2408.03297
作者: Ruizhe Zhang,Yongxin Xu,Yuzhen Xiao,Runchuan Zhu,Xinke Jiang,Xu Chu,Junfeng Zhao,Yasha Wang
关键词-EN: Retrieval-Augmented Generation, large language models, integrating external knowledge, encounter when dealing, knowledge-intensive tasks
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

[AI-5] Static IR Drop Prediction with Attention U-Net and Saliency-Based Explainability

链接: https://arxiv.org/abs/2408.03292
作者: Lizi Zhang,Azadeh Davoodi
关键词-EN: significant recent progress, translation task, significant recent, recent progress, progress to reduce
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:There has been significant recent progress to reduce the computational effort of static IR drop analysis using neural networks, and modeling as an image-to-image translation task. A crucial issue is the lack of sufficient data from real industry designs to train these networks. Additionally, there is no methodology to explain a high-drop pixel in a predicted IR drop image to its specific root-causes. In this work, we first propose a U-Net neural network model with attention gates which is specifically tailored to achieve fast and accurate image-based static IR drop prediction. Attention gates allow selective emphasis on relevant parts of the input data without supervision which is desired because of the often sparse nature of the IR drop map. We propose a two-phase training process which utilizes a mix of artificially-generated data and a limited number of points from real designs. The results are, on-average, 18% (53%) better in MAE and 14% (113%) in F1 score compared to the winner of the ICCAD 2023 contest (and U-Net only) when tested on real designs. Second, we propose a fast method using saliency maps which can explain a predicted IR drop in terms of specific input pixels contributing the most to a drop. In our experiments, we show the number of high IR drop pixels can be reduced on-average by 18% by mimicking upsize of a tiny portion of PDN’s resistive edges.

[AI-6] StructEval: Deepen and Broaden Large Language Model Assessment via Structured Evaluation ACL2024

链接: https://arxiv.org/abs/2408.03281
作者: Boxi Cao,Mengjie Ren,Hongyu Lin,Xianpei Han,Feng Zhang,Junfeng Zhan,Le Sun
关键词-EN: large language models, atomic test objective, development of large, large language, test objective
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: ACL 2024;Benchmark at this https URL at this https URL

点击查看摘要

[AI-7] Compress and Compare: Interactively Evaluating Efficiency and Behavior Across ML Model Compression Experiments

链接: https://arxiv.org/abs/2408.03274
作者: Angie Boggust,Venkatesh Sivaraman,Yannick Assogba,Donghao Ren,Dominik Moritz,Fred Hohman
关键词-EN: deploy machine learning, Compress and Compare, learning models on-device, machine learning models, compression
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Accepted to VIS 2024

点击查看摘要

Abstract:To deploy machine learning models on-device, practitioners use compression algorithms to shrink and speed up models while maintaining their high-quality output. A critical aspect of compression in practice is model comparison, including tracking many compression experiments, identifying subtle changes in model behavior, and negotiating complex accuracy-efficiency trade-offs. However, existing compression tools poorly support comparison, leading to tedious and, sometimes, incomplete analyses spread across disjoint tools. To support real-world comparative workflows, we develop an interactive visual system called Compress and Compare. Within a single interface, Compress and Compare surfaces promising compression strategies by visualizing provenance relationships between compressed models and reveals compression-induced behavior changes by comparing models’ predictions, weights, and activations. We demonstrate how Compress and Compare supports common compression analysis tasks through two case studies, debugging failed compression on generative language models and identifying compression artifacts in image classification models. We further evaluate Compress and Compare in a user study with eight compression experts, illustrating its potential to provide structure to compression workflows, help practitioners build intuition about compression, and encourage thorough analysis of compression’s effect on model behavior. Through these evaluations, we identify compression-specific challenges that future visual analytics tools should consider and Compress and Compare visualizations that may generalize to broader model comparison tasks.

[AI-8] Unveiling Factual Recall Behaviors of Large Language Models through Knowledge Neurons

链接: https://arxiv.org/abs/2408.03247
作者: Yifei Wang,Yuheng Chen,Wanting Wen,Yu Sheng,Linjing Li,Daniel Dajun Zeng
关键词-EN: Large Language Models, Language Models, Large Language, investigate whether Large, Models
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

[AI-9] Personalizing Federated Instrument Segmentation with Visual Trait Priors in Robotic Surgery

链接: https://arxiv.org/abs/2408.03208
作者: Jialang Xu,Jiacheng Wang,Lequan Yu,Danail Stoyanov,Yueming Jin,Evangelos B. Mazomenos
关键词-EN: Personalized federated learning, surgical instrument segmentation, federated learning, promising approach, Existing PFL methods
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO); Medical Physics (physics.med-ph)
*备注: 9 pages, 3 figures, under review

点击查看摘要

Abstract:Personalized federated learning (PFL) for surgical instrument segmentation (SIS) is a promising approach. It enables multiple clinical sites to collaboratively train a series of models in privacy, with each model tailored to the individual distribution of each site. Existing PFL methods rarely consider the personalization of multi-headed self-attention, and do not account for appearance diversity and instrument shape similarity, both inherent in surgical scenes. We thus propose PFedSIS, a novel PFL method with visual trait priors for SIS, incorporating global-personalized disentanglement (GPD), appearance-regulation personalized enhancement (APE), and shape-similarity global enhancement (SGE), to boost SIS performance in each site. GPD represents the first attempt at head-wise assignment for multi-headed self-attention personalization. To preserve the unique appearance representation of each site and gradually leverage the inter-site difference, APE introduces appearance regulation and provides customized layer-wise aggregation solutions via hypernetworks for each site’s personalized parameters. The mutual shape information of instruments is maintained and shared via SGE, which enhances the cross-style shape consistency on the image level and computes the shape-similarity contribution of each site on the prediction level for updating the global parameters. PFedSIS outperforms state-of-the-art methods with +1.51% Dice, +2.11% IoU, -2.79 ASSD, -15.55 HD95 performance gains. The corresponding code and models will be released at this https URL.

[AI-10] Adversarial Safety-Critical Scenario Generation using Naturalistic Human Driving Priors

链接: https://arxiv.org/abs/2408.03200
作者: Kunkun Hao,Yonggang Luo,Wen Cui,Yuqiao Bai,Jucheng Yang,Songyang Yan,Yuxi Pan,Zijiang Yang
关键词-EN: Evaluating the decision-making, test scenarios play, crucial role, decision-making system, system is indispensable
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注: Published in IEEE Transactions on Intelligent Vehicles, 2023

点击查看摘要

Abstract:Evaluating the decision-making system is indispensable in developing autonomous vehicles, while realistic and challenging safety-critical test scenarios play a crucial role. Obtaining these scenarios is non-trivial, thanks to the long-tailed distribution, sparsity, and rarity in real-world data sets. To tackle this problem, in this paper, we introduce a natural adversarial scenario generation solution using naturalistic human driving priors and reinforcement learning techniques. By doing this, we can obtain large-scale test scenarios that are both diverse and realistic. Specifically, we build a simulation environment that mimics natural traffic interaction scenarios. Informed by this environment, we implement a two-stage procedure. The first stage incorporates conventional rule-based models, e.g., IDM~(Intelligent Driver Model) and MOBIL~(Minimizing Overall Braking Induced by Lane changes) model, to coarsely and discretely capture and calibrate key control parameters from the real-world dataset. Next, we leverage GAIL~(Generative Adversarial Imitation Learning) to represent driver behaviors continuously. The derived GAIL can be further used to design a PPO~(Proximal Policy Optimization)-based actor-critic network framework to fine-tune the reward function, and then optimizes our natural adversarial scenario generation solution. Extensive experiments have been conducted in the NGSIM dataset including the trajectory of 3,000 vehicles. Essential traffic parameters were measured in comparison with the baseline model, e.g., the collision rate, accelerations, steering, and the number of lane changes. Our findings demonstrate that the proposed model can generate realistic safety-critical test scenarios covering both naturalness and adversariality, which can be a cornerstone for the development of autonomous vehicles.

[AI-11] raining on the Fly: On-device Self-supervised Learning aboard Nano-drones within 20 mW

链接: https://arxiv.org/abs/2408.03168
作者: Elia Cereda,Alessandro Giusti,Daniele Palossi
关键词-EN: Miniaturized cyber-physical systems, increasingly attractive technology, Miniaturized cyber-physical, cyber-physical systems, powered by tiny
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
*备注: This paper has been accepted for publication in the IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems. Copyright 2024 IEEE

点击查看摘要

Abstract:Miniaturized cyber-physical systems (CPSes) powered by tiny machine learning (TinyML), such as nano-drones, are becoming an increasingly attractive technology. Their small form factor (i.e., ~10cm diameter) ensures vast applicability, ranging from the exploration of narrow disaster scenarios to safe human-robot interaction. Simple electronics make these CPSes inexpensive, but strongly limit the computational, memory, and sensing resources available on board. In real-world applications, these limitations are further exacerbated by domain shift. This fundamental machine learning problem implies that model perception performance drops when moving from the training domain to a different deployment one. To cope with and mitigate this general problem, we present a novel on-device fine-tuning approach that relies only on the limited ultra-low power resources available aboard nano-drones. Then, to overcome the lack of ground-truth training labels aboard our CPS, we also employ a self-supervised method based on ego-motion consistency. Albeit our work builds on top of a specific real-world vision-based human pose estimation task, it is widely applicable for many embedded TinyML use cases. Our 512-image on-device training procedure is fully deployed aboard an ultra-low power GWT GAP9 System-on-Chip and requires only 1MB of memory while consuming as low as 19mW or running in just 510ms (at 38mW). Finally, we demonstrate the benefits of our on-device learning approach by field-testing our closed-loop CPS, showing a reduction in horizontal position error of up to 26% vs. a non-fine-tuned state-of-the-art baseline. In the most challenging never-seen-before environment, our on-device learning procedure makes the difference between succeeding or failing the mission.

[AI-12] Dilated Convolution with Learnable Spacings makes visual models more aligned with humans: a Grad-CAM study IJCAI2024

链接: https://arxiv.org/abs/2408.03164
作者: Rabih Chamas,Ismail Khalfaoui-Hassani,Timothee Masquelier
关键词-EN: Learnable Spacing, advanced convolution method, recent advanced convolution, Dilated Convolution, receptive fields
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: Accepted at The Trustworthy AI Workshop, IJCAI 2024

点击查看摘要

Abstract:Dilated Convolution with Learnable Spacing (DCLS) is a recent advanced convolution method that allows enlarging the receptive fields (RF) without increasing the number of parameters, like the dilated convolution, yet without imposing a regular grid. DCLS has been shown to outperform the standard and dilated convolutions on several computer vision benchmarks. Here, we show that, in addition, DCLS increases the models’ interpretability, defined as the alignment with human visual strategies. To quantify it, we use the Spearman correlation between the models’ GradCAM heatmaps and the ClickMe dataset heatmaps, which reflect human visual attention. We took eight reference models - ResNet50, ConvNeXt (T, S and B), CAFormer, ConvFormer, and FastViT (sa 24 and 36) - and drop-in replaced the standard convolution layers with DCLS ones. This improved the interpretability score in seven of them. Moreover, we observed that Grad-CAM generated random heatmaps for two models in our study: CAFormer and ConvFormer models, leading to low interpretability scores. We addressed this issue by introducing Threshold-Grad-CAM, a modification built on top of Grad-CAM that enhanced interpretability across nearly all models. The code and checkpoints to reproduce this study are available at: this https URL.

[AI-13] User-in-the-loop Evaluation of Multimodal LLMs for Activity Assistance

链接: https://arxiv.org/abs/2408.03160
作者: Mrinal Verghese,Brian Chen,Hamid Eghbalzadeh,Tushar Nagarajan,Ruta Desai
关键词-EN: Large Language Models, facilitate vision-powered assistants, powered by Large, Large Language, Conditioned Language Models
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 9 pages, 4 figures

点击查看摘要

Abstract:Our research investigates the capability of modern multimodal reasoning models, powered by Large Language Models (LLMs), to facilitate vision-powered assistants for multi-step daily activities. Such assistants must be able to 1) encode relevant visual history from the assistant’s sensors, e.g., camera, 2) forecast future actions for accomplishing the activity, and 3) replan based on the user in the loop. To evaluate the first two capabilities, grounding visual history and forecasting in short and long horizons, we conduct benchmarking of two prominent classes of multimodal LLM approaches – Socratic Models and Vision Conditioned Language Models (VCLMs) on video-based action anticipation tasks using offline datasets. These offline benchmarks, however, do not allow us to close the loop with the user, which is essential to evaluate the replanning capabilities and measure successful activity completion in assistive scenarios. To that end, we conduct a first-of-its-kind user study, with 18 participants performing 3 different multi-step cooking activities while wearing an egocentric observation device called Aria and following assistance from multimodal LLMs. We find that the Socratic approach outperforms VCLMs in both offline and online settings. We further highlight how grounding long visual history, common in activity assistance, remains challenging in current models, especially for VCLMs, and demonstrate that offline metrics do not indicate online performance.

[AI-14] Optimizing Disease Prediction with Artificial Intelligence Driven Feature Selection and Attention Networks

链接: https://arxiv.org/abs/2408.03151
作者: D. Dhinakaran,S. Edwin Raja,M. Thiyagarajan,J. Jeno Jasmine,P. Raghavan
关键词-EN: Electronic Health Records, repositories of Electronic, machine learning methodologies, ignited innovative strategies, Stabilized Energy Valley
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 16 Pages, 4 Figures

点击查看摘要

Abstract:The rapid integration of machine learning methodologies in healthcare has ignited innovative strategies for disease prediction, particularly with the vast repositories of Electronic Health Records (EHR) data. This article delves into the realm of multi-disease prediction, presenting a comprehensive study that introduces a pioneering ensemble feature selection model. This model, designed to optimize learning systems, combines statistical, deep, and optimally selected features through the innovative Stabilized Energy Valley Optimization with Enhanced Bounds (SEV-EB) algorithm. The objective is to achieve unparalleled accuracy and stability in predicting various disorders. This work proposes an advanced ensemble model that synergistically integrates statistical, deep, and optimally selected features. This combination aims to enhance the predictive power of the model by capturing diverse aspects of the health data. At the heart of the proposed model lies the SEV-EB algorithm, a novel approach to optimal feature selection. The algorithm introduces enhanced bounds and stabilization techniques, contributing to the robustness and accuracy of the overall prediction model. To further elevate the predictive capabilities, an HSC-AttentionNet is introduced. This network architecture combines deep temporal convolution capabilities with LSTM, allowing the model to capture both short-term patterns and long-term dependencies in health data. Rigorous evaluations showcase the remarkable performance of the proposed model. Achieving a 95% accuracy and 94% F1-score in predicting various disorders, the model surpasses traditional methods, signifying a significant advancement in disease prediction accuracy. The implications of this research extend beyond the confines of academia.

[AI-15] COMMENTATOR: A Code-mixed Multilingual Text Annotation Framework

链接: https://arxiv.org/abs/2408.03125
作者: Rajvee Sheth,Shubh Nisar,Heenaben Prajapati,Himanshu Beniwal,Mayank Singh
关键词-EN: NLP community increasingly, community increasingly addresses, increasingly addresses challenges, multilingual datasets efficiently, NLP community
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

[AI-16] Evaluating the Translation Performance of Large Language Models Based on Euas-20

链接: https://arxiv.org/abs/2408.03119
作者: Yan Huang,Wei Liu
关键词-EN: BERT and GPT, large language models, deep learning technology, natural language processing, achieved breakthrough results
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: 15 pages, 8 figures

点击查看摘要

[AI-17] Learning Provably Robust Policies in Uncertain Parametric Environments

链接: https://arxiv.org/abs/2408.03093
作者: Yannik Schnitzer,Alessandro Abate,David Parker
关键词-EN: unknown distribution, learning MDP policies, present a data-driven, transition probabilities, probabilities are defined
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
*备注:

点击查看摘要

Abstract:We present a data-driven approach for learning MDP policies that are robust across stochastic environments whose transition probabilities are defined by parameters with an unknown distribution. We produce probably approximately correct (PAC) guarantees for the performance of these learned policies in a new, unseen environment over the unknown distribution. Our approach is based on finite samples of the MDP environments, for each of which we build an approximation of the model as an interval MDP, by exploring a set of generated trajectories. We use the built approximations to synthesise a single policy that performs well (meets given requirements) across the sampled environments, and furthermore bound its risk (of not meeting the given requirements) when deployed in an unseen environment. Our procedure offers a trade-off between the guaranteed performance of the learned policy and the risk of not meeting the guarantee in an unseen environment. Our approach exploits knowledge of the environment’s state space and graph structure, and we show how additional knowledge of its parametric structure can be leveraged to optimize learning and to obtain tighter guarantees from less samples. We evaluate our approach on a diverse range of established benchmarks, demonstrating that we can generate highly performing and robust policies, along with guarantees that tightly quantify their performance and the associated risk.

[AI-18] Enhancing Complex Causality Extraction via Improved Subtask Interaction and Knowledge Fusion NLPCC2024

链接: https://arxiv.org/abs/2408.03079
作者: Jinglong Gao,Chen Lu,Xiao Ding,Zhongyang Li,Ting Liu,Bing Qin
关键词-EN: Complex Causality Extraction, Event Causality Extraction, Causality Extraction, ECE, ECE simultaneously
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: NLPCC 2024 Oral

点击查看摘要

[AI-19] BodySLAM: A Generalized Monocular Visual SLAM Framework for Surgical Applications

链接: https://arxiv.org/abs/2408.03078
作者: G. Manni(1 and 2),C. Lauretti(2),F. Prata(3),R. Papalia(3),L. Zollo(2),P. Soda(1) ((1) Research Unit of Computer Systems and Bioinformatics Department of Engineering Università Campus Bio-Medico di Roma, (2) Unit of Advanced Robotics and Human-Centred Technologies Department of Engineering Università Campus Bio-Medico di Roma, (3) Department of Urology Fondazione Policlinico Universitario Campus Bio-Medico)
关键词-EN: Depth Estimation Module, Monocular Depth Estimation, Pose Estimation Module, Endoscopic surgery relies, Monocular Pose Estimation
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)
*备注: 15 pages, 7 figures

点击查看摘要

Abstract:Endoscopic surgery relies on two-dimensional views, posing challenges for surgeons in depth perception and instrument manipulation. While Simultaneous Localization and Mapping (SLAM) has emerged as a promising solution to address these limitations, its implementation in endoscopic procedures presents significant challenges due to hardware limitations, such as the use of a monocular camera and the absence of odometry sensors. This study presents a robust deep learning-based SLAM approach that combines state-of-the-art and newly developed models. It consists of three main parts: the Monocular Pose Estimation Module that introduces a novel unsupervised method based on the CycleGAN architecture, the Monocular Depth Estimation Module that leverages the novel Zoe architecture, and the 3D Reconstruction Module which uses information from the previous models to create a coherent surgical map. The performance of the procedure was rigorously evaluated using three publicly available datasets (Hamlyn, EndoSLAM, and SCARED) and benchmarked against two state-of-the-art methods, EndoSFMLearner and EndoDepth. The integration of Zoe in the MDEM demonstrated superior performance compared to state-of-the-art depth estimation algorithms in endoscopy, whereas the novel approach in the MPEM exhibited competitive performance and the lowest inference time. The results showcase the robustness of our approach in laparoscopy, gastroscopy, and colonoscopy, three different scenarios in endoscopic surgery. The proposed SLAM approach has the potential to improve the accuracy and efficiency of endoscopic procedures by providing surgeons with enhanced depth perception and 3D reconstruction capabilities.

[AI-20] Solving QUBO on the Loihi 2 Neuromorphic Processor

链接: https://arxiv.org/abs/2408.03076
作者: Alessandro Pierro,Philipp Stratmann,Gabriel Andres Fonseca Guerra,Sumedh Risbud,Timothy Shea,Ashish Rao Mangalore,Andreas Wild
关键词-EN: Quadratic Unconstrained Binary, Unconstrained Binary Optimization, solving Quadratic Unconstrained, Binary Optimization problems, Quadratic Unconstrained
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI); Discrete Mathematics (cs.DM)
*备注: 12 pages, 3 figures. Shared first authorship: Alessandro Pierro, Philipp Stratmann, and Gabriel Andres Fonseca Guerra

点击查看摘要

Abstract:In this article, we describe an algorithm for solving Quadratic Unconstrained Binary Optimization problems on the Intel Loihi 2 neuromorphic processor. The solver is based on a hardware-aware fine-grained parallel simulated annealing algorithm developed for Intel’s neuromorphic research chip Loihi 2. Preliminary results show that our approach can generate feasible solutions in as little as 1 ms and up to 37x more energy efficient compared to two baseline solvers running on a CPU. These advantages could be especially relevant for size-, weight-, and power-constrained edge computing applications.

[AI-21] OpenOmni: A Collaborative Open Source Tool for Building Future-Ready Multimodal Conversational Agents

链接: https://arxiv.org/abs/2408.03047
作者: Qiang Sun,Yuanyi Luo,Sirui Li,Wenxiao Zhang,Wei Liu
关键词-EN: Multimodal conversational agents, Multimodal conversational, conversational agents, agents are highly, highly desirable
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:

点击查看摘要

[AI-22] Highly Efficient Self-Adaptive Reward Shaping for Reinforcement Learning

链接: https://arxiv.org/abs/2408.03029
作者: Haozhe Ma,Zhengding Luo,Thanh Vinh Vo,Kuankuan Sima,Tze-Yun Leong
关键词-EN: informative reward signals, Reward shaping addresses, addresses the challenge, reinforcement learning, learning by constructing
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Reward shaping addresses the challenge of sparse rewards in reinforcement learning by constructing denser and more informative reward signals. To achieve self-adaptive and highly efficient reward shaping, we propose a novel method that incorporates success rates derived from historical experiences into shaped rewards. Our approach utilizes success rates sampled from Beta distributions, which dynamically evolve from uncertain to reliable values as more data is collected. Initially, the self-adaptive success rates exhibit more randomness to encourage exploration. Over time, they become more certain to enhance exploitation, thus achieving a better balance between exploration and exploitation. We employ Kernel Density Estimation (KDE) combined with Random Fourier Features (RFF) to derive the Beta distributions, resulting in a computationally efficient implementation in high-dimensional continuous state spaces. This method provides a non-parametric and learning-free approach. The proposed method is evaluated on a wide range of continuous control tasks with sparse and delayed rewards, demonstrating significant improvements in sample efficiency and convergence stability compared to several baselines.

[AI-23] Integrating Controllable Motion Skills from Demonstrations

链接: https://arxiv.org/abs/2408.03018
作者: Honghao Liao,Zhiheng Li,Ziyu Meng,Ran Song,Yibin Li,Wei Zhang
关键词-EN: motion skills, versatile motion skills, motion, expanding applications, mastery of versatile
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The expanding applications of legged robots require their mastery of versatile motion skills. Correspondingly, researchers must address the challenge of integrating multiple diverse motion skills into controllers. While existing reinforcement learning (RL)-based approaches have achieved notable success in multi-skill integration for legged robots, these methods often require intricate reward engineering or are restricted to integrating a predefined set of motion skills constrained by specific task objectives, resulting in limited flexibility. In this work, we introduce a flexible multi-skill integration framework named Controllable Skills Integration (CSI). CSI enables the integration of a diverse set of motion skills with varying styles into a single policy without the need for complex reward tuning. Furthermore, in a hierarchical control manner, the trained low-level policy can be coupled with a high-level Natural Language Inference (NLI) module to enable preliminary language-directed skill control. Our experiments demonstrate that CSI can flexibly integrate a diverse array of motion skills more comprehensively and facilitate the transitions between different skills. Additionally, CSI exhibits good scalability as the number of motion skills to be integrated increases significantly.

[AI-24] NeurDB: On the Design and Implementation of an AI-powered Autonomous Database

链接: https://arxiv.org/abs/2408.03013
作者: Zhanhao Zhao,Shaofeng Cai,Haotian Gao,Hexiang Pan,Siqi Xiang,Naili Xing,Gang Chen,Beng Chin Ooi,Yanyan Shen,Yuncheng Wu,Meihui Zhang
关键词-EN: relieve end-user burdens, aiming to relieve, industry sectors, increasingly embracing, optimization and intelligent
类目: Databases (cs.DB); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Databases are increasingly embracing AI to provide autonomous system optimization and intelligent in-database analytics, aiming to relieve end-user burdens across various industry sectors. Nonetheless, most existing approaches fail to account for the dynamic nature of databases, which renders them ineffective for real-world applications characterized by evolving data and workloads. This paper introduces NeurDB, an AI-powered autonomous database that deepens the fusion of AI and databases with adaptability to data and workload drift. NeurDB establishes a new in-database AI ecosystem that seamlessly integrates AI workflows within the database. This integration enables efficient and effective in-database AI analytics and fast-adaptive learned system components. Empirical evaluations demonstrate that NeurDB substantially outperforms existing solutions in managing AI analytics tasks, with the proposed learned components more effectively handling environmental dynamism than state-of-the-art approaches.

[AI-25] Cross-cultural analysis of pedestrian group behaviour influence on crossing decisions in interactions with autonomous vehicles ITSC2024

链接: https://arxiv.org/abs/2408.03003
作者: Sergio Martín Serrano,Óscar Méndez Blanco,Stewart Worrall,Miguel Ángel Sotelo,David Fernández-Llorca
关键词-EN: Understanding cultural backgrounds, diverse societal norms, varied cultural contexts, Understanding cultural, enhancing acceptance
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
*备注: Paper accepted at the 27th IEEE International Conference on Intelligent Transportation Systems (ITSC 2024)

点击查看摘要

Abstract:Understanding cultural backgrounds is crucial for the seamless integration of autonomous driving into daily life as it ensures that systems are attuned to diverse societal norms and behaviours, enhancing acceptance and safety in varied cultural contexts. In this work, we investigate the impact of co-located pedestrians on crossing behaviour, considering cultural and situational factors. To accomplish this, a full-scale virtual reality (VR) environment was created in the CARLA simulator, enabling the identical experiment to be replicated in both Spain and Australia. Participants (N=30) attempted to cross the road at an urban crosswalk alongside other pedestrians exhibiting conservative to more daring behaviours, while an autonomous vehicle (AV) approached with different driving styles. For the analysis of interactions, we utilized questionnaires and direct measures of the moment when participants entered the lane. Our findings indicate that pedestrians tend to cross the same traffic gap together, even though reckless behaviour by the group reduces confidence and makes the situation perceived as more complex. Australian participants were willing to take fewer risks than Spanish participants, adopting more cautious behaviour when it was uncertain whether the AV would yield. Comments: Paper accepted at the 27th IEEE International Conference on Intelligent Transportation Systems (ITSC 2024) Subjects: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI) Cite as: arXiv:2408.03003 [cs.HC] (or arXiv:2408.03003v1 [cs.HC] for this version)

[AI-26] LLMs as Probabilistic Minimally Adequate Teachers for DFA Learning

链接: https://arxiv.org/abs/2408.02999
作者: Lekai Chen,Ashutosh Trivedi,Alvaro Velasquez
关键词-EN: large language models, Minimally Adequate Teacher, probabilistic Minimally Adequate, language models, emergence of intelligence
类目: Formal Languages and Automata Theory (cs.FL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The emergence of intelligence in large language models (LLMs) has inspired investigations into their integration into automata learning. This paper introduces the probabilistic Minimally Adequate Teacher (pMAT) formulation, which leverages a probabilistic oracle that could give persistent errors randomly during answering the membership queries for deterministic finite automata (DFA) learning. Given the tendency of LLMs to produce hallucinatory content, we have developed techniques to improve answer accuracy and ensure the correctness of the learned automata. We propose the \mathttDiscrimination prompt as well as the \mathttVerification prompt and explore their advantages over common prompts. Additionally, we compare DFA learning performance between the TTT algorithm and common active learning algorithms. To address the exponential number of persistent errors, we implement a dynamic query cache refinement algorithm that identifies and corrects conflicting queries by combining the active and passive learning algorithms. The empirical results demonstrate the robustness and efficiency of our approach, providing a theoretical foundation for automata learning with LLMs in the loop.

[AI-27] ASR-enhanced Multimodal Representation Learning for Cross-Domain Product Retrieval

链接: https://arxiv.org/abs/2408.02978
作者: Ruixiang Zhao,Jian Jia,Yan Li,Xuehan Bai,Quan Chen,Han Li,Peng Jiang,Xirong Li
关键词-EN: live stream promotions, E-commerce is increasingly, increasingly multimedia-enriched, manner as images, stream promotions
类目: Multimedia (cs.MM); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: 10 pages, 5 figures

点击查看摘要

Abstract:E-commerce is increasingly multimedia-enriched, with products exhibited in a broad-domain manner as images, short videos, or live stream promotions. A unified and vectorized cross-domain production representation is essential. Due to large intra-product variance and high inter-product similarity in the broad-domain scenario, a visual-only representation is inadequate. While Automatic Speech Recognition (ASR) text derived from the short or live-stream videos is readily accessible, how to de-noise the excessively noisy text for multimodal representation learning is mostly untouched. We propose ASR-enhanced Multimodal Product Representation Learning (AMPere). In order to extract product-specific information from the raw ASR text, AMPere uses an easy-to-implement LLM-based ASR text summarizer. The LLM-summarized text, together with visual data, is then fed into a multi-branch network to generate compact multimodal embeddings. Extensive experiments on a large-scale tri-domain dataset verify the effectiveness of AMPere in obtaining a unified multimodal product representation that clearly improves cross-domain product retrieval.

[AI-28] Empathy Level Alignment via Reinforcement Learning for Empathetic Response Generation

链接: https://arxiv.org/abs/2408.02976
作者: Hui Ma,Bo Zhang,Bo Xu,Jian Wang,Hongfei Lin,Xiao Sun
关键词-EN: human-like dialogue systems, building human-like dialogue, Empathetic response generation, empathetic responses, produce empathetic responses
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

[AI-29] Anytime Multi-Agent Path Finding with an Adaptive Delay-Based Heuristic

链接: https://arxiv.org/abs/2408.02960
作者: Thomy Phan,Benran Zhang,Shao-Hung Chan,Sven Koenig
关键词-EN: Anytime multi-agent path, Anytime multi-agent, multi-agent path finding, scalable path optimization, Large Neighborhood Search
类目: Artificial Intelligence (cs.AI)
*备注: arXiv admin note: text overlap with arXiv:2312.16767

点击查看摘要

Abstract:Anytime multi-agent path finding (MAPF) is a promising approach to scalable path optimization in multi-agent systems. MAPF-LNS, based on Large Neighborhood Search (LNS), is the current state-of-the-art approach where a fast initial solution is iteratively optimized by destroying and repairing selected paths of the solution. Current MAPF-LNS variants commonly use an adaptive selection mechanism to choose among multiple destroy heuristics. However, to determine promising destroy heuristics, MAPF-LNS requires a considerable amount of exploration time. As common destroy heuristics are non-adaptive, any performance bottleneck caused by these heuristics cannot be overcome via adaptive heuristic selection alone, thus limiting the overall effectiveness of MAPF-LNS in terms of solution cost. In this paper, we propose Adaptive Delay-based Destroy-and-Repair Enhanced with Success-based Self-Learning (ADDRESS) as a single-destroy-heuristic variant of MAPF-LNS. ADDRESS applies restricted Thompson Sampling to the top-K set of the most delayed agents to select a seed agent for adaptive LNS neighborhood generation. We evaluate ADDRESS in multiple maps from the MAPF benchmark set and demonstrate cost improvements by at least 50% in large-scale scenarios with up to a thousand agents, compared with the original MAPF-LNS and other state-of-the-art methods.

[AI-30] Few-shot Scooping Under Domain Shift via Simulated Maximal Deployment Gaps

链接: https://arxiv.org/abs/2408.02949
作者: Yifan Zhu,Pranay Thangeda,Erica L Tevere,Ashish Goel,Erik Kramer,Hari D Nayar,Melkior Ornik,Kris Hauser
关键词-EN: sample granular materials, deep kernel Gaussian, Deep Kernel Calibration, Maximal Deployment Gaps, deep kernel
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
*备注: arXiv admin note: substantial text overlap with arXiv:2303.02893

点击查看摘要

Abstract:Autonomous lander missions on extraterrestrial bodies need to sample granular materials while coping with domain shifts, even when sampling strategies are extensively tuned on Earth. To tackle this challenge, this paper studies the few-shot scooping problem and proposes a vision-based adaptive scooping strategy that uses the deep kernel Gaussian process method trained with a novel meta-training strategy to learn online from very limited experience on out-of-distribution target terrains. Our Deep Kernel Calibration with Maximal Deployment Gaps (kCMD) strategy explicitly trains a deep kernel model to adapt to large domain shifts by creating simulated maximal deployment gaps from an offline training dataset and training models to overcome these deployment gaps during training. Employed in a Bayesian Optimization sequential decision-making framework, the proposed method allows the robot to perform high-quality scooping actions on out-of-distribution terrains after a few attempts, significantly outperforming non-adaptive methods proposed in the excavation literature as well as other state-of-the-art meta-learning methods. The proposed method also demonstrates zero-shot transfer capability, successfully adapting to the NASA OWLAT platform, which serves as a state-of-the-art simulator for potential future planetary missions. These results demonstrate the potential of training deep models with simulated deployment gaps for more generalizable meta-learning in high-capacity models. Furthermore, they highlight the promise of our method in autonomous lander sampling missions by enabling landers to overcome the deployment gap between Earth and extraterrestrial bodies.

[AI-31] Scaling Laws for Data Poisoning in LLMs

链接: https://arxiv.org/abs/2408.02946
作者: Dillon Bowen,Brendan Murphy,Will Cai,David Khachaturov,Adam Gleave,Kellin Pelrine
关键词-EN: Recent work shows, Recent work, data poisoning, data, work shows
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 20 pages

点击查看摘要

Abstract:Recent work shows that LLMs are vulnerable to data poisoning, in which they are trained on partially corrupted or harmful data. Poisoned data is hard to detect, breaks guardrails, and leads to undesirable and harmful behavior. Given the intense efforts by leading labs to train and deploy increasingly larger and more capable LLMs, it is critical to ask if the risk of data poisoning will be naturally mitigated by scale, or if it is an increasing threat. We consider three threat models by which data poisoning can occur: malicious fine-tuning, imperfect data curation, and intentional data contamination. Our experiments evaluate the effects of data poisoning on 23 frontier LLMs ranging from 1.5-72 billion parameters on three datasets which speak to each of our threat models. We find that larger LLMs are increasingly vulnerable, learning harmful behavior – including sleeper agent behavior – significantly more quickly than smaller LLMs with even minimal data poisoning. These results underscore the need for robust safeguards against data poisoning in larger LLMs.

[AI-32] Doubly Stochastic Adaptive Neighbors Clustering via the Marcus Mapping

链接: https://arxiv.org/abs/2408.02932
作者: Jinghui Yuan,Chusheng Zeng,Fangyuan Xie,Zhe Cao,Rong Wang,Feiping Nie,Xuelong Li
关键词-EN: Doubly stochastic symmetric, Marcus mapping, Doubly stochastic, similarity graph-based clustering, Marcus
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Clustering is a fundamental task in machine learning and data science, and similarity graph-based clustering is an important approach within this domain. Doubly stochastic symmetric similarity graphs provide numerous benefits for clustering problems and downstream tasks, yet learning such graphs remains a significant challenge. Marcus theorem states that a strictly positive symmetric matrix can be transformed into a doubly stochastic symmetric matrix by diagonal matrices. However, in clustering, learning sparse matrices is crucial for computational efficiency. We extend Marcus theorem by proposing the Marcus mapping, which indicates that certain sparse matrices can also be transformed into doubly stochastic symmetric matrices via diagonal matrices. Additionally, we introduce rank constraints into the clustering problem and propose the Doubly Stochastic Adaptive Neighbors Clustering algorithm based on the Marcus Mapping (ANCMM). This ensures that the learned graph naturally divides into the desired number of clusters. We validate the effectiveness of our algorithm through extensive comparisons with state-of-the-art algorithms. Finally, we explore the relationship between the Marcus mapping and optimal transport. We prove that the Marcus mapping solves a specific type of optimal transport problem and demonstrate that solving this problem through Marcus mapping is more efficient than directly applying optimal transport methods.

[AI-33] he Need for a Big World Simulator: A Scientific Challenge for Continual Learning

链接: https://arxiv.org/abs/2408.02930
作者: Saurabh Kumar,Hong Jun Jeon,Alex Lewandowski,Benjamin Van Roy
关键词-EN: small agent, conceptual view, view that motivates, frame offers, small agent operating
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Accepted to the Finding the Frame Workshop at RLC 2024

点击查看摘要

Abstract:The “small agent, big world” frame offers a conceptual view that motivates the need for continual learning. The idea is that a small agent operating in a much bigger world cannot store all information that the world has to offer. To perform well, the agent must be carefully designed to ingest, retain, and eject the right information. To enable the development of performant continual learning agents, a number of synthetic environments have been proposed. However, these benchmarks suffer from limitations, including unnatural distribution shifts and a lack of fidelity to the “small agent, big world” framing. This paper aims to formalize two desiderata for the design of future simulated environments. These two criteria aim to reflect the objectives and complexity of continual learning in practical settings while enabling rapid prototyping of algorithms on a smaller scale.

[AI-34] HARMONIC: Harnessing LLMs for Tabular Data Synthesis and Privacy Protection

链接: https://arxiv.org/abs/2408.02927
作者: Yuxin Wang,Duanyu Feng,Yongfu Dai,Zhengyu Chen,Jimin Huang,Sophia Ananiadou,Qianqian Xie,Hao Wang
关键词-EN: tabular data, advancing deep learning, tabular data generation, Data, tabular data presented
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Cryptography and Security (cs.CR)
*备注:

点击查看摘要

[AI-35] A Taxonomy of Architecture Options for Foundation Model-based Agents : Analysis and Decision Model

链接: https://arxiv.org/abs/2408.02920
作者: Jingwen Zhou,Qinghua Lu,Jieshan Chen,Liming Zhu,Xiwei Xu,Zhenchang Xing,Stefan Harrer
关键词-EN: rapid advancement, technology has led, led to widespread, widespread applications, design
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
*备注: Under review

点击查看摘要

Abstract:The rapid advancement of AI technology has led to widespread applications of agent systems across various domains. However, the need for detailed architecture design poses significant challenges in designing and operating these systems. This paper introduces a taxonomy focused on the architectures of foundation-model-based agents, addressing critical aspects such as functional capabilities and non-functional qualities. We also discuss the operations involved in both design-time and run-time phases, providing a comprehensive view of architectural design and operational characteristics. By unifying and detailing these classifications, our taxonomy aims to improve the design of foundation-model-based agents. Additionally, the paper establishes a decision model that guides critical design and runtime decisions, offering a structured approach to enhance the development of foundation-model-based agents. Our contributions include providing a structured architecture design option and guiding the development process of foundation-model-based agents, thereby addressing current fragmentation in the field.

[AI-36] KOI: Accelerating Online Imitation Learning via Hybrid Key-state Guidance

链接: https://arxiv.org/abs/2408.02912
作者: Jingxian Lu,Wenke Xia,Dong Wang,Zhigang Wang,Bin Zhao,Di Hu,Xuelong Li
关键词-EN: Online Imitation Learning, extensive online exploration, online exploration space, Online Imitation, efficient online imitation
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注: Submitted to Corl 2024

点击查看摘要

Abstract:Online Imitation Learning methods struggle with the gap between extensive online exploration space and limited expert trajectories, which hinder efficient exploration due to inaccurate task-aware reward estimation. Inspired by the findings from cognitive neuroscience that task decomposition could facilitate cognitive processing for efficient learning, we hypothesize that an agent could estimate precise task-aware imitation rewards for efficient online exploration by decomposing the target task into the objectives of “what to do” and the mechanisms of “how to do”. In this work, we introduce the hybrid Key-state guided Online Imitation (KOI) learning approach, which leverages the integration of semantic and motion key states as guidance for task-aware reward estimation. Initially, we utilize the visual-language models to segment the expert trajectory into semantic key states, indicating the objectives of “what to do”. Within the intervals between semantic key states, optical flow is employed to capture motion key states to understand the process of “how to do”. By integrating a thorough grasp of both semantic and motion key states, we refine the trajectory-matching reward computation, encouraging task-aware exploration for efficient online imitation learning. Our experiment results prove that our method is more sample efficient in the Meta-World and LIBERO environments. We also conduct real-world robotic manipulation experiments to validate the efficacy of our method, demonstrating the practical applicability of our KOI method.

[AI-37] Enabling Intelligent Traffic Systems: A Deep Learning Method for Accurate Arabic License Plate Recognition

链接: https://arxiv.org/abs/2408.02904
作者: M. A. Sayedelahl
关键词-EN: accurate Egyptian Vehicle, Egyptian Vehicle License, Vehicle License Plate, Egyptian Vehicle, License Plate Recognition
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:This paper introduces a novel two-stage framework for accurate Egyptian Vehicle License Plate Recognition (EVLPR). The first stage employs image processing techniques to reliably localize license plates, while the second stage utilizes a custom-designed deep learning model for robust Arabic character recognition. The proposed system achieves a remarkable 99.3% accuracy on a diverse dataset, surpassing existing approaches. Its potential applications extend to intelligent traffic management, including traffic violation detection and parking optimization. Future research will focus on enhancing the system’s capabilities through architectural refinements, expanded datasets, and addressing system dependencies.

[AI-38] A Metric Driven Approach to Mixed Precision Training

链接: https://arxiv.org/abs/2408.02897
作者: Mitchelle Rasquinha,Gil Tabak
关键词-EN: deep learning methodologies, increasing neural network, neural network size, network size improves, improves model quality
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:As deep learning methodologies have developed, it has been generally agreed that increasing neural network size improves model quality. However, this is at the expense of memory and compute requirements, which also need to be increased. Various efficiency techniques have been proposed to rein in hardware costs, one being the use of low precision numerics. Recent accelerators have introduced several different 8-bit data types to help accommodate DNNs in terms of numerics. In this paper, we identify a metric driven methodology to aid in the choice of numerics. We demonstrate how such a methodology can help scale training of a language representation model. The technique can be generalized to other model architectures.

[AI-39] VizECGNet: Visual ECG Image Network for Cardiovascular Diseases Classification with Multi-Modal Training and Knowledge Distillation ICIP

链接: https://arxiv.org/abs/2408.02888
作者: Ju-Hyeon Nam,Seo-Hyung Park,Su Jung Kim,Sang-Chul Lee
关键词-EN: heart electrical signal, heart conditions, heart electrical, ECG, heart
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: Accepted in International Conference on Image Processing (ICIP) 2024

点击查看摘要

Abstract:An electrocardiogram (ECG) captures the heart’s electrical signal to assess various heart conditions. In practice, ECG data is stored as either digitized signals or printed images. Despite the emergence of numerous deep learning models for digitized signals, many hospitals prefer image storage due to cost considerations. Recognizing the unavailability of raw ECG signals in many clinical settings, we propose VizECGNet, which uses only printed ECG graphics to determine the prognosis of multiple cardiovascular diseases. During training, cross-modal attention modules (CMAM) are used to integrate information from two modalities - image and signal, while self-modality attention modules (SMAM) capture inherent long-range dependencies in ECG data of each modality. Additionally, we utilize knowledge distillation to improve the similarity between two distinct predictions from each modality stream. This innovative multi-modal deep learning architecture enables the utilization of only ECG images during inference. VizECGNet with image input achieves higher performance in precision, recall, and F1-Score compared to signal-based ECG classification models, with improvements of 3.50%, 8.21%, and 7.38%, respectively.

[AI-40] Compromising Embodied Agents with Contextual Backdoor Attacks

链接: https://arxiv.org/abs/2408.02882
作者: Aishan Liu,Yuguang Zhou,Xianglong Liu,Tianyuan Zhang,Siyuan Liang,Jiakai Wang,Yanjun Pu,Tianlin Li,Junqi Zhang,Wenbo Zhou,Qing Guo,Dacheng Tao
关键词-EN: Large language models, Large language, transformed the development, language models, embodied intelligence
类目: Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Large language models (LLMs) have transformed the development of embodied intelligence. By providing a few contextual demonstrations, developers can utilize the extensive internal knowledge of LLMs to effortlessly translate complex tasks described in abstract language into sequences of code snippets, which will serve as the execution logic for embodied agents. However, this paper uncovers a significant backdoor security threat within this process and introduces a novel method called \method. By poisoning just a few contextual demonstrations, attackers can covertly compromise the contextual environment of a black-box LLM, prompting it to generate programs with context-dependent defects. These programs appear logically sound but contain defects that can activate and induce unintended behaviors when the operational agent encounters specific triggers in its interactive environment. To compromise the LLM’s contextual environment, we employ adversarial in-context generation to optimize poisoned demonstrations, where an LLM judge evaluates these poisoned prompts, reporting to an additional LLM that iteratively optimizes the demonstration in a two-player adversarial game using chain-of-thought reasoning. To enable context-dependent behaviors in downstream agents, we implement a dual-modality activation strategy that controls both the generation and execution of program defects through textual and visual triggers. We expand the scope of our attack by developing five program defect modes that compromise key aspects of confidentiality, integrity, and availability in embodied agents. To validate the effectiveness of our approach, we conducted extensive experiments across various tasks, including robot planning, robot manipulation, and compositional visual reasoning. Additionally, we demonstrate the potential impact of our approach by successfully attacking real-world autonomous driving systems.

[AI-41] Hide and Seek: Fingerprinting Large Language Models with Evolutionary Learning

链接: https://arxiv.org/abs/2408.02871
作者: Dmitri Iourovitski,Sanat Sharma,Rakshak Talwar
关键词-EN: Large Language Model, Large Language, generated by Large, Language Model, grown exponentially
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:As content generated by Large Language Model (LLM) has grown exponentially, the ability to accurately identify and fingerprint such text has become increasingly crucial. In this work, we introduce a novel black-box approach for fingerprinting LLMs, achieving an impressive 72% accuracy in identifying the correct family of models (Such as Llama, Mistral, Gemma, etc) among a lineup of LLMs. We present an evolutionary strategy that leverages the capabilities of one LLM to discover the most salient features for identifying other LLMs. Our method employs a unique “Hide and Seek” algorithm, where an Auditor LLM generates discriminative prompts, and a Detective LLM analyzes the responses to fingerprint the target models. This approach not only demonstrates the feasibility of LLM-driven model identification but also reveals insights into the semantic manifolds of different LLM families. By iteratively refining prompts through in-context learning, our system uncovers subtle distinctions between model outputs, providing a powerful tool for LLM analysis and verification. This research opens new avenues for understanding LLM behavior and has significant implications for model attribution, security, and the broader field of AI transparency.

[AI-42] On The Stability of Moral Preferences: A Problem with Computational Elicitation Methods

链接: https://arxiv.org/abs/2408.02862
作者: Kyle Boerstler,Vijay Keswani,Lok Chan,Jana Schaich Borg,Vincent Conitzer,Hoda Heidari,Walter Sinnott-Armstrong
关键词-EN: frameworks feature heavily, elicitation frameworks feature, Preference elicitation frameworks, frameworks feature, feature heavily
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
*备注: To appear in AIES 2024

点击查看摘要

Abstract:Preference elicitation frameworks feature heavily in the research on participatory ethical AI tools and provide a viable mechanism to enquire and incorporate the moral values of various stakeholders. As part of the elicitation process, surveys about moral preferences, opinions, and judgments are typically administered only once to each participant. This methodological practice is reasonable if participants’ responses are stable over time such that, all other relevant factors being held constant, their responses today will be the same as their responses to the same questions at a later time. However, we do not know how often that is the case. It is possible that participants’ true moral preferences change, are subject to temporary moods or whims, or are influenced by environmental factors we don’t track. If participants’ moral responses are unstable in such ways, it would raise important methodological and theoretical issues for how participants’ true moral preferences, opinions, and judgments can be ascertained. We address this possibility here by asking the same survey participants the same moral questions about which patient should receive a kidney when only one is available ten times in ten different sessions over two weeks, varying only presentation order across sessions. We measured how often participants gave different responses to simple (Study One) and more complicated (Study Two) repeated scenarios. On average, the fraction of times participants changed their responses to controversial scenarios was around 10-18% across studies, and this instability is observed to have positive associations with response time and decision-making difficulty. We discuss the implications of these results for the efficacy of moral preference elicitation, highlighting the role of response instability in causing value misalignment between stakeholders and AI tools trained on their moral judgments.

[AI-43] Development of REGAI: Rubric Enabled Generative Artificial Intelligence

链接: https://arxiv.org/abs/2408.02811
作者: Zach Johnson,Jeremy Straub
关键词-EN: based artificial intelligence, generative artificial intelligence, retrieval augmented generation, large language model, artificial intelligence
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:This paper presents and evaluates a new retrieval augmented generation (RAG) and large language model (LLM)-based artificial intelligence (AI) technique: rubric enabled generative artificial intelligence (REGAI). REGAI uses rubrics, which can be created manually or automatically by the system, to enhance the performance of LLMs for evaluation purposes. REGAI improves on the performance of both classical LLMs and RAG-based LLM techniques. This paper describes REGAI, presents data regarding its performance and discusses several possible application areas for the technology.

[AI-44] Examining Gender and Power on Wikipedia Through Face and Politeness

链接: https://arxiv.org/abs/2408.02798
作者: Adil Soubki,Shyne Choi,Owen Rambow
关键词-EN: sociolinguistic theory, face acts, analyzing discourse, discourse by combining, combining two interdependent
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

[AI-45] Diffusion Models as Data Mining Tools ECCV2024

链接: https://arxiv.org/abs/2408.02752
作者: Ioannis Siglidis,Aleksander Holynski,Alexei A. Efros,Mathieu Aubry,Shiry Ginosar
关键词-EN: generative models trained, paper demonstrates, synthesis as tools, generative models, data
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: Project Page: this https URL Accepted in ECCV 2024

点击查看摘要

Abstract:This paper demonstrates how to use generative models trained for image synthesis as tools for visual data mining. Our insight is that since contemporary generative models learn an accurate representation of their training data, we can use them to summarize the data by mining for visual patterns. Concretely, we show that after finetuning conditional diffusion models to synthesize images from a specific dataset, we can use these models to define a typicality measure on that dataset. This measure assesses how typical visual elements are for different data labels, such as geographic location, time stamps, semantic labels, or even the presence of a disease. This analysis-by-synthesis approach to data mining has two key advantages. First, it scales much better than traditional correspondence-based approaches since it does not require explicitly comparing all pairs of visual elements. Second, while most previous works on visual data mining focus on a single dataset, our approach works on diverse datasets in terms of content and scale, including a historical car dataset, a historical face dataset, a large worldwide street-view dataset, and an even larger scene dataset. Furthermore, our approach allows for translating visual elements across class labels and analyzing consistent changes.

[AI-46] MDM: Advancing Multi-Domain Distribution Matching for Automatic Modulation Recognition Dataset Synthesis

链接: https://arxiv.org/abs/2408.02714
作者: Dongwei Xu,Jiajun Chen,Yao Lu,Tianhao Xia,Qi Xuan,Wei Wang,Yun Lin,Xiaoniu Yang
关键词-EN: Automatic Modulation Recognition, Modulation Recognition, Automatic Modulation, introduced into Automatic, deep learning technology
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Recently, deep learning technology has been successfully introduced into Automatic Modulation Recognition (AMR) tasks. However, the success of deep learning is all attributed to the training on large-scale datasets. Such a large amount of data brings huge pressure on storage, transmission and model training. In order to solve the problem of large amount of data, some researchers put forward the method of data distillation, which aims to compress large training data into smaller synthetic datasets to maintain its performance. While numerous data distillation techniques have been developed within the realm of image processing, the unique characteristics of signals set them apart. Signals exhibit distinct features across various domains, necessitating specialized approaches for their analysis and processing. To this end, a novel dataset distillation method–Multi-domain Distribution Matching (MDM) is proposed. MDM employs the Discrete Fourier Transform (DFT) to translate timedomain signals into the frequency domain, and then uses a model to compute distribution matching losses between the synthetic and real datasets, considering both the time and frequency domains. Ultimately, these two losses are integrated to update the synthetic dataset. We conduct extensive experiments on three AMR datasets. Experimental results show that, compared with baseline methods, our method achieves better performance under the same compression ratio. Furthermore, we conduct crossarchitecture generalization experiments on several models, and the experimental results show that our synthetic datasets can generalize well on other unseen models.

[AI-47] Automatic Voice Identification after Speech Resynthesis using PPG

链接: https://arxiv.org/abs/2408.02712
作者: Thibault Gaudier(LST, LIUM),Marie Tahon(LIUM, LST),Anthony Larcher(LST, LIUM),Yannick Estève(LIA)
关键词-EN: voice conversion preserves, speech resynthesis system.A, PPG-based speech resynthesis, speech edition preserves, intermediate representations.Phonetic PosteriorGrams
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE); Audio and Speech Processing (eess.AS); Signal Processing (eess.SP)
*备注:

点击查看摘要

Abstract:Speech resynthesis is a generic task for which we want to synthesize audio with another audio as input, which finds applications for media monitors and journalists.Among different tasks addressed by speech resynthesis, voice conversion preserves the linguistic information while modifying the identity of the speaker, and speech edition preserves the identity of the speaker but some words are this http URL both cases, we need to disentangle speaker and phonetic contents in intermediate representations.Phonetic PosteriorGrams (PPG) are a frame-level probabilistic representation of phonemes, and are usually considered speaker-independent.This paper presents a PPG-based speech resynthesis system.A perceptive evaluation assesses that it produces correct audio quality.Then, we demonstrate that an automatic speaker verification model is not able to recover the source speaker after re-synthesis with PPG, even when the model is trained on synthetic data.

[AI-48] xt Conditioned Symbolic Drumbeat Generation using Latent Diffusion Models

链接: https://arxiv.org/abs/2408.02711
作者: Pushkar Jajoria,James McDermott
关键词-EN: Latent Diffusion Models, Diffusion Models, study introduces, introduces a text-conditioned, text-conditioned approach
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注:

点击查看摘要

Abstract:This study introduces a text-conditioned approach to generating drumbeats with Latent Diffusion Models (LDMs). It uses informative conditioning text extracted from training data filenames. By pretraining a text and drumbeat encoder through contrastive learning within a multimodal network, aligned following CLIP, we align the modalities of text and music closely. Additionally, we examine an alternative text encoder based on multihot text encodings. Inspired by musics multi-resolution nature, we propose a novel LSTM variant, MultiResolutionLSTM, designed to operate at various resolutions independently. In common with recent LDMs in the image space, it speeds up the generation process by running diffusion in a latent space provided by a pretrained unconditional autoencoder. We demonstrate the originality and variety of the generated drumbeats by measuring distance (both over binary pianorolls and in the latent space) versus the training dataset and among the generated drumbeats. We also assess the generated drumbeats through a listening test focused on questions of quality, aptness for the prompt text, and novelty. We show that the generated drumbeats are novel and apt to the prompt text, and comparable in quality to those created by human musicians.

[AI-49] Enhancing Medical Learning and Reasoning Systems: A Boxology-Based Comparative Analysis of Design Patterns

链接: https://arxiv.org/abs/2408.02709
作者: Chi Him Ng
关键词-EN: design patterns, systems’ design patterns, study analyzes hybrid, established design patterns, patterns
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:This study analyzes hybrid AI systems’ design patterns and their effectiveness in clinical decision-making using the boxology framework. It categorizes and copares various architectures combining machine learning and rule-based reasoning to provide insights into their structural foundations and healthcare applications. Addressing two main questions, how to categorize these systems againts established design patterns and how to extract insights through comparative analysis, the study uses design patterns from software engineering to understand and optimize healthcare AI systems. Boxology helps identify commonalities and create reusable solutions, enhancing these systems’ scalability, reliability, and performance. Five primary architectures are examined: REML, MLRB, RBML, RMLT, and PERML. Each has unique strengths and weaknesses, highlighting the need for tailored approaches in clinical tasks. REML excels in high-accuracy prediction for datasets with limited data; MLRB in handling large datasets and complex data integration; RBML in explainability and trustworthiness; RMLT in managing high-dimensional data; and PERML, though limited in analysis, shows promise in urgent care scenarios. The study introduces four new patterns, creates five abstract categorization patterns, and refines those five further to specific systems. These contributions enhance Boxlogy’s taxonomical organization and offer novel approaches to integrating expert knowledge with machine learning. Boxology’s structured, modular apporach offers significant advantages in developing and analyzing hybrid AI systems, revealing commonalities, and promoting reusable solutions. In conclusion, this study underscores hybrid AI systems’ crucial role in advancing healthcare and Boxology’s potential to drive further innovation in AI integration, ultimately improving clinical decision support and patient outcomes.

[AI-50] SnapE – Training Snapshot Ensembles of Link Prediction Models ISWC

链接: https://arxiv.org/abs/2408.02707
作者: Ali Shaban,Heiko Paulheim
关键词-EN: Snapshot ensembles, prediction, models, Snapshot, Abstract
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Accepted at International Semantic Web Conference (ISWC) 2024

点击查看摘要

Abstract:Snapshot ensembles have been widely used in various fields of prediction. They allow for training an ensemble of prediction models at the cost of training a single one. They are known to yield more robust predictions by creating a set of diverse base models. In this paper, we introduce an approach to transfer the idea of snapshot ensembles to link prediction models in knowledge graphs. Moreover, since link prediction in knowledge graphs is a setup without explicit negative examples, we propose a novel training loop that iteratively creates negative examples using previous snapshot models. An evaluation with four base models across four datasets shows that this approach constantly outperforms the single model approach, while keeping the training time constant.

[AI-51] Bayesian Kolmogorov Arnold Networks (Bayesian_KANs): A Probabilistic Approach to Enhance Accuracy and Interpretability

链接: https://arxiv.org/abs/2408.02706
作者: Masoud Muhammed Hassan
关键词-EN: strong predictive skills, Kolmogorov Arnold Networks, Bayesian Kolmogorov Arnold, deep learning models, deep learning
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Because of its strong predictive skills, deep learning has emerged as an essential tool in many industries, including healthcare. Traditional deep learning models, on the other hand, frequently lack interpretability and omit to take prediction uncertainty into account two crucial components of clinical decision making. In order to produce explainable and uncertainty aware predictions, this study presents a novel framework called Bayesian Kolmogorov Arnold Networks (BKANs), which combines the expressive capacity of Kolmogorov Arnold Networks with Bayesian inference. We employ BKANs on two medical datasets, which are widely used benchmarks for assessing machine learning models in medical diagnostics: the Pima Indians Diabetes dataset and the Cleveland Heart Disease dataset. Our method provides useful insights into prediction confidence and decision boundaries and outperforms traditional deep learning models in terms of prediction accuracy. Moreover, BKANs’ capacity to represent aleatoric and epistemic uncertainty guarantees doctors receive more solid and trustworthy decision support. Our Bayesian strategy improves the interpretability of the model and considerably minimises overfitting, which is important for tiny and imbalanced medical datasets, according to experimental results. We present possible expansions to further use BKANs in more complicated multimodal datasets and address the significance of these discoveries for future research in building reliable AI systems for healthcare. This work paves the way for a new paradigm in deep learning model deployment in vital sectors where transparency and reliability are crucial.

[AI-52] PSNE: Efficient Spectral Sparsification Algorithms for Scaling Network Embedding

链接: https://arxiv.org/abs/2408.02705
作者: Longlong Lin,Yunfeng Yu,Zihao Wang,Zeli Wang,Yuying Zhao,Jin Zhao,Tao Jia
关键词-EN: numerous practical applications, received extensive attention, PPR matrix, continuous dense vector, dense vector space
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Network embedding has numerous practical applications and has received extensive attention in graph learning, which aims at mapping vertices into a low-dimensional and continuous dense vector space by preserving the underlying structural properties of the graph. Many network embedding methods have been proposed, among which factorization of the Personalized PageRank (PPR for short) matrix has been empirically and theoretically well supported recently. However, several fundamental issues cannot be addressed. (1) Existing methods invoke a seminal Local Push subroutine to approximate \textita single row or column of the PPR matrix. Thus, they have to execute n ( n is the number of nodes) Local Push subroutines to obtain a provable PPR matrix, resulting in prohibitively high computational costs for large n . (2) The PPR matrix has limited power in capturing the structural similarity between vertices, leading to performance degradation. To overcome these dilemmas, we propose PSNE, an efficient spectral s\textbfParsification method for \textbfScaling \textbfNetwork \textbfEmbedding, which can fast obtain the embedding vectors that retain strong structural similarities. Specifically, PSNE first designs a matrix polynomial sparser to accelerate the calculation of the PPR matrix, which has a theoretical guarantee in terms of the Frobenius norm. Subsequently, PSNE proposes a simple but effective multiple-perspective strategy to enhance further the representation power of the obtained approximate PPR matrix. Finally, PSNE applies a randomized singular value decomposition algorithm on the sparse and multiple-perspective PPR matrix to get the target embedding vectors. Experimental evaluation of real-world and synthetic datasets shows that our solutions are indeed more efficient, effective, and scalable compared with ten competitors.

[AI-53] Spatial-temporal Graph Convolutional Networks with Diversified Transformation for Dynamic Graph Representation Learning

链接: https://arxiv.org/abs/2408.02704
作者: Ling Wang,Yixiang Huang,Hao Wu
关键词-EN: describe evolving interactions, real-world applications, describe evolving, evolving interactions, interactions between nodes
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 8 papges, 1 figure

点击查看摘要

Abstract:Dynamic graphs (DG) are often used to describe evolving interactions between nodes in real-world applications. Temporal patterns are a natural feature of DGs and are also key to representation learning. However, existing dynamic GCN models are mostly composed of static GCNs and sequence modules, which results in the separation of spatiotemporal information and cannot effectively capture complex temporal patterns in DGs. To address this problem, this study proposes a spatial-temporal graph convolutional networks with diversified transformation (STGCNDT), which includes three aspects: a) constructing a unified graph tensor convolutional network (GTCN) using tensor M-products without the need to represent spatiotemporal information separately; b) introducing three transformation schemes in GTCN to model complex temporal patterns to aggregate temporal information; and c) constructing an ensemble of diversified transformation schemes to obtain higher representation capabilities. Empirical studies on four DGs that appear in communication networks show that the proposed STGCNDT significantly outperforms state-of-the-art models in solving link weight estimation tasks due to the diversified transformations.

[AI-54] DeepNetBeam: A Framework for the Analysis of Functionally Graded Porous Beams

链接: https://arxiv.org/abs/2408.02698
作者: Mohammad Sadegh Eshaghi,Mostafa Bamdad,Cosmin Anitescu,Yizheng Wang,Xiaoying Zhuang,Timon Rabczuk
关键词-EN: Scientific Machine Learning, Machine Learning, Scientific Machine, investigates different Scientific, Neural Operator methods
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:This study investigates different Scientific Machine Learning (SciML) approaches for the analysis of functionally graded (FG) porous beams and compares them under a new framework. The beam material properties are assumed to vary as an arbitrary continuous function. The methods consider the output of a neural network/operator as an approximation to the displacement fields and derive the equations governing beam behavior based on the continuum formulation. The methods are implemented in the framework and formulated by three approaches: (a) the vector approach leads to a Physics-Informed Neural Network (PINN), (b) the energy approach brings about the Deep Energy Method (DEM), and © the data-driven approach, which results in a class of Neural Operator methods. Finally, a neural operator has been trained to predict the response of the porous beam with functionally graded material under any porosity distribution pattern and any arbitrary traction condition. The results are validated with analytical and numerical reference solutions. The data and code accompanying this manuscript will be publicly available at this https URL.

[AI-55] Why Rectified Power Unit Networks Fail and How to Improve It: An Effective Theory Perspective

链接: https://arxiv.org/abs/2408.02697
作者: Taeyoung Kim,Myungjoo Kang
关键词-EN: Rectified Power Unit, Rectified Linear Unit, Power Unit, Linear Unit, Rectified Power
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 25 pages, 8 figures

点击查看摘要

Abstract:The Rectified Power Unit (RePU) activation functions, unlike the Rectified Linear Unit (ReLU), have the advantage of being a differentiable function when constructing neural networks. However, it can be experimentally observed when deep layers are stacked, neural networks constructed with RePU encounter critical issues. These issues include the values exploding or vanishing and failure of training. And these happen regardless of the hyperparameter initialization. From the perspective of effective theory, we aim to identify the causes of this phenomenon and propose a new activation function that retains the advantages of RePU while overcoming its drawbacks.

[AI-56] Distribution-Level Memory Recall for Continual Learning: Preserving Knowledge and Avoiding Confusion

链接: https://arxiv.org/abs/2408.02695
作者: Shaoxu Cheng,Kanglei Geng,Chiyuan He,Zihuan Qiu,Linfeng Xu,Heqian Qiu,Lanxiao Wang,Qingbo Wu,Fanman Meng,Hongliang Li
关键词-EN: Deep Neural Networks, enable Deep Neural, Neural Networks, Deep Neural, forgetting previously learned
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Continual Learning (CL) aims to enable Deep Neural Networks (DNNs) to learn new data without forgetting previously learned knowledge. The key to achieving this goal is to avoid confusion at the feature level, i.e., avoiding confusion within old tasks and between new and old tasks. Previous prototype-based CL methods generate pseudo features for old knowledge replay by adding Gaussian noise to the centroids of old classes. However, the distribution in the feature space exhibits anisotropy during the incremental process, which prevents the pseudo features from faithfully reproducing the distribution of old knowledge in the feature space, leading to confusion in classification boundaries within old tasks. To address this issue, we propose the Distribution-Level Memory Recall (DMR) method, which uses a Gaussian mixture model to precisely fit the feature distribution of old knowledge at the distribution level and generate pseudo features in the next stage. Furthermore, resistance to confusion at the distribution level is also crucial for multimodal learning, as the problem of multimodal imbalance results in significant differences in feature responses between different modalities, exacerbating confusion within old tasks in prototype-based CL methods. Therefore, we mitigate the multi-modal imbalance problem by using the Inter-modal Guidance and Intra-modal Mining (IGIM) method to guide weaker modalities with prior information from dominant modalities and further explore useful information within modalities. For the second key, We propose the Confusion Index to quantitatively describe a model’s ability to distinguish between new and old tasks, and we use the Incremental Mixup Feature Enhancement (IMFE) method to enhance pseudo features with new sample features, alleviating classification confusion between new and old knowledge.

[AI-57] KAN based Autoencoders for Factor Models

链接: https://arxiv.org/abs/2408.02694
作者: Tianqi Wang,Shubham Singh
关键词-EN: Inspired by recent, Kolmogorov-Arnold Networks, latent factor conditional, conditional asset pricing, factor conditional asset
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computational Finance (q-fin.CP)
*备注: 7 pages

点击查看摘要

Abstract:Inspired by recent advances in Kolmogorov-Arnold Networks (KANs), we introduce a novel approach to latent factor conditional asset pricing models. While previous machine learning applications in asset pricing have predominantly used Multilayer Perceptrons with ReLU activation functions to model latent factor exposures, our method introduces a KAN-based autoencoder which surpasses MLP models in both accuracy and interpretability. Our model offers enhanced flexibility in approximating exposures as nonlinear functions of asset characteristics, while simultaneously providing users with an intuitive framework for interpreting latent factors. Empirical backtesting demonstrates our model’s superior ability to explain cross-sectional risk exposures. Moreover, long-short portfolios constructed using our model’s predictions achieve higher Sharpe ratios, highlighting its practical value in investment management.

[AI-58] Attention is all you need for an improved CNN-based flash flood susceptibility modeling. The case of the ungauged Rheraya watershed Morocco

链接: https://arxiv.org/abs/2408.02692
作者: Akram Elghouat,Ahmed Algouti,Abdellah Algouti,Soukaina Baid
关键词-EN: Effective flood hazard, Effective flood, flash flood susceptibility, hazard management requires, management requires evaluating
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Effective flood hazard management requires evaluating and predicting flash flood susceptibility. Convolutional neural networks (CNNs) are commonly used for this task but face issues like gradient explosion and overfitting. This study explores the use of an attention mechanism, specifically the convolutional block attention module (CBAM), to enhance CNN models for flash flood susceptibility in the ungauged Rheraya watershed, a flood prone region. We used ResNet18, DenseNet121, and Xception as backbone architectures, integrating CBAM at different locations. Our dataset included 16 conditioning factors and 522 flash flood inventory points. Performance was evaluated using accuracy, precision, recall, F1-score, and the area under the curve (AUC) of the receiver operating characteristic (ROC). Results showed that CBAM significantly improved model performance, with DenseNet121 incorporating CBAM in each convolutional block achieving the best results (accuracy = 0.95, AUC = 0.98). Distance to river and drainage density were identified as key factors. These findings demonstrate the effectiveness of the attention mechanism in improving flash flood susceptibility modeling and offer valuable insights for disaster management.

[AI-59] Symmetric Graph Contrastive Learning against Noisy Views for Recommendation

链接: https://arxiv.org/abs/2408.02691
作者: Chu Zhao,Enneng Yang,Yuliang Liang,Jianzhe Zhao,Guibing Guo,Xingwei Wang
关键词-EN: Graph Contrastive Learning, leverages data augmentation, Graph Contrastive, produce contrasting views, data augmentation techniques
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
*备注: 24 pages, submitted to TOIS

点击查看摘要

Abstract:Graph Contrastive Learning (GCL) leverages data augmentation techniques to produce contrasting views, enhancing the accuracy of recommendation systems through learning the consistency between contrastive views. However, existing augmentation methods, such as directly perturbing interaction graph (e.g., node/edge dropout), may interfere with the original connections and generate poor contrasting views, resulting in sub-optimal performance. In this paper, we define the views that share only a small amount of information with the original graph due to poor data augmentation as noisy views (i.e., the last 20% of the views with a cosine similarity value less than 0.1 to the original view). We demonstrate through detailed experiments that noisy views will significantly degrade recommendation performance. Further, we propose a model-agnostic Symmetric Graph Contrastive Learning (SGCL) method with theoretical guarantees to address this issue. Specifically, we introduce symmetry theory into graph contrastive learning, based on which we propose a symmetric form and contrast loss resistant to noisy interference. We provide theoretical proof that our proposed SGCL method has a high tolerance to noisy views. Further demonstration is given by conducting extensive experiments on three real-world datasets. The experimental results demonstrate that our approach substantially increases recommendation accuracy, with relative improvements reaching as high as 12.25% over nine other competing models. These results highlight the efficacy of our method.

[AI-60] Spatio-Temporal Partial Sensing Forecast for Long-term Traffic

链接: https://arxiv.org/abs/2408.02689
作者: Zibo Liu,Zhe Jiang,Zelin Xu,Tingsong Xiao,Zhengkun Xiao,Haibo Wang,Shigang Chen
关键词-EN: recent measurements, installed at chosen, Traffic, locations, partial sensing
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Traffic forecasting uses recent measurements by sensors installed at chosen locations to forecast the future road traffic. Existing work either assumes all locations are equipped with sensors or focuses on short-term forecast. This paper studies partial sensing traffic forecast of long-term traffic, assuming sensors only at some locations. The study is important in lowering the infrastructure investment cost in traffic management since deploying sensors at all locations could incur prohibitively high cost. However, the problem is challenging due to the unknown distribution at unsensed locations, the intricate spatio-temporal correlation in long-term forecasting, as well as noise in data and irregularities in traffic patterns (e.g., road closure). We propose a Spatio-Temporal Partial Sensing (STPS) forecast model for long-term traffic prediction, with several novel contributions, including a rank-based embedding technique to capture irregularities and overcome noise, a spatial transfer matrix to overcome the spatial distribution shift from permanently sensed locations to unsensed locations, and a multi-step training process that utilizes all available data to successively refine the model parameters for better accuracy. Extensive experiments on several real-world traffic datasets demonstrate that STPS outperforms the state-of-the-art and achieves superior accuracy in partial sensing long-term forecasting.

[AI-61] A Systematic Review of Intermediate Fusion in Multimodal Deep Learning for Biomedical Applications

链接: https://arxiv.org/abs/2408.02686
作者: Valerio Guarrasi,Fatih Aksu,Camillo Maria Caruso,Francesco Di Feola,Aurora Rofena,Filippo Ruffini,Paolo Soda
关键词-EN: Deep learning, handle complex, intermediate fusion methods, high-dimensional data, Multimodal deep learning
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Deep learning has revolutionized biomedical research by providing sophisticated methods to handle complex, high-dimensional data. Multimodal deep learning (MDL) further enhances this capability by integrating diverse data types such as imaging, textual data, and genetic information, leading to more robust and accurate predictive models. In MDL, differently from early and late fusion methods, intermediate fusion stands out for its ability to effectively combine modality-specific features during the learning process. This systematic review aims to comprehensively analyze and formalize current intermediate fusion methods in biomedical applications. We investigate the techniques employed, the challenges faced, and potential future directions for advancing intermediate fusion methods. Additionally, we introduce a structured notation to enhance the understanding and application of these methods beyond the biomedical domain. Our findings are intended to support researchers, healthcare professionals, and the broader deep learning community in developing more sophisticated and insightful multimodal models. Through this review, we aim to provide a foundational framework for future research and practical applications in the dynamic field of MDL.

[AI-62] Artificial Neural Networks for Photonic Applications: From Algorithms to Implementation

链接: https://arxiv.org/abs/2408.02685
作者: Pedro Freire,Egor Manuylovich,Jaroslaw E. Prilepsky,Sergei K. Turitsy
关键词-EN: artificial neural networks, neural networks, broad audience, applied mathematics, targets a broad
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Signal Processing (eess.SP); Optics (physics.optics)
*备注:

点击查看摘要

Abstract:This tutorial-review on applications of artificial neural networks in photonics targets a broad audience, ranging from optical research and engineering communities to computer science and applied mathematics. We focus here on the research areas at the interface between these disciplines, attempting to find the right balance between technical details specific to each domain and overall clarity. First, we briefly recall key properties and peculiarities of some core neural network types, which we believe are the most relevant to photonics, also linking the layer’s theoretical design to some photonics hardware realizations. After that, we elucidate the question of how to fine-tune the selected model’s design to perform the required task with optimized accuracy. Then, in the review part, we discuss recent developments and progress for several selected applications of neural networks in photonics, including multiple aspects relevant to optical communications, imaging, sensing, and the design of new materials and lasers. In the following section, we put a special emphasis on how to accurately evaluate the complexity of neural networks in the context of the transition from algorithms to hardware implementation. The introduced complexity characteristics are used to analyze the applications of neural networks in optical communications, as a specific, albeit highly important example, comparing those with some benchmark signal processing methods. We combine the description of the well-known model compression strategies used in machine learning, with some novel techniques introduced recently in optical applications of neural networks. It is important to stress that although our focus in this tutorial-review is on photonics, we believe that the methods and techniques presented here can be handy in a much wider range of scientific and engineering applications.

[AI-63] Recording First-person Experiences to Build a New Type of Foundation Model

链接: https://arxiv.org/abs/2408.02680
作者: Dionis Barcari,David Gamez,Aliya Grig
关键词-EN: Foundation models, current AI boom, big impact, impact in recent, recent years
类目: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
*备注: 5 pages, 5 figures, 3 tables. arXiv admin note: substantial text overlap with arXiv:2408.00030

点击查看摘要

Abstract:Foundation models have had a big impact in recent years and billions of dollars are being invested in them in the current AI boom. The more popular ones, such as Chat-GPT, are trained on large amounts of Internet data. However, it is becoming apparent that this data is likely to be exhausted soon, and technology companies are looking for new sources of data to train the next generation of foundation models. Reinforcement learning, RAG, prompt engineering and cognitive modelling are often used to fine-tune and augment the behaviour of foundation models. These techniques have been used to replicate people, such as Caryn Marjorie. These chatbots are not based on people’s actual emotional and physiological responses to their environment, so they are, at best, a surface-level approximation to the characters they are imitating. To address these issues, we have developed a recording rig that captures what the wearer is seeing and hearing as well as their skin conductance (GSR), facial expression and brain state (14 channel EEG). AI algorithms are used to process this data into a rich picture of the environment and internal states of the subject. Foundation models trained on this data could replicate human behaviour much more accurately than the personality models that have been developed so far. This type of model has many potential applications, including recommendation, personal assistance, GAN systems, dating and recruitment. This paper gives some background to this work and describes the recording rig and preliminary tests of its functionality. It then suggests how a new type of foundation model could be created from the data captured by the rig and outlines some applications. Data gathering and model training are expensive, so we are currently working on the launch of a start-up that could raise funds for the next stage of the project. Comments: 5 pages, 5 figures, 3 tables. arXiv admin note: substantial text overlap with arXiv:2408.00030 Subjects: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG) Cite as: arXiv:2408.02680 [cs.AI] (or arXiv:2408.02680v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2408.02680 Focus to learn more arXiv-issued DOI via DataCite Submission history From: David Gamez [view email] [v1] Wed, 31 Jul 2024 11:51:26 UTC (6,747 KB)

[AI-64] Patient-centered data science: an integrative framework for evaluating and predicting clinical outcomes in the digital health era

链接: https://arxiv.org/abs/2408.02677
作者: Mohsen Amoei,Dan Poenaru
关键词-EN: patient-centered data science, digital health era, study proposes, health era, patient-centered data
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
*备注:

点击查看摘要

Abstract:This study proposes a novel, integrative framework for patient-centered data science in the digital health era. We developed a multidimensional model that combines traditional clinical data with patient-reported outcomes, social determinants of health, and multi-omic data to create comprehensive digital patient representations. Our framework employs a multi-agent artificial intelligence approach, utilizing various machine learning techniques including large language models, to analyze complex, longitudinal datasets. The model aims to optimize multiple patient outcomes simultaneously while addressing biases and ensuring generalizability. We demonstrate how this framework can be implemented to create a learning healthcare system that continuously refines strategies for optimal patient care. This approach has the potential to significantly improve the translation of digital health innovations into real-world clinical benefits, addressing current limitations in AI-driven healthcare models.

[AI-65] On Biases in a UK Biobank-based Retinal Image Classification Model MICCAI

链接: https://arxiv.org/abs/2408.02676
作者: Anissa Alloula,Rima Mustafa,Daniel R McGowan,Bartłomiej W. Papież
关键词-EN: Recent work, uncovered alarming disparities, machine learning models, work has uncovered, uncovered alarming
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Computers and Society (cs.CY); Image and Video Processing (eess.IV)
*备注: To appear at MICCAI FAIMI Workshop 2024

点击查看摘要

Abstract:Recent work has uncovered alarming disparities in the performance of machine learning models in healthcare. In this study, we explore whether such disparities are present in the UK Biobank fundus retinal images by training and evaluating a disease classification model on these images. We assess possible disparities across various population groups and find substantial differences despite strong overall performance of the model. In particular, we discover unfair performance for certain assessment centres, which is surprising given the rigorous data standardisation protocol. We compare how these differences emerge and apply a range of existing bias mitigation methods to each one. A key insight is that each disparity has unique properties and responds differently to the mitigation methods. We also find that these methods are largely unable to enhance fairness, highlighting the need for better bias mitigation methods tailored to the specific type of bias.

[AI-66] Dynamic Language Group-Based MoE: Enhancing Efficiency and Flexibility for Code-Switching Speech Recognition

链接: https://arxiv.org/abs/2407.18581
作者: Hukai Huang,Shenghui Lu,Yahui Shan,He Qu,Wenhao Guan,Qingyang Hong,Lin Li
关键词-EN: approach is ideally, multilingual and code-switching, challenges due, multi-expert architecture, ideally suited
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

[AI-67] DrawTalking: Building Interactive Worlds by Sketching and Speaking ATC

链接: https://arxiv.org/abs/2401.05631
作者: Karl Toby Rosenberg,Rubaiat Habib Kazi,Li-Yi Wei,Haijun Xia,Ken Perlin
关键词-EN: controlling interactive worlds, introduce DrawTalking, telling stories, approach to building, building and controlling
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Emerging Technologies (cs.ET); Graphics (cs.GR)
*备注: 25 pages, 27 figures; Matching version accepted at UIST 2024

点击查看摘要

[AI-68] QADQN: Quantum Attention Deep Q-Network for Financial Market Prediction

链接: https://arxiv.org/abs/2408.03088
作者: Siddhant Dutta,Nouhaila Innan,Alberto Marchisio,Sadok Ben Yahia,Muhammad Shafique
关键词-EN: optimal trading strategy, trading strategy development, strategy development remain, development remain challenging, remain challenging due
类目: Quantum Physics (quant-ph); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Accepted at the 2024 IEEE International Conference on Quantum Computing and Engineering (QCE24), QCRL, September 2024

点击查看摘要

Abstract:Financial market prediction and optimal trading strategy development remain challenging due to market complexity and volatility. Our research in quantum finance and reinforcement learning for decision-making demonstrates the approach of quantum-classical hybrid algorithms to tackling real-world financial challenges. In this respect, we corroborate the concept with rigorous backtesting and validate the framework’s performance under realistic market conditions, by including fixed transaction cost per trade. This paper introduces a Quantum Attention Deep Q-Network (QADQN) approach to address these challenges through quantum-enhanced reinforcement learning. Our QADQN architecture uses a variational quantum circuit inside a traditional deep Q-learning framework to take advantage of possible quantum advantages in decision-making. We gauge the QADQN agent’s performance on historical data from major market indices, including the SP 500. We evaluate the agent’s learning process by examining its reward accumulation and the effectiveness of its experience replay mechanism. Our empirical results demonstrate the QADQN’s superior performance, achieving better risk-adjusted returns with Sortino ratios of 1.28 and 1.19 for non-overlapping and overlapping test periods respectively, indicating effective downside risk management.

[AI-69] LLM-Empowered Resource Allocation in Wireless Communications Systems

链接: https://arxiv.org/abs/2408.02944
作者: Woongsup Lee,Jeonghun Park
关键词-EN: large language models, wireless communication systems, language models, resource allocation, recent success
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
*备注: submitted to possible IEEE journal

点击查看摘要

Abstract:The recent success of large language models (LLMs) has spurred their application in various fields. In particular, there have been efforts to integrate LLMs into various aspects of wireless communication systems. The use of LLMs in wireless communication systems has the potential to realize artificial general intelligence (AGI)-enabled wireless networks. In this paper, we investigate an LLM-based resource allocation scheme for wireless communication systems. Specifically, we formulate a simple resource allocation problem involving two transmit pairs and develop an LLM-based resource allocation approach that aims to maximize either energy efficiency or spectral efficiency. Additionally, we consider the joint use of low-complexity resource allocation techniques to compensate for the reliability shortcomings of the LLM-based scheme. After confirming the applicability and feasibility of LLM-based resource allocation, we address several key technical challenges that remain in applying LLMs in practice.

[AI-70] VisionUnite: A Vision-Language Foundation Model for Ophthalmology Enhanced with Clinical Knowledge

链接: https://arxiv.org/abs/2408.02865
作者: Zihan Li,Diping Song,Zefeng Yang,Deming Wang,Fei Li,Xiulan Zhang,Paul E. Kinahan,Yu Qiao
关键词-EN: improved diagnostic methods, advanced equipment, developed regions, regions with limited, limited access
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

[AI-71] Multistain Pretraining for Slide Representation Learning in Pathology ECCV’24

链接: https://arxiv.org/abs/2408.02859
作者: Guillaume Jaume,Anurag Vaidya,Andrew Zhang,Andrew H. Song,Richard J. Chen,Sharifa Sahai,Dandan Mo,Emilio Madrigal,Long Phi Le,Faisal Mahmood
关键词-EN: Developing self-supervised learning, Developing self-supervised, gigapixel whole-slide images, computational pathology, learn universal
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: ECCV’24

点击查看摘要

Abstract:Developing self-supervised learning (SSL) models that can learn universal and transferable representations of HE gigapixel whole-slide images (WSIs) is becoming increasingly valuable in computational pathology. These models hold the potential to advance critical tasks such as few-shot classification, slide retrieval, and patient stratification. Existing approaches for slide representation learning extend the principles of SSL from small images (e.g., 224 x 224 patches) to entire slides, usually by aligning two different augmentations (or views) of the slide. Yet the resulting representation remains constrained by the limited clinical and biological diversity of the views. Instead, we postulate that slides stained with multiple markers, such as immunohistochemistry, can be used as different views to form a rich task-agnostic training signal. To this end, we introduce Madeleine, a multimodal pretraining strategy for slide representation learning. Madeleine is trained with a dual global-local cross-stain alignment objective on large cohorts of breast cancer samples (N=4,211 WSIs across five stains) and kidney transplant samples (N=12,070 WSIs across four stains). We demonstrate the quality of slide representations learned by Madeleine on various downstream evaluations, ranging from morphological and molecular classification to prognostic prediction, comprising 21 tasks using 7,299 WSIs from multiple medical centers. Code is available at this https URL.

[AI-72] raining a multilayer dynamical spintronic network with standard machine learning tools to perform time series classification

链接: https://arxiv.org/abs/2408.02835
作者: Erwan Plouet,Dédalo Sanz-Hernández,Aymeric Vecchiola,Julie Grollier,Frank Mizrahi
关键词-EN: low energy cost, ability to process, process time-series, time-series at low, low energy
类目: Disordered Systems and Neural Networks (cond-mat.dis-nn); Mesoscale and Nanoscale Physics (cond-mat.mes-hall); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 7 pages, 4 figures

点击查看摘要

Abstract:The ability to process time-series at low energy cost is critical for many applications. Recurrent neural network, which can perform such tasks, are computationally expensive when implementing in software on conventional computers. Here we propose to implement a recurrent neural network in hardware using spintronic oscillators as dynamical neurons. Using numerical simulations, we build a multi-layer network and demonstrate that we can use backpropagation through time (BPTT) and standard machine learning tools to train this network. Leveraging the transient dynamics of the spintronic oscillators, we solve the sequential digits classification task with 89.83\pm2.91~% accuracy, as good as the equivalent software network. We devise guidelines on how to choose the time constant of the oscillators as well as hyper-parameters of the network to adapt to different input time scales.

[AI-73] A Review on Organ Deformation Modeling Approaches for Reliable Surgical Navigation using Augmented Reality

链接: https://arxiv.org/abs/2408.02713
作者: Zheng Han,Qi Dou
关键词-EN: revolutionize surgical procedures, visualize critical structures, patient body, organ deformation, revolutionize surgical
类目: Medical Physics (physics.med-ph); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Image and Video Processing (eess.IV)
*备注:

点击查看摘要

Abstract:Augmented Reality (AR) holds the potential to revolutionize surgical procedures by allowing surgeons to visualize critical structures within the patient’s body. This is achieved through superimposing preoperative organ models onto the actual anatomy. Challenges arise from dynamic deformations of organs during surgery, making preoperative models inadequate for faithfully representing intraoperative anatomy. To enable reliable navigation in augmented surgery, modeling of intraoperative deformation to obtain an accurate alignment of the preoperative organ model with the intraoperative anatomy is indispensable. Despite the existence of various methods proposed to model intraoperative organ deformation, there are still few literature reviews that systematically categorize and summarize these approaches. This review aims to fill this gap by providing a comprehensive and technical-oriented overview of modeling methods for intraoperative organ deformation in augmented reality in surgery. Through a systematic search and screening process, 112 closely relevant papers were included in this review. By presenting the current status of organ deformation modeling methods and their clinical applications, this review seeks to enhance the understanding of organ deformation modeling in AR-guided surgery, and discuss the potential topics for future advancements.

[AI-74] Inventory problems and the parametric measure m_lambda

链接: https://arxiv.org/abs/2408.02700
作者: Irina Georgescu
关键词-EN: credibility measure, credibility theory, lambda, credibility, measure
类目: Optimization and Control (math.OC); Artificial Intelligence (cs.AI); Theoretical Economics (econ.TH)
*备注:

点击查看摘要

Abstract:The credibility theory was introduced by B. Liu as a new way to describe the fuzzy uncertainty. The credibility measure is the fundamental notion of the credibility theory. Recently, L.Yang and K. Iwamura extended the credibility measure by defining the parametric measure m_\lambda ( \lambda is a real parameter in the interval [0,1] and for \lambda= 1/2 we obtain as a particular case the notion of credibility measure). By using the m_\lambda -measure, we studied in this paper a risk neutral multi-item inventory problem. Our construction generalizes the credibilistic inventory model developed by Y. Li and Y. Liu in 2019. In our model, the components of demand vector are fuzzy variables and the maximization problem is formulated by using the notion of m_\lambda -expected value. We shall prove a general formula for the solution of optimization problem, from which we obtained effective formulas for computing the optimal solutions in the particular cases where the demands are trapezoidal and triangular fuzzy numbers. For \lambda=1/2 we obtain as a particular case the computation formulas of the optimal solutions of the credibilistic inventory problem of Li and Liu. These computation formulas are applied for some m_\lambda -models obtained from numerical data.

[AI-75] Diff-PIC: Revolutionizing Particle-In-Cell Simulation for Advancing Nuclear Fusion with Diffusion Models

链接: https://arxiv.org/abs/2408.02693
作者: Chuan Liu,Chunshu Wu,Mingkai Chen,James Chenhao Liang,Ang Li,Michael Huang,Chuang Ren,Dongfang Liu,Ying Nian Wu,Tong Geng
关键词-EN: harnessing energy extracted, drawing significant attention, Sustainable energy, Laser-Plasma Interaction, fusion ignition underscore
类目: Computational Physics (physics.comp-ph); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Sustainable energy is a crucial global challenge, and recent breakthroughs in nuclear fusion ignition underscore the potential of harnessing energy extracted from nuclear fusion in everyday life, thereby drawing significant attention to fusion ignition research, especially Laser-Plasma Interaction (LPI). Unfortunately, the complexity of LPI at ignition scale renders theory-based analysis nearly impossible – instead, it has to rely heavily on Particle-in-Cell (PIC) simulations, which is extremely computationally intensive, making it a major bottleneck in advancing fusion ignition. In response, this work introduces Diff-PIC, a novel paradigm that leverages conditional diffusion models as a computationally efficient alternative to PIC simulations for generating high-fidelity scientific data. Specifically, we design a distillation paradigm to distill the physical patterns captured by PIC simulations into diffusion models, demonstrating both theoretical and practical feasibility. Moreover, to ensure practical effectiveness, we provide solutions for two critical challenges: (1) We develop a physically-informed conditional diffusion model that can learn and generate meaningful embeddings for mathematically continuous physical conditions. This model offers algorithmic generalization and adaptable transferability, effectively capturing the complex relationships between physical conditions and simulation outcomes; and (2) We employ the rectified flow technique to make our model a one-step conditional diffusion model, enhancing its efficiency further while maintaining high fidelity and physical validity. Diff-PIC establishes a new paradigm for using diffusion models to overcome the computational barriers in nuclear fusion research, setting a benchmark for future innovations and advancements in this field.

计算机视觉

[CV-0] LLaVA-OneVision: Easy Visual Task Transfer

链接: https://arxiv.org/abs/2408.03326
作者: Bo Li,Yuanhan Zhang,Dong Guo,Renrui Zhang,Feng Li,Hao Zhang,Kaichen Zhang,Yanwei Li,Ziwei Liu,Chunyuan Li
关键词-EN: LLaVA-NeXT blog series, open large multimodal, large multimodal models, developed by consolidating, insights into data
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注: Project Homepage: this https URL

点击查看摘要

[CV-1] MDT-A2G: Exploring Masked Diffusion Transformers for Co-Speech Gesture Generation

链接: https://arxiv.org/abs/2408.03312
作者: Xiaofeng Mao,Zhengkai Jiang,Qilin Wang,Chencan Fu,Jiangning Zhang,Jiafu Wu,Yabiao Wang,Chengjie Wang,Wei Li,Mingmin Chi
关键词-EN: Masked Diffusion Transformer, Convolutional Neural Network, Recent advancements, Diffusion Transformers, co-speech gesture generation
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Recent advancements in the field of Diffusion Transformers have substantially improved the generation of high-quality 2D images, 3D videos, and 3D shapes. However, the effectiveness of the Transformer architecture in the domain of co-speech gesture generation remains relatively unexplored, as prior methodologies have predominantly employed the Convolutional Neural Network (CNNs) or simple a few transformer layers. In an attempt to bridge this research gap, we introduce a novel Masked Diffusion Transformer for co-speech gesture generation, referred to as MDT-A2G, which directly implements the denoising process on gesture sequences. To enhance the contextual reasoning capability of temporally aligned speech-driven gestures, we incorporate a novel Masked Diffusion Transformer. This model employs a mask modeling scheme specifically designed to strengthen temporal relation learning among sequence gestures, thereby expediting the learning process and leading to coherent and realistic motions. Apart from audio, Our MDT-A2G model also integrates multi-modal information, encompassing text, emotion, and identity. Furthermore, we propose an efficient inference strategy that diminishes the denoising computation by leveraging previously calculated results, thereby achieving a speedup with negligible performance degradation. Experimental results demonstrate that MDT-A2G excels in gesture generation, boasting a learning speed that is over 6 \times faster than traditional diffusion transformers and an inference speed that is 5.7 \times than the standard diffusion model.

[CV-2] Fusing Forces: Deep-Human-Guided Refinement of Segmentation Masks ICPR2024

点击查看摘要

[CV-3] xtIM: Part-aware Interactive Motion Synthesis from Text

链接: https://arxiv.org/abs/2408.03302
作者: Siyuan Fan,Bo Du,Xiantao Cai,Bo Peng,Longling Sun
关键词-EN: synthesizing TEXT-driven human, Interactive Motions, human Interactive Motions, TEXT-driven human Interactive, synthesizing TEXT-driven
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:In this work, we propose TextIM, a novel framework for synthesizing TEXT-driven human Interactive Motions, with a focus on the precise alignment of part-level semantics. Existing methods often overlook the critical roles of interactive body parts and fail to adequately capture and align part-level semantics, resulting in inaccuracies and even erroneous movement outcomes. To address these issues, TextIM utilizes a decoupled conditional diffusion framework to enhance the detailed alignment between interactive movements and corresponding semantic intents from textual descriptions. Our approach leverages large language models, functioning as a human brain, to identify interacting human body parts and to comprehend interaction semantics to generate complicated and subtle interactive motion. Guided by the refined movements of the interacting parts, TextIM further extends these movements into a coherent whole-body motion. We design a spatial coherence module to complement the entire body movements while maintaining consistency and harmony across body parts using a part graph convolutional network. For training and evaluation, we carefully selected and re-labeled interactive motions from HUMANML3D to develop a specialized dataset. Experimental results demonstrate that TextIM produces semantically accurate human interactive motions, significantly enhancing the realism and applicability of synthesized interactive motions in diverse scenarios, even including interactions with deformable and dynamically changing objects.

[CV-4] DopQ-ViT: Towards Distribution-Friendly and Outlier-Aware Post-Training Quantization for Vision Transformers

链接: https://arxiv.org/abs/2408.03291
作者: Lianwei Yang,Haisong Gong
关键词-EN: hinder widespread adoption, high computational cost, garnered significant attention, significant latency issues, Vision transformers
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Vision transformers (ViTs) have garnered significant attention for their performance in vision tasks; however, the high computational cost and significant latency issues have hinder widespread adoption. Post-training quantization (PTQ), a promising method for model compression, still faces accuracy degradation challenges with ViTs. There are two reasons for this: the existing quantization paradigm does not fit the power-law distribution of post-Softmax activations well, and accuracy inevitably decreases after reparameterizing post-LayerNorm activations. We propose a Distribution-Friendly and Outlier-Aware Post-training Quantization method for Vision Transformers, named DopQ-ViT. DopQ-ViT analyzes the inefficiencies of current quantizers and introduces a distribution-friendly Tan Quantizer called TanQ. TanQ focuses more on values near 1, more accurately preserving the power-law distribution of post-Softmax activations, and achieves favorable results. Moreover, when reparameterizing post-LayerNorm activations from channel-wise to layer-wise quantization, the accuracy degradation is mainly due to the significant impact of outliers in the scaling factors. Therefore, DopQ-ViT proposes a method to Search for the Optimal Scaling Factor, denoted as SOSF, which compensates for the influence of outliers and preserves the performance of the quantization model. DopQ-ViT has undergone extensive validation and demonstrates significant performance improvements in quantization models, particularly in low-bit settings.

[CV-5] Biomedical SAM 2: Segment Anything in Biomedical Images and Videos

链接: https://arxiv.org/abs/2408.03286
作者: Zhiling Yan,Weixiang Sun,Rong Zhou,Zhengqing Yuan,Kai Zhang,Yiwei Li,Tianming Liu,Quanzheng Li,Xiang Li,Lifang He,Lichao Sun
关键词-EN: measuring biological structures, biological structures, video object segmentation, essential for diagnosing, diagnosing and analyzing
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Medical image segmentation and video object segmentation are essential for diagnosing and analyzing diseases by identifying and measuring biological structures. Recent advances in natural domain have been driven by foundation models like the Segment Anything Model 2 (SAM 2). To explore the performance of SAM 2 in biomedical applications, we designed two evaluation pipelines for single-frame image segmentation and multi-frame video segmentation with varied prompt designs, revealing SAM 2’s limitations in medical contexts. Consequently, we developed BioSAM 2, an enhanced foundation model optimized for biomedical data based on SAM 2. Our experiments show that BioSAM 2 not only surpasses the performance of existing state-of-the-art foundation models but also matches or even exceeds specialist models, demonstrating its efficacy and potential in the medical domain.

[CV-6] ReSyncer: Rewiring Style-based Generator for Unified Audio-Visually Synced Facial Performer ECCV

链接: https://arxiv.org/abs/2408.03284
作者: Jiazhi Guan,Zhiliang Xu,Hang Zhou,Kaisiyuan Wang,Shengyi He,Zhanwang Zhang,Borong Liang,Haocheng Feng,Errui Ding,Jingtuo Liu,Jingdong Wang,Youjian Zhao,Ziwei Liu
关键词-EN: require long-term videos, applications including, retain visible artifacts, virtual presenters, Abstract
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Multimedia (cs.MM)
*备注: Accepted to European Conference on Computer Vision (ECCV), 2024. Project page: this https URL

点击查看摘要

Abstract:Lip-syncing videos with given audio is the foundation for various applications including the creation of virtual presenters or performers. While recent studies explore high-fidelity lip-sync with different techniques, their task-orientated models either require long-term videos for clip-specific training or retain visible artifacts. In this paper, we propose a unified and effective framework ReSyncer, that synchronizes generalized audio-visual facial information. The key design is revisiting and rewiring the Style-based generator to efficiently adopt 3D facial dynamics predicted by a principled style-injected Transformer. By simply re-configuring the information insertion mechanisms within the noise and style space, our framework fuses motion and appearance with unified training. Extensive experiments demonstrate that ReSyncer not only produces high-fidelity lip-synced videos according to audio, but also supports multiple appealing properties that are suitable for creating virtual presenters and performers, including fast personalized fine-tuning, video-driven lip-syncing, the transfer of speaking styles, and even face swapping. Resources can be found at this https URL.

[CV-7] AMES: Asymmetric and Memory-Efficient Similarity Estimation for Instance-level Retrieval ECCV2024

链接: https://arxiv.org/abs/2408.03282
作者: Pavel Suma,Giorgos Kordopatis-Zilos,Ahmet Iscen,Giorgos Tolias
关键词-EN: instance-level image retrieval, image retrieval re-ranking, limit memory usage, ultimately aiming, investigates the problem
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: ECCV 2024

点击查看摘要

Abstract:This work investigates the problem of instance-level image retrieval re-ranking with the constraint of memory efficiency, ultimately aiming to limit memory usage to 1KB per image. Departing from the prevalent focus on performance enhancements, this work prioritizes the crucial trade-off between performance and memory requirements. The proposed model uses a transformer-based architecture designed to estimate image-to-image similarity by capturing interactions within and across images based on their local descriptors. A distinctive property of the model is the capability for asymmetric similarity estimation. Database images are represented with a smaller number of descriptors compared to query images, enabling performance improvements without increasing memory consumption. To ensure adaptability across different applications, a universal model is introduced that adjusts to a varying number of local descriptors during the testing phase. Results on standard benchmarks demonstrate the superiority of our approach over both hand-crafted and learned models. In particular, compared with current state-of-the-art methods that overlook their memory footprint, our approach not only attains superior performance but does so with a significantly reduced memory footprint. The code and pretrained models are publicly available at: this https URL

[CV-8] LAC-Net: Linear-Fusion Attention-Guided Convolutional Network for Accurate Robotic Grasping Under the Occlusion IROS2024

链接: https://arxiv.org/abs/2408.03238
作者: Jinyu Zhang,Yongchong Gu,Jianxiong Gao,Haitao Lin,Qiang Sun,Xinwei Sun,Xiangyang Xue,Yanwei Fu
关键词-EN: perceiving complete object, visual perception, addresses the challenge, challenge of perceiving, paper addresses
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
*备注: accepted by IROS2024

点击查看摘要

Abstract:This paper addresses the challenge of perceiving complete object shapes through visual perception. While prior studies have demonstrated encouraging outcomes in segmenting the visible parts of objects within a scene, amodal segmentation, in particular, has the potential to allow robots to infer the occluded parts of objects. To this end, this paper introduces a new framework that explores amodal segmentation for robotic grasping in cluttered scenes, thus greatly enhancing robotic grasping abilities. Initially, we use a conventional segmentation algorithm to detect the visible segments of the target object, which provides shape priors for completing the full object mask. Particularly, to explore how to utilize semantic features from RGB images and geometric information from depth images, we propose a Linear-fusion Attention-guided Convolutional Network (LAC-Net). LAC-Net utilizes the linear-fusion strategy to effectively fuse this cross-modal data, and then uses the prior visible mask as attention map to guide the network to focus on target feature locations for further complete mask recovery. Using the amodal mask of the target object provides advantages in selecting more accurate and robust grasp points compared to relying solely on the visible segments. The results on different datasets show that our method achieves state-of-the-art performance. Furthermore, the robot experiments validate the feasibility and robustness of this method in the real world. Our code and demonstrations are available on the project page: this https URL.

[CV-9] Contrastive Learning for Image Complexity Representation

链接: https://arxiv.org/abs/2408.03230
作者: Shipeng Liu,Liang Zhao,Dengfeng Chen,Zhanping Song
关键词-EN: Quantifying and evaluating, image complexity, evaluating image complexity, instrumental in enhancing, Quantifying
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Quantifying and evaluating image complexity can be instrumental in enhancing the performance of various computer vision tasks. Supervised learning can effectively learn image complexity features from well-annotated datasets. However, creating such datasets requires expensive manual annotation costs. The models may learn human subjective biases from it. In this work, we introduce the MoCo v2 framework. We utilize contrastive learning to represent image complexity, named CLIC (Contrastive Learning for Image Complexity). We find that there are complexity differences between different local regions of an image, and propose Random Crop and Mix (RCM), which can produce positive samples consisting of multi-scale local crops. RCM can also expand the train set and increase data diversity without introducing additional data. We conduct extensive experiments with CLIC, comparing it with both unsupervised and supervised methods. The results demonstrate that the performance of CLIC is comparable to that of state-of-the-art supervised methods. In addition, we establish the pipelines that can apply CLIC to computer vision tasks to effectively improve their performance.

[CV-10] Line-based 6-DoF Object Pose Estimation and Tracking With an Event Camera

链接: https://arxiv.org/abs/2408.03225
作者: Zibin Liu,Banglei Guan,Yang Shang,Qifeng Yu,Laurent Kneip
关键词-EN: Pose estimation, high dynamic range, object pose estimation, fundamental application, Pose
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted by IEEE Transactions on Image Processing,2024

点击查看摘要

Abstract:Pose estimation and tracking of objects is a fundamental application in 3D vision. Event cameras possess remarkable attributes such as high dynamic range, low latency, and resilience against motion blur, which enables them to address challenging high dynamic range scenes or high-speed motion. These features make event cameras an ideal complement over standard cameras for object pose estimation. In this work, we propose a line-based robust pose estimation and tracking method for planar or non-planar objects using an event camera. Firstly, we extract object lines directly from events, then provide an initial pose using a globally-optimal Branch-and-Bound approach, where 2D-3D line correspondences are not known in advance. Subsequently, we utilize event-line matching to establish correspondences between 2D events and 3D models. Furthermore, object poses are refined and continuously tracked by minimizing event-line distances. Events are assigned different weights based on these distances, employing robust estimation algorithms. To evaluate the precision of the proposed methods in object pose estimation and tracking, we have devised and established an event-based moving object dataset. Compared against state-of-the-art methods, the robustness and accuracy of our methods have been validated both on synthetic experiments and the proposed dataset. The source code is available at this https URL.

[CV-11] Learning to Learn without Forgetting using Attention

链接: https://arxiv.org/abs/2408.03219
作者: Anna Vettoruzzo,Joaquin Vanschoren,Mohamed-Rafik Bouguelia,Thorsteinn Rögnvaldsson
关键词-EN: retaining previously learned, previously learned patterns, previously learned experience, previously learned, ability to continually
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注: Published at 3rd Conference on Lifelong Learning Agents (CoLLAs), 2024

点击查看摘要

Abstract:Continual learning (CL) refers to the ability to continually learn over time by accommodating new knowledge while retaining previously learned experience. While this concept is inherent in human learning, current machine learning methods are highly prone to overwrite previously learned patterns and thus forget past experience. Instead, model parameters should be updated selectively and carefully, avoiding unnecessary forgetting while optimally leveraging previously learned patterns to accelerate future learning. Since hand-crafting effective update mechanisms is difficult, we propose meta-learning a transformer-based optimizer to enhance CL. This meta-learned optimizer uses attention to learn the complex relationships between model parameters across a stream of tasks, and is designed to generate effective weight updates for the current task while preventing catastrophic forgetting on previously encountered tasks. Evaluations on benchmark datasets like SplitMNIST, RotatedMNIST, and SplitCIFAR-100 affirm the efficacy of the proposed approach in terms of both forward and backward transfer, even on small sets of labeled data, highlighting the advantages of integrating a meta-learned optimizer within the continual learning framework.

[CV-12] IPAdapter-Instruct: Resolving Ambiguity in Image-based Conditioning using Instruct Prompts

链接: https://arxiv.org/abs/2408.03209
作者: Ciara Rowles,Shimon Vainer,Dante De Nigris,Slava Elizarov,Konstantin Kutsy,Simon Donné
关键词-EN: fine structural details, Diffusion models continuously, accurately describing image, models continuously push, practice proves
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 17 pages, 10 figures, Project page: this https URL

点击查看摘要

Abstract:Diffusion models continuously push the boundary of state-of-the-art image generation, but the process is hard to control with any nuance: practice proves that textual prompts are inadequate for accurately describing image style or fine structural details (such as faces). ControlNet and IPAdapter address this shortcoming by conditioning the generative process on imagery instead, but each individual instance is limited to modeling a single conditional posterior: for practical use-cases, where multiple different posteriors are desired within the same workflow, training and using multiple adapters is cumbersome. We propose IPAdapter-Instruct, which combines natural-image conditioning with ``Instruct’’ prompts to swap between interpretations for the same conditioning image: style transfer, object extraction, both, or something else still? IPAdapterInstruct efficiently learns multiple tasks with minimal loss in quality compared to dedicated per-task models.

[CV-13] Personalizing Federated Instrument Segmentation with Visual Trait Priors in Robotic Surgery

点击查看摘要

[CV-14] Efficient NeRF Optimization – Not All Samples Remain Equally Hard

链接: https://arxiv.org/abs/2408.03193
作者: Juuso Korhonen,Goutham Rangu,Hamed R. Tavakoli,Juho Kannala
关键词-EN: Neural Radiance Fields, Radiance Fields, Neural Radiance, online hard sample, hard sample mining
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:We propose an application of online hard sample mining for efficient training of Neural Radiance Fields (NeRF). NeRF models produce state-of-the-art quality for many 3D reconstruction and rendering tasks but require substantial computational resources. The encoding of the scene information within the NeRF network parameters necessitates stochastic sampling. We observe that during the training, a major part of the compute time and memory usage is spent on processing already learnt samples, which no longer affect the model update significantly. We identify the backward pass on the stochastic samples as the computational bottleneck during the optimization. We thus perform the first forward pass in inference mode as a relatively low-cost search for hard samples. This is followed by building the computational graph and updating the NeRF network parameters using only the hard samples. To demonstrate the effectiveness of the proposed approach, we apply our method to Instant-NGP, resulting in significant improvements of the view-synthesis quality over the baseline (1 dB improvement on average per training time, or 2x speedup to reach the same PSNR level) along with approx. 40% memory savings coming from using only the hard samples to build the computational graph. As our method only interfaces with the network module, we expect it to be widely applicable.

[CV-15] An Object is Worth 64x64 Pixels: Generating 3D Object via Image Diffusion

链接: https://arxiv.org/abs/2408.03178
作者: Xingguang Yan,Han-Hung Lee,Ziyu Wan,Angel X. Chang
关键词-EN: Object Images, generating realistic, representation termed, Object, Abstract
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Machine Learning (cs.LG)
*备注: Project Page: this https URL

点击查看摘要

Abstract:We introduce a new approach for generating realistic 3D models with UV maps through a representation termed “Object Images.” This approach encapsulates surface geometry, appearance, and patch structures within a 64x64 pixel image, effectively converting complex 3D shapes into a more manageable 2D format. By doing so, we address the challenges of both geometric and semantic irregularity inherent in polygonal meshes. This method allows us to use image generation models, such as Diffusion Transformers, directly for 3D shape generation. Evaluated on the ABO dataset, our generated shapes with patch structures achieve point cloud FID comparable to recent 3D generative models, while naturally supporting PBR material generation.

[CV-16] Dilated Convolution with Learnable Spacings makes visual models more aligned with humans: a Grad-CAM study IJCAI2024

点击查看摘要

[CV-17] User-in-the-loop Evaluation of Multimodal LLMs for Activity Assistance

点击查看摘要

[CV-18] Iterative CT Reconstruction via Latent Variable Optimization of Shallow Diffusion Models

链接: https://arxiv.org/abs/2408.03156
作者: Sho Ozaki,Shizuo Kaji,Toshikazu Imae,Kanabu Nawa,Hideomi Yamashita,Keiichi Nakagawa
关键词-EN: garnered significant attention, diffusion model, diffusion, garnered significant, significant attention
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Medical Physics (physics.med-ph)
*备注: 19 pages, 9 figures

点击查看摘要

Abstract:Image generative AI has garnered significant attention in recent years. In particular, the diffusion model, a core component of recent generative AI, produces high-quality images with rich diversity. In this study, we propose a novel CT reconstruction method by combining the denoising diffusion probabilistic model with iterative CT reconstruction. In sharp contrast to previous studies, we optimize the fidelity loss of CT reconstruction with respect to the latent variable of the diffusion model, instead of the image and model parameters. To suppress anatomical structure changes produced by the diffusion model, we shallow the diffusion and reverse processes, and fix a set of added noises in the reverse process to make it deterministic during inference. We demonstrate the effectiveness of the proposed method through sparse view CT reconstruction of 1/10 view projection data. Despite the simplicity of the implementation, the proposed method shows the capability of reconstructing high-quality images while preserving the patient’s anatomical structure, and outperforms existing methods including iterative reconstruction, iterative reconstruction with total variation, and the diffusion model alone in terms of quantitative indices such as SSIM and PSNR. We also explore further sparse view CT using 1/20 view projection data with the same trained diffusion model. As the number of iterations increases, image quality improvement comparable to that of 1/10 sparse view CT reconstruction is achieved. In principle, the proposed method can be widely applied not only to CT but also to other imaging modalities such as MRI, PET, and SPECT.

[CV-19] Leveraging Entity Information for Cross-Modality Correlation Learning: The Entity-Guided Multimodal Summarization ACL

链接: https://arxiv.org/abs/2408.03149
作者: Yanghai Zhang,Ye Liu,Shiwei Wu,Kai Zhang,Xukai Liu,Qi Liu,Enhong Chen
关键词-EN: Multimodal Summarization model, Multimodal Summarization, rapid increase, increase in multimedia, spurred advancements
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
*备注: In ACL-Findings 2024

点击查看摘要

[CV-20] SuperSimpleNet: Unifying Unsupervised and Supervised Learning for Fast and Reliable Surface Defect Detection ICPR2024

链接: https://arxiv.org/abs/2408.03143
作者: Blaž Rolih,Matic Fučka,Danijel Skočaj
关键词-EN: surface defect detection, localise abnormal regions, surface defect, captured objects, identify and localise
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted to ICPR 2024

点击查看摘要

Abstract:The aim of surface defect detection is to identify and localise abnormal regions on the surfaces of captured objects, a task that’s increasingly demanded across various industries. Current approaches frequently fail to fulfil the extensive demands of these industries, which encompass high performance, consistency, and fast operation, along with the capacity to leverage the entirety of the available training data. Addressing these gaps, we introduce SuperSimpleNet, an innovative discriminative model that evolved from SimpleNet. This advanced model significantly enhances its predecessor’s training consistency, inference time, as well as detection performance. SuperSimpleNet operates in an unsupervised manner using only normal training images but also benefits from labelled abnormal training images when they are available. SuperSimpleNet achieves state-of-the-art results in both the supervised and the unsupervised settings, as demonstrated by experiments across four challenging benchmark datasets. Code: this https URL .

[CV-21] Benchmarking In-the-wild Multimodal Disease Recognition and A Versatile Baseline

链接: https://arxiv.org/abs/2408.03120
作者: Tianqi Wei,Zhi Chen,Zi Huang,Xin Yu
关键词-EN: recognizing in-laboratory diseased, Existing plant disease, achieved remarkable performance, in-laboratory diseased images, Existing plant
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Existing plant disease classification models have achieved remarkable performance in recognizing in-laboratory diseased images. However, their performance often significantly degrades in classifying in-the-wild images. Furthermore, we observed that in-the-wild plant images may exhibit similar appearances across various diseases (i.e., small inter-class discrepancy) while the same diseases may look quite different (i.e., large intra-class variance). Motivated by this observation, we propose an in-the-wild multimodal plant disease recognition dataset that contains the largest number of disease classes but also text-based descriptions for each disease. Particularly, the newly provided text descriptions are introduced to provide rich information in textual modality and facilitate in-the-wild disease classification with small inter-class discrepancy and large intra-class variance issues. Therefore, our proposed dataset can be regarded as an ideal testbed for evaluating disease recognition methods in the real world. In addition, we further present a strong yet versatile baseline that models text descriptions and visual data through multiple prototypes for a given class. By fusing the contributions of multimodal prototypes in classification, our baseline can effectively address the small inter-class discrepancy and large intra-class variance issues. Remarkably, our baseline model can not only classify diseases but also recognize diseases in few-shot or training-free scenarios. Extensive benchmarking results demonstrate that our proposed in-the-wild multimodal dataset sets many new challenges to the plant disease recognition task and there is a large space to improve for future works.

[CV-22] Prototype Learning for Micro-gesture Classification MICRO IJCAI-2024

链接: https://arxiv.org/abs/2408.03097
作者: Guoliang Chen,Fei Wang,Kun Li,Zhiliang Wu,Hehe Fan,Yi Yang,Meng Wang,Dan Guo
关键词-EN: challenge at IJCAI, Micro-gesture Classification, micro-gesture classification task, track of Micro-gesture, briefly introduce
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 1st Place in Micro-gesture Classification in MiGA at IJCAI-2024

点击查看摘要

Abstract:In this paper, we briefly introduce the solution developed by our team, HFUT-VUT, for the track of Micro-gesture Classification in the MiGA challenge at IJCAI 2024. The task of micro-gesture classification task involves recognizing the category of a given video clip, which focuses on more fine-grained and subtle body movements compared to typical action recognition tasks. Given the inherent complexity of micro-gesture recognition, which includes large intra-class variability and minimal inter-class differences, we utilize two innovative modules, i.e., the cross-modal fusion module and prototypical refinement module, to improve the discriminative ability of MG features, thereby improving the classification accuracy. Our solution achieved significant success, ranking 1st in the track of Micro-gesture Classification. We surpassed the performance of last year’s leading team by a substantial margin, improving Top-1 accuracy by 6.13%.

[CV-23] BodySLAM: A Generalized Monocular Visual SLAM Framework for Surgical Applications

点击查看摘要

[CV-24] SCOPE: A Synthetic Multi-Modal Dataset for Collective Perception Including Physical-Correct Weather Conditions

链接: https://arxiv.org/abs/2408.03065
作者: Jörg Gamerdinger,Sven Teufel,Patrick Schulz,Stephan Amann,Jan-Patrick Kirchner,Oliver Bringmann
关键词-EN: received considerable attention, limited sensing ranges, Collective perception, autonomous driving, received considerable
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Collective perception has received considerable attention as a promising approach to overcome occlusions and limited sensing ranges of vehicle-local perception in autonomous driving. In order to develop and test novel collective perception technologies, appropriate datasets are required. These datasets must include not only different environmental conditions, as they strongly influence the perception capabilities, but also a wide range of scenarios with different road users as well as realistic sensor models. Therefore, we propose the Synthetic COllective PErception (SCOPE) dataset. SCOPE is the first synthetic multi-modal dataset that incorporates realistic camera and LiDAR models as well as parameterized and physically accurate weather simulations for both sensor types. The dataset contains 17,600 frames from over 40 diverse scenarios with up to 24 collaborative agents, infrastructure sensors, and passive traffic, including cyclists and pedestrians. In addition, recordings from two novel digital-twin maps from Karlsruhe and Tübingen are included. The dataset is available at this https URL

[CV-25] MGFs: Masked Gaussian Fields for Meshing Building based on Multi-View Images

链接: https://arxiv.org/abs/2408.03060
作者: Tengfei Wang,Zongqian Zhan,Rui Xia,Linxia Ji,Xin Wang
关键词-EN: garnered substantial research, substantial research interest, Masked Gaussian Fields, Gaussian fields-based methods, image-based building surface
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
*备注:

点击查看摘要

Abstract:Over the last few decades, image-based building surface reconstruction has garnered substantial research interest and has been applied across various fields, such as heritage preservation, architectural planning, etc. Compared to the traditional photogrammetric and NeRF-based solutions, recently, Gaussian fields-based methods have exhibited significant potential in generating surface meshes due to their time-efficient training and detailed 3D information preservation. However, most gaussian fields-based methods are trained with all image pixels, encompassing building and nonbuilding areas, which results in a significant noise for building meshes and degeneration in time efficiency. This paper proposes a novel framework, Masked Gaussian Fields (MGFs), designed to generate accurate surface reconstruction for building in a time-efficient way. The framework first applies EfficientSAM and COLMAP to generate multi-level masks of building and the corresponding masked point clouds. Subsequently, the masked gaussian fields are trained by integrating two innovative losses: a multi-level perceptual masked loss focused on constructing building regions and a boundary loss aimed at enhancing the details of the boundaries between different masks. Finally, we improve the tetrahedral surface mesh extraction method based on the masked gaussian spheres. Comprehensive experiments on UAV images demonstrate that, compared to the traditional method and several NeRF-based and Gaussian-based SOTA solutions, our approach significantly improves both the accuracy and efficiency of building surface reconstruction. Notably, as a byproduct, there is an additional gain in the novel view synthesis of building.

[CV-26] Comb Prune Distill: Towards Unified Pruning for Vision Model Compression ITSC2024

链接: https://arxiv.org/abs/2408.03046
作者: Jonas Schmitt,Ruiping Liu,Junwei Zheng,Jiaming Zhang,Rainer Stiefelhagen
关键词-EN: Lightweight and effective, limited resources, intelligent vehicles, essential for devices, devices with limited
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted by ITSC 2024. Code is publicly available at: this https URL

点击查看摘要

Abstract:Lightweight and effective models are essential for devices with limited resources, such as intelligent vehicles. Structured pruning offers a promising approach to model compression and efficiency enhancement. However, existing methods often tie pruning techniques to specific model architectures or vision tasks. To address this limitation, we propose a novel unified pruning framework Comb, Prune, Distill (CPD), which addresses both model-agnostic and task-agnostic concerns simultaneously. Our framework employs a combing step to resolve hierarchical layer-wise dependency issues, enabling architecture independence. Additionally, the pruning pipeline adaptively remove parameters based on the importance scoring metrics regardless of vision tasks. To support the model in retaining its learned information, we introduce knowledge distillation during the pruning step. Extensive experiments demonstrate the generalizability of our framework, encompassing both convolutional neural network (CNN) and transformer models, as well as image classification and segmentation tasks. In image classification we achieve a speedup of up to x4.3 with a accuracy loss of 1.8% and in semantic segmentation up to x1.89 with a 5.1% loss in mIoU.

[CV-27] argeted Visual Prompting for Medical Visual Question Answering MICCAI

链接: https://arxiv.org/abs/2408.03043
作者: Sergio Tascon-Morales,Pablo Márquez-Neila,Raphael Sznitman
关键词-EN: multimodal large language, classical model architectures, large language models, recent years, rapidly evolved
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted at the MICCAI AMAI Workshop 2024

点击查看摘要

Abstract:With growing interest in recent years, medical visual question answering (Med-VQA) has rapidly evolved, with multimodal large language models (MLLMs) emerging as an alternative to classical model architectures. Specifically, their ability to add visual information to the input of pre-trained LLMs brings new capabilities for image interpretation. However, simple visual errors cast doubt on the actual visual understanding abilities of these models. To address this, region-based questions have been proposed as a means to assess and enhance actual visual understanding through compositional evaluation. To combine these two perspectives, this paper introduces targeted visual prompting to equip MLLMs with region-based questioning capabilities. By presenting the model with both the isolated region and the region in its context in a customized visual prompt, we show the effectiveness of our method across multiple datasets while comparing it to several baseline models. Our code and data are available at this https URL.

[CV-28] Nighttime Pedestrian Detection Based on Fore-Background Contrast Learning

链接: https://arxiv.org/abs/2408.03030
作者: He Yao,Yongjun Zhang,Huachun Jian,Li Zhang,Ruzhong Cheng
关键词-EN: background information, channel attention mechanisms, channel attention, attention, attention mechanisms
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:The significance of background information is frequently overlooked in contemporary research concerning channel attention mechanisms. This study addresses the issue of suboptimal single-spectral nighttime pedestrian detection performance under low-light conditions by incorporating background information into the channel attention mechanism. Despite numerous studies focusing on the development of efficient channel attention mechanisms, the relevance of background information has been largely disregarded. By adopting a contrast learning approach, we reexamine channel attention with regard to pedestrian objects and background information for nighttime pedestrian detection, resulting in the proposed Fore-Background Contrast Attention (FBCA). FBCA possesses two primary attributes: (1) channel descriptors form remote dependencies with global spatial feature information; (2) the integration of background information enhances the distinction between channels concentrating on low-light pedestrian features and those focusing on background information. Consequently, the acquired channel descriptors exhibit a higher semantic level and spatial accuracy. Experimental outcomes demonstrate that FBCA significantly outperforms existing methods in single-spectral nighttime pedestrian detection, achieving state-of-the-art results on the NightOwls and TJU-DHD-pedestrian datasets. Furthermore, this methodology also yields performance improvements for the multispectral LLVIP dataset. These findings indicate that integrating background information into the channel attention mechanism effectively mitigates detector performance degradation caused by illumination factors in nighttime scenarios.

[CV-29] CKNN: Cleansed k-Nearest Neighbor for Unsupervised Video Anomaly Detection

链接: https://arxiv.org/abs/2408.03014
作者: Jihun Yi,Sungroh Yoon
关键词-EN: unsupervised video anomaly, video anomaly detection, Anomaly Cluster, anomaly detection, anomaly
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:In this paper, we address the problem of unsupervised video anomaly detection (UVAD). The task aims to detect abnormal events in test video using unlabeled videos as training data. The presence of anomalies in the training data poses a significant challenge in this task, particularly because they form clusters in the feature space. We refer to this property as the “Anomaly Cluster” issue. The condensed nature of these anomalies makes it difficult to distinguish between normal and abnormal data in the training set. Consequently, training conventional anomaly detection techniques using an unlabeled dataset often leads to sub-optimal results. To tackle this difficulty, we propose a new method called Cleansed k-Nearest Neighbor (CKNN), which explicitly filters out the Anomaly Clusters by cleansing the training dataset. Following the k-nearest neighbor algorithm in the feature space provides powerful anomaly detection capability. Although the identified Anomaly Cluster issue presents a significant challenge to applying k-nearest neighbor in UVAD, our proposed cleansing scheme effectively addresses this problem. We evaluate the proposed method on various benchmark datasets and demonstrate that CKNN outperforms the previous state-of-the-art UVAD method by up to 8.5% (from 82.0 to 89.0) in terms of AUROC. Moreover, we emphasize that the performance of the proposed method is comparable to that of the state-of-the-art method trained using anomaly-free data.

[CV-30] Dual-path Collaborative Generation Network for Emotional Video Captioning

链接: https://arxiv.org/abs/2408.03006
作者: Cheng Ye,Weidong Chen,Jingyu Li,Lei Zhang,Zhendong Mao
关键词-EN: Emotional Video Captioning, visual emotional cues, emotional cues, Video Captioning, Emotional
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Acccepted by ACM Multimedia 2024, oral

点击查看摘要

Abstract:Emotional Video Captioning is an emerging task that aims to describe factual content with the intrinsic emotions expressed in videos. The essential of the EVC task is to effectively perceive subtle and ambiguous visual emotional cues during the caption generation, which is neglected by the traditional video captioning. Existing emotional video captioning methods perceive global visual emotional cues at first, and then combine them with the video features to guide the emotional caption generation, which neglects two characteristics of the EVC task. Firstly, their methods neglect the dynamic subtle changes in the intrinsic emotions of the video, which makes it difficult to meet the needs of common scenes with diverse and changeable emotions. Secondly, as their methods incorporate emotional cues into each step, the guidance role of emotion is overemphasized, which makes factual content more or less ignored during generation. To this end, we propose a dual-path collaborative generation network, which dynamically perceives visual emotional cues evolutions while generating emotional captions by collaborative learning. Specifically, in the dynamic emotion perception path, we propose a dynamic emotion evolution module, which first aggregates visual features and historical caption features to summarize the global visual emotional cues, and then dynamically selects emotional cues required to be re-composed at each stage. Besides, in the adaptive caption generation path, to balance the description of factual content and emotional cues, we propose an emotion adaptive decoder. Thus, our methods can generate emotion-related words at the necessary time step, and our caption generation balances the guidance of factual content and emotional cues well. Extensive experiments on three challenging datasets demonstrate the superiority of our approach and each proposed module.

[CV-31] Multitask and Multimodal Neural Tuning for Large Models

链接: https://arxiv.org/abs/2408.03001
作者: Hao Sun,Yu Song,Jihong Hu,Yen-Wei Chen,Lanfen Lin
关键词-EN: demonstrated impressive capabilities, recent years, demonstrated impressive, impressive capabilities, large-scale multimodal models
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
*备注:

点击查看摘要

Abstract:In recent years, large-scale multimodal models have demonstrated impressive capabilities across various domains. However, enabling these models to effectively perform multiple multimodal tasks simultaneously remains a significant challenge. To address this, we introduce a novel tuning method called neural tuning, designed to handle diverse multimodal tasks concurrently, including reasoning segmentation, referring segmentation, image captioning, and text-to-image generation. Neural tuning emulates sparse distributed representation in human brain, where only specific subsets of neurons are activated for each task. Additionally, we present a new benchmark, MMUD, where each sample is annotated with multiple task labels. By applying neural tuning to pretrained large models on the MMUD benchmark, we achieve simultaneous task handling in a streamlined and efficient manner. All models, code, and datasets will be publicly available after publication, facilitating further research and development in this field.

[CV-32] DreamLCM: Towards High-Quality Text-to-3D Generation via Latent Consistency Model ACM-MM2024

链接: https://arxiv.org/abs/2408.02993
作者: Yiming Zhong,Xiaolin Zhang,Yao Zhao,Yunchao Wei
关键词-EN: developed rapidly due, SDS method, task has developed, developed rapidly, poor quality due
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 15 pages, 9 figures, ACM MM 2024

点击查看摘要

Abstract:Recently, the text-to-3D task has developed rapidly due to the appearance of the SDS method. However, the SDS method always generates 3D objects with poor quality due to the over-smooth issue. This issue is attributed to two factors: 1) the DDPM single-step inference produces poor guidance gradients; 2) the randomness from the input noises and timesteps averages the details of the 3D this http URL this paper, to address the issue, we propose DreamLCM which incorporates the Latent Consistency Model (LCM). DreamLCM leverages the powerful image generation capabilities inherent in LCM, enabling generating consistent and high-quality guidance, i.e., predicted noises or images. Powered by the improved guidance, the proposed method can provide accurate and detailed gradients to optimize the target 3D this http URL addition, we propose two strategies to enhance the generation quality further. Firstly, we propose a guidance calibration strategy, utilizing Euler Solver to calibrate the guidance distribution to accelerate 3D models to converge. Secondly, we propose a dual timestep strategy, increasing the consistency of guidance and optimizing 3D models from geometry to appearance in DreamLCM. Experiments show that DreamLCM achieves state-of-the-art results in both generation quality and training efficiency. The code is available at this https URL.

[CV-33] Diffusion Model Meets Non-Exemplar Class-Incremental Learning and Beyond

链接: https://arxiv.org/abs/2408.02983
作者: Jichuan Zhang,Yali Li,Xin Liu,Shengjin Wang
关键词-EN: Non-exemplar class-incremental learning, resist catastrophic forgetting, Non-exemplar class-incremental, class samples, resist catastrophic
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Non-exemplar class-incremental learning (NECIL) is to resist catastrophic forgetting without saving old class samples. Prior methodologies generally employ simple rules to generate features for replaying, suffering from large distribution gap between replayed features and real ones. To address the aforementioned issue, we propose a simple, yet effective \textbfDiffusion-based \textbfFeature \textbfReplay (\textbfDiffFR) method for NECIL. First, to alleviate the limited representational capacity caused by fixing the feature extractor, we employ Siamese-based self-supervised learning for initial generalizable features. Second, we devise diffusion models to generate class-representative features highly similar to real features, which provides an effective way for exemplar-free knowledge memorization. Third, we introduce prototype calibration to direct the diffusion model’s focus towards learning the distribution shapes of features, rather than the entire distribution. Extensive experiments on public datasets demonstrate significant performance gains of our DiffFR, outperforming the state-of-the-art NECIL methods by 3.0% in average. The code will be made publicly available soon.

[CV-34] Sample-agnostic Adversarial Perturbation for Vision-Language Pre-training Models

链接: https://arxiv.org/abs/2408.02980
作者: Haonan Zheng,Wen Jiang,Xinyang Deng,Wenrui Li
关键词-EN: Recent studies, Vision-Language Pre-training, intentionally designed perturbations, security have highlighted, highlighted the vulnerability
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 13 pages, 8 figures, published in ACMMM2024

点击查看摘要

Abstract:Recent studies on AI security have highlighted the vulnerability of Vision-Language Pre-training (VLP) models to subtle yet intentionally designed perturbations in images and texts. Investigating multimodal systems’ robustness via adversarial attacks is crucial in this field. Most multimodal attacks are sample-specific, generating a unique perturbation for each sample to construct adversarial samples. To the best of our knowledge, it is the first work through multimodal decision boundaries to explore the creation of a universal, sample-agnostic perturbation that applies to any image. Initially, we explore strategies to move sample points beyond the decision boundaries of linear classifiers, refining the algorithm to ensure successful attacks under the top k accuracy metric. Based on this foundation, in visual-language tasks, we treat visual and textual modalities as reciprocal sample points and decision hyperplanes, guiding image embeddings to traverse text-constructed decision boundaries, and vice versa. This iterative process consistently refines a universal perturbation, ultimately identifying a singular direction within the input space which is exploitable to impair the retrieval performance of VLP models. The proposed algorithms support the creation of global perturbations or adversarial patches. Comprehensive experiments validate the effectiveness of our method, showcasing its data, task, and model transferability across various VLP models and datasets. Code: this https URL

[CV-35] ASR-enhanced Multimodal Representation Learning for Cross-Domain Product Retrieval

点击查看摘要

[CV-36] Fast Point Cloud Geometry Compression with Context-based Residual Coding and INR-based Refinement ECCV2024

链接: https://arxiv.org/abs/2408.02966
作者: Hao Xu,Xi Zhang,Xiaolin Wu
关键词-EN: characterizing neighboring relations, regular sample grids, videos of regular, set of unordered, difficulties in characterizing
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
*备注: Accepted by ECCV 2024

点击查看摘要

Abstract:Compressing a set of unordered points is far more challenging than compressing images/videos of regular sample grids, because of the difficulties in characterizing neighboring relations in an irregular layout of points. Many researchers resort to voxelization to introduce regularity, but this approach suffers from quantization loss. In this research, we use the KNN method to determine the neighborhoods of raw surface points. This gives us a means to determine the spatial context in which the latent features of 3D points are compressed by arithmetic coding. As such, the conditional probability model is adaptive to local geometry, leading to significant rate reduction. Additionally, we propose a dual-layer architecture where a non-learning base layer reconstructs the main structures of the point cloud at low complexity, while a learned refinement layer focuses on preserving fine details. This design leads to reductions in model complexity and coding latency by two orders of magnitude compared to SOTA methods. Moreover, we incorporate an implicit neural representation (INR) into the refinement layer, allowing the decoder to sample points on the underlying surface at arbitrary densities. This work is the first to effectively exploit content-aware local contexts for compressing irregular raw point clouds, achieving high rate-distortion performance, low complexity, and the ability to function as an arbitrary-scale upsampling network simultaneously.

[CV-37] Online Temporal Action Localization with Memory-Augmented Transformer ECCV2024 KR

链接: https://arxiv.org/abs/2408.02957
作者: Youngkil Song,Dongkeun Kim,Minsu Cho,Suha Kwak
关键词-EN: identifying multiple action, multiple action instances, Online temporal action, task of identifying, identifying multiple
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted to ECCV 2024, Project page: this https URL

点击查看摘要

Abstract:Online temporal action localization (On-TAL) is the task of identifying multiple action instances given a streaming video. Since existing methods take as input only a video segment of fixed size per iteration, they are limited in considering long-term context and require tuning the segment size carefully. To overcome these limitations, we propose memory-augmented transformer (MATR). MATR utilizes the memory queue that selectively preserves the past segment features, allowing to leverage long-term context for inference. We also propose a novel action localization method that observes the current input segment to predict the end time of the ongoing action and accesses the memory queue to estimate the start time of the action. Our method outperformed existing methods on two datasets, THUMOS14 and MUSES, surpassing not only TAL methods in the online setting but also some offline TAL methods.

[CV-38] WWW: Where Which and Whatever Enhancing Interpretability in Multimodal Deepfake Detection IJCAI2024

链接: https://arxiv.org/abs/2408.02954
作者: Juho Jung,Sangyoun Lee,Jooeon Kang,Yunjin Na
关键词-EN: manipulate entire frames, detection accuracies exceeding, detection manipulate entire, oversaturated detection accuracies, multimodal deepfake detection
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 4 pages, 2 figures, 2 tables, Accepted as Oral Presentation at The Trustworthy AI Workshop @ IJCAI 2024

点击查看摘要

Abstract:All current benchmarks for multimodal deepfake detection manipulate entire frames using various generation techniques, resulting in oversaturated detection accuracies exceeding 94% at the video-level classification. However, these benchmarks struggle to detect dynamic deepfake attacks with challenging frame-by-frame alterations presented in real-world scenarios. To address this limitation, we introduce FakeMix, a novel clip-level evaluation benchmark aimed at identifying manipulated segments within both video and audio, providing insight into the origins of deepfakes. Furthermore, we propose novel evaluation metrics, Temporal Accuracy (TA) and Frame-wise Discrimination Metric (FDM), to assess the robustness of deepfake detection models. Evaluating state-of-the-art models against diverse deepfake benchmarks, particularly FakeMix, demonstrates the effectiveness of our approach comprehensively. Specifically, while achieving an Average Precision (AP) of 94.2% at the video-level, the evaluation of the existing models at the clip-level using the proposed metrics, TA and FDM, yielded sharp declines in accuracy to 53.1%, and 52.1%, respectively.

[CV-39] Segmenting Small Stroke Lesions with Novel Labeling Strategies

链接: https://arxiv.org/abs/2408.02929
作者: Liang Shang,Zhengyang Lou,Andrew L. Alexander,Vivek Prabhakaran,William A. Sethares,Veena A. Nair,Nagesh Adluru
关键词-EN: Deep neural networks, demonstrated exceptional efficacy, Deep neural, demonstrated exceptional, exceptional efficacy
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Deep neural networks have demonstrated exceptional efficacy in stroke lesion segmentation. However, the delineation of small lesions, critical for stroke diagnosis, remains a challenge. In this study, we propose two straightforward yet powerful approaches that can be seamlessly integrated into a variety of networks: Multi-Size Labeling (MSL) and Distance-Based Labeling (DBL), with the aim of enhancing the segmentation accuracy of small lesions. MSL divides lesion masks into various categories based on lesion volume while DBL emphasizes the lesion boundaries. Experimental evaluations on the Anatomical Tracings of Lesions After Stroke (ATLAS) v2.0 dataset showcase that an ensemble of MSL and DBL achieves consistently better or equal performance on recall (3.6% and 3.7%), F1 (2.4% and 1.5%), and Dice scores (1.3% and 0.0%) compared to the top-1 winner of the 2022 MICCAI ATLAS Challenge on both the subset only containing small lesions and the entire dataset, respectively. Notably, on the mini-lesion subset, a single MSL model surpasses the previous best ensemble strategy, with enhancements of 1.0% and 0.3% on F1 and Dice scores, respectively. Our code is available at: this https URL.

[CV-40] Evaluation of Segment Anything Model 2: The Role of SAM2 in the Underwater Environment

链接: https://arxiv.org/abs/2408.02924
作者: Shijie Lian,Hua Li
关键词-EN: underwater visualization tasks, segment underwater instances, underwater instance segmentation, large-scale modeling, academic community
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:With breakthroughs in large-scale modeling, the Segment Anything Model (SAM) and its extensions have been attempted for applications in various underwater visualization tasks in marine sciences, and have had a significant impact on the academic community. Recently, Meta has further developed the Segment Anything Model 2 (SAM2), which significantly improves running speed and segmentation accuracy compared to its predecessor. This report aims to explore the potential of SAM2 in marine science by evaluating it on the underwater instance segmentation benchmark datasets UIIS and USIS10K. The experiments show that the performance of SAM2 is extremely dependent on the type of user-provided prompts. When using the ground truth bounding box as prompt, SAM2 performed excellently in the underwater instance segmentation domain. However, when running in automatic mode, SAM2’s ability with point prompts to sense and segment underwater instances is significantly degraded. It is hoped that this paper will inspire researchers to further explore the SAM model family in the underwater domain. The results and evaluation codes in this paper are available at this https URL.

[CV-41] Pose Magic: Efficient and Temporally Consistent Human Pose Estimation with a Hybrid Mamba-GCN Network

链接: https://arxiv.org/abs/2408.02922
作者: Xinyi Zhang,Qiqi Bao,Qinpeng Cui,Wenming Yang,Qingmin Liao
关键词-EN: Human Pose Estimation, based on Transformers, Human Pose, Pose Estimation, Pose Magic
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Current state-of-the-art (SOTA) methods in 3D Human Pose Estimation (HPE) are primarily based on Transformers. However, existing Transformer-based 3D HPE backbones often encounter a trade-off between accuracy and computational efficiency. To resolve the above dilemma, in this work, leveraging recent advances in state space models, we utilize Mamba for high-quality and efficient long-range modeling. Nonetheless, Mamba still faces challenges in precisely exploiting the local dependencies between joints. To address these issues, we propose a new attention-free hybrid spatiotemporal architecture named Hybrid Mamba-GCN (Pose Magic). This architecture introduces local enhancement with GCN by capturing relationships between neighboring joints, thus producing new representations to complement Mamba’s outputs. By adaptively fusing representations from Mamba and GCN, Pose Magic demonstrates superior capability in learning the underlying 3D structure. To meet the requirements of real-time inference, we also provide a fully causal version. Extensive experiments show that Pose Magic achieves new SOTA results ( \downarrow 0.9 mm ) while saving 74.1% FLOPs. In addition, Pose Magic exhibits optimal motion consistency and the ability to generalize to unseen sequence lengths.

[CV-42] Dual-View Pyramid Pooling in Deep Neural Networks for Improved Medical Image Classification and Confidence Calibration

链接: https://arxiv.org/abs/2408.02906
作者: Xiaoqing Zhang,Qiushi Nie,Zunjie Xiao,Jilu Zhao,Xiao Wu,Pengxin Guo,Runzhi Li,Jin Liu,Yanjie Wei,Yi Pan
关键词-EN: deep neural networks, neural networks, maps in deep, deep neural, CCP
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 27

点击查看摘要

Abstract:Spatial pooling (SP) and cross-channel pooling (CCP) operators have been applied to aggregate spatial features and pixel-wise features from feature maps in deep neural networks (DNNs), respectively. Their main goal is to reduce computation and memory overhead without visibly weakening the performance of DNNs. However, SP often faces the problem of losing the subtle feature representations, while CCP has a high possibility of ignoring salient feature representations, which may lead to both miscalibration of confidence issues and suboptimal medical classification results. To address these problems, we propose a novel dual-view framework, the first to systematically investigate the relative roles of SP and CCP by analyzing the difference between spatial features and pixel-wise features. Based on this framework, we propose a new pooling method, termed dual-view pyramid pooling (DVPP), to aggregate multi-scale dual-view features. DVPP aims to boost both medical image classification and confidence calibration performance by fully leveraging the merits of SP and CCP operators from a dual-axis perspective. Additionally, we discuss how to fulfill DVPP with five parameter-free implementations. Extensive experiments on six 2D/3D medical image classification tasks show that our DVPP surpasses state-of-the-art pooling methods in terms of medical image classification results and confidence calibration across different DNNs.

[CV-43] Enabling Intelligent Traffic Systems: A Deep Learning Method for Accurate Arabic License Plate Recognition

点击查看摘要

[CV-44] Lighthouse: A User-Friendly Library for Reproducible Video Moment Retrieval and Highlight Detection

链接: https://arxiv.org/abs/2408.02901
作者: Taichi Nishimura,Shota Nakada,Hokuto Munakata,Tatsuya Komatsu
关键词-EN: video moment retrieval, reproducible video moment, highlight detection, user-friendly library, video moment
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Multimedia (cs.MM)
*备注: 6 pages; library tech report

点击查看摘要

[CV-45] MedTrinity-25M: A Large-scale Multimodal Dataset with Multigranular Annotations for Medicine

链接: https://arxiv.org/abs/2408.02900
作者: Yunfei Xie,Ce Zhou,Lang Gao,Juncheng Wu,Xianhang Li,Hong-Yu Zhou,Sheng Liu,Lei Xing,James Zou,Cihang Xie,Yuyin Zhou
关键词-EN: paper introduces, million images, annotations, multimodal, enriched annotations
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: The project page is at this https URL

点击查看摘要

Abstract:This paper introduces MedTrinity-25M, a comprehensive, large-scale multimodal dataset for medicine, covering over 25 million images across 10 modalities, with multigranular annotations for more than 65 diseases. These enriched annotations encompass both global textual information, such as disease/lesion type, modality, region-specific descriptions, and inter-regional relationships, as well as detailed local annotations for regions of interest (ROIs), including bounding boxes, segmentation masks. Unlike existing approach which is limited by the availability of image-text pairs, we have developed the first automated pipeline that scales up multimodal data by generating multigranular visual and texual annotations (in the form of image-ROI-description triplets) without the need for any paired text descriptions. Specifically, data from over 90 different sources have been collected, preprocessed, and grounded using domain-specific expert models to identify ROIs related to abnormal regions. We then build a comprehensive knowledge base and prompt multimodal large language models to perform retrieval-augmented generation with the identified ROIs as guidance, resulting in multigranular texual descriptions. Compared to existing datasets, MedTrinity-25M provides the most enriched annotations, supporting a comprehensive range of multimodal tasks such as captioning and report generation, as well as vision-centric tasks like classification and segmentation. Pretraining on MedTrinity-25M, our model achieves state-of-the-art performance on VQA-RAD and PathVQA, surpassing both multimodal large language models and other representative SoTA approaches. This dataset can also be utilized to support large-scale pre-training of multimodal medical AI models, contributing to the development of future foundation models in the medical domain.

[CV-46] Diverse Generation while Maintaining Semantic Coordination: A Diffusion-Based Data Augmentation Method for Object Detection ICPR2024

链接: https://arxiv.org/abs/2408.02891
作者: Sen Nie,Zhuo Wang,Xinxin Wang,Kun He
关键词-EN: Recent studies emphasize, Recent studies, Category Affinity Matrix, Surrounding Region Alignment, object detection models
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 15 pages, 7 figures, ICPR2024

点击查看摘要

Abstract:Recent studies emphasize the crucial role of data augmentation in enhancing the performance of object detection models. However,existing methodologies often struggle to effectively harmonize dataset diversity with semantic this http URL bridge this gap, we introduce an innovative augmentation technique leveraging pre-trained conditional diffusion models to mediate this balance. Our approach encompasses the development of a Category Affinity Matrix, meticulously designed to enhance dataset diversity, and a Surrounding Region Alignment strategy, which ensures the preservation of semantic coordination in the augmented images. Extensive experimental evaluations confirm the efficacy of our method in enriching dataset diversity while seamlessly maintaining semantic coordination. Our method yields substantial average improvements of +1.4AP, +0.9AP, and +3.4AP over existing alternatives on three distinct object detection models, respectively.

[CV-47] VizECGNet: Visual ECG Image Network for Cardiovascular Diseases Classification with Multi-Modal Training and Knowledge Distillation ICIP

点击查看摘要

[CV-48] Body of Her: A Preliminary Study on End-to-End Humanoid Agent

链接: https://arxiv.org/abs/2408.02879
作者: Tenglong Ao
关键词-EN: Interactive virtual humanoid, virtual humanoid agent, physical world, humanoid agent, crucial interface
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Technical Report v1; Project Page: this https URL

点击查看摘要

Abstract:Interactive virtual humanoid agent is a crucial interface with the physical world. A relatively complete humanoid agent first needs to have face and body, then possess both verbal and non-verbal (such as eye contact, facial expression, lip motion, gesture, and manipulation) abilities, and finally, it is capable of real-time duplex communication, e.g., the ability to actively interrupt conversations. Most prior systems typically only consider a subset of these elements, leaving a gap from realistic humanoid agent. In this work, we propose a real-time, duplex, interactive end-to-end network capable of modeling realistic agent behaviors, including speech, full-body movements for talking, responding, idling, and manipulation. This system is a multimodal model integrating audio and visual inputs, extended from a pre-trained large language model (LLM). We collect approximately 200,000 hours of audio, around 130,000 hours of video data, and about 20,000 alignment samples to build the model. The final model demonstrates capabilities that are difficult to achieve in previous systems, such as generalized object manipulation. This work performs a preliminary exploration of the end-to-end approach in this field, aiming to inspire further research towards scaling up.

[CV-49] Analyzing Data Efficiency and Performance of Machine Learning Algorithms for Assessing Low Back Pain Physical Rehabilitation Exercises

链接: https://arxiv.org/abs/2408.02855
作者: Aleksa Marusic,Louis Annabi,Sao Msi Nguyen,Adriana Tapus
关键词-EN: active research area, Analyzing human motion, Analyzing human, research area, active research
类目: Human-Computer Interaction (cs.HC); Computer Vision and Pattern Recognition (cs.CV)
*备注: European Conference on Mobile Robots (2023)

点击查看摘要

Abstract:Analyzing human motion is an active research area, with various applications. In this work, we focus on human motion analysis in the context of physical rehabilitation using a robot coach system. Computer-aided assessment of physical rehabilitation entails evaluation of patient performance in completing prescribed rehabilitation exercises, based on processing movement data captured with a sensory system, such as RGB and RGB-D cameras. As 2D and 3D human pose estimation from RGB images had made impressive improvements, we aim to compare the assessment of physical rehabilitation exercises using movement data obtained from both RGB-D camera (Microsoft Kinect) and estimation from RGB videos (OpenPose and BlazePose algorithms). A Gaussian Mixture Model (GMM) is employed from position (and orientation) features, with performance metrics defined based on the log-likelihood values from GMM. The evaluation is performed on a medical database of clinical patients carrying out low back-pain rehabilitation exercises, previously coached by robot Poppy.

[CV-50] GAReT: Cross-view Video Geolocalization with Adapters and Auto-Regressive Transformers ECCV2024

链接: https://arxiv.org/abs/2408.02840
作者: Manu S Pillai,Mamshad Nayeem Rizve,Mubarak Shah
关键词-EN: Cross-view video geo-localization, aims to derive, Cross-view video, Cross-view, derive GPS trajectories
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted at ECCV 2024

点击查看摘要

Abstract:Cross-view video geo-localization (CVGL) aims to derive GPS trajectories from street-view videos by aligning them with aerial-view images. Despite their promising performance, current CVGL methods face significant challenges. These methods use camera and odometry data, typically absent in real-world scenarios. They utilize multiple adjacent frames and various encoders for feature extraction, resulting in high computational costs. Moreover, these approaches independently predict each street-view frame’s location, resulting in temporally inconsistent GPS trajectories. To address these challenges, in this work, we propose GAReT, a fully transformer-based method for CVGL that does not require camera and odometry data. We introduce GeoAdapter, a transformer-adapter module designed to efficiently aggregate image-level representations and adapt them for video inputs. Specifically, we train a transformer encoder on video frames and aerial images, then freeze the encoder to optimize the GeoAdapter module to obtain video-level representation. To address temporally inconsistent trajectories, we introduce TransRetriever, an encoder-decoder transformer model that predicts GPS locations of street-view frames by encoding top-k nearest neighbor predictions per frame and auto-regressively decoding the best neighbor based on the previous frame’s predictions. Our method’s effectiveness is validated through extensive experiments, demonstrating state-of-the-art performance on benchmark datasets. Our code is available at this https URL.

[CV-51] DaCapo: a modular deep learning framework for scalable 3D image segmentation

链接: https://arxiv.org/abs/2408.02834
作者: William Patton,Jeff L. Rhoades,Marwan Zouinkhi,David G. Ackerman,Caroline Malin-Mayor,Diane Adjavon,Larissa Heinrich,Davis Bennett,Yurii Zubov,CellMap Project Team,Aubrey V. Weigel,Jan Funke
关键词-EN: specialized deep learning, deep learning library, learning library tailored, existing machine learning, machine learning approaches
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Image and Video Processing (eess.IV); Quantitative Methods (q-bio.QM)
*备注:

点击查看摘要

Abstract:DaCapo is a specialized deep learning library tailored to expedite the training and application of existing machine learning approaches on large, near-isotropic image data. In this correspondence, we introduce DaCapo’s unique features optimized for this specific domain, highlighting its modular structure, efficient experiment management tools, and scalable deployment capabilities. We discuss its potential to improve access to large-scale, isotropic image segmentation and invite the community to explore and contribute to this open-source initiative.

[CV-52] Mitigating Malicious Attacks in Federated Learning via Confidence-aware Defense

链接: https://arxiv.org/abs/2408.02813
作者: Qilei Li,Ahmed M. Abdelmoniem
关键词-EN: distributed machine learning, machine learning paradigm, Federated Learning, emerging distributed machine, sharing private local
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注:

点击查看摘要

Abstract:Federated Learning (FL) is an emerging distributed machine learning paradigm that allows multiple clients to collaboratively train a global model without sharing private local data. However, FL systems are vulnerable to attacks from malicious clients, who can degrade the global model performance through data poisoning and model poisoning. Existing defense methods typically focus on a single type of attack, such as Byzantine attacks or backdoor attacks, and are often ineffective against potential data poisoning attacks like label flipping and label shuffling. Additionally, these methods often lack accuracy and robustness in detecting and handling malicious updates. To address these issues, we propose a novel method based on model confidence scores, which evaluates the uncertainty of client model updates to detect and defend against malicious clients. Our approach is comprehensively effective for both model poisoning and data poisoning attacks and is capable of accurately identifying and mitigating potential malicious updates from being aggregated. Experimental results demonstrate that our method significantly improves the robustness of FL systems against various types of attacks, also achieving higher model accuracy and stability across various scenarios.

[CV-53] SiCo: A Size-Controllable Virtual Try-On Approach for Informed Decision-Making

链接: https://arxiv.org/abs/2408.02803
作者: Sherry X. Chen,Alex Christopher Lim,Yimeng Liu,Pradeep Sen,Misha Sra
关键词-EN: Virtual try-on, applications aim, making purchase decisions, aim to improve, VTO
类目: Human-Computer Interaction (cs.HC); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Virtual try-on (VTO) applications aim to improve the online shopping experience by allowing users to preview garments, before making purchase decisions. However, many VTO tools fail to consider the crucial relationship between a garment’s size and the user’s body size, often employing a one-size-fits-all approach when visualizing a clothing item. This results in poor size recommendations and purchase decisions leading to increased return rates. To address this limitation, we introduce SiCo, an online VTO system, where users can upload images of themselves and visualize how different sizes of clothing would look on their body to help make better-informed purchase decisions. Our user study shows SiCo’s superiority over baseline VTO. The results indicate that our approach significantly enhances user ability to gauge the appearance of outfits on their bodies and boosts their confidence in selecting clothing sizes that match desired goals. Based on our evaluation, we believe our VTO design has the potential to reduce return rates and enhance the online clothes shopping experience. Our code is available at this https URL.

[CV-54] Gaussian Mixture based Evidential Learning for Stereo Matching

链接: https://arxiv.org/abs/2408.02796
作者: Weide Liu,Xingxing Wang,Lu Wang,Jun Cheng,Fayao Liu,Xulei Yang
关键词-EN: Gaussian mixture based, based evidential learning, evidential learning solution, robust stereo matching, mixture based evidential
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:In this paper, we introduce a novel Gaussian mixture based evidential learning solution for robust stereo matching. Diverging from previous evidential deep learning approaches that rely on a single Gaussian distribution, our framework posits that individual image data adheres to a mixture-of-Gaussian distribution in stereo matching. This assumption yields more precise pixel-level predictions and more accurately mirrors the real-world image distribution. By further employing the inverse-Gamma distribution as an intermediary prior for each mixture component, our probabilistic model achieves improved depth estimation compared to its counterpart with the single Gaussian and effectively captures the model uncertainty, which enables a strong cross-domain generation ability. We evaluated our method for stereo matching by training the model using the Scene Flow dataset and testing it on KITTI 2015 and Middlebury 2014. The experiment results consistently show that our method brings improvements over the baseline methods in a trustworthy manner. Notably, our approach achieved new state-of-the-art results on both the in-domain validated data and the cross-domain datasets, demonstrating its effectiveness and robustness in stereo matching tasks.

[CV-55] Lesion Elevation Prediction from Skin Images Improves Diagnosis MICCAI

链接: https://arxiv.org/abs/2408.02792
作者: Kumar Abhishek,Ghassan Hamarneh
关键词-EN: skin lesion elevation, incorporating additional features, skin lesion, lesion elevation labels, deep learning-based computer-aided
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Medical Image Computing and Computer-Assisted Intervention (MICCAI) ISIC Skin Image Analysis Workshop (MICCAI ISIC) 2024; 12 pages, 2 tables, 4 figures

点击查看摘要

Abstract:While deep learning-based computer-aided diagnosis for skin lesion image analysis is approaching dermatologists’ performance levels, there are several works showing that incorporating additional features such as shape priors, texture, color constancy, and illumination further improves the lesion diagnosis performance. In this work, we look at another clinically useful feature, skin lesion elevation, and investigate the feasibility of predicting and leveraging skin lesion elevation labels. Specifically, we use a deep learning model to predict image-level lesion elevation labels from 2D skin lesion images. We test the elevation prediction accuracy on the derm7pt dataset, and use the elevation prediction model to estimate elevation labels for images from five other datasets: ISIC 2016, 2017, and 2018 Challenge datasets, MSK, and DermoFit. We evaluate cross-domain generalization by using these estimated elevation labels as auxiliary inputs to diagnosis models, and show that these improve the classification performance, with AUROC improvements of up to 6.29% and 2.69% for dermoscopic and clinical images, respectively. The code is publicly available at this https URL.

[CV-56] GazeXplain: Learning to Predict Natural Language Explanations of Visual Scanpaths ECCV2024

链接: https://arxiv.org/abs/2408.02788
作者: Xianyu Chen,Ming Jiang,Qi Zhao
关键词-EN: exploring visual scenes, visual scanpath prediction, scanpath prediction, visual, visual scenes
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: To appear in ECCV2024

点击查看摘要

Abstract:While exploring visual scenes, humans’ scanpaths are driven by their underlying attention processes. Understanding visual scanpaths is essential for various applications. Traditional scanpath models predict the where and when of gaze shifts without providing explanations, creating a gap in understanding the rationale behind fixations. To bridge this gap, we introduce GazeXplain, a novel study of visual scanpath prediction and explanation. This involves annotating natural-language explanations for fixations across eye-tracking datasets and proposing a general model with an attention-language decoder that jointly predicts scanpaths and generates explanations. It integrates a unique semantic alignment mechanism to enhance the consistency between fixations and explanations, alongside a cross-dataset co-training approach for generalization. These novelties present a comprehensive and adaptable solution for explainable human visual scanpath prediction. Extensive experiments on diverse eye-tracking datasets demonstrate the effectiveness of GazeXplain in both scanpath prediction and explanation, offering valuable insights into human visual attention and cognitive processes.

[CV-57] Segmentation Style Discovery: Application to Skin Lesion Images MICCAI

链接: https://arxiv.org/abs/2408.02787
作者: Kumar Abhishek,Jeremy Kawahara,Ghassan Hamarneh
关键词-EN: Variability in medical, medical image segmentation, choice of tools, medical image, annotator preferences
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Medical Image Computing and Computer-Assisted Intervention (MICCAI) ISIC Skin Image Analysis Workshop (MICCAI ISIC) 2024; 13 pages, 2 tables, 3 figures

点击查看摘要

Abstract:Variability in medical image segmentation, arising from annotator preferences, expertise, and their choice of tools, has been well documented. While the majority of multi-annotator segmentation approaches focus on modeling annotator-specific preferences, they require annotator-segmentation correspondence. In this work, we introduce the problem of segmentation style discovery, and propose StyleSeg, a segmentation method that learns plausible, diverse, and semantically consistent segmentation styles from a corpus of image-mask pairs without any knowledge of annotator correspondence. StyleSeg consistently outperforms competing methods on four publicly available skin lesion segmentation (SLS) datasets. We also curate ISIC-MultiAnnot, the largest multi-annotator SLS dataset with annotator correspondence, and our results show a strong alignment, using our newly proposed measure AS2, between the predicted styles and annotator preferences. The code and the dataset are available at this https URL.

[CV-58] LR-Net: A Lightweight and Robust Network for Infrared Small Target Detection

链接: https://arxiv.org/abs/2408.02780
作者: Chuang Yu,Yunpeng Liu,Jinmiao Zhao,Zelin Shi
关键词-EN: small target detection, infrared small target, difficulty meeting actual, meeting actual comprehensive, Limited by equipment
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Limited by equipment limitations and the lack of target intrinsic features, existing infrared small target detection methods have difficulty meeting actual comprehensive performance requirements. Therefore, we propose an innovative lightweight and robust network (LR-Net), which abandons the complex structure and achieves an effective balance between detection accuracy and resource consumption. Specifically, to ensure the lightweight and robustness, on the one hand, we construct a lightweight feature extraction attention (LFEA) module, which can fully extract target features and strengthen information interaction across channels. On the other hand, we construct a simple refined feature transfer (RFT) module. Compared with direct cross-layer connections, the RFT module can improve the network’s feature refinement extraction capability with little resource consumption. Meanwhile, to solve the problem of small target loss in high-level feature maps, on the one hand, we propose a low-level feature distribution (LFD) strategy to use low-level features to supplement the information of high-level features. On the other hand, we introduce an efficient simplified bilinear interpolation attention module (SBAM) to promote the guidance constraints of low-level features on high-level features and the fusion of the two. In addition, We abandon the traditional resizing method and adopt a new training and inference cropping strategy, which is more robust to datasets with multi-scale samples. Extensive experimental results show that our LR-Net achieves state-of-the-art (SOTA) performance. Notably, on the basis of the proposed LR-Net, we achieve 3rd place in the “ICPR 2024 Resource-Limited Infrared Small Target Detection Challenge Track 2: Lightweight Infrared Small Target Detection”.

[CV-59] Refined Infrared Small Target Detection Scheme with Single-Point Supervision

链接: https://arxiv.org/abs/2408.02773
作者: Jinmiao Zhao,Zelin Shi,Chuang Yu,Yunpeng Liu
关键词-EN: infrared small target, small target detection, small target, infrared small, target detection
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Recently, infrared small target detection with single-point supervision has attracted extensive attention. However, the detection accuracy of existing methods has difficulty meeting actual needs. Therefore, we propose an innovative refined infrared small target detection scheme with single-point supervision, which has excellent segmentation accuracy and detection rate. Specifically, we introduce label evolution with single point supervision (LESPS) framework and explore the performance of various excellent infrared small target detection networks based on this framework. Meanwhile, to improve the comprehensive performance, we construct a complete post-processing strategy. On the one hand, to improve the segmentation accuracy, we use a combination of test-time augmentation (TTA) and conditional random field (CRF) for post-processing. On the other hand, to improve the detection rate, we introduce an adjustable sensitivity (AS) strategy for post-processing, which fully considers the advantages of multiple detection results and reasonably adds some areas with low confidence to the fine segmentation image in the form of centroid points. In addition, to further improve the performance and explore the characteristics of this task, on the one hand, we construct and find that a multi-stage loss is helpful for fine-grained detection. On the other hand, we find that a reasonable sliding window cropping strategy for test samples has better performance for actual multi-size samples. Extensive experimental results show that the proposed scheme achieves state-of-the-art (SOTA) performance. Notably, the proposed scheme won the third place in the “ICPR 2024 Resource-Limited Infrared Small Target Detection Challenge Track 1: Weakly Supervised Infrared Small Target Detection”.

[CV-60] From Recognition to Prediction: Leveraging Sequence Reasoning for Action Anticipation

链接: https://arxiv.org/abs/2408.02769
作者: Xin Liu,Chao Hao,Zitong Yu,Huanjing Yue,Jingyu Yang
关键词-EN: action anticipation task, anticipation task refers, action anticipation, anticipation task, refers to predicting
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted by ACM TOMM

点击查看摘要

Abstract:The action anticipation task refers to predicting what action will happen based on observed videos, which requires the model to have a strong ability to summarize the present and then reason about the future. Experience and common sense suggest that there is a significant correlation between different actions, which provides valuable prior knowledge for the action anticipation task. However, previous methods have not effectively modeled this underlying statistical relationship. To address this issue, we propose a novel end-to-end video modeling architecture that utilizes attention mechanisms, named Anticipation via Recognition and Reasoning (ARR). ARR decomposes the action anticipation task into action recognition and sequence reasoning tasks, and effectively learns the statistical relationship between actions by next action prediction (NAP). In comparison to existing temporal aggregation strategies, ARR is able to extract more effective features from observable videos to make more reasonable predictions. In addition, to address the challenge of relationship modeling that requires extensive training data, we propose an innovative approach for the unsupervised pre-training of the decoder, which leverages the inherent temporal dynamics of video to enhance the reasoning capabilities of the network. Extensive experiments on the Epic-kitchen-100, EGTEA Gaze+, and 50salads datasets demonstrate the efficacy of the proposed methods. The code is available at this https URL.

[CV-61] ConDL: Detector-Free Dense Image Matching

链接: https://arxiv.org/abs/2408.02766
作者: Monika Kwiatkowski,Simon Matern,Olaf Hellwich
关键词-EN: deep-learning framework designed, dense image correspondences, estimating dense image, introduce a deep-learning, deep-learning framework
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In this work, we introduce a deep-learning framework designed for estimating dense image correspondences. Our fully convolutional model generates dense feature maps for images, where each pixel is associated with a descriptor that can be matched across multiple images. Unlike previous methods, our model is trained on synthetic data that includes significant distortions, such as perspective changes, illumination variations, shadows, and specular highlights. Utilizing contrastive learning, our feature maps achieve greater invariance to these distortions, enabling robust matching. Notably, our method eliminates the need for a keypoint detector, setting it apart from many existing image-matching techniques.

[CV-62] Dimensionality Reduction and Nearest Neighbors for Improving Out-of-Distribution Detection in Medical Image Segmentation

链接: https://arxiv.org/abs/2408.02761
作者: McKell Woodland,Nihil Patel,Austin Castelo,Mais Al Taie,Mohamed Eltaher,Joshua P. Yung,Tucker J. Netherton,Tiffany L. Calderone,Jessica I. Sanchez,Darrel W. Cleere,Ahmed Elsaiey,Nakul Gupta,David Victor,Laura Beretta,Ankit B. Patel Kristy K. Brock
关键词-EN: Clinically deployed deep, deployed deep learning-based, deep learning-based segmentation, Clinically deployed, learning-based segmentation models
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: Expansion of “Dimensionality Reduction for Improving Out-of-Distribution Detection in Medical Image Segmentation” arXiv:2308.03723 . Submitted to the Journal for Machine Learning in Biomedical Imaging. Code available at this https URL

点击查看摘要

Abstract:Clinically deployed deep learning-based segmentation models are known to fail on data outside of their training distributions. While clinicians review the segmentations, these models tend to perform well in most instances, which could exacerbate automation bias. Therefore, detecting out-of-distribution images at inference is critical to warn the clinicians that the model likely failed. This work applied the Mahalanobis distance (MD) post hoc to the bottleneck features of four Swin UNETR and nnU-net models that segmented the liver on T1-weighted magnetic resonance imaging and computed tomography. By reducing the dimensions of the bottleneck features with either principal component analysis or uniform manifold approximation and projection, images the models failed on were detected with high performance and minimal computational load. In addition, this work explored a non-parametric alternative to the MD, a k-th nearest neighbors distance (KNN). KNN drastically improved scalability and performance over MD when both were applied to raw and average-pooled bottleneck features.

[CV-63] Diffusion Models as Data Mining Tools ECCV2024

点击查看摘要

[CV-64] Privacy-Safe Iris Presentation Attack Detection

链接: https://arxiv.org/abs/2408.02750
作者: Mahsa Mitcheff,Patrick Tinsley,Adam Czajka
关键词-EN: presentation attack detection, iris PAD, iris presentation attack, iris PAD methods, iris PAD benchmarks
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
*备注:

点击查看摘要

Abstract:This paper proposes a framework for a privacy-safe iris presentation attack detection (PAD) method, designed solely with synthetically-generated, identity-leakage-free iris images. Once trained, the method is evaluated in a classical way using state-of-the-art iris PAD benchmarks. We designed two generative models for the synthesis of ISO/IEC 19794-6-compliant iris images. The first model synthesizes bona fide-looking samples. To avoid ``identity leakage,‘’ the generated samples that accidentally matched those used in the model’s training were excluded. The second model synthesizes images of irises with textured contact lenses and is conditioned by a given contact lens brand to have better control over textured contact lens appearance when forming the training set. Our experiments demonstrate that models trained solely on synthetic data achieve a lower but still reasonable performance when compared to solutions trained with iris images collected from human subjects. This is the first-of-its-kind attempt to use solely synthetic data to train a fully-functional iris PAD solution, and despite the performance gap between regular and the proposed methods, this study demonstrates that with the increasing fidelity of generative models, creating such privacy-safe iris PAD methods may be possible. The source codes and generative models trained for this work are offered along with the paper.

[CV-65] MMIU: Multimodal Multi-image Understanding for Evaluating Large Vision-Language Models

链接: https://arxiv.org/abs/2408.02718
作者: Fanqing Meng,Jin Wang,Chuanhao Li,Quanfeng Lu,Hao Tian,Jiaqi Liao,Xizhou Zhu,Jifeng Dai,Yu Qiao,Ping Luo,Kaipeng Zhang,Wenqi Shao
关键词-EN: Large Vision-Language Models, crucial for Large, Large Vision-Language, process multiple images, capability to process
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Project Page: this https URL

点击查看摘要

Abstract:The capability to process multiple images is crucial for Large Vision-Language Models (LVLMs) to develop a more thorough and nuanced understanding of a scene. Recent multi-image LVLMs have begun to address this need. However, their evaluation has not kept pace with their development. To fill this gap, we introduce the Multimodal Multi-image Understanding (MMIU) benchmark, a comprehensive evaluation suite designed to assess LVLMs across a wide range of multi-image tasks. MMIU encompasses 7 types of multi-image relationships, 52 tasks, 77K images, and 11K meticulously curated multiple-choice questions, making it the most extensive benchmark of its kind. Our evaluation of 24 popular LVLMs, including both open-source and proprietary models, reveals significant challenges in multi-image comprehension, particularly in tasks involving spatial understanding. Even the most advanced models, such as GPT-4o, achieve only 55.7% accuracy on MMIU. Through multi-faceted analytical experiments, we identify key performance gaps and limitations, providing valuable insights for future model and data improvements. We aim for MMIU to advance the frontier of LVLM research and development, moving us toward achieving sophisticated multimodal multi-image user interactions.

[CV-66] RCDM: Enabling Robustness for Conditional Diffusion Model

链接: https://arxiv.org/abs/2408.02710
作者: Weifeng Xu,Xiang Zhu,Xiaoyong Li
关键词-EN: conditional diffusion model, standard diffusion model, Robust Conditional Diffusion, diffusion model, conditional diffusion
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:The conditional diffusion model (CDM) enhances the standard diffusion model by providing more control, improving the quality and relevance of the outputs, and making the model adaptable to a wider range of complex tasks. However, inaccurate conditional inputs in the inverse process of CDM can easily lead to generating fixed errors in the neural network, which diminishes the adaptability of a well-trained model. The existing methods like data augmentation, adversarial training, robust optimization can improve the robustness, while they often face challenges such as high computational complexity, limited applicability to unknown perturbations, and increased training difficulty. In this paper, we propose a lightweight solution, the Robust Conditional Diffusion Model (RCDM), based on control theory to dynamically reduce the impact of noise and significantly enhance the model’s robustness. RCDM leverages the collaborative interaction between two neural networks, along with optimal control strategies derived from control theory, to optimize the weights of two networks during the sampling process. Unlike conventional techniques, RCDM establishes a mathematical relationship between fixed errors and the weights of the two neural networks without incurring additional computational overhead. Extensive experiments were conducted on MNIST and CIFAR-10 datasets, and the results demonstrate the effectiveness and adaptability of our proposed model.

[CV-67] Compositional Physical Reasoning of Objects and Events from Videos

链接: https://arxiv.org/abs/2408.02687
作者: Zhenfang Chen,Shilong Dong,Kexin Yi,Yunzhu Li,Mingyu Ding,Antonio Torralba,Joshua B. Tenenbaum,Chuang Gan
关键词-EN: hidden physical properties, physical properties, physical, objects’ physical properties, hidden physical
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: arXiv admin note: text overlap with arXiv:2205.01089

点击查看摘要

Abstract:Understanding and reasoning about objects’ physical properties in the natural world is a fundamental challenge in artificial intelligence. While some properties like colors and shapes can be directly observed, others, such as mass and electric charge, are hidden from the objects’ visual appearance. This paper addresses the unique challenge of inferring these hidden physical properties from objects’ motion and interactions and predicting corresponding dynamics based on the inferred physical properties. We first introduce the Compositional Physical Reasoning (ComPhy) dataset. For a given set of objects, ComPhy includes limited videos of them moving and interacting under different initial conditions. The model is evaluated based on its capability to unravel the compositional hidden properties, such as mass and charge, and use this knowledge to answer a set of questions. Besides the synthetic videos from simulators, we also collect a real-world dataset to show further test physical reasoning abilities of different models. We evaluate state-of-the-art video reasoning models on ComPhy and reveal their limited ability to capture these hidden properties, which leads to inferior performance. We also propose a novel neuro-symbolic framework, Physical Concept Reasoner (PCR), that learns and reasons about both visible and hidden physical properties from question answering. After training, PCR demonstrates remarkable capabilities. It can detect and associate objects across frames, ground visible and hidden physical properties, make future and counterfactual predictions, and utilize these extracted representations to answer challenging questions.

[CV-68] On Biases in a UK Biobank-based Retinal Image Classification Model MICCAI

点击查看摘要

[CV-69] On Feasibility of Intent Obfuscating Attacks

链接: https://arxiv.org/abs/2408.02674
作者: Zhaobin Li,Patrick Shafto
关键词-EN: avoid culpability, common tactic, adversarial situations, Intent obfuscation, enabling the attacker
类目: Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV)
*备注: 31 pages, 18 Figures. Includes technical appendix. To be published in AIES 2024

点击查看摘要

Abstract:Intent obfuscation is a common tactic in adversarial situations, enabling the attacker to both manipulate the target system and avoid culpability. Surprisingly, it has rarely been implemented in adversarial attacks on machine learning systems. We are the first to propose incorporating intent obfuscation in generating adversarial examples for object detectors: by perturbing another non-overlapping object to disrupt the target object, the attacker hides their intended target. We conduct a randomized experiment on 5 prominent detectors – YOLOv3, SSD, RetinaNet, Faster R-CNN, and Cascade R-CNN – using both targeted and untargeted attacks and achieve success on all models and attacks. We analyze the success factors characterizing intent obfuscating attacks, including target object confidence and perturb object sizes. We then demonstrate that the attacker can exploit these success factors to increase success rates for all models and attacks. Finally, we discuss known defenses and legal repercussions.

[CV-70] Segment Anything in Medical Images and Videos: Benchmark and Deployment

链接: https://arxiv.org/abs/2408.03322
作者: Jun Ma,Sumin Kim,Feifei Li,Mohammed Baharoon,Reza Asakereh,Hongwei Lyu,Bo Wang
关键词-EN: data remains unclear, medical data remains, Recent advances, segmentation foundation models, remains unclear
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Recent advances in segmentation foundation models have enabled accurate and efficient segmentation across a wide range of natural images and videos, but their utility to medical data remains unclear. In this work, we first present a comprehensive benchmarking of the Segment Anything Model 2 (SAM2) across 11 medical image modalities and videos and point out its strengths and weaknesses by comparing it to SAM1 and MedSAM. Then, we develop a transfer learning pipeline and demonstrate SAM2 can be quickly adapted to medical domain by fine-tuning. Furthermore, we implement SAM2 as a 3D slicer plugin and Gradio API for efficient 3D image and video segmentation. The code has been made publicly available at \urlthis https URL.

[CV-71] SGSR: Structure-Guided Multi-Contrast MRI Super-Resolution via Spatio-Frequency Co-Query Attention

链接: https://arxiv.org/abs/2408.03194
作者: Shaoming Zheng,Yinsong Wang,Siyi Du,Chen Qin
关键词-EN: Magnetic Resonance Imaging, Magnetic Resonance, Resonance Imaging, leading diagnostic modality, range of exams
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注: The 15th International Workshop on Machine Learning in Medical Imaging (MLMI 2024)

点击查看摘要

Abstract:Magnetic Resonance Imaging (MRI) is a leading diagnostic modality for a wide range of exams, where multiple contrast images are often acquired for characterizing different tissues. However, acquiring high-resolution MRI typically extends scan time, which can introduce motion artifacts. Super-resolution of MRI therefore emerges as a promising approach to mitigate these challenges. Earlier studies have investigated the use of multiple contrasts for MRI super-resolution (MCSR), whereas majority of them did not fully exploit the rich contrast-invariant structural information. To fully utilize such crucial prior knowledge of multi-contrast MRI, in this work, we propose a novel structure-guided MCSR (SGSR) framework based on a new spatio-frequency co-query attention (CQA) mechanism. Specifically, CQA performs attention on features of multiple contrasts with a shared structural query, which is particularly designed to extract, fuse, and refine the common structures from different contrasts. We further propose a novel frequency-domain CQA module in addition to the spatial domain, to enable more fine-grained structural refinement. Extensive experiments on fastMRI knee data and low-field brain MRI show that SGSR outperforms state-of-the-art MCSR methods with statistical significance.

[CV-72] raining-Free Condition Video Diffusion Models for single frame Spatial-Semantic Echocardiogram Synthesis MICCAI2024

链接: https://arxiv.org/abs/2408.03035
作者: Van Phi Nguyen,Tri Nhan Luong Ha,Huy Hieu Pham,Quoc Long Tran
关键词-EN: Conditional video diffusion, Conditional video, shown promising results, video diffusion models, video synthesis
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted to MICCAI 2024

点击查看摘要

Abstract:Conditional video diffusion models (CDM) have shown promising results for video synthesis, potentially enabling the generation of realistic echocardiograms to address the problem of data scarcity. However, current CDMs require a paired segmentation map and echocardiogram dataset. We present a new method called Free-Echo for generating realistic echocardiograms from a single end-diastolic segmentation map without additional training data. Our method is based on the 3D-Unet with Temporal Attention Layers model and is conditioned on the segmentation map using a training-free conditioning method based on SDEdit. We evaluate our model on two public echocardiogram datasets, CAMUS and EchoNet-Dynamic. We show that our model can generate plausible echocardiograms that are spatially aligned with the input segmentation map, achieving performance comparable to training-based CDMs. Our work opens up new possibilities for generating echocardiograms from a single segmentation map, which can be used for data augmentation, domain adaptation, and other applications in medical imaging. Our code is available at \urlthis https URL

[CV-73] VisionUnite: A Vision-Language Foundation Model for Ophthalmology Enhanced with Clinical Knowledge

链接: https://arxiv.org/abs/2408.02865
作者: Zihan Li,Diping Song,Zefeng Yang,Deming Wang,Fei Li,Xiulan Zhang,Paul E. Kinahan,Yu Qiao
关键词-EN: improved diagnostic methods, advanced equipment, developed regions, regions with limited, limited access
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

[CV-74] Multistain Pretraining for Slide Representation Learning in Pathology ECCV’24

点击查看摘要

[CV-75] Scribble-Based Interactive Segmentation of Medical Hyperspectral Images

链接: https://arxiv.org/abs/2408.02708
作者: Zhonghao Wang,Junwen Wang,Charlie Budd,Oscar MacCormac,Jonathan Shapey,Tom Vercauteren
关键词-EN: broad spectral range, medical imaging modality, advanced medical imaging, captures optical data, imaging modality
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Hyperspectral imaging (HSI) is an advanced medical imaging modality that captures optical data across a broad spectral range, providing novel insights into the biochemical composition of tissues. HSI may enable precise differentiation between various tissue types and pathologies, making it particularly valuable for tumour detection, tissue classification, and disease diagnosis. Deep learning-based segmentation methods have shown considerable advancements, offering automated and accurate results. However, these methods face challenges with HSI datasets due to limited annotated data and discrepancies from hardware and acquisition techniques~\citeclancy2020surgical,studier2023heiporspectral. Variability in clinical protocols also leads to different definitions of structure boundaries. Interactive segmentation methods, utilizing user knowledge and clinical insights, can overcome these issues and achieve precise segmentation results \citezhao2013overview. This work introduces a scribble-based interactive segmentation framework for medical hyperspectral images. The proposed method utilizes deep learning for feature extraction and a geodesic distance map generated from user-provided scribbles to obtain the segmentation results. The experiment results show that utilising the geodesic distance maps based on deep learning-extracted features achieved better segmentation results than geodesic distance maps directly generated from hyperspectral images, reconstructed RGB images, or Euclidean distance maps. Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2408.02708 [eess.IV] (or arXiv:2408.02708v1 [eess.IV] for this version) https://doi.org/10.48550/arXiv.2408.02708 Focus to learn more arXiv-issued DOI via DataCite

机器学习

[LG-0] ClassiFIM: An Unsupervised Method To Detect Phase Transitions

链接: https://arxiv.org/abs/2408.03323
作者: Victor Kasatkin,Evgeny Mozgunov,Nicholas Ezzell,Utkarsh Mishra,Itay Hen,Daniel Lidar
关键词-EN: Fisher Information Metric, Fisher Information, Information Metric, proposed by physicists, problem proposed
类目: Machine Learning (cs.LG)
*备注: 23 pages, 5 figures

点击查看摘要

Abstract:Estimation of the Fisher Information Metric (FIM-estimation) is an important task that arises in unsupervised learning of phase transitions, a problem proposed by physicists. This work completes the definition of the task by defining rigorous evaluation metrics distMSE, distMSEPS, and distRE and introduces ClassiFIM, a novel machine learning method designed to solve the FIM-estimation task. Unlike existing methods for unsupervised learning of phase transitions, ClassiFIM directly estimates a well-defined quantity (the FIM), allowing it to be rigorously compared to any present and future other methods that estimate the same. ClassiFIM transforms a dataset for the FIM-estimation task into a dataset for an auxiliary binary classification task and involves selecting and training a model for the latter. We prove that the output of ClassiFIM approaches the exact FIM in the limit of infinite dataset size and under certain regularity conditions. We implement ClassiFIM on multiple datasets, including datasets describing classical and quantum phase transitions, and find that it achieves a good ground truth approximation with modest computational resources. Furthermore, we independently implement two alternative state-of-the-art methods for unsupervised estimation of phase transition locations on the same datasets and find that ClassiFIM predicts such locations at least as well as these other methods. To emphasize the generality of our method, we also propose and generate the MNIST-CNN dataset, which consists of the output of CNNs trained on MNIST for different hyperparameter choices. Using ClassiFIM on this dataset suggests there is a phase transition in the distribution of image-prediction pairs for CNNs trained on MNIST, demonstrating the broad scope of FIM-estimation beyond physics.

[LG-1] Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters

链接: https://arxiv.org/abs/2408.03314
作者: Charlie Snell,Jaehoon Lee,Kelvin Xu,Aviral Kumar
关键词-EN: open-ended natural language, building generally self-improving, generally self-improving agents, Enabling LLMs, natural language
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
*备注:

点击查看摘要

[LG-2] Fusing Forces: Deep-Human-Guided Refinement of Segmentation Masks ICPR2024

点击查看摘要

[LG-3] SARA: Singular-Value Based Adaptive Low-Rank Adaption

链接: https://arxiv.org/abs/2408.03290
作者: Jihao Gu,Shuai Chen,Zelin Wang,Yibo Zhang,Ping Gong
关键词-EN: adding inference overhead, large pre-trained models, inference overhead, adding inference, LoRA method assumes
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
*备注:

点击查看摘要

[LG-4] Malicious Internet Entity Detection Using Local Graph Inference

链接: https://arxiv.org/abs/2408.03287
作者: Simon Mandlik,Tomas Pevny,Vaclav Smidl,Lukas Bajer
关键词-EN: Detection of malicious, high expressive power, computer security, malicious behavior, challenging problem
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注: A preprint. Full publication: this https URL

点击查看摘要

Abstract:Detection of malicious behavior in a large network is a challenging problem for machine learning in computer security, since it requires a model with high expressive power and scalable inference. Existing solutions struggle to achieve this feat – current cybersec-tailored approaches are still limited in expressivity, and methods successful in other domains do not scale well for large volumes of data, rendering frequent retraining impossible. This work proposes a new perspective for learning from graph data that is modeling network entity interactions as a large heterogeneous graph. High expressivity of the method is achieved with neural network architecture HMILnet that naturally models this type of data and provides theoretical guarantees. The scalability is achieved by pursuing local graph inference, i.e., classifying individual vertices and their neighborhood as independent samples. Our experiments exhibit improvement over the state-of-the-art Probabilistic Threat Propagation (PTP) algorithm, show a further threefold accuracy improvement when additional data is used, which is not possible with the PTP algorithm, and demonstrate the generalization capabilities of the method to new, previously unseen entities.

[LG-5] StructEval: Deepen and Broaden Large Language Model Assessment via Structured Evaluation ACL2024

链接: https://arxiv.org/abs/2408.03281
作者: Boxi Cao,Mengjie Ren,Hongyu Lin,Xianpei Han,Feng Zhang,Junfeng Zhan,Le Sun
关键词-EN: large language models, atomic test objective, development of large, large language, test objective
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: ACL 2024;Benchmark at this https URL at this https URL

点击查看摘要

[LG-6] Compress and Compare: Interactively Evaluating Efficiency and Behavior Across ML Model Compression Experiments

点击查看摘要

[LG-7] Analysis of Partially-Calibrated Sparse Subarrays for Direction Finding with Extended Degrees of Freedom

链接: https://arxiv.org/abs/2408.03236
作者: W. S. Leite,R. C. de Lamare
关键词-EN: Multiple Signal Classification, Coarray Multiple Signal, partially-calibrated sparse subarrays, Generalized Coarray Multiple, multiple partially-calibrated sparse
类目: Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注: 6 pages, 5 figures

点击查看摘要

Abstract:This paper investigates the problem of direction-of-arrival (DOA) estimation using multiple partially-calibrated sparse subarrays. In particular, we present the Generalized Coarray Multiple Signal Classification (GCA-MUSIC) DOA estimation algorithm to scenarios with partially-calibrated sparse subarrays. The proposed GCA-MUSIC algorithm exploits the difference coarray for each subarray, followed by a specific pseudo-spectrum merging rule that is based on the intersection of the signal subspaces associated to each subarray. This rule assumes that there is no a priori knowledge about the cross-covariance between subarrays. In that way, only the second-order statistics of each subarray are used to estimate the directions with increased degrees of freedom, i.e., the estimation procedure preserves the coarray Multiple Signal Classification and sparse arrays properties to estimate more sources than the number of physical sensors in each subarray. Numerical simulations show that the proposed GCA-MUSIC has better performance than other similar strategies.

[LG-8] Dont Think It Twice: Exploit Shift Invariance for Efficient Online Streaming Inference of CNNs

链接: https://arxiv.org/abs/2408.03223
作者: Christodoulos Kechris,Jonathan Dan,Jose Miranda,David Atienza
关键词-EN: Deep learning time-series, Deep learning, convolutional neural networks, learning time-series processing, learning time-series
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Deep learning time-series processing often relies on convolutional neural networks with overlapping windows. This overlap allows the network to produce an output faster than the window length. However, it introduces additional computations. This work explores the potential to optimize computational efficiency during inference by exploiting convolution’s shift-invariance properties to skip the calculation of layer activations between successive overlapping windows. Although convolutions are shift-invariant, zero-padding and pooling operations, widely used in such networks, are not efficient and complicate efficient streaming inference. We introduce StreamiNNC, a strategy to deploy Convolutional Neural Networks for online streaming inference. We explore the adverse effects of zero padding and pooling on the accuracy of streaming inference, deriving theoretical error upper bounds for pooling during streaming. We address these limitations by proposing signal padding and pooling alignment and provide guidelines for designing and deploying models for StreamiNNC. We validate our method in simulated data and on three real-world biomedical signal processing applications. StreamiNNC achieves a low deviation between streaming output and normal inference for all three networks (2.03 - 3.55% NRMSE). This work demonstrates that it is possible to linearly speed up the inference of streaming CNNs processing overlapping windows, negating the additional computation typically incurred by overlapping windows.

[LG-9] Masked Random Noise for Communication Efficient Federaetd Learning

链接: https://arxiv.org/abs/2408.03220
作者: Shiwei Li,Yingyi Cheng,Haozhao Wang,Xing Tang,Shijie Xu,Weihong Luo,Yuhua Li,Dugang Liu,Xiuqiang He,and Ruixuan Li
关键词-EN: safeguards data privacy, effectively safeguards data, data privacy, random noise, Masked Random Noise
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注: Accepted by MM 2024

点击查看摘要

Abstract:Federated learning is a promising distributed training paradigm that effectively safeguards data privacy. However, it may involve significant communication costs, which hinders training efficiency. In this paper, we aim to enhance communication efficiency from a new perspective. Specifically, we request the distributed clients to find optimal model updates relative to global model parameters within predefined random noise. For this purpose, we propose Federated Masked Random Noise (FedMRN), a novel framework that enables clients to learn a 1-bit mask for each model parameter and apply masked random noise (i.e., the Hadamard product of random noise and masks) to represent model updates. To make FedMRN feasible, we propose an advanced mask training strategy, called progressive stochastic masking (PSM). After local training, each client only need to transmit local masks and a random seed to the server. Additionally, we provide theoretical guarantees for the convergence of FedMRN under both strongly convex and non-convex assumptions. Extensive experiments are conducted on four popular datasets. The results show that FedMRN exhibits superior convergence speed and test accuracy compared to relevant baselines, while attaining a similar level of accuracy as FedAvg.

[LG-10] Learning to Learn without Forgetting using Attention

点击查看摘要

[LG-11] FedBAT: Communication-Efficient Federated Learning via Learnable Binarization ICML2024

链接: https://arxiv.org/abs/2408.03215
作者: Shiwei Li,Wenchao Xu,Haozhao Wang,Xing Tang,Yining Qi,Shijie Xu,Weihong Luo,Yuhua Li,Xiuqiang He,Ruixuan Li
关键词-EN: exposing users’ privacy, promising distributed machine, effectively exploit large-scale, exploit large-scale data, machine learning paradigm
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注: Accepted by ICML 2024

点击查看摘要

Abstract:Federated learning is a promising distributed machine learning paradigm that can effectively exploit large-scale data without exposing users’ privacy. However, it may incur significant communication overhead, thereby potentially impairing the training efficiency. To address this challenge, numerous studies suggest binarizing the model updates. Nonetheless, traditional methods usually binarize model updates in a post-training manner, resulting in significant approximation errors and consequent degradation in model accuracy. To this end, we propose Federated Binarization-Aware Training (FedBAT), a novel framework that directly learns binary model updates during the local training process, thus inherently reducing the approximation errors. FedBAT incorporates an innovative binarization operator, along with meticulously designed derivatives to facilitate efficient learning. In addition, we establish theoretical guarantees regarding the convergence of FedBAT. Extensive experiments are conducted on four popular datasets. The results show that FedBAT significantly accelerates the convergence and exceeds the accuracy of baselines by up to 9%, even surpassing that of FedAvg in some cases.

[LG-12] RELIEF: Reinforcement Learning Empowered Graph Feature Prompt Tuning

链接: https://arxiv.org/abs/2408.03195
作者: Jiapeng Zhu,Zichen Ding,Jianxiang Yu,Jiaqi Tan,Xiang Li,Weining Qian
关键词-EN: Natural Language Processing, Language Processing, Natural Language, achievements in Natural, Graph Neural Network
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The advent of the “pre-train, prompt” paradigm has recently extended its generalization ability and data efficiency to graph representation learning, following its achievements in Natural Language Processing (NLP). Initial graph prompt tuning approaches tailored specialized prompting functions for Graph Neural Network (GNN) models pre-trained with specific strategies, such as edge prediction, thus limiting their applicability. In contrast, another pioneering line of research has explored universal prompting via adding prompts to the input graph’s feature space, thereby removing the reliance on specific pre-training strategies. However, the necessity to add feature prompts to all nodes remains an open question. Motivated by findings from prompt tuning research in the NLP domain, which suggest that highly capable pre-trained models need less conditioning signal to achieve desired behaviors, we advocate for strategically incorporating necessary and lightweight feature prompts to certain graph nodes to enhance downstream task performance. This introduces a combinatorial optimization problem, requiring a policy to decide 1) which nodes to prompt and 2) what specific feature prompts to attach. We then address the problem by framing the prompt incorporation process as a sequential decision-making problem and propose our method, RELIEF, which employs Reinforcement Learning (RL) to optimize it. At each step, the RL agent selects a node (discrete action) and determines the prompt content (continuous action), aiming to maximize cumulative performance gain. Extensive experiments on graph and node-level tasks with various pre-training strategies in few-shot scenarios demonstrate that our RELIEF outperforms fine-tuning and other prompt-based approaches in classification performance and data efficiency.

[LG-13] An Object is Worth 64x64 Pixels: Generating 3D Object via Image Diffusion

点击查看摘要

[LG-14] Leveraging Parameter Efficient Training Methods for Low Resource Text Classification: A Case Study in Marathi

链接: https://arxiv.org/abs/2408.03172
作者: Pranita Deshmukh,Nikita Kulkarni,Sanhita Kulkarni,Kareena Manghani,Raviraj Joshi
关键词-EN: Natural Language Processing, advanced Natural Language, Bidirectional Encoder Representations, advanced Natural, Language Processing
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注: Accepted at I2CT 2024

点击查看摘要

[LG-15] Iterative CT Reconstruction via Latent Variable Optimization of Shallow Diffusion Models

点击查看摘要

[LG-16] SC: A Simple Two-Sided Constraint against Over-Smoothing KDD2024

链接: https://arxiv.org/abs/2408.03152
作者: Furong Peng,Kang Liu,Xuan Lu,Yuhua Qian,Hongren Yan,Chao Ma
关键词-EN: Convolutional Neural Network, Graph Convolutional Neural, Convolutional Neural, analyzing relational data, widely adopted method
类目: Machine Learning (cs.LG)
*备注: accept by KDD2024

点击查看摘要

Abstract:Graph Convolutional Neural Network (GCN), a widely adopted method for analyzing relational data, enhances node discriminability through the aggregation of neighboring information. Usually, stacking multiple layers can improve the performance of GCN by leveraging information from high-order neighbors. However, the increase of the network depth will induce the over-smoothing problem, which can be attributed to the quality and quantity of neighbors changing: (a) neighbor quality, node’s neighbors become overlapping in high order, leading to aggregated information becoming indistinguishable, (b) neighbor quantity, the exponentially growing aggregated neighbors submerges the node’s initial feature by recursively aggregating operations. Current solutions mainly focus on one of the above causes and seldom consider both at once. Aiming at tackling both causes of over-smoothing in one shot, we introduce a simple Two-Sided Constraint (TSC) for GCNs, comprising two straightforward yet potent techniques: random masking and contrastive constraint. The random masking acts on the representation matrix’s columns to regulate the degree of information aggregation from neighbors, thus preventing the convergence of node representations. Meanwhile, the contrastive constraint, applied to the representation matrix’s rows, enhances the discriminability of the nodes. Designed as a plug-in module, TSC can be easily coupled with GCN or SGC architectures. Experimental analyses on diverse real-world graph datasets verify that our approach markedly reduces the convergence of node’s representation and the performance degradation in deeper GCN. Comments: accept by KDD2024 Subjects: Machine Learning (cs.LG) Cite as: arXiv:2408.03152 [cs.LG] (or arXiv:2408.03152v1 [cs.LG] for this version)

[LG-17] Optimizing Disease Prediction with Artificial Intelligence Driven Feature Selection and Attention Networks

点击查看摘要

[LG-18] Conditioning LLMs with Emotion in Neural Machine Translation

链接: https://arxiv.org/abs/2408.03150
作者: Charles Brazier,Jean-Luc Rouas
关键词-EN: Natural Language Processing, Language Processing tasks, Large Language Models, including Machine Translation, Large Language
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注: 6 pages, In Proceedings of the 21st International Conference on Spoken Language Translation (IWSLT), Bangkok, Thailand, 2024

点击查看摘要

[LG-19] Closed-loop Diffusion Control of Complex Physical Systems

链接: https://arxiv.org/abs/2408.03124
作者: Long Wei,Haodong Feng,Peiyan Hu,Tao Zhang,Yuchen Yang,Xiang Zheng,Ruiqi Feng,Dixia Fan,Tailin Wu
关键词-EN: complex physical systems, physical systems, control, science and engineering, complex physical
类目: ystems and Control (eess.SY); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The control problems of complex physical systems have wide applications in science and engineering. Several previous works have demonstrated that generative control methods based on diffusion models have significant advantages for solving these problems. However, existing generative control methods face challenges in handling closed-loop control, which is an inherent constraint for effective control of complex physical systems. In this paper, we propose a Closed-Loop Diffusion method for Physical systems Control (CL-DiffPhyCon). By adopting an asynchronous denoising schedule for different time steps, CL-DiffPhyCon generates control signals conditioned on real-time feedback from the environment. Thus, CL-DiffPhyCon is able to speed up diffusion control methods in a closed-loop framework. We evaluate CL-DiffPhyCon on the 1D Burgers’ equation control and 2D incompressible fluid control tasks. The results demonstrate that CL-DiffPhyCon achieves notable control performance with significant sampling acceleration.

[LG-20] opic Modeling with Fine-tuning LLMs and Bag of Sentences

链接: https://arxiv.org/abs/2408.03099
作者: Johannes Schneider
关键词-EN: Large language models, Large language, classical topic models, outperforming classical topic, modeling outperforming classical
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注: This is the submitted journal version of enhanced with the novel fine-tuning part of "Efficient and Flexible Topic Modeling using Pretrained Embeddings and Bag of Sentences’’ which appeared at the International Conference on Agents and Artificial Intelligence(ICAART) in 2024

点击查看摘要

[LG-21] Learning Provably Robust Policies in Uncertain Parametric Environments

点击查看摘要

[LG-22] Research on Autonomous Driving Decision-making Strategies based Deep Reinforcement Learning

链接: https://arxiv.org/abs/2408.03084
作者: Zixiang Wang,Hao Yan,Changsong Wei,Junyu Wang,Shi Bo,Minheng Xiao
关键词-EN: behavior decision-making subsystem, autonomous driving system, deep reinforcement learning, key component, important symbol
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The behavior decision-making subsystem is a key component of the autonomous driving system, which reflects the decision-making ability of the vehicle and the driver, and is an important symbol of the high-level intelligence of the vehicle. However, the existing rule-based decision-making schemes are limited by the prior knowledge of designers, and it is difficult to cope with complex and changeable traffic scenarios. In this work, an advanced deep reinforcement learning model is adopted, which can autonomously learn and optimize driving strategies in a complex and changeable traffic environment by modeling the driving decision-making process as a reinforcement learning problem. Specifically, we used Deep Q-Network (DQN) and Proximal Policy Optimization (PPO) for comparative experiments. DQN guides the agent to choose the best action by approximating the state-action value function, while PPO improves the decision-making quality by optimizing the policy function. We also introduce improvements in the design of the reward function to promote the robustness and adaptability of the model in real-world driving situations. Experimental results show that the decision-making strategy based on deep reinforcement learning has better performance than the traditional rule-based method in a variety of driving tasks.

[LG-23] Highly Efficient Self-Adaptive Reward Shaping for Reinforcement Learning

点击查看摘要

[LG-24] NeurDB: On the Design and Implementation of an AI-powered Autonomous Database

点击查看摘要

[LG-25] Federated Learning Architectures: A Performance Evaluation with Crop Yield Prediction Application

链接: https://arxiv.org/abs/2408.02998
作者: Anwesha Mukherjee,Rajkumar Buyya
关键词-EN: Federated learning, decentralized federated learning, federated learning frameworks, Long Short-Term Memory, decentralized federated
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Federated learning has become an emerging technology for data analysis for IoT applications. This paper implements centralized and decentralized federated learning frameworks for crop yield prediction based on Long Short-Term Memory Network. For centralized federated learning, multiple clients and one server is considered, where the clients exchange their model updates with the server that works as the aggregator to build the global model. For the decentralized framework, a collaborative network is formed among the devices either using ring topology or using mesh topology. In this network, each device receives model updates from the neighbour devices, and performs aggregation to build the upgraded model. The performance of the centralized and decentralized federated learning frameworks are evaluated in terms of prediction accuracy, precision, recall, F1-Score, and training time. The experimental results present that \geq 97% and 97.5% prediction accuracy are achieved using the centralized and decentralized federated learning-based frameworks respectively. The results also show that the using centralized federated learning the response time can be reduced by \sim 75% than the cloud-only framework. Finally, the future research directions of the use of federated learning in crop yield prediction are explored in this paper.

[LG-26] A Differential Smoothness-based Compact-Dynamic Graph Convolutional Network for Spatiotemporal Signal Recovery

链接: https://arxiv.org/abs/2408.02987
作者: Pengcheng Gao,Zicheng Gao,Ye Yuan
关键词-EN: High quality spatiotemporal, real application scenarios, High quality, spatiotemporal signal, quality spatiotemporal signal
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:High quality spatiotemporal signal is vitally important for real application scenarios like energy management, traffic planning and cyber security. Due to the uncontrollable factors like abrupt sensors breakdown or communication fault, the spatiotemporal signal collected by sensors is always incomplete. A dynamic graph convolutional network (DGCN) is effective for processing spatiotemporal signal recovery. However, it adopts a static GCN and a sequence neural network to explore the spatial and temporal patterns, separately. Such a separated two-step processing is loose spatiotemporal, thereby failing to capture the complex inner spatiotemporal correlation. To address this issue, this paper proposes a Compact-Dynamic Graph Convolutional Network (CDGCN) for spatiotemporal signal recovery with the following two-fold ideas: a) leveraging the tensor M-product to build a unified tensor graph convolution framework, which considers both spatial and temporal patterns simultaneously; and b) constructing a differential smoothness-based objective function to reduce the noise interference in spatiotemporal signal, thereby further improve the recovery accuracy. Experiments on real-world spatiotemporal datasets demonstrate that the proposed CDGCN significantly outperforms the state-of-the-art models in terms of recovery accuracy.

[LG-27] Wave Interpolation Neural Operator: Interpolated Prediction of Electric Fields Across Untrained Wavelengths

链接: https://arxiv.org/abs/2408.02971
作者: Joonhyuk Seo,Chanik Kang,Dongjin Seo,Haejun Chung
关键词-EN: Designing photonic structures, Designing photonic, high computational costs, photonic structures requires, structures requires electromagnetic
类目: Machine Learning (cs.LG); Emerging Technologies (cs.ET); Optics (physics.optics)
*备注: 9 pages, 5 figures, 4 tables / Appendix: 4 pages, 4 figures, 3 tables

点击查看摘要

Abstract:Designing photonic structures requires electromagnetic simulations, which often require high computational costs. Researchers have developed surrogate solvers for predicting electric fields to alleviate the computational issues. However, existing surrogate solvers are limited to performing inference at fixed simulation conditions and require retraining for different conditions. To address this, we propose Wave Interpolation Neural Operator (WINO), a novel surrogate solver enabling simulation condition interpolation across a continuous spectrum of broadband wavelengths. WINO introduces the Fourier Group Convolution Shuffling operator and a new conditioning method to efficiently predict electric fields from both trained and untrained wavelength data, achieving significant improvements in parameter efficiency and spectral interpolation performance. Our model demonstrates approximately 100 times faster performance than traditional finite-difference frequency-domain simulations. Moreover, compared to the state-of-the-art model, we achieve a 74% reduction in parameters and 80.5% improvements in prediction accuracy for untrained wavelengths, and 13.2% improvements for trained wavelengths.

[LG-28] Data-Driven Stochastic Closure Modeling via Conditional Diffusion Model and Neural Operator

链接: https://arxiv.org/abs/2408.02965
作者: Xinghao Dong,Chuanqi Chen,Jin-Long Wu
关键词-EN: direct numerical simulation, data-driven stochastic closure, stochastic closure model, Closure models, stochastic closure
类目: Machine Learning (cs.LG); Dynamical Systems (math.DS); Computational Physics (physics.comp-ph)
*备注:

点击查看摘要

Abstract:Closure models are widely used in simulating complex multiscale dynamical systems such as turbulence and the earth system, for which direct numerical simulation that resolves all scales is often too expensive. For those systems without a clear scale separation, deterministic and local closure models often lack enough generalization capability, which limits their performance in many real-world applications. In this work, we propose a data-driven modeling framework for constructing stochastic and non-local closure models via conditional diffusion model and neural operator. Specifically, the Fourier neural operator is incorporated into a score-based diffusion model, which serves as a data-driven stochastic closure model for complex dynamical systems governed by partial differential equations (PDEs). We also demonstrate how accelerated sampling methods can improve the efficiency of the data-driven stochastic closure model. The results show that the proposed methodology provides a systematic approach via generative machine learning techniques to construct data-driven stochastic closure models for multiscale dynamical systems with continuous spatiotemporal fields.

[LG-29] Synaptic Modulation using Interspike Intervals Increases Energy Efficiency of Spiking Neural Networks

链接: https://arxiv.org/abs/2408.02961
作者: Dylan Adams,Magda Zajaczkowska,Ashiq Anjum,Andrea Soltoggio,Shirin Dora
关键词-EN: Spiking Neural Networks, Artificial Neural Networks, Spiking Neural, Artificial Neural, Neural Networks
类目: Neural and Evolutionary Computing (cs.NE); Machine Learning (cs.LG)
*备注: 9 pages, 3 figures

点击查看摘要

Abstract:Despite basic differences between Spiking Neural Networks (SNN) and Artificial Neural Networks (ANN), most research on SNNs involve adapting ANN-based methods for SNNs. Pruning (dropping connections) and quantization (reducing precision) are often used to improve energy efficiency of SNNs. These methods are very effective for ANNs whose energy needs are determined by signals transmitted on synapses. However, the event-driven paradigm in SNNs implies that energy is consumed by spikes. In this paper, we propose a new synapse model whose weights are modulated by Interspike Intervals (ISI) i.e. time difference between two spikes. SNNs composed of this synapse model, termed ISI Modulated SNNs (IMSNN), can use gradient descent to estimate how the ISI of a neuron changes after updating its synaptic parameters. A higher ISI implies fewer spikes and vice-versa. The learning algorithm for IMSNNs exploits this information to selectively propagate gradients such that learning is achieved by increasing the ISIs resulting in a network that generates fewer spikes. The performance of IMSNNs with dense and convolutional layers have been evaluated in terms of classification accuracy and the number of spikes using the MNIST and FashionMNIST datasets. The performance comparison with conventional SNNs shows that IMSNNs exhibit upto 90% reduction in the number of spikes while maintaining similar classification accuracy.

[LG-30] Kolmogorov-Arnold PointNet: Deep learning for prediction of fluid fields on irregular geometries

链接: https://arxiv.org/abs/2408.02950
作者: Ali Kashefi
关键词-EN: supervised deep learning, deep learning framework, fluid flow fields, shared Kolmogorov-Arnold Networks, present Kolmogorov-Arnold PointNet
类目: Machine Learning (cs.LG); Computational Physics (physics.comp-ph)
*备注:

点击查看摘要

Abstract:We present Kolmogorov-Arnold PointNet (KA-PointNet) as a novel supervised deep learning framework for the prediction of incompressible steady-state fluid flow fields in irregular domains, where the predicted fields are a function of the geometry of the domains. In KA-PointNet, we implement shared Kolmogorov-Arnold Networks (KANs) in the segmentation branch of the PointNet architecture. We utilize Jacobi polynomials to construct shared KANs. As a benchmark test case, we consider incompressible laminar steady-state flow over a cylinder, where the geometry of its cross-section varies over the data set. We investigate the performance of Jacobi polynomials with different degrees as well as special cases of Jacobi polynomials such as Legendre polynomials, Chebyshev polynomials of the first and second kinds, and Gegenbauer polynomials, in terms of the computational cost of training and accuracy of prediction of the test set. Additionally, we compare the performance of PointNet with shared KANs (i.e., KA-PointNet) and PointNet with shared Multilayer Perceptrons (MLPs). It is observed that when the number of trainable parameters is approximately equal, PointNet with shared KANs (i.e., KA-PointNet) outperforms PointNet with shared MLPs.

[LG-31] Scaling Laws for Data Poisoning in LLMs

点击查看摘要

[LG-32] Achieving More with Less: A Tensor-Optimization-Powered Ensemble Method

链接: https://arxiv.org/abs/2408.02936
作者: Jinghui Yuan,Weijin Jiang,Zhe Cao,Fangyuan Xie,Rong Wang,Feiping Nie,Xuelong Li
关键词-EN: base learners, base, Ensemble learning, weak base learner, learners
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Ensemble learning is a method that leverages weak learners to produce a strong learner. However, obtaining a large number of base learners requires substantial time and computational resources. Therefore, it is meaningful to study how to achieve the performance typically obtained with many base learners using only a few. We argue that to achieve this, it is essential to enhance both classification performance and generalization ability during the ensemble process. To increase model accuracy, each weak base learner needs to be more efficiently integrated. It is observed that different base learners exhibit varying levels of accuracy in predicting different classes. To capitalize on this, we introduce confidence tensors \tilde\mathbf\Theta and \tilde\mathbf\Theta_rst signifies that the t -th base classifier assigns the sample to class r while it actually belongs to class s . To the best of our knowledge, this is the first time an evaluation of the performance of base classifiers across different classes has been proposed. The proposed confidence tensor compensates for the strengths and weaknesses of each base classifier in different classes, enabling the method to achieve superior results with a smaller number of base learners. To enhance generalization performance, we design a smooth and convex objective function that leverages the concept of margin, making the strong learner more discriminative. Furthermore, it is proved that in gradient matrix of the loss function, the sum of each column’s elements is zero, allowing us to solve a constrained optimization problem using gradient-based methods. We then compare our algorithm with random forests of ten times the size and other classical methods across numerous datasets, demonstrating the superiority of our approach.

[LG-33] Doubly Stochastic Adaptive Neighbors Clustering via the Marcus Mapping

点击查看摘要

[LG-34] he Need for a Big World Simulator: A Scientific Challenge for Continual Learning

点击查看摘要

[LG-35] HARMONIC: Harnessing LLMs for Tabular Data Synthesis and Privacy Protection

链接: https://arxiv.org/abs/2408.02927
作者: Yuxin Wang,Duanyu Feng,Yongfu Dai,Zhengyu Chen,Jimin Huang,Sophia Ananiadou,Qianqian Xie,Hao Wang
关键词-EN: tabular data, advancing deep learning, tabular data generation, Data, tabular data presented
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Cryptography and Security (cs.CR)
*备注:

点击查看摘要

[LG-36] A Metric Driven Approach to Mixed Precision Training

点击查看摘要

[LG-37] Compromising Embodied Agents with Contextual Backdoor Attacks

点击查看摘要

[LG-38] Back-Projection Diffusion: Solving the Wideband Inverse Scattering Problem with Diffusion Models

链接: https://arxiv.org/abs/2408.02866
作者: Borong Zhang,Martín Guerra,Qin Li,Leonardo Zepeda-Núñez
关键词-EN: wideband scattering data, inverse scattering map, posterior distribution induced, Wideband back-projection diffusion, wideband scattering
类目: Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注:

点击查看摘要

Abstract:We present \textitWideband back-projection diffusion, an end-to-end probabilistic framework for approximating the posterior distribution induced by the inverse scattering map from wideband scattering data. This framework leverages conditional diffusion models coupled with the underlying physics of wave-propagation and symmetries in the problem, to produce highly accurate reconstructions. The framework introduces a factorization of the score function into a physics-based latent representation inspired by the filtered back-propagation formula and a conditional score function conditioned on this latent representation. These two steps are also constrained to obey symmetries in the formulation while being amenable to compression by imposing the rank structure found in the filtered back-projection formula. As a result, empirically, our framework is able to provide sharp reconstructions effortlessly, even recovering sub-Nyquist features in the multiple-scattering regime. It has low-sample and computational complexity, its number of parameters scales sub-linearly with the target resolution, and it has stable training dynamics.

[LG-39] A Framework for Fine-Tuning LLMs using Heterogeneous Feedback

链接: https://arxiv.org/abs/2408.02861
作者: Ryan Aponte(1),Ryan A. Rossi(2),Shunan Guo(2),Franck Dernoncourt(2),Tong Yu(2),Xiang Chen(2),Subrata Mitra(2),Nedim Lipka(2) ((1) Carnegie Mellon University, (2) Adobe Research)
关键词-EN: including text summarization, Large language models, Large language, web navigation, range of tasks
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注: 7 pages, 1 figure

点击查看摘要

[LG-40] Active Learning for WBAN-based Health Monitoring

链接: https://arxiv.org/abs/2408.02849
作者: Cho-Chun Chiu,Tuan Nguyen,Ting He,Shiqiang Wang,Beom-Su Kim,Ki-Il Kim
关键词-EN: body area network, wireless body area, area network, learning machine learning, learning problem motivated
类目: Machine Learning (cs.LG); Networking and Internet Architecture (cs.NI)
*备注:

点击查看摘要

Abstract:We consider a novel active learning problem motivated by the need of learning machine learning models for health monitoring in wireless body area network (WBAN). Due to the limited resources at body sensors, collecting each unlabeled sample in WBAN incurs a nontrivial cost. Moreover, training health monitoring models typically requires labels indicating the patient’s health state that need to be generated by healthcare professionals, which cannot be obtained at the same pace as data collection. These challenges make our problem fundamentally different from classical active learning, where unlabeled samples are free and labels can be queried in real time. To handle these challenges, we propose a two-phased active learning method, consisting of an online phase where a coreset construction algorithm is proposed to select a subset of unlabeled samples based on their noisy predictions, and an offline phase where the selected samples are labeled to train the target model. The samples selected by our algorithm are proved to yield a guaranteed error in approximating the full dataset in evaluating the loss function. Our evaluation based on real health monitoring data and our own experimentation demonstrates that our solution can drastically save the data curation cost without sacrificing the quality of the target model.

[LG-41] Heterogeneous graph attention network improves cancer multiomics integration

链接: https://arxiv.org/abs/2408.02845
作者: Sina Tabakhi,Charlotte Vandermeulen,Ian Sudbery,Haiping Lu
关键词-EN: data demands advanced, demands advanced integration, human diseases, multiomics data demands, data demands
类目: Machine Learning (cs.LG); Multiagent Systems (cs.MA); Biomolecules (q-bio.BM); Genomics (q-bio.GN)
*备注: 29 pages, 13 figures

点击查看摘要

Abstract:The increase in high-dimensional multiomics data demands advanced integration models to capture the complexity of human diseases. Graph-based deep learning integration models, despite their promise, struggle with small patient cohorts and high-dimensional features, often applying independent feature selection without modeling relationships among omics. Furthermore, conventional graph-based omics models focus on homogeneous graphs, lacking multiple types of nodes and edges to capture diverse structures. We introduce a Heterogeneous Graph ATtention network for omics integration (HeteroGATomics) to improve cancer diagnosis. HeteroGATomics performs joint feature selection through a multi-agent system, creating dedicated networks of feature and patient similarity for each omic modality. These networks are then combined into one heterogeneous graph for learning holistic omic-specific representations and integrating predictions across modalities. Experiments on three cancer multiomics datasets demonstrate HeteroGATomics’ superior performance in cancer diagnosis. Moreover, HeteroGATomics enhances interpretability by identifying important biomarkers contributing to the diagnosis outcomes.

[LG-42] Interpretation of the Intent Detection Problem as Dynamics in a Low-dimensional Space ECAI-2024

链接: https://arxiv.org/abs/2408.02838
作者: Eduardo Sanchez-Karhunen,Jose F. Quesada-Moreno,Miguel A. Gutiérrez-Naranjo
关键词-EN: text classification task, Intent detection, users query, text classification, recognize and label
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Machine Learning (stat.ML)
*备注: Camera-Ready version. Accepted paper at 27th European Conference on Artificial Intelligence (ECAI-2024)

点击查看摘要

[LG-43] DaCapo: a modular deep learning framework for scalable 3D image segmentation

点击查看摘要

[LG-44] Wave-RVFL: A Randomized Neural Network Based on Wave Loss Function

链接: https://arxiv.org/abs/2408.02824
作者: M. Sajid,A. Quadir,M. Tanveer
关键词-EN: vector functional link, random vector functional, strong generalization capabilities, functional link, network is well-regarded
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The random vector functional link (RVFL) network is well-regarded for its strong generalization capabilities in the field of machine learning. However, its inherent dependencies on the square loss function make it susceptible to noise and outliers. Furthermore, the calculation of RVFL’s unknown parameters necessitates matrix inversion of the entire training sample, which constrains its scalability. To address these challenges, we propose the Wave-RVFL, an RVFL model incorporating the wave loss function. We formulate and solve the proposed optimization problem of the Wave-RVFL using the adaptive moment estimation (Adam) algorithm in a way that successfully eliminates the requirement for matrix inversion and significantly enhances scalability. The Wave-RVFL exhibits robustness against noise and outliers by preventing over-penalization of deviations, thereby maintaining a balanced approach to managing noise and outliers. The proposed Wave-RVFL model is evaluated on multiple UCI datasets, both with and without the addition of noise and outliers, across various domains and sizes. Empirical results affirm the superior performance and robustness of the Wave-RVFL compared to baseline models, establishing it as a highly effective and scalable classification solution.

[LG-45] Pre-trained Encoder Inference: Revealing Upstream Encoders In Downstream Machine Learning Services

链接: https://arxiv.org/abs/2408.02814
作者: Shaopeng Fu,Xuexue Sun,Ke Qing,Tianhang Zheng,Di Wang
关键词-EN: easily accessed online, downstream machine learning, build downstream machine, Pre-trained Encoder Inference, PEI attack
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注:

点击查看摘要

Abstract:Though pre-trained encoders can be easily accessed online to build downstream machine learning (ML) services quickly, various attacks have been designed to compromise the security and privacy of these encoders. While most attacks target encoders on the upstream side, it remains unknown how an encoder could be threatened when deployed in a downstream ML service. This paper unveils a new vulnerability: the Pre-trained Encoder Inference (PEI) attack, which posts privacy threats toward encoders hidden behind downstream ML services. By only providing API accesses to a targeted downstream service and a set of candidate encoders, the PEI attack can infer which encoder is secretly used by the targeted service based on candidate ones. We evaluate the attack performance of PEI against real-world encoders on three downstream tasks: image classification, text classification, and text-to-image generation. Experiments show that the PEI attack succeeds in revealing the hidden encoder in most cases and seldom makes mistakes even when the hidden encoder is not in the candidate set. We also conducted a case study on one of the most recent vision-language models, LLaVA, to illustrate that the PEI attack is useful in assisting other ML attacks such as adversarial attacks. The code is available at this https URL.

[LG-46] Mitigating Malicious Attacks in Federated Learning via Confidence-aware Defense

点击查看摘要

[LG-47] Deciphering Air Travel Disruptions: A Machine Learning Approach

链接: https://arxiv.org/abs/2408.02802
作者: Aravinda Jatavallabha,Jacob Gerlach,Aadithya Naresh
关键词-EN: departure time, Decision Tree Regression, Random Forest Regression, trends by examining, examining factors
类目: Machine Learning (cs.LG)
*备注: 10 pages, 11 figures, 6 tables

点击查看摘要

Abstract:This research investigates flight delay trends by examining factors such as departure time, airline, and airport. It employs regression machine learning methods to predict the contributions of various sources to delays. Time-series models, including LSTM, Hybrid LSTM, and Bi-LSTM, are compared with baseline regression models such as Multiple Regression, Decision Tree Regression, Random Forest Regression, and Neural Network. Despite considerable errors in the baseline models, the study aims to identify influential features in delay prediction, potentially informing flight planning strategies. Unlike previous work, this research focuses on regression tasks and explores the use of time-series models for predicting flight delays. It offers insights into aviation operations by independently analyzing each delay component (e.g., security, weather).

[LG-48] Sparse Deep Learning Models with the ell_1 Regularization

链接: https://arxiv.org/abs/2408.02801
作者: Lixin Shen,Rui Wang,Yuesheng Xu,Mingsong Yan
关键词-EN: Sparse neural networks, Sparse neural, regularization parameters, reducing its complexity, neural networks
类目: Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Sparse neural networks are highly desirable in deep learning in reducing its complexity. The goal of this paper is to study how choices of regularization parameters influence the sparsity level of learned neural networks. We first derive the \ell_1 -norm sparsity-promoting deep learning models including single and multiple regularization parameters models, from a statistical viewpoint. We then characterize the sparsity level of a regularized neural network in terms of the choice of the regularization parameters. Based on the characterizations, we develop iterative algorithms for selecting regularization parameters so that the weight parameters of the resulting deep neural network enjoy prescribed sparsity levels. Numerical experiments are presented to demonstrate the effectiveness of the proposed algorithms in choosing desirable regularization parameters and obtaining corresponding neural networks having both of predetermined sparsity levels and satisfactory approximation accuracy.

[LG-49] Examining Gender and Power on Wikipedia Through Face and Politeness

链接: https://arxiv.org/abs/2408.02798
作者: Adil Soubki,Shyne Choi,Owen Rambow
关键词-EN: sociolinguistic theory, face acts, analyzing discourse, discourse by combining, combining two interdependent
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

[LG-50] Algorithm-Informed Graph Neural Networks for Leakage Detection and Localization in Water Distribution Networks

链接: https://arxiv.org/abs/2408.02797
作者: Zepeng Zhang,Olga Fink
关键词-EN: Detecting and localizing, water distribution networks, significant challenge, efficient and sustainable, sustainable management
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Detecting and localizing leakages is a significant challenge for the efficient and sustainable management of water distribution networks (WDN). Leveraging the inherent graph structure of WDNs, recent approaches have used graph-based data-driven methods. However, these methods often learn shortcuts that work well with in-distribution data but fail to generalize to out-of-distribution data. To address this limitation and inspired by the perfect generalization ability of classical algorithms, we propose an algorithm-informed graph neural network (AIGNN). Recognizing that WDNs function as flow networks, incorporating max-flow information can be beneficial for inferring pressures. In the proposed framework, we first train AIGNN to emulate the Ford-Fulkerson algorithm for solving max-flow problems. This algorithmic knowledge is then transferred to address the pressure estimation problem in WDNs. Two AIGNNs are deployed, one to reconstruct pressure based on the current measurements, and another to predict pressure based on previous measurements. Leakages are detected and localized by comparing the outputs of the reconstructor and the predictor. By pretraining AIGNNs to reason like algorithms, they are expected to extract more task-relevant and generalizable features. Experimental results demonstrate that the proposed algorithm-informed approach achieves superior results with better generalization ability compared to GNNs that do not incorporate algorithmic knowledge.

[LG-51] 4D-Var using Hessian approximation and backpropagation applied to automatically-differentiable numerical and machine learning models

链接: https://arxiv.org/abs/2408.02767
作者: Kylen Solvik,Stephen G. Penny,Stephan Hoyer
关键词-EN: software-based tangent linear, supports automatic differentiation, tangent linear model, automatic differentiation, difficult to implement
类目: Machine Learning (cs.LG); Dynamical Systems (math.DS); Geophysics (physics.geo-ph)
*备注: 24 pages, 7 figures

点击查看摘要

Abstract:Constraining a numerical weather prediction (NWP) model with observations via 4D variational (4D-Var) data assimilation is often difficult to implement in practice due to the need to develop and maintain a software-based tangent linear model and adjoint model. One of the most common 4D-Var algorithms uses an incremental update procedure, which has been shown to be an approximation of the Gauss-Newton method. Here we demonstrate that when using a forecast model that supports automatic differentiation, an efficient and in some cases more accurate alternative approximation of the Gauss-Newton method can be applied by combining backpropagation of errors with Hessian approximation. This approach can be used with either a conventional numerical model implemented within a software framework that supports automatic differentiation, or a machine learning (ML) based surrogate model. We test the new approach on a variety of Lorenz-96 and quasi-geostrophic models. The results indicate potential for a deeper integration of modeling, data assimilation, and new technologies in a next-generation of operational forecast systems that leverage weather models designed to support automatic differentiation.

[LG-52] ConDL: Detector-Free Dense Image Matching

点击查看摘要

[LG-53] Dimensionality Reduction and Nearest Neighbors for Improving Out-of-Distribution Detection in Medical Image Segmentation

点击查看摘要

[LG-54] Classification of Raw MEG/EEG Data with Detach-Rocket Ensemble: An Improved ROCKET Algorithm for Multivariate Time Series Analysis

链接: https://arxiv.org/abs/2408.02760
作者: Adrià Solana,Erik Fransén,Gonzalo Uribarri
关键词-EN: Time Series Classification, Multivariate Time Series, acquisition modalities involve, simultaneous time-dependent recording, Random Convolutional Kernel
类目: Machine Learning (cs.LG); Computational Engineering, Finance, and Science (cs.CE); Neurons and Cognition (q-bio.NC); Machine Learning (stat.ML)
*备注: To be published in European Conference on Machine Learning and Data Mining 2024, 20 pages, 7 figures, 2 tables

点击查看摘要

Abstract:Multivariate Time Series Classification (MTSC) is a ubiquitous problem in science and engineering, particularly in neuroscience, where most data acquisition modalities involve the simultaneous time-dependent recording of brain activity in multiple brain regions. In recent years, Random Convolutional Kernel models such as ROCKET and MiniRocket have emerged as highly effective time series classification algorithms, capable of achieving state-of-the-art accuracy results with low computational load. Despite their success, these types of models face two major challenges when employed in neuroscience: 1) they struggle to deal with high-dimensional data such as EEG and MEG, and 2) they are difficult to interpret. In this work, we present a novel ROCKET-based algorithm, named Detach-Rocket Ensemble, that is specifically designed to address these two problems in MTSC. Our algorithm leverages pruning to provide an integrated estimation of channel importance, and ensembles to achieve better accuracy and provide a label probability. Using a synthetic multivariate time series classification dataset in which we control the amount of information carried by each of the channels, we first show that our algorithm is able to correctly recover the channel importance for classification. Then, using two real-world datasets, a MEG dataset and an EEG dataset, we show that Detach-Rocket Ensemble is able to provide both interpretable channel relevance and competitive classification accuracy, even when applied directly to the raw brain data, without the need for feature engineering.

[LG-55] A Novel Hybrid Approach for Tornado Prediction in the United States: Kalman-Convolutional BiLSTM with Multi-Head Attention

链接: https://arxiv.org/abs/2408.02751
作者: Jiawei Zhou
关键词-EN: intense atmospheric vortex, atmospheric vortex phenomena, pose significant challenges, Hybrid Scan Reflectivity, intense atmospheric
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Tornadoes are among the most intense atmospheric vortex phenomena and pose significant challenges for detection and forecasting. Conventional methods, which heavily depend on ground-based observations and radar data, are limited by issues such as decreased accuracy over greater distances and a high rate of false positives. To address these challenges, this study utilizes the Seamless Hybrid Scan Reflectivity (SHSR) dataset from the Multi-Radar Multi-Sensor (MRMS) system, which integrates data from multiple radar sources to enhance accuracy. A novel hybrid model, the Kalman-Convolutional BiLSTM with Multi-Head Attention, is introduced to improve dynamic state estimation and capture both spatial and temporal dependencies within the data. This model demonstrates superior performance in precision, recall, F1-Score, and accuracy compared to methods such as K-Nearest Neighbors (KNN) and LightGBM. The results highlight the considerable potential of advanced machine learning techniques to improve tornado prediction and reduce false alarm rates. Future research will focus on expanding datasets, exploring innovative model architectures, and incorporating large language models (LLMs) to provide deeper insights. This research introduces a novel model for tornado prediction, offering a robust framework for enhancing forecasting accuracy and public safety.

[LG-56] MDM: Advancing Multi-Domain Distribution Matching for Automatic Modulation Recognition Dataset Synthesis

点击查看摘要

[LG-57] xt Conditioned Symbolic Drumbeat Generation using Latent Diffusion Models

点击查看摘要

[LG-58] RCDM: Enabling Robustness for Conditional Diffusion Model

点击查看摘要

[LG-59] SnapE – Training Snapshot Ensembles of Link Prediction Models ISWC

点击查看摘要

[LG-60] Bayesian Kolmogorov Arnold Networks (Bayesian_KANs): A Probabilistic Approach to Enhance Accuracy and Interpretability

点击查看摘要

[LG-61] PSNE: Efficient Spectral Sparsification Algorithms for Scaling Network Embedding

点击查看摘要

[LG-62] Spatial-temporal Graph Convolutional Networks with Diversified Transformation for Dynamic Graph Representation Learning

点击查看摘要

[LG-63] DeepNetBeam: A Framework for the Analysis of Functionally Graded Porous Beams

点击查看摘要

[LG-64] Why Rectified Power Unit Networks Fail and How to Improve It: An Effective Theory Perspective

点击查看摘要

[LG-65] Distribution-Level Memory Recall for Continual Learning: Preserving Knowledge and Avoiding Confusion

点击查看摘要

[LG-66] KAN based Autoencoders for Factor Models

点击查看摘要

[LG-67] Attention is all you need for an improved CNN-based flash flood susceptibility modeling. The case of the ungauged Rheraya watershed Morocco

点击查看摘要

[LG-68] Symmetric Graph Contrastive Learning against Noisy Views for Recommendation

点击查看摘要

[LG-69] Spatio-Temporal Partial Sensing Forecast for Long-term Traffic

点击查看摘要

[LG-70] A probabilistic framework for learning non-intrusive corrections to long-time climate simulations from short-time training data

链接: https://arxiv.org/abs/2408.02688
作者: Benedikt Barthel Sorensen,Leonardo Zepeda-Núñez,Ignacio Lopez-Gomez,Zhong Yi Wan,Rob Carver,Fei Sha,Themistoklis Sapsis
关键词-EN: science and engineering, ubiquitous in science, Chaotic systems, simulations, data
类目: Machine Learning (cs.LG); Dynamical Systems (math.DS); Atmospheric and Oceanic Physics (physics.ao-ph); Fluid Dynamics (physics.flu-dyn)
*备注:

点击查看摘要

Abstract:Chaotic systems, such as turbulent flows, are ubiquitous in science and engineering. However, their study remains a challenge due to the large range scales, and the strong interaction with other, often not fully understood, physics. As a consequence, the spatiotemporal resolution required for accurate simulation of these systems is typically computationally infeasible, particularly for applications of long-term risk assessment, such as the quantification of extreme weather risk due to climate change. While data-driven modeling offers some promise of alleviating these obstacles, the scarcity of high-quality simulations results in limited available data to train such models, which is often compounded by the lack of stability for long-horizon simulations. As such, the computational, algorithmic, and data restrictions generally imply that the probability of rare extreme events is not accurately captured. In this work we present a general strategy for training neural network models to non-intrusively correct under-resolved long-time simulations of chaotic systems. The approach is based on training a post-processing correction operator on under-resolved simulations nudged towards a high-fidelity reference. This enables us to learn the dynamics of the underlying system directly, which allows us to use very little training data, even when the statistics thereof are far from converged. Additionally, through the use of probabilistic network architectures we are able to leverage the uncertainty due to the limited training data to further improve extrapolation capabilities. We apply our framework to severely under-resolved simulations of quasi-geostrophic flow and demonstrate its ability to accurately predict the anisotropic statistics over time horizons more than 30 times longer than the data seen in training.

[LG-71] A Systematic Review of Intermediate Fusion in Multimodal Deep Learning for Biomedical Applications

点击查看摘要

[LG-72] Artificial Neural Networks for Photonic Applications: From Algorithms to Implementation

点击查看摘要

[LG-73] Open Set Recognition for Random Forest

链接: https://arxiv.org/abs/2408.02684
作者: Guanchao Feng,Dhruv Desai,Stefano Pasquali,Dhagash Mehta
关键词-EN: incomplete knowledge, changing regimes, collect training, difficult to collect, open-set recognition
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:In many real-world classification or recognition tasks, it is often difficult to collect training examples that exhaust all possible classes due to, for example, incomplete knowledge during training or ever changing regimes. Therefore, samples from unknown/novel classes may be encountered in testing/deployment. In such scenarios, the classifiers should be able to i) perform classification on known classes, and at the same time, ii) identify samples from unknown classes. This is known as open-set recognition. Although random forest has been an extremely successful framework as a general-purpose classification (and regression) method, in practice, it usually operates under the closed-set assumption and is not able to identify samples from new classes when run out of the box. In this work, we propose a novel approach to enabling open-set recognition capability for random forest classifiers by incorporating distance metric learning and distance-based open-set recognition. The proposed method is validated on both synthetic and real-world datasets. The experimental results indicate that the proposed approach outperforms state-of-the-art distance-based open-set recognition methods.

[LG-74] Improving Machine Learning Based Sepsis Diagnosis Using Heart Rate Variability

链接: https://arxiv.org/abs/2408.02683
作者: Sai Balaji,Christopher Sun,Anaiy Somalwar
关键词-EN: enhancing patient outcomes, patient outcomes, early and accurate, enhancing patient, Random Forest classifiers
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The early and accurate diagnosis of sepsis is critical for enhancing patient outcomes. This study aims to use heart rate variability (HRV) features to develop an effective predictive model for sepsis detection. Critical HRV features are identified through feature engineering methods, including statistical bootstrapping and the Boruta algorithm, after which XGBoost and Random Forest classifiers are trained with differential hyperparameter settings. In addition, ensemble models are constructed to pool the prediction probabilities of high-recall and high-precision classifiers and improve model performance. Finally, a neural network model is trained on the HRV features, achieving an F1 score of 0.805, a precision of 0.851, and a recall of 0.763. The best-performing machine learning model is compared to this neural network through an interpretability analysis, where Local Interpretable Model-agnostic Explanations are implemented to determine decision-making criterion based on numerical ranges and thresholds for specific features. This study not only highlights the efficacy of HRV in automated sepsis diagnosis but also increases the transparency of black box outputs, maximizing clinical applicability.

[LG-75] Recording First-person Experiences to Build a New Type of Foundation Model

点击查看摘要

[LG-76] Visual Analysis of Multi-outcome Causal Graphs

链接: https://arxiv.org/abs/2408.02679
作者: Mengjie Fan,Jinlu Yu,Daniel Weiskopf,Nan Cao,Huai-Yu Wang,Liang Zhou
关键词-EN: multi-outcome causal graphs, causal graphs, multi-outcome causal, causal, graphs
类目: Machine Learning (cs.LG); Graphics (cs.GR); Human-Computer Interaction (cs.HC); Methodology (stat.ME)
*备注:

点击查看摘要

Abstract:We introduce a visual analysis method for multiple causal graphs with different outcome variables, namely, multi-outcome causal graphs. Multi-outcome causal graphs are important in healthcare for understanding multimorbidity and comorbidity. To support the visual analysis, we collaborated with medical experts to devise two comparative visualization techniques at different stages of the analysis process. First, a progressive visualization method is proposed for comparing multiple state-of-the-art causal discovery algorithms. The method can handle mixed-type datasets comprising both continuous and categorical variables and assist in the creation of a fine-tuned causal graph of a single outcome. Second, a comparative graph layout technique and specialized visual encodings are devised for the quick comparison of multiple causal graphs. In our visual analysis approach, analysts start by building individual causal graphs for each outcome variable, and then, multi-outcome causal graphs are generated and visualized with our comparative technique for analyzing differences and commonalities of these causal graphs. Evaluation includes quantitative measurements on benchmark datasets, a case study with a medical expert, and expert user studies with real-world health research data.

[LG-77] Convergence rates of stochastic gradient method with independent sequences of step-size and momentum weight

链接: https://arxiv.org/abs/2408.02678
作者: Wen-Liang Hwang
关键词-EN: navigate ravines efficiently, stochastic sub-gradient method, momentum weight, large-scale learning algorithms, local minimum
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:In large-scale learning algorithms, the momentum term is usually included in the stochastic sub-gradient method to improve the learning speed because it can navigate ravines efficiently to reach a local minimum. However, step-size and momentum weight hyper-parameters must be appropriately tuned to optimize convergence. We thus analyze the convergence rate using stochastic programming with Polyak’s acceleration of two commonly used step-size learning rates: diminishing-to-zero" and constant-and-drop" (where the sequence is divided into stages and a constant step-size is applied at each stage) under strongly convex functions over a compact convex set with bounded sub-gradients. For the former, we show that the convergence rate can be written as a product of exponential in step-size and polynomial in momentum weight. Our analysis justifies the convergence of using the default momentum weight setting and the diminishing-to-zero step-size sequence in large-scale machine learning software. For the latter, we present the condition for the momentum weight sequence to converge at each stage.

[LG-78] Patient-centered data science: an integrative framework for evaluating and predicting clinical outcomes in the digital health era

点击查看摘要

[LG-79] On Biases in a UK Biobank-based Retinal Image Classification Model MICCAI

点击查看摘要

[LG-80] Solving the Wide-band Inverse Scattering Problem via Equivariant Neural Networks

链接: https://arxiv.org/abs/2212.06068
作者: Borong Zhang,Leonardo Zepeda-Núñez,Qin Li
关键词-EN: inverse scattering problem, deep neural network, expensive optimization loop, neural network architecture, inverse map
类目: Numerical Analysis (math.NA); Machine Learning (cs.LG)
*备注: 21 pages, 9 figures, and 4 tables

点击查看摘要

Abstract:This paper introduces a novel deep neural network architecture for solving the inverse scattering problem in frequency domain with wide-band data, by directly approximating the inverse map, thus avoiding the expensive optimization loop of classical methods. The architecture is motivated by the filtered back-projection formula in the full aperture regime and with homogeneous background, and it leverages the underlying equivariance of the problem and compressibility of the integral operator. This drastically reduces the number of training parameters, and therefore the computational and sample complexity of the method. In particular, we obtain an architecture whose number of parameters scale sub-linearly with respect to the dimension of the inputs, while its inference complexity scales super-linearly but with very small constants. We provide several numerical tests that show that the current approach results in better reconstruction than optimization-based techniques such as full-waveform inversion, but at a fraction of the cost while being competitive with state-of-the-art machine learning methods.

[LG-81] Hedge Fund Portfolio Construction Using PolyModel Theory and iTransformer

链接: https://arxiv.org/abs/2408.03320
作者: Siqiao Zhao,Zhikang Dong,Zeyu Cao,Raphael Douady
关键词-EN: machine learning methods, apply machine learning, Polymodel theory, making it challenging, key problem
类目: Portfolio Management (q-fin.PM); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:When constructing portfolios, a key problem is that a lot of financial time series data are sparse, making it challenging to apply machine learning methods. Polymodel theory can solve this issue and demonstrate superiority in portfolio construction from various aspects. To implement the PolyModel theory for constructing a hedge fund portfolio, we begin by identifying an asset pool, utilizing over 10,000 hedge funds for the past 29 years’ data. PolyModel theory also involves choosing a wide-ranging set of risk factors, which includes various financial indices, currencies, and commodity prices. This comprehensive selection mirrors the complexities of the real-world environment. Leveraging on the PolyModel theory, we create quantitative measures such as Long-term Alpha, Long-term Ratio, and SVaR. We also use more classical measures like the Sharpe ratio or Morningstar’s MRAR. To enhance the performance of the constructed portfolio, we also employ the latest deep learning techniques (iTransformer) to capture the upward trend, while efficiently controlling the downside, using all the features. The iTransformer model is specifically designed to address the challenges in high-dimensional time series forecasting and could largely improve our strategies. More precisely, our strategies achieve better Sharpe ratio and annualized return. The above process enables us to create multiple portfolio strategies aiming for high returns and low risks when compared to various benchmarks.

[LG-82] Pre-training and in-context learning IS Bayesian inference a la De Finetti

链接: https://arxiv.org/abs/2408.03307
作者: Naimeng Ye,Hanming Yang,Andrew Siah,Hongseok Namkoong
关键词-EN: Accurately gauging uncertainty, Accurately gauging, intelligent systems, longstanding goal, goal of intelligent
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Accurately gauging uncertainty on the underlying environment is a longstanding goal of intelligent systems. We characterize which latent concepts pre-trained sequence models are naturally able to reason with. We go back to De Finetti’s predictive view of Bayesian reasoning: instead of modeling latent parameters through priors and likelihoods like topic models do, De Finetti has long advocated for modeling exchangeable (permutation invariant) sequences of observables. According to this view, pre-training autoregressive models formulates informed beliefs based on prior observations (“empirical Bayes”), and forward generation is a simulated instantiation of an environment (“posterior inference”). This connection allows extending in-context learning (ICL) beyond predictive settings, highlighting sequence models’ ability to perform explicit statistical inference. In particular, we show the sequence prediction loss over exchangeable documents controls performance on downstream tasks where uncertainty quantification is key. Empirically, we propose and demonstrate several approaches for encoding exchangeability in sequence model architectures: data augmentation, regularization, and causal masking.

[LG-83] Convergence Conditions for Stochastic Line Search Based Optimization of Over-parametrized Models

链接: https://arxiv.org/abs/2408.03199
作者: Matteo Lapucci,Davide Pucci
关键词-EN: fitting over-parametrized models, finite-sum problems related, over-parametrized models, solve the finite-sum, finite-sum problems
类目: Optimization and Control (math.OC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In this paper, we deal with algorithms to solve the finite-sum problems related to fitting over-parametrized models, that typically satisfy the interpolation condition. In particular, we focus on approaches based on stochastic line searches and employing general search directions. We define conditions on the sequence of search directions that guarantee finite termination and bounds for the backtracking procedure. Moreover, we shed light on the additional property of directions needed to prove fast (linear) convergence of the general class of algorithms when applied to PL functions in the interpolation regime. From the point of view of algorithms design, the proposed analysis identifies safeguarding conditions that could be employed in relevant algorithmic framework. In particular, it could be of interest to integrate stochastic line searches within momentum, conjugate gradient or adaptive preconditioning methods.

[LG-84] Active Learning for Level Set Estimation Using Randomized Straddle Algorithms

链接: https://arxiv.org/abs/2408.03144
作者: Yu Inatsu,Shion Takeno,Kentaro Kutsukake,Ichiro Takeuchi
关键词-EN: Level set estimation, Level set, set estimation, problem of identifying, theoretical guarantees
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 21 pages, 4 figures

点击查看摘要

Abstract:Level set estimation (LSE), the problem of identifying the set of input points where a function takes value above (or below) a given threshold, is important in practical applications. When the function is expensive-to-evaluate and black-box, the \textitstraddle algorithm, which is a representative heuristic for LSE based on Gaussian process models, and its extensions having theoretical guarantees have been developed. However, many of existing methods include a confidence parameter \beta^1/2_t that must be specified by the user, and methods that choose \beta^1/2_t heuristically do not provide theoretical guarantees. In contrast, theoretically guaranteed values of \beta^1/2_t need to be increased depending on the number of iterations and candidate points, and are conservative and not good for practical performance. In this study, we propose a novel method, the \textitrandomized straddle algorithm, in which \beta_t in the straddle algorithm is replaced by a random sample from the chi-squared distribution with two degrees of freedom. The confidence parameter in the proposed method has the advantages of not needing adjustment, not depending on the number of iterations and candidate points, and not being conservative. Furthermore, we show that the proposed method has theoretical guarantees that depend on the sample complexity and the number of iterations. Finally, we confirm the usefulness of the proposed method through numerical experiments using synthetic and real data.

[LG-85] QADQN: Quantum Attention Deep Q-Network for Financial Market Prediction

点击查看摘要

[LG-86] Matrix Multiplication on Quantum Computer

链接: https://arxiv.org/abs/2408.03085
作者: Jiaqi Yao,Ding Liu
关键词-EN: quantum matrix multiplication, Quantum Fourier Transform, universal quantum matrix, matrix multiplication, quantum matrix
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This paper introduces an innovative and practical approach to universal quantum matrix multiplication. We designed optimized quantum adders and multipliers based on Quantum Fourier Transform (QFT), which significantly reduced the number of gates used compared to classical adders and multipliers. Subsequently, we construct a basic universal quantum matrix multiplication and extend it to the Strassen algorithm. We conduct comparative experiments to analyze the performance of the quantum matrix multiplication and evaluate the acceleration provided by the optimized quantum adder and multiplier. Furthermore, we investigate the advantages and disadvantages of the quantum Strassen algorithm compared to basic quantum matrix multiplication.

[LG-87] Evaluating Posterior Probabilities: Decision Theory Proper Scoring Rules and Calibration

链接: https://arxiv.org/abs/2408.02841
作者: Luciana Ferrer,Daniel Ramos
关键词-EN: calibration metrics, output posterior probabilities, calibration, machine learning classifiers, classifiers are designed
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Most machine learning classifiers are designed to output posterior probabilities for the classes given the input sample. These probabilities may be used to make the categorical decision on the class of the sample; provided as input to a downstream system; or provided to a human for interpretation. Evaluating the quality of the posteriors generated by these system is an essential problem which was addressed decades ago with the invention of proper scoring rules (PSRs). Unfortunately, much of the recent machine learning literature uses calibration metrics – most commonly, the expected calibration error (ECE) – as a proxy to assess posterior performance. The problem with this approach is that calibration metrics reflect only one aspect of the quality of the posteriors, ignoring the discrimination performance. For this reason, we argue that calibration metrics should play no role in the assessment of posterior quality. Expected PSRs should instead be used for this job, preferably normalized for ease of interpretation. In this work, we first give a brief review of PSRs from a practical perspective, motivating their definition using Bayes decision theory. We discuss why expected PSRs provide a principled measure of the quality of a system’s posteriors and why calibration metrics are not the right tool for this job. We argue that calibration metrics, while not useful for performance assessment, may be used as diagnostic tools during system development. With this purpose in mind, we discuss a simple and practical calibration metric, called calibration loss, derived from a decomposition of expected PSRs. We compare this metric with the ECE and with the expected score divergence calibration metric from the PSR literature and argue, using theoretical and empirical evidence, that calibration loss is superior to these two metrics.

[LG-88] Optimizing Cox Models with Stochastic Gradient Descent: Theoretical Foundations and Practical Guidances

链接: https://arxiv.org/abs/2408.02839
作者: Lang Zeng,Weijing Tang,Zhao Ren,Ying Ding
关键词-EN: variants poses substantial, poses substantial computational, substantial computational challenges, SGD, network variants poses
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Optimizing Cox regression and its neural network variants poses substantial computational challenges in large-scale studies. Stochastic gradient descent (SGD), known for its scalability in model optimization, has recently been adapted to optimize Cox models. Unlike its conventional application, which typically targets a sum of independent individual loss, SGD for Cox models updates parameters based on the partial likelihood of a subset of data. Despite its empirical success, the theoretical foundation for optimizing Cox partial likelihood with SGD is largely underexplored. In this work, we demonstrate that the SGD estimator targets an objective function that is batch-size-dependent. We establish that the SGD estimator for the Cox neural network (Cox-NN) is consistent and achieves the optimal minimax convergence rate up to a polylogarithmic factor. For Cox regression, we further prove the \sqrtn -consistency and asymptotic normality of the SGD estimator, with variance depending on the batch size. Furthermore, we quantify the impact of batch size on Cox-NN training and its effect on the SGD estimator’s asymptotic efficiency in Cox regression. These findings are validated by extensive numerical experiments and provide guidance for selecting batch sizes in SGD applications. Finally, we demonstrate the effectiveness of SGD in a real-world application where GD is unfeasible due to the large scale of data.

[LG-89] raining a multilayer dynamical spintronic network with standard machine learning tools to perform time series classification

点击查看摘要

[LG-90] Adaptive Learning for Quantum Linear Regression

链接: https://arxiv.org/abs/2408.02833
作者: Costantino Carugno,Maurizio Ferrari Dacrema,Paolo Cremonesi
关键词-EN: handle machine learning, machine learning problems, cloud-based services, services has enabled, handle machine
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The recent availability of quantum annealers as cloud-based services has enabled new ways to handle machine learning problems, and several relevant algorithms have been adapted to run on these devices. In a recent work, linear regression was formulated as a quadratic binary optimization problem that can be solved via quantum annealing. Although this approach promises a computational time advantage for large datasets, the quality of the solution is limited by the necessary use of a precision vector, used to approximate the real-numbered regression coefficients in the quantum formulation. In this work, we focus on the practical challenge of improving the precision vector encoding: instead of setting an array of generic values equal for all coefficients, we allow each one to be expressed by its specific precision, which is tuned with a simple adaptive algorithm. This approach is evaluated on synthetic datasets of increasing size, and linear regression is solved using the D-Wave Advantage quantum annealer, as well as classical solvers. To the best of our knowledge, this is the largest dataset ever evaluated for linear regression on a quantum annealer. The results show that our formulation is able to deliver improved solution quality in all instances, and could better exploit the potential of current quantum devices.

[LG-91] Setting the duration of online A/B experiments

链接: https://arxiv.org/abs/2408.02830
作者: Harrison H. Li,Chaoyu Yu
关键词-EN: resulting confidence interval, sufficient statistical power, sample size, confidence interval, wasting resources
类目: Methodology (stat.ME); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In designing an online A/B experiment, it is crucial to select a sample size and duration that ensure the resulting confidence interval (CI) for the treatment effect is the right width to detect an effect of meaningful magnitude with sufficient statistical power without wasting resources. While the relationship between sample size and CI width is well understood, the effect of experiment duration on CI width remains less clear. This paper provides an analytical formula for the width of a CI based on a ratio treatment effect estimator as a function of both sample size (N) and duration (T). The formula is derived from a mixed effects model with two variance components. One component, referred to as the temporal variance, persists over time for experiments where the same users are kept in the same experiment arm across different days. The remaining error variance component, by contrast, decays to zero as T gets large. The formula we derive introduces a key parameter that we call the user-specific temporal correlation (UTC), which quantifies the relative sizes of the two variance components and can be estimated from historical experiments. Higher UTC indicates a slower decay in CI width over time. On the other hand, when the UTC is 0 – as for experiments where users shuffle in and out of the experiment across days – the CI width decays at the standard parametric 1/T rate. We also study how access to pre-period data for the users in the experiment affects the CI width decay. We show our formula closely explains CI widths on real A/B experiments at YouTube.

[LG-92] Continuous Monitoring via Repeated Significance

链接: https://arxiv.org/abs/2408.02821
作者: Eric Bax,Arundhyoti Sarkar,Alex Shtoff
关键词-EN: multiple interim analyses, Requiring statistical significance, statistically significant result, interim analysis, multiple interim
类目: Methodology (stat.ME); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Requiring statistical significance at multiple interim analyses to declare a statistically significant result for an AB test allows less stringent requirements for significance at each interim analysis. Repeated repeated significance competes well with methods built on assumptions about the test – assumptions that may be impossible to evaluate a priori and may require extra data to evaluate empirically. Instead, requiring repeated significance allows the data itself to prove directly that the required results are not due to chance alone. We explain how to apply tests with repeated significance to continuously monitor unbounded tests – tests that do not have an a priori bound on running time or number of observations. We show that it is impossible to maintain a constant requirement for significance for unbounded tests, but that we can come arbitrarily close to that goal. Subjects: Methodology (stat.ME); Machine Learning (cs.LG) Cite as: arXiv:2408.02821 [stat.ME] (or arXiv:2408.02821v1 [stat.ME] for this version)

[LG-93] KAN we improve on HEP classification tasks? Kolmogorov-Arnold Networks applied to an LHC physics example

链接: https://arxiv.org/abs/2408.02743
作者: Johannes Erdmann,Florian Mausolf,Jan Lukas Späh
关键词-EN: Kolmogorov-Arnold Networks, KANs, Recently, Networks, performance
类目: High Energy Physics - Phenomenology (hep-ph); Machine Learning (cs.LG); High Energy Physics - Experiment (hep-ex); Data Analysis, Statistics and Probability (physics.data-an)
*备注: 25 pages, 9 figures

点击查看摘要

Abstract:Recently, Kolmogorov-Arnold Networks (KANs) have been proposed as an alternative to multilayer perceptrons, suggesting advantages in performance and interpretability. We study a typical binary event classification task in high-energy physics including high-level features and comment on the performance and interpretability of KANs in this context. We find that the learned activation functions of a one-layer KAN resemble the log-likelihood ratio of the input features. In deeper KANs, the activations in the first KAN layer differ from those in the one-layer KAN, which indicates that the deeper KANs learn more complex representations of the data. We study KANs with different depths and widths and we compare them to multilayer perceptrons in terms of performance and number of trainable parameters. For the chosen classification task, we do not find that KANs are more parameter efficient. However, small KANs may offer advantages in terms of interpretability that come at the cost of only a moderate loss in performance.

[LG-94] Image Super-resolution Inspired Electron Density Prediction

链接: https://arxiv.org/abs/2402.12335
作者: Chenghan Li,Or Sharir,Shunyue Yuan,Garnet K. Chan
关键词-EN: convolutional residual network, trivially generated guess, accurate ground-state quantum, ground-state quantum mechanical, quantum mechanical density
类目: Chemical Physics (physics.chem-ph); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Drawing inspiration from the domain of image super-resolution, we view the electron density as a 3D grayscale image and use a convolutional residual network to transform a crude and trivially generated guess of the molecular density into an accurate ground-state quantum mechanical density. We find that this model outperforms all prior density prediction approaches. Because the input is itself a real-space density, the predictions are equivariant to molecular symmetry transformations even though the model is not constructed to be. Due to its simplicity, the model is directly applicable to unseen molecular conformations and chemical elements. We show that fine-tuning on limited new data provides high accuracy even in challenging cases of exotic elements and charge states. Our work suggests new routes to learning real-space physical quantities drawing from the established ideas of image processing.

信息检索

[IR-0] CADRL: Category-aware Dual-agent Reinforcement Learning for Explainable Recommendations over Knowledge Graphs

链接: https://arxiv.org/abs/2408.03166
作者: Shangfei Zheng,Hongzhi Yin,Tong Chen,Xiangjie Kong,Jian Hou,Pengpeng Zhao
关键词-EN: mitigate data sparsity, address cold-start issues, recommender systems, Knowledge graphs, widely adopted
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Knowledge graphs (KGs) have been widely adopted to mitigate data sparsity and address cold-start issues in recommender systems. While existing KGs-based recommendation methods can predict user preferences and demands, they fall short in generating explicit recommendation paths and lack explainability. As a step beyond the above methods, recent advancements utilize reinforcement learning (RL) to find suitable items for a given user via explainable recommendation paths. However, the performance of these solutions is still limited by the following two points. (1) Lack of ability to capture contextual dependencies from neighboring information. (2) The excessive reliance on short recommendation paths due to efficiency concerns. To surmount these challenges, we propose a category-aware dual-agent reinforcement learning (CADRL) model for explainable recommendations over KGs. Specifically, our model comprises two components: (1) a category-aware gated graph neural network that jointly captures context-aware item representations from neighboring entities and categories, and (2) a dual-agent RL framework where two agents efficiently traverse long paths to search for suitable items. Finally, experimental results show that CADRL outperforms state-of-the-art models in terms of both effectiveness and efficiency on large-scale datasets.

[IR-1] Modeling User Intent Beyond Trigger: Incorporating Uncertainty for Trigger-Induced Recommendation

链接: https://arxiv.org/abs/2408.03091
作者: Jianxing Ma,Zhibo Xiao,Luwei Yang,Hansheng Xue,Xuanzhou Liu,Wen Jiang,Wei Ning,Guannan Zhang
关键词-EN: user intent, users’ short-term intent, Trigger-Induced Recommendation, immersive browsing experience, intent
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:To cater to users’ desire for an immersive browsing experience, numerous e-commerce platforms provide various recommendation scenarios, with a focus on Trigger-Induced Recommendation (TIR) tasks. However, the majority of current TIR methods heavily rely on the trigger item to understand user intent, lacking a higher-level exploration and exploitation of user intent (e.g., popular items and complementary items), which may result in an overly convergent understanding of users’ short-term intent and can be detrimental to users’ long-term purchasing experiences. Moreover, users’ short-term intent shows uncertainty and is affected by various factors such as browsing context and historical behaviors, which poses challenges to user intent modeling. To address these challenges, we propose a novel model called Deep Uncertainty Intent Network (DUIN), comprising three essential modules: i) Explicit Intent Exploit Module extracting explicit user intent using the contrastive learning paradigm; ii) Latent Intent Explore Module exploring latent user intent by leveraging the multi-view relationships between items; iii) Intent Uncertainty Measurement Module offering a distributional estimation and capturing the uncertainty associated with user intent. Experiments on three real-world datasets demonstrate the superior performance of DUIN compared to existing baselines. Notably, DUIN has been deployed across all TIR scenarios in our e-commerce platform, with online A/B testing results conclusively validating its superiority.

[IR-2] he Crowd in MOOCs: A Study of Learning Patterns at Scale

链接: https://arxiv.org/abs/2408.03025
作者: Xin Zhou,Aixin Sun,Jie Zhang,Donghui Lin
关键词-EN: Massive Open Online, Massive Open, Open Online, learners’ learning behavior, learning activity data
类目: Information Retrieval (cs.IR)
*备注: 16 pages

点击查看摘要

Abstract:The increasing availability of learning activity data in Massive Open Online Courses (MOOCs) enables us to conduct a large-scale analysis of learners’ learning behavior. In this paper, we analyze a dataset of 351 million learning activities from 0.8 million unique learners enrolled in over 1.6 thousand courses within two years. Specifically, we mine and identify the learning patterns of the crowd from both temporal and course enrollment perspectives leveraging mutual information theory and sequential pattern mining methods. From the temporal perspective, we find that the time intervals between consecutive learning activities of learners exhibit a mix of power-law and periodic cosine function distribution. By qualifying the relationship between course pairs, we observe that the most frequently co-enrolled courses usually fall in the same category or the same university. We demonstrate these findings can facilitate manifold applications including recommendation tasks on courses. A simple recommendation model utilizing the course enrollment patterns is competitive to the baselines with 200 \times faster training time.

[IR-3] Fact Finder – Enhancing Domain Expertise of Large Language Models by Incorporating Knowledge Graphs

链接: https://arxiv.org/abs/2408.03010
作者: Daniel Steinigen,Roman Teucher,Timm Heine Ruland,Max Rudat,Nicolas Flores-Herr,Peter Fischer,Nikola Milosevic,Christopher Schymura,Angelo Ziletti
关键词-EN: Large Language Models, natural language queries, answering natural language, Language Models, Large Language
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
*备注: 10 pages, 7 figures

点击查看摘要

[IR-4] A Real-Time Adaptive Multi-Stream GPU System for Online Approximate Nearest Neighborhood Search CIKM’24

链接: https://arxiv.org/abs/2408.02937
作者: Yiping Sun,Yang Shi,Jiaolong Du
关键词-EN: Approximate Nearest Neighbor, Nearest Neighbor Search, Approximate Nearest, Nearest Neighbor, emerging LLM applications
类目: Information Retrieval (cs.IR)
*备注: Accepted by CIKM’24

点击查看摘要

Abstract:In recent years, Approximate Nearest Neighbor Search (ANNS) has played a pivotal role in modern search and recommendation systems, especially in emerging LLM applications like Retrieval-Augmented Generation. There is a growing exploration into harnessing the parallel computing capabilities of GPUs to meet the substantial demands of ANNS. However, existing systems primarily focus on offline scenarios, overlooking the distinct requirements of online applications that necessitate real-time insertion of new vectors. This limitation renders such systems inefficient for real-world scenarios. Moreover, previous architectures struggled to effectively support real-time insertion due to their reliance on serial execution streams. In this paper, we introduce a novel Real-Time Adaptive Multi-Stream GPU ANNS System (RTAMS-GANNS). Our architecture achieves its objectives through three key advancements: 1) We initially examined the real-time insertion mechanisms in existing GPU ANNS systems and discovered their reliance on repetitive copying and memory allocation, which significantly hinders real-time effectiveness on GPUs. As a solution, we introduce a dynamic vector insertion algorithm based on memory blocks, which includes in-place rearrangement. 2) To enable real-time vector insertion in parallel, we introduce a multi-stream parallel execution mode, which differs from existing systems that operate serially within a single stream. Our system utilizes a dynamic resource pool, allowing multiple streams to execute concurrently without additional execution blocking. 3) Through extensive experiments and comparisons, our approach effectively handles varying QPS levels across different datasets, reducing latency by up to 40%-80%. The proposed system has also been deployed in real-world industrial search and recommendation systems, serving hundreds of millions of users daily, and has achieved good results.

[IR-5] Wiping out the limitations of Large Language Models – A Taxonomy for Retrieval Augmented Generation

链接: https://arxiv.org/abs/2408.02854
作者: Mahei Manhai Li,Irina Nikishina,Özge Sevgili,Martin Semman
关键词-EN: Current research, evolving very quickly, technological innovations, business contexts, unit of analysis
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Current research on RAGs is distributed across various disciplines, and since the technology is evolving very quickly, its unit of analysis is mostly on technological innovations, rather than applications in business contexts. Thus, in this research, we aim to create a taxonomy to conceptualize a comprehensive overview of the constituting characteristics that define RAG applications, facilitating the adoption of this technology in the IS community. To the best of our knowledge, no RAG application taxonomies have been developed so far. We describe our methodology for developing the taxonomy, which includes the criteria for selecting papers, an explanation of our rationale for employing a Large Language Model (LLM)-supported approach to extract and identify initial characteristics, and a concise overview of our systematic process for conceptualizing the taxonomy. Our systematic taxonomy development process includes four iterative phases designed to refine and enhance our understanding and presentation of RAG’s core dimensions. We have developed a total of five meta-dimensions and sixteen dimensions to comprehensively capture the concept of Retrieval-Augmented Generation (RAG) applications. When discussing our findings, we also detail the specific research areas and pose key research questions to guide future information system researchers as they explore the emerging topics of RAG systems.

[IR-6] Entity Retrieval for Answering Entity-Centric Questions

链接: https://arxiv.org/abs/2408.02795
作者: Hassan S. Shavarani,Anoop Sarkar
关键词-EN: retrieval-augmented question answering, crucial factor, Entity Retrieval, retrieval, question answering
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL)
*备注: 17 pages total, 10 Tables, 4 Figures

点击查看摘要

[IR-7] Symmetric Graph Contrastive Learning against Noisy Views for Recommendation

点击查看摘要

附件下载

点击下载今日全部论文列表

今日(2024-08-07)Arxiv最新论文

目录

概览 (2024-08-07)

自然语言处理

人工智能

计算机视觉

机器学习

信息检索

附件下载