今日(2024-08-13)Arxiv最新论文

本篇博文主要展示 2024-08-13 从Arxiv.org论文网站获取的最新论文列表，自动更新，按照NLP、CV、ML、AI、IR五个大方向区分，若需要邮件定时接收，请在评论区留下你的邮箱号。

说明：每日论文数据从Arxiv.org获取，每天早上10:30左右定时自动更新。

友情提示: 如何您需要邮箱接收每日论文数据，请在评论处留下你的邮箱，同样每天10:30左右邮件定时自动发送。

链接: https://arxiv.org/abs/2408.06335
作者: Tanisha Khurana,Kaushik Pillalamarri,Vikram Pande,Munindar Singh
关键词-EN: Natural Language Processing, Language Processing, Natural Language, paper explores humor, paper explores
关键词-ZN: 自然语言处理，语言处理，自然语言，论文探讨幽默，论文探讨
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:This paper explores humor detection through a linguistic lens, prioritizing syntactic, semantic, and contextual features over computational methods in Natural Language Processing. We categorize features into syntactic, semantic, and contextual dimensions, including lexicons, structural statistics, Word2Vec, WordNet, and phonetic style. Our proposed model, Colbert, utilizes BERT embeddings and parallel hidden layers to capture sentence congruity. By combining syntactic, semantic, and contextual features, we train Colbert for humor detection. Feature engineering examines essential syntactic and semantic features alongside BERT embeddings. SHAP interpretations and decision trees identify influential features, revealing that a holistic approach improves humor detection accuracy on unseen data. Integrating linguistic cues from different dimensions enhances the model’s ability to understand humor complexity beyond traditional computational methods.
摘要：本文通过语言学角度探讨幽默检测，优先考虑自然语言处理中的语法、语义和上下文特征而不是计算方法。我们将功能分类为语法、语义和上下文维度，包括词典、结构统计、Word2Vec、WordNet和语音风格。我们提出的模型Colbert利用BERT嵌入和并行隐藏层来捕获句子一致性。通过结合语法、语义和上下文特征，我们训练科尔伯特进行幽默检测。特征工程检查基本的语法和语义特征以及BERT嵌入。SHAP解释和决策树识别有影响力的特征，揭示了整体方法可以提高对未见数据的幽默检测准确性。整合来自不同维度的语言线索增强了模型理解超越传统计算方法的幽默复杂性的能力。

[NLP-1] FastFiD: Improve Inference Efficiency of Open Domain Question Answering via Sentence Selection ACL2024
[NLP-1] FastFiD：通过句子选择提高开放领域问题回答的推理效率

链接: https://arxiv.org/abs/2408.06333
作者: Yufei Huang,Xu Han,Maosong Sun
关键词-EN: Open Domain Question, Domain Question Answering, Open Domain, dense passage retrieval, pretrained language models
关键词-ZN: 开放领域问题、领域问题解答、开放领域、密集段落检索、预训练语言模型
类目: Computation and Language (cs.CL)
备注: ACL 2024 Main Conference

点击查看摘要

Abstract:Open Domain Question Answering (ODQA) has been advancing rapidly in recent times, driven by significant developments in dense passage retrieval and pretrained language models. Current models typically incorporate the FiD framework, which is composed by a neural retriever alongside an encoder-decoder neural reader. In the answer generation process, the retriever will retrieve numerous passages (around 100 for instance), each of which is then individually encoded by the encoder. Subsequently, the decoder makes predictions based on these encoded passages. Nevertheless, this framework can be relatively time-consuming, particularly due to the extensive length of the gathered passages. To address this, we introduce FastFiD in this paper, a novel approach that executes sentence selection on the encoded passages. This aids in retaining valuable sentences while reducing the context length required for generating answers. Experiments on three commonly used datasets (Natural Questions, TriviaQA and ASQA) demonstrate that our method can enhance the inference speed by 2.3X-5.7X, while simultaneously maintaining the model’s performance. Moreover, an in-depth analysis of the model’s attention reveals that the selected sentences indeed hold a substantial contribution towards the final answer. The codes are publicly available at this https URL.
摘要：在密集段落检索和预先训练语言模型的推动下，开放领域问答(ODQA)近年来得到了迅速的发展。目前的模型通常包含FID框架，该框架由一个神经检索器和一个编解码器神经阅读器组成。在答案生成过程中，检索器将检索大量段落(例如大约100段)，然后由编码器对每一段单独进行编码。随后，解码器基于这些编码的段落进行预测。然而，这个框架可能会相对耗时，特别是由于收集的段落很长。为了解决这一问题，我们在本文中引入了一种新的方法FastFiD，它在编码后的段落上执行句子选择。这有助于保留有价值的句子，同时减少生成答案所需的上下文长度。在三个常用的数据集(自然问题、TriviaQA和ASQA)上的实验表明，该方法在保持模型性能的同时，可以将推理速度提高2.3倍-5.7倍。此外，对该模型注意力的深入分析表明，所选句子确实对最终答案有很大的贡献。这些代码可在此HTTPS URL上公开获得。

[NLP-2] Animate or Inanimate That is the Question for Large Language Models
[NLP-2] 动画还是无生命这是大型语言模型的问题

链接: https://arxiv.org/abs/2408.06332
作者: Leonardo Ranaldi,Giulia Pucci,Fabio Massimo Zanzotto
关键词-EN: multi-layered language understanding, shaping their memory, deeply intertwined, plays an essential, essential role
关键词-ZN: 多层语言理解，塑造他们的记忆，深深交织在一起，发挥着至关重要的作用
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The cognitive essence of humans is deeply intertwined with the concept of animacy, which plays an essential role in shaping their memory, vision, and multi-layered language understanding. Although animacy appears in language via nuanced constraints on verbs and adjectives, it is also learned and refined through extralinguistic information. Similarly, we assume that the LLMs’ limited abilities to understand natural language when processing animacy are motivated by the fact that these models are trained exclusively on text. Hence, the question this paper aims to answer arises: can LLMs, in their digital wisdom, process animacy in a similar way to what humans would do? We then propose a systematic analysis via prompting approaches. In particular, we probe different LLMs by prompting them using animate, inanimate, usual, and stranger contexts. Results reveal that, although LLMs have been trained predominantly on textual data, they exhibit human-like behavior when faced with typical animate and inanimate entities in alignment with earlier studies. Hence, LLMs can adapt to understand unconventional situations by recognizing oddities as animated without needing to interface with unspoken cognitive triggers humans rely on to break down animations. Subjects: Computation and Language (cs.CL) Cite as: arXiv:2408.06332 [cs.CL] (or arXiv:2408.06332v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2408.06332 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
摘要：人类的认知本质与生命力的概念深深地交织在一起，这对塑造他们的记忆、视觉和多层次的语言理解起着至关重要的作用。虽然生命力通过对动词和形容词的细微限制出现在语言中，但它也是通过非语言信息学习和提炼的。同样，我们假设LLMS在处理动画时理解自然语言的有限能力是由这些模型专门针对文本训练的事实驱动的。因此，这篇论文旨在回答的问题产生了：LLM能否以他们的数字智慧，以类似于人类所做的方式处理生命？然后，我们提出了一个系统的分析方法，通过激励方法。特别是，我们通过使用有生命的、无生命的、普通的和陌生的上下文来提示不同的LLM来探索它们。结果表明，尽管LLM主要是基于文本数据进行训练的，但它们在面对典型的有生命和无生命实体时表现出类似于人类的行为，这与早期的研究一致。因此，LLMS可以通过将奇怪的东西识别为动画来适应理解非常规情况，而不需要与人类赖以分解动画的潜台词认知触发接口。主题：计算和语言(cs.CL)引用为：arxiv：2408.06332cs.CL https://doi.org/10.48550/arXiv.2408.06332 Focus通过DataCite了解更多arxiv发布的DOI(待注册)

[NLP-3] VisualAgent Bench: Towards Large Multimodal Models as Visual Foundation Agents
[NLP-3] Visual AgentBench：将大型多模式模型作为视觉基础代理

链接: https://arxiv.org/abs/2408.06327
作者: Xiao Liu,Tianjie Zhang,Yu Gu,Iat Long Iong,Yifan Xu,Xixuan Song,Shudan Zhang,Hanyu Lai,Xinyi Liu,Hanlin Zhao,Jiadai Sun,Xinyue Yang,Yu Yang,Zehan Qi,Shuntian Yao,Xueqiao Sun,Siyi Cheng,Qinkai Zheng,Hao Yu,Hanchen Zhang,Wenyi Hong,Ming Ding,Lihang Pan,Xiaotao Gu,Aohan Zeng,Zhengxiao Du,Chan Hee Song,Yu Su,Yuxiao Dong,Jie Tang
关键词-EN: Large Multimodal Models, Large Multimodal, form highly capable, highly capable Visual, Visual Foundation Agents
关键词-ZN: 大型多模式模型，大型多模式，形成高功能，高功能视觉，视觉基础代理
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Large Multimodal Models (LMMs) have ushered in a new era in artificial intelligence, merging capabilities in both language and vision to form highly capable Visual Foundation Agents. These agents are postulated to excel across a myriad of tasks, potentially approaching general artificial intelligence. However, existing benchmarks fail to sufficiently challenge or showcase the full potential of LMMs in complex, real-world environments. To address this gap, we introduce VisualAgentBench (VAB), a comprehensive and pioneering benchmark specifically designed to train and evaluate LMMs as visual foundation agents across diverse scenarios, including Embodied, Graphical User Interface, and Visual Design, with tasks formulated to probe the depth of LMMs’ understanding and interaction capabilities. Through rigorous testing across nine proprietary LMM APIs and eight open models, we demonstrate the considerable yet still developing agent capabilities of these models. Additionally, VAB constructs a trajectory training set constructed through hybrid methods including Program-based Solvers, LMM Agent Bootstrapping, and Human Demonstrations, promoting substantial performance improvements in LMMs through behavior cloning. Our work not only aims to benchmark existing models but also provides a solid foundation for future development into visual foundation agents. Code, train \ test data, and part of fine-tuned open LMMs are available at \urlthis https URL.
摘要：大型多通道模型(LMM)开启了人工智能的新纪元，融合了语言和视觉的能力，形成了功能强大的可视化基础代理。这些智能体被认为能够胜任无数的任务，有可能接近一般的人工智能。然而，现有的基准不能充分挑战或展示土地管理在复杂的现实世界环境中的全部潜力。为了弥补这一差距，我们引入了VisualAgentBch(VAB)，这是一个全面的开创性基准，专门设计用于培训和评估LMM作为视觉基础代理的不同场景，包括具体化、图形用户界面和视觉设计，制定的任务旨在探索LMM的理解和交互能力的深度。通过对9个专有LMM API和8个开放模型的严格测试，我们展示了这些模型相当可观但仍在开发中的代理能力。此外，VAB构建了通过混合方法构建的轨迹训练集，包括基于程序的求解器、LMM代理引导和人类演示，通过行为克隆促进LMM的性能大幅提高。我们的工作不仅旨在对现有模型进行基准测试，还为未来发展成为可视化基础代理提供了坚实的基础。代码、训练\测试数据和部分经过微调的开放LMM可在此HTTPS URL中找到。

[NLP-4] Long-Form Answers to Visual Questions from Blind and Low Vision People
[NLP-4] 盲人和低视力者视觉问题的长篇答案

链接: https://arxiv.org/abs/2408.06303
作者: Mina Huh,Fangyuan Xu,Yi-Hao Peng,Chongyan Chen,Hansika Murugu,Danna Gurari,Eunsol Choi,Amy Pavel
关键词-EN: long-form answers, Vision language models, answers, generate long-form answers, Vision language
关键词-ZN: 长式答案，视觉语言模型，答案，生成长式答案，视觉语言
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注: COLM 2024

点击查看摘要

Abstract:Vision language models can now generate long-form answers to questions about images - long-form visual question answers (LFVQA). We contribute VizWiz-LF, a dataset of long-form answers to visual questions posed by blind and low vision (BLV) users. VizWiz-LF contains 4.2k long-form answers to 600 visual questions, collected from human expert describers and six VQA models. We develop and annotate functional roles of sentences of LFVQA and demonstrate that long-form answers contain information beyond the question answer such as explanations and suggestions. We further conduct automatic and human evaluations with BLV and sighted people to evaluate long-form answers. BLV people perceive both human-written and generated long-form answers to be plausible, but generated answers often hallucinate incorrect visual details, especially for unanswerable visual questions (e.g., blurry or irrelevant images). To reduce hallucinations, we evaluate the ability of VQA models to abstain from answering unanswerable questions across multiple prompting strategies.
摘要：视觉语言模型现在可以生成关于图像的问题的长格式答案-长格式视觉问题答案(LFVQA)。我们贡献了VizWiz-LF，这是一个针对盲人和低视力(BLV)用户提出的视觉问题的长格式答案的数据集。VizWiz-LF包含对600个视觉问题的4.2k个长格式答案，收集自人类专家描述符和六个VQA模型。我们开发和注释了LFVQA句子的功能角色，并证明长格式答案包含问题答案之外的信息，如解释和建议。我们还使用BLV和有视力的人进行自动和人工评估，以评估长格式的答案。BLV的人认为人类编写的和生成的长格式答案都是可信的，但生成的答案通常会产生不正确的视觉细节，特别是对于无法回答的视觉问题(例如，模糊或无关的图像)。为了减少幻觉，我们评估了VQA模型通过多种提示策略避免回答无法回答的问题的能力。

[NLP-5] he AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery
[NLP-5] 人工智能科学家：迈向全自动开放式科学发现

链接: https://arxiv.org/abs/2408.06292
作者: Chris Lu,Cong Lu,Robert Tjarko Lange,Jakob Foerster,Jeff Clune,David Ha
关键词-EN: artificial general intelligence, developing agents capable, discovering new knowledge, grand challenges, challenges of artificial
关键词-ZN: 人工通用智能、开发有能力的代理、发现新知识、重大挑战、人工挑战
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:One of the grand challenges of artificial general intelligence is developing agents capable of conducting scientific research and discovering new knowledge. While frontier models have already been used as aids to human scientists, e.g. for brainstorming ideas, writing code, or prediction tasks, they still conduct only a small part of the scientific process. This paper presents the first comprehensive framework for fully automatic scientific discovery, enabling frontier large language models to perform research independently and communicate their findings. We introduce The AI Scientist, which generates novel research ideas, writes code, executes experiments, visualizes results, describes its findings by writing a full scientific paper, and then runs a simulated review process for evaluation. In principle, this process can be repeated to iteratively develop ideas in an open-ended fashion, acting like the human scientific community. We demonstrate its versatility by applying it to three distinct subfields of machine learning: diffusion modeling, transformer-based language modeling, and learning dynamics. Each idea is implemented and developed into a full paper at a cost of less than 15 per paper. To evaluate the generated papers, we design and validate an automated reviewer, which we show achieves near-human performance in evaluating paper scores. The AI Scientist can produce papers that exceed the acceptance threshold at a top machine learning conference as judged by our automated reviewer. This approach signifies the beginning of a new era in scientific discovery in machine learning: bringing the transformative benefits of AI agents to the entire research process of AI itself, and taking us closer to a world where endless affordable creativity and innovation can be unleashed on the world’s most challenging problems. Our code is open-sourced at this https URL
摘要：人工通用智能的重大挑战之一是开发能够进行科学研究和发现新知识的代理。虽然前沿模型已经被用作人类科学家的辅助工具，例如用于集思广益、编写代码或预测任务，但它们仍然只进行科学过程的一小部分。本文提出了第一个全面的全自动科学发现框架，使前沿大语言模型能够独立进行研究并交流他们的发现。我们介绍AI科学家，它产生新的研究想法，编写代码，执行实验，可视化结果，通过撰写完整的科学论文来描述其发现，然后运行模拟审查过程进行评估。原则上，这个过程可以重复，以一种开放的方式迭代地开发想法，就像人类科学界一样。我们通过将其应用于机器学习的三个不同的子领域来展示它的多功能性：扩散建模、基于转换器的语言建模和学习动力学。每个想法都被实施并发展成一篇完整的论文，每篇论文的成本不到15英镑。为了评估生成的试卷，我们设计并验证了一个自动评审员，我们展示了它在评估试卷分数方面取得了接近人类的表现。根据我们的自动评审员的判断，AI科学家可以在顶级机器学习会议上发表超过接受门槛的论文。这种方法标志着机器学习科学发现的新纪元的开始：将人工智能代理的变革性好处带到人工智能本身的整个研究过程，并将我们带向一个可以在世界上最具挑战性的问题上释放无穷无尽的负担得起的创造力和创新的世界。我们的代码在这个HTTPS URL上是开源的

[NLP-6] Synthetic Patient-Physician Dialogue Generation from Clinical Notes Using LLM
[NLP-6] 使用LLM从临床笔记中生成综合患者-医生对话

链接: https://arxiv.org/abs/2408.06285
作者: Trisha Das,Dina Albassam,Jimeng Sun
关键词-EN: enhance patient-physician communication, improve healthcare accessibility, Medical dialogue systems, enhance patient-physician, patient-physician communication
关键词-ZN: 加强患者与医生的沟通，改善医疗保健的可及性，医疗对话系统，加强患者与医生、患者与医生的沟通
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Medical dialogue systems (MDS) enhance patient-physician communication, improve healthcare accessibility, and reduce costs. However, acquiring suitable data to train these systems poses significant challenges. Privacy concerns prevent the use of real conversations, necessitating synthetic alternatives. Synthetic dialogue generation from publicly available clinical notes offers a promising solution to this issue, providing realistic data while safeguarding privacy. Our approach, SynDial, uses a single LLM iteratively with zero-shot prompting and a feedback loop to generate and refine high-quality synthetic dialogues. The feedback consists of weighted evaluation scores for similarity and extractiveness. The iterative process ensures dialogues meet predefined thresholds, achieving superior extractiveness as a result of the feedback loop. Additionally, evaluation shows that the generated dialogues excel in factuality metric compared to the baselines and has comparable diversity scores with GPT4.
摘要：医疗对话系统(MDS)加强了医患沟通，改善了医疗服务的可及性，降低了成本。然而，获取适当的数据来训练这些系统带来了巨大的挑战。出于隐私考虑，无法使用真实的对话，因此有必要使用人工合成的替代方案。从公开可用的临床笔记生成合成对话为这个问题提供了一个有希望的解决方案，在保护隐私的同时提供了现实的数据。我们的方法SynDial迭代地使用带有零镜头提示和反馈循环的单个LLM来生成和提炼高质量的合成对话。反馈包括对相似性和抽象性的加权评价分数。迭代过程确保对话达到预定义的阈值，通过反馈循环实现卓越的抽象力。此外，评估表明，生成的对话在真实性度量方面优于基线，并具有与GPT4相当的多样性分数。

[NLP-7] MovieSum: An Abstractive Summarization Dataset for Movie Screenplays ACL2024
[NLP-7] MovieSum：电影剧本的抽象摘要数据集

链接: https://arxiv.org/abs/2408.06281
作者: Rohit Saxena,Frank Keller
关键词-EN: long input contexts, long input, input contexts, movie screenplays, Movie
关键词-ZN: 长输入上下文、长输入、输入上下文、电影剧本、电影
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: ACL 2024 Findings

点击查看摘要

Abstract:Movie screenplay summarization is challenging, as it requires an understanding of long input contexts and various elements unique to movies. Large language models have shown significant advancements in document summarization, but they often struggle with processing long input contexts. Furthermore, while television transcripts have received attention in recent studies, movie screenplay summarization remains underexplored. To stimulate research in this area, we present a new dataset, MovieSum, for abstractive summarization of movie screenplays. This dataset comprises 2200 movie screenplays accompanied by their Wikipedia plot summaries. We manually formatted the movie screenplays to represent their structural elements. Compared to existing datasets, MovieSum possesses several distinctive features: (1) It includes movie screenplays, which are longer than scripts of TV episodes. (2) It is twice the size of previous movie screenplay datasets. (3) It provides metadata with IMDb IDs to facilitate access to additional external knowledge. We also show the results of recently released large language models applied to summarization on our dataset to provide a detailed baseline.
摘要：电影剧本摘要很有挑战性，因为它需要理解长时间的输入上下文和电影特有的各种元素。大型语言模型在文档摘要方面显示出了显著的进步，但它们经常在处理长输入上下文方面遇到困难。此外，尽管电视剧本在最近的研究中受到了关注，但电影剧本摘要仍然没有得到充分的探索。为了促进这一领域的研究，我们提出了一个新的数据集MovieSum，用于电影剧本的抽象摘要。这个数据集包括2200个电影剧本，以及它们在维基百科上的剧情摘要。我们手动格式化了电影剧本，以表示它们的结构元素。与现有的数据集相比，MovieSum有几个明显的特点：(1)它包括电影剧本，比电视剧的剧本长。(2)是以往电影剧本数据集的两倍。(3)它提供带有IMDb ID的元数据，以方便获取更多的外部知识。我们还展示了最近发布的大型语言模型在我们的数据集上应用于摘要的结果，以提供详细的基线。

[NLP-8] Review-driven Personalized Preference Reasoning with Large Language Models for Recommendation
[NLP-8] 具有大型语言模型的审查驱动个性化偏好推理进行推荐

链接: https://arxiv.org/abs/2408.06276
作者: Jieyong Kim,Hyunseo Kim,Hyunjin Cho,SeongKu Kang,Buru Chang,Jinyoung Yeo,Dongha Lee
关键词-EN: Large Language Models, Language Models, Large Language, generating significant interest, demonstrated exceptional performance
关键词-ZN: 大型语言模型，语言模型，大型语言，引起了浓厚的兴趣，表现出了出色的性能
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Recent advancements in Large Language Models (LLMs) have demonstrated exceptional performance across a wide range of tasks, generating significant interest in their application to recommendation systems. However, existing methods have not fully capitalized on the potential of LLMs, often constrained by limited input information or failing to fully utilize their advanced reasoning capabilities. To address these limitations, we introduce EXP3RT, a novel LLM-based recommender designed to leverage rich preference information contained in user and item reviews. EXP3RT is basically fine-tuned through distillation from a teacher LLM to perform three key tasks in order: EXP3RT first extracts and encapsulates essential subjective preferences from raw reviews, aggregates and summarizes them according to specific criteria to create user and item profiles. It then generates detailed step-by-step reasoning followed by predicted rating, i.e., reasoning-enhanced rating prediction, by considering both subjective and objective information from user/item profiles and item descriptions. This personalized preference reasoning from EXP3RT enhances rating prediction accuracy and also provides faithful and reasonable explanations for recommendation. Extensive experiments show that EXP3RT outperforms existing methods on both rating prediction and candidate item reranking for top-k recommendation, while significantly enhancing the explainability of recommendation systems.
摘要：大型语言模型(LLM)的最新进展在广泛的任务中表现出了出色的性能，引起了人们对它们在推荐系统中的应用的浓厚兴趣。然而，现有的方法没有充分发挥LLMS的潜力，往往受到输入信息有限或未能充分利用其高级推理能力的限制。为了解决这些局限性，我们引入了EXP3RT，这是一个基于LLM的新型推荐系统，旨在利用用户和项目评论中包含的丰富偏好信息。EXP3RT基本上是通过从LLM教师那里蒸馏来进行微调，以依次执行三项关键任务：EXP3RT首先从原始评论中提取和封装基本的主观偏好，然后根据特定的标准对其进行汇总和汇总，以创建用户和项目配置文件。然后，它通过考虑来自用户/项目简档和项目描述的主观和客观信息，生成详细的逐步推理，然后是预测评级，即推理增强的评级预测。EXP3RT的个性化偏好推理提高了评级预测的准确性，也为推荐提供了忠实而合理的解释。大量的实验表明，EXP3RT在Top-k推荐的评分预测和候选条目重排序方面都优于现有的方法，同时显著提高了推荐系统的可解释性。

[NLP-9] FuxiTranyu: A Multilingual Large Language Model Trained with Balanced Data
[NLP-9] FuxiTranyu：一个经过平衡数据训练的多语言大型语言模型

链接: https://arxiv.org/abs/2408.06273
作者: Haoran Sun,Renren Jin,Shaoyang Xu,Leiyu Pan,Supryadi,Menglong Cui,Jiangcun Dui,Yikun Lei,Lei Yang,Ling Shi,Juesi Xiao,Shaolin Zhu,Deyi Xiong
关键词-EN: Large language models, Large language, demonstrated prowess, Large, multilingual
关键词-ZN: 大型语言模型，大型语言，展示实力，大型，多语言
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have demonstrated prowess in a wide range of tasks. However, many LLMs exhibit significant performance discrepancies between high- and low-resource languages. To mitigate this challenge, we present FuxiTranyu, an open-source multilingual LLM, which is designed to satisfy the need of the research community for balanced and high-performing multilingual capabilities. FuxiTranyu-8B, the base model with 8 billion parameters, is trained from scratch on a meticulously balanced multilingual data repository that contains 600 billion tokens covering 43 natural languages and 16 programming languages. In addition to the base model, we also develop two instruction-tuned models: FuxiTranyu-8B-SFT that is fine-tuned on a diverse multilingual instruction dataset, and FuxiTranyu-8B-DPO that is further refined with DPO on a preference dataset for enhanced alignment ability. Extensive experiments on a wide range of multilingual benchmarks demonstrate the competitive performance of FuxiTranyu against existing multilingual LLMs, e.g., BLOOM-7B, PolyLM-13B, Llama-2-Chat-7B and Mistral-7B-Instruct. Interpretability analyses at both the neuron and representation level suggest that FuxiTranyu is able to learn consistent multilingual representations across different languages. To promote further research into multilingual LLMs and their working mechanisms, we release both the base and instruction-tuned FuxiTranyu models together with 58 pretraining checkpoints at HuggingFace and Github.
摘要：大型语言模型在广泛的任务中表现出了强大的能力。然而，许多LLM在高资源语言和低资源语言之间表现出显著的性能差异。为了缓解这一挑战，我们提出了FuxiTranyu，这是一个开源的多语言LLM，旨在满足研究社区对平衡和高性能多语言能力的需求。拥有80亿个参数的基本模型FuxiTranyu-8B在一个精心平衡的多语言数据库上从头开始训练，该数据库包含6000亿个令牌，涵盖43种自然语言和16种编程语言。除了基本模型外，我们还开发了两个指令调优模型：FuxiTranyu-8B-SFT，它在多样化的多语言教学数据集上进行了微调，以及FuxiTranyu-8B-DPO，它在偏好数据集上使用DPO进一步优化，以增强对齐能力。在广泛的多语言基准上进行的大量实验表明，FuxiTranyu与现有的多语言LLM相比具有竞争力，例如Bloom-7B、PolyLM-13B、Llama-2-Chat-7B和Mistral-7B-Indict。神经元和表征两个层面的可解释性分析表明，FuxiTranyu能够跨不同语言学习一致的多语言表征。为了促进对多语言LLM及其工作机制的进一步研究，我们发布了基础和教学调整的FuxiTranyu模型，以及HuggingFace和Github的58个预培训检查点。

[NLP-10] Anchored Preference Optimization and Contrastive Revisions: Addressing Underspecification in Alignment
[NLP-10] 锚定偏好优化和对比修订：解决一致中规格不足的问题

链接: https://arxiv.org/abs/2408.06266
作者: Karel D’Oosterlinck,Winnie Xu,Chris Develder,Thomas Demeester,Amanpreet Singh,Christopher Potts,Douwe Kiela,Shikib Mehri
关键词-EN: Large Language Models, Large Language, Language Models, alignment objectives, Anchored Preference Optimization
关键词-ZN: 大型语言模型、大型语言、语言模型、对齐目标、锚定偏好优化
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) are often aligned using contrastive alignment objectives and preference pair datasets. The interaction between model, paired data, and objective makes alignment a complicated procedure, sometimes producing subpar results. We study this and find that (i) preference data gives a better learning signal when the underlying responses are contrastive, and (ii) alignment objectives lead to better performance when they specify more control over the model during training. Based on these insights, we introduce Contrastive Learning from AI Revisions (CLAIR), a data-creation method which leads to more contrastive preference pairs, and Anchored Preference Optimization (APO), a controllable and more stable alignment objective. We align Llama-3-8B-Instruct using various comparable datasets and alignment objectives and measure MixEval-Hard scores, which correlate highly with human judgments. The CLAIR preferences lead to the strongest performance out of all datasets, and APO consistently outperforms less controllable objectives. Our best model, trained on 32K CLAIR preferences with APO, improves Llama-3-8B-Instruct by 7.65%, closing the gap with GPT4-turbo by 45%. Our code is available at this https URL.
摘要：大型语言模型通常使用对比对齐目标和偏好对数据集进行对齐。模型、配对数据和目标之间的相互作用使得比对成为一个复杂的过程，有时会产生低于平均水平的结果。我们对此进行了研究，发现(I)偏好数据在潜在反应具有对比性时提供了更好的学习信号，以及(Ii)一致性目标在训练过程中指定对模型的更多控制时会导致更好的性能。基于这些见解，我们引入了从人工智能版本中对比学习(Clair)和锚定偏好优化(APO)，这是一种产生更多对比偏好对的数据创建方法，而锚定偏好优化是一种可控且更稳定的比对目标。我们使用各种可比较的数据集和匹配目标来对Llama-3-8B-指令进行对齐，并测量与人类判断高度相关的MixEval-Hard分数。在所有数据集中，Clair偏好导致最强的性能，而APO的性能始终优于较不可控的目标。我们最好的模型，训练32K克莱尔偏好与APO，改善了Llama-3-8B-指令7.65%，缩小了与GPT4-涡轮的差距45%。我们的代码可以在这个HTTPS URL上找到。

[NLP-11] Context-aware Visual Storytelling with Visual Prefix Tuning and Contrastive Learning
[NLP-11] 通过视觉前置调整和对比学习实现上下文感知的视觉讲故事

链接: https://arxiv.org/abs/2408.06259
作者: Yingjin Song,Denis Paperno,Albert Gatt
关键词-EN: storytelling systems generate, systems generate multi-sentence, Visual storytelling systems, generate multi-sentence stories, image sequences
关键词-ZN: 讲故事系统生成，系统生成多句话，视觉讲故事系统，生成多句话故事，图像序列
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注: 18 pages, 12 figures, accepted by INLG 2024

点击查看摘要

Abstract:Visual storytelling systems generate multi-sentence stories from image sequences. In this task, capturing contextual information and bridging visual variation bring additional challenges. We propose a simple yet effective framework that leverages the generalization capabilities of pretrained foundation models, only training a lightweight vision-language mapping network to connect modalities, while incorporating context to enhance coherence. We introduce a multimodal contrastive objective that also improves visual relevance and story informativeness. Extensive experimental results, across both automatic metrics and human evaluations, demonstrate that the stories generated by our framework are diverse, coherent, informative, and interesting.
摘要：视觉讲故事系统从图像序列生成多句故事。在这项任务中，捕获上下文信息和弥合视觉变化带来了额外的挑战。我们提出了一个简单而有效的框架，该框架利用预训练的基础模型的概括能力，仅训练轻量级的视觉语言映射网络来连接模式，同时结合上下文来增强一致性。我们引入了一个多模式对比目标，它也提高了视觉相关性和故事信息性。自动指标和人工评估的广泛实验结果表明，我们的框架生成的故事是多样的、连贯的、信息丰富的且有趣的。

[NLP-12] FLEURS-R: A Restored Multilingual Speech Corpus for Generation Tasks
[NLP-12] FLEURS-R：用于生成任务的恢复多语言语音库

链接: https://arxiv.org/abs/2408.06227
作者: Min Ma,Yuma Koizumi,Shigeki Karita,Heiga Zen,Jason Riesa,Haruko Ishikawa,Michiel Bacchiani
关键词-EN: Few-shot Learning Evaluation, Few-shot Learning, Universal Representations, restoration applied version, paper introduces FLEURS-R
关键词-ZN: 少镜头学习评估，少镜头学习，通用表示，恢复应用版本，论文介绍FLEURS-R
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:This paper introduces FLEURS-R, a speech restoration applied version of the Few-shot Learning Evaluation of Universal Representations of Speech (FLEURS) corpus. FLEURS-R maintains an N-way parallel speech corpus in 102 languages as FLEURS, with improved audio quality and fidelity by applying the speech restoration model Miipher. The aim of FLEURS-R is to advance speech technology in more languages and catalyze research including text-to-speech (TTS) and other speech generation tasks in low-resource languages. Comprehensive evaluations with the restored speech and TTS baseline models trained from the new corpus show that the new corpus obtained significantly improved speech quality while maintaining the semantic contents of the speech. The corpus is publicly released via Hugging Face.
摘要：本文介绍了FLEURS-R，这是语音通用表示少镜头学习评估（FLEURS）数据库的语音恢复应用版本。FLEURS-R作为FLEURS维护了102种语言的N路并行语音库，通过应用语音恢复模型Miipher提高了音频质量和保真度。FLEURS-R的目标是推进更多语言中的语音技术，并催化研究，包括文本转语音（TTC）和低资源语言中的其他语音生成任务。对从新数据库训练的恢复语音和TTC基线模型的综合评估表明，新数据库在保持语音语义内容的同时，获得了显着改善的语音质量。该文集通过Hugging Face公开发布。

[NLP-13] On Effects of Steering Latent Representation for Large Language Model Unlearning
[NLP-13] 论引导潜在表示对大型语言模型取消学习的影响

链接: https://arxiv.org/abs/2408.06223
作者: Dang Huu-Tien,Trung-Tin Pham,Hoang Thanh-Tung,Naoya Inoue
关键词-EN: large language model, Representation Misdirection, large language, target random representation, steers model representation
关键词-ZN: 大型语言模型、表示误导、大型语言、目标随机表示、引导模型表示
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 15 pages, 5 figures, 8 tables

点击查看摘要

Abstract:Representation Misdirection for Unlearning (RMU), which steers model representation in the intermediate layer to a target random representation, is an effective method for large language model (LLM) unlearning. Despite its high performance, the underlying cause and explanation remain underexplored. In this paper, we first theoretically demonstrate that steering forget representations in the intermediate layer reduces token confidence, causing LLMs to generate wrong or nonsense responses. Second, we investigate how the coefficient influences the alignment of forget-sample representations with the random direction and hint at the optimal coefficient values for effective unlearning across different network layers. Third, we show that RMU unlearned models are robust against adversarial jailbreak attacks. Last, our empirical analysis shows that RMU is less effective when applied to the middle and later layers in LLMs. To resolve this drawback, we propose Adaptive RMU – a simple yet effective alternative method that makes unlearning effective with most layers. Extensive experiments demonstrate that Adaptive RMU significantly improves the unlearning performance compared to prior art while incurring no additional computational cost.
摘要：遗忘表征误导(RMU)是一种有效的大语言模型遗忘方法，它将中间层的模型表征引导为目标随机表征。尽管其表现良好，但其根本原因和解释仍未得到充分研究。在本文中，我们首先从理论上证明，中间层中的转向遗忘表征会降低令牌置信度，从而导致LLM生成错误或无意义的响应。其次，我们研究了系数如何影响遗忘样本表示与随机方向的对齐，并提示了跨不同网络层有效遗忘的最优系数值。第三，我们证明了RMU未学习模型对敌意越狱攻击是健壮的。最后，我们的实证分析表明，当RMU应用于LLMS的中后期时，其有效性较差。为了解决这一缺陷，我们提出了自适应RMU–一种简单但有效的替代方法，使遗忘在大多数层都有效。大量实验表明，与现有技术相比，自适应RMU在不增加额外计算代价的情况下，显著改善了遗忘性能。

[NLP-14] Mutual Reasoning Makes Smaller LLMs Stronger Problem-Solvers
[NLP-14] 相互推理使更小的LLC更强大的问题解决者

链接: https://arxiv.org/abs/2408.06195
作者: Zhenting Qi,Mingyuan Ma,Jiahang Xu,Li Lyna Zhang,Fan Yang,Mao Yang
关键词-EN: small language models, Carlo Tree Search, language models, superior models, significantly improves reasoning
关键词-ZN: 小型语言模型、卡洛树搜索、语言模型、高级模型，显着改善推理
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:This paper introduces rStar, a self-play mutual reasoning approach that significantly improves reasoning capabilities of small language models (SLMs) without fine-tuning or superior models. rStar decouples reasoning into a self-play mutual generation-discrimination process. First, a target SLM augments the Monte Carlo Tree Search (MCTS) with a rich set of human-like reasoning actions to construct higher quality reasoning trajectories. Next, another SLM, with capabilities similar to the target SLM, acts as a discriminator to verify each trajectory generated by the target SLM. The mutually agreed reasoning trajectories are considered mutual consistent, thus are more likely to be correct. Extensive experiments across five SLMs demonstrate rStar can effectively solve diverse reasoning problems, including GSM8K, GSM-Hard, MATH, SVAMP, and StrategyQA. Remarkably, rStar boosts GSM8K accuracy from 12.51% to 63.91% for LLaMA2-7B, from 36.46% to 81.88% for Mistral-7B, from 74.53% to 91.13% for LLaMA3-8B-Instruct. Code will be available at this https URL.
摘要：本文介绍了一种自发挥的相互推理方法rStar，它可以显著提高小语言模型(SLM)的推理能力，而不需要对模型进行微调或优化。RStar将推理分解为一个自我发挥的相互代际歧视过程。首先，目标SLM在蒙特卡罗树搜索(MCTS)的基础上增加了一组丰富的类人推理行为，以构建更高质量的推理轨迹。接下来，具有类似于目标SLM的能力的另一SLM充当鉴别器，以验证由目标SLM生成的每个轨迹。双方同意的推理轨迹被认为是相互一致的，因此更有可能是正确的。在五个SLM上的广泛实验表明，rStar可以有效地解决各种推理问题，包括GSM8K、GSM-Hard、数学、SVAMP和Strategy QA。值得注意的是，rStar将GSM8K的准确率从12.51%提高到63.91%(LLaMA2-7B)，从36.46%提高到81.88%(西北风-7B)，从74.53%提高到91.13%(LLaMA3-8B-指令)。代码将在此HTTPS URL上提供。

[NLP-15] Improving Structural Diversity of Blackbox LLMs via Chain-of-Specification Prompting
[NLP-15] 通过规范链预算改进Blackbox LLM的结构多样性

链接: https://arxiv.org/abs/2408.06186
作者: Halley Young,Yimeng Zeng,Jacob Gardner,Osbert Bastani
关键词-EN: large language models, key challenge facing, challenge facing large, facing large language, diversity
关键词-ZN: 大语言模型，面临的关键挑战，面临的大语言，多样性
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:The capability to generate diverse text is a key challenge facing large language models (LLMs). Thus far, diversity has been studied via metrics such as n -gram diversity or diversity of BERT embeddings. However, for these kinds of diversity, the user has little control over the dimensions along which diversity is considered. For example, in the poetry domain, one might desire diversity in terms of rhyme and meter, whereas in the code domain, one might desire diversity in terms of the kinds of expressions used to solve a problem. We propose a diversity metric called structural diversity, where the user provides a mapping from generated text to features capturing the kinds of diversity that they care about. In addition, we propose a novel strategy called chain-of-specification (CoS) prompting for improving diversity by first having the LLM generate a specification encoding one instance of structural features, and then prompting the LLM to generate text that satisfies these features; notably, our strategy works with blackbox LLMs. In our experiments, we show that for structural diversity in the poetry and code domains, CoS significantly improves diversity compared to several baselines.
摘要：生成不同文本的能力是大型语言模型(LLM)面临的关键挑战。到目前为止，已经通过n元语法分集或BERT嵌入的分集等度量来研究分集。然而，对于这些类型的分集，用户几乎无法控制考虑分集的维度。例如，在诗歌领域，人们可能希望在押韵和韵律方面多样化，而在代码领域，人们可能希望在用于解决问题的表达类型方面多样化。我们提出了一种称为结构多样性的多样性度量，其中用户提供了从生成的文本到特征的映射，以捕捉他们所关心的多样性的种类。此外，我们提出了一种新的策略，称为规格链(CoS)，通过首先让LLM生成对结构特征的一个实例进行编码的规范，然后促使LLM生成满足这些特征的文本，从而提高多样性；值得注意的是，我们的策略适用于黑盒LLM。在我们的实验中，我们表明，对于诗歌和码域的结构多样性，CoS比几个基线显著提高了多样性。

[NLP-16] LipidBERT: A Lipid Language Model Pre-trained on METiS de novo Lipid Library
[NLP-16] LipidBERT：在METiS de novo Lipid Library上预训练的Lipid语言模型

链接: https://arxiv.org/abs/2408.06150
作者: Tianhao Yu,Cai Yao,Zhuorui Sun,Feng Shi,Lin Zhang,Kangjie Lyu,Xuan Bai,Andong Liu,Xicheng Zhang,Jiali Zou,Wenshou Wang,Chris Lai,Kai Wang
关键词-EN: million virtual lipids, virtual screening techniques, generate and maintain, maintain a database, virtual lipids
关键词-ZN: 百万虚拟脂质，虚拟筛查技术，生成和维护，维护数据库，虚拟脂质
类目: Computation and Language (cs.CL); Chemical Physics (physics.chem-ph); Biomolecules (q-bio.BM)
备注:

点击查看摘要

Abstract:In this study, we generate and maintain a database of 10 million virtual lipids through METiS’s in-house de novo lipid generation algorithms and lipid virtual screening techniques. These virtual lipids serve as a corpus for pre-training, lipid representation learning, and downstream task knowledge transfer, culminating in state-of-the-art LNP property prediction performance. We propose LipidBERT, a BERT-like model pre-trained with the Masked Language Model (MLM) and various secondary tasks. Additionally, we compare the performance of embeddings generated by LipidBERT and PhatGPT, our GPT-like lipid generation model, on downstream tasks. The proposed bilingual LipidBERT model operates in two languages: the language of ionizable lipid pre-training, using in-house dry-lab lipid structures, and the language of LNP fine-tuning, utilizing in-house LNP wet-lab data. This dual capability positions LipidBERT as a key AI-based filter for future screening tasks, including new versions of METiS de novo lipid libraries and, more importantly, candidates for in vivo testing for orgran-targeting LNPs. To the best of our knowledge, this is the first successful demonstration of the capability of a pre-trained language model on virtual lipids and its effectiveness in downstream tasks using web-lab data. This work showcases the clever utilization of METiS’s in-house de novo lipid library as well as the power of dry-wet lab integration.
摘要：在这项研究中，我们通过梅蒂斯公司内部的从头生成脂类算法和脂类虚拟筛选技术来生成和维护一个包含1000万个虚拟脂类的数据库。这些虚拟脂类作为预训练、脂类表示学习和下游任务知识转移的语料库，最终实现最先进的LNP性质预测性能。我们提出了LipidBERT，一个类似Bert的模型，预先训练了掩蔽语言模型(MLM)和各种次级任务。此外，我们还比较了LipidBERT和我们的GPT样脂生成模型PhatGPT生成的嵌入在下游任务中的性能。所提出的双语LipidBERT模型以两种语言运行：可电离脂质预训练语言，使用内部干实验室脂质结构；以及LNP微调语言，利用内部LNP湿实验室数据。这一双重能力将LipidBERT定位为未来筛查任务的关键基于人工智能的过滤器，包括新版本的Metis de novo脂质库，更重要的是，可用于体内针对Orgran靶向LNPs的测试。据我们所知，这是第一次使用网络实验室数据成功地展示了预先训练的关于虚拟血脂的语言模型的能力，以及它在下游任务中的有效性。这项工作展示了梅蒂斯的内部新脂库的巧妙利用以及干湿实验室集成的力量。

[NLP-17] Med42-v2: A Suite of Clinical LLMs
[NLP-17] Med 42-v2：临床LLM套件

链接: https://arxiv.org/abs/2408.06142
作者: Clément Christophe,Praveen K Kanithi,Tathagata Raha,Shadab Khan,Marco AF Pimentel
关键词-EN: large language models, clinical large language, introduces a suite, designed to address, large language
关键词-ZN: 大型语言模型，临床大型语言，引入了一个旨在解决大型语言的套件
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Med42-v2 introduces a suite of clinical large language models (LLMs) designed to address the limitations of generic models in healthcare settings. These models are built on Llama3 architecture and fine-tuned using specialized clinical data. They underwent multi-stage preference alignment to effectively respond to natural prompts. While generic models are often preference-aligned to avoid answering clinical queries as a precaution, Med42-v2 is specifically trained to overcome this limitation, enabling its use in clinical settings. Med42-v2 models demonstrate superior performance compared to the original Llama3 models in both 8B and 70B parameter configurations and GPT-4 across various medical benchmarks. These LLMs are developed to understand clinical queries, perform reasoning tasks, and provide valuable assistance in clinical environments. The models are now publicly available at \hrefthis https URLthis https URL.
摘要：Med 42-v2引入了一套临床大型语言模型（LLM），旨在解决医疗保健环境中通用模型的局限性。这些模型基于Llama 3架构构建，并使用专业临床数据进行微调。他们经历了多阶段的偏好调整，以有效地响应自然提示。虽然通用模型通常会根据偏好进行调整，以避免回答临床询问，作为预防措施，但Med 42-v2经过专门培训可以克服这一限制，使其能够在临床环境中使用。与原始Llama 3型号相比，Med 42-v2型号在8B和70 B参数配置以及GPT-4在各种医疗基准上表现出卓越的性能。这些LLM旨在理解临床查询、执行推理任务并在临床环境中提供宝贵的帮助。这些模型现已在\hrefThis https URLThis https URL上公开获取。

[NLP-18] Utilize Transformers for translating Wikipedia category names
[NLP-18] 利用变形金刚翻译维基百科类别名称

链接: https://arxiv.org/abs/2408.06124
作者: Hoang-Thang Ta,Quoc Thang La
关键词-EN: navigating content efficiently, articles are categorized, content efficiently, categorized to aid, aid readers
关键词-ZN: 有效导航内容，文章分类，内容高效，分类以帮助，帮助读者
类目: Computation and Language (cs.CL)
备注: 5 pages, 1 figure

点击查看摘要

Abstract:On Wikipedia, articles are categorized to aid readers in navigating content efficiently. The manual creation of new categories can be laborious and time-intensive. To tackle this issue, we built language models to translate Wikipedia categories from English to Vietnamese with a dataset containing 15,000 English-Vietnamese category pairs. Subsequently, small to medium-scale Transformer pre-trained models with a sequence-to-sequence architecture were fine-tuned for category translation. The experiments revealed that OPUS-MT-en-vi surpassed other models, attaining the highest performance with a BLEU score of 0.73, despite its smaller model storage. We expect our paper to be an alternative solution for translation tasks with limited computer resources.
摘要：在维基百科上，文章会被分类，以帮助读者有效地导航内容。手动创建新类别可能既费力又耗时。为了解决这个问题，我们构建了语言模型，将维基百科类别从英语翻译为越南语，数据集包含15，000个英语-越南语类别对。随后，对具有序列到序列架构的中小规模Transformer预训练模型进行了微调，以进行类别转换。实验显示，OPUS-MT-en-vi超越了其他型号，尽管其型号存储空间较小，但BLEU评分为0.73，以最高性能。我们希望我们的论文成为计算机资源有限的翻译任务的替代解决方案。

[NLP-19] How ChatGPT Changed the Medias Narratives on AI: A Semi-Automated Narrative Analysis Through Frame Semantics
[NLP-19] ChatGPT如何改变人工智能上的媒体叙事：通过框架语义的半自动叙事分析

链接: https://arxiv.org/abs/2408.06120
作者: Igor Ryazanov,Carl Öhman,Johanna Björklund
关键词-EN: technology media coverage, recent explosion, media coverage, technology media, mixed-method frame semantics-based
关键词-ZN: 技术媒体报道，最近爆炸，媒体报道，技术媒体，基于混合方法框架语义
类目: Computation and Language (cs.CL); Computers and Society (cs.CY)
备注: 18 pages, 6 figures and 2 appendices (5 pages)

点击查看摘要

Abstract:The recent explosion of attention to AI is arguably one of the biggest in the technology’s media coverage. To investigate the effects it has on the discourse, we perform a mixed-method frame semantics-based analysis on a dataset of more than 49,000 sentences collected from 5846 news articles that mention AI. The dataset covers the twelve-month period centred around the launch of OpenAI’s chatbot ChatGPT and is collected from the most visited open-access English-language news publishers. Our findings indicate that during the half year succeeding the launch, media attention rose tenfold \unicodex2014 from already historically high levels. During this period, discourse has become increasingly centred around experts and political leaders, and AI has become more closely associated with dangers and risks. A deeper review of the data also suggests a qualitative shift in the types of threat AI is thought to represent, as well as the anthropomorphic qualities ascribed to it.
摘要：最近对人工智能的关注激增可以说是该技术媒体报道中最大的关注之一。为了调查它对话语的影响，我们对从5846篇提及人工智能的新闻文章中收集的49，000多个句子的数据集进行了基于混合方法的基于框架语义的分析。该数据集涵盖了以OpenAI聊天机器人ChatGPT推出为中心的十二个月期间，并从访问量最大的开放获取英语新闻出版商收集。我们的调查结果表明，在发布后的半年内，媒体的关注度从历史高位增加了十倍。在此期间，话语越来越集中在专家和政治领导人身上，人工智能与危险和风险的联系更加紧密。对数据的更深入审查还表明，人工智能被认为代表的威胁类型以及归因于它的拟人化特征发生了质的转变。

[NLP-20] Building Decision Making Models Through Language Model Regime
[NLP-20] 通过语言模型机制构建决策模型

链接: https://arxiv.org/abs/2408.06087
作者: Yu Zhang,Haoxiang Liu,Feijun Jiang,Weihua Luo,Kaifu Zhang
关键词-EN: decision making, making problems leveraging, decision, making, problems leveraging
关键词-ZN: 决策制定，利用问题，决策，制定，利用问题
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:We propose a novel approach for decision making problems leveraging the generalization capabilities of large language models (LLMs). Traditional methods such as expert systems, planning algorithms, and reinforcement learning often exhibit limited generalization, typically requiring the training of new models for each unique task. In contrast, LLMs demonstrate remarkable success in generalizing across varied language tasks, inspiring a new strategy for training decision making models. Our approach, referred to as “Learning then Using” (LTU), entails a two-stage process. Initially, the \textitlearning phase develops a robust foundational decision making model by integrating diverse knowledge from various domains and decision making contexts. The subsequent \textitusing phase refines this foundation model for specific decision making scenarios. Distinct from other studies that employ LLMs for decision making through supervised learning, our LTU method embraces a versatile training methodology that combines broad pre-training with targeted fine-tuning. Experiments in e-commerce domains such as advertising and search optimization have shown that LTU approach outperforms traditional supervised learning regimes in decision making capabilities and generalization. The LTU approach is the first practical training architecture for both single-step and multi-step decision making tasks combined with LLMs, which can be applied beyond game and robot domains. It provides a robust and adaptable framework for decision making, enhances the effectiveness and flexibility of various systems in tackling various challenges.
摘要：我们提出了一种利用大型语言模型的泛化能力来解决决策问题的新方法。传统的方法，如专家系统、规划算法和强化学习，往往表现出有限的泛化，通常需要为每个独特的任务训练新的模型。相比之下，LLMS在概括不同的语言任务方面表现出了显著的成功，启发了一种训练决策模型的新策略。我们的方法被称为“先学后用”(LTU)，需要一个两个阶段的过程。最初，文本学习阶段通过集成来自不同领域和决策环境的不同知识来开发一个健壮的基本决策模型。随后的\文本使用阶段针对特定的决策场景改进了此基础模型。与其他使用LLMS通过监督学习进行决策的研究不同，我们的LTU方法采用了一种通用的培训方法，将广泛的预培训与有针对性的微调相结合。在广告和搜索优化等电子商务领域的实验表明，LTU方法在决策能力和泛化能力方面优于传统的监督学习算法。LTU方法是第一个将LLMS与单步和多步决策任务相结合的实用训练体系，可以应用于游戏和机器人领域。它为决策提供了一个稳健和适应性强的框架，提高了各种系统在应对各种挑战方面的有效性和灵活性。

[NLP-21] An Investigation Into Explainable Audio Hate Speech Detection SIGDIAL2024
[NLP-21] 可解释音频仇恨语音检测研究

链接: https://arxiv.org/abs/2408.06065
作者: Jinmyeong An,Wonjun Lee,Yejin Jeon,Jungseul Ok,Yunsu Kim,Gary Geunbae Lee
关键词-EN: hate speech, hate speech detection, speech, content largely unexplored, hate
关键词-ZN: 仇恨言论，仇恨言论检测，言论，内容基本上未被探索，仇恨
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: Accepted to SIGDIAL 2024

点击查看摘要

Abstract:Research on hate speech has predominantly revolved around detection and interpretation from textual inputs, leaving verbal content largely unexplored. While there has been limited exploration into hate speech detection within verbal acoustic speech inputs, the aspect of interpretability has been overlooked. Therefore, we introduce a new task of explainable audio hate speech detection. Specifically, we aim to identify the precise time intervals, referred to as audio frame-level rationales, which serve as evidence for hate speech classification. Towards this end, we propose two different approaches: cascading and End-to-End (E2E). The cascading approach initially converts audio to transcripts, identifies hate speech within these transcripts, and subsequently locates the corresponding audio time frames. Conversely, the E2E approach processes audio utterances directly, which allows it to pinpoint hate speech within specific time frames. Additionally, due to the lack of explainable audio hate speech datasets that include audio frame-level rationales, we curated a synthetic audio dataset to train our models. We further validated these models on actual human speech utterances and found that the E2E approach outperforms the cascading method in terms of the audio frame Intersection over Union (IoU) metric. Furthermore, we observed that including frame-level rationales significantly enhances hate speech detection accuracy for the E2E approach. \textbfDisclaimer The reader may encounter content of an offensive or hateful nature. However, given the nature of the work, this cannot be avoided. Comments: Accepted to SIGDIAL 2024 Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Sound (cs.SD); Audio and Speech Processing (eess.AS) Cite as: arXiv:2408.06065 [cs.CL] (or arXiv:2408.06065v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2408.06065 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
摘要：对仇恨言论的研究主要集中在从文本输入中检测和解释，而对言语内容的研究还很少。虽然对言语声学输入中的仇恨言语检测的探索有限，但可解释性方面的研究却被忽视了。因此，我们提出了一个新的任务–可解释音频仇恨语音检测。具体地说，我们的目标是确定准确的时间间隔，称为音频帧级别的原理，作为仇恨言论分类的证据。为此，我们提出了两种不同的方法：级联和端到端(E2E)。级联方法首先将音频转换成抄本，识别这些抄本中的仇恨言论，然后定位相应的音频时间帧。相反，E2E方法直接处理音频话语，这使得它能够在特定的时间范围内准确地识别仇恨言论。此外，由于缺乏包括音频帧级别原理的可解释的音频仇恨语音数据集，我们策划了一个合成音频数据集来训练我们的模型。我们在实际的人类语音上进一步验证了这些模型，发现在音频帧交集(IOU)度量方面，E2E方法的性能优于级联方法。此外，我们观察到，包括帧级别的基本原理显著提高了E2E方法的仇恨语音检测精度。\textbf免责声明读者可能会遇到冒犯性或可恶的内容。然而，考虑到工作的性质，这是不可避免的。备注：接受SIGDIAL2024科目：计算和语言(cs.CL)；人工智能(cs.AI)；声音(cs.SD)；音频和语音处理(eess.AS)引用为：arxiv：2408.06065cs.CL https://doi.org/10.48550/arXiv.2408.06065 Focus通过DataCite了解更多arxiv发布的DOI(等待注册)

[NLP-22] On Tables with Numbers with Numbers
[NLP-22] 关于带有数字的表格和数字

链接: https://arxiv.org/abs/2408.06062
作者: Konstantinos Kogkalidis,Stergios Chatzikyriakidis
关键词-EN: contemporary computational linguistics, critical reflection, culture of contemporary, growing obsession, tables with numbers
关键词-ZN: 当代计算语言学、批判性反思、当代文化、日益增长的痴迷、数字表格
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:This paper is a critical reflection on the epistemic culture of contemporary computational linguistics, framed in the context of its growing obsession with tables with numbers. We argue against tables with numbers on the basis of their epistemic irrelevance, their environmental impact, their role in enabling and exacerbating social inequalities, and their deep ties to commercial applications and profit-driven research. We substantiate our arguments with empirical evidence drawn from a meta-analysis of computational linguistics research over the last decade.
摘要：本文是对当代计算语言学认识论文化的批判性反思，其框架是在其对数字表格日益痴迷的背景下进行的。我们反对带有数字的表格，因为它们在认识上的不相关性、它们对环境的影响、它们在促成和加剧社会不平等方面的作用，以及它们与商业应用和利润驱动的研究的深厚联系。我们用从过去十年计算语言学研究的元分析中得出的经验证据来证实我们的论点。

[NLP-23] DiagESC: Dialogue Synthesis for Integrating Depression Diagnosis into Emotional Support Conversation SIGDIAL2024
[NLP-23] DiagESC：将抑郁症诊断整合到情感支持对话中的对话合成

链接: https://arxiv.org/abs/2408.06044
作者: Seungyeon Seo,Gary Geunbae Lee
关键词-EN: experiencing mental distress, health care aim, individuals experiencing mental, Diagnostic Emotional Support, Emotional Support Conversation
关键词-ZN: 经历精神困扰、医疗保健目标、经历精神疾病的个人、诊断情感支持、情感支持对话
类目: Computation and Language (cs.CL)
备注: Accepted by SIGDIAL 2024

点击查看摘要

Abstract:Dialogue systems for mental health care aim to provide appropriate support to individuals experiencing mental distress. While extensive research has been conducted to deliver adequate emotional support, existing studies cannot identify individuals who require professional medical intervention and cannot offer suitable guidance. We introduce the Diagnostic Emotional Support Conversation task for an advanced mental health management system. We develop the DESC dataset to assess depression symptoms while maintaining user experience by utilizing task-specific utterance generation prompts and a strict filtering algorithm. Evaluations by professional psychological counselors indicate that DESC has a superior ability to diagnose depression than existing data. Additionally, conversational quality evaluation reveals that DESC maintains fluent, consistent, and coherent dialogues.
摘要：心理健康护理对话系统旨在为经历精神困扰的个人提供适当的支持。虽然已经进行了广泛的研究来提供足够的情感支持，但现有的研究无法识别需要专业医疗干预的个人，也无法提供适当的指导。我们为先进的心理健康管理系统引入了诊断情感支持对话任务。我们开发DESC数据集来评估抑郁症状，同时通过利用特定任务的话语生成提示和严格的过滤算法来保持用户体验。专业心理咨询师的评估表明，DESC诊断抑郁症的能力优于现有数据。此外，对话质量评估表明，DESC保持流畅、一致和连贯的对话。

[NLP-24] Enhancing Dialogue Speech Recognition with Robust Contextual Awareness via Noise Representation Learning SIGDIAL2024
[NLP-24] 通过噪音表示学习增强具有强大上下文感知的对话语音识别

链接: https://arxiv.org/abs/2408.06043
作者: Wonjun Lee,San Kim,Gary Geunbae Lee
关键词-EN: requiring accurate Automatic, accurate Automatic Speech, Recent dialogue systems, Automatic Speech Recognition, accurate Automatic
关键词-ZN: 需要准确的自动、准确的自动语音、最近的对话系统、自动语音识别、准确的自动
类目: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: 11 pages, 2 figures, Accepted to SIGDIAL2024

点击查看摘要

Abstract:Recent dialogue systems rely on turn-based spoken interactions, requiring accurate Automatic Speech Recognition (ASR). Errors in ASR can significantly impact downstream dialogue tasks. To address this, using dialogue context from user and agent interactions for transcribing subsequent utterances has been proposed. This method incorporates the transcription of the user’s speech and the agent’s response as model input, using the accumulated context generated by each turn. However, this context is susceptible to ASR errors because it is generated by the ASR model in an auto-regressive fashion. Such noisy context can further degrade the benefits of context input, resulting in suboptimal ASR performance. In this paper, we introduce Context Noise Representation Learning (CNRL) to enhance robustness against noisy context, ultimately improving dialogue speech recognition accuracy. To maximize the advantage of context awareness, our approach includes decoder pre-training using text-based dialogue data and noise representation learning for a context encoder. Based on the evaluation of speech dialogues, our method shows superior results compared to baselines. Furthermore, the strength of our approach is highlighted in noisy environments where user speech is barely audible due to real-world noise, relying on contextual information to transcribe the input accurately.
摘要：目前的对话系统依赖于基于话轮的口语交互，需要准确的自动语音识别(ASR)。ASR中的错误会显著影响下游对话任务。为了解决这个问题，已经提出了使用来自用户和代理交互的对话上下文来转录后续话语。该方法结合了用户语音和代理响应的转录作为模型输入，使用每轮生成的累积上下文。然而，该上下文很容易受到ASR错误的影响，因为它是由ASR模型以自回归方式生成的。这样的噪声环境会进一步降低环境输入的益处，导致ASR性能不佳。在本文中，我们引入了上下文噪声表示学习(CNRL)来增强对噪声背景的鲁棒性，最终提高对话语音识别的准确率。为了最大限度地发挥上下文感知的优势，我们的方法包括使用基于文本的对话数据对解码器进行预训练，以及对上下文编码器进行噪声表示学习。基于对语音对话的评估，我们的方法显示出比基线更好的结果。此外，我们的方法的优势在噪声环境中得到了突出体现，在噪声环境中，用户的语音由于真实世界的噪声而几乎听不到，依赖于上下文信息来准确地转录输入。

[NLP-25] ARPA: A Novel Hybrid Model for Advancing Visual Word Disambiguation Using Large Language Models and Transformers
[NLP-25] ARPA：一种使用大型语言模型和转换器推进视觉单词歧义消除的新型混合模型

链接: https://arxiv.org/abs/2408.06040
作者: Aristi Papastavrou,Maria Lymperaiou,Giorgos Stamou
关键词-EN: Visual Word Sense, Word Sense Disambiguation, natural language processing, rapidly evolving fields, visual word disambiguation
关键词-ZN: 视觉义、义歧义、自然语言处理、快速发展的领域、视觉词歧义
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:In the rapidly evolving fields of natural language processing and computer vision, Visual Word Sense Disambiguation (VWSD) stands as a critical, yet challenging task. The quest for models that can seamlessly integrate and interpret multimodal data is more pressing than ever. Imagine a system that can understand language with the depth and nuance of human cognition, while simultaneously interpreting the rich visual context of the world around it. We present ARPA, an architecture that fuses the unparalleled contextual understanding of large language models with the advanced feature extraction capabilities of transformers, which then pass through a custom Graph Neural Network (GNN) layer to learn intricate relationships and subtle nuances within the data. This innovative architecture not only sets a new benchmark in visual word disambiguation but also introduces a versatile framework poised to transform how linguistic and visual data interact by harnessing the synergistic strengths of its components, ensuring robust performance even in the most complex disambiguation scenarios. Through a series of experiments and comparative analysis, we reveal the substantial advantages of our model, underscoring its potential to redefine standards in the field. Beyond its architectural prowess, our architecture excels through experimental enrichments, including sophisticated data augmentation and multi-modal training techniques. ARPA’s introduction marks a significant milestone in visual word disambiguation, offering a compelling solution that bridges the gap between linguistic and visual modalities. We invite researchers and practitioners to explore the capabilities of our model, envisioning a future where such hybrid models drive unprecedented advancements in artificial intelligence. Subjects: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL) Cite as: arXiv:2408.06040 [cs.CV] (or arXiv:2408.06040v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2408.06040 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
摘要：在快速发展的自然语言处理和计算机视觉领域，视觉词义消歧是一项关键而又具有挑战性的任务。对能够无缝集成和解释多模式数据的模型的追求比以往任何时候都更加紧迫。想象一个系统，它可以理解语言，具有人类认知的深度和细微差别，同时解释周围世界的丰富视觉背景。我们提出了ARPA，这是一种将对大型语言模型的无与伦比的上下文理解与转换器的高级特征提取能力相融合的体系结构，转换器然后通过定制的图形神经网络(GNN)层来了解数据中的复杂关系和细微差别。这一创新的架构不仅在视觉单词歧义消除方面树立了新的基准，而且还引入了一个多功能的框架，通过利用其组件的协同优势来转变语言和视觉数据的交互方式，确保即使在最复杂的歧义消除场景中也能实现稳健的性能。通过一系列的实验和比较分析，我们揭示了我们的模型的实质性优势，强调了它重新定义该领域标准的潜力。除了其架构能力之外，我们的架构还通过实验丰富而出类拔萃，包括复杂的数据增强和多模式训练技术。ARPA的推出标志着视觉单词歧义消除的一个重要里程碑，提供了一个令人信服的解决方案，弥合了语言和视觉形态之间的差距。我们邀请研究人员和从业者探索我们模型的能力，展望未来，这种混合模型将推动人工智能领域前所未有的进步。主题：计算机视觉与模式识别(cs.cv)；计算与语言(cs.CL)引用如下：arxiv：2408.06040cs.cv https://doi.org/10.48550/arXiv.2408.06040 Focus通过DataCite了解更多arxiv发布的DOI(待注册)

[NLP-26] Controlling Surprisal in Music Generation via Information Content Curve Matching
[NLP-26] 通过信息内容曲线匹配控制音乐生成中的惊喜

链接: https://arxiv.org/abs/2408.06022
作者: Mathias Rose Bjare,Stefan Lattner,Gerhard Widmer
关键词-EN: music generation systems, IIC, recent years, encouraging research, Instantaneous Information Content
关键词-ZN: 音乐生成系统，IIC，近年来，鼓励研究，即时信息内容
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
备注: 8 pages, 4 figures, 2 tables, accepted at the 25th Int. Society for Music Information Retrieval Conf., San Francisco, USA, 2024

点击查看摘要

Abstract:In recent years, the quality and public interest in music generation systems have grown, encouraging research into various ways to control these systems. We propose a novel method for controlling surprisal in music generation using sequence models. To achieve this goal, we define a metric called Instantaneous Information Content (IIC). The IIC serves as a proxy function for the perceived musical surprisal (as estimated from a probabilistic model) and can be calculated at any point within a music piece. This enables the comparison of surprisal across different musical content even if the musical events occur in irregular time intervals. We use beam search to generate musical material whose IIC curve closely approximates a given target IIC. We experimentally show that the IIC correlates with harmonic and rhythmic complexity and note density. The correlation decreases with the length of the musical context used for estimating the IIC. Finally, we conduct a qualitative user study to test if human listeners can identify the IIC curves that have been used as targets when generating the respective musical material. We provide code for creating IIC interpolations and IIC visualizations on this https URL.
摘要：近年来，音乐生成系统的质量和公众兴趣都在增长，这鼓励了对控制这些系统的各种方法的研究。我们提出了一种在音乐生成中使用序列模型控制惊奇的新方法。为了实现这一目标，我们定义了一个称为即时信息内容(IIC)的度量。IIC用作感知到的音乐惊喜的代理函数(根据概率模型估计)，并且可以在音乐作品中的任何点进行计算。这使得能够比较不同音乐内容之间的惊喜，即使音乐事件以不规则的时间间隔发生。我们使用波束搜索来生成音乐素材，其IIC曲线非常接近于给定的目标IIC。我们的实验表明，IIC与和声和节奏的复杂性以及音符密度相关。这种相关性随着用于估计IIC的音乐背景的长度而减小。最后，我们进行了一个定性的用户研究，以测试人类听众在生成相应的音乐素材时是否能够识别作为目标的IIC曲线。我们提供了在此HTTPS URL上创建IIC内插和IIC可视化的代码。

[NLP-27] he Language of Trauma: Modeling Traumatic Event Descriptions Across Domains with Explainable AI
[NLP-27] 创伤语言：用可解释的人工智能建模跨领域的创伤事件描述

链接: https://arxiv.org/abs/2408.05977
作者: Miriam Schirmer,Tobias Leemann,Gjergji Kasneci,Jürgen Pfeffer,David Jurgens
关键词-EN: diverse online contexts, online contexts, Psychological trauma, Psychological, distressing events
关键词-ZN: 不同的在线背景、在线背景、心理创伤、心理、痛苦事件
类目: Computation and Language (cs.CL); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:Psychological trauma can manifest following various distressing events and is captured in diverse online contexts. However, studies traditionally focus on a single aspect of trauma, often neglecting the transferability of findings across different scenarios. We address this gap by training language models with progressing complexity on trauma-related datasets, including genocide-related court data, a Reddit dataset on post-traumatic stress disorder (PTSD), counseling conversations, and Incel forum posts. Our results show that the fine-tuned RoBERTa model excels in predicting traumatic events across domains, slightly outperforming large language models like GPT-4. Additionally, SLALOM-feature scores and conceptual explanations effectively differentiate and cluster trauma-related language, highlighting different trauma aspects and identifying sexual abuse and experiences related to death as a common traumatic event across all datasets. This transferability is crucial as it allows for the development of tools to enhance trauma detection and intervention in diverse populations and settings.
摘要：心理创伤可以在以下各种令人痛苦的事件中表现出来，并在不同的在线环境中捕捉到。然而，传统上的研究侧重于创伤的一个方面，往往忽视了研究结果在不同情况下的可转移性。我们通过在创伤相关数据集(包括与种族灭绝相关的法庭数据、关于创伤后应激障碍(PTSD)的Reddit数据集、咨询对话和INCEL论坛帖子)上训练具有不断增长的复杂性的语言模型来解决这一差距。我们的结果表明，微调的Roberta模型在预测跨域创伤事件方面表现出色，略高于GPT-4等大型语言模型。此外，激流回旋特征评分和概念解释有效地区分和分类了与创伤有关的语言，突出了不同的创伤方面，并在所有数据集中将性虐待和与死亡有关的经历确定为常见的创伤事件。这种可转移性至关重要，因为它允许开发工具，在不同的人群和环境中加强创伤检测和干预。

[NLP-28] ConvKGYarn: Spinning Configurable and Scalable Conversational Knowledge Graph QA datasets with Large Language Models
[NLP-28] ConvKGYarn：使用大型语言模型构建可配置和可扩展的对话知识图谱QA数据集

链接: https://arxiv.org/abs/2408.05948
作者: Ronak Pradeep,Daniel Lee,Ali Mousavi,Jeff Pound,Yisi Sang,Jimmy Lin,Ihab Ilyas,Saloni Potdar,Mostafa Arefiyan,Yunyao Li
关键词-EN: Large Language Models, Large Language, advancement of Large, assistants necessitates dynamic, Language Models
关键词-ZN: 大型语言模型，大型语言，大型的进步，助理需要动态的语言模型
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:The rapid advancement of Large Language Models (LLMs) and conversational assistants necessitates dynamic, scalable, and configurable conversational datasets for training and evaluation. These datasets must accommodate diverse user interaction modes, including text and voice, each presenting unique modeling challenges. Knowledge Graphs (KGs), with their structured and evolving nature, offer an ideal foundation for current and precise knowledge. Although human-curated KG-based conversational datasets exist, they struggle to keep pace with the rapidly changing user information needs. We present ConvKGYarn, a scalable method for generating up-to-date and configurable conversational KGQA datasets. Qualitative psychometric analyses confirm our method can generate high-quality datasets rivaling a popular conversational KGQA dataset while offering it at scale and covering a wide range of human-interaction configurations. We showcase its utility by testing LLMs on diverse conversations - exploring model behavior on conversational KGQA sets with different configurations grounded in the same KG fact set. Our results highlight the ability of ConvKGYarn to improve KGQA foundations and evaluate parametric knowledge of LLMs, thus offering a robust solution to the constantly evolving landscape of conversational assistants.
摘要：大型语言模型和会话助手的快速发展为训练和评估提供了动态、可扩展和可配置的会话数据集。这些数据集必须适应不同的用户交互模式，包括文本和语音，每种模式都面临着独特的建模挑战。知识图谱(KGs)具有结构化和进化的性质，为当前和精确的知识提供了理想的基础。尽管存在人类管理的基于KG的对话数据集，但它们难以跟上快速变化的用户信息需求的步伐。我们提出了ConvKGYarn，这是一种可扩展的方法，用于生成最新的和可配置的会话KGQA数据集。定性的心理测量学分析证实，我们的方法可以生成高质量的数据集，与流行的对话式KGQA数据集相媲美，同时提供规模化的KGQA数据集，并覆盖广泛的人类交互配置。我们通过在不同的会话上测试LLM来展示其实用性-探索基于同一KG事实集的不同配置的会话KGQA集上的模型行为。我们的结果突出了ConvKGYarn改善KGQA基础和评估LLMS参数知识的能力，从而为不断发展的会话助理提供了一个健壮的解决方案。

[NLP-29] A New Pipeline For Generating Instruction Dataset via RAG and Self Fine-Tuning
[NLP-29] 通过RAG和自微调生成指令数据集的新管道

链接: https://arxiv.org/abs/2408.05911
作者: Chih-Wei Song,Yu-Kai Lee,Yin-Te Tsai
关键词-EN: large language models, recent years, enterprises and organizations, specialized Agents rely, rapid development
关键词-ZN: 大型语言模型，近年来，企业和组织，专业化代理依赖，快速发展
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 5 pages, SCA 2024: The 7th IEEE International Workshop on Smart Computing Applications

点击查看摘要

Abstract:With the rapid development of large language models in recent years, there has been an increasing demand for domain-specific Agents that can cater to the unique needs of enterprises and organizations. Unlike general models, which strive for broad coverage, these specialized Agents rely on focused datasets tailored to their intended applications. This research proposes a pipeline that leverages the power of LLMs and the Retrieval-Augmented Generation related framework to construct high-quality instruction datasets for fine-tuning on specific domains using custom document collections. By ingesting domain-specific documents, the pipeline generates relevant and contextually appropriate instructions, thus effectively creating a comprehensive dataset for fine-tuning LLMs on the target domain. This approach overcomes the limitations of traditional dataset creation methods, which often rely on manual curation or web-scraping techniques that may introduce noise and irrelevant data. Notably, our pipeline offers a dynamic solution that can quickly adapt to updates or modifications in the domain-specific document collection, eliminating the need for complete retraining. Additionally, it addresses the challenge of data scarcity by enabling the generation of instruction datasets from a limited set of initial documents, rendering it suitable for unpopular or specialized domains where comprehensive datasets are scarce. As a case study, we apply this approach to the domain of psychiatry, a field requiring specialized knowledge and sensitive handling of patient information. The resulting fine-tuned LLM demonstrates showcases the viability of the proposed approach and underscores its potential for widespread adoption across various industries and domains where tailored, accurate, and contextually relevant language models are indispensable.
摘要：近年来，随着大型语言模型的快速发展，对能够满足企业和组织独特需求的领域特定代理的需求越来越大。与争取广泛覆盖的一般模型不同，这些专门的代理依赖于为其预期应用量身定做的重点数据集。这项研究提出了一种管道，利用LLMS的能力和与检索-增强生成相关的框架来构建高质量的指令数据集，以便使用自定义文档集合在特定领域进行微调。通过接收特定于领域的文档，管道生成相关的和上下文适当的指令，从而有效地创建用于微调目标域上的LLM的全面数据集。这种方法克服了传统数据集创建方法的局限性，传统数据集创建方法通常依赖于可能引入噪声和不相关数据的手动管理或Web抓取技术。值得注意的是，我们的流水线提供了一个动态解决方案，可以快速适应特定领域文档集合的更新或修改，而不需要完全的再培训。此外，它能够从一组有限的初始文档生成指令数据集，从而解决了数据稀缺的挑战，使其适用于缺乏全面数据集的不受欢迎或专门的领域。作为一个案例研究，我们将这种方法应用于精神病学领域，这是一个需要专门知识和敏感处理患者信息的领域。由此产生的微调LLM展示了建议方法的可行性，并强调了其在各种行业和领域广泛采用的潜力，在这些行业和领域中，定制的、准确的和上下文相关的语言模型是必不可少的。

[NLP-30] AdTEC: A Unified Benchmark for Evaluating Text Quality in Search Engine Advertising
[NLP-30] AdTEC：评估搜索引擎广告文本质量的统一基准

链接: https://arxiv.org/abs/2408.05906
作者: Peinan Zhang,Yusuke Sakai,Masato Mita,Hiroki Ouchi,Taro Watanabe
关键词-EN: texts automatically created, language generation technology, natural language generation, generation technology, real-world setting
关键词-ZN: 自动创建文本、语言生成技术、自然语言生成、生成技术、现实世界设置
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:With the increase in the more fluent ad texts automatically created by natural language generation technology, it is in the high demand to verify the quality of these creatives in a real-world setting. We propose AdTEC, the first public benchmark to evaluate ad texts in multiple aspects from the perspective of practical advertising operations. Our contributions are: (i) Defining five tasks for evaluating the quality of ad texts and building a dataset based on the actual operational experience of advertising agencies, which is typically kept in-house. (ii) Validating the performance of existing pre-trained language models (PLMs) and human evaluators on the dataset. (iii) Analyzing the characteristics and providing challenges of the benchmark. The results show that while PLMs have already reached the practical usage level in several tasks, human still outperforms in certain domains, implying that there is significant room for improvement in such area.
摘要：随着自然语言生成技术自动创建的更流畅的广告文本的增加，在现实世界环境中验证这些创意的质量的需求越来越高。我们提出AdTEC，这是第一个从实际广告运营的角度对广告文本进行多个方面评估的公共基准。我们的贡献是：（i）定义评估广告文本质量的五项任务，并根据广告机构的实际运营经验（通常保存在内部）构建数据集。(ii)验证现有预训练语言模型（PLM）和人类评估者在数据集上的性能。(iii)分析基准的特征并提供挑战。结果表明，虽然PLM在多项任务中已经达到了实际使用水平，但人类在某些领域仍然表现优于，这意味着该领域还有很大的改进空间。

[NLP-31] GlyphPattern: An Abstract Pattern Recognition for Vision-Language Models
[NLP-31] GlyphPattern：视觉语言模型的抽象模式识别

链接: https://arxiv.org/abs/2408.05894
作者: Zixuan Wu,Yoolim Kim,Carolyn Jane Anderson
关键词-EN: made rapid progress, powerful large language, abstract pattern recognition, textual data, foundation of powerful
关键词-ZN: 进步迅速，强大的大语言，抽象模式识别，文本数据，强大的基础
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Vision-Language Models (VLMs) building upon the foundation of powerful large language models have made rapid progress in reasoning across visual and textual data. While VLMs perform well on vision tasks that they are trained on, our results highlight key challenges in abstract pattern recognition. We present GlyphPattern, a 954 item dataset that pairs 318 human-written descriptions of visual patterns from 40 writing systems with three visual presentation styles. GlyphPattern evaluates abstract pattern recognition in VLMs, requiring models to understand and judge natural language descriptions of visual patterns. GlyphPattern patterns are drawn from a large-scale cognitive science investigation of human writing systems; as a result, they are rich in spatial reference and compositionality. Our experiments show that GlyphPattern is challenging for state-of-the-art VLMs (GPT-4o achieves only 55% accuracy), with marginal gains from few-shot prompting. Our detailed error analysis reveals challenges at multiple levels, including visual processing, natural language understanding, and pattern generalization. Subjects: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL) Cite as: arXiv:2408.05894 [cs.CV] (or arXiv:2408.05894v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2408.05894 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
摘要：视觉语言模型建立在强大的大型语言模型的基础上，在跨视觉和文本数据进行推理方面取得了快速的进展。虽然VLM在它们接受过培训的视觉任务中表现良好，但我们的结果突出了抽象模式识别方面的关键挑战。我们提出了GlyphPattern，这是一个954项的数据集，将来自40个书写系统的318个人写的视觉模式描述与三种视觉呈现风格配对。GlyphPattern评估VLMS中的抽象模式识别，要求模型理解和判断视觉模式的自然语言描述。GlyphPattern模式来自对人类书写系统的大规模认知科学调查；因此，它们具有丰富的空间参照性和构成性。我们的实验表明，GlyphPattern对最先进的VLM具有挑战性(GPT-40只达到55%的准确率)，从少镜头提示中获得了微不足道的好处。我们的详细错误分析揭示了多个层面的挑战，包括视觉处理、自然语言理解和模式泛化。主题：计算机视觉与模式识别(cs.cv)；计算与语言(cs.CL)引用如下：arxiv：2408.05894cs.cv https://doi.org/10.48550/arXiv.2408.05894 Focus通过DataCite了解更多arxiv发布的DOI(待注册)

[NLP-32] Creating Arabic LLM Prompts at Scale
[NLP-32] 大规模创建阿拉伯语LLM预算

链接: https://arxiv.org/abs/2408.05882
作者: Abdelrahman El-Sheikh,Ahmed Elmogtaba,Kareem Darwish,Muhammad Elmallah,Ashraf Elneima,Hassan Sawaf
关键词-EN: chatGPT and BARD, BARD has popularized, debut of chatGPT, answers that matches, BARD
关键词-ZN: chatGPT和BARD，BARD已经普及，chatGPT首次亮相，匹配的答案，BARD
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The debut of chatGPT and BARD has popularized instruction following text generation using LLMs, where a user can interrogate an LLM using natural language requests and obtain natural language answers that matches their requests. Training LLMs to respond in this manner requires a large number of worked out examples of user requests (aka prompts) with corresponding gold responses. In this paper, we introduce two methods for creating such prompts for Arabic cheaply and quickly. The first methods entails automatically translating existing prompt datasets from English, such as PromptSource and Super-NaturalInstructions, and then using machine translation quality estimation to retain high quality translations only. The second method involves creating natural language prompts on top of existing Arabic NLP datasets. Using these two methods we were able to create more than 67.4 million Arabic prompts that cover a variety of tasks including summarization, headline generation, grammar checking, open/closed question answering, creative writing, etc. We show that fine tuning an open 7 billion parameter large language model, namely base Qwen2 7B, enables it to outperform a state-of-the-art 70 billion parameter instruction tuned model, namely Llama3 70B, in handling Arabic prompts.
摘要：ChatGPT和BARD的问世普及了使用LLMS生成文本后的指令，用户可以使用自然语言请求询问LLM，并获得与他们的请求匹配的自然语言答案。培训LLM以这种方式响应需要大量精心设计的具有相应Gold响应的用户请求(又名提示)示例。在本文中，我们介绍了两种廉价、快速地创建阿拉伯语提示的方法。第一种方法需要自动从英文翻译现有的提示数据集，如PromptSource和Super-NaturalInstructions，然后使用机器翻译质量估计来仅保留高质量的翻译。第二种方法涉及在现有的阿拉伯语NLP数据集上创建自然语言提示。使用这两种方法，我们能够创建超过6740万个阿拉伯语提示，涵盖了摘要、标题生成、语法检查、开放式/封闭式问答、创造性写作等各种任务。我们表明，微调一个开放的70亿参数大型语言模型，即基本Qwen27B，使其在处理阿拉伯语提示方面的表现优于最先进的700亿参数指令优化模型，即Llama370B。

[NLP-33] LLM-Based Robust Product Classification in Commerce and Compliance
[NLP-33] 基于LLM的商业和合规稳健产品分类

链接: https://arxiv.org/abs/2408.05874
作者: Sina Gholamian,Gianfranco Romani,Bartosz Rudnikowicz,Laura Skylaki
关键词-EN: Product classification, crucial task, compliance regulations, regulations are verified, verified and taxes
关键词-ZN: 产品分类、关键任务、合规法规、法规经过验证、验证和税收
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 11 pages

点击查看摘要

Abstract:Product classification is a crucial task in international trade, as compliance regulations are verified and taxes and duties are applied based on product categories. Manual classification of products is time-consuming and error-prone, and the sheer volume of products imported and exported renders the manual process infeasible. Consequently, e-commerce platforms and enterprises involved in international trade have turned to automatic product classification using machine learning. However, current approaches do not consider the real-world challenges associated with product classification, such as very abbreviated and incomplete product descriptions. In addition, recent advancements in generative Large Language Models (LLMs) and their reasoning capabilities are mainly untapped in product classification and e-commerce. In this research, we explore the real-life challenges of industrial classification and we propose data perturbations that allow for realistic data simulation. Furthermore, we employ LLM-based product classification to improve the robustness of the prediction in presence of incomplete data. Our research shows that LLMs with in-context learning outperform the supervised approaches in the clean-data scenario. Additionally, we illustrate that LLMs are significantly more robust than the supervised approaches when data attacks are present.
摘要：产品分类是国际贸易中的一项重要任务，因为要核实合规规定，并根据产品类别征收税收和关税。人工对产品进行分类既耗时又容易出错，而且进出口产品的数量庞大，使手工分类过程变得不可行。因此，参与国际贸易的电子商务平台和企业已经转向使用机器学习的产品自动分类。然而，目前的方法没有考虑到与产品分类相关的现实挑战，例如非常简短和不完整的产品描述。此外，生成性大型语言模型(LLM)及其推理能力的最新进展主要是在产品分类和电子商务方面尚未开发。在这项研究中，我们探索了现实生活中的行业分类挑战，并提出了允许现实数据模拟的数据扰动。此外，我们使用基于LLM的产品分类来提高在存在不完整数据的情况下预测的稳健性。我们的研究表明，在干净数据的情况下，具有情境学习的LLMS的性能优于有监督的方法。此外，我们还说明了当存在数据攻击时，LLMS比监督方法具有更强的健壮性。

[NLP-34] Defining Boundaries: A Spectrum of Task Feasibility for Large Language Models
[NLP-34] 定义边界：大型语言模型的任务可行性谱

链接: https://arxiv.org/abs/2408.05873
作者: Wenbo Zhang,Zihang Xu,Hengrui Cai
关键词-EN: Large language models, shown remarkable performance, Large language, language models, leading to incorrect
关键词-ZN: 大型语言模型，表现出出色的性能，大型语言，语言模型，导致不正确
类目: Computation and Language (cs.CL)
备注: 20 pages, 9 tables, 15 Figures

点击查看摘要

Abstract:Large language models (LLMs) have shown remarkable performance in various tasks but often fail to handle queries that exceed their knowledge and capabilities, leading to incorrect or fabricated responses. This paper addresses the need for LLMs to recognize and refuse infeasible tasks due to the required skills surpassing their capabilities. We first systematically conceptualize infeasible tasks for LLMs, providing formal definitions and categorizations that cover a spectrum of related hallucinations. We develop and benchmark a new dataset comprising diverse infeasible and feasible tasks to test multiple LLMs’ abilities on task feasibility. Furthermore, we explore the potential of training enhancements to increase LLMs’ refusal capabilities with fine-tuning. Experiments validate the effectiveness of our methods, offering promising directions for refining the operational boundaries of LLMs in real applications.
摘要：大型语言模型（LLM）在各种任务中表现出了出色的性能，但通常无法处理超出其知识和能力的查询，从而导致不正确或捏造的响应。本文解决了LLM识别并拒绝因所需技能超出其能力而不可行的任务的需求。我们首先系统地概念化LLM的不可行任务，提供涵盖一系列相关幻觉的正式定义和分类。我们开发和基准测试一个新数据集，其中包括各种不可行和可行的任务，以测试多个LLM的任务可行性能力。此外，我们还探索了培训增强的潜力，通过微调来提高LLM的拒绝能力。实验验证了我们方法的有效性，为在实际应用中细化LLM的操作边界提供了有希望的方向。

[NLP-35] Iterative Improvement of an Additively Regularized Topic Model
[NLP-35] 添加性规则化主题模型的迭代改进

链接: https://arxiv.org/abs/2408.05840
作者: Alex Gorbulev,Vasiliy Alekseev,Konstantin Vorontsov
关键词-EN: soft clustering problem, topic model, Topic, clustering problem, unknown clusters
关键词-ZN: 软集群问题，主题模型，主题，集群问题，未知集群
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR); Probability (math.PR)
备注: A full draft of the second version of the article

点击查看摘要

Abstract:Topic modelling is fundamentally a soft clustering problem (of known objects – documents, over unknown clusters – topics). That is, the task is incorrectly posed. In particular, the topic models are unstable and incomplete. All this leads to the fact that the process of finding a good topic model (repeated hyperparameter selection, model training, and topic quality assessment) can be particularly long and labor-intensive. We aim to simplify the process, to make it more deterministic and provable. To this end, we present a method for iterative training of a topic model. The essence of the method is that a series of related topic models are trained so that each subsequent model is at least as good as the previous one, i.e., that it retains all the good topics found earlier. The connection between the models is achieved by additive regularization. The result of this iterative training is the last topic model in the series, which we call the iteratively updated additively regularized topic model (ITAR). Experiments conducted on several collections of natural language texts show that the proposed ITAR model performs better than other popular topic models (LDA, ARTM, BERTopic), its topics are diverse, and its perplexity (ability to “explain” the underlying data) is moderate.
摘要：主题建模本质上是一个软聚类问题(已知对象–文档，未知簇–主题)。也就是说，任务摆错了姿势。特别是，主题模型是不稳定和不完整的。所有这些都导致了一个事实，即寻找一个好的主题模型的过程(重复的超参数选择、模型训练和主题质量评估)可能会特别漫长和劳动密集型。我们的目标是简化这一过程，使其更具确定性和可证明性。为此，我们提出了一种主题模型的迭代训练方法。该方法的本质是训练一系列相关的主题模型，使得每个后续模型至少与前一个模型一样好，即它保留了之前发现的所有好主题。模型之间的联系通过加性正则化来实现。这种迭代训练的结果是该系列中的最后一个主题模型，我们称之为迭代更新的添加正则化主题模型(ITAR)。在几个自然语言文本集合上的实验表明，所提出的ITAR模型的性能优于其他流行的主题模型(LDA、ARTM、BERTITABLE)，其主题多样，困惑程度(对底层数据的解释能力)中等。

[NLP-36] SAGA: A Participant-specific Examination of Story Alternatives and Goal Applicability for a Deeper Understanding of Complex Events
[NLP-36] SAGA：对故事替代方案和目标适用性的特定研究，以更深入地理解复杂事件

链接: https://arxiv.org/abs/2408.05793
作者: Sai Vallurupalli,Katrin Erk,Francis Ferraro
关键词-EN: Interpreting and assessing, assessing goal driven, participant achievement lens, goal driven actions, Interpreting
关键词-ZN: 解释和评估，评估目标驱动，参与者成就视角，目标驱动行动，解释
类目: Computation and Language (cs.CL)
备注: Accepted to Findings of the Association for Computational Linguistics 2024

点击查看摘要

Abstract:Interpreting and assessing goal driven actions is vital to understanding and reasoning over complex events. It is important to be able to acquire the knowledge needed for this understanding, though doing so is challenging. We argue that such knowledge can be elicited through a participant achievement lens. We analyze a complex event in a narrative according to the intended achievements of the participants in that narrative, the likely future actions of the participants, and the likelihood of goal success. We collect 6.3K high quality goal and action annotations reflecting our proposed participant achievement lens, with an average weighted Fleiss-Kappa IAA of 80%. Our collection contains annotated alternate versions of each narrative. These alternate versions vary minimally from the “original” story, but can license drastically different inferences. Our findings suggest that while modern large language models can reflect some of the goal-based knowledge we study, they find it challenging to fully capture the design and intent behind concerted actions, even when the model pretraining included the data from which we extracted the goal knowledge. We show that smaller models fine-tuned on our dataset can achieve performance surpassing larger models.
摘要：解释和评估目标驱动的行为对于理解和推理复杂事件至关重要。重要的是能够获得这种理解所需的知识，尽管这样做是具有挑战性的。我们认为，这样的知识可以通过参与者成就透镜来获得。我们根据参与者在叙事中的预期成就、参与者未来可能的行动以及目标成功的可能性来分析叙事中的复杂事件。我们收集了6.3K高质量的目标和行动注释，反映了我们建议的参与者成就镜头，平均加权Fleiss-Kappa IAA为80%。我们的集合包含每个叙事的带注释的替代版本。这些替代版本与最初的故事差别很小，但可以做出截然不同的推论。我们的发现表明，虽然现代大型语言模型可以反映我们研究的一些基于目标的知识，但它们发现，即使模型预训练包括我们从中提取目标知识的数据，完全捕捉协调行动背后的设计和意图也是具有挑战性的。我们表明，在我们的数据集上微调的较小模型可以获得超过较大模型的性能。

[NLP-37] HiLight: A Hierarchy-aware Light Global Model with Hierarchical Local ConTrastive Learning
[NLP-37] HiLight：具有分层本地对比学习的分层感知Light Global模型

链接: https://arxiv.org/abs/2408.05786
作者: Zhijian Chen,Zhonghua Li,Jianxin Yang,Ye Qi
关键词-EN: multi-label classification head, multi-label classification, Hierarchical text classification, Hierarchical local conTrastive, structure encoder
关键词-ZN: 多标签分类头、多标签分类、分层文本分类、分层局部比较、结构编码器
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Hierarchical text classification (HTC) is a special sub-task of multi-label classification (MLC) whose taxonomy is constructed as a tree and each sample is assigned with at least one path in the tree. Latest HTC models contain three modules: a text encoder, a structure encoder and a multi-label classification head. Specially, the structure encoder is designed to encode the hierarchy of taxonomy. However, the structure encoder has scale problem. As the taxonomy size increases, the learnable parameters of recent HTC works grow rapidly. Recursive regularization is another widely-used method to introduce hierarchical information but it has collapse problem and generally relaxed by assigning with a small weight (ie. 1e-6). In this paper, we propose a Hierarchy-aware Light Global model with Hierarchical local conTrastive learning (HiLight), a lightweight and efficient global model only consisting of a text encoder and a multi-label classification head. We propose a new learning task to introduce the hierarchical information, called Hierarchical Local Contrastive Learning (HiLCL). Extensive experiments are conducted on two benchmark datasets to demonstrate the effectiveness of our model.
摘要：层次文本分类(HTC)是多标签分类(MLC)中的一个特殊的子任务，其分类被构建为一棵树，每个样本在树中被分配至少一条路径。最新的HTC型号包含三个模块：文本编码器、结构编码器和多标签分类头。特别是，结构编码器被设计用于对分类的层次进行编码。然而，结构编码器存在尺度问题。随着分类规模的增加，最近HTC作品的可学习参数迅速增长。递归正则化是另一种广泛使用的引入层次信息的方法，但它存在折叠问题，通常通过赋予较小的权重(即，1E-6)。本文提出了一种基于分层局部对比学习的层次感知Light Global模型(Hilight)，该模型只由文本编码器和多标签分类头组成，是一种轻量级高效的全局模型。我们提出了一种新的学习任务来引入分层信息，称为分层局部对比学习(HiLCL)。在两个基准数据集上进行了大量的实验，验证了该模型的有效性。

[NLP-38] LI-TTA: Language Informed Test-Time Adaptation for Automatic Speech Recognition INTERSPEECH2024
[NLP-38] LI-TTA：自动语音识别的语言知情测试时间自适应

链接: https://arxiv.org/abs/2408.05769
作者: Eunseop Yoon,Hee Suk Yoon,John Harvill,Mark Hasegawa-Johnson,Chang D. Yoo
关键词-EN: target environment diverges, original training environment, Automatic Speech Recognition, domain shift challenge, Informed Test-Time Adaptation
关键词-ZN: 目标环境分歧、原始训练环境、自动语音识别、域转移挑战、知情测试时间适应
类目: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: INTERSPEECH 2024

点击查看摘要

Abstract:Test-Time Adaptation (TTA) has emerged as a crucial solution to the domain shift challenge, wherein the target environment diverges from the original training environment. A prime exemplification is TTA for Automatic Speech Recognition (ASR), which enhances model performance by leveraging output prediction entropy minimization as a self-supervision signal. However, a key limitation of this self-supervision lies in its primary focus on acoustic features, with minimal attention to the linguistic properties of the input. To address this gap, we propose Language Informed Test-Time Adaptation (LI-TTA), which incorporates linguistic insights during TTA for ASR. LI-TTA integrates corrections from an external language model to merge linguistic with acoustic information by minimizing the CTC loss from the correction alongside the standard TTA loss. With extensive experiments, we show that LI-TTA effectively improves the performance of TTA for ASR in various distribution shift situations.
摘要：测试时间适应(TTA)已经成为应对领域转换挑战的关键解决方案，其中目标环境与原始训练环境不同。最好的例子是用于自动语音识别(ASR)的TTA，它通过利用输出预测熵最小化作为自我监督信号来增强模型性能。然而，这种自我监督的一个关键限制在于它主要关注声学特征，而对输入的语言属性关注很少。为了弥补这一差距，我们提出了语言知情测试时间适应(LI-TTA)，它在ASR的测试时间适应过程中融入了语言方面的见解。LI-TTA集成了来自外部语言模型的校正，通过最小化校正产生的CTC损失和标准TTA损失来合并语言和声学信息。通过大量的实验，我们证明了Li-TTA在不同分布移位情况下有效地提高了ASR的TTA性能。

[NLP-39] Reference-free Hallucination Detection for Large Vision-Language Models
[NLP-39] 大型视觉语言模型的无参考幻觉检测

链接: https://arxiv.org/abs/2408.05767
作者: Qing Li,Chenyang Lyu,Jiahui Geng,Derui Zhu,Maxim Panov,Fakhri Karray
关键词-EN: Large vision-language models, made significant progress, Large vision-language, vision-language models, recent years
关键词-ZN: 大型视觉语言模型，取得了重大进展，大型视觉语言，视觉语言模型，近年来
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large vision-language models (LVLMs) have made significant progress in recent years. While LVLMs exhibit excellent ability in language understanding, question answering, and conversations of visual inputs, they are prone to producing hallucinations. While several methods are proposed to evaluate the hallucinations in LVLMs, most are reference-based and depend on external tools, which complicates their practical application. To assess the viability of alternative methods, it is critical to understand whether the reference-free approaches, which do not rely on any external tools, can efficiently detect hallucinations. Therefore, we initiate an exploratory study to demonstrate the effectiveness of different reference-free solutions in detecting hallucinations in LVLMs. In particular, we conduct an extensive study on three kinds of techniques: uncertainty-based, consistency-based, and supervised uncertainty quantification methods on four representative LVLMs across two different tasks. The empirical results show that the reference-free approaches are capable of effectively detecting non-factual responses in LVLMs, with the supervised uncertainty quantification method outperforming the others, achieving the best performance across different settings.
摘要：近年来，大型视觉语言模型取得了显著的进展。虽然LVLMS在语言理解、问题回答和视觉输入对话方面表现出出色的能力，但它们容易产生幻觉。虽然人们提出了几种方法来评估LVLMS中的幻觉，但大多数方法都是基于参考的，并且依赖于外部工具，这使得它们的实际应用变得复杂。为了评估替代方法的可行性，关键是要了解不依赖任何外部工具的无参照方法是否能够有效地检测幻觉。因此，我们启动了一项探索性研究，以展示不同的无参照解决方案在检测LVLMS幻觉方面的有效性。特别是，我们对三种技术进行了广泛的研究：基于不确定性的、基于一致性的和有监督的不确定性量化方法，并在两个不同的任务中对四个具有代表性的LVLM进行了研究。实验结果表明，无参照方法能够有效地检测出LVLMS中的非事实响应，其中有监督的不确定性量化方法的性能优于其他方法，在不同的设置下取得了最好的性能。

[NLP-40] Language-Informed Beam Search Decoding for Multilingual Machine Translation ACL2024
[NLP-40] 用于多语言机器翻译的语音信息束搜索解码

链接: https://arxiv.org/abs/2408.05738
作者: Yilin Yang,Stefan Lee,Prasad Tadepalli
关键词-EN: auto-regressive Neural Machine, Neural Machine Translation, Neural Machine, decoding auto-regressive Neural, including multilingual NMT
关键词-ZN: 自回归神经机器、神经机器翻译、神经机器、解码自回归神经，包括多语言NMT
类目: Computation and Language (cs.CL)
备注: ACL 2024 Findings

点击查看摘要

Abstract:Beam search decoding is the de-facto method for decoding auto-regressive Neural Machine Translation (NMT) models, including multilingual NMT where the target language is specified as an input. However, decoding multilingual NMT models commonly produces ``off-target’’ translations – yielding translation outputs not in the intended language. In this paper, we first conduct an error analysis of off-target translations for a strong multilingual NMT model and identify how these decodings are produced during beam search. We then propose Language-informed Beam Search (LiBS), a general decoding algorithm incorporating an off-the-shelf Language Identification (LiD) model into beam search decoding to reduce off-target translations. LiBS is an inference-time procedure that is NMT-model agnostic and does not require any additional parallel data. Results show that our proposed LiBS algorithm on average improves +1.1 BLEU and +0.9 BLEU on WMT and OPUS datasets, and reduces off-target rates from 22.9% to 7.7% and 65.8% to 25.3% respectively.
摘要：波束搜索译码是对自回归神经机器翻译(NMT)模型进行译码的实际方法，包括将目标语言指定为输入的多语言NMT。然而，对多语言NMT模型进行解码通常会产生“偏离目标”的翻译–产生不是目标语言的翻译输出。在本文中，我们首先对一个强大的多语言NMT模型进行了目标外翻译的误差分析，并确定了这些译码是如何在波束搜索过程中产生的。然后，我们提出了语言信息波束搜索(LIBS)，这是一种将现成的语言识别(LID)模型融入到波束搜索解码中的通用解码算法，以减少偏离目标的翻译。LIBS是一个与NMT模型无关的推理时间过程，不需要任何额外的并行数据。实验结果表明，本文提出的LIBS算法在WMT和OPUS数据集上平均提高了+1.1 BLEU和+0.9 BLEU，并将脱靶率分别从22.9降到7.7和65.8降到25.3。

[NLP-41] raining an NLP Scholar at a Small Liberal Arts College: A Backwards Designed Course Proposal
[NLP-41] 在小型文理学院培养NLP学者：倒退设计的课程提案

链接: https://arxiv.org/abs/2408.05664
作者: Grusha Prasad,Forrest Davis
关键词-EN: natural language processing, generated student interest, NLP, language processing, NLP scholar
关键词-ZN: 自然语言处理，激发学生兴趣，NLP，语言处理，NLP学者
类目: Computation and Language (cs.CL)
备注: 9 pages, Presented at 6th Workshop on Teaching NLP

点击查看摘要

Abstract:The rapid growth in natural language processing (NLP) over the last couple years has generated student interest and excitement in learning more about the field. In this paper, we present two types of students that NLP courses might want to train. First, an “NLP engineer” who is able to flexibly design, build and apply new technologies in NLP for a wide range of tasks. Second, an “NLP scholar” who is able to pose, refine and answer questions in NLP and how it relates to the society, while also learning to effectively communicate these answers to a broader audience. While these two types of skills are not mutually exclusive – NLP engineers should be able to think critically, and NLP scholars should be able to build systems – we think that courses can differ in the balance of these skills. As educators at Small Liberal Arts Colleges, the strengths of our students and our institution favors an approach that is better suited to train NLP scholars. In this paper we articulate what kinds of skills an NLP scholar should have, and then adopt a backwards design to propose course components that can aid the acquisition of these skills.
摘要：在过去的几年里，自然语言处理(NLP)的快速发展引起了学生们对学习更多该领域知识的兴趣和兴奋。在本文中，我们介绍了NLP课程可能想要培养的两种类型的学生。第一，能够灵活地在NLP中设计、建造和应用新技术以完成广泛任务的“NLP工程师”。第二，能够在NLP中提出、提炼和回答问题以及它与社会的关系，同时也学会有效地将这些答案传达给更广泛的受众的“NLP学者”。虽然这两种技能并不是相互排斥的–NLP工程师应该能够批判性地思考，NLP学者应该能够构建系统–但我们认为，课程可以在这些技能的平衡上有所不同。作为小型文理学院的教育工作者，我们的学生和我们学校的优势支持一种更适合培养NLP学者的方法。在本文中，我们明确了一名NLP学者应该具备哪些技能，然后采用反向设计的方法提出了有助于获得这些技能的课程组成部分。

[NLP-42] WiDe-analysis: Enabling One-click Content Moderation Analysis on Wikipedias Articles for Deletion
[NLP-42] 广泛分析：在维基百科文章上启用一键内容审核分析以供删除

链接: https://arxiv.org/abs/2408.05655
作者: Hsuvas Borkakoty,Luis Espinosa-Anke
关键词-EN: platforms grow, existing policies, online platforms, crucial for ensuring, ensuring activity
关键词-ZN: 平台发展、现有政策、在线平台对于确保、确保活动至关重要
类目: Computation and Language (cs.CL)
备注: System Demonstration

点击查看摘要

Abstract:Content moderation in online platforms is crucial for ensuring activity therein adheres to existing policies, especially as these platforms grow. NLP research in this area has typically focused on automating some part of it given that it is not feasible to monitor all active discussions effectively. Past works have focused on revealing deletion patterns with like sentiment analysis, or on developing platform-specific models such as Wikipedia policy or stance detectors. Unsurprisingly, however, this valuable body of work is rather scattered, with little to no agreement with regards to e.g., the deletion discussions corpora used for training or the number of stance labels. Moreover, while efforts have been made to connect stance with rationales (e.g., to ground a deletion decision on the relevant policy), there is little explanability work beyond that. In this paper, we introduce a suite of experiments on Wikipedia deletion discussions and wide-analyis (Wikipedia Deletion Analysis), a Python package aimed at providing one click analysis to content moderation discussions. We release all assets associated with wide-analysis, including data, models and the Python package, and a HuggingFace space with the goal to accelerate research on automating content moderation in Wikipedia and beyond.
摘要：在线平台中的内容审核对于确保其中的活动遵守现有政策至关重要，尤其是随着这些平台的发展。NLP在这一领域的研究通常侧重于将其某些部分自动化，因为有效地监测所有活跃的讨论是不可行的。过去的工作集中在通过Like情感分析来揭示删除模式，或者开发特定于平台的模型，如维基百科政策或立场检测器。然而，不足为奇的是，这一有价值的工作相当分散，在用于培训的删除讨论语料库或立场标签的数量等方面几乎没有达成一致。此外，虽然已经努力将立场与理由联系起来(例如，对有关政策作出删除决定)，但除此之外几乎没有什么解释工作。在本文中，我们介绍了一套关于维基百科删除讨论和Wide-analyis(维基百科删除分析)的实验，这是一个旨在为内容审核讨论提供一键分析的Python包。我们发布了所有与广泛分析相关的资产，包括数据、模型和Python包，以及一个HuggingFace空间，目标是加速维基百科和其他领域对自动化内容审核的研究。

[NLP-43] Speculative Diffusion Decoding: Accelerating Language Generation through Diffusion
[NLP-43] 推测扩散解码：通过扩散加速语言生成

链接: https://arxiv.org/abs/2408.05636
作者: Jacob K Christopher,Brian R Bartoldson,Bhavya Kailkhura,Ferdinando Fioretto
关键词-EN: widely adopted method, accelerate large language, Speculative decoding, widely adopted, adopted method
关键词-ZN: 广泛采用的方法，加速大型语言，推测解码，广泛采用，采用的方法
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Speculative decoding has emerged as a widely adopted method to accelerate large language model inference without sacrificing the quality of the model outputs. While this technique has facilitated notable speed improvements by enabling parallel sequence verification, its efficiency remains inherently limited by the reliance on incremental token generation in existing draft models. To overcome this limitation, this paper proposes an adaptation of speculative decoding which uses discrete diffusion models to generate draft sequences. This allows parallelization of both the drafting and verification steps, providing significant speed-ups to the inference process. Our proposed approach, \textitSpeculative Diffusion Decoding (SpecDiff), is validated on standard language generation benchmarks and empirically demonstrated to provide a \textbfup to 8.7x speed-up over standard generation processes and up to 2.5x speed-up over existing speculative decoding approaches.
摘要：推测解码已成为一种广泛采用的方法，可以在不牺牲模型输出质量的情况下加速大型语言模型推理。虽然该技术通过实现并行序列验证来促进了速度的显着提高，但其效率仍然受到现有草案模型中对增量令牌生成的依赖的固有限制。为了克服这一限制，本文提出了一种推测解码的改编，该解码使用离散扩散模型来生成草稿序列。这允许起草和验证步骤的并行化，从而显着加快推理过程。我们提出的方法\textitSpeculative Distusion Decoding（SpecDiff）在标准语言生成基准上得到了验证，并通过经验证明可以提供比标准生成过程高达8.7倍的速度，比现有推测解码方法高达2.5倍的速度。

[NLP-44] Metacognitive Myopia in Large Language Models
[NLP-44] 大型语言模型中的元认知近视

链接: https://arxiv.org/abs/2408.05568
作者: Florian Scholten,Tobias R. Rebholz,Mandy Hütter
关键词-EN: Large Language Models, cloud moral judgments, Large Language, exhibit potentially harmful, culturally inherent stereotypes
关键词-ZN: 大型语言模型、云道德判断、大型语言表现出潜在有害的、文化固有的刻板印象
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY); Applications (stat.AP)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) exhibit potentially harmful biases that reinforce culturally inherent stereotypes, cloud moral judgments, or amplify positive evaluations of majority groups. Previous explanations mainly attributed bias in LLMs to human annotators and the selection of training data. Consequently, they have typically been addressed with bottom-up approaches such as reinforcement learning or debiasing corpora. However, these methods only treat the effects of LLM biases by indirectly influencing the model architecture, but do not address the underlying causes in the computational process. Here, we propose metacognitive myopia as a cognitive-ecological framework that can account for a conglomerate of established and emerging LLM biases and provide a lever to address problems in powerful but vulnerable tools. Our theoretical framework posits that a lack of the two components of metacognition, monitoring and control, causes five symptoms of metacognitive myopia in LLMs: integration of invalid tokens and embeddings, susceptibility to redundant information, neglect of base rates in conditional computation, decision rules based on frequency, and inappropriate higher-order statistical inference for nested data structures. As a result, LLMs produce erroneous output that reaches into the daily high-stakes decisions of humans. By introducing metacognitive regulatory processes into LLMs, engineers and scientists can develop precise remedies for the underlying causes of these biases. Our theory sheds new light on flawed human-machine interactions and raises ethical concerns regarding the increasing, imprudent implementation of LLMs in organizational structures.
摘要：大型语言模型(LLM)表现出潜在的有害偏见，这些偏见强化了文化固有的刻板印象，模糊了道德判断，或者放大了对多数群体的积极评价。以往的解释主要将LLMS的偏差归因于人类注释者和训练数据的选择。因此，这些问题通常是通过自下而上的方法来解决的，比如强化学习或消除语料库的偏见。然而，这些方法只是通过间接影响模型体系结构来处理LLM偏差的影响，而没有解决计算过程中的根本原因。在这里，我们提出元认知近视作为一个认知生态框架，可以解释已建立的和新出现的LLM偏见的集合体，并提供一个杠杆来解决强大但脆弱的工具中的问题。我们的理论框架假设，元认知的监控和控制这两个组成部分的缺失导致了元认知近视的五个症状：无效标记和嵌入的整合、对冗余信息的敏感性、在条件计算中忽略基本比率、基于频率的决策规则以及嵌套数据结构的不适当的高阶统计推理。因此，低成本模型会产生错误的输出，影响到人类日常的高风险决策。通过将元认知监管过程引入LLM，工程师和科学家可以针对这些偏见的根本原因开发出精确的补救措施。我们的理论为有缺陷的人机交互提供了新的线索，并引发了人们对组织结构中越来越多的、不谨慎地实施LLM的伦理问题的关注。

[NLP-45] Document-Level Event Extraction with Definition-Driven ICL
[NLP-45] 使用描述驱动ICL的文档级事件提取

链接: https://arxiv.org/abs/2408.05566
作者: Zhuoyuan Liu,Yilin Luo
关键词-EN: Natural Language Processing, Large Language Models, document-level event extraction, Language Processing, Large Language
关键词-ZN: 自然语言处理、大型语言模型、文档级事件提取、语言处理、大型语言
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:In the field of Natural Language Processing (NLP), Large Language Models (LLMs) have shown great potential in document-level event extraction tasks, but existing methods face challenges in the design of prompts. To address this issue, we propose an optimization strategy called “Definition-driven Document-level Event Extraction (DDEE).” By adjusting the length of the prompt and enhancing the clarity of heuristics, we have significantly improved the event extraction performance of LLMs. We used data balancing techniques to solve the long-tail effect problem, enhancing the model’s generalization ability for event types. At the same time, we refined the prompt to ensure it is both concise and comprehensive, adapting to the sensitivity of LLMs to the style of prompts. In addition, the introduction of structured heuristic methods and strict limiting conditions has improved the precision of event and argument role extraction. These strategies not only solve the prompt engineering problems of LLMs in document-level event extraction but also promote the development of event extraction technology, providing new research perspectives for other tasks in the NLP field.
摘要：在自然语言处理领域，大语言模型在文档级事件抽取任务中显示出了巨大的潜力，但现有的方法在提示符的设计方面面临着挑战。为了解决这个问题，我们提出了一种名为“定义驱动的文档级事件提取(DDEE)”的优化策略。通过调整提示长度和提高启发式规则的清晰度，显著提高了LLMS的事件抽取性能。利用数据均衡技术解决了长尾效应问题，增强了模型对事件类型的泛化能力。同时，我们对提示语进行了细化，使其既简洁又全面，适应了LLMS对提示语风格的敏感性。此外，结构化启发式方法和严格的限制条件的引入提高了事件和论元角色提取的精度。这些策略不仅解决了LLMS在文档级事件抽取中的即时工程问题，而且促进了事件抽取技术的发展，为自然语言处理领域的其他任务提供了新的研究视角。

[NLP-46] Large Language Model-based Role-Playing for Personalized Medical Jargon Extraction
[NLP-46] 基于大语言模型的角色扮演个性化医疗行话提取

链接: https://arxiv.org/abs/2408.05555
作者: Jung Hoon Lim,Sunjae Kwon,Zonghai Yao,John P.Lalor,Hong Yu
关键词-EN: Electronic Health Records, Health Records, Electronic Health, personal medical information, reveal that Electronic
关键词-ZN: 电子健康记录、健康记录、电子健康、个人医疗信息，揭示了电子
类目: Computation and Language (cs.CL)
备注: 17 pages, 3 figures, 3 tables

点击查看摘要

Abstract:Previous studies reveal that Electronic Health Records (EHR), which have been widely adopted in the U.S. to allow patients to access their personal medical information, do not have high readability to patients due to the prevalence of medical jargon. Tailoring medical notes to individual comprehension by identifying jargon that is difficult for each person will enhance the utility of generative models. We present the first quantitative analysis to measure the impact of role-playing in LLM in medical term extraction. By comparing the results of Mechanical Turk workers over 20 sentences, our study demonstrates that LLM role-playing improves F1 scores in 95% of cases across 14 different socio-demographic backgrounds. Furthermore, applying role-playing with in-context learning outperformed the previous state-of-the-art models. Our research showed that ChatGPT can improve traditional medical term extraction systems by utilizing role-play to deliver personalized patient education, a potential that previous models had not achieved.
摘要：以前的研究表明，电子健康记录(EHR)在美国已经被广泛采用，以允许患者访问他们的个人医疗信息，但由于医学术语的流行，对患者的可读性不高。通过识别每个人都难以理解的行话，根据个人的理解量身定做医学笔记，将增强生成模型的实用性。我们提出了第一个量化分析来衡量LLM中角色扮演在医学术语提取中的影响。通过比较超过20个句子的土耳其机械工人的结果，我们的研究表明，LLM角色扮演在14个不同的社会人口背景下提高了95%的案例的F1分数。此外，将角色扮演与情景学习结合起来，表现优于之前最先进的模型。我们的研究表明，ChatGPT可以通过利用角色扮演来提供个性化的患者教育，从而改进传统的医学术语提取系统，这是以前的模型所没有实现的潜力。

[NLP-47] Multi-layer Sequence Labeling-based Joint Biomedical Event Extraction NLPCC2024
[NLP-47] 基于多层序列标记的联合生物医学事件提取

链接: https://arxiv.org/abs/2408.05545
作者: Gongchi Chen,Pengchao Wu,Jinghang Gu,Longhua Qian,Guodong Zhou
关键词-EN: recent years, dominated by complicated, complicated pipeline, biomedical event extraction, biomedical event
关键词-ZN: 近年来，以复杂、复杂的管道为主，生物医学事件提取，生物医学事件
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 13 pages, 3 figures, accepted by NLPCC2024

点击查看摘要

Abstract:In recent years, biomedical event extraction has been dominated by complicated pipeline and joint methods, which need to be simplified. In addition, existing work has not effectively utilized trigger word information explicitly. Hence, we propose MLSL, a method based on multi-layer sequence labeling for joint biomedical event extraction. MLSL does not introduce prior knowledge and complex structures. Moreover, it explicitly incorporates the information of candidate trigger words into the sequence labeling to learn the interaction relationships between trigger words and argument roles. Based on this, MLSL can learn well with just a simple workflow. Extensive experimentation demonstrates the superiority of MLSL in terms of extraction performance compared to other state-of-the-art methods.
摘要：近年来，生物医学事件提取以复杂的管道和关节方法为主，需要简化。此外，现有工作尚未明确有效利用触发词信息。因此，我们提出了MLSL，一种基于多层序列标记的联合生物医学事件提取方法。MLSL不会引入先验知识和复杂结构。此外，它还将候选触发词的信息明确地融入到序列标签中，以了解触发词和参数角色之间的交互关系。基于此，MLSL只需简单的工作流程即可学习。大量实验表明，与其他最先进的方法相比，MLSL在提取性能方面的优越性。

[NLP-48] P3: A Policy-Driven Pace-Adaptive and Diversity-Promoted Framework for Optimizing LLM Training
[NLP-48] P3：优化LLM培训的政策驱动、适应性和多元化促进框架

链接: https://arxiv.org/abs/2408.05541
作者: Yingxuan Yang,Huayi Wang,Muning Wen,Weinan Zhang
关键词-EN: rapidly evolving field, field of Large, Large Language Models, selecting high-quality data, Large Language
关键词-ZN: 快速发展的领域，大型领域，大型语言模型，选择高质量数据，大型语言
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:In the rapidly evolving field of Large Language Models (LLMs), selecting high-quality data for fine-tuning is essential. This paper focuses on task-specific data pruning and selection to enhance fine-tuning. We introduce an innovative framework, termed P3, which improves LLM performance through a dynamic, adaptive training strategy. Specifically, P3 comprises the following components: (1) Policy-driven Difficulty Measurement: we begin by measuring the difficulty of data based on the model’s real-time performance, transitioning from static, predefined metrics to more dynamic and adaptable ones. (2) Pace-adaptive Selection: we employ self-paced learning (SPL) to gradually select increasingly challenging data, thereby progressively enhancing the model’s performance. (3) Diversity Promotion: we integrate Determinantal Point Process (DPP) into the selection process to promote the diversity within and between samples, enriching the learning process. We have validated our method on two well-known LLM datasets, APPS and MATH, designed for logical reasoning scenarios. The results show that our P3 framework significantly improves training outcomes compared to traditional methods. By fundamentally refining data selection and utilization strategies, P3 not only advances theoretical understanding of dynamic training approaches but also provides a versatile framework that can revolutionize model training in natural language processing.
摘要：在快速发展的大型语言模型领域，选择高质量的数据进行微调是至关重要的。本文的重点是针对特定任务的数据剪枝和选择，以增强微调。我们引入了一个名为P3的创新框架，它通过动态、自适应的培训策略提高了LLM的性能。具体地说，P3由以下几个部分组成：(1)策略驱动的难度度量：我们首先基于模型的实时性能来度量数据的难度，从静态的、预定义的度量过渡到更动态和适应性更强的度量。(2)步长自适应选择：我们使用自步调学习(SPL)来逐步选择日益具有挑战性的数据，从而逐步提高模型的性能。(3)多样性促进：我们将决定点过程(DPP)融入到选择过程中，以促进样本内部和样本之间的多样性，丰富了学习过程。我们已经在两个著名的LLM数据集APPS和MATH上验证了我们的方法，这两个数据集是为逻辑推理场景设计的。结果表明，与传统方法相比，我们的P3框架显著改善了训练结果。通过从根本上改进数据选择和使用策略，P3不仅提高了对动态训练方法的理论理解，而且提供了一个通用的框架，可以彻底改变自然语言处理中的模型训练。

[NLP-49] Context-Driven Index Trimming: A Data Quality Perspective to Enhancing Precision of RALMs
[NLP-49] 上下文驱动索引修剪：从数据质量角度提高RALM精确度

链接: https://arxiv.org/abs/2408.05524
作者: Kexin Ma,Ruochun Jin,Xi Wang,Huan Chen,Jing Ren,Yuhua Tang
关键词-EN: Large Language Models, Context Matching Dependencies, Retrieval-Augmented Large Language, Context-Driven Index Trimming, data quality issues
关键词-ZN: 大型语言模型、上下文匹配从属关系、检索增强大型语言、上下文驱动索引修剪、数据质量问题
类目: Computation and Language (cs.CL); Databases (cs.DB)
备注:

点击查看摘要

Abstract:Retrieval-Augmented Large Language Models (RALMs) have made significant strides in enhancing the accuracy of generated responses.However, existing research often overlooks the data quality issues within retrieval results, often caused by inaccurate existing vector-distance-based retrieval methods.We propose to boost the precision of RALMs’ answers from a data quality perspective through the Context-Driven Index Trimming (CDIT) framework, where Context Matching Dependencies (CMDs) are employed as logical data quality rules to capture and regulate the consistency between retrieved contexts.Based on the semantic comprehension capabilities of Large Language Models (LLMs), CDIT can effectively identify and discard retrieval results that are inconsistent with the query context and further modify indexes in the database, thereby improving answer quality.Experiments demonstrate on challenging question-answering tasks.Also, the flexibility of CDIT is verified through its compatibility with various language models and indexing methods, which offers a promising approach to bolster RALMs’ data quality and retrieval precision jointly.
摘要：检索增强的大语言模型在提高检索结果的准确性方面取得了长足的进步。然而，现有的研究往往忽略了检索结果中的数据质量问题，这往往是由于现有的基于向量距离的检索方法不准确造成的。我们提出通过上下文驱动的索引裁剪(CDIT)框架从数据质量的角度提高RALMS的答案精度，其中上下文匹配依赖(CMD)作为逻辑数据质量规则来捕获和调节检索到的上下文之间的一致性。基于大语言模型(LLMS)的语义理解能力，我们提出了一种基于上下文匹配的索引裁剪(CDIT)框架，该框架利用上下文匹配依赖关系(CMD)作为逻辑数据质量规则来捕获和调节检索到的上下文之间的一致性。基于大语言模型的语义理解能力，CDIT可以有效地识别和丢弃与查询上下文不一致的检索结果，并进一步修改数据库中的索引，从而提高答案质量。通过对问答任务的挑战，验证了CDIT的灵活性，并通过其与多种语言模型和索引方法的兼容性，为联合提高RALMS的数据质量和检索精度提供了一种有前景的方法。

[NLP-50] SWIFT:A Scalable lightWeight Infrastructure for Fine-Tuning
[NLP-50] SWIFT：用于微调的可扩展轻量级基础设施

链接: https://arxiv.org/abs/2408.05517
作者: Yuze Zhao,Jintao Huang,Jinghan Hu,Daoze Zhang,Zeyinzi Jiang,Zhikai Wu,Baole Ai,Ang Wang,Wenmeng Zhou,Yingda Chen
关键词-EN: leverage Attention-based Transformer, Large Language Models, Multi-modal Large Language, Attention-based Transformer architectures, Large Language
关键词-ZN: 利用基于注意力的Transformer、大型语言模型、多模式大型语言、基于注意力的Transformer架构、大型语言
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Recent development in Large Language Models (LLMs) and Multi-modal Large Language Models (MLLMs) have leverage Attention-based Transformer architectures and achieved superior performance and generalization capabilities. They have since covered extensive areas of traditional learning tasks. For instance, text-based tasks such as text-classification and sequence-labeling, as well as multi-modal tasks like Visual Question Answering (VQA) and Optical Character Recognition (OCR), which were previously addressed using different models, can now be tackled based on one foundation model. Consequently, the training and lightweight fine-tuning of LLMs and MLLMs, especially those based on Transformer architecture, has become particularly important. In recognition of these overwhelming needs, we develop SWIFT, a customizable one-stop infrastructure for large models. With support of over 300+ LLMs and 50+ MLLMs, SWIFT stands as the open-source framework that provide the \textitmost comprehensive support for fine-tuning large models. In particular, it is the first training framework that provides systematic support for MLLMs. In addition to the core functionalities of fine-tuning, SWIFT also integrates post-training processes such as inference, evaluation, and model quantization, to facilitate fast adoptions of large models in various application scenarios. With a systematic integration of various training techniques, SWIFT offers helpful utilities such as benchmark comparisons among different training techniques for large models. For fine-tuning models specialized in agent framework, we show that notable improvements on the ToolBench leader-board can be achieved by training with customized dataset on SWIFT, with an increase of 5.2%-21.8% in the Act.EM metric over various baseline models, a reduction in hallucination by 1.6%-14.1%, and an average performance improvement of 8%-17%.
摘要：大型语言模型(LLM)和多模式大型语言模型(MLLM)的最新发展充分利用了基于注意力的Transformer体系结构，并取得了优异的性能和泛化能力。自那以后，它们涵盖了传统学习任务的广泛领域。例如，基于文本的任务，如文本分类和序列标记，以及多模式任务，如视觉问答(VQA)和光学字符识别(OCR)，以前使用不同的模型来处理，现在可以基于一个基础模型来处理。因此，LLM和MLLM的培训和轻量级微调，特别是基于Transformer体系结构的LLM和MLLM变得尤为重要。考虑到这些巨大的需求，我们开发了SWIFT，这是一个可为大型模型定制的一站式基础设施。SWIFT支持300多个LLM和50多个MLLM，是为微调大型模型提供最全面支持的开源框架。特别是，它是第一个为大规模毁灭性武器管理提供系统支持的培训框架。除了微调的核心功能外，SWIFT还集成了推理、评估和模型量化等训练后流程，以促进大模型在各种应用场景中的快速采用。通过系统集成各种培训技术，SWIFT提供了有用的实用工具，例如针对大型模型的不同培训技术之间的基准比较。对于专门针对代理框架的微调模型，我们表明，通过使用SWIFT上的定制数据集进行培训，可以在工具台排行榜上实现显著的改进，与各种基线模型相比，Act.EM度量增加了5.2-21.8%，幻觉减少了1.6%-14.1%，平均性能提高了8.17%。

[NLP-51] Your Context Is Not an Array: Unveiling Random Access Limitations in Transformers
[NLP-51] 您的上下文不是数组：揭开变形金刚中的随机访问限制

链接: https://arxiv.org/abs/2408.05506
作者: MohammadReza Ebrahimi,Sunny Panchal,Roland Memisevic
关键词-EN: Transformer-based large language, Transformer-based large, large language models, recent successes, surprising failure modes
关键词-ZN: 基于转换器的大型语言、基于转换器的大型语言模型、最近的成功、令人惊讶的失败模式
类目: Computation and Language (cs.CL)
备注: Published as a conference paper at COLM 2024

点击查看摘要

Abstract:Despite their recent successes, Transformer-based large language models show surprising failure modes. A well-known example of such failure modes is their inability to length-generalize: solving problem instances at inference time that are longer than those seen during training. In this work, we further explore the root cause of this failure by performing a detailed analysis of model behaviors on the simple parity task. Our analysis suggests that length generalization failures are intricately related to a model’s inability to perform random memory accesses within its context window. We present supporting evidence for this hypothesis by demonstrating the effectiveness of methodologies that circumvent the need for indexing or that enable random token access indirectly, through content-based addressing. We further show where and how the failure to perform random memory access manifests through attention map visualizations.
摘要：尽管最近取得了成功，但基于Transformer的大型语言模型仍表现出令人惊讶的失败模式。此类失败模式的一个众所周知的例子是它们无法进行长度概括：在比训练期间看到的推理时间更长的推理时间解决问题实例。在这项工作中，我们通过对简单宇称任务的模型行为进行详细分析，进一步探讨了这种失败的根本原因。我们的分析表明，长度概括失败与模型无法在其上下文窗口内执行随机内存访问有着复杂的关系。我们通过证明规避索引需求或通过基于内容的编址间接实现随机令牌访问的方法的有效性，为这一假设提供了支持证据。我们通过注意力地图可视化进一步展示了执行随机内存访问失败的位置以及如何表现。

[NLP-52] GEM: Context-Aware Gaze EstiMation with Visual Search Behavior Matching for Chest Radiograph
[NLP-52] GEM：胸部X光摄影机视觉搜索行为匹配的上下文感知凝视估计

链接: https://arxiv.org/abs/2408.05502
作者: Shaonan Liu,Wenting Chen,Jie Liu,Xiaoling Luo,Linlin Shen
关键词-EN: scene comprehension tasks, human scene comprehension, medical diagnostic analysis, context-aware gaze estimation, medical image interpretation
关键词-ZN: 场景理解任务、人类场景理解、医学诊断分析、上下文感知凝视估计、医学图像解释
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注: 9 figures

点击查看摘要

Abstract:Gaze estimation is pivotal in human scene comprehension tasks, particularly in medical diagnostic analysis. Eye-tracking technology facilitates the recording of physicians’ ocular movements during image interpretation, thereby elucidating their visual attention patterns and information-processing strategies. In this paper, we initially define the context-aware gaze estimation problem in medical radiology report settings. To understand the attention allocation and cognitive behavior of radiologists during the medical image interpretation process, we propose a context-aware Gaze EstiMation (GEM) network that utilizes eye gaze data collected from radiologists to simulate their visual search behavior patterns throughout the image interpretation process. It consists of a context-awareness module, visual behavior graph construction, and visual behavior matching. Within the context-awareness module, we achieve intricate multimodal registration by establishing connections between medical reports and images. Subsequently, for a more accurate simulation of genuine visual search behavior patterns, we introduce a visual behavior graph structure, capturing such behavior through high-order relationships (edges) between gaze points (nodes). To maintain the authenticity of visual behavior, we devise a visual behavior-matching approach, adjusting the high-order relationships between them by matching the graph constructed from real and estimated gaze points. Extensive experiments on four publicly available datasets demonstrate the superiority of GEM over existing methods and its strong generalizability, which also provides a new direction for the effective utilization of diverse modalities in medical image interpretation and enhances the interpretability of models in the field of medical imaging. this https URL
摘要：在人类场景理解任务中，尤其是在医学诊断分析中，凝视估计起着至关重要的作用。眼球跟踪技术有助于记录医生在图像解释过程中的眼睛运动，从而阐明他们的视觉注意模式和信息处理策略。在本文中，我们首先定义了医疗放射学报告设置中的上下文感知凝视估计问题。为了了解放射科医生在医学图像解释过程中的注意力分配和认知行为，我们提出了一种情境感知凝视估计(GEM)网络，该网络利用从放射科医生那里收集的眼睛注视数据来模拟他们在图像解释过程中的视觉搜索行为模式。它由上下文感知模块、视觉行为图构建和视觉行为匹配组成。在上下文感知模块中，我们通过在医疗报告和图像之间建立连接来实现复杂的多模式注册。随后，为了更准确地模拟真实的视觉搜索行为模式，我们引入了一种视觉行为图结构，通过视点(节点)之间的高阶关系(边)来捕捉这种行为。为了保持视觉行为的真实性，我们设计了一种视觉行为匹配方法，通过匹配由真实视点和估计视点构建的图来调整它们之间的高阶关系。在四个公开可用的数据集上的大量实验表明，GEM方法比现有方法更具优越性和较强的泛化能力，这也为有效利用多种模式解释医学图像提供了新的方向，并提高了医学成像领域模型的可解释性。此HTTPS URL

[NLP-53] MABR: A Multilayer Adversarial Bias Removal Approach Without Prior Bias Knowledge
[NLP-53] MABR：一种无需事先偏见知识的多层对抗偏见消除方法

链接: https://arxiv.org/abs/2408.05497
作者: Maxwell J. Yin,Boyu Wang,Charles Ling
关键词-EN: exacerbate existing social, existing social biases, real-world data, data often mirror, mirror and exacerbate
关键词-ZN: 加剧现有的社会、现有的社会偏见、现实世界的数据、数据经常反映、反映和加剧
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Models trained on real-world data often mirror and exacerbate existing social biases. Traditional methods for mitigating these biases typically require prior knowledge of the specific biases to be addressed, such as gender or racial biases, and the social groups associated with each instance. In this paper, we introduce a novel adversarial training strategy that operates independently of prior bias-type knowledge and protected attribute labels. Our approach proactively identifies biases during model training by utilizing auxiliary models, which are trained concurrently by predicting the performance of the main model without relying on task labels. Additionally, we implement these auxiliary models at various levels of the feature maps of the main model, enabling the detection of a broader and more nuanced range of bias features. Through experiments on racial and gender biases in sentiment and occupation classification tasks, our method effectively reduces social biases without the need for demographic annotations. Moreover, our approach not only matches but often surpasses the efficacy of methods that require detailed demographic insights, marking a significant advancement in bias mitigation techniques.
摘要：根据真实世界数据训练的模型往往反映并加剧了现有的社会偏见。缓解这些偏见的传统方法通常需要事先了解需要解决的具体偏见，如性别或种族偏见，以及与每个实例相关联的社会群体。在本文中，我们介绍了一种新的对抗训练策略，该策略独立于先验偏见类型知识和受保护的属性标签。我们的方法通过利用辅助模型来主动识别模型训练过程中的偏差，这些辅助模型是通过预测主模型的性能来同时训练的，而不依赖于任务标签。此外，我们在主模型的特征映射的不同级别上实现了这些辅助模型，使得能够检测更广泛和更细微范围的偏差特征。通过对情绪和职业分类任务中的种族和性别偏见的实验，我们的方法有效地减少了社会偏见，而不需要人口统计注释。此外，我们的方法不仅与需要详细人口统计学见解的方法相匹配，而且往往超过这些方法的有效性，标志着偏差缓解技术的重大进步。

[NLP-54] Investigating Instruction Tuning Large Language Models on Graphs
[NLP-54] 调查图表上的指令调优大型语言模型

链接: https://arxiv.org/abs/2408.05457
作者: Kerui Zhu,Bo-Wei Huang,Bowen Jin,Yizhu Jiao,Ming Zhong,Kevin Chang,Shou-De Lin,Jiawei Han
关键词-EN: Large Language Models, advancements of Large, Language Models, Large Language, NLP tasks
关键词-ZN: 大型语言模型、大型、语言模型、大型语言、NLP任务的改进
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: COLM 2024

点击查看摘要

Abstract:Inspired by the recent advancements of Large Language Models (LLMs) in NLP tasks, there’s growing interest in applying LLMs to graph-related tasks. This study delves into the capabilities of instruction-following LLMs for engaging with real-world graphs, aiming to offer empirical insights into how LLMs can effectively interact with graphs and generalize across graph tasks. We begin by constructing a dataset designed for instruction tuning, which comprises a diverse collection of 79 graph-related tasks from academic and e-commerce domains, featuring 44,240 training instances and 18,960 test samples. Utilizing this benchmark, our initial investigation focuses on identifying the optimal graph representation that serves as a conduit for LLMs to understand complex graph structures. Our findings indicate that JSON format for graph representation consistently outperforms natural language and code formats across various LLMs and graph types. Furthermore, we examine the key factors that influence the generalization abilities of instruction-tuned LLMs by evaluating their performance on both in-domain and out-of-domain graph tasks.
摘要：受最近自然语言处理任务中大语言模型的发展启发，人们对将大语言模型应用于图形相关任务的兴趣与日俱增。这项研究深入探讨了遵循指令的学习记忆模型处理真实世界图形的能力，旨在为学习记忆模型如何有效地与图形交互并在图形任务中进行概括提供经验上的见解。我们首先构建一个用于教学调优的数据集，其中包括来自学术和电子商务领域的79个与图形相关的任务的不同集合，包含44,240个训练实例和18,960个测试样本。利用这个基准，我们的初步研究集中在确定最优的图表示，该图表示作为LLM理解复杂图结构的管道。我们的发现表明，JSON格式的图形表示在各种LLM和图形类型上始终优于自然语言和代码格式。此外，我们通过评估指令调优的LLMS在域内和域外图形任务上的性能来考察影响其泛化能力的关键因素。

[NLP-55] Path-LLM: A Shortest-Path-based LLM Learning for Unified Graph Representation
[NLP-55] Path-LLM：基于最短路径的LLM学习，用于统一图表示

链接: https://arxiv.org/abs/2408.05456
作者: Wenbo Shang,Xuliang Zhu,Xin Huang
关键词-EN: multiple downstream applications, Unified graph representation, aims to produce, applied to multiple, Unified graph
关键词-ZN: 多个下游应用程序，统一图形表示，旨在生成、应用于多个统一图形
类目: Computation and Language (cs.CL)
备注: 12 pages, 8 figures

点击查看摘要

Abstract:Unified graph representation learning aims to produce node embeddings, which can be applied to multiple downstream applications. However, existing studies based on graph neural networks and language models either suffer from the limitations of numerous training needed toward specific downstream predictions or have shallow semantic features. In this work, we propose a novel Path-LLM model to learn unified graph representation, which leverages a powerful large language model (LLM) to incorporate our proposed path features. Our Path-LLM framework consists of several well-designed techniques. First, we develop a new mechanism of long-to-short shortest path (L2SP) selection, which covers essential connections between different dense groups. An in-depth comparison of different path selection plans is offered to illustrate the strength of our designed L2SP. Then, we design path textualization to obtain L2SP-based training texts. Next, we feed the texts into a self-supervised LLM training process to learn embeddings. Extensive experiments on benchmarks validate the superiority of Path-LLM against the state-of-the-art WalkLM method on two classical graph learning tasks (node classification and link prediction) and one NP-hard graph query processing task (keyword search), meanwhile saving more than 90% of training paths.
摘要：统一图表示学习的目的是产生可应用于多个下游应用的节点嵌入。然而，现有的基于图神经网络和语言模型的研究要么存在针对特定下游预测所需的大量训练的局限性，要么具有较浅的语义特征。在这项工作中，我们提出了一种新的路径-LLM模型来学习统一的图表示，它利用了一个强大的大型语言模型(LLM)来结合我们提出的路径特征。我们的PATH-LLM框架由几个设计良好的技术组成。首先，我们提出了一种新的长到短最短路径(L2SP)选择机制，该机制覆盖了不同密集群体之间的本质联系。对不同的路径选择方案进行了深入的比较，以说明所设计的L2SP的优势。然后，我们设计路径文本化来获得基于L2SP的训练文本。接下来，我们将文本输入到一个自我监督的LLM训练过程中，以学习嵌入。大量的基准测试验证了Path-LLM在两个经典图学习任务(节点分类和链接预测)和一个NP-Hard图查询处理任务(关键字搜索)上的优越性，同时节省了90%以上的训练路径。

[NLP-56] Chain of Condition: Construct Verify and Solve Conditions for Conditional Question Answering
[NLP-56] 条件链：构建、验证和解决条件问题回答

链接: https://arxiv.org/abs/2408.05442
作者: Jiuheng Lin,Yuxuan Lai,Yansong Feng
关键词-EN: find probable answers, Conditional question answering, important task, task that aims, aims to find
关键词-ZN: 找到可能的答案，有条件的问题回答，重要任务，目标任务，目标是找到
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Conditional question answering (CQA) is an important task that aims to find probable answers and identify conditions that need to be satisfied to support the answer. Existing approaches struggle with CQA due to two main challenges: (1) precisely identifying conditions and their logical relationship, and (2) verifying and solving the conditions. To address these challenges, we propose Chain of Condition, a novel prompting approach by firstly identifying all conditions and constructing their logical relationships explicitly according to the document, then verifying whether these conditions are satisfied, finally solving the logical expression by tools to indicate any missing conditions and generating the answer based on the resolved conditions. The experiments on two benchmark conditional question answering datasets shows chain of condition outperforms existing prompting baselines, establishing a new state-of-the-art. Furthermore, with backbone models like GPT-3.5-Turbo or GPT-4, it surpasses all supervised baselines with only few-shot settings.
摘要：条件问答是一项重要的任务，其目的是寻找可能的答案，并确定需要满足的条件以支持答案。现有的方法由于两个主要挑战而与CQA作斗争：(1)准确地识别条件及其逻辑关系，(2)验证和解决条件。为了应对这些挑战，我们提出了条件链，这是一种新颖的提示方法，它首先识别所有条件并根据文档显式地构建它们的逻辑关系，然后验证这些条件是否满足，最后用工具求解逻辑表达式以指示任何缺失的条件，并根据已解决的条件生成答案。在两个基准条件问答数据集上的实验表明，条件链的性能优于现有的提示基线，建立了一个新的最先进的状态。此外，通过GPT-3.5-Turbo或GPT-4等主干机型，它超过了所有只有几次拍摄设置的监督基线。

[NLP-57] LaiDA: Linguistics-aware In-context Learning with Data Augmentation for Metaphor Components Identification NLPCC2024
[NLP-57] Laida：语言感知的上下文学习，通过数据增强来识别隐喻成分

链接: https://arxiv.org/abs/2408.05404
作者: Hongde Liu,Chenyuan He,Feiyang Meng,Changyong Niu,Yuxiang Jia
关键词-EN: Metaphor Components Identification, Components Identification, Metaphor Components, enhancing machine understanding, advancing downstream natural
关键词-ZN: 隐喻组件识别，组件识别，隐喻组件，增强机器理解，推进下游自然
类目: Computation and Language (cs.CL)
备注: This paper has been accepted by NLPCC 2024 Shared Tasks

点击查看摘要

Abstract:Metaphor Components Identification (MCI) contributes to enhancing machine understanding of metaphors, thereby advancing downstream natural language processing tasks. However, the complexity, diversity, and dependency on context and background knowledge pose significant challenges for MCI. Large language models (LLMs) offer new avenues for accurate comprehension of complex natural language texts due to their strong semantic analysis and extensive commonsense knowledge. In this research, a new LLM-based framework is proposed, named Linguistics-aware In-context Learning with Data Augmentation (LaiDA). Specifically, ChatGPT and supervised fine-tuning are utilized to tailor a high-quality dataset. LaiDA incorporates a simile dataset for pre-training. A graph attention network encoder generates linguistically rich feature representations to retrieve similar examples. Subsequently, LLM is fine-tuned with prompts that integrate linguistically similar examples. LaiDA ranked 2nd in Subtask 2 of NLPCC2024 Shared Task 9, demonstrating its effectiveness. Code and data are available at this https URL.
摘要：隐喻成分识别有助于提高机器对隐喻的理解，从而推进后续的自然语言处理任务。然而，复杂性、多样性以及对背景和背景知识的依赖给MCI带来了巨大的挑战。大语言模型以其强大的语义分析能力和丰富的常识知识，为准确理解复杂的自然语言文本提供了新的途径。在这项研究中，我们提出了一个新的基于LLM的学习框架，称为基于数据增强的语言学感知的上下文中学习(LAIDA)。具体地说，ChatGPT和有监督的微调被用于定制高质量的数据集。莱达纳入了一个用于预培训的明喻数据集。图形注意网络编码器生成语言丰富的特征表示以检索相似的例子。随后，LLM使用集成了语言上相似的示例的提示进行了微调。莱达在NLPCC2024共享任务9的子任务2中排名第二，证明了其有效性。代码和数据可在此HTTPS URL上找到。

[NLP-58] FiST-Financial Style Transfer with Hallucination and Creativity Control Framework
[NLP-58] 具有幻觉和创造力控制框架的FIST财务风格转移

链接: https://arxiv.org/abs/2408.05365
作者: Sohini Roychowdhury,Marko Krema,Brian Moore,Xingjian Lai,Dike Effedua,Bharat Jethwani
关键词-EN: general purpose large, purpose large language, Financial report generation, language models pose, large language models
关键词-ZN: 通用大型、目的大型语言、财务报告生成、语言模型姿势、大型语言模型
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE)
备注: 8 pages, 13 figures, 5 tables, conference

点击查看摘要

Abstract:Financial report generation using general purpose large language models pose two major challenges, including the lack of compound sentences and hallucinations. Advanced prompt engineering and retrieval augmented generation (RAG) techniques are incapable of curing the writing style discrepancies. In this work we propose a novel two-stage fine-tuning process wherein public domain financial reports are processed into prompt-completions and augmented using simple LLM prompts to then enable sectional financial report generation using minimal instructions and tabular data inputs. Our proposed fine-tuning framework results doubles the number of correct questions answers and reduces hallucinations by over 50%. Additionally, the two-stage fine tuned models have lower perplexity, improved ROUGE, TER and BLEU scores, higher creativity and knowledge density with lower uncertainty and cross entropy.
摘要：使用通用大型语言模型生成财务报告提出了两个主要挑战，包括缺乏复合句和幻觉。先进的提示工程和检索增强生成（RAG）技术无法解决写作风格差异。在这项工作中，我们提出了一种新颖的两阶段微调流程，其中公共领域财务报告被处理成预算完成，并使用简单的LLM提示进行扩展，然后使用最少的指令和表格数据输入来实现部门财务报告生成。我们提出的微调框架结果使正确问题答案的数量增加了一倍，并减少了50%以上的幻觉。此外，两阶段微调模型具有更低的困惑度，改善了ROUGE、TER和BLEU分数，更高的创造力和知识密度，以及更低的不确定性和交叉信息。

[NLP-59] DataNarrative: Automated Data-Driven Storytelling with Visualizations and Texts
[NLP-59] DataNarrative：通过可视化和文本自动化数据驱动的讲故事

链接: https://arxiv.org/abs/2408.05346
作者: Mohammed Saidul Islam,Enamul Hoque,Shafiq Joty,Md Tahmid Rahman Laskar,Md Rizwan Parvez
关键词-EN: combining narrative techniques, visualizations and text, Data-driven storytelling, powerful method, method for conveying
关键词-ZN: 结合叙事技术、可视化和文本、数据驱动的讲故事、强大的方法、传达方法
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Data-driven storytelling is a powerful method for conveying insights by combining narrative techniques with visualizations and text. These stories integrate visual aids, such as highlighted bars and lines in charts, along with textual annotations explaining insights. However, creating such stories requires a deep understanding of the data and meticulous narrative planning, often necessitating human intervention, which can be time-consuming and mentally taxing. While Large Language Models (LLMs) excel in various NLP tasks, their ability to generate coherent and comprehensive data stories remains underexplored. In this work, we introduce a novel task for data story generation and a benchmark containing 1,449 stories from diverse sources. To address the challenges of crafting coherent data stories, we propose a multiagent framework employing two LLM agents designed to replicate the human storytelling process: one for understanding and describing the data (Reflection), generating the outline, and narration, and another for verification at each intermediary step. While our agentic framework generally outperforms non-agentic counterparts in both model-based and human evaluations, the results also reveal unique challenges in data story generation.
摘要：数据驱动的讲故事是一种通过将叙事技术与可视化和文本相结合来传达真知灼见的强大方法。这些故事集成了视觉辅助工具，如图表中突出显示的条形图和线条，以及解释洞察力的文本注释。然而，创作这样的故事需要对数据的深刻理解和细致的叙事规划，往往需要人工干预，这可能是耗时和精神负担的。虽然大型语言模型(LLM)在各种NLP任务中表现出色，但它们生成连贯和全面的数据故事的能力仍未得到充分开发。在这项工作中，我们介绍了一个新颖的数据故事生成任务和一个包含来自不同来源的1,449个故事的基准。为了解决制作连贯的数据故事的挑战，我们提出了一个多代理框架，该框架采用两个LLM代理，旨在复制人类的讲故事过程：一个用于理解和描述数据(反映)，生成大纲和叙述，另一个用于在每个中间步骤进行验证。虽然我们的代理框架在基于模型的评估和人类评估中通常都优于非代理框架，但结果也揭示了数据故事生成方面的独特挑战。

[NLP-60] Revisiting Multi-Modal LLM Evaluation
[NLP-60] 重新审视多模式LLM评估

链接: https://arxiv.org/abs/2408.05334
作者: Jian Lu,Shikhar Srivastava,Junyu Chen,Robik Shrestha,Manoj Acharya,Kushal Kafle,Christopher Kanan
关键词-EN: large language models, multi-modal large language, referring expression comprehension, visual question answering, language models
关键词-ZN: 大型语言模型、多模式大型语言、指代表达理解、视觉问答、语言模型
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:With the advent of multi-modal large language models (MLLMs), datasets used for visual question answering (VQA) and referring expression comprehension have seen a resurgence. However, the most popular datasets used to evaluate MLLMs are some of the earliest ones created, and they have many known problems, including extreme bias, spurious correlations, and an inability to permit fine-grained analysis. In this paper, we pioneer evaluating recent MLLMs (LLaVA 1.5, LLaVA-NeXT, BLIP2, InstructBLIP, GPT-4V, and GPT-4o) on datasets designed to address weaknesses in earlier ones. We assess three VQA datasets: 1) TDIUC, which permits fine-grained analysis on 12 question types; 2) TallyQA, which has simple and complex counting questions; and 3) DVQA, which requires optical character recognition for chart understanding. We also study VQDv1, a dataset that requires identifying all image regions that satisfy a given query. Our experiments reveal the weaknesses of many MLLMs that have not previously been reported. Our code is integrated into the widely used LAVIS framework for MLLM evaluation, enabling the rapid assessment of future MLLMs. Project webpage: this https URL
摘要：随着多通道大型语言模型(MLLMS)的出现，用于视觉问答和指称表达理解的数据集重新兴起。然而，用于评估最大似然模型的最受欢迎的数据集是一些最早创建的数据集，它们存在许多已知问题，包括极端偏差、虚假关联和无法进行细粒度分析。在本文中，我们率先在旨在解决早期MLLM弱点的数据集上评估最近的MLLMS(LLaVA 1.5、LLaVA-NEXT、BLIP2、InstructBLIP、GPT-4V和GPT-4o)。我们评估了三个VQA数据集：1)TDIUC，它允许对12种问题类型进行细粒度分析；2)TallyQA，它有简单和复杂的计数问题；以及3)DVQA，它需要光学字符识别来理解图表。我们还研究了VQDv1，这是一个需要识别满足给定查询的所有图像区域的数据集。我们的实验揭示了许多以前没有报道的MLLMS的弱点。我们的代码集成到广泛使用的用于MLLM评估的LAVIS框架中，从而能够对未来的MLLM进行快速评估。项目网页：此HTTPS URL

[NLP-61] From Text to Insight: Leveraging Large Language Models for Performance Evaluation in Management
[NLP-61] 从文本到洞察：利用大型语言模型进行管理绩效评估

链接: https://arxiv.org/abs/2408.05328
作者: Ning Li,Huaikang Zhou,Mingze Xu
关键词-EN: Large Language Models, Language Models, Large Language, organizational task performance, enhance objectivity
关键词-ZN: 大型语言模型，语言模型，大型语言，组织任务绩效，增强客观性
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Human-Computer Interaction (cs.HC); General Economics (econ.GN)
备注: 39 pages, 8 figures, 5 tables

点击查看摘要

Abstract:This study explores the potential of Large Language Models (LLMs), specifically GPT-4, to enhance objectivity in organizational task performance evaluations. Through comparative analyses across two studies, including various task performance outputs, we demonstrate that LLMs can serve as a reliable and even superior alternative to human raters in evaluating knowledge-based performance outputs, which are a key contribution of knowledge workers. Our results suggest that GPT ratings are comparable to human ratings but exhibit higher consistency and reliability. Additionally, combined multiple GPT ratings on the same performance output show strong correlations with aggregated human performance ratings, akin to the consensus principle observed in performance evaluation literature. However, we also find that LLMs are prone to contextual biases, such as the halo effect, mirroring human evaluative biases. Our research suggests that while LLMs are capable of extracting meaningful constructs from text-based data, their scope is currently limited to specific forms of performance evaluation. By highlighting both the potential and limitations of LLMs, our study contributes to the discourse on AI role in management studies and sets a foundation for future research to refine AI theoretical and practical applications in management.
摘要：本研究探讨了大型语言模型(LLM)，特别是GPT-4在组织任务绩效评估中提高客观性的潜力。通过两项研究的对比分析，包括不同的任务绩效输出，我们证明了LLMS在评估基于知识的绩效输出方面可以作为一种可靠的、甚至是优于人类评分者的选择，这是知识型员工的关键贡献。我们的结果表明，GPT评分与人类评分相当，但表现出更高的一致性和可靠性。此外，同一绩效输出的组合多个GPT评级与人类绩效综合评级显示出很强的相关性，这类似于绩效评估文献中观察到的共识原则。然而，我们也发现LLM容易受到语境偏差的影响，如光环效应，反映了人类的评价偏差。我们的研究表明，虽然LLM能够从基于文本的数据中提取有意义的结构，但它们的范围目前仅限于特定形式的性能评估。通过突出LLMS的潜力和局限性，我们的研究有助于论述人工智能在管理研究中的作用，并为未来的研究奠定基础，以完善人工智能在管理中的理论和实践应用。

[NLP-62] A Psychology-based Unified Dynamic Framework for Curriculum Learning
[NLP-62] 基于心理学的统一课程学习动态框架

链接: https://arxiv.org/abs/2408.05326
作者: Guangyu Meng,Qingkai Zeng,John P. Lalor,Hong Yu
关键词-EN: Directly learning, random difficulty levels, Curriculum Learning, machine learning, learning
关键词-ZN: 直接学习、随机难度、课程学习、机器学习、学习
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Directly learning from examples of random difficulty levels is often challenging for both humans and machine learning models. A more effective strategy involves exposing learners to examples in a progressive order, from easy to difficult. Curriculum Learning (CL) has been proposed to implement this strategy in machine learning model training. However, two key challenges persist in CL framework design: defining the difficulty of training data and determining the appropriate amount of data to input at each training step. This paper presents a Psychology-based Unified Dynamic Framework for Curriculum Learning (PUDF), drawing inspiration from psychometrics. We quantify the difficulty of training data by applying Item Response Theory (IRT) to responses from Artificial Crowds (AC). This theory-driven IRT-AC approach leads to global (i.e., model-independent) and interpretable difficulty values. Leveraging IRT, we propose a Dynamic Data Selection via Model Ability Estimation (DDS-MAE) strategy to schedule the appropriate amount of data during model training. Since our difficulty labeling and model ability estimation are based on a consistent theory, namely IRT, their values are comparable within the same scope, potentially leading to a faster convergence compared to the other CL methods. Experimental results demonstrate that fine-tuning pre-trained language models with PUDF enhances their performance on the GLUE benchmark. Moreover, PUDF surpasses other state-of-the-art (SOTA) CL methods on the GLUE benchmark. We further explore the components of PUDF, namely the difficulty measurer (IRT-AC) and the training scheduler (DDS-MAE) qualitatively and quantitatively. Lastly, we conduct an ablation study to clarify which components of PUDF contribute to faster convergence and higher accuracy.
摘要：对于人类和机器学习模型来说，直接从随机难度水平的例子中学习通常都是具有挑战性的。一种更有效的策略是让学习者按照从容易到困难的顺序接触例子。课程学习(CL)被提出在机器学习模型训练中实施这一策略。然而，CL框架设计中的两个关键挑战仍然存在：定义训练数据的难度和确定在每个训练步骤输入的适当数据量。本文借鉴心理计量学的思想，提出了一种基于心理学的课程学习统一动态框架。我们通过将项目反应理论(IRT)应用于人工群体(AC)的反应来量化训练数据的难度。这种理论驱动的IRT-AC方法产生了全局的(即，独立于模型的)和可解释的难度值。利用IRT，我们提出了一种通过模型能力估计的动态数据选择(DDS-MAE)策略来在模型训练期间调度适当的数据量。由于我们的难度标注和模型能力估计是基于一致的理论，即IRT，它们的值在相同的范围内是可比较的，潜在地导致比其他CL方法更快的收敛。实验结果表明，使用PUDF对预先训练好的语言模型进行微调，提高了它们在GLUE基准上的性能。此外，PUDF在GLUE基准上超过了其他最先进的(SOTA)CL方法。我们进一步定性和定量地探讨了PUDF的两个组成部分，即难度度量(IRT-AC)和训练调度(DDS-MAE)。最后，我们进行了消融研究，以澄清PUDF的哪些成分有助于更快的收敛和更高的精度。

[NLP-63] MUSE: Multi-Knowledge Passing on the Edges Boosting Knowledge Graph Completion
[NLP-63] MUSE：边缘的多知识传递促进知识图谱的完成

链接: https://arxiv.org/abs/2408.05283
作者: Pengjie Liu
关键词-EN: Knowledge Graph Completion, Graph Completion, Knowledge Graph, Deep Neural Networks, aims to predict
关键词-ZN: 知识图完成，图完成，知识图，深度神经网络，旨在预测
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Knowledge Graph Completion (KGC) aims to predict the missing information in the (head entity)-[relation]-(tail entity) triplet. Deep Neural Networks have achieved significant progress in the relation prediction task. However, most existing KGC methods focus on single features (e.g., entity IDs) and sub-graph aggregation, which cannot fully explore all the features in the Knowledge Graph (KG), and neglect the external semantic knowledge injection. To address these problems, we propose MUSE, a knowledge-aware reasoning model to learn a tailored embedding space in three dimensions for missing relation prediction through a multi-knowledge representation learning mechanism. Our MUSE consists of three parallel components: 1) Prior Knowledge Learning for enhancing the triplets’ semantic representation by fine-tuning BERT; 2) Context Message Passing for enhancing the context messages of KG; 3) Relational Path Aggregation for enhancing the path representation from the head entity to the tail entity. Our experimental results show that MUSE significantly outperforms other baselines on four public datasets, such as over 5.50% improvement in H@1 and 4.20% improvement in MRR on the NELL995 dataset. The code and all datasets will be released via this https URL.
摘要：知识图补全(KGC)旨在预测(头实体)-关系三元组中缺失的信息。深度神经网络在关联预测方面取得了重大进展。然而，现有的KGC方法大多侧重于单一特征(如实体ID)和子图聚合，不能充分挖掘知识图(KG)中的所有特征，而忽略了外部语义知识注入。为了解决这些问题，我们提出了一种知识感知推理模型MUSE，该模型通过一种多知识表示学习机制来学习三维空间中的定制嵌入空间，用于缺失关系预测。MUSE由三个并行组件组成：1)先验知识学习，通过微调BERT来增强三元组的语义表示；2)上下文消息传递，增强KG的上下文消息；3)关系路径聚合，增强从头实体到尾实体的路径表示。我们的实验结果表明，MUSE在四个公共数据集上的性能明显优于其他基线，例如在NELL995数据集上H@1的性能提高了5.50%以上，MRR性能提高了4.20%。代码和所有数据集将通过此HTTPS URL发布。

[NLP-64] Large Model Strategic Thinking Small Model Efficiency: Transferring Theory of Mind in Large Language Models
[NLP-64] 大模型战略思维小模型效率：在大语言模型中转移思维理论

链接: https://arxiv.org/abs/2408.05241
作者: Nunzio Lore,Alireza(Sepehr)Ilami,Babak Heydari
关键词-EN: models increases commensurately, art models increases, newer Large Language, Language Models continues, increases commensurately
关键词-ZN: 模型呈直线增长，艺术模型呈直线增长，更新的大型语言，语言模型继续，呈直线增长
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Emerging Technologies (cs.ET); Computer Science and Game Theory (cs.GT)
备注: 18 pages, 6 figures

点击查看摘要

Abstract:As the performance of larger, newer Large Language Models continues to improve for strategic Theory of Mind (ToM) tasks, the demand for these state of the art models increases commensurately. However, their deployment is costly both in terms of processing power and time. In this paper, we investigate the feasibility of creating smaller, simulation-ready agents by way of fine-tuning. To do this, we present a large pre-trained model with 20 unique scenarios that combine a social context with a social dilemma, recording its answers, and using them for Q\A fine-tuning on a smaller model of the same family. Our focus is on in-context game-theoretic decision-making, the same domain within which human interaction occurs and that requires both a theory of mind (or a semblance thereof) and an understanding of social dynamics. We find that the fine-tuned smaller language model exhibited significant performance closer to that of its larger relative, and that their improvements extended in areas and contexts beyond the ones provided in the training examples. On average for all games, through fine-tuning, the smaller model showed a %46 improvement in aligning with the behavior of the larger model, with %100 representing complete alignment. This suggests that our pipeline represents an efficient method to transmit some form of theory of mind to smaller models, creating improved and cheaply deployable algorithms in the process. Despite their simplicity and their associated shortcomings and limitations, our findings represent a stepping stone in the pursuit and training of specialized models for strategic and social decision making.
摘要：随着更大、更新的大型语言模型在战略心理理论(TOM)任务中的性能不断提高，对这些最先进模型的需求也相应增加。然而，它们的部署在处理能力和时间方面都是昂贵的。在本文中，我们研究了通过微调的方式创建更小的、可用于模拟的代理的可行性。为此，我们提供了一个大型的预训练模型，其中包含20个独特的场景，这些场景将社会背景与社会困境结合在一起，记录其答案，并使用它们在同一家庭的一个较小模型上进行问答微调。我们的重点是情境博弈论决策，这是人类互动发生的同一个领域，需要心理理论(或其外观)和对社会动态的理解。我们发现，经过微调的较小语言模型的性能更接近于其较大的相对语言模型，并且他们的改进在领域和上下文方面超出了训练示例中提供的改进。平均而言，在所有游戏中，通过微调，较小的模型在与较大模型的行为一致方面显示出46%的改进，100表示完全对齐。这表明，我们的管道代表了一种有效的方法，可以将某种形式的心理理论传递给较小的模型，在这个过程中创造出改进的、可廉价部署的算法。尽管它们很简单，但我们的发现是追求和培训战略和社会决策的专门模型的垫脚石。

[NLP-65] Quantum Algorithms for Compositional Text Processing
[NLP-65] 合成文本处理的量子算法

链接: https://arxiv.org/abs/2408.06061
作者: Tuomas Laakkonen(Quantinuum),Konstantinos Meichanetzidis(Quantinuum),Bob Coecke(Quantinuum)
关键词-EN: natural language processing, natural language, quantum natural language, Quantum, language processing
关键词-ZN: 自然语言处理，自然语言，量子自然语言，量子，语言处理
类目: Quantum Physics (quant-ph); Computation and Language (cs.CL)
备注: In Proceedings QPL 2024, arXiv:2408.05113

点击查看摘要

Abstract:Quantum computing and AI have found a fruitful intersection in the field of natural language processing. We focus on the recently proposed DisCoCirc framework for natural language, and propose a quantum adaptation, QDisCoCirc. This is motivated by a compositional approach to rendering AI interpretable: the behavior of the whole can be understood in terms of the behavior of parts, and the way they are put together. For the model-native primitive operation of text similarity, we derive quantum algorithms for fault-tolerant quantum computers to solve the task of question-answering within QDisCoCirc, and show that this is BQP-hard; note that we do not consider the complexity of question-answering in other natural language processing models. Assuming widely-held conjectures, implementing the proposed model classically would require super-polynomial resources. Therefore, it could provide a meaningful demonstration of the power of practical quantum processors. The model construction builds on previous work in compositional quantum natural language processing. Word embeddings are encoded as parameterized quantum circuits, and compositionality here means that the quantum circuits compose according to the linguistic structure of the text. We outline a method for evaluating the model on near-term quantum processors, and elsewhere we report on a recent implementation of this on quantum hardware. In addition, we adapt a quantum algorithm for the closest vector problem to obtain a Grover-like speedup in the fault-tolerant regime for our model. This provides an unconditional quadratic speedup over any classical algorithm in certain circumstances, which we will verify empirically in future work.
摘要：量子计算与人工智能在自然语言处理领域找到了卓有成效的交集。我们重点研究了最近提出的自然语言的DisCoCirc框架，并提出了一个量子适应QDisCoCirc。这是由一种使人工智能可解释的合成方法所推动的：整体的行为可以根据部分的行为以及它们被组合在一起的方式来理解。对于文本相似的模型本机原语操作，我们推导了用于容错量子计算机的量子算法来解决QDisCoCirc中的问答任务，并证明了这是BQP-Hard；注意，我们没有考虑其他自然语言处理模型中问答的复杂性。假设广泛存在的猜想，经典地实现所提出的模型将需要超多项式资源。因此，它可以为实用量子处理器的能力提供一个有意义的演示。该模型的构建建立在以前在成分量子自然语言处理方面的工作的基础上。词的嵌入被编码为参数化量子电路，这里的组合性指的是量子电路根据文本的语言结构组成。我们概述了一种在近期量子处理器上评估该模型的方法，并在其他地方报告了最近在量子硬件上实现这一方法的情况。此外，我们采用量子算法来解决最近向量问题，从而在容错机制下获得了类似Grover的加速比。在某些情况下，与任何经典算法相比，这提供了无条件的二次加速比，这一点我们将在未来的工作中进行经验验证。

[NLP-66] VQ-CTAP: Cross-Modal Fine-Grained Sequence Representation Learning for Speech Processing
[NLP-66] VQ-CTAP：语音处理的跨模式细粒度序列表示学习

链接: https://arxiv.org/abs/2408.05758
作者: Chunyu Qiang,Wang Geng,Yi Zhao,Ruibo Fu,Tao Wang,Cheng Gong,Tianrui Wang,Qiuyu Liu,Jiangyan Yi,Zhengqi Wen,Chen Zhang,Hao Che,Longbiao Wang,Jianwu Dang,Jianhua Tao
关键词-EN: brought significant improvements, Deep learning, Vector Quantized Contrastive, cross-modal representation learning, cross-modal sequence representation
关键词-ZN: 带来了显着的改进、深度学习、载体量化对比、跨模式表示学习、跨模式序列表示
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Sound (cs.SD)
备注:

点击查看摘要

Abstract:Deep learning has brought significant improvements to the field of cross-modal representation learning. For tasks such as text-to-speech (TTS), voice conversion (VC), and automatic speech recognition (ASR), a cross-modal fine-grained (frame-level) sequence representation is desired, emphasizing the semantic content of the text modality while de-emphasizing the paralinguistic information of the speech modality. We propose a method called “Vector Quantized Contrastive Token-Acoustic Pre-training (VQ-CTAP)”, which uses the cross-modal aligned sequence transcoder to bring text and speech into a joint multimodal space, learning how to connect text and speech at the frame level. The proposed VQ-CTAP is a paradigm for cross-modal sequence representation learning, offering a promising solution for fine-grained generation and recognition tasks in speech processing. The VQ-CTAP can be directly applied to VC and ASR tasks without fine-tuning or additional structures. We propose a sequence-aware semantic connector, which connects multiple frozen pre-trained modules for the TTS task, exhibiting a plug-and-play capability. We design a stepping optimization strategy to ensure effective model convergence by gradually injecting and adjusting the influence of various loss components. Furthermore, we propose a semantic-transfer-wise paralinguistic consistency loss to enhance representational capabilities, allowing the model to better generalize to unseen data and capture the nuances of paralinguistic information. In addition, VQ-CTAP achieves high-compression speech coding at a rate of 25Hz from 24kHz input waveforms, which is a 960-fold reduction in the sampling rate. The audio demo is available at this https URL
摘要：深度学习给跨通道表征学习领域带来了显著的进步。对于诸如文本到语音(TTS)、语音转换(VC)和自动语音识别(ASR)之类的任务，需要跨模式的细粒度(Frame-Level)序列表示，其强调文本通道的语义内容，而不强调语音通道的副语言信息。我们提出了一种称为矢量量化对比标记-声学预训练(VQ-CTAP)的方法，它使用跨模式对齐序列转码器将文本和语音带入联合多模式空间，学习如何在帧级别连接文本和语音。提出的VQ-CTAP是一种跨模式序列表示学习的范例，为语音处理中的细粒度生成和识别任务提供了一种有前途的解决方案。VQ-CTAP可以直接应用于VC和ASR任务，而不需要微调或附加结构。我们提出了一种序列感知的语义连接器，它为TTS任务连接了多个冻结的预训练模块，表现出即插即用的能力。通过逐步注入和调整各种损耗分量的影响，设计了逐步优化策略，保证了模型的有效收敛。此外，我们还提出了一种基于语义迁移的副语言一致性损失来增强表征能力，使模型能够更好地概括到不可见的数据，并捕捉副语言信息的细微差别。此外，VQ-CTAP实现了从24 kHz输入波形以25 hz的速率进行高压缩语音编码，这使得采样率降低了960倍。音频演示可通过以下HTTPS URL获得

[NLP-67] Improving Whispers Recognition Performance for Under-Represented Language Kazakh Leveraging Unpaired Speech and Text INTERSPEECH2024
[NLP-67] 利用不配对的语音和文本提高代表性不足语言哈萨克语的耳语识别性能

链接: https://arxiv.org/abs/2408.05554
作者: Jinpeng Li,Yu Pu,Qi Sun,Wei-Qiang Zhang
关键词-EN: made significant progress, large-scale automatic speech, large-scale automatic, made significant, significant progress
关键词-ZN: 取得了重大进展，大规模自动语音，大规模自动化，取得了重大、重大进展
类目: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Sound (cs.SD)
备注: Accepted by INTERSPEECH 2024;Minor typo correction

点击查看摘要

Abstract:Whisper and other large-scale automatic speech recognition models have made significant progress in performance. However, their performance on many low-resource languages, such as Kazakh, is not satisfactory. It is worth researching how to utilize low-cost data to improve the performance of Whisper on under-represented languages. In this study, we utilized easily accessible unpaired speech and text data and combined the language model GPT with Whisper on Kazakh. We implemented end of transcript (EOT) judgment modification and hallucination penalty to improve the performance of speech recognition. Further, we employed the decoding average token log probability as a criterion to select samples from unlabeled speech data and used pseudo-labeled data to fine-tune the model to further improve its performance. Ultimately, we achieved more than 10% absolute WER reduction in multiple experiments, and the whole process has the potential to be generalized to other under-represented languages.
摘要：Whisper等大规模自动语音识别模型在性能上取得了长足的进步。然而，它们在许多低资源语言上的表现并不令人满意，例如哈萨克语。如何利用低成本的数据来提高Whisper在表示不足的语言上的性能是值得研究的。在本研究中，我们利用容易获取的不成对的语音和文本数据，并将语言模型GPT和Whisper on哈萨克语相结合。为了提高语音识别的性能，我们实现了EOT判断修正和幻觉惩罚。此外，我们使用解码平均令牌日志概率作为从未标记语音数据中选择样本的准则，并使用伪标记数据对模型进行微调以进一步提高其性能。最终，我们在多次实验中获得了超过10%的WER绝对值，整个过程有可能推广到其他未被充分表示的语言中。

人工智能

[AI-0] LOLgorithm: Integrating SemanticSyntactic and Contextual Elements for Humor Classification

链接: https://arxiv.org/abs/2408.06335
作者: Tanisha Khurana,Kaushik Pillalamarri,Vikram Pande,Munindar Singh
关键词-EN: Natural Language Processing, Language Processing, Natural Language, paper explores humor, paper explores
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

[AI-1] VisualAgent Bench: Towards Large Multimodal Models as Visual Foundation Agents

链接: https://arxiv.org/abs/2408.06327
作者: Xiao Liu,Tianjie Zhang,Yu Gu,Iat Long Iong,Yifan Xu,Xixuan Song,Shudan Zhang,Hanyu Lai,Xinyi Liu,Hanlin Zhao,Jiadai Sun,Xinyue Yang,Yu Yang,Zehan Qi,Shuntian Yao,Xueqiao Sun,Siyi Cheng,Qinkai Zheng,Hao Yu,Hanchen Zhang,Wenyi Hong,Ming Ding,Lihang Pan,Xiaotao Gu,Aohan Zeng,Zhengxiao Du,Chan Hee Song,Yu Su,Yuxiao Dong,Jie Tang
关键词-EN: Large Multimodal Models, Large Multimodal, form highly capable, highly capable Visual, Visual Foundation Agents
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

[AI-2] Can We Rely on LLM Agents to Draft Long-Horizon Plans? Lets Take TravelPlanner as an Example

链接: https://arxiv.org/abs/2408.06318
作者: Yanan Chen,Ali Pesaranghader,Tanmana Sadhu,Dong Hoon Yi
关键词-EN: Large language models, artificial general intelligence, brought autonomous agents, autonomous agents closer, Large language
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 13 pages, 2 figures, 4 tables

点击查看摘要

Abstract:Large language models (LLMs) have brought autonomous agents closer to artificial general intelligence (AGI) due to their promising generalization and emergent capabilities. There is, however, a lack of studies on how LLM-based agents behave, why they could potentially fail, and how to improve them, particularly in demanding real-world planning tasks. In this paper, as an effort to fill the gap, we present our study using a realistic benchmark, TravelPlanner, where an agent must meet multiple constraints to generate accurate plans. We leverage this benchmark to address four key research questions: (1) are LLM agents robust enough to lengthy and noisy contexts when it comes to reasoning and planning? (2) can few-shot prompting adversely impact the performance of LLM agents in scenarios with long context? (3) can we rely on refinement to improve plans, and (4) can fine-tuning LLMs with both positive and negative feedback lead to further improvement? Our comprehensive experiments indicate that, firstly, LLMs often fail to attend to crucial parts of a long context, despite their ability to handle extensive reference information and few-shot examples; secondly, they still struggle with analyzing the long plans and cannot provide accurate feedback for refinement; thirdly, we propose Feedback-Aware Fine-Tuning (FAFT), which leverages both positive and negative feedback, resulting in substantial gains over Supervised Fine-Tuning (SFT). Our findings offer in-depth insights to the community on various aspects related to real-world planning applications.

[AI-3] Body Transformer: Leveraging Robot Embodiment for Policy Learning

链接: https://arxiv.org/abs/2408.06316
作者: Carmelo Sferrazza,Dun-Ming Huang,Fangchen Liu,Jongmin Lee,Pieter Abbeel
关键词-EN: natural language processing, machine learning algorithms, learning algorithms applied, recent years, computer vision
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In recent years, the transformer architecture has become the de facto standard for machine learning algorithms applied to natural language processing and computer vision. Despite notable evidence of successful deployment of this architecture in the context of robot learning, we claim that vanilla transformers do not fully exploit the structure of the robot learning problem. Therefore, we propose Body Transformer (BoT), an architecture that leverages the robot embodiment by providing an inductive bias that guides the learning process. We represent the robot body as a graph of sensors and actuators, and rely on masked attention to pool information throughout the architecture. The resulting architecture outperforms the vanilla transformer, as well as the classical multilayer perceptron, in terms of task completion, scaling properties, and computational efficiency when representing either imitation or reinforcement learning policies. Additional material including the open-source code is available at this https URL.

[AI-4] OWL2Vec4OA: Tailoring Knowledge Graph Embeddings for Ontology Alignment

链接: https://arxiv.org/abs/2408.06310
作者: Sevinj Teymurova,Ernesto Jiménez-Ruiz,Tillman Weyde,Jiaoyan Chen
关键词-EN: achieving semantic interoperability, ontologies covering intersecting, covering intersecting domains, Ontology alignment, domains is increasing
类目: Artificial Intelligence (cs.AI)
*备注: Submitted to a conference

点击查看摘要

Abstract:Ontology alignment is integral to achieving semantic interoperability as the number of available ontologies covering intersecting domains is increasing. This paper proposes OWL2Vec4OA, an extension of the ontology embedding system OWL2Vec*. While OWL2Vec* has emerged as a powerful technique for ontology embedding, it currently lacks a mechanism to tailor the embedding to the ontology alignment task. OWL2Vec4OA incorporates edge confidence values from seed mappings to guide the random walk strategy. We present the theoretical foundations, implementation details, and experimental evaluation of our proposed extension, demonstrating its potential effectiveness for ontology alignment tasks.

[AI-5] he AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery

链接: https://arxiv.org/abs/2408.06292
作者: Chris Lu,Cong Lu,Robert Tjarko Lange,Jakob Foerster,Jeff Clune,David Ha
关键词-EN: artificial general intelligence, developing agents capable, discovering new knowledge, grand challenges, challenges of artificial
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注:

点击查看摘要

[AI-6] Synthetic Patient-Physician Dialogue Generation from Clinical Notes Using LLM

链接: https://arxiv.org/abs/2408.06285
作者: Trisha Das,Dina Albassam,Jimeng Sun
关键词-EN: enhance patient-physician communication, improve healthcare accessibility, Medical dialogue systems, enhance patient-physician, patient-physician communication
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

[AI-7] MovieSum: An Abstractive Summarization Dataset for Movie Screenplays ACL2024

链接: https://arxiv.org/abs/2408.06281
作者: Rohit Saxena,Frank Keller
关键词-EN: long input contexts, long input, input contexts, movie screenplays, Movie
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: ACL 2024 Findings

点击查看摘要

[AI-8] Anchored Preference Optimization and Contrastive Revisions: Addressing Underspecification in Alignment

链接: https://arxiv.org/abs/2408.06266
作者: Karel D’Oosterlinck,Winnie Xu,Chris Develder,Thomas Demeester,Amanpreet Singh,Christopher Potts,Douwe Kiela,Shikib Mehri
关键词-EN: Large Language Models, Large Language, Language Models, alignment objectives, Anchored Preference Optimization
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:

点击查看摘要

[AI-9] Audio Enhancement for Computer Audition – An Iterative Training Paradigm Using Sample Importance

链接: https://arxiv.org/abs/2408.06264
作者: Manuel Milling,Shuo Liu,Andreas Triantafyllopoulos,Ilhan Aslan,Björn W. Schuller
关键词-EN: Neural network models, acoustic scene classification, Neural network, scene classification, acoustic scene
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
*备注:

点击查看摘要

Abstract:Neural network models for audio tasks, such as automatic speech recognition (ASR) and acoustic scene classification (ASC), are susceptible to noise contamination for real-life applications. To improve audio quality, an enhancement module, which can be developed independently, is explicitly used at the front-end of the target audio applications. In this paper, we present an end-to-end learning solution to jointly optimise the models for audio enhancement (AE) and the subsequent applications. To guide the optimisation of the AE module towards a target application, and especially to overcome difficult samples, we make use of the sample-wise performance measure as an indication of sample importance. In experiments, we consider four representative applications to evaluate our training paradigm, i.e., ASR, speech command recognition (SCR), speech emotion recognition (SER), and ASC. These applications are associated with speech and non-speech tasks concerning semantic and non-semantic features, transient and global information, and the experimental results indicate that our proposed approach can considerably boost the noise robustness of the models, especially at low signal-to-noise ratios (SNRs), for a wide range of computer audition tasks in everyday-life noisy environments.

[AI-10] Open-Source Molecular Processing Pipeline for Generating Molecules

链接: https://arxiv.org/abs/2408.06261
作者: Shreyas V,Jose Siguenza,Karan Bania,Bharath Ramsundar
关键词-EN: shown considerable promise, Cao and Kipf, Generative Adversarial Networks, computational chemistry, molecules have shown
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Biomolecules (q-bio.BM)
*备注: Presented at the 2024 Molecular Machine Learning Conference (MoML 2024)

点击查看摘要

Abstract:Generative models for molecules have shown considerable promise for use in computational chemistry, but remain difficult to use for non-experts. For this reason, we introduce open-source infrastructure for easily building generative molecular models into the widely used DeepChem [Ramsundar et al., 2019] library with the aim of creating a robust and reusable molecular generation pipeline. In particular, we add high quality PyTorch [Paszke et al., 2019] implementations of the Molecular Generative Adversarial Networks (MolGAN) [Cao and Kipf, 2022] and Normalizing Flows [Papamakarios et al., 2021]. Our implementations show strong performance comparable with past work [Kuznetsov and Polykovskiy, 2021, Cao and Kipf, 2022].

[AI-11] Decentralized Intelligence Health Network (DIHN)

链接: https://arxiv.org/abs/2408.06240
作者: Abraham Nash
关键词-EN: Intelligence Health Network, Decentralized Intelligence Health, addressing significant challenges, sovereign health network, health data sovereignty
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Distributed, Parallel, and Cluster Computing (cs.DC); Emerging Technologies (cs.ET)
*备注: 17 pages, 7 figures. arXiv admin note: substantial text overlap with arXiv:2407.02461

点击查看摘要

Abstract:Decentralized Intelligence Health Network (DIHN) is a theoretical framework addressing significant challenges of health data sovereignty and AI utilization in healthcare caused by data fragmentation across providers and institutions. It establishes a sovereign architecture for healthcare provision as a prerequisite to a sovereign health network, then facilitates effective AI utilization by overcoming barriers to accessing diverse medical data sources. This comprehensive framework leverages: 1) self-sovereign identity architecture coupled with a personal health record (PHR) as a prerequisite for health data sovereignty; 2) a scalable federated learning (FL) protocol implemented on a public blockchain for decentralized AI training in healthcare, where health data remains with participants and only model parameter updates are shared; and 3) a scalable, trustless rewards mechanism to incentivize participation and ensure fair reward distribution. This framework ensures that no entity can prevent or control access to training on health data offered by participants or determine financial benefits, as these processes operate on a public blockchain with an immutable record and without a third party. It supports effective AI training in healthcare, allowing patients to maintain control over their health data, benefit financially, and contribute to a decentralized, scalable ecosystem that leverages collective AI to develop beneficial healthcare algorithms. Patients receive rewards into their digital wallets as an incentive to opt-in to the FL protocol, with a long-term roadmap to funding decentralized insurance solutions. This approach introduces a novel, self-financed healthcare model that adapts to individual needs, complements existing systems, and redefines universal coverage. It highlights the potential to transform healthcare data management and AI utilization while empowering patients.

[AI-12] FLEURS-R: A Restored Multilingual Speech Corpus for Generation Tasks

链接: https://arxiv.org/abs/2408.06227
作者: Min Ma,Yuma Koizumi,Shigeki Karita,Heiga Zen,Jason Riesa,Haruko Ishikawa,Michiel Bacchiani
关键词-EN: Few-shot Learning Evaluation, Few-shot Learning, Universal Representations, restoration applied version, paper introduces FLEURS-R
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Sound (cs.SD); Audio and Speech Processing (eess.AS)
*备注:

点击查看摘要

[AI-13] A Large-Scale Study of Model Integration in ML-Enabled Software Systems

链接: https://arxiv.org/abs/2408.06226
作者: Yorick Sens,Henriette Knopp,Sven Peldszus,Thorsten Berger
关键词-EN: machine learning, models, rise of machine, drastically changed, systems
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The rise of machine learning (ML) and its embedding in systems has drastically changed the engineering of software-intensive systems. Traditionally, software engineering focuses on manually created artifacts such as source code and the process of creating them, as well as best practices for integrating them, i.e., software architectures. In contrast, the development of ML artifacts, i.e. ML models, comes from data science and focuses on the ML models and their training data. However, to deliver value to end users, these ML models must be embedded in traditional software, often forming complex topologies. In fact, ML-enabled software can easily incorporate many different ML models. While the challenges and practices of building ML-enabled systems have been studied to some extent, beyond isolated examples, little is known about the characteristics of real-world ML-enabled systems. Properly embedding ML models in systems so that they can be easily maintained or reused is far from trivial. We need to improve our empirical understanding of such systems, which we address by presenting the first large-scale study of real ML-enabled software systems, covering over 2,928 open source systems on GitHub. We classified and analyzed them to determine their characteristics, as well as their practices for reusing ML models and related code, and the architecture of these systems. Our findings provide practitioners and researchers with insight into practices for embedding and integrating ML models, bringing data science and software engineering closer together.

[AI-14] On Effects of Steering Latent Representation for Large Language Model Unlearning

链接: https://arxiv.org/abs/2408.06223
作者: Dang Huu-Tien,Trung-Tin Pham,Hoang Thanh-Tung,Naoya Inoue
关键词-EN: large language model, Representation Misdirection, large language, target random representation, steers model representation
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: 15 pages, 5 figures, 8 tables

点击查看摘要

[AI-15] Strategy Game-Playing with Size-Constrained State Abstraction

链接: https://arxiv.org/abs/2408.06202
作者: Linjie Xu,Diego Perez-Liebana,Alexander Dockhorn
关键词-EN: Playing strategy games, Playing strategy, state abstraction, artificial intelligence, challenging problem
类目: Artificial Intelligence (cs.AI)
*备注: 8 pages, to be published in Proceedings of the Conference on Games 2024, codes are open-sourced at this https URL

点击查看摘要

Abstract:Playing strategy games is a challenging problem for artificial intelligence (AI). One of the major challenges is the large search space due to a diverse set of game components. In recent works, state abstraction has been applied to search-based game AI and has brought significant performance improvements. State abstraction techniques rely on reducing the search space, e.g., by aggregating similar states. However, the application of these abstractions is hindered because the quality of an abstraction is difficult to evaluate. Previous works hence abandon the abstraction in the middle of the search to not bias the search to a local optimum. This mechanism introduces a hyper-parameter to decide the time to abandon the current state abstraction. In this work, we propose a size-constrained state abstraction (SCSA), an approach that limits the maximum number of nodes being grouped together. We found that with SCSA, the abstraction is not required to be abandoned. Our empirical results on 3 strategy games show that the SCSA agent outperforms the previous methods and yields robust performance over different games. Codes are open-sourced at \urlthis https URL.

[AI-16] Dynamic Blocked Clause Elimination for Projected Model Counting

链接: https://arxiv.org/abs/2408.06199
作者: Jean-Marie Lagniez,Pierre Marquis,Armin Biere
关键词-EN: blocked clause elimination, blocked clause, clause elimination, clause, model counting
类目: Artificial Intelligence (cs.AI)
*备注: LIPIcs, Volume 305, SAT 2024

点击查看摘要

Abstract:In this paper, we explore the application of blocked clause elimination for projected model counting. This is the problem of determining the number of models ||\exists X.\Sigma|| of a propositional formula \Sigma after eliminating a given set X of variables existentially. Although blocked clause elimination is a well-known technique for SAT solving, its direct application to model counting is challenging as in general it changes the number of models. However, we demonstrate, by focusing on projected variables during the blocked clause search, that blocked clause elimination can be leveraged while preserving the correct model count. To take advantage of blocked clause elimination in an efficient way during model counting, a novel data structure and associated algorithms are introduced. Our proposed approach is implemented in the model counter d4. Our experiments demonstrate the computational benefits of our new method of blocked clause elimination for projected model counting.

[AI-17] Palantir: Towards Efficient Super Resolution for Ultra-high-definition Live Streaming

链接: https://arxiv.org/abs/2408.06152
作者: Xinqi Jin,Zhui Zhu,Xikai Sun,Fan Dang,Jiangchuan Liu,Jingao Xu,Kebin Liu,Xinlei Chen,Yunhao Liu
关键词-EN: super-resolution deep neural, deep neural networks, neural networks opens, Neural enhancement, deep neural
类目: Multimedia (cs.MM); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Networking and Internet Architecture (cs.NI)
*备注:

点击查看摘要

Abstract:Neural enhancement through super-resolution deep neural networks opens up new possibilities for ultra-high-definition live streaming over existing encoding and networking infrastructure. Yet, the heavy SR DNN inference overhead leads to severe deployment challenges. To reduce the overhead, existing systems propose to apply DNN-based SR only on selected anchor frames while upscaling non-anchor frames via the lightweight reusing-based SR approach. However, frame-level scheduling is coarse-grained and fails to deliver optimal efficiency. In this work, we propose Palantir, the first neural-enhanced UHD live streaming system with fine-grained patch-level scheduling. In the presented solutions, two novel techniques are incorporated to make good scheduling decisions for inference overhead optimization and reduce the scheduling latency. Firstly, under the guidance of our pioneering and theoretical analysis, Palantir constructs a directed acyclic graph (DAG) for lightweight yet accurate quality estimation under any possible anchor patch set. Secondly, to further optimize the scheduling latency, Palantir improves parallelizability by refactoring the computation subprocedure of the estimation process into a sparse matrix-matrix multiplication operation. The evaluation results suggest that Palantir incurs a negligible scheduling latency accounting for less than 5.7% of the end-to-end latency requirement. When compared to the state-of-the-art real-time frame-level scheduling strategy, Palantir reduces the energy overhead of SR-integrated mobile clients by 38.1% at most (and 22.4% on average) and the monetary costs of cloud-based SR by 80.1% at most (and 38.4% on average).

[AI-18] Med42-v2: A Suite of Clinical LLMs

链接: https://arxiv.org/abs/2408.06142
作者: Clément Christophe,Praveen K Kanithi,Tathagata Raha,Shadab Khan,Marco AF Pimentel
关键词-EN: large language models, clinical large language, introduces a suite, designed to address, large language
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

[AI-19] A Methodological Report on Anomaly Detection on Dynamic Knowledge Graphs

链接: https://arxiv.org/abs/2408.06121
作者: Xiaohua Lu,Leshanshui Yang
关键词-EN: dynamic knowledge graph, Kubernetes applications, environment for Kubernetes, dynamic knowledge, Knowledge Graph Anomaly
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In this paper, we explore different approaches to anomaly detection on dynamic knowledge graphs, specifically in a microservices environment for Kubernetes applications. Our approach explores three dynamic knowledge graph representations: sequential data, one-hop graph structure, and two-hop graph structure, with each representation incorporating increasingly complex structural information. Each phase includes different machine learning and deep learning models. We empirically analyse their performance and propose an approach based on ensemble learning of these models. Our approach significantly outperforms the baseline on the ISWC 2024 Dynamic Knowledge Graph Anomaly Detection dataset, providing a robust solution for anomaly detection in dynamic complex data.

[AI-20] Generalization capabilities of MeshGraphNets to unseen geometries for fluid dynamics

链接: https://arxiv.org/abs/2408.06101
作者: Robin Schmöcker,Alexander Henkes,Julian Roth,Thomas Wick
关键词-EN: Learning Mesh-Based Simulation, Graph Networks, Simulation with Graph, Learning Mesh-Based, Mesh-Based Simulation
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Numerical Analysis (math.NA); Fluid Dynamics (physics.flu-dyn)
*备注:

点击查看摘要

Abstract:This works investigates the generalization capabilities of MeshGraphNets (MGN) [Pfaff et al. Learning Mesh-Based Simulation with Graph Networks. ICML 2021] to unseen geometries for fluid dynamics, e.g. predicting the flow around a new obstacle that was not part of the training data. For this purpose, we create a new benchmark dataset for data-driven computational fluid dynamics (CFD) which extends DeepMind’s flow around a cylinder dataset by including different shapes and multiple objects. We then use this new dataset to extend the generalization experiments conducted by DeepMind on MGNs by testing how well an MGN can generalize to different shapes. In our numerical tests, we show that MGNs can sometimes generalize well to various shapes by training on a dataset of one obstacle shape and testing on a dataset of another obstacle shape.

[AI-21] Building Decision Making Models Through Language Model Regime

链接: https://arxiv.org/abs/2408.06087
作者: Yu Zhang,Haoxiang Liu,Feijun Jiang,Weihua Luo,Kaifu Zhang
关键词-EN: decision making, making problems leveraging, decision, making, problems leveraging
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

[AI-22] Fully Bayesian Differential Gaussian Processes through Stochastic Differential Equations

链接: https://arxiv.org/abs/2408.06069
作者: Jian Xu,Zhiqi Lin,Min Chen,Junmei Yang,Delu Zeng,John Paisley
关键词-EN: deep Gaussian process, infinitely deep Gaussian, Traditional deep Gaussian, deep Gaussian, infinitely deep
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Traditional deep Gaussian processes model the data evolution using a discrete hierarchy, whereas differential Gaussian processes (DIFFGPs) represent the evolution as an infinitely deep Gaussian process. However, prior DIFFGP methods often overlook the uncertainty of kernel hyperparameters and assume them to be fixed and time-invariant, failing to leverage the unique synergy between continuous-time models and approximate inference. In this work, we propose a fully Bayesian approach that treats the kernel hyperparameters as random variables and constructs coupled stochastic differential equations (SDEs) to learn their posterior distribution and that of inducing points. By incorporating estimation uncertainty on hyperparameters, our method enhances the model’s flexibility and adaptability to complex dynamics. Additionally, our approach provides a time-varying, comprehensive, and realistic posterior approximation through coupling variables using SDE methods. Experimental results demonstrate the advantages of our method over traditional approaches, showcasing its superior performance in terms of flexibility, accuracy, and other metrics. Our work opens up exciting research avenues for advancing Bayesian inference and offers a powerful modeling tool for continuous-time Gaussian processes.

[AI-23] Online Optimization of Curriculum Learning Schedules using Evolutionary Optimization

链接: https://arxiv.org/abs/2408.06068
作者: Mohit Jiwatode,Leon Schlecht,Alexander Dockhorn
关键词-EN: Rolling Horizon Evolutionary, Horizon Evolutionary Algorithms, Rolling Horizon, automatically produce effective, produce effective curricula
类目: Artificial Intelligence (cs.AI)
*备注: 8 pages including abstract, to be published in the Proceedings of the IEEE Conference on Games 2024

点击查看摘要

Abstract:We propose RHEA CL, which combines Curriculum Learning (CL) with Rolling Horizon Evolutionary Algorithms (RHEA) to automatically produce effective curricula during the training of a reinforcement learning agent. RHEA CL optimizes a population of curricula, using an evolutionary algorithm, and selects the best-performing curriculum as the starting point for the next training epoch. Performance evaluations are conducted after every curriculum step in all environments. We evaluate the algorithm on the \textitDoorKey and \textitDynamicObstacles environments within the Minigrid framework. It demonstrates adaptability and consistent improvement, particularly in the early stages, while reaching a stable performance later that is capable of outperforming other curriculum learners. In comparison to other curriculum schedules, RHEA CL has been shown to yield performance improvements for the final Reinforcement learning (RL) agent at the cost of additional evaluation during training.

[AI-24] An Investigation Into Explainable Audio Hate Speech Detection SIGDIAL2024

链接: https://arxiv.org/abs/2408.06065
作者: Jinmyeong An,Wonjun Lee,Yejin Jeon,Jungseul Ok,Yunsu Kim,Gary Geunbae Lee
关键词-EN: hate speech, hate speech detection, speech, content largely unexplored, hate
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Sound (cs.SD); Audio and Speech Processing (eess.AS)
*备注: Accepted to SIGDIAL 2024

点击查看摘要

[AI-25] Perceptual Similarity for Measuring Decision-Making Style and Policy Diversity in Games

链接: https://arxiv.org/abs/2408.06051
作者: Chiu-Chou Lin,Wei-Chen Chiu,I-Chen Wu
关键词-EN: Defining and measuring, crucial in gaming, measuring decision-making styles, reflect a broad, broad spectrum
类目: Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注: TMLR 08/2024 this https URL

点击查看摘要

Abstract:Defining and measuring decision-making styles, also known as playstyles, is crucial in gaming, where these styles reflect a broad spectrum of individuality and diversity. However, finding a universally applicable measure for these styles poses a challenge. Building on Playstyle Distance, the first unsupervised metric to measure playstyle similarity based on game screens and raw actions, we introduce three enhancements to increase accuracy: multiscale analysis with varied state granularity, a perceptual kernel rooted in psychology, and the utilization of the intersection-over-union method for efficient evaluation. These innovations not only advance measurement precision but also offer insights into human cognition of similarity. Across two racing games and seven Atari games, our techniques significantly improve the precision of zero-shot playstyle classification, achieving an accuracy exceeding 90 percent with fewer than 512 observation-action pairs, which is less than half an episode of these games. Furthermore, our experiments with 2048 and Go demonstrate the potential of discrete playstyle measures in puzzle and board games. We also develop an algorithm for assessing decision-making diversity using these measures. Our findings improve the measurement of end-to-end game analysis and the evolution of artificial intelligence for diverse playstyles.

[AI-26] Understanding Byzantine Robustness in Federated Learning with A Black-box Server

链接: https://arxiv.org/abs/2408.06042
作者: Fangyuan Zhao,Yuexiang Xie,Xuebin Ren,Bolin Ding,Shusen Yang,Yaliang Li
关键词-EN: malicious model updates, Federated learning, Byzantine attacks, advanced Byzantine attack, Byzantine attack algorithms
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
*备注: We have released code on this https URL

点击查看摘要

Abstract:Federated learning (FL) becomes vulnerable to Byzantine attacks where some of participators tend to damage the utility or discourage the convergence of the learned model via sending their malicious model updates. Previous works propose to apply robust rules to aggregate updates from participators against different types of Byzantine attacks, while at the same time, attackers can further design advanced Byzantine attack algorithms targeting specific aggregation rule when it is known. In practice, FL systems can involve a black-box server that makes the adopted aggregation rule inaccessible to participants, which can naturally defend or weaken some Byzantine attacks. In this paper, we provide an in-depth understanding on the Byzantine robustness of the FL system with a black-box server. Our investigation demonstrates the improved Byzantine robustness of a black-box server employing a dynamic defense strategy. We provide both empirical evidence and theoretical analysis to reveal that the black-box server can mitigate the worst-case attack impact from a maximum level to an expectation level, which is attributed to the inherent inaccessibility and randomness offered by a black-box server.The source code is available at this https URL to promote further research in the community.

[AI-27] Spacetime E(n)-Transformer: Equivariant Attention for Spatio-temporal Graphs

链接: https://arxiv.org/abs/2408.06039
作者: Sergio G. Charles
关键词-EN: equivariant Transformer architecture, equivariant Transformer, Transformer architecture, spatio-temporal graph data, graph data
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:We introduce an E(n) -equivariant Transformer architecture for spatio-temporal graph data. By imposing rotation, translation, and permutation equivariance inductive biases in both space and time, we show that the Spacetime E(n) -Transformer (SET) outperforms purely spatial and temporal models without symmetry-preserving properties. We benchmark SET against said models on the charged N -body problem, a simple physical system with complex dynamics. While existing spatio-temporal graph neural networks focus on sequential modeling, we empirically demonstrate that leveraging underlying domain symmetries yields considerable improvements for modeling dynamical systems on graphs.

[AI-28] Peaking into the Black-box: Prediction Intervals Give Insight into Data-driven Quadrotor Model Reliability

链接: https://arxiv.org/abs/2408.06036
作者: Jasper van Beers,Coen de Visser
关键词-EN: quadrotor aerodynamic models, Ensuring the reliability, validity of data-driven, accepted and practical, data-driven quadrotor model
类目: ystems and Control (eess.SY); Artificial Intelligence (cs.AI)
*备注: Presented at AIAA SciTech Forum 2023 in National Harbor, MD, USA

点击查看摘要

Abstract:Ensuring the reliability and validity of data-driven quadrotor model predictions is essential for their accepted and practical use. This is especially true for grey- and black-box models wherein the mapping of inputs to predictions is not transparent and subsequent reliability notoriously difficult to ascertain. Nonetheless, such techniques are frequently and successfully used to identify quadrotor models. Prediction intervals (PIs) may be employed to provide insight into the consistency and accuracy of model predictions. This paper estimates such PIs for polynomial and Artificial Neural Network (ANN) quadrotor aerodynamic models. Two existing ANN PI estimation techniques - the bootstrap method and the quality driven method - are validated numerically for quadrotor aerodynamic models using an existing high-fidelity quadrotor simulation. Quadrotor aerodynamic models are then identified on real quadrotor flight data to demonstrate their utility and explore their sensitivity to model interpolation and extrapolation. It is found that the ANN-based PIs widen considerably when extrapolating and remain constant, or shrink, when interpolating. While this behaviour also occurs for the polynomial PIs, it is of lower magnitude. The estimated PIs establish probabilistic bounds within which the quadrotor model outputs will likely lie, subject to modelling and measurement uncertainties that are reflected through the PI widths.

[AI-29] Controlling Surprisal in Music Generation via Information Content Curve Matching

链接: https://arxiv.org/abs/2408.06022
作者: Mathias Rose Bjare,Stefan Lattner,Gerhard Widmer
关键词-EN: music generation systems, IIC, recent years, encouraging research, Instantaneous Information Content
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
*备注: 8 pages, 4 figures, 2 tables, accepted at the 25th Int. Society for Music Information Retrieval Conf., San Francisco, USA, 2024

点击查看摘要

[AI-30] Uncertainty-Informed Volume Visualization using Implicit Neural Representation IEEE-VIS2024

链接: https://arxiv.org/abs/2408.06018
作者: Shanu Saklani,Chitwan Goel,Shrey Bansal,Zhe Wang,Soumya Dutta,Tushar M. Athawale,David Pugmire,Christopher R. Johnson
关键词-EN: Deep Neural Networks, Neural Networks, increasing adoption, Networks, visualization tasks
类目: Graphics (cs.GR); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: To appear in IEEE Workshop on Uncertainty Visualization in conjunction with IEEE VIS 2024, Florida, USA

点击查看摘要

Abstract:The increasing adoption of Deep Neural Networks (DNNs) has led to their application in many challenging scientific visualization tasks. While advanced DNNs offer impressive generalization capabilities, understanding factors such as model prediction quality, robustness, and uncertainty is crucial. These insights can enable domain scientists to make informed decisions about their data. However, DNNs inherently lack ability to estimate prediction uncertainty, necessitating new research to construct robust uncertainty-aware visualization techniques tailored for various visualization tasks. In this work, we propose uncertainty-aware implicit neural representations to model scalar field data sets effectively and comprehensively study the efficacy and benefits of estimated uncertainty information for volume visualization tasks. We evaluate the effectiveness of two principled deep uncertainty estimation techniques: (1) Deep Ensemble and (2) Monte Carlo Dropout (MCDropout). These techniques enable uncertainty-informed volume visualization in scalar field data sets. Our extensive exploration across multiple data sets demonstrates that uncertainty-aware models produce informative volume visualization results. Moreover, integrating prediction uncertainty enhances the trustworthiness of our DNN model, making it suitable for robustly analyzing and visualizing real-world scientific volumetric data sets.

[AI-31] ransfer learning of state-based potential games for process optimization in decentralized manufacturing systems

链接: https://arxiv.org/abs/2408.05992
作者: Steve Yuwono,Dorothea Schwung,Andreas Schwung
关键词-EN: state-based potential games, enhancing distributed self-optimization, potential games, transfer learning, paper presents
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Science and Game Theory (cs.GT)
*备注: This pre-print was submitted to Computers in Industry on May 02, 2024

点击查看摘要

Abstract:This paper presents a novel transfer learning approach in state-based potential games (TL-SbPGs) for enhancing distributed self-optimization in manufacturing systems. The approach focuses on the practical relevant industrial setting where sharing and transferring gained knowledge among similar-behaved players improves the self-learning mechanism in large-scale systems. With TL-SbPGs, the gained knowledge can be reused by other players to optimize their policies, thereby improving the learning outcomes of the players and accelerating the learning process. To accomplish this goal, we develop transfer learning concepts and similarity criteria for players, which offer two distinct settings: (a) predefined similarities between players and (b) dynamically inferred similarities between players during training. We formally prove the applicability of the SbPG framework in transfer learning. Additionally, we introduce an efficient method to determine the optimal timing and weighting of the transfer learning procedure during the training phase. Through experiments on a laboratory-scale testbed, we demonstrate that TL-SbPGs significantly boost production efficiency while reducing power consumption of the production schedules while also outperforming native SbPGs.

[AI-32] Exploring and Learning Structure: Active Inference Approach in Navigational Agents

链接: https://arxiv.org/abs/2408.05982
作者: Daria de Tinguy,Tim Verbelen,Bart Dhoedt
关键词-EN: biologically inspired principles, Drawing inspiration, Active Inference Framework, animal navigation strategies, rooted in biologically
类目: Artificial Intelligence (cs.AI); Robotics (cs.RO)
*备注: IWAI workshop 2024

点击查看摘要

Abstract:Drawing inspiration from animal navigation strategies, we introduce a novel computational model for navigation and mapping, rooted in biologically inspired principles. Animals exhibit remarkable navigation abilities by efficiently using memory, imagination, and strategic decision-making to navigate complex and aliased environments. Building on these insights, we integrate traditional cognitive mapping approaches with an Active Inference Framework (AIF) to learn an environment structure in a few steps. Through the incorporation of topological mapping for long-term memory and AIF for navigation planning and structure learning, our model can dynamically apprehend environmental structures and expand its internal map with predicted beliefs during exploration. Comparative experiments with the Clone-Structured Graph (CSCG) model highlight our model’s ability to rapidly learn environmental structures in a single episode, with minimal navigation overlap. this is achieved without prior knowledge of the dimensions of the environment or the type of observations, showcasing its robustness and effectiveness in navigating ambiguous environments.

[AI-33] Freehand Sketch Generation from Mechanical Components ACM-MM

链接: https://arxiv.org/abs/2408.05966
作者: Zhichao Liao,Di Huang,Heming Fang,Yue Ma,Fengyuan Piao,Xinghui Li,Long Zeng,Pingfa Feng
关键词-EN: Drawing freehand sketches, Drawing freehand, AI-based engineering modeling, multimedia devices, devices for AI-based
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Graphics (cs.GR); Multimedia (cs.MM)
*备注: Published at ACM Multimedia (ACM MM) 2024

点击查看摘要

Abstract:Drawing freehand sketches of mechanical components on multimedia devices for AI-based engineering modeling has become a new trend. However, its development is being impeded because existing works cannot produce suitable sketches for data-driven research. These works either generate sketches lacking a freehand style or utilize generative models not originally designed for this task resulting in poor effectiveness. To address this issue, we design a two-stage generative framework mimicking the human sketching behavior pattern, called MSFormer, which is the first time to produce humanoid freehand sketches tailored for mechanical components. The first stage employs Open CASCADE technology to obtain multi-view contour sketches from mechanical components, filtering perturbing signals for the ensuing generation process. Meanwhile, we design a view selector to simulate viewpoint selection tasks during human sketching for picking out information-rich sketches. The second stage translates contour sketches into freehand sketches by a transformer-based generator. To retain essential modeling features as much as possible and rationalize stroke distribution, we introduce a novel edge-constraint stroke initialization. Furthermore, we utilize a CLIP vision encoder and a new loss function incorporating the Hausdorff distance to enhance the generalizability and robustness of the model. Extensive experiments demonstrate that our approach achieves state-of-the-art performance for generating freehand sketches in the mechanical domain. Project page: this https URL .

[AI-34] Match Point AI: A Novel AI Framework for Evaluating Data-Driven Tennis Strategies

链接: https://arxiv.org/abs/2408.05960
作者: Carlo Nübel,Alexander Dockhorn,Sanaz Mostaghim
关键词-EN: video games due, reimplementing their mechanics, artificial intelligence, focus on board, board or video
类目: Artificial Intelligence (cs.AI)
*备注: 4 pages, 1 page abstract, short paper, to be published in Proceedings of the IEEE Conference on Games 2024

点击查看摘要

Abstract:Many works in the domain of artificial intelligence in games focus on board or video games due to the ease of reimplementing their mechanics. Decision-making problems in real-world sports share many similarities to such domains. Nevertheless, not many frameworks on sports games exist. In this paper, we present the tennis match simulation environment \textitMatch Point AI, in which different agents can compete against real-world data-driven bot strategies. Next to presenting the framework, we highlight its capabilities by illustrating, how MCTS can be used in Match Point AI to optimize the shot direction selection problem in tennis. While the framework will be extended in the future, first experiments already reveal that generated shot-by-shot data of simulated tennis matches show realistic characteristics when compared to real-world data. At the same time, reasonable shot placement strategies emerge, which share similarities to the ones found in real-world tennis matches.

[AI-35] Markov Senior – Learning Markov Junior Grammars to Generate User-specified Content

链接: https://arxiv.org/abs/2408.05959
作者: Mehmet Kayra Oğuz,Alexander Dockhorn
关键词-EN: probabilistic programming language, Markov Junior, programming language, Markov Senior, Markov
类目: Artificial Intelligence (cs.AI)
*备注: 8 pages, to be published in the Proceedings of the IEEE Conference on Games 2024, demo implementation can be found here: this https URL

点击查看摘要

Abstract:Markov Junior is a probabilistic programming language used for procedural content generation across various domains. However, its reliance on manually crafted and tuned probabilistic rule sets, also called grammars, presents a significant bottleneck, diverging from approaches that allow rule learning from examples. In this paper, we propose a novel solution to this challenge by introducing a genetic programming-based optimization framework for learning hierarchical rule sets automatically. Our proposed method Markov Senior'' focuses on extracting positional and distance relations from single input samples to construct probabilistic rules to be used by Markov Junior. Using a Kullback-Leibler divergence-based fitness measure, we search for grammars to generate content that is coherent with the given sample. To enhance scalability, we introduce a divide-and-conquer strategy that enables the efficient generation of large-scale content. We validate our approach through experiments in generating image-based content and Super Mario levels, demonstrating its flexibility and effectiveness. In this way, Markov Senior’’ allows for the wider application of Markov Junior for tasks in which an example may be available, but the design of a generative rule set is infeasible.

[AI-36] Robust online reconstruction of continuous-time signals from a lean spike train ensemble code

链接: https://arxiv.org/abs/2408.05950
作者: Anik Chattopadhyay,Arunava Banerjee
关键词-EN: high temporal resolution, Sensory stimuli, spike trains, energy efficiency, temporal resolution
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI); Sound (cs.SD); Audio and Speech Processing (eess.AS)
*备注: 22 pages, including a 9-page appendix, 8 figures. A GitHub link to the project implementation is embedded in the paper

点击查看摘要

Abstract:Sensory stimuli in animals are encoded into spike trains by neurons, offering advantages such as sparsity, energy efficiency, and high temporal resolution. This paper presents a signal processing framework that deterministically encodes continuous-time signals into biologically feasible spike trains, and addresses the questions about representable signal classes and reconstruction bounds. The framework considers encoding of a signal through spike trains generated by an ensemble of neurons using a convolve-then-threshold mechanism with various convolution kernels. A closed-form solution to the inverse problem, from spike trains to signal reconstruction, is derived in the Hilbert space of shifted kernel functions, ensuring sparse representation of a generalized Finite Rate of Innovation (FRI) class of signals. Additionally, inspired by real-time processing in biological systems, an efficient iterative version of the optimal reconstruction is formulated that considers only a finite window of past spikes, ensuring robustness of the technique to ill-conditioned encoding; convergence guarantees of the windowed reconstruction to the optimal solution are then provided. Experiments on a large audio dataset demonstrate excellent reconstruction accuracy at spike rates as low as one-fifth of the Nyquist rate, while showing clear competitive advantage in comparison to state-of-the-art sparse coding techniques in the low spike rate regime.

[AI-37] Multimodal Large Language Models for Phishing Webpage Detection and Identification

链接: https://arxiv.org/abs/2408.05941
作者: Jehyun Lee,Peiyuan Lim,Bryan Hooi,Dinil Mon Divakaran
关键词-EN: developed numerous solutions, researchers have developed, numerous solutions, machine learning, detecting phishing webpages
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
*备注: To appear in eCrime 2024

点击查看摘要

Abstract:To address the challenging problem of detecting phishing webpages, researchers have developed numerous solutions, in particular those based on machine learning (ML) algorithms. Among these, brand-based phishing detection that uses models from Computer Vision to detect if a given webpage is imitating a well-known brand has received widespread attention. However, such models are costly and difficult to maintain, as they need to be retrained with labeled dataset that has to be regularly and continuously collected. Besides, they also need to maintain a good reference list of well-known websites and related meta-data for effective performance. In this work, we take steps to study the efficacy of large language models (LLMs), in particular the multimodal LLMs, in detecting phishing webpages. Given that the LLMs are pretrained on a large corpus of data, we aim to make use of their understanding of different aspects of a webpage (logo, theme, favicon, etc.) to identify the brand of a given webpage and compare the identified brand with the domain name in the URL to detect a phishing attack. We propose a two-phase system employing LLMs in both phases: the first phase focuses on brand identification, while the second verifies the domain. We carry out comprehensive evaluations on a newly collected dataset. Our experiments show that the LLM-based system achieves a high detection rate at high precision; importantly, it also provides interpretable evidence for the decisions. Our system also performs significantly better than a state-of-the-art brand-based phishing detection system while demonstrating robustness against two known adversarial attacks. Comments: To appear in eCrime 2024 Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI) Cite as: arXiv:2408.05941 [cs.CR] (or arXiv:2408.05941v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2408.05941 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-38] Spb3DTracker: A Robust LiDAR-Based Person Tracker for Noisy Environmen

链接: https://arxiv.org/abs/2408.05940
作者: Eunsoo Im,Changhyun Jee,Jung Kwon Lee
关键词-EN: autonomous vehicle field, vehicle field, leading to widespread, significant advancements, autonomous vehicle
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)
*备注: 17 pages, 5 figures

点击查看摘要

Abstract:Person detection and tracking (PDT) has seen significant advancements with 2D camera-based systems in the autonomous vehicle field, leading to widespread adoption of these algorithms. However, growing privacy concerns have recently emerged as a major issue, prompting a shift towards LiDAR-based PDT as a viable alternative. Within this domain, “Tracking-by-Detection” (TBD) has become a prominent methodology. Despite its effectiveness, LiDAR-based PDT has not yet achieved the same level of performance as camera-based PDT. This paper examines key components of the LiDAR-based PDT framework, including detection post-processing, data association, motion modeling, and lifecycle management. Building upon these insights, we introduce SpbTrack, a robust person tracker designed for diverse environments. Our method achieves superior performance on noisy datasets and state-of-the-art results on KITTI Dataset benchmarks and custom office indoor dataset among LiDAR-based trackers. Project page at anonymous.

[AI-39] Optimizing RAG Techniques for Automotive Industry PDF Chatbots: A Case Study with Locally Deployed Ollama Models

链接: https://arxiv.org/abs/2408.05933
作者: Fei Liu,Zejun Kang,Xing Han
关键词-EN: large language models, automotive industry, automotive industry documents, Ollama local RAG, optimizing the deployment
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
*备注:

点击查看摘要

Abstract:With the growing demand for offline PDF chatbots in automotive industrial production environments, optimizing the deployment of large language models (LLMs) in local, low-performance settings has become increasingly important. This study focuses on enhancing Retrieval-Augmented Generation (RAG) techniques for processing complex automotive industry documents using locally deployed Ollama models. Based on the Langchain framework, we propose a multi-dimensional optimization approach for Ollama’s local RAG implementation. Our method addresses key challenges in automotive document processing, including multi-column layouts and technical specifications. We introduce improvements in PDF processing, retrieval mechanisms, and context compression, tailored to the unique characteristics of automotive industry documents. Additionally, we design custom classes supporting embedding pipelines and an agent supporting self-RAG based on LangGraph best practices. To evaluate our approach, we constructed a proprietary dataset comprising typical automotive industry documents, including technical reports and corporate regulations. We compared our optimized RAG model and self-RAG agent against a naive RAG baseline across three datasets: our automotive industry dataset, QReCC, and CoQA. Results demonstrate significant improvements in context precision, context recall, answer relevancy, and faithfulness, with particularly notable performance on the automotive industry dataset. Our optimization scheme provides an effective solution for deploying local RAG systems in the automotive sector, addressing the specific needs of PDF chatbots in industrial production environments. This research has important implications for advancing information processing and intelligent production in the automotive industry.

[AI-40] BI-MDRG: Bridging Image History in Multimodal Dialogue Response Generation ECCV2024

链接: https://arxiv.org/abs/2408.05926
作者: Hee Suk Yoon,Eunseop Yoon,Joshua Tian Jin Tee,Kang Zhang,Yu-Jung Heo,Du-Seong Chang,Chang D. Yoo
关键词-EN: recently proposed task, Dialogue Response Generation, Multimodal Dialogue, image, Multimodal Dialogue Response
类目: Artificial Intelligence (cs.AI)
*备注: ECCV 2024

点击查看摘要

Abstract:Multimodal Dialogue Response Generation (MDRG) is a recently proposed task where the model needs to generate responses in texts, images, or a blend of both based on the dialogue context. Due to the lack of a large-scale dataset specifically for this task and the benefits of leveraging powerful pre-trained models, previous work relies on the text modality as an intermediary step for both the image input and output of the model rather than adopting an end-to-end approach. However, this approach can overlook crucial information about the image, hindering 1) image-grounded text response and 2) consistency of objects in the image response. In this paper, we propose BI-MDRG that bridges the response generation path such that the image history information is utilized for enhanced relevance of text responses to the image content and the consistency of objects in sequential image responses. Through extensive experiments on the multimodal dialogue benchmark dataset, we show that BI-MDRG can effectively increase the quality of multimodal dialogue. Additionally, recognizing the gap in benchmark datasets for evaluating the image consistency in multimodal dialogue, we have created a curated set of 300 dialogues annotated to track object consistency across conversations.

[AI-41] Adapting a Foundation Model for Space-based Tasks

链接: https://arxiv.org/abs/2408.05924
作者: Matthew Foutter,Praneet Bhoj,Rohan Sinha,Amine Elhafsi,Somrita Banerjee,Christopher Agia,Justin Kruger,Tommaso Guffanti,Daniele Gammelli,Simone D’Amico,Marco Pavone
关键词-EN: large language models, possess attributes, navigate complex, space-based applications, foundation model adapted
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Foundation models, e.g., large language models, possess attributes of intelligence which offer promise to endow a robot with the contextual understanding necessary to navigate complex, unstructured tasks in the wild. In the future of space robotics, we see three core challenges which motivate the use of a foundation model adapted to space-based applications: 1) Scalability of ground-in-the-loop operations; 2) Generalizing prior knowledge to novel environments; and 3) Multi-modality in tasks and sensor data. Therefore, as a first-step towards building a foundation model for space-based applications, we automatically label the AI4Mars dataset to curate a language annotated dataset of visual-question-answer tuples. We fine-tune a pretrained LLaVA checkpoint on this dataset to endow a vision-language model with the ability to perform spatial reasoning and navigation on Mars’ surface. In this work, we demonstrate that 1) existing vision-language models are deficient visual reasoners in space-based applications, and 2) fine-tuning a vision-language model on extraterrestrial data significantly improves the quality of responses even with a limited training dataset of only a few thousand samples.

[AI-42] Urban Region Pre-training and Prompting: A Graph-based Approach

链接: https://arxiv.org/abs/2408.05920
作者: Jiahui Jin,Yifan Song,Dong Kan,Haojia Zhu,Xiangguo Sun,Zhicheng Li,Xigang Sun,Jinghui Zhang
关键词-EN: textbf, Urban region, Urban, region, tasks
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Urban region representation is crucial for various urban downstream tasks. However, despite the proliferation of methods and their success, acquiring general urban region knowledge and adapting to different tasks remains challenging. Previous work often neglects the spatial structures and functional layouts between entities, limiting their ability to capture transferable knowledge across regions. Further, these methods struggle to adapt effectively to specific downstream tasks, as they do not adequately address the unique features and relationships required for different downstream tasks. In this paper, we propose a \textbfG raph-based \textbfU rban \textbfR egion \textbfP re-training and \textbfP rompting framework ( \textbfGURPP ) for region representation learning. Specifically, we first construct an urban region graph that integrates detailed spatial entity data for more effective urban region representation. Then, we develop a subgraph-centric urban region pre-training model to capture the heterogeneous and transferable patterns of interactions among entities. To further enhance the adaptability of these embeddings to different tasks, we design two graph-based prompting methods to incorporate explicit/hidden task knowledge. Extensive experiments on various urban region prediction tasks and different cities demonstrate the superior performance of our GURPP framework. The implementation is available at this repository: https://anonymous.4open.science/r/GURPP.

[AI-43] Inverse design of Non-parameterized Ventilated Acoustic Resonator via Variational Autoencoder with Acoustic Response-encoded Latent Space

链接: https://arxiv.org/abs/2408.05917
作者: Min Woo Cho,Seok Hyeon Hwang,Jun-Young Jang,Jin Yeong Song,Sun-kwang Hwang,Kyoung Je Cha,Dong Yong Park,Kyungjun Song,Sang Min Park
关键词-EN: Ventilated acoustic resonator, flexible shape adaptability, Ventilated acoustic, excellent low-frequency attenuation, low-frequency attenuation performance
类目: Computational Engineering, Finance, and Science (cs.CE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Ventilated acoustic resonator(VAR), a type of acoustic metamaterial, emerge as an alternative for sound attenuation in environments that require ventilation, owing to its excellent low-frequency attenuation performance and flexible shape adaptability. However, due to the non-linear acoustic responses of VARs, the VAR designs are generally obtained within a limited parametrized design space, and the design relies on the iteration of the numerical simulation which consumes a considerable amount of computational time and resources. This paper proposes an acoustic response-encoded variational autoencoder (AR-VAE), a novel variational autoencoder-based generative design model for the efficient and accurate inverse design of VAR even with non-parametrized designs. The AR-VAE matches the high-dimensional acoustic response with the VAR cross-section image in the dimension-reduced latent space, which enables the AR-VAE to generate various non-parametrized VAR cross-section images with the target acoustic response. AR-VAE generates non-parameterized VARs from target acoustic responses, which show a 25-fold reduction in mean squared error compared to conventional deep learning-based parameter searching methods while exhibiting lower average mean squared error and peak frequency variance. By combining the inverse-designed VARs by AR-VAE, multi-cavity VAR was devised for broadband and multitarget peak frequency attenuation. The proposed design method presents a new approach for structural inverse-design with a high-dimensional non-linear physical response.

[AI-44] A New Pipeline For Generating Instruction Dataset via RAG and Self Fine-Tuning

链接: https://arxiv.org/abs/2408.05911
作者: Chih-Wei Song,Yu-Kai Lee,Yin-Te Tsai
关键词-EN: large language models, recent years, enterprises and organizations, specialized Agents rely, rapid development
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: 5 pages, SCA 2024: The 7th IEEE International Workshop on Smart Computing Applications

点击查看摘要

[AI-45] Weakly Supervised Video Anomaly Detection and Localization with Spatio-Temporal Prompts

链接: https://arxiv.org/abs/2408.05905
作者: Peng Wu,Xuerong Zhou,Guansong Pang,Zhiwei Yang,Qingsen Yan,Peng Wang,Yanning Zhang
关键词-EN: Current weakly supervised, video anomaly detection, supervised video anomaly, coarse video-level annotations, Current weakly
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: Accepted by ACMMM2024

点击查看摘要

Abstract:Current weakly supervised video anomaly detection (WSVAD) task aims to achieve frame-level anomalous event detection with only coarse video-level annotations available. Existing works typically involve extracting global features from full-resolution video frames and training frame-level classifiers to detect anomalies in the temporal dimension. However, most anomalous events tend to occur in localized spatial regions rather than the entire video frames, which implies existing frame-level feature based works may be misled by the dominant background information and lack the interpretation of the detected anomalies. To address this dilemma, this paper introduces a novel method called STPrompt that learns spatio-temporal prompt embeddings for weakly supervised video anomaly detection and localization (WSVADL) based on pre-trained vision-language models (VLMs). Our proposed method employs a two-stream network structure, with one stream focusing on the temporal dimension and the other primarily on the spatial dimension. By leveraging the learned knowledge from pre-trained VLMs and incorporating natural motion priors from raw videos, our model learns prompt embeddings that are aligned with spatio-temporal regions of videos (e.g., patches of individual frames) for identify specific local regions of anomalies, enabling accurate video anomaly detection while mitigating the influence of background information. Without relying on detailed spatio-temporal annotations or auxiliary object detection/tracking, our method achieves state-of-the-art performance on three public benchmarks for the WSVADL task.

[AI-46] Polyp SAM 2: Advancing Zero shot Polyp Segmentation in Colorectal Cancer Detection

链接: https://arxiv.org/abs/2408.05892
作者: Mobina Mansoori,Sajjad Shahabodini,Jamshid Abouei,Konstantinos N. Plataniotis,Arash Mohammadi
关键词-EN: colorectal cancer, plays a crucial, crucial role, early detection, detection and diagnosis
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Polyp segmentation plays a crucial role in the early detection and diagnosis of colorectal cancer. However, obtaining accurate segmentations often requires labor-intensive annotations and specialized models. Recently, Meta AI Research released a general Segment Anything Model 2 (SAM 2), which has demonstrated promising performance in several segmentation tasks. In this work, we evaluate the performance of SAM 2 in segmenting polyps under various prompted settings. We hope this report will provide insights to advance the field of polyp segmentation and promote more interesting work in the future. This project is publicly available at this https URL sajjad-sh33/Polyp-SAM-2.

[AI-47] Integrative Approaches in Cybersecurity and AI

链接: https://arxiv.org/abs/2408.05888
作者: Marwan Omar
关键词-EN: modern technological ecosystems, artificial intelligence, recent years, area of research, technological ecosystems
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In recent years, the convergence of cybersecurity, artificial intelligence (AI), and data management has emerged as a critical area of research, driven by the increasing complexity and interdependence of modern technological ecosystems. This paper provides a comprehensive review and analysis of integrative approaches that harness AI techniques to enhance cybersecurity frameworks and optimize data management practices. By exploring the synergies between these domains, we identify key trends, challenges, and future directions that hold the potential to revolutionize the way organizations protect, analyze, and leverage their data. Our findings highlight the necessity of cross-disciplinary strategies that incorporate AI-driven automation, real-time threat detection, and advanced data analytics to build more resilient and adaptive security architectures.

[AI-48] LLM-Based Robust Product Classification in Commerce and Compliance

链接: https://arxiv.org/abs/2408.05874
作者: Sina Gholamian,Gianfranco Romani,Bartosz Rudnikowicz,Laura Skylaki
关键词-EN: Product classification, crucial task, compliance regulations, regulations are verified, verified and taxes
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 11 pages

点击查看摘要

[AI-49] Leveraging Knowledge Graph-Based Human-Like Memory Systems to Solve Partially Observable Markov Decision Processes

链接: https://arxiv.org/abs/2408.05861
作者: Taewoon Kim,Vincent François-Lavet,Michael Cochez
关键词-EN: long-term memory system, memory system, make complex, Markov decision processes, observe only part
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Humans observe only part of their environment at any moment but can still make complex, long-term decisions thanks to our long-term memory system. To test how an AI can learn and utilize its long-term memory system, we have developed a partially observable Markov decision processes (POMDP) environment, where the agent has to answer questions while navigating a maze. The environment is completely knowledge graph (KG) based, where the hidden states are dynamic KGs. A KG is both human- and machine-readable, making it easy to see what the agents remember and forget. We train and compare agents with different memory systems, to shed light on how human brains work when it comes to managing its own memory systems. By repurposing the given learning objective as learning a memory management policy, we were able to capture the most likely belief state, which is not only interpretable but also reusable.

[AI-50] Root Cause Attribution of Delivery Risks via Causal Discovery with Reinforcement Learning

链接: https://arxiv.org/abs/2408.05860
作者: Shi Bo,Minheng Xiao
关键词-EN: integrating causal discovery, paper presents, root cause attribution, supply chains, supply
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:This paper presents a novel approach to root cause attribution of delivery risks within supply chains by integrating causal discovery with reinforcement learning. As supply chains become increasingly complex, traditional methods of root cause analysis struggle to capture the intricate interrelationships between various factors, often leading to spurious correlations and suboptimal decision-making. Our approach addresses these challenges by leveraging causal discovery to identify the true causal relationships between operational variables, and reinforcement learning to iteratively refine the causal graph. This method enables the accurate identification of key drivers of late deliveries, such as shipping mode and delivery status, and provides actionable insights for optimizing supply chain performance. We apply our approach to a real-world supply chain dataset, demonstrating its effectiveness in uncovering the underlying causes of delivery delays and offering strategies for mitigating these risks. The findings have significant implications for improving operational efficiency, customer satisfaction, and overall profitability within supply chains.

[AI-51] he Cognitive Revolution in Interpretability: From Explaining Behavior to Interpreting Representations and Algorithms

链接: https://arxiv.org/abs/2408.05859
作者: Adam Davies,Ashkan Khakzar
关键词-EN: Artificial neural networks, Artificial neural, inherently interpretable, neural networks, computation graphs
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Artificial neural networks have long been understood as “black boxes”: though we know their computation graphs and learned parameters, the knowledge encoded by these weights and functions they perform are not inherently interpretable. As such, from the early days of deep learning, there have been efforts to explain these models’ behavior and understand them internally; and recently, mechanistic interpretability (MI) has emerged as a distinct research area studying the features and implicit algorithms learned by foundation models such as large language models. In this work, we aim to ground MI in the context of cognitive science, which has long struggled with analogous questions in studying and explaining the behavior of “black box” intelligent systems like the human brain. We leverage several important ideas and developments in the history of cognitive science to disentangle divergent objectives in MI and indicate a clear path forward. First, we argue that current methods are ripe to facilitate a transition in deep learning interpretation echoing the “cognitive revolution” in 20th-century psychology that shifted the study of human psychology from pure behaviorism toward mental representations and processing. Second, we propose a taxonomy mirroring key parallels in computational neuroscience to describe two broad categories of MI research, semantic interpretation (what latent representations are learned and used) and algorithmic interpretation (what operations are performed over representations) to elucidate their divergent goals and objects of study. Finally, we elaborate the parallels and distinctions between various approaches in both categories, analyze the respective strengths and weaknesses of representative works, clarify underlying assumptions, outline key challenges, and discuss the possibility of unifying these modes of interpretation under a common framework.

[AI-52] Scaling Virtual World with Delta-Engine

链接: https://arxiv.org/abs/2408.05842
作者: Hongqiu Wu,Zekai Xu,Tianyang Xu,Jiale Hong,Weiqi Wu,Hai Zhao,Min Zhang,Zhezhi He
关键词-EN: virtual world, cyberspace where people, people can live, base engine, world
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In this paper, we focus on \emphvirtual world, a cyberspace where people can live in. An ideal virtual world shares great similarity with our real world. One of the crucial aspects is its evolving nature, reflected by the individuals’ capacity to grow and thereby influence the objective world. Such dynamics is unpredictable and beyond the reach of existing systems. For this, we propose a special engine called \emphDelta-Engine to drive this virtual world. \Delta associates the world’s evolution to the engine’s expansion. A delta-engine consists of a base engine and a neural proxy. Given an observation, the proxy generates new code based on the base engine through the process of \emphincremental prediction. This paper presents a full-stack introduction to the delta-engine. The key feature of the delta-engine is its scalability to unknown elements within the world, Technically, it derives from the prefect co-work of the neural proxy and the base engine, and the alignment with high-quality data. We an engine-oriented fine-tuning method that embeds the base engine into the proxy. We then discuss a human-AI collaborative design process to produce novel and interesting data efficiently. Eventually, we propose three evaluation principles to comprehensively assess the performance of a delta engine: naive evaluation, incremental evaluation, and adversarial evaluation. Our code, data, and models are open-sourced at \urlthis https URL. Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2408.05842 [cs.AI] (or arXiv:2408.05842v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2408.05842 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-53] Real-Time Drowsiness Detection Using Eye Aspect Ratio and Facial Landmark Detection

链接: https://arxiv.org/abs/2408.05836
作者: Varun Shiva Krishna Rupani,Velpooru Venkata Sai Thushar,Kondadi Tejith
关键词-EN: Eye Aspect Ratio, essential for improving, Aspect Ratio, EAR, Drowsiness
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Drowsiness detection is essential for improving safety in areas such as transportation and workplace health. This study presents a real-time system designed to detect drowsiness using the Eye Aspect Ratio (EAR) and facial landmark detection techniques. The system leverages Dlibs pre-trained shape predictor model to accurately detect and monitor 68 facial landmarks, which are used to compute the EAR. By establishing a threshold for the EAR, the system identifies when eyes are closed, indicating potential drowsiness. The process involves capturing a live video stream, detecting faces in each frame, extracting eye landmarks, and calculating the EAR to assess alertness. Our experiments show that the system reliably detects drowsiness with high accuracy while maintaining low computational demands. This study offers a strong solution for real-time drowsiness detection, with promising applications in driver monitoring and workplace safety. Future research will investigate incorporating additional physiological and contextual data to further enhance detection accuracy and reliability.

[AI-54] Robust Domain Generalization for Multi-modal Object Recognition

链接: https://arxiv.org/abs/2408.05831
作者: Yuxin Qiao,Keqin Li,Junhong Lin,Rong Wei,Chufeng Jiang,Yang Luo,Haoyu Yang
关键词-EN: machine learning encounters, multi-label classification, training data, encounters the challenge, handling tasks
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 6 pages, 2 figures. This is a preprint version of the article. The final version will be published in the proceedings of the IEEE conference

点击查看摘要

Abstract:In multi-label classification, machine learning encounters the challenge of domain generalization when handling tasks with distributions differing from the training data. Existing approaches primarily focus on vision object recognition and neglect the integration of natural language. Recent advancements in vision-language pre-training leverage supervision from extensive visual-language pairs, enabling learning across diverse domains and enhancing recognition in multi-modal scenarios. However, these approaches face limitations in loss function utilization, generality across backbones, and class-aware visual fusion. This paper proposes solutions to these limitations by inferring the actual loss, broadening evaluations to larger vision-language backbones, and introducing Mixup-CLIPood, which incorporates a novel mix-up loss for enhanced class-aware visual fusion. Our method demonstrates superior performance in domain generalization across multiple datasets.

[AI-55] A Single Goal is All You Need: Skills and Exploration Emerge from Contrastive RL without Rewards Demonstrations or Subgoals

链接: https://arxiv.org/abs/2408.05804
作者: Grace Liu,Michael Tang,Benjamin Eysenbach
关键词-EN: present empirical evidence, directed exploration emerging, trials are observed, present empirical, empirical evidence
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Code and videos: this https URL

点击查看摘要

Abstract:In this paper, we present empirical evidence of skills and directed exploration emerging from a simple RL algorithm long before any successful trials are observed. For example, in a manipulation task, the agent is given a single observation of the goal state and learns skills, first for moving its end-effector, then for pushing the block, and finally for picking up and placing the block. These skills emerge before the agent has ever successfully placed the block at the goal location and without the aid of any reward functions, demonstrations, or manually-specified distance metrics. Once the agent has learned to reach the goal state reliably, exploration is reduced. Implementing our method involves a simple modification of prior work and does not require density estimates, ensembles, or any additional hyperparameters. Intuitively, the proposed method seems like it should be terrible at exploration, and we lack a clear theoretical understanding of why it works so effectively, though our experiments provide some hints.

[AI-56] A Meta-Engine Framework for Interleaved Task and Motion Planning using Topological Refinements ECAI2024

链接: https://arxiv.org/abs/2408.05795
作者: Elisa Tosello,Alessandro Valentini,Andrea Micheli
关键词-EN: includes discrete actions, discrete actions executable, automated planning problem, low-level continuous motions, automated planning
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注: To appear in ECAI 2024

点击查看摘要

Abstract:Task And Motion Planning (TAMP) is the problem of finding a solution to an automated planning problem that includes discrete actions executable by low-level continuous motions. This field is gaining increasing interest within the robotics community, as it significantly enhances robot’s autonomy in real-world applications. Many solutions and formulations exist, but no clear standard representation has emerged. In this paper, we propose a general and open-source framework for modeling and benchmarking TAMP problems. Moreover, we introduce an innovative meta-technique to solve TAMP problems involving moving agents and multiple task-state-dependent obstacles. This approach enables using any off-the-shelf task planner and motion planner while leveraging a geometric analysis of the motion planner’s search space to prune the task planner’s exploration, enhancing its efficiency. We also show how to specialize this meta-engine for the case of an incremental SMT-based planner. We demonstrate the effectiveness of our approach across benchmark problems of increasing complexity, where robots must navigate environments with movable obstacles. Finally, we integrate state-of-the-art TAMP algorithms into our framework and compare their performance with our achievements.

[AI-57] HateSieve: A Contrastive Learning Framework for Detecting and Segmenting Hateful Content in Multimodal Memes ACL

链接: https://arxiv.org/abs/2408.05794
作者: Xuanyu Su,Yansong Li,Diana Inkpen,Nathalie Japkowicz
关键词-EN: Large Multimodal Models, Multimodal Models, Large Multimodal, Amidst the rise, memes remains significant
类目: Artificial Intelligence (cs.AI); Computer Science and Game Theory (cs.GT)
*备注: 8 pages overall, the accepted paper at the 3rd Workshop on Advances in Language and Vision Research (ALVR 2024) ACL workshops

点击查看摘要

Abstract:Amidst the rise of Large Multimodal Models (LMMs) and their widespread application in generating and interpreting complex content, the risk of propagating biased and harmful memes remains significant. Current safety measures often fail to detect subtly integrated hateful content within ``Confounder Memes’'. To address this, we introduce \textscHateSieve, a new framework designed to enhance the detection and segmentation of hateful elements in memes. \textscHateSieve features a novel Contrastive Meme Generator that creates semantically paired memes, a customized triplet dataset for contrastive learning, and an Image-Text Alignment module that produces context-aware embeddings for accurate meme segmentation. Empirical experiments on the Hateful Meme Dataset show that \textscHateSieve not only surpasses existing LMMs in performance with fewer trainable parameters but also offers a robust mechanism for precisely identifying and isolating hateful content. \textcolorredCaution: Contains academic discussions of hate speech; viewer discretion advised.

[AI-58] Continual Learning of Nonlinear Independent Representations

链接: https://arxiv.org/abs/2408.05788
作者: Boyang Sun,Ignavier Ng,Guangyi Chen,Yifan Shen,Qirong Ho,Kun Zhang
关键词-EN: interested variables plays, relations between interested, plays a pivotal, pivotal role, deep insights
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
*备注: 9 pages, 5 Figures

点击查看摘要

Abstract:Identifying the causal relations between interested variables plays a pivotal role in representation learning as it provides deep insights into the dataset. Identifiability, as the central theme of this approach, normally hinges on leveraging data from multiple distributions (intervention, distribution shift, time series, etc.). Despite the exciting development in this field, a practical but often overlooked problem is: what if those distribution shifts happen sequentially? In contrast, any intelligence possesses the capacity to abstract and refine learned knowledge sequentially – lifelong learning. In this paper, with a particular focus on the nonlinear independent component analysis (ICA) framework, we move one step forward toward the question of enabling models to learn meaningful (identifiable) representations in a sequential manner, termed continual causal representation learning. We theoretically demonstrate that model identifiability progresses from a subspace level to a component-wise level as the number of distributions increases. Empirically, we show that our method achieves performance comparable to nonlinear ICA methods trained jointly on multiple offline distributions and, surprisingly, the incoming new distribution does not necessarily benefit the identification of all latent variables.

[AI-59] CURLing the Dream: Contrastive Representations for World Modeling in Reinforcement Learning

链接: https://arxiv.org/abs/2408.05781
作者: Victor Augusto Kich,Jair Augusto Bottega,Raul Steinmetz,Ricardo Bedin Grando,Ayano Yorozu,Akihisa Ohya
关键词-EN: Control Suite tasks, DeepMind Control Suite, integrates contrastive learning, visual reinforcement learning, Control Suite
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: Paper accepted for 24th International Conference on Control, Automation and Systems (ICCAS)

点击查看摘要

Abstract:In this work, we present Curled-Dreamer, a novel reinforcement learning algorithm that integrates contrastive learning into the DreamerV3 framework to enhance performance in visual reinforcement learning tasks. By incorporating the contrastive loss from the CURL algorithm and a reconstruction loss from autoencoder, Curled-Dreamer achieves significant improvements in various DeepMind Control Suite tasks. Our extensive experiments demonstrate that Curled-Dreamer consistently outperforms state-of-the-art algorithms, achieving higher mean and median scores across a diverse set of tasks. The results indicate that the proposed approach not only accelerates learning but also enhances the robustness of the learned policies. This work highlights the potential of combining different learning paradigms to achieve superior performance in reinforcement learning applications.

[AI-60] Seg-CycleGAN : SAR-to-optical image translation guided by a downstream task

链接: https://arxiv.org/abs/2408.05777
作者: Hannuo Zhang,Huihui Li,Jiarui Lin,Yujie Zhang,Jianghua Fan,Hang Liu
关键词-EN: Synthetic Aperture Radar, Aperture Radar, Optical remote sensing, Synthetic Aperture, offering complementary capabilities
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Image and Video Processing (eess.IV)
*备注: 8 pages, 5 figures

点击查看摘要

Abstract:Optical remote sensing and Synthetic Aperture Radar(SAR) remote sensing are crucial for earth observation, offering complementary capabilities. While optical sensors provide high-quality images, they are limited by weather and lighting conditions. In contrast, SAR sensors can operate effectively under adverse conditions. This letter proposes a GAN-based SAR-to-optical image translation method named Seg-CycleGAN, designed to enhance the accuracy of ship target translation by leveraging semantic information from a pre-trained semantic segmentation model. Our method utilizes the downstream task of ship target semantic segmentation to guide the training of image translation network, improving the quality of output Optical-styled images. The potential of foundation-model-annotated datasets in SAR-to-optical translation tasks is revealed. This work suggests broader research and applications for downstream-task-guided frameworks. The code will be available at this https URL

[AI-61] Neurosymbolic Methods for Rule Mining

链接: https://arxiv.org/abs/2408.05773
作者: Agnieszka Lawrynowicz,Luis Galarraga,Mehwish Alam,Berenice Jaulmes,Vaclav Zeman,Tomas Kliegr
关键词-EN: essential background information, beginning with essential, background information, including measures, address the problem
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In this chapter, we address the problem of rule mining, beginning with essential background information, including measures of rule quality. We then explore various rule mining methodologies, categorized into three groups: inductive logic programming, path sampling and generalization, and linear programming. Following this, we delve into neurosymbolic methods, covering topics such as the integration of deep learning with rules, the use of embeddings for rule learning, and the application of large language models in rule learning.

[AI-62] An analysis of HOI: using a training-free method with multimodal visual foundation models when only the test set is available without the training set

链接: https://arxiv.org/abs/2408.05772
作者: Chaoyi Ai
关键词-EN: Human-Object Interaction, ultimately forming, langle human, aims to identify, recognize their relationships
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Human-Object Interaction (HOI) aims to identify the pairs of humans and objects in images and to recognize their relationships, ultimately forming \langle human, object, verb \rangle triplets. Under default settings, HOI performance is nearly saturated, with many studies focusing on long-tail distribution and zero-shot/few-shot scenarios. Let us consider an intriguing problem:``What if there is only test dataset without training dataset, using multimodal visual foundation model in a training-free manner? ‘’ This study uses two experimental settings: grounding truth and random arbitrary combinations. We get some interesting conclusion and find that the open vocabulary capabilities of the multimodal visual foundation model are not yet fully realized. Additionally, replacing the feature extraction with grounding DINO further confirms these findings.

[AI-63] Reference-free Hallucination Detection for Large Vision-Language Models

链接: https://arxiv.org/abs/2408.05767
作者: Qing Li,Chenyang Lyu,Jiahui Geng,Derui Zhu,Maxim Panov,Fakhri Karray
关键词-EN: Large vision-language models, made significant progress, Large vision-language, vision-language models, recent years
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

[AI-64] Low-Dimensional Federated Knowledge Graph Embedding via Knowledge Distillation

链接: https://arxiv.org/abs/2408.05748
作者: Xiaoxiong Zhang,Zhiwei Zeng,Xin Zhou,Zhiqi Shen
关键词-EN: Federated Knowledge Graph, distributed Knowledge Graphs, Knowledge Graph Embedding, preserving data privacy, Knowledge Graph
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Federated Knowledge Graph Embedding (FKGE) aims to facilitate collaborative learning of entity and relation embeddings from distributed Knowledge Graphs (KGs) across multiple clients, while preserving data privacy. Training FKGE models with higher dimensions is typically favored due to their potential for achieving superior performance. However, high-dimensional embeddings present significant challenges in terms of storage resource and inference speed. Unlike traditional KG embedding methods, FKGE involves multiple client-server communication rounds, where communication efficiency is critical. Existing embedding compression methods for traditional KGs may not be directly applicable to FKGE as they often require multiple model trainings which potentially incur substantial communication costs. In this paper, we propose a light-weight component based on Knowledge Distillation (KD) which is titled FedKD and tailored specifically for FKGE methods. During client-side local training, FedKD facilitates the low-dimensional student model to mimic the score distribution of triples from the high-dimensional teacher model using KL divergence loss. Unlike traditional KD way, FedKD adaptively learns a temperature to scale the score of positive triples and separately adjusts the scores of corresponding negative triples using a predefined temperature, thereby mitigating teacher over-confidence issue. Furthermore, we dynamically adjust the weight of KD loss to optimize the training process. Extensive experiments on three datasets support the effectiveness of FedKD.

[AI-65] MTSCI: A Conditional Diffusion Model for Multivariate Time Series Consistent Imputation CIKM2024

链接: https://arxiv.org/abs/2408.05740
作者: Jianping Zhou,Junhao Li,Guanjie Zheng,Xinbing Wang,Chenghu Zhou
关键词-EN: multivariate time series, time series imputation, time series, multivariate time, Time Series Consistent
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
*备注: 10 pages, 5 figures, accepted by CIKM2024

点击查看摘要

Abstract:Missing values are prevalent in multivariate time series, compromising the integrity of analyses and degrading the performance of downstream tasks. Consequently, research has focused on multivariate time series imputation, aiming to accurately impute the missing values based on available observations. A key research question is how to ensure imputation consistency, i.e., intra-consistency between observed and imputed values, and inter-consistency between adjacent windows after imputation. However, previous methods rely solely on the inductive bias of the imputation targets to guide the learning process, ignoring imputation consistency and ultimately resulting in poor performance. Diffusion models, known for their powerful generative abilities, prefer to generate consistent results based on available observations. Therefore, we propose a conditional diffusion model for Multivariate Time Series Consistent Imputation (MTSCI). Specifically, MTSCI employs a contrastive complementary mask to generate dual views during the forward noising process. Then, the intra contrastive loss is calculated to ensure intra-consistency between the imputed and observed values. Meanwhile, MTSCI utilizes a mixup mechanism to incorporate conditional information from adjacent windows during the denoising process, facilitating the inter-consistency between imputed samples. Extensive experiments on multiple real-world datasets demonstrate that our method achieves the state-of-the-art performance on multivariate time series imputation task under different missing scenarios. Code is available at this https URL.

[AI-66] Deformable Image Registration with Multi-scale Feature Fusion from Shared Encoder Auxiliary and Pyramid Decoders

链接: https://arxiv.org/abs/2408.05717
作者: Hongchao Zhou,Shunbo Hu
关键词-EN: deformable convolutional pyramid, convolutional pyramid network, unsupervised image registration, deformable convolutional, pyramid network
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In this work, we propose a novel deformable convolutional pyramid network for unsupervised image registration. Specifically, the proposed network enhances the traditional pyramid network by adding an additional shared auxiliary decoder for image pairs. This decoder provides multi-scale high-level feature information from unblended image pairs for the registration task. During the registration process, we also design a multi-scale feature fusion block to extract the most beneficial features for the registration task from both global and local contexts. Validation results indicate that this method can capture complex deformations while achieving higher registration accuracy and maintaining smooth and plausible deformations.

[AI-67] op Pass: Improve Code Generation by Pass@k-Maximized Code Ranking

链接: https://arxiv.org/abs/2408.05715
作者: Zhi-Cun Lyu,Xin-Ye Li,Zheng Xie,Ming Li
关键词-EN: Large Language Models, Large Language, Code generation, Language Models, greatly enhanced
类目: Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
*备注: Accepted by Frontier of Computer Science

点击查看摘要

Abstract:Code generation has been greatly enhanced by the profound advancements in Large Language Models (LLMs) recently. Nevertheless, such LLM-based code generation approaches still struggle to generate error-free code in a few tries when faced with complex problems. To address this, the prevailing strategy is to sample a huge number of candidate programs, with the hope of any one in them could work. However, users of code generation systems usually expect to find a correct program by reviewing or testing only a small number of code candidates. Otherwise, the system would be unhelpful. In this paper, we propose Top Pass, a code ranking approach that identifies potential correct solutions from a large number of candidates. Top Pass directly optimizes the pass@k loss function, enhancing the quality at the top of the candidate list. This enables the user to find the correct solution within as few tries as possible. Experimental results on four benchmarks indicate that our Top Pass method enhances the usability of code generation models by producing better ranking results, particularly achieving a 32.9% relative improvement in pass@1 on CodeContests when compared to the state-of-the-art ranking method.

[AI-68] DeepAir: A Multi-Agent Deep Reinforcement Learning Based Scheme for an Unknown User Location Problem

链接: https://arxiv.org/abs/2408.05712
作者: Baris Yamansavascilar,Atay Ozgovde,Cem Ersoy
关键词-EN: unmanned aerial vehicles, aerial vehicles, networking paradigms, deployment of unmanned, unmanned aerial
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI)
*备注: 12 pages, 8 figures, 5 tables

点击查看摘要

Abstract:The deployment of unmanned aerial vehicles (UAVs) in many different settings has provided various solutions and strategies for networking paradigms. Therefore, it reduces the complexity of the developments for the existing problems, which otherwise require more sophisticated approaches. One of those existing problems is the unknown user locations in an infrastructure-less environment in which users cannot connect to any communication device or computation-providing server, which is essential to task offloading in order to achieve the required quality of service (QoS). Therefore, in this study, we investigate this problem thoroughly and propose a novel deep reinforcement learning (DRL) based scheme, DeepAir. DeepAir considers all of the necessary steps including sensing, localization, resource allocation, and multi-access edge computing (MEC) to achieve QoS requirements for the offloaded tasks without violating the maximum tolerable delay. To this end, we use two types of UAVs including detector UAVs, and serving UAVs. We utilize detector UAVs as DRL agents which ensure sensing, localization, and resource allocation. On the other hand, we utilize serving UAVs to provide MEC features. Our experiments show that DeepAir provides a high task success rate by deploying fewer detector UAVs in the environment, which includes different numbers of users and user attraction points, compared to benchmark methods.

[AI-69] A Novel Momentum-Based Deep Learning Techniques for Medical Image Classification and Segmentation

链接: https://arxiv.org/abs/2408.05692
作者: Koushik Biswas,Ridal Pal,Shaswat Patel,Debesh Jha,Meghana Karri,Amit Reza,Gorkem Durak,Alpay Medetalibeyoglu,Matthew Antalek,Yury Velichko,Daniela Ladner,Amir Borhani,Ulas Bagci
关键词-EN: Accurately segmenting, intervention planning, critical prerequisite, prerequisite for computer-assisted, computer-assisted diagnosis
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
*备注: 8 pages

点击查看摘要

Abstract:Accurately segmenting different organs from medical images is a critical prerequisite for computer-assisted diagnosis and intervention planning. This study proposes a deep learning-based approach for segmenting various organs from CT and MRI scans and classifying diseases. Our study introduces a novel technique integrating momentum within residual blocks for enhanced training dynamics in medical image analysis. We applied our method in two distinct tasks: segmenting liver, lung, colon data and classifying abdominal pelvic CT and MRI scans. The proposed approach has shown promising results, outperforming state-of-the-art methods on publicly available benchmarking datasets. For instance, in the lung segmentation dataset, our approach yielded significant enhancements over the TransNetR model, including a 5.72% increase in dice score, a 5.04% improvement in mean Intersection over Union (mIoU), an 8.02% improvement in recall, and a 4.42% improvement in precision. Hence, incorporating momentum led to state-of-the-art performance in both segmentation and classification tasks, representing a significant advancement in the field of medical imaging.

[AI-70] Separate Generation and Evaluation for Parallel Greedy Best-First Search ICAPS ICAPS24 ICAPS-2024

链接: https://arxiv.org/abs/2408.05682
作者: Takumi Shimoda,Alex Fukunaga
关键词-EN: Bench Transition System, sequential GBFS, Parallelization of Greedy, straightforward parallelization, GBFS
类目: Artificial Intelligence (cs.AI); Data Structures and Algorithms (cs.DS)
*备注: In Proceedings of ICAPS-2024 Workshop on Heuristics and Search for Domain-Independent Planning (HSDIP-24) this https URL

点击查看摘要

Abstract:Parallelization of Greedy Best First Search (GBFS) has been difficult because straightforward parallelization can result in search behavior which differs significantly from sequential GBFS, exploring states which would not be explored by sequential GBFS with any tie-breaking strategy. Recent work has proposed a class of parallel GBFS algorithms which constrains search to exploration of the Bench Transition System (BTS), which is the set of states that can be expanded by GBFS under some tie-breaking policy. However, enforcing this constraint is costly, as such BTS-constrained algorithms are forced to spend much of the time waiting so that only states which are guaranteed to be in the BTS are expanded. We propose an improvement to parallel search which decouples state generation and state evaluation and significantly improves state evaluation rate, resulting in better search performance.

[AI-71] SRTFD: Scalable Real-Time Fault Diagnosis through Online Continual Learning

链接: https://arxiv.org/abs/2408.05681
作者: Dandan Zhao,Karthick Sharma,Hongpeng Yin,Yuxin Qi,Shuhao Zhang
关键词-EN: maintaining operational safety, minimizing economic losses, detecting system abnormalities, essential for maintaining, maintaining operational
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Fault diagnosis (FD) is essential for maintaining operational safety and minimizing economic losses by detecting system abnormalities. Recently, deep learning (DL)-driven FD methods have gained prominence, offering significant improvements in precision and adaptability through the utilization of extensive datasets and advanced DL models. Modern industrial environments, however, demand FD methods that can handle new fault types, dynamic conditions, large-scale data, and provide real-time responses with minimal prior information. Although online continual learning (OCL) demonstrates potential in addressing these requirements by enabling DL models to continuously learn from streaming data, it faces challenges such as data redundancy, imbalance, and limited labeled data. To overcome these limitations, we propose SRTFD, a scalable real-time fault diagnosis framework that enhances OCL with three critical methods: Retrospect Coreset Selection (RCS), which selects the most relevant data to reduce redundant training and improve efficiency; Global Balance Technique (GBT), which ensures balanced coreset selection and robust model performance; and Confidence and Uncertainty-driven Pseudo-label Learning (CUPL), which updates the model using unlabeled data for continuous adaptation. Extensive experiments on a real-world dataset and two public simulated datasets demonstrate SRTFD’s effectiveness and potential for providing advanced, scalable, and precise fault diagnosis in modern industrial systems.

[AI-72] Efficient Federated Learning Using Dynamic Update and Adaptive Pruning with Momentum on Shared Server Data

链接: https://arxiv.org/abs/2408.05678
作者: Ji Liu,Juncheng Jia,Hong Zhang,Yuhui Yun,Leye Wang,Yang Zhou,Huaiyu Dai,Dejing Dou
关键词-EN: Federated Learning, limited computational resources, low training efficiency, achieving remarkable performance, shared insensitive data
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 27 pages, to appear in TIST

点击查看摘要

Abstract:Despite achieving remarkable performance, Federated Learning (FL) encounters two important problems, i.e., low training efficiency and limited computational resources. In this paper, we propose a new FL framework, i.e., FedDUMAP, with three original contributions, to leverage the shared insensitive data on the server in addition to the distributed data in edge devices so as to efficiently train a global model. First, we propose a simple dynamic server update algorithm, which takes advantage of the shared insensitive data on the server while dynamically adjusting the update steps on the server in order to speed up the convergence and improve the accuracy. Second, we propose an adaptive optimization method with the dynamic server update algorithm to exploit the global momentum on the server and each local device for superior accuracy. Third, we develop a layer-adaptive model pruning method to carry out specific pruning operations, which is adapted to the diverse features of each layer so as to attain an excellent trade-off between effectiveness and efficiency. Our proposed FL model, FedDUMAP, combines the three original techniques and has a significantly better performance compared with baseline approaches in terms of efficiency (up to 16.9 times faster), accuracy (up to 20.4% higher), and computational cost (up to 62.6% smaller).

[AI-73] StealthDiffusion: Towards Evading Diffusion Forensic Detection through Diffusion Model

链接: https://arxiv.org/abs/2408.05669
作者: Ziyin Zhou,Ke Sun,Zhongxi Chen,Huafeng Kuang,Xiaoshuai Sun,Rongrong Ji
关键词-EN: AI-Generated Content Stealth, Content Stealth, AI-Generated Content, rapid progress, progress in generative
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The rapid progress in generative models has given rise to the critical task of AI-Generated Content Stealth (AIGC-S), which aims to create AI-generated images that can evade both forensic detectors and human inspection. This task is crucial for understanding the vulnerabilities of existing detection methods and developing more robust techniques. However, current adversarial attacks often introduce visible noise, have poor transferability, and fail to address spectral differences between AI-generated and genuine images. To address this, we propose StealthDiffusion, a framework based on stable diffusion that modifies AI-generated images into high-quality, imperceptible adversarial examples capable of evading state-of-the-art forensic detectors. StealthDiffusion comprises two main components: Latent Adversarial Optimization, which generates adversarial perturbations in the latent space of stable diffusion, and Control-VAE, a module that reduces spectral differences between the generated adversarial images and genuine images without affecting the original diffusion model’s generation process. Extensive experiments show that StealthDiffusion is effective in both white-box and black-box settings, transforming AI-generated images into high-quality adversarial forgeries with frequency spectra similar to genuine images. These forgeries are classified as genuine by advanced forensic classifiers and are difficult for humans to distinguish.

[AI-74] Utilizing Large Language Models to Optimize the Detection and Explainability of Phishing Websites

链接: https://arxiv.org/abs/2408.05667
作者: Sayak Saha Roy,Shirin Nilizadeh
关键词-EN: lightweight Large Language, Large Language Model, lightweight Large, Large Language, specifically designed
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In this paper, we introduce PhishLang, an open-source, lightweight Large Language Model (LLM) specifically designed for phishing website detection through contextual analysis of the website. Unlike traditional heuristic or machine learning models that rely on static features and struggle to adapt to new threats and deep learning models that are computationally intensive, our model utilizes the advanced language processing capabilities of LLMs to learn granular features that are characteristic of phishing attacks. Furthermore, PhishLang operates with minimal data preprocessing and offers performance comparable to leading deep learning tools, while being significantly faster and less resource-intensive. Over a 3.5-month testing period, PhishLang successfully identified approximately 26K phishing URLs, many of which were undetected by popular antiphishing blocklists, thus demonstrating its potential to aid current detection measures. We also evaluate PhishLang against several realistic adversarial attacks and develop six patches that make it very robust against such threats. Furthermore, we integrate PhishLang with GPT-3.5 Turbo to create \textitexplainable blocklisting - warnings that provide users with contextual information about different features that led to a website being marked as phishing. Finally, we have open-sourced the PhishLang framework and developed a Chromium-based browser extension and URL scanner website, which implement explainable warnings for end-users.

[AI-75] Eigen Attention: Attention in Low-Rank Space for KV Cache Compression

链接: https://arxiv.org/abs/2408.05646
作者: Utkarsh Saxena,Gobinda Saha,Sakshi Choudhary,Kaushik Roy
关键词-EN: impressive reasoning abilities, natural language processing, language processing due, Large language models, represent a groundbreaking
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 12 page, 6 figures, 6 tables

点击查看摘要

Abstract:Large language models (LLMs) represent a groundbreaking advancement in the domain of natural language processing due to their impressive reasoning abilities. Recently, there has been considerable interest in increasing the context lengths for these models to enhance their applicability to complex tasks. However, at long context lengths and large batch sizes, the key-value (KV) cache, which stores the attention keys and values, emerges as the new bottleneck in memory usage during inference. To address this, we propose Eigen Attention, which performs the attention operation in a low-rank space, thereby reducing the KV cache memory overhead. Our proposed approach is orthogonal to existing KV cache compression techniques and can be used synergistically with them. Through extensive experiments over OPT, MPT, and Llama model families, we demonstrate that Eigen Attention results in up to 40% reduction in KV cache sizes and up to 60% reduction in attention operation latency with minimal drop in performance.

[AI-76] Federated Smoothing Proximal Gradient for Quantile Regression with Non-Convex Penalties

链接: https://arxiv.org/abs/2408.05640
作者: Reza Mirzaeifard,Diyako Ghaderyan,Stefan Werner
关键词-EN: generate vast amounts, Distributed sensors, generate vast, vast amounts, data
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Distributed sensors in the internet-of-things (IoT) generate vast amounts of sparse data. Analyzing this high-dimensional data and identifying relevant predictors pose substantial challenges, especially when data is preferred to remain on the device where it was collected for reasons such as data integrity, communication bandwidth, and privacy. This paper introduces a federated quantile regression algorithm to address these challenges. Quantile regression provides a more comprehensive view of the relationship between variables than mean regression models. However, traditional approaches face difficulties when dealing with nonconvex sparse penalties and the inherent non-smoothness of the loss function. For this purpose, we propose a federated smoothing proximal gradient (FSPG) algorithm that integrates a smoothing mechanism with the proximal gradient framework, thereby enhancing both precision and computational speed. This integration adeptly handles optimization over a network of devices, each holding local data samples, making it particularly effective in federated learning scenarios. The FSPG algorithm ensures steady progress and reliable convergence in each iteration by maintaining or reducing the value of the objective function. By leveraging nonconvex penalties, such as the minimax concave penalty (MCP) and smoothly clipped absolute deviation (SCAD), the proposed method can identify and preserve key predictors within sparse models. Comprehensive simulations validate the robust theoretical foundations of the proposed algorithm and demonstrate improved estimation precision and reliable convergence.

[AI-77] Enhancing Computational Efficiency in Intensive Domains via Redundant Residue Number Systems SOCC

链接: https://arxiv.org/abs/2408.05639
作者: Soudabeh Mousavi,Dara Rahmati,Saeid Gorgin,Jeong-A Lee
关键词-EN: digital signal processing, redundant number systems, number systems, Binary Number System, residue number systems
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI)
*备注: This paper has been accepted by the 21st International SoC Conference (ISOCC), 2024, 2 pages

点击查看摘要

Abstract:In computation-intensive domains such as digital signal processing, encryption, and neural networks, the performance of arithmetic units, including adders and multipliers, is pivotal. Conventional numerical systems often fall short of meeting the efficiency requirements of these applications concerning area, time, and power consumption. Innovative approaches like residue number systems (RNS) and redundant number systems have been introduced to surmount this challenge, markedly elevating computational efficiency. This paper examines from multiple perspectives how the fusion of redundant number systems with RNS (termed R-RNS) can diminish latency and enhance circuit implementation, yielding substantial benefits in practical scenarios. We conduct a comparative analysis of four systems - RNS, redundant number system, Binary Number System (BNS), and Signed-Digit Redundant Residue Number System (SD-RNS)-and appraise SD-RNS through an advanced Deep Neural Network (DNN) utilizing the CIFAR-10 dataset. Our findings are encouraging, demonstrating that SD-RNS attains computational speedups of 1.27 times and 2.25 times over RNS and BNS, respectively, and reduces energy consumption by 60% compared to BNS during sequential addition and multiplication tasks.

[AI-78] PRTGaussian: Efficient Relighting Using 3D Gaussians with Precomputed Radiance Transfer

链接: https://arxiv.org/abs/2408.05631
作者: Libo Zhang,Yuxuan Han,Wenbin Lin,Jingwang Ling,Feng Xu
关键词-EN: Precomputed Radiance Transfer, realtime relightable novel-view, relightable novel-view synthesis, novel-view synthesis method, synthesis method made
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:We present PRTGaussian, a realtime relightable novel-view synthesis method made possible by combining 3D Gaussians and Precomputed Radiance Transfer (PRT). By fitting relightable Gaussians to multi-view OLAT data, our method enables real-time, free-viewpoint relighting. By estimating the radiance transfer based on high-order spherical harmonics, we achieve a balance between capturing detailed relighting effects and maintaining computational efficiency. We utilize a two-stage process: in the first stage, we reconstruct a coarse geometry of the object from multi-view images. In the second stage, we initialize 3D Gaussians with the obtained point cloud, then simultaneously refine the coarse geometry and learn the light transport for each Gaussian. Extensive experiments on synthetic datasets show that our approach can achieve fast and high-quality relighting for general objects. Code and data are available at this https URL.

[AI-79] Forecasting Day-Ahead Electricity Prices in the Integrated Single Electricity Market: Addressing Volatility with Comparative Machine Learning Methods

链接: https://arxiv.org/abs/2408.05628
作者: Ben Harkin,Xueqin Liu
关键词-EN: Irish Integrated Single, Integrated Single Electricity, Irish Integrated, Integrated Single, Single Electricity Market
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:This paper undertakes a comprehensive investigation of electricity price forecasting methods, focused on the Irish Integrated Single Electricity Market, particularly on changes during recent periods of high volatility. The primary objective of this research is to evaluate and compare the performance of various forecasting models, ranging from traditional machine learning models to more complex neural networks, as well as the impact of different lengths of training periods. The performance metrics, mean absolute error, root mean square error, and relative mean absolute error, are utilized to assess and compare the accuracy of each model. A comprehensive set of input features was investigated and selected from data recorded between October 2018 and September 2022. The paper demonstrates that the daily EU Natural Gas price is a more useful feature for electricity price forecasting in Ireland than the daily Henry Hub Natural Gas price. This study also shows that the correlation of features to the day-ahead market price has changed in recent years. The price of natural gas on the day and the amount of wind energy on the grid that hour are significantly more important than any other features. More specifically speaking, the input fuel for electricity has become a more important driver of the price of it, than the total generation or demand. In addition, it can be seen that System Non-Synchronous Penetration (SNSP) is highly correlated with the day-ahead market price, and that renewables are pushing down the price of electricity.

[AI-80] UrFound: Towards Universal Retinal Foundation Models via Knowledge-Guided Masked Modeling

链接: https://arxiv.org/abs/2408.05618
作者: Kai Yu,Yang Zhou,Yang Bai,Zhi Da Soh,Xinxing Xu,Rick Siow Mong Goh,Ching-Yu Cheng,Yong Liu
关键词-EN: Color Fundus Photography, Optical Coherence Tomography, label-efficient model adaptation, Retinal foundation, Retinal
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Retinal foundation models aim to learn generalizable representations from diverse retinal images, facilitating label-efficient model adaptation across various ophthalmic tasks. Despite their success, current retinal foundation models are generally restricted to a single imaging modality, such as Color Fundus Photography (CFP) or Optical Coherence Tomography (OCT), limiting their versatility. Moreover, these models may struggle to fully leverage expert annotations and overlook the valuable domain knowledge essential for domain-specific representation learning. To overcome these limitations, we introduce UrFound, a retinal foundation model designed to learn universal representations from both multimodal retinal images and domain knowledge. UrFound is equipped with a modality-agnostic image encoder and accepts either CFP or OCT images as inputs. To integrate domain knowledge into representation learning, we encode expert annotation in text supervision and propose a knowledge-guided masked modeling strategy for model pre-training. It involves reconstructing randomly masked patches of retinal images while predicting masked text tokens conditioned on the corresponding retinal image. This approach aligns multimodal images and textual expert annotations within a unified latent space, facilitating generalizable and domain-specific representation learning. Experimental results demonstrate that UrFound exhibits strong generalization ability and data efficiency when adapting to various tasks in retinal image analysis. By training on ~180k retinal images, UrFound significantly outperforms the state-of-the-art retinal foundation model trained on up to 1.6 million unlabelled images across 8 public retinal datasets. Our code and data are available at this https URL.

[AI-81] Residual-INR: Communication Efficient On-Device Learning Using Implicit Neural Representation

链接: https://arxiv.org/abs/2408.05617
作者: Hanqiu Chen,Xuebin Yao,Pradeep Subedi,Cong Hao
关键词-EN: distributed computing paradigm, on-device learning, edge computing system, Edge computing, Edge
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Distributed, Parallel, and Cluster Computing (cs.DC); Information Theory (cs.IT)
*备注: This paper has been accepted by ICCAD 2024

点击查看摘要

Abstract:Edge computing is a distributed computing paradigm that collects and processes data at or near the source of data generation. The on-device learning at edge relies on device-to-device wireless communication to facilitate real-time data sharing and collaborative decision-making among multiple devices. This significantly improves the adaptability of the edge computing system to the changing environments. However, as the scale of the edge computing system is getting larger, communication among devices is becoming the bottleneck because of the limited bandwidth of wireless communication leads to large data transfer latency. To reduce the amount of device-to-device data transmission and accelerate on-device learning, in this paper, we propose Residual-INR, a fog computing-based communication-efficient on-device learning framework by utilizing implicit neural representation (INR) to compress images/videos into neural network weights. Residual-INR enhances data transfer efficiency by collecting JPEG images from edge devices, compressing them into INR format at the fog node, and redistributing them for on-device learning. By using a smaller INR for full image encoding and a separate object INR for high-quality object region reconstruction through residual encoding, our technique can reduce the encoding redundancy while maintaining the object quality. Residual-INR is a promising solution for edge on-device learning because it reduces data transmission by up to 5.16 x across a network of 10 edge devices. It also facilitates CPU-free accelerated on-device learning, achieving up to 2.9 x speedup without sacrificing accuracy. Our code is available at: this https URL.

[AI-82] Representation Alignment from Human Feedback for Cross-Embodiment Reward Learning from Mixed-Quality Demonstrations

链接: https://arxiv.org/abs/2408.05610
作者: Connor Mattson,Anurag Aribandi,Daniel S. Brown
关键词-EN: inverse reinforcement learning, cross-embodiment inverse reinforcement, action space, inverse reinforcement, learning
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注: First Two Authors Share Equal Contribution. 19 Pages, 4 Figures

点击查看摘要

Abstract:We study the problem of cross-embodiment inverse reinforcement learning, where we wish to learn a reward function from video demonstrations in one or more embodiments and then transfer the learned reward to a different embodiment (e.g., different action space, dynamics, size, shape, etc.). Learning reward functions that transfer across embodiments is important in settings such as teaching a robot a policy via human video demonstrations or teaching a robot to imitate a policy from another robot with a different embodiment. However, prior work has only focused on cases where near-optimal demonstrations are available, which is often difficult to ensure. By contrast, we study the setting of cross-embodiment reward learning from mixed-quality demonstrations. We demonstrate that prior work struggles to learn generalizable reward representations when learning from mixed-quality data. We then analyze several techniques that leverage human feedback for representation learning and alignment to enable effective cross-embodiment learning. Our results give insight into how different representation learning techniques lead to qualitatively different reward shaping behaviors and the importance of human feedback when learning from mixed-quality, mixed-embodiment data.

[AI-83] Mitigating Metropolitan Carbon Emissions with Dynamic Eco-driving at Scale

链接: https://arxiv.org/abs/2408.05609
作者: Vindula Jayawardana,Baptiste Freydt,Ao Qu,Cameron Hickert,Edgar Sanchez,Catherine Tang,Mark Taylor,Blaine Leonard,Cathy Wu
关键词-EN: sector to decarbonize, sheer scale, scale and diversity, diversity of transportation, transportation make
类目: ystems and Control (eess.SY); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multiagent Systems (cs.MA); Robotics (cs.RO)
*备注: In review

点击查看摘要

Abstract:The sheer scale and diversity of transportation make it a formidable sector to decarbonize. Here, we consider an emerging opportunity to reduce carbon emissions: the growing adoption of semi-autonomous vehicles, which can be programmed to mitigate stop-and-go traffic through intelligent speed commands and, thus, reduce emissions. But would such dynamic eco-driving move the needle on climate change? A comprehensive impact analysis has been out of reach due to the vast array of traffic scenarios and the complexity of vehicle emissions. We address this challenge with large-scale scenario modeling efforts and by using multi-task deep reinforcement learning with a carefully designed network decomposition strategy. We perform an in-depth prospective impact assessment of dynamic eco-driving at 6,011 signalized intersections across three major US metropolitan cities, simulating a million traffic scenarios. Overall, we find that vehicle trajectories optimized for emissions can cut city-wide intersection carbon emissions by 11-22%, without harming throughput or safety, and with reasonable assumptions, equivalent to the national emissions of Israel and Nigeria, respectively. We find that 10% eco-driving adoption yields 25%-50% of the total reduction, and nearly 70% of the benefits come from 20% of intersections, suggesting near-term implementation pathways. However, the composition of this high-impact subset of intersections varies considerably across different adoption levels, with minimal overlap, calling for careful strategic planning for eco-driving deployments. Moreover, the impact of eco-driving, when considered jointly with projections of vehicle electrification and hybrid vehicle adoption remains significant. More broadly, this work paves the way for large-scale analysis of traffic externalities, such as time, safety, and air quality, and the potential impact of solution strategies.

[AI-84] Exploring Applications of State Space Models and Advanced Training Techniques in Sequential Recommendations: A Comparative Study on Efficiency and Performance

链接: https://arxiv.org/abs/2408.05606
作者: Mark Obozov,Makar Baderko,Stepan Kulibaba,Nikolay Kutuzov,Alexander Gasnikov
关键词-EN: Recommender systems aim, dynamically changing user, historical user behaviour, changing user preferences, Recommender systems
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注: arXiv admin note: text overlap with arXiv:2403.07691 by other authors

点击查看摘要

Abstract:Recommender systems aim to estimate the dynamically changing user preferences and sequential dependencies between historical user behaviour and metadata. Although transformer-based models have proven to be effective in sequential recommendations, their state growth is proportional to the length of the sequence that is being processed, which makes them expensive in terms of memory and inference costs. Our research focused on three promising directions in sequential recommendations: enhancing speed through the use of State Space Models (SSM), as they can achieve SOTA results in the sequential recommendations domain with lower latency, memory, and inference costs, as proposed by arXiv:2403.03900 improving the quality of recommendations with Large Language Models (LLMs) via Monolithic Preference Optimization without Reference Model (ORPO); and implementing adaptive batch- and step-size algorithms to reduce costs and accelerate training processes.

[AI-85] Sequential Representation Learning via Static-Dynamic Conditional Disentanglement ECCV2024

链接: https://arxiv.org/abs/2408.05599
作者: Mathieu Cyrille Simon,Pascal Frossard,Christophe De Vleeschouwer
关键词-EN: paper explores self-supervised, explores self-supervised disentangled, self-supervised disentangled representation, disentangled representation learning, additional Normalizing Flows
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted at ECCV 2024

点击查看摘要

Abstract:This paper explores self-supervised disentangled representation learning within sequential data, focusing on separating time-independent and time-varying factors in videos. We propose a new model that breaks the usual independence assumption between those factors by explicitly accounting for the causal relationship between the static/dynamic variables and that improves the model expressivity through additional Normalizing Flows. A formal definition of the factors is proposed. This formalism leads to the derivation of sufficient conditions for the ground truth factors to be identifiable, and to the introduction of a novel theoretically grounded disentanglement constraint that can be directly and efficiently incorporated into our new framework. The experiments show that the proposed approach outperforms previous complex state-of-the-art techniques in scenarios where the dynamics of a scene are influenced by its content.

[AI-86] In-Context Exploiter for Extensive-Form Games

链接: https://arxiv.org/abs/2408.05575
作者: Shuxin Li,Chang Yang,Youzhi Zhang,Pengdeng Li,Xinrun Wang,Xiao Huang,Hau Chan,Bo An
关键词-EN: widely adopted solution, adopted solution concept, Nash equilibrium, game theory due, stability property
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Nash equilibrium (NE) is a widely adopted solution concept in game theory due to its stability property. However, we observe that the NE strategy might not always yield the best results, especially against opponents who do not adhere to NE strategies. Based on this observation, we pose a new game-solving question: Can we learn a model that can exploit any, even NE, opponent to maximize their own utility? In this work, we make the first attempt to investigate this problem through in-context learning. Specifically, we introduce a novel method, In-Context Exploiter (ICE), to train a single model that can act as any player in the game and adaptively exploit opponents entirely by in-context learning. Our ICE algorithm involves generating diverse opponent strategies, collecting interactive history training data by a reinforcement learning algorithm, and training a transformer-based agent within a well-designed curriculum learning framework. Finally, comprehensive experimental results validate the effectiveness of our ICE algorithm, showcasing its in-context learning ability to exploit any unknown opponent, thereby positively answering our initial game-solving question.

[AI-87] Metacognitive Myopia in Large Language Models

链接: https://arxiv.org/abs/2408.05568
作者: Florian Scholten,Tobias R. Rebholz,Mandy Hütter
关键词-EN: Large Language Models, cloud moral judgments, Large Language, exhibit potentially harmful, culturally inherent stereotypes
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY); Applications (stat.AP)
*备注:

点击查看摘要

[AI-88] Document-Level Event Extraction with Definition-Driven ICL

链接: https://arxiv.org/abs/2408.05566
作者: Zhuoyuan Liu,Yilin Luo
关键词-EN: Natural Language Processing, Large Language Models, document-level event extraction, Language Processing, Large Language
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Information Retrieval (cs.IR)
*备注:

点击查看摘要

[AI-89] Impacts of Darwinian Evolution on Pre-trained Deep Neural Networks

链接: https://arxiv.org/abs/2408.05563
作者: Guodong Du,Runhua Jiang,Senqiao Yang,Haoyang Li,Wei Chen,Keren Li,Sim Kuan Goh,Ho-Kin Tang
关键词-EN: deep neural networks, neural network optimization, lines of evidence, remain unclear, neural networks
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: This work has been submitted to the IEEE for possible publication

点击查看摘要

Abstract:Darwinian evolution of the biological brain is documented through multiple lines of evidence, although the modes of evolutionary changes remain unclear. Drawing inspiration from the evolved neural systems (e.g., visual cortex), deep learning models have demonstrated superior performance in visual tasks, among others. While the success of training deep neural networks has been relying on back-propagation (BP) and its variants to learn representations from data, BP does not incorporate the evolutionary processes that govern biological neural systems. This work proposes a neural network optimization framework based on evolutionary theory. Specifically, BP-trained deep neural networks for visual recognition tasks obtained from the ending epochs are considered the primordial ancestors (initial population). Subsequently, the population evolved with differential evolution. Extensive experiments are carried out to examine the relationships between Darwinian evolution and neural network optimization, including the correspondence between datasets, environment, models, and living species. The empirical results show that the proposed framework has positive impacts on the network, with reduced over-fitting and an order of magnitude lower time complexity compared to BP. Moreover, the experiments show that the proposed framework performs well on deep neural networks and big datasets.

[AI-90] Evolutionary Neural Architecture Search for 3D Point Cloud Analysis

链接: https://arxiv.org/abs/2408.05556
作者: Yisheng Yang,Guodong Du,Chean Khim Toa,Ho-Kin Tang,Sim Kuan Goh
关键词-EN: Interaction Dimension Search, Self-adaptive Differential Evolution, manual architecture design, automates neural network, neural network design
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: This work has been submitted to the IEEE for possible publication

点击查看摘要

Abstract:Neural architecture search (NAS) automates neural network design by using optimization algorithms to navigate architecture spaces, reducing the burden of manual architecture design. While NAS has achieved success, applying it to emerging domains, such as analyzing unstructured 3D point clouds, remains underexplored due to the data lying in non-Euclidean spaces, unlike images. This paper presents Success-History-based Self-adaptive Differential Evolution with a Joint Point Interaction Dimension Search (SHSADE-PIDS), an evolutionary NAS framework that encodes discrete deep neural network architectures to continuous spaces and performs searches in the continuous spaces for efficient point cloud neural architectures. Comprehensive experiments on challenging 3D segmentation and classification benchmarks demonstrate SHSADE-PIDS’s capabilities. It discovered highly efficient architectures with higher accuracy, significantly advancing prior NAS techniques. For segmentation on SemanticKITTI, SHSADE-PIDS attained 64.51% mean IoU using only 0.55M parameters and 4.5GMACs, reducing overhead by over 22-26X versus other top methods. For ModelNet40 classification, it achieved 93.4% accuracy with just 1.31M parameters, surpassing larger models. SHSADE-PIDS provided valuable insights into bridging evolutionary algorithms with neural architecture optimization, particularly for emerging frontiers like point cloud learning.

[AI-91] Multi-layer Sequence Labeling-based Joint Biomedical Event Extraction NLPCC2024

链接: https://arxiv.org/abs/2408.05545
作者: Gongchi Chen,Pengchao Wu,Jinghang Gu,Longhua Qian,Guodong Zhou
关键词-EN: recent years, dominated by complicated, complicated pipeline, biomedical event extraction, biomedical event
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: 13 pages, 3 figures, accepted by NLPCC2024

点击查看摘要

[AI-92] CryoBench: Diverse and challenging datasets for the heterogeneity problem in cryo-EM

链接: https://arxiv.org/abs/2408.05526
作者: Minkyu Jeon,Rishwanth Raghu,Miro Astore,Geoffrey Woollard,Ryan Feathers,Alkin Kaz,Sonya M. Hanson,Pilar Cossio,Ellen D. Zhong
关键词-EN: Cryo-electron microscopy, determining high-resolution, imaging data, powerful technique, Cryo-electron
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG); Biomolecules (q-bio.BM)
*备注:

点击查看摘要

Abstract:Cryo-electron microscopy (cryo-EM) is a powerful technique for determining high-resolution 3D biomolecular structures from imaging data. As this technique can capture dynamic biomolecular complexes, 3D reconstruction methods are increasingly being developed to resolve this intrinsic structural heterogeneity. However, the absence of standardized benchmarks with ground truth structures and validation metrics limits the advancement of the field. Here, we propose CryoBench, a suite of datasets, metrics, and performance benchmarks for heterogeneous reconstruction in cryo-EM. We propose five datasets representing different sources of heterogeneity and degrees of difficulty. These include conformational heterogeneity generated from simple motions and random configurations of antibody complexes and from tens of thousands of structures sampled from a molecular dynamics simulation. We also design datasets containing compositional heterogeneity from mixtures of ribosome assembly states and 100 common complexes present in cells. We then perform a comprehensive analysis of state-of-the-art heterogeneous reconstruction tools including neural and non-neural methods and their sensitivity to noise, and propose new metrics for quantitative comparison of methods. We hope that this benchmark will be a foundational resource for analyzing existing methods and new algorithmic development in both the cryo-EM and machine learning communities.

[AI-93] Disentangled Noisy Correspondence Learning

链接: https://arxiv.org/abs/2408.05503
作者: Zhuohang Dang,Minnan Luo,Jihong Wang,Chengyou Jia,Haochen Han,Herun Wan,Guang Dai,Xiaojun Chang,Jingdong Wang
关键词-EN: understanding latent correspondences, retrieval is crucial, crucial in understanding, understanding latent, MEI
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Cross-modal retrieval is crucial in understanding latent correspondences across modalities. However, existing methods implicitly assume well-matched training data, which is impractical as real-world data inevitably involves imperfect alignments, i.e., noisy correspondences. Although some works explore similarity-based strategies to address such noise, they suffer from sub-optimal similarity predictions influenced by modality-exclusive information (MEI), e.g., background noise in images and abstract definitions in texts. This issue arises as MEI is not shared across modalities, thus aligning it in training can markedly mislead similarity predictions. Moreover, although intuitive, directly applying previous cross-modal disentanglement methods suffers from limited noise tolerance and disentanglement efficacy. Inspired by the robustness of information bottlenecks against noise, we introduce DisNCL, a novel information-theoretic framework for feature Disentanglement in Noisy Correspondence Learning, to adaptively balance the extraction of MII and MEI with certifiable optimal cross-modal disentanglement efficacy. DisNCL then enhances similarity predictions in modality-invariant subspace, thereby greatly boosting similarity-based alleviation strategy for noisy correspondences. Furthermore, DisNCL introduces soft matching targets to model noisy many-to-many relationships inherent in multi-modal input for noise-robust and accurate cross-modal alignment. Extensive experiments confirm DisNCL’s efficacy by 2% average recall improvement. Mutual information estimation and visualization results show that DisNCL learns meaningful MII/MEI subspaces, validating our theoretical analyses.

[AI-94] PointNCBW: Towards Dataset Ownership Verification for Point Clouds via Negative Clean-label Backdoor Watermark

链接: https://arxiv.org/abs/2408.05500
作者: Cheng Wei,Yang Wang,Kuofeng Gao,Shuo Shao,Yiming Li,Zhibo Wang,Zhan Qin
关键词-EN: point clouds, computer vision, time-consuming and expensive, collection is time-consuming, Recently
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 12 pages

点击查看摘要

Abstract:Recently, point clouds have been widely used in computer vision, whereas their collection is time-consuming and expensive. As such, point cloud datasets are the valuable intellectual property of their owners and deserve protection. To detect and prevent unauthorized use of these datasets, especially for commercial or open-sourced ones that cannot be sold again or used commercially without permission, we intend to identify whether a suspicious third-party model is trained on our protected dataset under the black-box setting. We achieve this goal by designing a scalable clean-label backdoor-based dataset watermark for point clouds that ensures both effectiveness and stealthiness. Unlike existing clean-label watermark schemes, which are susceptible to the number of categories, our method could watermark samples from all classes instead of only from the target one. Accordingly, it can still preserve high effectiveness even on large-scale datasets with many classes. Specifically, we perturb selected point clouds with non-target categories in both shape-wise and point-wise manners before inserting trigger patterns without changing their labels. The features of perturbed samples are similar to those of benign samples from the target class. As such, models trained on the watermarked dataset will have a distinctive yet stealthy backdoor behavior, i.e., misclassifying samples from the target class whenever triggers appear, since the trained DNNs will treat the inserted trigger pattern as a signal to deny predicting the target label. We also design a hypothesis-test-guided dataset ownership verification based on the proposed watermark. Extensive experiments on benchmark datasets are conducted, verifying the effectiveness of our method and its resistance to potential removal methods.

[AI-95] LLMServingSim: A HW/SW Co-Simulation Infrastructure for LLM Inference Serving at Scale

链接: https://arxiv.org/abs/2408.05499
作者: Jaehong Cho,Minsu Kim,Hyunmin Choi,Guseul Heo,Jongse Park
关键词-EN: large language model, building efficient large, efficient large language, LLM serving, LLM serving systems
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI)
*备注: 15 pages, 11 figures

点击查看摘要

Abstract:Recently, there has been an extensive research effort in building efficient large language model (LLM) inference serving systems. These efforts not only include innovations in the algorithm and software domains but also constitute developments of various hardware acceleration techniques. Nevertheless, there is a lack of simulation infrastructure capable of accurately modeling versatile hardware-software behaviors in LLM serving systems without extensively extending the simulation time. This paper aims to develop an effective simulation tool, called LLMServingSim, to support future research in LLM serving systems. In designing LLMServingSim, we focus on two limitations of existing simulators: (1) they lack consideration of the dynamic workload variations of LLM inference serving due to its autoregressive nature, and (2) they incur repetitive simulations without leveraging algorithmic redundancies in LLMs. To address these limitations, LLMServingSim simulates the LLM serving in the granularity of iterations, leveraging the computation redundancies across decoder blocks and reusing the simulation results from previous iterations. Additionally, LLMServingSim provides a flexible framework that allows users to plug in any accelerator compiler-and-simulation stacks for exploring various system designs with heterogeneous processors. Our experiments demonstrate that LLMServingSim produces simulation results closely following the performance behaviors of real GPU-based LLM serving system with less than 14.7% error rate, while offering 91.5x faster simulation speed compared to existing accelerator simulators.

[AI-96] Structure and Reduction of MCTS for Explainable-AI ECAI2024

链接: https://arxiv.org/abs/2408.05488
作者: Ronit Bustin,Claudia V. Goldman
关键词-EN: Monte Carlo Tree, Carlo Tree Search, covering infinite states’, infinite states’ space, Carlo Tree
类目: Artificial Intelligence (cs.AI); Information Theory (cs.IT)
*备注: ECAI 2024

点击查看摘要

Abstract:Complex sequential decision-making planning problems, covering infinite states’ space have been shown to be solvable by AlphaZero type of algorithms. Such an approach that trains a neural model while simulating projection of futures with a Monte Carlo Tree Search algorithm were shown to be applicable to real life planning problems. As such, engineers and users interacting with the resulting policy of behavior might benefit from obtaining automated explanations about these planners’ decisions offline or online. This paper focuses on the information within the Monte Carlo Tree Search data structure. Given its construction, this information contains much of the reasoning of the sequential decision-making algorithm and is essential for its explainability. We show novel methods using information theoretic tools for the simplification and reduction of the Monte Carlo Tree Search and the extraction of information. Such information can be directly used for the construction of human understandable explanations. We show that basic explainability quantities can be calculated with limited additional computational cost, as an integrated part of the Monte Carlo Tree Search construction process. We focus on the theoretical and algorithmic aspects and provide examples of how the methods presented here can be used in the construction of human understandable explanations.

[AI-97] Multi-agent Planning using Visual Language Models

链接: https://arxiv.org/abs/2408.05478
作者: Michele Brienza,Francesco Argenziano,Vincenzo Suriani,Domenico D. Bloisi,Daniele Nardi
关键词-EN: Large Language Models, Visual Language Models, Large Language, Visual Language, attracting increasing interest
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Large Language Models (LLMs) and Visual Language Models (VLMs) are attracting increasing interest due to their improving performance and applications across various domains and tasks. However, LLMs and VLMs can produce erroneous results, especially when a deep understanding of the problem domain is required. For instance, when planning and perception are needed simultaneously, these models often struggle because of difficulties in merging multi-modal information. To address this issue, fine-tuned models are typically employed and trained on specialized data structures representing the environment. This approach has limited effectiveness, as it can overly complicate the context for processing. In this paper, we propose a multi-agent architecture for embodied task planning that operates without the need for specific data structures as input. Instead, it uses a single image of the environment, handling free-form domains by leveraging commonsense knowledge. We also introduce a novel, fully automatic evaluation procedure, PG2S, designed to better assess the quality of a plan. We validated our approach using the widely recognized ALFRED dataset, comparing PG2S to the existing KAS metric to further evaluate the quality of the generated plans.

[AI-98] Artworks Reimagined: Exploring Human-AI Co-Creation through Body Prompting

链接: https://arxiv.org/abs/2408.05476
作者: Jonas Oppenlaender,Hannah Johnston,Johanna Silvennoinen,Helena Barranha
关键词-EN: generative artificial intelligence, popular activity, artificial intelligence, body prompting, Image generation
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET)
*备注: 16 pages, 5 figures, 2 tables

点击查看摘要

Abstract:Image generation using generative artificial intelligence is a popular activity. However, it is almost exclusively performed in the privacy of an individual’s home via typing on a keyboard. In this article, we explore body prompting as input for image generation. Body prompting extends interaction with generative AI beyond textual inputs to reconnect the creative act of image generation with the physical act of creating artworks. We implement this concept in an interactive art installation, Artworks Reimagined, designed to transform artworks via body prompting. We deployed the installation at an event with hundreds of visitors in a public and private setting. Our results from a sample of visitors (N=79) show that body prompting was well-received and provides an engaging and fun experience. We identify three distinct patterns of embodied interaction with the generative AI and present insights into participants’ experience of body prompting and AI co-creation. We provide valuable recommendations for practitioners seeking to design interactive generative AI experiences in museums, galleries, and other public cultural spaces.

[AI-99] Investigating Instruction Tuning Large Language Models on Graphs

链接: https://arxiv.org/abs/2408.05457
作者: Kerui Zhu,Bo-Wei Huang,Bowen Jin,Yizhu Jiao,Ming Zhong,Kevin Chang,Shou-De Lin,Jiawei Han
关键词-EN: Large Language Models, advancements of Large, Language Models, Large Language, NLP tasks
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: COLM 2024

点击查看摘要

[AI-100] Mathematical Models of Computation in Superposition ICML2024

链接: https://arxiv.org/abs/2408.05451
作者: Kaarel Hänni,Jake Mendel,Dmitry Vaintrob,Lawrence Chan
关键词-EN: mechanistically interpreting current, Superposition, current AI systems, challenge to mechanistically, emph
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 28 pages, 5 figures. Published at the ICML 2024 Mechanistic Interpretability (MI) Workshop

点击查看摘要

Abstract:Superposition – when a neural network represents more features'' than it has dimensions -- seems to pose a serious challenge to mechanistically interpreting current AI systems. Existing theory work studies \emphrepresentational superposition, where superposition is only used when passing information through bottlenecks. In this work, we present mathematical models of \emphcomputation in superposition, where superposition is actively helpful for efficiently accomplishing the task. We first construct a task of efficiently emulating a circuit that takes the AND of the \binomm2 pairs of each of m features. We construct a 1-layer MLP that uses superposition to perform this task up to \varepsilon -error, where the network only requires \tildeO(m^\frac23) neurons, even when the input features are \emphthemselves in superposition. We generalize this construction to arbitrary sparse boolean circuits of low depth, and then construct error correction’’ layers that allow deep fully-connected networks of width d to emulate circuits of width \tildeO(d^1.5) and \emphany polynomial depth. We conclude by providing some potential applications of our work for interpreting neural networks that implement computation in superposition. Comments: 28 pages, 5 figures. Published at the ICML 2024 Mechanistic Interpretability (MI) Workshop Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2408.05451 [cs.LG] (or arXiv:2408.05451v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2408.05451 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-101] EPAM-Net: An Efficient Pose-driven Attention-guided Multimodal Network for Video Action Recognition

链接: https://arxiv.org/abs/2408.05421
作者: Ahmed Abdelkawy,Asem Ali,Aly Farag
关键词-EN: Existing multimodal-based human, multiple data modalities, multimodal-based human action, Existing multimodal-based, human action recognition
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Existing multimodal-based human action recognition approaches are either computationally expensive, which limits their applicability in real-time scenarios, or fail to exploit the spatial temporal information of multiple data modalities. In this work, we present an efficient pose-driven attention-guided multimodal network (EPAM-Net) for action recognition in videos. Specifically, we adapted X3D networks for both RGB and pose streams to capture spatio-temporal features from RGB videos and their skeleton sequences. Then skeleton features are utilized to help the visual network stream focusing on key frames and their salient spatial regions using a spatial temporal attention block. Finally, the scores of the two streams of the proposed network are fused for final classification. The experimental results show that our method achieves competitive performance on NTU-D 60 and NTU RGB-D 120 benchmark datasets. Moreover, our model provides a 6.2–9.9x reduction in FLOPs (floating-point operation, in number of multiply-adds) and a 9–9.6x reduction in the number of network parameters. The code will be available at this https URL.

[AI-102] High-fidelity and Lip-synced Talking Face Synthesis via Landmark-based Diffusion Model

链接: https://arxiv.org/abs/2408.05416
作者: Weizhi Zhong,Junfan Lin,Peixin Chen,Liang Lin,Guanbin Li
关键词-EN: huge industrial potential, attracted increasing attention, Audio-driven talking face, increasing attention due, Audio-driven talking
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Multimedia (cs.MM)
*备注: submitted to IEEE Transactions on Image Processing(TIP)

点击查看摘要

Abstract:Audio-driven talking face video generation has attracted increasing attention due to its huge industrial potential. Some previous methods focus on learning a direct mapping from audio to visual content. Despite progress, they often struggle with the ambiguity of the mapping process, leading to flawed results. An alternative strategy involves facial structural representations (e.g., facial landmarks) as intermediaries. This multi-stage approach better preserves the appearance details but suffers from error accumulation due to the independent optimization of different stages. Moreover, most previous methods rely on generative adversarial networks, prone to training instability and mode collapse. To address these challenges, our study proposes a novel landmark-based diffusion model for talking face generation, which leverages facial landmarks as intermediate representations while enabling end-to-end optimization. Specifically, we first establish the less ambiguous mapping from audio to landmark motion of lip and jaw. Then, we introduce an innovative conditioning module called TalkFormer to align the synthesized motion with the motion represented by landmarks via differentiable cross-attention, which enables end-to-end optimization for improved lip synchronization. Besides, TalkFormer employs implicit feature warping to align the reference image features with the target motion for preserving more appearance details. Extensive experiments demonstrate that our approach can synthesize high-fidelity and lip-synced talking face videos, preserving more subject appearance details from the reference image.

[AI-103] Style-Preserving Lip Sync via Audio-Aware Style Reference

链接: https://arxiv.org/abs/2408.05412
作者: Weizhi Zhong,Jichang Li,Yinqi Cai,Liang Lin,Guanbin Li
关键词-EN: Audio-driven lip sync, recently drawn significant, drawn significant attention, lip sync, Audio-driven lip
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Multimedia (cs.MM)
*备注: submitted to IEEE Transactions on Image Processing(TIP)

点击查看摘要

Abstract:Audio-driven lip sync has recently drawn significant attention due to its widespread application in the multimedia domain. Individuals exhibit distinct lip shapes when speaking the same utterance, attributed to the unique speaking styles of individuals, posing a notable challenge for audio-driven lip sync. Earlier methods for such task often bypassed the modeling of personalized speaking styles, resulting in sub-optimal lip sync conforming to the general styles. Recent lip sync techniques attempt to guide the lip sync for arbitrary audio by aggregating information from a style reference video, yet they can not preserve the speaking styles well due to their inaccuracy in style aggregation. This work proposes an innovative audio-aware style reference scheme that effectively leverages the relationships between input audio and reference audio from style reference video to address the style-preserving audio-driven lip sync. Specifically, we first develop an advanced Transformer-based model adept at predicting lip motion corresponding to the input audio, augmented by the style information aggregated through cross-attention layers from style reference video. Afterwards, to better render the lip motion into realistic talking face video, we devise a conditional latent diffusion model, integrating lip motion through modulated convolutional layers and fusing reference facial images via spatial cross-attention layers. Extensive experiments validate the efficacy of the proposed approach in achieving precise lip sync, preserving speaking styles, and generating high-fidelity, realistic talking face videos.

[AI-104] A Cost-Effective Eye-Tracker for Early Detection of Mild Cognitive Impairment

链接: https://arxiv.org/abs/2408.05369
作者: Danilo Greco,Francesco Masulli,Stefano Rovetta,Alberto Cabri,Davide Daffonchio
关键词-EN: Mild Cognitive Impairment, Visual Paired Comparison, Paired Comparison protocol, low-cost eye-tracker aimed, Cognitive Impairment
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:This paper presents a low-cost eye-tracker aimed at carrying out tests based on a Visual Paired Comparison protocol for the early detection of Mild Cognitive Impairment. The proposed eye-tracking system is based on machine learning algorithms, a standard webcam, and two personal computers that constitute, respectively, the “Measurement Sub-System” performing the test on the patients and the “Test Management Sub-System” used by medical staff for configuring the test protocol, recording the patient data, monitoring the test and storing the test results. The system also integrates an stress estimator based on the measurement of heart rate variability obtained with photoplethysmography.

[AI-105] FiST-Financial Style Transfer with Hallucination and Creativity Control Framework

链接: https://arxiv.org/abs/2408.05365
作者: Sohini Roychowdhury,Marko Krema,Brian Moore,Xingjian Lai,Dike Effedua,Bharat Jethwani
关键词-EN: general purpose large, purpose large language, Financial report generation, language models pose, large language models
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE)
*备注: 8 pages, 13 figures, 5 tables, conference

点击查看摘要

[AI-106] MindSpeech: Continuous Imagined Speech Decoding using High-Density fNIRS and Prompt Tuning for Advanced Human-AI Interaction

链接: https://arxiv.org/abs/2408.05362
作者: Suyi Zhang,Ekram Alam,Jack Baber,Francesca Bianco,Edward Turner,Maysam Chamanzar,Hamid Dehghani
关键词-EN: artificial intelligence systems, coming decade, artificial intelligence, continue to improve, improve and revolutionise
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In the coming decade, artificial intelligence systems will continue to improve and revolutionise every industry and facet of human life. Designing effective, seamless and symbiotic communication paradigms between humans and AI agents is increasingly important. This paper reports a novel method for human-AI interaction by developing a direct brain-AI interface. We discuss a novel AI model, called MindSpeech, which enables open-vocabulary, continuous decoding for imagined speech. This study focuses on enhancing human-AI communication by utilising high-density functional near-infrared spectroscopy (fNIRS) data to develop an AI model capable of decoding imagined speech non-invasively. We discuss a new word cloud paradigm for data collection, improving the quality and variety of imagined sentences generated by participants and covering a broad semantic space. Utilising a prompt tuning-based approach, we employed the Llama2 large language model (LLM) for text generation guided by brain signals. Our results show significant improvements in key metrics, such as BLEU-1 and BERT P scores, for three out of four participants, demonstrating the method’s effectiveness. Additionally, we demonstrate that combining data from multiple participants enhances the decoder performance, with statistically significant improvements in BERT scores for two participants. Furthermore, we demonstrated significantly above-chance decoding accuracy for imagined speech versus resting conditions and the identified activated brain regions during imagined speech tasks in our study are consistent with the previous studies on brain regions involved in speech encoding. This study underscores the feasibility of continuous imagined speech decoding. By integrating high-density fNIRS with advanced AI techniques, we highlight the potential for non-invasive, accurate communication systems with AI in the near future.

[AI-107] MindGPT: Advancing Human-AI Interaction with Non-Invasive fNIRS-Based Imagined Speech Decoding

链接: https://arxiv.org/abs/2408.05361
作者: Suyi Zhang,Ekram Alam,Jack Baber,Francesca Bianco,Edward Turner,Maysam Chamanzar,Hamid Dehghani
关键词-EN: artificial intelligence systems, coming decade, artificial intelligence, set to revolutionise, revolutionise every industry
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In the coming decade, artificial intelligence systems are set to revolutionise every industry and facet of human life. Building communication systems that enable seamless and symbiotic communication between humans and AI agents is increasingly important. This research advances the field of human-AI interaction by developing an innovative approach to decode imagined speech using non-invasive high-density functional near-infrared spectroscopy (fNIRS). Notably, this study introduces MindGPT, the first thought-to-LLM (large language model) system in the world.

[AI-108] SHIELD: LLM-Driven Schema Induction for Predictive Analytics in EV Battery Supply Chain Disruptions

链接: https://arxiv.org/abs/2408.05357
作者: Zhi-Qi Cheng,Yifei Dong,Aike Shi,Wei Liu,Yuzhi Hu,Jason O’Connor,Alexander Hauptmann,Kate Whitefoot
关键词-EN: advanced predictive analytics, necessitates advanced predictive, battery supply chain, Schema-based Hierarchical Induction, supply chain vulnerability
类目: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
*备注: 30 pages, 11 figures, Project: this https URL

点击查看摘要

Abstract:The electric vehicle (EV) battery supply chain’s vulnerability to disruptions necessitates advanced predictive analytics. We present SHIELD (Schema-based Hierarchical Induction for EV supply chain Disruption), a system integrating Large Language Models (LLMs) with domain expertise for EV battery supply chain risk assessment. SHIELD combines: (1) LLM-driven schema learning to construct a comprehensive knowledge library, (2) a disruption analysis system utilizing fine-tuned language models for event extraction, multi-dimensional similarity matching for schema matching, and Graph Convolutional Networks (GCNs) with logical constraints for prediction, and (3) an interactive interface for visualizing results and incorporating expert feedback to enhance decision-making. Evaluated on 12,070 paragraphs from 365 sources (2022-2023), SHIELD outperforms baseline GCNs and LLM+prompt methods (e.g., GPT-4o) in disruption prediction. These results demonstrate SHIELD’s effectiveness in combining LLM capabilities with domain expertise for enhanced supply chain risk assessment.

[AI-109] rusting Your AI Agent Emotionally and Cognitively: Development and Validation of a Semantic Differential Scale for AI Trust

链接: https://arxiv.org/abs/2408.05354
作者: Ruoxi Shang,Gary Hsieh,Chirag Shah
关键词-EN: research in human-AI, human-AI interactions, interactions has primarily, primarily focused, Trust
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Trust is not just a cognitive issue but also an emotional one, yet the research in human-AI interactions has primarily focused on the cognitive route of trust development. Recent work has highlighted the importance of studying affective trust towards AI, especially in the context of emerging human-like LLMs-powered conversational agents. However, there is a lack of validated and generalizable measures for the two-dimensional construct of trust in AI agents. To address this gap, we developed and validated a set of 27-item semantic differential scales for affective and cognitive trust through a scenario-based survey study. We then further validated and applied the scale through an experiment study. Our empirical findings showed how the emotional and cognitive aspects of trust interact with each other and collectively shape a person’s overall trust in AI agents. Our study methodology and findings also provide insights into the capability of the state-of-art LLMs to foster trust through different routes.

[AI-110] Explainable AI Reloaded: Challenging the XAI Status Quo in the Era of Large Language Models

链接: https://arxiv.org/abs/2408.05345
作者: Upol Ehsan,Mark O. Riedl
关键词-EN: Large Language Models, vision of Explainable, initial vision, popular framing, Language Models
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
*备注: Accepted to ACM HTTF 2024

点击查看摘要

Abstract:When the initial vision of Explainable (XAI) was articulated, the most popular framing was to open the (proverbial) “black-box” of AI so that we could understand the inner workings. With the advent of Large Language Models (LLMs), the very ability to open the black-box is increasingly limited especially when it comes to non-AI expert end-users. In this paper, we challenge the assumption of “opening” the black-box in the LLM era and argue for a shift in our XAI expectations. Highlighting the epistemic blind spots of an algorithm-centered XAI view, we argue that a human-centered perspective can be a path forward. We operationalize the argument by synthesizing XAI research along three dimensions: explainability outside the black-box, explainability around the edges of the black box, and explainability that leverages infrastructural seams. We conclude with takeaways that reflexively inform XAI as a domain.

[AI-111] CAR: Contrast-Agnostic Deformable Medical Image Registration with Contrast-Invariant Latent Regularization

链接: https://arxiv.org/abs/2408.05341
作者: Yinsong Wang,Siyi Du,Shaoming Zheng,Xinzhe Luo,Chen Qin
关键词-EN: complex intensity relationships, Multi-contrast image registration, challenging task due, Multi-contrast image, challenging task
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 12 pages, 3 figures, 3 tables, accecpted by WBIR 2024

点击查看摘要

Abstract:Multi-contrast image registration is a challenging task due to the complex intensity relationships between different imaging contrasts. Conventional image registration methods are typically based on iterative optimizations for each input image pair, which is time-consuming and sensitive to contrast variations. While learning-based approaches are much faster during the inference stage, due to generalizability issues, they typically can only be applied to the fixed contrasts observed during the training stage. In this work, we propose a novel contrast-agnostic deformable image registration framework that can be generalized to arbitrary contrast images, without observing them during training. Particularly, we propose a random convolution-based contrast augmentation scheme, which simulates arbitrary contrasts of images over a single image contrast while preserving their inherent structural information. To ensure that the network can learn contrast-invariant representations for facilitating contrast-agnostic registration, we further introduce contrast-invariant latent regularization (CLR) that regularizes representation in latent space through a contrast invariance loss. Experiments show that CAR outperforms the baseline approaches regarding registration accuracy and also possesses better generalization ability to unseen imaging contrasts. Code is available at \urlthis https URL.

[AI-112] VACoDe: Visual Augmented Contrastive Decoding

链接: https://arxiv.org/abs/2408.05337
作者: Sihyeon Kim,Boryeong Cho,Sangmin Bae,Sumyeong Ahn,Se-Young Yun
关键词-EN: generate inaccurate responses, recent Large Vision-Language, recent Large, Large Vision-Language Models, inaccurate responses
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 10 pages, 7 figures

点击查看摘要

Abstract:Despite the astonishing performance of recent Large Vision-Language Models (LVLMs), these models often generate inaccurate responses. To address this issue, previous studies have focused on mitigating hallucinations by employing contrastive decoding (CD) with augmented images, which amplifies the contrast with the original image. However, these methods have limitations, including reliance on a single augmentation, which is restrictive for certain tasks, as well as the high cost of using external knowledge. In this study, we address these limitations by exploring how to utilize multiple image augmentations. Through extensive experiments, we observed that different augmentations produce varying levels of contrast depending on the task. Based on this observation, we introduce a novel method called VACoDe, Visual Augmented Contrastive Decoding. This method adaptively selects the augmentation with the highest contrast for each task using the proposed softmax distance metric. Our empirical tests show that \alg outperforms previous methods and improves output quality in various vision-language tasks. Additionally, VACoDe can be universally applied across different model types and sizes without additional training or the use of external models and data.

[AI-113] Logically Constrained Robotics Transformers for Enhanced Perception-Action Planning

链接: https://arxiv.org/abs/2408.05336
作者: Parv Kapoor,Sai Vemprala,Ashish Kapoor
关键词-EN: stakeholder intent, model based planning, advent of large, ensure their output, output aligns
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注: Robotics Science and Systems: Towards Safe Autonomy

点击查看摘要

Abstract:With the advent of large foundation model based planning, there is a dire need to ensure their output aligns with the stakeholder’s intent. When these models are deployed in the real world, the need for alignment is magnified due to the potential cost to life and infrastructure due to unexpected faliures. Temporal Logic specifications have long provided a way to constrain system behaviors and are a natural fit for these use cases. In this work, we propose a novel approach to factor in signal temporal logic specifications while using autoregressive transformer models for trajectory planning. We also provide a trajectory dataset for pretraining and evaluating foundation models. Our proposed technique acheives 74.3 % higher specification satisfaction over the baselines.

[AI-114] Revisiting Multi-Modal LLM Evaluation

链接: https://arxiv.org/abs/2408.05334
作者: Jian Lu,Shikhar Srivastava,Junyu Chen,Robik Shrestha,Manoj Acharya,Kushal Kafle,Christopher Kanan
关键词-EN: large language models, multi-modal large language, referring expression comprehension, visual question answering, language models
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

[AI-115] Neural Machine Unranking

链接: https://arxiv.org/abs/2408.05330
作者: Jingrui Hou,Axel Finke,Georgina Cosma
关键词-EN: termed Neural Machine, Neural Machine UnRanking, neural information retrieval, information retrieval, Machine UnRanking
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:We tackle the problem of machine unlearning within neural information retrieval, termed Neural Machine UnRanking (NuMuR) for short. Many of the mainstream task- or model-agnostic approaches for machine unlearning were designed for classification tasks. First, we demonstrate that these methods perform poorly on NuMuR tasks due to the unique challenges posed by neural information retrieval. Then, we develop a methodology for NuMuR named Contrastive and Consistent Loss (CoCoL), which effectively balances the objectives of data forgetting and model performance retention. Experimental results demonstrate that CoCoL facilitates more effective and controllable data removal than existing techniques.

[AI-116] From Text to Insight: Leveraging Large Language Models for Performance Evaluation in Management

链接: https://arxiv.org/abs/2408.05328
作者: Ning Li,Huaikang Zhou,Mingze Xu
关键词-EN: Large Language Models, Language Models, Large Language, organizational task performance, enhance objectivity
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Human-Computer Interaction (cs.HC); General Economics (econ.GN)
*备注: 39 pages, 8 figures, 5 tables

点击查看摘要

[AI-117] A Recurrent YOLOv8-based framework for Event-Based Object Detection

链接: https://arxiv.org/abs/2408.05321
作者: Diego A. Silva,Kamilya Smagulova,Ahmed Elsheikh,Mohammed E. Fouda,Ahmed M. Eltawil
关键词-EN: conventional frame-based RGB, frame-based RGB sensors, RGB sensors, primarily relying, frame-based RGB
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Object detection is crucial in various cutting-edge applications, such as autonomous vehicles and advanced robotics systems, primarily relying on data from conventional frame-based RGB sensors. However, these sensors often struggle with issues like motion blur and poor performance in challenging lighting conditions. In response to these challenges, event-based cameras have emerged as an innovative paradigm. These cameras, mimicking the human eye, demonstrate superior performance in environments with fast motion and extreme lighting conditions while consuming less power. This study introduces ReYOLOv8, an advanced object detection framework that enhances a leading frame-based detection system with spatiotemporal modeling capabilities. We implemented a low-latency, memory-efficient method for encoding event data to boost the system’s performance. We also developed a novel data augmentation technique tailored to leverage the unique attributes of event data, thus improving detection accuracy. Our models outperformed all comparable approaches in the GEN1 dataset, focusing on automotive applications, achieving mean Average Precision (mAP) improvements of 5%, 2.8%, and 2.5% across nano, small, and medium scales, respectively.These enhancements were achieved while reducing the number of trainable parameters by an average of 4.43% and maintaining real-time processing speeds between 9.2ms and 15.5ms. On the PEDRo dataset, which targets robotics applications, our models showed mAP improvements ranging from 9% to 18%, with 14.5x and 3.8x smaller models and an average speed enhancement of 1.67x.

[AI-118] rule4ml: An Open-Source Tool for Resource Utilization and Latency Estimation for ML Models on FPGA

链接: https://arxiv.org/abs/2408.05314
作者: Mohammad Mehdi Rahimifar,Hamza Ezzaoui Rahali,Audrey C. Therrien
关键词-EN: Implementing Machine Learning, Field-Programmable Gate Arrays, Implementing Machine, Machine Learning, Gate Arrays
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Implementing Machine Learning (ML) models on Field-Programmable Gate Arrays (FPGAs) is becoming increasingly popular across various domains as a low-latency and low-power solution that helps manage large data rates generated by continuously improving detectors. However, developing ML models for FPGAs is time-consuming, as optimization requires synthesis to evaluate FPGA area and latency, making the process slow and repetitive. This paper introduces a novel method to predict the resource utilization and inference latency of Neural Networks (NNs) before their synthesis and implementation on FPGA. We leverage HLS4ML, a tool-flow that helps translate NNs into high-level synthesis (HLS) code, to synthesize a diverse dataset of NN architectures and train resource utilization and inference latency predictors. While HLS4ML requires full synthesis to obtain resource and latency insights, our method uses trained regression models for immediate pre-synthesis predictions. The prediction models estimate the usage of Block RAM (BRAM), Digital Signal Processors (DSP), Flip-Flops (FF), and Look-Up Tables (LUT), as well as the inference clock cycles. The predictors were evaluated on both synthetic and existing benchmark architectures and demonstrated high accuracy with R2 scores ranging between 0.8 and 0.98 on the validation set and sMAPE values between 10% and 30%. Overall, our approach provides valuable preliminary insights, enabling users to quickly assess the feasibility and efficiency of NNs on FPGAs, accelerating the development and deployment processes. The open-source repository can be found at this https URL, while the datasets are publicly available at this https URL.

[AI-119] he impact of internal variability on benchmarking deep learning climate emulators

链接: https://arxiv.org/abs/2408.05288
作者: Björn Lütjens,Raffaele Ferrari,Duncan Watson-Parris,Noelle Selin
关键词-EN: Full-complexity Earth system, Full-complexity Earth, Earth system models, Earth system, computationally very expensive
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Full-complexity Earth system models (ESMs) are computationally very expensive, limiting their use in exploring the climate outcomes of multiple emission pathways. More efficient emulators that approximate ESMs can directly map emissions onto climate outcomes, and benchmarks are being used to evaluate their accuracy on standardized tasks and datasets. We investigate a popular benchmark in data-driven climate emulation, ClimateBench, on which deep learning-based emulators are currently achieving the best performance. We implement a linear regression-based emulator, akin to pattern scaling, and find that it outperforms the incumbent 100M-parameter deep learning foundation model, ClimaX, on 3 out of 4 regionally-resolved surface-level climate variables. While emulating surface temperature is expected to be predominantly linear, this result is surprising for emulating precipitation. We identify that this outcome is a result of high levels of internal variability in the benchmark targets. To address internal variability, we update the benchmark targets with ensemble averages from the MPI-ESM1.2-LR model that contain 50 instead of 3 climate simulations per emission pathway. Using the new targets, we show that linear pattern scaling continues to be more accurate on temperature, but can be outperformed by a deep learning-based model for emulating precipitation. We publish our code, data, and an interactive tutorial at this http URL.

[AI-120] Semi-Supervised One-Shot Imitation Learning

链接: https://arxiv.org/abs/2408.05285
作者: Philipp Wu,Kourosh Hakhamaneshi,Yuqing Du,Igor Mordatch,Aravind Rajeswaran,Pieter Abbeel
关键词-EN: One-shot Imitation Learning, One-shot Imitation, Imitation Learning, OSIL, aims to imbue
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:One-shot Imitation Learning~(OSIL) aims to imbue AI agents with the ability to learn a new task from a single demonstration. To supervise the learning, OSIL typically requires a prohibitively large number of paired expert demonstrations – i.e. trajectories corresponding to different variations of the same semantic task. To overcome this limitation, we introduce the semi-supervised OSIL problem setting, where the learning agent is presented with a large dataset of trajectories with no task labels (i.e. an unpaired dataset), along with a small dataset of multiple demonstrations per semantic task (i.e. a paired dataset). This presents a more realistic and practical embodiment of few-shot learning and requires the agent to effectively leverage weak supervision from a large dataset of trajectories. Subsequently, we develop an algorithm specifically applicable to this semi-supervised OSIL setting. Our approach first learns an embedding space where different tasks cluster uniquely. We utilize this embedding space and the clustering it supports to self-generate pairings between trajectories in the large unpaired dataset. Through empirical results on simulated control tasks, we demonstrate that OSIL models trained on such self-generated pairings are competitive with OSIL models trained with ground-truth labels, presenting a major advancement in the label-efficiency of OSIL.

[AI-121] Can a Bayesian Oracle Prevent Harm from an Agent ?

链接: https://arxiv.org/abs/2408.05284
作者: Yoshua Bengio,Michael K. Cohen,Nikolay Malkin,Matt MacDermott,Damiano Fornasiere,Pietro Greiner,Younesse Kaddar
关键词-EN: machine learning methods, satisfy probabilistic safety, design powerful, powerful AI systems, systems based
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Is there a way to design powerful AI systems based on machine learning methods that would satisfy probabilistic safety guarantees? With the long-term goal of obtaining a probabilistic guarantee that would apply in every context, we consider estimating a context-dependent bound on the probability of violating a given safety specification. Such a risk evaluation would need to be performed at run-time to provide a guardrail against dangerous actions of an AI. Noting that different plausible hypotheses about the world could produce very different outcomes, and because we do not know which one is right, we derive bounds on the safety violation probability predicted under the true but unknown hypothesis. Such bounds could be used to reject potentially dangerous actions. Our main results involve searching for cautious but plausible hypotheses, obtained by a maximization that involves Bayesian posteriors over hypotheses. We consider two forms of this result, in the iid case and in the non-iid case, and conclude with open problems towards turning such theoretical results into practical AI guardrails.

[AI-122] A Systematic Literature Map on Big Data

链接: https://arxiv.org/abs/2408.05253
作者: Rogerio Rossi,Kechi Hirama,Eduardo Ferreira Franco
关键词-EN: Big Data paradigm, Big Data, government services, solid field, Data paradigm
类目: Digital Libraries (cs.DL); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET)
*备注: 8 pages, 1 figure, 5 tables

点击查看摘要

Abstract:The paradigm of Big Data has been established as a solid field of studies in many areas such as healthcare, science, transport, education, government services, among others. Despite widely discussed, there is no agreed definition about the paradigm although there are many concepts proposed by the academy and industry. This work aims to provide an analytical view of the studies conducted and published regarding the Big Data paradigm. The approach used is the systematic map of the literature, combining bibliometric analysis and content analysis to depict the panorama of research works, identifying patterns, trends, and gaps. The results indicate that there is still a long way to go, both in research and in concepts, such as building and defining adequate infrastructures and standards, to meet future challenges and for the paradigm to become effective and bring the expected benefits.

[AI-123] Advancing oncology with federated learning: transcending boundaries in breast lung and prostate cancer. A systematic review

链接: https://arxiv.org/abs/2408.05249
作者: Anshu Ankolekar,Sebastian Boie,Maryam Abdollahyan,Emanuela Gadaleta,Seyed Alireza Hasheminasab,Guang Yang,Charles Beauville,Nikolaos Dikaios,George Anthony Kastis,Michael Bussmann,Sara Khalid,Hagen Kruger,Philippe Lambin,Giorgos Papanastasiou
关键词-EN: Federated Learning, centralised machine learning, machine learning, overcoming privacy concerns, promising solution
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Image and Video Processing (eess.IV)
*备注: 5 Figures, 3 Tables, 1 Supplementary Table

点击查看摘要

Abstract:Federated Learning (FL) has emerged as a promising solution to address the limitations of centralised machine learning (ML) in oncology, particularly in overcoming privacy concerns and harnessing the power of diverse, multi-center data. This systematic review synthesises current knowledge on the state-of-the-art FL in oncology, focusing on breast, lung, and prostate cancer. Distinct from previous surveys, our comprehensive review critically evaluates the real-world implementation and impact of FL on cancer care, demonstrating its effectiveness in enhancing ML generalisability, performance and data privacy in clinical settings and data. We evaluated state-of-the-art advances in FL, demonstrating its growing adoption amid tightening data privacy regulations. FL outperformed centralised ML in 15 out of the 25 studies reviewed, spanning diverse ML models and clinical applications, and facilitating integration of multi-modal information for precision medicine. Despite the current challenges identified in reproducibility, standardisation and methodology across studies, the demonstrable benefits of FL in harnessing real-world data and addressing clinical needs highlight its significant potential for advancing cancer research. We propose that future research should focus on addressing these limitations and investigating further advanced FL methods, to fully harness data diversity and realise the transformative power of cutting-edge FL in cancer care.

[AI-124] he Role and Applications of Airport Digital Twin in Cyberattack Protection during the Generative AI Era

链接: https://arxiv.org/abs/2408.05248
作者: Abraham Itzhak Weinberg
关键词-EN: increasingly sophisticated cyberattacks, recent years, threat facing airports, growing and increasingly, increasingly sophisticated
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In recent years, the threat facing airports from growing and increasingly sophisticated cyberattacks has become evident. Airports are considered a strategic national asset, so protecting them from attacks, specifically cyberattacks, is a crucial mission. One way to increase airports’ security is by using Digital Twins (DTs). This paper shows and demonstrates how DTs can enhance the security mission. The integration of DTs with Generative AI (GenAI) algorithms can lead to synergy and new frontiers in fighting cyberattacks. The paper exemplifies ways to model cyberattack scenarios using simulations and generate synthetic data for testing defenses. It also discusses how DTs can be used as a crucial tool for vulnerability assessment by identifying weaknesses, prioritizing, and accelerating remediations in case of cyberattacks. Moreover, the paper demonstrates approaches for anomaly detection and threat hunting using Machine Learning (ML) and GenAI algorithms. Additionally, the paper provides impact prediction and recovery coordination methods that can be used by DT operators and stakeholders. It also introduces ways to harness the human factor by integrating training and simulation algorithms with Explainable AI (XAI) into the DT platforms. Lastly, the paper offers future applications and technologies that can be utilized in DT environments.

[AI-125] Early-Exit meets Model-Distributed Inference at Edge Networks

链接: https://arxiv.org/abs/2408.05247
作者: Marco Colocrese,Erdem Koyuncu,Hulya Seferoglu
关键词-EN: Distributed inference techniques, broadly classified, data, layers, inference techniques
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Networking and Internet Architecture (cs.NI)
*备注:

点击查看摘要

Abstract:Distributed inference techniques can be broadly classified into data-distributed and model-distributed schemes. In data-distributed inference (DDI), each worker carries the entire deep neural network (DNN) model but processes only a subset of the data. However, feeding the data to workers results in high communication costs, especially when the data is large. An emerging paradigm is model-distributed inference (MDI), where each worker carries only a subset of DNN layers. In MDI, a source device that has data processes a few layers of DNN and sends the output to a neighboring device, i.e., offloads the rest of the layers. This process ends when all layers are processed in a distributed manner. In this paper, we investigate the design and development of MDI with early-exit, which advocates that there is no need to process all the layers of a model for some data to reach the desired accuracy, i.e., we can exit the model without processing all the layers if target accuracy is reached. We design a framework MDI-Exit that adaptively determines early-exit and offloading policies as well as data admission at the source. Experimental results on a real-life testbed of NVIDIA Nano edge devices show that MDI-Exit processes more data when accuracy is fixed and results in higher accuracy for the fixed data rate.

[AI-126] Differentially Private Data Release on Graphs: Inefficiencies and Unfairness

链接: https://arxiv.org/abs/2408.05246
作者: Ferdinando Fioretto,Diptangshu Sen,Juba Ziani
关键词-EN: transportation.The information carried, sensitive user data, Google location data, data, location data
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 32 pages

点击查看摘要

Abstract:Networks are crucial components of many sectors, including telecommunications, healthcare, finance, energy, and transportation.The information carried in such networks often contains sensitive user data, like location data for commuters and packet data for online users. Therefore, when considering data release for networks, one must ensure that data release mechanisms do not leak information about individuals, quantified in a precise mathematical sense. Differential Privacy (DP) is the widely accepted, formal, state-of-the-art technique, which has found use in a variety of real-life settings including the 2020 U.S. Census, Apple users’ device data, or Google’s location data. Yet, the use of DP comes with new challenges, as the noise added for privacy introduces inaccuracies or biases and further, DP techniques can also distribute these biases disproportionately across different populations, inducing fairness issues. The goal of this paper is to characterize the impact of DP on bias and unfairness in the context of releasing information about networks, taking a departure from previous work which has studied these effects in the context of private population counts release (such as in the U.S. Census). To this end, we consider a network release problem where the network structure is known to all, but the weights on edges must be released privately. We consider the impact of this private release on a simple downstream decision-making task run by a third-party, which is to find the shortest path between any two pairs of nodes and recommend the best route to users. This setting is of highly practical relevance, mirroring scenarios in transportation networks, where preserving privacy while providing accurate routing information is crucial. Our work provides theoretical foundations and empirical evidence into the bias and unfairness arising due to privacy in these networked decision problems.

[AI-127] Improved Adaboost Algorithm for Web Advertisement Click Prediction Based on Long Short-Term Memory Networks

链接: https://arxiv.org/abs/2408.05245
作者: Qixuan Yu,Xirui Tang,Feiyang Li,Zinan Cao
关键词-EN: Short-Term Memory Networks, Long Short-Term Memory, Memory Networks, web page advertisements, improved Adaboost algorithm
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Social and Information Networks (cs.SI)
*备注:

点击查看摘要

Abstract:This paper explores an improved Adaboost algorithm based on Long Short-Term Memory Networks (LSTMs), which aims to improve the prediction accuracy of user clicks on web page advertisements. By comparing it with several common machine learning algorithms, the paper analyses the advantages of the new model in ad click prediction. It is shown that the improved algorithm proposed in this paper performs well in user ad click prediction with an accuracy of 92%, which is an improvement of 13.6% compared to the highest of 78.4% among the other three base models. This significant improvement indicates that the algorithm is more capable of capturing user behavioural characteristics and time series patterns. In addition, this paper evaluates the model’s performance on other performance metrics, including accuracy, recall, and F1 score. The results show that the improved Adaboost algorithm based on LSTM is significantly ahead of the traditional model in all these metrics, which further validates its effectiveness and superiority. Especially when facing complex and dynamically changing user behaviours, the model is able to better adapt and make accurate predictions. In order to ensure the practicality and reliability of the model, this study also focuses on the accuracy difference between the training set and the test set. After validation, the accuracy of the proposed model on these two datasets only differs by 1.7%, which is a small difference indicating that the model has good generalisation ability and can be effectively applied to real-world scenarios.

[AI-128] Large Model Strategic Thinking Small Model Efficiency: Transferring Theory of Mind in Large Language Models

链接: https://arxiv.org/abs/2408.05241
作者: Nunzio Lore,Alireza(Sepehr)Ilami,Babak Heydari
关键词-EN: models increases commensurately, art models increases, newer Large Language, Language Models continues, increases commensurately
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Emerging Technologies (cs.ET); Computer Science and Game Theory (cs.GT)
*备注: 18 pages, 6 figures

点击查看摘要

[AI-129] he Literature Review Network: An Explainable Artificial Intelligence for Systematic Literature Reviews Meta-analyses and Method Development

链接: https://arxiv.org/abs/2408.05239
作者: Joshua Morriss,Tod Brindle,Jessica Bah Rösman,Daniel Reibsamen,Andreas Enz
关键词-EN: LRN, Systematic literature reviews, Literature Review Network, quality of evidence, review
类目: Digital Libraries (cs.DL); Artificial Intelligence (cs.AI)
*备注: 12 pages, 4 figures, 10 tables

点击查看摘要

Abstract:Systematic literature reviews are the highest quality of evidence in research. However, the review process is hindered by significant resource and data constraints. The Literature Review Network (LRN) is the first of its kind explainable AI platform adhering to PRISMA 2020 standards, designed to automate the entire literature review process. LRN was evaluated in the domain of surgical glove practices using 3 search strings developed by experts to query PubMed. A non-expert trained all LRN models. Performance was benchmarked against an expert manual review. Explainability and performance metrics assessed LRN’s ability to replicate the experts’ review. Concordance was measured with the Jaccard index and confusion matrices. Researchers were blinded to the other’s results until study completion. Overlapping studies were integrated into an LRN-generated systematic review. LRN models demonstrated superior classification accuracy without expert training, achieving 84.78% and 85.71% accuracy. The highest performance model achieved high interrater reliability (k = 0.4953) and explainability metrics, linking ‘reduce’, ‘accident’, and ‘sharp’ with ‘double-gloving’. Another LRN model covered 91.51% of the relevant literature despite diverging from the non-expert’s judgments (k = 0.2174), with the terms ‘latex’, ‘double’ (gloves), and ‘indication’. LRN outperformed the manual review (19,920 minutes over 11 months), reducing the entire process to 288.6 minutes over 5 days. This study demonstrates that explainable AI does not require expert training to successfully conduct PRISMA-compliant systematic literature reviews like an expert. LRN summarized the results of surgical glove studies and identified themes that were nearly identical to the clinical researchers’ findings. Explainable AI can accurately expedite our understanding of clinical practices, potentially revolutionizing healthcare research.

[AI-130] Biomimetic Machine Learning approach for prediction of mechanical properties of Additive Friction Stir Deposited Aluminum alloys based walled structures

链接: https://arxiv.org/abs/2408.05237
作者: Akshansh Mishra
关键词-EN: Friction Stir Deposited, Additive Friction Stir, Stir Deposited, Additive Friction, Friction Stir
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE)
*备注: 26 pages, 14 figures, 6 tables

点击查看摘要

Abstract:This study presents a novel approach to predicting mechanical properties of Additive Friction Stir Deposited (AFSD) aluminum alloy walled structures using biomimetic machine learning. The research combines numerical modeling of the AFSD process with genetic algorithm-optimized machine learning models to predict von Mises stress and logarithmic strain. Finite element analysis was employed to simulate the AFSD process for five aluminum alloys: AA2024, AA5083, AA5086, AA7075, and AA6061, capturing complex thermal and mechanical interactions. A dataset of 200 samples was generated from these simulations. Subsequently, Decision Tree (DT) and Random Forest (RF) regression models, optimized using genetic algorithms, were developed to predict key mechanical properties. The GA-RF model demonstrated superior performance in predicting both von Mises stress (R square = 0.9676) and logarithmic strain (R square = 0.7201). This innovative approach provides a powerful tool for understanding and optimizing the AFSD process across multiple aluminum alloys, offering insights into material behavior under various process parameters.

[AI-131] SLO-aware GPU Frequency Scaling for Energy Efficient LLM Inference Serving

链接: https://arxiv.org/abs/2408.05235
作者: Andreas Kosmas Kakolyris,Dimosthenis Masouros,Petros Vavaroutsos,Sotirios Xydis,Dimitrios Soudris
关键词-EN: Large Language Models, Large Language, power-hungry GPUs places, GPUs places ever-increasing, ever-increasing energy demands
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Hardware Architecture (cs.AR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:As Large Language Models (LLMs) gain traction, their reliance on power-hungry GPUs places ever-increasing energy demands, raising environmental and monetary concerns. Inference dominates LLM workloads, presenting a critical challenge for providers: minimizing energy costs under Service-Level Objectives (SLOs) that ensure optimal user experience. In this paper, we present \textitthrottLL’eM, a framework that reduces energy consumption while meeting SLOs through the use of instance and GPU frequency scaling. \textitthrottLL’eM features mechanisms that project future KV cache usage and batch size. Leveraging a Machine-Learning (ML) model that receives these projections as inputs, \textitthrottLL’eM manages performance at the iteration level to satisfy SLOs with reduced frequencies and instance sizes. We show that the proposed ML model achieves R^2 scores greater than 0.97 and miss-predicts performance by less than 1 iteration per second on average. Experimental results on LLM inference traces show that \textitthrottLL’eM achieves up to 43.8% lower energy consumption and an energy efficiency improvement of at least 1.71\times under SLOs, when compared to NVIDIA’s Triton server.

[AI-132] Large Language Model based Agent Framework for Electric Vehicle Charging Behavior Simulation

链接: https://arxiv.org/abs/2408.05233
作者: Junkang Feng,Chenggang Cui,Chuanlin Zhang,Zizhu Fan
关键词-EN: LLM based agent, simulating electric vehicle, integrating user preferences, based agent framework, LLM based
类目: Artificial Intelligence (cs.AI)
*备注: 7 pages,3 figures

点击查看摘要

Abstract:This paper introduces a new LLM based agent framework for simulating electric vehicle (EV) charging behavior, integrating user preferences, psychological characteristics, and environmental factors to optimize the charging process. The framework comprises several modules, enabling sophisticated, adaptive simulations. Dynamic decision making is supported by continuous reflection and memory updates, ensuring alignment with user expectations and enhanced efficiency. The framework’s ability to generate personalized user profiles and real-time decisions offers significant advancements for urban EV charging management. Future work could focus on incorporating more intricate scenarios and expanding data sources to enhance predictive accuracy and practical utility.

[AI-133] Large Language Models for cross-language code clone detection

链接: https://arxiv.org/abs/2408.04430
作者: Micheline Bénédicte Moumoula,Abdoul Kader Kabore,Jacques Klein,Tegawendé Bissyande
关键词-EN: code clone detection, modern software development, cross-lingual code clone, code clone, Large Language Models
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:With the involvement of multiple programming languages in modern software development, cross-lingual code clone detection has gained traction with the software engineering community. Numerous studies have explored this topic, proposing various promising approaches. Inspired by the significant advances in machine learning in recent years, particularly Large Language Models (LLMs), which have demonstrated their ability to tackle various tasks, this paper revisits cross-lingual code clone detection. We investigate the capabilities of four (04) LLMs and eight (08) prompts for the identification of cross-lingual code clones. Additionally, we evaluate a pre-trained embedding model to assess the effectiveness of the generated representations for classifying clone and non-clone pairs. Both studies (based on LLMs and Embedding models) are evaluated using two widely used cross-lingual datasets, XLCoST and CodeNet. Our results show that LLMs can achieve high F1 scores, up to 0.98, for straightforward programming examples (e.g., from XLCoST). However, they not only perform less well on programs associated with complex programming challenges but also do not necessarily understand the meaning of code clones in a cross-lingual setting. We show that embedding models used to represent code fragments from different programming languages in the same representation space enable the training of a basic classifier that outperforms all LLMs by ~2 and ~24 percentage points on the XLCoST and CodeNet datasets, respectively. This finding suggests that, despite the apparent capabilities of LLMs, embeddings provided by embedding models offer suitable representations to achieve state-of-the-art performance in cross-lingual code clone detection. Subjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2408.04430 [cs.SE] (or arXiv:2408.04430v1 [cs.SE] for this version) https://doi.org/10.48550/arXiv.2408.04430 Focus to learn more arXiv-issued DOI via DataCite

[AI-134] Blockchain and Artificial Intelligence: Synergies and Conflicts

链接: https://arxiv.org/abs/2405.13462
作者: Leon Witt,Armando Teles Fortes,Kentaroh Toyoda,Wojciech Samek,Dan Li
关键词-EN: Artificial Intelligence, technology and Artificial, respective domains, emerged as transformative, transformative forces
类目: Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注:

点击查看摘要

Abstract:Blockchain technology and Artificial Intelligence (AI) have emerged as transformative forces in their respective domains. This paper explores synergies and challenges between these two technologies. Our research analyses the biggest projects combining blockchain and AI, based on market capitalization, and derives a novel framework to categorize contemporary and future use cases. Despite the theoretical compatibility, current real-world applications combining blockchain and AI remain in their infancy.

[AI-135] Learning Delays in Spiking Neural Networks using Dilated Convolutions with Learnable Spacings

链接: https://arxiv.org/abs/2306.17670
作者: Ilyass Hammouamri,Ismail Khalfaoui-Hassani,Timothée Masquelier
关键词-EN: Spiking Neural Networks, Neural Networks, promising research direction, building power-efficient information, Spiking Neural
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注:

点击查看摘要

Abstract:Spiking Neural Networks (SNNs) are a promising research direction for building power-efficient information processing systems, especially for temporal tasks such as speech recognition. In SNNs, delays refer to the time needed for one spike to travel from one neuron to another. These delays matter because they influence the spike arrival times, and it is well-known that spiking neurons respond more strongly to coincident input spikes. More formally, it has been shown theoretically that plastic delays greatly increase the expressivity in SNNs. Yet, efficient algorithms to learn these delays have been lacking. Here, we propose a new discrete-time algorithm that addresses this issue in deep feedforward SNNs using backpropagation, in an offline manner. To simulate delays between consecutive layers, we use 1D convolutions across time. The kernels contain only a few non-zero weights - one per synapse - whose positions correspond to the delays. These positions are learned together with the weights using the recently proposed Dilated Convolution with Learnable Spacings (DCLS). We evaluated our method on three datasets: the Spiking Heidelberg Dataset (SHD), the Spiking Speech Commands (SSC) and its non-spiking version Google Speech Commands v0.02 (GSC) benchmarks, which require detecting temporal patterns. We used feedforward SNNs with two or three hidden fully connected layers, and vanilla leaky integrate-and-fire neurons. We showed that fixed random delays help and that learning them helps even more. Furthermore, our method outperformed the state-of-the-art in the three datasets without using recurrent connections and with substantially fewer parameters. Our work demonstrates the potential of delay learning in developing accurate and precise models for temporal data processing. Our code is based on PyTorch / SpikingJelly and available at: this https URL

[AI-136] ACCELERATION: Sequentially-scanning DECT Imaging Using High Temporal Resolution Image Reconstruction And Temporal Extrapolation

链接: https://arxiv.org/abs/2408.06163
作者: Qiaoxin Li,Dong Liang,Yinsheng Li
关键词-EN: Dual-energy computed tomography, precise medical diagnosis, obtain quantitative elemental, quantitative elemental composition, Dual-energy computed
类目: Medical Physics (physics.med-ph); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Instrumentation and Detectors (physics.ins-det)
*备注:

点击查看摘要

Abstract:Dual-energy computed tomography (DECT) has been widely used to obtain quantitative elemental composition of imaged subjects for personalized and precise medical diagnosis. Compared with existing high-end DECT leveraging advanced X-ray source and/or detector technologies, the use of the sequentially-scanning data acquisition scheme to implement DECT may make broader impact on clinical practice because this scheme requires no specialized hardware designs. However, since the concentration of iodinated contrast agent in the imaged subject varies over time, sequentially-scanned data sets acquired at two tube potentials are temporally inconsistent. As existing material decomposition approaches for DECT assume that the data sets acquired at two tube potentials are temporally consistent, the violation of this assumption results in inaccurate quantification accuracy of iodine concentration. In this work, we developed a technique to achieve sequentially-scanning DECT imaging using high temporal resolution image reconstruction and temporal extrapolation, ACCELERATION in short, to address the technical challenge induced by temporal inconsistency of sequentially-scanned data sets and improve iodine quantification accuracy in sequentially-scanning DECT. ACCELERATION has been validated and evaluated using numerical simulation data sets generated from clinical human subject exams. Results demonstrated the improvement of iodine quantification accuracy using ACCELERATION.

[AI-137] Quantum Gradient Class Activation Map for Model Interpretability

链接: https://arxiv.org/abs/2408.05899
作者: Hsin-Yi Lin,Huan-Hsin Tseng,Samuel Yen-Chi Chen,Shinjae Yoo
关键词-EN: Quantum machine learning, recently made significant, made significant advancements, machine learning, Variational Quantum Circuits
类目: Quantum Physics (quant-ph); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Submitted to IEEE SiPS 2024

点击查看摘要

Abstract:Quantum machine learning (QML) has recently made significant advancements in various topics. Despite the successes, the safety and interpretability of QML applications have not been thoroughly investigated. This work proposes using Variational Quantum Circuits (VQCs) for activation mapping to enhance model transparency, introducing the Quantum Gradient Class Activation Map (QGrad-CAM). This hybrid quantum-classical computing framework leverages both quantum and classical strengths and gives access to the derivation of an explicit formula of feature map importance. Experimental results demonstrate significant, fine-grained, class-discriminative visual explanations generated across both image and speech datasets.

[AI-138] Divide-and-Conquer Predictive Coding: a structured Bayesian inference algorithm NEURIPS

链接: https://arxiv.org/abs/2408.05834
作者: Eli Sennesh,Hao Wu,Tommaso Salvatori
关键词-EN: Unexpected stimuli induce, Unexpected stimuli, predictive coding, stimuli induce, Unexpected
类目: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Neurons and Cognition (q-bio.NC)
*备注: 22 pages, 5 figures, submitted to Neural Information Processing Systems (NeurIPS) 2024

点击查看摘要

Abstract:Unexpected stimuli induce “error” or “surprise” signals in the brain. The theory of predictive coding promises to explain these observations in terms of Bayesian inference by suggesting that the cortex implements variational inference in a probabilistic graphical model. However, when applied to machine learning tasks, this family of algorithms has yet to perform on par with other variational approaches in high-dimensional, structured inference problems. To address this, we introduce a novel predictive coding algorithm for structured generative models, that we call divide-and-conquer predictive coding (DCPC). DCPC differs from other formulations of predictive coding, as it respects the correlation structure of the generative model and provably performs maximum-likelihood updates of model parameters, all without sacrificing biological plausibility. Empirically, DCPC achieves better numerical performance than competing algorithms and provides accurate inference in a number of problems not previously addressed with predictive coding. We provide an open implementation of DCPC in Pyro on Github.

[AI-139] me Makes Space: Emergence of Place Fields in Networks Encoding Temporally Continuous Sensory Experiences

链接: https://arxiv.org/abs/2408.05798
作者: Zhaoze Wang,Ronald W. Di Tullio,Spencer Rooke,Vijay Balasubramanian
关键词-EN: support episodic memory, place, place fields, episodic memory recall, partial cues
类目: Neurons and Cognition (q-bio.NC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
*备注:

点击查看摘要

Abstract:The vertebrate hippocampus is believed to use recurrent connectivity in area CA3 to support episodic memory recall from partial cues. This brain area also contains place cells, whose location-selective firing fields implement maps supporting spatial memory. Here we show that place cells emerge in networks trained to remember temporally continuous sensory episodes. We model CA3 as a recurrent autoencoder that recalls and reconstructs sensory experiences from noisy and partially occluded observations by agents traversing simulated rooms. The agents move in realistic trajectories modeled from rodents and environments are modeled as high-dimensional sensory experience maps. Training our autoencoder to pattern-complete and reconstruct experiences with a constraint on total activity causes spatially localized firing fields, i.e., place cells, to emerge in the encoding layer. The emergent place fields reproduce key aspects of hippocampal phenomenology: a) remapping (maintenance of and reversion to distinct learned maps in different environments), implemented via repositioning of experience manifolds in the network’s hidden layer, b) orthogonality of spatial representations in different arenas, c) robust place field emergence in differently shaped rooms, with single units showing multiple place fields in large or complex spaces, and d) slow representational drift of place fields. We argue that these results arise because continuous traversal of space makes sensory experience temporally continuous. We make testable predictions: a) rapidly changing sensory context will disrupt place fields, b) place fields will form even if recurrent connections are blocked, but reversion to previously learned representations upon remapping will be abolished, c) the dimension of temporally smooth experience sets the dimensionality of place fields, including during virtual navigation of abstract spaces.

[AI-140] VQ-CTAP: Cross-Modal Fine-Grained Sequence Representation Learning for Speech Processing

链接: https://arxiv.org/abs/2408.05758
作者: Chunyu Qiang,Wang Geng,Yi Zhao,Ruibo Fu,Tao Wang,Cheng Gong,Tianrui Wang,Qiuyu Liu,Jiangyan Yi,Zhengqi Wen,Chen Zhang,Hao Che,Longbiao Wang,Jianwu Dang,Jianhua Tao
关键词-EN: brought significant improvements, Deep learning, Vector Quantized Contrastive, cross-modal representation learning, cross-modal sequence representation
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Sound (cs.SD)
*备注:

点击查看摘要

[AI-141] C-KANRecon: High-Quality and Accelerated MRI Reconstruction via Adaptive KAN Mechanisms and Intelligent Feature Scaling

链接: https://arxiv.org/abs/2408.05705
作者: Ruiquan Ge,Xiao Yu,Yifei Chen,Fan Jia,Shenghao Zhu,Guanyu Zhou,Yiyu Huang,Chenyan Zhang,Dong Zeng,Changmiao Wang,Qiegen Liu,Shanzhou Niu
关键词-EN: Magnetic Resonance Imaging, Magnetic Resonance, Resonance Imaging, clinical diagnosis due, multiple contrast mechanisms
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: 10 pages, 3 figures

点击查看摘要

Abstract:Magnetic Resonance Imaging (MRI) has become essential in clinical diagnosis due to its high resolution and multiple contrast mechanisms. However, the relatively long acquisition time limits its broader application. To address this issue, this study presents an innovative conditional guided diffusion model, named as TC-KANRecon, which incorporates the Multi-Free U-KAN (MF-UKAN) module and a dynamic clipping strategy. TC-KANRecon model aims to accelerate the MRI reconstruction process through deep learning methods while maintaining the quality of the reconstructed images. The MF-UKAN module can effectively balance the tradeoff between image denoising and structure preservation. Specifically, it presents the multi-head attention mechanisms and scalar modulation factors, which significantly enhances the model’s robustness and structure preservation capabilities in complex noise environments. Moreover, the dynamic clipping strategy in TC-KANRecon adjusts the cropping interval according to the sampling steps, thereby mitigating image detail loss typically caused by traditional cropping methods and enriching the visual features of the images. Furthermore, the MC-Model module incorporates full-sampling k-space information, realizing efficient fusion of conditional information, enhancing the model’s ability to process complex data, and improving the realism and detail richness of reconstructed images. Experimental results demonstrate that the proposed method outperforms other MRI reconstruction methods in both qualitative and quantitative evaluations. Notably, TC-KANRecon method exhibits excellent reconstruction results when processing high-noise, low-sampling-rate MRI data. Our source code is available at this https URL.

[AI-142] Quantum-secure multiparty deep learning

链接: https://arxiv.org/abs/2408.05629
作者: Kfir Sulimany,Sri Krishna Vadlamani,Ryan Hamerly,Prahlad Iyengar,Dirk Englund
关键词-EN:
类目: Quantum Physics (quant-ph); Artificial Intelligence (cs.AI); Information Theory (cs.IT); Machine Learning (cs.LG); Optics (physics.optics)
*备注:

点击查看摘要

[AI-143] Evolutionary mechanisms that promote cooperation may not promote social welfare

链接: https://arxiv.org/abs/2408.05373
作者: TheAnh Han,Manh Hong Duong,Matjaz Perc
关键词-EN:
类目: Dynamical Systems (math.DS); Artificial Intelligence (cs.AI); Computer Science and Game Theory (cs.GT); Multiagent Systems (cs.MA); Adaptation and Self-Organizing Systems (nlin.AO)
*备注: 21 pages, 5 figures

点击查看摘要

[AI-144] scASDC: Attention Enhanced Structural Deep Clustering for Single-cell RNA-seq Data

链接: https://arxiv.org/abs/2408.05258
作者: Wenwen Min,Zhen Wang,Fangfang Zhu,Taosheng Xu,Shunfang Wang
关键词-EN:
类目: Genomics (q-bio.GN); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

[AI-145] Predictive maintenance solution for industrial systems – an unsupervised approach based on log periodic power law

链接: https://arxiv.org/abs/2408.05231
作者: Bogdan Łobodziński
关键词-EN:
类目: Applications (stat.AP); Artificial Intelligence (cs.AI); Data Analysis, Statistics and Probability (physics.data-an)
*备注: 14 pages, 4 figures, 1 table

点击查看摘要

计算机视觉

[CV-0] Moo-ving Beyond Tradition: Revolutionizing Cattle Behavioural Phenotyping with Pose Estimation Techniques

链接: https://arxiv.org/abs/2408.06336
作者: Navid Ghassemi,Ali Goldani,Ian Q. Whishaw,Majid H. Mohajerani
关键词-EN: major contributor, Artificial Intelligence, pose estimation, Canada, pose
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:The cattle industry has been a major contributor to the economy of many countries, including the US and Canada. The integration of Artificial Intelligence (AI) has revolutionized this sector, mirroring its transformative impact across all industries by enabling scalable and automated monitoring and intervention practices. AI has also introduced tools and methods that automate many tasks previously performed by human labor with the help of computer vision, including health inspections. Among these methods, pose estimation has a special place; pose estimation is the process of finding the position of joints in an image of animals. Analyzing the pose of animal subjects enables precise identification and tracking of the animal’s movement and the movements of its body parts. By summarizing the video and imagery data into movement and joint location using pose estimation and then analyzing this information, we can address the scalability challenge in cattle management, focusing on health monitoring, behavioural phenotyping and welfare concerns. Our study reviews recent advancements in pose estimation methodologies, their applicability in improving the cattle industry, existing challenges, and gaps in this field. Furthermore, we propose an initiative to enhance open science frameworks within this field of study by launching a platform designed to connect industry and academia.

[CV-1] HeLiMOS: A Dataset for Moving Object Segmentation in 3D Point Clouds From Heterogeneous LiDAR Sensors IROS

链接: https://arxiv.org/abs/2408.06328
作者: Hyungtae Lim,Seoyeon Jang,Benedikt Mersch,Jens Behley,Hyun Myung,Cyrill Stachniss
关键词-EN: Moving object segmentation, Moving object, LiDAR sensors, object segmentation, MOS
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
*备注: Proc. IEEE/RSJ Int. Conf. Intell. Robot. Syst. (IROS) 2024

点击查看摘要

Abstract:Moving object segmentation (MOS) using a 3D light detection and ranging (LiDAR) sensor is crucial for scene understanding and identification of moving objects. Despite the availability of various types of 3D LiDAR sensors in the market, MOS research still predominantly focuses on 3D point clouds from mechanically spinning omnidirectional LiDAR sensors. Thus, we are, for example, lacking a dataset with MOS labels for point clouds from solid-state LiDAR sensors which have irregular scanning patterns. In this paper, we present a labeled dataset, called \textitHeLiMOS, that enables to test MOS approaches on four heterogeneous LiDAR sensors, including two solid-state LiDAR sensors. Furthermore, we introduce a novel automatic labeling method to substantially reduce the labeling effort required from human annotators. To this end, our framework exploits an instance-aware static map building approach and tracking-based false label filtering. Finally, we provide experimental results regarding the performance of commonly used state-of-the-art MOS approaches on HeLiMOS that suggest a new direction for a sensor-agnostic MOS, which generally works regardless of the type of LiDAR sensors used to capture 3D point clouds. Our dataset is available at this https URL.

[CV-2] VisualAgent Bench: Towards Large Multimodal Models as Visual Foundation Agents

链接: https://arxiv.org/abs/2408.06327
作者: Xiao Liu,Tianjie Zhang,Yu Gu,Iat Long Iong,Yifan Xu,Xixuan Song,Shudan Zhang,Hanyu Lai,Xinyi Liu,Hanlin Zhao,Jiadai Sun,Xinyue Yang,Yu Yang,Zehan Qi,Shuntian Yao,Xueqiao Sun,Siyi Cheng,Qinkai Zheng,Hao Yu,Hanchen Zhang,Wenyi Hong,Ming Ding,Lihang Pan,Xiaotao Gu,Aohan Zeng,Zhengxiao Du,Chan Hee Song,Yu Su,Yuxiao Dong,Jie Tang
关键词-EN: Large Multimodal Models, Large Multimodal, form highly capable, highly capable Visual, Visual Foundation Agents
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

[CV-3] EqNIO: Subequivariant Neural Inertial Odometry

链接: https://arxiv.org/abs/2408.06321
作者: Royina Karegoudra Jayanth,Yinshuang Xu,Ziyun Wang,Evangelos Chatzipantazis,Daniel Gehrig,Kostas Daniilidis
关键词-EN: Extended Kalman Filter, Extended Kalman, Inertial Measurement Unit, Kalman Filter, stochastic filter networks
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
*备注: 26 pages

点击查看摘要

Abstract:Presently, neural networks are widely employed to accurately estimate 2D displacements and associated uncertainties from Inertial Measurement Unit (IMU) data that can be integrated into stochastic filter networks like the Extended Kalman Filter (EKF) as measurements and uncertainties for the update step in the filter. However, such neural approaches overlook symmetry which is a crucial inductive bias for model generalization. This oversight is notable because (i) physical laws adhere to symmetry principles when considering the gravity axis, meaning there exists the same transformation for both the physical entity and the resulting trajectory, and (ii) displacements should remain equivariant to frame transformations when the inertial frame changes. To address this, we propose a subequivariant framework by: (i) deriving fundamental layers such as linear and nonlinear layers for a subequivariant network, designed to handle sequences of vectors and scalars, (ii) employing the subequivariant network to predict an equivariant frame for the sequence of inertial measurements. This predicted frame can then be utilized for extracting invariant features through projection, which are integrated with arbitrary network architectures, (iii) transforming the invariant output by frame transformation to obtain equivariant displacements and covariances. We demonstrate the effectiveness and generalization of our Equivariant Framework on a filter-based approach with TLIO architecture for TLIO and Aria datasets, and an end-to-end deep learning approach with RONIN architecture for RONIN, RIDI and OxIOD datasets.

[CV-4] From SAM to SAM 2: Exploring Improvements in Metas Segment Anything Model

链接: https://arxiv.org/abs/2408.06305
作者: Athulya Sundaresan Geetha,Muhammad Hussain
关键词-EN: Meta in April, community by Meta, bounding boxes, groundbreaking tool, based on prompts
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:The Segment Anything Model (SAM), introduced to the computer vision community by Meta in April 2023, is a groundbreaking tool that allows automated segmentation of objects in images based on prompts such as text, clicks, or bounding boxes. SAM excels in zero-shot performance, segmenting unseen objects without additional training, stimulated by a large dataset of over one billion image masks. SAM 2 expands this functionality to video, leveraging memory from preceding and subsequent frames to generate accurate segmentation across entire videos, enabling near real-time performance. This comparison shows how SAM has evolved to meet the growing need for precise and efficient segmentation in various applications. The study suggests that future advancements in models like SAM will be crucial for improving computer vision technology.

[CV-5] Long-Form Answers to Visual Questions from Blind and Low Vision People

链接: https://arxiv.org/abs/2408.06303
作者: Mina Huh,Fangyuan Xu,Yi-Hao Peng,Chongyan Chen,Hansika Murugu,Danna Gurari,Eunsol Choi,Amy Pavel
关键词-EN: long-form answers, Vision language models, answers, generate long-form answers, Vision language
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
*备注: COLM 2024

点击查看摘要

[CV-6] Finding Patterns in Ambiguity: Interpretable Stress Testing in the Decision~Boundary CVPR

链接: https://arxiv.org/abs/2408.06302
作者: Inês Gomes,Luís F. Teixeira,Jan N. van Rijn,Carlos Soares,André Restivo,Luís Cunha,Moisés Santos
关键词-EN: domains highlights, highlights the importance, importance of understanding, understanding the decision-making, decision-making processes
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注: To be published in the Responsible Generative AI workshop at CVPR

点击查看摘要

Abstract:The increasing use of deep learning across various domains highlights the importance of understanding the decision-making processes of these black-box models. Recent research focusing on the decision boundaries of deep classifiers, relies on generated synthetic instances in areas of low confidence, uncovering samples that challenge both models and humans. We propose a novel approach to enhance the interpretability of deep binary classifiers by selecting representative samples from the decision boundary - prototypes - and applying post-model explanation algorithms. We evaluate the effectiveness of our approach through 2D visualizations and GradientSHAP analysis. Our experiments demonstrate the potential of the proposed method, revealing distinct and compact clusters and diverse prototypes that capture essential features that lead to low-confidence decisions. By offering a more aggregated view of deep classifiers’ decision boundaries, our work contributes to the responsible development and deployment of reliable machine learning systems.

[CV-7] Mipmap-GS: Let Gaussians Deform with Scale-specific Mipmap for Anti-aliasing Rendering

链接: https://arxiv.org/abs/2408.06286
作者: Jiameng Li,Yue Shi,Jiezhang Cao,Bingbing Ni,Wenjun Zhang,Kai Zhang,Luc Van Gool
关键词-EN: attracted great attention, superior rendering efficiency, Gaussian Splatting, high fidelity, attracted great
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 9 pages

点击查看摘要

Abstract:3D Gaussian Splatting (3DGS) has attracted great attention in novel view synthesis because of its superior rendering efficiency and high fidelity. However, the trained Gaussians suffer from severe zooming degradation due to non-adjustable representation derived from single-scale training. Though some methods attempt to tackle this problem via post-processing techniques such as selective rendering or filtering techniques towards primitives, the scale-specific information is not involved in Gaussians. In this paper, we propose a unified optimization method to make Gaussians adaptive for arbitrary scales by self-adjusting the primitive properties (e.g., color, shape and size) and distribution (e.g., position). Inspired by the mipmap technique, we design pseudo ground-truth for the target scale and propose a scale-consistency guidance loss to inject scale information into 3D Gaussians. Our method is a plug-in module, applicable for any 3DGS models to solve the zoom-in and zoom-out aliasing. Extensive experiments demonstrate the effectiveness of our method. Notably, our method outperforms 3DGS in PSNR by an average of 9.25 dB for zoom-in and 10.40 dB for zoom-out on the NeRF Synthetic dataset.

[CV-8] Context-aware Visual Storytelling with Visual Prefix Tuning and Contrastive Learning

链接: https://arxiv.org/abs/2408.06259
作者: Yingjin Song,Denis Paperno,Albert Gatt
关键词-EN: storytelling systems generate, systems generate multi-sentence, Visual storytelling systems, generate multi-sentence stories, image sequences
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
*备注: 18 pages, 12 figures, accepted by INLG 2024

点击查看摘要

[CV-9] Rethinking Video with a Universal Event-Based Representation

链接: https://arxiv.org/abs/2408.06248
作者: Andrew Freeman
关键词-EN: discrete image frames, sequence of discrete, discrete image, Delta, Traditionally
类目: Multimedia (cs.MM); Computer Vision and Pattern Recognition (cs.CV)
*备注: 137 pages. PhD dissertation at the University of North Carolina, Chapel Hill

点击查看摘要

Abstract:Traditionally, video is structured as a sequence of discrete image frames. Recently, however, a novel video sensing paradigm has emerged which eschews video frames entirely. These “event” sensors aim to mimic the human vision system with asynchronous sensing, where each pixel has an independent, sparse data stream. While these cameras enable high-speed and high-dynamic-range sensing, researchers often revert to a framed representation of the event data for existing applications, or build bespoke applications for a particular camera’s event data type. At the same time, classical video systems have significant computational redundancy at the application layer, since pixel samples are repeated across frames in the uncompressed domain. To address the shortcomings of existing systems, I introduce Address, Decimation, \Deltat Event Representation (AD\DeltaER, pronounced “adder”), a novel intermediate video representation and system framework. The framework transcodes a variety of framed and event camera sources into a single event-based representation, which supports source-modeled lossy compression and backward compatibility with traditional frame-based applications. I demonstrate that AD\DeltaER achieves state-of-the-art application speed and compression performance for scenes with high temporal redundancy. Crucially, I describe how AD\DeltaER unlocks an entirely new control mechanism for computer vision: application speed can correlate with both the scene content and the level of lossy compression. Finally, I discuss the implications for event-based video on large-scale video surveillance and resource-constrained sensing. Comments: 137 pages. PhD dissertation at the University of North Carolina, Chapel Hill Subjects: Multimedia (cs.MM); Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2408.06248 [cs.MM] (or arXiv:2408.06248v1 [cs.MM] for this version) https://doi.org/10.48550/arXiv.2408.06248 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[CV-10] Latent Disentanglement for Low Light Image Enhancement

链接: https://arxiv.org/abs/2408.06245
作者: Zhihao Zheng,Mooi Choo Chuah
关键词-EN: Retinex theory, Disentangle-based Enhancement Network, Latent Disentangle-based Enhancement, learning-based low-light image, Retinex-based decomposition techniques
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Many learning-based low-light image enhancement (LLIE) algorithms are based on the Retinex theory. However, the Retinex-based decomposition techniques in such models introduce corruptions which limit their enhancement performance. In this paper, we propose a Latent Disentangle-based Enhancement Network (LDE-Net) for low light vision tasks. The latent disentanglement module disentangles the input image in latent space such that no corruption remains in the disentangled Content and Illumination components. For LLIE task, we design a Content-Aware Embedding (CAE) module that utilizes Content features to direct the enhancement of the Illumination component. For downstream tasks (e.g. nighttime UAV tracking and low-light object detection), we develop an effective light-weight enhancer based on the latent disentanglement framework. Comprehensive quantitative and qualitative experiments demonstrate that our LDE-Net significantly outperforms state-of-the-art methods on various LLIE benchmarks. In addition, the great results obtained by applying our framework on the downstream tasks also demonstrate the usefulness of our latent disentanglement design.

[CV-11] 3D Reconstruction of Protein Structures from Multi-view AFM Images using Neural Radiance Fields (NeRFs)

链接: https://arxiv.org/abs/2408.06244
作者: Jaydeep Rade,Ethan Herron,Soumik Sarkar,Anwesha Sarkar,Adarsh Krishnamurthy
关键词-EN: Recent advancements, AFM images, AFM, shown promise, leveraging inputs
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Recent advancements in deep learning for predicting 3D protein structures have shown promise, particularly when leveraging inputs like protein sequences and Cryo-Electron microscopy (Cryo-EM) images. However, these techniques often fall short when predicting the structures of protein complexes (PCs), which involve multiple proteins. In our study, we investigate using atomic force microscopy (AFM) combined with deep learning to predict the 3D structures of PCs. AFM generates height maps that depict the PCs in various random orientations, providing a rich information for training a neural network to predict the 3D structures. We then employ the pre-trained UpFusion model (which utilizes a conditional diffusion model for synthesizing novel views) to train an instance-specific NeRF model for 3D reconstruction. The performance of UpFusion is evaluated through zero-shot predictions of 3D protein structures using AFM images. The challenge, however, lies in the time-intensive and impractical nature of collecting actual AFM images. To address this, we use a virtual AFM imaging process that transforms a `PDB’ protein file into multi-view 2D virtual AFM images via volume rendering techniques. We extensively validate the UpFusion architecture using both virtual and actual multi-view AFM images. Our results include a comparison of structures predicted with varying numbers of views and different sets of views. This novel approach holds significant potential for enhancing the accuracy of protein complex structure predictions with further fine-tuning of the UpFusion network.

[CV-12] Correlation Weighted Prototype-based Self-Supervised One-Shot Segmentation of Medical Images ICPR2024

链接: https://arxiv.org/abs/2408.06235
作者: Siladittya Manna,Saumik Bhattacharya,Umapada Pal
关键词-EN: sufficient annotated data, Medical image segmentation, Medical image, sufficient annotated, annotated data
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted to ICPR 2024

点击查看摘要

Abstract:Medical image segmentation is one of the domains where sufficient annotated data is not available. This necessitates the application of low-data frameworks like few-shot learning. Contemporary prototype-based frameworks often do not account for the variation in features within the support and query images, giving rise to a large variance in prototype alignment. In this work, we adopt a prototype-based self-supervised one-way one-shot learning framework using pseudo-labels generated from superpixels to learn the semantic segmentation task itself. We use a correlation-based probability score to generate a dynamic prototype for each query pixel from the bag of prototypes obtained from the support feature map. This weighting scheme helps to give a higher weightage to contextually related prototypes. We also propose a quadrant masking strategy in the downstream segmentation task by utilizing prior domain information to discard unwanted false positives. We present extensive experimentations and evaluations on abdominal CT and MR datasets to show that the proposed simple but potent framework performs at par with the state-of-the-art methods.

[CV-13] FruitNeRF: A Unified Neural Radiance Field based Fruit Counting Framework

链接: https://arxiv.org/abs/2408.06190
作者: Lukas Meyer,Andreas Gilson,Ute Schmidt,Marc Stamminger
关键词-EN: view synthesis methods, introduce FruitNeRF, view synthesis, fruit, fruit type directly
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Project Page: this https URL

点击查看摘要

Abstract:We introduce FruitNeRF, a unified novel fruit counting framework that leverages state-of-the-art view synthesis methods to count any fruit type directly in 3D. Our framework takes an unordered set of posed images captured by a monocular camera and segments fruit in each image. To make our system independent of the fruit type, we employ a foundation model that generates binary segmentation masks for any fruit. Utilizing both modalities, RGB and semantic, we train a semantic neural radiance field. Through uniform volume sampling of the implicit Fruit Field, we obtain fruit-only point clouds. By applying cascaded clustering on the extracted point cloud, our approach achieves precise fruit count.The use of neural radiance fields provides significant advantages over conventional methods such as object tracking or optical flow, as the counting itself is lifted into 3D. Our method prevents double counting fruit and avoids counting irrelevant fruit.We evaluate our methodology using both real-world and synthetic datasets. The real-world dataset consists of three apple trees with manually counted ground truths, a benchmark apple dataset with one row and ground truth fruit location, while the synthetic dataset comprises various fruit types including apple, plum, lemon, pear, peach, and mango.Additionally, we assess the performance of fruit counting using the foundation model compared to a U-Net.

[CV-14] Blind-Match: Efficient Homomorphic Encryption-Based 1:N Matching for Privacy-Preserving Biometric Identification CIKM2024

链接: https://arxiv.org/abs/2408.06167
作者: Hyunmin Choi,Jiwon Kim,Chiyoung Song,Simon S. Woo,Hyoungshick Kim
关键词-EN: leverages homomorphic encryption, homomorphic encryption, efficient and privacy-preserving, system that leverages, leverages homomorphic
类目: Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR)
*备注: Accepted to CIKM 2024 (Applied Research Track)

点击查看摘要

Abstract:We present Blind-Match, a novel biometric identification system that leverages homomorphic encryption (HE) for efficient and privacy-preserving 1:N matching. Blind-Match introduces a HE-optimized cosine similarity computation method, where the key idea is to divide the feature vector into smaller parts for processing rather than computing the entire vector at once. By optimizing the number of these parts, Blind-Match minimizes execution time while ensuring data privacy through HE. Blind-Match achieves superior performance compared to state-of-the-art methods across various biometric datasets. On the LFW face dataset, Blind-Match attains a 99.63% Rank-1 accuracy with a 128-dimensional feature vector, demonstrating its robustness in face recognition tasks. For fingerprint identification, Blind-Match achieves a remarkable 99.55% Rank-1 accuracy on the PolyU dataset, even with a compact 16-dimensional feature vector, significantly outperforming the state-of-the-art method, Blind-Touch, which achieves only 59.17%. Furthermore, Blind-Match showcases practical efficiency in large-scale biometric identification scenarios, such as Naver Cloud’s FaceSign, by processing 6,144 biometric samples in 0.74 seconds using a 128-dimensional feature vector.

[CV-15] OmniCLIP: Adapting CLIP for Video Recognition with Spatial-Temporal Omni-Scale Feature Learning ECAI-2024

链接: https://arxiv.org/abs/2408.06158
作者: Mushui Liu,Bozheng Li,Yunlong Yu
关键词-EN: Recent Vision-Language Models, made great progress, Vision-Language Models, video recognition, made great
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: ECAI-2024

点击查看摘要

Abstract:Recent Vision-Language Models (VLMs) \textite.g. CLIP have made great progress in video recognition. Despite the improvement brought by the strong visual backbone in extracting spatial features, CLIP still falls short in capturing and integrating spatial-temporal features which is essential for video recognition. In this paper, we propose OmniCLIP, a framework that adapts CLIP for video recognition by focusing on learning comprehensive features encompassing spatial, temporal, and dynamic spatial-temporal scales, which we refer to as omni-scale features. This is achieved through the design of spatial-temporal blocks that include parallel temporal adapters (PTA), enabling efficient temporal modeling. Additionally, we introduce a self-prompt generator (SPG) module to capture dynamic object spatial features. The synergy between PTA and SPG allows OmniCLIP to discern varying spatial information across frames and assess object scales over time. We have conducted extensive experiments in supervised video recognition, few-shot video recognition, and zero-shot recognition tasks. The results demonstrate the effectiveness of our method, especially with OmniCLIP achieving a top-1 accuracy of 74.30% on HMDB51 in a 16-shot setting, surpassing the recent MotionPrompt approach even with full training data. The code is available at \urlthis https URL.

[CV-16] Novel View Synthesis from a Single Image with Pretrained Diffusion Guidance

链接: https://arxiv.org/abs/2408.06157
作者: Taewon Kang,Divya Kothandaraman,Dinesh Manocha,Ming C. Lin
关键词-EN: Recent, complex environments, view synthesis, Abstract, scenes
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 6 pages, 7 figures

点击查看摘要

Abstract:Recent 3D novel view synthesis (NVS) methods are limited to single-object-centric scenes generated from new viewpoints and struggle with complex environments. They often require extensive 3D data for training, lacking generalization beyond training distribution. Conversely, 3D-free methods can generate text-controlled views of complex, in-the-wild scenes using a pretrained stable diffusion model without tedious fine-tuning, but lack camera control. In this paper, we introduce HawkI++, a method capable of generating camera-controlled viewpoints from a single input image. HawkI++ excels in handling complex and diverse scenes without additional 3D data or extensive training. It leverages widely available pretrained NVS models for weak guidance, integrating this knowledge into a 3D-free view synthesis approach to achieve the desired results efficiently. Our experimental results demonstrate that HawkI++ outperforms existing models in both qualitative and quantitative evaluations, providing high-fidelity and consistent novel view synthesis at desired camera angles across a wide variety of scenes.

[CV-17] Palantir: Towards Efficient Super Resolution for Ultra-high-definition Live Streaming

点击查看摘要

[CV-18] Efficient and Scalable Point Cloud Generation with Sparse Point-Voxel Diffusion Models

链接: https://arxiv.org/abs/2408.06145
作者: Ioannis Romanelis,Vlassios Fotis,Athanasios Kalogeras,Christos Alexakos,Konstantinos Moustakas,Adrian Munteanu
关键词-EN: maintaining fast generation, high-quality and diverse, point cloud, point cloud generative, fast generation times
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:We propose a novel point cloud U-Net diffusion architecture for 3D generative modeling capable of generating high-quality and diverse 3D shapes while maintaining fast generation times. Our network employs a dual-branch architecture, combining the high-resolution representations of points with the computational efficiency of sparse voxels. Our fastest variant outperforms all non-diffusion generative approaches on unconditional shape generation, the most popular benchmark for evaluating point cloud generative models, while our largest model achieves state-of-the-art results among diffusion methods, with a runtime approximately 70% of the previously state-of-the-art PVD. Beyond unconditional generation, we perform extensive evaluations, including conditional generation on all categories of ShapeNet, demonstrating the scalability of our model to larger datasets, and implicit generation which allows our network to produce high quality point clouds on fewer timesteps, further decreasing the generation time. Finally, we evaluate the architecture’s performance in point cloud completion and super-resolution. Our model excels in all tasks, establishing it as a state-of-the-art diffusion U-Net for point cloud generative modeling. The code is publicly available at this https URL.

[CV-19] MR3D-Net: Dynamic Multi-Resolution 3D Sparse Voxel Grid Fusion for LiDAR-Based Collective Perception ITSC2024

链接: https://arxiv.org/abs/2408.06137
作者: Sven Teufel,Jörg Gamerdinger,Georg Volk,Oliver Bringmann
关键词-EN: automated vehicles depends, safe operation, operation of automated, ability to perceive, automated vehicles
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted at IEEE ITSC 2024

点击查看摘要

Abstract:The safe operation of automated vehicles depends on their ability to perceive the environment comprehensively. However, occlusion, sensor range, and environmental factors limit their perception capabilities. To overcome these limitations, collective perception enables vehicles to exchange information. However, fusing this exchanged information is a challenging task. Early fusion approaches require large amounts of bandwidth, while intermediate fusion approaches face interchangeability issues. Late fusion of shared detections is currently the only feasible approach. However, it often results in inferior performance due to information loss. To address this issue, we propose MR3D-Net, a dynamic multi-resolution 3D sparse voxel grid fusion backbone architecture for LiDAR-based collective perception. We show that sparse voxel grids at varying resolutions provide a meaningful and compact environment representation that can adapt to the communication bandwidth. MR3D-Net achieves state-of-the-art performance on the OPV2V 3D object detection benchmark while reducing the required bandwidth by up to 94% compared to early fusion. Code is available at this https URL

[CV-20] DPDETR: Decoupled Position Detection Transformer for Infrared-Visible Object Detection

链接: https://arxiv.org/abs/2408.06123
作者: Junjie Guo,Chenqiang Gao,Fangcen Liu,Deyu Meng
关键词-EN: Infrared-visible object detection, visible image pairs, achieve robust object, Decoupled Position Detection, Position Detection Transformer
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
*备注:

点击查看摘要

Abstract:Infrared-visible object detection aims to achieve robust object detection by leveraging the complementary information of infrared and visible image pairs. However, the commonly existing modality misalignment problem presents two challenges: fusing misalignment complementary features is difficult, and current methods cannot accurately locate objects in both modalities under misalignment conditions. In this paper, we propose a Decoupled Position Detection Transformer (DPDETR) to address these problems. Specifically, we explicitly formulate the object category, visible modality position, and infrared modality position to enable the network to learn the intrinsic relationships and output accurate positions of objects in both modalities. To fuse misaligned object features accurately, we propose a Decoupled Position Multispectral Cross-attention module that adaptively samples and aggregates multispectral complementary features with the constraint of infrared and visible reference positions. Additionally, we design a query-decoupled Multispectral Decoder structure to address the optimization gap among the three kinds of object information in our task and propose a Decoupled Position Contrastive DeNosing Training strategy to enhance the DPDETR’s ability to learn decoupled positions. Experiments on DroneVehicle and KAIST datasets demonstrate significant improvements compared to other state-of-the-art methods. The code will be released at this https URL.

[CV-21] RISurConv: Rotation Invariant Surface Attention-Augmented Convolutions for 3D Point Cloud Classification and Segmentation ECCV2024

链接: https://arxiv.org/abs/2408.06110
作者: Zhiyuan Zhang,Licheng Yang,Zhiyu Xiang
关键词-EN: cloud deep learning, prior works focus, rotation invariant property, deep learning, point cloud deep
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: ECCV 2024 (oral)

点击查看摘要

Abstract:Despite the progress on 3D point cloud deep learning, most prior works focus on learning features that are invariant to translation and point permutation, and very limited efforts have been devoted for rotation invariant property. Several recent studies achieve rotation invariance at the cost of lower accuracies. In this work, we close this gap by proposing a novel yet effective rotation invariant architecture for 3D point cloud classification and segmentation. Instead of traditional pointwise operations, we construct local triangle surfaces to capture more detailed surface structure, based on which we can extract highly expressive rotation invariant surface properties which are then integrated into an attention-augmented convolution operator named RISurConv to generate refined attention features via self-attention layers. Based on RISurConv we build an effective neural network for 3D point cloud analysis that is invariant to arbitrary rotations while maintaining high accuracy. We verify the performance on various benchmarks with supreme results obtained surpassing the previous state-of-the-art by a large margin. We achieve an overall accuracy of 96.0% (+4.7%) on ModelNet40, 93.1% (+12.8%) on ScanObjectNN, and class accuracies of 91.5% (+3.6%), 82.7% (+5.1%), and 78.5% (+9.2%) on the three categories of the FG3D dataset for the fine-grained classification task. Additionally, we achieve 81.5% (+1.0%) mIoU on ShapeNet for the segmentation task. Code is available here: this https URL

[CV-22] owards Robust Monocular Depth Estimation in Non-Lambertian Surfaces

链接: https://arxiv.org/abs/2408.06083
作者: Junrui Zhang,Jiaqi Li,Yachuan Huang,Yiran Wang,Jinghong Zheng,Liao Shen,Zhiguo Cao
关键词-EN: scenes emerge recently, general scenes emerge, monocular depth estimation, emerge recently, field of monocular
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:In the field of monocular depth estimation (MDE), many models with excellent zero-shot performance in general scenes emerge recently. However, these methods often fail in predicting non-Lambertian surfaces, such as transparent or mirror (ToM) surfaces, due to the unique reflective properties of these regions. Previous methods utilize externally provided ToM masks and aim to obtain correct depth maps through direct in-painting of RGB images. These methods highly depend on the accuracy of additional input masks, and the use of random colors during in-painting makes them insufficiently robust. We are committed to incrementally enabling the baseline model to directly learn the uniqueness of non-Lambertian surface regions for depth estimation through a well-designed training framework. Therefore, we propose non-Lambertian surface regional guidance, which constrains the predictions of MDE model from the gradient domain to enhance its robustness. Noting the significant impact of lighting on this task, we employ the random tone-mapping augmentation during training to ensure the network can predict correct results for varying lighting inputs. Additionally, we propose an optional novel lighting fusion module, which uses Variational Autoencoders to fuse multiple images and obtain the most advantageous input RGB image for depth estimation when multi-exposure images are available. Our method achieves accuracy improvements of 33.39% and 5.21% in zero-shot testing on the Booster and Mirror3D dataset for non-Lambertian surfaces, respectively, compared to the Depth Anything V2. The state-of-the-art performance of 90.75 in delta1.05 within the ToM regions on the TRICKY2024 competition test set demonstrates the effectiveness of our approach.

[CV-23] owards Adversarial Robustness via Debiased High-Confidence Logit Alignment

链接: https://arxiv.org/abs/2408.06079
作者: Kejia Zhang,Juanjuan Weng,Zhiming Luo,Shaozi Li
关键词-EN: deep neural networks, adversarial, inverse adversarial attacks, inverse adversarial, adversarial training techniques
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Despite the significant advances that deep neural networks (DNNs) have achieved in various visual tasks, they still exhibit vulnerability to adversarial examples, leading to serious security concerns. Recent adversarial training techniques have utilized inverse adversarial attacks to generate high-confidence examples, aiming to align the distributions of adversarial examples with the high-confidence regions of their corresponding classes. However, in this paper, our investigation reveals that high-confidence outputs under inverse adversarial attacks are correlated with biased feature activation. Specifically, training with inverse adversarial examples causes the model’s attention to shift towards background features, introducing a spurious correlation bias. To address this bias, we propose Debiased High-Confidence Adversarial Training (DHAT), a novel approach that not only aligns the logits of adversarial examples with debiased high-confidence logits obtained from inverse adversarial examples, but also restores the model’s attention to its normal state by enhancing foreground logit orthogonality. Extensive experiments demonstrate that DHAT achieves state-of-the-art performance and exhibits robust generalization capabilities across various vision datasets. Additionally, DHAT can seamlessly integrate with existing advanced adversarial training techniques for improving the performance.

[CV-24] CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer

链接: https://arxiv.org/abs/2408.06072
作者: Zhuoyi Yang,Jiayan Teng,Wendi Zheng,Ming Ding,Shiyu Huang,Jiazheng Xu,Yuanming Yang,Wenyi Hong,Xiaohan Zhang,Guanyu Feng,Da Yin,Xiaotao Gu,Yuxuan Zhang,Weihan Wang,Yean Cheng,Ting Liu,Bin Xu,Yuxiao Dong,Jie Tang
关键词-EN: large-scale diffusion transformer, generating videos based, text prompts, diffusion transformer model, transformer model designed
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:We introduce CogVideoX, a large-scale diffusion transformer model designed for generating videos based on text prompts. To efficently model video data, we propose to levearge a 3D Variational Autoencoder (VAE) to compress videos along both spatial and temporal dimensions. To improve the text-video alignment, we propose an expert transformer with the expert adaptive LayerNorm to facilitate the deep fusion between the two modalities. By employing a progressive training technique, CogVideoX is adept at producing coherent, long-duration videos characterized by significant motions. In addition, we develop an effective text-video data processing pipeline that includes various data preprocessing strategies and a video captioning method. It significantly helps enhance the performance of CogVideoX, improving both generation quality and semantic alignment. Results show that CogVideoX demonstrates state-of-the-art performance across both multiple machine metrics and human evaluations. The model weights of both the 3D Causal VAE and CogVideoX are publicly available at this https URL.

[CV-25] A-BDD: Leveraging Data Augmentations for Safe Autonomous Driving in Adverse Weather and Lighting

链接: https://arxiv.org/abs/2408.06071
作者: Felix Assion,Florens Gressner,Nitin Augustine,Jona Klemenc,Ahmed Hammam,Alexandre Krattinger,Holger Trittenbach,Sascha Riemer
关键词-EN: High-autonomy vehicle functions, High-autonomy vehicle, vehicle functions rely, machine learning, understand the environment
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:High-autonomy vehicle functions rely on machine learning (ML) algorithms to understand the environment. Despite displaying remarkable performance in fair weather scenarios, perception algorithms are heavily affected by adverse weather and lighting conditions. To overcome these difficulties, ML engineers mainly rely on comprehensive real-world datasets. However, the difficulties in real-world data collection for critical areas of the operational design domain (ODD) often means synthetic data is required for perception training and safety validation. Thus, we present A-BDD, a large set of over 60,000 synthetically augmented images based on BDD100K that are equipped with semantic segmentation and bounding box annotations (inherited from the BDD100K dataset). The dataset contains augmented data for rain, fog, overcast and sunglare/shadow with varying intensity levels. We further introduce novel strategies utilizing feature-based image quality metrics like FID and CMMD, which help identify useful augmented and real-world data for ML training and testing. By conducting experiments on A-BDD, we provide evidence that data augmentations can play a pivotal role in closing performance gaps in adverse weather and lighting conditions.

[CV-26] ControlNeXt: Powerful and Efficient Control for Image and Video Generation

链接: https://arxiv.org/abs/2408.06070
作者: Bohao Peng,Jian Wang,Yuechen Zhang,Wenbo Li,Ming-Chang Yang,Jiaya Jia
关键词-EN: Diffusion models, demonstrated remarkable, remarkable and robust, robust abilities, video generation
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: controllable generation

点击查看摘要

Abstract:Diffusion models have demonstrated remarkable and robust abilities in both image and video generation. To achieve greater control over generated results, researchers introduce additional architectures, such as ControlNet, Adapters and ReferenceNet, to integrate conditioning controls. However, current controllable generation methods often require substantial additional computational resources, especially for video generation, and face challenges in training or exhibit weak control. In this paper, we propose ControlNeXt: a powerful and efficient method for controllable image and video generation. We first design a more straightforward and efficient architecture, replacing heavy additional branches with minimal additional cost compared to the base model. Such a concise structure also allows our method to seamlessly integrate with other LoRA weights, enabling style alteration without the need for additional training. As for training, we reduce up to 90% of learnable parameters compared to the alternatives. Furthermore, we propose another method called Cross Normalization (CN) as a replacement for Zero-Convolution’ to achieve fast and stable training convergence. We have conducted various experiments with different base models across images and videos, demonstrating the robustness of our method.

[CV-27] Parallel transport on matrix manifolds and Exponential Action

链接: https://arxiv.org/abs/2408.06054
作者: Du Nguyen,Stefan Sommer
关键词-EN: common matrix Lie, matrix Lie groups, express parallel transport, parallel transport, exponential actions
类目: Numerical Analysis (math.NA); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:We express parallel transport for several common matrix Lie groups with a family of pseudo-Riemannian metrics in terms of matrix exponential and exponential actions. The expression for parallel transport is preserved by taking the quotient under certain scenarios. In particular, for a Stiefel manifold of orthogonal matrices of size n\times d , we give an expression for parallel transport along a geodesic from time zero to t , that could be computed with time complexity of O(nd^2) for small t , and of O(td^3) for large t, contributing a step in a long-standing open problem in matrix manifolds. A similar result holds for flag manifolds with the canonical metric. We also show the parallel transport formulas for the generalized linear group, and the special orthogonal group under these metrics.

[CV-28] BooW-VTON: Boosting In-the-Wild Virtual Try-On via Mask-Free Pseudo Data Training

链接: https://arxiv.org/abs/2408.06047
作者: Xuanpu Zhang,Dan Song,Pengxin Zhan,Qingguo Chen,Zhao Xu,Weihua Luo,Kaifu Zhang,Anan Liu
关键词-EN: Image-based virtual try-on, Image-based virtual, generate realistic try-on, increasingly popular, popular and important
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Image-based virtual try-on is an increasingly popular and important task to generate realistic try-on images of specific person. Existing methods always employ an accurate mask to remove the original garment in the source image, thus achieving realistic synthesized images in simple and conventional try-on scenarios based on powerful diffusion model. Therefore, acquiring suitable mask is vital to the try-on performance of these methods. However, obtaining precise inpainting masks, especially for complex wild try-on data containing diverse foreground occlusions and person poses, is not easy as Figure 1-Top shows. This difficulty often results in poor performance in more practical and challenging real-life scenarios, such as the selfie scene shown in Figure 1-Bottom. To this end, we propose a novel training paradigm combined with an efficient data augmentation method to acquire large-scale unpaired training data from wild scenarios, thereby significantly facilitating the try-on performance of our model without the need for additional inpainting masks. Besides, a try-on localization loss is designed to localize a more accurate try-on area to obtain more reasonable try-on results. It is noted that our method only needs the reference cloth image, source pose image and source person image as input, which is more cost-effective and user-friendly compared to existing methods. Extensive qualitative and quantitative experiments have demonstrated superior performance in wild scenarios with such a low-demand input.

[CV-29] ARPA: A Novel Hybrid Model for Advancing Visual Word Disambiguation Using Large Language Models and Transformers

链接: https://arxiv.org/abs/2408.06040
作者: Aristi Papastavrou,Maria Lymperaiou,Giorgos Stamou
关键词-EN: Visual Word Sense, Word Sense Disambiguation, natural language processing, rapidly evolving fields, visual word disambiguation
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
*备注:

点击查看摘要

[CV-30] Layer-Specific Optimization: Sensitivity Based Convolution Layers Basis Search

链接: https://arxiv.org/abs/2408.06024
作者: Vasiliy Alekseev,Ilya Lukashevich,Ilia Zharikov,Ilya Vasiliev
关键词-EN: Deep neural network, Deep neural, neural network models, complex architecture, matrix decomposition
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注: A revived draft of an unpublished (and never-to-be-published) article. For the sake of history, memory, and old times

点击查看摘要

Abstract:Deep neural network models have a complex architecture and are overparameterized. The number of parameters is more than the whole dataset, which is highly resource-consuming. This complicates their application and limits its usage on different devices. Reduction in the number of network parameters helps to reduce the size of the model, but at the same time, thoughtlessly applied, can lead to a deterioration in the quality of the network. One way to reduce the number of model parameters is matrix decomposition, where a matrix is represented as a product of smaller matrices. In this paper, we propose a new way of applying the matrix decomposition with respect to the weights of convolutional layers. The essence of the method is to train not all convolutions, but only the subset of convolutions (basis convolutions), and represent the rest as linear combinations of the basis ones. Experiments on models from the ResNet family and the CIFAR-10 dataset demonstrate that basis convolutions can not only reduce the size of the model but also accelerate the forward and backward passes of the network. Another contribution of this work is that we propose a fast method for selecting a subset of network layers in which the use of matrix decomposition does not degrade the quality of the final model.

[CV-31] ClickAttention: Click Region Similarity Guided Interactive Segmentation

链接: https://arxiv.org/abs/2408.06021
作者: Long Xu,Shanghong Li,Yongquan Chen,Junkang Chen,Rui Huang,Feng Wu
关键词-EN: http URL interactive, http URL addition, http URL address, URL interactive demo, specific target objects
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Interactive segmentation algorithms based on click points have garnered significant attention from researchers in recent years.However, existing studies typically use sparse click maps as model inputs to segment specific target objects, which primarily affect local regions and have limited abilities to focus on the whole target object, leading to increased times of this http URL addition, most existing algorithms can not balance well between high performance and this http URL address this issue, we propose a click attention algorithm that expands the influence range of positive clicks based on the similarity between positively-clicked regions and the whole input.We also propose a discriminative affinity loss to reduce the attention coupling between positive and negative click regions to avoid an accuracy decrease caused by mutual interference between positive and negative clicks.Extensive experiments demonstrate that our approach is superior to existing methods and achieves cutting-edge performance in fewer this http URL interactive demo and all reproducible codes will be released at this https URL.

[CV-32] HeadGAP: Few-shot 3D Head Avatar via Generalizable Gaussian Priors

链接: https://arxiv.org/abs/2408.06019
作者: Xiaozheng Zheng,Chao Wen,Zhaohu Li,Weiyi Zhang,Zhuo Su,Xu Chang,Yang Zhao,Zheng Lv,Xiaoyuan Zhang,Yongjie Zhang,Guidong Wang,Lan Xu
关键词-EN: avatar creation, data with high-fidelity, animatable robustness, avatar creation phase, capable of generalizing
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Project page: this https URL

点击查看摘要

Abstract:In this paper, we present a novel 3D head avatar creation approach capable of generalizing from few-shot in-the-wild data with high-fidelity and animatable robustness. Given the underconstrained nature of this problem, incorporating prior knowledge is essential. Therefore, we propose a framework comprising prior learning and avatar creation phases. The prior learning phase leverages 3D head priors derived from a large-scale multi-view dynamic dataset, and the avatar creation phase applies these priors for few-shot personalization. Our approach effectively captures these priors by utilizing a Gaussian Splatting-based auto-decoder network with part-based dynamic modeling. Our method employs identity-shared encoding with personalized latent codes for individual identities to learn the attributes of Gaussian primitives. During the avatar creation phase, we achieve fast head avatar personalization by leveraging inversion and fine-tuning strategies. Extensive experiments demonstrate that our model effectively exploits head priors and successfully generalizes them to few-shot personalization, achieving photo-realistic rendering quality, multi-view consistency, and stable animation.

[CV-33] Uncertainty-Informed Volume Visualization using Implicit Neural Representation IEEE-VIS2024

点击查看摘要

[CV-34] DEEPTalk: Dynamic Emotion Embedding for Probabilistic Speech-Driven 3D Face Animation

链接: https://arxiv.org/abs/2408.06010
作者: Jisoo Kim,Jungbin Cho,Joonho Park,Soonmin Hwang,Da Eun Kim,Geon Kim,Youngjae Yu
关键词-EN: range of applications, facial, facial motion, garnered lots, lots of attention
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: First two authors contributed equally

点击查看摘要

Abstract:Speech-driven 3D facial animation has garnered lots of attention thanks to its broad range of applications. Despite recent advancements in achieving realistic lip motion, current methods fail to capture the nuanced emotional undertones conveyed through speech and produce monotonous facial motion. These limitations result in blunt and repetitive facial animations, reducing user engagement and hindering their applicability. To address these challenges, we introduce DEEPTalk, a novel approach that generates diverse and emotionally rich 3D facial expressions directly from speech inputs. To achieve this, we first train DEE (Dynamic Emotion Embedding), which employs probabilistic contrastive learning to forge a joint emotion embedding space for both speech and facial motion. This probabilistic framework captures the uncertainty in interpreting emotions from speech and facial motion, enabling the derivation of emotion vectors from its multifaceted space. Moreover, to generate dynamic facial motion, we design TH-VQVAE (Temporally Hierarchical VQ-VAE) as an expressive and robust motion prior overcoming limitations of VAEs and VQ-VAEs. Utilizing these strong priors, we develop DEEPTalk, A talking head generator that non-autoregressively predicts codebook indices to create dynamic facial motion, incorporating a novel emotion consistency loss. Extensive experiments on various datasets demonstrate the effectiveness of our approach in creating diverse, emotionally expressive talking faces that maintain accurate lip-sync. Source code will be made publicly available soon.

[CV-35] An Analysis for Image-to-Image Translation and Style Transfer

链接: https://arxiv.org/abs/2408.06000
作者: Xiaoming Yu,Jie Tian,Zhenhua Hu
关键词-EN: style transfer models, deep learning, recent years, development of generative, models have emerged
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
*备注:

点击查看摘要

Abstract:With the development of generative technologies in deep learning, a large number of image-to-image translation and style transfer models have emerged at an explosive rate in recent years. These two technologies have made significant progress and can generate realistic images. However, many communities tend to confuse the two, because both generate the desired image based on the input image and both cover the two definitions of content and style. In fact, there are indeed significant differences between the two, and there is currently a lack of clear explanations to distinguish the two technologies, which is not conducive to the advancement of technology. We hope to serve the entire community by introducing the differences and connections between image-to-image translation and style transfer. The entire discussion process involves the concepts, forms, training modes, evaluation processes, and visualization results of the two technologies. Finally, we conclude that image-to-image translation divides images by domain, and the types of images in the domain are limited, and the scope involved is small, but the conversion ability is strong and can achieve strong semantic changes. Style transfer divides image types by single image, and the scope involved is large, but the transfer ability is limited, and it transfers more texture and color of the image.

[CV-36] Diffuse-UDA: Addressing Unsupervised Domain Adaptation in Medical Image Segmentation with Appearance and Structure Aligned Diffusion Models

链接: https://arxiv.org/abs/2408.05985
作者: Haifan Gong,Yitao Wang,Yihan Wang,Jiashun Xiao,Xiang Wan,Haofeng Li
关键词-EN: present significant challenges, imaging present significant, labeled datasets, datasets from well-resourced, unlabeled datasets
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:The scarcity and complexity of voxel-level annotations in 3D medical imaging present significant challenges, particularly due to the domain gap between labeled datasets from well-resourced centers and unlabeled datasets from less-resourced centers. This disparity affects the fairness of artificial intelligence algorithms in healthcare. We introduce Diffuse-UDA, a novel method leveraging diffusion models to tackle Unsupervised Domain Adaptation (UDA) in medical image segmentation. Diffuse-UDA generates high-quality image-mask pairs with target domain characteristics and various structures, thereby enhancing UDA tasks. Initially, pseudo labels for target domain samples are generated. Subsequently, a specially tailored diffusion model, incorporating deformable augmentations, is trained on image-label or image-pseudo-label pairs from both domains. Finally, source domain labels guide the diffusion model to generate image-label pairs for the target domain. Comprehensive evaluations on several benchmarks demonstrate that Diffuse-UDA outperforms leading UDA and semi-supervised strategies, achieving performance close to or even surpassing the theoretical upper bound of models trained directly on target domain data. Diffuse-UDA offers a pathway to advance the development and deployment of AI systems in medical imaging, addressing disparities between healthcare environments. This approach enables the exploration of innovative AI-driven diagnostic tools, improves outcomes, saves time, and reduces human error.

[CV-37] Unseen No More: Unlocking the Potential of CLIP for Generative Zero-shot HOI Detection ACM-MM2024

链接: https://arxiv.org/abs/2408.05974
作者: Yixin Guo,Yu Liu,Jianghao Li,Weimin Wang,Qi Jia
关键词-EN: Zero-shot human-object interaction, zero-shot HOI detection, HOI, human-object interaction, detector is capable
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted by ACM MM 2024

点击查看摘要

Abstract:Zero-shot human-object interaction (HOI) detector is capable of generalizing to HOI categories even not encountered during training. Inspired by the impressive zero-shot capabilities offered by CLIP, latest methods strive to leverage CLIP embeddings for improving zero-shot HOI detection. However, these embedding-based methods train the classifier on seen classes only, inevitably resulting in seen-unseen confusion for the model during inference. Besides, we find that using prompt-tuning and adapters further increases the gap between seen and unseen accuracy. To tackle this challenge, we present the first generation-based model using CLIP for zero-shot HOI detection, coined HOIGen. It allows to unlock the potential of CLIP for feature generation instead of feature extraction only. To achieve it, we develop a CLIP-injected feature generator in accordance with the generation of human, object and union features. Then, we extract realistic features of seen samples and mix them with synthetic features together, allowing the model to train seen and unseen classes jointly. To enrich the HOI scores, we construct a generative prototype bank in a pairwise HOI recognition branch, and a multi-knowledge prototype bank in an image-wise HOI recognition branch, respectively. Extensive experiments on HICO-DET benchmark demonstrate our HOIGen achieves superior performance for both seen and unseen classes under various zero-shot settings, compared with other top-performing methods. Code is available at: this https URL

[CV-38] Freehand Sketch Generation from Mechanical Components ACM-MM

点击查看摘要

[CV-39] arget Detection of Safety Protective Gear Using the Improved YOLOv5

链接: https://arxiv.org/abs/2408.05964
作者: Hao Liu,Xue Qin
关键词-EN: personal protective equipment, protective equipment monitoring, frequently obstructed targets, personal protective, protective equipment
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In high-risk railway construction, personal protective equipment monitoring is critical but challenging due to small and frequently obstructed targets. We propose YOLO-EA, an innovative model that enhances safety measure detection by integrating ECA into its backbone’s convolutional layers, improving discernment of minuscule objects like hardhats. YOLO-EA further refines target recognition under occlusion by replacing GIoU with EIoU loss. YOLO-EA’s effectiveness was empirically substantiated using a dataset derived from real-world railway construction site surveillance footage. It outperforms YOLOv5, achieving 98.9% precision and 94.7% recall, up 2.5% and 0.5% respectively, while maintaining real-time performance at 70.774 fps. This highly efficient and precise YOLO-EA holds great promise for practical application in intricate construction scenarios, enforcing stringent safety compliance during complex railway construction projects.

[CV-40] Boosting Adverse Weather Crowd Counting via Multi-queue Contrastive Learning

链接: https://arxiv.org/abs/2408.05956
作者: Tianhang Pan,Zhuoran Zheng,Xiuyi Jia
关键词-EN: adverse weather, adverse weather conditions, weather conditions, normal weather conditions, weather
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 11 pages, 7 figures

点击查看摘要

Abstract:Currently, most crowd counting methods have outstanding performance under normal weather conditions. However, they often struggle to maintain their performance in extreme and adverse weather conditions due to significant differences in the domain and a lack of adverse weather images for training. To address this issue and enhance the model’s robustness in adverse weather, we propose a two-stage crowd counting method. Specifically, in the first stage, we introduce a multi-queue MoCo contrastive learning strategy to tackle the problem of weather class imbalance. This strategy facilitates the learning of weather-aware representations by the model. In the second stage, we propose to refine the representations under the guidance of contrastive learning, enabling the conversion of the weather-aware representations to the normal weather domain. While significantly improving the robustness, our method only marginally increases the weight of the model. In addition, we also create a new synthetic adverse weather dataset. Extensive experimental results show that our method achieves competitive performance.

[CV-41] Probabilistic Vision-Language Representation for Weakly Supervised Temporal Action Localization ACM-MM2024

链接: https://arxiv.org/abs/2408.05955
作者: Geuntaek Lim,Hyunwoo Kim,Joonsoo Kim,Yukyung Choi
关键词-EN: Weakly supervised temporal, Weakly supervised, temporal action localization, supervised temporal action, detect action instances
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted to ACM MM 2024

点击查看摘要

Abstract:Weakly supervised temporal action localization (WTAL) aims to detect action instances in untrimmed videos using only video-level annotations. Since many existing works optimize WTAL models based on action classification labels, they encounter the task discrepancy problem (i.e., localization-by-classification). To tackle this issue, recent studies have attempted to utilize action category names as auxiliary semantic knowledge through vision-language pre-training (VLP). However, there are still areas where existing research falls short. Previous approaches primarily focused on leveraging textual information from language models but overlooked the alignment of dynamic human action and VLP knowledge in a joint space. Furthermore, the deterministic representation employed in previous studies struggles to capture fine-grained human motions. To address these problems, we propose a novel framework that aligns human action knowledge and VLP knowledge in a probabilistic embedding space. Moreover, we propose intra- and inter-distribution contrastive learning to enhance the probabilistic embedding space based on statistical similarities. Extensive experiments and ablation studies reveal that our method significantly outperforms all previous state-of-the-art methods. Code is available at this https URL.

[CV-42] A Simple Task-aware Contrastive Local Descriptor Selection Strategy for Few-shot Learning between inter class and intra class ICANN2024

链接: https://arxiv.org/abs/2408.05953
作者: Qian Qiao,Yu Xie,Shaoyao Huang,Fanzhang Li
关键词-EN: Few-shot image classification, Few-shot image, image classification aims, labeled samples, aims to classify
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
*备注: Submitted to ICANN 2024

点击查看摘要

Abstract:Few-shot image classification aims to classify novel classes with few labeled samples. Recent research indicates that deep local descriptors have better representational capabilities. These studies recognize the impact of background noise on classification performance. They typically filter query descriptors using all local descriptors in the support classes or engage in bidirectional selection between local descriptors in support and query sets. However, they ignore the fact that background features may be useful for the classification performance of specific tasks. This paper proposes a novel task-aware contrastive local descriptor selection network (TCDSNet). First, we calculate the contrastive discriminative score for each local descriptor in the support class, and select discriminative local descriptors to form a support descriptor subset. Finally, we leverage support descriptor subsets to adaptively select discriminative query descriptors for specific tasks. Extensive experiments demonstrate that our method outperforms state-of-the-art methods on both general and fine-grained datasets.

[CV-43] Optimizing Vision Transformers with Data-Free Knowledge Transfer

链接: https://arxiv.org/abs/2408.05952
作者: Gousia Habib,Damandeep Singh,Ishfaq Ahmad Malik,Brejesh Lall
关键词-EN: Natural Language Processing, Convolutional Neural Networks, traditional Convolutional Neural, Language Processing, Neural Networks
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:The groundbreaking performance of transformers in Natural Language Processing (NLP) tasks has led to their replacement of traditional Convolutional Neural Networks (CNNs), owing to the efficiency and accuracy achieved through the self-attention mechanism. This success has inspired researchers to explore the use of transformers in computer vision tasks to attain enhanced long-term semantic awareness. Vision transformers (ViTs) have excelled in various computer vision tasks due to their superior ability to capture long-distance dependencies using the self-attention mechanism. Contemporary ViTs like Data Efficient Transformers (DeiT) can effectively learn both global semantic information and local texture information from images, achieving performance comparable to traditional CNNs. However, their impressive performance comes with a high computational cost due to very large number of parameters, hindering their deployment on devices with limited resources like smartphones, cameras, drones etc. Additionally, ViTs require a large amount of data for training to achieve performance comparable to benchmark CNN models. Therefore, we identified two key challenges in deploying ViTs on smaller form factor devices: the high computational requirements of large models and the need for extensive training data. As a solution to these challenges, we propose compressing large ViT models using Knowledge Distillation (KD), which is implemented data-free to circumvent limitations related to data availability. Additionally, we conducted experiments on object detection within the same environment in addition to classification tasks. Based on our analysis, we found that datafree knowledge distillation is an effective method to overcome both issues, enabling the deployment of ViTs on less resourceconstrained devices.

[CV-44] MV2DFusion: Leveraging Modality-Specific Object Semantics for Multi-Modal 3D Detection

链接: https://arxiv.org/abs/2408.05945
作者: Zitian Wang,Zehao Huang,Yulu Gao,Naiyan Wang,Si Liu
关键词-EN: demand for robust, rise of autonomous, autonomous vehicles, vehicles has significantly, significantly increased
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:The rise of autonomous vehicles has significantly increased the demand for robust 3D object detection systems. While cameras and LiDAR sensors each offer unique advantages–cameras provide rich texture information and LiDAR offers precise 3D spatial data–relying on a single modality often leads to performance limitations. This paper introduces MV2DFusion, a multi-modal detection framework that integrates the strengths of both worlds through an advanced query-based fusion mechanism. By introducing an image query generator to align with image-specific attributes and a point cloud query generator, MV2DFusion effectively combines modality-specific object semantics without biasing toward one single modality. Then the sparse fusion process can be accomplished based on the valuable object semantics, ensuring efficient and accurate object detection across various scenarios. Our framework’s flexibility allows it to integrate with any image and point cloud-based detectors, showcasing its adaptability and potential for future advancements. Extensive evaluations on the nuScenes and Argoverse2 datasets demonstrate that MV2DFusion achieves state-of-the-art performance, particularly excelling in long-range detection scenarios.

[CV-45] Spb3DTracker: A Robust LiDAR-Based Person Tracker for Noisy Environmen

点击查看摘要

[CV-46] UniPortrait: A Unified Framework for Identity-Preserving Single- and Multi-Human Image Personalization

链接: https://arxiv.org/abs/2408.05939
作者: Junjie He,Yifeng Geng,Liefeng Bo
关键词-EN: free-form input description, high face fidelity, diverse layout generation, extensive facial editability, paper presents UniPortrait
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Tech report; Project page: this https URL

点击查看摘要

Abstract:This paper presents UniPortrait, an innovative human image personalization framework that unifies single- and multi-ID customization with high face fidelity, extensive facial editability, free-form input description, and diverse layout generation. UniPortrait consists of only two plug-and-play modules: an ID embedding module and an ID routing module. The ID embedding module extracts versatile editable facial features with a decoupling strategy for each ID and embeds them into the context space of diffusion models. The ID routing module then combines and distributes these embeddings adaptively to their respective regions within the synthesized image, achieving the customization of single and multiple IDs. With a carefully designed two-stage training scheme, UniPortrait achieves superior performance in both single- and multi-ID customization. Quantitative and qualitative experiments demonstrate the advantages of our method over existing approaches as well as its good scalability, e.g., the universal compatibility with existing generative control tools. The project page is at this https URL .

[CV-47] Deep Geometric Moments Promote Shape Consistency in Text-to-3D Generation

链接: https://arxiv.org/abs/2408.05938
作者: Utkarsh Nath,Rajeev Goel,Eun Som Jeon,Changhoon Kim,Kyle Min,Yezhou Yang,Yingzhen Yang,Pavan Turaga
关键词-EN: Score Distillation Sampling, Distillation Sampling, Score Distillation, widely adopted practice, address the data
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 9 pages, 8 figures

点击查看摘要

Abstract:To address the data scarcity associated with 3D assets, 2D-lifting techniques such as Score Distillation Sampling (SDS) have become a widely adopted practice in text-to-3D generation pipelines. However, the diffusion models used in these techniques are prone to viewpoint bias and thus lead to geometric inconsistencies such as the Janus problem. To counter this, we introduce MT3D, a text-to-3D generative model that leverages a high-fidelity 3D object to overcome viewpoint bias and explicitly infuse geometric understanding into the generation pipeline. Firstly, we employ depth maps derived from a high-quality 3D model as control signals to guarantee that the generated 2D images preserve the fundamental shape and structure, thereby reducing the inherent viewpoint bias. Next, we utilize deep geometric moments to ensure geometric consistency in the 3D representation explicitly. By incorporating geometric details from a 3D asset, MT3D enables the creation of diverse and geometrically consistent objects, thereby improving the quality and usability of our 3D representations.

[CV-48] Multi-scale Contrastive Adaptor Learning for Segmenting Anything in Underperformed Scenes

链接: https://arxiv.org/abs/2408.05936
作者: Ke Zhou,Zhongwei Qiu,Dongmei Fu
关键词-EN: Foundational vision models, achieved significant breakthroughs, Foundational vision, large-scale visual datasets, achieved significant
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Foundational vision models, such as the Segment Anything Model (SAM), have achieved significant breakthroughs through extensive pre-training on large-scale visual datasets. Despite their general success, these models may fall short in specialized tasks with limited data, and fine-tuning such large-scale models is often not feasible. Current strategies involve incorporating adaptors into the pre-trained SAM to facilitate downstream task performance with minimal model adjustment. However, these strategies can be hampered by suboptimal learning approaches for the adaptors. In this paper, we introduce a novel Multi-scale Contrastive Adaptor learning method named MCA-SAM, which enhances adaptor performance through a meticulously designed contrastive learning framework at both token and sample levels. Our Token-level Contrastive adaptor (TC-adaptor) focuses on refining local representations by improving the discriminability of patch tokens, while the Sample-level Contrastive adaptor (SC-adaptor) amplifies global understanding across different samples. Together, these adaptors synergistically enhance feature comparison within and across samples, bolstering the model’s representational strength and its ability to adapt to new tasks. Empirical results demonstrate that MCA-SAM sets new benchmarks, outperforming existing methods in three challenging domains: camouflage object detection, shadow segmentation, and polyp segmentation. Specifically, MCA-SAM exhibits substantial relative performance enhancements, achieving a 20.0% improvement in MAE on the COD10K dataset, a 6.0% improvement in MAE on the CAMO dataset, a 15.4% improvement in BER on the ISTD dataset, and a 7.9% improvement in mDice on the Kvasir-SEG dataset.

[CV-49] A Simple Early Exiting Framework for Accelerated Sampling in Diffusion Models ICML2024

链接: https://arxiv.org/abs/2408.05927
作者: Taehong Moon,Moonseok Choi,EungGu Yun,Jongmin Yoon,Gayoung Lee,Jaewoong Cho,Juho Lee
关键词-EN: shown remarkable performance, Diffusion models, domains including images, score estimation, score estimation networks
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: ICML 2024

点击查看摘要

Abstract:Diffusion models have shown remarkable performance in generation problems over various domains including images, videos, text, and audio. A practical bottleneck of diffusion models is their sampling speed, due to the repeated evaluation of score estimation networks during the inference. In this work, we propose a novel framework capable of adaptively allocating compute required for the score estimation, thereby reducing the overall sampling time of diffusion models. We observe that the amount of computation required for the score estimation may vary along the time step for which the score is estimated. Based on this observation, we propose an early-exiting scheme, where we skip the subset of parameters in the score estimation network during the inference, based on a time-dependent exit schedule. Using the diffusion models for image synthesis, we show that our method could significantly improve the sampling throughput of the diffusion models without compromising image quality. Furthermore, we also demonstrate that our method seamlessly integrates with various types of solvers for faster sampling, capitalizing on their compatibility to enhance overall efficiency. The source code and our experiments are available at \urlthis https URL

[CV-50] PAFormer: Part Aware Transformer for Person Re-identification

链接: https://arxiv.org/abs/2408.05918
作者: Hyeono Jung,Jangwon Lee,Jiwon Yoo,Dami Ko,Gyeonghwan Kim
关键词-EN: measure feature distances, body parts, Part Aware Transformer, body, body part
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 34 pages, 8 figures

点击查看摘要

Abstract:Within the domain of person re-identification (ReID), partial ReID methods are considered mainstream, aiming to measure feature distances through comparisons of body parts between samples. However, in practice, previous methods often lack sufficient awareness of anatomical aspect of body parts, resulting in the failure to capture features of the same body parts across different samples. To address this issue, we introduce \textbfPart Aware Transformer (PAFormer), a pose estimation based ReID model which can perform precise part-to-part comparison. In order to inject part awareness to pose tokens, we introduce learnable parameters called `pose token’ which estimate the correlation between each body part and partial regions of the image. Notably, at inference phase, PAFormer operates without additional modules related to body part localization, which is commonly used in previous ReID methodologies leveraging pose estimation models. Additionally, leveraging the enhanced awareness of body parts, PAFormer suggests the use of a learning-based visibility predictor to estimate the degree of occlusion for each body part. Also, we introduce a teacher forcing technique using ground truth visibility scores which enables PAFormer to be trained only with visible parts. A set of extensive experiments show that our method outperforms existing approaches on well-known ReID benchmark datasets.

[CV-51] Deep Multimodal Collaborative Learning for Polyp Re-Identification

链接: https://arxiv.org/abs/2408.05914
作者: Suncheng Xiang,Jincheng Li,Zhengjie Zhang,Shilun Cai,Jiale Guan,Dahong Qian
关键词-EN: Polyp Re-Identification aims, Colonoscopic Polyp Re-Identification, computer-aided diagnosis, aims to match, gallery with images
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Work in progress. arXiv admin note: text overlap with arXiv:2307.10625

点击查看摘要

Abstract:Colonoscopic Polyp Re-Identification aims to match the same polyp from a large gallery with images from different views taken using different cameras and plays an important role in the prevention and treatment of colorectal cancer in computer-aided diagnosis. However, traditional methods for object ReID directly adopting CNN models trained on the ImageNet dataset usually produce unsatisfactory retrieval performance on colonoscopic datasets due to the large domain gap. Worsely, these solutions typically learn unimodal modal representations on the basis of visual samples, which fails to explore complementary information from different modalities. To address this challenge, we propose a novel Deep Multimodal Collaborative Learning framework named DMCL for polyp re-identification, which can effectively encourage modality collaboration and reinforce generalization capability in medical scenarios. On the basis of it, a dynamic multimodal feature fusion strategy is introduced to leverage the optimized multimodal representations for multimodal fusion via end-to-end training. Experiments on the standard benchmarks show the benefits of the multimodal setting over state-of-the-art unimodal ReID models, especially when combined with the specialized multimodal fusion strategy.

[CV-52] Weakly Supervised Video Anomaly Detection and Localization with Spatio-Temporal Prompts

点击查看摘要

[CV-53] HcNet: Image Modeling with Heat Conduction Equation

链接: https://arxiv.org/abs/2408.05901
作者: Zhemin Zhang,Xun Gong
关键词-EN: heat conduction equation, heat conduction, Heat Conduction Layer, Heat Conduction Network, Foundation models
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Foundation models, such as CNNs and ViTs, have powered the development of image modeling. However, general guidance to model architecture design is still missing. The design of many modern model architectures, such as residual structures, multiplicative gating signal, and feed-forward networks, can be interpreted in terms of the heat conduction equation. This finding inspired us to model images by the heat conduction equation, where the essential idea is to conceptualize image features as temperatures and model their information interaction as the diffusion of thermal energy. We can take advantage of the rich knowledge in the heat conduction equation to guide us in designing new and more interpretable models. As an example, we propose Heat Conduction Layer and Refine Approximation Layer inspired by solving the heat conduction equation using Finite Difference Method and Fourier series, respectively. This paper does not aim to present a state-of-the-art model; instead, it seeks to integrate the overall architectural design of the model into the heat conduction theory framework. Nevertheless, our Heat Conduction Network (HcNet) still shows competitive performance. Code available at \urlthis https URL.

[CV-54] Classifier Guidance Enhances Diffusion-based Adversarial Purification by Preserving Predictive Information ECAI2024

链接: https://arxiv.org/abs/2408.05900
作者: Mingkun Zhang,Jianing Li,Wei Chen,Jiafeng Guo,Xueqi Cheng
关键词-EN: defend neural networks, promising approaches, approaches to defend, defend neural, neural networks
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted by ECAI 2024

点击查看摘要

Abstract:Adversarial purification is one of the promising approaches to defend neural networks against adversarial attacks. Recently, methods utilizing diffusion probabilistic models have achieved great success for adversarial purification in image classification tasks. However, such methods fall into the dilemma of balancing the needs for noise removal and information preservation. This paper points out that existing adversarial purification methods based on diffusion models gradually lose sample information during the core denoising process, causing occasional label shift in subsequent classification tasks. As a remedy, we suggest to suppress such information loss by introducing guidance from the classifier confidence. Specifically, we propose Classifier-cOnfidence gUided Purification (COUP) algorithm, which purifies adversarial examples while keeping away from the classifier decision boundary. Experimental results show that COUP can achieve better adversarial robustness under strong attack methods.

[CV-55] GlyphPattern: An Abstract Pattern Recognition for Vision-Language Models

链接: https://arxiv.org/abs/2408.05894
作者: Zixuan Wu,Yoolim Kim,Carolyn Jane Anderson
关键词-EN: made rapid progress, powerful large language, abstract pattern recognition, textual data, foundation of powerful
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
*备注:

点击查看摘要

[CV-56] Polyp SAM 2: Advancing Zero shot Polyp Segmentation in Colorectal Cancer Detection

点击查看摘要

[CV-57] CMAB: A First National-Scale Multi-Attribute Building Dataset Derived from Open Source Data and GeoAI

链接: https://arxiv.org/abs/2408.05891
作者: Yecheng Zhang,Huimin Zhao,Ying Long
关键词-EN: Rapidly acquiring three-dimensional, Rapidly acquiring, accurate urban analysis, including geometric attributes, acquiring three-dimensional
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 43 pages, 20 figures

点击查看摘要

Abstract:Rapidly acquiring three-dimensional (3D) building data, including geometric attributes like rooftop, height, and structure, as well as indicative attributes like function, quality, and age, is essential for accurate urban analysis, simulations, and policy updates. Existing large-scale building datasets lack accuracy, extensibility and indicative attributes. This paper presents a geospatial artificial intelligence (GeoAI) framework for large-scale building modeling, introducing the first Multi-Attribute Building dataset (CMAB) in China at a national scale. The dataset covers 3,667 natural cities with a total rooftop area of 21.3 billion square meters with an F1-Score of 89.93% in rooftop extraction through the OCRNet. We trained bootstrap aggregated XGBoost models with city administrative classifications, incorporating building features such as morphology, location, and function. Using multi-source data, including billions of high-resolution Google Earth imagery and 60 million street view images (SVI), we generated rooftop, height, function, age, and quality attributes for each building. Accuracy was validated through model benchmarks, existing similar products, and manual SVI validation. The results support urban planning and sustainable development.

[CV-58] Enhancing 3D Transformer Segmentation Model for Medical Image with Token-level Representation Learning

链接: https://arxiv.org/abs/2408.05889
作者: Xinrong Hu,Dewen Zeng,Yawen Wu,Xueyang Li,Yiyu Shi
关键词-EN: find Swin Transformer, pixelwise dense prediction, high computational cost, Swin Transformer, works find Swin
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:In the field of medical images, although various works find Swin Transformer has promising effectiveness on pixelwise dense prediction, whether pre-training these models without using extra dataset can further boost the performance for the downstream semantic segmentation remains unexplored.Applications of previous representation learning methods are hindered by the limited number of 3D volumes and high computational cost. In addition, most of pretext tasks designed specifically for Transformer are not applicable to hierarchical structure of Swin Transformer. Thus, this work proposes a token-level representation learning loss that maximizes agreement between token embeddings from different augmented views individually instead of volume-level global features. Moreover, we identify a potential representation collapse exclusively caused by this new loss. To prevent collapse, we invent a simple “rotate-and-restore” mechanism, which rotates and flips one augmented view of input volume, and later restores the order of tokens in the feature maps. We also modify the contrastive loss to address the discrimination between tokens at the same position but from different volumes. We test our pre-training scheme on two public medical segmentation datasets, and the results on the downstream segmentation task show more improvement of our methods than other state-of-the-art pre-trainig methods.

[CV-59] LaWa: Using Latent Space for In-Generation Image Watermarking

链接: https://arxiv.org/abs/2408.05868
作者: Ahmad Rezaei,Mohammad Akbari,Saeed Ranjbar Alvar,Arezou Fatemi,Yong Zhang
关键词-EN: generative models producing, latent space, Latent Diffusion Models, indistinguishable from real, malicious usage
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:With generative models producing high quality images that are indistinguishable from real ones, there is growing concern regarding the malicious usage of AI-generated images. Imperceptible image watermarking is one viable solution towards such concerns. Prior watermarking methods map the image to a latent space for adding the watermark. Moreover, Latent Diffusion Models (LDM) generate the image in the latent space of a pre-trained autoencoder. We argue that this latent space can be used to integrate watermarking into the generation process. To this end, we present LaWa, an in-generation image watermarking method designed for LDMs. By using coarse-to-fine watermark embedding modules, LaWa modifies the latent space of pre-trained autoencoders and achieves high robustness against a wide range of image transformations while preserving perceptual quality of the image. We show that LaWa can also be used as a general image watermarking method. Through extensive experiments, we demonstrate that LaWa outperforms previous works in perceptual quality, robustness against attacks, and computational complexity, while having very low false positive rate. Code is available here.

[CV-60] SABER-6D: Shape Representation Based Implicit Object Pose Estimation

链接: https://arxiv.org/abs/2408.05867
作者: Shishir Reddy Vutukur,Mengkejiergeli Ba,Benjamin Busam,Matthias Kayser,Gurprit Singh
关键词-EN: shape representation, encoder-decoder architecture, named SABER, learning shape representation, RGB image input
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:In this paper, we propose a novel encoder-decoder architecture, named SABER, to learn the 6D pose of the object in the embedding space by learning shape representation at a given pose. This model enables us to learn pose by performing shape representation at a target pose from RGB image input. We perform shape representation as an auxiliary task which helps us in learning rotations space for an object based on 2D images. An image encoder predicts the rotation in the embedding space and the DeepSDF based decoder learns to represent the object’s shape at the given pose. As our approach is shape based, the pipeline is suitable for any type of object irrespective of the symmetry. Moreover, we need only a CAD model of the objects to train SABER. Our pipeline is synthetic data based and can also handle symmetric objects without symmetry labels and, thus, no additional labeled training data is needed. The experimental evaluation shows that our method achieves close to benchmark results for both symmetric objects and asymmetric objects on Occlusion-LineMOD, and T-LESS datasets.

[CV-61] Real-Time Drowsiness Detection Using Eye Aspect Ratio and Facial Landmark Detection

点击查看摘要

[CV-62] Robust Domain Generalization for Multi-modal Object Recognition

点击查看摘要

[CV-63] Sampling Foundational Transformer: A Theoretical Perspective

链接: https://arxiv.org/abs/2408.05822
作者: Viet Anh Nguyen,Minh Lenhat,Khoa Nguyen,Duong Duc Hieu,Dao Huu Hung,Truong Son Hy
关键词-EN: multiple data modalities, data modalities, earned transformers great, transformers great success, self-attention mechanism earned
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:The versatility of self-attention mechanism earned transformers great success in almost all data modalities, with limitations on the quadratic complexity and difficulty of training. To apply transformers across different data modalities, practitioners have to make specific clever data-modality-dependent constructions. In this paper, we propose Sampling Foundational Transformer (SFT) that can work on multiple data modalities (e.g., point cloud, graph, and sequence) and constraints (e.g., rotational-invariant). The existence of such model is important as contemporary foundational modeling requires operability on multiple data sources. For efficiency on large number of tokens, our model relies on our context aware sampling-without-replacement mechanism for both linear asymptotic computational complexity and real inference time gain. For efficiency, we rely on our newly discovered pseudoconvex formulation of transformer layer to increase model’s convergence rate. As a model working on multiple data modalities, SFT has achieved competitive results on many benchmarks, while being faster in inference, compared to other very specialized models.

[CV-64] HySparK: Hybrid Sparse Masking for Large Scale Medical Image Pre-Training MICCAI2024

链接: https://arxiv.org/abs/2408.05815
作者: Fenghe Tang,Ronghao Xu,Qingsong Yao,Xueming Fu,Quan Quan,Heqin Zhu,Zaiyi Liu,S. Kevin Zhou
关键词-EN: learning representational capabilities, exhibits remarkable learning, remarkable learning representational, generative self-supervised learning, learning strategy exhibits
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Early accept at MICCAI 2024

点击查看摘要

Abstract:The generative self-supervised learning strategy exhibits remarkable learning representational capabilities. However, there is limited attention to end-to-end pre-training methods based on a hybrid architecture of CNN and Transformer, which can learn strong local and global representations simultaneously. To address this issue, we propose a generative pre-training strategy called Hybrid Sparse masKing (HySparK) based on masked image modeling and apply it to large-scale pre-training on medical images. First, we perform a bottom-up 3D hybrid masking strategy on the encoder to keep consistency masking. Then we utilize sparse convolution for the top CNNs and encode unmasked patches for the bottom vision Transformers. Second, we employ a simple hierarchical decoder with skip-connections to achieve dense multi-scale feature reconstruction. Third, we implement our pre-training method on a collection of multiple large-scale 3D medical imaging datasets. Extensive experiments indicate that our proposed pre-training strategy demonstrates robust transfer-ability in supervised downstream tasks and sheds light on HySparK’s promising prospects. The code is available at this https URL

[CV-65] Egocentric Vision Language Planning

链接: https://arxiv.org/abs/2408.05802
作者: Zhirui Fang,Ming Yang,Weishuai Zeng,Boyu Li,Junpeng Yue,Ziluo Ding,Xiu Li,Zongqing Lu
关键词-EN: general embodied agent, explore leveraging large, leveraging large multi-modal, large multi-modal models, embodied agent
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:We explore leveraging large multi-modal models (LMMs) and text2image models to build a more general embodied agent. LMMs excel in planning long-horizon tasks over symbolic abstractions but struggle with grounding in the physical world, often failing to accurately identify object positions in images. A bridge is needed to connect LMMs to the physical world. The paper proposes a novel approach, egocentric vision language planning (EgoPlan), to handle long-horizon tasks from an egocentric perspective in varying household scenarios. This model leverages a diffusion model to simulate the fundamental dynamics between states and actions, integrating techniques like style transfer and optical flow to enhance generalization across different environmental dynamics. The LMM serves as a planner, breaking down instructions into sub-goals and selecting actions based on their alignment with these sub-goals, thus enabling more generalized and effective decision-making. Experiments show that EgoPlan improves long-horizon task success rates from the egocentric view compared to baselines across household scenarios.

[CV-66] CURLing the Dream: Contrastive Representations for World Modeling in Reinforcement Learning

点击查看摘要

[CV-67] U-DECN: End-to-End Underwater Object Detection ConvNet with Improved DeNoising Training

链接: https://arxiv.org/abs/2408.05780
作者: Zhuoyan Liu,Bo Wang,Ye Li
关键词-EN: specific environmental challenges, underwater color cast, color cast noise, Underwater object detection, underwater color
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Underwater object detection has higher requirements of running speed and deployment efficiency for the detector due to its specific environmental challenges. NMS of two- or one-stage object detectors and transformer architecture of query-based end-to-end object detectors are not conducive to deployment on underwater embedded devices with limited processing power. As for the detrimental effect of underwater color cast noise, recent underwater object detectors make network architecture or training complex, which also hinders their application and deployment on underwater vehicle platforms. In this paper, we propose the Underwater DECO with improved deNoising training (U-DECN), the query-based end-to-end object detector (with ConvNet encoder-decoder architecture) for underwater color cast noise that addresses the above problems. We integrate advanced technologies from DETR variants into DECO and design optimization methods specifically for the ConvNet architecture, including Separate Contrastive DeNoising Forward and Deformable Convolution in SIM. To address the underwater color cast noise issue, we propose an underwater color denoising query to improve the generalization of the model for the biased object feature information by different color cast noise. Our U-DECN, with ResNet-50 backbone, achieves 61.4 AP (50 epochs), 63.3 AP (72 epochs), 64.0 AP (100 epochs) on DUO, and 21 FPS (5 times faster than Deformable DETR and DINO 4 FPS) on NVIDIA AGX Orin by TensorRT FP16, outperforming the other state-of-the-art query-based end-to-end object detectors. The code is available at this https URL.

[CV-68] Seg-CycleGAN : SAR-to-optical image translation guided by a downstream task

点击查看摘要

[CV-69] Efficient Test-Time Prompt Tuning for Vision-Language Models

链接: https://arxiv.org/abs/2408.05775
作者: Yuhan Zhu,Guozhen Zhang,Chen Xu,Haocheng Shen,Xiaoxin Chen,Gangshan Wu,Limin Wang
关键词-EN: test-time prompt tuning, suitable text prompts, Vision-language models, prompt tuning, showcased impressive zero-shot
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Vision-language models have showcased impressive zero-shot classification capabilities when equipped with suitable text prompts. Previous studies have shown the effectiveness of test-time prompt tuning; however, these methods typically require per-image prompt adaptation during inference, which incurs high computational budgets and limits scalability and practical deployment. To overcome this issue, we introduce Self-TPT, a novel framework leveraging Self-supervised learning for efficient Test-time Prompt Tuning. The key aspect of Self-TPT is that it turns to efficient predefined class adaptation via self-supervised learning, thus avoiding computation-heavy per-image adaptation at inference. Self-TPT begins by co-training the self-supervised and the classification task using source data, then applies the self-supervised task exclusively for test-time new class adaptation. Specifically, we propose Contrastive Prompt Learning (CPT) as the key task for self-supervision. CPT is designed to minimize the intra-class distances while enhancing inter-class distinguishability via contrastive learning. Furthermore, empirical evidence suggests that CPT could closely mimic back-propagated gradients of the classification task, offering a plausible explanation for its effectiveness. Motivated by this finding, we further introduce a gradient matching loss to explicitly enhance the gradient similarity. We evaluated Self-TPT across three challenging zero-shot benchmarks. The results consistently demonstrate that Self-TPT not only significantly reduces inference costs but also achieves state-of-the-art performance, effectively balancing the efficiency-efficacy trade-off.

[CV-70] An analysis of HOI: using a training-free method with multimodal visual foundation models when only the test set is available without the training set

点击查看摘要

[CV-71] PRECISe : Prototype-Reservation for Explainable Classification under Imbalanced and Scarce-Data Settings

链接: https://arxiv.org/abs/2408.05754
作者: Vaibhav Ganatra,Drishti Goel
关键词-EN: severe class imbalance, Deep learning models, Deep learning, class imbalance, limited amount
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Deep learning models used for medical image classification tasks are often constrained by the limited amount of training data along with severe class imbalance. Despite these problems, models should be explainable to enable human trust in the models’ decisions to ensure wider adoption in high-risk situations. In this paper, we propose PRECISe, an explainable-by-design model meticulously constructed to concurrently address all three challenges. Evaluation on 2 imbalanced medical image datasets reveals that PRECISe outperforms the current state-of-the-art methods on data efficient generalization to minority classes, achieving an accuracy of ~87% in detecting pneumonia in chest x-rays upon training on 60 images only. Additionally, a case study is presented to highlight the model’s ability to produce easily interpretable predictions, reinforcing its practical utility and reliability for medical imaging tasks.

[CV-72] RTF-Q: Unsupervised domain adaptation based retraining-free quantization network

链接: https://arxiv.org/abs/2408.05752
作者: Nanyang Du,Chen Tang,Yuan Meng,Zhi Wang
关键词-EN: Performing unsupervised domain, resource-constrained edge devices, Performing unsupervised, edge devices, unsupervised domain adaptation
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Performing unsupervised domain adaptation on resource-constrained edge devices is a significant task. Although existing research allows edge devices to use subnets with different computational budgets for inference, they often require expensive pre-training and do not consider the issues of parameter precision redundancy in the model, which is not conducive to the deployment of the model on edge devices. In this paper, we introduce a ReTraining-Free Quantized (RTF-Q) network based on unsupervised domain adaptation, featuring quantized subnets of varying computational costs that can operate on devices with dynamically changing computation budgets. Our network has three switchable dimensions: width (number of channels), input resolution, and quantization bit-width. Specifically, we choose subnet dimensions that have minimal impact on network performance and then directly load the official weight files without requiring expensive and time-consuming pre-training on Imagenet-1K. To further reduce the network’s computational load and memory usage, we use quantization-aware training, reducing the BitOPs of full-precision networks by at least 1/16. We propose a training method called SandwichQ for multiple quantization bit widths, which can efficiently train multiple quantization subnets. By training in multiple quantization bit-width spaces simultaneously and using the proposed SandwichQ rule, we achieve better network performance compared to using a single quantization bit-width alone. Experimental results show that our method achieves classification accuracy comparable to SOTA methods on various UDA tasks, significantly reducing network size and computational overhead. Code will be available at this https URL.

[CV-73] Advancing Re-Ranking with Multimodal Fusion and Target-Oriented Auxiliary Tasks in E-Commerce Search

链接: https://arxiv.org/abs/2408.05751
作者: Enqiang Xu,Xinhui Li,Zhigong Zhou,Jiahao Ji,Jinyuan Zhao,Dadong Miao,Songlin Wang,Lin Liu,Sulong Xu
关键词-EN: rapidly evolving field, driving conversion rates, enhancing user experience, rapidly evolving, evolving field
类目: Information Retrieval (cs.IR); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:In the rapidly evolving field of e-commerce, the effectiveness of search re-ranking models is crucial for enhancing user experience and driving conversion rates. Despite significant advancements in feature representation and model architecture, the integration of multimodal information remains underexplored. This study addresses this gap by investigating the computation and fusion of textual and visual information in the context of re-ranking. We propose \textbfAdvancing \textbfRe-Ranking with \textbfMulti\textbfmodal Fusion and \textbfTarget-Oriented Auxiliary Tasks (ARMMT), which integrates an attention-based multimodal fusion technique and an auxiliary ranking-aligned task to enhance item representation and improve targeting capabilities. This method not only enriches the understanding of product attributes but also enables more precise and personalized recommendations. Experimental evaluations on this http URL’s search platform demonstrate that ARMMT achieves state-of-the-art performance in multimodal information integration, evidenced by a 0.22% increase in the Conversion Rate (CVR), significantly contributing to Gross Merchandise Volume (GMV). This pioneering approach has the potential to revolutionize e-commerce re-ranking, leading to elevated user satisfaction and business growth.

[CV-74] FADE: A Dataset for Detecting Falling Objects around Buildings in Video

链接: https://arxiv.org/abs/2408.05750
作者: Zhigang Tu,Zitao Gao,Zhengbo Zhang,Chunluan Zhou,Junsong Yuan,Bo Du
关键词-EN: great impact force, Falling objects, falling object detection, object detection, Falling
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 11 pages, 10 figures

点击查看摘要

Abstract:Falling objects from buildings can cause severe injuries to pedestrians due to the great impact force they exert. Although surveillance cameras are installed around some buildings, it is challenging for humans to capture such events in surveillance videos due to the small size and fast motion of falling objects, as well as the complex background. Therefore, it is necessary to develop methods to automatically detect falling objects around buildings in surveillance videos. To facilitate the investigation of falling object detection, we propose a large, diverse video dataset called FADE (FAlling Object DEtection around Buildings) for the first time. FADE contains 1,881 videos from 18 scenes, featuring 8 falling object categories, 4 weather conditions, and 4 video resolutions. Additionally, we develop a new object detection method called FADE-Net, which effectively leverages motion information and produces small-sized but high-quality proposals for detecting falling objects around buildings. Importantly, our method is extensively evaluated and analyzed by comparing it with the previous approaches used for generic object detection, video object detection, and moving object detection on the FADE dataset. Experimental results show that the proposed FADE-Net significantly outperforms other methods, providing an effective baseline for future research. The dataset and code are publicly available at this https URL.

[CV-75] Efficient and Versatile Robust Fine-Tuning of Zero-shot Models ECCV2024

链接: https://arxiv.org/abs/2408.05749
作者: Sungyeon Kim,Boseung Jeong,Donghyun Kim,Suha Kwak
关键词-EN: provide consistent accuracy, Large-scale image-text pre-trained, Large-scale image-text, provide consistent, consistent accuracy
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: Accepted to ECCV 2024

点击查看摘要

Abstract:Large-scale image-text pre-trained models enable zero-shot classification and provide consistent accuracy across various data distributions. Nonetheless, optimizing these models in downstream tasks typically requires fine-tuning, which reduces generalization to out-of-distribution (OOD) data and demands extensive computational resources. We introduce Robust Adapter (R-Adapter), a novel method for fine-tuning zero-shot models to downstream tasks while simultaneously addressing both these issues. Our method integrates lightweight modules into the pre-trained model and employs novel self-ensemble techniques to boost OOD robustness and reduce storage expenses substantially. Furthermore, we propose MPM-NCE loss designed for fine-tuning on vision-language downstream tasks. It ensures precise alignment of multiple image-text pairs and discriminative feature learning. By extending the benchmark for robust fine-tuning beyond classification to include diverse tasks such as cross-modal retrieval and open vocabulary segmentation, we demonstrate the broad applicability of R-Adapter. Our extensive experiments demonstrate that R-Adapter achieves state-of-the-art performance across a diverse set of tasks, tuning only 13% of the parameters of the CLIP encoders.

[CV-76] Improving Adversarial Transferability with Neighbourhood Gradient Information

链接: https://arxiv.org/abs/2408.05745
作者: Haijing Guo,Jiafeng Wang,Zhaoyu Chen,Kaixun Jiang,Lingyi Hong,Pinxue Guo,Jinglun Li,Wenqiang Zhang
关键词-EN: Deep neural networks, Deep neural, significant performance degradation, Neighbourhood Gradient Information, gradient information
类目: Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR)
*备注:

点击查看摘要

Abstract:Deep neural networks (DNNs) are known to be susceptible to adversarial examples, leading to significant performance degradation. In black-box attack scenarios, a considerable attack performance gap between the surrogate model and the target model persists. This work focuses on enhancing the transferability of adversarial examples to narrow this performance gap. We observe that the gradient information around the clean image, i.e. Neighbourhood Gradient Information, can offer high transferability. Leveraging this, we propose the NGI-Attack, which incorporates Example Backtracking and Multiplex Mask strategies, to use this gradient information and enhance transferability fully. Specifically, we first adopt Example Backtracking to accumulate Neighbourhood Gradient Information as the initial momentum term. Multiplex Mask, which forms a multi-way attack strategy, aims to force the network to focus on non-discriminative regions, which can obtain richer gradient information during only a few iterations. Extensive experiments demonstrate that our approach significantly enhances adversarial transferability. Especially, when attacking numerous defense models, we achieve an average attack success rate of 95.8%. Notably, our method can plugin with any off-the-shelf algorithm to improve their attack performance without additional time cost.

[CV-77] Neural Architecture Search based Global-local Vision Mamba for Palm-Vein Recognition

链接: https://arxiv.org/abs/2408.05743
作者: Huafeng Qin,Yuming Fu,Jing Chen,Mounim A. El-Yacoubi,Xinbo Gao,Jun Wang
关键词-EN: Feature Iteration Unit, high security, high privacy, past years, Mamba
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Due to the advantages such as high security, high privacy, and liveness recognition, vein recognition has been received more and more attention in past years. Recently, deep learning models, e.g., Mamba has shown robust feature representation with linear computational complexity and successfully applied for visual tasks. However, vision Manba can capture long-distance feature dependencies but unfortunately deteriorate local feature details. Besides, manually designing a Mamba architecture based on human priori knowledge is very time-consuming and error-prone. In this paper, first, we propose a hybrid network structure named Global-local Vision Mamba (GLVM), to learn the local correlations in images explicitly and global dependencies among tokens for vein feature representation. Secondly, we design a Multi-head Mamba to learn the dependencies along different directions, so as to improve the feature representation ability of vision Mamba. Thirdly, to learn the complementary features, we propose a ConvMamba block consisting of three branches, named Multi-head Mamba branch (MHMamba), Feature Iteration Unit branch (FIU), and Convolutional Neural Network (CNN) branch, where the Feature Iteration Unit branch aims to fuse convolutional local features with Mamba-based global representations. Finally, a Globallocal Alternate Neural Architecture Search (GLNAS) method is proposed to search the optimal architecture of GLVM alternately with the evolutionary algorithm, thereby improving the recognition performance for vein recognition tasks. We conduct rigorous experiments on three public palm-vein databases to estimate the performance. The experimental results demonstrate that the proposed method outperforms the representative approaches and achieves state-of-the-art recognition accuracy.

[CV-78] A Training-Free Framework for Video License Plate Tracking and Recognition with Only One-Shot

链接: https://arxiv.org/abs/2408.05729
作者: Haoxuan Ding,Qi Wang,Junyu Gao,Qiang Li
关键词-EN: license plate, license plate detection, license plate formats, trained on closed, license
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Traditional license plate detection and recognition models are often trained on closed datasets, limiting their ability to handle the diverse license plate formats across different regions. The emergence of large-scale pre-trained models has shown exceptional generalization capabilities, enabling few-shot and zero-shot learning. We propose OneShotLP, a training-free framework for video-based license plate detection and recognition, leveraging these advanced models. Starting with the license plate position in the first video frame, our method tracks this position across subsequent frames using a point tracking module, creating a trajectory of prompts. These prompts are input into a segmentation module that uses a promptable large segmentation model to generate local masks of the license plate regions. The segmented areas are then processed by multimodal large language models (MLLMs) for accurate license plate recognition. OneShotLP offers significant advantages, including the ability to function effectively without extensive training data and adaptability to various license plate styles. Experimental results on UFPR-ALPR and SSIG-SegPlate datasets demonstrate the superior accuracy of our approach compared to traditional methods. This highlights the potential of leveraging pre-trained models for diverse real-world applications in intelligent transportation systems. The code is available at this https URL.

[CV-79] Deep Learning with Data Privacy via Residual Perturbation

链接: https://arxiv.org/abs/2408.05723
作者: Wenqi Tao,Huaming Ling,Zuoqiang Shi,Bao Wang
关键词-EN: Protecting data privacy, Protecting data, deep learning, crucial importance, Protecting
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Protecting data privacy in deep learning (DL) is of crucial importance. Several celebrated privacy notions have been established and used for privacy-preserving DL. However, many existing mechanisms achieve privacy at the cost of significant utility degradation and computational overhead. In this paper, we propose a stochastic differential equation-based residual perturbation for privacy-preserving DL, which injects Gaussian noise into each residual mapping of ResNets. Theoretically, we prove that residual perturbation guarantees differential privacy (DP) and reduces the generalization gap of DL. Empirically, we show that residual perturbation is computationally efficient and outperforms the state-of-the-art differentially private stochastic gradient descent (DPSGD) in utility maintenance without sacrificing membership privacy.

[CV-80] Deformable Image Registration with Multi-scale Feature Fusion from Shared Encoder Auxiliary and Pyramid Decoders

点击查看摘要

[CV-81] SSL: A Self-similarity Loss for Improving Generative Image Super-resolution ACM-MM2024

链接: https://arxiv.org/abs/2408.05713
作者: Du Chen,Zhengqiang Zhang,Jie Liang,Lei Zhang
关键词-EN: Generative adversarial networks, real-world image super-resolution, image perceptual quality, generative diffusion models, generative Real-ISR models
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted by ACM MM 2024

点击查看摘要

Abstract:Generative adversarial networks (GAN) and generative diffusion models (DM) have been widely used in real-world image super-resolution (Real-ISR) to enhance the image perceptual quality. However, these generative models are prone to generating visual artifacts and false image structures, resulting in unnatural Real-ISR results. Based on the fact that natural images exhibit high self-similarities, i.e., a local patch can have many similar patches to it in the whole image, in this work we propose a simple yet effective self-similarity loss (SSL) to improve the performance of generative Real-ISR models, enhancing the hallucination of structural and textural details while reducing the unpleasant visual artifacts. Specifically, we compute a self-similarity graph (SSG) of the ground-truth image, and enforce the SSG of Real-ISR output to be close to it. To reduce the training cost and focus on edge areas, we generate an edge mask from the ground-truth image, and compute the SSG only on the masked pixels. The proposed SSL serves as a general plug-and-play penalty, which could be easily applied to the off-the-shelf Real-ISR models. Our experiments demonstrate that, by coupling with SSL, the performance of many state-of-the-art Real-ISR models, including those GAN and DM based ones, can be largely improved, reproducing more perceptually realistic image details and eliminating many false reconstructions and visual artifacts. Codes and supplementary material can be found at this https URL

[CV-82] Contrastive masked auto-encoders based self-supervised hashing for 2D image and 3D point cloud cross-modal retrieval ICME2024

链接: https://arxiv.org/abs/2408.05711
作者: Rukai Wei,Heng Cui,Yu Liu,Yufeng Hou,Yanzhao Xie,Ke Zhou
关键词-EN: Implementing cross-modal hashing, real-world retrieval systems, Implementing cross-modal, growing concern, concern in real-world
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted by ICME 2024

点击查看摘要

Abstract:Implementing cross-modal hashing between 2D images and 3D point-cloud data is a growing concern in real-world retrieval systems. Simply applying existing cross-modal approaches to this new task fails to adequately capture latent multi-modal semantics and effectively bridge the modality gap between 2D and 3D. To address these issues without relying on hand-crafted labels, we propose contrastive masked autoencoders based self-supervised hashing (CMAH) for retrieval between images and point-cloud data. We start by contrasting 2D-3D pairs and explicitly constraining them into a joint Hamming space. This contrastive learning process ensures robust discriminability for the generated hash codes and effectively reduces the modality gap. Moreover, we utilize multi-modal auto-encoders to enhance the model’s understanding of multi-modal semantics. By completing the masked image/point-cloud data modeling task, the model is encouraged to capture more localized clues. In addition, the proposed multi-modal fusion block facilitates fine-grained interactions among different modalities. Extensive experiments on three public datasets demonstrate that the proposed CMAH significantly outperforms all baseline methods.

[CV-83] Efficient Diffusion Transformer with Step-wise Dynamic Attention Mediators ECCV2024

链接: https://arxiv.org/abs/2408.05710
作者: Yifan Pu,Zhuofan Xia,Jiayi Guo,Dongchen Han,Qixiu Li,Duo Li,Yuhui Yuan,Ji Li,Yizeng Han,Shiji Song,Gao Huang,Xiu Li
关键词-EN: paper identifies significant, identifies significant redundancy, denoising diffusion steps, diffusion transformer models, diffusion transformer framework
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: ECCV 2024

点击查看摘要

Abstract:This paper identifies significant redundancy in the query-key interactions within self-attention mechanisms of diffusion transformer models, particularly during the early stages of denoising diffusion steps. In response to this observation, we present a novel diffusion transformer framework incorporating an additional set of mediator tokens to engage with queries and keys separately. By modulating the number of mediator tokens during the denoising generation phases, our model initiates the denoising process with a precise, non-ambiguous stage and gradually transitions to a phase enriched with detail. Concurrently, integrating mediator tokens simplifies the attention module’s complexity to a linear scale, enhancing the efficiency of global attention processes. Additionally, we propose a time-step dynamic mediator token adjustment mechanism that further decreases the required computational FLOPs for generation, simultaneously facilitating the generation of high-quality images within the constraints of varied inference budgets. Extensive experiments demonstrate that the proposed method can improve the generated image quality while also reducing the inference cost of diffusion transformers. When integrated with the recent work SiT, our method achieves a state-of-the-art FID score of 2.01. The source code is available at this https URL.

[CV-84] Decoder Pre-Training with only Text for Scene Text Recognition ACM-MM2024

链接: https://arxiv.org/abs/2408.05706
作者: Shuai Zhao,Yongkun Du,Zhineng Chen,Yu-Gang Jiang
关键词-EN: achieved remarkable progress, primarily relying, text, STR, real images
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted by ACM MM 2024

点击查看摘要

Abstract:Scene text recognition (STR) pre-training methods have achieved remarkable progress, primarily relying on synthetic datasets. However, the domain gap between synthetic and real images poses a challenge in acquiring feature representations that align well with images on real scenes, thereby limiting the performance of these methods. We note that vision-language models like CLIP, pre-trained on extensive real image-text pairs, effectively align images and text in a unified embedding space, suggesting the potential to derive the representations of real images from text alone. Building upon this premise, we introduce a novel method named Decoder Pre-training with only text for STR (DPTR). DPTR treats text embeddings produced by the CLIP text encoder as pseudo visual embeddings and uses them to pre-train the decoder. An Offline Randomized Perturbation (ORP) strategy is introduced. It enriches the diversity of text embeddings by incorporating natural image embeddings extracted from the CLIP image encoder, effectively directing the decoder to acquire the potential representations of real images. In addition, we introduce a Feature Merge Unit (FMU) that guides the extracted visual embeddings focusing on the character foreground within the text image, thereby enabling the pre-trained decoder to work more efficiently and accurately. Extensive experiments across various STR decoders and language recognition tasks underscore the broad applicability and remarkable performance of DPTR, providing a novel insight for STR pre-training. Code is available at this https URL

[CV-85] MacFormer: Semantic Segmentation with Fine Object Boundaries

链接: https://arxiv.org/abs/2408.05699
作者: Guoan Xu,Wenfeng Huang,Tao Wu,Ligeng Chen,Wenjing Jia,Guangwei Gao,Xiatian Zhu,Stuart Perry
关键词-EN: segmentation involves assigning, Semantic segmentation involves, Semantic segmentation, Vision Transformer-based models, involves assigning
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 13 pages, 7 figures, submitted to TIP

点击查看摘要

Abstract:Semantic segmentation involves assigning a specific category to each pixel in an image. While Vision Transformer-based models have made significant progress, current semantic segmentation methods often struggle with precise predictions in localized areas like object boundaries. To tackle this challenge, we introduce a new semantic segmentation architecture, ``MacFormer’', which features two key components. Firstly, using learnable agent tokens, a Mutual Agent Cross-Attention (MACA) mechanism effectively facilitates the bidirectional integration of features across encoder and decoder layers. This enables better preservation of low-level features, such as elementary edges, during decoding. Secondly, a Frequency Enhancement Module (FEM) in the decoder leverages high-frequency and low-frequency components to boost features in the frequency domain, benefiting object boundaries with minimal computational complexity increase. MacFormer is demonstrated to be compatible with various network architectures and outperforms existing methods in both accuracy and efficiency on benchmark datasets ADE20K and Cityscapes under different computational constraints.

[CV-86] A Novel Momentum-Based Deep Learning Techniques for Medical Image Classification and Segmentation

点击查看摘要

[CV-87] Single Image Dehazing Using Scene Depth Ordering

链接: https://arxiv.org/abs/2408.05683
作者: Pengyang Ling,Huaian Chen,Xiao Tan,Yimeng Shan,Yi Jin
关键词-EN: depth order, single image dehazing, hazy images, weather generally suffer, image dehazing problem
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
*备注: 14 pages, 15 figures

点击查看摘要

Abstract:Images captured in hazy weather generally suffer from quality degradation, and many dehazing methods have been developed to solve this problem. However, single image dehazing problem is still challenging due to its ill-posed nature. In this paper, we propose a depth order guided single image dehazing method, which utilizes depth order in hazy images to guide the dehazing process to achieve a similar depth perception in corresponding dehazing results. The consistency of depth perception ensures that the regions that look farther or closer in hazy images also appear farther or closer in the corresponding dehazing results, and thus effectively avoid the undesired visual effects. To achieve this goal, a simple yet effective strategy is proposed to extract the depth order in hazy images, which offers a reference for depth perception in hazy weather. Additionally, a depth order embedded transformation model is devised, which performs transmission estimation under the guidance of depth order to realize an unchanged depth order in the dehazing results. The extracted depth order provides a powerful global constraint for the dehazing process, which contributes to the efficient utilization of global information, thereby bringing an overall improvement in restoration quality. Extensive experiments demonstrate that the proposed method can better recover potential structure and vivid color with higher computational efficiency than the state-of-the-art dehazing methods.

[CV-88] PS-TTL: Prototype-based Soft-labels and Test-Time Learning for Few-shot Object Detection ACM-MM2024

链接: https://arxiv.org/abs/2408.05674
作者: Yingjie Gao,Yanan Zhang,Ziyue Huang,Nanqing Liu,Di Huang
关键词-EN: Few-Shot Object Detection, Object Detection, gained widespread attention, made significant progress, significant progress due
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted to ACM MM 2024

点击查看摘要

Abstract:In recent years, Few-Shot Object Detection (FSOD) has gained widespread attention and made significant progress due to its ability to build models with a good generalization power using extremely limited annotated data. The fine-tuning based paradigm is currently dominating this field, where detectors are initially pre-trained on base classes with sufficient samples and then fine-tuned on novel ones with few samples, but the scarcity of labeled samples of novel classes greatly interferes precisely fitting their data distribution, thus hampering the performance. To address this issue, we propose a new framework for FSOD, namely Prototype-based Soft-labels and Test-Time Learning (PS-TTL). Specifically, we design a Test-Time Learning (TTL) module that employs a mean-teacher network for self-training to discover novel instances from test data, allowing detectors to learn better representations and classifiers for novel classes. Furthermore, we notice that even though relatively low-confidence pseudo-labels exhibit classification confusion, they still tend to recall foreground. We thus develop a Prototype-based Soft-labels (PS) strategy through assessing similarities between low-confidence pseudo-labels and category prototypes as soft-labels to unleash their potential, which substantially mitigates the constraints posed by few-shot samples. Extensive experiments on both the VOC and COCO benchmarks show that PS-TTL achieves the state-of-the-art, highlighting its effectiveness. The code and model are available at this https URL.

[CV-89] StealthDiffusion: Towards Evading Diffusion Forensic Detection through Diffusion Model

点击查看摘要

[CV-90] Performance Evaluation of YOLOv8 Model Configurations for Instance Segmentation of Strawberry Fruit Development Stages in an Open Field Environment

链接: https://arxiv.org/abs/2408.05661
作者: Abdul-Razak Alhassan Gamani,Ibrahim Arhin,Adrena Kyeremateng Asamoah
关键词-EN: optimizing yield management, making informed decisions, informed decisions related, Accurate identification, milliseconds
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 15 page, 18 figures

点击查看摘要

Abstract:Accurate identification of strawberries during their maturing stages is crucial for optimizing yield management, and pest control, and making informed decisions related to harvest and post-harvest logistics. This study evaluates the performance of YOLOv8 model configurations for instance segmentation of strawberries into ripe and unripe stages in an open field environment. The YOLOv8n model demonstrated superior segmentation accuracy with a mean Average Precision (mAP) of 80.9%, outperforming other YOLOv8 configurations. In terms of inference speed, YOLOv8n processed images at 12.9 milliseconds, while YOLOv8s, the least-performing model, processed at 22.2 milliseconds. Over 86 test images with 348 ground truth labels, YOLOv8n detected 235 ripe fruit classes and 51 unripe fruit classes out of 251 ground truth ripe fruits and 97 unripe ground truth labels, respectively. In comparison, YOLOv8s detected 204 ripe fruits and 37 unripe fruits. Overall, YOLOv8n achieved the fastest inference speed of 24.2 milliseconds, outperforming YOLOv8s, YOLOv8m, YOLOv8l, and YOLOv8x, which processed images at 33.0 milliseconds, 44.3 milliseconds, 53.6 milliseconds, and 62.5 milliseconds, respectively. These results underscore the potential of advanced object segmentation algorithms to address complex visual recognition tasks in open-field agriculture effectively to address complex visual recognition tasks in open-field agriculture effectively.

[CV-91] Advancing Pavement Distress Detection in Developing Countries: A Novel Deep Learning Approach with Locally-Collected Datasets

链接: https://arxiv.org/abs/2408.05649
作者: Blessing Agyei Kyem,Eugene Kofi Okrah Denteh,Joshua Kofi Asamoah,Kenneth Adomako Tutu,Armstrong Aboah
关键词-EN: diverse environmental factors, Block Attention Module, Convolutional Block Attention, environmental factors, due to resource
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Road infrastructure maintenance in developing countries faces unique challenges due to resource constraints and diverse environmental factors. This study addresses the critical need for efficient, accurate, and locally-relevant pavement distress detection methods in these regions. We present a novel deep learning approach combining YOLO (You Only Look Once) object detection models with a Convolutional Block Attention Module (CBAM) to simultaneously detect and classify multiple pavement distress types. The model demonstrates robust performance in detecting and classifying potholes, longitudinal cracks, alligator cracks, and raveling, with confidence scores ranging from 0.46 to 0.93. While some misclassifications occur in complex scenarios, these provide insights into unique challenges of pavement assessment in developing countries. Additionally, we developed a web-based application for real-time distress detection from images and videos. This research advances automated pavement distress detection and provides a tailored solution for developing countries, potentially improving road safety, optimizing maintenance strategies, and contributing to sustainable transportation infrastructure development.

[CV-92] Visual SLAM with 3D Gaussian Primitives and Depth Priors Enabling Novel View Synthesis

链接: https://arxiv.org/abs/2408.05635
作者: Zhongche Qu,Zhi Zhang,Cong Liu,Jianhua Yin
关键词-EN: Conventional geometry-based SLAM, geometry-based SLAM systems, SLAM systems lack, Conventional geometry-based, Gaussian Splatting
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
*备注:

点击查看摘要

Abstract:Conventional geometry-based SLAM systems lack dense 3D reconstruction capabilities since their data association usually relies on feature correspondences. Additionally, learning-based SLAM systems often fall short in terms of real-time performance and accuracy. Balancing real-time performance with dense 3D reconstruction capabilities is a challenging problem. In this paper, we propose a real-time RGB-D SLAM system that incorporates a novel view synthesis technique, 3D Gaussian Splatting, for 3D scene representation and pose estimation. This technique leverages the real-time rendering performance of 3D Gaussian Splatting with rasterization and allows for differentiable optimization in real time through CUDA implementation. We also enable mesh reconstruction from 3D Gaussians for explicit dense 3D reconstruction. To estimate accurate camera poses, we utilize a rotation-translation decoupled strategy with inverse optimization. This involves iteratively updating both in several iterations through gradient-based optimization. This process includes differentiably rendering RGB, depth, and silhouette maps and updating the camera parameters to minimize a combined loss of photometric loss, depth geometry loss, and visibility loss, given the existing 3D Gaussian map. However, 3D Gaussian Splatting (3DGS) struggles to accurately represent surfaces due to the multi-view inconsistency of 3D Gaussians, which can lead to reduced accuracy in both camera pose estimation and scene reconstruction. To address this, we utilize depth priors as additional regularization to enforce geometric constraints, thereby improving the accuracy of both pose estimation and 3D reconstruction. We also provide extensive experimental results on public benchmark datasets to demonstrate the effectiveness of our proposed methods in terms of pose accuracy, geometric accuracy, and rendering performance.

[CV-93] PRTGaussian: Efficient Relighting Using 3D Gaussians with Precomputed Radiance Transfer

点击查看摘要

[CV-94] UrFound: Towards Universal Retinal Foundation Models via Knowledge-Guided Masked Modeling

点击查看摘要

[CV-95] Residual-INR: Communication Efficient On-Device Learning Using Implicit Neural Representation

点击查看摘要

[CV-96] Sequential Representation Learning via Static-Dynamic Conditional Disentanglement ECCV2024

点击查看摘要

[CV-97] Non-Negative Reduced Biquaternion Matrix Factorization with Applications in Color Face Recognition

链接: https://arxiv.org/abs/2408.05582
作者: Jifei Miao,Junjun Pan,Michael K. Ng
关键词-EN: four-dimensional algebra highly, algebra highly suitable, recently garnered significant, garnered significant attention, representing color pixels
类目: Computer Vision and Pattern Recognition (cs.CV); Numerical Analysis (math.NA)
*备注:

点击查看摘要

Abstract:Reduced biquaternion (RB), as a four-dimensional algebra highly suitable for representing color pixels, has recently garnered significant attention from numerous scholars. In this paper, for color image processing problems, we introduce a concept of the non-negative RB matrix and then use the multiplication properties of RB to propose a non-negative RB matrix factorization (NRBMF) model. The NRBMF model is introduced to address the challenge of reasonably establishing a non-negative quaternion matrix factorization model, which is primarily hindered by the multiplication properties of traditional quaternions. Furthermore, this paper transforms the problem of solving the NRBMF model into an RB alternating non-negative least squares (RB-ANNLS) problem. Then, by introducing a method to compute the gradient of the real function with RB matrix variables, we solve the RB-ANNLS optimization problem using the RB projected gradient algorithm and conduct a convergence analysis of the algorithm. Finally, we validate the effectiveness and superiority of the proposed NRBMF model in color face recognition.

[CV-98] Camera Perspective Transformation to Birds Eye View via Spatial Transformer Model for Road Intersection Monitoring

链接: https://arxiv.org/abs/2408.05577
作者: Rukesh Prajapati,Amr S. El-Wakeel
关键词-EN: bird eye view, utilize bird eye, Road intersection monitoring, eye view, research often utilize
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Road intersection monitoring and control research often utilize bird’s eye view (BEV) simulators. In real traffic settings, achieving a BEV akin to that in a simulator necessitates the deployment of drones or specific sensor mounting, which is neither feasible nor practical. Consequently, traffic intersection management remains confined to simulation environments given these constraints. In this paper, we address the gap between simulated environments and real-world implementation by introducing a novel deep-learning model that converts a single camera’s perspective of a road intersection into a BEV. We created a simulation environment that closely resembles a real-world traffic junction. The proposed model transforms the vehicles into BEV images, facilitating road intersection monitoring and control model processing. Inspired by image transformation techniques, we propose a Spatial-Transformer Double Decoder-UNet (SDD-UNet) model that aims to eliminate the transformed image distortions. In addition, the model accurately estimates the vehicle’s positions and enables the direct application of simulation-trained models in real-world contexts. SDD-UNet model achieves an average dice similarity coefficient (DSC) above 95% which is 40% better than the original UNet model. The mean absolute error (MAE) is 0.102 and the centroid of the predicted mask is 0.14 meters displaced, on average, indicating high accuracy.

[CV-99] Impacts of Darwinian Evolution on Pre-trained Deep Neural Networks

点击查看摘要

[CV-100] What Matters in Autonomous Driving Anomaly Detection: A Weakly Supervised Horizon

链接: https://arxiv.org/abs/2408.05562
作者: Utkarsh Tiwari,Snehashis Majhi,Michal Balazia,François Brémond
关键词-EN: Video anomaly detection, Video anomaly, VAD, important task, weakly-supervised VAD methods
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Video anomaly detection (VAD) in autonomous driving scenario is an important task, however it involves several challenges due to the ego-centric views and moving camera. Due to this, it remains largely under-explored. While recent developments in weakly-supervised VAD methods have shown remarkable progress in detecting critical real-world anomalies in static camera scenario, the development and validation of such methods are yet to be explored for moving camera VAD. This is mainly due to existing datasets like DoTA not following training pre-conditions of weakly-supervised learning. In this paper, we aim to promote weakly-supervised method development for autonomous driving VAD. We reorganize the DoTA dataset and aim to validate recent powerful weakly-supervised VAD methods on moving camera scenarios. Further, we provide a detailed analysis of what modifications on state-of-the-art methods can significantly improve the detection performance. Towards this, we propose a “feature transformation block” and through experimentation we show that our propositions can empower existing weakly-supervised VAD methods significantly in improving the VAD in autonomous driving. Our codes/dataset/demo will be released at this http URL

[CV-101] Object Re-identification via Spatial-temporal Fusion Networks and Causal Identity Matching

链接: https://arxiv.org/abs/2408.05558
作者: Hye-Geun Kim,Yong-Hyuk Moon,Yeong-Jun Cho
关键词-EN: Object re-identification, large camera networks, ReID, objects degrade ReID, fusion network
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Object re-identification (ReID) in large camera networks has many challenges. First, the similar appearances of objects degrade ReID performances. This challenge cannot be addressed by existing appearance-based ReID methods. Second, most ReID studies are performed in laboratory settings and do not consider ReID problems in real-world scenarios. To overcome these challenges, we introduce a novel ReID framework that leverages a spatial-temporal fusion network and causal identity matching (CIM). The framework estimates camera network topology using the proposed adaptive Parzen window and combines appearance features with spatial-temporal cue within the Fusion Network. It achieved outstanding performance across several datasets, including VeRi776, Vehicle-3I, and Market-1501, achieving up to 99.70% rank-1 accuracy and 95.5% mAP. Furthermore, the proposed CIM approach, which dynamically assigns gallery sets based on the camera network topology, further improved ReID accuracy and robustness in real-world settings, evidenced by a 94.95% mAP and 95.19% F1 score on the Vehicle-3I dataset. The experimental results support the effectiveness of incorporating spatial-temporal information and CIM for real-world ReID scenarios regardless of the data domain (e.g., vehicle, person).

[CV-102] Evolutionary Neural Architecture Search for 3D Point Cloud Analysis

点击查看摘要

[CV-103] PixelFade: Privacy-preserving Person Re-identification with Noise-guided Progressive Replacement

链接: https://arxiv.org/abs/2408.05543
作者: Delong Zhang,Yi-Xing Peng,Xiao-Ming Wu,Ancong Wu,Wei-Shi Zheng
关键词-EN: triggering public concern, Online person re-identification, potential data leakage, re-identification services face, exposing cloud-stored images
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: accepted by ACMMM24

点击查看摘要

Abstract:Online person re-identification services face privacy breaches from potential data leakage and recovery attacks, exposing cloud-stored images to malicious attackers and triggering public concern. The privacy protection of pedestrian images is crucial. Previous privacy-preserving person re-identification methods are unable to resist recovery attacks and compromise accuracy. In this paper, we propose an iterative method (PixelFade) to optimize pedestrian images into noise-like images to resist recovery attacks. We first give an in-depth study of protected images from previous privacy methods, which reveal that the chaos of protected images can disrupt the learning of recovery models. Accordingly, Specifically, we propose Noise-guided Objective Function with the feature constraints of a specific authorization model, optimizing pedestrian images to normal-distributed noise images while preserving their original identity information as per the authorization model. To solve the above non-convex optimization problem, we propose a heuristic optimization algorithm that alternately performs the Constraint Operation and the Partial Replacement Operation. This strategy not only safeguards that original pixels are replaced with noises to protect privacy, but also guides the images towards an improved optimization direction to effectively preserve discriminative features. Extensive experiments demonstrate that our PixelFade outperforms previous methods in resisting recovery attacks and Re-ID performance. The code is available at this https URL.

[CV-104] Radiance Field Learners As UAV First-Person Viewers ECCV2024

链接: https://arxiv.org/abs/2408.05533
作者: Liqi Yan,Qifan Wang,Junhan Zhao,Qiang Guan,Zheng Tang,Jianhui Zhang,Dongfang Liu
关键词-EN: Unmanned Aerial Vehicles, Aerial Vehicles, Unmanned Aerial, holds immense potential, Neural Radiance Field
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted to ECCV 2024

点击查看摘要

Abstract:First-Person-View (FPV) holds immense potential for revolutionizing the trajectory of Unmanned Aerial Vehicles (UAVs), offering an exhilarating avenue for navigating complex building structures. Yet, traditional Neural Radiance Field (NeRF) methods face challenges such as sampling single points per iteration and requiring an extensive array of views for supervision. UAV videos exacerbate these issues with limited viewpoints and significant spatial scale variations, resulting in inadequate detail rendering across diverse scales. In response, we introduce FPV-NeRF, addressing these challenges through three key facets: (1) Temporal consistency. Leveraging spatio-temporal continuity ensures seamless coherence between frames; (2) Global structure. Incorporating various global features during point sampling preserves space integrity; (3) Local granularity. Employing a comprehensive framework and multi-resolution supervision for multi-scale scene feature representation tackles the intricacies of UAV video spatial scales. Additionally, due to the scarcity of publicly available FPV videos, we introduce an innovative view synthesis method using NeRF to generate FPV perspectives from UAV footage, enhancing spatial perception for drones. Our novel dataset spans diverse trajectories, from outdoor to indoor environments, in the UAV domain, differing significantly from traditional NeRF scenarios. Through extensive experiments encompassing both interior and exterior building structures, FPV-NeRF demonstrates a superior understanding of the UAV flying space, outperforming state-of-the-art methods in our curated UAV dataset. Explore our project page for further insights: this https URL.

[CV-105] CryoBench: Diverse and challenging datasets for the heterogeneity problem in cryo-EM

点击查看摘要

[CV-106] DeepFace-Attention: Multimodal Face Biometrics for Attention Estimation with Application to e-Learning

链接: https://arxiv.org/abs/2408.05523
作者: Roberto Daza,Luis F. Gomez,Julian Fierrez,Aythami Morales,Ruben Tolosana,Javier Ortega-Garcia
关键词-EN: analysis techniques applied, webcam videos, work introduces, introduces an innovative, techniques applied
类目: Human-Computer Interaction (cs.HC); Computer Vision and Pattern Recognition (cs.CV)
*备注: Article accepted in the IEEE Access journal. Accessible at this https URL

点击查看摘要

Abstract:This work introduces an innovative method for estimating attention levels (cognitive load) using an ensemble of facial analysis techniques applied to webcam videos. Our method is particularly useful, among others, in e-learning applications, so we trained, evaluated, and compared our approach on the mEBAL2 database, a public multi-modal database acquired in an e-learning environment. mEBAL2 comprises data from 60 users who performed 8 different tasks. These tasks varied in difficulty, leading to changes in their cognitive loads. Our approach adapts state-of-the-art facial analysis technologies to quantify the users’ cognitive load in the form of high or low attention. Several behavioral signals and physiological processes related to the cognitive load are used, such as eyeblink, heart rate, facial action units, and head pose, among others. Furthermore, we conduct a study to understand which individual features obtain better results, the most efficient combinations, explore local and global features, and how temporary time intervals affect attention level estimation, among other aspects. We find that global facial features are more appropriate for multimodal systems using score-level fusion, particularly as the temporal window increases. On the other hand, local features are more suitable for fusion through neural network training with score-level fusion approaches. Our method outperforms existing state-of-the-art accuracies using the public mEBAL2 benchmark.

[CV-107] Long working distance portable smartphone microscopy for metallic mesh defect detection

链接: https://arxiv.org/abs/2408.05518
作者: Zhengang Lu,Hongsheng Qin,Jing Li,Ming Sun,Jiubin Tan
关键词-EN: metal line structure, transparent electromagnetic shielding, electromagnetic shielding film, fine metal line, line structure
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Metallic mesh is a transparent electromagnetic shielding film with a fine metal line structure. However, it can develop defects that affect the optoelectronic performance whether in the production preparation or in actual use. The development of in-situ non-destructive testing (NDT) devices for metallic mesh requires long working distances, reflective optical path design, and miniaturization. To address the limitations of existing smartphone microscopes, which feature short working distances and inadequate transmission imaging for industrial in-situ inspection, we propose a novel long-working distance reflective smartphone microscopy system (LD-RSM). LD-RSM builds a 4f optical imaging system with external optical components and a smartphone, utilizing a beam splitter to achieve reflective imaging with the illumination system and imaging system on the same side of the sample. It achieves an optical resolution of 4.92 \mu m and a working distance of up to 22.23 mm. Additionally, we introduce a dual prior weighted Robust Principal Component Analysis (DW-RPCA) for defect detection. This approach leverages spectral filter fusion and Hough transform to model different defect types, enhancing the accuracy and efficiency of defect identification. Coupled with an optimized threshold segmentation algorithm, DW-RPCA method achieves a pixel-level accuracy of 84.8%. Our work showcases strong potential for growth in the field of in-situ on-line inspection of industrial products.

[CV-108] Anticipation through Head Pose Estimation: a preliminary study

链接: https://arxiv.org/abs/2408.05516
作者: Federico Figari Tomenotti,Nicoletta Noceti
关键词-EN: human-human social interaction, anticipate others’ goals, basis of human-human, human-human social, social interaction
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted at the workshop on advancing Group Understanding and robots’ adaptive behavior (GROUND), held at the Robotics Science and Systems (RSS) Conference, 2024

点击查看摘要

Abstract:The ability to anticipate others’ goals and intentions is at the basis of human-human social interaction. Such ability, largely based on non-verbal communication, is also a key to having natural and pleasant interactions with artificial agents, like robots. In this work, we discuss a preliminary experiment on the use of head pose as a visual cue to understand and anticipate action goals, particularly reaching and transporting movements. By reasoning on the spatio-temporal connections between the head, hands and objects in the scene, we will show that short-range anticipation is possible, laying the foundations for future applications to human-robot interaction.

[CV-109] PointMT: Efficient Point Cloud Analysis with Hybrid MLP-Transformer Architecture

链接: https://arxiv.org/abs/2408.05508
作者: Qiang Zheng,Chao Zhang,Jian Sun
关键词-EN: made significant progress, point cloud analysis, Transformer architecture hinder, Transformer architecture, virtual reality
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:In recent years, point cloud analysis methods based on the Transformer architecture have made significant progress, particularly in the context of multimedia applications such as 3D modeling, virtual reality, and autonomous systems. However, the high computational resource demands of the Transformer architecture hinder its scalability, real-time processing capabilities, and deployment on mobile devices and other platforms with limited computational resources. This limitation remains a significant obstacle to its practical application in scenarios requiring on-device intelligence and multimedia processing. To address this challenge, we propose an efficient point cloud analysis architecture, \textbfPoint \textbfMLP-\textbfTransformer (PointMT). This study tackles the quadratic complexity of the self-attention mechanism by introducing a linear complexity local attention mechanism for effective feature aggregation. Additionally, to counter the Transformer’s focus on token differences while neglecting channel differences, we introduce a parameter-free channel temperature adaptation mechanism that adaptively adjusts the attention weight distribution in each channel, enhancing the precision of feature aggregation. To improve the Transformer’s slow convergence speed due to the limited scale of point cloud datasets, we propose an MLP-Transformer hybrid module, which significantly enhances the model’s convergence speed. Furthermore, to boost the feature representation capability of point tokens, we refine the classification head, enabling point tokens to directly participate in prediction. Experimental results on multiple evaluation benchmarks demonstrate that PointMT achieves performance comparable to state-of-the-art methods while maintaining an optimal balance between performance and accuracy.

[CV-110] Disentangled Noisy Correspondence Learning

点击查看摘要

[CV-111] GEM: Context-Aware Gaze EstiMation with Visual Search Behavior Matching for Chest Radiograph

链接: https://arxiv.org/abs/2408.05502
作者: Shaonan Liu,Wenting Chen,Jie Liu,Xiaoling Luo,Linlin Shen
关键词-EN: scene comprehension tasks, human scene comprehension, medical diagnostic analysis, context-aware gaze estimation, medical image interpretation
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
*备注: 9 figures

点击查看摘要

[CV-112] PointNCBW: Towards Dataset Ownership Verification for Point Clouds via Negative Clean-label Backdoor Watermark

点击查看摘要

[CV-113] ZePo: Zero-Shot Portrait Stylization with Faster Sampling ACM-MM2024

链接: https://arxiv.org/abs/2408.05492
作者: Jin Liu,Huaibo Huang,Jie Cao,Ran He
关键词-EN: art content synthesis, Consistency Features, Latent Consistency Models, significantly advanced, advanced the field
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted by ACM MM 2024

点击查看摘要

Abstract:Diffusion-based text-to-image generation models have significantly advanced the field of art content synthesis. However, current portrait stylization methods generally require either model fine-tuning based on examples or the employment of DDIM Inversion to revert images to noise space, both of which substantially decelerate the image generation process. To overcome these limitations, this paper presents an inversion-free portrait stylization framework based on diffusion models that accomplishes content and style feature fusion in merely four sampling steps. We observed that Latent Consistency Models employing consistency distillation can effectively extract representative Consistency Features from noisy images. To blend the Consistency Features extracted from both content and style images, we introduce a Style Enhancement Attention Control technique that meticulously merges content and style features within the attention space of the target image. Moreover, we propose a feature merging strategy to amalgamate redundant features in Consistency Features, thereby reducing the computational load of attention control. Extensive experiments have validated the effectiveness of our proposed framework in enhancing stylization efficiency and fidelity. The code is available at \urlthis https URL.

[CV-114] ReToMe-VA: Recursive Token Merging for Video Diffusion-based Unrestricted Adversarial Attack

链接: https://arxiv.org/abs/2408.05479
作者: Ziyi Gao,Kai Chen,Zhipeng Wei,Tingshu Mou,Jingjing Chen,Zhiyu Tan,Hao Li,Yu-Gang Jiang
关键词-EN: Recent diffusion-based unrestricted, diffusion-based unrestricted attacks, Recent diffusion-based, diffusion-based unrestricted, previous unrestricted attacks
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Recent diffusion-based unrestricted attacks generate imperceptible adversarial examples with high transferability compared to previous unrestricted attacks and restricted attacks. However, existing works on diffusion-based unrestricted attacks are mostly focused on images yet are seldom explored in videos. In this paper, we propose the Recursive Token Merging for Video Diffusion-based Unrestricted Adversarial Attack (ReToMe-VA), which is the first framework to generate imperceptible adversarial video clips with higher transferability. Specifically, to achieve spatial imperceptibility, ReToMe-VA adopts a Timestep-wise Adversarial Latent Optimization (TALO) strategy that optimizes perturbations in diffusion models’ latent space at each denoising step. TALO offers iterative and accurate updates to generate more powerful adversarial frames. TALO can further reduce memory consumption in gradient computation. Moreover, to achieve temporal imperceptibility, ReToMe-VA introduces a Recursive Token Merging (ReToMe) mechanism by matching and merging tokens across video frames in the self-attention module, resulting in temporally consistent adversarial videos. ReToMe concurrently facilitates inter-frame interactions into the attack process, inducing more diverse and robust gradients, thus leading to better adversarial transferability. Extensive experiments demonstrate the efficacy of ReToMe-VA, particularly in surpassing state-of-the-art attacks in adversarial transferability by more than 14.16% on average.

[CV-115] Scene123: One Prompt to 3D Scene Generation via Video-Assisted and Consistency-Enhanced MAE

链接: https://arxiv.org/abs/2408.05477
作者: Yiying Yang,Fukun Yin,Jiayuan Fan,Xin Chen,Wanzhang Li,Gang Yu
关键词-EN: Artificial Intelligence Generated, Intelligence Generated Content, cognitive content creation, human-like cognitive content, Artificial Intelligence
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: arXiv admin note: text overlap with arXiv:2305.11588 by other authors

点击查看摘要

Abstract:As Artificial Intelligence Generated Content (AIGC) advances, a variety of methods have been developed to generate text, images, videos, and 3D objects from single or multimodal inputs, contributing efforts to emulate human-like cognitive content creation. However, generating realistic large-scale scenes from a single input presents a challenge due to the complexities involved in ensuring consistency across extrapolated views generated by models. Benefiting from recent video generation models and implicit neural representations, we propose Scene123, a 3D scene generation model, that not only ensures realism and diversity through the video generation framework but also uses implicit neural fields combined with Masked Autoencoders (MAE) to effectively ensures the consistency of unseen areas across views. Specifically, we initially warp the input image (or an image generated from text) to simulate adjacent views, filling the invisible areas with the MAE model. However, these filled images usually fail to maintain view consistency, thus we utilize the produced views to optimize a neural radiance field, enhancing geometric consistency. Moreover, to further enhance the details and texture fidelity of generated views, we employ a GAN-based Loss against images derived from the input image through the video generation model. Extensive experiments demonstrate that our method can generate realistic and consistent scenes from a single prompt. Both qualitative and quantitative results indicate that our approach surpasses existing state-of-the-art methods. We show encourage video examples at this https URL. Comments: arXiv admin note: text overlap with arXiv:2305.11588 by other authors Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2408.05477 [cs.CV] (or arXiv:2408.05477v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2408.05477 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[CV-116] Cross-view image geo-localization with Panorama-BEV Co-Retrieval Network ECCV2024

链接: https://arxiv.org/abs/2408.05475
作者: Junyan Ye,Zhutao Lv,Weijia Li,Jinhua Yu,Haote Yang,Huaping Zhong,Conghui He
关键词-EN: georeferenced satellite database, street view, Cross-view geolocalization identifies, view, street
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted by ECCV 2024

点击查看摘要

Abstract:Cross-view geolocalization identifies the geographic location of street view images by matching them with a georeferenced satellite database. Significant challenges arise due to the drastic appearance and geometry differences between views. In this paper, we propose a new approach for cross-view image geo-localization, i.e., the Panorama-BEV Co-Retrieval Network. Specifically, by utilizing the ground plane assumption and geometric relations, we convert street view panorama images into the BEV view, reducing the gap between street panoramas and satellite imagery. In the existing retrieval of street view panorama images and satellite images, we introduce BEV and satellite image retrieval branches for collaborative retrieval. By retaining the original street view retrieval branch, we overcome the limited perception range issue of BEV representation. Our network enables comprehensive perception of both the global layout and local details around the street view capture locations. Additionally, we introduce CVGlobal, a global cross-view dataset that is closer to real-world scenarios. This dataset adopts a more realistic setup, with street view directions not aligned with satellite images. CVGlobal also includes cross-regional, cross-temporal, and street view to map retrieval tests, enabling a comprehensive evaluation of algorithm performance. Our method excels in multiple tests on common cross-view datasets such as CVUSA, CVACT, VIGOR, and our newly introduced CVGlobal, surpassing the current state-of-the-art approaches. The code and datasets can be found at \urlthis https URL.

[CV-117] Multimodal generative semantic communication based on latent diffusion model

链接: https://arxiv.org/abs/2408.05455
作者: Weiqi Fu,Lianming Xu,Xin Wu,Haoyang Wei,Li Wang
关键词-EN: accurately gather environmental, make timely decisions, gather environmental data, semantic communication frameworks, command information
类目: Computer Vision and Pattern Recognition (cs.CV); Networking and Internet Architecture (cs.NI)
*备注:

点击查看摘要

Abstract:In emergencies, the ability to quickly and accurately gather environmental data and command information, and to make timely decisions, is particularly critical. Traditional semantic communication frameworks, primarily based on a single modality, are susceptible to complex environments and lighting conditions, thereby limiting decision accuracy. To this end, this paper introduces a multimodal generative semantic communication framework named mm-GESCO. The framework ingests streams of visible and infrared modal image data, generates fused semantic segmentation maps, and transmits them using a combination of one-hot encoding and zlib compression techniques to enhance data transmission efficiency. At the receiving end, the framework can reconstruct the original multimodal images based on the semantic maps. Additionally, a latent diffusion model based on contrastive learning is designed to align different modal data within the latent space, allowing mm-GESCO to reconstruct latent features of any modality presented at the input. Experimental results demonstrate that mm-GESCO achieves a compression ratio of up to 200 times, surpassing the performance of existing semantic communication frameworks and exhibiting excellent performance in downstream tasks such as object classification and detection.

[CV-118] EV-MGDispNet: Motion-Guided Event-Based Stereo Disparity Estimation Network with Left-Right Consistency

链接: https://arxiv.org/abs/2408.05452
作者: Junjie Jiang,Hao Zhuang,Xinjie Huang,Delei Kong,Zheng Fang
关键词-EN: high dynamic range, stereo disparity estimation, high temporal resolution, disparity estimation, camera stereo disparity
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
*备注:

点击查看摘要

Abstract:Event cameras have the potential to revolutionize the field of robot vision, particularly in areas like stereo disparity estimation, owing to their high temporal resolution and high dynamic range. Many studies use deep learning for event camera stereo disparity estimation. However, these methods fail to fully exploit the temporal information in the event stream to acquire clear event representations. Additionally, there is room for further reduction in pixel shifts in the feature maps before constructing the cost volume. In this paper, we propose EV-MGDispNet, a novel event-based stereo disparity estimation method. Firstly, we propose an edge-aware aggregation (EAA) module, which fuses event frames and motion confidence maps to generate a novel clear event representation. Then, we propose a motion-guided attention (MGA) module, where motion confidence maps utilize deformable transformer encoders to enhance the feature map with more accurate edges. Finally, we also add a census left-right consistency loss function to enhance the left-right consistency of stereo event representation. Through conducting experiments within challenging real-world driving scenarios, we validate that our method outperforms currently known state-of-the-art methods in terms of mean absolute error (MAE) and root mean square error (RMSE) metrics.

[CV-119] Ensemble everything everywhere: Multi-scale aggregation for adversarial robustness

链接: https://arxiv.org/abs/2408.05446
作者: Stanislav Fort,Balaji Lakshminarayanan
关键词-EN: deep neural networks, intermediate layer predictions, adversarial robustness, reliability and alignment, neural networks
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 34 pages, 25 figures, appendix

点击查看摘要

Abstract:Adversarial examples pose a significant challenge to the robustness, reliability and alignment of deep neural networks. We propose a novel, easy-to-use approach to achieving high-quality representations that lead to adversarial robustness through the use of multi-resolution input representations and dynamic self-ensembling of intermediate layer predictions. We demonstrate that intermediate layer predictions exhibit inherent robustness to adversarial attacks crafted to fool the full classifier, and propose a robust aggregation mechanism based on Vickrey auction that we call \textitCrossMax to dynamically ensemble them. By combining multi-resolution inputs and robust ensembling, we achieve significant adversarial robustness on CIFAR-10 and CIFAR-100 datasets without any adversarial training or extra data, reaching an adversarial accuracy of \approx 72% (CIFAR-10) and \approx 48% (CIFAR-100) on the RobustBench AutoAttack suite ( L_\infty=8/255) with a finetuned ImageNet-pretrained ResNet152. This represents a result comparable with the top three models on CIFAR-10 and a +5 % gain compared to the best current dedicated approach on CIFAR-100. Adding simple adversarial training on top, we get \approx 78% on CIFAR-10 and \approx 51% on CIFAR-100, improving SOTA by 5 % and 9 % respectively and seeing greater gains on the harder dataset. We validate our approach through extensive experiments and provide insights into the interplay between adversarial robustness, and the hierarchical nature of deep representations. We show that simple gradient-based attacks against our model lead to human-interpretable images of the target classes as well as interpretable image changes. As a byproduct, using our multi-resolution prior, we turn pre-trained classifiers and CLIP models into controllable image generators and develop successful transferable attacks on large vision language models.

[CV-120] Content-decoupled Contrastive Learning-based Implicit Degradation Modeling for Blind Image Super-Resolution

链接: https://arxiv.org/abs/2408.05440
作者: Jiang Yuan,Ji Ma,Bo Wang,Weiming Hu
关键词-EN: wide application range, complex degradation scenarios, Implicit degradation modeling-based, modeling-based blind super-resolution, degradation modeling-based blind
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
*备注:

点击查看摘要

Abstract:Implicit degradation modeling-based blind super-resolution (SR) has attracted more increasing attention in the community due to its excellent generalization to complex degradation scenarios and wide application range. How to extract more discriminative degradation representations and fully adapt them to specific image features is the key to this task. In this paper, we propose a new Content-decoupled Contrastive Learning-based blind image super-resolution (CdCL) framework following the typical blind SR pipeline. This framework introduces negative-free contrastive learning technique for the first time to model the implicit degradation representation, in which a new cyclic shift sampling strategy is designed to ensure decoupling between content features and degradation features from the data perspective, thereby improving the purity and discriminability of the learned implicit degradation space. In addition, to improve the efficiency and effectiveness of implicit degradation-based blind super-resolving, we design a detail-aware implicit degradation adaption module with lower complexity, which adapts degradation information to the specific LR image from both channel and spatial perspectives. Extensive experiments on synthetic and real data prove that the proposed CdCL comprehensively improves the quantitative and qualitative results of contrastive learning-based implicit blind SR paradigm, and achieves SOTA PSNR in this field. Even if the number of parameters is halved, our method still achieves very competitive results.

[CV-121] A Methodological and Structural Review of Hand Gesture Recognition Across Diverse Data Modalities

链接: https://arxiv.org/abs/2408.05436
作者: Jungpil Shin,Abu Saleh Musa Miah,Md. Humaun Kabir,Md. Abdur Rahim,Abdullah Al Shiam
关键词-EN: authentic human-computer interaction, developing Hand Gesture, Hand Gesture Recognition, hand gestures, developing Hand
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Researchers have been developing Hand Gesture Recognition (HGR) systems to enhance natural, efficient, and authentic human-computer interaction, especially benefiting those who rely solely on hand gestures for communication. Despite significant progress, the automatic and precise identification of hand gestures remains a considerable challenge in computer vision. Recent studies have focused on specific modalities like RGB images, skeleton data, and spatiotemporal interest points. This paper provides a comprehensive review of HGR techniques and data modalities from 2014 to 2024, exploring advancements in sensor technology and computer vision. We highlight accomplishments using various modalities, including RGB, Skeleton, Depth, Audio, EMG, EEG, and Multimodal approaches and identify areas needing further research. We reviewed over 200 articles from prominent databases, focusing on data collection, data settings, and gesture representation. Our review assesses the efficacy of HGR systems through their recognition accuracy and identifies a gap in research on continuous gesture recognition, indicating the need for improved vision-based gesture systems. The field has experienced steady research progress, including advancements in hand-crafted features and deep learning (DL) techniques. Additionally, we report on the promising developments in HGR methods and the area of multimodal approaches. We hope this survey will serve as a potential guideline for diverse data modality-based HGR research.

[CV-122] SAM-FNet: SAM-Guided Fusion Network for Laryngo-Pharyngeal Tumor Detection

链接: https://arxiv.org/abs/2408.05426
作者: Jia Wei,Yun Li,Meiyu Qiu,Hongyu Chen,Xiaomao Fan,Wenbin Lei
关键词-EN: highly fatal malignant, fatal malignant disease, malignant disease affecting, highly fatal, fatal malignant
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Laryngo-pharyngeal cancer (LPC) is a highly fatal malignant disease affecting the head and neck region. Previous studies on endoscopic tumor detection, particularly those leveraging dual-branch network architectures, have shown significant advancements in tumor detection. These studies highlight the potential of dual-branch networks in improving diagnostic accuracy by effectively integrating global and local (lesion) feature extraction. However, they are still limited in their capabilities to accurately locate the lesion region and capture the discriminative feature information between the global and local branches. To address these issues, we propose a novel SAM-guided fusion network (SAM-FNet), a dual-branch network for laryngo-pharyngeal tumor detection. By leveraging the powerful object segmentation capabilities of the Segment Anything Model (SAM), we introduce the SAM into the SAM-FNet to accurately segment the lesion region. Furthermore, we propose a GAN-like feature optimization (GFO) module to capture the discriminative features between the global and local branches, enhancing the fusion feature complementarity. Additionally, we collect two LPC datasets from the First Affiliated Hospital (FAHSYSU) and the Sixth Affiliated Hospital (SAHSYSU) of Sun Yat-sen University. The FAHSYSU dataset is used as the internal dataset for training the model, while the SAHSYSU dataset is used as the external dataset for evaluating the model’s performance. Extensive experiments on both datasets of FAHSYSU and SAHSYSU demonstrate that the SAM-FNet can achieve competitive results, outperforming the state-of-the-art counterparts. The source code of SAM-FNet is available at the URL of this https URL.

[CV-123] EPAM-Net: An Efficient Pose-driven Attention-guided Multimodal Network for Video Action Recognition

点击查看摘要

[CV-124] High-fidelity and Lip-synced Talking Face Synthesis via Landmark-based Diffusion Model

点击查看摘要

[CV-125] Style-Preserving Lip Sync via Audio-Aware Style Reference

点击查看摘要

[CV-126] How Does Audio Influence Visual Attention in Omnidirectional Videos? Database and Model

链接: https://arxiv.org/abs/2408.05411
作者: Yuxin Zhu,Huiyu Duan,Kaiwei Zhang,Yucheng Zhu,Xilei Zhu,Long Teng,Xiongkuo Min,Guangtao Zhai
关键词-EN: augmented reality applications, enhancing user engagement, Understanding and predicting, predicting viewer attention, audio-visual saliency
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Understanding and predicting viewer attention in omnidirectional videos (ODVs) is crucial for enhancing user engagement in virtual and augmented reality applications. Although both audio and visual modalities are essential for saliency prediction in ODVs, the joint exploitation of these two modalities has been limited, primarily due to the absence of large-scale audio-visual saliency databases and comprehensive analyses. This paper comprehensively investigates audio-visual attention in ODVs from both subjective and objective perspectives. Specifically, we first introduce a new audio-visual saliency database for omnidirectional videos, termed AVS-ODV database, containing 162 ODVs and corresponding eye movement data collected from 60 subjects under three audio modes including mute, mono, and ambisonics. Based on the constructed AVS-ODV database, we perform an in-depth analysis of how audio influences visual attention in ODVs. To advance the research on audio-visual saliency prediction for ODVs, we further establish a new benchmark based on the AVS-ODV database by testing numerous state-of-the-art saliency models, including visual-only models and audio-visual models. In addition, given the limitations of current models, we propose an innovative omnidirectional audio-visual saliency prediction network (OmniAVS), which is built based on the U-Net architecture, and hierarchically fuses audio and visual features from the multimodal aligned embedding space. Extensive experimental results demonstrate that the proposed OmniAVS model outperforms other state-of-the-art models on both ODV AVS prediction and traditional AVS predcition tasks. The AVS-ODV database and OmniAVS model will be released to facilitate future research.

[CV-127] RSL-BA: Rolling Shutter Line Bundle Adjustment

链接: https://arxiv.org/abs/2408.05409
作者: Yongcong Zhang,Bangyan Liao,Yifei Xue,Chen Lu,Peidong Liu,Yizhen Lao
关键词-EN: inherently encoding spatial, spatial structural information, encoding spatial structural, inherently encoding, structural information
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:The line is a prevalent element in man-made environments, inherently encoding spatial structural information, thus making it a more robust choice for feature representation in practical applications. Despite its apparent advantages, previous rolling shutter bundle adjustment (RSBA) methods have only supported sparse feature points, which lack robustness, particularly in degenerate environments. In this paper, we introduce the first rolling shutter line-based bundle adjustment solution, RSL-BA. Specifically, we initially establish the rolling shutter camera line projection theory utilizing Plücker line parameterization. Subsequently, we derive a series of reprojection error formulations which are stable and efficient. Finally, we theoretically and experimentally demonstrate that our method can prevent three common degeneracies, one of which is first discovered in this paper. Extensive synthetic and real data experiments demonstrate that our method achieves efficiency and accuracy comparable to existing point-based rolling shutter bundle adjustment solutions.

[CV-128] Mesh deformation-based single-view 3D reconstruction of thin eyeglasses frames with differentiable rendering

链接: https://arxiv.org/abs/2408.05402
作者: Fan Zhang,Ziyue Ji,Weiguang Kang,Weiqing Li,Zhiyong Su
关键词-EN: Augmented Reality, Virtual Reality, virtual eyeglasses try-on, eyeglasses try-on application, virtual eyeglasses
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
*备注:

点击查看摘要

Abstract:With the support of Virtual Reality (VR) and Augmented Reality (AR) technologies, the 3D virtual eyeglasses try-on application is well on its way to becoming a new trending solution that offers a “try on” option to select the perfect pair of eyeglasses at the comfort of your own home. Reconstructing eyeglasses frames from a single image with traditional depth and image-based methods is extremely difficult due to their unique characteristics such as lack of sufficient texture features, thin elements, and severe self-occlusions. In this paper, we propose the first mesh deformation-based reconstruction framework for recovering high-precision 3D full-frame eyeglasses models from a single RGB image, leveraging prior and domain-specific knowledge. Specifically, based on the construction of a synthetic eyeglasses frame dataset, we first define a class-specific eyeglasses frame template with pre-defined keypoints. Then, given an input eyeglasses frame image with thin structure and few texture features, we design a keypoint detector and refiner to detect predefined keypoints in a coarse-to-fine manner to estimate the camera pose accurately. After that, using differentiable rendering, we propose a novel optimization approach for producing correct geometry by progressively performing free-form deformation (FFD) on the template mesh. We define a series of loss functions to enforce consistency between the rendered result and the corresponding RGB input, utilizing constraints from inherent structure, silhouettes, keypoints, per-pixel shading information, and so on. Experimental results on both the synthetic dataset and real images demonstrate the effectiveness of the proposed algorithm.

[CV-129] PersonViT: Large-scale Self-supervised Vision Transformer for Person Re-Identificat

链接: https://arxiv.org/abs/2408.05398
作者: Bin Hu,Xinggang Wang,Wenyu Liu
关键词-EN: retrieve relevant individuals, non-overlapping camera images, person ReID, Masked Image Modeling, aims to retrieve
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Person Re-Identification (ReID) aims to retrieve relevant individuals in non-overlapping camera images and has a wide range of applications in the field of public safety. In recent years, with the development of Vision Transformer (ViT) and self-supervised learning techniques, the performance of person ReID based on self-supervised pre-training has been greatly improved. Person ReID requires extracting highly discriminative local fine-grained features of the human body, while traditional ViT is good at extracting context-related global features, making it difficult to focus on local human body features. To this end, this article introduces the recently emerged Masked Image Modeling (MIM) self-supervised learning method into person ReID, and effectively extracts high-quality global and local features through large-scale unsupervised pre-training by combining masked image modeling and discriminative contrastive learning, and then conducts supervised fine-tuning training in the person ReID task. This person feature extraction method based on ViT with masked image modeling (PersonViT) has the good characteristics of unsupervised, scalable, and strong generalization capabilities, overcoming the problem of difficult annotation in supervised person ReID, and achieves state-of-the-art results on publicly available benchmark datasets, including MSMT17, Market1501, DukeMTMC-reID, and Occluded-Duke. The code and pre-trained models of the PersonViT method are released at this https URL to promote further research in the person ReID fie

[CV-130] DeepSpeak Dataset v1.0

链接: https://arxiv.org/abs/2408.05366
作者: Sarah Barrington,Matyas Bohacek,Hany Farid
关键词-EN: describe a large-scale, people talking, talking and gesturing, gesturing in front, large-scale dataset
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:We describe a large-scale dataset–\em DeepSpeak–of real and deepfake footage of people talking and gesturing in front of their webcams. The real videos in this first version of the dataset consist of 9 hours of footage from 220 diverse individuals. Constituting more than 25 hours of footage, the fake videos consist of a range of different state-of-the-art face-swap and lip-sync deepfakes with natural and AI-generated voices. We expect to release future versions of this dataset with different and updated deepfake technologies. This dataset is made freely available for research and non-commercial uses; requests for commercial use will be considered.

[CV-131] Spherical World-Locking for Audio-Visual Localization in Egocentric Videos ECCV2024

链接: https://arxiv.org/abs/2408.05364
作者: Heeseung Yun,Ruohan Gao,Ishwarya Ananthabhotla,Anurag Kumar,Jacob Donley,Chao Li,Gunhee Kim,Vamsi Krishna Ithapu,Calvin Murdock
关键词-EN: provide comprehensive contexts, spanning multisensory perception, videos provide comprehensive, behavioral interaction, Egocentric videos provide
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: ECCV2024

点击查看摘要

Abstract:Egocentric videos provide comprehensive contexts for user and scene understanding, spanning multisensory perception to behavioral interaction. We propose Spherical World-Locking (SWL) as a general framework for egocentric scene representation, which implicitly transforms multisensory streams with respect to measurements of head orientation. Compared to conventional head-locked egocentric representations with a 2D planar field-of-view, SWL effectively offsets challenges posed by self-motion, allowing for improved spatial synchronization between input modalities. Using a set of multisensory embeddings on a worldlocked sphere, we design a unified encoder-decoder transformer architecture that preserves the spherical structure of the scene representation, without requiring expensive projections between image and world coordinate systems. We evaluate the effectiveness of the proposed framework on multiple benchmark tasks for egocentric video understanding, including audio-visual active speaker localization, auditory spherical source localization, and behavior anticipation in everyday activities.

[CV-132] AyE-Edge: Automated Deployment Space Search Empowering Accuracy yet Efficient Real-Time Object Detection on the Edge

链接: https://arxiv.org/abs/2408.05363
作者: Chao Wu,Yifan Gong,Liangkai Liu,Mengquan Li,Yushu Wu,Xuan Shen,Zhimin Li,Geng Yuan,Weisong Shi,Yanzhi Wang
关键词-EN: ever-broad application prospects, application prospects, growing demand, ever-broad application, Object detection
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Object detection on the edge (Edge-OD) is in growing demand thanks to its ever-broad application prospects. However, the development of this field is rigorously restricted by the deployment dilemma of simultaneously achieving high accuracy, excellent power efficiency, and meeting strict real-time requirements. To tackle this dilemma, we propose AyE-Edge, the first-of-this-kind development tool that explores automated algorithm-device deployment space search to realize Accurate yet power-Efficient real-time object detection on the Edge. Through a collaborative exploration of keyframe selection, CPU-GPU configuration, and DNN pruning strategy, AyE-Edge excels in extensive real-world experiments conducted on a mobile device. The results consistently demonstrate AyE-Edge’s effectiveness, realizing outstanding real-time performance, detection accuracy, and notably, a remarkable 96.7% reduction in power consumption, compared to state-of-the-art (SOTA) competitors.

[CV-133] Enabling Quick Accurate Crowdsourced Annotation for Elevation-Aware Flood Extent Mapping

链接: https://arxiv.org/abs/2408.05350
作者: Landon Dyken,Saugat Adhikari,Pravin Poudel,Steve Petruzza,Da Yan,Will Usher,Sidharth Kumar
关键词-EN: allocate relief efforts, properly allocate relief, relief efforts, disaster management, flood extent mappings
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In order to assess damage and properly allocate relief efforts, mapping the extent of flood events is a necessary and important aspect of disaster management. In recent years, deep learning methods have evolved as an effective tool to quickly label high-resolution imagery and provide necessary flood extent mappings. These methods, though, require large amounts of annotated training data to create models that are accurate and robust to new flooded imagery. In this work, we provide FloodTrace, an application that enables effective crowdsourcing for flooded region annotation for machine learning training data, removing the requirement for annotation to be done solely by researchers. We accomplish this through two orthogonal methods within our application, informed by requirements from domain experts. First, we utilize elevation-guided annotation tools and 3D rendering to inform user annotation decisions with digital elevation model data, improving annotation accuracy. For this purpose, we provide a unique annotation method that uses topological data analysis to outperform the state-of-the-art elevation-guided annotation tool in efficiency. Second, we provide a framework for researchers to review aggregated crowdsourced annotations and correct inaccuracies using methods inspired by uncertainty visualization. We conducted a user study to confirm the application effectiveness in which 266 graduate students annotated high-resolution aerial imagery from Hurricane Matthew in North Carolina. Experimental results show the accuracy and efficiency benefits of our application apply even for untrained users. In addition, using our aggregation and correction framework, flood detection models trained on crowdsourced annotations were able to achieve performance equal to models trained on expert-labeled annotations, while requiring a fraction of the time on the part of the researcher.

[CV-134] CAR: Contrast-Agnostic Deformable Medical Image Registration with Contrast-Invariant Latent Regularization

点击查看摘要

[CV-135] VACoDe: Visual Augmented Contrastive Decoding

点击查看摘要

[CV-136] Revisiting Multi-Modal LLM Evaluation

链接: https://arxiv.org/abs/2408.05334
作者: Jian Lu,Shikhar Srivastava,Junyu Chen,Robik Shrestha,Manoj Acharya,Kushal Kafle,Christopher Kanan
关键词-EN: large language models, multi-modal large language, referring expression comprehension, visual question answering, language models
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

[CV-137] A Recurrent YOLOv8-based framework for Event-Based Object Detection

点击查看摘要

[CV-138] he impact of internal variability on benchmarking deep learning climate emulators

点击查看摘要

[CV-139] Zero-shot 3D Segmentation of Abdominal Organs in CT Scans Using Segment Anything Model 2: Adapting Video Tracking Capabilities for 3D Medical Imaging

链接: https://arxiv.org/abs/2408.06170
作者: Yosuke Yamagishi,Shouhei Hanaoka,Tomohiro Kikuchi,Takahiro Nakao,Yuta Nakamura,Yukihiro Nomura,Soichiro Miki,Takeharu Yoshikawa,Osamu Abe
关键词-EN: video tracking capabilities, leveraging its video, SAM, study aimed, aimed to evaluate
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注: 16 pages, 6 figures (including 1 supplemental figure), 3 tables

点击查看摘要

Abstract:Purpose: This study aimed to evaluate the zero-shot performance of Segment Anything Model 2 (SAM 2) in 3D segmentation of abdominal organs in CT scans, leveraging its video tracking capabilities for volumetric medical imaging. Materials and Methods: Using a subset of the TotalSegmentator CT dataset (n=123) from 8 different institutions, we assessed SAM 2’s ability to segment 8 abdominal organs. Segmentation was initiated from three different Z-coordinate levels (caudal, mid, and cranial levels) of each organ. Performance was measured using the Dice similarity coefficient (DSC). We also analyzed organ volumes to contextualize the results. Results: As a zero-shot approach, larger organs with clear boundaries demonstrated high segmentation performance, with mean(median) DSCs as follows: liver 0.821(0.898), left kidney 0.870(0.921), right kidney 0.862(0.935), and spleen 0.891(0.932). Smaller or less defined structures showed lower performance: gallbladder 0.531(0.590), pancreas 0.361(0.359), and adrenal glands 0.203-0.308(0.109-0.231). Significant differences in DSC were observed depending on the starting initial slice of segmentation for different organs. A moderate positive correlation was observed between volume size and DSCs (Spearman’s rs = 0.731, P .001 at caudal-level). DSCs exhibited high variability within organs, ranging from near 0 to almost 1.0, indicating substantial inconsistency in segmentation performance between scans. Conclusion: SAM 2 demonstrated promising zero-shot performance in segmenting certain abdominal organs in CT scans, particularly larger organs with clear boundaries. The model’s ability to segment previously unseen targets without additional training highlights its potential for cross-domain generalization in medical imaging. However, improvements are needed for smaller and less defined structures.

[CV-140] ACCELERATION: Sequentially-scanning DECT Imaging Using High Temporal Resolution Image Reconstruction And Temporal Extrapolation

点击查看摘要

[CV-141] Five Pitfalls When Assessing Synthetic Medical Images with Reference Metrics MICCAI2024

链接: https://arxiv.org/abs/2408.06075
作者: Melanie Dohmen,Tuan Truong,Ivo M. Baltruschat,Matthias Lenga
关键词-EN: developed to objectively, objectively and quantitatively, quantitatively compare, metrics, Reference metrics
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注: 10 pages, 5 figures, accepted at Deep Generative Models workshop @ MICCAI 2024

点击查看摘要

Abstract:Reference metrics have been developed to objectively and quantitatively compare two images. Especially for evaluating the quality of reconstructed or compressed images, these metrics have shown very useful. Extensive tests of such metrics on benchmarks of artificially distorted natural images have revealed which metric best correlate with human perception of quality. Direct transfer of these metrics to the evaluation of generative models in medical imaging, however, can easily lead to pitfalls, because assumptions about image content, image data format and image interpretation are often very different. Also, the correlation of reference metrics and human perception of quality can vary strongly for different kinds of distortions and commonly used metrics, such as SSIM, PSNR and MAE are not the best choice for all situations. We selected five pitfalls that showcase unexpected and probably undesired reference metric scores and discuss strategies to avoid them.

[CV-142] A Sharpness Based Loss Function for Removing Out-of-Focus Blur

链接: https://arxiv.org/abs/2408.06014
作者: Uditangshu Aurangabadkar,Darren Ramsook,Anil Kokaram
关键词-EN: Deep Neural Network, modern Deep Neural, Neural Network, Deep Neural, complex optimization criteria
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注: 6 pages, IEEE MMSP

点击查看摘要

Abstract:The success of modern Deep Neural Network (DNN) approaches can be attributed to the use of complex optimization criteria beyond standard losses such as mean absolute error (MAE) or mean squared error (MSE). In this work, we propose a novel method of utilising a no-reference sharpness metric Q introduced by Zhu and Milanfar for removing out-of-focus blur from images. We also introduce a novel dataset of real-world out-of-focus images for assessing restoration models. Our fine-tuned method produces images with a 7.5 % increase in perceptual quality (LPIPS) as compared to a standard model trained only on MAE. Furthermore, we observe a 6.7 % increase in Q (reflecting sharper restorations) and 7.25 % increase in PSNR over most state-of-the-art (SOTA) methods.

[CV-143] Image Denoising Using Green Channel Prior

链接: https://arxiv.org/abs/2408.05923
作者: Zhaoming Kong,Fangxi Deng,Xiaowei Yang
关键词-EN: green channel, appealing and challenging, observations may vary, vary with local, local image contents
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注: arXiv admin note: text overlap with arXiv:2402.08235

点击查看摘要

Abstract:Image denoising is an appealing and challenging task, in that noise statistics of real-world observations may vary with local image contents and different image channels. Specifically, the green channel usually has twice the sampling rate in raw data. To handle noise variances and leverage such channel-wise prior information, we propose a simple and effective green channel prior-based image denoising (GCP-ID) method, which integrates GCP into the classic patch-based denoising framework. Briefly, we exploit the green channel to guide the search for similar patches, which aims to improve the patch grouping quality and encourage sparsity in the transform domain. The grouped image patches are then reformulated into RGGB arrays to explicitly characterize the density of green samples. Furthermore, to enhance the adaptivity of GCP-ID to various image contents, we cast the noise estimation problem into a classification task and train an effective estimator based on convolutional neural networks (CNNs). Experiments on real-world datasets demonstrate the competitive performance of the proposed GCP-ID method for image and video denoising applications in both raw and sRGB spaces. Our code is available at this https URL.

[CV-144] Deep Learning in Medical Image Registration: Magic or Mirage?

链接: https://arxiv.org/abs/2408.05839
作者: Rohit Jena,Deeksha Sethi,Pratik Chaudhari,James C. Gee
关键词-EN:
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

[CV-145] Prototype Learning Guided Hybrid Network for Breast Tumor Segmentation in DCE-MRI

链接: https://arxiv.org/abs/2408.05803
作者: Lei Zhou,Yuzhong Zhang,Jiadong Zhang,Xuejun Qian,Chen Gong,Kun Sun,Zhongxiang Ding,Xing Wang,Zhenhui Li,Zaiyi Liu,Dinggang Shen
关键词-EN:
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

[CV-146] C-KANRecon: High-Quality and Accelerated MRI Reconstruction via Adaptive KAN Mechanisms and Intelligent Feature Scaling

点击查看摘要

[CV-147] Evaluating BM3D and NBNet: A Comprehensive Study of Image Denoising Across Multiple Datasets

链接: https://arxiv.org/abs/2408.05697
作者: Ghazal Kaviani,Reza Marzban,Ghassan AlRegib
关键词-EN:
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

[CV-148] BeyondCT: A deep learning model for predicting pulmonary function from chest CT scans

链接: https://arxiv.org/abs/2408.05645
作者: Kaiwen Geng,Zhiyi Shi,Xiaoyan Zhao,Alaa Ali,Jing Wang,Joseph Leader,Jiantao Pu
关键词-EN:
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 5 tables, 7 figures,22 pages

点击查看摘要

[CV-149] Unidirectional imaging with partially coherent light

链接: https://arxiv.org/abs/2408.05449
作者: Guangdong Ma,Che-Yung Shen,Jingxi Li,Luzhe Huang,Cagatay Isil,Fazil Onuralp Ardic,Xilin Yang,Yuhang Li,Yuntian Wang,Md Sadman Sakib Rahman,Aydogan Ozcan
关键词-EN: FOV, partially coherent, imagers form images, Unidirectional imagers form, image formation
类目: Optics (physics.optics); Computer Vision and Pattern Recognition (cs.CV); Applied Physics (physics.app-ph)
*备注: 25 Pages, 8 Figures

点击查看摘要

Abstract:Unidirectional imagers form images of input objects only in one direction, e.g., from field-of-view (FOV) A to FOV B, while blocking the image formation in the reverse direction, from FOV B to FOV A. Here, we report unidirectional imaging under spatially partially coherent light and demonstrate high-quality imaging only in the forward direction (A-B) with high power efficiency while distorting the image formation in the backward direction (B-A) along with low power efficiency. Our reciprocal design features a set of spatially engineered linear diffractive layers that are statistically optimized for partially coherent illumination with a given phase correlation length. Our analyses reveal that when illuminated by a partially coherent beam with a correlation length of ~1.5 w or larger, where w is the wavelength of light, diffractive unidirectional imagers achieve robust performance, exhibiting asymmetric imaging performance between the forward and backward directions - as desired. A partially coherent unidirectional imager designed with a smaller correlation length of less than 1.5 w still supports unidirectional image transmission, but with a reduced figure of merit. These partially coherent diffractive unidirectional imagers are compact (axially spanning less than 75 w), polarization-independent, and compatible with various types of illumination sources, making them well-suited for applications in asymmetric visual information processing and communication.

[CV-150] PRISM Lite: A lightweight model for interactive 3D placenta segmentation in ultrasound

链接: https://arxiv.org/abs/2408.05372
作者: Hao Li,Baris Oguz,Gabriel Arenas,Xing Yao,Jiacheng Wang,Alison Pouch,Brett Byram,Nadav Schwartz,Ipek Oguz
关键词-EN:
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

[CV-151] GesturePrint: Enabling User Identification for mmWave-based Gesture Recognition Systems

链接: https://arxiv.org/abs/2408.05358
作者: Lilin Xu,Keyi Wang,Chaojie Gu,Xiuzhen Guo,Shibo He,Jiming Chen
关键词-EN:
类目: ignal Processing (eess.SP); Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
*备注: Accepted to the 44th IEEE International Conference on Distributed Computing Systems (ICDCS 2024)

点击查看摘要

机器学习

[LG-0] LOLgorithm: Integrating SemanticSyntactic and Contextual Elements for Humor Classification

链接: https://arxiv.org/abs/2408.06335
作者: Tanisha Khurana,Kaushik Pillalamarri,Vikram Pande,Munindar Singh
关键词-EN: Natural Language Processing, Language Processing, Natural Language, paper explores humor, paper explores
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

[LG-1] Can We Rely on LLM Agents to Draft Long-Horizon Plans? Lets Take TravelPlanner as an Example

点击查看摘要

[LG-2] Body Transformer: Leveraging Robot Embodiment for Policy Learning

点击查看摘要

[LG-3] Finding Patterns in Ambiguity: Interpretable Stress Testing in the Decision~Boundary CVPR

点击查看摘要

[LG-4] LEARN: An Invex Loss for Outlier Oblivious Robust Online Optimization

链接: https://arxiv.org/abs/2408.06297
作者: Adarsh Barik,Anand Krishna,Vincent Y. F. Tan
关键词-EN: corrupting loss functions, Exponential Adjusted Robust, robust online convex, Log Exponential Adjusted, arbitrary number
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:We study a robust online convex optimization framework, where an adversary can introduce outliers by corrupting loss functions in an arbitrary number of rounds k, unknown to the learner. Our focus is on a novel setting allowing unbounded domains and large gradients for the losses without relying on a Lipschitz assumption. We introduce the Log Exponential Adjusted Robust and iNvex (LEARN) loss, a non-convex (invex) robust loss function to mitigate the effects of outliers and develop a robust variant of the online gradient descent algorithm by leveraging the LEARN loss. We establish tight regret guarantees (up to constants), in a dynamic setting, with respect to the uncorrupted rounds and conduct experiments to validate our theory. Furthermore, we present a unified analysis framework for developing online optimization algorithms for non-convex (invex) losses, utilizing it to provide regret bounds with respect to the LEARN loss, which may be of independent interest.

[LG-5] he AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery

链接: https://arxiv.org/abs/2408.06292
作者: Chris Lu,Cong Lu,Robert Tjarko Lange,Jakob Foerster,Jeff Clune,David Ha
关键词-EN: artificial general intelligence, developing agents capable, discovering new knowledge, grand challenges, challenges of artificial
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注:

点击查看摘要

[LG-6] Mambular: A Sequential Model for Tabular Deep Learning

链接: https://arxiv.org/abs/2408.06291
作者: Anton Frederik Thielmann,Manish Kumar,Christoph Weisser,Arik Reuter,Benjamin Säfken,Soheila Samiee
关键词-EN: gradient-boosted decision trees, decision trees, tabular data, traditionally been dominated, dominated by gradient-boosted
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The analysis of tabular data has traditionally been dominated by gradient-boosted decision trees (GBDTs), known for their proficiency with mixed categorical and numerical features. However, recent deep learning innovations are challenging this dominance. We introduce Mambular, an adaptation of the Mamba architecture optimized for tabular data. We extensively benchmark Mambular against state-of-the-art models, including neural networks and tree-based methods, and demonstrate its competitive performance across diverse datasets. Additionally, we explore various adaptations of Mambular to understand its effectiveness for tabular data. We investigate different pooling strategies, feature interaction mechanisms, and bi-directional processing. Our analysis shows that interpreting features as a sequence and passing them through Mamba layers results in surprisingly performant models. The results highlight Mambulars potential as a versatile and powerful architecture for tabular data analysis, expanding the scope of deep learning applications in this domain. The source code is available at this https URL. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2408.06291 [cs.LG] (or arXiv:2408.06291v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2408.06291 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-7] Synthetic Patient-Physician Dialogue Generation from Clinical Notes Using LLM

链接: https://arxiv.org/abs/2408.06285
作者: Trisha Das,Dina Albassam,Jimeng Sun
关键词-EN: enhance patient-physician communication, improve healthcare accessibility, Medical dialogue systems, enhance patient-physician, patient-physician communication
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

[LG-8] Anchored Preference Optimization and Contrastive Revisions: Addressing Underspecification in Alignment

链接: https://arxiv.org/abs/2408.06266
作者: Karel D’Oosterlinck,Winnie Xu,Chris Develder,Thomas Demeester,Amanpreet Singh,Christopher Potts,Douwe Kiela,Shikib Mehri
关键词-EN: Large Language Models, Large Language, Language Models, alignment objectives, Anchored Preference Optimization
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:

点击查看摘要

[LG-9] DUNE: A Machine Learning Deep UNet based Ensemble Approach to Monthly Seasonal and Annual Climate Forecasting

链接: https://arxiv.org/abs/2408.06262
作者: Pratik Shukla,Milton Halem
关键词-EN: numerical weather predictions, climate fields based, deep-learning architectures offer, monthly averaged long-term, averaged long-term data
类目: Machine Learning (cs.LG); Atmospheric and Oceanic Physics (physics.ao-ph)
*备注: Excluding Appendix: 18 pages, 14 figures

点击查看摘要

Abstract:Capitalizing on the recent availability of ERA5 monthly averaged long-term data records of mean atmospheric and climate fields based on high-resolution reanalysis, deep-learning architectures offer an alternative to physics-based daily numerical weather predictions for subseasonal to seasonal (S2S) and annual means. A novel Deep UNet+±based Ensemble (DUNE) neural architecture is introduced, employing multi-encoder-decoder structures with residual blocks. When initialized from a prior month or year, this architecture produced the first AI-based global monthly, seasonal, or annual mean forecast of 2-meter temperatures (T2m) and sea surface temperatures (SST). ERA5 monthly mean data is used as input for T2m over land, SST over oceans, and solar radiation at the top of the atmosphere for each month of 40 years to train the model. Validation forecasts are performed for an additional two years, followed by five years of forecast evaluations to account for natural annual variability. AI-trained inference forecast weights generate forecasts in seconds, enabling ensemble seasonal forecasts. Root Mean Squared Error (RMSE), Anomaly Correlation Coefficient (ACC), and Heidke Skill Score (HSS) statistics are presented globally and over specific regions. These forecasts outperform persistence, climatology, and multiple linear regression for all domains. DUNE forecasts demonstrate comparable statistical accuracy to NOAA’s operational monthly and seasonal probabilistic outlook forecasts over the US but at significantly higher resolutions. RMSE and ACC error statistics for other recent AI-based daily forecasts also show superior performance for DUNE-based forecasts. The DUNE model’s application to an ensemble data assimilation cycle shows comparable forecast accuracy with a single high-resolution model, potentially eliminating the need for retraining on extrapolated datasets.

[LG-10] Open-Source Molecular Processing Pipeline for Generating Molecules

点击查看摘要

[LG-11] Deep Learning System Boundary Testing through Latent Space Style Mixing

链接: https://arxiv.org/abs/2408.06258
作者: Amr Abdellatif,Xingcheng Chen,Vincenzo Riccio,Andrea Stocco
关键词-EN: deep learning, generalizability and robustness, crucial for understanding, understanding their generalizability, Evaluating
类目: oftware Engineering (cs.SE); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Evaluating the behavioral frontier of deep learning (DL) systems is crucial for understanding their generalizability and robustness. However, boundary testing is challenging due to their high-dimensional input space. Generative artificial intelligence offers a promising solution by modeling data distribution within compact latent space representations, thereby facilitating finer-grained explorations. In this work, we introduce MIMICRY, a novel black-box system-agnostic test generator that leverages these latent representations to generate frontier inputs for the DL systems under test. Specifically, MIMICRY uses style-based generative adversarial networks trained to learn the representation of inputs with disentangled features. This representation enables embedding style-mixing operations between a source and a target input, combining their features to explore the boundary between them. We evaluated the effectiveness of different MIMICRY configurations in generating boundary inputs for four popular DL image classification systems. Our results show that manipulating the latent space allows for effective and efficient exploration of behavioral frontiers. As opposed to a model-based baseline, MIMICRY generates a higher quality frontier of behaviors which includes more and closer inputs. Additionally, we assessed the validity of these inputs, revealing a high validity rate according to human assessors.

[LG-12] A Large-Scale Study of Model Integration in ML-Enabled Software Systems

点击查看摘要

[LG-13] A Digital Twin Framework Utilizing Machine Learning for Robust Predictive Maintenance: Enhancing Tire Health Monitoring FAST

链接: https://arxiv.org/abs/2408.06220
作者: Vispi Karkaria,Jie Chen,Christopher Luey,Chase Siuta,Damien Lim,Robert Radulescu,Wei Chen
关键词-EN: digital twin framework, digital twin, Remaining Casing Potential, twin framework, predictive maintenance
类目: Machine Learning (cs.LG); Computational Engineering, Finance, and Science (cs.CE)
*备注: Paper accepted at ASME IDETC 2024, and fast-tracked for ASME Journal of Computing and Information Science in Engineering

点击查看摘要

Abstract:We introduce a novel digital twin framework for predictive maintenance of long-term physical systems. Using monitoring tire health as an application, we show how the digital twin framework can be used to enhance automotive safety and efficiency, and how the technical challenges can be overcome using a three-step approach. Firstly, for managing the data complexity over a long operation span, we employ data reduction techniques to concisely represent physical tires using historical performance and usage data. Relying on these data, for fast real-time prediction, we train a transformer-based model offline on our concise dataset to predict future tire health over time, represented as Remaining Casing Potential (RCP). Based on our architecture, our model quantifies both epistemic and aleatoric uncertainty, providing reliable confidence intervals around predicted RCP. Secondly, to incorporate real-time data, we update the predictive model in the digital twin framework, ensuring its accuracy throughout its life span with the aid of hybrid modeling and the use of discrepancy function. Thirdly, to assist decision making in predictive maintenance, we implement a Tire State Decision Algorithm, which strategically determines the optimal timing for tire replacement based on RCP forecasted by our transformer model. This approach ensures our digital twin accurately predicts system health, continually refines its digital representation, and supports predictive maintenance decisions. Our framework effectively embodies a physical system, leveraging big data and machine learning for predictive maintenance, model updates, and decision-making.

[LG-14] Computability of Classification and Deep Learning: From Theoretical Limits to Practical Feasibility through Quantization

链接: https://arxiv.org/abs/2408.06212
作者: Holger Boche,Vit Fojtik,Adalbert Fono,Gitta Kutyniok
关键词-EN: past decade led, deep learning, deep learning methods, deep learning applications, unwavering success
类目: Machine Learning (cs.LG); Computational Complexity (cs.CC)
*备注:

点击查看摘要

Abstract:The unwavering success of deep learning in the past decade led to the increasing prevalence of deep learning methods in various application fields. However, the downsides of deep learning, most prominently its lack of trustworthiness, may not be compatible with safety-critical or high-responsibility applications requiring stricter performance guarantees. Recently, several instances of deep learning applications have been shown to be subject to theoretical limitations of computability, undermining the feasibility of performance guarantees when employed on real-world computers. We extend the findings by studying computability in the deep learning framework from two perspectives: From an application viewpoint in the context of classification problems and a general limitation viewpoint in the context of training neural networks. In particular, we show restrictions on the algorithmic solvability of classification problems that also render the algorithmic detection of failure in computations in a general setting infeasible. Subsequently, we prove algorithmic limitations in training deep neural networks even in cases where the underlying problem is well-behaved. Finally, we end with a positive observation, showing that in quantized versions of classification and deep network training, computability restrictions do not arise or can be overcome to a certain degree.

[LG-15] Improving Structural Diversity of Blackbox LLMs via Chain-of-Specification Prompting

链接: https://arxiv.org/abs/2408.06186
作者: Halley Young,Yimeng Zeng,Jacob Gardner,Osbert Bastani
关键词-EN: large language models, key challenge facing, challenge facing large, facing large language, diversity
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注:

点击查看摘要

[LG-16] Centralized and Federated Heart Disease Classification Models Using UCI Dataset and their Shapley-value Based Interpretability

链接: https://arxiv.org/abs/2408.06183
作者: Mario Padilla Rodriguez,Mohamed Nafea
关键词-EN: accurate diagnostic methods, Cardiovascular diseases, mortality worldwide, diagnostic methods, Hungary and Switzerland
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Cardiovascular diseases are a leading cause of mortality worldwide, highlighting the need for accurate diagnostic methods. This study benchmarks centralized and federated machine learning algorithms for heart disease classification using the UCI dataset which includes 920 patient records from four hospitals in the USA, Hungary and Switzerland. Our benchmark is supported by Shapley-value interpretability analysis to quantify features’ importance for classification. In the centralized setup, various binary classification algorithms are trained on pooled data, with a support vector machine (SVM) achieving the highest testing accuracy of 83.3%, surpassing the established benchmark of 78.7% with logistic regression. Additionally, federated learning algorithms with four clients (hospitals) are explored, leveraging the dataset’s natural partition to enhance privacy without sacrificing accuracy. Federated SVM, an uncommon approach in the literature, achieves a top testing accuracy of 73.8%. Our interpretability analysis aligns with existing medical knowledge of heart disease indicators. Overall, this study establishes a benchmark for efficient and interpretable pre-screening tools for heart disease while maintaining patients’ privacy.

[LG-17] A Methodological Report on Anomaly Detection on Dynamic Knowledge Graphs

点击查看摘要

[LG-18] Contexts Matter: An Empirical Study on Contextual Influence in Fairness Testing for Deep Learning Systems

链接: https://arxiv.org/abs/2408.06102
作者: Chengwen Du,Tao Chen
关键词-EN: deep learning systems, increasingly important, deep learning, learning systems, Fairness testing
类目: oftware Engineering (cs.SE); Machine Learning (cs.LG)
*备注: Received by ESEM 24

点击查看摘要

Abstract:Background: Fairness testing for deep learning systems has been becoming increasingly important. However, much work assumes perfect context and conditions from the other parts: well-tuned hyperparameters for accuracy; rectified bias in data, and mitigated bias in the labeling. Yet, these are often difficult to achieve in practice due to their resource-/labour-intensive nature. Aims: In this paper, we aim to understand how varying contexts affect fairness testing outcomes. Method:We conduct an extensive empirical study, which covers 10,800 cases, to investigate how contexts can change the fairness testing result at the model level against the existing assumptions. We also study why the outcomes were observed from the lens of correlation/fitness landscape analysis. Results: Our results show that different context types and settings generally lead to a significant impact on the testing, which is mainly caused by the shifts of the fitness landscape under varying contexts. Conclusions: Our findings provide key insights for practitioners to evaluate the test generators and hint at future research directions.

[LG-19] Generalization capabilities of MeshGraphNets to unseen geometries for fluid dynamics

点击查看摘要

[LG-20] Approximating Discrimination Within Models When Faced With Several Non-Binary Sensitive Attributes

链接: https://arxiv.org/abs/2408.06099
作者: Yijun Bian,Yujie Luo,Ping Xu
关键词-EN: sensitive attributes, multiple sensitive attributes, multiple, machine learning, hierarchically and historically
类目: Machine Learning (cs.LG); Computers and Society (cs.CY)
*备注: The first two authors contributed equally, listed in alphabetical order. arXiv admin note: substantial text overlap with arXiv:2405.09251

点击查看摘要

Abstract:Discrimination mitigation with machine learning (ML) models could be complicated because multiple factors may interweave with each other including hierarchically and historically. Yet few existing fairness measures are able to capture the discrimination level within ML models in the face of multiple sensitive attributes. To bridge this gap, we propose a fairness measure based on distances between sets from a manifold perspective, named as ‘harmonic fairness measure via manifolds (HFM)’ with two optional versions, which can deal with a fine-grained discrimination evaluation for several sensitive attributes of multiple values. To accelerate the computation of distances of sets, we further propose two approximation algorithms named ‘Approximation of distance between sets for one sensitive attribute with multiple values (ApproxDist)’ and ‘Approximation of extended distance between sets for several sensitive attributes with multiple values (ExtendDist)’ to respectively resolve bias evaluation of one single sensitive attribute with multiple values and that of several sensitive attributes with multiple values. Moreover, we provide an algorithmic effectiveness analysis for ApproxDist under certain assumptions to explain how well it could work. The empirical results demonstrate that our proposed fairness measure HFM is valid and approximation algorithms (i.e., ApproxDist and ExtendDist) are effective and efficient.

[LG-21] Building Decision Making Models Through Language Model Regime

链接: https://arxiv.org/abs/2408.06087
作者: Yu Zhang,Haoxiang Liu,Feijun Jiang,Weihua Luo,Kaifu Zhang
关键词-EN: decision making, making problems leveraging, decision, making, problems leveraging
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

[LG-22] A-BDD: Leveraging Data Augmentations for Safe Autonomous Driving in Adverse Weather and Lighting

点击查看摘要

[LG-23] Fully Bayesian Differential Gaussian Processes through Stochastic Differential Equations

点击查看摘要

[LG-24] Dont You (Project Around Discs)? Neural Network Surrogate and Projected Gradient Descent for Calibrating an Intervertebral Disc Finite Element Model

链接: https://arxiv.org/abs/2408.06067
作者: Matan Atad,Gabriel Gruber,Marx Ribeiro,Luis Fernando Nicolini,Robert Graf,Hendrik Möller,Kati Nispel,Ivan Ezhov,Daniel Rueckert,Jan S. Kirschke
关键词-EN: human intervertebral discs, finite element, intervertebral discs, human intervertebral, reliability and application
类目: Machine Learning (cs.LG)
*备注: Under submission. Project code: this https URL

点击查看摘要

Abstract:Accurate calibration of finite element (FE) models of human intervertebral discs (IVDs) is essential for their reliability and application in diagnosing and planning treatments for spinal conditions. Traditional calibration methods are computationally intensive, requiring iterative, derivative-free optimization algorithms that often take hours or days to converge. This study addresses these challenges by introducing a novel, efficient, and effective calibration method for an L4-L5 IVD FE model using a neural network (NN) surrogate. The NN surrogate predicts simulation outcomes with high accuracy, outperforming other machine learning models, and significantly reduces the computational cost associated with traditional FE simulations. Next, a Projected Gradient Descent (PGD) approach guided by gradients of the NN surrogate is proposed to efficiently calibrate FE models. Our method explicitly enforces feasibility with a projection step, thus maintaining material bounds throughout the optimization process. The proposed method is evaluated against state-of-the-art Genetic Algorithm (GA) and inverse model baselines on synthetic and in vitro experimental datasets. Our approach demonstrates superior performance on synthetic data, achieving a Mean Absolute Error (MAE) of 0.06 compared to the baselines’ MAE of 0.18 and 0.54, respectively. On experimental specimens, our method outperforms the baseline in 5 out of 6 cases. Most importantly, our approach reduces calibration time to under three seconds, compared to up to 8 days per sample required by traditional calibration. Such efficiency paves the way for applying more complex FE models, enabling accurate patient-specific simulations and advancing spinal treatment planning. Comments: Under submission. Project code: this https URL Subjects: Machine Learning (cs.LG) Cite as: arXiv:2408.06067 [cs.LG] (or arXiv:2408.06067v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2408.06067 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-25] ruVRF: Towards Triple-Granularity Verification on Machine Unlearning

链接: https://arxiv.org/abs/2408.06063
作者: Chunyi Zhou,Anmin Fu,Zhiyang Dai
关键词-EN: mislead data contributors, reliable validation methods, dishonest model providers, creating opportunities, forgotten has led
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The concept of the right to be forgotten has led to growing interest in machine unlearning, but reliable validation methods are lacking, creating opportunities for dishonest model providers to mislead data contributors. Traditional invasive methods like backdoor injection are not feasible for legacy data. To address this, we introduce TruVRF, a non-invasive unlearning verification framework operating at class-, volume-, and sample-level granularities. TruVRF includes three Unlearning-Metrics designed to detect different types of dishonest servers: Neglecting, Lazy, and Deceiving. Unlearning-Metric-I checks class alignment, Unlearning-Metric-II verifies sample count, and Unlearning-Metric-III confirms specific sample deletion. Evaluations on three datasets show TruVRF’s robust performance, with over 90% accuracy for Metrics I and III, and a 4.8% to 8.2% inference deviation for Metric II. TruVRF also demonstrates generalizability and practicality across various conditions and with state-of-the-art unlearning frameworks like SISA and Amnesiac Unlearning.

[LG-26] Perceptual Similarity for Measuring Decision-Making Style and Policy Diversity in Games

点击查看摘要

[LG-27] What Ails Generative Structure-based Drug Design: Too Little or Too Much Expressivity?

链接: https://arxiv.org/abs/2408.06050
作者: Rafał Karczewski,Samuel Kaski,Markus Heinonen,Vikas Garg
关键词-EN: structure-based drug design, accelerate structure-based drug, empirical performance turns, drug design, elaborate training
类目: Machine Learning (cs.LG)
*备注: 25 pages, 11 figures

点击查看摘要

Abstract:Several generative models with elaborate training and sampling procedures have been proposed recently to accelerate structure-based drug design (SBDD); however, perplexingly, their empirical performance turns out to be suboptimal. We seek to better understand this phenomenon from both theoretical and empirical perspectives. Since most of these models apply graph neural networks (GNNs), one may suspect that they inherit the representational limitations of GNNs. We analyze this aspect, establishing the first such results for protein-ligand complexes. A plausible counterview may attribute the underperformance of these models to their excessive parameterizations, inducing expressivity at the expense of generalization. We also investigate this possibility with a simple metric-aware approach that learns an economical surrogate for affinity to infer an unlabelled molecular graph and optimizes for labels conditioned on this graph and molecular properties. The resulting model achieves state-of-the-art results using 100x fewer trainable parameters and affords up to 1000x speedup. Collectively, our findings underscore the need to reassess and redirect the existing paradigm and efforts for SBDD.

[LG-28] Spacetime E(n)-Transformer: Equivariant Attention for Spatio-temporal Graphs

点击查看摘要

[LG-29] Graph Clustering with Cross-View Feature Propagation

链接: https://arxiv.org/abs/2408.06029
作者: Zhixuan Duan,Zuo Wang,Fanghui Bi
关键词-EN: multi-view feature propagation, http URL contrast, Cross-View Feature Propagation, latent feature propagation, feature propagation
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Graph clustering is a fundamental and challenging learning task, which is conventionally approached by grouping similar vertices based on edge structure and feature this http URL contrast to previous methods, in this paper, we investigate how multi-view feature propagation can influence cluster discovery in graph this http URL this end, we present Graph Clustering With Cross-View Feature Propagation (GCCFP), a novel method that leverages multi-view feature propagation to enhance cluster identification in graph data.GCCFP employs a unified objective function that utilizes graph topology and multi-view vertex features to determine vertex cluster membership, regularized by a module that supports key latent feature propagation. We derive an iterative algorithm to optimize this function, prove model convergence within a finite number of iterations, and analyze its computational complexity. Our experiments on various real-world graphs demonstrate the superior clustering performance of GCCFP compared to well-established methods, manifesting its effectiveness across different scenarios.

[LG-30] Layer-Specific Optimization: Sensitivity Based Convolution Layers Basis Search

点击查看摘要

[LG-31] Uncertainty-Informed Volume Visualization using Implicit Neural Representation IEEE-VIS2024

点击查看摘要

[LG-32] LUT Tensor Core: Lookup Table Enables Efficient Low-Bit LLM Inference Acceleration

链接: https://arxiv.org/abs/2408.06003
作者: Zhiwen Mo,Lei Wang,Jianyu Wei,Zhichen Zeng,Shijie Cao,Lingxiao Ma,Naifeng Jing,Ting Cao,Jilong Xue,Fan Yang,Mao Yang
关键词-EN: large language model, demands ever-greater resources, LUT Tensor Core, rapid growing trend, shrink memory usage
类目: Hardware Architecture (cs.AR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:As large language model (LLM) inference demands ever-greater resources, there is a rapid growing trend of using low-bit weights to shrink memory usage and boost inference efficiency. However, these low-bit LLMs introduce the need for mixed-precision matrix multiplication (mpGEMM), which is a crucial yet under-explored operation that involves multiplying lower-precision weights with higher-precision activations. Unfortunately, current hardware does not natively support mpGEMM, resulting in indirect and inefficient dequantization-based implementations. To address the mpGEMM requirements in low-bit LLMs, we explored the lookup table (LUT)-based approach for mpGEMM. However, a conventional LUT implementation falls short of its potential. To fully harness the power of LUT-based mpGEMM, we introduce LUT Tensor Core, a software-hardware co-design optimized for low-bit LLM inference. Specifically, we introduce software-based operator fusion and table symmetrization techniques to optimize table precompute and table storage, respectively. Then, LUT Tensor Core proposes the hardware design featuring an elongated tiling shape design to enhance table reuse and a bit-serial design to support various precision combinations in mpGEMM. Moreover, we design an end-to-end compilation stack with new instructions for LUT-based mpGEMM, enabling efficient LLM compilation and optimizations. The evaluation on low-bit LLMs (e.g., BitNet, LLAMA) shows that LUT Tensor Core achieves more than a magnitude of improvements on both compute density and energy efficiency. Subjects: Hardware Architecture (cs.AR); Machine Learning (cs.LG) Cite as: arXiv:2408.06003 [cs.AR] (or arXiv:2408.06003v1 [cs.AR] for this version) https://doi.org/10.48550/arXiv.2408.06003 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-33] ransfer learning of state-based potential games for process optimization in decentralized manufacturing systems

点击查看摘要

[LG-34] Global-to-Local Support Spectrums for Language Model Explainability

链接: https://arxiv.org/abs/2408.05976
作者: Lucas Agussurja,Xinyang Lu,Bryan Kian Hsiang Low
关键词-EN: points, Existing sample-based methods, influence functions, functions and representer, approximating the effect
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Existing sample-based methods, like influence functions and representer points, measure the importance of a training point by approximating the effect of its removal from training. As such, they are skewed towards outliers and points that are very close to the decision boundaries. The explanations provided by these methods are often static and not specific enough for different test points. In this paper, we propose a method to generate an explanation in the form of support spectrums which are based on two main ideas: the support sets and a global-to-local importance measure. The support set is the set of training points, in the predicted class, that ``lie in between’’ the test point and training points in the other classes. They indicate how well the test point can be distinguished from the points not in the predicted class. The global-to-local importance measure is obtained by decoupling existing methods into the global and local components which are then used to select the points in the support set. Using this method, we are able to generate explanations that are tailored to specific test points. In the experiments, we show the effectiveness of the method in image classification and text generation tasks.

[LG-35] arget Detection of Safety Protective Gear Using the Improved YOLOv5

点击查看摘要

[LG-36] ConvKGYarn: Spinning Configurable and Scalable Conversational Knowledge Graph QA datasets with Large Language Models

链接: https://arxiv.org/abs/2408.05948
作者: Ronak Pradeep,Daniel Lee,Ali Mousavi,Jeff Pound,Yisi Sang,Jimmy Lin,Ihab Ilyas,Saloni Potdar,Mostafa Arefiyan,Yunyao Li
关键词-EN: Large Language Models, Large Language, advancement of Large, assistants necessitates dynamic, Language Models
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注:

点击查看摘要

[LG-37] Inverse design of Non-parameterized Ventilated Acoustic Resonator via Variational Autoencoder with Acoustic Response-encoded Latent Space

点击查看摘要

[LG-38] Cluster-Segregate-Perturb (CSP): A Model-agnostic Explainability Pipeline for Spatiotemporal Land Surface Forecasting Models

链接: https://arxiv.org/abs/2408.05916
作者: Tushar Verma,Sudipan Saha
关键词-EN: regional climate change, climate change effects, modelling regional climate, integrates satellite images, Satellite images
类目: Machine Learning (cs.LG); Image and Video Processing (eess.IV)
*备注:

点击查看摘要

Abstract:Satellite images have become increasingly valuable for modelling regional climate change effects. Earth surface forecasting represents one such task that integrates satellite images with meteorological data to capture the joint evolution of regional climate change effects. However, understanding the complex relationship between specific meteorological variables and land surface evolution poses a significant challenge. In light of this challenge, our paper introduces a pipeline that integrates principles from both perturbation-based explainability techniques like LIME and global marginal explainability techniques like PDP, besides addressing the constraints of using such techniques when applying them to high-dimensional spatiotemporal deep models. The proposed pipeline simplifies the undertaking of diverse investigative analyses, such as marginal sensitivity analysis, marginal correlation analysis, lag analysis, etc., on complex land surface forecasting models In this study we utilised Convolutional Long Short-Term Memory (ConvLSTM) as the surface forecasting model and did analyses on the Normalized Difference Vegetation Index (NDVI) of the surface forecasts, since meteorological variables like temperature, pressure, and precipitation significantly influence it. The study area encompasses various regions in Europe. Our analyses show that precipitation exhibits the highest sensitivity in the study area, followed by temperature and pressure. Pressure has little to no direct effect on NDVI. Additionally, interesting nonlinear correlations between meteorological variables and NDVI have been uncovered.

[LG-39] Polyp SAM 2: Advancing Zero shot Polyp Segmentation in Colorectal Cancer Detection

点击查看摘要

[LG-40] Online-Score-Aided Federated Learning: Taming the Resource Constraints in Wireless Networks

链接: https://arxiv.org/abs/2408.05886
作者: Md Ferdous Pervej,Minseok Choi,Andreas F. Molisch
关键词-EN: pose significant challenges, device pose significant, protects data privacy, time-varying wireless network, wireless device pose
类目: Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注: Under review for possible publication in IEEE Transactions on Wireless Communications (TWC)

点击查看摘要

Abstract:While FL is a widely popular distributed ML strategy that protects data privacy, time-varying wireless network parameters and heterogeneous system configurations of the wireless device pose significant challenges. Although the limited radio and computational resources of the network and the clients, respectively, are widely acknowledged, two critical yet often ignored aspects are (a) wireless devices can only dedicate a small chunk of their limited storage for the FL task and (b) new training samples may arrive in an online manner in many practical wireless applications. Therefore, we propose a new FL algorithm called OSAFL, specifically designed to learn tasks relevant to wireless applications under these practical considerations. Since it has long been proven that under extreme resource constraints, clients may perform an arbitrary number of local training steps, which may lead to client drift under statistically heterogeneous data distributions, we leverage normalized gradient similarities and exploit weighting clients’ updates based on optimized scores that facilitate the convergence rate of the proposed OSAFL algorithm. Our extensive simulation results on two different tasks – each with three different datasets – with four popular ML models validate the effectiveness of OSAFL compared to six existing state-of-the-art FL baselines.

[LG-41] GFlowNet Training by Policy Gradients

链接: https://arxiv.org/abs/2408.05885
作者: Puhua Niu,Shili Wu,Mingzhou Fan,Xiaoning Qian
关键词-EN: Generative Flow Networks, generate combinatorial objects, Generative Flow, Flow Networks, desired properties
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Generative Flow Networks (GFlowNets) have been shown effective to generate combinatorial objects with desired properties. We here propose a new GFlowNet training framework, with policy-dependent rewards, that bridges keeping flow balance of GFlowNets to optimizing the expected accumulated reward in traditional Reinforcement-Learning (RL). This enables the derivation of new policy-based GFlowNet training methods, in contrast to existing ones resembling value-based RL. It is known that the design of backward policies in GFlowNet training affects efficiency. We further develop a coupled training strategy that jointly solves GFlowNet forward policy training and backward policy design. Performance analysis is provided with a theoretical guarantee of our policy-based GFlowNet training. Experiments on both simulated and real-world datasets verify that our policy-based strategies provide advanced RL perspectives for robust gradient estimation to improve GFlowNet performance.

[LG-42] Low-Rank Approximation Adaptation and Other Tales

链接: https://arxiv.org/abs/2408.05883
作者: Jun Lu
关键词-EN: natural language processing, modern data analysis, Low-rank approximation, signal processing, language processing
类目: Machine Learning (cs.LG); Information Retrieval (cs.IR); Numerical Analysis (math.NA)
*备注:

点击查看摘要

Abstract:Low-rank approximation is a fundamental technique in modern data analysis, widely utilized across various fields such as signal processing, machine learning, and natural language processing. Despite its ubiquity, the mechanics of low-rank approximation and its application in adaptation can sometimes be obscure, leaving practitioners and researchers with questions about its true capabilities and limitations. This paper seeks to clarify low-rank approximation and adaptation by offering a comprehensive guide that reveals their inner workings and explains their utility in a clear and accessible way. Our focus here is to develop a solid intuition for how low-rank approximation and adaptation operate, and why they are so effective. We begin with basic concepts and gradually build up to the mathematical underpinnings, ensuring that readers of all backgrounds can gain a deeper understanding of low-rank approximation and adaptation. We strive to strike a balance between informal explanations and rigorous mathematics, ensuring that both newcomers and experienced experts can benefit from this survey. Additionally, we introduce new low-rank decomposition and adaptation algorithms that have not yet been explored in the field, hoping that future researchers will investigate their potential applicability.

[LG-43] LLM-Based Robust Product Classification in Commerce and Compliance

链接: https://arxiv.org/abs/2408.05874
作者: Sina Gholamian,Gianfranco Romani,Bartosz Rudnikowicz,Laura Skylaki
关键词-EN: Product classification, crucial task, compliance regulations, regulations are verified, verified and taxes
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 11 pages

点击查看摘要

[LG-44] Leveraging Knowledge Graph-Based Human-Like Memory Systems to Solve Partially Observable Markov Decision Processes

点击查看摘要

[LG-45] Comparative Evaluation of Memory Technologies for Synaptic Crossbar Arrays- Part 2: Design Knobs and DNN Accuracy Trends

链接: https://arxiv.org/abs/2408.05857
作者: Jeffry Victor,Chunguang Wang,Sumeet K. Gupta
关键词-EN: Deep Neural Networks, Neural Networks, Deep Neural, hardware non-idealities limit, acceleration of Deep
类目: Emerging Technologies (cs.ET); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Crossbar memory arrays have been touted as the workhorse of in-memory computing (IMC)-based acceleration of Deep Neural Networks (DNNs), but the associated hardware non-idealities limit their efficacy. To address this, cross-layer design solutions that reduce the impact of hardware non-idealities on DNN accuracy are needed. In Part 1 of this paper, we established the co-optimization strategies for various memory technologies and their crossbar arrays, and conducted a comparative technology evaluation in the context of IMC robustness. In this part, we analyze various design knobs such as array size and bit-slice (number of bits per device) and their impact on the performance of 8T SRAM, ferroelectric transistor (FeFET), Resistive RAM (ReRAM) and spin-orbit-torque magnetic RAM (SOT-MRAM) in the context of inference accuracy at 7nm technology node. Further, we study the effect of circuit design solutions such as Partial Wordline Activation (PWA) and custom ADC reference levels that reduce the hardware non-idealities and comparatively analyze the response of each technology to such accuracy enhancing techniques. Our results on ResNet-20 (with CIFAR-10) show that PWA increases accuracy by up to 32.56% while custom ADC reference levels yield up to 31.62% accuracy enhancement. We observe that compared to the other technologies, FeFET, by virtue of its small layout height and high distinguishability of its memory states, is best suited for large arrays. For higher bit-slices and a more complex dataset (ResNet-50 with Cifar-100) we found that ReRAM matches the performance of FeFET.

[LG-46] Using Retriever Augmented Large Language Models for Attack Graph Generation

链接: https://arxiv.org/abs/2408.05855
作者: Renascence Tarafder Prapty,Ashish Kundu,Arun Iyengar
关键词-EN: effective vulnerability management, modern systems increases, threat modeling techniques, modeling techniques, complexity of modern
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:As the complexity of modern systems increases, so does the importance of assessing their security posture through effective vulnerability management and threat modeling techniques. One powerful tool in the arsenal of cybersecurity professionals is the attack graph, a representation of all potential attack paths within a system that an adversary might exploit to achieve a certain objective. Traditional methods of generating attack graphs involve expert knowledge, manual curation, and computational algorithms that might not cover the entire threat landscape due to the ever-evolving nature of vulnerabilities and exploits. This paper explores the approach of leveraging large language models (LLMs), such as ChatGPT, to automate the generation of attack graphs by intelligently chaining Common Vulnerabilities and Exposures (CVEs) based on their preconditions and effects. It also shows how to utilize LLMs to create attack graphs from threat reports.

[LG-47] An End-to-End Model for Time Series Classification In the Presence of Missing Values

链接: https://arxiv.org/abs/2408.05849
作者: Pengshuai Yao,Mengna Liu,Xu Cheng,Fan Shi,Huan Li,Xiufeng Liu,Shengyong Chen
关键词-EN: Time series, practical applications, time series analysis, prevalent issue, Time series classification
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Time series classification with missing data is a prevalent issue in time series analysis, as temporal data often contain missing values in practical applications. The traditional two-stage approach, which handles imputation and classification separately, can result in sub-optimal performance as label information is not utilized in the imputation process. On the other hand, a one-stage approach can learn features under missing information, but feature representation is limited as imputed errors are propagated in the classification process. To overcome these challenges, this study proposes an end-to-end neural network that unifies data imputation and representation learning within a single framework, allowing the imputation process to take advantage of label information. Differing from previous methods, our approach places less emphasis on the accuracy of imputation data and instead prioritizes classification performance. A specifically designed multi-scale feature learning module is implemented to extract useful information from the noise-imputation data. The proposed model is evaluated on 68 univariate time series datasets from the UCR archive, as well as a multivariate time series dataset with various missing data ratios and 4 real-world datasets with missing information. The results indicate that the proposed model outperforms state-of-the-art approaches for incomplete time series classification, particularly in scenarios with high levels of missing data.

[LG-48] Online Matrix Completion: A Collaborative Approach with Hott Items

链接: https://arxiv.org/abs/2408.05843
作者: Dheeraj Baby,Soumyabrata Pal
关键词-EN: low rank matrix, rank matrix completion, Delta, matrix completion problem, rank matrix
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: Appeared at the Forty-first International Conference on Machine Learning, 2024

点击查看摘要

Abstract:We investigate the low rank matrix completion problem in an online setting with M users, N items, T rounds, and an unknown rank- r reward matrix R\in \mathbbR^M\times N . This problem has been well-studied in the literature and has several applications in practice. In each round, we recommend S carefully chosen distinct items to every user and observe noisy rewards. In the regime where M,N T , we propose two distinct computationally efficient algorithms for recommending items to users and analyze them under the benign \emphhott items assumption.1) First, for S=1 , under additional incoherence/smoothness assumptions on R , we propose the phased algorithm \textscPhasedClusterElim. Our algorithm obtains a near-optimal per-user regret of \tildeO(NM^-1(\Delta^-1+\Delta_hott^-2)) where \Delta_hott,\Delta are problem-dependent gap parameters with \Delta_hott \Delta almost always. 2) Second, we consider a simplified setting with S=r where we make significantly milder assumptions on R . Here, we introduce another phased algorithm, \textscDeterminantElim, to derive a regret guarantee of \widetildeO(NM^-1/r\Delta_det^-1)) where \Delta_det is another problem-dependent gap. Both algorithms crucially use collaboration among users to jointly eliminate sub-optimal items for groups of users successively in phases, but with distinctive and novel approaches.

[LG-49] Sampling Foundational Transformer: A Theoretical Perspective

点击查看摘要

[LG-50] Kernel Density Estimators in Large Dimensions

链接: https://arxiv.org/abs/2408.05807
作者: Giulio Biroli,Marc Mézard
关键词-EN: studies Kernel density, Central Limit Theorem, paper studies Kernel, Kernel density estimation, rho
类目: Machine Learning (cs.LG); Disordered Systems and Neural Networks (cond-mat.dis-nn); Statistics Theory (math.ST); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:This paper studies Kernel density estimation for a high-dimensional distribution \rho(x) . Traditional approaches have focused on the limit of large number of data points n and fixed dimension d . We analyze instead the regime where both the number n of data points y_i and their dimensionality d grow with a fixed ratio \alpha=(\log n)/d . Our study reveals three distinct statistical regimes for the kernel-based estimate of the density \hat \rho_h^\mathcal D(x)=\frac1n h^d\sum_i=1^n K\left(\fracx-y_ih\right) , depending on the bandwidth h : a classical regime for large bandwidth where the Central Limit Theorem (CLT) holds, which is akin to the one found in traditional approaches. Below a certain value of the bandwidth, h_CLT(\alpha) , we find that the CLT breaks down. The statistics of \hat \rho_h^\mathcal D(x) for a fixed x drawn from \rho(x) is given by a heavy-tailed distribution (an alpha-stable distribution). In particular below a value h_G(\alpha) , we find that \hat \rho_h^\mathcal D(x) is governed by extreme value statistics: only a few points in the database matter and give the dominant contribution to the density estimator. We provide a detailed analysis for high-dimensional multivariate Gaussian data. We show that the optimal bandwidth threshold based on Kullback-Leibler divergence lies in the new statistical regime identified in this paper. Our findings reveal limitations of classical approaches, show the relevance of these new statistical regimes, and offer new insights for Kernel density estimation in high-dimensional settings.

[LG-51] A Single Goal is All You Need: Skills and Exploration Emerge from Contrastive RL without Rewards Demonstrations or Subgoals

点击查看摘要

[LG-52] A Comparative Study of Convolutional and Recurrent Neural Networks for Storm Surge Prediction in Tampa Bay

链接: https://arxiv.org/abs/2408.05797
作者: Mandana Farhang Ghahfarokhi,Seyed Hossein Sonbolestan,Mahta Zamanizadeh
关键词-EN: common deep learning, deep learning architectures, storm surge modeling, surrogate storm surge, common deep
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In this paper, we compare the performance of three common deep learning architectures, CNN-LSTM, LSTM, and 3D-CNN, in the context of surrogate storm surge modeling. The study site for this paper is the Tampa Bay area in Florida. Using high-resolution atmospheric data from the reanalysis models and historical water level data from NOAA tide stations, we trained and tested these models to evaluate their performance. Our findings indicate that the CNN-LSTM model outperforms the other architectures, achieving a test loss of 0.010 and an R-squared (R2) score of 0.84. The LSTM model, although it achieved the lowest training loss of 0.007 and the highest training R2 of 0.88, exhibited poorer generalization with a test loss of 0.014 and an R2 of 0.77. The 3D-CNN model showed reasonable performance with a test loss of 0.011 and an R2 of 0.82 but displayed instability under extreme conditions. A case study on Hurricane Ian, which caused a significant negative surge of -1.5 meters in Tampa Bay indicates the CNN-LSTM model’s robustness and accuracy in extreme scenarios.

[LG-53] Continual Learning of Nonlinear Independent Representations

点击查看摘要

[LG-54] On zero-shot learning in neural state estimation of power distribution systems

链接: https://arxiv.org/abs/2408.05787
作者: Aleksandr Berezin,Stephan Balduin,Thomas Oberließen,Sebastian Peter,Eric MSP Veith
关键词-EN: power distribution systems, distribution systems, neural state estimation, paper addresses, addresses the challenge
类目: Machine Learning (cs.LG)
*备注: 13 pages, 2 figures, associated source code available at this https URL

点击查看摘要

Abstract:This paper addresses the challenge of neural state estimation in power distribution systems. We identified a research gap in the current state of the art, which lies in the inability of models to adapt to changes in the power grid, such as loss of sensors and branch switching. Our experiments demonstrate that graph neural networks are the most promising models for this use case and that their performance can degrade with scale. We propose augmentations to remedy this issue and perform a comprehensive grid search of different model configurations for common zero-shot learning scenarios in neural state estimation.

[LG-55] CURLing the Dream: Contrastive Representations for World Modeling in Reinforcement Learning

点击查看摘要

[LG-56] Pareto Front Shape-Agnostic Pareto Set Learning in Multi-Objective Optimization

链接: https://arxiv.org/abs/2408.05778
作者: Rongguang Ye,Longcan Chen,Wei-Bin Kou,Jinyuan Zhang,Hisao Ishibuchi
关键词-EN: Pareto set learning, Pareto set, Pareto front, Pareto front shape, Pareto
类目: Machine Learning (cs.LG)
*备注: 7 pages

点击查看摘要

Abstract:Pareto set learning (PSL) is an emerging approach for acquiring the complete Pareto set of a multi-objective optimization problem. Existing methods primarily rely on the mapping of preference vectors in the objective space to Pareto optimal solutions in the decision space. However, the sampling of preference vectors theoretically requires prior knowledge of the Pareto front shape to ensure high performance of the PSL methods. Designing a sampling strategy of preference vectors is difficult since the Pareto front shape cannot be known in advance. To make Pareto set learning work effectively in any Pareto front shape, we propose a Pareto front shape-agnostic Pareto Set Learning (GPSL) that does not require the prior information about the Pareto front. The fundamental concept behind GPSL is to treat the learning of the Pareto set as a distribution transformation problem. Specifically, GPSL can transform an arbitrary distribution into the Pareto set distribution. We demonstrate that training a neural network by maximizing hypervolume enables the process of distribution transformation. Our proposed method can handle any shape of the Pareto front and learn the Pareto set without requiring prior knowledge. Experimental results show the high performance of our proposed method on diverse test problems compared with recent Pareto set learning algorithms.

[LG-57] Scalable and Adaptive Spectral Embedding for Attributed Graph Clustering CIKM2024

链接: https://arxiv.org/abs/2408.05765
作者: Yunhui Liu,Tieke He,Qing Wu,Tao Zheng,Jianhua Zhao
关键词-EN: made promising advancements, recent years, Attributed graph clustering, Attributed graph, aims to group
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: Accepted by CIKM 2024 (Short Paper)

点击查看摘要

Abstract:Attributed graph clustering, which aims to group the nodes of an attributed graph into disjoint clusters, has made promising advancements in recent years. However, most existing methods face challenges when applied to large graphs due to the expensive computational cost and high memory usage. In this paper, we introduce Scalable and Adaptive Spectral Embedding (SASE), a simple attributed graph clustering method devoid of parameter learning. SASE comprises three main components: node features smoothing via k -order simple graph convolution, scalable spectral clustering using random Fourier features, and adaptive order selection. With these designs, SASE not only effectively captures global cluster structures but also exhibits linear time and space complexity relative to the graph size. Empirical results demonstrate the superiority of SASE. For example, on the ArXiv dataset with 169K nodes and 1.17M edges, SASE achieves a 6.9% improvement in ACC and a 5.87\times speedup compared to the runner-up, S3GC.

[LG-58] Personalized Federated Learning for improving radar based precipitation nowcasting on heterogeneous areas

链接: https://arxiv.org/abs/2408.05761
作者: Judith Sáinz-Pardo Díaz,María Castrillo,Juraj Bartok,Ignacio Heredia Cachá,Irina Malkin Ondík,Ivan Martynovskyi,Khadijeh Alibabaei,Lisana Berberi,Valentin Kozlov,Álvaro López García
关键词-EN: increasing generation, processing and exploiting, exploiting data, distributed weather radar, data
类目: Machine Learning (cs.LG)
*备注: Accepted for publication in Earth Science Informatics

点击查看摘要

Abstract:The increasing generation of data in different areas of life, such as the environment, highlights the need to explore new techniques for processing and exploiting data for useful purposes. In this context, artificial intelligence techniques, especially through deep learning models, are key tools to be used on the large amount of data that can be obtained, for example, from weather radars. In many cases, the information collected by these radars is not open, or belongs to different institutions, thus needing to deal with the distributed nature of this data. In this work, the applicability of a personalized federated learning architecture, which has been called adapFL, on distributed weather radar images is addressed. To this end, given a single available radar covering 400 km in diameter, the captured images are divided in such a way that they are disjointly distributed into four different federated clients. The results obtained with adapFL are analyzed in each zone, as well as in a central area covering part of the surface of each of the previously distributed areas. The ultimate goal of this work is to study the generalization capability of this type of learning technique for its extrapolation to use cases in which a representative number of radars is available, whose data can not be centralized due to technical, legal or administrative concerns. The results of this preliminary study indicate that the performance obtained in each zone with the adapFL approach allows improving the results of the federated learning approach, the individual deep learning models and the classical Continuity Tracking Radar Echoes by Correlation approach.

[LG-59] Efficient and Versatile Robust Fine-Tuning of Zero-shot Models ECCV2024

点击查看摘要

[LG-60] MTSCI: A Conditional Diffusion Model for Multivariate Time Series Consistent Imputation CIKM2024

点击查看摘要

[LG-61] Deep Learning with Data Privacy via Residual Perturbation

点击查看摘要

[LG-62] Fast and Scalable Semi-Supervised Learning for Multi-View Subspace Clustering

链接: https://arxiv.org/abs/2408.05707
作者: Huaming Ling,Chenglong Bao,Jiebo Song,Zuoqiang Shi
关键词-EN: Scalable Semi-supervised Multi-view, Semi-supervised Multi-view Subspace, Multi-view Subspace Clustering, Fast and Scalable, Scalable Semi-supervised
类目: Machine Learning (cs.LG)
*备注: 40 pages,7 figures

点击查看摘要

Abstract:In this paper, we introduce a Fast and Scalable Semi-supervised Multi-view Subspace Clustering (FSSMSC) method, a novel solution to the high computational complexity commonly found in existing approaches. FSSMSC features linear computational and space complexity relative to the size of the data. The method generates a consensus anchor graph across all views, representing each data point as a sparse linear combination of chosen landmarks. Unlike traditional methods that manage the anchor graph construction and the label propagation process separately, this paper proposes a unified optimization model that facilitates simultaneous learning of both. An effective alternating update algorithm with convergence guarantees is proposed to solve the unified optimization model. Additionally, the method employs the obtained anchor graph and landmarks’ low-dimensional representations to deduce low-dimensional representations for raw data. Following this, a straightforward clustering approach is conducted on these low-dimensional representations to achieve the final clustering results. The effectiveness and efficiency of FSSMSC are validated through extensive experiments on multiple benchmark datasets of varying scales.

[LG-63] Predicting Chaotic System Behavior using Machine Learning Techniques

链接: https://arxiv.org/abs/2408.05702
作者: Huaiyuan Rao,Yichen Zhao,Qiang Lai
关键词-EN: machine learning techniques, Generation Reservoir Computing, Reservoir Computing, demonstrated superior performance, Long short-term Memory
类目: Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注: 8 pages, 15 figures

点击查看摘要

Abstract:Recently, machine learning techniques, particularly deep learning, have demonstrated superior performance over traditional time series forecasting methods across various applications, including both single-variable and multi-variable predictions. This study aims to investigate the capability of i) Next Generation Reservoir Computing (NG-RC) ii) Reservoir Computing (RC) iii) Long short-term Memory (LSTM) for predicting chaotic system behavior, and to compare their performance in terms of accuracy, efficiency, and robustness. These methods are applied to predict time series obtained from four representative chaotic systems including Lorenz, Rössler, Chen, Qi systems. In conclusion, we found that NG-RC is more computationally efficient and offers greater potential for predicting chaotic system behavior.

[LG-64] SMILES-Mamba: Chemical Mamba Foundation Models for Drug ADMET Prediction

链接: https://arxiv.org/abs/2408.05696
作者: Bohao Xu,Yingzhou Lu,Chenhao Li,Ling Yue,Xiao Wang,Nan Hao,Tianfan Fu,Jim Chen
关键词-EN: safety and efficacy, critical for ensuring, ensuring safety, drug discovery, small-molecule drugs
类目: Machine Learning (cs.LG); Quantitative Methods (q-bio.QM)
*备注:

点击查看摘要

Abstract:In drug discovery, predicting the absorption, distribution, metabolism, excretion, and toxicity (ADMET) properties of small-molecule drugs is critical for ensuring safety and efficacy. However, the process of accurately predicting these properties is often resource-intensive and requires extensive experimental data. To address this challenge, we propose SMILES-Mamba, a two-stage model that leverages both unlabeled and labeled data through a combination of self-supervised pretraining and fine-tuning strategies. The model first pre-trains on a large corpus of unlabeled SMILES strings to capture the underlying chemical structure and relationships, before being fine-tuned on smaller, labeled datasets specific to ADMET tasks. Our results demonstrate that SMILES-Mamba exhibits competitive performance across 22 ADMET datasets, achieving the highest score in 14 tasks, highlighting the potential of self-supervised learning in improving molecular property prediction. This approach not only enhances prediction accuracy but also reduces the dependence on large, labeled datasets, offering a promising direction for future research in drug discovery.

[LG-65] A Novel Momentum-Based Deep Learning Techniques for Medical Image Classification and Segmentation

点击查看摘要

[LG-66] he Bandit Whisperer: Communication Learning for Restless Bandits

链接: https://arxiv.org/abs/2408.05686
作者: Yunfan Zhao,Tonghan Wang,Dheeraj Nagaraj,Aparna Taneja,Milind Tambe
关键词-EN: Restless Multi-Arm Bandits, Applying Reinforcement Learning, Applying Reinforcement, Multi-Arm Bandits, Restless Multi-Arm
类目: Machine Learning (cs.LG); Multiagent Systems (cs.MA)
*备注:

点击查看摘要

Abstract:Applying Reinforcement Learning (RL) to Restless Multi-Arm Bandits (RMABs) offers a promising avenue for addressing allocation problems with resource constraints and temporal dynamics. However, classic RMAB models largely overlook the challenges of (systematic) data errors - a common occurrence in real-world scenarios due to factors like varying data collection protocols and intentional noise for differential privacy. We demonstrate that conventional RL algorithms used to train RMABs can struggle to perform well in such settings. To solve this problem, we propose the first communication learning approach in RMABs, where we study which arms, when involved in communication, are most effective in mitigating the influence of such systematic data errors. In our setup, the arms receive Q-function parameters from similar arms as messages to guide behavioral policies, steering Q-function updates. We learn communication strategies by considering the joint utility of messages across all pairs of arms and using a Q-network architecture that decomposes the joint utility. Both theoretical and empirical evidence validate the effectiveness of our method in significantly improving RMAB performance across diverse problems.

[LG-67] SRTFD: Scalable Real-Time Fault Diagnosis through Online Continual Learning

点击查看摘要

[LG-68] Efficient Federated Learning Using Dynamic Update and Adaptive Pruning with Momentum on Shared Server Data

点击查看摘要

[LG-69] nsor Decomposition Meets RKHS: Efficient Algorithms for Smooth and Misaligned Data

链接: https://arxiv.org/abs/2408.05677
作者: Brett W. Larsen,Tamara G. Kolda,Anru R. Zhang,Alex H. Williams
关键词-EN: multidimensional data array, canonical polyadic, tensor decomposition decomposes, decomposes a multidimensional, sum of outer
类目: Numerical Analysis (math.NA); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The canonical polyadic (CP) tensor decomposition decomposes a multidimensional data array into a sum of outer products of finite-dimensional vectors. Instead, we can replace some or all of the vectors with continuous functions (infinite-dimensional vectors) from a reproducing kernel Hilbert space (RKHS). We refer to tensors with some infinite-dimensional modes as quasitensors, and the approach of decomposing a tensor with some continuous RKHS modes is referred to as CP-HiFi (hybrid infinite and finite dimensional) tensor decomposition. An advantage of CP-HiFi is that it can enforce smoothness in the infinite dimensional modes. Further, CP-HiFi does not require the observed data to lie on a regular and finite rectangular grid and naturally incorporates misaligned data. We detail the methodology and illustrate it on a synthetic example.

[LG-70] Utilizing Large Language Models to Optimize the Detection and Explainability of Phishing Websites

点击查看摘要

[LG-71] Controlling for discrete unmeasured confounding in nonlinear causal models

链接: https://arxiv.org/abs/2408.05647
作者: Patrick Burauel,Frederick Eberhardt,Michel Besserve
关键词-EN: identifying causal relationships, major challenge, challenge for identifying, identifying causal, causal relationships
类目: Machine Learning (cs.LG); Methodology (stat.ME)
*备注:

点击查看摘要

Abstract:Unmeasured confounding is a major challenge for identifying causal relationships from non-experimental data. Here, we propose a method that can accommodate unmeasured discrete confounding. Extending recent identifiability results in deep latent variable models, we show theoretically that confounding can be detected and corrected under the assumption that the observed data is a piecewise affine transformation of a latent Gaussian mixture model and that the identity of the mixture components is confounded. We provide a flow-based algorithm to estimate this model and perform deconfounding. Experimental results on synthetic and real-world data provide support for the effectiveness of our approach.

[LG-72] Eigen Attention: Attention in Low-Rank Space for KV Cache Compression

点击查看摘要

[LG-73] Federated Smoothing Proximal Gradient for Quantile Regression with Non-Convex Penalties

点击查看摘要

[LG-74] Speculative Diffusion Decoding: Accelerating Language Generation through Diffusion

链接: https://arxiv.org/abs/2408.05636
作者: Jacob K Christopher,Brian R Bartoldson,Bhavya Kailkhura,Ferdinando Fioretto
关键词-EN: widely adopted method, accelerate large language, Speculative decoding, widely adopted, adopted method
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注:

点击查看摘要

[LG-75] Forecasting Day-Ahead Electricity Prices in the Integrated Single Electricity Market: Addressing Volatility with Comparative Machine Learning Methods

点击查看摘要

[LG-76] An Information-Theoretic Analysis of Temporal GNNs

链接: https://arxiv.org/abs/2408.05624
作者: Amirmohammad Farzaneh
关键词-EN: Graph Neural Networks, Temporal Graph Neural, Graph Neural, Neural Networks, Temporal Graph
类目: Information Theory (cs.IT); Machine Learning (cs.LG)
*备注: To be presented at Information Theory Workshop 2024

点击查看摘要

Abstract:Temporal Graph Neural Networks, a new and trending area of machine learning, suffers from a lack of formal analysis. In this paper, information theory is used as the primary tool to provide a framework for the analysis of temporal GNNs. For this reason, the concept of information bottleneck is used and adjusted to be suitable for a temporal analysis of such networks. To this end, a new definition for Mutual Information Rate is provided, and the potential use of this new metric in the analysis of temporal GNNs is studied.

[LG-77] Residual-INR: Communication Efficient On-Device Learning Using Implicit Neural Representation

点击查看摘要

[LG-78] Mitigating Metropolitan Carbon Emissions with Dynamic Eco-driving at Scale

点击查看摘要

[LG-79] Exploring Applications of State Space Models and Advanced Training Techniques in Sequential Recommendations: A Comparative Study on Efficiency and Performance

点击查看摘要

[LG-80] Safety Enhancement in Planetary Rovers: Early Detection of Tip-over Risks Using Autoencoders

链接: https://arxiv.org/abs/2408.05602
作者: Mariela De Lucas Alvarez
关键词-EN: Autonomous robots consistently, robots consistently encounter, consistently encounter unforeseen, encounter unforeseen dangerous, unforeseen dangerous situations
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注: 10 pages, 7 figures

点击查看摘要

Abstract:Autonomous robots consistently encounter unforeseen dangerous situations during exploration missions. The characteristic rimless wheels in the AsguardIV rover allow it to overcome challenging terrains. However, steep slopes or difficult maneuvers can cause the rover to tip over and threaten the completion of a mission. This work focuses on identifying early signs or initial stages for potential tip-over events to predict and detect these critical moments before they fully occur, possibly preventing accidents and enhancing the safety and stability of the rover during its exploration mission. Inertial Measurement Units (IMU) readings are used to develop compact, robust, and efficient Autoencoders that combine the power of sequence processing of Long Short-Term Memory Networks (LSTM). By leveraging LSTM-based Autoencoders, this work contributes predictive capabilities for detecting tip-over risks and developing safety measures for more reliable exploration missions.

[LG-81] Sequential Representation Learning via Static-Dynamic Conditional Disentanglement ECCV2024

点击查看摘要

[LG-82] Meta Clustering of Neural Bandits KDD2024

链接: https://arxiv.org/abs/2408.05586
作者: Yikun Ban,Yunzhe Qi,Tianxin Wei,Lihui Liu,Jingrui He
关键词-EN: sequential decision-making process, decision-making process, Upper Confidence Bound, powerful framework, framework to formulate
类目: Machine Learning (cs.LG)
*备注: KDD 2024

点击查看摘要

Abstract:The contextual bandit has been identified as a powerful framework to formulate the recommendation process as a sequential decision-making process, where each item is regarded as an arm and the objective is to minimize the regret of T rounds. In this paper, we study a new problem, Clustering of Neural Bandits, by extending previous work to the arbitrary reward function, to strike a balance between user heterogeneity and user correlations in the recommender system. To solve this problem, we propose a novel algorithm called M-CNB, which utilizes a meta-learner to represent and rapidly adapt to dynamic clusters, along with an informative Upper Confidence Bound (UCB)-based exploration strategy. We provide an instance-dependent performance guarantee for the proposed algorithm that withstands the adversarial context, and we further prove the guarantee is at least as good as state-of-the-art (SOTA) approaches under the same assumptions. In extensive experiments conducted in both recommendation and online classification scenarios, M-CNB outperforms SOTA baselines. This shows the effectiveness of the proposed approach in improving online recommendation and online classification performance.

[LG-83] Dynamical causality under invisible confounders

链接: https://arxiv.org/abs/2408.05584
作者: Jinling Yan,Shao-Wu Zhang,Chihao Zhang,Weitian Huang,Jifan Shi,Luonan Chen
关键词-EN: spurious causal interactions, invisible confounders, CIC method, prone to spurious, confounders
类目: Machine Learning (cs.LG); Methodology (stat.ME)
*备注: 23 pages, 5 figures

点击查看摘要

Abstract:Causality inference is prone to spurious causal interactions, due to the substantial confounders in a complex system. While many existing methods based on the statistical methods or dynamical methods attempt to address misidentification challenges, there remains a notable lack of effective methods to infer causality, in particular in the presence of invisible/unobservable confounders. As a result, accurately inferring causation with invisible confounders remains a largely unexplored and outstanding issue in data science and AI fields. In this work, we propose a method to overcome such challenges to infer dynamical causality under invisible confounders (CIC method) and further reconstruct the invisible confounders from time-series data by developing an orthogonal decomposition theorem in a delay embedding space. The core of our CIC method lies in its ability to decompose the observed variables not in their original space but in their delay embedding space into the common and private subspaces respectively, thereby quantifying causality between those variables both theoretically and computationally. This theoretical foundation ensures the causal detection for any high-dimensional system even with only two observed variables under many invisible confounders, which is actually a long-standing problem in the field. In addition to the invisible confounder problem, such a decomposition actually makes the intertwined variables separable in the embedding space, thus also solving the non-separability problem of causal inference. Extensive validation of the CIC method is carried out using various real datasets, and the experimental results demonstrates its effectiveness to reconstruct real biological networks even with unobserved confounders.

[LG-84] Incremental Gauss-Newton Descent for Machine Learning

链接: https://arxiv.org/abs/2408.05560
作者: Mikalai Korbit,Mario Zanon
关键词-EN: Stochastic Gradient Descent, Stochastic Gradient, Gradient Descent, solve problems arising, SGD
类目: Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Stochastic Gradient Descent (SGD) is a popular technique used to solve problems arising in machine learning. While very effective, SGD also has some weaknesses and various modifications of the basic algorithm have been proposed in order to at least partially tackle them, mostly yielding accelerated versions of SGD. Filling a gap in the literature, we present a modification of the SGD algorithm exploiting approximate second-order information based on the Gauss-Newton approach. The new method, which we call Incremental Gauss-Newton Descent (IGND), has essentially the same computational burden as standard SGD, appears to converge faster on certain classes of problems, and can also be accelerated. The key intuition making it possible to implement IGND efficiently is that, in the incremental case, approximate second-order information can be condensed into a scalar value that acts as a scaling constant of the update. We derive IGND starting from the theory supporting Gauss-Newton methods in a general setting and then explain how IGND can also be interpreted as a well-scaled version of SGD, which makes tuning the algorithm simpler, and provides increased robustness. Finally, we show how IGND can be used in practice by solving supervised learning tasks as well as reinforcement learning problems. The simulations show that IGND can significantly outperform SGD while performing at least as well as SGD in the worst case.

[LG-85] Evolutionary Neural Architecture Search for 3D Point Cloud Analysis

点击查看摘要

[LG-86] Convergence Analysis for Deep Sparse Coding via Convolutional Neural Networks

链接: https://arxiv.org/abs/2408.05540
作者: Jianfei Li,Han Feng,Ding-Xuan Zhou
关键词-EN: sparse coding theory, sparse coding, Deep Sparse Coding, neural network architectures, advanced neural network
类目: Machine Learning (cs.LG); Data Structures and Algorithms (cs.DS); Information Theory (cs.IT); Neural and Evolutionary Computing (cs.NE)
*备注:

点击查看摘要

Abstract:In this work, we explore the intersection of sparse coding theory and deep learning to enhance our understanding of feature extraction capabilities in advanced neural network architectures. We begin by introducing a novel class of Deep Sparse Coding (DSC) models and establish a thorough theoretical analysis of their uniqueness and stability properties. By applying iterative algorithms to these DSC models, we derive convergence rates for convolutional neural networks (CNNs) in their ability to extract sparse features. This provides a strong theoretical foundation for the use of CNNs in sparse feature learning tasks. We additionally extend this convergence analysis to more general neural network architectures, including those with diverse activation functions, as well as self-attention and transformer-based models. This broadens the applicability of our findings to a wide range of deep learning methods for deep sparse feature extraction. Inspired by the strong connection between sparse coding and CNNs, we also explore training strategies to encourage neural networks to learn more sparse features. Through numerical experiments, we demonstrate the effectiveness of these approaches, providing valuable insights for the design of efficient and interpretable deep learning models.

[LG-87] Can LLMs Replace Manual Annotation of Software Engineering Artifacts?

链接: https://arxiv.org/abs/2408.05534
作者: Toufique Ahmed,Premkumar Devanbu,Christoph Treude,Michael Pradel
关键词-EN: obtain greater generalizability, include human-subject studies, human-subject studies, tools and processes, Experimental evaluations
类目: oftware Engineering (cs.SE); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Experimental evaluations of software engineering innovations, e.g., tools and processes, often include human-subject studies as a component of a multi-pronged strategy to obtain greater generalizability of the findings. However, human-subject studies in our field are challenging, due to the cost and difficulty of finding and employing suitable subjects, ideally, professional programmers with varying degrees of experience. Meanwhile, large language models (LLMs) have recently started to demonstrate human-level performance in several areas. This paper explores the possibility of substituting costly human subjects with much cheaper LLM queries in evaluations of code and code-related artifacts. We study this idea by applying six state-of-the-art LLMs to ten annotation tasks from five datasets created by prior work, such as judging the accuracy of a natural language summary of a method or deciding whether a code change fixes a static analysis warning. Our results show that replacing some human annotation effort with LLMs can produce inter-rater agreements equal or close to human-rater agreement. To help decide when and how to use LLMs in human-subject studies, we propose model-model agreement as a predictor of whether a given task is suitable for LLMs at all, and model confidence as a means to select specific samples where LLMs can safely replace human annotators. Overall, our work is the first step toward mixed human-LLM evaluations in software engineering.

[LG-88] CryoBench: Diverse and challenging datasets for the heterogeneity problem in cryo-EM

点击查看摘要

[LG-89] PointNCBW: Towards Dataset Ownership Verification for Point Clouds via Negative Clean-label Backdoor Watermark

点击查看摘要

[LG-90] A Laplacian-based Quantum Graph Neural Network for Semi-Supervised Learning

链接: https://arxiv.org/abs/2408.05498
作者: Hamed Gholipour,Farid Bozorgnia,Kailash Hambarde,Hamzeh MohammadGheymasi,Javier Mancilla,Andre Sequeira,Joao Neves
关键词-EN: remains largely unexplored, domain remains largely, Breast Cancer Wisconsin, classical graph-based semi-supervised, quantum domain remains
类目: Machine Learning (cs.LG); Quantum Physics (quant-ph)
*备注:

点击查看摘要

Abstract:Laplacian learning method is a well-established technique in classical graph-based semi-supervised learning, but its potential in the quantum domain remains largely unexplored. This study investigates the performance of the Laplacian-based Quantum Semi-Supervised Learning (QSSL) method across four benchmark datasets – Iris, Wine, Breast Cancer Wisconsin, and Heart Disease. Further analysis explores the impact of increasing Qubit counts, revealing that adding more Qubits to a quantum system doesn’t always improve performance. The effectiveness of additional Qubits depends on the quantum algorithm and how well it matches the dataset. Additionally, we examine the effects of varying entangling layers on entanglement entropy and test accuracy. The performance of Laplacian learning is highly dependent on the number of entangling layers, with optimal configurations varying across different datasets. Typically, moderate levels of entanglement offer the best balance between model complexity and generalization capabilities. These observations highlight the crucial need for precise hyperparameter tuning tailored to each dataset to achieve optimal performance in Laplacian learning methods.

[LG-91] Variational Inference Failures Under Model Symmetries: Permutation Invariant Posteriors for Bayesian Neural Networks

链接: https://arxiv.org/abs/2408.05496
作者: Yoav Gelberg,Tycho F.A. van der Ouderaa,Mark van der Wilk,Yarin Gal
关键词-EN: Bayesian neural network, neural network architectures, neural network, Bayesian neural, rise to Bayesian
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Weight space symmetries in neural network architectures, such as permutation symmetries in MLPs, give rise to Bayesian neural network (BNN) posteriors with many equivalent modes. This multimodality poses a challenge for variational inference (VI) techniques, which typically rely on approximating the posterior with a unimodal distribution. In this work, we investigate the impact of weight space permutation symmetries on VI. We demonstrate, both theoretically and empirically, that these symmetries lead to biases in the approximate posterior, which degrade predictive performance and posterior fit if not explicitly accounted for. To mitigate this behavior, we leverage the symmetric structure of the posterior and devise a symmetrization mechanism for constructing permutation invariant variational posteriors. We show that the symmetrized distribution has a strictly better fit to the true posterior, and that it can be trained using the original ELBO objective with a modified KL regularization term. We demonstrate experimentally that our approach mitigates the aforementioned biases and results in improved predictions and a higher ELBO.

[LG-92] opological Blind Spots: Understanding and Extending Topological Deep Learning Through the Lens of Expressivity

链接: https://arxiv.org/abs/2408.05486
作者: Yam Eitan,Yoav Gelberg,Guy Bar-Shalom,Fabrizio Frasca,Michael Bronstein,Haggai Maron
关键词-EN: Topological deep learning, HOMP, facilitates learning, deep learning, Topological
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Topological deep learning (TDL) facilitates learning from data represented by topological structures. The primary model utilized in this setting is higher-order message-passing (HOMP), which extends traditional graph message-passing neural networks (MPNN) to diverse topological domains. Given the significant expressivity limitations of MPNNs, our paper aims to explore both the strengths and weaknesses of HOMP’s expressive power and subsequently design novel architectures to address these limitations. We approach this from several perspectives: First, we demonstrate HOMP’s inability to distinguish between topological objects based on fundamental topological and metric properties such as diameter, orientability, planarity, and homology. Second, we show HOMP’s limitations in fully leveraging the topological structure of objects constructed using common lifting and pooling operators on graphs. Finally, we compare HOMP’s expressive power to hypergraph networks, which are the most extensively studied TDL methods. We then develop two new classes of TDL models: multi-cellular networks (MCN) and scalable multi-cellular networks (SMCN). These models draw inspiration from expressive graph architectures. While MCN can reach full expressivity but is highly unscalable, SMCN offers a more scalable alternative that still mitigates many of HOMP’s expressivity limitations. Finally, we construct a synthetic dataset, where TDL models are tasked with separating pairs of topological objects based on basic topological properties. We demonstrate that while HOMP is unable to distinguish between any of the pairs in the dataset, SMCN successfully distinguishes all pairs, empirically validating our theoretical findings. Our work opens a new design space and new opportunities for TDL, paving the way for more expressive and versatile models.

[LG-93] A Structural Feature-Based Approach for Comprehensive Graph Classification

链接: https://arxiv.org/abs/2408.05474
作者: Saiful Islam,Md. Nahid Hasan,Pitambar Khanra
关键词-EN: intensified greater interest, increasing prevalence, prevalence of graph-structured, graph-structured data, domains has intensified
类目: Machine Learning (cs.LG); Social and Information Networks (cs.SI)
*备注: 25 pages, 6 Figures

点击查看摘要

Abstract:The increasing prevalence of graph-structured data across various domains has intensified greater interest in graph classification tasks. While numerous sophisticated graph learning methods have emerged, their complexity often hinders practical implementation. In this article, we address this challenge by proposing a method that constructs feature vectors based on fundamental graph structural properties. We demonstrate that these features, despite their simplicity, are powerful enough to capture the intrinsic characteristics of graphs within the same class. We explore the efficacy of our approach using three distinct machine learning methods, highlighting how our feature-based classification leverages the inherent structural similarities of graphs within the same class to achieve accurate classification. A key advantage of our approach is its simplicity, which makes it accessible and adaptable to a broad range of applications, including social network analysis, bioinformatics, and cybersecurity. Furthermore, we conduct extensive experiments to validate the performance of our method, showing that it not only reveals a competitive performance but in some cases surpasses the accuracy of more complex, state-of-the-art techniques. Our findings suggest that a focus on fundamental graph features can provide a robust and efficient alternative for graph classification, offering significant potential for both research and practical applications.

[LG-94] FuXi Weather: An end-to-end machine learning weather data assimilation and forecasting system

链接: https://arxiv.org/abs/2408.05472
作者: Xiuyu Sun,Xiaohui Zhong,Xiaoze Xu,Yuanqing Huang,Hao Li,Jie Feng,Wei Han,Libo Wu,Yuan Qi
关键词-EN: Operational numerical weather, machine learning based, Operational numerical, learning based weather, numerical weather prediction
类目: Machine Learning (cs.LG)
*备注: 34 pages, 4 figures

点击查看摘要

Abstract:Operational numerical weather prediction systems consist of three fundamental components: the global observing system for data collection, data assimilation for generating initial conditions, and the forecasting model to predict future weather conditions. While NWP have undergone a quiet revolution, with forecast skills progressively improving over the past few decades, their advancement has slowed due to challenges such as high computational costs and the complexities associated with assimilating an increasing volume of observational data and managing finer spatial grids. Advances in machine learning offer an alternative path towards more efficient and accurate weather forecasts. The rise of machine learning based weather forecasting models has also spurred the development of machine learning based DA models or even purely machine learning based weather forecasting systems. This paper introduces FuXi Weather, an end-to-end machine learning based weather forecasting system. FuXi Weather employs specialized data preprocessing and multi-modal data fusion techniques to integrate information from diverse sources under all-sky conditions, including microwave sounders from 3 polar-orbiting satellites and radio occultation data from Global Navigation Satellite System. Operating on a 6-hourly DA and forecasting cycle, FuXi Weather independently generates robust and accurate 10-day global weather forecasts at a spatial resolution of 0.25\textdegree. It surpasses the European Centre for Medium-range Weather Forecasts high-resolution forecasts in terms of predictability, extending the skillful forecast lead times for several key weather variables such as the geopotential height at 500 hPa from 9.25 days to 9.5 days. The system’s high computational efficiency and robust performance, even with limited observations, demonstrates its potential as a promising alternative to traditional NWP systems.

[LG-95] A Versatile Framework for Attributed Network Clustering via K-Nearest Neighbor Augmentation VLDB

链接: https://arxiv.org/abs/2408.05459
作者: Yiran Li,Gongyao Guo,Jieming Shi,Renchi Yang,Shiqi Shen,Qing Li,Jun Luo
关键词-EN: modeling social networks, Attributed, Attributed networks, ubiquitous in modeling, modeling social
类目: ocial and Information Networks (cs.SI); Machine Learning (cs.LG)
*备注: 25 pages. Accepted by the VLDB Journal

点击查看摘要

Abstract:Attributed networks containing entity-specific information in node attributes are ubiquitous in modeling social networks, e-commerce, bioinformatics, etc. Their inherent network topology ranges from simple graphs to hypergraphs with high-order interactions and multiplex graphs with separate layers. An important graph mining task is node clustering, aiming to partition the nodes of an attributed network into k disjoint clusters such that intra-cluster nodes are closely connected and share similar attributes, while inter-cluster nodes are far apart and dissimilar. It is highly challenging to capture multi-hop connections via nodes or attributes for effective clustering on multiple types of attributed networks. In this paper, we first present AHCKA as an efficient approach to attributed hypergraph clustering (AHC). AHCKA includes a carefully-crafted K-nearest neighbor augmentation strategy for the optimized exploitation of attribute information on hypergraphs, a joint hypergraph random walk model to devise an effective AHC objective, and an efficient solver with speedup techniques for the objective optimization. The proposed techniques are extensible to various types of attributed networks, and thus, we develop ANCKA as a versatile attributed network clustering framework, capable of attributed graph clustering (AGC), attributed multiplex graph clustering (AMGC), and AHC. Moreover, we devise ANCKA with algorithmic designs tailored for GPU acceleration to boost efficiency. We have conducted extensive experiments to compare our methods with 19 competitors on 8 attributed hypergraphs, 16 competitors on 6 attributed graphs, and 16 competitors on 3 attributed multiplex graphs, all demonstrating the superb clustering quality and efficiency of our methods.

[LG-96] Mathematical Models of Computation in Superposition ICML2024

点击查看摘要

[LG-97] Ensemble everything everywhere: Multi-scale aggregation for adversarial robustness

点击查看摘要

[LG-98] Predicting Long-Term Allograft Survival in Liver Transplant Recipients

链接: https://arxiv.org/abs/2408.05437
作者: Xiang Gao,Michael Cooper,Maryam Naghibzadeh,Amirhossein Azhie,Mamatha Bhat,Rahul G. Krishnan
关键词-EN: allograft failure occurs, liver transplant recipients, Liver allograft failure, occurs in approximately, leading to mortality
类目: Machine Learning (cs.LG)
*备注: Accepted at MLHC 2024

点击查看摘要

Abstract:Liver allograft failure occurs in approximately 20% of liver transplant recipients within five years post-transplant, leading to mortality or the need for retransplantation. Providing an accurate and interpretable model for individualized risk estimation of graft failure is essential for improving post-transplant care. To this end, we introduce the Model for Allograft Survival (MAS), a simple linear risk score that outperforms other advanced survival models. Using longitudinal patient follow-up data from the United States (U.S.), we develop our models on 82,959 liver transplant recipients and conduct multi-site evaluations on 11 regions. Additionally, by testing on a separate non-U.S. cohort, we explore the out-of-distribution generalization performance of various models without additional fine-tuning, a crucial property for clinical deployment. We find that the most complex models are also the ones most vulnerable to distribution shifts despite achieving the best in-distribution performance. Our findings not only provide a strong risk score for predicting long-term graft failure but also suggest that the routine machine learning pipeline with only in-distribution held-out validation could create harmful consequences for patients at deployment.

[LG-99] Simple and Nearly-Optimal Sampling for Rank-1 Tensor Completion via Gauss-Jordan

链接: https://arxiv.org/abs/2408.05431
作者: Alejandro Gomez-Leos,Oscar López
关键词-EN: uniformly sampled subset, uniformly sampled, sampled subset, Machine Learning, Omega
类目: Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG); Statistics Theory (math.ST); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:We revisit the sample and computational complexity of completing a rank-1 tensor in \otimes_i=1^N \mathbbR^d , given a uniformly sampled subset of its entries. We present a characterization of the problem (i.e. nonzero entries) which admits an algorithm amounting to Gauss-Jordan on a pair of random linear systems. For example, when N = \Theta(1) , we prove it uses no more than m = O(d^2 \log d) samples and runs in O(md^2) time. Moreover, we show any algorithm requires \Omega(d\log d) samples. By contrast, existing upper bounds on the sample complexity are at least as large as d^1.5 \mu^\Omega(1) \log^\Omega(1) d , where \mu can be \Theta(d) in the worst case. Prior work obtained these looser guarantees in higher rank versions of our problem, and tend to involve more complicated algorithms. Subjects: Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG); Statistics Theory (math.ST); Machine Learning (stat.ML) Cite as: arXiv:2408.05431 [cs.DS] (or arXiv:2408.05431v1 [cs.DS] for this version) https://doi.org/10.48550/arXiv.2408.05431 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-100] HoME: Hierarchy of Multi-Gate Experts for Multi-Task Learning at Kuaishou

链接: https://arxiv.org/abs/2408.05430
作者: Xu Wang,Jiangxia Cao,Zhiyi Fu,Kun Gai,Guorui Zhou
关键词-EN: present the practical, practical problems, lessons learned, learned at short-video, Kuaishou
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注: Work in progress

点击查看摘要

Abstract:In this paper, we present the practical problems and the lessons learned at short-video services from Kuaishou. In industry, a widely-used multi-task framework is the Mixture-of-Experts (MoE) paradigm, which always introduces some shared and specific experts for each task and then uses gate networks to measure related experts’ contributions. Although the MoE achieves remarkable improvements, we still observe three anomalies that seriously affect model performances in our iteration: (1) Expert Collapse: We found that experts’ output distributions are significantly different, and some experts have over 90% zero activations with ReLU, making it hard for gate networks to assign fair weights to balance experts. (2) Expert Degradation: Ideally, the shared-expert aims to provide predictive information for all tasks simultaneously. Nevertheless, we find that some shared-experts are occupied by only one task, which indicates that shared-experts lost their ability but degenerated into some specific-experts. (3) Expert Underfitting: In our services, we have dozens of behavior tasks that need to be predicted, but we find that some data-sparse prediction tasks tend to ignore their specific-experts and assign large weights to shared-experts. The reason might be that the shared-experts can perceive more gradient updates and knowledge from dense tasks, while specific-experts easily fall into underfitting due to their sparse behaviors. Motivated by those observations, we propose HoME to achieve a simple, efficient and balanced MoE system for multi-task learning.

[LG-101] Generalized Encouragement-Based Instrumental Variables for Counterfactual Regression

链接: https://arxiv.org/abs/2408.05428
作者: Anpeng Wu,Kun Kuang,Ruoxuan Xiong,Xiangwei Chen,Zexu Sun,Fei Wu,Kun Zhang
关键词-EN: randomized controlled trials, controlled trials, perfectly enforced, randomized controlled, impractical or compliance
类目: Machine Learning (cs.LG); Methodology (stat.ME); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:In causal inference, encouragement designs (EDs) are widely used to analyze causal effects, when randomized controlled trials (RCTs) are impractical or compliance to treatment cannot be perfectly enforced. Unlike RCTs, which directly allocate treatments, EDs randomly assign encouragement policies that positively motivate individuals to engage in a specific treatment. These random encouragements act as instrumental variables (IVs), facilitating the identification of causal effects through leveraging exogenous perturbations in discrete treatment scenarios. However, real-world applications of encouragement designs often face challenges such as incomplete randomization, limited experimental data, and significantly fewer encouragements compared to treatments, hindering precise causal effect estimation. To address this, this paper introduces novel theories and algorithms for identifying the Conditional Average Treatment Effect (CATE) using variations in encouragement. Further, by leveraging both observational and encouragement data, we propose a generalized IV estimator, named Encouragement-based Counterfactual Regression (EnCounteR), to effectively estimate the causal effects. Extensive experiments on both synthetic and real-world datasets demonstrate the superiority of EnCounteR over existing methods.

[LG-102] Detecting Masquerade Attacks in Controller Area Networks Using Graph Machine Learning

链接: https://arxiv.org/abs/2408.05427
作者: William Marfo,Pablo Moriano,Deepak K. Tosh,Shirley V. Moore
关键词-EN: Modern vehicles rely, electronic control units, controller area networks, Modern vehicles, control units
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Modern vehicles rely on a myriad of electronic control units (ECUs) interconnected via controller area networks (CANs) for critical operations. Despite their ubiquitous use and reliability, CANs are susceptible to sophisticated cyberattacks, particularly masquerade attacks, which inject false data that mimic legitimate messages at the expected frequency. These attacks pose severe risks such as unintended acceleration, brake deactivation, and rogue steering. Traditional intrusion detection systems (IDS) often struggle to detect these subtle intrusions due to their seamless integration into normal traffic. This paper introduces a novel framework for detecting masquerade attacks in the CAN bus using graph machine learning (ML). We hypothesize that the integration of shallow graph embeddings with time series features derived from CAN frames enhances the detection of masquerade attacks. We show that by representing CAN bus frames as message sequence graphs (MSGs) and enriching each node with contextual statistical attributes from time series, we can enhance detection capabilities across various attack patterns compared to using only graph-based features. Our method ensures a comprehensive and dynamic analysis of CAN frame interactions, improving robustness and efficiency. Extensive experiments on the ROAD dataset validate the effectiveness of our approach, demonstrating statistically significant improvements in the detection rates of masquerade attacks compared to a baseline that uses only graph-based features, as confirmed by Mann-Whitney U and Kolmogorov-Smirnov tests (p 0.05).

[LG-103] Modeling Multi-Step Scientific Processes with Graph Transformer Networks

链接: https://arxiv.org/abs/2408.05425
作者: Amanda A. Volk,Robert W. Epps,Jeffrey G. Ethier,Luke A. Baldwin
关键词-EN: including material science, including material, material science, work presents, linear models
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This work presents the use of graph learning for the prediction of multi-step experimental outcomes for applications across experimental research, including material science, chemistry, and biology. The viability of geometric learning for regression tasks was benchmarked against a collection of linear models through a combination of simulated and real-world data training studies. First, a selection of five arbitrarily designed multi-step surrogate functions were developed to reflect various features commonly found within experimental processes. A graph transformer network outperformed all tested linear models in scenarios that featured hidden interactions between process steps and sequence dependent features, while retaining equivalent performance in sequence agnostic scenarios. Then, a similar comparison was applied to real-world literature data on algorithm guided colloidal atomic layer deposition. Using the complete reaction sequence as training data, the graph neural network outperformed all linear models in predicting the three spectral properties for most training set sizes. Further implementation of graph neural networks and geometric representation of scientific processes for the prediction of experiment outcomes could lead to algorithm driven navigation of higher dimension parameter spaces and efficient exploration of more dynamic systems.

[LG-104] Interface Laplace Learning: Learnable Interface Term Helps Semi-Supervised Learning

链接: https://arxiv.org/abs/2408.05419
作者: Tangjun Wang,Chenglong Bao,Zuoqiang Shi
关键词-EN: Interface Laplace learning, called Interface Laplace, Laplace learning model, Laplace learning, Interface Laplace
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We introduce a novel framework, called Interface Laplace learning, for graph-based semi-supervised learning. Motivated by the observation that an interface should exist between different classes where the function value is non-smooth, we introduce a Laplace learning model that incorporates an interface term. This model challenges the long-standing assumption that functions are smooth at all unlabeled points. In the proposed approach, we add an interface term to the Laplace learning model at the interface positions. We provide a practical algorithm to approximate the interface positions using k-hop neighborhood indices, and to learn the interface term from labeled data without artificial design. Our method is efficient and effective, and we present extensive experiments demonstrating that Interface Laplace learning achieves better performance than other recent semi-supervised learning approaches at extremely low label rates on the MNIST, FashionMNIST, and CIFAR-10 datasets.

[LG-105] SAMSA: Efficient Transformer for Many Data Modalities

链接: https://arxiv.org/abs/2408.05391
作者: Minh Lenhat,Viet Anh Nguyen,Khoa Nguyen,Duong Duc Hieu,Dao Huu Hung,Truong Son Hy
关键词-EN: earned transformers great, transformers great success, mechanism earned transformers, data modalities, difficulty of training
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The versatility of self-attention mechanism earned transformers great success in almost all data modalities, with limitations on the quadratic complexity and difficulty of training. Efficient transformers, on the other hand, often rely on clever data-modality-dependent construction to get over the quadratic complexity of transformers. This greatly hinders their applications on different data modalities, which is one of the pillars of contemporary foundational modeling. In this paper, we lay the groundwork for efficient foundational modeling by proposing SAMSA - SAMpling-Self-Attention, a context-aware linear complexity self-attention mechanism that works well on multiple data modalities. Our mechanism is based on a differentiable sampling without replacement method we discovered. This enables the self-attention module to attend to the most important token set, where the importance is defined by data. Moreover, as differentiability is not needed in inference, the sparse formulation of our method costs little time overhead, further lowering computational costs. In short, SAMSA achieved competitive or even SOTA results on many benchmarks, while being faster in inference, compared to other very specialized models. Against full self-attention, real inference time significantly decreases while performance ranges from negligible degradation to outperformance. We release our source code in the repository: this https URL

[LG-106] EclipseNETs: a differentiable description of irregular eclipse conditions

链接: https://arxiv.org/abs/2408.05387
作者: Giacomo Acciarini,Francesco Biscani,Dario Izzo
关键词-EN: determining eclipse regions, critical challenge, frequent and critical, determining eclipse, Solar System bodies
类目: Machine Learning (cs.LG); Instrumentation and Methods for Astrophysics (astro-ph.IM); Space Physics (physics.space-ph)
*备注:

点击查看摘要

Abstract:In the field of spaceflight mechanics and astrodynamics, determining eclipse regions is a frequent and critical challenge. This determination impacts various factors, including the acceleration induced by solar radiation pressure, the spacecraft power input, and its thermal state all of which must be accounted for in various phases of the mission design. This study leverages recent advances in neural image processing to develop fully differentiable models of eclipse regions for highly irregular celestial bodies. By utilizing test cases involving Solar System bodies previously visited by spacecraft, such as 433 Eros, 25143 Itokawa, 67P/Churyumov–Gerasimenko, and 101955 Bennu, we propose and study an implicit neural architecture defining the shape of the eclipse cone based on the Sun’s direction. Employing periodic activation functions, we achieve high precision in modeling eclipse conditions. Furthermore, we discuss the potential applications of these differentiable models in spaceflight mechanics computations.

[LG-107] IntentRec: Predicting User Session Intent with Hierarchical Multi-Task Learning

链接: https://arxiv.org/abs/2408.05353
作者: Sejoon Oh,Moumita Bhattacharya,Yesu Feng,Sudarshan Lamkhede
关键词-EN: diverse digital services, Recommender systems, streaming media, systems have played, played a critical
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Recommender systems have played a critical role in diverse digital services such as e-commerce, streaming media, social networks, etc. If we know what a user’s intent is in a given session (e.g. do they want to watch short videos or a movie or play games; are they shopping for a camping trip), it becomes easier to provide high-quality recommendations. In this paper, we introduce IntentRec, a novel recommendation framework based on hierarchical multi-task neural network architecture that tries to estimate a user’s latent intent using their short- and long-term implicit signals as proxies and uses the intent prediction to predict the next item user is likely to engage with. By directly leveraging the intent prediction, we can offer accurate and personalized recommendations to users. Our comprehensive experiments on Netflix user engagement data show that IntentRec outperforms the state-of-the-art next-item and next-intent predictors. We also share several findings and downstream applications of IntentRec.

[LG-108] Enabling Quick Accurate Crowdsourced Annotation for Elevation-Aware Flood Extent Mapping

点击查看摘要

[LG-109] Hybrid Efficient Unsupervised Anomaly Detection for Early Pandemic Case Identification

链接: https://arxiv.org/abs/2408.05347
作者: Ghazal Ghajari,Mithun Kumar PK,Fathi Amsaad
关键词-EN: identifying unusual patterns, identifying unusual, unusual patterns, labeled training, anomaly detection
类目: Machine Learning (cs.LG); Image and Video Processing (eess.IV)
*备注:

点击查看摘要

Abstract:Unsupervised anomaly detection is a promising technique for identifying unusual patterns in data without the need for labeled training examples. This approach is particularly valuable for early case detection in epidemic management, especially when early-stage data are scarce. This research introduces a novel hybrid method for anomaly detection that combines distance and density measures, enhancing its applicability across various infectious diseases. Our method is especially relevant in pandemic situations, as demonstrated during the COVID-19 crisis, where traditional supervised classification methods fall short due to limited data. The efficacy of our method is evaluated using COVID-19 chest X-ray data, where it significantly outperforms established unsupervised techniques. It achieves an average AUC of 77.43%, surpassing the AUC of Isolation Forest at 73.66% and KNN at 52.93%. These results highlight the potential of our hybrid anomaly detection method to improve early detection capabilities in diverse epidemic scenarios, thereby facilitating more effective and timely responses.

[LG-110] AI-assisted Coding with Cody: Lessons from Context Retrieval and Evaluation for Code Recommendations

链接: https://arxiv.org/abs/2408.05344
作者: Jan Hartman,Rishabh Mehrotra,Hitesh Sagtani,Dominic Cooney,Rafal Gajdulewicz,Beyang Liu,Julie Tibshirani,Quinn Slack
关键词-EN: recently popular type, LLM-based coding assistant, recently popular, popular type, type of recommender
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG); Software Engineering (cs.SE)
*备注:

点击查看摘要

Abstract:In this work, we discuss a recently popular type of recommender system: an LLM-based coding assistant. Connecting the task of providing code recommendations in multiple formats to traditional RecSys challenges, we outline several similarities and differences due to domain specifics. We emphasize the importance of providing relevant context to an LLM for this use case and discuss lessons learned from context enhancements offline and online evaluation of such AI-assisted coding systems.

[LG-111] rule4ml: An Open-Source Tool for Resource Utilization and Latency Estimation for ML Models on FPGA

点击查看摘要

[LG-112] Audio-visual cross-modality knowledge transfer for machine learning-based in-situ monitoring in laser additive manufacturing

链接: https://arxiv.org/abs/2408.05307
作者: Jiarui Xie,Mutahar Safdar,Lequn Chen,Seung Ki Moon,Yaoyao Fiona Zhao
关键词-EN: laser additive manufacturing, detect laser additive, based in-situ monitoring, in-situ monitoring systems, LAM in-situ monitoring
类目: Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG)
*备注: 36 pages, 12 figures, 6 tables

点击查看摘要

Abstract:Various machine learning (ML)-based in-situ monitoring systems have been developed to detect laser additive manufacturing (LAM) process anomalies and defects. Multimodal fusion can improve in-situ monitoring performance by acquiring and integrating data from multiple modalities, including visual and audio data. However, multimodal fusion employs multiple sensors of different types, which leads to higher hardware, computational, and operational costs. This paper proposes a cross-modality knowledge transfer (CMKT) methodology that transfers knowledge from a source to a target modality for LAM in-situ monitoring. CMKT enhances the usefulness of the features extracted from the target modality during the training phase and removes the sensors of the source modality during the prediction phase. This paper proposes three CMKT methods: semantic alignment, fully supervised mapping, and semi-supervised mapping. Semantic alignment establishes a shared encoded space between modalities to facilitate knowledge transfer. It utilizes a semantic alignment loss to align the distributions of the same classes (e.g., visual defective and audio defective classes) and a separation loss to separate the distributions of different classes (e.g., visual defective and audio defect-free classes). The two mapping methods transfer knowledge by deriving the features of one modality from the other modality using fully supervised and semi-supervised learning. The proposed CMKT methods were implemented and compared with multimodal audio-visual fusion in an LAM in-situ anomaly detection case study. The semantic alignment method achieves a 98.4% accuracy while removing the audio modality during the prediction phase, which is comparable to the accuracy of multimodal fusion (98.2%).

[LG-113] he impact of internal variability on benchmarking deep learning climate emulators

点击查看摘要

[LG-114] Semi-Supervised One-Shot Imitation Learning

点击查看摘要

[LG-115] Can a Bayesian Oracle Prevent Harm from an Agent ?

点击查看摘要

[LG-116] Advancing oncology with federated learning: transcending boundaries in breast lung and prostate cancer. A systematic review

点击查看摘要

[LG-117] Early-Exit meets Model-Distributed Inference at Edge Networks

点击查看摘要

[LG-118] Differentially Private Data Release on Graphs: Inefficiencies and Unfairness

点击查看摘要

[LG-119] Improved Adaboost Algorithm for Web Advertisement Click Prediction Based on Long Short-Term Memory Networks

点击查看摘要

[LG-120] SocFedGPT: Federated GPT-based Adaptive Content Filtering System Leveraging User Interactions in Social Networks

链接: https://arxiv.org/abs/2408.05243
作者: Sai Puppala,Ismail Hossain,Md Jahangir Alam,Sajedul Talukder
关键词-EN: federated learning framework, social media platforms, Context-based Social Media, Social Media LLM, social media
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC); Information Retrieval (cs.IR); Social and Information Networks (cs.SI)
*备注: This research paper is submitted to ASONAM 2024 conference on Advances in Social Networks Analysis and Mining and going to be published in Springer

点击查看摘要

Abstract:Our study presents a multifaceted approach to enhancing user interaction and content relevance in social media platforms through a federated learning framework. We introduce personalized GPT and Context-based Social Media LLM models, utilizing federated learning for privacy and security. Four client entities receive a base GPT-2 model and locally collected social media data, with federated aggregation ensuring up-to-date model maintenance. Subsequent modules focus on categorizing user posts, computing user persona scores, and identifying relevant posts from friends’ lists. A quantifying social engagement approach, coupled with matrix factorization techniques, facilitates personalized content suggestions in real-time. An adaptive feedback loop and readability score algorithm also enhance the quality and relevance of content presented to users. Our system offers a comprehensive solution to content filtering and recommendation, fostering a tailored and engaging social media experience while safeguarding user privacy.

[LG-121] FLASH: Federated Learning-Based LLMs for Advanced Query Processing in Social Networks through RAG

链接: https://arxiv.org/abs/2408.05242
作者: Sai Puppala,Ismail Hossain,Md Jahangir Alam,Sajedul Talukder
关键词-EN: Federated Learning GPT, Federated Learning, Leveraging Federated Learning, Federated Learning techniques, chatbot system empowered
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC); Information Retrieval (cs.IR); Social and Information Networks (cs.SI)
*备注: This research paper is submitted to ASONAM 2024 conference on Advances in Social Networks Analysis and Mining and going to be published in Springer

点击查看摘要

Abstract:Our paper introduces a novel approach to social network information retrieval and user engagement through a personalized chatbot system empowered by Federated Learning GPT. The system is designed to seamlessly aggregate and curate diverse social media data sources, including user posts, multimedia content, and trending news. Leveraging Federated Learning techniques, the GPT model is trained on decentralized data sources to ensure privacy and security while providing personalized insights and recommendations. Users interact with the chatbot through an intuitive interface, accessing tailored information and real-time updates on social media trends and user-generated content. The system’s innovative architecture enables efficient processing of input files, parsing and enriching text data with metadata, and generating relevant questions and answers using advanced language models. By facilitating interactive access to a wealth of social network information, this personalized chatbot system represents a significant advancement in social media communication and knowledge dissemination.

[LG-122] Biomimetic Machine Learning approach for prediction of mechanical properties of Additive Friction Stir Deposited Aluminum alloys based walled structures

点击查看摘要

[LG-123] SLO-aware GPU Frequency Scaling for Energy Efficient LLM Inference Serving

点击查看摘要

[LG-124] Beyond the Neural Fog: Interpretable Learning for AC Optimal Power Flow

链接: https://arxiv.org/abs/2408.05228
作者: Salvador Pineda,Juan Pérez-Ruiz,Juan Miguel Morales
关键词-EN: non-convex nature makes, optimal power flow, power system operations, system operations, challenging to solve
类目: ystems and Control (eess.SY); Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:The AC optimal power flow (AC-OPF) problem is essential for power system operations, but its non-convex nature makes it challenging to solve. A widely used simplification is the linearized DC optimal power flow (DC-OPF) problem, which can be solved to global optimality, but whose optimal solution is always infeasible in the original AC-OPF problem. Recently, neural networks (NN) have been introduced for solving the AC-OPF problem at significantly faster computation times. However, these methods necessitate extensive datasets, are difficult to train, and are often viewed as black boxes, leading to resistance from operators who prefer more transparent and interpretable solutions. In this paper, we introduce a novel learning-based approach that merges simplicity and interpretability, providing a bridge between traditional approximation methods and black-box learning techniques. Our approach not only provides transparency for operators but also achieves competitive accuracy. Numerical results across various power networks demonstrate that our method provides accuracy comparable to, and often surpassing, that of neural networks, particularly when training datasets are limited.

[LG-125] Large Language Models for cross-language code clone detection

点击查看摘要

[LG-126] Learning Delays in Spiking Neural Networks using Dilated Convolutions with Learnable Spacings

点击查看摘要

[LG-127] Inverse designing metamaterials with programmable nonlinear functional responses in graph space

链接: https://arxiv.org/abs/2408.06300
作者: Marco Maurizi,Derek Xu,Yu-Tong Wang,Desheng Yao,David Hahn,Mourad Oudich,Anish Satpati,Mathieu Bauchy,Wei Wang,Yizhou Sun,Yun Jing,Xiaoyu Rayne Zheng
关键词-EN: impact protection, dynamic stimuli, represented as nonlinear, structural support, photonic bandgaps
类目: Materials Science (cond-mat.mtrl-sci); Machine Learning (cs.LG)
*备注: 19 pages, 5 figures

点击查看摘要

Abstract:Material responses to static and dynamic stimuli, represented as nonlinear curves, are design targets for engineering functionalities like structural support, impact protection, and acoustic and photonic bandgaps. Three-dimensional metamaterials offer significant tunability due to their internal structure, yet existing methods struggle to capture their complex behavior-to-structure relationships. We present GraphMetaMat, a graph-based framework capable of designing three-dimensional metamaterials with programmable responses and arbitrary manufacturing constraints. Integrating graph networks, physics biases, reinforcement learning, and tree search, GraphMetaMat can target stress-strain curves spanning four orders of magnitude and complex behaviors, as well as viscoelastic transmission responses with varying attenuation gaps. GraphMetaMat can create cushioning materials for protective equipment and vibration-damping panels for electric vehicles, outperforming commercial materials, and enabling the automatic design of materials with on-demand functionalities.

[LG-128] Multi-marginal Schr"odinger Bridges with Iterative Reference

链接: https://arxiv.org/abs/2408.06277
作者: Yunyi Shen,Renato Berlinghieri,Tamara Broderick
关键词-EN: Practitioners frequently aim, frequently aim, aim to infer, time points, sample snapshots
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Methodology (stat.ME)
*备注:

点击查看摘要

Abstract:Practitioners frequently aim to infer an unobserved population trajectory using sample snapshots at multiple time points. For instance, in single-cell sequencing, scientists would like to learn how gene expression evolves over time. But sequencing any cell destroys that cell. So we cannot access any cell’s full trajectory, but we can access snapshot samples from many cells. Stochastic differential equations are commonly used to analyze systems with full individual-trajectory access; since here we have only sample snapshots, these methods are inapplicable. The deep learning community has recently explored using Schrödinger bridges (SBs) and their extensions to estimate these dynamics. However, these methods either (1) interpolate between just two time points or (2) require a single fixed reference dynamic within the SB, which is often just set to be Brownian motion. But learning piecewise from adjacent time points can fail to capture long-term dependencies. And practitioners are typically able to specify a model class for the reference dynamic but not the exact values of the parameters within it. So we propose a new method that (1) learns the unobserved trajectories from sample snapshots across multiple time points and (2) requires specification only of a class of reference dynamics, not a single fixed one. In particular, we suggest an iterative projection method inspired by Schrödinger bridges; we alternate between learning a piecewise SB on the unobserved trajectories and using the learned SB to refine our best guess for the dynamics within the reference class. We demonstrate the advantages of our method via a well-known simulated parametric model from ecology, simulated and real data from systems biology, and real motion-capture data.

[LG-129] Reciprocal Learning

链接: https://arxiv.org/abs/2408.06257
作者: Julian Rodemann,Christoph Jansen,Georg Schollmeyer
关键词-EN: single paradigm, wide array, array of machine, reciprocal learning, learning
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Optimization and Control (math.OC); Statistics Theory (math.ST); Methodology (stat.ME)
*备注: 41 pages, 3 figures

点击查看摘要

Abstract:We demonstrate that a wide array of machine learning algorithms are specific instances of one single paradigm: reciprocal learning. These instances range from active learning over multi-armed bandits to self-training. We show that all these algorithms do not only learn parameters from data but also vice versa: They iteratively alter training data in a way that depends on the current model fit. We introduce reciprocal learning as a generalization of these algorithms using the language of decision theory. This allows us to study under what conditions they converge. The key is to guarantee that reciprocal learning contracts such that the Banach fixed-point theorem applies. In this way, we find that reciprocal learning algorithms converge at linear rates to an approximately optimal model under relatively mild assumptions on the loss function, if their predictions are probabilistic and the sample adaption is both non-greedy and either randomized or regularized. We interpret these findings and provide corollaries that relate them to specific active learning, self-training, and bandit algorithms.

[LG-130] A Comprehensive Case Study on the Performance of Machine Learning Methods on the Classification of Solar Panel Electroluminescence Images

链接: https://arxiv.org/abs/2408.06229
作者: Xinyi Song,Kennedy Odongo,Francis G. Pascual,Yili Hong
关键词-EN: harvest solar energy, renewable energy, important form, form of renewable, solar energy
类目: Applications (stat.AP); Machine Learning (cs.LG)
*备注: 30 pages, 14 figures

点击查看摘要

Abstract:Photovoltaics (PV) are widely used to harvest solar energy, an important form of renewable energy. Photovoltaic arrays consist of multiple solar panels constructed from solar cells. Solar cells in the field are vulnerable to various defects, and electroluminescence (EL) imaging provides effective and non-destructive diagnostics to detect those defects. We use multiple traditional machine learning and modern deep learning models to classify EL solar cell images into different functional/defective categories. Because of the asymmetry in the number of functional vs. defective cells, an imbalanced label problem arises in the EL image data. The current literature lacks insights on which methods and metrics to use for model training and prediction. In this paper, we comprehensively compare different machine learning and deep learning methods under different performance metrics on the classification of solar cell EL images from monocrystalline and polycrystalline modules. We provide a comprehensive discussion on different metrics. Our results provide insights and guidelines for practitioners in selecting prediction methods and performance metrics.

[LG-131] A Comprehensive Survey on EEG-Based Emotion Recognition: A Graph-Based Perspective

链接: https://arxiv.org/abs/2408.06027
作者: Chenyu Liu,Xinliang Zhou,Yihao Wu,Yi Ding,Liming Zhai,Kun Wang,Ziyu Jia,Yang Liu
关键词-EN: emotion recognition, affective computing, brain region connectivity, brain regions, intuitively respond
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Compared to other modalities, electroencephalogram (EEG) based emotion recognition can intuitively respond to emotional patterns in the human brain and, therefore, has become one of the most focused tasks in affective computing. The nature of emotions is a physiological and psychological state change in response to brain region connectivity, making emotion recognition focus more on the dependency between brain regions instead of specific brain regions. A significant trend is the application of graphs to encapsulate such dependency as dynamic functional connections between nodes across temporal and spatial dimensions. Concurrently, the neuroscientific underpinnings behind this dependency endow the application of graphs in this field with a distinctive significance. However, there is neither a comprehensive review nor a tutorial for constructing emotion-relevant graphs in EEG-based emotion recognition. In this paper, we present a comprehensive survey of these studies, delivering a systematic review of graph-related methods in this field from a methodological perspective. We propose a unified framework for graph applications in this field and categorize these methods on this basis. Finally, based on previous studies, we also present several open challenges and future directions in this field.

[LG-132] Parameters Inference for Nonlinear Wave Equations with Markovian Switching

链接: https://arxiv.org/abs/2408.05990
作者: Yi Zhang,Zhikun Zhang,Xiangjun Wang
关键词-EN: Traditional partial differential, Traditional partial, Markov switching models, partial differential equations, Markovian switching models
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Traditional partial differential equations with constant coefficients often struggle to capture abrupt changes in real-world phenomena, leading to the development of variable coefficient PDEs and Markovian switching models. Recently, research has introduced the concept of PDEs with Markov switching models, established their well-posedness and presented numerical methods. However, there has been limited discussion on parameter estimation for the jump coefficients in these models. This paper addresses this gap by focusing on parameter inference for the wave equation with Markovian switching. We propose a Bayesian statistical framework using discrete sparse Bayesian learning to establish its convergence and a uniform error bound. Our method requires fewer assumptions and enables independent parameter inference for each segment by allowing different underlying structures for the parameter estimation problem within each segmented time interval. The effectiveness of our approach is demonstrated through three numerical cases, which involve noisy spatiotemporal data from different wave equations with Markovian switching. The results show strong performance in parameter estimation for variable coefficient PDEs.

[LG-133] Quantum Gradient Class Activation Map for Model Interpretability

点击查看摘要

[LG-134] On the Robustness of Kernel Goodness-of-Fit Tests

链接: https://arxiv.org/abs/2408.05854
作者: Xing Liu,François-Xavier Briol
关键词-EN:
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST); Methodology (stat.ME)
*备注: 50 pages, 13 figures

点击查看摘要

[LG-135] Divide-and-Conquer Predictive Coding: a structured Bayesian inference algorithm NEURIPS

点击查看摘要

[LG-136] On the Convergence of a Federated Expectation-Maximization Algorithm

链接: https://arxiv.org/abs/2408.05819
作者: Zhixu Tao,Rajita Chandak,Sanjeev Kulkarni
关键词-EN: Federated Learning algorithms, Data heterogeneity, Linear Regressions model, convergence rate, Federated Learning
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Data heterogeneity has been a long-standing bottleneck in studying the convergence rates of Federated Learning algorithms. In order to better understand the issue of data heterogeneity, we study the convergence rate of the Expectation-Maximization (EM) algorithm for the Federated Mixture of K Linear Regressions model. We fully characterize the convergence rate of the EM algorithm under all regimes of m/n where m is the number of clients and n is the number of data points per client. We show that with a signal-to-noise-ratio (SNR) of order \Omega(\sqrtK) , the well-initialized EM algorithm converges within the minimax distance of the ground truth under each of the regimes. Interestingly, we identify that when m grows exponentially in n , the EM algorithm only requires a constant number of iterations to converge. We perform experiments on synthetic datasets to illustrate our results. Surprisingly, the results show that rather than being a bottleneck, data heterogeneity can accelerate the convergence of federated learning algorithms.

[LG-137] me Makes Space: Emergence of Place Fields in Networks Encoding Temporally Continuous Sensory Experiences

点击查看摘要

[LG-138] BeyondCT: A deep learning model for predicting pulmonary function from chest CT scans

点击查看摘要

[LG-139] Quantum-secure multiparty deep learning

点击查看摘要

[LG-140] S-SIRUS: an explainability algorithm for spatial regression Random Forest

链接: https://arxiv.org/abs/2408.05537
作者: Luca Patelli,Natalia Golini,Rosaria Ignaccolo,Michela Cameletti
关键词-EN:
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

[LG-141] Latent class analysis for multi-layer categorical data

链接: https://arxiv.org/abs/2408.05535
作者: Huan Qing
关键词-EN:
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

[LG-142] SuperEncoder: Towards Universal Neural Approximate Quantum State Preparation

链接: https://arxiv.org/abs/2408.05435
作者: Yilun Zhao,Bingmeng Wang,Wenle Jiang,Xiwei Pan,Bing Li,Yinhe Han,Ying Wang
关键词-EN:
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG)
*备注:

点击查看摘要

[LG-143] Efficient Quantum Gradient and Higher-order Derivative Estimation via Generalized Hadamard Test

链接: https://arxiv.org/abs/2408.05406
作者: Dantong Li,Dikshant Dulal,Mykhailo Ohorodnikov,Hanrui Wang,Yongshan Ding
关键词-EN: Noisy Intermediate-Scale Quantum, Hadamard Test, Direct Hadamard Test, Flexible Hadamard Test, near-term quantum hardware
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In the context of Noisy Intermediate-Scale Quantum (NISQ) computing, parameterized quantum circuits (PQCs) represent a promising paradigm for tackling challenges in quantum sensing, optimal control, optimization, and machine learning on near-term quantum hardware. Gradient-based methods are crucial for understanding the behavior of PQCs and have demonstrated substantial advantages in the convergence rates of Variational Quantum Algorithms (VQAs) compared to gradient-free methods. However, existing gradient estimation methods, such as Finite Difference, Parameter Shift Rule, Hadamard Test, and Direct Hadamard Test, often yield suboptimal gradient circuits for certain PQCs. To address these limitations, we introduce the Flexible Hadamard Test, which, when applied to first-order gradient estimation methods, can invert the roles of ansatz generators and observables. This inversion facilitates the use of measurement optimization techniques to efficiently compute PQC gradients. Additionally, to overcome the exponential cost of evaluating higher-order partial derivatives, we propose the k -fold Hadamard Test, which computes the k^th -order partial derivative using a single circuit. Furthermore, we introduce Quantum Automatic Differentiation (QAD), a unified gradient method that adaptively selects the best gradient estimation technique for individual parameters within a PQC. This represents the first implementation, to our knowledge, that departs from the conventional practice of uniformly applying a single method to all parameters. Through rigorous numerical experiments, we demonstrate the effectiveness of our proposed first-order gradient methods, showing up to an O(N) factor improvement in circuit execution count for real PQC applications. Our research contributes to the acceleration of VQA computations, offering practical utility in the NISQ era of quantum computing.

[LG-144] fastkqr: A Fast Algorithm for Kernel Quantile Regression

链接: https://arxiv.org/abs/2408.05393
作者: Qian Tang,Yuwen Gu,Boxiang Wang
关键词-EN: applied areas, powerful tool, tool for robust, robust and heterogeneous, heterogeneous learning
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Quantile regression is a powerful tool for robust and heterogeneous learning that has seen applications in a diverse range of applied areas. However, its broader application is often hindered by the substantial computational demands arising from the non-smooth quantile loss function. In this paper, we introduce a novel algorithm named fastkqr, which significantly advances the computation of quantile regression in reproducing kernel Hilbert spaces. The core of fastkqr is a finite smoothing algorithm that magically produces exact regression quantiles, rather than approximations. To further accelerate the algorithm, we equip fastkqr with a novel spectral technique that carefully reutilizes matrix computations. In addition, we extend fastkqr to accommodate a flexible kernel quantile regression with a data-driven crossing penalty, addressing the interpretability challenges of crossing quantile curves at multiple levels. We have implemented fastkqr in a publicly available R package. Extensive simulations and real applications show that fastkqr matches the accuracy of state-of-the-art algorithms but can operate up to an order of magnitude faster.

[LG-145] Optimizing Portfolio with Two-Sided Transactions and Lending: A Reinforcement Learning Framework

链接: https://arxiv.org/abs/2408.05382
作者: Ali Habibnia,Mahdi Soltanzadeh
关键词-EN:
类目: Portfolio Management (q-fin.PM); Machine Learning (cs.LG)
*备注:

点击查看摘要

[LG-146] GesturePrint: Enabling User Identification for mmWave-based Gesture Recognition Systems

点击查看摘要

[LG-147] scASDC: Attention Enhanced Structural Deep Clustering for Single-cell RNA-seq Data

点击查看摘要

[LG-148] Advancing Thermodynamic Group-Contribution Methods by Machine Learning: UNIFAC 2.0

链接: https://arxiv.org/abs/2408.05220
作者: Nicolas Hayer,Thorsten Wendel,Stephan Mandt,Hans Hasse,Fabian Jirasek
关键词-EN:
类目: Chemical Physics (physics.chem-ph); Machine Learning (cs.LG)
*备注:

点击查看摘要

[LG-149] Physics-Informed Weakly Supervised Learning for Interatomic Potentials

链接: https://arxiv.org/abs/2408.05215
作者: Makoto Takamoto,Viktor Zaverkin,Mathias Niepert
关键词-EN:
类目: Chemical Physics (physics.chem-ph); Machine Learning (cs.LG); Biological Physics (physics.bio-ph); Computational Physics (physics.comp-ph)
*备注: 24 pages, 2 figures, 18 Tables

点击查看摘要

信息检索

[IR-0] Perceptual Similarity for Measuring Decision-Making Style and Policy Diversity in Games

点击查看摘要

[IR-1] he landscape of ontologies in materials science and engineering: A survey and evaluation

链接: https://arxiv.org/abs/2408.06034
作者: Ebrahim Norouzi,Jörg Waitelonis,Harald Sack
关键词-EN: materials science, material properties, describe experiments, computational workflows, Science and Engineering
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Ontologies are widely used in materials science to describe experiments, processes, material properties, and experimental and computational workflows. Numerous online platforms are available for accessing and sharing ontologies in Materials Science and Engineering (MSE). Additionally, several surveys of these ontologies have been conducted. However, these studies often lack comprehensive analysis and quality control metrics. This paper provides an overview of ontologies used in Materials Science and Engineering to assist domain experts in selecting the most suitable ontology for a given purpose. Sixty selected ontologies are analyzed and compared based on the requirements outlined in this paper. Statistical data on ontology reuse and key metrics are also presented. The evaluation results provide valuable insights into the strengths and weaknesses of the investigated MSE ontologies. This enables domain experts to select suitable ontologies and to incorporate relevant terms from existing resources.

[IR-2] ConvKGYarn: Spinning Configurable and Scalable Conversational Knowledge Graph QA datasets with Large Language Models

链接: https://arxiv.org/abs/2408.05948
作者: Ronak Pradeep,Daniel Lee,Ali Mousavi,Jeff Pound,Yisi Sang,Jimmy Lin,Ihab Ilyas,Saloni Potdar,Mostafa Arefiyan,Yunyao Li
关键词-EN: Large Language Models, Large Language, advancement of Large, assistants necessitates dynamic, Language Models
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注:

点击查看摘要

[IR-3] Optimizing RAG Techniques for Automotive Industry PDF Chatbots: A Case Study with Locally Deployed Ollama Models

点击查看摘要

[IR-4] Low-Rank Approximation Adaptation and Other Tales

点击查看摘要

[IR-5] Iterative Improvement of an Additively Regularized Topic Model

链接: https://arxiv.org/abs/2408.05840
作者: Alex Gorbulev,Vasiliy Alekseev,Konstantin Vorontsov
关键词-EN: soft clustering problem, topic model, Topic, clustering problem, unknown clusters
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR); Probability (math.PR)
*备注: A full draft of the second version of the article

点击查看摘要

[IR-6] GraphTransfer: A Generic Feature Fusion Framework for Collaborative Filtering

链接: https://arxiv.org/abs/2408.05792
作者: Jiafeng Xia,Dongsheng Li,Hansu Gu,Tun Lu,Ning Gu
关键词-EN: Graph Neural Networks, Neural Networks, extract powerful structural, Graph Neural, powerful structural features
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Graph Neural Networks (GNNs) have demonstrated effectiveness in collaborative filtering tasks due to their ability to extract powerful structural features. However, combining the graph features extracted from user-item interactions and auxiliary features extracted from user genres and item properties remains a challenge. Currently available fusion methods face two major issues: 1) simple methods such as concatenation and summation are generic, but not accurate in capturing feature relationships; 2) task-specific methods like attention mechanisms and meta paths may not be suitable for general feature fusion. To address these challenges, we present GraphTransfer, a simple but universal feature fusion framework for GNN-based collaborative filtering. Our method accurately fuses different types of features by first extracting graph features from the user-item interaction graph and auxiliary features from users and items using GCN. The proposed cross fusion module then effectively bridges the semantic gaps between the interaction scores of different features. Theoretical analysis and experiments on public datasets show that GraphTransfer outperforms other feature fusion methods in CF tasks. Additionally, we demonstrate the universality of our framework via empirical studies in three other scenarios, showing that GraphTransfer leads to significant improvements in the performance of CF algorithms.

[IR-7] Advancing Re-Ranking with Multimodal Fusion and Target-Oriented Auxiliary Tasks in E-Commerce Search

点击查看摘要

[IR-8] MomentCross: Next-Generation Real-Time Cross-Domain CTR Prediction for Live-Streaming Recommendation at Kuaishou

链接: https://arxiv.org/abs/2408.05709
作者: Jiangxia Cao,Shen Wang,Yue Li,Shenghui Wang,Jian Tang,Shiyao Wang,Shuang Yang,Zhaojie Liu,Guorui Zhou
关键词-EN: temporarily-alive to distribution, feedback delay, long time, live-streaming, Kuaishou
类目: Information Retrieval (cs.IR)
*备注: Work in progress

点击查看摘要

Abstract:Kuaishou, is one of the largest short-video and live-streaming platform, compared with short-video recommendations, live-streaming recommendation is more complex because of: (1) temporarily-alive to distribution, (2) user may watch for a long time with feedback delay, (3) content is unpredictable and changes over time. Actually, even if a user is interested in the live-streaming author, it still may be an negative watching (e.g., short-view 3s) since the real-time content is not attractive enough. Therefore, for live-streaming recommendation, there exists a challenging task: how do we recommend the live-streaming at right moment for users? Additionally, our platform’s major exposure content is short short-video, and the amount of exposed short-video is 9x more than exposed live-streaming. Thus users will leave more behaviors on short-videos, which leads to a serious data imbalance problem making the live-streaming data could not fully reflect user interests. In such case, there raises another challenging task: how do we utilize users’ short-video behaviors to make live-streaming recommendation better?

[IR-9] A Decoding Acceleration Framework for Industrial Deployable LLM-based Recommender Systems

链接: https://arxiv.org/abs/2408.05676
作者: Yunjia Xi,Hangyu Wang,Bo Chen,Jianghao Lin,Menghui Zhu,Weiwen Liu,Ruiming Tang,Weinan Zhang,Yong Yu
关键词-EN: LLM-based recommender systems, increasing attention, recommender systems, Recently, LLMs
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Recently, increasing attention has been paid to LLM-based recommender systems, but their deployment is still under exploration in the industry. Most deployments utilize LLMs as feature enhancers, generating augmentation knowledge in the offline stage. However, in recommendation scenarios, involving numerous users and items, even offline generation with LLMs consumes considerable time and resources. This generation inefficiency stems from the autoregressive nature of LLMs, and a promising direction for acceleration is speculative decoding, a Draft-then-Verify paradigm that increases the number of generated tokens per decoding step. In this paper, we first identify that recommendation knowledge generation is suitable for retrieval-based speculative decoding. Then, we discern two characteristics: (1) extensive items and users in RSs bring retrieval inefficiency, and (2) RSs exhibit high diversity tolerance for text generated by LLMs. Based on the above insights, we propose a Decoding Acceleration Framework for LLM-based Recommendation (dubbed DARE), with Customized Retrieval Pool to improve retrieval efficiency and Relaxed Verification to increase the acceptance rate of draft tokens, respectively. Extensive experiments demonstrate that DARE achieves a 3-5x speedup and is compatible with various frameworks and backbone LLMs. DARE has also been deployed to online advertising scenarios within a large-scale commercial environment, achieving a 3.45x speedup while maintaining the downstream performance.

[IR-10] Utilizing Large Language Models to Optimize the Detection and Explainability of Phishing Websites

点击查看摘要

[IR-11] Exploring Applications of State Space Models and Advanced Training Techniques in Sequential Recommendations: A Comparative Study on Efficiency and Performance

点击查看摘要

[IR-12] Document-Level Event Extraction with Definition-Driven ICL

链接: https://arxiv.org/abs/2408.05566
作者: Zhuoyuan Liu,Yilin Luo
关键词-EN: Natural Language Processing, Large Language Models, document-level event extraction, Language Processing, Large Language
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Information Retrieval (cs.IR)
*备注:

点击查看摘要

[IR-13] HoME: Hierarchy of Multi-Gate Experts for Multi-Task Learning at Kuaishou

点击查看摘要

[IR-14] Report on the 1st Workshop on Large Language Model for Evaluation in Information Retrieval (LLM4Eval 2024) at SIGIR 2024

链接: https://arxiv.org/abs/2408.05388
作者: Hossein A. Rahmani,Clemencia Siro,Mohammad Aliannejadi,Nick Craswell,Charles L. A. Clarke,Guglielmo Faggioli,Bhaskar Mitra,Paul Thomas,Emine Yilmaz
关键词-EN: ACM SIGIR Conference, Large Language Model, SIGIR Conference, ACM SIGIR, Information Retrieval
类目: Information Retrieval (cs.IR)
*备注: LLM4Eval Workshop Report

点击查看摘要

Abstract:The first edition of the workshop on Large Language Model for Evaluation in Information Retrieval (LLM4Eval 2024) took place in July 2024, co-located with the ACM SIGIR Conference 2024 in the USA (SIGIR 2024). The aim was to bring information retrieval researchers together around the topic of LLMs for evaluation in information retrieval that gathered attention with the advancement of large language models and generative AI. Given the novelty of the topic, the workshop was focused around multi-sided discussions, namely panels and poster sessions of the accepted proceedings papers.

[IR-15] IntentRec: Predicting User Session Intent with Hierarchical Multi-Task Learning

点击查看摘要

[IR-16] owards Scalable Topic Detection on Web via Simulating Levy Walks Nature of Topics in Similarity Space

链接: https://arxiv.org/abs/2408.05348
作者: Junbiao Pang,Qingming Huang
关键词-EN: social media websites, popular topics, social media, media websites, key steps
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Organizing a few webpages from social media websites into popular topics is one of the key steps to understand trends on web. Discovering popular topics from web faces a sea of noise webpages which never evolve into popular topics. In this paper, we discover that the similarity values between webpages in a popular topic contain the statistically similar features observed in Levy walks. Consequently, we present a simple, novel, yet very powerful Explore-Exploit (EE) approach to group topics by simulating Levy walks nature in the similarity space. The proposed EE-based topic clustering is an effective and effcient method which is a solid move towards handling a sea of noise webpages. Experiments on two public data sets demonstrate that our approach is not only comparable to the state-of-the-art methods in terms of effectiveness but also significantly outperforms the state-of-the-art methods in terms of efficiency.

[IR-17] AI-assisted Coding with Cody: Lessons from Context Retrieval and Evaluation for Code Recommendations

点击查看摘要

[IR-18] Neural Machine Unranking

点击查看摘要

[IR-19] SocFedGPT: Federated GPT-based Adaptive Content Filtering System Leveraging User Interactions in Social Networks

点击查看摘要

[IR-20] FLASH: Federated Learning-Based LLMs for Advanced Query Processing in Social Networks through RAG

点击查看摘要

附件下载

点击下载今日全部论文列表

今日(2024-08-13)Arxiv最新论文

目录

概览 (2024-08-13)

自然语言处理

人工智能

计算机视觉

机器学习

信息检索

附件下载