本篇博文主要展示 2024-08-20 从Arxiv.org论文网站获取的最新论文列表,自动更新,按照NLP、CV、ML、AI、IR五个大方向区分,若需要邮件定时接收,请在评论区留下你的邮箱号。

说明:每日论文数据从Arxiv.org获取,每天早上10:30左右定时自动更新。

友情提示: 如何您需要邮箱接收每日论文数据,请在评论处留下你的邮箱,同样每天10:30左右邮件定时自动发送。

目录

概览 (2024-08-20)

今日共更新712篇论文,其中:

  • 自然语言处理105篇(Computation and Language (cs.CL))
  • 人工智能207篇(Artificial Intelligence (cs.AI))
  • 计算机视觉167篇(Computer Vision and Pattern Recognition (cs.CV))
  • 机器学习179篇(Machine Learning (cs.LG))

自然语言处理

[NLP-0] LongVILA: Scaling Long-Context Visual Language Models for Long Videos
[NLP-0] LongVILA:扩展长视频的长上下文视觉语言模型

链接: https://arxiv.org/abs/2408.10188
作者: Fuzhao Xue,Yukang Chen,Dacheng Li,Qinghao Hu,Ligeng Zhu,Xiuyu Li,Yunhao Fang,Haotian Tang,Shang Yang,Zhijian Liu,Ethan He,Hongxu Yin,Pavlo Molchanov,Jan Kautz,Linxi Fan,Yuke Zhu,Yao Lu,Song Han
关键词-EN: Sequence Parallelism, Multi-Modal Sequence Parallelism, multi-modal foundation models, capability is critical, Hugging Face Transformers
关键词-ZH: 序列并行主义,多模式序列并行主义,多模式基础模型,能力至关重要,拥抱脸变形金刚
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注: Code and models are available at this https URL

点击查看摘要

Abstract:Long-context capability is critical for multi-modal foundation models. We introduce LongVILA, a full-stack solution for long-context vision-language models, including system, model training, and dataset development. On the system side, we introduce the first Multi-Modal Sequence Parallelism (MM-SP) system that enables long-context training and inference, enabling 2M context length training on 256 GPUs. MM-SP is also efficient, being 2.1x - 5.7x faster than Ring-Style Sequence Parallelism and 1.1x - 1.4x faster than Megatron-LM in text-only settings. Moreover, it seamlessly integrates with Hugging Face Transformers. For model training, we propose a five-stage pipeline comprising alignment, pre-training, context extension, and long-short joint supervised fine-tuning. Regarding datasets, we meticulously construct large-scale visual language pre-training datasets and long video instruction-following datasets to support our multi-stage training process. The full-stack solution extends the feasible frame number of VILA by a factor of 128 (from 8 to 1024 frames) and improves long video captioning score from 2.00 to 3.26 (1.6x), achieving 99.5% accuracy in 1400-frames video (274k context length) needle in a haystack. LongVILA-8B also demonstrates a consistent improvement in performance on long videos within the VideoMME benchmark as the video frames increase.
摘要:长上下文能力是多通道基础模型的关键。我们介绍了LongVILA,这是一个针对长上下文视觉语言模型的全栈解决方案,包括系统、模型训练和数据集开发。在系统方面,我们引入了第一个多模式序列并行(MM-SP)系统,该系统支持长上下文训练和推理,支持在256个GPU上进行2M上下文长度的训练。MM-SP也是高效的,在纯文本设置中,它比Ring式序列并行快2.1倍-5.7倍,比威震天-LM快1.1倍-1.4倍。此外,它还与拥抱脸变形金刚无缝集成。对于模型训练,我们提出了包括对齐、预训练、上下文扩展和长短联合监督微调的五个阶段的流水线。在数据集方面,我们精心构建了大规模的视觉语言预训练数据集和长视频教学跟踪数据集,以支持我们的多阶段训练过程。全栈解决方案将Vila的可行帧数扩展了128倍(从8帧增加到1024帧),并将长视频字幕得分从2.00提高到3.26(1.6倍),在1400帧视频(274k上下文长度)大海捞针的情况下达到99.5%的准确率。LongVILA-8B还表明,在视频MME基准中,随着视频帧的增加,长视频的性能会得到持续的改善。

[NLP-1] Multilingual Needle in a Haystack: Investigating Long-Context Behavior of Multilingual Large Language Models
[NLP-1] 多语言干草堆中的针:研究多语言大型语言模型的长上下文行为

链接: https://arxiv.org/abs/2408.10151
作者: Amey Hengle,Prasoon Bajpai,Soham Dan,Tanmoy Chakraborty
关键词-EN: handle long multilingual, recent large language, demonstrate remarkable abilities, long multilingual contexts, recent large
关键词-ZH: 处理长期多语言、最近大型语言、表现出非凡的能力、长期多语言背景、最近大型
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:While recent large language models (LLMs) demonstrate remarkable abilities in responding to queries in diverse languages, their ability to handle long multilingual contexts is unexplored. As such, a systematic evaluation of the long-context capabilities of LLMs in multilingual settings is crucial, specifically in the context of information retrieval. To address this gap, we introduce the MultiLingual Needle-in-a-Haystack (MLNeedle) test, designed to assess a model’s ability to retrieve relevant information (the needle) from a collection of multilingual distractor texts (the haystack). This test serves as an extension of the multilingual question-answering task, encompassing both monolingual and cross-lingual retrieval. We evaluate four state-of-the-art LLMs on MLNeedle. Our findings reveal that model performance can vary significantly with language and needle position. Specifically, we observe that model performance is the lowest when the needle is (i) in a language outside the English language family and (ii) located in the middle of the input context. Furthermore, although some models claim a context size of 8k tokens or greater, none demonstrate satisfactory cross-lingual retrieval performance as the context length increases. Our analysis provides key insights into the long-context behavior of LLMs in multilingual settings to guide future evaluation protocols. To our knowledge, this is the first study to investigate the multilingual long-context behavior of LLMs.
摘要:虽然最近的大型语言模型(LLM)在响应不同语言的查询方面表现出了显著的能力,但它们处理长时间多语言上下文的能力还没有被探索过。因此,对小岛屿发展中国家在多语种环境下的长语境能力进行系统评价至关重要,特别是在信息检索方面。为了弥补这一差距,我们引入了多语言干草堆中的针(MLNeedle)测试,旨在评估模型从多语言干扰文本(干草堆)集合中检索相关信息(针)的能力。这项测试是多语种问答任务的延伸,包括单语和跨语种检索。我们在MLNeedle上评估了四种最先进的LLM。我们的发现表明,模型的性能会因语言和针头位置的不同而显著不同。具体地说,我们观察到,当指针(I)在英语语系之外的语言中,以及(Ii)位于输入上下文的中间时,模型性能最低。此外,尽管一些模型声称上下文大小为8K或更大,但随着上下文长度的增加,没有一个模型显示出令人满意的跨语言检索性能。我们的分析为LLMS在多语言环境下的长语境行为提供了关键的见解,以指导未来的评估方案。据我们所知,这是第一个研究外语学习者多语长语境行为的研究。

[NLP-2] In-Context Learning with Representations: Contextual Generalization of Trained Transformers
[NLP-2] 使用表示的上下文学习:训练有素的变形金刚的上下文概括

链接: https://arxiv.org/abs/2408.10147
作者: Tong Yang,Yu Huang,Yingbin Liang,Yuejie Chi
关键词-EN: pretrained large language, large language models, remarkable capability, capability of pretrained, pretrained large
关键词-ZH: 预训练的大型语言,大型语言模型,非凡的能力,预训练的能力,预训练的大型
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:In-context learning (ICL) refers to a remarkable capability of pretrained large language models, which can learn a new task given a few examples during inference. However, theoretical understanding of ICL is largely under-explored, particularly whether transformers can be trained to generalize to unseen examples in a prompt, which will require the model to acquire contextual knowledge of the prompt for generalization. This paper investigates the training dynamics of transformers by gradient descent through the lens of non-linear regression tasks. The contextual generalization here can be attained via learning the template function for each task in-context, where all template functions lie in a linear space with m basis functions. We analyze the training dynamics of one-layer multi-head transformers to in-contextly predict unlabeled inputs given partially labeled prompts, where the labels contain Gaussian noise and the number of examples in each prompt are not sufficient to determine the template. Under mild assumptions, we show that the training loss for a one-layer multi-head transformer converges linearly to a global minimum. Moreover, the transformer effectively learns to perform ridge regression over the basis functions. To our knowledge, this study is the first provable demonstration that transformers can learn contextual (i.e., template) information to generalize to both unseen examples and tasks when prompts contain only a small number of query-answer pairs.
摘要:情境学习是指利用预先训练好的大型语言模型,在推理过程中通过几个例子学习一项新任务的能力。然而,对ICL的理论理解在很大程度上是探索不足的,特别是是否可以培训变压器在提示中概括到看不见的例子,这将要求模型获得关于概括提示的上下文知识。本文通过非线性回归任务的视角,研究了梯度下降法对变压器的训练动态。这里的上下文泛化可以通过学习上下文中每个任务的模板函数来实现,其中所有模板函数都位于具有m个基函数的线性空间中。我们分析了单层多头变压器的训练动态,以在给定部分标签提示的情况下对未标记输入进行上下文预测,其中标签包含高斯噪声,并且每个提示中的样本数不足以确定模板。在较温和的假设下,我们证明了单层多头变压器的训练损耗线性收敛到全局最小。此外,变压器有效地学习对基函数执行岭回归。据我们所知,这项研究是第一个可证明的证明,当提示只包含少量的询问-回答对时,转换器可以学习上下文(即模板)信息来概括到未见过的例子和任务。

[NLP-3] Instruction Finetuning for Leaderboard Generation from Empirical AI Research
[NLP-3] 经验性人工智能研究中生成排行榜的指令微调

链接: https://arxiv.org/abs/2408.10141
作者: Salomon Kabongo,Jennifer D’Souza
关键词-EN: pretrained Large Language, Large Language Models, pretrained Large, quadruples from articles, Large Language
关键词-ZH: 预训练的大型语言、大型语言模型、预训练的大型、来自文章的四倍体、大型语言
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:This study demonstrates the application of instruction finetuning of pretrained Large Language Models (LLMs) to automate the generation of AI research leaderboards, extracting (Task, Dataset, Metric, Score) quadruples from articles. It aims to streamline the dissemination of advancements in AI research by transitioning from traditional, manual community curation, or otherwise taxonomy-constrained natural language inference (NLI) models, to an automated, generative LLM-based approach. Utilizing the FLAN-T5 model, this research enhances LLMs’ adaptability and reliability in information extraction, offering a novel method for structured knowledge representation.
摘要:本研究展示了预训练的大型语言模型(LLM)的指令微调的应用,以自动生成人工智能研究排行榜,从文章中提取(任务、数据集、指标、分数)四倍。它旨在通过从传统的手动社区策展或其他受分类学约束的自然语言推理(NLI)模型过渡到自动化的生成性基于LLM的方法,简化人工智能研究进步的传播。该研究利用FLAN-T5模型,增强了LLM在信息提取方面的适应性和可靠性,为结构化知识表示提供了一种新颖的方法。

[NLP-4] Rhyme-aware Chinese lyric generator based on GPT
[NLP-4] 基于GPT的韵律感知中文歌词生成器

链接: https://arxiv.org/abs/2408.10130
作者: Yixiao Yuan,Yangchen Huang,Yu Ma,Xinjin Li,Zhenglin Li,Yiming Shi,Huapeng Zhou
关键词-EN: Neural language representation, effectively capture rich, capture rich semantic, rich semantic patterns, consistently improve natural
关键词-ZH: 神经语言表示,有效捕获丰富,捕获丰富的语义、丰富的语义模式,持续改进自然
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Neural language representation models such as GPT, pre-trained on large-scale corpora, can effectively capture rich semantic patterns from plain text and be fine-tuned to consistently improve natural language generation performance. However, existing pre-trained language models used to generate lyrics rarely consider rhyme information, which is crucial in lyrics. Using a pre-trained model directly results in poor performance. To enhance the rhyming quality of generated lyrics, we incorporate integrated rhyme information into our model, thereby improving lyric generation performance.
摘要:GPT等神经语言表示模型在大规模数据库上预先训练,可以有效地从纯文本中捕获丰富的语义模式,并进行微调以持续提高自然语言生成性能。然而,用于生成歌词的现有预训练语言模型很少考虑押韵信息,而押韵信息在歌词中至关重要。使用预先训练的模型直接导致性能不佳。为了提高生成歌词的押韵质量,我们将集成的押韵信息融入到我们的模型中,从而提高歌词生成性能。

[NLP-5] GLIMMER: Incorporating Graph and Lexical Features in Unsupervised Multi-Document Summarization ECAI2024
[NLP-5] GLIMMER:无监督多文档摘要中的图表和词汇特征

链接: https://arxiv.org/abs/2408.10115
作者: Ran Liu,Ming Liu,Min Yu,Jianguo Jiang,Gang Li,Dan Zhang,Jingyuan Li,Xiang Meng,Weiqing Huang
关键词-EN: Pre-trained language models, multi-document summarization tasks, Pre-trained language, summarization tasks, multi-document summarization
关键词-ZH: 预训练的语言模型、多文档摘要任务、预训练的语言、摘要任务、多文档摘要
类目: Computation and Language (cs.CL)
备注: 19 pages, 7 figures. Accepted by ECAI 2024

点击查看摘要

Abstract:Pre-trained language models are increasingly being used in multi-document summarization tasks. However, these models need large-scale corpora for pre-training and are domain-dependent. Other non-neural unsupervised summarization approaches mostly rely on key sentence extraction, which can lead to information loss. To address these challenges, we propose a lightweight yet effective unsupervised approach called GLIMMER: a Graph and LexIcal features based unsupervised Multi-docuMEnt summaRization approach. It first constructs a sentence graph from the source documents, then automatically identifies semantic clusters by mining low-level features from raw texts, thereby improving intra-cluster correlation and the fluency of generated sentences. Finally, it summarizes clusters into natural sentences. Experiments conducted on Multi-News, Multi-XScience and DUC-2004 demonstrate that our approach outperforms existing unsupervised approaches. Furthermore, it surpasses state-of-the-art pre-trained multi-document summarization models (e.g. PEGASUS and PRIMERA) under zero-shot settings in terms of ROUGE scores. Additionally, human evaluations indicate that summaries generated by GLIMMER achieve high readability and informativeness scores. Our code is available at this https URL.
摘要:预训练的语言模型越来越多地被用于多文档摘要任务。然而,这些模型需要大规模的语料库进行预训练,并且是领域相关的。其他非神经非监督文摘方法大多依赖于关键句子的提取,这会导致信息丢失。为了应对这些挑战,我们提出了一种轻量级但有效的无监督方法GLIMMER:一种基于图和词汇特征的无监督多文档摘要方法。该方法首先从原始文档中构造句子图,然后通过挖掘原始文本中的低层特征自动识别语义簇,从而提高簇内相关性和生成句子的流畅度。最后,将聚类归纳为自然句。在多新闻、多X科学和DUC-2004上进行的实验表明,我们的方法比现有的非监督方法性能更好。此外,在Rouge分数方面,它在零镜头设置下超过了最先进的预先训练的多文档摘要模型(例如Pegasus和Primera)。此外,人工评估表明,Gimmer生成的摘要具有很高的可读性和信息性分数。我们的代码可以在这个HTTPS URL上找到。

[NLP-6] Personalizing Reinforcement Learning from Human Feedback with Variational Preference Learning
[NLP-6] 通过变分偏好学习从人类反馈中进行个性化强化学习

链接: https://arxiv.org/abs/2408.10075
作者: Sriyash Poddar,Yanming Wan,Hamish Ivison,Abhishek Gupta,Natasha Jaques
关键词-EN: Human Feedback, Reinforcement Learning, powerful paradigm, paradigm for aligning, RLHF
关键词-ZH: 人类反馈、强化学习、强大的范式、对齐范式、RL HF
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Robotics (cs.RO)
备注: this http URL

点击查看摘要

Abstract:Reinforcement Learning from Human Feedback (RLHF) is a powerful paradigm for aligning foundation models to human values and preferences. However, current RLHF techniques cannot account for the naturally occurring differences in individual human preferences across a diverse population. When these differences arise, traditional RLHF frameworks simply average over them, leading to inaccurate rewards and poor performance for individual subgroups. To address the need for pluralistic alignment, we develop a class of multimodal RLHF methods. Our proposed techniques are based on a latent variable formulation - inferring a novel user-specific latent and learning reward models and policies conditioned on this latent without additional user-specific data. While conceptually simple, we show that in practice, this reward modeling requires careful algorithmic considerations around model architecture and reward scaling. To empirically validate our proposed technique, we first show that it can provide a way to combat underspecification in simulated control problems, inferring and optimizing user-specific reward functions. Next, we conduct experiments on pluralistic language datasets representing diverse user preferences and demonstrate improved reward function accuracy. We additionally show the benefits of this probabilistic framework in terms of measuring uncertainty, and actively learning user preferences. This work enables learning from diverse populations of users with divergent preferences, an important challenge that naturally occurs in problems from robot learning to foundation model alignment.
摘要:人类反馈强化学习(RLHF)是一种使基础模型与人类价值观和偏好保持一致的强大范例。然而,目前的RLHF技术不能解释不同人群中自然发生的个人偏好差异。当这些差异出现时,传统的RLHF框架只是对它们进行平均,导致不准确的奖励和个别子组的糟糕表现。为了解决多元比对的需要,我们发展了一类多模式RLHF方法。我们提出的技术是基于一种潜在变量公式–推断一种新的特定于用户的潜在模型和学习奖励模型和策略,该模型和策略以该潜在变量为条件,而不需要额外的特定于用户的数据。虽然概念上很简单,但我们表明,在实践中,这种奖励建模需要仔细考虑模型架构和奖励缩放的算法。为了经验性地验证我们提出的技术,我们首先展示了它可以提供一种方法来对抗模拟控制问题中的不足,推断和优化特定于用户的奖励函数。接下来,我们在代表不同用户偏好的多元语言数据集上进行了实验,并证明了改进的奖励函数精度。我们还展示了这个概率框架在测量不确定性和主动学习用户偏好方面的好处。这项工作使学习能够从具有不同偏好的不同用户群体中进行,这是从机器人学习到基础模型对齐等问题中自然会出现的一个重要挑战。

[NLP-7] Privacy Checklist: Privacy Violation Detection Grounding on Contextual Integrity Theory
[NLP-7] 隐私检查表:基于上下文完整性理论的隐私侵犯检测

链接: https://arxiv.org/abs/2408.10053
作者: Haoran Li,Wei Fan,Yulin Chen,Jiayang Cheng,Tianshu Chu,Xuebing Zhou,Peizhao Hu,Yangqiu Song
关键词-EN: attracted wide attention, Privacy, Natural Language Processing, smart devices, attracted wide
关键词-ZH: 引起广泛关注,隐私、自然语言处理、智能设备,引起广泛关注
类目: Computation and Language (cs.CL); Cryptography and Security (cs.CR)
备注:

点击查看摘要

Abstract:Privacy research has attracted wide attention as individuals worry that their private data can be easily leaked during interactions with smart devices, social platforms, and AI applications. Computer science researchers, on the other hand, commonly study privacy issues through privacy attacks and defenses on segmented fields. Privacy research is conducted on various sub-fields, including Computer Vision (CV), Natural Language Processing (NLP), and Computer Networks. Within each field, privacy has its own formulation. Though pioneering works on attacks and defenses reveal sensitive privacy issues, they are narrowly trapped and cannot fully cover people’s actual privacy concerns. Consequently, the research on general and human-centric privacy research remains rather unexplored. In this paper, we formulate the privacy issue as a reasoning problem rather than simple pattern matching. We ground on the Contextual Integrity (CI) theory which posits that people’s perceptions of privacy are highly correlated with the corresponding social context. Based on such an assumption, we develop the first comprehensive checklist that covers social identities, private attributes, and existing privacy regulations. Unlike prior works on CI that either cover limited expert annotated norms or model incomplete social context, our proposed privacy checklist uses the whole Health Insurance Portability and Accountability Act of 1996 (HIPAA) as an example, to show that we can resort to large language models (LLMs) to completely cover the HIPAA’s regulations. Additionally, our checklist also gathers expert annotations across multiple ontologies to determine private information including but not limited to personally identifiable information (PII). We use our preliminary results on the HIPAA to shed light on future context-centric privacy research to cover more privacy regulations, social norms and standards.
摘要:隐私研究吸引了广泛的关注,因为个人担心他们的私人数据在与智能设备、社交平台和人工智能应用程序交互时很容易被泄露。另一方面,计算机科学研究人员通常通过对分割的领域进行隐私攻击和防御来研究隐私问题。隐私研究在不同的子领域进行,包括计算机视觉(CV)、自然语言处理(NLP)和计算机网络。在每个领域,隐私都有自己的表述。尽管攻击和防御方面的开创性作品揭示了敏感的隐私问题,但它们被狭隘地困住了,不能完全覆盖人们对隐私的实际担忧。因此,关于一般隐私研究和以人为中心的隐私研究仍然是相当未被探索的。在本文中,我们将隐私问题描述为一个推理问题,而不是简单的模式匹配。我们基于语境完整性(CI)理论,该理论认为人们对隐私的感知与相应的社会语境高度相关。基于这样的假设,我们开发了第一个全面的清单,其中包括社会身份、私人属性和现有的隐私法规。与以往关于CI的工作要么涵盖有限的专家注释规范,要么涵盖不完整的社会背景,我们提出的隐私检查表以1996年的整个健康保险携带和责任法案(HIPAA)为例,表明我们可以求助于大型语言模型(LLM)来完全覆盖HIPAA的规定。此外,我们的检查表还收集了跨多个本体的专家注释,以确定私人信息,包括但不限于个人身份信息(PII)。我们使用我们在HIPAA上的初步结果来阐明未来以上下文为中心的隐私研究,以涵盖更多的隐私法规、社会规范和标准。

[NLP-8] C2RL: Content and Context Representation Learning for Gloss-free Sign Language Translation and Retrieval
[NLP-8] C2 RL:无光泽手语翻译和检索的内容和上下文表示学习

链接: https://arxiv.org/abs/2408.09949
作者: Zhigang Chen,Benjia Zhou,Yiqing Huang,Jun Wan,Yibo Hu,Hailin Shi,Yanyan Liang,Zhen Lei,Du Zhang
关键词-EN: Sign Language Translation, Sign Language Retrieval, Language Translation, Language Retrieval, Sign Language
关键词-ZH: 手语翻译,手语检索,语言翻译,语言检索,手语
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Sign Language Representation Learning (SLRL) is crucial for a range of sign language-related downstream tasks such as Sign Language Translation (SLT) and Sign Language Retrieval (SLRet). Recently, many gloss-based and gloss-free SLRL methods have been proposed, showing promising performance. Among them, the gloss-free approach shows promise for strong scalability without relying on gloss annotations. However, it currently faces suboptimal solutions due to challenges in encoding the intricate, context-sensitive characteristics of sign language videos, mainly struggling to discern essential sign features using a non-monotonic video-text alignment strategy. Therefore, we introduce an innovative pretraining paradigm for gloss-free SLRL, called C ^2 RL, in this paper. Specifically, rather than merely incorporating a non-monotonic semantic alignment of video and text to learn language-oriented sign features, we emphasize two pivotal aspects of SLRL: Implicit Content Learning (ICL) and Explicit Context Learning (ECL). ICL delves into the content of communication, capturing the nuances, emphasis, timing, and rhythm of the signs. In contrast, ECL focuses on understanding the contextual meaning of signs and converting them into equivalent sentences. Despite its simplicity, extensive experiments confirm that the joint optimization of ICL and ECL results in robust sign language representation and significant performance gains in gloss-free SLT and SLRet tasks. Notably, C ^2 RL improves the BLEU-4 score by +5.3 on P14T, +10.6 on CSL-daily, +6.2 on OpenASL, and +1.3 on How2Sign. It also boosts the R@1 score by +8.3 on P14T, +14.4 on CSL-daily, and +5.9 on How2Sign. Additionally, we set a new baseline for the OpenASL dataset in the SLRet task.
摘要:手语表征学习(SLRL)对于手语翻译(SLT)和手语检索(SLRet)等一系列与手语相关的下游任务至关重要。最近,许多基于光泽和无光泽的SLRL方法被提出,显示出良好的性能。其中,无光泽方法显示了强大的可伸缩性,而不依赖于光泽注释。然而,由于手语视频复杂的、上下文敏感的特征编码方面的挑战,它目前面临着次优的解决方案,主要是使用非单调的视频-文本对齐策略来识别基本的手语特征。因此,我们在本文中引入了一种创新的无光泽SLRL预训练范式,称为C^2RL。具体地说,我们不是简单地结合视频和文本的非单调语义对齐来学习面向语言的手势特征,而是强调了SLRL的两个关键方面:内隐内容学习(ICL)和显性上下文学习(ECL)。ICL深入研究交流的内容,捕捉手势的细微差别、强调、时机和节奏。相反,ECL侧重于理解符号的语境意义,并将它们转换成对等的句子。尽管它很简单,但大量的实验证实,ICL和ECL的联合优化导致了健壮的手语表示,并在无光泽的SLT和SLRet任务中显著提高了性能。值得注意的是,C^2 RL在P14T上将BLEU-4的得分提高了+5.3,在CSL-Daily上提高了+10.6,在OpenASL上提高了+6.2,在How2Sign上提高了+1.3。它还将P14T上的R@1分数提高了+8.3,CSL-Daily上的+14.4,How2Sign上的+5.9。此外,我们在SLRet任务中为OpenASL数据集设置了新的基线。

[NLP-9] Microscopic Analysis on LLM players via Social Deduction Game
[NLP-9] 社会演绎游戏对LLM玩家的微观分析

链接: https://arxiv.org/abs/2408.09946
作者: Byungjun Kim,Dayeon Seo,Bugeun Kim
关键词-EN: large language models, begun developing autonomous, Recent studies, developing autonomous game, social deduction games
关键词-ZH: 大型语言模型,开始开发自主,最近的研究,开发自主游戏,社交演绎游戏
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Under review, 10 pages

点击查看摘要

Abstract:Recent studies have begun developing autonomous game players for social deduction games using large language models (LLMs). When building LLM players, fine-grained evaluations are crucial for addressing weaknesses in game-playing abilities. However, existing studies have often overlooked such assessments. Specifically, we point out two issues with the evaluation methods employed. First, game-playing abilities have typically been assessed through game-level outcomes rather than specific event-level skills; Second, error analyses have lacked structured methodologies. To address these issues, we propose an approach utilizing a variant of the SpyFall game, named SpyGame. We conducted an experiment with four LLMs, analyzing their gameplay behavior in SpyGame both quantitatively and qualitatively. For the quantitative analysis, we introduced eight metrics to resolve the first issue, revealing that these metrics are more effective than existing ones for evaluating the two critical skills: intent identification and camouflage. In the qualitative analysis, we performed thematic analysis to resolve the second issue. This analysis identifies four major categories that affect gameplay of LLMs. Additionally, we demonstrate how these categories complement and support the findings from the quantitative analysis.
摘要:最近的研究已经开始使用大语言模型(LLM)开发用于社会演绎游戏的自主游戏玩家。在构建LLM玩家时,细粒度的评估对于解决游戏能力方面的弱点至关重要。然而,现有的研究往往忽视了这样的评估。具体地说,我们指出了所使用的评估方法中的两个问题。首先,玩游戏的能力通常是通过游戏级别的结果而不是具体的事件级别技能来评估的;其次,错误分析缺乏结构化的方法。为了解决这些问题,我们提出了一种方法,利用Spyfall游戏的一个变体,称为SpyGame。我们对四个LLM进行了实验,对他们在SpyGame中的游戏行为进行了定量和定性的分析。对于定量分析,我们引入了八个指标来解决第一个问题,揭示了这些指标在评估意图识别和伪装这两个关键技能方面比现有的指标更有效。在定性分析中,我们进行了主题分析来解决第二个问题。这一分析确定了影响LLMS游戏性的四个主要类别。此外,我们还演示了这些类别如何补充和支持来自定量分析的结果。

[NLP-10] Benchmarking LLMs for Translating Classical Chinese Poetry:Evaluating Adequacy Fluency and Elegance
[NLP-10] 中国古典诗歌翻译的法学硕士基准:评估语言能力、流畅性和优雅性

链接: https://arxiv.org/abs/2408.09945
作者: Andong Chen,Lianzhang Lou,Kehai Chen,Xuefeng Bai,Yang Xiang,Muyun Yang,Tiejun Zhao,Min Zhang
关键词-EN: Large language models, shown remarkable performance, Large language, language models, shown remarkable
关键词-ZH: 大型语言模型,表现出非凡的性能,大型语言,语言模型,表现出非凡的性能
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Work in progress

点击查看摘要

Abstract:Large language models (LLMs) have shown remarkable performance in general translation tasks. However, the increasing demand for high-quality translations that are not only adequate but also fluent and elegant. To assess the extent to which current LLMs can meet these demands, we introduce a suitable benchmark for translating classical Chinese poetry into English. This task requires not only adequacy in translating culturally and historically significant content but also a strict adherence to linguistic fluency and poetic elegance. Our study reveals that existing LLMs fall short of this task. To address these issues, we propose RAT, a \textbfRetrieval-\textbfAugmented machine \textbfTranslation method that enhances the translation process by incorporating knowledge related to classical poetry. Additionally, we propose an automatic evaluation metric based on GPT-4, which better assesses translation quality in terms of adequacy, fluency, and elegance, overcoming the limitations of traditional metrics. Our dataset and code will be made available.
摘要:大型语言模型在一般翻译任务中表现出显著的性能。然而,对高质量翻译的需求日益增长,这些翻译不仅足够,而且流利和优雅。为了评估目前的LLMS在多大程度上能够满足这些需求,我们引入了一个合适的基准来将中国古典诗歌翻译成英语。这项任务不仅要求翻译具有重要文化和历史意义的内容,而且严格遵守语言的流畅和诗意的优雅。我们的研究表明,现有的LLM不能完成这一任务。为了解决这些问题,我们提出了RAT,一种通过整合与古典诗歌相关的知识来增强翻译过程的文本检索-文本增强机器翻译方法。此外,我们还提出了一种基于GPT-4的自动评价指标,它从充分性、流畅性和优雅三个方面对翻译质量进行了更好的评价,克服了传统评价指标的局限性。我们的数据集和代码将可用。

[NLP-11] “Image Tell me your story!” Predicting the original meta-context of visual misinformation
[NLP-11] “形象告诉我你的故事!“预测视觉错误信息的原始元上下文

链接: https://arxiv.org/abs/2408.09939
作者: Jonathan Tonglet,Marie-Francine Moens,Iryna Gurevych
关键词-EN: developed automated approaches, visual misinformation detection, researchers have developed, assist human fact-checkers, image
关键词-ZH: 开发了自动化方法、视觉错误信息检测、研究人员开发了辅助人类事实核查、图像
类目: Computation and Language (cs.CL)
备注: Preprint. Code available at this https URL

点击查看摘要

Abstract:To assist human fact-checkers, researchers have developed automated approaches for visual misinformation detection. These methods assign veracity scores by identifying inconsistencies between the image and its caption, or by detecting forgeries in the image. However, they neglect a crucial point of the human fact-checking process: identifying the original meta-context of the image. By explaining what is actually true about the image, fact-checkers can better detect misinformation, focus their efforts on check-worthy visual content, engage in counter-messaging before misinformation spreads widely, and make their explanation more convincing. Here, we fill this gap by introducing the task of automated image contextualization. We create 5Pils, a dataset of 1,676 fact-checked images with question-answer pairs about their original meta-context. Annotations are based on the 5 Pillars fact-checking framework. We implement a first baseline that grounds the image in its original meta-context using the content of the image and textual evidence retrieved from the open web. Our experiments show promising results while highlighting several open challenges in retrieval and reasoning. We make our code and data publicly available.
摘要:为了帮助人类事实核查人员,研究人员开发了视觉错误信息检测的自动化方法。这些方法通过识别图像与其标题之间的不一致,或通过检测图像中的伪造来分配准确性分数。然而,他们忽略了人类事实核查过程中的一个关键点:识别图像的原始元上下文。通过解释关于图像的真实情况,事实核查人员可以更好地发现错误信息,将他们的努力集中在值得检查的可视内容上,在错误信息广泛传播之前进行反信息传递,并使他们的解释更具说服力。在这里,我们通过引入自动图像上下文化任务来填补这一空白。我们创建了5Pils,这是一个包含1,676张事实核查图像的数据集,其中包含关于其原始元上下文的问答对。注释基于5支柱事实核查框架。我们使用图像的内容和从开放网络检索的文本证据来实现第一个基线,该基线将图像置于其原始元上下文中。我们的实验显示了有希望的结果,同时强调了检索和推理方面的几个开放挑战。我们公开我们的代码和数据。

[NLP-12] Attribution Analysis Meets Model Editing: Advancing Knowledge Correction in Vision Language Models with VisEdit
[NLP-12] 归因分析满足模型编辑:利用VisEdit推进视觉语言模型中的知识纠正

链接: https://arxiv.org/abs/2408.09916
作者: Qizhou Chen,Taolin Zhang,Chengyu Wang,Xiaofeng He,Dakan Wang,Tingting Liu
关键词-EN: Large Language Model, Model editing aims, developed Large Language, Language Model, costly retraining
关键词-ZH: 大型语言模型、模型编辑目标、开发大型语言、语言模型、昂贵的再培训
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Model editing aims to correct outdated or erroneous knowledge in large models without costly retraining. Recent research discovered that the mid-layer representation of the subject’s final token in a prompt has a strong influence on factual predictions, and developed Large Language Model (LLM) editing techniques based on this observation. However, for Vision-LLMs (VLLMs), how visual representations impact the predictions from a decoder-only language model remains largely unexplored. To the best of our knowledge, model editing for VLLMs has not been extensively studied in the literature. In this work, we employ the contribution allocation and noise perturbation methods to measure the contributions of visual representations for token predictions. Our attribution analysis shows that visual representations in mid-to-later layers that are highly relevant to the prompt contribute significantly to predictions. Based on these insights, we propose VisEdit, a novel model editor for VLLMs that effectively corrects knowledge by editing intermediate visual representations in regions important to the edit prompt. We evaluated VisEdit using multiple VLLM backbones and public VLLM editing benchmark datasets. The results show the superiority of VisEdit over the strong baselines adapted from existing state-of-the-art editors for LLMs.
摘要:模型编辑的目的是纠正大型模型中过时或错误的知识,而不需要进行昂贵的再培训。最近的研究发现,提示语中受试者最终表征的中间层表征对事实预测有很大的影响,并基于这一观察结果开发了大型语言模型(LLM)编辑技术。然而,对于视觉LLMS(VLLMS)来说,视觉表征如何影响来自仅限解码器的语言模型的预测在很大程度上仍未被探索。就我们所知,VLLMS的模型编辑在文献中还没有得到广泛的研究。在这项工作中,我们使用贡献分配和噪声扰动方法来衡量视觉表征对令牌预测的贡献。我们的归因分析表明,与提示高度相关的中后期层中的视觉表征对预测有显著贡献。基于这些见解,我们提出了一种新的VLLMS模型编辑器VisEdit,它通过编辑对编辑提示重要的区域中的中间视觉表示来有效地纠正知识。我们使用多个VLLM主干和公共VLLM编辑基准数据集对VisEdit进行了评估。结果表明,相对于改编自现有最先进的LLMS编辑器的强大基线,VisEdit具有更高的优越性。

[NLP-13] Active Learning for Identifying Disaster-Related Tweets: A Comparison with Keyword Filtering and Generic Fine-Tuning
[NLP-13] 识别灾难相关推文的主动学习:与关键词过滤和通用微调的比较

链接: https://arxiv.org/abs/2408.09914
作者: David Hanny,Sebastian Schmidt,Bernd Resch
关键词-EN: provide essential information, essential information, emergency response, response during natural, natural disasters
关键词-ZH: 提供重要信息、重要信息、紧急响应、自然、自然灾害期间的响应
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Submitted for the Intelligent Systems Conference (IntelliSys 2024). The version of record of this contribution is published in the Springer series Lecture Notes in Networks and Systems, and is available online at this https URL . This preprint has not undergone peer review or any post-submission improvements or corrections. 13 pages, 2 figures

点击查看摘要

Abstract:Information from social media can provide essential information for emergency response during natural disasters in near real-time. However, it is difficult to identify the disaster-related posts among the large amounts of unstructured data available. Previous methods often use keyword filtering, topic modelling or classification-based techniques to identify such posts. Active Learning (AL) presents a promising sub-field of Machine Learning (ML) that has not been used much in the field of text classification of social media content. This study therefore investigates the potential of AL for identifying disaster-related Tweets. We compare a keyword filtering approach, a RoBERTa model fine-tuned with generic data from CrisisLex, a base RoBERTa model trained with AL and a fine-tuned RoBERTa model trained with AL regarding classification performance. For testing, data from CrisisLex and manually labelled data from the 2021 flood in Germany and the 2023 Chile forest fires were considered. The results show that generic fine-tuning combined with 10 rounds of AL outperformed all other approaches. Consequently, a broadly applicable model for the identification of disaster-related Tweets could be trained with very little labelling effort. The model can be applied to use cases beyond this study and provides a useful tool for further research in social media analysis.
摘要:来自社交媒体的信息可以近乎实时地为自然灾害期间的应急响应提供必要的信息。然而,在现有的大量非结构化数据中,很难确定与灾害有关的帖子。以前的方法通常使用关键词过滤、主题建模或基于分类的技术来识别此类帖子。主动学习(AL)是机器学习(ML)的一个很有前途的子领域,在社交媒体内容的文本分类领域还没有得到很好的应用。因此,这项研究调查了AL在识别灾难相关推文方面的潜力。我们比较了关键词过滤方法、用Crisis Lex的通用数据微调的Roberta模型、用AL训练的基本Roberta模型和用AL训练的微调Roberta模型的分类性能。为了进行测试,考虑了Crisis Lex的数据和2021年德国洪水和2023年智利森林大火的手动标记数据。结果表明,通用微调结合10轮AL优于所有其他方法。因此,一个广泛适用的识别与灾害有关的推文的模型只需很少的标签工作就可以训练出来。该模型可以应用于本研究以外的用例,并为社交媒体分析的进一步研究提供了有用的工具。

[NLP-14] Performance Law of Large Language Models
[NLP-14] 大型语言模型的性能定律

链接: https://arxiv.org/abs/2408.09895
作者: Chuhan Wu,Ruiming Tang
关键词-EN: large language models, achieved impressive performance, large language, achieved impressive, scaling law
关键词-ZH: 大型语言模型,实现了令人印象深刻的性能,大型语言,实现了令人印象深刻的缩放定律
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Personal opinions of the authors

点击查看摘要

Abstract:Guided by the belief of the scaling law, large language models (LLMs) have achieved impressive performance in recent years. However, scaling law only gives a qualitative estimation of loss, which is influenced by various factors such as model architectures, data distributions, tokenizers, and computation precision. Thus, estimating the real performance of LLMs with different training settings rather than loss may be quite useful in practical development. In this article, we present an empirical equation named “Performance Law” to directly predict the MMLU score of an LLM, which is a widely used metric to indicate the general capability of LLMs in real-world conversations and applications. Based on only a few key hyperparameters of the LLM architecture and the size of training data, we obtain a quite accurate MMLU prediction of various LLMs with diverse sizes and architectures developed by different organizations in different years. Performance law can be used to guide the choice of LLM architecture and the effective allocation of computational resources without extensive experiments.
摘要:近年来,在标度律的指导下,大语言模型取得了令人瞩目的成绩。然而,标度律只能给出损失的定性估计,受模型结构、数据分布、标记器和计算精度等多种因素的影响。因此,估计具有不同训练设置而不是损失的LLMS的真实性能在实际开发中可能是非常有用的。在本文中,我们提出了一个名为“性能定律”的经验公式来直接预测LLM的MMLU分数,这是一个广泛使用的度量标准,用于衡量LLM在现实世界会话和应用中的总体能力。基于LLM体系结构的几个关键超参数和训练数据的大小,我们获得了对不同组织在不同年份开发的不同大小和体系结构的各种LLM的MMLU预测。性能定律可以用来指导LLM体系结构的选择和计算资源的有效分配,而不需要大量的实验。

[NLP-15] Docling Technical Report
[NLP-15] 对接技术报告

链接: https://arxiv.org/abs/2408.09869
作者: Christoph Auer,Maksym Lysak,Ahmed Nassar,Michele Dolfi,Nikolaos Livathinos,Panos Vagenas,Cesar Berrospi Ramis,Matteo Omenetti,Fabian Lindlbauer,Kasper Dinkla,Valery Weber,Lucas Morin,Ingmar Meijer,Viktor Kuropiatnyk,Peter W. J. Staar
关键词-EN: PDF document conversion, report introduces Docling, MIT-licensed open-source package, technical report introduces, introduces Docling
关键词-ZH: PDF文档转换、报告介绍Docling、MIT许可的开源包、技术报告介绍、介绍Docling
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Software Engineering (cs.SE)
备注: arXiv admin note: substantial text overlap with arXiv:2206.01062

点击查看摘要

Abstract:This technical report introduces Docling, an easy to use, self-contained, MIT-licensed open-source package for PDF document conversion. It is powered by state-of-the-art specialized AI models for layout analysis (DocLayNet) and table structure recognition (TableFormer), and runs efficiently on commodity hardware in a small resource budget. The code interface allows for easy extensibility and addition of new features and models.
摘要:本技术报告介绍了Docling,这是一个易于使用、独立、获得MIT许可的用于PDF文档转换的开源包。它由用于布局分析(DocLayNet)和表结构识别(Table Former)的最先进的专业人工智能模型提供支持,并以较小的资源预算在商品硬件上高效运行。代码界面允许轻松扩展并添加新功能和模型。

[NLP-16] MAPLE: Enhancing Review Generation with Multi-Aspect Prompt LEarning in Explainable Recommendation
[NLP-16] MAPLE:在可解释推荐中通过多方面提示学习增强评论生成

链接: https://arxiv.org/abs/2408.09865
作者: Ching-Wen Yang,Che Wei Chen,Kun-da Wu,Hao Xu,Jui-Feng Yao,Hung-Yu Kao
关键词-EN: Explainable Recommendation task, Explainable Recommendation, Recommendation task, task is designed, designed to receive
关键词-ZH: 可解释推荐任务,可解释推荐,推荐任务,任务是设计的,旨在接收
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注: 8 main pages, 10 pages for appendix. Under review

点击查看摘要

Abstract:Explainable Recommendation task is designed to receive a pair of user and item and output explanations to justify why an item is recommended to a user. Many models treat review-generation as a proxy of explainable recommendation. Although they are able to generate fluent and grammatical sentences, they suffer from generality and hallucination issues. We propose a personalized, aspect-controlled model called Multi-Aspect Prompt LEarner (MAPLE), in which it integrates aspect category as another input dimension to facilitate the memorization of fine-grained aspect terms. Experiments on two real-world review datasets in restaurant domain show that MAPLE outperforms the baseline review-generation models in terms of text and feature diversity while maintaining excellent coherence and factual relevance. We further treat MAPLE as a retriever component in the retriever-reader framework and employ a Large-Language Model (LLM) as the reader, showing that MAPLE’s explanation along with the LLM’s comprehension ability leads to enriched and personalized explanation as a result. We will release the code and data in this http upon acceptance.
摘要:解释性推荐任务被设计为接收一对用户和物品,并输出解释,以证明某一物品被推荐给用户的原因。许多模型将评论生成视为可解释推荐的代理。尽管他们能够写出流利的、合乎语法的句子,但他们存在普遍性和幻觉问题。我们提出了一种个性化的、体控的多方面提示学习模型,该模型将体类别作为另一个输入维,以便于细粒度体术语的记忆。在餐馆领域的两个真实评论数据集上的实验表明,在保持良好的连贯性和事实相关性的同时,Maple在文本和特征多样性方面优于基线评论生成模型。我们进一步将Maple作为检索器-阅读器框架中的检索器组件,并使用大语言模型(LLM)作为阅读器,表明Maple的解释与LLM的理解能力一起导致了丰富和个性化的解释。我们将在接受后发布此http中的代码和数据。

[NLP-17] aSL: Continual Dialog State Tracking via Task Skill Localization and Consolidation ACL2024
[NLP-17] aSL:通过任务技能本地化和整合进行连续对话状态跟踪

链接: https://arxiv.org/abs/2408.09857
作者: Yujie Feng,Xu Chu,Yongxin Xu,Guangyuan Shi,Bo Liu,Xiao-Ming Wu
关键词-EN: Dialogue State Tracking, Continual Dialogue State, practical dialogue system, dialogue system requires, ongoing skill acquisition
关键词-ZH: 对话状态跟踪、持续对话状态、实用对话系统、对话系统要求、持续技能习得
类目: Computation and Language (cs.CL)
备注: Accepted to ACL 2024 Main Conference

点击查看摘要

Abstract:A practical dialogue system requires the capacity for ongoing skill acquisition and adaptability to new tasks while preserving prior knowledge. However, current methods for Continual Dialogue State Tracking (DST), a crucial function of dialogue systems, struggle with the catastrophic forgetting issue and knowledge transfer between tasks. We present TaSL, a novel framework for task skill localization and consolidation that enables effective knowledge transfer without relying on memory replay. TaSL uses a novel group-wise technique to pinpoint task-specific and task-shared areas. Additionally, a fine-grained skill consolidation strategy protects task-specific knowledge from being forgotten while updating shared knowledge for bi-directional knowledge transfer. As a result, TaSL strikes a balance between preserving previous knowledge and excelling at new tasks. Comprehensive experiments on various backbones highlight the significant performance improvements of TaSL over existing state-of-the-art methods. The source code is provided for reproducibility.
摘要:一个实用的对话系统需要不断获得技能和适应新任务的能力,同时保持先前的知识。然而,当前的持续对话状态跟踪(DST)方法是对话系统的关键功能,难以解决灾难性的遗忘问题和任务之间的知识转移。我们提出了TASL,这是一个新的任务技能本地化和巩固框架,可以在不依赖记忆重放的情况下实现有效的知识转移。TASL使用一种新的分组技术来精确定位任务特定和任务共享的区域。此外,在更新共享知识以进行双向知识转移时,细粒度的技能巩固策略可防止特定于任务的知识被遗忘。因此,TASL在保留以前的知识和擅长新任务之间取得了平衡。在不同主干上的综合实验突出了TASL相对于现有最先进方法的显著性能改进。提供源代码是为了便于重现。

[NLP-18] amLoRA: Boosting Low-Rank Adaptation with Expert Collaboration and Competition
[NLP-18] amLoRA:通过专家合作和竞争促进低级别适应

链接: https://arxiv.org/abs/2408.09856
作者: Tianwei Lin,Jiang Liu,Wenqiao Zhang,Zhaocheng Li,Yang Dai,Haoyuan Li,Zhelun Yu,Wanggui He,Juncheng Li,Hao Jiang,Siliang Tang,Yueting Zhuang
关键词-EN: effectively addressed GPU, addressed GPU memory, GPU memory constraints, addressed GPU, GPU memory
关键词-ZH: 有效地解决了图形处理器,解决了图形处理器,图形处理器
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:While Parameter-Efficient Fine-Tuning (PEFT) methods like LoRA have effectively addressed GPU memory constraints during fine-tuning, their performance often falls short, especially in multidimensional task scenarios. To address this issue, one straightforward solution is to introduce task-specific LoRA modules as domain experts, leveraging the modeling of multiple experts’ capabilities and thus enhancing the general capability of multi-task learning. Despite promising, these additional components often add complexity to the training and inference process, contravening the efficient characterization of PEFT designed for. Considering this, we introduce an innovative PEFT method, TeamLoRA, consisting of a collaboration and competition module for experts, and thus achieving the right balance of effectiveness and efficiency: (i) For collaboration, a novel knowledge-sharing and -organizing mechanism is devised to appropriately reduce the scale of matrix operations, thereby boosting the training and inference speed. (ii) For competition, we propose leveraging a game-theoretic interaction mechanism for experts, encouraging experts to transfer their domain-specific knowledge while facing diverse downstream tasks, and thus enhancing the performance. By doing so, TeamLoRA elegantly connects the experts as a “Team” with internal collaboration and competition, enabling a faster and more accurate PEFT paradigm for multi-task learning. To validate the superiority of TeamLoRA, we curate a comprehensive multi-task evaluation(CME) benchmark to thoroughly assess the capability of multi-task learning. Experiments conducted on our CME and other benchmarks indicate the effectiveness and efficiency of TeamLoRA. Our project is available at this https URL.
摘要:虽然像LORA这样的参数高效精调(PEFT)方法已经有效地解决了微调过程中的GPU内存限制问题,但它们的性能经常不足,特别是在多维任务场景中。为了解决这个问题,一个简单的解决方案是引入特定于任务的LORA模块作为领域专家,利用对多个专家的能力进行建模,从而增强多任务学习的一般能力。尽管前景看好,但这些额外的组件往往增加了训练和推理过程的复杂性,与PEFT为以下目的而设计的高效特征背道而驰。考虑到这一点,我们引入了一种创新的PEFT方法TeamLoRA,该方法由专家协作和竞争模块组成,从而实现了有效性和效率的正确平衡:(I)对于协作,设计了一种新颖的知识共享和组织机制,适当减少了矩阵运算的规模,从而提高了训练和推理的速度。(Ii)对于竞争,我们建议利用博弈论的专家互动机制,鼓励专家在面对不同的下游任务时转移他们特定领域的知识,从而提高绩效。通过这样做,TeamLoRA优雅地将专家作为一个“团队”与内部协作和竞争联系在一起,为多任务学习提供了更快、更准确的PEFT范例。为了验证TeamLoRA的优越性,我们策划了一个综合的多任务评估(CME)基准来全面评估多任务学习的能力。在我们的CME和其他基准测试上进行的实验表明了TeamLoRA的有效性和效率。我们的项目可以在这个HTTPS URL上找到。

[NLP-19] Self-Directed Turing Test for Large Language Models
[NLP-19] 大型语言模型的自主图灵测试

链接: https://arxiv.org/abs/2408.09853
作者: Weiqi Wu,Hongqiu Wu,Hai Zhao
关键词-EN: Turing test examines, exhibit human-like behaviour, Traditional Turing tests, Turing tests adopt, Turing test
关键词-ZH: 图灵测试检查,表现出类人行为,传统图灵测试,图灵测试采用,图灵测试
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The Turing test examines whether AIs can exhibit human-like behaviour in natural language conversations. Traditional Turing tests adopt a rigid dialogue format where each participant sends only one message each time and require continuous human involvement to direct the entire interaction with the test subject. This fails to reflect a natural conversational style and hinders the evaluation of Large Language Models (LLMs) in complex and prolonged dialogues. This paper proposes the Self-Directed Turing Test, which extends the original test with a burst dialogue format, allowing more dynamic exchanges by multiple consecutive messages. It further efficiently reduces human workload by having the LLM self-direct the majority of the test process, iteratively generating dialogues that simulate its interaction with humans. With the pseudo-dialogue history, the model then engages in a shorter dialogue with a human, which is paired with a human-human conversation on the same topic to be judged using questionnaires. We introduce the X-Turn Pass-Rate metric to assess the human likeness of LLMs across varying durations. While LLMs like GPT-4 initially perform well, achieving pass rates of 51.9% and 38.9% during 3 turns and 10 turns of dialogues respectively, their performance drops as the dialogue progresses, which underscores the difficulty in maintaining consistency in the long term.
摘要:图灵测试考察人工智能能否在自然语言对话中表现出类似人类的行为。传统的图灵测试采用僵化的对话格式,每个参与者每次只发送一条消息,并且需要持续的人工参与来指导与测试对象的整个交互。这不能反映一种自然的对话风格,并阻碍了在复杂和长时间的对话中评估大语言模型(LLM)。本文提出了自定向图灵测试,它用突发对话格式扩展了原始测试,允许通过多个连续的消息进行更多的动态交换。它通过让LLM自我指导大部分测试过程,迭代地生成模拟其与人类交互的对话,进一步有效地减少了人工工作量。有了伪对话历史,该模型然后与人类进行较短的对话,这与关于同一主题的人与人的对话配对,使用问卷进行判断。我们引入了X转弯通过率的度量来评估不同持续时间的LLM的人类相似性。虽然像GPT-4这样的LLMS一开始表现良好,在3轮和10轮对话中分别达到51.9%和38.9%的通过率,但随着对话的进行,它们的表现有所下降,这突显了保持长期一致性的难度。

[NLP-20] Importance Weighting Can Help Large Language Models Self-Improve
[NLP-20] 重要性加权可以帮助大型语言模型自我改进

链接: https://arxiv.org/abs/2408.09849
作者: Chunyang Jiang,Chi-min Chan,Wei Xue,Qifeng Liu,Yike Guo
关键词-EN: shown remarkable capability, Large language models, Large language, tasks and applications, LLM self-improvement
关键词-ZH: 表现出非凡的能力,大型语言模型,大型语言、任务和应用程序,LLM自我完善
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have shown remarkable capability in numerous tasks and applications. However, fine-tuning LLMs using high-quality datasets under external supervision remains prohibitively expensive. In response, LLM self-improvement approaches have been vibrantly developed recently. The typical paradigm of LLM self-improvement involves training LLM on self-generated data, part of which may be detrimental and should be filtered out due to the unstable data quality. While current works primarily employs filtering strategies based on answer correctness, in this paper, we demonstrate that filtering out correct but with high distribution shift extent (DSE) samples could also benefit the results of self-improvement. Given that the actual sample distribution is usually inaccessible, we propose a new metric called DS weight to approximate DSE, inspired by the Importance Weighting methods. Consequently, we integrate DS weight with self-consistency to comprehensively filter the self-generated samples and fine-tune the language model. Experiments show that with only a tiny valid set (up to 5% size of the training set) to compute DS weight, our approach can notably promote the reasoning ability of current LLM self-improvement methods. The resulting performance is on par with methods that rely on external supervision from pre-trained reward models.
摘要:大型语言模型在众多的任务和应用中表现出了非凡的能力。然而,在外部监督下使用高质量数据集微调LLMS的成本仍然高得令人望而却步。作为回应,LLM自我改进方法最近得到了蓬勃发展。LLM自我改进的典型范例包括对LLM进行自我生成数据的培训,其中部分数据可能是有害的,由于数据质量不稳定,应将其过滤掉。虽然目前的研究主要采用基于答案正确性的过滤策略,但在本文中,我们证明了过滤出正确但分布漂移程度(DSE)较高的样本也有助于自我改进的结果。考虑到实际样本分布通常是不可获得的,受重要性加权方法的启发,我们提出了一种称为DS权重的新度量来逼近DSE。因此,我们将DS权重和自一致性结合起来,对自生成的样本进行全面过滤,并对语言模型进行微调。实验表明,该方法只需要一个很小的有效集(训练集大小不超过5个)来计算DS权重,就能显著提高现有LLM自改进方法的推理能力。由此产生的绩效与依赖于来自预先培训的奖励模型的外部监督的方法不相上下。

[NLP-21] Continual Dialogue State Tracking via Reason-of-Select Distillation ACL2024
[NLP-21] 通过选择推理蒸馏进行连续对话状态跟踪

链接: https://arxiv.org/abs/2408.09846
作者: Yujie Feng,Bo Liu,Xiaoyu Dong,Zexin Lu,Li-Ming Zhan,Xiao-Ming Wu,Albert Y.S. Lam
关键词-EN: requires continuous skill, continuous skill acquisition, system requires continuous, Dialogue State Tracking, retaining prior knowledge
关键词-ZH: 需要持续技能、持续技能获取、系统需要持续、对话状态跟踪、保留先验知识
类目: Computation and Language (cs.CL)
备注: Accepted to ACL 2024 Findings

点击查看摘要

Abstract:An ideal dialogue system requires continuous skill acquisition and adaptation to new tasks while retaining prior knowledge. Dialogue State Tracking (DST), vital in these systems, often involves learning new services and confronting catastrophic forgetting, along with a critical capability loss termed the “Value Selection Quandary.” To address these challenges, we introduce the Reason-of-Select (RoS) distillation method by enhancing smaller models with a novel ‘meta-reasoning’ capability. Meta-reasoning employs an enhanced multi-domain perspective, combining fragments of meta-knowledge from domain-specific dialogues during continual learning. This transcends traditional single-perspective reasoning. The domain bootstrapping process enhances the model’s ability to dissect intricate dialogues from multiple possible values. Its domain-agnostic property aligns data distribution across different domains, effectively mitigating forgetting. Additionally, two novel improvements, “multi-value resolution” strategy and Semantic Contrastive Reasoning Selection method, significantly enhance RoS by generating DST-specific selection chains and mitigating hallucinations in teachers’ reasoning, ensuring effective and reliable knowledge transfer. Extensive experiments validate the exceptional performance and robust generalization capabilities of our method. The source code is provided for reproducibility.
摘要:一个理想的对话系统需要持续不断地获得技能并适应新的任务,同时保持先前的知识。对话状态跟踪(DST)在这些系统中至关重要,它通常涉及学习新服务和面临灾难性遗忘,以及称为“价值选择困境”的关键能力损失。为了应对这些挑战,我们引入了选择原因(ROS)蒸馏方法,通过使用新颖的“元推理”能力来增强较小的模型。元推理采用了增强的多领域视角,在持续学习过程中结合了来自特定领域对话的元知识片段。这超越了传统的单视角推理。域自举过程增强了模型从多个可能值中剖析复杂对话的能力。其与域无关的特性可跨不同域对齐数据分布,有效地减少遗忘。此外,“多值分解”策略和语义对比推理选择方法这两个新的改进,通过生成特定于DST的选择链和减轻教师推理中的幻觉,显著提高了ROS,确保了有效和可靠的知识传递。大量实验验证了该方法的卓越性能和稳健的泛化能力。提供源代码是为了便于重现。

[NLP-22] CMoralEval: A Moral Evaluation Benchmark for Chinese Large Language Models ACL2024
[NLP-22] CMoralEval:中文大型语言模型的道德评估基准

链接: https://arxiv.org/abs/2408.09819
作者: Linhao Yu,Yongqi Leng,Yufei Huang,Shang Wu,Haixin Liu,Xinmeng Ji,Jiahui Zhao,Jinwang Song,Tingting Cui,Xiaoqing Cheng,Tao Liu,Deyi Xiong
关键词-EN: ethically relevant context, large language model, Chinese LLMs, Chinese, language model
关键词-ZH: 道德相关上下文、大型语言模型、中文法学硕士、中文、语言模型
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted by ACL 2024 (Findings)

点击查看摘要

Abstract:What a large language model (LLM) would respond in ethically relevant context? In this paper, we curate a large benchmark CMoralEval for morality evaluation of Chinese LLMs. The data sources of CMoralEval are two-fold: 1) a Chinese TV program discussing Chinese moral norms with stories from the society and 2) a collection of Chinese moral anomies from various newspapers and academic papers on morality. With these sources, we aim to create a moral evaluation dataset characterized by diversity and authenticity. We develop a morality taxonomy and a set of fundamental moral principles that are not only rooted in traditional Chinese culture but also consistent with contemporary societal norms. To facilitate efficient construction and annotation of instances in CMoralEval, we establish a platform with AI-assisted instance generation to streamline the annotation process. These help us curate CMoralEval that encompasses both explicit moral scenarios (14,964 instances) and moral dilemma scenarios (15,424 instances), each with instances from different data sources. We conduct extensive experiments with CMoralEval to examine a variety of Chinese LLMs. Experiment results demonstrate that CMoralEval is a challenging benchmark for Chinese LLMs. The dataset is publicly available at \urlthis https URL.
摘要:在伦理相关的语境中,大型语言模型(LLM)会做出什么样的反应?在本文中,我们策划了一个大型基准CMoralEval,用于中国地方政府的道德评估。CMoralEval的数据来源有两个:1)一个用社会故事讨论中国道德规范的中国电视节目;2)从各种关于道德的报纸和学术论文中收集中国的道德失范。通过这些来源,我们的目标是创建一个具有多样性和真实性的道德评估数据集。我们制定了一套道德分类和一套基本的道德原则,这些原则不仅植根于中国传统文化,而且符合当代社会规范。为了方便CMoralEval中实例的高效构建和标注,我们建立了一个人工智能辅助的实例生成平台来简化标注过程。这些帮助我们管理CMoralEval,它包括明确的道德场景(14,964个实例)和道德困境场景(15,424个实例),每个场景都有来自不同数据源的实例。我们使用CMoralEval进行了大量的实验,以检验各种中国的LLM。实验结果表明,CMoralEval是中国LLMS的一个具有挑战性的基准。数据集可在此HTTPS URL上\url公开获取。

[NLP-23] AutoML-guided Fusion of Entity and LLM-based representations
[NLP-23] AutoML引导的实体和基于LLM的表示的融合

链接: https://arxiv.org/abs/2408.09794
作者: Boshko Koloski,Senja Pollak,Roberto Navigli,Blaž Škrlj
关键词-EN: Large semantic knowledge, Large Language Model, grounded in factual, semantic knowledge bases, Large semantic
关键词-ZH: 大语义知识,大语言模型,基于事实、语义知识库,大语义
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large semantic knowledge bases are grounded in factual knowledge. However, recent approaches to dense text representations (embeddings) do not efficiently exploit these resources. Dense and robust representations of documents are essential for effectively solving downstream classification and retrieval tasks. This work demonstrates that injecting embedded information from knowledge bases can augment the performance of contemporary Large Language Model (LLM)-based representations for the task of text classification. Further, by considering automated machine learning (AutoML) with the fused representation space, we demonstrate it is possible to improve classification accuracy even if we use low-dimensional projections of the original representation space obtained via efficient matrix factorization. This result shows that significantly faster classifiers can be achieved with minimal or no loss in predictive performance, as demonstrated using five strong LLM baselines on six diverse real-life datasets.
摘要:大型语义知识库是以事实知识为基础的。然而,最近用于密集文本表示(嵌入)的方法不能有效地利用这些资源。密集和健壮的文档表示是有效解决下游分类和检索任务的关键。这项工作表明,从知识库中注入嵌入的信息可以增强当代基于大语言模型(LLM)的表示法在文本分类任务中的性能。此外,通过考虑融合表示空间的自动机器学习(AutoML),我们证明了即使使用通过有效的矩阵分解获得的原始表示空间的低维投影,也可以提高分类精度。这一结果表明,在预测性能损失最小或没有损失的情况下,可以实现显著更快的分类器,正如在六个不同的现实生活数据集上使用五个强大的LLM基线所展示的那样。

[NLP-24] Anim-Director: A Large Multimodal Model Powered Agent for Controllable Animation Video Generation SIGGRAPH
[NLP-24] 动画总监:一个大型多模式模型支持可控制动画视频生成的代理

链接: https://arxiv.org/abs/2408.09787
作者: Yunxin Li,Haoyuan Shi,Baotian Hu,Longyue Wang,Jiashun Zhu,Jinyi Xu,Zhen Zhao,Min Zhang
关键词-EN: high training costs, incurs high training, sophisticated multi-stage pipeline, demands substantial human, substantial human effort
关键词-ZH: 培训成本高,需要高培训,复杂的多阶段管道,需要大量的人力、大量的人力努力
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注: Accepted by SIGGRAPH Asia 2024, Project and Codes: this https URL

点击查看摘要

Abstract:Traditional animation generation methods depend on training generative models with human-labelled data, entailing a sophisticated multi-stage pipeline that demands substantial human effort and incurs high training costs. Due to limited prompting plans, these methods typically produce brief, information-poor, and context-incoherent animations. To overcome these limitations and automate the animation process, we pioneer the introduction of large multimodal models (LMMs) as the core processor to build an autonomous animation-making agent, named Anim-Director. This agent mainly harnesses the advanced understanding and reasoning capabilities of LMMs and generative AI tools to create animated videos from concise narratives or simple instructions. Specifically, it operates in three main stages: Firstly, the Anim-Director generates a coherent storyline from user inputs, followed by a detailed director’s script that encompasses settings of character profiles and interior/exterior descriptions, and context-coherent scene descriptions that include appearing characters, interiors or exteriors, and scene events. Secondly, we employ LMMs with the image generation tool to produce visual images of settings and scenes. These images are designed to maintain visual consistency across different scenes using a visual-language prompting method that combines scene descriptions and images of the appearing character and setting. Thirdly, scene images serve as the foundation for producing animated videos, with LMMs generating prompts to guide this process. The whole process is notably autonomous without manual intervention, as the LMMs interact seamlessly with generative tools to generate prompts, evaluate visual quality, and select the best one to optimize the final output.
摘要:传统的动画生成方法依赖于用人类标记的数据来训练生成模型,这需要复杂的多阶段管道,需要大量的人力和高昂的培训成本。由于有限的提示计划,这些方法通常会生成简短、信息贫乏和上下文不连贯的动画。为了克服这些限制并使动画过程自动化,我们率先引入大型多通道模型(LMM)作为核心处理器来构建一个自主动画制作代理,名为Anim-Director。该代理主要利用LMM和生成性AI工具的高级理解和推理能力,根据简洁的叙述或简单的指令创建动画视频。具体地说,它分为三个主要阶段:首先,动画导演根据用户输入生成连贯的故事情节,然后是详细的导演剧本,其中包括角色简介和内部/外部描述的设置,以及上下文连贯的场景描述,包括出现的角色、内部或外部以及场景事件。其次,我们使用LMM和图像生成工具来生成场景和场景的可视图像。这些图像被设计为使用视觉语言提示方法在不同场景之间保持视觉一致性,该视觉语言提示方法结合了场景描述以及出现的角色和背景的图像。第三,场景图像是制作动画视频的基础,LMM生成提示来指导这一过程。整个过程明显是自主的,无需人工干预,因为LMM与生成性工具无缝交互,以生成提示、评估视觉质量并选择最佳工具来优化最终输出。

[NLP-25] GoNoGo: An Efficient LLM-based Multi-Agent System for Streamlining Automotive Software Release Decision-Making
[NLP-25] GoNoGo:一个高效的基于LLM的多代理系统,用于简化汽车软件发布决策

链接: https://arxiv.org/abs/2408.09785
作者: Arsham Gholamzadeh Khoee,Yinan Yu,Robert Feldt,Andris Freimanis,Patrick Andersson,Dhasarathy Parthasarathy
关键词-EN: industry typically rely, software test data, Traditional methods, tabular software test, automotive industry typically
关键词-ZH: 行业通常依赖,软件测试数据,传统方法,表格软件测试,汽车行业通常
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Software Engineering (cs.SE)
备注:

点击查看摘要

Abstract:Traditional methods for making software deployment decisions in the automotive industry typically rely on manual analysis of tabular software test data. These methods often lead to higher costs and delays in the software release cycle due to their labor-intensive nature. Large Language Models (LLMs) present a promising solution to these challenges. However, their application generally demands multiple rounds of human-driven prompt engineering, which limits their practical deployment, particularly for industrial end-users who need reliable and efficient results. In this paper, we propose GoNoGo, an LLM agent system designed to streamline automotive software deployment while meeting both functional requirements and practical industrial constraints. Unlike previous systems, GoNoGo is specifically tailored to address domain-specific and risk-sensitive systems. We evaluate GoNoGo’s performance across different task difficulties using zero-shot and few-shot examples taken from industrial practice. Our results show that GoNoGo achieves a 100% success rate for tasks up to Level 2 difficulty with 3-shot examples, and maintains high performance even for more complex tasks. We find that GoNoGo effectively automates decision-making for simpler tasks, significantly reducing the need for manual intervention. In summary, GoNoGo represents an efficient and user-friendly LLM-based solution currently employed in our industrial partner’s company to assist with software release decision-making, supporting more informed and timely decisions in the release process for risk-sensitive vehicle systems.
摘要:在汽车行业中,传统的软件部署决策方法通常依赖于对表格软件测试数据的手动分析。这些方法由于其劳动密集型的性质,通常会导致更高的成本和软件发布周期的延迟。大型语言模型(LLM)为这些挑战提供了一个很有前途的解决方案。然而,它们的应用一般需要多轮人为主导的即时工程,这限制了它们的实际部署,特别是对于需要可靠和高效结果的工业终端用户。在本文中,我们提出了GoNoGo,这是一个LLM代理系统,旨在简化汽车软件部署,同时满足功能需求和实际工业限制。与以前的系统不同,GoNoGo是专门为特定领域和风险敏感系统量身定做的。我们使用工业实践中的零射和少射例子来评估GoNoGo在不同任务困难中的表现。实验结果表明,对于难度达到2级的任务,GoNoGo在3个镜头的例子中达到了100%的成功率,即使对于更复杂的任务,GoNoGo也保持了高性能。我们发现,GoNoGo有效地自动化了简单任务的决策,显著减少了人工干预的需要。总而言之,GoNoGo是一种高效且用户友好的基于LLM的解决方案,目前在我们的工业合作伙伴的公司中使用,以帮助制定软件发布决策,支持风险敏感型车辆系统发布过程中更明智和及时的决策。

[NLP-26] Summarizing long regulatory documents with a multi-step pipeline
[NLP-26] 通过多步骤管道总结冗长的监管文件

链接: https://arxiv.org/abs/2408.09777
作者: Mika Sie,Ruby Beek,Michiel Bots,Sjaak Brinkkemper,Albert Gatt
关键词-EN: long regulatory texts, challenging to summarize, long regulatory, Due, regulatory texts
关键词-ZH: 冗长的监管文本,难以总结,冗长的监管文本,到期的监管文本
类目: Computation and Language (cs.CL)
备注: Under review

点击查看摘要

Abstract:Due to their length and complexity, long regulatory texts are challenging to summarize. To address this, a multi-step extractive-abstractive architecture is proposed to handle lengthy regulatory documents more effectively. In this paper, we show that the effectiveness of a two-step architecture for summarizing long regulatory texts varies significantly depending on the model used. Specifically, the two-step architecture improves the performance of decoder-only models. For abstractive encoder-decoder models with short context lengths, the effectiveness of an extractive step varies, whereas for long-context encoder-decoder models, the extractive step worsens their performance. This research also highlights the challenges of evaluating generated texts, as evidenced by the differing results from human and automated evaluations. Most notably, human evaluations favoured language models pretrained on legal text, while automated metrics rank general-purpose language models higher. The results underscore the importance of selecting the appropriate summarization strategy based on model architecture and context length.
摘要:由于其长度和复杂性,长篇监管文本很难概括。为了解决这一问题,提出了一个多步骤提取-抽象体系结构,以更有效地处理冗长的监管文件。在这篇文章中,我们表明,两步体系结构对于总结冗长的监管文本的有效性显著取决于所使用的模型。具体地说,两步结构提高了仅解码器模型的性能。对于具有短上下文长度的抽象编解码器模型,提取步骤的有效性是不同的,而对于长上下文编解码器模型,提取步骤会恶化它们的性能。这项研究还强调了评估生成的文本的挑战,这一点从人工评估和自动评估的不同结果中可见一斑。最值得注意的是,人工评估倾向于根据法律文本预先训练的语言模型,而自动度量方法将通用语言模型排在更高的位置。研究结果强调了根据模型结构和上下文长度选择适当的摘要策略的重要性。

[NLP-27] Are Large Language Models More Honest in Their Probabilistic or Verbalized Confidence?
[NLP-27] 大型语言模型在概率上更诚实还是在言语上更诚实?

链接: https://arxiv.org/abs/2408.09773
作者: Shiyu Ni,Keping Bi,Lulu Yu,Jiafeng Guo
关键词-EN: Large language models, knowledge boundaries, Large language, found to produce, produce hallucinations
关键词-ZH: 大型语言模型,知识边界,大型语言,发现会产生,产生幻觉
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have been found to produce hallucinations when the question exceeds their internal knowledge boundaries. A reliable model should have a clear perception of its knowledge boundaries, providing correct answers within its scope and refusing to answer when it lacks knowledge. Existing research on LLMs’ perception of their knowledge boundaries typically uses either the probability of the generated tokens or the verbalized confidence as the model’s confidence in its response. However, these studies overlook the differences and connections between the two. In this paper, we conduct a comprehensive analysis and comparison of LLMs’ probabilistic perception and verbalized perception of their factual knowledge boundaries. First, we investigate the pros and cons of these two perceptions. Then, we study how they change under questions of varying frequencies. Finally, we measure the correlation between LLMs’ probabilistic confidence and verbalized confidence. Experimental results show that 1) LLMs’ probabilistic perception is generally more accurate than verbalized perception but requires an in-domain validation set to adjust the confidence threshold. 2) Both perceptions perform better on less frequent questions. 3) It is challenging for LLMs to accurately express their internal confidence in natural language.
摘要:大型语言模型(LLM)被发现在问题超出其内部知识边界时会产生幻觉。一个可靠的模型应该对其知识边界有一个清晰的感知,在其范围内提供正确的答案,并在缺乏知识时拒绝回答。现有的关于LLMS对其知识边界的感知的研究通常使用生成的标记的概率或言语上的置信度作为模型对其响应的置信度。然而,这些研究忽略了两者之间的区别和联系。本文对学习者对其事实知识边界的概率知觉和言语知觉进行了全面的分析和比较。首先,我们调查了这两种看法的利弊。然后,我们研究了在频率变化的问题下它们是如何变化的。最后,我们测量了LLMS的概率置信度和言语置信度之间的相关性。实验结果表明:1)LLMS的概率感知一般比言语感知更准确,但需要一个域内验证集来调整置信度阈值。2)这两种认知在不太频繁的问题上表现得更好。3)LLMS要准确地用自然语言表达内心的自信是一件很有挑战性的事情。

[NLP-28] Strategic Demonstration Selection for Improved Fairness in LLM In-Context Learning
[NLP-28] 提高LLM背景学习公平性的战略示范选择

链接: https://arxiv.org/abs/2408.09757
作者: Jingyu Hu,Weiru Liu,Mengnan Du
关键词-EN: Recent studies highlight, large language models, steer large language, Recent studies, processing tabular data
关键词-ZH: 最近的研究强调,大型语言模型,引导大型语言,最近的研究,处理表格数据
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:Recent studies highlight the effectiveness of using in-context learning (ICL) to steer large language models (LLMs) in processing tabular data, a challenging task given the structured nature of such data. Despite advancements in performance, the fairness implications of these methods are less understood. This study investigates how varying demonstrations within ICL prompts influence the fairness outcomes of LLMs. Our findings reveal that deliberately including minority group samples in prompts significantly boosts fairness without sacrificing predictive accuracy. Further experiments demonstrate that the proportion of minority to majority samples in demonstrations affects the trade-off between fairness and prediction accuracy. Based on these insights, we introduce a mitigation technique that employs clustering and evolutionary strategies to curate a diverse and representative sample set from the training data. This approach aims to enhance both predictive performance and fairness in ICL applications. Experimental results validate that our proposed method dramatically improves fairness across various metrics, showing its efficacy in real-world scenarios.
摘要:最近的研究强调了使用上下文中学习(ICL)来指导大型语言模型(LLM)处理表格数据的有效性,考虑到此类数据的结构化性质,这是一项具有挑战性的任务。尽管在性能上有所进步,但人们对这些方法的公平含义知之甚少。本研究考察了ICL提示中不同的演示如何影响LLMS的公平结果。我们的发现表明,故意在提示中包括少数群体样本可以在不牺牲预测准确性的情况下显著提高公平性。进一步的实验表明,演示中的少数样本和多数样本的比例影响着公平性和预测精度之间的权衡。基于这些见解,我们引入了一种缓解技术,该技术使用聚类和进化策略来从训练数据中挑选出多样化和具有代表性的样本集。该方法旨在提高ICL应用中的预测性能和公平性。实验结果表明,该方法显著提高了不同度量间的公平性,并在实际场景中显示了其有效性。

[NLP-29] R2GenCSR: Retrieving Context Samples for Large Language Model based X-ray Medical Report Generation
[NLP-29] R2 GenCSR:检索基于大型语言模型的X射线医疗报告生成的上下文样本

链接: https://arxiv.org/abs/2408.09743
作者: Xiao Wang,Yuehang Li,Fuling Wang,Shiao Wang,Chuanfu Li,Bo Jiang
关键词-EN: Large Language Models, leverage large models, Large Language, generation methods attempt, existing X-ray medical
关键词-ZH: 大型语言模型、利用大型模型、大型语言、生成方法尝试、现有X射线医疗
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: In Peer Review

点击查看摘要

Abstract:Inspired by the tremendous success of Large Language Models (LLMs), existing X-ray medical report generation methods attempt to leverage large models to achieve better performance. They usually adopt a Transformer to extract the visual features of a given X-ray image, and then, feed them into the LLM for text generation. How to extract more effective information for the LLMs to help them improve final results is an urgent problem that needs to be solved. Additionally, the use of visual Transformer models also brings high computational complexity. To address these issues, this paper proposes a novel context-guided efficient X-ray medical report generation framework. Specifically, we introduce the Mamba as the vision backbone with linear complexity, and the performance obtained is comparable to that of the strong Transformer model. More importantly, we perform context retrieval from the training set for samples within each mini-batch during the training phase, utilizing both positively and negatively related samples to enhance feature representation and discriminative learning. Subsequently, we feed the vision tokens, context information, and prompt statements to invoke the LLM for generating high-quality medical reports. Extensive experiments on three X-ray report generation datasets (i.e., IU-Xray, MIMIC-CXR, CheXpert Plus) fully validated the effectiveness of our proposed model. The source code of this work will be released on \urlthis https URL.
摘要:受到大型语言模型(LLM)巨大成功的启发,现有的X射线医学报告生成方法试图利用大型模型来实现更好的性能。它们通常采用转换器来提取给定X射线图像的视觉特征,然后将其送入LLM进行文本生成。如何为LLMS提取更有效的信息,帮助它们改进最终的结果,是一个迫切需要解决的问题。此外,可视化变形金刚模型的使用也带来了很高的计算复杂性。针对这些问题,本文提出了一种新颖的上下文引导的高效X射线医疗报告生成框架。具体地说,我们引入MAMBA作为视觉主干,具有线性复杂度,所获得的性能与强Transformer模型相当。更重要的是,我们在训练阶段对每个小批次内的样本从训练集中执行上下文检索,利用正相关和负相关样本来增强特征表示和区分性学习。随后,我们提供视觉标记、上下文信息和提示语句,以调用LLM来生成高质量的医疗报告。在三个X射线报告生成数据集(即Iu-Xray、MIMIC-CXR、CheXpert Plus)上的大量实验充分验证了该模型的有效性。此作品的源代码将在此HTTPS URL上发布。

[NLP-30] Paired Completion: Flexible Quantification of Issue-framing at Scale with LLMs
[NLP-30] 配对完成:通过LLM灵活量化大规模问题框架

链接: https://arxiv.org/abs/2408.09742
作者: Simon D Angus,Lachlan O’Neill
关键词-EN: Detecting and quantifying, quantifying issue framing, textual discourse, climate science, science vs. denialism
关键词-ZH: 检测和量化、量化问题框架、文本话语、气候科学、科学与否认主义
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); General Economics (econ.GN)
备注: 9 pages, 4 figures

点击查看摘要

Abstract:Detecting and quantifying issue framing in textual discourse - the perspective one takes to a given topic (e.g. climate science vs. denialism, misogyny vs. gender equality) - is highly valuable to a range of end-users from social and political scientists to program evaluators and policy analysts. However, conceptual framing is notoriously challenging for automated natural language processing (NLP) methods since the words and phrases used by either side' of an issue are often held in common, with only subtle stylistic flourishes separating their use. Here we develop and rigorously evaluate new detection methods for issue framing and narrative analysis within large text datasets. By introducing a novel application of next-token log probabilities derived from generative large language models (LLMs) we show that issue framing can be reliably and efficiently detected in large corpora with only a few examples of either perspective on a given issue, a method we call paired completion’. Through 192 independent experiments over three novel, synthetic datasets, we evaluate paired completion against prompt-based LLM methods and labelled methods using traditional NLP and recent LLM contextual embeddings. We additionally conduct a cost-based analysis to mark out the feasible set of performant methods at production-level scales, and a model bias analysis. Together, our work demonstrates a feasible path to scalable, accurate and low-bias issue-framing in large corpora.
摘要:检测和量化文本话语中的问题框架–人们对给定话题的视角(例如,气候科学与否定论、厌女症与性别平等)–对于从社会和政治科学家到项目评估员和政策分析师的一系列最终用户来说非常有价值。然而,概念框架对于自动自然语言处理(NLP)方法来说是出了名的挑战,因为问题的任何一方使用的单词和短语往往是相同的,只有微妙的文体花哨将它们的使用分开。在这里,我们开发并严格评估新的检测方法,用于大型文本数据集中的问题框架和叙事分析。通过引入从生成性大型语言模型(LLMS)派生的下一个令牌日志概率的新应用,我们表明,在大型语料库中,问题框架可以被可靠而有效地检测到,只需几个关于给定问题的任一视角的例子,这种方法我们称为“配对完成”。通过在三个新的合成数据集上的192个独立实验,我们使用传统的NLP和最近的LLM上下文嵌入来评估基于提示的LLM方法和标记方法的配对补全。此外,我们还进行了基于成本的分析,以确定在生产级规模上可行的执行方法集,并进行了模型偏差分析。总之,我们的工作展示了一条在大型语料库中实现可扩展、准确和低偏见问题的可行路径–框架。

[NLP-31] Pedestrian Attribute Recognition: A New Benchmark Dataset and A Large Language Model Augmented Framework
[NLP-31] 行人属性识别:新的基准数据集和大型语言模型增强框架

链接: https://arxiv.org/abs/2408.09720
作者: Jiandong Jin,Xiao Wang,Qian Zhu,Haiyang Wang,Chenglong Li
关键词-EN: Pedestrian Attribute Recognition, human-centered research, indispensable tasks, tasks in human-centered, Attribute Recognition
关键词-ZH: 行人属性识别,以人为本的研究,不可或缺的任务,以人为本的任务,属性识别
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: MSP60K PAR Benchmark Dataset, LLM based PAR model, In Peer Review

点击查看摘要

Abstract:Pedestrian Attribute Recognition (PAR) is one of the indispensable tasks in human-centered research. However, existing datasets neglect different domains (e.g., environments, times, populations, and data sources), only conducting simple random splits, and the performance of these datasets has already approached saturation. In the past five years, no large-scale dataset has been opened to the public. To address this issue, this paper proposes a new large-scale, cross-domain pedestrian attribute recognition dataset to fill the data gap, termed MSP60K. It consists of 60,122 images and 57 attribute annotations across eight scenarios. Synthetic degradation is also conducted to further narrow the gap between the dataset and real-world challenging scenarios. To establish a more rigorous benchmark, we evaluate 17 representative PAR models under both random and cross-domain split protocols on our dataset. Additionally, we propose an innovative Large Language Model (LLM) augmented PAR framework, named LLM-PAR. This framework processes pedestrian images through a Vision Transformer (ViT) backbone to extract features and introduces a multi-embedding query Transformer to learn partial-aware features for attribute classification. Significantly, we enhance this framework with LLM for ensemble learning and visual feature augmentation. Comprehensive experiments across multiple PAR benchmark datasets have thoroughly validated the efficacy of our proposed framework. The dataset and source code accompanying this paper will be made publicly available at \urlthis https URL.
摘要:行人属性识别(PAR)是以人为中心研究中不可缺少的任务之一。然而,现有的数据集忽略了不同的域(如环境、时间、人口和数据源),只进行简单的随机拆分,这些数据集的性能已经接近饱和。在过去的五年里,没有大规模的数据集向公众开放。针对这一问题,本文提出了一种新的大规模跨域行人属性识别数据集MSP60K。它由8个场景中的60,122张图片和57个属性注释组成。还进行了合成退化,以进一步缩小数据集与真实世界具有挑战性的场景之间的差距。为了建立一个更严格的基准,我们在我们的数据集上评估了17个具有代表性的PAR模型在随机和跨域拆分协议下的性能。此外,我们还提出了一个创新的大型语言模型(LLM)扩展PAR框架,称为LLM-PAR。该框架通过视觉转换器(VIT)主干对行人图像进行处理以提取特征,并引入多嵌入查询转换器来学习用于属性分类的局部感知特征。值得注意的是,我们使用LLM增强了这一框架,用于集成学习和视觉特征增强。在多个PAR基准数据集上的综合实验充分验证了我们所提出的框架的有效性。本文附带的数据集和源代码将在此HTTPS URL上公开提供。

[NLP-32] SEMDR: A Semantic-Aware Dual Encoder Model for Legal Judgment Prediction with Legal Clue Tracing
[NLP-32] SEMR:一种用于法律线索追踪的法律判决预测的语义感知双编码器模型

链接: https://arxiv.org/abs/2408.09717
作者: Pengjie Liu,Wang Zhang,Yulong Ding,Xuefeng Zhang,Shuang-Hua Yang
关键词-EN: legal clue tracing, form legal judgments, legal judgments based, criminal facts, requires LJP models
关键词-ZH: 法律线索追踪、形成法律判决、基于法律判决、犯罪事实,需要LJP模型
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Legal Judgment Prediction (LJP) aims to form legal judgments based on the criminal fact description. However, researchers struggle to classify confusing criminal cases, such as robbery and theft, which requires LJP models to distinguish the nuances between similar crimes. Existing methods usually design handcrafted features to pick up necessary semantic legal clues to make more accurate legal judgment predictions. In this paper, we propose a Semantic-Aware Dual Encoder Model (SEMDR), which designs a novel legal clue tracing mechanism to conduct fine-grained semantic reasoning between criminal facts and instruments. Our legal clue tracing mechanism is built from three reasoning levels: 1) Lexicon-Tracing, which aims to extract criminal facts from criminal descriptions; 2) Sentence Representation Learning, which contrastively trains language models to better represent confusing criminal facts; 3) Multi-Fact Reasoning, which builds a reasons graph to propagate semantic clues among fact nodes to capture the subtle difference among criminal facts. Our legal clue tracing mechanism helps SEMDR achieve state-of-the-art on the CAIL2018 dataset and shows its advance in few-shot scenarios. Our experiments show that SEMDR has a strong ability to learn more uniform and distinguished representations for criminal facts, which helps to make more accurate predictions on confusing criminal cases and reduces the model uncertainty during making judgments. All codes will be released via GitHub.
摘要:法律判决预测旨在根据犯罪事实描述形成法律判决。然而,研究人员很难对令人困惑的刑事案件进行分类,例如抢劫和盗窃,这需要LJP模型区分类似犯罪之间的细微差别。现有的方法通常设计手工制作的特征来提取必要的语义法律线索,以做出更准确的法律判断预测。本文提出了一种语义感知的双重编码模型(SEMDR),该模型设计了一种新颖的法律线索追踪机制来在犯罪事实和工具之间进行细粒度的语义推理。我们的法律线索追踪机制建立在三个推理层次上:1)词汇追踪,目的是从犯罪描述中提取犯罪事实;2)句子表征学习,通过对比训练语言模型,更好地表示令人困惑的犯罪事实;3)多事实推理,建立原因图,在事实节点之间传播语义线索,以捕捉犯罪事实之间的细微差异。我们的法律线索追踪机制帮助SEMDR在CAIL2018数据集上实现了最先进的技术,并在极少发生的情况下展示了其先进性。我们的实验表明,SEMDR具有很强的学习犯罪事实的统一和区分表示的能力,有助于对混淆的刑事案件做出更准确的预测,减少了判决过程中模型的不确定性。所有代码都将通过GitHub发布。

[NLP-33] Bridging the Language Gap: Enhancing Multilingual Prompt-Based Code Generation in LLMs via Zero-Shot Cross-Lingual Transfer
[NLP-33] 弥合语言差距:通过零镜头跨语言传输增强LLM中基于预算的多语言代码生成

链接: https://arxiv.org/abs/2408.09701
作者: Mingda Li,Abhijit Mishra,Utkarsh Mujumdar
关键词-EN: Large Language Models, challenge global inclusivity, prompts challenge global, gained substantial attention, Language Models
关键词-ZH: 大型语言模型,挑战全球包容性,引发全球挑战,获得大量关注,语言模型
类目: Computation and Language (cs.CL)
备注: Under Review

点击查看摘要

Abstract:The use of Large Language Models (LLMs) for program code generation has gained substantial attention, but their biases and limitations with non-English prompts challenge global inclusivity. This paper investigates the complexities of multilingual prompt-based code generation. Our evaluations of LLMs, including CodeLLaMa and CodeGemma, reveal significant disparities in code quality for non-English prompts; we also demonstrate the inadequacy of simple approaches like prompt translation, bootstrapped data augmentation, and fine-tuning. To address this, we propose a zero-shot cross-lingual approach using a neural projection technique, integrating a cross-lingual encoder like LASER artetxe2019massively to map multilingual embeddings from it into the LLM’s token space. This method requires training only on English data and scales effectively to other languages. Results on a translated and quality-checked MBPP dataset show substantial improvements in code quality. This research promotes a more inclusive code generation landscape by empowering LLMs with multilingual capabilities to support the diverse linguistic spectrum in programming.
摘要:使用大型语言模型(LLM)生成程序代码已经引起了广泛的关注,但它们对非英语提示的偏见和限制挑战了全球包容性。本文研究了基于多语言提示的代码生成的复杂性。我们对LLMS的评估,包括CodeLLaMa和CodeGema,显示了非英语提示在代码质量方面的显著差异;我们还证明了快速翻译、引导数据扩充和微调等简单方法的不足。为了解决这一问题,我们提出了一种使用神经投影技术的零触发跨语言方法,大规模地集成了一个像LASER ARTETXe2019这样的跨语言编码器来将多语言嵌入映射到LLM的标记空间中。这种方法只需要对英语数据进行培训,并有效地扩展到其他语言。在经过翻译和质量检查的MBPP数据集上的结果显示,代码质量有了实质性的改善。这项研究通过赋予LLM多语言能力来支持编程中的不同语言频谱,从而促进了更具包容性的代码生成环境。

[NLP-34] Recording for Eyes Not Echoing to Ears: Contextualized Spoken-to-Written Conversion of ASR Transcripts
[NLP-34] 眼睛不响耳朵的录音:ASB成绩单的背景化口语到书面转换

链接: https://arxiv.org/abs/2408.09688
作者: Jiaqing Liu,Chong Deng,Qinglin Zhang,Qian Chen,Hai Yu,Wen Wang
关键词-EN: Automatic Speech Recognition, exhibit recognition errors, Automatic Speech, transcripts exhibit recognition, Speech Recognition
关键词-ZH: 自动语音识别、展品识别错误、自动语音、笔录展品识别、语音识别
类目: Computation and Language (cs.CL)
备注: 7 pages, 3 figures

点击查看摘要

Abstract:Automatic Speech Recognition (ASR) transcripts exhibit recognition errors and various spoken language phenomena such as disfluencies, ungrammatical sentences, and incomplete sentences, hence suffering from poor readability. To improve readability, we propose a Contextualized Spoken-to-Written conversion (CoS2W) task to address ASR and grammar errors and also transfer the informal text into the formal style with content preserved, utilizing contexts and auxiliary information. This task naturally matches the in-context learning capabilities of Large Language Models (LLMs). To facilitate comprehensive comparisons of various LLMs, we construct a document-level Spoken-to-Written conversion of ASR Transcripts Benchmark (SWAB) dataset. Using SWAB, we study the impact of different granularity levels on the CoS2W performance, and propose methods to exploit contexts and auxiliary information to enhance the outputs. Experimental results reveal that LLMs have the potential to excel in the CoS2W task, particularly in grammaticality and formality, our methods achieve effective understanding of contexts and auxiliary information by LLMs. We further investigate the effectiveness of using LLMs as evaluators and find that LLM evaluators show strong correlations with human evaluations on rankings of faithfulness and formality, which validates the reliability of LLM evaluators for the CoS2W task.
摘要:自动语音识别(ASR)成绩单存在识别错误和各种口语现象,如不流利、无语法的句子、不完整的句子等,因此可读性差。为了提高可读性,我们提出了一个语境化的口语到书面语转换(CoS2W)任务来解决ASR和语法错误,并利用上下文和辅助信息将非正式文本转换为保留内容的正式文本。这项任务自然与大型语言模型(LLM)的情景学习能力相匹配。为了便于全面比较不同的LLM,我们构建了一个文档级的ASR记录基准的口语到书面转换(SLAB)数据集。利用SWAB,我们研究了不同粒度级别对CoS2W性能的影响,并提出了利用上下文和辅助信息来提高输出的方法。实验结果表明,LLMS具有在CoS2W任务中的优势,特别是在语法和形式化方面,我们的方法实现了LLMS对语境和辅助信息的有效理解。我们进一步考察了使用LLM作为评价者的有效性,发现LLM评价者在忠诚度和正式性排名上与人类评价显示出很强的相关性,这验证了LLM评价者在CoS2W任务中的可靠性。

[NLP-35] BLADE: Benchmarking Language Model Agents for Data-Driven Science
[NLP-35] BLADE:数据驱动科学的基准语言模型代理

链接: https://arxiv.org/abs/2408.09667
作者: Ken Gu,Ruoxi Shang,Ruien Jiang,Keying Kuang,Richard-John Lin,Donghe Lyu,Yue Mao,Youran Pan,Teng Wu,Jiaqian Yu,Yikun Zhang,Tianmai M. Zhang,Lanyi Zhu,Mike A. Merrill,Jeffrey Heer,Tim Althoff
关键词-EN: scientific discovery requires, make nuanced analytical, Data-driven scientific discovery, statistical expertise, discovery requires
关键词-ZH: 科学发现需要,进行细致入微的分析,数据驱动的科学发现,统计专业知识,发现需要
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Data-driven scientific discovery requires the iterative integration of scientific domain knowledge, statistical expertise, and an understanding of data semantics to make nuanced analytical decisions, e.g., about which variables, transformations, and statistical models to consider. LM-based agents equipped with planning, memory, and code execution capabilities have the potential to support data-driven science. However, evaluating agents on such open-ended tasks is challenging due to multiple valid approaches, partially correct steps, and different ways to express the same decisions. To address these challenges, we present BLADE, a benchmark to automatically evaluate agents’ multifaceted approaches to open-ended research questions. BLADE consists of 12 datasets and research questions drawn from existing scientific literature, with ground truth collected from independent analyses by expert data scientists and researchers. To automatically evaluate agent responses, we developed corresponding computational methods to match different representations of analyses to this ground truth. Though language models possess considerable world knowledge, our evaluation shows that they are often limited to basic analyses. However, agents capable of interacting with the underlying data demonstrate improved, but still non-optimal, diversity in their analytical decision making. Our work enables the evaluation of agents for data-driven science and provides researchers deeper insights into agents’ analysis approaches.
摘要:数据驱动的科学发现需要科学领域知识、统计专业知识和对数据语义的理解的迭代集成,以做出细微差别的分析决策,例如,关于要考虑的变量、转换和统计模型。配备了规划、内存和代码执行能力的基于LM的代理具有支持数据驱动的科学的潜力。然而,由于有多种有效的方法、部分正确的步骤以及表达相同决策的不同方式,评估此类开放式任务的代理是具有挑战性的。为了应对这些挑战,我们提出了Blade,这是一个自动评估代理人对开放式研究问题的多方面方法的基准。Blade由12个数据集和从现有科学文献中提取的研究问题组成,基本事实来自专家数据科学家和研究人员的独立分析。为了自动评估代理响应,我们开发了相应的计算方法,以将不同的分析表示与这一基本事实相匹配。尽管语言模型拥有相当多的世界知识,但我们的评估表明,它们往往局限于基本的分析。然而,能够与基础数据交互的代理人在其分析决策方面表现出改善的、但仍不是最佳的多样性。我们的工作使对数据驱动科学的代理进行评估成为可能,并为研究人员提供了对代理分析方法的更深层次的见解。

[NLP-36] A Comparison of Large Language Model and Human Performance on Random Number Generation Tasks
[NLP-36] 随机数生成任务中大语言模型和人类表现的比较

链接: https://arxiv.org/abs/2408.09656
作者: Rachel M. Harrison
关键词-EN: Number Generation Tasks, generate sequences devoid, Generation Tasks, psychology for examining, devoid of predictable
关键词-ZH: 数字生成任务,缺乏生成序列,生成任务,检查心理学,缺乏可预测
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Neurons and Cognition (q-bio.NC)
备注:

点击查看摘要

Abstract:Random Number Generation Tasks (RNGTs) are used in psychology for examining how humans generate sequences devoid of predictable patterns. By adapting an existing human RNGT for an LLM-compatible environment, this preliminary study tests whether ChatGPT-3.5, a large language model (LLM) trained on human-generated text, exhibits human-like cognitive biases when generating random number sequences. Initial findings indicate that ChatGPT-3.5 more effectively avoids repetitive and sequential patterns compared to humans, with notably lower repeat frequencies and adjacent number frequencies. Continued research into different models, parameters, and prompting methodologies will deepen our understanding of how LLMs can more closely mimic human random generation behaviors, while also broadening their applications in cognitive and behavioral science research.
摘要:随机数生成任务(RNGTS)用于心理学中,用于研究人类如何生成缺乏可预测模式的序列。通过将现有的人类RNGT适应LLM兼容的环境,这项初步研究测试ChatGPT-3.5(一种在人类生成的文本上训练的大型语言模型(LLM))在生成随机数序列时是否表现出类似人类的认知偏差。初步研究结果表明,与人类相比,ChatGPT-3.5更有效地避免了重复和顺序模式,重复频率和相邻数频率明显较低。对不同模型、参数和激励方法的持续研究将加深我们对LLM如何更接近地模拟人类随机生成行为的理解,同时也扩大其在认知和行为科学研究中的应用。

[NLP-37] Acquiring Bidirectionality via Large and Small Language Models
[NLP-37] 通过大大小小的语言模型获取双向性

链接: https://arxiv.org/abs/2408.09640
作者: Takumi Goto,Hiroyoshi Nagao,Yuta Koreeda
关键词-EN: widely used approach, approach for token-classification, bidirectional language models, token representation, BERT
关键词-ZH: 广泛使用的方法、标记分类方法、双向语言模型、标记表示、BERT
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Using token representation from bidirectional language models (LMs) such as BERT is still a widely used approach for token-classification tasks. Even though there exist much larger unidirectional LMs such as Llama-2, they are rarely used to replace the token representation of bidirectional LMs. In this work, we hypothesize that their lack of bidirectionality is keeping them behind. To that end, we propose to newly train a small backward LM and concatenate its representations to those of existing LM for downstream tasks. Through experiments in named entity recognition, we demonstrate that introducing backward model improves the benchmark performance more than 10 points. Furthermore, we show that the proposed method is especially effective for rare domains and in few-shot learning settings.
摘要:使用BERT等双向语言模型(LM)的标记表示仍然是标记分类任务的广泛使用的方法。尽管存在更大的单向LM,例如Llama-2,但它们很少被用来取代双向LM的令牌表示。在这项工作中,我们假设它们缺乏双向性导致它们落后。为此,我们建议新训练一个小型的反向LM,并将其表示与现有LM的表示相连接,以执行下游任务。通过命名实体识别实验,我们证明引入后向模型使基准性能提高了10个百分点以上。此外,我们表明,所提出的方法对于罕见的领域和很少的学习环境特别有效。

[NLP-38] How to Make the Most of LLMs Grammatical Knowledge for Acceptability Judgments
[NLP-38] 如何充分利用LLM语法知识进行可接受性判断

链接: https://arxiv.org/abs/2408.09639
作者: Yusuke Ide,Yuto Nishida,Miyu Oba,Yusuke Sakai,Justin Vasselli,Hidetaka Kamigaito,Taro Watanabe
关键词-EN: linguistic minimal pairs, minimal pairs, benchmark of linguistic, linguistic minimal, required to judge
关键词-ZH: 语言最小对,最小对,语言基准,语言最小,需要判断
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The grammatical knowledge of language models (LMs) is often measured using a benchmark of linguistic minimal pairs, where LMs are presented with a pair of acceptable and unacceptable sentences and required to judge which is acceptable. The existing dominant approach, however, naively calculates and compares the probabilities of paired sentences using LMs. Additionally, large language models (LLMs) have yet to be thoroughly examined in this field. We thus investigate how to make the most of LLMs’ grammatical knowledge to comprehensively evaluate it. Through extensive experiments of nine judgment methods in English and Chinese, we demonstrate that a probability readout method, in-template LP, and a prompting-based method, Yes/No probability computing, achieve particularly high performance, surpassing the conventional approach. Our analysis reveals their different strengths, e.g., Yes/No probability computing is robust against token-length bias, suggesting that they harness different aspects of LLMs’ grammatical knowledge. Consequently, we recommend using diverse judgment methods to evaluate LLMs comprehensively.
摘要:语言模型(LMS)的语法知识通常是用语言极小对的基准来衡量的,其中LMS被呈现一对可接受和不可接受的句子,并被要求判断哪些是可接受的。然而,现有的主流方法天真地使用LMS来计算和比较成对句子的概率。此外,大型语言模型(LLM)在这一领域还没有得到彻底的研究。因此,我们研究如何最大限度地利用LLMS的语法知识来对其进行综合评价。通过对英汉两种语言中九种判断方法的大量实验,我们证明了一种概率读出方法–模板内LP方法和一种基于提示的方法–是/否概率计算方法取得了特别高的性能,超过了传统的方法。我们的分析揭示了它们的不同优势,例如,是/否概率计算对标记长度偏差是健壮的,这表明它们利用了LLMS语法知识的不同方面。因此,我们建议使用不同的判断方法来综合评价低成本管理。

[NLP-39] MoDeGPT: Modular Decomposition for Large Language Model Compression
[NLP-39] MoDeGPT:用于大型语言模型压缩的模块分解

链接: https://arxiv.org/abs/2408.09632
作者: Chi-Heng Lin,Shangqian Gao,James Seale Smith,Abhishek Patel,Shikhar Tuli,Yilin Shen,Hongxia Jin,Yen-Chang Hsu
关键词-EN: Large Language Models, Large Language, demonstrating exceptional performance, Language Models, reshaped the landscape
关键词-ZH: 大型语言模型,大型语言,展示卓越的性能,语言模型,重塑了格局
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Machine Learning (stat.ML)
备注: 31 pages, 9 figures

点击查看摘要

Abstract:Large Language Models (LLMs) have reshaped the landscape of artificial intelligence by demonstrating exceptional performance across various tasks. However, substantial computational requirements make their deployment challenging on devices with limited resources. Recently, compression methods using low-rank matrix techniques have shown promise, yet these often lead to degraded accuracy or introduce significant overhead in parameters and inference latency. This paper introduces \textbfModular \textbfDecomposition (MoDeGPT), a novel structured compression framework that does not need recovery fine-tuning while resolving the above drawbacks. MoDeGPT partitions the Transformer block into modules comprised of matrix pairs and reduces the hidden dimensions via reconstructing the module-level outputs. MoDeGPT is developed based on a theoretical framework that utilizes three well-established matrix decomposition algorithms – Nyström approximation, CR decomposition, and SVD – and applies them to our redefined transformer modules. Our comprehensive experiments show MoDeGPT, without backward propagation, matches or surpasses previous structured compression methods that rely on gradient information, and saves 98% of compute costs on compressing a 13B model. On \textscLlama-2/3 and OPT models, MoDeGPT maintains 90-95% zero-shot performance with 25-30% compression rates. Moreover, the compression can be done on a single GPU within a few hours and increases the inference throughput by up to 46%.
摘要:大型语言模型(LLM)通过在各种任务中表现出出色的性能,重塑了人工智能的版图。然而,大量的计算要求使得它们在资源有限的设备上的部署具有挑战性。最近,使用低阶矩阵技术的压缩方法已经显示出很好的前景,但是这些方法通常会导致精度降低,或者在参数和推理延迟方面引入显著的开销。本文介绍了一种新型的结构化压缩框架-.MoDeGPT将变换器块划分为由矩阵对组成的模块,并通过重构模块级输出来降低隐藏维度。MoDeGPT是基于利用三种成熟的矩阵分解算法(Nyström近似、CR分解和SVD)的理论框架开发的,并将它们应用于我们重新定义的变压器模块。我们的综合实验表明,MoDeGPT在没有反向传播的情况下,达到或超过了以前依赖梯度信息的结构化压缩方法,并且在压缩13B模型时节省了98%的计算成本。在\extscLlama-2/3和OPT型号上,MoDeGPT保持90%-95%的零点性能和25%-30%的压缩比。此外,压缩可以在单个GPU上在几个小时内完成,并将推理吞吐量提高高达46%。

[NLP-40] A Strategy to Combine 1stGen Transformers and Open LLMs for Automatic Text Classification
[NLP-40] 将第1stGen Transformers和开放式LLM结合起来进行自动文本分类的策略

链接: https://arxiv.org/abs/2408.09629
作者: Claudio M. V. de Andrade,Washington Cunha,Davi Reis,Adriana Silvina Pagano,Leonardo Rocha,Marcos André Gonçalves
关键词-EN: Large Language Models, Large Language, NLP tasks, first-generation transformers, Language Models
关键词-ZH: 大型语言模型、大型语言、NLP任务、第一代转换器、语言模型
类目: Computation and Language (cs.CL)
备注: 13 pages, 3 figures, 8 tables

点击查看摘要

Abstract:Transformer models have achieved state-of-the-art results, with Large Language Models (LLMs), an evolution of first-generation transformers (1stTR), being considered the cutting edge in several NLP tasks. However, the literature has yet to conclusively demonstrate that LLMs consistently outperform 1stTRs across all NLP tasks. This study compares three 1stTRs (BERT, RoBERTa, and BART) with two open LLMs (Llama 2 and Bloom) across 11 sentiment analysis datasets. The results indicate that open LLMs may moderately outperform or match 1stTRs in 8 out of 11 datasets but only when fine-tuned. Given this substantial cost for only moderate gains, the practical applicability of these models in cost-sensitive scenarios is questionable. In this context, a confidence-based strategy that seamlessly integrates 1stTRs with open LLMs based on prediction certainty is proposed. High-confidence documents are classified by the more cost-effective 1stTRs, while uncertain cases are handled by LLMs in zero-shot or few-shot modes, at a much lower cost than fine-tuned versions. Experiments in sentiment analysis demonstrate that our solution not only outperforms 1stTRs, zero-shot, and few-shot LLMs but also competes closely with fine-tuned LLMs at a fraction of the cost.
摘要:变压器模型已经取得了最先进的成果,大语言模型(LLM)是第一代变压器(1stTR)的演变,被认为是几个NLP任务的前沿。然而,文献尚未确凿地证明,在所有NLP任务中,LLMS的表现始终好于1stRR。这项研究在11个情绪分析数据集中比较了三个1stRR(Bert,Roberta和BART)与两个开放的LLM(Llama 2和Bloom)。结果表明,开放的LLM可能在11个数据集中的8个中适度地超过或匹配1stRR,但只有在微调的情况下。考虑到这种只有适度收益的可观成本,这些模型在成本敏感型情景中的实际适用性值得怀疑。在此背景下,提出了一种基于置信度的策略,将1stRR与基于预测确定性的开放LLMS无缝集成。高可信文档由更具成本效益的1stRR分类,而不确定的情况则由LLMS以零激发或少激发模式处理,成本比微调版本低得多。在情感分析中的实验表明,我们的解决方案不仅性能优于1stRR、零镜头和少镜头LLM,而且以很小的代价与微调LLM竞争。

[NLP-41] Refining Packing and Shuffling Strategies for Enhanced Performance in Generative Language Models ACL
[NLP-41] 完善打包和洗牌策略以提高生成语言模型的性能

链接: https://arxiv.org/abs/2408.09621
作者: Yanbing Chen,Ruilin Wang,Zihao Yang,Lavender Yao Jiang,Eric Karl Oermann
关键词-EN: prevent overfitting, overfitting and improve, auto-regressive language models, MSL, Packing
关键词-ZH: 防止过度适应、过度适应和改进、自回归语言模型、MSL、Packing
类目: Computation and Language (cs.CL)
备注: 11 pages (include appendix), 26 figures, submitted to ACL ARR Aug 2024

点击查看摘要

Abstract:Packing and shuffling tokens is a common practice in training auto-regressive language models (LMs) to prevent overfitting and improve efficiency. Typically documents are concatenated to chunks of maximum sequence length (MSL) and then shuffled. However setting the atom size, the length for each data chunk accompanied by random shuffling, to MSL may lead to contextual incoherence due to tokens from different documents being packed into the same chunk. An alternative approach is to utilize padding, another common data packing strategy, to avoid contextual incoherence by only including one document in each shuffled chunk. To optimize both packing strategies (concatenation vs padding), we investigated the optimal atom size for shuffling and compared their performance and efficiency. We found that matching atom size to MSL optimizes performance for both packing methods (concatenation and padding), and padding yields lower final perplexity (higher performance) than concatenation at the cost of more training steps and lower compute efficiency. This trade-off informs the choice of packing methods in training language models.
摘要:打包和洗牌是训练自回归语言模型(LMS)的一种常见做法,以防止过度拟合和提高效率。通常,文档被连接成最大序列长度(MSL)的块,然后被混洗。然而,将原子大小(伴随着随机洗牌的每个数据块的长度)设置为MSL可能会导致上下文不连贯,因为来自不同文档的令牌被打包到相同的块中。另一种方法是使用填充,这是另一种常见的数据打包策略,通过在每个混洗的块中只包含一个文档来避免上下文不一致。为了优化两种填充策略(拼接和填充),我们研究了洗牌的最佳原子尺寸,并比较了它们的性能和效率。我们发现,匹配原子大小的MSL优化了两种填充方法(拼接和填充)的性能,填充产生了比拼接更低的最终困惑(更高的性能),但代价是训练步骤更多,计算效率更低。这种权衡决定了训练语言模型中打包方法的选择。

[NLP-42] Grammatical Error Feedback: An Implicit Evaluation Approach
[NLP-42] 语法错误反馈:一种隐性评估方法

链接: https://arxiv.org/abs/2408.09565
作者: Stefano Bannò,Kate Knill,Mark J. F. Gales
关键词-EN: crucial for consolidating, feedback, grammatical error, Grammatical, computer-assisted language learning
关键词-ZH: 对于巩固、反馈、语法错误、语法、计算机辅助语言学习至关重要
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Grammatical feedback is crucial for consolidating second language (L2) learning. Most research in computer-assisted language learning has focused on feedback through grammatical error correction (GEC) systems, rather than examining more holistic feedback that may be more useful for learners. This holistic feedback will be referred to as grammatical error feedback (GEF). In this paper, we present a novel implicit evaluation approach to GEF that eliminates the need for manual feedback annotations. Our method adopts a grammatical lineup approach where the task is to pair feedback and essay representations from a set of possible alternatives. This matching process can be performed by appropriately prompting a large language model (LLM). An important aspect of this process, explored here, is the form of the lineup, i.e., the selection of foils. This paper exploits this framework to examine the quality and need for GEC to generate feedback, as well as the system used to generate feedback, using essays from the Cambridge Learner Corpus.
摘要:语法反馈对于巩固第二语言学习至关重要。大多数计算机辅助语言学习的研究都集中在通过语法纠错系统(GEC)进行反馈,而不是考察对学习者更有用的更全面的反馈。这种整体反馈将被称为语法错误反馈。在本文中,我们提出了一种新的隐式评估方法,该方法消除了对人工反馈注释的需要。我们的方法采用语法阵列法,任务是从一组可能的备选方案中将反馈和论文表达配对。该匹配过程可以通过适当地提示大型语言模型(LLM)来执行。这里探讨的这个过程的一个重要方面是阵容的形式,即选择花剑。本文利用这一框架,使用剑桥学习者语料库中的论文,考察了GEC生成反馈的质量和需求,以及用于生成反馈的系统。

[NLP-43] HiAgent : Hierarchical Working Memory Management for Solving Long-Horizon Agent Tasks with Large Language Model
[NLP-43] HiAgent:分层工作内存管理,用于解决使用大型语言模型的长期代理任务

链接: https://arxiv.org/abs/2408.09559
作者: Mengkang Hu,Tianxing Chen,Qiguang Chen,Yao Mu,Wenqi Shao,Ping Luo
关键词-EN: Large Language Model, Large Language, Language Model, exhibit significant potential, based agents exhibit
关键词-ZH: 大型语言模型,大型语言,语言模型,展现出巨大的潜力,基于代理展示
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注: Project Page: this https URL

点击查看摘要

Abstract:Large Language Model (LLM)-based agents exhibit significant potential across various domains, operating as interactive systems that process environmental observations to generate executable actions for target tasks. The effectiveness of these agents is significantly influenced by their memory mechanism, which records historical experiences as sequences of action-observation pairs. We categorize memory into two types: cross-trial memory, accumulated across multiple attempts, and in-trial memory (working memory), accumulated within a single attempt. While considerable research has optimized performance through cross-trial memory, the enhancement of agent performance through improved working memory utilization remains underexplored. Instead, existing approaches often involve directly inputting entire historical action-observation pairs into LLMs, leading to redundancy in long-horizon tasks. Inspired by human problem-solving strategies, this paper introduces HiAgent, a framework that leverages subgoals as memory chunks to manage the working memory of LLM-based agents hierarchically. Specifically, HiAgent prompts LLMs to formulate subgoals before generating executable actions and enables LLMs to decide proactively to replace previous subgoals with summarized observations, retaining only the action-observation pairs relevant to the current subgoal. Experimental results across five long-horizon tasks demonstrate that HiAgent achieves a twofold increase in success rate and reduces the average number of steps required by 3.8. Additionally, our analysis shows that HiAgent consistently improves performance across various steps, highlighting its robustness and generalizability. Project Page: this https URL .
摘要:基于大型语言模型(LLM)的代理在各个领域显示出巨大的潜力,它们作为交互系统运行,处理环境观察以生成目标任务的可执行操作。这些智能体的有效性受到其记忆机制的显著影响,记忆机制将历史经验记录为行动-观察对序列。我们将记忆分为两种类型:多次尝试积累的交叉试验记忆和一次尝试积累的试验内记忆(工作记忆)。虽然已经有相当多的研究通过交叉试验内存来优化性能,但通过改善工作内存利用率来增强代理性能的研究仍然很少。相反,现有的方法通常涉及将整个历史动作-观察对直接输入到LLMS中,导致长期任务中的冗余。受人类问题求解策略的启发,提出了一种以子目标为内存块的分层管理LLM代理工作内存的框架–HiAgent。具体地说,HiAgent在生成可执行操作之前提示LLM制定子目标,并使LLM能够主动决定用汇总的观测替换以前的子目标,仅保留与当前子目标相关的操作-观测对。在五个长时间任务上的实验结果表明,HiAgent的成功率提高了一倍,平均所需步骤减少了3.8步。此外,我们的分析表明,HiAgent在各个步骤中持续提高性能,突出了其健壮性和通用性。项目页面:此HTTPS URL。

[NLP-44] No Such Thing as a General Learner: Language models and their dual optimization
[NLP-44] 普通学习者没有这样的事情:语言模型及其双重优化

链接: https://arxiv.org/abs/2408.09544
作者: Emmanuel Chemla,Ryan M. Nefdt
关键词-EN: Large Language Models, successful Large Language, successful Large, Language Models, Large Language
关键词-ZH: 大型语言模型,成功的大型语言,成功的大型,语言模型,大型语言
类目: Computation and Language (cs.CL)
备注: 11 pages, 4 figures

点击查看摘要

Abstract:What role can the otherwise successful Large Language Models (LLMs) play in the understanding of human cognition, and in particular in terms of informing language acquisition debates? To contribute to this question, we first argue that neither humans nor LLMs are general learners, in a variety of senses. We make a novel case for how in particular LLMs follow a dual-optimization process: they are optimized during their training (which is typically compared to language acquisition), and modern LLMs have also been selected, through a process akin to natural selection in a species. From this perspective, we argue that the performance of LLMs, whether similar or dissimilar to that of humans, does not weigh easily on important debates about the importance of human cognitive biases for language.
摘要:原本成功的大型语言模型(LLM)在理解人类认知方面,特别是在为语言习得辩论提供信息方面可以发挥什么作用?为了解决这个问题,我们首先认为,从各种意义上来说,人类和LLM都不是一般学习者。我们提出了一个新颖的案例,说明了LLM如何遵循双重优化过程:它们在训练期间进行优化(通常与语言习得进行比较),并且通过类似于物种自然选择的过程,现代LLM也被选择。从这个角度来看,我们认为,LLM的表现,无论与人类的表现相似还是不同,都不会轻易影响有关人类对语言认知偏见重要性的重要辩论。

[NLP-45] Using ChatGPT to Score Essays and Short-Form Constructed Responses
[NLP-45] 使用ChatGPT对论文和简短的构建回答进行评分

链接: https://arxiv.org/abs/2408.09540
作者: Mark D. Shermis
关键词-EN: ASAP competition, large language models, ChatGPT large language, aimed to determine, large language
关键词-ZH: ASAP竞赛,大型语言模型,ChatGPT大型语言,旨在确定,大型语言
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 35 pages, 8 tables, 2 Figures, 27 references

点击查看摘要

Abstract:This study aimed to determine if ChatGPT’s large language models could match the scoring accuracy of human and machine scores from the ASAP competition. The investigation focused on various prediction models, including linear regression, random forest, gradient boost, and boost. ChatGPT’s performance was evaluated against human raters using quadratic weighted kappa (QWK) metrics. Results indicated that while ChatGPT’s gradient boost model achieved QWKs close to human raters for some data sets, its overall performance was inconsistent and often lower than human scores. The study highlighted the need for further refinement, particularly in handling biases and ensuring scoring fairness. Despite these challenges, ChatGPT demonstrated potential for scoring efficiency, especially with domain-specific fine-tuning. The study concludes that ChatGPT can complement human scoring but requires additional development to be reliable for high-stakes assessments. Future research should improve model accuracy, address ethical considerations, and explore hybrid models combining ChatGPT with empirical methods.
摘要:这项研究旨在确定ChatGPT的大型语言模型是否能够与ASAP比赛中人和机器得分的准确性相匹配。调查的重点是各种预测模型,包括线性回归、随机森林、梯度增强和增强。ChatGPT的表现是使用二次加权kappa(QWK)指标与人类评分者进行评估的。结果表明,尽管ChatGPT的梯度增强模型在某些数据集上获得了接近人类评分者的QWK,但其整体表现并不一致,而且往往低于人类的得分。这项研究强调了进一步改进的必要性,特别是在处理偏见和确保得分公平方面。尽管有这些挑战,但ChatGPT显示出了提高评分效率的潜力,特别是通过特定于领域的微调。研究得出结论,ChatGPT可以补充人类的评分,但需要进一步的开发才能可靠地进行高风险的评估。未来的研究应该提高模型的准确性,解决伦理方面的考虑,并探索将ChatGPT与经验方法相结合的混合模型。

[NLP-46] Revisiting the Graph Reasoning Ability of Large Language Models : Case Studies in Translation Connectivity and Shortest Path
[NLP-46] 重新审视大型语言模型的图推理能力:翻译连通性和最短路径的案例研究

链接: https://arxiv.org/abs/2408.09529
作者: Xinnan Dai,Qihao Wen,Yifei Shen,Hongzhi Wen,Dongsheng Li,Jiliang Tang,Caihua Shan
关键词-EN: Large Language Models, Large Language, Language Models, achieved great success, achieved great
关键词-ZH: 大型语言模型,大型语言,语言模型,取得了巨大的成功,取得了巨大的成功
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have achieved great success in various reasoning tasks. In this work, we focus on the graph reasoning ability of LLMs. Although theoretical studies proved that LLMs are capable of handling graph reasoning tasks, empirical evaluations reveal numerous failures. To deepen our understanding on this discrepancy, we revisit the ability of LLMs on three fundamental graph tasks: graph description translation, graph connectivity, and the shortest-path problem. Our findings suggest that LLMs can fail to understand graph structures through text descriptions and exhibit varying performance for all these three fundamental tasks. Meanwhile, we perform a real-world investigation on knowledge graphs and make consistent observations with our findings. The codes and datasets are available.
摘要:大型语言模型(LLM)在各种推理任务中取得了巨大成功。在这项工作中,我们重点关注LLM的图推理能力。尽管理论研究证明LLM能够处理图推理任务,但经验评估揭示了许多失败。为了加深我们对这种差异的理解,我们重新审视了LLM在三个基本图任务上的能力:图描述翻译、图连接性和最短路径问题。我们的研究结果表明,LLM可能无法通过文本描述理解图形结构,并且在所有这三项基本任务中表现出不同的性能。与此同时,我们对知识图谱进行现实世界的调查,并与我们的发现进行一致的观察。代码和数据集均可用。

[NLP-47] Out-of-distribution generalization via composition: a lens through induction heads in Transformers
[NLP-47] 通过构图进行非分布概括:变形金刚中感应头的镜头

链接: https://arxiv.org/abs/2408.09503
作者: Jiajun Song,Zhuoyan Xu,Yiqiao Zhong
关键词-EN: Large language models, OOD generalization, Large language, OOD, Large
关键词-ZH: 大型语言模型、OOD概括、大型语言、OOD、大型
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Machine Learning (stat.ML)
备注: 41 pages, 25 figures

点击查看摘要

Abstract:Large language models (LLMs) such as GPT-4 sometimes appear to be creative, solving novel tasks often with a few demonstrations in the prompt. These tasks require the models to generalize on distributions different from those from training data – which is known as out-of-distribution (OOD) generalization. Despite the tremendous success of LLMs, how they approach OOD generalization remains an open and underexplored question. We examine OOD generalization in settings where instances are generated according to hidden rules, including in-context learning with symbolic reasoning. Models are required to infer the hidden rules behind input prompts without any fine-tuning. We empirically examined the training dynamics of Transformers on a synthetic example and conducted extensive experiments on a variety of pretrained LLMs, focusing on a type of components known as induction heads. We found that OOD generalization and composition are tied together – models can learn rules by composing two self-attention layers, thereby achieving OOD generalization. Furthermore, a shared latent subspace in the embedding (or feature) space acts as a bridge for composition by aligning early layers and later layers, which we refer to as the common bridge representation hypothesis. Comments: 41 pages, 25 figures Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Machine Learning (stat.ML) Cite as: arXiv:2408.09503 [cs.CL] (or arXiv:2408.09503v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2408.09503 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
摘要:大型语言模型(LLM),如GPT-4,有时看起来很有创意,通常在提示符中提供一些演示来解决新的任务。这些任务要求模型对不同于训练数据的分布进行泛化–这称为分布外(OOD)泛化。尽管LLM取得了巨大的成功,但它们如何实现面向对象设计的泛化仍然是一个开放的、未被探索的问题。我们研究了在根据隐藏规则生成实例的环境中的OOD泛化,包括使用符号推理的上下文学习。模型需要在不进行任何微调的情况下推断输入提示背后的隐藏规则。我们在一个合成例子上对变压器的训练动态进行了经验检验,并在各种预先训练的LLM上进行了广泛的实验,重点是一种称为感应头的组件。我们发现,OOD泛化和组合是紧密联系在一起的–模型可以通过组合两个自我关注层来学习规则,从而实现OOD泛化。此外,嵌入(或特征)空间中的共享潜在子空间通过对齐早期层和后期层来充当合成的桥梁,我们称之为公共桥表示假设。评论:41页,25图主题:计算与语言(cs.CL);人工智能(cs.AI);机器学习(cs.LG);机器学习(stat.ML)引用如下:arxiv:2408.09503cs.CLhttps://doi.org/10.48550/arXiv.2408.09503 Focus通过DataCite了解更多arxiv发布的DOI(等待注册)

[NLP-48] REFINE-LM: Mitigating Language Model Stereotypes via Reinforcement Learning
[NLP-48] REFINE-LM:通过强化学习缓解语言模型刻板印象

链接: https://arxiv.org/abs/2408.09489
作者: Rameez Qureshi,Naïm Es-Sebbani,Luis Galárraga,Yvette Graham,Miguel Couceiro,Zied Bouraoui
关键词-EN: unintended bias, biases, language models, significant concern, models
关键词-ZH: 无意的偏见、偏见、语言模型、重大担忧、模型
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:With the introduction of (large) language models, there has been significant concern about the unintended bias such models may inherit from their training data. A number of studies have shown that such models propagate gender stereotypes, as well as geographical and racial bias, among other biases. While existing works tackle this issue by preprocessing data and debiasing embeddings, the proposed methods require a lot of computational resources and annotation effort while being limited to certain types of biases. To address these issues, we introduce REFINE-LM, a debiasing method that uses reinforcement learning to handle different types of biases without any fine-tuning. By training a simple model on top of the word probability distribution of a LM, our bias agnostic reinforcement learning method enables model debiasing without human annotations or significant computational resources. Experiments conducted on a wide range of models, including several LMs, show that our method (i) significantly reduces stereotypical biases while preserving LMs performance; (ii) is applicable to different types of biases, generalizing across contexts such as gender, ethnicity, religion, and nationality-based biases; and (iii) it is not expensive to train.
摘要:随着(大型)语言模型的引入,人们一直非常担心这些模型可能从其训练数据中继承的意外偏差。一些研究表明,这种模式宣扬性别陈规定型观念,以及地域和种族偏见,以及其他偏见。虽然现有的工作是通过对数据进行预处理和去偏向嵌入来解决这个问题,但所提出的方法需要大量的计算资源和标注工作,而且仅限于某些类型的偏向。为了解决这些问题,我们引入了Reine-LM,这是一种去偏方法,它使用强化学习来处理不同类型的偏差,而不需要进行任何微调。通过在LM的单词概率分布上训练一个简单的模型,我们的偏见不可知强化学习方法能够在不需要人工注释或大量计算资源的情况下实现模型去偏。在包括几个LMS在内的广泛模型上进行的实验表明,我们的方法(I)在保持LMS性能的同时显著减少了刻板印象的偏见;(Ii)适用于不同类型的偏见,可在性别、种族、宗教和基于国籍的偏见等背景下推广;(Iii)训练成本不高。

[NLP-49] Activated Parameter Locating via Causal Intervention for Model Merging
[NLP-49] 通过因果干预定位模型合并的激活参数

链接: https://arxiv.org/abs/2408.09485
作者: Fanshuang Kong,Richong Zhang,Ziqiao Wang
关键词-EN: achieving convincing generalization, combines multiple homologous, achieving convincing, additional training, multiple homologous models
关键词-ZH: 实现令人信服的概括,结合多个同类,实现令人信服的、额外的训练、多个同类模型
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Model merging combines multiple homologous models into one model, achieving convincing generalization without the necessity of additional training. A key challenge in this problem is resolving parameter redundancies and conflicts across multiple models. Existing models have demonstrated that dropping a portion of delta parameters can alleviate conflicts while maintaining performance. However, these methods often drop parameters either randomly or based on magnitude, overlooking task-specific information embedded in fine-tuned models. In this paper, we propose an Activated Parameter Locating (APL) method that utilizes causal intervention to estimate parameter importance, enabling more precise parameter drops and better conflict mitigation. Moreover, to reduce the computational complexity associated with a large number of parameter partitions, we also introduce a theoretically supported gradient approximation strategy for APL. Experiments on model merging within both in-domain and out-of-domain settings, along with associated analyses, showcase the effectiveness of APL.
摘要:模型合并将多个同源模型合并为一个模型,无需额外的训练即可实现令人信服的泛化。这个问题中的一个关键挑战是解决多个模型之间的参数冗余和冲突。现有的模型已经证明,丢弃部分增量参数可以在保持性能的同时缓解冲突。然而,这些方法经常随机或基于大小丢弃参数,忽略了嵌入在微调模型中的特定于任务的信息。在本文中,我们提出了一种激活参数定位(APL)方法,该方法利用因果干预来估计参数的重要性,从而能够更精确地丢弃参数和更好地缓解冲突。此外,为了减少与大量参数划分相关的计算复杂性,我们还引入了一种理论上支持的APL的梯度逼近策略。域内和域外环境下的模型合并实验,以及相关的分析,展示了APL的有效性。

[NLP-50] PanoSent: A Panoptic Sextuple Extraction Benchmark for Multimodal Conversational Aspect-based Sentiment Analysis ACM-MM2024
[NLP-50] PanoSent:用于基于语音的多模式对话情绪分析的全景六重组提取基准

链接: https://arxiv.org/abs/2408.09481
作者: Meng Luo,Hao Fei,Bobo Li,Shengqiong Wu,Qian Liu,Soujanya Poria,Erik Cambria,Mong-Li Lee,Wynne Hsu
关键词-EN: Aspect-based Sentiment Analysis, existing Aspect-based Sentiment, holistic research target, research target seamlessly, target seamlessly integrating
关键词-ZH: 基于目标的情绪分析、现有的基于目标的情绪、整体研究目标、无缝研究目标、无缝集成目标
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted by ACM MM 2024 (Oral)

点击查看摘要

Abstract:While existing Aspect-based Sentiment Analysis (ABSA) has received extensive effort and advancement, there are still gaps in defining a more holistic research target seamlessly integrating multimodality, conversation context, fine-granularity, and also covering the changing sentiment dynamics as well as cognitive causal rationales. This paper bridges the gaps by introducing a multimodal conversational ABSA, where two novel subtasks are proposed: 1) Panoptic Sentiment Sextuple Extraction, panoramically recognizing holder, target, aspect, opinion, sentiment, rationale from multi-turn multi-party multimodal dialogue. 2) Sentiment Flipping Analysis, detecting the dynamic sentiment transformation throughout the conversation with the causal reasons. To benchmark the tasks, we construct PanoSent, a dataset annotated both manually and automatically, featuring high quality, large scale, multimodality, multilingualism, multi-scenarios, and covering both implicit and explicit sentiment elements. To effectively address the tasks, we devise a novel Chain-of-Sentiment reasoning framework, together with a novel multimodal large language model (namely Sentica) and a paraphrase-based verification mechanism. Extensive evaluations demonstrate the superiority of our methods over strong baselines, validating the efficacy of all our proposed methods. The work is expected to open up a new era for the ABSA community, and thus all our codes and data are open at this https URL
摘要:虽然现有的基于方面的情感分析(ABSA)已经取得了广泛的努力和进展,但在定义一个更全面的研究目标方面仍存在差距,该目标无缝地整合了多通道、对话上下文、细粒度,并涵盖了情绪变化的动态以及认知因果原理。本文提出了两个新颖的子任务:1)从多轮多方多通道对话中提取全景情感六元组,全景识别持有者、目标、体、观点、情感、理论基础。2)情感翻转分析,检测带有因果原因的整个谈话过程中情感的动态变化。为了对任务进行基准测试,我们构建了PanoSent,一个人工和自动标注的数据集,具有高质量、大规模、多通道、多语言、多场景、涵盖隐性和显性情感元素的特点。为了有效地处理这些任务,我们设计了一个新的情感链推理框架,以及一个新的多通道大型语言模型(即Sentica)和基于释义的验证机制。广泛的评估表明,我们的方法优于强大的基线,验证了我们提出的所有方法的有效性。这项工作有望为ABSA社区开辟一个新纪元,因此我们所有的代码和数据都在这个HTTPS URL上开放

[NLP-51] Image-Based Geolocation Using Large Vision-Language Models
[NLP-51] 使用大型视觉语言模型的基于图像的地理定位

链接: https://arxiv.org/abs/2408.09474
作者: Yi Liu,Junchen Ding,Gelei Deng,Yuekang Li,Tianwei Zhang,Weisong Sun,Yaowen Zheng,Jingquan Ge,Yang Liu
关键词-EN: offering numerous benefits, modern life, offering numerous, vital aspect, aspect of modern
关键词-ZH: 提供现代生活的众多好处,提供现代生活的众多重要方面
类目: Cryptography and Security (cs.CR); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Geolocation is now a vital aspect of modern life, offering numerous benefits but also presenting serious privacy concerns. The advent of large vision-language models (LVLMs) with advanced image-processing capabilities introduces new risks, as these models can inadvertently reveal sensitive geolocation information. This paper presents the first in-depth study analyzing the challenges posed by traditional deep learning and LVLM-based geolocation methods. Our findings reveal that LVLMs can accurately determine geolocations from images, even without explicit geographic training. To address these challenges, we introduce \tool, an innovative framework that significantly enhances image-based geolocation accuracy. \tool employs a systematic chain-of-thought (CoT) approach, mimicking human geoguessing strategies by carefully analyzing visual and contextual cues such as vehicle types, architectural styles, natural landscapes, and cultural elements. Extensive testing on a dataset of 50,000 ground-truth data points shows that \tool outperforms both traditional models and human benchmarks in accuracy. It achieves an impressive average score of 4550.5 in the GeoGuessr game, with an 85.37% win rate, and delivers highly precise geolocation predictions, with the closest distances as accurate as 0.3 km. Furthermore, our study highlights issues related to dataset integrity, leading to the creation of a more robust dataset and a refined framework that leverages LVLMs’ cognitive capabilities to improve geolocation precision. These findings underscore \tool’s superior ability to interpret complex visual data, the urgent need to address emerging security vulnerabilities posed by LVLMs, and the importance of responsible AI development to ensure user privacy protection. Subjects: Cryptography and Security (cs.CR); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2408.09474 [cs.CR] (or arXiv:2408.09474v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2408.09474 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
摘要:地理定位现在是现代生活的一个重要方面,它提供了许多好处,但也带来了严重的隐私问题。具有高级图像处理能力的大型视觉语言模型(LVLM)的出现带来了新的风险,因为这些模型可能会无意中泄露敏感的地理位置信息。本文首次深入研究分析了传统的深度学习和基于LVLM的地理定位方法带来的挑战。我们的发现表明,即使没有明确的地理训练,LVLMS也可以从图像中准确地确定地理位置。为了应对这些挑战,我们引入了\Tool,这是一个创新的框架,可以显著提高基于图像的地理定位精度。Tool采用系统的思维链(COT)方法,通过仔细分析视觉和背景线索(如车辆类型、建筑风格、自然景观和文化元素)来模仿人类的地理定位策略。在50,000个地面真实数据点的数据集上的广泛测试表明,\Tool在准确性方面优于传统模型和人类基准。它在GeoGuessr游戏中取得了令人印象深刻的平均得分4550.5分,胜率为85.37%,并提供高精度的地理位置预测,最近距离的精确度为0.30公里。此外,我们的研究强调了与数据集完整性相关的问题,从而创建了更强大的数据集和利用LVLMS的认知能力来提高地理定位精度的精细化框架。这些发现突显了\Tool解释复杂可视数据的卓越能力,解决LVLMS带来的新出现的安全漏洞的迫切需要,以及负责任的人工智能开发以确保用户隐私保护的重要性。主题:密码学和安全(cs.CR);计算和语言(cs.CL);计算机视觉和模式识别(cs.CV)引用为:arxiv:2408.09474cs.CRhttps://doi.org/10.48550/arXiv.2408.09474 Focus通过DataCite了解更多arxiv发布的DOI(待注册)

[NLP-52] WPN: An Unlearning Method Based on N-pair Contrastive Learning in Language Models ECAI2024
[NLP-52] WPN:一种基于语言模型N对对比学习的去学习方法

链接: https://arxiv.org/abs/2408.09459
作者: Guitao Chen,Yunshen Wang,Hongye Sun,Guang Chen
关键词-EN: Generative language models, offer numerous advantages, Generative language, harmful knowledge acquired, harmful outputs due
关键词-ZH: 生成性语言模型提供了许多优点,生成性语言、获得的有害知识、产生的有害输出
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注: ECAI 2024

点击查看摘要

Abstract:Generative language models (LMs) offer numerous advantages but may produce inappropriate or harmful outputs due to the harmful knowledge acquired during pre-training. This knowledge often manifests as undesirable correspondences, such as “harmful prompts” leading to “harmful outputs,” which our research aims to mitigate through unlearning techniques.However, existing unlearning methods based on gradient ascent can significantly impair the performance of LMs. To address this issue, we propose a novel approach called Weighted Positional N-pair (WPN) Learning, which leverages position-weighted mean pooling within an n-pair contrastive learning framework. WPN is designed to modify the output distribution of LMs by eliminating specific harmful outputs (e.g., replacing toxic responses with neutral ones), thereby transforming the model’s behavior from “harmful prompt-harmful output” to “harmful prompt-harmless response”.Experiments on OPT and GPT-NEO LMs show that WPN effectively reduces the proportion of harmful responses, achieving a harmless rate of up to 95.8% while maintaining stable performance on nine common benchmarks (with less than 2% degradation on average). Moreover, we provide empirical evidence to demonstrate WPN’s ability to weaken the harmful correspondences in terms of generalizability and robustness, as evaluated on out-of-distribution test sets and under adversarial attacks.
摘要:生成语言模型(LMS)有许多优点,但可能会产生不适当或有害的输出,因为在预培训中获得了有害的知识。这些知识经常表现为不希望看到的对应关系,例如“有害提示”导致“有害输出”,我们的研究旨在通过遗忘技术来缓解这种情况。然而,现有的基于梯度上升的遗忘方法会显著影响LMS的性能。为了解决这个问题,我们提出了一种新的方法,称为加权位置N对(WPN)学习,它利用n对对比学习框架中的位置加权平均池。在OPT和GPT-neo LMS上的实验表明,WPN有效地减少了有害响应的比例,在9个常用基准上保持了稳定的性能(平均降级小于2),从而改变了LMS的输出分布。此外,我们提供了经验证据来证明WPN在泛化能力和稳健性方面能够削弱有害的对应关系,如在分布外测试集上和在敌意攻击下的评估。

[NLP-53] Identifying Speakers and Addressees of Quotations in Novels with Prompt Learning NLPCC2024
[NLP-53] 快速学习识别小说引言的说话者和收件人

链接: https://arxiv.org/abs/2408.09452
作者: Yuchen Yan,Hanjie Zhao,Senbin Zhu,Hongde Liu,Zhihong Zhang,Yuxiang Jia
关键词-EN: drive plot development, reflect character relationships, reflect character, create characters, literary works
关键词-ZH: 推动情节发展,反映人物关系,反映人物,创造人物,文学作品
类目: Computation and Language (cs.CL)
备注: This paper has been accepted by NLPCC 2024

点击查看摘要

Abstract:Quotations in literary works, especially novels, are important to create characters, reflect character relationships, and drive plot development. Current research on quotation extraction in novels primarily focuses on quotation attribution, i.e., identifying the speaker of the quotation. However, the addressee of the quotation is also important to construct the relationship between the speaker and the addressee. To tackle the problem of dataset scarcity, we annotate the first Chinese quotation corpus with elements including speaker, addressee, speaking mode and linguistic cue. We propose prompt learning-based methods for speaker and addressee identification based on fine-tuned pre-trained models. Experiments on both Chinese and English datasets show the effectiveness of the proposed methods, which outperform methods based on zero-shot and few-shot large language models.
摘要:文学作品尤其是小说中的引言对于塑造人物、反映人物关系、推动情节发展具有重要意义。目前对小说引文提取的研究主要集中在引文归因上,即识别引言的发言人。然而,引言的收件人对于构建说话人和收件人之间的关系也很重要。为了解决数据集稀缺的问题,我们用说话人、收件人、说话模式和语言线索等元素对第一个中文引文库进行注释。我们提出了基于即时学习的方法,用于基于微调的预训练模型的说话者和收件人识别。在中文和英语数据集上的实验表明了所提出方法的有效性,其性能优于基于零镜头和少镜头大语言模型的方法。

[NLP-54] Hindi-BEIR : A Large Scale Retrieval Benchmark in Hindi
[NLP-54] Hindi-BEIR:印地语大规模检索基准

链接: https://arxiv.org/abs/2408.09437
作者: Arkadeep Acharya,Rudra Murthy,Vishwajeet Kumar,Jaydeep Sen
关键词-EN: Hindi speakers worldwide, information retrieval systems, efficient information retrieval, Hindi, speakers worldwide
关键词-ZH: 全球印地语使用者、信息检索系统、高效信息检索、印地语、全球印地语使用者
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Given the large number of Hindi speakers worldwide, there is a pressing need for robust and efficient information retrieval systems for Hindi. Despite ongoing research, there is a lack of comprehensive benchmark for evaluating retrieval models in Hindi. To address this gap, we introduce the Hindi version of the BEIR benchmark, which includes a subset of English BEIR datasets translated to Hindi, existing Hindi retrieval datasets, and synthetically created datasets for retrieval. The benchmark is comprised of 15 datasets spanning across 8 distinct tasks. We evaluate state-of-the-art multilingual retrieval models on this benchmark to identify task and domain-specific challenges and their impact on retrieval performance. By releasing this benchmark and a set of relevant baselines, we enable researchers to understand the limitations and capabilities of current Hindi retrieval models, promoting advancements in this critical area. The datasets from Hindi-BEIR are publicly available.
摘要:由于世界各地有大量的印地语使用者,迫切需要强大而高效的印地语信息检索系统。尽管正在进行研究,但缺乏评估印地语检索模型的全面基准。为了弥补这一差距,我们引入了BEIR基准的印地语版本,其中包括翻译成印地语的英语BEIR数据集的子集、现有的印地语检索数据集以及用于检索的综合创建的数据集。该基准由跨越8个不同任务的15个数据集组成。我们在这个基准上评估了最新的多语言检索模型,以确定特定于任务和领域的挑战及其对检索性能的影响。通过发布这一基准和一系列相关基线,我们使研究人员能够了解当前印地语检索模型的局限性和能力,促进这一关键领域的进步。来自印地语-贝尔的数据集是公开可用的。

[NLP-55] HySem: A context length optimized LLM pipeline for unstructured tabular extraction
[NLP-55] HySem:一个上下文长度优化的LLM管道,用于非结构化表格提取

链接: https://arxiv.org/abs/2408.09434
作者: Narayanan PP,Anantharaman Palacode Narayana Iyer
关键词-EN: Regulatory compliance reporting, Regulatory compliance, compliance reporting, relies on detailed, unstructured format
关键词-ZH: 监管合规性报告,监管合规性,合规性报告,依赖于详细的、非结构化的格式
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 9 pages, 4 tables, 3 figures, 1 algorithm

点击查看摘要

Abstract:Regulatory compliance reporting in the pharmaceutical industry relies on detailed tables, but these are often under-utilized beyond compliance due to their unstructured format and arbitrary content. Extracting and semantically representing tabular data is challenging due to diverse table presentations. Large Language Models (LLMs) demonstrate substantial potential for semantic representation, yet they encounter challenges related to accuracy and context size limitations, which are crucial considerations for the industry applications. We introduce HySem, a pipeline that employs a novel context length optimization technique to generate accurate semantic JSON representations from HTML tables. This approach utilizes a custom fine-tuned model specifically designed for cost- and privacy-sensitive small and medium pharmaceutical enterprises. Running on commodity hardware and leveraging open-source models, our auto-correcting agents rectify both syntax and semantic errors in LLM-generated content. HySem surpasses its peer open-source models in accuracy and provides competitive performance when benchmarked against OpenAI GPT-4o and effectively addresses context length limitations, which is a crucial factor for supporting larger tables.
摘要:制药行业的合规报告依赖于详细的表格,但由于其格式不结构化和内容随意性,这些表格往往在合规之外没有得到充分利用。由于表格表示的多样性,提取和语义表示表格数据是具有挑战性的。大型语言模型在语义表示方面显示出巨大的潜力,但它们遇到了与准确性和上下文大小限制相关的挑战,这是行业应用的关键考虑因素。我们介绍了HySem,这是一种管道,它采用了一种新的上下文长度优化技术来从HTML表中生成准确的语义JSON表示。这种方法利用了专门为对成本和隐私敏感的中小型制药企业设计的定制微调模型。我们的自动更正代理运行在商用硬件上,并利用开源模型,可以纠正LLM生成的内容中的语法和语义错误。HySem在精确度上超过了其同行的开源模型,在与OpenAI GPT-4o进行基准测试时提供了具有竞争力的性能,并有效地解决了上下文长度限制,这是支持更大表格的关键因素。

[NLP-56] FASST: Fast LLM-based Simultaneous Speech Translation
[NLP-56] FASST:基于LLM的快速同步语音翻译

链接: https://arxiv.org/abs/2408.09430
作者: Siqi Ouyang,Xi Xu,Chinmay Dandekar,Lei Li
关键词-EN: generates text translation, SST, generates text, streaming speech input, speech
关键词-ZH: 生成文本翻译、CST、生成文本、流语音输入、语音
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Simultaneous speech translation (SST) takes streaming speech input and generates text translation on the fly. Existing methods either have high latency due to recomputation of input representations, or fall behind of offline ST in translation quality. In this paper, we propose FASST, a fast large language model based method for streaming speech translation. We propose blockwise-causal speech encoding and consistency mask, so that streaming speech input can be encoded incrementally without recomputation. Furthermore, we develop a two-stage training strategy to optimize FASST for simultaneous inference. We evaluate FASST and multiple strong prior models on MuST-C dataset. Experiment results show that FASST achieves the best quality-latency trade-off. It outperforms the previous best model by an average of 1.5 BLEU under the same latency for English to Spanish translation.
摘要:同步语音翻译(CST)接受流语音输入并动态生成文本翻译。现有方法要么由于输入表示的重新计算而具有高延迟,要么在翻译质量方面落后于离线ST。在本文中,我们提出了FASST,这是一种基于大语言模型的快速流语音翻译方法。我们提出了块因果语音编码和一致性屏蔽,以便流语音输入可以增量编码,而无需重新计算。此外,我们开发了一种两阶段训练策略来优化FASST以实现同时推理。我们在MuST-C数据集上评估FASST和多个强先验模型。实验结果表明,FASST实现了最佳的质量-延迟权衡。在英语到西班牙语翻译的相同延迟下,它比之前的最佳模型平均表现为1.5 BLEU。

[NLP-57] Distinguish Confusion in Legal Judgment Prediction via Revised Relation Knowledge
[NLP-57] 通过修改关系知识区分法律判断预测中的混乱

链接: https://arxiv.org/abs/2408.09422
作者: Nuo Xu,Pinghui Wang,Junzhou Zhao,Feiyang Sun,Lin Lan,Jing Tao,Li Pan,Xiaohong Guan
关键词-EN: Legal Judgment Prediction, case judgment results, Legal Judgment, Judgment Prediction, judgment results based
关键词-ZH: 法律判决预测、案件判决结果、法律判决、判决预测、判决结果基于
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted by ACM TOIS

点击查看摘要

Abstract:Legal Judgment Prediction (LJP) aims to automatically predict a law case’s judgment results based on the text description of its facts. In practice, the confusing law articles (or charges) problem frequently occurs, reflecting that the law cases applicable to similar articles (or charges) tend to be misjudged. Although some recent works based on prior knowledge solve this issue well, they ignore that confusion also occurs between law articles with a high posterior semantic similarity due to the data imbalance problem instead of only between the prior highly similar ones, which is this work’s further finding. This paper proposes an end-to-end model named \textitD-LADAN to solve the above challenges. On the one hand, D-LADAN constructs a graph among law articles based on their text definition and proposes a graph distillation operation (GDO) to distinguish the ones with a high prior semantic similarity. On the other hand, D-LADAN presents a novel momentum-updated memory mechanism to dynamically sense the posterior similarity between law articles (or charges) and a weighted GDO to adaptively capture the distinctions for revising the inductive bias caused by the data imbalance problem. We perform extensive experiments to demonstrate that D-LADAN significantly outperforms state-of-the-art methods in accuracy and robustness.
摘要:法律判决预测旨在根据案件事实的文本描述对案件的判决结果进行自动预测。在实践中,混淆法律条文(或罪名)的问题时有发生,反映出适用于类似条文(或罪名)的法律案件往往会被误判。虽然最近的一些基于先验知识的工作很好地解决了这个问题,但它们忽略了由于数据不平衡的问题,在后验语义相似度较高的法律文章之间也会发生混淆,而不是仅仅在先前高度相似的法律文章之间发生混淆,这是本工作的进一步发现。为了解决上述问题,本文提出了一种端到端模型–D-Ladan。一方面,D-Ladan根据法律文章的文本定义构造法律文章之间的图,并提出一种图提取操作(GDO)来区分先验语义相似度较高的法律文章。另一方面,D-Ladan提出了一种新颖的动量更新记忆机制来动态感知法律条款(或指控)之间的后验相似性,并提出了一种加权GDO来自适应地捕捉差异以修正数据失衡问题造成的归纳偏差。我们通过大量的实验证明,D-LADAN在准确率和稳健性方面都明显优于最先进的方法。

[NLP-58] Challenges and Responses in the Practice of Large Language Models
[NLP-58] 大型语言模型实践中的挑战与应对

链接: https://arxiv.org/abs/2408.09416
作者: Hongyin Zhu
关键词-EN: carefully summarizes extensive, paper carefully summarizes, covering multiple dimensions, academic research, high-profile AI field
关键词-ZH: 仔细总结广泛,论文仔细总结,涵盖多维度、学术研究、备受瞩目的人工智能领域
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This paper carefully summarizes extensive and profound questions from all walks of life, focusing on the current high-profile AI field, covering multiple dimensions such as industry trends, academic research, technological innovation and business applications. This paper meticulously curates questions that are both thought-provoking and practically relevant, providing nuanced and insightful answers to each. To facilitate readers’ understanding and reference, this paper specifically classifies and organizes these questions systematically and meticulously from the five core dimensions of computing power infrastructure, software architecture, data resources, application scenarios, and brain science. This work aims to provide readers with a comprehensive, in-depth and cutting-edge AI knowledge framework to help people from all walks of life grasp the pulse of AI development, stimulate innovative thinking, and promote industrial progress.
摘要:本文认真总结了各行各业广泛而深刻的问题,重点关注当前备受瞩目的人工智能领域,涵盖行业趋势、学术研究、技术创新和商业应用等多个维度。本文精心策划了发人深省且具有实际意义的问题,为每个问题提供了细致入微且富有洞察力的答案。为了方便读者理解和参考,本文从计算能力基础设施、软件架构、数据资源、应用场景、脑科学五个核心维度对这些问题进行了具体的系统、细致的分类和组织。本作品旨在为读者提供全面、深入、前沿的人工智能知识框架,帮助各行各业的人们把握人工智能发展脉搏,激发创新思维,推动产业进步。

[NLP-59] Comparison between the Structures of Word Co-occurrence and Word Similarity Networks for Ill-formed and Well-formed Texts in Taiwan Mandarin
[NLP-59] 台湾普通话畸形与畸形文本的词共现结构与词相似性网络的比较

链接: https://arxiv.org/abs/2408.09404
作者: Po-Hsuan Huang,Hsuan-Lei Shao
关键词-EN: word co-occurrence networks, word co-occurrence, co-occurrence networks, co-occurrence networks built, co-occurrence
关键词-ZH: 词同现网络,词同现,同现网络,建立的同现网络,同现
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 4 pages, 1 figure, 5 tables

点击查看摘要

Abstract:The study of word co-occurrence networks has attracted the attention of researchers due to their potential significance as well as applications. Understanding the structure of word co-occurrence networks is therefore important to fully realize their significance and usages. In past studies, word co-occurrence networks built on well-formed texts have been found to possess certain characteristics, including being small-world, following a two-regime power law distribution, and being generally disassortative. On the flip side, past studies have found that word co-occurrence networks built from ill-formed texts such as microblog posts may behave differently from those built from well-formed documents. While both kinds of word co-occurrence networks are small-world and disassortative, word co-occurrence networks built from ill-formed texts are scale-free and follow the power law distribution instead of the two-regime power law distribution. However, since past studies on the behavior of word co-occurrence networks built from ill-formed texts only investigated English, the universality of such characteristics remains to be seen among different languages. In addition, it is yet to be investigated whether there could be possible similitude/differences between word co-occurrence networks and other potentially comparable networks. This study therefore investigates and compares the structure of word co-occurrence networks and word similarity networks based on Taiwan Mandarin ill-formed internet forum posts and compare them with those built with well-formed judicial judgments, and seeks to find out whether the three aforementioned properties (scale-free, small-world, and disassortative) for ill-formed and well-formed texts are universal among different languages and between word co-occurrence and word similarity networks.
摘要:词语共现网络的研究因其潜在的意义和应用而引起了研究者的关注。因此,了解词语共现网络的结构对于充分认识它们的意义和用法是很重要的。在过去的研究中,建立在结构良好的文本上的单词共现网络被发现具有某些特征,包括世界小,遵循两种制度的幂分布,以及总体上不相配。另一方面,过去的研究发现,从格式不佳的文本(如微博帖子)构建的单词共现网络可能与从格式良好的文档构建的网络不同。虽然这两种词共现网络都是小世界和错位的,但从病态文本构建的词共现网络是无标度的,遵循幂定律分布,而不是两种体制的幂定律分布。然而,由于过去对病态文本构建的词汇共现网络行为的研究只调查了英语,这种特征在不同语言中的普遍性还有待观察。此外,词汇共现网络与其他潜在的可比较网络之间是否可能存在相似/差异还有待调查。因此,本研究以台湾普通话格式不良的互联网论坛帖子为研究对象,考察并比较了词语共现网络与词语相似度网络的结构,并与结构良好的司法判决书构建的词语共现网络结构进行了比较,试图找出上述三个性质(无标度、小世界和错位)在不同语言之间以及词语共现与词语相似网络之间是否具有普遍性。

[NLP-60] Game Development as Human-LLM Interaction
[NLP-60] 游戏开发作为人与法学硕士互动

链接: https://arxiv.org/abs/2408.09386
作者: Jiale Hong,Hongqiu Wu,Hai Zhao
关键词-EN: highly specialized task, complex game engine, complex programming languages, Interaction-driven Game Engine, game engine powered
关键词-ZH: 高度专业化的任务、复杂的游戏引擎、复杂的编程语言、交互驱动的游戏引擎、游戏引擎驱动的
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Game development is a highly specialized task that relies on a complex game engine powered by complex programming languages, preventing many gaming enthusiasts from handling it. This paper introduces the Interaction-driven Game Engine (IGE) powered by LLM, which allows everyone to develop a custom game using natural language through Human-LLM interaction. To enable an LLM to function as an IGE, we instruct it to perform the following processes in each turn: (1) P_script : configure the game script segment based on the user’s input; (2) P_code : generate the corresponding code snippet based on the game script segment; (3) P_utter : interact with the user, including guidance and feedback. We propose a data synthesis pipeline based on the LLM to generate game script-code pairs and interactions from a few manually crafted seed data. We propose a three-stage progressive training strategy to transfer the dialogue-based LLM to our IGE smoothly. We construct an IGE for poker games as a case study and comprehensively evaluate it from two perspectives: interaction quality and code correctness. The code and data are available at \urlthis https URL.
摘要:游戏开发是一项高度专业化的任务,它依赖于由复杂编程语言驱动的复杂游戏引擎,阻止了许多游戏爱好者处理它。本文介绍了由LLM支持的交互驱动游戏引擎(IGE),它允许每个人通过人机交互使用自然语言开发自定义游戏。为了使LLM能够作为一个IGE,我们指示它在每个回合中执行以下过程:(1)P_SCRIPT:根据用户的输入配置游戏脚本片段;(2)P_CODE:根据游戏脚本片段生成相应的代码片段;(3)P_utter:与用户交互,包括指导和反馈。我们提出了一种基于LLM的数据合成流水线,用于从少量手工制作的种子数据中生成游戏脚本-代码对和交互。我们提出了一种三阶段渐进训练策略,将基于对话的LLM顺利地过渡到我们的IGE。我们构建了一个用于扑克游戏的IGE作为案例,并从交互质量和代码正确性两个角度对其进行了综合评估。代码和数据可在此HTTPS URL上找到。

[NLP-61] Offline RLHF Methods Need More Accurate Supervision Signals
[NLP-61] 离线RL HF方法需要更准确的监督信号

链接: https://arxiv.org/abs/2408.09385
作者: Shiqi Wang,Zhengze Zhang,Rui Zhao,Fei Tan,Cam Tu Nguyen
关键词-EN: Large Language Models, Large Language, advances in Large, Language Models, increasingly important
关键词-ZH: 大型语言模型,大型语言,大型语言模型的进步,越来越重要
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: under review

点击查看摘要

Abstract:With the rapid advances in Large Language Models (LLMs), aligning LLMs with human preferences become increasingly important. Although Reinforcement Learning with Human Feedback (RLHF) proves effective, it is complicated and highly resource-intensive. As such, offline RLHF has been introduced as an alternative solution, which directly optimizes LLMs with ranking losses on a fixed preference dataset. Current offline RLHF only captures the ordinal relationship'' between responses, overlooking the crucial aspect of how much’’ one is preferred over the others. To address this issue, we propose a simple yet effective solution called \textbfReward \textbfDifference \textbfOptimization, shorted as \textbfRDO. Specifically, we introduce \it reward difference coefficients to reweigh sample pairs in offline RLHF. We then develop a \it difference model involving rich interactions between a pair of responses for predicting these difference coefficients. Experiments with 7B LLMs on the HH and TL;DR datasets substantiate the effectiveness of our method in both automatic metrics and human evaluation, thereby highlighting its potential for aligning LLMs with human intent and values.
摘要:随着大语言模型的快速发展,使大语言模型与人类偏好保持一致变得越来越重要。虽然带人反馈的强化学习(RLHF)被证明是有效的,但它是复杂的和高度资源密集型的。因此,离线RLHF被引入作为一种替代解决方案,它直接优化具有固定偏好数据集上的排名损失的LLM。目前离线的RLHF只捕获了回答之间的“顺序关系”,而忽略了“多少”这一关键方面,其中一个比其他更受欢迎。为了解决这个问题,我们提出了一种简单而有效的解决方案,称为\textbfReward\textbf Difference\extbf Optimation,简称为\textbfRDO。具体地说,我们在离线RLHF中引入了报酬差异系数来对样本对进行加权。然后,我们建立了一个包含一对响应之间丰富交互作用的差值模型来预测这些差值系数。在HH和TL;DR数据集上使用7B LLM进行的实验证明了我们的方法在自动度量和人工评估方面的有效性,从而突出了其使LLM与人类意图和价值观保持一致的潜力。

[NLP-62] Improving and Assessing the Fidelity of Large Language Models Alignment to Online Communities
[NLP-62] 改进和评估大型语言模型与在线社区的一致性

链接: https://arxiv.org/abs/2408.09366
作者: Minh Duc Chu,Zihao He,Rebecca Dorn,Kristina Lerman
关键词-EN: Large language models, complex social dynamics, study complex social, Large language, shown promise
关键词-ZH: 大型语言模型,复杂的社会动态,研究复杂的社会,大型语言,显示出希望
类目: Computation and Language (cs.CL); Computers and Society (cs.CY); Social and Information Networks (cs.SI)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have shown promise in representing individuals and communities, offering new ways to study complex social dynamics. However, effectively aligning LLMs with specific human groups and systematically assessing the fidelity of the alignment remains a challenge. This paper presents a robust framework for aligning LLMs with online communities via instruction-tuning and comprehensively evaluating alignment across various aspects of language, including authenticity, emotional tone, toxicity, and harm. We demonstrate the utility of our approach by applying it to online communities centered on dieting and body image. We administer an eating disorder psychometric test to the aligned LLMs to reveal unhealthy beliefs and successfully differentiate communities with varying levels of eating disorder risk. Our results highlight the potential of LLMs in automated moderation and broader applications in public health and social science research.
摘要:大型语言模型(LLM)在代表个人和社区方面表现出了希望,为研究复杂的社会动态提供了新的方法。然而,有效地将LLM与特定人类群体对齐并系统地评估对齐的保真度仍然是一个挑战。本文提出了一个强大的框架,通过描述调整和全面评估语言各个方面(包括真实性、情感基调、毒性和伤害)的一致性,将LLM与在线社区保持一致。我们通过将其应用于以节食和身体形象为中心的在线社区来证明我们方法的实用性。我们对对齐的LLM进行饮食障碍心理测量测试,以揭示不健康的信念并成功区分具有不同饮食障碍风险的社区。我们的结果强调了LLM在自动审核方面的潜力以及在公共卫生和社会科学研究中更广泛的应用。

[NLP-63] Concept Distillation from Strong to Weak Models via Hypotheses-to-Theories Prompting
[NLP-63] 通过假设到理论匹配从强模型到弱模型的概念提炼

链接: https://arxiv.org/abs/2408.09365
作者: Emmanuel Aboah Boateng,Cassiano O. Becker,Nabiha Asghar,Kabir Walia,Ashwin Srinivasan,Ehi Nosakhare,Victor Dibia,Soundar Srinivasan
关键词-EN: Hand-crafting high quality, Hand-crafting high, high quality prompts, labor-intensive process, high quality
关键词-ZH: 手工制作高品质,手工制作高,高品质提示,劳动密集型流程,高品质
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 13 pages, 8 figures, conference

点击查看摘要

Abstract:Hand-crafting high quality prompts to optimize the performance of language models is a complicated and labor-intensive process. Furthermore, when migrating to newer, smaller, or weaker models (possibly due to latency or cost gains), prompts need to be updated to re-optimize the task performance. We propose Concept Distillation (CD), an automatic prompt optimization technique for enhancing weaker models on complex tasks. CD involves: (1) collecting mistakes made by weak models with a base prompt (initialization), (2) using a strong model to generate reasons for these mistakes and create rules/concepts for weak models (induction), and (3) filtering these rules based on validation set performance and integrating them into the base prompt (deduction/verification). We evaluated CD on NL2Code and mathematical reasoning tasks, observing significant performance boosts for small and weaker language models. Notably, Mistral-7B’s accuracy on Multi-Arith increased by 20%, and Phi-3-mini-3.8B’s accuracy on HumanEval rose by 34%. Compared to other automated methods, CD offers an effective, cost-efficient strategy for improving weak models’ performance on complex tasks and enables seamless workload migration across different language models without compromising performance.
摘要:手工制作高质量的提示语以优化语言模型的性能是一个复杂且劳动密集型的过程。此外,当迁移到更新、更小或更弱的模型时(可能是由于延迟或成本增加),需要更新提示以重新优化任务性能。我们提出了概念蒸馏(CD),这是一种自动提示优化技术,用于增强复杂任务中较弱的模型。CD涉及:(1)收集具有基本提示(初始化)的弱模型所犯的错误,(2)使用强模型来生成这些错误的原因并为弱模型创建规则/概念(归纳),以及(3)基于验证集性能来过滤这些规则并将其集成到基本提示(演绎/验证)中。我们在NL2Code和数学推理任务上对CD进行了评估,观察到小型和较弱语言模型的显著性能提升。值得注意的是,米斯特拉尔-7B在多ARITH上的准确率提高了20%,而PHI-3-MINI-3.8B在人类评估上的准确率提高了34%。与其他自动化方法相比,CD为改进弱模型在复杂任务上的性能提供了一种有效、经济高效的策略,并允许在不影响性能的情况下无缝地跨不同语言模型迁移工作负载。

[NLP-64] SkyScript-100M: 1000000000 Pairs of Scripts and Shooting Scripts for Short Drama
[NLP-64] SkyScript-100 M:1000000000短剧双字幕和拍摄字幕

链接: https://arxiv.org/abs/2408.09333
作者: Jing Tang,Quanlu Jia,Yuqiang Xie,Zeyu Gong,Xiang Wen,Jiayi Zhang,Yalong Guo,Guibin Chen,Jiangping Yang
关键词-EN: Generating high-quality shooting, Generating high-quality, short drama, short, scene and shot
关键词-ZH: 生成高质量的拍摄,生成高质量的短剧、短片、场景和镜头
类目: Computation and Language (cs.CL)
备注: 18 pages, 12 figures

点击查看摘要

Abstract:Generating high-quality shooting scripts containing information such as scene and shot language is essential for short drama script generation. We collect 6,660 popular short drama episodes from the Internet, each with an average of 100 short episodes, and the total number of short episodes is about 80,000, with a total duration of about 2,000 hours and totaling 10 terabytes (TB). We perform keyframe extraction and annotation on each episode to obtain about 10,000,000 shooting scripts. We perform 100 script restorations on the extracted shooting scripts based on our self-developed large short drama generation model SkyReels. This leads to a dataset containing 1,000,000,000 pairs of scripts and shooting scripts for short dramas, called SkyScript-100M. We compare SkyScript-100M with the existing dataset in detail and demonstrate some deeper insights that can be achieved based on SkyScript-100M. Based on SkyScript-100M, researchers can achieve several deeper and more far-reaching script optimization goals, which may drive a paradigm shift in the entire field of text-to-video and significantly advance the field of short drama video generation. The data and code are available at this https URL.
摘要:生成包含场景和镜头语言等信息的高质量的拍摄剧本是短剧剧本制作的关键。我们从网上收集了6660集热门短剧,每集平均100集,短集总数约8万集,总时长约2000小时,总计10TB。我们对每集进行关键帧提取和注释,以获得约1000万个拍摄脚本。我们基于我们自己开发的大型短剧生成模型SkyReels,对提取的拍摄剧本进行了100次剧本恢复。这将产生一个数据集,其中包含1,000,000,000对剧本和短剧的拍摄剧本,称为SkyScrip-100M。我们将SkyScrip-100M与现有的数据集进行详细的比较,并展示了基于SkyScrip-100M可以实现的一些更深层次的见解。基于SkyScrip-100M,研究人员可以实现几个更深层次、更深远的剧本优化目标,这可能会推动整个文本到视频领域的范式转变,并显著推进短剧视频生成领域。数据和代码可在此HTTPS URL上找到。

[NLP-65] Fostering Natural Conversation in Large Language Models with NICO: a Natural Interactive COnversation dataset
[NLP-65] 利用NICO在大型语言模型中培养自然对话:自然交互式协调数据集

链接: https://arxiv.org/abs/2408.09330
作者: Renliang Sun,Mengyuan Liu,Shiping Yang,Rui Wang,Junqing He,Jiaxing Zhang
关键词-EN: Large Language Models, contemporary Large Language, Language Models, Large Language, contemporary Large
关键词-ZH: 大型语言模型,当代大型语言,语言模型,大型语言,当代大型
类目: Computation and Language (cs.CL)
备注: 16 pages, 3 figures, 10 tables

点击查看摘要

Abstract:Benefiting from diverse instruction datasets, contemporary Large Language Models (LLMs) perform effectively as AI assistants in collaborating with humans. However, LLMs still struggle to generate natural and colloquial responses in real-world applications such as chatbots and psychological counseling that require more human-like interactions. To address these limitations, we introduce NICO, a Natural Interactive COnversation dataset in Chinese. We first use GPT-4-turbo to generate dialogue drafts and make them cover 20 daily-life topics and 5 types of social interactions. Then, we hire workers to revise these dialogues to ensure that they are free of grammatical errors and unnatural utterances. We define two dialogue-level natural conversation tasks and two sentence-level tasks for identifying and rewriting unnatural sentences. Multiple open-source and closed-source LLMs are tested and analyzed in detail. The experimental results highlight the challenge of the tasks and demonstrate how NICO can help foster the natural dialogue capabilities of LLMs. The dataset will be released.
摘要:得益于多样化的指令数据集,当代大型语言模型在与人类的协作中有效地发挥了人工智能的辅助作用。然而,在聊天机器人和心理咨询等需要更多人类互动的真实应用中,LLMS仍然难以产生自然和口语的反应。为了解决这些限制,我们引入了NICO,这是一个中文自然交互对话数据集。我们首先使用GPT-4-TURBO生成对话草稿,使其涵盖20个日常生活主题和5种社交类型。然后,我们聘请工作人员修改这些对话,以确保它们没有语法错误和不自然的话语。我们定义了两个对话级别的自然对话任务和两个句子级别的任务来识别和重写不自然语句。对多个开源和闭源LLM进行了详细的测试和分析。实验结果突出了这些任务的挑战,并展示了NICO如何帮助培养LLMS的自然对话能力。数据集将被发布。

[NLP-66] hreshold Filtering Packing for Supervised Fine-Tuning: Training Related Samples within Packs
[NLP-66] 用于监督微调的三档过滤包装:在包内训练相关样本

链接: https://arxiv.org/abs/2408.09327
作者: Jiancheng Dong,Lei Jiang,Wei Jin,Lu Cheng
关键词-EN: facilitate GPU processing, designed maximum length, concatenating data points, models involves concatenating, autoregressive models involves
关键词-ZH: 促进图形处理、设计最大长度、连接数据点、模型涉及连接、自回归模型涉及
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: 13 pages, 4 figures

点击查看摘要

Abstract:Packing for Supervised Fine-Tuning (SFT) in autoregressive models involves concatenating data points of varying lengths until reaching the designed maximum length to facilitate GPU processing. However, randomly concatenating data points and feeding them into an autoregressive transformer can lead to cross-contamination of sequences due to the significant difference in their subject matter. The mainstream approaches in SFT ensure that each token in the attention calculation phase only focuses on tokens within its own short sequence, without providing additional learning signals for the preceding context. To address these challenges, we introduce Threshold Filtering Packing (TFP), a method that selects samples with related context while maintaining sufficient diversity within the same pack. Our experiments show that TFP offers a simple-to-implement and scalable approach that significantly enhances SFT performance, with observed improvements of up to 7% on GSM8K, 4% on HumanEval, and 15% on the adult-census-income dataset.
摘要:在自回归模型中,有监督精调(SFT)的填充需要将不同长度的数据点连接起来,直到达到设计的最大长度,以便于GPU处理。然而,随机连接数据点并将它们送入自回归转换器可能会导致序列的交叉污染,因为它们的主题存在显著差异。SFT中的主流方法确保了注意力计算阶段的每个标记只关注其自身短序列中的标记,而不为前面的上下文提供额外的学习信号。为了应对这些挑战,我们引入了阈值过滤打包(TFP),这是一种在相同的包中保持足够多样性的同时选择具有相关背景的样本的方法。我们的实验表明,TFP提供了一种易于实现和可扩展的方法,显著提高了SFT的性能,在GSM8K上观察到的改进高达7%,在HumanEval上提高了4%,在成人普查收入数据集上提高了15%。

[NLP-67] Characterizing and Evaluating the Reliability of LLMs against Jailbreak Attacks
[NLP-67] 描述和评估LLM针对越狱攻击的可靠性

链接: https://arxiv.org/abs/2408.09326
作者: Kexin Chen,Yi Liu,Dongxia Wang,Jiaying Chen,Wenhai Wang
关键词-EN: Large Language Models, Large Language, notable societal impact, Language Models, societal impact
关键词-ZH: 大型语言模型,大型语言,显着的社会影响,语言模型,社会影响
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have increasingly become pivotal in content generation with notable societal impact. These models hold the potential to generate content that could be deemed harmful.Efforts to mitigate this risk include implementing safeguards to ensure LLMs adhere to social ethics.However, despite such measures, the phenomenon of “jailbreaking” – where carefully crafted prompts elicit harmful responses from models – persists as a significant challenge. Recognizing the continuous threat posed by jailbreaking tactics and their repercussions for the trustworthy use of LLMs, a rigorous assessment of the models’ robustness against such attacks is essential. This study introduces an comprehensive evaluation framework and conducts an large-scale empirical experiment to address this need. We concentrate on 10 cutting-edge jailbreak strategies across three categories, 1525 questions from 61 specific harmful categories, and 13 popular LLMs. We adopt multi-dimensional metrics such as Attack Success Rate (ASR), Toxicity Score, Fluency, Token Length, and Grammatical Errors to thoroughly assess the LLMs’ outputs under jailbreak. By normalizing and aggregating these metrics, we present a detailed reliability score for different LLMs, coupled with strategic recommendations to reduce their susceptibility to such vulnerabilities. Additionally, we explore the relationships among the models, attack strategies, and types of harmful content, as well as the correlations between the evaluation metrics, which proves the validity of our multifaceted evaluation framework. Our extensive experimental results demonstrate a lack of resilience among all tested LLMs against certain strategies, and highlight the need to concentrate on the reliability facets of LLMs. We believe our study can provide valuable insights into enhancing the security evaluation of LLMs against jailbreak within the domain.
摘要:大型语言模型日益成为内容生成的关键,具有显著的社会影响。这些模式有可能产生可能被认为有害的内容。缓解这种风险的方法包括实施保障措施,以确保LLMS遵守社会道德。然而,尽管采取了这些措施,“越狱”现象–精心制作的提示会招致模特的有害回应–仍然是一个重大挑战。认识到越狱战术构成的持续威胁及其对可信地使用LLM的影响,严格评估这些模型对此类攻击的稳健性是至关重要的。为了满足这一需求,本研究引入了一个综合评价框架,并进行了大规模的实证实验。我们集中在三个类别的10个尖端越狱策略,来自61个特定有害类别的1525个问题,以及13个流行的LLM。我们采用攻击成功率(ASR)、毒性分数、流畅度、令牌长度和语法错误等多维度量来彻底评估越狱情况下LLMS的输出。通过标准化和聚合这些指标,我们为不同的LLM提供了详细的可靠性分数,并提供了降低它们对此类漏洞的易感性的战略建议。此外,我们还探讨了模型、攻击策略和有害内容类型之间的关系,以及评估指标之间的相关性,从而证明了我们的多方面评估框架的有效性。我们广泛的实验结果表明,在所有测试的LLM中,对某些策略缺乏弹性,并强调了需要专注于LLM的可靠性方面。我们相信,我们的研究可以为加强LLMS在域内抗越狱的安全性评估提供有价值的见解。

[NLP-68] An Open-Source American Sign Language Fingerspell Recognition and Semantic Pose Retrieval Interface
[NLP-68] 开源美国手语指纹识别和语义姿势检索接口

链接: https://arxiv.org/abs/2408.09311
作者: Kevin Jose Thomas
关键词-EN: American Sign Language, Sign Language fingerspell, sign language translation, advanced sign language, Sign Language
关键词-ZH: 美国手语,手语手指拼写,手语翻译,高级手语,手语
类目: Computation and Language (cs.CL)
备注: 8 pages, 9 figures

点击查看摘要

Abstract:This paper introduces an open-source interface for American Sign Language fingerspell recognition and semantic pose retrieval, aimed to serve as a stepping stone towards more advanced sign language translation systems. Utilizing a combination of convolutional neural networks and pose estimation models, the interface provides two modular components: a recognition module for translating ASL fingerspelling into spoken English and a production module for converting spoken English into ASL pose sequences. The system is designed to be highly accessible, user-friendly, and capable of functioning in real-time under varying environmental conditions like backgrounds, lighting, skin tones, and hand sizes. We discuss the technical details of the model architecture, application in the wild, as well as potential future enhancements for real-world consumer applications.
摘要:本文介绍了一个用于美国手语指法识别和语义姿势检索的开源接口,旨在作为迈向更先进手语翻译系统的垫脚石。该接口利用卷积神经网络和姿态估计模型的组合,提供了两个模块化组件:用于将ADL手指拼写翻译为英语口语的识别模块和用于将英语口语转换为手语姿态序列的生成模块。该系统的设计具有高度可访问性、用户友好性,并且能够在背景、灯光、肤色和手大小等不同环境条件下实时运行。我们讨论了模型架构、野外应用程序的技术细节,以及现实世界消费者应用程序的潜在未来增强。

[NLP-69] CyberPal.AI: Empowering LLMs with Expert-Driven Cybersecurity Instructions
[NLP-69] Cyberspel.AI:通过专家驱动的网络安全指令为法学硕士提供支持

链接: https://arxiv.org/abs/2408.09304
作者: Matan Levi,Yair Alluouche,Daniel Ohayon,Anton Puzanov
关键词-EN: Large Language Models, natural language processing, advanced natural language, Large Language, providing versatile capabilities
关键词-ZH: 大型语言模型、自然语言处理、高级自然语言、大型语言,提供多功能
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have significantly advanced natural language processing (NLP), providing versatile capabilities across various applications. However, their application to complex, domain-specific tasks, such as cyber-security, often faces substantial challenges. In this study, we introduce SecKnowledge and this http URL to address these challenges and train security-expert LLMs. SecKnowledge is a domain-knowledge-driven cyber-security instruction dataset, meticulously designed using years of accumulated expert knowledge in the domain through a multi-phase generation process. this http URL refers to a family of LLMs fine-tuned using SecKnowledge, aimed at building security-specialized LLMs capable of answering and following complex security-related instructions. Additionally, we introduce SecKnowledge-Eval, a comprehensive and diverse cyber-security evaluation benchmark, composed of an extensive set of cyber-security tasks we specifically developed to assess LLMs in the field of cyber-security, along with other publicly available security benchmarks. Our results show a significant average improvement of up to 24% over the baseline models, underscoring the benefits of our expert-driven instruction dataset generation process. These findings contribute to the advancement of AI-based cyber-security applications, paving the way for security-expert LLMs that can enhance threat-hunting and investigation processes.
摘要:大型语言模型(LLM)极大地提高了自然语言处理(NLP)的水平,为各种应用提供了多种功能。然而,它们在复杂的、特定领域的任务中的应用,如网络安全,往往面临着巨大的挑战。在这项研究中,我们引入了SecKnowledge和这个http URL来应对这些挑战,并培训安全专家LLM。SecKnowledge是一个领域知识驱动的网络安全教学数据集,利用该领域多年积累的专家知识,通过多阶段的生成过程精心设计。此http URL指的是使用SecKnowledge微调的一系列LLM,旨在构建能够响应和遵循复杂的安全相关指令的安全专用LLM。此外,我们还介绍了SecKnowledge-Eval,这是一个全面和多样化的网络安全评估基准,由我们专门为评估网络安全领域的LLM而开发的一套广泛的网络安全任务组成,以及其他公开可用的安全基准。我们的结果显示,与基准模型相比,我们的平均改进幅度高达24%,这突显了我们的专家驱动的教学数据集生成过程的好处。这些发现有助于推进基于人工智能的网络安全应用,为能够增强威胁搜索和调查过程的安全专家LLM铺平了道路。

[NLP-70] ConVerSum: A Contrastive Learning based Approach for Data-Scarce Solution of Cross-Lingual Summarization Beyond Direct Equivalents
[NLP-70] ConVerSum:一种基于对比学习的方法,用于超越直接等效的跨语言总结的数据稀缺解决方案

链接: https://arxiv.org/abs/2408.09273
作者: Sanzana Karim Lora,Rifat Shahriyar
关键词-EN: Natural Language Processing, branch in Natural, Processing that demands, Natural Language, Language Processing
关键词-ZH: 自然语言处理,自然分支,需要的处理,自然语言,语言处理
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Cross-Lingual summarization (CLS) is a sophisticated branch in Natural Language Processing that demands models to accurately translate and summarize articles from different source languages. Despite the improvement of the subsequent studies, This area still needs data-efficient solutions along with effective training methodologies. To the best of our knowledge, there is no feasible solution for CLS when there is no available high-quality CLS data. In this paper, we propose a novel data-efficient approach, ConVerSum, for CLS leveraging the power of contrastive learning, generating versatile candidate summaries in different languages based on the given source document and contrasting these summaries with reference summaries concerning the given documents. After that, we train the model with a contrastive ranking loss. Then, we rigorously evaluate the proposed approach against current methodologies and compare it to powerful Large Language Models (LLMs)- Gemini, GPT 3.5, and GPT 4 proving our model performs better for low-resource languages’ CLS. These findings represent a substantial improvement in the area, opening the door to more efficient and accurate cross-lingual summarizing techniques.
摘要:跨语言摘要是自然语言处理中的一个复杂分支,它需要模型来准确地翻译和摘要不同来源语言的文章。尽管随后的研究有所改进,但这一领域仍然需要数据效率高的解决方案以及有效的培训方法。据我们所知,在没有可用的高质量CLS数据的情况下,CLS没有可行的解决方案。在本文中,我们提出了一种新的数据高效的方法ConVerSum,该方法利用对比学习的能力,基于给定的源文档生成不同语言的通用候选摘要,并将这些摘要与给定文档的参考摘要进行比较。之后,我们使用对比排名损失对模型进行训练。然后,我们对提出的方法进行了严格的评估,并将其与强大的大型语言模型(LLM)-Gemini、GPT 3.5和GPT 4-进行了比较,证明了我们的模型对于低资源语言的CLS具有更好的性能。这些发现代表着该领域的实质性改进,为更有效和准确的跨语言摘要技术打开了大门。

[NLP-71] Reference-Guided Verdict: LLMs-as-Judges in Automatic Evaluation of Free-Form Text
[NLP-71] 参考引导判决:自由形式文本自动评估中的法学硕士作为评委

链接: https://arxiv.org/abs/2408.09235
作者: Sher Badshah,Hassan Sajjad
关键词-EN: Large Language Models, Large Language, advancements in Large, Language Models, rapid advancements
关键词-ZH: 大型语言模型,大型语言,大型进步,语言模型,快速进步
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The rapid advancements in Large Language Models (LLMs) have highlighted the critical need for robust evaluation methods that can accurately assess the quality of generated text, particularly in free-form tasks. Traditional metrics like BLEU and ROUGE, while useful, often fail to capture the semantic richness and contextual relevance of free-form text compared to reference answers. In this study, we introduce a reference-guided verdict method that leverages multiple LLMs-as-judges to provide a more reliable and accurate evaluation of open-ended LLM generations. By integrating diverse LLMs, our approach mitigates individual model biases and significantly improves alignment with human judgments, especially in challenging tasks where traditional metrics and single-model evaluations fall short. Through experiments across multiple question-answering tasks, we show that our method closely aligns with human evaluations, establishing it as a scalable, reproducible, and effective alternative to human evaluation. Our approach not only enhances evaluation reliability but also opens new avenues for refining automated assessment in generative AI.
摘要:大型语言模型(LLM)的快速发展突显了对稳健的评估方法的迫切需求,这种方法可以准确地评估生成的文本的质量,特别是在自由格式任务中。像BLEU和Rouge这样的传统度量方法虽然有用,但与参考答案相比,往往无法捕捉自由格式文本的语义丰富性和上下文相关性。在这项研究中,我们介绍了一种参考指导的判决方法,该方法利用多个LLMS作为法官来提供对开放式LLM生成的更可靠和准确的评估。通过集成不同的LLM,我们的方法减轻了单个模型的偏差,并显著提高了与人类判断的一致性,特别是在传统指标和单一模型评估达不到的挑战性任务中。通过多个问答任务的实验,我们表明我们的方法与人类评价密切一致,确立了它作为一种可扩展、可重复性和有效的人类评价的替代方案。我们的方法不仅提高了评估的可靠性,而且为生成式人工智能中的自动评估开辟了新的途径。

[NLP-72] Architectural Foundations and Strategic Considerations for the Large Language Model Infrastructures
[NLP-72] 大型语言模型基础结构的架构基础和战略考虑

链接: https://arxiv.org/abs/2408.09205
作者: Hongyin Zhu
关键词-EN: large language model, language model, artificial intelligence, large language, undertaking in artificial
关键词-ZH: 大语言模型,语言模型,人工智能,大语言,人工创业
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The development of a large language model (LLM) infrastructure is a pivotal undertaking in artificial intelligence. This paper explores the intricate landscape of LLM infrastructure, software, and data management. By analyzing these core components, we emphasize the pivotal considerations and safeguards crucial for successful LLM development. This work presents a concise synthesis of the challenges and strategies inherent in constructing a robust and effective LLM infrastructure, offering valuable insights for researchers and practitioners alike.
摘要:大型语言模型(LLM)基础设施的开发是人工智能领域的一项关键任务。本文探讨了LLM基础设施、软件和数据管理的复杂环境。通过分析这些核心组件,我们强调了LLM成功开发至关重要的关键考虑因素和保障措施。这项工作简要地综合了构建强大而有效的LLM基础设施所固有的挑战和策略,为研究人员和从业者提供了宝贵的见解。

[NLP-73] AI Managed Emergency Documentation with a Pretrained Model
[NLP-73] 人工智能使用预训练模型管理紧急文档

链接: https://arxiv.org/abs/2408.09193
作者: David Menzies,Sean Kirwan,Ahmad Albarqawi
关键词-EN: large language model, discharge letter writing, study investigates, large language, improve efficiency
关键词-ZH: 大语言模型,放电信件写作,学习调查,大语言,提高效率
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Ethical approval for the study was obtained from the University College Dublin, Human Research Ethics Committee (UCD HREC)

点击查看摘要

Abstract:This study investigates the use of a large language model system to improve efficiency and quality in emergency department (ED) discharge letter writing. Time constraints and infrastructural deficits make compliance with current discharge letter targets difficult. We explored potential efficiencies from an artificial intelligence software in the generation of ED discharge letters and the attitudes of doctors toward this technology. The evaluated system leverages advanced techniques to fine-tune a model to generate discharge summaries from short-hand inputs, including voice, text, and electronic health record data. Nineteen physicians with emergency medicine experience evaluated the system text and voice-to-text interfaces against manual typing. The results showed significant time savings with MedWrite LLM interfaces compared to manual methods.
摘要:本研究调查了使用大型语言模型系统来提高急诊科(ED)出院信写作的效率和质量。时间限制和基础设施缺陷使得遵守当前的出院通知书目标变得困难。我们探索了人工智能软件在生成ED出院信方面的潜在效率以及医生对这项技术的态度。评估的系统利用先进技术微调模型,以从包括语音、文本和电子健康记录数据在内的速记输入生成出院摘要。19名具有急诊医学经验的医生根据手动打字评估了系统文本和语音到文本界面。结果显示,与手动方法相比,MedWriting LLM界面可以显着节省时间。

[NLP-74] Chinese Metaphor Recognition Using a Multi-stage Prompting Large Language Model
[NLP-74] 基于多阶段嵌入大语言模型的中文隐喻识别

链接: https://arxiv.org/abs/2408.09177
作者: Jie Wang,Jin Wang,Xuejie Zhang
关键词-EN: Large Language Models, common in everyday, everyday language, Metaphors, identification and understanding
关键词-ZH: 大型语言模型,常见于日常语言、隐喻、识别和理解
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Metaphors are common in everyday language, and the identification and understanding of metaphors are facilitated by models to achieve a better understanding of the text. Metaphors are mainly identified and generated by pre-trained models in existing research, but situations, where tenors or vehicles are not included in the metaphor, cannot be handled. The problem can be effectively solved by using Large Language Models (LLMs), but significant room for exploration remains in this early-stage research area. A multi-stage generative heuristic-enhanced prompt framework is proposed in this study to enhance the ability of LLMs to recognize tenors, vehicles, and grounds in Chinese metaphors. In the first stage, a small model is trained to obtain the required confidence score for answer candidate generation. In the second stage, questions are clustered and sampled according to specific rules. Finally, the heuristic-enhanced prompt needed is formed by combining the generated answer candidates and demonstrations. The proposed model achieved 3rd place in Track 1 of Subtask 1, 1st place in Track 2 of Subtask 1, and 1st place in both tracks of Subtask 2 at the NLPCC-2024 Shared Task 9.
摘要:隐喻在日常语言中普遍存在,而隐喻的识别和理解通过模型来实现,从而更好地理解语篇。在现有的研究中,隐喻主要是通过预先训练的模型来识别和生成的,但不能处理隐喻中不包括基调或喻体的情况。使用大型语言模型(LLM)可以有效地解决这个问题,但在这个早期研究领域仍有很大的探索空间。本研究提出了一个多阶段生成性启发式增强提示框架,以提高LLMS识别汉语隐喻中的主旨、喻体和背景的能力。在第一阶段,训练一个小模型以获得生成答案候选所需的置信度分数。在第二阶段,根据特定的规则对问题进行聚类和抽样。最后,通过将生成的答案候选和演示相结合,形成所需的启发式增强提示。在NLPCC-2024共享任务9上,该模型在子任务1的轨道1中获得了第3名,在子任务1的轨道2中获得了第1名,在子任务2的两个轨道中都获得了第1名。

[NLP-75] Cognitive LLMs: Towards Integrating Cognitive Architectures and Large Language Models for Manufacturing Decision-making
[NLP-75] 认知法学硕士:整合认知架构和大型语言模型以进行制造决策

链接: https://arxiv.org/abs/2408.09176
作者: Siyu Wu,Alessandro Oltramari,Jonathan Francis,C. Lee Giles,Frank E. Ritter
关键词-EN: Large Language Models, Language Models, Large Language, enabling reliable machine, Resolving the dichotomy
关键词-ZH: 大型语言模型,语言模型,大型语言,实现可靠的机器,解决二分法
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Symbolic Computation (cs.SC)
备注: 20 pages, 8 figures, 2 tables

点击查看摘要

Abstract:Resolving the dichotomy between the human-like yet constrained reasoning processes of Cognitive Architectures and the broad but often noisy inference behavior of Large Language Models (LLMs) remains a challenging but exciting pursuit, for enabling reliable machine reasoning capabilities in production systems. Because Cognitive Architectures are famously developed for the purpose of modeling the internal mechanisms of human cognitive decision-making at a computational level, new investigations consider the goal of informing LLMs with the knowledge necessary for replicating such processes, e.g., guided perception, memory, goal-setting, and action. Previous approaches that use LLMs for grounded decision-making struggle with complex reasoning tasks that require slower, deliberate cognition over fast and intuitive inference – reporting issues related to the lack of sufficient grounding, as in hallucination. To resolve these challenges, we introduce LLM-ACTR, a novel neuro-symbolic architecture that provides human-aligned and versatile decision-making by integrating the ACT-R Cognitive Architecture with LLMs. Our framework extracts and embeds knowledge of ACT-R’s internal decision-making process as latent neural representations, injects this information into trainable LLM adapter layers, and fine-tunes the LLMs for downstream prediction. Our experiments on novel Design for Manufacturing tasks show both improved task performance as well as improved grounded decision-making capability of our approach, compared to LLM-only baselines that leverage chain-of-thought reasoning strategies.
摘要:要在产生式系统中实现可靠的机器推理能力,解决认知体系结构中类似人类的但受限的推理过程和大型语言模型(LLM)广泛但往往有噪声的推理行为之间的二分法仍然是一个具有挑战性但令人兴奋的追求。由于认知体系结构的发展是为了在计算水平上对人类认知决策的内部机制进行建模,新的研究考虑的目标是向LLM提供复制此类过程所需的知识,例如,引导感知、记忆、目标设置和行动。以前使用LLMS进行扎根决策的方法与复杂的推理任务作斗争,这些复杂的推理任务需要较慢的、刻意的认知,而不是快速和直观的推理–报告与缺乏足够的根基有关的问题,如幻觉。为了解决这些挑战,我们引入了LLM-ACTR,这是一种新型的神经符号体系结构,通过将ACT-R认知体系结构与LLMS相结合来提供与人类一致的通用决策。我们的框架提取并嵌入ACT-R内部决策过程的知识作为潜在的神经表示,将这些信息注入可训练的LLM适配器层,并微调LLM以用于下游预测。我们在新颖的制造设计任务上的实验表明,与利用思想链推理策略的仅LLM基线相比,我们的方法既提高了任务性能,也提高了接地决策能力。

[NLP-76] ableBench: A Comprehensive and Complex Benchmark for Table Question Answering
[NLP-76] ableBench:桌面问题回答的全面而复杂的基准

链接: https://arxiv.org/abs/2408.09174
作者: Xianjie Wu,Jian Yang,Linzheng Chai,Ge Zhang,Jiaheng Liu,Xinrun Du,Di Liang,Daixin Shu,Xianfu Cheng,Tianzhen Sun,Guanglin Niu,Tongliang Li,Zhoujun Li
关键词-EN: Large Language Models, introducing previously unimaginable, Large Language, previously unimaginable capabilities, Recent advancements
关键词-ZH: 大型语言模型,引入以前难以想象的,大型语言,以前难以想象的能力,最近的进步
类目: Computation and Language (cs.CL)
备注: 12 pages

点击查看摘要

Abstract:Recent advancements in Large Language Models (LLMs) have markedly enhanced the interpretation and processing of tabular data, introducing previously unimaginable capabilities. Despite these achievements, LLMs still encounter significant challenges when applied in industrial scenarios, particularly due to the increased complexity of reasoning required with real-world tabular data, underscoring a notable disparity between academic benchmarks and practical applications. To address this discrepancy, we conduct a detailed investigation into the application of tabular data in industrial scenarios and propose a comprehensive and complex benchmark TableBench, including 18 fields within four major categories of table question answering (TableQA) capabilities. Furthermore, we introduce TableLLM, trained on our meticulously constructed training set TableInstruct, achieving comparable performance with GPT-3.5. Massive experiments conducted on TableBench indicate that both open-source and proprietary LLMs still have significant room for improvement to meet real-world demands, where the most advanced model, GPT-4, achieves only a modest score compared to humans.
摘要:大型语言模型(LLM)的最新进展显著增强了对表格数据的解释和处理,引入了以前无法想象的功能。尽管取得了这些成就,但LLMS在工业场景中应用时仍面临重大挑战,特别是由于需要使用真实世界的表格数据进行推理的复杂性增加,这突显了学术基准与实际应用之间的显著差距。为了解决这一差异,我们对表格数据在工业场景中的应用进行了详细的调查,并提出了一个全面而复杂的基准TableB边,包括四大类表格问答(TableQA)能力中的18个字段。此外,我们引入了TableLLM,在我们精心构建的训练集TableInstruct上进行了训练,取得了与GPT-3.5相当的性能。在TableBtch上进行的大规模实验表明,开源和专有LLM仍有很大的改进空间,以满足现实世界的需求,在现实世界中,最先进的型号GPT-4与人类相比只取得了中等的分数。

[NLP-77] Unc-TTP: A Method for Classifying LLM Uncertainty to Improve In-Context Example Selection
[NLP-77] Unc-TTP:一种对LLM不确定性进行分类以改进上下文示例选择的方法

链接: https://arxiv.org/abs/2408.09172
作者: Hsiu-Yuan Huang,Zichen Wu,Yutong Yang,Junzhao Zhang,Yunfang Wu
关键词-EN: Large Language Models, Large Language, Language Models, demonstrated exceptional performance, downstream tasks
关键词-ZH: 大型语言模型、大型语言、语言模型、展示了出色的性能、下游任务
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 7 pages, long paper

点击查看摘要

Abstract:Nowadays, Large Language Models (LLMs) have demonstrated exceptional performance across various downstream tasks. However, it is challenging for users to discern whether the responses are generated with certainty or are fabricated to meet user expectations. Estimating the uncertainty of LLMs is particularly challenging due to their vast scale and the lack of white-box access. In this work, we propose a novel Uncertainty Tripartite Testing Paradigm (Unc-TTP) to classify LLM uncertainty, via evaluating the consistency of LLM outputs when incorporating label interference into the sampling-based approach. Based on Unc-TTP outputs, we aggregate instances into certain and uncertain categories. Further, we conduct a detailed analysis of the uncertainty properties of LLMs and show Unc-TTP’s superiority over the existing sampling-based methods. In addition, we leverage the obtained uncertainty information to guide in-context example selection, demonstrating that Unc-TTP obviously outperforms retrieval-based and sampling-based approaches in selecting more informative examples. Our work paves a new way to classify the uncertainty of both open- and closed-source LLMs, and introduces a practical approach to exploit this uncertainty to improve LLMs performance.
摘要:如今,大型语言模型(LLM)在各种下游任务中表现出了优异的性能。然而,用户很难辨别这些回复是肯定产生的,还是为了满足用户期望而捏造的。由于LLMS的规模巨大,而且缺乏白盒通道,估计LLMS的不确定性尤其具有挑战性。在这项工作中,我们提出了一种新的不确定度三方测试范式(UNC-TTP),通过评估在基于采样的方法中加入标签干扰时LLM输出的一致性,来对LLM不确定性进行分类。基于UNC-TTP的输出,我们将实例聚合为确定的和不确定的类别。此外,我们对LLMS的不确定性特性进行了详细的分析,并展示了UNC-TTP方法相对于现有的基于采样的方法的优越性。此外,我们利用获得的不确定性信息来指导上下文中的示例选择,表明UNC-TTP在选择信息更丰富的示例方面明显优于基于检索和基于采样的方法。我们的工作为开源和闭源LLMS的不确定性分类提供了一种新的方法,并介绍了一种利用这种不确定性来提高LLMS性能的实用方法。

[NLP-78] Automatic Metrics in Natural Language Generation: A Survey of Current Evaluation Practices
[NLP-78] 自然语言生成中的自动收件箱:当前评估实践的调查

链接: https://arxiv.org/abs/2408.09169
作者: Patrícia Schmidtová,Saad Mahamood,Simone Balloccu,Ondřej Dušek,Albert Gatt,Dimitra Gkatzia,David M. Howcroft,Ondřej Plátek,Adarsa Sivaprasad
关键词-EN: language processing systems, processing systems, evaluate natural language, natural language processing, Automatic metrics
关键词-ZH: 语言处理系统,处理系统,评估自然语言,自然语言处理,自动指标
类目: Computation and Language (cs.CL)
备注: Accepted to INLG 2024

点击查看摘要

Abstract:Automatic metrics are extensively used to evaluate natural language processing systems. However, there has been increasing focus on how they are used and reported by practitioners within the field. In this paper, we have conducted a survey on the use of automatic metrics, focusing particularly on natural language generation (NLG) tasks. We inspect which metrics are used as well as why they are chosen and how their use is reported. Our findings from this survey reveal significant shortcomings, including inappropriate metric usage, lack of implementation details and missing correlations with human judgements. We conclude with recommendations that we believe authors should follow to enable more rigour within the field.
摘要:自动指标被广泛用于评估自然语言处理系统。然而,人们越来越关注该领域从业者如何使用和报告它们。在本文中,我们对自动指标的使用进行了调查,特别关注自然语言生成(NLG)任务。我们检查使用了哪些指标以及选择它们的原因以及如何报告它们的使用情况。我们的调查结果揭示了重大缺陷,包括不恰当的指标使用、缺乏实施细节以及与人类判断的相关性缺失。我们最后提出了我们认为作者应该遵循的建议,以使该领域更加严格。

[NLP-79] CogLM: Tracking Cognitive Development of Large Language Models
[NLP-79] CogLM:跟踪大型语言模型的认知发展

链接: https://arxiv.org/abs/2408.09150
作者: Xinglin Wang,Peiwen Yuan,Shaoxiong Feng,Yiwei Li,Boyuan Pan,Heda Wang,Yao Hu,Kan Li
关键词-EN: Piaget Theory, Large Language Models, cognitive levels, cognitive levels forms, Cognitive
关键词-ZH: 皮亚杰理论,大语言模型,认知水平,认知水平形式,认知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: under review

点击查看摘要

Abstract:Piaget’s Theory of Cognitive Development (PTC) posits that the development of cognitive levels forms the foundation for human learning across various abilities. As Large Language Models (LLMs) have recently shown remarkable abilities across a wide variety of tasks, we are curious about the cognitive levels of current LLMs: to what extent they have developed and how this development has been achieved. To this end, we construct a benchmark CogLM (Cognitive Ability Evaluation for Language Model) based on PTC to assess the cognitive levels of LLMs. CogLM comprises 1,220 questions spanning 10 cognitive abilities crafted by more than 20 human experts, providing a comprehensive testbed for the cognitive levels of LLMs. Through extensive experiments across multiple mainstream LLMs with CogLM, we find that: (1) Human-like cognitive abilities have emerged in advanced LLMs (GPT-4), comparable to those of a 20-year-old human. (2) The parameter size and optimization objective are two key factors affecting the cognitive levels of LLMs. (3) The performance on downstream tasks is positively correlated with the level of cognitive abilities. These findings fill the gap in research on the cognitive abilities of LLMs, tracing the development of LLMs from a cognitive perspective and guiding the future direction of their evolution.
摘要:皮亚杰的认知发展理论认为,认知水平的发展构成了人类学习各种能力的基础。由于大型语言模型(LLM)最近在各种任务中表现出了非凡的能力,我们对当前LLM的认知水平感到好奇:它们发展到了什么程度,是如何实现的?为此,我们构建了一种基于PTC的语言认知能力评估基准模型CogLM(Cognitive Capability Value For Language Model)来评估学习者的认知水平。CogLM包括1220个问题,涉及20多名人类专家精心设计的10种认知能力,为LLM的认知水平提供了一个全面的试验台。通过使用CogLM对多个主流LLM进行广泛的实验,我们发现:(1)高级LLM(GPT-4)已经出现了类似人类的认知能力,与20岁的人相当。(2)参数大小和优化目标是影响LLMS认知水平的两个关键因素。(3)下游任务成绩与认知能力水平呈正相关。这些发现填补了对LLM认知能力研究的空白,从认知的角度追踪LLM的发展,并指导其未来的进化方向。

[NLP-80] Selective Prompt Anchoring for Code Generation
[NLP-80] 代码生成的选择性提示锚定

链接: https://arxiv.org/abs/2408.09121
作者: Yuan Tian,Tianyi Zhang
关键词-EN: large language models, automating coding tasks, transformed software development, Recent advances, Copilot and ChatGPT
关键词-ZH: 大型语言模型、自动化编码任务、转型后的软件开发、最新进展、Copilot和ChatGPT
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Software Engineering (cs.SE)
备注: Under review

点击查看摘要

Abstract:Recent advances in large language models (LLMs) such as Copilot and ChatGPT have transformed software development by automating coding tasks. Despite these advancements, challenges remain in reducing error rates and fully meeting user expectations. Our empirical study reveals LLMs tend to dilute their self-attention on the initial prompt as more code tokens are generated. We hypothesize this self-attention dilution issue is one of the root causes of inaccuracies in LLM-generated code. To mitigate this issue, we propose Selective Prompt Anchoring (SPA). SPA amplifies the influence of the selected parts in the initial prompt, which we refer to as ``anchored text’', during code generation. Specifically, SPA calculates the logit distribution difference with and without the anchored text. We prove this difference approximates the anchored text’s contextual contribution to the output logits. SPA creates an augmented logit distribution by linearly combining the original logit distribution and the logit difference. We evaluate SPA with five LLMs on four benchmarks. Our results demonstrate that using SPA can consistently improve Pass@1 rates by up to 9.7% in all settings. Notably, with selective text anchoring, a small version of DeepSeek-Coder (6.7B) can achieve better performance than an original much larger version (33B). Our code is available at this https URL.
摘要:大型语言模型(LLM)的最新进展,如Copilot和ChatGPT,通过自动化编码任务改变了软件开发。尽管取得了这些进步,但在降低错误率和充分满足用户期望方面仍然存在挑战。我们的实证研究表明,随着更多的代码符号的产生,LLMS在最初的提示中倾向于稀释他们的自我注意。我们假设这种自我关注稀释问题是导致LLM生成的代码不准确的根本原因之一。为了缓解这一问题,我们提出了选择性即时锚定(SPA)。在代码生成过程中,SPA放大了初始提示中选定部分的影响,我们将其称为‘’锚定文本‘’。具体地说,SPA计算使用和不使用锚定文本时的Logit分布差异。我们证明了这种差异近似于锚定文本对输出日志的上下文贡献。SPA通过线性组合原始Logit分布和Logit差异来创建扩展的Logit分布。我们在四个基准上使用五个LLM来评估SPA。我们的结果表明,在所有设置下,使用SPA可以持续地将通过率提高高达9.7%。值得注意的是,使用选择性文本锚定,DeepSeek-Coder的小版本(6.7B)可以获得比原始大得多的版本(33B)更好的性能。我们的代码可以在这个HTTPS URL上找到。

[NLP-81] Measuring Visual Sycophancy in Multimodal Models
[NLP-81] 测量多模式模型中的视觉谄媚性

链接: https://arxiv.org/abs/2408.09111
作者: Jaehyuk Lim,Bruce W. Lee
关键词-EN: disproportionately favor visually, favor visually presented, multimodal language models, visually presented information, paper introduces
关键词-ZH: 不成比例地支持视觉化,支持视觉呈现,多模式语言模型,视觉呈现信息,论文介绍
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:This paper introduces and examines the phenomenon of “visual sycophancy” in multimodal language models, a term we propose to describe these models’ tendency to disproportionately favor visually presented information, even when it contradicts their prior knowledge or responses. Our study employs a systematic methodology to investigate this phenomenon: we present models with images of multiple-choice questions, which they initially answer correctly, then expose the same model to versions with visually pre-marked options. Our findings reveal a significant shift in the models’ responses towards the pre-marked option despite their previous correct answers. Comprehensive evaluations demonstrate that visual sycophancy is a consistent and quantifiable behavior across various model architectures. Our findings highlight potential limitations in the reliability of these models when processing potentially misleading visual information, raising important questions about their application in critical decision-making contexts.
摘要:本文介绍并考察了多通道语言模型中的“视觉奉承”现象,我们提出的这个术语用来描述这些模型倾向于不成比例地偏爱视觉呈现的信息,即使它与他们的先验知识或反应相矛盾。我们的研究采用了一种系统的方法来研究这一现象:我们将多项选择题的图像呈现给模型,这些图像最初是正确回答的,然后将相同的模型暴露在带有视觉上预先标记的选项的版本中。我们的发现显示,尽管之前的答案是正确的,但模型对预先标记的选项的反应发生了显着转变。综合评估表明,视觉奉承是一种跨各种模型架构的一致且可量化的行为。我们的发现突出了这些模型在处理潜在误导性视觉信息时的可靠性方面的潜在局限性,这引发了关于它们在关键决策环境中的应用的重要问题。

[NLP-82] Improving Rare Word Translation With Dictionaries and Attention Masking
[NLP-82] 用词典和注意力掩蔽改进稀有词翻译

链接: https://arxiv.org/abs/2408.09075
作者: Kenneth J. Sible,David Chiang
关键词-EN: dominant encoder-decoder architecture, encoder-decoder architecture, translation settings, dominant encoder-decoder, machine translation
关键词-ZH: 主导编码器-解码器架构、编码器-解码器架构、翻译设置、主导编码器-解码器、机器翻译
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:In machine translation, rare words continue to be a problem for the dominant encoder-decoder architecture, especially in low-resource and out-of-domain translation settings. Human translators solve this problem with monolingual or bilingual dictionaries. In this paper, we propose appending definitions from a bilingual dictionary to source sentences and using attention masking to link together rare words with their definitions. We find that including definitions for rare words improves performance by up to 1.0 BLEU and 1.6 MacroF1.
摘要:在机器翻译中,稀有词仍然是占主导地位的编码器-解码器架构的一个问题,尤其是在低资源和域外翻译设置中。人类翻译人员通过单语或双语词典解决了这个问题。在本文中,我们建议将双语词典中的定义添加到源句子中,并使用注意力掩蔽将罕见词与其定义联系起来。我们发现,包含稀有词的定义可以将性能提高高达1.0 BLEU和1.6 MacroF 1。

[NLP-83] CodeTaxo: Enhancing Taxonomy Expansion with Limited Examples via Code Language Prompts
[NLP-83] CodeTaxo:通过代码语言脚本通过有限的示例增强分类学扩展

链接: https://arxiv.org/abs/2408.09070
作者: Qingkai Zeng,Yuyang Bai,Zhaoxuan Tan,Zhenyu Wu,Shangbin Feng,Meng Jiang
关键词-EN: representation of knowledge, play a crucial, crucial role, applications by providing, providing a structural
关键词-ZH: 知识的表示,通过提供结构性的应用程序发挥着至关重要的作用
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Taxonomies play a crucial role in various applications by providing a structural representation of knowledge. The task of taxonomy expansion involves integrating emerging concepts into existing taxonomies by identifying appropriate parent concepts for these new query concepts. Previous approaches typically relied on self-supervised methods that generate annotation data from existing taxonomies. However, these methods are less effective when the existing taxonomy is small (fewer than 100 entities). In this work, we introduce \textscCodeTaxo, a novel approach that leverages large language models through code language prompts to capture the taxonomic structure. Extensive experiments on five real-world benchmarks from different domains demonstrate that \textscCodeTaxo consistently achieves superior performance across all evaluation metrics, significantly outperforming previous state-of-the-art methods. The code and data are available at \urlthis https URL.
摘要:分类学通过提供知识的结构性表示,在各种应用中发挥着至关重要的作用。分类扩展的任务涉及通过为这些新查询概念识别适当的父概念,将新兴概念集成到现有的分类中。以前的方法通常依赖于从现有分类法生成注释数据的自我监督方法。然而,当现有分类法很小(少于100个实体)时,这些方法就不太有效。在这项工作中,我们引入了\textscCodeTaxo,这是一种新颖的方法,通过代码语言提示利用大型语言模型来捕获分类结构。对来自不同领域的五个现实世界基准进行的广泛实验表明,\textscCodeTaxo在所有评估指标上始终实现卓越的性能,显着优于之前的最先进方法。代码和数据可在\urlThis https URL上获取。

[NLP-84] Learning to Route for Dynamic Adapter Composition in Continual Learning with Language Models
[NLP-84] 在使用语言模型的持续学习中学习动态适配器组合的路径

链接: https://arxiv.org/abs/2408.09053
作者: Vladimir Araujo,Marie-Francine Moens,Tinne Tuytelaars
关键词-EN: pre-trained language models, Parameter-efficient fine-tuning, language models, continual learning, PEFT
关键词-ZH: 预训练语言模型、参数高效微调、语言模型、持续学习、PEFT
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Parameter-efficient fine-tuning (PEFT) methods are increasingly used with pre-trained language models (PLMs) for continual learning (CL). These methods involve training a PEFT module for each new task and using similarity-based selection to route modules during inference. However, they face two major limitations: 1) interference with already learned modules and 2) suboptimal routing when composing modules. In this paper, we introduce a method that isolates the training of PEFT modules for task specialization. Then, before evaluation, it learns to compose the previously learned modules by training a router that leverages samples from a small memory. We evaluate our method in two CL setups using several benchmarks. Our results show that our method provides a better composition of PEFT modules, leading to better generalization and performance compared to previous methods.
摘要:参数高效微调(PEFT)方法越来越多地与预训练语言模型(PLM)一起用于持续学习(CL)。这些方法涉及为每个新任务训练PEFT模块,并在推理期间使用基于相似性的选择来路由模块。然而,它们面临着两个主要限制:1)干扰已经学习的模块; 2)组成模块时的次优路由。本文中,我们介绍了一种隔离PEFT模块训练以实现任务专业化的方法。然后,在评估之前,它通过训练利用小内存中的样本的路由器来学习组合之前学习的模块。我们使用多个基准在两个CL设置中评估我们的方法。我们的结果表明,我们的方法提供了更好的PEFT模块组合,与以前的方法相比,具有更好的概括性和性能。

[NLP-85] Language Models Show Stable Value Orientations Across Diverse Role-Plays
[NLP-85] 语言模型在不同的角色扮演中显示稳定的价值取向

链接: https://arxiv.org/abs/2408.09049
作者: Bruce W. Lee,Yeongheon Lee,Hyunsoo Cho
关键词-EN: adopting diverse personas, large language models, revealing a persistent, prompted to assume, large language
关键词-ZH: 采用多样化的角色、大型语言模型,揭示持久的、促使假设的大型语言
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:We demonstrate that large language models (LLMs) exhibit consistent value orientations despite adopting diverse personas, revealing a persistent inertia in their responses that remains stable across the variety of roles they are prompted to assume. To systematically explore this phenomenon, we introduce the role-play-at-scale methodology, which involves prompting LLMs with randomized, diverse personas and analyzing the macroscopic trend of their responses. Unlike previous works that simply feed these questions to LLMs as if testing human subjects, our role-play-at-scale methodology diagnoses inherent tendencies in a systematic and scalable manner by: (1) prompting the model to act in different random personas and (2) asking the same question multiple times for each random persona. This approach reveals consistent patterns in LLM responses across diverse role-play scenarios, indicating deeply encoded inherent tendencies. Our findings contribute to the discourse on value alignment in foundation models and demonstrate the efficacy of role-play-at-scale as a diagnostic tool for uncovering encoded biases in LLMs.
摘要:我们证明,大型语言模型(LLM)尽管采用了不同的人物角色,但表现出一致的价值取向,揭示了他们在反应中的持久惯性,这种惯性在他们被提示承担的各种角色中保持稳定。为了系统地探索这一现象,我们引入了角色扮演的方法,该方法包括用随机的、多样化的人物角色激励LLM,并分析他们反应的宏观趋势。与以前的工作不同,我们的角色扮演方法像测试人类对象一样简单地将这些问题提供给LLM,我们的角色扮演方法通过以下方式以系统和可扩展的方式诊断内在倾向:(1)促使模型在不同的随机角色中行动;(2)对每个随机角色多次提出相同的问题。这种方法揭示了在不同的角色扮演场景中LLM响应的一致模式,表明了深深编码的内在趋势。我们的发现有助于讨论基础模型中的价值匹配,并证明了角色扮演作为一种诊断工具的有效性,以揭示LLMS中的编码偏见。

[NLP-86] Studying the Effects of Collaboration in Interactive Theme Discovery Systems
[NLP-86] 研究交互式主题发现系统中协作的影响

链接: https://arxiv.org/abs/2408.09030
作者: Alvin Po-Chun Chen,Dananjay Srinivas,Alexandra Barry,Maksim Seniw,Maria Leonor Pacheco
关键词-EN: gained considerable traction, support qualitative data, solutions have gained, gained considerable, considerable traction
关键词-ZH: 获得了相当大的吸引力,支持定性数据,解决方案已经获得,获得了相当大的吸引力
类目: Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:NLP-assisted solutions have gained considerable traction to support qualitative data analysis. However, there does not exist a unified evaluation framework that can account for the many different settings in which qualitative researchers may employ them. In this paper, we take a first step in this direction by proposing an evaluation framework to study the way in which different tools may result in different outcomes depending on the collaboration strategy employed. Specifically, we study the impact of synchronous vs. asynchronous collaboration using two different NLP-assisted qualitative research tools and present a comprehensive analysis of significant differences in the consistency, cohesiveness, and correctness of their outputs.
摘要:NLP辅助解决方案在支持定性数据分析方面获得了相当大的支持。然而,目前还没有一个统一的评估框架来解释定性研究人员可能使用它们的许多不同环境。在本文中,我们向这个方向迈出了第一步,提出了一个评估框架,以研究不同工具可能会根据所采用的协作策略产生不同结果的方式。具体来说,我们使用两种不同的NLP辅助定性研究工具来研究同步协作与同步协作的影响,并对其输出的一致性、凝聚性和正确性的显着差异进行全面分析。

[NLP-87] From Lazy to Prolific: Tackling Missing Labels in Open Vocabulary Extreme Classification by Positive-Unlabeled Sequence Learning
[NLP-87] 从懒惰到多产:通过正无标签序列学习解决开放词汇极端分类中的缺失标签

链接: https://arxiv.org/abs/2408.08981
作者: Haoran Ranran Zhang,Bensu Uçar,Soumik Dey,Hansi Wu,Binbin Li,Rui Zhang
关键词-EN: Extreme Multi-label Classification, Open-vocabulary Extreme Multi-label, extends traditional XMC, Open-vocabulary Extreme, Multi-label Classification
关键词-ZH: 极限多标签分类,开放词汇极限多标签,扩展传统XMC,开放词汇极限,多标签分类
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Open-vocabulary Extreme Multi-label Classification (OXMC) extends traditional XMC by allowing prediction beyond an extremely large, predefined label set (typically 10^3 to 10^12 labels), addressing the dynamic nature of real-world labeling tasks. However, self-selection bias in data annotation leads to significant missing labels in both training and test data, particularly for less popular inputs. This creates two critical challenges: generation models learn to be “lazy’” by under-generating labels, and evaluation becomes unreliable due to insufficient annotation in the test set. In this work, we introduce Positive-Unlabeled Sequence Learning (PUSL), which reframes OXMC as an infinite keyphrase generation task, addressing the generation model’s laziness. Additionally, we propose to adopt a suite of evaluation metrics, F1@ \mathcalO and newly proposed B@ k , to reliably assess OXMC models with incomplete ground truths. In a highly imbalanced e-commerce dataset with substantial missing labels, PUSL generates 30% more unique labels, and 72% of its predictions align with actual user queries. On the less skewed EURLex-4.3k dataset, PUSL demonstrates superior F1 scores, especially as label counts increase from 15 to 30. Our approach effectively tackles both the modeling and evaluation challenges in OXMC with missing labels.
摘要:开放词汇表Extreme多标签分类(OXMC)扩展了传统的XMC,允许在超大的预定义标签集(通常为10^3到10^12个标签)之外进行预测,从而解决了现实世界标签任务的动态本质。然而,数据标注中的自我选择偏差导致训练和测试数据中都有显著的标签缺失,特别是对于不太受欢迎的输入。这就产生了两个关键的挑战:生成模型通过不生成标签来学习变得“懒惰”,而评估由于测试集中的注释不足而变得不可靠。在这项工作中,我们引入了正无标签序列学习(PUSL),它将OXMC重新框架为一个无限的关键词生成任务,解决了生成模型的懒惰问题。此外,我们建议采用一套评估指标F1@\mathcalO和新提出的B@k来可靠地评估具有不完全基本事实的OXMC模型。在一个高度不平衡的电子商务数据集中,标签大量缺失,PUSL生成的唯一标签多30%,其预测的72%与实际用户查询一致。在偏差较小的EURLex-4.3k数据集上,PUSL显示出优越的F1分数,特别是当标签数量从15增加到30时。我们的方法有效地解决了缺少标签的OXMC中的建模和评估挑战。

[NLP-88] See What LLMs Cannot Answer: A Self-Challenge Framework for Uncovering LLM Weaknesses
[NLP-88] 看看LLM无法回答的问题:揭露LLM弱点的自我挑战框架

链接: https://arxiv.org/abs/2408.08978
作者: Yulong Chen,Yang Liu,Jianhao Yan,Xuefeng Bai,Ming Zhong,Yinghao Yang,Ziyi Yang,Chenguang Zhu,Yue Zhang
关键词-EN: Large Language Models, Language Models, Large Language, consistently surpassed numerous, surpassed numerous human-designed
关键词-ZH: 大型语言模型,语言模型,大型语言,不断超越无数,超越无数人类设计的
类目: Computation and Language (cs.CL)
备注: COLM 2024

点击查看摘要

Abstract:The impressive performance of Large Language Models (LLMs) has consistently surpassed numerous human-designed benchmarks, presenting new challenges in assessing the shortcomings of LLMs. Designing tasks and finding LLMs’ limitations are becoming increasingly important. In this paper, we investigate the question of whether an LLM can discover its own limitations from the errors it makes. To this end, we propose a Self-Challenge evaluation framework with human-in-the-loop. Starting from seed instances that GPT-4 fails to answer, we prompt GPT-4 to summarize error patterns that can be used to generate new instances and incorporate human feedback on them to refine these patterns for generating more challenging data, iteratively. We end up with 8 diverse patterns, such as text manipulation and questions with assumptions. We then build a benchmark, SC-G4, consisting of 1,835 instances generated by GPT-4 using these patterns, with human-annotated gold responses. The SC-G4 serves as a challenging benchmark that allows for a detailed assessment of LLMs’ abilities. Our results show that only 44.96% of instances in SC-G4 can be answered correctly by GPT-4. Interestingly, our pilot study indicates that these error patterns also challenge other LLMs, such as Claude-3 and Llama-3, and cannot be fully resolved through fine-tuning. Our work takes the first step to demonstrate that LLMs can autonomously identify their inherent flaws and provide insights for future dynamic and automatic evaluation.
摘要:大型语言模型(LLM)令人印象深刻的表现一直超过了许多人类设计的基准,这给评估LLM的缺点提出了新的挑战。设计任务并找出LLMS的局限性正变得越来越重要。在本文中,我们研究了LLM是否能从它所犯的错误中发现自己的局限性的问题。为此,我们提出了一种人在环中的自我挑战评估框架。从GPT-4未能回答的种子实例开始,我们提示GPT-4总结可用于生成新实例的错误模式,并结合对它们的人工反馈来提炼这些模式,以迭代地生成更具挑战性的数据。我们最终得到了8种不同的模式,比如文本操纵和带有假设的问题。然后,我们构建一个基准测试SC-G4,它由GPT-4使用这些模式生成的1,835个实例组成,带有人工注释的GOLD响应。SC-G4是一个具有挑战性的基准,可以对低成本管理系统的能力进行详细评估。结果表明,在SC-G4中只有44.96%的实例可以被GPT-4正确回答。有趣的是,我们的初步研究表明,这些错误模式也挑战其他LLM,如Claude-3和Llama-3,不能通过微调完全解决。我们的工作迈出了第一步,证明了LLMS可以自主地识别其固有的缺陷,并为未来的动态和自动评估提供见解。

[NLP-89] A Multi-Task and Multi-Label Classification Model for Implicit Discourse Relation Recognition
[NLP-89] 隐式话语关系识别的多任务多标签分类模型

链接: https://arxiv.org/abs/2408.08971
作者: Nelson Filipe Costa,Leila Kosseim
关键词-EN: Discourse Relation Recognition, Implicit Discourse Relation, Relation Recognition, Implicit Discourse, Discourse Relation
关键词-ZH: 话语关系识别、隐性话语关系、关系识别、隐性话语、话语关系
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:In this work, we address the inherent ambiguity in Implicit Discourse Relation Recognition (IDRR) by introducing a novel multi-task classification model capable of learning both multi-label and single-label representations of discourse relations. Leveraging the DiscoGeM corpus, we train and evaluate our model on both multi-label and traditional single-label classification tasks. To the best of our knowledge, our work presents the first truly multi-label classifier in IDRR, establishing a benchmark for multi-label classification and achieving SOTA results in single-label classification on DiscoGeM. Additionally, we evaluate our model on the PDTB 3.0 corpus for single-label classification without any prior exposure to its data. While the performance is below the current SOTA, our model demonstrates promising results indicating potential for effective transfer learning across both corpora.
摘要:在这项工作中,我们通过引入一种新型的多任务分类模型来解决隐式话语关系识别(IDRR)中固有的模糊性,该模型能够学习话语关系的多标签和单标签表示。利用DiscoGeM数据库,我们在多标签和传统单标签分类任务上训练和评估我们的模型。据我们所知,我们的工作在IDRR中提出了第一个真正的多标签分类器,建立了多标签分类的基准,并在DiscoGeM上实现了单标签分类的SOTA结果。此外,我们在PDTB 3.0数据库上评估了我们的模型,以进行单标签分类,而无需事先接触其数据。虽然性能低于当前的SOTA,但我们的模型显示出有希望的结果,表明两个库之间有可能进行有效的迁移学习。

[NLP-90] BnSentMix: A Diverse Bengali-English Code-Mixed Dataset for Sentiment Analysis
[NLP-90] BnSentMix:用于情绪分析的多样化孟加拉语-英语代码混合数据集

链接: https://arxiv.org/abs/2408.08964
作者: Sadia Alam,Md Farhan Ishmam,Navid Hasin Alvee,Md Shahnewaz Siddique,Md Azam Hossain,Abu Raihan Mostofa Kamal
关键词-EN: provide valuable insights, widespread availability, provide valuable, valuable insights, insights into low-resource
关键词-ZH: 提供有价值的见解、广泛可用性、提供有价值的见解、对低资源的见解
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The widespread availability of code-mixed data can provide valuable insights into low-resource languages like Bengali, which have limited datasets. Sentiment analysis has been a fundamental text classification task across several languages for code-mixed data. However, there has yet to be a large-scale and diverse sentiment analysis dataset on code-mixed Bengali. We address this limitation by introducing BnSentMix, a sentiment analysis dataset on code-mixed Bengali consisting of 20,000 samples with 4 sentiment labels from Facebook, YouTube, and e-commerce sites. We ensure diversity in data sources to replicate realistic code-mixed scenarios. Additionally, we propose 14 baseline methods including novel transformer encoders further pre-trained on code-mixed Bengali-English, achieving an overall accuracy of 69.8% and an F1 score of 69.1% on sentiment classification tasks. Detailed analyses reveal variations in performance across different sentiment labels and text types, highlighting areas for future improvement.
摘要:代码混合数据的广泛使用可以为像孟加拉语这样的低资源语言提供有价值的见解,这些语言的数据集有限。对于代码混合的数据,情感分析一直是跨几种语言的基本文本分类任务。然而,目前还没有关于代码混合的孟加拉语的大规模和多样化的情绪分析数据集。我们通过引入BnSentMix来解决这个限制,BnSentMix是一个基于代码混合的孟加拉语的情感分析数据集,由来自Facebook、YouTube和电子商务网站的20,000个样本和4个情感标签组成。我们确保数据来源的多样性,以复制现实的代码混合场景。此外,我们提出了14种基线方法,包括新的转换编码者,进一步对代码混合的孟加拉语-英语进行了预训练,在情感分类任务上获得了69.8的总体准确率和69.1的F1得分。详细的分析揭示了不同情感标签和文本类型的表现差异,突出了未来需要改进的领域。

[NLP-91] Adaptive Guardrails For Large Language Models via Trust Modeling and In-Context Learning
[NLP-91] 通过信任建模和上下文内学习为大型语言模型提供自适应护栏

链接: https://arxiv.org/abs/2408.08959
作者: Jinwei Hu,Yi Dong,Xiaowei Huang
关键词-EN: Large language models, maintain LLMs’ alignment, part of Large, Large language, language models
关键词-ZH: 大型语言模型,维护LLM的一致性,大型、大型语言、语言模型的一部分
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Under Review

点击查看摘要

Abstract:Guardrails have become an integral part of Large language models (LLMs), by moderating harmful or toxic response in order to maintain LLMs’ alignment to human expectations. However, the existing guardrail methods do not consider different needs and access rights of individual users, and treat all the users with the same rule. This study introduces an adaptive guardrail mechanism, supported by trust modeling and enhanced with in-context learning, to dynamically modulate access to sensitive content based on user trust metrics. By leveraging a combination of direct interaction trust and authority-verified trust, the system precisely tailors the strictness of content moderation to align with the user’s credibility and the specific context of their inquiries. Our empirical evaluations demonstrate that the adaptive guardrail effectively meets diverse user needs, outperforming existing guardrails in practicality while securing sensitive information and precisely managing potentially hazardous content through a context-aware knowledge base. This work is the first to introduce trust-oriented concept within a guardrail system, offering a scalable solution that enriches the discourse on ethical deployment for next-generation LLMs.
摘要:护栏已经成为大型语言模型(LLMS)不可或缺的一部分,它通过缓和有害或有毒的反应来保持LLMS与人类预期的一致性。然而,现有的护栏方法没有考虑单个用户的不同需求和访问权限,对所有用户都一视同仁。该研究引入了一种自适应防护机制,该机制以信任建模为支持,并通过情境学习进行增强,以基于用户信任度量动态调整对敏感内容的访问。通过利用直接交互信任和权威验证信任的组合,该系统精确地定制了内容审核的严格程度,以与用户的可信度和他们查询的特定上下文保持一致。我们的经验评估表明,自适应护栏有效地满足了不同的用户需求,在实用性上优于现有的护栏,同时通过上下文感知知识库保护敏感信息并精确管理潜在危险内容。这项工作是第一次在护栏系统中引入面向信任的概念,提供了一个可扩展的解决方案,丰富了关于下一代LLM伦理部署的论述。

[NLP-92] DePrompt: Desensitization and Evaluation of Personal Identifiable Information in Large Language Model Prompts
[NLP-92] 去提示:大型语言模型预算中个人可识别信息的脱敏和评估

链接: https://arxiv.org/abs/2408.08930
作者: Xiongtao Sun,Gan Liu,Zhipeng He,Hui Li,Xiaoguang Li
关键词-EN: widely impacting, impacting the accuracy, accuracy and interpretability, large language models, Prompt
关键词-ZH: 广泛影响,影响准确性、准确性和可解释性,大型语言模型,提示
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Prompt serves as a crucial link in interacting with large language models (LLMs), widely impacting the accuracy and interpretability of model outputs. However, acquiring accurate and high-quality responses necessitates precise prompts, which inevitably pose significant risks of personal identifiable information (PII) leakage. Therefore, this paper proposes DePrompt, a desensitization protection and effectiveness evaluation framework for prompt, enabling users to safely and transparently utilize LLMs. Specifically, by leveraging large model fine-tuning techniques as the underlying privacy protection method, we integrate contextual attributes to define privacy types, achieving high-precision PII entity identification. Additionally, through the analysis of key features in prompt desensitization scenarios, we devise adversarial generative desensitization methods that retain important semantic content while disrupting the link between identifiers and privacy attributes. Furthermore, we present utility evaluation metrics for prompt to better gauge and balance privacy and usability. Our framework is adaptable to prompts and can be extended to text usability-dependent scenarios. Through comparison with benchmarks and other model methods, experimental evaluations demonstrate that our desensitized prompt exhibit superior privacy protection utility and model inference results.
摘要:提示是与大型语言模型交互的关键环节,广泛影响模型输出的准确性和可解释性。然而,要获得准确和高质量的回复需要精确的提示,这不可避免地会带来个人可识别信息(PII)泄露的巨大风险。为此,本文提出了一种针对PROMPT的脱敏保护和有效性评估框架–DePrompt,使用户能够安全、透明地使用LLMS。具体地说,通过利用大型模型微调技术作为底层隐私保护方法,我们结合上下文属性来定义隐私类型,实现了高精度的PII实体识别。此外,通过分析即时脱敏场景中的关键特征,我们设计了对抗性生成性脱敏方法,该方法保留了重要的语义内容,同时破坏了识别符和隐私属性之间的联系。此外,为了更好地衡量和平衡私密性和可用性,我们提出了针对提示的效用评估度量。我们的框架能够适应提示,并可以扩展到文本可用性相关的场景。通过与基准测试和其他模型方法的比较,实验评估表明,该去敏感提示具有更好的隐私保护效用和模型推理效果。

[NLP-93] VerilogCoder: Autonomous Verilog Coding Agents with Graph-based Planning and Abstract Syntax Tree (AST)-based Waveform Tracing Tool AAAI2025
[NLP-93] VerilogCoder:具有基于图的规划和基于抽象树(AST)的波浪形跟踪工具的自主Verilog编码代理

链接: https://arxiv.org/abs/2408.08927
作者: Chia-Tung Ho,Haoxing Ren,Brucek Khailany
关键词-EN: automating hardware design, modern Integrated Circuits, Circuit Relation Graph, growing complexity, complexity of modern
关键词-ZH: 自动化硬件设计,现代集成电路,电路关系图,不断增长的复杂性,现代的复杂性
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: main paper 7 pages, reference 1 page, appendix 22 pages. It is under review of AAAI 2025

点击查看摘要

Abstract:Due to the growing complexity of modern Integrated Circuits (ICs), automating hardware design can prevent a significant amount of human error from the engineering process and result in less errors. Verilog is a popular hardware description language for designing and modeling digital systems; thus, Verilog generation is one of the emerging areas of research to facilitate the design process. In this work, we propose VerilogCoder, a system of multiple Artificial Intelligence (AI) agents for Verilog code generation, to autonomously write Verilog code and fix syntax and functional errors using collaborative Verilog tools (i.e., syntax checker, simulator, and waveform tracer). Firstly, we propose a task planner that utilizes a novel Task and Circuit Relation Graph retrieval method to construct a holistic plan based on module descriptions. To debug and fix functional errors, we develop a novel and efficient abstract syntax tree (AST)-based waveform tracing tool, which is integrated within the autonomous Verilog completion flow. The proposed methodology successfully generates 94.2% syntactically and functionally correct Verilog code, surpassing the state-of-the-art methods by 33.9% on the VerilogEval-Human v2 benchmark.
摘要:由于现代集成电路(IC)的复杂性越来越高,自动化硬件设计可以避免工程过程中大量的人为错误,并导致更少的错误。Verilog是一种流行的硬件描述语言,用于设计和建模数字系统;因此,Verilog生成是促进设计过程的新兴研究领域之一。在这项工作中,我们提出了VerilogCoder,一个由多个人工智能(AI)代理组成的Verilog代码生成系统,使用协作的Verilog工具(即语法检查器、模拟器和波形跟踪器)自主编写Verilog代码并修复语法和功能错误。首先,我们提出了一种任务规划器,它利用一种新的任务和电路关系图检索方法来构建基于模块描述的整体计划。为了调试和修复功能错误,我们开发了一种新颖而高效的基于抽象语法树(AST)的波形跟踪工具,该工具集成在自治的Verilog完成流程中。该方法成功地生成了94.2%的语法和功能正确的Verilog代码,在VerilogEval-Human v2基准上超过了最先进的方法33.9%。

[NLP-94] Cybench: A Framework for Evaluating Cybersecurity Capabilities and Risk of Language Models
[NLP-94] Cybank:评估网络安全能力和语言模型风险的框架

链接: https://arxiv.org/abs/2408.08926
作者: Andy K. Zhang,Neil Perry,Riya Dulepet,Eliot Jones,Justin W. Lin,Joey Ji,Celeste Menders,Gashon Hussein,Samantha Liu,Donovan Jasper,Pura Peetathawatchai,Ari Glenn,Vikram Sivashankar,Daniel Zamoshchin,Leo Glikbarg,Derek Askaryar,Mike Yang,Teddy Zhang,Rishi Alluri,Nathan Tran,Rinnara Sangpisit,Polycarpos Yiorkadjis,Kenny Osele,Gautham Raghupathi,Dan Boneh,Daniel E. Ho,Percy Liang
关键词-EN: autonomously identifying vulnerabilities, Language Model, real-world impact, capable of autonomously, autonomously identifying
关键词-ZH: 自主识别漏洞、语言模型、现实世界影响、能够自主、自主识别
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY); Machine Learning (cs.LG)
备注: 86 pages, 7 figures

点击查看摘要

Abstract:Language Model (LM) agents for cybersecurity that are capable of autonomously identifying vulnerabilities and executing exploits have the potential to cause real-world impact. Policymakers, model providers, and other researchers in the AI and cybersecurity communities are interested in quantifying the capabilities of such agents to help mitigate cyberrisk and investigate opportunities for penetration testing. Toward that end, we introduce Cybench, a framework for specifying cybersecurity tasks and evaluating agents on those tasks. We include 40 professional-level Capture the Flag (CTF) tasks from 4 distinct CTF competitions, chosen to be recent, meaningful, and spanning a wide range of difficulties. Each task includes its own description, starter files, and is initialized in an environment where an agent can execute bash commands and observe outputs. Since many tasks are beyond the capabilities of existing LM agents, we introduce subtasks, which break down a task into intermediary steps for more gradated evaluation; we add subtasks for 17 of the 40 tasks. To evaluate agent capabilities, we construct a cybersecurity agent and evaluate 7 models: GPT-4o, Claude 3 Opus, Claude 3.5 Sonnet, Mixtral 8x22b Instruct, Gemini 1.5 Pro, Llama 3 70B Chat, and Llama 3.1 405B Instruct. Without guidance, we find that agents are able to solve only the easiest complete tasks that took human teams up to 11 minutes to solve, with Claude 3.5 Sonnet and GPT-4o having the highest success rates. Finally, subtasks provide more signal for measuring performance compared to unguided runs, with models achieving a 3.2% higher success rate on complete tasks with subtask-guidance than without subtask-guidance. All code and data are publicly available at this https URL
摘要:用于网络安全的语言模型(LM)代理能够自主识别漏洞并执行攻击,因此有可能造成现实世界的影响。人工智能和网络安全领域的政策制定者、模型提供商和其他研究人员对量化此类代理的能力感兴趣,以帮助降低网络风险并调查渗透测试的机会。为此,我们引入了Cybench,这是一个用于指定网络安全任务和评估执行这些任务的代理的框架。我们包括来自4个不同CTF比赛的40个专业级捕获旗帜(CTF)任务,选择最新的、有意义的和跨越广泛困难的任务。每个任务都包括其自己的描述、启动器文件,并在代理可以执行bash命令和观察输出的环境中进行初始化。由于许多任务超出了现有LM代理的能力范围,因此我们引入了子任务,将任务分解为多个中间步骤,以进行更分级的评估;我们为40个任务中的17个任务添加子任务。为了评估代理的能力,我们构建了一个网络安全代理,并对7个模型进行了评估:GPT-40、克劳德3 Opus、克劳德3.5十四行诗、Mixtral 8x22B指令、Gemini 1.5 Pro、Llama 3 70B Chat和Llama 3.1 405B指令。在没有指导的情况下,我们发现代理只能完成最简单的任务,这些任务需要人类团队花费11分钟才能完成,其中克劳德3.5十四行诗和GPT-40的成功率最高。最后,与无引导运行相比,子任务提供了更多用于测量性能的信号,模型在有子任务指导的完成任务上的成功率比无子任务指导的高3.2%。所有代码和数据均可通过此HTTPS URL公开获得

[NLP-95] Retail-GPT: leveraging Retrieval Augmented Generation (RAG) for building E-commerce Chat Assistants
[NLP-95] 零售-GPT:利用检索增强生成(RAG)构建电子商务聊天助理

链接: https://arxiv.org/abs/2408.08925
作者: Bruno Amaral Teixeira de Freitas,Roberto de Alencar Lotufo
关键词-EN: open-source RAG-based chatbot, RAG-based chatbot designed, work presents Retail-GPT, enhance user engagement, work presents
关键词-ZH: 开源基于RAG的聊天机器人,基于RAG的聊天机器人设计,作品呈现Retail-GPT,增强用户参与度,作品呈现
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
备注: 5 pages, 4 figures

点击查看摘要

Abstract:This work presents Retail-GPT, an open-source RAG-based chatbot designed to enhance user engagement in retail e-commerce by guiding users through product recommendations and assisting with cart operations. The system is cross-platform and adaptable to various e-commerce domains, avoiding reliance on specific chat applications or commercial activities. Retail-GPT engages in human-like conversations, interprets user demands, checks product availability, and manages cart operations, aiming to serve as a virtual sales agent and test the viability of such assistants across different retail businesses.
摘要:这项工作介绍了Retail-GPT,这是一个基于RAG的开源聊天机器人,旨在通过指导用户进行产品推荐和协助购物车操作来增强用户在零售电子商务中的参与度。该系统是跨平台的,可适应各种电子商务领域,避免依赖特定的聊天应用程序或商业活动。Retail-GPT参与类似人类的对话、解释用户需求、检查产品可用性并管理购物车操作,旨在充当虚拟销售代理并测试此类助理跨不同零售企业的可行性。

[NLP-96] Prefix Guidance: A Steering Wheel for Large Language Models to Defend Against Jailbreak Attacks
[NLP-96] 前置指导:大型语言模型防御越狱攻击的方向盘

链接: https://arxiv.org/abs/2408.08924
作者: Jiawei Zhao,Kejiang Chen,Xiaojian Yuan,Weiming Zhang
关键词-EN: large language models, achieved remarkable performance, recent years, rapid development, development of large
关键词-ZH: 大型语言模型,取得了显着的性能,近年来发展迅速,发展壮大
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:In recent years, the rapid development of large language models (LLMs) has achieved remarkable performance across various tasks. However, research indicates that LLMs are vulnerable to jailbreak attacks, where adversaries can induce the generation of harmful content through meticulously crafted prompts. This vulnerability poses significant challenges to the secure use and promotion of LLMs. Existing defense methods offer protection from different perspectives but often suffer from insufficient effectiveness or a significant impact on the model’s capabilities. In this paper, we propose a plug-and-play and easy-to-deploy jailbreak defense framework, namely Prefix Guidance (PG), which guides the model to identify harmful prompts by directly setting the first few tokens of the model’s output. This approach combines the model’s inherent security capabilities with an external classifier to defend against jailbreak attacks. We demonstrate the effectiveness of PG across three models and five attack methods. Compared to baselines, our approach is generally more effective on average. Additionally, results on the Just-Eval benchmark further confirm PG’s superiority to preserve the model’s performance.
摘要:近年来,大型语言模型的快速发展在各种任务中取得了显著的性能。然而,研究表明,LLMS容易受到越狱攻击,在越狱攻击中,攻击者可以通过精心制作的提示来诱导生成有害内容。此漏洞对安全使用和推广LLMS构成重大挑战。现有的防御方法从不同的角度提供保护,但往往存在有效性不足或对模型能力产生重大影响的问题。本文提出了一种即插即用、易于部署的越狱防御框架–前缀引导(PG),它通过直接设置模型输出的前几个令牌来引导模型识别有害提示。这种方法将模型固有的安全功能与外部分类器相结合,以防御越狱攻击。我们在三个模型和五种攻击方法上演示了PG的有效性。与基线相比,我们的方法总体上更有效。此外,在Just-Eval基准上的结果进一步证实了PG在保持模型性能方面的优越性。

[NLP-97] Graph Retrieval-Augmented Generation: A Survey
[NLP-97] 图检索增强生成:调查

链接: https://arxiv.org/abs/2408.08921
作者: Boci Peng,Yun Zhu,Yongchao Liu,Xiaohe Bo,Haizhou Shi,Chuntao Hong,Yan Zhang,Siliang Tang
关键词-EN: Large Language Models, Language Models, Large Language, achieved remarkable success, RAG refines LLM
关键词-ZH: 大型语言模型,语言模型,大型语言,取得显着成功,RAG完善LLM
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注: Ongoing work

点击查看摘要

Abstract:Recently, Retrieval-Augmented Generation (RAG) has achieved remarkable success in addressing the challenges of Large Language Models (LLMs) without necessitating retraining. By referencing an external knowledge base, RAG refines LLM outputs, effectively mitigating issues such as ``hallucination’', lack of domain-specific knowledge, and outdated information. However, the complex structure of relationships among different entities in databases presents challenges for RAG systems. In response, GraphRAG leverages structural information across entities to enable more precise and comprehensive retrieval, capturing relational knowledge and facilitating more accurate, context-aware responses. Given the novelty and potential of GraphRAG, a systematic review of current technologies is imperative. This paper provides the first comprehensive overview of GraphRAG methodologies. We formalize the GraphRAG workflow, encompassing Graph-Based Indexing, Graph-Guided Retrieval, and Graph-Enhanced Generation. We then outline the core technologies and training methods at each stage. Additionally, we examine downstream tasks, application domains, evaluation methodologies, and industrial use cases of GraphRAG. Finally, we explore future research directions to inspire further inquiries and advance progress in the field.
摘要:最近,检索增强生成(RAG)在不需要重新训练的情况下解决了大型语言模型(LLM)的挑战,取得了显著的成功。通过引用外部知识库,RAG改进了LLM输出,有效地缓解了诸如“幻觉”、缺乏特定领域知识和过时信息等问题。然而,数据库中不同实体之间的复杂关系结构给RAG系统带来了挑战。作为回应,GraphRAG利用跨实体的结构化信息来实现更精确和全面的检索,捕获关系知识并促进更准确、上下文感知的响应。鉴于GraphRAG的新颖性和潜力,对当前技术进行系统审查势在必行。本文首次全面概述了GraphRAG方法。我们将GraphRAG工作流程形式化,包括基于图形的索引、图形引导的检索和图形增强的生成。然后,我们概述每个阶段的核心技术和培训方法。此外,我们还研究了GraphRAG的下游任务、应用领域、评估方法和行业用例。最后,我们探讨了未来的研究方向,以启发进一步的研究和推动该领域的进展。

[NLP-98] What should I wear to a party in a Greek taverna? Evaluation for Conversational Agents in the Fashion Domain KDD
[NLP-98] 我应该穿什么去参加希腊小酒馆的聚会?时尚领域对话代理人的评估

链接: https://arxiv.org/abs/2408.08907
作者: Antonis Maronikolakis,Ana Peleteiro Ramallo,Weiwei Cheng,Thomas Kober
关键词-EN: online fashion retail, enhancing customer experience, Large language models, poised to revolutionize, revolutionize the domain
关键词-ZH: 在线时尚零售,增强客户体验,大型语言模型,准备彻底改变领域
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
备注: Accepted at KDD workshop on Evaluation and Trustworthiness of Generative AI Models

点击查看摘要

Abstract:Large language models (LLMs) are poised to revolutionize the domain of online fashion retail, enhancing customer experience and discovery of fashion online. LLM-powered conversational agents introduce a new way of discovery by directly interacting with customers, enabling them to express in their own ways, refine their needs, obtain fashion and shopping advice that is relevant to their taste and intent. For many tasks in e-commerce, such as finding a specific product, conversational agents need to convert their interactions with a customer to a specific call to different backend systems, e.g., a search system to showcase a relevant set of products. Therefore, evaluating the capabilities of LLMs to perform those tasks related to calling other services is vital. However, those evaluations are generally complex, due to the lack of relevant and high quality datasets, and do not align seamlessly with business needs, amongst others. To this end, we created a multilingual evaluation dataset of 4k conversations between customers and a fashion assistant in a large e-commerce fashion platform to measure the capabilities of LLMs to serve as an assistant between customers and a backend engine. We evaluate a range of models, showcasing how our dataset scales to business needs and facilitates iterative development of tools.
摘要:大型语言模特将给在线时尚零售领域带来革命性的变化,提升顾客体验,发现在线时尚。LLM驱动的对话代理通过直接与客户互动,引入了一种新的发现方式,使他们能够以自己的方式表达,细化他们的需求,获得与他们的品味和意图相关的时尚和购物建议。对于电子商务中的许多任务,例如查找特定的产品,对话代理需要将他们与客户的交互转换为对不同后端系统的特定呼叫,例如,展示相关产品集的搜索系统。因此,评估LLMS执行与调用其他服务相关的任务的能力是至关重要的。然而,由于缺乏相关和高质量的数据集,这些评价通常很复杂,而且不能与业务需求等无缝结合。为此,我们创建了一个多语言评估数据集,其中包括某大型电子商务时尚平台中客户与时尚助手之间的4k次对话,以衡量LLMS作为客户与后端引擎之间的助手的能力。我们评估了一系列模型,展示了我们的数据集如何根据业务需求进行扩展,并促进工具的迭代开发。

[NLP-99] Kov: Transferable and Naturalistic Black-Box LLM Attacks using Markov Decision Processes and Tree Search
[NLP-99] Kov:使用Markov决策过程和树搜索的可转移和自然主义黑匣子LLM攻击

链接: https://arxiv.org/abs/2408.08899
作者: Robert J. Moss
关键词-EN: Eliciting harmful behavior, Eliciting harmful, important task, task to ensure, ensure the proper
关键词-ZH: 煽动有害行为,煽动有害,重要任务,确保任务,确保适当
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Eliciting harmful behavior from large language models (LLMs) is an important task to ensure the proper alignment and safety of the models. Often when training LLMs, ethical guidelines are followed yet alignment failures may still be uncovered through red teaming adversarial attacks. This work frames the red-teaming problem as a Markov decision process (MDP) and uses Monte Carlo tree search to find harmful behaviors of black-box, closed-source LLMs. We optimize token-level prompt suffixes towards targeted harmful behaviors on white-box LLMs and include a naturalistic loss term, log-perplexity, to generate more natural language attacks for better interpretability. The proposed algorithm, Kov, trains on white-box LLMs to optimize the adversarial attacks and periodically evaluates responses from the black-box LLM to guide the search towards more harmful black-box behaviors. In our preliminary study, results indicate that we can jailbreak black-box models, such as GPT-3.5, in only 10 queries, yet fail on GPT-4 - which may indicate that newer models are more robust to token-level attacks. All work to reproduce these results is open sourced (this https URL).
摘要:从大型语言模型中提取有害行为是保证模型正确匹配和安全的一项重要任务。当训练LLM时,通常会遵循道德准则,但通过红队对抗性攻击,仍可能发现对齐失败。该工作将红队问题框架化为一个马尔可夫决策过程(MDP),并使用蒙特卡罗树搜索来发现黑盒、闭源LLM的有害行为。我们针对白盒LLMS上的目标有害行为对令牌级提示后缀进行了优化,并引入了一个自然主义的损失术语LOG-POWERFORITY,以生成更多的自然语言攻击以获得更好的可解释性。KOV算法在白盒LLM上进行训练以优化对抗性攻击,并定期评估来自黑盒LLM的响应以指导搜索更有害的黑盒行为。在我们的初步研究中,结果表明,我们可以在10个查询中越狱黑盒模型,例如GPT-3.5,但在GPT-4上失败-这可能表明新的模型对令牌级攻击更健壮。所有复制这些结果的工作都是开源的(这个HTTPS URL)。

[NLP-100] Enhancing Exploratory Learning through Exploratory Search with the Emergence of Large Language Models
[NLP-100] 随着大型语言模型的出现,通过探索性搜索增强探索性学习

链接: https://arxiv.org/abs/2408.08894
作者: Yiming Luo,Patrick Cheong-Iao,Shanton Chang
关键词-EN: large language models, learners find, challenging issue, confused learners, large language
关键词-ZH: 大型语言模型、学习者发现、具有挑战性的问题、困惑的学习者、大型语言
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 11 pages, 7 figures

点击查看摘要

Abstract:In the information era, how learners find, evaluate, and effectively use information has become a challenging issue, especially with the added complexity of large language models (LLMs) that have further confused learners in their information retrieval and search activities. This study attempts to unpack this complexity by combining exploratory search strategies with the theories of exploratory learning to form a new theoretical model of exploratory learning from the perspective of students’ learning. Our work adapts Kolb’s learning model by incorporating high-frequency exploration and feedback loops, aiming to promote deep cognitive and higher-order cognitive skill development in students. Additionally, this paper discusses and suggests how advanced LLMs integrated into information retrieval and information theory can support students in their exploratory searches, contributing theoretically to promoting student-computer interaction and supporting their learning journeys in the new era with LLMs.
摘要:在信息时代,学习者如何发现、评估和有效利用信息已成为一个具有挑战性的问题,特别是随着大语言模型(LLM)的增加,学习者在信息检索和搜索活动中更加困惑。本研究试图通过将探究性搜索策略与探究性学习理论相结合,从学生学习的角度出发,构建探究性学习的新理论模型。我们的工作通过引入高频探索和反馈循环来适应Kolb的学习模式,旨在促进学生的深层认知和更高层次的认知技能发展。此外,本文还讨论和建议了如何将先进的LLMS整合到信息检索和信息理论中,以支持学生的探索性搜索,从理论上促进学生与计算机的交互,支持他们在新时代的学习之旅。

[NLP-101] LEGENT: Open Platform for Embodied Agents ACL2024
[NLP-101] LECENT:针对已确定代理的开放平台

链接: https://arxiv.org/abs/2404.18243
作者: Zhili Cheng,Zhitong Wang,Jinyi Hu,Shengding Hu,An Liu,Yuge Tu,Pengkai Li,Lei Shi,Zhiyuan Liu,Maosong Sun
关键词-EN: Large Language Models, Large Multimodal Models, hindering complex real-life, Large Language, Large Multimodal
关键词-ZH: 大型语言模型,大型多模式模型,阻碍复杂的现实生活,大型语言,大型多模式
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Robotics (cs.RO)
备注: ACL 2024 System Demonstration

点击查看摘要

Abstract:Despite advancements in Large Language Models (LLMs) and Large Multimodal Models (LMMs), their integration into language-grounded, human-like embodied agents remains incomplete, hindering complex real-life task performance in physical environments. Existing integrations often feature limited open sourcing, challenging collective progress in this field. We introduce LEGENT, an open, scalable platform for developing embodied agents using LLMs and LMMs. LEGENT offers a dual approach: a rich, interactive 3D environment with communicable and actionable agents, paired with a user-friendly interface, and a sophisticated data generation pipeline utilizing advanced algorithms to exploit supervision from simulated worlds at scale. In our experiments, an embryonic vision-language-action model trained on LEGENT-generated data surpasses GPT-4V in embodied tasks, showcasing promising generalization capabilities.
摘要:尽管大型语言模型(LLM)和大型多模式模型(LSYS)取得了进步,但它们与基于语言的、类人的具体化代理的集成仍然不完整,从而阻碍了物理环境中复杂的现实生活任务的性能。现有的集成通常以有限的开源为特征,对该领域的集体进步构成挑战。我们引入LEGENT,这是一个开放、可扩展的平台,用于使用LLM和LSYS开发具体代理。LEENT提供了双重方法:一个丰富的交互式3D环境,具有可通信且可操作的代理,搭配用户友好的界面,以及利用先进算法大规模利用模拟世界的监督的复杂数据生成管道。在我们的实验中,在LEGENT生成的数据上训练的胚胎视觉-语言-动作模型在具体任务中超越了GPT-4V,展示了有前途的概括能力。

[NLP-102] PhysBERT: A Text Embedding Model for Physics Scientific Literature
[NLP-102] PhysBERT:物理科学文献的文本嵌入模型

链接: https://arxiv.org/abs/2408.09574
作者: Thorsten Hellert,João Montenegro,Andrea Pollastro
关键词-EN: Natural Language Processing, Language Processing, pose significant challenges, Natural Language, extraction through Natural
关键词-ZH: 自然语言处理,语言处理,带来了重大挑战,自然语言,通过自然提取
类目: Computational Physics (physics.comp-ph); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The specialized language and complex concepts in physics pose significant challenges for information extraction through Natural Language Processing (NLP). Central to effective NLP applications is the text embedding model, which converts text into dense vector representations for efficient information retrieval and semantic analysis. In this work, we introduce PhysBERT, the first physics-specific text embedding model. Pre-trained on a curated corpus of 1.2 million arXiv physics papers and fine-tuned with supervised data, PhysBERT outperforms leading general-purpose models on physics-specific tasks including the effectiveness in fine-tuning for specific physics subdomains.
摘要:物理学中的专业语言和复杂概念对通过自然语言处理(NLP)进行信息提取提出了重大挑战。有效的NLP应用程序的核心是文本嵌入模型,该模型将文本转换为密集的载体表示,以实现高效的信息检索和语义分析。在这项工作中,我们介绍了PhysBERT,这是第一个物理特定的文本嵌入模型。PhysBERT在包含120万篇arXiv物理论文的精心策划的数据库上进行了预训练,并利用监督数据进行了微调,在物理特定任务方面优于领先的通用模型,包括对特定物理子领域进行微调的有效性。

[NLP-103] Enhancing Startup Success Predictions in Venture Capital: A GraphRAG Augmented Multivariate Time Series Method
[NLP-103] 增强风险投资初创企业成功预测:GraphRAG增强多元时间序列方法

链接: https://arxiv.org/abs/2408.09420
作者: Gao Zitian,Xiao Yihao
关键词-EN: subjective revenue forecasts, limited financial data, revenue forecasts, Venture Capital, challenging due
关键词-ZH: 主观收入预测、有限的财务数据、收入预测、风险投资、具有挑战性的到期
类目: Computational Finance (q-fin.CP); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:In the Venture Capital(VC) industry, predicting the success of startups is challenging due to limited financial data and the need for subjective revenue forecasts. Previous methods based on time series analysis or deep learning often fall short as they fail to incorporate crucial inter-company relationships such as competition and collaboration. Regarding the issues, we propose a novel approach using GrahphRAG augmented time series model. With GraphRAG, time series predictive methods are enhanced by integrating these vital relationships into the analysis framework, allowing for a more dynamic understanding of the startup ecosystem in venture capital. Our experimental results demonstrate that our model significantly outperforms previous models in startup success predictions. To the best of our knowledge, our work is the first application work of GraphRAG.
摘要:在风险投资(VC)行业,由于财务数据有限并且需要主观的收入预测,预测初创公司的成功具有挑战性。之前基于时间序列分析或深度学习的方法往往存在缺陷,因为它们未能纳入竞争和协作等关键的公司间关系。关于这些问题,我们提出了一种使用GrahphRAG增强时间序列模型的新颖方法。借助GraphRAG,通过将这些重要关系集成到分析框架中来增强时间序列预测方法,从而可以更动态地了解风险投资中的初创企业生态系统。我们的实验结果表明,我们的模型在初创公司成功预测方面显着优于之前的模型。据我们所知,我们的工作是GraphRAG的第一个应用工作。

[NLP-104] Generating Data with Text-to-Speech and Large-Language Models for Conversational Speech Recognition
[NLP-104] 使用文本到语音和大语言模型生成数据以进行对话语音识别

链接: https://arxiv.org/abs/2408.09215
作者: Samuele Cornell,Jordan Darefsky,Zhiyao Duan,Shinji Watanabe
关键词-EN: leverage large scale, large scale pre-trained, scale pre-trained models, speech processing tasks, processing tasks
关键词-ZH: 利用大规模、大规模预训练、规模预训练模型、语音处理任务、处理任务
类目: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Sound (cs.SD)
备注: To appear at SynData4GenAI 2024 workshop

点击查看摘要

Abstract:Currently, a common approach in many speech processing tasks is to leverage large scale pre-trained models by fine-tuning them on in-domain data for a particular application. Yet obtaining even a small amount of such data can be problematic, especially for sensitive domains and conversational speech scenarios, due to both privacy issues and annotation costs. To address this, synthetic data generation using single speaker datasets has been employed. Yet, for multi-speaker cases, such an approach often requires extensive manual effort and is prone to domain mismatches. In this work, we propose a synthetic data generation pipeline for multi-speaker conversational ASR, leveraging a large language model (LLM) for content creation and a conversational multi-speaker text-to-speech (TTS) model for speech synthesis. We conduct evaluation by fine-tuning the Whisper ASR model for telephone and distant conversational speech settings, using both in-domain data and generated synthetic data. Our results show that the proposed method is able to significantly outperform classical multi-speaker generation approaches that use external, non-conversational speech datasets.
摘要:目前,在许多语音处理任务中,一种常见的方法是利用大规模的预训练模型,根据特定应用的域内数据对其进行微调。然而,由于隐私问题和注释成本的原因,即使获取少量此类数据也可能存在问题,特别是对于敏感领域和对话语音场景。为了解决这个问题,使用了使用单个说话人数据集的合成数据生成。然而,对于多说话人的情况,这样的方法通常需要大量的人工工作,并且容易出现域不匹配。在这项工作中,我们提出了一种用于多说话人会话ASR的合成数据生成流水线,利用大型语言模型(LLM)来创建内容,并利用会话式多说话人文本到语音(TTS)模型来进行语音合成。我们通过使用域内数据和生成的合成数据对电话和远程对话语音设置的Whisper ASR模型进行微调来进行评估。我们的结果表明,该方法的性能明显优于使用外部非会话语音数据集的经典多说话人生成方法。

人工智能

[AI-0] KAN 2.0: Kolmogorov-Arnold Networks Meet Science

链接: https://arxiv.org/abs/2408.10205
作者: Ziming Liu,Pingchuan Ma,Yixuan Wang,Wojciech Matusik,Max Tegmark
关键词-EN: inherent incompatibility, depends on symbolism, Science, Science lies, science depends
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computational Physics (physics.comp-ph); Data Analysis, Statistics and Probability (physics.data-an)
*备注: 27 pages, 14 figures

点击查看摘要

Abstract:A major challenge of AI + Science lies in their inherent incompatibility: today’s AI is primarily based on connectionism, while science depends on symbolism. To bridge the two worlds, we propose a framework to seamlessly synergize Kolmogorov-Arnold Networks (KANs) and science. The framework highlights KANs’ usage for three aspects of scientific discovery: identifying relevant features, revealing modular structures, and discovering symbolic formulas. The synergy is bidirectional: science to KAN (incorporating scientific knowledge into KANs), and KAN to science (extracting scientific insights from KANs). We highlight major new functionalities in the pykan package: (1) MultKAN: KANs with multiplication nodes. (2) kanpiler: a KAN compiler that compiles symbolic formulas into KANs. (3) tree converter: convert KANs (or any neural networks) to tree graphs. Based on these tools, we demonstrate KANs’ capability to discover various types of physical laws, including conserved quantities, Lagrangians, symmetries, and constitutive laws.

[AI-1] Demystifying the Communication Characteristics for Distributed Transformer Models

链接: https://arxiv.org/abs/2408.10197
作者: Quentin Anthony,Benjamin Michalowicz,Jacob Hatef,Lang Xu,Mustafa Abduljabbar,Aamir Shafi,Hari Subramoni,Dhabaleswar Panda
关键词-EN: time series prediction, Deep learning, audio generation, series prediction, time series
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Deep learning (DL) models based on the transformer architecture have revolutionized many DL applications such as large language models (LLMs), vision transformers, audio generation, and time series prediction. Much of this progress has been fueled by distributed training, yet distributed communication remains a substantial bottleneck to training progress. This paper examines the communication behavior of transformer models - that is, how different parallelism schemes used in multi-node/multi-GPU DL Training communicate data in the context of transformers. We use GPT-based language models as a case study of the transformer architecture due to their ubiquity. We validate the empirical results obtained from our communication logs using analytical models. At a high level, our analysis reveals a need to optimize small message point-to-point communication further, correlations between sequence length, per-GPU throughput, model size, and optimizations used, and where to potentially guide further optimizations in framework and HPC middleware design and optimization.

[AI-2] SpaRP: Fast 3D Object Reconstruction and Pose Estimation from Sparse Views ECCV2024

链接: https://arxiv.org/abs/2408.10195
作者: Chao Xu,Ang Li,Linghao Chen,Yulin Liu,Ruoxi Shi,Hao Su,Minghua Liu
关键词-EN: attracted considerable attention, recently attracted considerable, generation has recently, considerable attention, recently attracted
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Graphics (cs.GR)
*备注: ECCV 2024

点击查看摘要

Abstract:Open-world 3D generation has recently attracted considerable attention. While many single-image-to-3D methods have yielded visually appealing outcomes, they often lack sufficient controllability and tend to produce hallucinated regions that may not align with users’ expectations. In this paper, we explore an important scenario in which the input consists of one or a few unposed 2D images of a single object, with little or no overlap. We propose a novel method, SpaRP, to reconstruct a 3D textured mesh and estimate the relative camera poses for these sparse-view images. SpaRP distills knowledge from 2D diffusion models and finetunes them to implicitly deduce the 3D spatial relationships between the sparse views. The diffusion model is trained to jointly predict surrogate representations for camera poses and multi-view images of the object under known poses, integrating all information from the input sparse views. These predictions are then leveraged to accomplish 3D reconstruction and pose estimation, and the reconstructed 3D model can be used to further refine the camera poses of input views. Through extensive experiments on three datasets, we demonstrate that our method not only significantly outperforms baseline methods in terms of 3D reconstruction quality and pose prediction accuracy but also exhibits strong efficiency. It requires only about 20 seconds to produce a textured mesh and camera poses for the input views. Project page: this https URL.

[AI-3] ransformers to SSMs: Distilling Quadratic Knowledge to Subquadratic Models

链接: https://arxiv.org/abs/2408.10189
作者: Aviv Bick,Kevin Y. Li,Eric P. Xing,J. Zico Kolter,Albert Gu
关键词-EN: inference settings due, quadratic-time self-attention, dominant paradigm, paradigm for domains, domains like language
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Transformer architectures have become a dominant paradigm for domains like language modeling but suffer in many inference settings due to their quadratic-time self-attention. Recently proposed subquadratic architectures, such as Mamba, have shown promise, but have been pretrained with substantially less computational resources than the strongest Transformer models. In this work, we present a method that is able to distill a pretrained Transformer architecture into alternative architectures such as state space models (SSMs). The key idea to our approach is that we can view both Transformers and SSMs as applying different forms of mixing matrices over the token sequences. We can thus progressively distill the Transformer architecture by matching different degrees of granularity in the SSM: first matching the mixing matrices themselves, then the hidden units at each block, and finally the end-to-end predictions. Our method, called MOHAWK, is able to distill a Mamba-2 variant based on the Phi-1.5 architecture (Phi-Mamba) using only 3B tokens and a hybrid version (Hybrid Phi-Mamba) using 5B tokens. Despite using less than 1% of the training data typically used to train models from scratch, Phi-Mamba boasts substantially stronger performance compared to all past open-source non-Transformer models. MOHAWK allows models like SSMs to leverage computational resources invested in training Transformer-based architectures, highlighting a new avenue for building such models.

[AI-4] Imbalance-Aware Culvert-Sewer Defect Segmentation Using an Enhanced Feature Pyramid Network

链接: https://arxiv.org/abs/2408.10181
作者: Rasha Alshawi,Md Meftahul Ferdaus,Mahdi Abdelguerfi,Kendall Niles,Ken Pathak,Steve Sloan
关键词-EN: Feature Pyramid Network, Enhanced Feature Pyramid, significant challenge, Imbalanced datasets, Pyramid Network
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Imbalanced datasets are a significant challenge in real-world scenarios. They lead to models that underperform on underrepresented classes, which is a critical issue in infrastructure inspection. This paper introduces the Enhanced Feature Pyramid Network (E-FPN), a deep learning model for the semantic segmentation of culverts and sewer pipes within imbalanced datasets. The E-FPN incorporates architectural innovations like sparsely connected blocks and depth-wise separable convolutions to improve feature extraction and handle object variations. To address dataset imbalance, the model employs strategies like class decomposition and data augmentation. Experimental results on the culvert-sewer defects dataset and a benchmark aerial semantic segmentation drone dataset show that the E-FPN outperforms state-of-the-art methods, achieving an average Intersection over Union (IoU) improvement of 13.8% and 27.2%, respectively. Additionally, class decomposition and data augmentation together boost the model’s performance by approximately 6.9% IoU. The proposed E-FPN presents a promising solution for enhancing object segmentation in challenging, multi-class real-world datasets, with potential applications extending beyond culvert-sewer defect detection.

[AI-5] NeuRodin: A Two-stage Framework for High-Fidelity Neural Surface Reconstruction

链接: https://arxiv.org/abs/2408.10178
作者: Yifan Wang,Di Huang,Weicai Ye,Guofeng Zhang,Wanli Ouyang,Tong He
关键词-EN: Signed Distance Function, Signed Distance, Distance Function, based volume rendering, demonstrated significant capabilities
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Signed Distance Function (SDF)-based volume rendering has demonstrated significant capabilities in surface reconstruction. Although promising, SDF-based methods often fail to capture detailed geometric structures, resulting in visible defects. By comparing SDF-based volume rendering to density-based volume rendering, we identify two main factors within the SDF-based approach that degrade surface quality: SDF-to-density representation and geometric regularization. These factors introduce challenges that hinder the optimization of the SDF field. To address these issues, we introduce NeuRodin, a novel two-stage neural surface reconstruction framework that not only achieves high-fidelity surface reconstruction but also retains the flexible optimization characteristics of density-based methods. NeuRodin incorporates innovative strategies that facilitate transformation of arbitrary topologies and reduce artifacts associated with density bias. Extensive evaluations on the Tanks and Temples and ScanNet++ datasets demonstrate the superiority of NeuRodin, showing strong reconstruction capabilities for both indoor and outdoor environments using solely posed RGB captures. Project website: this https URL

[AI-6] Fairness Under Cover: Evaluating the Impact of Occlusions on Demographic Bias in Facial Recognition ECCV

链接: https://arxiv.org/abs/2408.10175
作者: Rafael M. Mamede,Pedro C. Neto,Ana F. Sequeira
关键词-EN: face recognition systems, Fairness Discrepancy Rate, face recognition, Face Occlusion Impact, face recognition models
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: Accepted at ECCV Workshop FAILED

点击查看摘要

Abstract:This study investigates the effects of occlusions on the fairness of face recognition systems, particularly focusing on demographic biases. Using the Racial Faces in the Wild (RFW) dataset and synthetically added realistic occlusions, we evaluate their effect on the performance of face recognition models trained on the BUPT-Balanced and BUPT-GlobalFace datasets. We note increases in the dispersion of FMR, FNMR, and accuracy alongside decreases in fairness according to Equilized Odds, Demographic Parity, STD of Accuracy, and Fairness Discrepancy Rate. Additionally, we utilize a pixel attribution method to understand the importance of occlusions in model predictions, proposing a new metric, Face Occlusion Impact Ratio (FOIR), that quantifies the extent to which occlusions affect model performance across different demographic groups. Our results indicate that occlusions exacerbate existing demographic biases, with models placing higher importance on occlusions in an unequal fashion, particularly affecting African individuals more severely.

[AI-7] SMILE: Zero-Shot Sparse Mixture of Low-Rank Experts Construction From Pre-Trained Foundation Models

链接: https://arxiv.org/abs/2408.10174
作者: Anke Tang,Li Shen,Yong Luo,Shuai Xie,Han Hu,Lefei Zhang,Bo Du,Dacheng Tao
关键词-EN: deep model fusion, Deep model, Deep model training, model fusion techniques, model fusion
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Code is available at this https URL

点击查看摘要

Abstract:Deep model training on extensive datasets is increasingly becoming cost-prohibitive, prompting the widespread adoption of deep model fusion techniques to leverage knowledge from pre-existing models. From simple weight averaging to more sophisticated methods like AdaMerging, model fusion effectively improves model performance and accelerates the development of new models. However, potential interference between parameters of individual models and the lack of interpretability in the fusion progress remain significant challenges. Existing methods often try to resolve the parameter interference issue by evaluating attributes of parameters, such as their magnitude or sign, or by parameter pruning. In this study, we begin by examining the fine-tuning of linear layers through the lens of subspace analysis and explicitly define parameter interference as an optimization problem to shed light on this subject. Subsequently, we introduce an innovative approach to model fusion called zero-shot Sparse MIxture of Low-rank Experts (SMILE) construction, which allows for the upscaling of source models into an MoE model without extra data or further training. Our approach relies on the observation that fine-tuning mostly keeps the important parts from the pre-training, but it uses less significant or unused areas to adapt to new tasks. Also, the issue of parameter interference, which is intrinsically intractable in the original parameter space, can be managed by expanding the dimensions. We conduct extensive experiments across diverse scenarios, such as image classification and text generalization tasks, using full fine-tuning and LoRA fine-tuning, and we apply our method to large language models (CLIP models, Flan-T5 models, and Mistral-7B models), highlighting the adaptability and scalability of SMILE. Code is available at this https URL

[AI-8] Customizing Language Models with Instance-wise LoRA for Sequential Recommendation

链接: https://arxiv.org/abs/2408.10159
作者: Xiaoyu Kong,Jiancan Wu,An Zhang,Leheng Sheng,Hui Lin,Xiang Wang,Xiangnan He
关键词-EN: Large Language Models, Sequential recommendation systems, recommendation systems predict, analyzing past interactions, Sequential recommendation
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Sequential recommendation systems predict a user’s next item of interest by analyzing past interactions, aligning recommendations with individual preferences. Leveraging the strengths of Large Language Models (LLMs) in knowledge comprehension and reasoning, recent approaches have applied LLMs to sequential recommendation through language generation paradigms. These methods convert user behavior sequences into prompts for LLM fine-tuning, utilizing Low-Rank Adaptation (LoRA) modules to refine recommendations. However, the uniform application of LoRA across diverse user behaviors sometimes fails to capture individual variability, leading to suboptimal performance and negative transfer between disparate sequences. To address these challenges, we propose Instance-wise LoRA (iLoRA), integrating LoRA with the Mixture of Experts (MoE) framework. iLoRA creates a diverse array of experts, each capturing specific aspects of user preferences, and introduces a sequence representation guided gate function. This gate function processes historical interaction sequences to generate enriched representations, guiding the gating network to output customized expert participation weights. This tailored approach mitigates negative transfer and dynamically adjusts to diverse behavior patterns. Extensive experiments on three benchmark datasets demonstrate the effectiveness of iLoRA, highlighting its superior performance compared to existing methods in capturing user-specific preferences and improving recommendation accuracy.

[AI-9] Rhyme-aware Chinese lyric generator based on GPT

链接: https://arxiv.org/abs/2408.10130
作者: Yixiao Yuan,Yangchen Huang,Yu Ma,Xinjin Li,Zhenglin Li,Yiming Shi,Huapeng Zhou
关键词-EN: Neural language representation, effectively capture rich, capture rich semantic, rich semantic patterns, consistently improve natural
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Neural language representation models such as GPT, pre-trained on large-scale corpora, can effectively capture rich semantic patterns from plain text and be fine-tuned to consistently improve natural language generation performance. However, existing pre-trained language models used to generate lyrics rarely consider rhyme information, which is crucial in lyrics. Using a pre-trained model directly results in poor performance. To enhance the rhyming quality of generated lyrics, we incorporate integrated rhyme information into our model, thereby improving lyric generation performance.

[AI-10] Advancing Voice Cloning for Nepali: Leveraging Transfer Learning in a Low-Resource Language

链接: https://arxiv.org/abs/2408.10128
作者: Manjil Karki,Pratik Shakya,Sandesh Acharya,Ravi Pandit,Dinesh Gothe
关键词-EN: personalized speech interfaces, prominent feature, feature in personalized, Voice cloning, speaker
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注: 7 pages, 10 figures

点击查看摘要

Abstract:Voice cloning is a prominent feature in personalized speech interfaces. A neural vocal cloning system can mimic someone’s voice using just a few audio samples. Both speaker encoding and speaker adaptation are topics of research in the field of voice cloning. Speaker adaptation relies on fine-tuning a multi-speaker generative model, which involves training a separate model to infer a new speaker embedding used for speaker encoding. Both methods can achieve excellent performance, even with a small number of cloning audios, in terms of the speech’s naturalness and similarity to the original speaker. Speaker encoding approaches are more appropriate for low-resource deployment since they require significantly less memory and have a faster cloning time than speaker adaption, which can offer slightly greater naturalness and similarity. The main goal is to create a vocal cloning system that produces audio output with a Nepali accent or that sounds like Nepali. For the further advancement of TTS, the idea of transfer learning was effectively used to address several issues that were encountered in the development of this system, including the poor audio quality and the lack of available data.

[AI-11] Learning Brave Assumption-Based Argumentation Frameworks via ASP WWW ECAI2024

链接: https://arxiv.org/abs/2408.10126
作者: Emanuele De Angelis(1),Maurizio Proietti(1),Francesca Toni(2) ((1) CNR-IASI, Rome, Italy, (2) Imperial, London, UK)
关键词-EN: Assumption-based Argumentation, including logic programming, including logic, unifying formalism, forms of non-monotonic
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Logic in Computer Science (cs.LO)
*备注: Extended version of the paper accepted at the 27th European Conference on Artificial Intelligence (ECAI 2024); Paper ID: M1488 ( this https URL )

点击查看摘要

Abstract:Assumption-based Argumentation (ABA) is advocated as a unifying formalism for various forms of non-monotonic reasoning, including logic programming. It allows capturing defeasible knowledge, subject to argumentative debate. While, in much existing work, ABA frameworks are given up-front, in this paper we focus on the problem of automating their learning from background knowledge and positive/negative examples. Unlike prior work, we newly frame the problem in terms of brave reasoning under stable extensions for ABA. We present a novel algorithm based on transformation rules (such as Rote Learning, Folding, Assumption Introduction and Fact Subsumption) and an implementation thereof that makes use of Answer Set Programming. Finally, we compare our technique to state-of-the-art ILP systems that learn defeasible knowledge.

[AI-12] Molecular Graph Representation Learning Integrating Large Language Models with Domain-specific Small Models

链接: https://arxiv.org/abs/2408.10124
作者: Tianyu Zhang,Yuxiang Ren,Chengbin Hou,Hairong Lv,Xuegong Zhang
关键词-EN: drug discovery, crucial foundation, foundation for drug, Domain-specific Small Models, Large Language Models
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Chemical Physics (physics.chem-ph); Biomolecules (q-bio.BM)
*备注:

点击查看摘要

Abstract:Molecular property prediction is a crucial foundation for drug discovery. In recent years, pre-trained deep learning models have been widely applied to this task. Some approaches that incorporate prior biological domain knowledge into the pre-training framework have achieved impressive results. However, these methods heavily rely on biochemical experts, and retrieving and summarizing vast amounts of domain knowledge literature is both time-consuming and expensive. Large Language Models (LLMs) have demonstrated remarkable performance in understanding and efficiently providing general knowledge. Nevertheless, they occasionally exhibit hallucinations and lack precision in generating domain-specific knowledge. Conversely, Domain-specific Small Models (DSMs) possess rich domain knowledge and can accurately calculate molecular domain-related metrics. However, due to their limited model size and singular functionality, they lack the breadth of knowledge necessary for comprehensive representation learning. To leverage the advantages of both approaches in molecular property prediction, we propose a novel Molecular Graph representation learning framework that integrates Large language models and Domain-specific small models (MolGraph-LarDo). Technically, we design a two-stage prompt strategy where DSMs are introduced to calibrate the knowledge provided by LLMs, enhancing the accuracy of domain-specific information and thus enabling LLMs to generate more precise textual descriptions for molecular samples. Subsequently, we employ a multi-modal alignment method to coordinate various modalities, including molecular graphs and their corresponding descriptive texts, to guide the pre-training of molecular representations. Extensive experiments demonstrate the effectiveness of the proposed method.

[AI-13] Geometry Informed Tokenization of Molecules for Language Model Generation

链接: https://arxiv.org/abs/2408.10120
作者: Xiner Li,Limei Wang,Youzhi Luo,Carl Edwards,Shurui Gui,Yuchao Lin,Heng Ji,Shuiwang Ji
关键词-EN: space using language, language models, requires discrete tokenization, Abstract, requires discrete
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:We consider molecule generation in 3D space using language models (LMs), which requires discrete tokenization of 3D molecular geometries. Although tokenization of molecular graphs exists, that for 3D geometries is largely unexplored. Here, we attempt to bridge this gap by proposing the Geo2Seq, which converts molecular geometries into SE(3) -invariant 1D discrete sequences. Geo2Seq consists of canonical labeling and invariant spherical representation steps, which together maintain geometric and atomic fidelity in a format conducive to LMs. Our experiments show that, when coupled with Geo2Seq, various LMs excel in molecular geometry generation, especially in controlled generation tasks.

[AI-14] Factorized-Dreamer: Training A High-Quality Video Generator with Limited and Low-Quality Data

链接: https://arxiv.org/abs/2408.10119
作者: Tao Yang,Yangming Shi,Yunwen Huang,Feng Chen,Yin Zheng,Lei Zhang
关键词-EN: gained significant attention, significant attention due, enhancement and translation, gained significant, wide applications
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Text-to-video (T2V) generation has gained significant attention due to its wide applications to video generation, editing, enhancement and translation, \etc. However, high-quality (HQ) video synthesis is extremely challenging because of the diverse and complex motions existed in real world. Most existing works struggle to address this problem by collecting large-scale HQ videos, which are inaccessible to the community. In this work, we show that publicly available limited and low-quality (LQ) data are sufficient to train a HQ video generator without recaptioning or finetuning. We factorize the whole T2V generation process into two steps: generating an image conditioned on a highly descriptive caption, and synthesizing the video conditioned on the generated image and a concise caption of motion details. Specifically, we present \emphFactorized-Dreamer, a factorized spatiotemporal framework with several critical designs for T2V generation, including an adapter to combine text and image embeddings, a pixel-aware cross attention module to capture pixel-level image information, a T5 text encoder to better understand motion description, and a PredictNet to supervise optical flows. We further present a noise schedule, which plays a key role in ensuring the quality and stability of video generation. Our model lowers the requirements in detailed captions and HQ videos, and can be directly trained on limited LQ datasets with noisy and brief captions such as WebVid-10M, largely alleviating the cost to collect large-scale HQ video-text pairs. Extensive experiments in a variety of T2V and image-to-video generation tasks demonstrate the effectiveness of our proposed Factorized-Dreamer. Our source codes are available at \urlthis https URL.

[AI-15] Enhancing Reinforcement Learning Through Guided Search ECAI2024

链接: https://arxiv.org/abs/2408.10113
作者: Jérôme Arjonilla,Abdallah Saffidine,Tristan Cazenave
关键词-EN: Markov Decision Problem, Offline Reinforcement Learning, suggest taking inspiration, Markov Decision, Decision Problem
类目: Artificial Intelligence (cs.AI)
*备注: Accepted Paper at ECAI 2024; Extended Version

点击查看摘要

Abstract:With the aim of improving performance in Markov Decision Problem in an Off-Policy setting, we suggest taking inspiration from what is done in Offline Reinforcement Learning (RL). In Offline RL, it is a common practice during policy learning to maintain proximity to a reference policy to mitigate uncertainty, reduce potential policy errors, and help improve performance. We find ourselves in a different setting, yet it raises questions about whether a similar concept can be applied to enhance performance ie, whether it is possible to find a guiding policy capable of contributing to performance improvement, and how to incorporate it into our RL agent. Our attention is particularly focused on algorithms based on Monte Carlo Tree Search (MCTS) as a guide.MCTS renowned for its state-of-the-art capabilities across various domains, catches our interest due to its ability to converge to equilibrium in single-player and two-player contexts. By harnessing the power of MCTS as a guide for our RL agent, we observed a significant performance improvement, surpassing the outcomes achieved by utilizing each method in isolation. Our experiments were carried out on the Atari 100k benchmark.

[AI-16] PLUTUS: A Well Pre-trained Large Unified Transformer can Unveil Financial Time Series Regularities

链接: https://arxiv.org/abs/2408.10111
作者: Yuanjian Xu,Anxian Liu,Jianing Hao,Zhenzhuo Li,Shichang Meng,Guang Zhang
关键词-EN: high noise levels, predicting market behaviors, textbf, noise levels, Financial time series
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Financial time series modeling is crucial for understanding and predicting market behaviors but faces challenges such as non-linearity, non-stationarity, and high noise levels. Traditional models struggle to capture complex patterns due to these issues, compounded by limitations in computational resources and model capacity. Inspired by the success of large language models in NLP, we introduce \textbfPLUTUS, a \textbfPre-trained \textbfLarge \textbfUnified \textbfTransformer-based model that \textbfUnveils regularities in financial time \textbfSeries. PLUTUS uses an invertible embedding module with contrastive learning and autoencoder techniques to create an approximate one-to-one mapping between raw data and patch embeddings. TimeFormer, an attention based architecture, forms the core of PLUTUS, effectively modeling high-noise time series. We incorporate a novel attention mechanisms to capture features across both variable and temporal dimensions. PLUTUS is pre-trained on an unprecedented dataset of 100 billion observations, designed to thrive in noisy financial environments. To our knowledge, PLUTUS is the first open-source, large-scale, pre-trained financial time series model with over one billion parameters. It achieves state-of-the-art performance in various tasks, demonstrating strong transferability and establishing a robust foundational model for finance. Our research provides technical guidance for pre-training financial time series data, setting a new standard in the field.

[AI-17] Envisioning Possibilities and Challenges of AI for Personalized Cancer Care

链接: https://arxiv.org/abs/2408.10108
作者: Elaine Kong,Kuo-Ting(Tim)Huang,Aakash Gautam
关键词-EN: Artificial Intelligence, gained significant interest, including in caring, significant interest, gained significant
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
*备注: 7 pages, 1 table, short paper at CSCW 2024

点击查看摘要

Abstract:The use of Artificial Intelligence (AI) in healthcare, including in caring for cancer survivors, has gained significant interest. However, gaps remain in our understanding of how such AI systems can provide care, especially for ethnic and racial minority groups who continue to face care disparities. Through interviews with six cancer survivors, we identify critical gaps in current healthcare systems such as a lack of personalized care and insufficient cultural and linguistic accommodation. AI, when applied to care, was seen as a way to address these issues by enabling real-time, culturally aligned, and linguistically appropriate interactions. We also uncovered concerns about the implications of AI-driven personalization, such as data privacy, loss of human touch in caregiving, and the risk of echo chambers that limit exposure to diverse information. We conclude by discussing the trade-offs between AI-enhanced personalization and the need for structural changes in healthcare that go beyond technological solutions, leading us to argue that we should begin by asking, ``Why personalization?‘’

[AI-18] Perturb-and-Compare Approach for Detecting Out-of-Distribution Samples in Constrained Access Environments ECAI

链接: https://arxiv.org/abs/2408.10107
作者: Heeyoung Lee,Hoyoon Byun,Changdae Oh,JinYeong Bak,Kyungwoo Song
关键词-EN: Accessing machine learning, Accessing machine, machine learning models, OOD detection, machine learning
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
*备注: Accepted to European Conference on Artificial Intelligence (ECAI) 2024

点击查看摘要

Abstract:Accessing machine learning models through remote APIs has been gaining prevalence following the recent trend of scaling up model parameters for increased performance. Even though these models exhibit remarkable ability, detecting out-of-distribution (OOD) samples remains a crucial safety concern for end users as these samples may induce unreliable outputs from the model. In this work, we propose an OOD detection framework, MixDiff, that is applicable even when the model’s parameters or its activations are not accessible to the end user. To bypass the access restriction, MixDiff applies an identical input-level perturbation to a given target sample and a similar in-distribution (ID) sample, then compares the relative difference in the model outputs of these two samples. MixDiff is model-agnostic and compatible with existing output-based OOD detection methods. We provide theoretical analysis to illustrate MixDiff’s effectiveness in discerning OOD samples that induce overconfident outputs from the model and empirically demonstrate that MixDiff consistently enhances the OOD detection performance on various datasets in vision and text domains.

[AI-19] Convert and Speak: Zero-shot Accent Conversion with Minimum Supervision

链接: https://arxiv.org/abs/2408.10096
作者: Zhijun Jia,Huaying Xue,Xiulian Peng,Yan Lu
关键词-EN: Low resource, converted semantic token, key challenge, pronunciation units, units and prosody
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
*备注: 9 pages, 4 figures, conference

点击查看摘要

Abstract:Low resource of parallel data is the key challenge of accent conversion(AC) problem in which both the pronunciation units and prosody pattern need to be converted. We propose a two-stage generative framework “convert-and-speak” in which the conversion is only operated on the semantic token level and the speech is synthesized conditioned on the converted semantic token with a speech generative model in target accent domain. The decoupling design enables the “speaking” module to use massive amount of target accent speech and relieves the parallel data required for the “conversion” module. Conversion with the bridge of semantic token also relieves the requirement for the data with text transcriptions and unlocks the usage of language pre-training technology to further efficiently reduce the need of parallel accent speech data. To reduce the complexity and latency of “speaking”, a single-stage AR generative model is designed to achieve good quality as well as lower computation cost. Experiments on Indian-English to general American-English conversion show that the proposed framework achieves state-of-the-art performance in accent similarity, speech quality, and speaker maintenance with only 15 minutes of weakly parallel data which is not constrained to the same speaker. Extensive experimentation with diverse accent types suggests that this framework possesses a high degree of adaptability, making it readily scalable to accommodate other accents with low-resource data. Audio samples are available at this https URL.

[AI-20] ARMADA: Attribute-Based Multimodal Data Augmentation

链接: https://arxiv.org/abs/2408.10086
作者: Xiaomeng Jin,Jeonghwan Kim,Yu Zhou,Kuan-Hao Huang,Te-Lin Wu,Nanyun Peng,Heng Ji
关键词-EN: multimodal data augmentation, multimodal data, Attribute-based Multimodal Data, Multimodal Language Models, manually annotating high-quality
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In Multimodal Language Models (MLMs), the cost of manually annotating high-quality image-text pair data for fine-tuning and alignment is extremely high. While existing multimodal data augmentation frameworks propose ways to augment image-text pairs, they either suffer from semantic inconsistency between texts and images, or generate unrealistic images, causing knowledge gap with real world examples. To address these issues, we propose Attribute-based Multimodal Data Augmentation (ARMADA), a novel multimodal data augmentation method via knowledge-guided manipulation of visual attributes of the mentioned entities. Specifically, we extract entities and their visual attributes from the original text data, then search for alternative values for the visual attributes under the guidance of knowledge bases (KBs) and large language models (LLMs). We then utilize an image-editing model to edit the images with the extracted attributes. ARMADA is a novel multimodal data generation framework that: (i) extracts knowledge-grounded attributes from symbolic KBs for semantically consistent yet distinctive image-text pair generation, (ii) generates visually similar images of disparate categories using neighboring entities in the KB hierarchy, and (iii) uses the commonsense knowledge of LLMs to modulate auxiliary visual attributes such as backgrounds for more robust representation of original entities. Our empirical results over four downstream tasks demonstrate the efficacy of our framework to produce high-quality data and enhance the model performance. This also highlights the need to leverage external knowledge proxies for enhanced interpretability and real-world grounding.

[AI-21] Personalizing Reinforcement Learning from Human Feedback with Variational Preference Learning

链接: https://arxiv.org/abs/2408.10075
作者: Sriyash Poddar,Yanming Wan,Hamish Ivison,Abhishek Gupta,Natasha Jaques
关键词-EN: Human Feedback, Reinforcement Learning, powerful paradigm, paradigm for aligning, RLHF
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Robotics (cs.RO)
*备注: this http URL

点击查看摘要

Abstract:Reinforcement Learning from Human Feedback (RLHF) is a powerful paradigm for aligning foundation models to human values and preferences. However, current RLHF techniques cannot account for the naturally occurring differences in individual human preferences across a diverse population. When these differences arise, traditional RLHF frameworks simply average over them, leading to inaccurate rewards and poor performance for individual subgroups. To address the need for pluralistic alignment, we develop a class of multimodal RLHF methods. Our proposed techniques are based on a latent variable formulation - inferring a novel user-specific latent and learning reward models and policies conditioned on this latent without additional user-specific data. While conceptually simple, we show that in practice, this reward modeling requires careful algorithmic considerations around model architecture and reward scaling. To empirically validate our proposed technique, we first show that it can provide a way to combat underspecification in simulated control problems, inferring and optimizing user-specific reward functions. Next, we conduct experiments on pluralistic language datasets representing diverse user preferences and demonstrate improved reward function accuracy. We additionally show the benefits of this probabilistic framework in terms of measuring uncertainty, and actively learning user preferences. This work enables learning from diverse populations of users with divergent preferences, an important challenge that naturally occurs in problems from robot learning to foundation model alignment.

[AI-22] Synthesis of Reward Machines for Multi-Agent Equilibrium Design (Full Version)

链接: https://arxiv.org/abs/2408.10074
作者: Muhammad Najib,Giuseppe Perelli
关键词-EN: well-established game-theoretic paradigm, Mechanism design, equilibrium design, achieve desired outcomes, Unlike mechanism design
类目: Computer Science and Game Theory (cs.GT); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
*备注:

点击查看摘要

Abstract:Mechanism design is a well-established game-theoretic paradigm for designing games to achieve desired outcomes. This paper addresses a closely related but distinct concept, equilibrium design. Unlike mechanism design, the designer’s authority in equilibrium design is more constrained; she can only modify the incentive structures in a given game to achieve certain outcomes without the ability to create the game from scratch. We study the problem of equilibrium design using dynamic incentive structures, known as reward machines. We use weighted concurrent game structures for the game model, with goals (for the players and the designer) defined as mean-payoff objectives. We show how reward machines can be used to represent dynamic incentives that allocate rewards in a manner that optimises the designer’s goal. We also introduce the main decision problem within our framework, the payoff improvement problem. This problem essentially asks whether there exists a dynamic incentive (represented by some reward machine) that can improve the designer’s payoff by more than a given threshold value. We present two variants of the problem: strong and weak. We demonstrate that both can be solved in polynomial time using a Turing machine equipped with an NP oracle. Furthermore, we also establish that these variants are either NP-hard or coNP-hard. Finally, we show how to synthesise the corresponding reward machine if it exists.

[AI-23] FFAA: Multimodal Large Language Model based Explainable Open-World Face Forgery Analysis Assistant

链接: https://arxiv.org/abs/2408.10072
作者: Zhengchao Huang,Bin Xia,Zicheng Lin,Zhun Mou,Wenming Yang
关键词-EN: widespread public concern, sparked widespread public, public information security, face forgery analysis, face forgery
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 17 pages, 18 figures; project page: this https URL

点击查看摘要

Abstract:The rapid advancement of deepfake technologies has sparked widespread public concern, particularly as face forgery poses a serious threat to public information security. However, the unknown and diverse forgery techniques, varied facial features and complex environmental factors pose significant challenges for face forgery analysis. Existing datasets lack descriptions of these aspects, making it difficult for models to distinguish between real and forged faces using only visual information amid various confounding factors. In addition, existing methods do not yield user-friendly and explainable results, complicating the understanding of the model’s decision-making process. To address these challenges, we introduce a novel Open-World Face Forgery Analysis VQA (OW-FFA-VQA) task and the corresponding benchmark. To tackle this task, we first establish a dataset featuring a diverse collection of real and forged face images with essential descriptions and reliable forgery reasoning. Base on this dataset, we introduce FFAA: Face Forgery Analysis Assistant, consisting of a fine-tuned Multimodal Large Language Model (MLLM) and Multi-answer Intelligent Decision System (MIDS). By integrating hypothetical prompts with MIDS, the impact of fuzzy classification boundaries is effectively mitigated, enhancing the model’s robustness. Extensive experiments demonstrate that our method not only provides user-friendly explainable results but also significantly boosts accuracy and robustness compared to previous methods.

[AI-24] Facial Wrinkle Segmentation for Cosmetic Dermatology: Pretraining with Texture Map-Based Weak Supervision

链接: https://arxiv.org/abs/2408.10060
作者: Junho Moon,Haejun Chung,Ikbeom Jang
关键词-EN: cosmetic dermatology, plays a crucial, crucial role, role in cosmetic, Facial wrinkle
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Facial wrinkle detection plays a crucial role in cosmetic dermatology. Precise manual segmentation of facial wrinkles is challenging and time-consuming, with inherent subjectivity leading to inconsistent results among graders. To address this issue, we propose two solutions. First, we build and release the first public facial wrinkle dataset, `FFHQ-Wrinkle’, an extension of the NVIDIA FFHQ dataset. This dataset includes 1,000 images with human labels and 50,000 images with automatically generated weak labels. This dataset can foster the research community to develop advanced wrinkle detection algorithms. Second, we introduce a training strategy for U-Net-like encoder-decoder models to detect wrinkles across the face automatically. Our method employs a two-stage training strategy: texture map pretraining and finetuning on human-labeled data. Initially, we pretrain models on a large dataset with weak labels (N=50k) or masked texture maps generated through computer vision techniques, without human intervention. Subsequently, we finetune the models using human-labeled data (N=1k), which consists of manually labeled wrinkle masks. During finetuning, the network inputs a combination of RGB and masked texture maps, comprising four channels. We effectively combine labels from multiple annotators to minimize subjectivity in manual labeling. Our strategies demonstrate improved segmentation performance in facial wrinkle segmentation both quantitatively and visually compared to existing pretraining methods.

[AI-25] he Practimum-Optimum Algorithm for Manufacturing Scheduling: A Paradigm Shift Leading to Breakthroughs in Scale and Performance

链接: https://arxiv.org/abs/2408.10040
作者: Moshe BenBassat
关键词-EN: developing automatic optimization, automatic optimization products, real-life business problems, represents a paradigm, paradigm shift
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The Practimum-Optimum (P-O) algorithm represents a paradigm shift in developing automatic optimization products for complex real-life business problems such as large-scale manufacturing scheduling. It leverages deep business domain expertise to create a group of virtual human expert (VHE) agents with different “schools of thought” on how to create high-quality schedules. By computerizing them into algorithms, P-O generates many valid schedules at far higher speeds than human schedulers are capable of. Initially, these schedules can also be local optimum peaks far away from high-quality schedules. By submitting these schedules to a reinforced machine learning algorithm (RL), P-O learns the weaknesses and strengths of each VHE schedule, and accordingly derives reward and punishment changes in the Demand Set that will modify the relative priorities for time and resource allocation that jobs received in the prior iteration that led to the current state of the schedule. These cause the core logic of the VHE algorithms to explore, in the subsequent iteration, substantially different parts of the schedules universe and potentially find higher-quality schedules. Using the hill climbing analogy, this may be viewed as a big jump, shifting from a given local peak to a faraway promising start point equipped with knowledge embedded in the demand set for future iterations. This is a fundamental difference from most contemporary algorithms, which spend considerable time on local micro-steps restricted to the neighbourhoods of local peaks they visit. This difference enables a breakthrough in scale and performance for fully automatic manufacturing scheduling in complex organizations. The P-O algorithm is at the heart of Plataine Scheduler that, in one click, routinely schedules 30,000-50,000 tasks for real-life complex manufacturing operations.

[AI-26] MSDiagnosis: An EMR-based Dataset for Clinical Multi-Step Diagnosis

链接: https://arxiv.org/abs/2408.10039
作者: Ruihui Hou,Shencheng Chen,Yongqi Fan,Lifeng Zhu,Jing Sun,Jingping Liu,Tong Ruan
关键词-EN: medical practice, typically requiring, includes primary diagnosis, critical in medical, requiring a continuous
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Clinical diagnosis is critical in medical practice, typically requiring a continuous and evolving process that includes primary diagnosis, differential diagnosis, and final diagnosis. However, most existing clinical diagnostic tasks are single-step processes, which does not align with the complex multi-step diagnostic procedures found in real-world clinical settings. In this paper, we propose a multi-step diagnostic task and annotate a clinical diagnostic dataset (MSDiagnosis). This dataset includes primary diagnosis, differential diagnosis, and final diagnosis questions. Additionally, we propose a novel and effective framework. This framework combines forward inference, backward inference, reflection, and refinement, enabling the LLM to self-evaluate and adjust its diagnostic results. To assess the effectiveness of our proposed method, we design and conduct extensive experiments. The experimental results demonstrate the effectiveness of the proposed method. We also provide a comprehensive experimental analysis and suggest future research directions for this task.

[AI-27] Deterministic Policy Gradient Primal-Dual Methods for Continuous-Space Constrained MDPs

链接: https://arxiv.org/abs/2408.10015
作者: Sergio Rozada,Dongsheng Ding,Antonio G. Marques,Alejandro Ribeiro
关键词-EN: Markov decision processes, constrained Markov decision, deterministic policy gradient, Markov decision, deterministic policy
类目: Artificial Intelligence (cs.AI); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:We study the problem of computing deterministic optimal policies for constrained Markov decision processes (MDPs) with continuous state and action spaces, which are widely encountered in constrained dynamical systems. Designing deterministic policy gradient methods in continuous state and action spaces is particularly challenging due to the lack of enumerable state-action pairs and the adoption of deterministic policies, hindering the application of existing policy gradient methods for constrained MDPs. To this end, we develop a deterministic policy gradient primal-dual method to find an optimal deterministic policy with non-asymptotic convergence. Specifically, we leverage regularization of the Lagrangian of the constrained MDP to propose a deterministic policy gradient primal-dual (D-PGPD) algorithm that updates the deterministic policy via a quadratic-regularized gradient ascent step and the dual variable via a quadratic-regularized gradient descent step. We prove that the primal-dual iterates of D-PGPD converge at a sub-linear rate to an optimal regularized primal-dual pair. We instantiate D-PGPD with function approximation and prove that the primal-dual iterates of D-PGPD converge at a sub-linear rate to an optimal regularized primal-dual pair, up to a function approximation error. Furthermore, we demonstrate the effectiveness of our method in two continuous control problems: robot navigation and fluid control. To the best of our knowledge, this appears to be the first work that proposes a deterministic policy search method for continuous-space constrained MDPs.

[AI-28] owards a Knowledge Graph for Models and Algorithms in Applied Mathematics

链接: https://arxiv.org/abs/2408.10003
作者: Björn Schembera,Frank Wübbeling,Hendrik Kleikamp,Burkhard Schmidt,Aurela Shehu,Marco Reidelbach,Christine Biedinger,Jochen Fiedler,Thomas Koprucki,Dorothea Iglezakis,Dominik Göddeke
关键词-EN: epistemically grounding numerical, grounding numerical data, research data FAIR, essential part, epistemically grounding
类目: Artificial Intelligence (cs.AI); Digital Libraries (cs.DL)
*备注: Preprint submitted to the 18th International Conference on Metadata and Semantics Research 2024

点击查看摘要

Abstract:Mathematical models and algorithms are an essential part of mathematical research data, as they are epistemically grounding numerical data. In order to represent models and algorithms as well as their relationship semantically to make this research data FAIR, two previously distinct ontologies were merged and extended, becoming a living knowledge graph. The link between the two ontologies is established by introducing computational tasks, as they occur in modeling, corresponding to algorithmic tasks. Moreover, controlled vocabularies are incorporated and a new class, distinguishing base quantities from specific use case quantities, was introduced. Also, both models and algorithms can now be enriched with metadata. Subject-specific metadata is particularly relevant here, such as the symmetry of a matrix or the linearity of a mathematical model. This is the only way to express specific workflows with concrete models and algorithms, as the feasible solution algorithm can only be determined if the mathematical properties of a model are known. We demonstrate this using two examples from different application areas of applied mathematics. In addition, we have already integrated over 250 research assets from applied mathematics into our knowledge graph.

[AI-29] Edge-Cloud Collaborative Motion Planning for Autonomous Driving with Large Language Models

链接: https://arxiv.org/abs/2408.09972
作者: Jiao Chen,Suyan Dai,Fangfang Chen,Zuohong Lv,Jianhua Tang
关键词-EN: Integrating large language, Integrating large, large language models, driving enhances personalization, open-world scenarios
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Integrating large language models (LLMs) into autonomous driving enhances personalization and adaptability in open-world scenarios. However, traditional edge computing models still face significant challenges in processing complex driving data, particularly regarding real-time performance and system efficiency. To address these challenges, this study introduces EC-Drive, a novel edge-cloud collaborative autonomous driving system with data drift detection capabilities. EC-Drive utilizes drift detection algorithms to selectively upload critical data, including new obstacles and traffic pattern changes, to the cloud for processing by GPT-4, while routine data is efficiently managed by smaller LLMs on edge devices. This approach not only reduces inference latency but also improves system efficiency by optimizing communication resource use. Experimental validation confirms the system’s robust processing capabilities and practical applicability in real-world driving conditions, demonstrating the effectiveness of this edge-cloud collaboration framework. Our data and system demonstration will be released at this https URL.

[AI-30] Unsupervised Machine Learning Hybrid Approach Integrating Linear Programming in Loss Function: A Robust Optimization Technique

链接: https://arxiv.org/abs/2408.09967
作者: Andrew Kiruluta,Andreas Lemos
关键词-EN: machine learning model, integrates linear programming, machine learning, paper presents, unsupervised machine learning
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:This paper presents a novel hybrid approach that integrates linear programming (LP) within the loss function of an unsupervised machine learning model. By leveraging the strengths of both optimization techniques and machine learning, this method introduces a robust framework for solving complex optimization problems where traditional methods may fall short. The proposed approach encapsulates the constraints and objectives of a linear programming problem directly into the loss function, guiding the learning process to adhere to these constraints while optimizing the desired outcomes. This technique not only preserves the interpretability of linear programming but also benefits from the flexibility and adaptability of machine learning, making it particularly well-suited for unsupervised or semi-supervised learning scenarios.

[AI-31] AdaResNet: Enhancing Residual Networks with Dynamic Weight Adjustment for Improved Feature Integration

链接: https://arxiv.org/abs/2408.09958
作者: Hong Su
关键词-EN: deep neural networks, Residual Network, making it challenging, early layers, deep neural
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In very deep neural networks, gradients can become extremely small during backpropagation, making it challenging to train the early layers. ResNet (Residual Network) addresses this issue by enabling gradients to flow directly through the network via skip connections, facilitating the training of much deeper networks. However, in these skip connections, the input ipd is directly added to the transformed data tfd, treating ipd and tfd equally, without adapting to different scenarios. In this paper, we propose AdaResNet (Auto-Adapting Residual Network), which automatically adjusts the ratio between ipd and tfd based on the training data. We introduce a variable, weight_tfd^ipd, to represent this ratio. This variable is dynamically adjusted during backpropagation, allowing it to adapt to the training data rather than remaining fixed. Experimental results demonstrate that AdaResNet achieves a maximum accuracy improvement of over 50% compared to traditional ResNet.

[AI-32] Contextual Importance and Utility in Python: New Functionality and Insights with the py-ciu Package IJCAI2024

链接: https://arxiv.org/abs/2408.09957
作者: Kary Främling
关键词-EN: reliable software implementations, industry to test, reliable software, important for allowing, allowing researchers
类目: Artificial Intelligence (cs.AI)
*备注: In Proceedings of XAI 2024 Workshop of 33rd International Joint Conference on Artificial Intelligence (IJCAI 2024), Jeju, South Corea

点击查看摘要

Abstract:The availability of easy-to-use and reliable software implementations is important for allowing researchers in academia and industry to test, assess and take into use eXplainable AI (XAI) methods. This paper describes the \textttpy-ciu Python implementation of the Contextual Importance and Utility (CIU) model-agnostic, post-hoc explanation method and illustrates capabilities of CIU that go beyond the current state-of-the-art that could be useful for XAI practitioners in general.

[AI-33] Weakly Supervised Pretraining and Multi-Annotator Supervised Finetuning for Facial Wrinkle Detection

链接: https://arxiv.org/abs/2408.09952
作者: Ik Jun Moon,Junho Moon,Ikbeom Jang
关键词-EN: Abstract, facial, facial wrinkles, Research question, skin
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:1. Research question: With the growing interest in skin diseases and skin aesthetics, the ability to predict facial wrinkles is becoming increasingly important. This study aims to evaluate whether a computational model, convolutional neural networks (CNN), can be trained for automated facial wrinkle segmentation. 2. Findings: Our study presents an effective technique for integrating data from multiple annotators and illustrates that transfer learning can enhance performance, resulting in dependable segmentation of facial wrinkles. 3. Meaning: This approach automates intricate and time-consuming tasks of wrinkle analysis with a deep learning framework. It could be used to facilitate skin treatments and diagnostics.

[AI-34] Principle Driven Parameterized Fiber Model based on GPT-PINN Neural Network

链接: https://arxiv.org/abs/2408.09951
作者: Yubin Zang,Boyu Hua,Zhenzhou Tang,Zhipeng Lin,Fangzheng Zhang,Simin Li,Zuxing Zhang,Hongwei Chen
关键词-EN: utilize artificial intelligence, artificial intelligence based, artificial intelligence regression, driven artificial intelligence, intelligence regression ability
类目: Artificial Intelligence (cs.AI); Signal Processing (eess.SP)
*备注:

点击查看摘要

Abstract:In cater the need of Beyond 5G communications, large numbers of data driven artificial intelligence based fiber models has been put forward as to utilize artificial intelligence’s regression ability to predict pulse evolution in fiber transmission at a much faster speed compared with the traditional split step Fourier method. In order to increase the physical interpretabiliy, principle driven fiber models have been proposed which inserts the Nonlinear Schodinger Equation into their loss functions. However, regardless of either principle driven or data driven models, they need to be re-trained the whole model under different transmission conditions. Unfortunately, this situation can be unavoidable when conducting the fiber communication optimization work. If the scale of different transmission conditions is large, then the whole model needs to be retrained large numbers of time with relatively large scale of parameters which may consume higher time costs. Computing efficiency will be dragged down as well. In order to address this problem, we propose the principle driven parameterized fiber model in this manuscript. This model breaks down the predicted NLSE solution with respect to one set of transmission condition into the linear combination of several eigen solutions which were outputted by each pre-trained principle driven fiber model via the reduced basis method. Therefore, the model can greatly alleviate the heavy burden of re-training since only the linear combination coefficients need to be found when changing the transmission condition. Not only strong physical interpretability can the model posses, but also higher computing efficiency can be obtained. Under the demonstration, the model’s computational complexity is 0.0113% of split step Fourier method and 1% of the previously proposed principle driven fiber model.

[AI-35] Caption-Driven Explorations: Aligning Image and Text Embeddings through Human-Inspired Foveated Vision

链接: https://arxiv.org/abs/2408.09948
作者: Dario Zanca,Andrea Zugarini,Simon Dietz,Thomas R. Altstidl,Mark A. Turban Ndjeuha,Leo Schwinn,Bjoern Eskofier
关键词-EN: crucial for vision, vision science, human attention, human, attention
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Understanding human attention is crucial for vision science and AI. While many models exist for free-viewing, less is known about task-driven image exploration. To address this, we introduce CapMIT1003, a dataset with captions and click-contingent image explorations, to study human attention during the captioning task. We also present NevaClip, a zero-shot method for predicting visual scanpaths by combining CLIP models with NeVA algorithms. NevaClip generates fixations to align the representations of foveated visual stimuli and captions. The simulated scanpaths outperform existing human attention models in plausibility for captioning and free-viewing tasks. This research enhances the understanding of human attention and advances scanpath prediction models.

[AI-36] Fiber Transmission Model with Parameterized Inputs based on GPT-PINN Neural Network

链接: https://arxiv.org/abs/2408.09947
作者: Yubin Zang,Boyu Hua,Zhipeng Lin,Fangzheng Zhang,Simin Li,Zuxing Zhang,Hongwei Chen
关键词-EN: principle driven fiber, driven fiber transmission, novelty principle driven, Nonlinear Schrodinger Equations, fiber transmission model
类目: Artificial Intelligence (cs.AI); Signal Processing (eess.SP)
*备注:

点击查看摘要

Abstract:In this manuscript, a novelty principle driven fiber transmission model for short-distance transmission with parameterized inputs is put forward. By taking into the account of the previously proposed principle driven fiber model, the reduced basis expansion method and transforming the parameterized inputs into parameterized coefficients of the Nonlinear Schrodinger Equations, universal solutions with respect to inputs corresponding to different bit rates can all be obtained without the need of re-training the whole model. This model, once adopted, can have prominent advantages in both computation efficiency and physical background. Besides, this model can still be effectively trained without the needs of transmitted signals collected in advance. Tasks of on-off keying signals with bit rates ranging from 2Gbps to 50Gbps are adopted to demonstrate the fidelity of the model.

[AI-37] Microscopic Analysis on LLM players via Social Deduction Game

链接: https://arxiv.org/abs/2408.09946
作者: Byungjun Kim,Dayeon Seo,Bugeun Kim
关键词-EN: large language models, begun developing autonomous, Recent studies, developing autonomous game, social deduction games
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注: Under review, 10 pages

点击查看摘要

Abstract:Recent studies have begun developing autonomous game players for social deduction games using large language models (LLMs). When building LLM players, fine-grained evaluations are crucial for addressing weaknesses in game-playing abilities. However, existing studies have often overlooked such assessments. Specifically, we point out two issues with the evaluation methods employed. First, game-playing abilities have typically been assessed through game-level outcomes rather than specific event-level skills; Second, error analyses have lacked structured methodologies. To address these issues, we propose an approach utilizing a variant of the SpyFall game, named SpyGame. We conducted an experiment with four LLMs, analyzing their gameplay behavior in SpyGame both quantitatively and qualitatively. For the quantitative analysis, we introduced eight metrics to resolve the first issue, revealing that these metrics are more effective than existing ones for evaluating the two critical skills: intent identification and camouflage. In the qualitative analysis, we performed thematic analysis to resolve the second issue. This analysis identifies four major categories that affect gameplay of LLMs. Additionally, we demonstrate how these categories complement and support the findings from the quantitative analysis.

[AI-38] Benchmarking LLMs for Translating Classical Chinese Poetry:Evaluating Adequacy Fluency and Elegance

链接: https://arxiv.org/abs/2408.09945
作者: Andong Chen,Lianzhang Lou,Kehai Chen,Xuefeng Bai,Yang Xiang,Muyun Yang,Tiejun Zhao,Min Zhang
关键词-EN: Large language models, shown remarkable performance, Large language, language models, shown remarkable
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: Work in progress

点击查看摘要

Abstract:Large language models (LLMs) have shown remarkable performance in general translation tasks. However, the increasing demand for high-quality translations that are not only adequate but also fluent and elegant. To assess the extent to which current LLMs can meet these demands, we introduce a suitable benchmark for translating classical Chinese poetry into English. This task requires not only adequacy in translating culturally and historically significant content but also a strict adherence to linguistic fluency and poetic elegance. Our study reveals that existing LLMs fall short of this task. To address these issues, we propose RAT, a \textbfRetrieval-\textbfAugmented machine \textbfTranslation method that enhances the translation process by incorporating knowledge related to classical poetry. Additionally, we propose an automatic evaluation metric based on GPT-4, which better assesses translation quality in terms of adequacy, fluency, and elegance, overcoming the limitations of traditional metrics. Our dataset and code will be made available.

[AI-39] SZU-AFS Antispoofing System for the ASVspoof 5 Challenge INTERSPEECH2024

链接: https://arxiv.org/abs/2408.09933
作者: Yuxiong Xu,Jiafeng Zhong,Sengui Zheng,Zefeng Liu,Bin Li
关键词-EN: SZU-AFS anti-spoofing system, Challenge under open, open conditions, paper presents, presents the SZU-AFS
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
*备注: 8 pages, 2 figures, ASVspoof 5 Workshop (Interspeech2024 Satellite)

点击查看摘要

Abstract:This paper presents the SZU-AFS anti-spoofing system, designed for Track 1 of the ASVspoof 5 Challenge under open conditions. The system is built with four stages: selecting a baseline model, exploring effective data augmentation (DA) methods for fine-tuning, applying a co-enhancement strategy based on gradient norm aware minimization (GAM) for secondary fine-tuning, and fusing logits scores from the two best-performing fine-tuned models. The system utilizes the Wav2Vec2 front-end feature extractor and the AASIST back-end classifier as the baseline model. During model fine-tuning, three distinct DA policies have been investigated: single-DA, random-DA, and cascade-DA. Moreover, the employed GAM-based co-enhancement strategy, designed to fine-tune the augmented model at both data and optimizer levels, helps the Adam optimizer find flatter minima, thereby boosting model generalization. Overall, the final fusion system achieves a minDCF of 0.115 and an EER of 4.04% on the evaluation set.

[AI-40] LCE: A Framework for Explainability of DNNs for Ultrasound Image Based on Concept Discovery

链接: https://arxiv.org/abs/2408.09899
作者: Weiji Kong,Xun Gong,Juan Wang
关键词-EN: Deep Neural Networks, Neural Networks, Deep Neural, decisions of Deep, increasingly important
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)
*备注:

点击查看摘要

Abstract:Explaining the decisions of Deep Neural Networks (DNNs) for medical images has become increasingly important. Existing attribution methods have difficulty explaining the meaning of pixels while existing concept-based methods are limited by additional annotations or specific model structures that are difficult to apply to ultrasound images. In this paper, we propose the Lesion Concept Explainer (LCE) framework, which combines attribution methods with concept-based methods. We introduce the Segment Anything Model (SAM), fine-tuned on a large number of medical images, for concept discovery to enable a meaningful explanation of ultrasound image DNNs. The proposed framework is evaluated in terms of both faithfulness and understandability. We point out deficiencies in the popular faithfulness evaluation metrics and propose a new evaluation metric. Our evaluation of public and private breast ultrasound datasets (BUSI and FG-US-B) shows that LCE performs well compared to commonly-used explainability methods. Finally, we also validate that LCE can consistently provide reliable explanations for more meaningful fine-grained diagnostic tasks in breast ultrasound.

[AI-41] Uncertainty Quantification of Pre-Trained and Fine-Tuned Surrogate Models using Conformal Prediction

链接: https://arxiv.org/abs/2408.09881
作者: Vignesh Gopakumar,Ander Gray,Joel Oskarsson,Lorenzo Zanisi,Stanislas Pamela,Daniel Giles,Matt Kusner,Marc Peter Deisenroth
关键词-EN: experimental modelling tasks, shown immense potential, Data-driven surrogate models, Data-driven surrogate, potential as quick
类目: Artificial Intelligence (cs.AI); Atmospheric and Oceanic Physics (physics.ao-ph); Plasma Physics (physics.plasm-ph)
*备注:

点击查看摘要

Abstract:Data-driven surrogate models have shown immense potential as quick, inexpensive approximations to complex numerical and experimental modelling tasks. However, most surrogate models characterising physical systems do not quantify their uncertainty, rendering their predictions unreliable, and needing further validation. Though Bayesian approximations offer some solace in estimating the error associated with these models, they cannot provide they cannot provide guarantees, and the quality of their inferences depends on the availability of prior information and good approximations to posteriors for complex problems. This is particularly pertinent to multi-variable or spatio-temporal problems. Our work constructs and formalises a conformal prediction framework that satisfies marginal coverage for spatio-temporal predictions in a model-agnostic manner, requiring near-zero computational costs. The paper provides an extensive empirical study of the application of the framework to ascertain valid error bars that provide guaranteed coverage across the surrogate model’s domain of operation. The application scope of our work extends across a large range of spatio-temporal models, ranging from solving partial differential equations to weather forecasting. Through the applications, the paper looks at providing statistically valid error bars for deterministic models, as well as crafting guarantees to the error bars of probabilistic models. The paper concludes with a viable conformal prediction formalisation that provides guaranteed coverage of the surrogate model, regardless of model architecture, and its training regime and is unbothered by the curse of dimensionality.

[AI-42] New spectral imaging biomarkers for sepsis and mortality in intensive care

链接: https://arxiv.org/abs/2408.09873
作者: Silvia Seidlitz,Katharina Hölzl,Ayca von Garrel,Jan Sellner,Stephan Katzenschlager,Tobias Hölle,Dania Fischer,Maik von der Forst,Felix C.F. Schmitt,Markus A. Weigand,Lena Maier-Hein,Maximilian Dietrich
关键词-EN: high socioeconomic importance, high risk, high socioeconomic, early identification, socioeconomic importance
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
*备注: Markus A. Weigand, Lena Maier-Hein and Maximilian Dietrich contributed equally

点击查看摘要

Abstract:With sepsis remaining a leading cause of mortality, early identification of septic patients and those at high risk of death is a challenge of high socioeconomic importance. The driving hypothesis of this study was that hyperspectral imaging (HSI) could provide novel biomarkers for sepsis diagnosis and treatment management due to its potential to monitor microcirculatory alterations. We conducted a comprehensive study involving HSI data of the palm and fingers from more than 480 patients on the day of their intensive care unit (ICU) admission. The findings demonstrate that HSI measurements can predict sepsis with an area under the receiver operating characteristic curve (AUROC) of 0.80 (95 % confidence interval (CI) [0.76; 0.84]) and mortality with an AUROC of 0.72 (95 % CI [0.65; 0.79]). The predictive performance improves substantially when additional clinical data is incorporated, leading to an AUROC of up to 0.94 (95 % CI [0.92; 0.96]) for sepsis and 0.84 (95 % CI [0.78; 0.89]) for mortality. We conclude that HSI presents novel imaging biomarkers for the rapid, non-invasive prediction of sepsis and mortality, suggesting its potential as an important modality for guiding diagnosis and treatment.

[AI-43] 3D-Aware Instance Segmentation and Tracking in Egocentric Videos

链接: https://arxiv.org/abs/2408.09860
作者: Yash Bhalgat,Vadim Tschernezki,Iro Laina,João F. Henriques,Andrea Vedaldi,Andrew Zisserman
关键词-EN: rapid camera motion, present unique challenges, videos present unique, scene understanding due, frequent object occlusions
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Egocentric videos present unique challenges for 3D scene understanding due to rapid camera motion, frequent object occlusions, and limited object visibility. This paper introduces a novel approach to instance segmentation and tracking in first-person video that leverages 3D awareness to overcome these obstacles. Our method integrates scene geometry, 3D object centroid tracking, and instance segmentation to create a robust framework for analyzing dynamic egocentric scenes. By incorporating spatial and temporal cues, we achieve superior performance compared to state-of-the-art 2D approaches. Extensive evaluations on the challenging EPIC Fields dataset demonstrate significant improvements across a range of tracking and segmentation consistency metrics. Specifically, our method outperforms the next best performing approach by 7 points in Association Accuracy (AssA) and 4.5 points in IDF1 score, while reducing the number of ID switches by 73% to 80% across various object categories. Leveraging our tracked instance segmentations, we showcase downstream applications in 3D object reconstruction and amodal video object segmentation in these egocentric settings.

[AI-44] amLoRA: Boosting Low-Rank Adaptation with Expert Collaboration and Competition

链接: https://arxiv.org/abs/2408.09856
作者: Tianwei Lin,Jiang Liu,Wenqiao Zhang,Zhaocheng Li,Yang Dai,Haoyuan Li,Zhelun Yu,Wanggui He,Juncheng Li,Hao Jiang,Siliang Tang,Yueting Zhuang
关键词-EN: effectively addressed GPU, addressed GPU memory, GPU memory constraints, addressed GPU, GPU memory
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:While Parameter-Efficient Fine-Tuning (PEFT) methods like LoRA have effectively addressed GPU memory constraints during fine-tuning, their performance often falls short, especially in multidimensional task scenarios. To address this issue, one straightforward solution is to introduce task-specific LoRA modules as domain experts, leveraging the modeling of multiple experts’ capabilities and thus enhancing the general capability of multi-task learning. Despite promising, these additional components often add complexity to the training and inference process, contravening the efficient characterization of PEFT designed for. Considering this, we introduce an innovative PEFT method, TeamLoRA, consisting of a collaboration and competition module for experts, and thus achieving the right balance of effectiveness and efficiency: (i) For collaboration, a novel knowledge-sharing and -organizing mechanism is devised to appropriately reduce the scale of matrix operations, thereby boosting the training and inference speed. (ii) For competition, we propose leveraging a game-theoretic interaction mechanism for experts, encouraging experts to transfer their domain-specific knowledge while facing diverse downstream tasks, and thus enhancing the performance. By doing so, TeamLoRA elegantly connects the experts as a “Team” with internal collaboration and competition, enabling a faster and more accurate PEFT paradigm for multi-task learning. To validate the superiority of TeamLoRA, we curate a comprehensive multi-task evaluation(CME) benchmark to thoroughly assess the capability of multi-task learning. Experiments conducted on our CME and other benchmarks indicate the effectiveness and efficiency of TeamLoRA. Our project is available at this https URL.

[AI-45] Self-Directed Turing Test for Large Language Models

链接: https://arxiv.org/abs/2408.09853
作者: Weiqi Wu,Hongqiu Wu,Hai Zhao
关键词-EN: Turing test examines, exhibit human-like behaviour, Traditional Turing tests, Turing tests adopt, Turing test
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The Turing test examines whether AIs can exhibit human-like behaviour in natural language conversations. Traditional Turing tests adopt a rigid dialogue format where each participant sends only one message each time and require continuous human involvement to direct the entire interaction with the test subject. This fails to reflect a natural conversational style and hinders the evaluation of Large Language Models (LLMs) in complex and prolonged dialogues. This paper proposes the Self-Directed Turing Test, which extends the original test with a burst dialogue format, allowing more dynamic exchanges by multiple consecutive messages. It further efficiently reduces human workload by having the LLM self-direct the majority of the test process, iteratively generating dialogues that simulate its interaction with humans. With the pseudo-dialogue history, the model then engages in a shorter dialogue with a human, which is paired with a human-human conversation on the same topic to be judged using questionnaires. We introduce the X-Turn Pass-Rate metric to assess the human likeness of LLMs across varying durations. While LLMs like GPT-4 initially perform well, achieving pass rates of 51.9% and 38.9% during 3 turns and 10 turns of dialogues respectively, their performance drops as the dialogue progresses, which underscores the difficulty in maintaining consistency in the long term.

[AI-46] Importance Weighting Can Help Large Language Models Self-Improve

链接: https://arxiv.org/abs/2408.09849
作者: Chunyang Jiang,Chi-min Chan,Wei Xue,Qifeng Liu,Yike Guo
关键词-EN: shown remarkable capability, Large language models, Large language, tasks and applications, LLM self-improvement
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Large language models (LLMs) have shown remarkable capability in numerous tasks and applications. However, fine-tuning LLMs using high-quality datasets under external supervision remains prohibitively expensive. In response, LLM self-improvement approaches have been vibrantly developed recently. The typical paradigm of LLM self-improvement involves training LLM on self-generated data, part of which may be detrimental and should be filtered out due to the unstable data quality. While current works primarily employs filtering strategies based on answer correctness, in this paper, we demonstrate that filtering out correct but with high distribution shift extent (DSE) samples could also benefit the results of self-improvement. Given that the actual sample distribution is usually inaccessible, we propose a new metric called DS weight to approximate DSE, inspired by the Importance Weighting methods. Consequently, we integrate DS weight with self-consistency to comprehensively filter the self-generated samples and fine-tune the language model. Experiments show that with only a tiny valid set (up to 5% size of the training set) to compute DS weight, our approach can notably promote the reasoning ability of current LLM self-improvement methods. The resulting performance is on par with methods that rely on external supervision from pre-trained reward models.

[AI-47] Demystifying Reinforcement Learning in Production Scheduling via Explainable AI

链接: https://arxiv.org/abs/2408.09841
作者: Daniel Fischer,Hannah M. Hüsener,Felix Grumbach,Lukas Vollenkemper,Arthur Müller,Pascal Reusch
关键词-EN: Deep Reinforcement Learning, Deep Reinforcement, Reinforcement Learning, frequently employed technique, solve scheduling problems
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Deep Reinforcement Learning (DRL) is a frequently employed technique to solve scheduling problems. Although DRL agents ace at delivering viable results in short computing times, their reasoning remains opaque. We conduct a case study where we systematically apply two explainable AI (xAI) frameworks, namely SHAP (DeepSHAP) and Captum (Input x Gradient), to describe the reasoning behind scheduling decisions of a specialized DRL agent in a flow production. We find that methods in the xAI literature lack falsifiability and consistent terminology, do not adequately consider domain-knowledge, the target audience or real-world scenarios, and typically provide simple input-output explanations rather than causal interpretations. To resolve this issue, we introduce a hypotheses-based workflow. This approach enables us to inspect whether explanations align with domain knowledge and match the reward hypotheses of the agent. We furthermore tackle the challenge of communicating these insights to third parties by tailoring hypotheses to the target audience, which can serve as interpretations of the agent’s behavior after verification. Our proposed workflow emphasizes the repeated verification of explanations and may be applicable to various DRL-based scheduling use cases.

[AI-48] Segment-Anything Models Achieve Zero-shot Robustness in Autonomous Driving

链接: https://arxiv.org/abs/2408.09839
作者: Jun Yan,Pengyu Wang,Danni Wang,Weiquan Huang,Daniel Watzenig,Huilin Yin
关键词-EN: SAM, Semantic segmentation, adversarial robustness, significant perception task, zero-shot adversarial robustness
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: Accepted to IAVVC 2024

点击查看摘要

Abstract:Semantic segmentation is a significant perception task in autonomous driving. It suffers from the risks of adversarial examples. In the past few years, deep learning has gradually transitioned from convolutional neural network (CNN) models with a relatively small number of parameters to foundation models with a huge number of parameters. The segment-anything model (SAM) is a generalized image segmentation framework that is capable of handling various types of images and is able to recognize and segment arbitrary objects in an image without the need to train on a specific object. It is a unified model that can handle diverse downstream tasks, including semantic segmentation, object detection, and tracking. In the task of semantic segmentation for autonomous driving, it is significant to study the zero-shot adversarial robustness of SAM. Therefore, we deliver a systematic empirical study on the robustness of SAM without additional training. Based on the experimental results, the zero-shot adversarial robustness of the SAM under the black-box corruptions and white-box adversarial attacks is acceptable, even without the need for additional training. The finding of this study is insightful in that the gigantic model parameters and huge amounts of training data lead to the phenomenon of emergence, which builds a guarantee of adversarial robustness. SAM is a vision foundation model that can be regarded as an early prototype of an artificial general intelligence (AGI) pipeline. In such a pipeline, a unified model can handle diverse tasks. Therefore, this research not only inspects the impact of vision foundation models on safe autonomous driving but also provides a perspective on developing trustworthy AGI. The code is available at: this https URL.

[AI-49] Minor DPO reject penalty to increase training robustness

链接: https://arxiv.org/abs/2408.09834
作者: Shiming Xie,Hong Chen,Fred Yu,Zeye Sun,Xiuyu Wu,Yingfan Hu
关键词-EN: align pretrained LLM, large-scale language model, pretrained LLM, Learning from human, fine-tuning step
类目: Artificial Intelligence (cs.AI)
*备注: 8 pages, 19 figures

点击查看摘要

Abstract:Learning from human preference is a paradigm used in large-scale language model (LLM) fine-tuning step to better align pretrained LLM to human preference for downstream task. In the past it uses reinforcement learning from human feedback (RLHF) algorithm to optimize the LLM policy to align with these preferences and not to draft too far from the original model. Recently, Direct Preference Optimization (DPO) has been proposed to solve the alignment problem with a simplified RL-free method. Using preference pairs of chosen and reject data, DPO models the relative log probability as implicit reward function and optimize LLM policy using a simple binary cross entropy objective directly. DPO is quite straight forward and easy to be understood. It perform efficiently and well in most cases. In this article, we analyze the working mechanism of \beta in DPO, disclose its syntax difference between RL algorithm and DPO, and understand the potential shortage brought by the DPO simplification. With these insights, we propose MinorDPO, which is better aligned to the original RL algorithm, and increase the stability of preference optimization process.

[AI-50] DNetGen: Empowering Complex Network Resilience Prediction with Generative Augmentation of Topology and Dynamics

链接: https://arxiv.org/abs/2408.09825
作者: Chang Liu,Jingtao Ding,Yiwen Song,Yong Li
关键词-EN: retain fundamental functionality, fundamental functionality amidst, functionality amidst external, amidst external perturbations, improving real-world complex
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Predicting the resilience of complex networks, which represents the ability to retain fundamental functionality amidst external perturbations or internal failures, plays a critical role in understanding and improving real-world complex systems. Traditional theoretical approaches grounded in nonlinear dynamical systems rely on prior knowledge of network dynamics. On the other hand, data-driven approaches frequently encounter the challenge of insufficient labeled data, a predicament commonly observed in real-world scenarios. In this paper, we introduce a novel resilience prediction framework for complex networks, designed to tackle this issue through generative data augmentation of network topology and dynamics. The core idea is the strategic utilization of the inherent joint distribution present in unlabeled network data, facilitating the learning process of the resilience predictor by illuminating the relationship between network topology and dynamics. Experiment results on three network datasets demonstrate that our proposed framework TDNetGen can achieve high prediction accuracy up to 85%-95%. Furthermore, the framework still demonstrates a pronounced augmentation capability in extreme low-data regimes, thereby underscoring its utility and robustness in enhancing the prediction of network resilience. We have open-sourced our code in the following link, this https URL.

[AI-51] CMoralEval: A Moral Evaluation Benchmark for Chinese Large Language Models ACL2024

链接: https://arxiv.org/abs/2408.09819
作者: Linhao Yu,Yongqi Leng,Yufei Huang,Shang Wu,Haixin Liu,Xinmeng Ji,Jiahui Zhao,Jinwang Song,Tingting Cui,Xiaoqing Cheng,Tao Liu,Deyi Xiong
关键词-EN: ethically relevant context, large language model, Chinese LLMs, Chinese, language model
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: Accepted by ACL 2024 (Findings)

点击查看摘要

Abstract:What a large language model (LLM) would respond in ethically relevant context? In this paper, we curate a large benchmark CMoralEval for morality evaluation of Chinese LLMs. The data sources of CMoralEval are two-fold: 1) a Chinese TV program discussing Chinese moral norms with stories from the society and 2) a collection of Chinese moral anomies from various newspapers and academic papers on morality. With these sources, we aim to create a moral evaluation dataset characterized by diversity and authenticity. We develop a morality taxonomy and a set of fundamental moral principles that are not only rooted in traditional Chinese culture but also consistent with contemporary societal norms. To facilitate efficient construction and annotation of instances in CMoralEval, we establish a platform with AI-assisted instance generation to streamline the annotation process. These help us curate CMoralEval that encompasses both explicit moral scenarios (14,964 instances) and moral dilemma scenarios (15,424 instances), each with instances from different data sources. We conduct extensive experiments with CMoralEval to examine a variety of Chinese LLMs. Experiment results demonstrate that CMoralEval is a challenging benchmark for Chinese LLMs. The dataset is publicly available at \urlthis https URL.

[AI-52] Contextual Dual Learning Algorithm with Listwise Distillation for Unbiased Learning to Rank

链接: https://arxiv.org/abs/2408.09817
作者: Lulu Yu,Keping Bi,Shiyu Ni,Jiafeng Guo
关键词-EN: implicit user feedback, leverage biased implicit, biased implicit user, unbiased ranking model, Dual Learning Algorithm
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
*备注: 12 pages, 2 figures

点击查看摘要

Abstract:Unbiased Learning to Rank (ULTR) aims to leverage biased implicit user feedback (e.g., click) to optimize an unbiased ranking model. The effectiveness of the existing ULTR methods has primarily been validated on synthetic datasets. However, their performance on real-world click data remains unclear. Recently, Baidu released a large publicly available dataset of their web search logs. Subsequently, the NTCIR-17 ULTRE-2 task released a subset dataset extracted from it. We conduct experiments on commonly used or effective ULTR methods on this subset to determine whether they maintain their effectiveness. In this paper, we propose a Contextual Dual Learning Algorithm with Listwise Distillation (CDLA-LD) to simultaneously address both position bias and contextual bias. We utilize a listwise-input ranking model to obtain reconstructed feature vectors incorporating local contextual information and employ the Dual Learning Algorithm (DLA) method to jointly train this ranking model and a propensity model to address position bias. As this ranking model learns the interaction information within the documents list of the training set, to enhance the ranking model’s generalization ability, we additionally train a pointwise-input ranking model to learn the listwise-input ranking model’s capability for relevance judgment in a listwise manner. Extensive experiments and analysis confirm the effectiveness of our approach.

[AI-53] World Models Increase Autonomy in Reinforcement Learning

链接: https://arxiv.org/abs/2408.09807
作者: Zhao Yang,Thomas M. Moerland,Mike Preuss,Edward S. Hu
关键词-EN: autonomously acquired experience, enabling policy acquisition, training intelligent agents, Reinforcement learning, acquired experience
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Reinforcement learning (RL) is an appealing paradigm for training intelligent agents, enabling policy acquisition from the agent’s own autonomously acquired experience. However, the training process of RL is far from automatic, requiring extensive human effort to reset the agent and environments. To tackle the challenging reset-free setting, we first demonstrate the superiority of model-based (MB) RL methods in such setting, showing that a straightforward adaptation of MBRL can outperform all the prior state-of-the-art methods while requiring less supervision. We then identify limitations inherent to this direct extension and propose a solution called model-based reset-free (MoReFree) agent, which further enhances the performance. MoReFree adapts two key mechanisms, exploration and policy learning, to handle reset-free tasks by prioritizing task-relevant states. It exhibits superior data-efficiency across various reset-free tasks without access to environmental reward or demonstrations while significantly outperforming privileged baselines that require supervision. Our findings suggest model-based methods hold significant promise for reducing human effort in RL. Website: this https URL

[AI-54] AutoML-guided Fusion of Entity and LLM-based representations

链接: https://arxiv.org/abs/2408.09794
作者: Boshko Koloski,Senja Pollak,Roberto Navigli,Blaž Škrlj
关键词-EN: Large semantic knowledge, Large Language Model, grounded in factual, semantic knowledge bases, Large semantic
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:Large semantic knowledge bases are grounded in factual knowledge. However, recent approaches to dense text representations (embeddings) do not efficiently exploit these resources. Dense and robust representations of documents are essential for effectively solving downstream classification and retrieval tasks. This work demonstrates that injecting embedded information from knowledge bases can augment the performance of contemporary Large Language Model (LLM)-based representations for the task of text classification. Further, by considering automated machine learning (AutoML) with the fused representation space, we demonstrate it is possible to improve classification accuracy even if we use low-dimensional projections of the original representation space obtained via efficient matrix factorization. This result shows that significantly faster classifiers can be achieved with minimal or no loss in predictive performance, as demonstrated using five strong LLM baselines on six diverse real-life datasets.

[AI-55] GoNoGo: An Efficient LLM-based Multi-Agent System for Streamlining Automotive Software Release Decision-Making

链接: https://arxiv.org/abs/2408.09785
作者: Arsham Gholamzadeh Khoee,Yinan Yu,Robert Feldt,Andris Freimanis,Patrick Andersson,Dhasarathy Parthasarathy
关键词-EN: industry typically rely, software test data, Traditional methods, tabular software test, automotive industry typically
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Software Engineering (cs.SE)
*备注:

点击查看摘要

Abstract:Traditional methods for making software deployment decisions in the automotive industry typically rely on manual analysis of tabular software test data. These methods often lead to higher costs and delays in the software release cycle due to their labor-intensive nature. Large Language Models (LLMs) present a promising solution to these challenges. However, their application generally demands multiple rounds of human-driven prompt engineering, which limits their practical deployment, particularly for industrial end-users who need reliable and efficient results. In this paper, we propose GoNoGo, an LLM agent system designed to streamline automotive software deployment while meeting both functional requirements and practical industrial constraints. Unlike previous systems, GoNoGo is specifically tailored to address domain-specific and risk-sensitive systems. We evaluate GoNoGo’s performance across different task difficulties using zero-shot and few-shot examples taken from industrial practice. Our results show that GoNoGo achieves a 100% success rate for tasks up to Level 2 difficulty with 3-shot examples, and maintains high performance even for more complex tasks. We find that GoNoGo effectively automates decision-making for simpler tasks, significantly reducing the need for manual intervention. In summary, GoNoGo represents an efficient and user-friendly LLM-based solution currently employed in our industrial partner’s company to assist with software release decision-making, supporting more informed and timely decisions in the release process for risk-sensitive vehicle systems.

[AI-56] MalLight: Influence-Aware Coordinated Traffic Signal Control for Traffic Signal Malfunctions CIKM24

链接: https://arxiv.org/abs/2408.09768
作者: Qinchen Yang,Zejun Xie,Hua Wei,Desheng Zhang,Yu Yang
关键词-EN: extended waiting time, traffic signal malfunction, signal malfunction, Urban traffic, traffic signal
类目: Artificial Intelligence (cs.AI)
*备注: Paper accepted to CIKM24 Full Research track

点击查看摘要

Abstract:Urban traffic is subject to disruptions that cause extended waiting time and safety issues at signalized intersections. While numerous studies have addressed the issue of intelligent traffic systems in the context of various disturbances, traffic signal malfunction, a common real-world occurrence with significant repercussions, has received comparatively limited attention. The primary objective of this research is to mitigate the adverse effects of traffic signal malfunction, such as traffic congestion and collision, by optimizing the control of neighboring functioning signals. To achieve this goal, this paper presents a novel traffic signal control framework (MalLight), which leverages an Influence-aware State Aggregation Module (ISAM) and an Influence-aware Reward Aggregation Module (IRAM) to achieve coordinated control of surrounding traffic signals. To the best of our knowledge, this study pioneers the application of a Reinforcement Learning(RL)-based approach to address the challenges posed by traffic signal malfunction. Empirical investigations conducted on real-world datasets substantiate the superior performance of our proposed methodology over conventional and deep learning-based alternatives in the presence of signal malfunction, with reduction of throughput alleviated by as much as 48.6 % .

[AI-57] Event Stream based Human Action Recognition: A High-Definition Benchmark Dataset and Algorithms

链接: https://arxiv.org/abs/2408.09764
作者: Xiao Wang,Shiao Wang,Pengpeng Shao,Bo Jiang,Lin Zhu,Yonghong Tian
关键词-EN: pivotal research domain, RGB cameras dominating, Human Action Recognition, RGB cameras, RGB cameras encounter
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE)
*备注: In Peer Review

点击查看摘要

Abstract:Human Action Recognition (HAR) stands as a pivotal research domain in both computer vision and artificial intelligence, with RGB cameras dominating as the preferred tool for investigation and innovation in this field. However, in real-world applications, RGB cameras encounter numerous challenges, including light conditions, fast motion, and privacy concerns. Consequently, bio-inspired event cameras have garnered increasing attention due to their advantages of low energy consumption, high dynamic range, etc. Nevertheless, most existing event-based HAR datasets are low resolution ( 346 \times 260 ). In this paper, we propose a large-scale, high-definition ( 1280 \times 800 ) human action recognition dataset based on the CeleX-V event camera, termed CeleX-HAR. It encompasses 150 commonly occurring action categories, comprising a total of 124,625 video sequences. Various factors such as multi-view, illumination, action speed, and occlusion are considered when recording these data. To build a more comprehensive benchmark dataset, we report over 20 mainstream HAR models for future works to compare. In addition, we also propose a novel Mamba vision backbone network for event stream based HAR, termed EVMamba, which equips the spatial plane multi-directional scanning and novel voxel temporal scanning mechanism. By encoding and mining the spatio-temporal information of event streams, our EVMamba has achieved favorable results across multiple datasets. Both the dataset and source code will be released on \urlthis https URL

[AI-58] Revisiting Reciprocal Recommender Systems: Metrics Formulation and Method KDD2024

链接: https://arxiv.org/abs/2408.09748
作者: Chen Yang,Sunhao Dai,Yupeng Hou,Wayne Xin Zhao,Jun Xu,Yang Song,Hengshu Zhu
关键词-EN: gained increasing attention, enhancing matching efficiency, conducting bilateral recommendations, Reciprocal recommender systems, involved parties
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
*备注: KDD 2024

点击查看摘要

Abstract:Reciprocal recommender systems~(RRS), conducting bilateral recommendations between two involved parties, have gained increasing attention for enhancing matching efficiency. However, the majority of existing methods in the literature still reuse conventional ranking metrics to separately assess the performance on each side of the recommendation process. These methods overlook the fact that the ranking outcomes of both sides collectively influence the effectiveness of the RRS, neglecting the necessity of a more holistic evaluation and a capable systemic solution. In this paper, we systemically revisit the task of reciprocal recommendation, by introducing the new metrics, formulation, and method. Firstly, we propose five new evaluation metrics that comprehensively and accurately assess the performance of RRS from three distinct perspectives: overall coverage, bilateral stability, and balanced ranking. These metrics provide a more holistic understanding of the system’s effectiveness and enable a comprehensive evaluation. Furthermore, we formulate the RRS from a causal perspective, formulating recommendations as bilateral interventions, which can better model the decoupled effects of potential influencing factors. By utilizing the potential outcome framework, we further develop a model-agnostic causal reciprocal recommendation method that considers the causal effects of recommendations. Additionally, we introduce a reranking strategy to maximize matching outcomes, as measured by the proposed metrics. Extensive experiments on two real-world datasets from recruitment and dating scenarios demonstrate the effectiveness of our proposed metrics and approach. The code and dataset are available at: this https URL. Comments: KDD 2024 Subjects: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI) Cite as: arXiv:2408.09748 [cs.IR] (or arXiv:2408.09748v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2408.09748 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Related DOI: https://doi.org/10.1145/3637528.3671734 Focus to learn more DOI(s) linking to related resources

[AI-59] Enhanced Cascade Prostate Cancer Classifier in mp-MRI Utilizing Recall Feedback Adaptive Loss and Prior Knowledge-Based Feature Extraction

链接: https://arxiv.org/abs/2408.09746
作者: Kun Luo,Bowen Zheng,Shidong Lv,Jie Tao,Qiang Wei
关键词-EN: males worldwide, Prostate cancer, mpMRI, Cascade Prostate Cancer, Prostate Cancer Classifier
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Prostate cancer is the second most common cancer in males worldwide, and mpMRI is commonly used for diagnosis. However, interpreting mpMRI is challenging and requires expertise from radiologists. This highlights the urgent need for automated grading in mpMRI. Existing studies lack integration of clinical prior information and suffer from uneven training sample distribution due to prevalence. Therefore, we propose a solution that incorporates prior knowledge, addresses the issue of uneven medical sample distribution, and maintains high interpretability in mpMRI. Firstly, we introduce Prior Knowledge-Based Feature Extraction, which mathematically models the PI-RADS criteria for prostate cancer as diagnostic information into model training. Secondly, we propose Adaptive Recall Feedback Loss to address the extremely imbalanced data problem. This method adjusts the training dynamically based on accuracy and recall in the validation set, resulting in high accuracy and recall simultaneously in the testing set.Thirdly, we design an Enhanced Cascade Prostate Cancer Classifier that classifies prostate cancer into different levels in an interpretable way, which refines the classification results and helps with clinical intervention. Our method is validated through experiments on the PI-CAI dataset and outperforms other methods with a more balanced result in both accuracy and recall rate.

[AI-60] R2GenCSR: Retrieving Context Samples for Large Language Model based X-ray Medical Report Generation

链接: https://arxiv.org/abs/2408.09743
作者: Xiao Wang,Yuehang Li,Fuling Wang,Shiao Wang,Chuanfu Li,Bo Jiang
关键词-EN: Large Language Models, leverage large models, Large Language, generation methods attempt, existing X-ray medical
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注: In Peer Review

点击查看摘要

Abstract:Inspired by the tremendous success of Large Language Models (LLMs), existing X-ray medical report generation methods attempt to leverage large models to achieve better performance. They usually adopt a Transformer to extract the visual features of a given X-ray image, and then, feed them into the LLM for text generation. How to extract more effective information for the LLMs to help them improve final results is an urgent problem that needs to be solved. Additionally, the use of visual Transformer models also brings high computational complexity. To address these issues, this paper proposes a novel context-guided efficient X-ray medical report generation framework. Specifically, we introduce the Mamba as the vision backbone with linear complexity, and the performance obtained is comparable to that of the strong Transformer model. More importantly, we perform context retrieval from the training set for samples within each mini-batch during the training phase, utilizing both positively and negatively related samples to enhance feature representation and discriminative learning. Subsequently, we feed the vision tokens, context information, and prompt statements to invoke the LLM for generating high-quality medical reports. Extensive experiments on three X-ray report generation datasets (i.e., IU-Xray, MIMIC-CXR, CheXpert Plus) fully validated the effectiveness of our proposed model. The source code of this work will be released on \urlthis https URL.

[AI-61] Paired Completion: Flexible Quantification of Issue-framing at Scale with LLMs

链接: https://arxiv.org/abs/2408.09742
作者: Simon D Angus,Lachlan O’Neill
关键词-EN: Detecting and quantifying, quantifying issue framing, textual discourse, climate science, science vs. denialism
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); General Economics (econ.GN)
*备注: 9 pages, 4 figures

点击查看摘要

Abstract:Detecting and quantifying issue framing in textual discourse - the perspective one takes to a given topic (e.g. climate science vs. denialism, misogyny vs. gender equality) - is highly valuable to a range of end-users from social and political scientists to program evaluators and policy analysts. However, conceptual framing is notoriously challenging for automated natural language processing (NLP) methods since the words and phrases used by either side' of an issue are often held in common, with only subtle stylistic flourishes separating their use. Here we develop and rigorously evaluate new detection methods for issue framing and narrative analysis within large text datasets. By introducing a novel application of next-token log probabilities derived from generative large language models (LLMs) we show that issue framing can be reliably and efficiently detected in large corpora with only a few examples of either perspective on a given issue, a method we call paired completion’. Through 192 independent experiments over three novel, synthetic datasets, we evaluate paired completion against prompt-based LLM methods and labelled methods using traditional NLP and recent LLM contextual embeddings. We additionally conduct a cost-based analysis to mark out the feasible set of performant methods at production-level scales, and a model bias analysis. Together, our work demonstrates a feasible path to scalable, accurate and low-bias issue-framing in large corpora.

[AI-62] Mutually-Aware Feature Learning for Few-Shot Object Counting

链接: https://arxiv.org/abs/2408.09734
作者: Yerim Jeon,Subeen Lee,Jihwan Kim,Jae-Pil Heo
关键词-EN: Few-shot object counting, garnered significant attention, Few-shot object, query image based, additional training
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: Submitted to Pattern Recognition

点击查看摘要

Abstract:Few-shot object counting has garnered significant attention for its practicality as it aims to count target objects in a query image based on given exemplars without the need for additional training. However, there is a shortcoming in the prevailing extract-and-match approach: query and exemplar features lack interaction during feature extraction since they are extracted unaware of each other and later correlated based on similarity. This can lead to insufficient target awareness of the extracted features, resulting in target confusion in precisely identifying the actual target when multiple class objects coexist. To address this limitation, we propose a novel framework, Mutually-Aware FEAture learning(MAFEA), which encodes query and exemplar features mutually aware of each other from the outset. By encouraging interaction between query and exemplar features throughout the entire pipeline, we can obtain target-aware features that are robust to a multi-category scenario. Furthermore, we introduce a background token to effectively associate the target region of query with exemplars and decouple its background region from them. Our extensive experiments demonstrate that our model reaches a new state-of-the-art performance on the two challenging benchmarks, FSCD-LVIS and FSC-147, with a remarkably reduced degree of the target confusion problem.

[AI-63] Pedestrian Attribute Recognition: A New Benchmark Dataset and A Large Language Model Augmented Framework

链接: https://arxiv.org/abs/2408.09720
作者: Jiandong Jin,Xiao Wang,Qian Zhu,Haiyang Wang,Chenglong Li
关键词-EN: Pedestrian Attribute Recognition, human-centered research, indispensable tasks, tasks in human-centered, Attribute Recognition
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注: MSP60K PAR Benchmark Dataset, LLM based PAR model, In Peer Review

点击查看摘要

Abstract:Pedestrian Attribute Recognition (PAR) is one of the indispensable tasks in human-centered research. However, existing datasets neglect different domains (e.g., environments, times, populations, and data sources), only conducting simple random splits, and the performance of these datasets has already approached saturation. In the past five years, no large-scale dataset has been opened to the public. To address this issue, this paper proposes a new large-scale, cross-domain pedestrian attribute recognition dataset to fill the data gap, termed MSP60K. It consists of 60,122 images and 57 attribute annotations across eight scenarios. Synthetic degradation is also conducted to further narrow the gap between the dataset and real-world challenging scenarios. To establish a more rigorous benchmark, we evaluate 17 representative PAR models under both random and cross-domain split protocols on our dataset. Additionally, we propose an innovative Large Language Model (LLM) augmented PAR framework, named LLM-PAR. This framework processes pedestrian images through a Vision Transformer (ViT) backbone to extract features and introduces a multi-embedding query Transformer to learn partial-aware features for attribute classification. Significantly, we enhance this framework with LLM for ensemble learning and visual feature augmentation. Comprehensive experiments across multiple PAR benchmark datasets have thoroughly validated the efficacy of our proposed framework. The dataset and source code accompanying this paper will be made publicly available at \urlthis https URL.

[AI-64] HYDEN: Hyperbolic Density Representations for Medical Images and Reports

链接: https://arxiv.org/abs/2408.09715
作者: Zhi Qiao,Linbin Han,Xiantong Zhen,Jia-Hong Gao,Zhen Qian
关键词-EN: hierarchical modeling advantages, inherent entailment relations, point vector embeddings, visual semantic representation, point vector
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
*备注:

点击查看摘要

Abstract:In light of the inherent entailment relations between images and text, hyperbolic point vector embeddings, leveraging the hierarchical modeling advantages of hyperbolic space, have been utilized for visual semantic representation learning. However, point vector embedding approaches fail to address the issue of semantic uncertainty, where an image may have multiple interpretations, and text may refer to different images, a phenomenon particularly prevalent in the medical domain. Therefor, we propose \textbfHYDEN, a novel hyperbolic density embedding based image-text representation learning approach tailored for specific medical domain data. This method integrates text-aware local features alongside global features from images, mapping image-text features to density features in hyperbolic space via using hyperbolic pseudo-Gaussian distributions. An encapsulation loss function is employed to model the partial order relations between image-text density distributions. Experimental results demonstrate the interpretability of our approach and its superior performance compared to the baseline methods across various zero-shot tasks and different datasets.

[AI-65] Partial-Multivariate Model for Forecasting

链接: https://arxiv.org/abs/2408.09703
作者: Jaehoon Lee,Hankook Lee,Sungik Choi,Sungjun Cho,Moontae Lee
关键词-EN: problems including multiple, including multiple time-series, utilize inter-feature information, multiple time-series features, complete-multivariate models
类目: Artificial Intelligence (cs.AI)
*备注: 25 pages

点击查看摘要

Abstract:When solving forecasting problems including multiple time-series features, existing approaches often fall into two extreme categories, depending on whether to utilize inter-feature information: univariate and complete-multivariate models. Unlike univariate cases which ignore the information, complete-multivariate models compute relationships among a complete set of features. However, despite the potential advantage of leveraging the additional information, complete-multivariate models sometimes underperform univariate ones. Therefore, our research aims to explore a middle ground between these two by introducing what we term Partial-Multivariate models where a neural network captures only partial relationships, that is, dependencies within subsets of all features. To this end, we propose PMformer, a Transformer-based partial-multivariate model, with its training algorithm. We demonstrate that PMformer outperforms various univariate and complete-multivariate models, providing a theoretical rationale and empirical analysis for its superiority. Additionally, by proposing an inference technique for PMformer, the forecasting accuracy is further enhanced. Finally, we highlight other advantages of PMformer: efficiency and robustness under missing features.

[AI-66] Photorealistic Object Insertion with Diffusion-Guided Inverse Rendering ECCV2024

链接: https://arxiv.org/abs/2408.09702
作者: Ruofan Liang,Zan Gojcic,Merlin Nimier-David,David Acuna,Nandita Vijaykumar,Sanja Fidler,Zian Wang
关键词-EN: image formation process, real-world scenes requires, image formation, correct insertion, requires a deep
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Graphics (cs.GR)
*备注: ECCV 2024, Project page: this https URL

点击查看摘要

Abstract:The correct insertion of virtual objects in images of real-world scenes requires a deep understanding of the scene’s lighting, geometry and materials, as well as the image formation process. While recent large-scale diffusion models have shown strong generative and inpainting capabilities, we find that current models do not sufficiently “understand” the scene shown in a single picture to generate consistent lighting effects (shadows, bright reflections, etc.) while preserving the identity and details of the composited object. We propose using a personalized large diffusion model as guidance to a physically based inverse rendering process. Our method recovers scene lighting and tone-mapping parameters, allowing the photorealistic composition of arbitrary virtual objects in single frames or videos of indoor or outdoor scenes. Our physically based pipeline further enables automatic materials and tone-mapping refinement.

[AI-67] Harnessing Multimodal Large Language Models for Multimodal Sequential Recommendation

链接: https://arxiv.org/abs/2408.09698
作者: Yuyang Ye,Zhi Zheng,Yishan Shen,Tianshu Wang,Hengruo Zhang,Peijun Zhu,Runlong Yu,Kai Zhang,Hui Xiong
关键词-EN: Large Language Models, Multimodal Large Language, demonstrated significant potential, Recent advances, Large Language
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Recent advances in Large Language Models (LLMs) have demonstrated significant potential in the field of Recommendation Systems (RSs). Most existing studies have focused on converting user behavior logs into textual prompts and leveraging techniques such as prompt tuning to enable LLMs for recommendation tasks. Meanwhile, research interest has recently grown in multimodal recommendation systems that integrate data from images, text, and other sources using modality fusion techniques. This introduces new challenges to the existing LLM-based recommendation paradigm which relies solely on text modality information. Moreover, although Multimodal Large Language Models (MLLMs) capable of processing multi-modal inputs have emerged, how to equip MLLMs with multi-modal recommendation capabilities remains largely unexplored. To this end, in this paper, we propose the Multimodal Large Language Model-enhanced Sequential Multimodal Recommendation (MLLM-MSR) model. To capture the dynamic user preference, we design a two-stage user preference summarization method. Specifically, we first utilize an MLLM-based item-summarizer to extract image feature given an item and convert the image into text. Then, we employ a recurrent user preference summarization generation paradigm to capture the dynamic changes in user preferences based on an LLM-based user-summarizer. Finally, to enable the MLLM for multi-modal recommendation task, we propose to fine-tune a MLLM-based recommender using Supervised Fine-Tuning (SFT) techniques. Extensive evaluations across various datasets validate the effectiveness of MLLM-MSR, showcasing its superior ability to capture and adapt to the evolving dynamics of user preferences.

[AI-68] LightWeather: Harnessing Absolute Positional Encoding to Efficient and Scalable Global Weather Forecasting

链接: https://arxiv.org/abs/2408.09695
作者: Yisong Fu,Fei Wang,Zezhi Shao,Chengqing Yu,Yujie Li,Zhao Chen,Zhulin An,Yongjun Xu
关键词-EN: capture long-term spatial-temporal, weather forecasting, gained traction, capability to capture, capture long-term
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Atmospheric and Oceanic Physics (physics.ao-ph)
*备注:

点击查看摘要

Abstract:Recently, Transformers have gained traction in weather forecasting for their capability to capture long-term spatial-temporal correlations. However, their complex architectures result in large parameter counts and extended training times, limiting their practical application and scalability to global-scale forecasting. This paper aims to explore the key factor for accurate weather forecasting and design more efficient solutions. Interestingly, our empirical findings reveal that absolute positional encoding is what really works in Transformer-based weather forecasting models, which can explicitly model the spatial-temporal correlations even without attention mechanisms. We theoretically prove that its effectiveness stems from the integration of geographical coordinates and real-world time features, which are intrinsically related to the dynamics of weather. Based on this, we propose LightWeather, a lightweight and effective model for station-based global weather forecasting. We employ absolute positional encoding and a simple MLP in place of other components of Transformer. With under 30k parameters and less than one hour of training time, LightWeather achieves state-of-the-art performance on global weather datasets compared to other advanced DL methods. The results underscore the superiority of integrating spatial-temporal knowledge over complex architectures, providing novel insights for DL in weather forecasting.

[AI-69] Simulating Field Experiments with Large Language Models

链接: https://arxiv.org/abs/2408.09682
作者: Yaoyu Chen,Yuheng Hu,Yingda Lu
关键词-EN: unprecedented content generation, Prevailing large language, human responses simulation, field experiments, Prevailing large
类目: Artificial Intelligence (cs.AI)
*备注: 17 pages, 5 figures, 6 tables

点击查看摘要

Abstract:Prevailing large language models (LLMs) are capable of human responses simulation through its unprecedented content generation and reasoning abilities. However, it is not clear whether and how to leverage LLMs to simulate field experiments. In this paper, we propose and evaluate two prompting strategies: the observer mode that allows a direct prediction on main conclusions and the participant mode that simulates distributions of responses from participants. Using this approach, we examine fifteen well cited field experimental papers published in INFORMS and MISQ, finding encouraging alignments between simulated experimental results and the actual results in certain scenarios. We further identify topics of which LLMs underperform, including gender difference and social norms related research. Additionally, the automatic and standardized workflow proposed in this paper enables the possibility of a large-scale screening of more papers with field experiments. This paper pioneers the utilization of large language models (LLMs) for simulating field experiments, presenting a significant extension to previous work which focused solely on lab environments. By introducing two novel prompting strategies, observer and participant modes, we demonstrate the ability of LLMs to both predict outcomes and replicate participant responses within complex field settings. Our findings indicate a promising alignment with actual experimental results in certain scenarios, achieving a stimulation accuracy of 66% in observer mode. This study expands the scope of potential applications for LLMs and illustrates their utility in assisting researchers prior to engaging in expensive field experiments. Moreover, it sheds light on the boundaries of LLMs when used in simulating field experiments, serving as a cautionary note for researchers considering the integration of LLMs into their experimental toolkit.

[AI-70] MambaLoc: Efficient Camera Localisation via State Space Model

链接: https://arxiv.org/abs/2408.09680
作者: Jialu Wang,Kaichen Zhou,Andrew Markham,Niki Trigoni
关键词-EN: edge-cloud IoT systems, Location information, augmented reality, automation and intelligence, intelligence of terminal
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Location information is pivotal for the automation and intelligence of terminal devices and edge-cloud IoT systems, such as autonomous vehicles and augmented reality. However, achieving reliable positioning across diverse IoT applications remains challenging due to significant training costs and the necessity of densely collected data. To tackle these issues, we have innovatively applied the selective state space (SSM) model to visual localization, introducing a new model named MambaLoc. The proposed model demonstrates exceptional training efficiency by capitalizing on the SSM model’s strengths in efficient feature extraction, rapid computation, and memory optimization, and it further ensures robustness in sparse data environments due to its parameter sparsity. Additionally, we propose the Global Information Selector (GIS), which leverages selective SSM to implicitly achieve the efficient global feature extraction capabilities of Non-local Neural Networks. This design leverages the computational efficiency of the SSM model alongside the Non-local Neural Networks’ capacity to capture long-range dependencies with minimal layers. Consequently, the GIS enables effective global information capture while significantly accelerating convergence. Our extensive experimental validation using public indoor and outdoor datasets first demonstrates our model’s effectiveness, followed by evidence of its versatility with various existing localization models.

[AI-71] Multi-Agent Reinforcement Learning for Autonomous Driving: A Survey

链接: https://arxiv.org/abs/2408.09675
作者: Ruiqi Zhang,Jing Hou,Florian Walter,Shangding Gu,Jiayi Guan,Florian Röhrbein,Yali Du,Panpan Cai,Guang Chen,Alois Knoll
关键词-EN: challenging real-world tasks, achieved performance surpassing, performance surpassing human, surpassing human capabilities, Reinforcement Learning
类目: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA); Robotics (cs.RO)
*备注: 23 pages, 6 figures and 2 tables. Submitted to IEEE Journal

点击查看摘要

Abstract:Reinforcement Learning (RL) is a potent tool for sequential decision-making and has achieved performance surpassing human capabilities across many challenging real-world tasks. As the extension of RL in the multi-agent system domain, multi-agent RL (MARL) not only need to learn the control policy but also requires consideration regarding interactions with all other agents in the environment, mutual influences among different system components, and the distribution of computational resources. This augments the complexity of algorithmic design and poses higher requirements on computational resources. Simultaneously, simulators are crucial to obtain realistic data, which is the fundamentals of RL. In this paper, we first propose a series of metrics of simulators and summarize the features of existing benchmarks. Second, to ease comprehension, we recall the foundational knowledge and then synthesize the recently advanced studies of MARL-related autonomous driving and intelligent transportation systems. Specifically, we examine their environmental modeling, state representation, perception units, and algorithm design. Conclusively, we discuss open challenges as well as prospects and opportunities. We hope this paper can help the researchers integrate MARL technologies and trigger more insightful ideas toward the intelligent and autonomous driving.

[AI-72] A Comparison of Large Language Model and Human Performance on Random Number Generation Tasks

链接: https://arxiv.org/abs/2408.09656
作者: Rachel M. Harrison
关键词-EN: Number Generation Tasks, generate sequences devoid, Generation Tasks, psychology for examining, devoid of predictable
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Neurons and Cognition (q-bio.NC)
*备注:

点击查看摘要

Abstract:Random Number Generation Tasks (RNGTs) are used in psychology for examining how humans generate sequences devoid of predictable patterns. By adapting an existing human RNGT for an LLM-compatible environment, this preliminary study tests whether ChatGPT-3.5, a large language model (LLM) trained on human-generated text, exhibits human-like cognitive biases when generating random number sequences. Initial findings indicate that ChatGPT-3.5 more effectively avoids repetitive and sequential patterns compared to humans, with notably lower repeat frequencies and adjacent number frequencies. Continued research into different models, parameters, and prompting methodologies will deepen our understanding of how LLMs can more closely mimic human random generation behaviors, while also broadening their applications in cognitive and behavioral science research.

[AI-73] Data-driven Conditional Instrumental Variables for Debiasing Recommender Systems

链接: https://arxiv.org/abs/2408.09651
作者: Zhirong Huang,Shichao Zhang,Debo Cheng,Jiuyong Li,Lin Liu,Guangquan Lu
关键词-EN: true user preferences, latent variables, deviate from true, recommender systems, user-item interaction data
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In recommender systems, latent variables can cause user-item interaction data to deviate from true user preferences. This biased data is then used to train recommendation models, further amplifying the bias and ultimately compromising both recommendation accuracy and user satisfaction. Instrumental Variable (IV) methods are effective tools for addressing the confounding bias introduced by latent variables; however, identifying a valid IV is often challenging. To overcome this issue, we propose a novel data-driven conditional IV (CIV) debiasing method for recommender systems, called CIV4Rec. CIV4Rec automatically generates valid CIVs and their corresponding conditioning sets directly from interaction data, significantly reducing the complexity of IV selection while effectively mitigating the confounding bias caused by latent variables in recommender systems. Specifically, CIV4Rec leverages a variational autoencoder (VAE) to generate the representations of the CIV and its conditional set from interaction data, followed by the application of least squares to derive causal representations for click prediction. Extensive experiments on two real-world datasets, Movielens-10M and Douban-Movie, demonstrate that our CIV4Rec successfully identifies valid CIVs, effectively reduces bias, and consequently improves recommendation accuracy.

[AI-74] ExpoMamba: Exploiting Frequency SSM Blocks for Efficient and Effective Image Enhancement

链接: https://arxiv.org/abs/2408.09650
作者: Eashan Adhikarla,Kai Zhang,John Nicholson,Brian D. Davison
关键词-EN: handling high-resolution images, computer vision, remains a challenging, challenging task, task in computer
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Multimedia (cs.MM); Image and Video Processing (eess.IV)
*备注:

点击查看摘要

Abstract:Low-light image enhancement remains a challenging task in computer vision, with existing state-of-the-art models often limited by hardware constraints and computational inefficiencies, particularly in handling high-resolution images. Recent foundation models, such as transformers and diffusion models, despite their efficacy in various domains, are limited in use on edge devices due to their computational complexity and slow inference times. We introduce ExpoMamba, a novel architecture that integrates components of the frequency state space within a modified U-Net, offering a blend of efficiency and effectiveness. This model is specifically optimized to address mixed exposure challenges, a common issue in low-light image enhancement, while ensuring computational efficiency. Our experiments demonstrate that ExpoMamba enhances low-light images up to 2-3x faster than traditional models with an inference time of 36.6 ms and achieves a PSNR improvement of approximately 15-20% over competing models, making it highly suitable for real-time image processing applications.

[AI-75] Debiased Contrastive Representation Learning for Mitigating Dual Biases in Recommender Systems

链接: https://arxiv.org/abs/2408.09646
作者: Zhirong Huang,Shichao Zhang,Debo Cheng,Jiuyong Li,Lin Liu,Guixian Zhang
关键词-EN: undermine recommender effectiveness, disproportionately favouring popular, biases undermine recommender, user-item historical data, favouring popular items
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In recommender systems, popularity and conformity biases undermine recommender effectiveness by disproportionately favouring popular items, leading to their over-representation in recommendation lists and causing an unbalanced distribution of user-item historical data. We construct a causal graph to address both biases and describe the abstract data generation mechanism. Then, we use it as a guide to develop a novel Debiased Contrastive Learning framework for Mitigating Dual Biases, called DCLMDB. In DCLMDB, both popularity bias and conformity bias are handled in the model training process by contrastive learning to ensure that user choices and recommended items are not unduly influenced by conformity and popularity. Extensive experiments on two real-world datasets, Movielens-10M and Netflix, show that DCLMDB can effectively reduce the dual biases, as well as significantly enhance the accuracy and diversity of recommendations.

[AI-76] How to Make the Most of LLMs Grammatical Knowledge for Acceptability Judgments

链接: https://arxiv.org/abs/2408.09639
作者: Yusuke Ide,Yuto Nishida,Miyu Oba,Yusuke Sakai,Justin Vasselli,Hidetaka Kamigaito,Taro Watanabe
关键词-EN: linguistic minimal pairs, minimal pairs, benchmark of linguistic, linguistic minimal, required to judge
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The grammatical knowledge of language models (LMs) is often measured using a benchmark of linguistic minimal pairs, where LMs are presented with a pair of acceptable and unacceptable sentences and required to judge which is acceptable. The existing dominant approach, however, naively calculates and compares the probabilities of paired sentences using LMs. Additionally, large language models (LLMs) have yet to be thoroughly examined in this field. We thus investigate how to make the most of LLMs’ grammatical knowledge to comprehensively evaluate it. Through extensive experiments of nine judgment methods in English and Chinese, we demonstrate that a probability readout method, in-template LP, and a prompting-based method, Yes/No probability computing, achieve particularly high performance, surpassing the conventional approach. Our analysis reveals their different strengths, e.g., Yes/No probability computing is robust against token-length bias, suggesting that they harness different aspects of LLMs’ grammatical knowledge. Consequently, we recommend using diverse judgment methods to evaluate LLMs comprehensively.

[AI-77] Meta-Learning on Augmented Gene Expression Profiles for Enhanced Lung Cancer Detection

链接: https://arxiv.org/abs/2408.09635
作者: Arya Hadizadeh Moghaddam,Mohsen Nayebi Kerdabadi,Cuncong Zhong,Zijun Yao
关键词-EN: providing critical information, obtained through DNA, DNA microarray, cancer detection classifiers, Gene expression profiles
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Genomics (q-bio.GN)
*备注: Accepted to AMIA 2024 Annual Symposium

点击查看摘要

Abstract:Gene expression profiles obtained through DNA microarray have proven successful in providing critical information for cancer detection classifiers. However, the limited number of samples in these datasets poses a challenge to employ complex methodologies such as deep neural networks for sophisticated analysis. To address this “small data” dilemma, Meta-Learning has been introduced as a solution to enhance the optimization of machine learning models by utilizing similar datasets, thereby facilitating a quicker adaptation to target datasets without the requirement of sufficient samples. In this study, we present a meta-learning-based approach for predicting lung cancer from gene expression profiles. We apply this framework to well-established deep learning methodologies and employ four distinct datasets for the meta-learning tasks, where one as the target dataset and the rest as source datasets. Our approach is evaluated against both traditional and deep learning methodologies, and the results show the superior performance of meta-learning on augmented source data compared to the baselines trained on single datasets. Moreover, we conduct the comparative analysis between meta-learning and transfer learning methodologies to highlight the efficiency of the proposed approach in addressing the challenges associated with limited sample sizes. Finally, we incorporate the explainability study to illustrate the distinctiveness of decisions made by meta-learning.

[AI-78] On the Foundations of Conflict-Driven Solving for Hybrid MKNF Knowledge Bases

链接: https://arxiv.org/abs/2408.09626
作者: Riley Kinahan,Spencer Killen,Kevin Wan,Jia-Huai You
关键词-EN: MKNF Knowledge Bases, Hybrid MKNF Knowledge, Knowledge Bases, tightly integrated reasoning, Hybrid MKNF
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Hybrid MKNF Knowledge Bases (HMKNF-KBs) constitute a formalism for tightly integrated reasoning over closed-world rules and open-world ontologies. This approach allows for accurate modeling of real-world systems, which often rely on both categorical and normative reasoning. Conflict-driven solving is the leading approach for computationally hard problems, such as satisfiability (SAT) and answer set programming (ASP), in which MKNF is rooted. This paper investigates the theoretical underpinnings required for a conflict-driven solver of HMKNF-KBs. The approach defines a set of completion and loop formulas, whose satisfaction characterizes MKNF models. This forms the basis for a set of nogoods, which in turn can be used as the backbone for a conflict-driven solver.

[AI-79] Attention is a smoothed cubic spline

链接: https://arxiv.org/abs/2408.09624
作者: Zehua Lai,Lek-Heng Lim,Yucong Liu
关键词-EN: hitherto unobserved insight, important but hitherto, hitherto unobserved, transformer, splines
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注: 20 pages, 2 figures

点击查看摘要

Abstract:We highlight a perhaps important but hitherto unobserved insight: The attention module in a transformer is a smoothed cubic spline. Viewed in this manner, this mysterious but critical component of a transformer becomes a natural development of an old notion deeply entrenched in classical approximation theory. More precisely, we show that with ReLU-activation, attention, masked attention, encoder-decoder attention are all cubic splines. As every component in a transformer is constructed out of compositions of various attention modules (= cubic splines) and feed forward neural networks (= linear splines), all its components – encoder, decoder, and encoder-decoder blocks; multilayered encoders and decoders; the transformer itself – are cubic or higher-order splines. If we assume the Pierce-Birkhoff conjecture, then the converse also holds, i.e., every spline is a ReLU-activated encoder. Since a spline is generally just C^2 , one way to obtain a smoothed C^\infty -version is by replacing ReLU with a smooth activation; and if this activation is chosen to be SoftMax, we recover the original transformer as proposed by Vaswani et al. This insight sheds light on the nature of the transformer by casting it entirely in terms of splines, one of the best known and thoroughly understood objects in applied mathematics.

[AI-80] Does Thought Require Sensory Grounding? From Pure Thinkers to Large Language Models

链接: https://arxiv.org/abs/2408.09605
作者: David J. Chalmers
关键词-EN: capacity to sense, require the capacity, capacity, sense, Abstract
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Does the capacity to think require the capacity to sense? A lively debate on this topic runs throughout the history of philosophy and now animates discussions of artificial intelligence. I argue that in principle, there can be pure thinkers: thinkers that lack the capacity to sense altogether. I also argue for significant limitations in just what sort of thought is possible in the absence of the capacity to sense. Regarding AI, I do not argue directly that large language models can think or understand, but I rebut one important argument (the argument from sensory grounding) that they cannot. I also use recent results regarding language models to address the question of whether or how sensory grounding enhances cognitive capacities.

[AI-81] Antidote: Post-fine-tuning Safety Alignment for Large Language Models against Harmful Fine-tuning

链接: https://arxiv.org/abs/2408.09600
作者: Tiansheng Huang,Gautam Bhattacharya,Pratik Joshi,Josh Kimball,Ling Liu
关键词-EN: aligned Large Language, LLMs safety alignment, Safety aligned Large, Large Language Models, LLMs safety
类目: Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
*备注:

点击查看摘要

Abstract:Safety aligned Large Language Models (LLMs) are vulnerable to harmful fine-tuning attacks \citeqi2023fine-- a few harmful data mixed in the fine-tuning dataset can break the LLMs’s safety alignment. Existing mitigation strategies include alignment stage solutions \citehuang2024vaccine, rosati2024representation and fine-tuning stage solutions \citehuang2024lazy,mukhoti2023fine. However, our evaluation shows that both categories of defenses fail \textitwhen some specific training hyper-parameters are chosen – a large learning rate or a large number of training epochs in the fine-tuning stage can easily invalidate the defense, which however, is necessary to guarantee finetune performance. To this end, we propose Antidote, a post-fine-tuning stage solution, which remains \textbf\textitagnostic to the training hyper-parameters in the fine-tuning stage. Antidote relies on the philosophy that by removing the harmful parameters, the harmful model can be recovered from the harmful behaviors, regardless of how those harmful parameters are formed in the fine-tuning stage. With this philosophy, we introduce a one-shot pruning stage after harmful fine-tuning to remove the harmful weights that are responsible for the generation of harmful content. Despite its embarrassing simplicity, empirical results show that Antidote can reduce harmful score while maintaining accuracy on downstream tasks.

[AI-82] Moonshine: Distilling Game Content Generators into Steerable Generative Models

链接: https://arxiv.org/abs/2408.09594
作者: Yuhe Nie,Michael Middleton,Tim Merino,Nidhushan Kanagaraja,Ashutosh Kumar,Zhan Zhuang,Julian Togelius
关键词-EN: Machine Learning, training data persist, Procedural Content Generation, limited training data, game content creation
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Procedural Content Generation via Machine Learning (PCGML) has enhanced game content creation, yet challenges in controllability and limited training data persist. This study addresses these issues by distilling a constructive PCG algorithm into a controllable PCGML model. We first generate a large amount of content with a constructive algorithm and label it using a Large Language Model (LLM). We use these synthetic labels to condition two PCGML models for content-specific generation, a diffusion model and the five-dollar model. This neural network distillation process ensures that the generation aligns with the original algorithm while introducing controllability through plain text. We define this text-conditioned PCGML as a Text-to-game-Map (T2M) task, offering an alternative to prevalent text-to-image multi-modal tasks. We compare our distilled models with the baseline constructive algorithm. Our analysis of the variety, accuracy, and quality of our generation demonstrates the efficacy of distilling constructive methods into controllable text-conditioned PCGML models.

[AI-83] SynTraC: A Synthetic Dataset for Traffic Signal Control from Traffic Monitoring Cameras ITSC2024

链接: https://arxiv.org/abs/2408.09588
作者: Tiejin Chen,Prithvi Shirke,Bharatesh Chakravarthi,Arpitsinh Vaghela,Longchao Da,Duo Lu,Yezhou Yang,Hua Wei
关键词-EN: traffic signal control, traffic signal, signal control, image-based traffic signal, aimed at bridging
类目: Artificial Intelligence (cs.AI)
*备注: Accepted to IEEE ITSC2024

点击查看摘要

Abstract:This paper introduces SynTraC, the first public image-based traffic signal control dataset, aimed at bridging the gap between simulated environments and real-world traffic management challenges. Unlike traditional datasets for traffic signal control which aim to provide simplified feature vectors like vehicle counts from traffic simulators, SynTraC provides real-style images from the CARLA simulator with annotated features, along with traffic signal states. This image-based dataset comes with diverse real-world scenarios, including varying weather and times of day. Additionally, SynTraC also provides different reward values for advanced traffic signal control algorithms like reinforcement learning. Experiments with SynTraC demonstrate that it is still an open challenge to image-based traffic signal control methods compared with feature-based control methods, indicating our dataset can further guide the development of future algorithms. The code for this paper can be found in \urlthis https URL.SynTraC

[AI-84] Say My Name: a Models Bias Discovery Framework

链接: https://arxiv.org/abs/2408.09570
作者: Massimiliano Ciranni,Luca Molinaro,Carlo Alberto Barbano,Attilio Fiandrotti,Vittorio Murino,Vito Paolo Pastore,Enzo Tartaglione
关键词-EN: increasingly more concerns, non-representative patterns, learning to downstream, downstream tasks, concerns about potential
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
*备注:

点击查看摘要

Abstract:In the last few years, due to the broad applicability of deep learning to downstream tasks and end-to-end training capabilities, increasingly more concerns about potential biases to specific, non-representative patterns have been raised. Many works focusing on unsupervised debiasing usually leverage the tendency of deep models to learn easier'' samples, for example by clustering the latent space to obtain bias pseudo-labels. However, the interpretation of such pseudo-labels is not trivial, especially for a non-expert end user, as it does not provide semantic information about the bias features. To address this issue, we introduce Say My Name’’ (SaMyNa), the first tool to identify biases within deep models semantically. Unlike existing methods, our approach focuses on biases learned by the model. Our text-based pipeline enhances explainability and supports debiasing efforts: applicable during either training or post-hoc validation, our method can disentangle task-related information and proposes itself as a tool to analyze biases. Evaluation on traditional benchmarks demonstrates its effectiveness in detecting biases and even disclaiming them, showcasing its broad applicability for model diagnosis.

[AI-85] MergeRepair: An Exploratory Study on Merging Task-Specific Adapters in Code LLMs for Automated Program Repair

链接: https://arxiv.org/abs/2408.09568
作者: Meghdad Dehghan,Jie JW Wu,Fatemeh H. Fard,Ali Ouni
关键词-EN: Large Language Models, Adapters, Automated Program Repair, merged adapters, program repair
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:[Context] Large Language Models (LLMs) have shown good performance in several software development-related tasks such as program repair, documentation, code refactoring, debugging, and testing. Adapters are specialized, small modules designed for parameter efficient fine-tuning of LLMs for specific tasks, domains, or applications without requiring extensive retraining of the entire model. These adapters offer a more efficient way to customize LLMs for particular needs, leveraging the pre-existing capabilities of the large model. Merging LLMs and adapters has shown promising results for various natural language domains and tasks, enabling the use of the learned models and adapters without additional training for a new task. [Objective] This research proposes continual merging and empirically studies the capabilities of merged adapters in Code LLMs, specially for the Automated Program Repair (APR) task. The goal is to gain insights into whether and how merging task-specific adapters can affect the performance of APR. [Method] In our framework, MergeRepair, we plan to merge multiple task-specific adapters using three different merging methods and evaluate the performance of the merged adapter for the APR task. Particularly, we will employ two main merging scenarios for all three techniques, (i) merging using equal-weight averaging applied on parameters of different adapters, where all adapters are of equal importance; and (ii) our proposed approach, continual merging, in which we sequentially merge the task-specific adapters and the order and weight of merged adapters matter. By exploratory study of merging techniques, we will investigate the improvement and generalizability of merged adapters for APR. Through continual merging, we will explore the capability of merged adapters and the effect of task order, as it occurs in real-world software projects.

[AI-86] Grammatical Error Feedback: An Implicit Evaluation Approach

链接: https://arxiv.org/abs/2408.09565
作者: Stefano Bannò,Kate Knill,Mark J. F. Gales
关键词-EN: crucial for consolidating, feedback, grammatical error, Grammatical, computer-assisted language learning
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Grammatical feedback is crucial for consolidating second language (L2) learning. Most research in computer-assisted language learning has focused on feedback through grammatical error correction (GEC) systems, rather than examining more holistic feedback that may be more useful for learners. This holistic feedback will be referred to as grammatical error feedback (GEF). In this paper, we present a novel implicit evaluation approach to GEF that eliminates the need for manual feedback annotations. Our method adopts a grammatical lineup approach where the task is to pair feedback and essay representations from a set of possible alternatives. This matching process can be performed by appropriately prompting a large language model (LLM). An important aspect of this process, explored here, is the form of the lineup, i.e., the selection of foils. This paper exploits this framework to examine the quality and need for GEC to generate feedback, as well as the system used to generate feedback, using essays from the Cambridge Learner Corpus.

[AI-87] HiAgent : Hierarchical Working Memory Management for Solving Long-Horizon Agent Tasks with Large Language Model

链接: https://arxiv.org/abs/2408.09559
作者: Mengkang Hu,Tianxing Chen,Qiguang Chen,Yao Mu,Wenqi Shao,Ping Luo
关键词-EN: Large Language Model, Large Language, Language Model, exhibit significant potential, based agents exhibit
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Robotics (cs.RO)
*备注: Project Page: this https URL

点击查看摘要

Abstract:Large Language Model (LLM)-based agents exhibit significant potential across various domains, operating as interactive systems that process environmental observations to generate executable actions for target tasks. The effectiveness of these agents is significantly influenced by their memory mechanism, which records historical experiences as sequences of action-observation pairs. We categorize memory into two types: cross-trial memory, accumulated across multiple attempts, and in-trial memory (working memory), accumulated within a single attempt. While considerable research has optimized performance through cross-trial memory, the enhancement of agent performance through improved working memory utilization remains underexplored. Instead, existing approaches often involve directly inputting entire historical action-observation pairs into LLMs, leading to redundancy in long-horizon tasks. Inspired by human problem-solving strategies, this paper introduces HiAgent, a framework that leverages subgoals as memory chunks to manage the working memory of LLM-based agents hierarchically. Specifically, HiAgent prompts LLMs to formulate subgoals before generating executable actions and enables LLMs to decide proactively to replace previous subgoals with summarized observations, retaining only the action-observation pairs relevant to the current subgoal. Experimental results across five long-horizon tasks demonstrate that HiAgent achieves a twofold increase in success rate and reduces the average number of steps required by 3.8. Additionally, our analysis shows that HiAgent consistently improves performance across various steps, highlighting its robustness and generalizability. Project Page: this https URL .

[AI-88] Addressing Heterogeneity in Federated Learning: Challenges and Solutions for a Shared Production Environment

链接: https://arxiv.org/abs/2408.09556
作者: Tatjana Legler,Vinit Hegiste,Ahmed Anwar,Martin Ruskowski
关键词-EN: Federated learning, preserving data privacy, machine learning models, training machine learning, shared production environments
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Federated learning (FL) has emerged as a promising approach to training machine learning models across decentralized data sources while preserving data privacy, particularly in manufacturing and shared production environments. However, the presence of data heterogeneity variations in data distribution, quality, and volume across different or clients and production sites, poses significant challenges to the effectiveness and efficiency of FL. This paper provides a comprehensive overview of heterogeneity in FL within the context of manufacturing, detailing the types and sources of heterogeneity, including non-independent and identically distributed (non-IID) data, unbalanced data, variable data quality, and statistical heterogeneity. We discuss the impact of these types of heterogeneity on model training and review current methodologies for mitigating their adverse effects. These methodologies include personalized and customized models, robust aggregation techniques, and client selection techniques. By synthesizing existing research and proposing new strategies, this paper aims to provide insight for effectively managing data heterogeneity in FL, enhancing model robustness, and ensuring fair and efficient training across diverse environments. Future research directions are also identified, highlighting the need for adaptive and scalable solutions to further improve the FL paradigm in the context of Industry 4.0.

[AI-89] Using ChatGPT to Score Essays and Short-Form Constructed Responses

链接: https://arxiv.org/abs/2408.09540
作者: Mark D. Shermis
关键词-EN: ASAP competition, large language models, ChatGPT large language, aimed to determine, large language
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: 35 pages, 8 tables, 2 Figures, 27 references

点击查看摘要

Abstract:This study aimed to determine if ChatGPT’s large language models could match the scoring accuracy of human and machine scores from the ASAP competition. The investigation focused on various prediction models, including linear regression, random forest, gradient boost, and boost. ChatGPT’s performance was evaluated against human raters using quadratic weighted kappa (QWK) metrics. Results indicated that while ChatGPT’s gradient boost model achieved QWKs close to human raters for some data sets, its overall performance was inconsistent and often lower than human scores. The study highlighted the need for further refinement, particularly in handling biases and ensuring scoring fairness. Despite these challenges, ChatGPT demonstrated potential for scoring efficiency, especially with domain-specific fine-tuning. The study concludes that ChatGPT can complement human scoring but requires additional development to be reliable for high-stakes assessments. Future research should improve model accuracy, address ethical considerations, and explore hybrid models combining ChatGPT with empirical methods.

[AI-90] PA-LLaVA: A Large Language-Vision Assistant for Human Pathology Image Understanding

链接: https://arxiv.org/abs/2408.09530
作者: Dawei Dai,Yuanhui Zhang,Long Xu,Qianlan Yang,Xiaojing Shen,Shuyin Xia,Guoyin Wang
关键词-EN: primarily involved developing, understanding primarily involved, image understanding primarily, pathology image understanding, involved developing models
类目: Artificial Intelligence (cs.AI)
*备注: 8 pages, 4 figs

点击查看摘要

Abstract:The previous advancements in pathology image understanding primarily involved developing models tailored to specific tasks. Recent studies has demonstrated that the large vision-language model can enhance the performance of various downstream tasks in medical image understanding. In this study, we developed a domain-specific large language-vision assistant (PA-LLaVA) for pathology image understanding. Specifically, (1) we first construct a human pathology image-text dataset by cleaning the public medical image-text data for domain-specific alignment; (2) Using the proposed image-text data, we first train a pathology language-image pretraining (PLIP) model as the specialized visual encoder for pathology image, and then we developed scale-invariant connector to avoid the information loss caused by image scaling; (3) We adopt two-stage learning to train PA-LLaVA, first stage for domain alignment, and second stage for end to end visual question \ answering (VQA) task. In experiments, we evaluate our PA-LLaVA on both supervised and zero-shot VQA datasets, our model achieved the best overall performance among multimodal models of similar scale. The ablation experiments also confirmed the effectiveness of our design. We posit that our PA-LLaVA model and the datasets presented in this work can promote research in field of computational pathology. All codes are available at: this https URLthis https URL

[AI-91] Revisiting the Graph Reasoning Ability of Large Language Models : Case Studies in Translation Connectivity and Shortest Path

链接: https://arxiv.org/abs/2408.09529
作者: Xinnan Dai,Qihao Wen,Yifei Shen,Hongzhi Wen,Dongsheng Li,Jiliang Tang,Caihua Shan
关键词-EN: Large Language Models, Large Language, Language Models, achieved great success, achieved great
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have achieved great success in various reasoning tasks. In this work, we focus on the graph reasoning ability of LLMs. Although theoretical studies proved that LLMs are capable of handling graph reasoning tasks, empirical evaluations reveal numerous failures. To deepen our understanding on this discrepancy, we revisit the ability of LLMs on three fundamental graph tasks: graph description translation, graph connectivity, and the shortest-path problem. Our findings suggest that LLMs can fail to understand graph structures through text descriptions and exhibit varying performance for all these three fundamental tasks. Meanwhile, we perform a real-world investigation on knowledge graphs and make consistent observations with our findings. The codes and datasets are available.

[AI-92] ALS-HAR: Harnessing Wearable Ambient Light Sensors to Enhance IMU-based HAR

链接: https://arxiv.org/abs/2408.09527
作者: Lala Shakti Swarup Ray,Daniel Geißler,Mengxi Liu,Bo Zhou,Sungho Suh,Paul Lukowicz
关键词-EN: screen brightness adaptation, smart devices commonly, brightness adaptation, primarily through body-worn, largely unexplored
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Despite the widespread integration of ambient light sensors (ALS) in smart devices commonly used for screen brightness adaptation, their application in human activity recognition (HAR), primarily through body-worn ALS, is largely unexplored. In this work, we developed ALS-HAR, a robust wearable light-based motion activity classifier. Although ALS-HAR achieves comparable accuracy to other modalities, its natural sensitivity to external disturbances, such as changes in ambient light, weather conditions, or indoor lighting, makes it challenging for daily use. To address such drawbacks, we introduce strategies to enhance environment-invariant IMU-based activity classifications through augmented multi-modal and contrastive classifications by transferring the knowledge extracted from the ALS. Our experiments on a real-world activity dataset for three different scenarios demonstrate that while ALS-HAR’s accuracy strongly relies on external lighting conditions, cross-modal information can still improve other HAR systems, such as IMU-based classifiers.Even in scenarios where ALS performs insufficiently, the additional knowledge enables improved accuracy and macro F1 score by up to 4.2 % and 6.4 %, respectively, for IMU-based classifiers and even surpasses multi-modal sensor fusion models in two of our three experiment scenarios. Our research highlights the untapped potential of ALS integration in advancing sensor-based HAR technology, paving the way for practical and efficient wearable ALS-based activity recognition systems with potential applications in healthcare, sports monitoring, and smart indoor environments.

[AI-93] A Unified Framework for Interpretable Transformers Using PDEs and Information Theory

链接: https://arxiv.org/abs/2408.09523
作者: Yukun Zhang
关键词-EN: Partial Differential Equations, Information Bottleneck Theory, integrating Partial Differential, Bottleneck Theory, Information Flow Theory
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Information Theory (cs.IT)
*备注:

点击查看摘要

Abstract:This paper presents a novel unified theoretical framework for understanding Transformer architectures by integrating Partial Differential Equations (PDEs), Neural Information Flow Theory, and Information Bottleneck Theory. We model Transformer information dynamics as a continuous PDE process, encompassing diffusion, self-attention, and nonlinear residual components. Our comprehensive experiments across image and text modalities demonstrate that the PDE model effectively captures key aspects of Transformer behavior, achieving high similarity (cosine similarity 0.98) with Transformer attention distributions across all layers. While the model excels in replicating general information flow patterns, it shows limitations in fully capturing complex, non-linear transformations. This work provides crucial theoretical insights into Transformer mechanisms, offering a foundation for future optimizations in deep learning architectural design. We discuss the implications of our findings, potential applications in model interpretability and efficiency, and outline directions for enhancing PDE models to better mimic the intricate behaviors observed in Transformers, paving the way for more transparent and optimized AI systems.

[AI-94] A Logic for Policy Based Resource Exchanges in Multiagent Systems

链接: https://arxiv.org/abs/2408.09516
作者: Lorenzo Ceragioli,Pierpaolo Degano,Letterio Galletta,Luca Viganò
关键词-EN: multiagent systems autonomous, systems autonomous agents, autonomous agents interact, collective goals, multiagent systems
类目: Logic in Computer Science (cs.LO); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In multiagent systems autonomous agents interact with each other to achieve individual and collective goals. Typical interactions concern negotiation and agreement on resource exchanges. Modeling and formalizing these agreements pose significant challenges, particularly in capturing the dynamic behaviour of agents, while ensuring that resources are correctly handled. Here, we propose exchange environments as a formal setting where agents specify and obey exchange policies, which are declarative statements about what resources they offer and what they require in return. Furthermore, we introduce a decidable extension of the computational fragment of linear logic as a fundamental tool for representing exchange environments and studying their dynamics in terms of provability.

[AI-95] Out-of-distribution generalization via composition: a lens through induction heads in Transformers

链接: https://arxiv.org/abs/2408.09503
作者: Jiajun Song,Zhuoyan Xu,Yiqiao Zhong
关键词-EN: Large language models, OOD generalization, Large language, OOD, Large
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 41 pages, 25 figures

点击查看摘要

Abstract:Large language models (LLMs) such as GPT-4 sometimes appear to be creative, solving novel tasks often with a few demonstrations in the prompt. These tasks require the models to generalize on distributions different from those from training data – which is known as out-of-distribution (OOD) generalization. Despite the tremendous success of LLMs, how they approach OOD generalization remains an open and underexplored question. We examine OOD generalization in settings where instances are generated according to hidden rules, including in-context learning with symbolic reasoning. Models are required to infer the hidden rules behind input prompts without any fine-tuning. We empirically examined the training dynamics of Transformers on a synthetic example and conducted extensive experiments on a variety of pretrained LLMs, focusing on a type of components known as induction heads. We found that OOD generalization and composition are tied together – models can learn rules by composing two self-attention layers, thereby achieving OOD generalization. Furthermore, a shared latent subspace in the embedding (or feature) space acts as a bridge for composition by aligning early layers and later layers, which we refer to as the common bridge representation hypothesis. Comments: 41 pages, 25 figures Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Machine Learning (stat.ML) Cite as: arXiv:2408.09503 [cs.CL] (or arXiv:2408.09503v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2408.09503 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-96] Beyond Local Views: Global State Inference with Diffusion Models for Cooperative Multi-Agent Reinforcement Learning

链接: https://arxiv.org/abs/2408.09501
作者: Zhiwei Xu,Hangyu Mao,Nianmin Zhang,Xin Xin,Pengjie Ren,Dapeng Li,Bin Zhang,Guoliang Fan,Zhumin Chen,Changwei Wang,Jiangjin Yin
关键词-EN: observable multi-agent systems, partially observable multi-agent, local observations, partially observable, Diffusion Models
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI)
*备注: 15 pages, 12 figures

点击查看摘要

Abstract:In partially observable multi-agent systems, agents typically only have access to local observations. This severely hinders their ability to make precise decisions, particularly during decentralized execution. To alleviate this problem and inspired by image outpainting, we propose State Inference with Diffusion Models (SIDIFF), which uses diffusion models to reconstruct the original global state based solely on local observations. SIDIFF consists of a state generator and a state extractor, which allow agents to choose suitable actions by considering both the reconstructed global state and local observations. In addition, SIDIFF can be effortlessly incorporated into current multi-agent reinforcement learning algorithms to improve their performance. Finally, we evaluated SIDIFF on different experimental platforms, including Multi-Agent Battle City (MABC), a novel and flexible multi-agent reinforcement learning environment we developed. SIDIFF achieved desirable results and outperformed other popular algorithms.

[AI-97] Leveraging Invariant Principle for Heterophilic Graph Structure Distribution Shifts

链接: https://arxiv.org/abs/2408.09490
作者: Jinluan Yang,Zhengyu Chen,Teng Xiao,Wenqiao Zhang,Yong Lin,Kun Kuang
关键词-EN: Graph Neural Networks, Neural Networks, Heterophilic Graph Neural, shown promising results, Graph Neural
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 20 pages, 7 figures

点击查看摘要

Abstract:Heterophilic Graph Neural Networks (HGNNs) have shown promising results for semi-supervised learning tasks on graphs. Notably, most real-world heterophilic graphs are composed of a mixture of nodes with different neighbor patterns, exhibiting local node-level homophilic and heterophilic structures. However, existing works are only devoted to designing better HGNN backbones or architectures for node classification tasks on heterophilic and homophilic graph benchmarks simultaneously, and their analyses of HGNN performance with respect to nodes are only based on the determined data distribution without exploring the effect caused by this structural difference between training and testing nodes. How to learn invariant node representations on heterophilic graphs to handle this structure difference or distribution shifts remains unexplored. In this paper, we first discuss the limitations of previous graph-based invariant learning methods from the perspective of data augmentation. Then, we propose \textbfHEI, a framework capable of generating invariant node representations through incorporating heterophily information to infer latent environments without augmentation, which are then used for invariant prediction, under heterophilic graph structure distribution shifts. We theoretically show that our proposed method can achieve guaranteed performance under heterophilic graph structure distribution shifts. Extensive experiments on various benchmarks and backbones can also demonstrate the effectiveness of our method compared with existing state-of-the-art baselines.

[AI-98] REFINE-LM: Mitigating Language Model Stereotypes via Reinforcement Learning

链接: https://arxiv.org/abs/2408.09489
作者: Rameez Qureshi,Naïm Es-Sebbani,Luis Galárraga,Yvette Graham,Miguel Couceiro,Zied Bouraoui
关键词-EN: unintended bias, biases, language models, significant concern, models
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:With the introduction of (large) language models, there has been significant concern about the unintended bias such models may inherit from their training data. A number of studies have shown that such models propagate gender stereotypes, as well as geographical and racial bias, among other biases. While existing works tackle this issue by preprocessing data and debiasing embeddings, the proposed methods require a lot of computational resources and annotation effort while being limited to certain types of biases. To address these issues, we introduce REFINE-LM, a debiasing method that uses reinforcement learning to handle different types of biases without any fine-tuning. By training a simple model on top of the word probability distribution of a LM, our bias agnostic reinforcement learning method enables model debiasing without human annotations or significant computational resources. Experiments conducted on a wide range of models, including several LMs, show that our method (i) significantly reduces stereotypical biases while preserving LMs performance; (ii) is applicable to different types of biases, generalizing across contexts such as gender, ethnicity, religion, and nationality-based biases; and (iii) it is not expensive to train.

[AI-99] PanoSent: A Panoptic Sextuple Extraction Benchmark for Multimodal Conversational Aspect-based Sentiment Analysis ACM-MM2024

链接: https://arxiv.org/abs/2408.09481
作者: Meng Luo,Hao Fei,Bobo Li,Shengqiong Wu,Qian Liu,Soujanya Poria,Erik Cambria,Mong-Li Lee,Wynne Hsu
关键词-EN: Aspect-based Sentiment Analysis, existing Aspect-based Sentiment, holistic research target, research target seamlessly, target seamlessly integrating
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: Accepted by ACM MM 2024 (Oral)

点击查看摘要

Abstract:While existing Aspect-based Sentiment Analysis (ABSA) has received extensive effort and advancement, there are still gaps in defining a more holistic research target seamlessly integrating multimodality, conversation context, fine-granularity, and also covering the changing sentiment dynamics as well as cognitive causal rationales. This paper bridges the gaps by introducing a multimodal conversational ABSA, where two novel subtasks are proposed: 1) Panoptic Sentiment Sextuple Extraction, panoramically recognizing holder, target, aspect, opinion, sentiment, rationale from multi-turn multi-party multimodal dialogue. 2) Sentiment Flipping Analysis, detecting the dynamic sentiment transformation throughout the conversation with the causal reasons. To benchmark the tasks, we construct PanoSent, a dataset annotated both manually and automatically, featuring high quality, large scale, multimodality, multilingualism, multi-scenarios, and covering both implicit and explicit sentiment elements. To effectively address the tasks, we devise a novel Chain-of-Sentiment reasoning framework, together with a novel multimodal large language model (namely Sentica) and a paraphrase-based verification mechanism. Extensive evaluations demonstrate the superiority of our methods over strong baselines, validating the efficacy of all our proposed methods. The work is expected to open up a new era for the ABSA community, and thus all our codes and data are open at this https URL

[AI-100] MedMAP: Promoting Incomplete Multi-modal Brain Tumor Segmentation with Alignment

链接: https://arxiv.org/abs/2408.09465
作者: Tianyi Liu,Zhaorui Tan,Muyin Chen,Xi Yang,Haochuan Jiang,Kaizhu Huang
关键词-EN: magnetic resonance imaging, Brain tumor segmentation, multiple magnetic resonance, Brain tumor, resonance imaging
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Brain tumor segmentation is often based on multiple magnetic resonance imaging (MRI). However, in clinical practice, certain modalities of MRI may be missing, which presents a more difficult scenario. To cope with this challenge, Knowledge Distillation, Domain Adaption, and Shared Latent Space have emerged as commonly promising strategies. However, recent efforts typically overlook the modality gaps and thus fail to learn important invariant feature representations across different modalities. Such drawback consequently leads to limited performance for missing modality models. To ameliorate these problems, pre-trained models are used in natural visual segmentation tasks to minimize the gaps. However, promising pre-trained models are often unavailable in medical image segmentation tasks. Along this line, in this paper, we propose a novel paradigm that aligns latent features of involved modalities to a well-defined distribution anchor as the substitution of the pre-trained model. As a major contribution, we prove that our novel training paradigm ensures a tight evidence lower bound, thus theoretically certifying its effectiveness. Extensive experiments on different backbones validate that the proposed paradigm can enable invariant feature representations and produce models with narrowed modality gaps. Models with our alignment paradigm show their superior performance on both BraTS2018 and BraTS2020 datasets.

[AI-101] In-Memory Learning Automata Architecture using Y-Flash Cell

链接: https://arxiv.org/abs/2408.09456
作者: Omar Ghazal,Tian Lan,Shalman Ojukwu,Komal Krishnamurthy,Alex Yakovlev,Rishad Shafik
关键词-EN: faces significant challenges, significant challenges due, frequent data transfer, architectures faces significant, faces significant
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The modern implementation of machine learning architectures faces significant challenges due to frequent data transfer between memory and processing units. In-memory computing, primarily through memristor-based analog computing, offers a promising solution to overcome this von Neumann bottleneck. In this technology, data processing and storage are located inside the memory. Here, we introduce a novel approach that utilizes floating-gate Y-Flash memristive devices manufactured with a standard 180 nm CMOS process. These devices offer attractive features, including analog tunability and moderate device-to-device variation; such characteristics are essential for reliable decision-making in ML applications. This paper uses a new machine learning algorithm, the Tsetlin Machine ™, for in-memory processing architecture. The TM’s learning element, Automaton, is mapped into a single Y-Flash cell, where the Automaton’s range is transferred into the Y-Flash’s conductance scope. Through comprehensive simulations, the proposed hardware implementation of the learning automata, particularly for Tsetlin machines, has demonstrated enhanced scalability and on-edge learning capabilities.

[AI-102] GraphSPNs: Sum-Product Networks Benefit From Canonical Orderings

链接: https://arxiv.org/abs/2408.09451
作者: Milan Papež,Martin Rektoris,Václav Šmídl,Tomáš Pevný
关键词-EN: capturing complex probability, complex probability distributions, Deep generative, recently made, made a remarkable
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Deep generative models have recently made a remarkable progress in capturing complex probability distributions over graphs. However, they are intractable and thus unable to answer even the most basic probabilistic inference queries without resorting to approximations. Therefore, we propose graph sum-product networks (GraphSPNs), a tractable deep generative model which provides exact and efficient inference over (arbitrary parts of) graphs. We investigate different principles to make SPNs permutation invariant. We demonstrate that GraphSPNs are able to (conditionally) generate novel and chemically valid molecular graphs, being competitive to, and sometimes even better than, existing intractable models. We find out that (Graph)SPNs benefit from ensuring the permutation invariance via canonical ordering.

[AI-103] Parallel Sampling via Counting

链接: https://arxiv.org/abs/2408.09442
作者: Nima Anari,Ruiquan Gao,Aviad Rubinstein
关键词-EN: product space, sigma, parallelization to speed, autoregressive models, arbitrary distribution
类目: Data Structures and Algorithms (cs.DS); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Probability (math.PR)
*备注:

点击查看摘要

Abstract:We show how to use parallelization to speed up sampling from an arbitrary distribution \mu on a product space [q]^n , given oracle access to counting queries: \mathbbP_X\sim \mu[X_S=\sigma_S] for any S\subseteq [n] and \sigma_S \in [q]^S . Our algorithm takes O(n^2/3\cdot \operatornamepolylog(n,q)) parallel time, to the best of our knowledge, the first sublinear in n runtime for arbitrary distributions. Our results have implications for sampling in autoregressive models. Our algorithm directly works with an equivalent oracle that answers conditional marginal queries \mathbbP_X\sim \mu[X_i=\sigma_i;\vert; X_S=\sigma_S] , whose role is played by a trained neural network in autoregressive models. This suggests a roughly n^1/3 -factor speedup is possible for sampling in any-order autoregressive models. We complement our positive result by showing a lower bound of \widetilde\Omega(n^1/3) for the runtime of any parallel sampling algorithm making at most \operatornamepoly(n) queries to the counting oracle, even for q=2 .

[AI-104] owards Boosting LLMs-driven Relevance Modeling with Progressive Retrieved Behavior-augmented Prompting

链接: https://arxiv.org/abs/2408.09439
作者: Zeyuan Chen,Haiyan Wu,Kaixin Wu,Wei Chen,Mingjie Zhong,Jia Xu,Zhongyi Liu,Wei Zhang
关键词-EN: enhancing user experience, Relevance modeling, Relevance, critical component, component for enhancing
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Relevance modeling is a critical component for enhancing user experience in search engines, with the primary objective of identifying items that align with users’ queries. Traditional models only rely on the semantic congruence between queries and items to ascertain relevance. However, this approach represents merely one aspect of the relevance judgement, and is insufficient in isolation. Even powerful Large Language Models (LLMs) still cannot accurately judge the relevance of a query and an item from a semantic perspective. To augment LLMs-driven relevance modeling, this study proposes leveraging user interactions recorded in search logs to yield insights into users’ implicit search intentions. The challenge lies in the effective prompting of LLMs to capture dynamic search intentions, which poses several obstacles in real-world relevance scenarios, i.e., the absence of domain-specific knowledge, the inadequacy of an isolated prompt, and the prohibitive costs associated with deploying LLMs. In response, we propose ProRBP, a novel Progressive Retrieved Behavior-augmented Prompting framework for integrating search scenario-oriented knowledge with LLMs effectively. Specifically, we perform the user-driven behavior neighbors retrieval from the daily search logs to obtain domain-specific knowledge in time, retrieving candidates that users consider to meet their expectations. Then, we guide LLMs for relevance modeling by employing advanced prompting techniques that progressively improve the outputs of the LLMs, followed by a progressive aggregation with comprehensive consideration of diverse aspects. For online serving, we have developed an industrial application framework tailored for the deployment of LLMs in relevance modeling. Experiments on real-world industry data and online A/B testing demonstrate our proposal achieves promising performance.

[AI-105] Enhancing Modal Fusion by Alignment and Label Matching for Multimodal Emotion Recognition INTERSPEECH2024

链接: https://arxiv.org/abs/2408.09438
作者: Qifei Li,Yingming Gao,Yuhua Wen,Cong Wang,Ya Li
关键词-EN: MER framework based, multimodal emotion recognition, MER framework, inter-modal information fusion, performance arising
类目: ound (cs.SD); Artificial Intelligence (cs.AI)
*备注: The paper has been accepted by INTERSPEECH 2024

点击查看摘要

Abstract:To address the limitation in multimodal emotion recognition (MER) performance arising from inter-modal information fusion, we propose a novel MER framework based on multitask learning where fusion occurs after alignment, called Foal-Net. The framework is designed to enhance the effectiveness of modality fusion and includes two auxiliary tasks: audio-video emotion alignment (AVEL) and cross-modal emotion label matching (MEM). First, AVEL achieves alignment of emotional information in audio-video representations through contrastive learning. Then, a modal fusion network integrates the aligned features. Meanwhile, MEM assesses whether the emotions of the current sample pair are the same, providing assistance for modal information fusion and guiding the model to focus more on emotional information. The experimental results conducted on IEMOCAP corpus show that Foal-Net outperforms the state-of-the-art methods and emotion alignment is necessary before modal fusion.

[AI-106] HySem: A context length optimized LLM pipeline for unstructured tabular extraction

链接: https://arxiv.org/abs/2408.09434
作者: Narayanan PP,Anantharaman Palacode Narayana Iyer
关键词-EN: Regulatory compliance reporting, Regulatory compliance, compliance reporting, relies on detailed, unstructured format
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: 9 pages, 4 tables, 3 figures, 1 algorithm

点击查看摘要

Abstract:Regulatory compliance reporting in the pharmaceutical industry relies on detailed tables, but these are often under-utilized beyond compliance due to their unstructured format and arbitrary content. Extracting and semantically representing tabular data is challenging due to diverse table presentations. Large Language Models (LLMs) demonstrate substantial potential for semantic representation, yet they encounter challenges related to accuracy and context size limitations, which are crucial considerations for the industry applications. We introduce HySem, a pipeline that employs a novel context length optimization technique to generate accurate semantic JSON representations from HTML tables. This approach utilizes a custom fine-tuned model specifically designed for cost- and privacy-sensitive small and medium pharmaceutical enterprises. Running on commodity hardware and leveraging open-source models, our auto-correcting agents rectify both syntax and semantic errors in LLM-generated content. HySem surpasses its peer open-source models in accuracy and provides competitive performance when benchmarked against OpenAI GPT-4o and effectively addresses context length limitations, which is a crucial factor for supporting larger tables.

[AI-107] FASST: Fast LLM-based Simultaneous Speech Translation

链接: https://arxiv.org/abs/2408.09430
作者: Siqi Ouyang,Xi Xu,Chinmay Dandekar,Lei Li
关键词-EN: generates text translation, SST, generates text, streaming speech input, speech
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Simultaneous speech translation (SST) takes streaming speech input and generates text translation on the fly. Existing methods either have high latency due to recomputation of input representations, or fall behind of offline ST in translation quality. In this paper, we propose FASST, a fast large language model based method for streaming speech translation. We propose blockwise-causal speech encoding and consistency mask, so that streaming speech input can be encoded incrementally without recomputation. Furthermore, we develop a two-stage training strategy to optimize FASST for simultaneous inference. We evaluate FASST and multiple strong prior models on MuST-C dataset. Experiment results show that FASST achieves the best quality-latency trade-off. It outperforms the previous best model by an average of 1.5 BLEU under the same latency for English to Spanish translation.

[AI-108] A Robust Algorithm for Contactless Fingerprint Enhancement and Matching

链接: https://arxiv.org/abs/2408.09426
作者: Mahrukh Siddiqui,Shahzaib Iqbal,Bandar AlShammari,Bandar Alhaqbani,Tariq M. Khan,Imran Razzak
关键词-EN: elastic deformation caused, contactless fingerprint images, contact fingerprint images, fingerprint images, contactless fingerprint
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Compared to contact fingerprint images, contactless fingerprint images exhibit four distinct characteristics: (1) they contain less noise; (2) they have fewer discontinuities in ridge patterns; (3) the ridge-valley pattern is less distinct; and (4) they pose an interoperability problem, as they lack the elastic deformation caused by pressing the finger against the capture device. These properties present significant challenges for the enhancement of contactless fingerprint images. In this study, we propose a novel contactless fingerprint identification solution that enhances the accuracy of minutiae detection through improved frequency estimation and a new region-quality-based minutia extraction algorithm. In addition, we introduce an efficient and highly accurate minutiae-based encoding and matching algorithm. We validate the effectiveness of our approach through extensive experimental testing. Our method achieves a minimum Equal Error Rate (EER) of 2.84% on the PolyU contactless fingerprint dataset, demonstrating its superior performance compared to existing state-of-the-art techniques. The proposed fingerprint identification method exhibits notable precision and resilience, proving to be an effective and feasible solution for contactless fingerprint-based identification systems.

[AI-109] Distinguish Confusion in Legal Judgment Prediction via Revised Relation Knowledge

链接: https://arxiv.org/abs/2408.09422
作者: Nuo Xu,Pinghui Wang,Junzhou Zhao,Feiyang Sun,Lin Lan,Jing Tao,Li Pan,Xiaohong Guan
关键词-EN: Legal Judgment Prediction, case judgment results, Legal Judgment, Judgment Prediction, judgment results based
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: Accepted by ACM TOIS

点击查看摘要

Abstract:Legal Judgment Prediction (LJP) aims to automatically predict a law case’s judgment results based on the text description of its facts. In practice, the confusing law articles (or charges) problem frequently occurs, reflecting that the law cases applicable to similar articles (or charges) tend to be misjudged. Although some recent works based on prior knowledge solve this issue well, they ignore that confusion also occurs between law articles with a high posterior semantic similarity due to the data imbalance problem instead of only between the prior highly similar ones, which is this work’s further finding. This paper proposes an end-to-end model named \textitD-LADAN to solve the above challenges. On the one hand, D-LADAN constructs a graph among law articles based on their text definition and proposes a graph distillation operation (GDO) to distinguish the ones with a high prior semantic similarity. On the other hand, D-LADAN presents a novel momentum-updated memory mechanism to dynamically sense the posterior similarity between law articles (or charges) and a weighted GDO to adaptively capture the distinctions for revising the inductive bias caused by the data imbalance problem. We perform extensive experiments to demonstrate that D-LADAN significantly outperforms state-of-the-art methods in accuracy and robustness.

[AI-110] Challenges and Responses in the Practice of Large Language Models

链接: https://arxiv.org/abs/2408.09416
作者: Hongyin Zhu
关键词-EN: carefully summarizes extensive, paper carefully summarizes, covering multiple dimensions, academic research, high-profile AI field
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:This paper carefully summarizes extensive and profound questions from all walks of life, focusing on the current high-profile AI field, covering multiple dimensions such as industry trends, academic research, technological innovation and business applications. This paper meticulously curates questions that are both thought-provoking and practically relevant, providing nuanced and insightful answers to each. To facilitate readers’ understanding and reference, this paper specifically classifies and organizes these questions systematically and meticulously from the five core dimensions of computing power infrastructure, software architecture, data resources, application scenarios, and brain science. This work aims to provide readers with a comprehensive, in-depth and cutting-edge AI knowledge framework to help people from all walks of life grasp the pulse of AI development, stimulate innovative thinking, and promote industrial progress.

[AI-111] mathbbBEHRNOULLI: A Binary EHR Data-Oriented Medication Recommendation System

链接: https://arxiv.org/abs/2408.09410
作者: Xihao Piao,Pei Gao,Zheng Chen,Lingwei Zhu,Yasuko Matsubara,Yasushi Sakurai
关键词-EN: binary EHR medical, medical event outcomes, binary EHR, binary medical event, EHR data
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The medical community believes binary medical event outcomes in EHR data contain sufficient information for making a sensible recommendation. However, there are two challenges to effectively utilizing such data: (1) modeling the relationship between massive 0,1 event outcomes is difficult, even with expert knowledge; (2) in practice, learning can be stalled by the binary values since the equally important 0 entries propagate no learning signals. Currently, there is a large gap between the assumed sufficient information and the reality that no promising results have been shown by utilizing solely the binary data: visiting or secondary information is often necessary to reach acceptable performance. In this paper, we attempt to build the first successful binary EHR data-oriented drug recommendation system by tackling the two difficulties, making sensible drug recommendations solely using the binary EHR medical records. To this end, we take a statistical perspective to view the EHR data as a sample from its cohorts and transform them into continuous Bernoulli probabilities. The transformed entries not only model a deterministic binary event with a distribution but also allow reflecting \emphevent-event relationship by conditional probability. A graph neural network is learned on top of the transformation. It captures event-event correlations while emphasizing \emphevent-to-patient features. Extensive results demonstrate that the proposed method achieves state-of-the-art performance on large-scale databases, outperforming baseline methods that use secondary information by a large margin. The source code is available at \urlthis https URL

[AI-112] Comparison between the Structures of Word Co-occurrence and Word Similarity Networks for Ill-formed and Well-formed Texts in Taiwan Mandarin

链接: https://arxiv.org/abs/2408.09404
作者: Po-Hsuan Huang,Hsuan-Lei Shao
关键词-EN: word co-occurrence networks, word co-occurrence, co-occurrence networks, co-occurrence networks built, co-occurrence
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: 4 pages, 1 figure, 5 tables

点击查看摘要

Abstract:The study of word co-occurrence networks has attracted the attention of researchers due to their potential significance as well as applications. Understanding the structure of word co-occurrence networks is therefore important to fully realize their significance and usages. In past studies, word co-occurrence networks built on well-formed texts have been found to possess certain characteristics, including being small-world, following a two-regime power law distribution, and being generally disassortative. On the flip side, past studies have found that word co-occurrence networks built from ill-formed texts such as microblog posts may behave differently from those built from well-formed documents. While both kinds of word co-occurrence networks are small-world and disassortative, word co-occurrence networks built from ill-formed texts are scale-free and follow the power law distribution instead of the two-regime power law distribution. However, since past studies on the behavior of word co-occurrence networks built from ill-formed texts only investigated English, the universality of such characteristics remains to be seen among different languages. In addition, it is yet to be investigated whether there could be possible similitude/differences between word co-occurrence networks and other potentially comparable networks. This study therefore investigates and compares the structure of word co-occurrence networks and word similarity networks based on Taiwan Mandarin ill-formed internet forum posts and compare them with those built with well-formed judicial judgments, and seeks to find out whether the three aforementioned properties (scale-free, small-world, and disassortative) for ill-formed and well-formed texts are universal among different languages and between word co-occurrence and word similarity networks.

[AI-113] Obtaining Optimal Spiking Neural Network in Sequence Learning via CRNN-SNN Conversion

链接: https://arxiv.org/abs/2408.09403
作者: Jiahao Su,Kang You,Zekai Xu,Weizhi Xu,Zhezhi He
关键词-EN: energy-efficient neuromorphic chips, Spiking neural networks, conventional artificial neural, rich neural dynamics, artificial neural networks
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted by 33rd International Conference on Artificial Neural Networks

点击查看摘要

Abstract:Spiking neural networks (SNNs) are becoming a promising alternative to conventional artificial neural networks (ANNs) due to their rich neural dynamics and the implementation of energy-efficient neuromorphic chips. However, the non-differential binary communication mechanism makes SNN hard to converge to an ANN-level accuracy. When SNN encounters sequence learning, the situation becomes worse due to the difficulties in modeling long-range dependencies. To overcome these difficulties, researchers developed variants of LIF neurons and different surrogate gradients but still failed to obtain good results when the sequence became longer (e.g., 500). Unlike them, we obtain an optimal SNN in sequence learning by directly mapping parameters from a quantized CRNN. We design two sub-pipelines to support the end-to-end conversion of different structures in neural networks, which is called CNN-Morph (CNN \rightarrow QCNN \rightarrow BIFSNN) and RNN-Morph (RNN \rightarrow QRNN \rightarrow RBIFSNN). Using conversion pipelines and the s-analog encoding method, the conversion error of our framework is zero. Furthermore, we give the theoretical and experimental demonstration of the lossless CRNN-SNN conversion. Our results show the effectiveness of our method over short and long timescales tasks compared with the state-of-the-art learning- and conversion-based methods. We reach the highest accuracy of 99.16% (0.46 \uparrow ) on S-MNIST, 94.95% (3.95 \uparrow ) on PS-MNIST (sequence length of 784) respectively, and the lowest loss of 0.057 (0.013 \downarrow ) within 8 time-steps in collision avoidance dataset.

[AI-114] Federated Graph Learning with Structure Proxy Alignment KDD2024

链接: https://arxiv.org/abs/2408.09393
作者: Xingbo Fu,Zihan Chen,Binchi Zhang,Chen Chen,Jundong Li
关键词-EN: Federated Graph Learning, financial fraud detection, multiple data owners, generic Federated Learning, graph data distributed
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注: Accepted by KDD 2024

点击查看摘要

Abstract:Federated Graph Learning (FGL) aims to learn graph learning models over graph data distributed in multiple data owners, which has been applied in various applications such as social recommendation and financial fraud detection. Inherited from generic Federated Learning (FL), FGL similarly has the data heterogeneity issue where the label distribution may vary significantly for distributed graph data across clients. For instance, a client can have the majority of nodes from a class, while another client may have only a few nodes from the same class. This issue results in divergent local objectives and impairs FGL convergence for node-level tasks, especially for node classification. Moreover, FGL also encounters a unique challenge for the node classification task: the nodes from a minority class in a client are more likely to have biased neighboring information, which prevents FGL from learning expressive node embeddings with Graph Neural Networks (GNNs). To grapple with the challenge, we propose FedSpray, a novel FGL framework that learns local class-wise structure proxies in the latent space and aligns them to obtain global structure proxies in the server. Our goal is to obtain the aligned structure proxies that can serve as reliable, unbiased neighboring information for node classification. To achieve this, FedSpray trains a global feature-structure encoder and generates unbiased soft targets with structure proxies to regularize local training of GNN models in a personalized way. We conduct extensive experiments over four datasets, and experiment results validate the superiority of FedSpray compared with other baselines. Our code is available at this https URL.

[AI-115] Game Development as Human-LLM Interaction

链接: https://arxiv.org/abs/2408.09386
作者: Jiale Hong,Hongqiu Wu,Hai Zhao
关键词-EN: highly specialized task, complex game engine, complex programming languages, Interaction-driven Game Engine, game engine powered
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
*备注:

点击查看摘要

Abstract:Game development is a highly specialized task that relies on a complex game engine powered by complex programming languages, preventing many gaming enthusiasts from handling it. This paper introduces the Interaction-driven Game Engine (IGE) powered by LLM, which allows everyone to develop a custom game using natural language through Human-LLM interaction. To enable an LLM to function as an IGE, we instruct it to perform the following processes in each turn: (1) P_script : configure the game script segment based on the user’s input; (2) P_code : generate the corresponding code snippet based on the game script segment; (3) P_utter : interact with the user, including guidance and feedback. We propose a data synthesis pipeline based on the LLM to generate game script-code pairs and interactions from a few manually crafted seed data. We propose a three-stage progressive training strategy to transfer the dialogue-based LLM to our IGE smoothly. We construct an IGE for poker games as a case study and comprehensively evaluate it from two perspectives: interaction quality and code correctness. The code and data are available at \urlthis https URL.

[AI-116] Offline RLHF Methods Need More Accurate Supervision Signals

链接: https://arxiv.org/abs/2408.09385
作者: Shiqi Wang,Zhengze Zhang,Rui Zhao,Fei Tan,Cam Tu Nguyen
关键词-EN: Large Language Models, Large Language, advances in Large, Language Models, increasingly important
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: under review

点击查看摘要

Abstract:With the rapid advances in Large Language Models (LLMs), aligning LLMs with human preferences become increasingly important. Although Reinforcement Learning with Human Feedback (RLHF) proves effective, it is complicated and highly resource-intensive. As such, offline RLHF has been introduced as an alternative solution, which directly optimizes LLMs with ranking losses on a fixed preference dataset. Current offline RLHF only captures the ordinal relationship'' between responses, overlooking the crucial aspect of how much’’ one is preferred over the others. To address this issue, we propose a simple yet effective solution called \textbfReward \textbfDifference \textbfOptimization, shorted as \textbfRDO. Specifically, we introduce \it reward difference coefficients to reweigh sample pairs in offline RLHF. We then develop a \it difference model involving rich interactions between a pair of responses for predicting these difference coefficients. Experiments with 7B LLMs on the HH and TL;DR datasets substantiate the effectiveness of our method in both automatic metrics and human evaluation, thereby highlighting its potential for aligning LLMs with human intent and values.

[AI-117] VRCopilot: Authoring 3D Layouts with Generative AI Models in VR

链接: https://arxiv.org/abs/2408.09382
作者: Lei Zhang,Jin Pan,Jacob Gettig,Steve Oney,Anhong Guo
关键词-EN: Virtual Reality, manipulation in Virtual, Immersive authoring, scenes via direct, intuitive medium
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET)
*备注: UIST 2024

点击查看摘要

Abstract:Immersive authoring provides an intuitive medium for users to create 3D scenes via direct manipulation in Virtual Reality (VR). Recent advances in generative AI have enabled the automatic creation of realistic 3D layouts. However, it is unclear how capabilities of generative AI can be used in immersive authoring to support fluid interactions, user agency, and creativity. We introduce VRCopilot, a mixed-initiative system that integrates pre-trained generative AI models into immersive authoring to facilitate human-AI co-creation in VR. VRCopilot presents multimodal interactions to support rapid prototyping and iterations with AI, and intermediate representations such as wireframes to augment user controllability over the created content. Through a series of user studies, we evaluated the potential and challenges in manual, scaffolded, and automatic creation in immersive authoring. We found that scaffolded creation using wireframes enhanced the user agency compared to automatic creation. We also found that manual creation via multimodal specification offers the highest sense of creativity and agency.

[AI-118] ELASTIC: Efficient Linear Attention for Sequential Interest Compression AAAI2025

链接: https://arxiv.org/abs/2408.09380
作者: Jiaxin Deng,Shiyao Wang,Song Lu,Yinfeng Li,Xinchen Luo,Yuanjun Liu,Peixing Xu,Guorui Zhou
关键词-EN: models heavily rely, transformer attention mechanism, linear dispatcher attention, Efficient Linear Attention, dispatcher attention mechanism
类目: Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
*备注: Submitted to AAAI 2025

点击查看摘要

Abstract:State-of-the-art sequential recommendation models heavily rely on transformer’s attention mechanism. However, the quadratic computational and memory complexities of self attention have limited its scalability for modeling users’ long range behaviour sequences. To address this problem, we propose ELASTIC, an Efficient Linear Attention for SequenTial Interest Compression, requiring only linear time complexity and decoupling model capacity from computational cost. Specifically, ELASTIC introduces a fixed length interest experts with linear dispatcher attention mechanism which compresses the long-term behaviour sequences to a significantly more compact representation which reduces up to 90% GPU memory usage with x2.7 inference speed up. The proposed linear dispatcher attention mechanism significantly reduces the quadratic complexity and makes the model feasible for adequately modeling extremely long sequences. Moreover, in order to retain the capacity for modeling various user interests, ELASTIC initializes a vast learnable interest memory bank and sparsely retrieves compressed user’s interests from the memory with a negligible computational overhead. The proposed interest memory retrieval technique significantly expands the cardinality of available interest space while keeping the same computational cost, thereby striking a trade-off between recommendation accuracy and efficiency. To validate the effectiveness of our proposed ELASTIC, we conduct extensive experiments on various public datasets and compare it with several strong sequential recommenders. Experimental results demonstrate that ELASTIC consistently outperforms baselines by a significant margin and also highlight the computational efficiency of ELASTIC when modeling long sequences. We will make our implementation code publicly available.

[AI-119] Detecting the Undetectable: Combining Kolmogorov-Arnold Networks and MLP for AI-Generated Image Detection

链接: https://arxiv.org/abs/2408.09371
作者: Taharim Rahman Anon,Jakaria Islam Emon
关键词-EN: artificial intelligence progresses, intelligence progresses, artificial intelligence, task of distinguishing, increasingly complicated
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 8 Pages, IEEE Transactions

点击查看摘要

Abstract:As artificial intelligence progresses, the task of distinguishing between real and AI-generated images is increasingly complicated by sophisticated generative models. This paper presents a novel detection framework adept at robustly identifying images produced by cutting-edge generative AI models, such as DALL-E 3, MidJourney, and Stable Diffusion 3. We introduce a comprehensive dataset, tailored to include images from these advanced generators, which serves as the foundation for extensive evaluation. we propose a classification system that integrates semantic image embeddings with a traditional Multilayer Perceptron (MLP). This baseline system is designed to effectively differentiate between real and AI-generated images under various challenging conditions. Enhancing this approach, we introduce a hybrid architecture that combines Kolmogorov-Arnold Networks (KAN) with the MLP. This hybrid model leverages the adaptive, high-resolution feature transformation capabilities of KAN, enabling our system to capture and analyze complex patterns in AI-generated images that are typically overlooked by conventional models. In out-of-distribution testing, our proposed model consistently outperformed the standard MLP across three out of distribution test datasets, demonstrating superior performance and robustness in classifying real images from AI-generated images with impressive F1 scores.

[AI-120] Concept Distillation from Strong to Weak Models via Hypotheses-to-Theories Prompting

链接: https://arxiv.org/abs/2408.09365
作者: Emmanuel Aboah Boateng,Cassiano O. Becker,Nabiha Asghar,Kabir Walia,Ashwin Srinivasan,Ehi Nosakhare,Victor Dibia,Soundar Srinivasan
关键词-EN: Hand-crafting high quality, Hand-crafting high, high quality prompts, labor-intensive process, high quality
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注: 13 pages, 8 figures, conference

点击查看摘要

Abstract:Hand-crafting high quality prompts to optimize the performance of language models is a complicated and labor-intensive process. Furthermore, when migrating to newer, smaller, or weaker models (possibly due to latency or cost gains), prompts need to be updated to re-optimize the task performance. We propose Concept Distillation (CD), an automatic prompt optimization technique for enhancing weaker models on complex tasks. CD involves: (1) collecting mistakes made by weak models with a base prompt (initialization), (2) using a strong model to generate reasons for these mistakes and create rules/concepts for weak models (induction), and (3) filtering these rules based on validation set performance and integrating them into the base prompt (deduction/verification). We evaluated CD on NL2Code and mathematical reasoning tasks, observing significant performance boosts for small and weaker language models. Notably, Mistral-7B’s accuracy on Multi-Arith increased by 20%, and Phi-3-mini-3.8B’s accuracy on HumanEval rose by 34%. Compared to other automated methods, CD offers an effective, cost-efficient strategy for improving weak models’ performance on complex tasks and enables seamless workload migration across different language models without compromising performance.

[AI-121] Panorama Tomosynthesis from Head CBCT with Simulated Projection Geometry

链接: https://arxiv.org/abs/2408.09358
作者: Anusree P.S.,Bikram Keshari Parida,Seong Yong Moon,Wonsang You
关键词-EN: Beam Computed Tomography, Cone Beam Computed, Cone Beam, Computed Tomography, Beam Computed
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 12 pages, 6 figures, 1 table, Journal submission planned

点击查看摘要

Abstract:Cone Beam Computed Tomography (CBCT) and Panoramic X-rays are the most commonly used imaging modalities in dental health care. CBCT can produce three-dimensional views of a patient’s head, providing clinicians with better diagnostic capability, whereas Panoramic X-ray can capture the entire maxillofacial region in a single image. If the CBCT is already available, it can be beneficial to synthesize a Panoramic X-ray, thereby avoiding an immediate additional scan and extra radiation exposure. Existing methods focus on delineating an approximate dental arch and creating orthogonal projections along this arch. However, no golden standard is available for such dental arch extractions, and this choice can affect the quality of synthesized X-rays. To avoid such issues, we propose a novel method for synthesizing Panoramic X-rays from diverse head CBCTs, employing a simulated projection geometry and dynamic rotation centers. Our method effectively synthesized panoramic views from CBCT, even for patients with missing or nonexistent teeth and in the presence of severe metal implants. Our results demonstrate that this method can generate high-quality panoramic images irrespective of the CBCT scanner geometry.

[AI-122] Meta-Learning Empowered Meta-Face: Personalized Speaking Style Adaptation for Audio-Driven 3D Talking Face Animation

链接: https://arxiv.org/abs/2408.09357
作者: Xukun Zhou,Fengxin Li,Ziqiao Peng,Kejian Wu,Jun He,Biao Qin,Zhaoxin Fan,Hongyan Liu
关键词-EN: augmented reality applications, face animation, reality applications, speaking style adaptation, animation is increasingly
类目: Graphics (cs.GR); Artificial Intelligence (cs.AI); Sound (cs.SD); Audio and Speech Processing (eess.AS)
*备注:

点击查看摘要

Abstract:Audio-driven 3D face animation is increasingly vital in live streaming and augmented reality applications. While remarkable progress has been observed, most existing approaches are designed for specific individuals with predefined speaking styles, thus neglecting the adaptability to varied speaking styles. To address this limitation, this paper introduces MetaFace, a novel methodology meticulously crafted for speaking style adaptation. Grounded in the novel concept of meta-learning, MetaFace is composed of several key components: the Robust Meta Initialization Stage (RMIS) for fundamental speaking style adaptation, the Dynamic Relation Mining Neural Process (DRMN) for forging connections between observed and unobserved speaking styles, and the Low-rank Matrix Memory Reduction Approach to enhance the efficiency of model optimization as well as learning style details. Leveraging these novel designs, MetaFace not only significantly outperforms robust existing baselines but also establishes a new state-of-the-art, as substantiated by our experimental results.

[AI-123] E-CGL: An Efficient Continual Graph Learner

链接: https://arxiv.org/abs/2408.09350
作者: Jianhao Guo,Zixuan Ni,Yun Zhu,Siliang Tang
关键词-EN: preserving previous knowledge, continual graph learning, continual graph, graph data, graph
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Continual learning has emerged as a crucial paradigm for learning from sequential data while preserving previous knowledge. In the realm of continual graph learning, where graphs continuously evolve based on streaming graph data, continual graph learning presents unique challenges that require adaptive and efficient graph learning methods in addition to the problem of catastrophic forgetting. The first challenge arises from the interdependencies between different graph data, where previous graphs can influence new data distributions. The second challenge lies in the efficiency concern when dealing with large graphs. To addresses these two problems, we produce an Efficient Continual Graph Learner (E-CGL) in this paper. We tackle the interdependencies issue by demonstrating the effectiveness of replay strategies and introducing a combined sampling strategy that considers both node importance and diversity. To overcome the limitation of efficiency, E-CGL leverages a simple yet effective MLP model that shares weights with a GCN during training, achieving acceleration by circumventing the computationally expensive message passing process. Our method comprehensively surpasses nine baselines on four graph continual learning datasets under two settings, meanwhile E-CGL largely reduces the catastrophic forgetting problem down to an average of -1.1%. Additionally, E-CGL achieves an average of 15.83x training time acceleration and 4.89x inference time acceleration across the four datasets. These results indicate that E-CGL not only effectively manages the correlation between different graph data during continual training but also enhances the efficiency of continual learning on large graphs. The code is publicly available at this https URL.

[AI-124] Characterizing and Evaluating the Reliability of LLMs against Jailbreak Attacks

链接: https://arxiv.org/abs/2408.09326
作者: Kexin Chen,Yi Liu,Dongxia Wang,Jiaying Chen,Wenhai Wang
关键词-EN: Large Language Models, Large Language, notable societal impact, Language Models, societal impact
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
*备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have increasingly become pivotal in content generation with notable societal impact. These models hold the potential to generate content that could be deemed harmful.Efforts to mitigate this risk include implementing safeguards to ensure LLMs adhere to social ethics.However, despite such measures, the phenomenon of “jailbreaking” – where carefully crafted prompts elicit harmful responses from models – persists as a significant challenge. Recognizing the continuous threat posed by jailbreaking tactics and their repercussions for the trustworthy use of LLMs, a rigorous assessment of the models’ robustness against such attacks is essential. This study introduces an comprehensive evaluation framework and conducts an large-scale empirical experiment to address this need. We concentrate on 10 cutting-edge jailbreak strategies across three categories, 1525 questions from 61 specific harmful categories, and 13 popular LLMs. We adopt multi-dimensional metrics such as Attack Success Rate (ASR), Toxicity Score, Fluency, Token Length, and Grammatical Errors to thoroughly assess the LLMs’ outputs under jailbreak. By normalizing and aggregating these metrics, we present a detailed reliability score for different LLMs, coupled with strategic recommendations to reduce their susceptibility to such vulnerabilities. Additionally, we explore the relationships among the models, attack strategies, and types of harmful content, as well as the correlations between the evaluation metrics, which proves the validity of our multifaceted evaluation framework. Our extensive experimental results demonstrate a lack of resilience among all tested LLMs against certain strategies, and highlight the need to concentrate on the reliability facets of LLMs. We believe our study can provide valuable insights into enhancing the security evaluation of LLMs against jailbreak within the domain.

[AI-125] Learning Fair Invariant Representations under Covariate and Correlation Shifts Simultaneously CIKM2024

链接: https://arxiv.org/abs/2408.09312
作者: Dong Li,Chen Zhao,Minglai Shao,Wenjun Wang
关键词-EN: invariant classifier, substantial and complex, complex challenge, challenge in machine, shifted test domains
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: CIKM 2024

点击查看摘要

Abstract:Achieving the generalization of an invariant classifier from training domains to shifted test domains while simultaneously considering model fairness is a substantial and complex challenge in machine learning. Existing methods address the problem of fairness-aware domain generalization, focusing on either covariate shift or correlation shift, but rarely consider both at the same time. In this paper, we introduce a novel approach that focuses on learning a fairness-aware domain-invariant predictor within a framework addressing both covariate and correlation shifts simultaneously, ensuring its generalization to unknown test domains inaccessible during training. In our approach, data are first disentangled into content and style factors in latent spaces. Furthermore, fairness-aware domain-invariant content representations can be learned by mitigating sensitive information and retaining as much other information as possible. Extensive empirical studies on benchmark datasets demonstrate that our approach surpasses state-of-the-art methods with respect to model accuracy as well as both group and individual fairness.

[AI-126] A Benchmark Time Series Dataset for Semiconductor Fabrication Manufacturing Constructed using Component-based Discrete-Event Simulation Models

链接: https://arxiv.org/abs/2408.09307
作者: Vamsi Krishna Pendyala,Hessam S. Sarjoughian,Bala Potineni,Edward J. Yellig
关键词-EN: high-computing devices increase, smart manufacturing factories, Advancements in high-computing, models, high-computing devices
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Advancements in high-computing devices increase the necessity for improved and new understanding and development of smart manufacturing factories. Discrete-event models with simulators have been shown to be critical to architect, designing, building, and operating the manufacturing of semiconductor chips. The diffusion, implantation, and lithography machines have intricate processes due to their feedforward and feedback connectivity. The dataset collected from simulations of the factory models holds the promise of generating valuable machine-learning models. As surrogate data-based models, their executions are highly efficient compared to the physics-based counterpart models. For the development of surrogate models, it is beneficial to have publicly available benchmark simulation models that are grounded in factory models that have concise structures and accurate behaviors. Hence, in this research, a dataset is devised and constructed based on a benchmark model of an Intel semiconductor fabrication factory. The model is formalized using the Parallel Discrete-Event System Specification and executed using the DEVS-Suite simulator. The time series dataset is constructed using discrete-event time trajectories. This dataset is further analyzed and used to develop baseline univariate and multivariate machine learning models. The dataset can also be utilized in the machine learning community for behavioral analysis based on formalized and scalable component-based discrete-event models and simulations.

[AI-127] Evaluating Usability and Engagement of Large Language Models in Virtual Reality for Traditional Scottish Curling

链接: https://arxiv.org/abs/2408.09285
作者: Ka Hei Carrie Lau,Efe Bozkir,Hong Gao,Enkelejda Kasneci
关键词-EN: Large Language Models, traditional Scottish curling, Scottish curling presented, Language Models, Virtual Reality
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:This paper explores the innovative application of Large Language Models (LLMs) in Virtual Reality (VR) environments to promote heritage education, focusing on traditional Scottish curling presented in the game ``Scottish Bonspiel VR’'. Our study compares the effectiveness of LLM-based chatbots with pre-defined scripted chatbots, evaluating key criteria such as usability, user engagement, and learning outcomes. The results show that LLM-based chatbots significantly improve interactivity and engagement, creating a more dynamic and immersive learning environment. This integration helps document and preserve cultural heritage and enhances dissemination processes, which are crucial for safeguarding intangible cultural heritage (ICH) amid environmental changes. Furthermore, the study highlights the potential of novel technologies in education to provide immersive experiences that foster a deeper appreciation of cultural heritage. These findings support the wider application of LLMs and VR in cultural education to address global challenges and promote sustainable practices to preserve and enhance cultural heritage.

[AI-128] PREMAP: A Unifying PREiMage APproximation Framework for Neural Networks

链接: https://arxiv.org/abs/2408.09262
作者: Xiyue Zhang,Benjie Wang,Marta Kwiatkowska,Huan Zhang
关键词-EN: neural network, network verification focus, focus on bounding, neural network predictions, neural network verification
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO)
*备注: arXiv admin note: text overlap with arXiv:2305.03686

点击查看摘要

Abstract:Most methods for neural network verification focus on bounding the image, i.e., set of outputs for a given input set. This can be used to, for example, check the robustness of neural network predictions to bounded perturbations of an input. However, verifying properties concerning the preimage, i.e., the set of inputs satisfying an output property, requires abstractions in the input space. We present a general framework for preimage abstraction that produces under- and over-approximations of any polyhedral output set. Our framework employs cheap parameterised linear relaxations of the neural network, together with an anytime refinement procedure that iteratively partitions the input region by splitting on input features and neurons. The effectiveness of our approach relies on carefully designed heuristics and optimization objectives to achieve rapid improvements in the approximation volume. We evaluate our method on a range of tasks, demonstrating significant improvement in efficiency and scalability to high-input-dimensional image classification tasks compared to state-of-the-art techniques. Further, we showcase the application to quantitative verification and robustness analysis, presenting a sound and complete algorithm for the former and providing sound quantitative results for the latter.

[AI-129] V2X-VLM: End-to-End V2X Cooperative Autonomous Driving Through Large Vision-Language Models

链接: https://arxiv.org/abs/2408.09251
作者: Junwei You,Haotian Shi,Zhuoyu Jiang,Zilin Huang,Rui Gan,Keshu Wu,Xi Cheng,Xiaopeng Li,Bin Ran
关键词-EN: systems that manage, navigation and control, increasingly focused, manage the full, full spectrum
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Advancements in autonomous driving have increasingly focused on end-to-end (E2E) systems that manage the full spectrum of driving tasks, from environmental perception to vehicle navigation and control. This paper introduces V2X-VLM, an innovative E2E vehicle-infrastructure cooperative autonomous driving (VICAD) framework with large vision-language models (VLMs). V2X-VLM is designed to enhance situational awareness, decision-making, and ultimate trajectory planning by integrating data from vehicle-mounted cameras, infrastructure sensors, and textual information. The strength of the comprehensive multimodel data fusion of the VLM enables precise and safe E2E trajectory planning in complex and dynamic driving scenarios. Validation on the DAIR-V2X dataset demonstrates that V2X-VLM outperforms existing state-of-the-art methods in cooperative autonomous driving.

[AI-130] owards Effective Top-N Hamming Search via Bipartite Graph Contrastive Hashing

链接: https://arxiv.org/abs/2408.09239
作者: Yankai Chen,Yixiang Fang,Yifei Zhang,Chenhao Ma,Yang Hong,Irwin King
关键词-EN: recommendation systems, document querying, bipartite graphs serves, fundamental task, Searching
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Searching on bipartite graphs serves as a fundamental task for various real-world applications, such as recommendation systems, database retrieval, and document querying. Conventional approaches rely on similarity matching in continuous Euclidean space of vectorized node embeddings. To handle intensive similarity computation efficiently, hashing techniques for graph-structured data have emerged as a prominent research direction. However, despite the retrieval efficiency in Hamming space, previous studies have encountered catastrophic performance decay. To address this challenge, we investigate the problem of hashing with Graph Convolutional Network for effective Top-N search. Our findings indicate the learning effectiveness of incorporating hashing techniques within the exploration of bipartite graph reception fields, as opposed to simply treating hashing as post-processing to output embeddings. To further enhance the model performance, we advance upon these findings and propose Bipartite Graph Contrastive Hashing (BGCH+). BGCH+ introduces a novel dual augmentation approach to both intermediate information and hash code outputs in the latent feature spaces, thereby producing more expressive and robust hash codes within a dual self-supervised learning paradigm. Comprehensive empirical analyses on six real-world benchmarks validate the effectiveness of our dual feature contrastive learning in boosting the performance of BGCH+ compared to existing approaches.

[AI-131] Hybrid Semantic Search: Unveiling User Intent Beyond Keywords

链接: https://arxiv.org/abs/2408.09236
作者: Aman Ahluwalia,Bishwajit Sutradhar,Karishma Ghosh
关键词-EN: Large Language Models, Large Language, non-semantic search engines, understanding user intent, traditional keyword-based search
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:This paper addresses the limitations of traditional keyword-based search in understanding user intent and introduces a novel hybrid search approach that leverages the strengths of non-semantic search engines, Large Language Models (LLMs), and embedding models. The proposed system integrates keyword matching, semantic vector embeddings, and LLM-generated structured queries to deliver highly relevant and contextually appropriate search results. By combining these complementary methods, the hybrid approach effectively captures both explicit and implicit user intent.The paper further explores techniques to optimize query execution for faster response times and demonstrates the effectiveness of this hybrid search model in producing comprehensive and accurate search outcomes.

[AI-132] Reference-Guided Verdict: LLMs-as-Judges in Automatic Evaluation of Free-Form Text

链接: https://arxiv.org/abs/2408.09235
作者: Sher Badshah,Hassan Sajjad
关键词-EN: Large Language Models, Large Language, advancements in Large, Language Models, rapid advancements
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The rapid advancements in Large Language Models (LLMs) have highlighted the critical need for robust evaluation methods that can accurately assess the quality of generated text, particularly in free-form tasks. Traditional metrics like BLEU and ROUGE, while useful, often fail to capture the semantic richness and contextual relevance of free-form text compared to reference answers. In this study, we introduce a reference-guided verdict method that leverages multiple LLMs-as-judges to provide a more reliable and accurate evaluation of open-ended LLM generations. By integrating diverse LLMs, our approach mitigates individual model biases and significantly improves alignment with human judgments, especially in challenging tasks where traditional metrics and single-model evaluations fall short. Through experiments across multiple question-answering tasks, we show that our method closely aligns with human evaluations, establishing it as a scalable, reproducible, and effective alternative to human evaluation. Our approach not only enhances evaluation reliability but also opens new avenues for refining automated assessment in generative AI.

[AI-133] Siamese Multiple Attention Temporal Convolution Networks for Human Mobility Signature Identification ITSC ITSC2024

链接: https://arxiv.org/abs/2408.09230
作者: Zhipeng Zheng,Yuchen Jiang,Shiyao Zhang,Xuetao Wei
关键词-EN: Mobility Signature Identification, Human Mobility Signature, Signature Identification, Human Mobility, Mobility Signature
类目: Artificial Intelligence (cs.AI)
*备注: 27th IEEE International Conference on Intelligent Transportation Systems (ITSC) (ITSC 2024)

点击查看摘要

Abstract:The Human Mobility Signature Identification (HuMID) problem stands as a fundamental task within the realm of driving style representation, dedicated to discerning latent driving behaviors and preferences from diverse driver trajectories for driver identification. Its solutions hold significant implications across various domains (e.g., ride-hailing, insurance), wherein their application serves to safeguard users and mitigate potential fraudulent activities. Present HuMID solutions often exhibit limitations in adaptability when confronted with lengthy trajectories, consequently incurring substantial computational overhead. Furthermore, their inability to effectively extract crucial local information further impedes their performance. To address this problem, we propose a Siamese Multiple Attention Temporal Convolutional Network (Siamese MA-TCN) to capitalize on the strengths of both TCN architecture and multi-head self-attention, enabling the proficient extraction of both local and long-term dependencies. Additionally, we devise a novel attention mechanism tailored for the efficient aggregation of multi-scale representations derived from our model. Experimental evaluations conducted on two real-world taxi trajectory datasets reveal that our proposed model effectively extracts both local key information and long-term dependencies. These findings highlight the model’s outstanding generalization capabilities, demonstrating its robustness and adaptability across datasets of varying sizes.

[AI-134] FEDMEKI: A Benchmark for Scaling Medical Foundation Models via Federated Knowledge Injection NEURIPS2024

链接: https://arxiv.org/abs/2408.09227
作者: Jiaqi Wang,Xiaochen Wang,Lingjuan Lyu,Jinghui Chen,Fenglong Ma
关键词-EN: Medical Knowledge Injection, Health Insurance Portability, Knowledge Injection, Federated Medical Knowledge, integrating medical knowledge
类目: Artificial Intelligence (cs.AI)
*备注: Submitted to Neurips 2024 DB Track

点击查看摘要

Abstract:This study introduces the Federated Medical Knowledge Injection (FEDMEKI) platform, a new benchmark designed to address the unique challenges of integrating medical knowledge into foundation models under privacy constraints. By leveraging a cross-silo federated learning approach, FEDMEKI circumvents the issues associated with centralized data collection, which is often prohibited under health regulations like the Health Insurance Portability and Accountability Act (HIPAA) in the USA. The platform is meticulously designed to handle multi-site, multi-modal, and multi-task medical data, which includes 7 medical modalities, including images, signals, texts, laboratory test results, vital signs, input variables, and output variables. The curated dataset to validate FEDMEKI covers 8 medical tasks, including 6 classification tasks (lung opacity detection, COVID-19 detection, electrocardiogram (ECG) abnormal detection, mortality prediction, sepsis prediction, and enlarged cardiomediastinum detection) and 2 generation tasks (medical visual question answering (MedVQA) and ECG noise clarification). This comprehensive dataset is partitioned across several clients to facilitate the decentralized training process under 16 benchmark approaches. FEDMEKI not only preserves data privacy but also enhances the capability of medical foundation models by allowing them to learn from a broader spectrum of medical knowledge without direct data exposure, thereby setting a new benchmark in the application of foundation models within the healthcare sector.

[AI-135] Neuro-Symbolic AI for Military Applications

链接: https://arxiv.org/abs/2408.09224
作者: Desta Haileselassie Hagos,Danda B. Rawat
关键词-EN: Artificial Intelligence, plays a significant, significant role, role in enhancing, shaping the future
类目: Artificial Intelligence (cs.AI)
*备注: Accepted at IEEE Transactions on Artificial Intelligence (TAI)

点击查看摘要

Abstract:Artificial Intelligence (AI) plays a significant role in enhancing the capabilities of defense systems, revolutionizing strategic decision-making, and shaping the future landscape of military operations. Neuro-Symbolic AI is an emerging approach that leverages and augments the strengths of neural networks and symbolic reasoning. These systems have the potential to be more impactful and flexible than traditional AI systems, making them well-suited for military applications. This paper comprehensively explores the diverse dimensions and capabilities of Neuro-Symbolic AI, aiming to shed light on its potential applications in military contexts. We investigate its capacity to improve decision-making, automate complex intelligence analysis, and strengthen autonomous systems. We further explore its potential to solve complex tasks in various domains, in addition to its applications in military contexts. Through this exploration, we address ethical, strategic, and technical considerations crucial to the development and deployment of Neuro-Symbolic AI in military and civilian applications. Contributing to the growing body of research, this study represents a comprehensive exploration of the extensive possibilities offered by Neuro-Symbolic AI.

[AI-136] Flatten: Video Action Recognition is an Image Classification task

链接: https://arxiv.org/abs/2408.09220
作者: Junlin Chen,Chengcheng Xu,Yangfan Xu,Jian Yang,Jun Li,Zhiping Shi
关键词-EN: video action recognition, subsequently leveraging prevalent, numerous researchers.Most traditional, typically involve converting, methods typically involve
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 13pages, 6figures

点击查看摘要

Abstract:In recent years, video action recognition, as a fundamental task in the field of video understanding, has been deeply explored by numerous researchers.Most traditional video action recognition methods typically involve converting videos into three-dimensional data that encapsulates both spatial and temporal information, subsequently leveraging prevalent image understanding models to model and analyze these data. However,these methods have significant drawbacks. Firstly, when delving into video action recognition tasks, image understanding models often need to be adapted accordingly in terms of model architecture and preprocessing for these spatiotemporal tasks; Secondly, dealing with high-dimensional data often poses greater challenges and incurs higher time costs compared to its lower-dimensional this http URL bridge the gap between image-understanding and video-understanding tasks while simplifying the complexity of video comprehension, we introduce a novel video representation architecture, Flatten, which serves as a plug-and-play module that can be seamlessly integrated into any image-understanding network for efficient and effective 3D temporal data modeling.Specifically, by applying specific flattening operations (e.g., row-major transform), 3D spatiotemporal data is transformed into 2D spatial information, and then ordinary image understanding models are used to capture temporal dynamic and spatial semantic information, which in turn accomplishes effective and efficient video action recognition. Extensive experiments on commonly used datasets (Kinetics-400, Something-Something v2, and HMDB-51) and three classical image classification models (Uniformer, SwinV2, and ResNet), have demonstrated that embedding Flatten provides a significant performance improvements over original model.

[AI-137] On the Improvement of Generalization and Stability of Forward-Only Learning via Neural Polarization ECAI2024

链接: https://arxiv.org/abs/2408.09210
作者: Erik B. Terres-Escudero,Javier Del Ser,Pablo Garcia-Bringas
关键词-EN: contrastive forward pass, recently gained attention, additional contrastive forward, Forward-only learning algorithms, replacing the backward
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE)
*备注: To be published in ECAI 2024

点击查看摘要

Abstract:Forward-only learning algorithms have recently gained attention as alternatives to gradient backpropagation, replacing the backward step of this latter solver with an additional contrastive forward pass. Among these approaches, the so-called Forward-Forward Algorithm (FFA) has been shown to achieve competitive levels of performance in terms of generalization and complexity. Networks trained using FFA learn to contrastively maximize a layer-wise defined goodness score when presented with real data (denoted as positive samples) and to minimize it when processing synthetic data (corr. negative samples). However, this algorithm still faces weaknesses that negatively affect the model accuracy and training stability, primarily due to a gradient imbalance between positive and negative samples. To overcome this issue, in this work we propose a novel implementation of the FFA algorithm, denoted as Polar-FFA, which extends the original formulation by introducing a neural division (\emphpolarization) between positive and negative instances. Neurons in each of these groups aim to maximize their goodness when presented with their respective data type, thereby creating a symmetric gradient behavior. To empirically gauge the improved learning capabilities of our proposed Polar-FFA, we perform several systematic experiments using different activation and goodness functions over image classification datasets. Our results demonstrate that Polar-FFA outperforms FFA in terms of accuracy and convergence speed. Furthermore, its lower reliance on hyperparameters reduces the need for hyperparameter tuning to guarantee optimal generalization capabilities, thereby allowing for a broader range of neural network configurations.

[AI-138] Architectural Foundations and Strategic Considerations for the Large Language Model Infrastructures

链接: https://arxiv.org/abs/2408.09205
作者: Hongyin Zhu
关键词-EN: large language model, language model, artificial intelligence, large language, undertaking in artificial
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The development of a large language model (LLM) infrastructure is a pivotal undertaking in artificial intelligence. This paper explores the intricate landscape of LLM infrastructure, software, and data management. By analyzing these core components, we emphasize the pivotal considerations and safeguards crucial for successful LLM development. This work presents a concise synthesis of the challenges and strategies inherent in constructing a robust and effective LLM infrastructure, offering valuable insights for researchers and practitioners alike.

[AI-139] Maintainability Challenges in ML: A Systematic Literature Review

链接: https://arxiv.org/abs/2408.09196
作者: Karthik Shivashankar,Antonio Martini
关键词-EN: Machine Learning, maintainability challenges, advances rapidly, businesses alike, adopted by academics
类目: Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
*备注:

点击查看摘要

Abstract:Background: As Machine Learning (ML) advances rapidly in many fields, it is being adopted by academics and businesses alike. However, ML has a number of different challenges in terms of maintenance not found in traditional software projects. Identifying what causes these maintainability challenges can help mitigate them early and continue delivering value in the long run without degrading ML performance. Aim: This study aims to identify and synthesise the maintainability challenges in different stages of the ML workflow and understand how these stages are interdependent and impact each other’s maintainability. Method: Using a systematic literature review, we screened more than 13000 papers, then selected and qualitatively analysed 56 of them. Results: (i) a catalogue of maintainability challenges in different stages of Data Engineering, Model Engineering workflows and the current challenges when building ML systems are discussed; (ii) a map of 13 maintainability challenges to different interdependent stages of ML that impact the overall workflow; (iii) Provided insights to developers of ML tools and researchers. Conclusions: In this study, practitioners and organisations will learn about maintainability challenges and their impact at different stages of ML workflow. This will enable them to avoid pitfalls and help to build a maintainable ML system. The implications and challenges will also serve as a basis for future research to strengthen our understanding of the ML system’s maintainability.

[AI-140] AI Managed Emergency Documentation with a Pretrained Model

链接: https://arxiv.org/abs/2408.09193
作者: David Menzies,Sean Kirwan,Ahmad Albarqawi
关键词-EN: large language model, discharge letter writing, study investigates, large language, improve efficiency
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注: Ethical approval for the study was obtained from the University College Dublin, Human Research Ethics Committee (UCD HREC)

点击查看摘要

Abstract:This study investigates the use of a large language model system to improve efficiency and quality in emergency department (ED) discharge letter writing. Time constraints and infrastructural deficits make compliance with current discharge letter targets difficult. We explored potential efficiencies from an artificial intelligence software in the generation of ED discharge letters and the attitudes of doctors toward this technology. The evaluated system leverages advanced techniques to fine-tune a model to generate discharge summaries from short-hand inputs, including voice, text, and electronic health record data. Nineteen physicians with emergency medicine experience evaluated the system text and voice-to-text interfaces against manual typing. The results showed significant time savings with MedWrite LLM interfaces compared to manual methods.

[AI-141] SA-GDA: Spectral Augmentation for Graph Domain Adaptation

链接: https://arxiv.org/abs/2408.09189
作者: Jinhui Pang,Zixuan Wang,Jiliang Tang,Mingyan Xiao,Nan Yin
关键词-EN: achieved impressive impressions, domain, graph-related tasks, achieved impressive, impressive impressions
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Graph neural networks (GNNs) have achieved impressive impressions for graph-related tasks. However, most GNNs are primarily studied under the cases of signal domain with supervised training, which requires abundant task-specific labels and is difficult to transfer to other domains. There are few works focused on domain adaptation for graph node classification. They mainly focused on aligning the feature space of the source and target domains, without considering the feature alignment between different categories, which may lead to confusion of classification in the target domain. However, due to the scarcity of labels of the target domain, we cannot directly perform effective alignment of categories from different domains, which makes the problem more challenging. In this paper, we present the \textitSpectral Augmentation for Graph Domain Adaptation (\method) for graph node classification. First, we observe that nodes with the same category in different domains exhibit similar characteristics in the spectral domain, while different classes are quite different. Following the observation, we align the category feature space of different domains in the spectral domain instead of aligning the whole features space, and we theoretical proof the stability of proposed \method. Then, we develop a dual graph convolutional network to jointly exploits local and global consistency for feature aggregation. Last, we utilize a domain classifier with an adversarial learning submodule to facilitate knowledge transfer between different domain graphs. Experimental results on a variety of publicly available datasets reveal the effectiveness of our \method.

[AI-142] EEG-SCMM: Soft Contrastive Masked Modeling for Cross-Corpus EEG-Based Emotion Recognition AAAI2025

链接: https://arxiv.org/abs/2408.09186
作者: Qile Liu,Weishan Ye,Yulu Liu,Zhen Liang
关键词-EN: garnered widespread attention, signals has garnered, recent years, garnered widespread, widespread attention
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
*备注: 16 pages, 8 figures, 15 tables, submitted to AAAI 2025

点击查看摘要

Abstract:Emotion recognition using electroencephalography (EEG) signals has garnered widespread attention in recent years. However, existing studies have struggled to develop a sufficiently generalized model suitable for different datasets without re-training (cross-corpus). This difficulty arises because distribution differences across datasets far exceed the intra-dataset variability. To solve this problem, we propose a novel Soft Contrastive Masked Modeling (SCMM) framework. Inspired by emotional continuity, SCMM integrates soft contrastive learning with a new hybrid masking strategy to effectively mine the “short-term continuity” characteristics inherent in human emotions. During the self-supervised learning process, soft weights are assigned to sample pairs, enabling adaptive learning of similarity relationships across samples. Furthermore, we introduce an aggregator that weightedly aggregates complementary information from multiple close samples based on pairwise similarities among samples to enhance fine-grained feature representation, which is then used for original sample reconstruction. Extensive experiments on the SEED, SEED-IV and DEAP datasets show that SCMM achieves state-of-the-art (SOTA) performance, outperforming the second-best method by an average accuracy of 4.26% under two types of cross-corpus conditions (same-class and different-class) for EEG-based emotion recognition.

[AI-143] Chinese Metaphor Recognition Using a Multi-stage Prompting Large Language Model

链接: https://arxiv.org/abs/2408.09177
作者: Jie Wang,Jin Wang,Xuejie Zhang
关键词-EN: Large Language Models, common in everyday, everyday language, Metaphors, identification and understanding
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Metaphors are common in everyday language, and the identification and understanding of metaphors are facilitated by models to achieve a better understanding of the text. Metaphors are mainly identified and generated by pre-trained models in existing research, but situations, where tenors or vehicles are not included in the metaphor, cannot be handled. The problem can be effectively solved by using Large Language Models (LLMs), but significant room for exploration remains in this early-stage research area. A multi-stage generative heuristic-enhanced prompt framework is proposed in this study to enhance the ability of LLMs to recognize tenors, vehicles, and grounds in Chinese metaphors. In the first stage, a small model is trained to obtain the required confidence score for answer candidate generation. In the second stage, questions are clustered and sampled according to specific rules. Finally, the heuristic-enhanced prompt needed is formed by combining the generated answer candidates and demonstrations. The proposed model achieved 3rd place in Track 1 of Subtask 1, 1st place in Track 2 of Subtask 1, and 1st place in both tracks of Subtask 2 at the NLPCC-2024 Shared Task 9.

[AI-144] Cognitive LLMs: Towards Integrating Cognitive Architectures and Large Language Models for Manufacturing Decision-making

链接: https://arxiv.org/abs/2408.09176
作者: Siyu Wu,Alessandro Oltramari,Jonathan Francis,C. Lee Giles,Frank E. Ritter
关键词-EN: Large Language Models, Language Models, Large Language, enabling reliable machine, Resolving the dichotomy
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Symbolic Computation (cs.SC)
*备注: 20 pages, 8 figures, 2 tables

点击查看摘要

Abstract:Resolving the dichotomy between the human-like yet constrained reasoning processes of Cognitive Architectures and the broad but often noisy inference behavior of Large Language Models (LLMs) remains a challenging but exciting pursuit, for enabling reliable machine reasoning capabilities in production systems. Because Cognitive Architectures are famously developed for the purpose of modeling the internal mechanisms of human cognitive decision-making at a computational level, new investigations consider the goal of informing LLMs with the knowledge necessary for replicating such processes, e.g., guided perception, memory, goal-setting, and action. Previous approaches that use LLMs for grounded decision-making struggle with complex reasoning tasks that require slower, deliberate cognition over fast and intuitive inference – reporting issues related to the lack of sufficient grounding, as in hallucination. To resolve these challenges, we introduce LLM-ACTR, a novel neuro-symbolic architecture that provides human-aligned and versatile decision-making by integrating the ACT-R Cognitive Architecture with LLMs. Our framework extracts and embeds knowledge of ACT-R’s internal decision-making process as latent neural representations, injects this information into trainable LLM adapter layers, and fine-tunes the LLMs for downstream prediction. Our experiments on novel Design for Manufacturing tasks show both improved task performance as well as improved grounded decision-making capability of our approach, compared to LLM-only baselines that leverage chain-of-thought reasoning strategies.

[AI-145] Unc-TTP: A Method for Classifying LLM Uncertainty to Improve In-Context Example Selection

链接: https://arxiv.org/abs/2408.09172
作者: Hsiu-Yuan Huang,Zichen Wu,Yutong Yang,Junzhao Zhang,Yunfang Wu
关键词-EN: Large Language Models, Large Language, Language Models, demonstrated exceptional performance, downstream tasks
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注: 7 pages, long paper

点击查看摘要

Abstract:Nowadays, Large Language Models (LLMs) have demonstrated exceptional performance across various downstream tasks. However, it is challenging for users to discern whether the responses are generated with certainty or are fabricated to meet user expectations. Estimating the uncertainty of LLMs is particularly challenging due to their vast scale and the lack of white-box access. In this work, we propose a novel Uncertainty Tripartite Testing Paradigm (Unc-TTP) to classify LLM uncertainty, via evaluating the consistency of LLM outputs when incorporating label interference into the sampling-based approach. Based on Unc-TTP outputs, we aggregate instances into certain and uncertain categories. Further, we conduct a detailed analysis of the uncertainty properties of LLMs and show Unc-TTP’s superiority over the existing sampling-based methods. In addition, we leverage the obtained uncertainty information to guide in-context example selection, demonstrating that Unc-TTP obviously outperforms retrieval-based and sampling-based approaches in selecting more informative examples. Our work paves a new way to classify the uncertainty of both open- and closed-source LLMs, and introduces a practical approach to exploit this uncertainty to improve LLMs performance.

[AI-146] Ranking Across Different Content Types: The Robust Beauty of Multinomial Blending RECSYS24

链接: https://arxiv.org/abs/2408.09168
作者: Jan Malte Lichtenberg,Giuseppe Di Benedetto,Matteo Ruffini
关键词-EN: media streaming services, multiple content types, content types, increasing number, number of media
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: To appear in 18th ACM Conference on Recommender Systems (RecSys24), Bari, Italy. ACM, New York, NY, USA, 3 pages

点击查看摘要

Abstract:An increasing number of media streaming services have expanded their offerings to include entities of multiple content types. For instance, audio streaming services that started by offering music only, now also offer podcasts, merchandise items, and videos. Ranking items across different content types into a single slate poses a significant challenge for traditional learning-to-rank (LTR) algorithms due to differing user engagement patterns for different content types. We explore a simple method for cross-content-type ranking, called multinomial blending (MB), which can be used in conjunction with most existing LTR algorithms. We compare MB to existing baselines not only in terms of ranking quality but also from other industry-relevant perspectives such as interpretability, ease-of-use, and stability in dynamic environments with changing user behavior and ranking model retraining. Finally, we report the results of an A/B test from an Amazon Music ranking use-case.

[AI-147] Linear Attention is Enough in Spatial-Temporal Forecasting

链接: https://arxiv.org/abs/2408.09158
作者: Xinyu Ning
关键词-EN: forecasting task attracted, task attracted numerous, attracted numerous attention, spatial-temporal forecasting tasks, traffic forecasting task
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:As the most representative scenario of spatial-temporal forecasting tasks, the traffic forecasting task attracted numerous attention from machine learning community due to its intricate correlation both in space and time dimension. Existing methods often treat road networks over time as spatial-temporal graphs, addressing spatial and temporal representations independently. However, these approaches struggle to capture the dynamic topology of road networks, encounter issues with message passing mechanisms and over-smoothing, and face challenges in learning spatial and temporal relationships separately. To address these limitations, we propose treating nodes in road networks at different time steps as independent spatial-temporal tokens and feeding them into a vanilla Transformer to learn complex spatial-temporal patterns, design STformer achieving SOTA. Given its quadratic complexity, we introduce a variant NSTformer based on Nystr \ddoto m method to approximate self-attention with linear complexity but even slightly better than former in a few cases astonishingly. Extensive experimental results on traffic datasets demonstrate that the proposed method achieves state-of-the-art performance at an affordable computational cost. Our code will be made available.

[AI-148] CogLM: Tracking Cognitive Development of Large Language Models

链接: https://arxiv.org/abs/2408.09150
作者: Xinglin Wang,Peiwen Yuan,Shaoxiong Feng,Yiwei Li,Boyuan Pan,Heda Wang,Yao Hu,Kan Li
关键词-EN: Piaget Theory, Large Language Models, cognitive levels, cognitive levels forms, Cognitive
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: under review

点击查看摘要

Abstract:Piaget’s Theory of Cognitive Development (PTC) posits that the development of cognitive levels forms the foundation for human learning across various abilities. As Large Language Models (LLMs) have recently shown remarkable abilities across a wide variety of tasks, we are curious about the cognitive levels of current LLMs: to what extent they have developed and how this development has been achieved. To this end, we construct a benchmark CogLM (Cognitive Ability Evaluation for Language Model) based on PTC to assess the cognitive levels of LLMs. CogLM comprises 1,220 questions spanning 10 cognitive abilities crafted by more than 20 human experts, providing a comprehensive testbed for the cognitive levels of LLMs. Through extensive experiments across multiple mainstream LLMs with CogLM, we find that: (1) Human-like cognitive abilities have emerged in advanced LLMs (GPT-4), comparable to those of a 20-year-old human. (2) The parameter size and optimization objective are two key factors affecting the cognitive levels of LLMs. (3) The performance on downstream tasks is positively correlated with the level of cognitive abilities. These findings fill the gap in research on the cognitive abilities of LLMs, tracing the development of LLMs from a cognitive perspective and guiding the future direction of their evolution.

[AI-149] Learning to Explore for Stochastic Gradient MCMC

链接: https://arxiv.org/abs/2408.09140
作者: SeungHyun Kim,Seohyeon Jung,Seonghyeon Kim,Juho Lee
关键词-EN: Bayesian Neural Networks, Bayesian Neural, Neural Networks, high-dimensional parameters pose, posterior inference due
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Bayesian Neural Networks(BNNs) with high-dimensional parameters pose a challenge for posterior inference due to the multi-modality of the posterior distributions. Stochastic Gradient MCMC(SGMCMC) with cyclical learning rate scheduling is a promising solution, but it requires a large number of sampling steps to explore high-dimensional multi-modal posteriors, making it computationally expensive. In this paper, we propose a meta-learning strategy to build \glssgmcmc which can efficiently explore the multi-modal target distributions. Our algorithm allows the learned SGMCMC to quickly explore the high-density region of the posterior landscape. Also, we show that this exploration property is transferrable to various tasks, even for the ones unseen during a meta-training stage. Using popular image classification benchmarks and a variety of downstream tasks, we demonstrate that our method significantly improves the sampling efficiency, achieving better performance than vanilla \glssgmcmc without incurring significant computational overhead.

[AI-150] Vanilla Gradient Descent for Oblique Decision Trees ECAI2024

链接: https://arxiv.org/abs/2408.09135
作者: Subrat Prasad Panda,Blaise Genest,Arvind Easwaran,Ponnuthurai Nagaratnam Suganthan
关键词-EN: major highly non-linear, DTs, textit, non-linear AI models, tabular data
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Published in ECAI2024. This version includes supplementary material

点击查看摘要

Abstract:Decision Trees (DTs) constitute one of the major highly non-linear AI models, valued, e.g., for their efficiency on tabular data. Learning accurate DTs is, however, complicated, especially for oblique DTs, and does take a significant training time. Further, DTs suffer from overfitting, e.g., they proverbially “do not generalize” in regression tasks. Recently, some works proposed ways to make (oblique) DTs differentiable. This enables highly efficient gradient-descent algorithms to be used to learn DTs. It also enables generalizing capabilities by learning regressors at the leaves simultaneously with the decisions in the tree. Prior approaches to making DTs differentiable rely either on probabilistic approximations at the tree’s internal nodes (soft DTs) or on approximations in gradient computation at the internal node (quantized gradient descent). In this work, we propose \textitDTSemNet, a novel \textitsemantically equivalent and invertible encoding for (hard, oblique) DTs as Neural \textitNetworks (NNs), that uses standard vanilla gradient descent. Experiments across various classification and regression benchmarks show that oblique DTs learned using \textitDTSemNet are more accurate than oblique DTs of similar size learned using state-of-the-art techniques. Further, DT training time is significantly reduced. We also experimentally demonstrate that \textitDTSemNet can learn DT policies as efficiently as NN policies in the Reinforcement Learning (RL) setup with physical inputs (dimensions \leq32 ). The code is available at \colorblue\textit\urlthis https URL.

[AI-151] Better Python Programming for all: With the focus on Maintainability

链接: https://arxiv.org/abs/2408.09134
作者: Karthik Shivashankar,Antonio Martini
关键词-EN: Python programming language, Large Language Models, Large Language, Python programming, programming language
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:This study aims to enhance the maintainability of code generated by Large Language Models (LLMs), with a focus on the Python programming language. As the use of LLMs for coding assistance grows, so do concerns about the maintainability of the code they produce. Previous research has mainly concentrated on the functional accuracy and testing success of generated code, overlooking aspects of maintainability. Our approach involves the use of a specifically designed dataset for training and evaluating the model, ensuring a thorough assessment of code maintainability. At the heart of our work is the fine-tuning of an LLM for code refactoring, aimed at enhancing code readability, reducing complexity, and improving overall maintainability. After fine-tuning an LLM to prioritize code maintainability, our evaluations indicate that this model significantly improves code maintainability standards, suggesting a promising direction for the future of AI-assisted software development. Subjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI) Cite as: arXiv:2408.09134 [cs.SE] (or arXiv:2408.09134v1 [cs.SE] for this version) https://doi.org/10.48550/arXiv.2408.09134 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-152] Identifying Technical Debt and Its Types Across Diverse Software Projects Issues

链接: https://arxiv.org/abs/2408.09128
作者: Karthik Shivashankar,Mili Orucevic,Maren Maritsdatter Kruke,Antonio Martini
关键词-EN: maintaining code quality, reducing long-term maintenance, long-term maintenance costs, Technical Debt, code quality
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Technical Debt (TD) identification in software projects issues is crucial for maintaining code quality, reducing long-term maintenance costs, and improving overall project health. This study advances TD classification using transformer-based models, addressing the critical need for accurate and efficient TD identification in large-scale software development. Our methodology employs multiple binary classifiers for TD and its type, combined through ensemble learning, to enhance accuracy and robustness in detecting various forms of TD. We train and evaluate these models on a comprehensive dataset from GitHub Archive Issues (2015-2024), supplemented with industrial data validation. We demonstrate that in-project fine-tuned transformer models significantly outperform task-specific fine-tuned models in TD classification, highlighting the importance of project-specific context in accurate TD identification. Our research also reveals the superiority of specialized binary classifiers over multi-class models for TD and its type identification, enabling more targeted debt resolution strategies. A comparative analysis shows that the smaller DistilRoBERTa model is more effective than larger language models like GPTs for TD classification tasks, especially after fine-tuning, offering insights into efficient model selection for specific TD detection tasks. The study also assesses generalization capabilities using metrics such as MCC, AUC ROC, Recall, and F1 score, focusing on model effectiveness, fine-tuning impact, and relative performance. By validating our approach on out-of-distribution and real-world industrial datasets, we ensure practical applicability, addressing the diverse nature of software projects. Subjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI) Cite as: arXiv:2408.09128 [cs.SE] (or arXiv:2408.09128v1 [cs.SE] for this version) https://doi.org/10.48550/arXiv.2408.09128 Focus to learn more arXiv-issued DOI via DataCite Submission history From: Karthik Shivashankar [view email] [v1] Sat, 17 Aug 2024 07:46:54 UTC (43 KB)

[AI-153] Markov Balance Satisfaction Improves Performance in Strictly Batch Offline Imitation Learning

链接: https://arxiv.org/abs/2408.09125
作者: Rishabh Agrawal,Nathan Dahlin,Rahul Jain,Ashutosh Nayyar
关键词-EN: directly programming behaviors, defining optimal control, optimal control costs, costs is challenging, notably effective
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Imitation learning (IL) is notably effective for robotic tasks where directly programming behaviors or defining optimal control costs is challenging. In this work, we address a scenario where the imitator relies solely on observed behavior and cannot make environmental interactions during learning. It does not have additional supplementary datasets beyond the expert’s dataset nor any information about the transition dynamics. Unlike state-of-the-art (SOTA) IL methods, this approach tackles the limitations of conventional IL by operating in a more constrained and realistic setting. Our method uses the Markov balance equation and introduces a novel conditional density estimation-based imitation learning framework. It employs conditional normalizing flows for transition dynamics estimation and aims at satisfying a balance equation for the environment. Through a series of numerical experiments on Classic Control and MuJoCo environments, we demonstrate consistently superior empirical performance compared to many SOTA IL algorithms.

[AI-154] Selective Prompt Anchoring for Code Generation

链接: https://arxiv.org/abs/2408.09121
作者: Yuan Tian,Tianyi Zhang
关键词-EN: large language models, automating coding tasks, transformed software development, Recent advances, Copilot and ChatGPT
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Software Engineering (cs.SE)
*备注: Under review

点击查看摘要

Abstract:Recent advances in large language models (LLMs) such as Copilot and ChatGPT have transformed software development by automating coding tasks. Despite these advancements, challenges remain in reducing error rates and fully meeting user expectations. Our empirical study reveals LLMs tend to dilute their self-attention on the initial prompt as more code tokens are generated. We hypothesize this self-attention dilution issue is one of the root causes of inaccuracies in LLM-generated code. To mitigate this issue, we propose Selective Prompt Anchoring (SPA). SPA amplifies the influence of the selected parts in the initial prompt, which we refer to as ``anchored text’', during code generation. Specifically, SPA calculates the logit distribution difference with and without the anchored text. We prove this difference approximates the anchored text’s contextual contribution to the output logits. SPA creates an augmented logit distribution by linearly combining the original logit distribution and the logit difference. We evaluate SPA with five LLMs on four benchmarks. Our results demonstrate that using SPA can consistently improve Pass@1 rates by up to 9.7% in all settings. Notably, with selective text anchoring, a small version of DeepSeek-Coder (6.7B) can achieve better performance than an original much larger version (33B). Our code is available at this https URL.

[AI-155] Measuring Visual Sycophancy in Multimodal Models

链接: https://arxiv.org/abs/2408.09111
作者: Jaehyuk Lim,Bruce W. Lee
关键词-EN: disproportionately favor visually, favor visually presented, multimodal language models, visually presented information, paper introduces
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)
*备注:

点击查看摘要

Abstract:This paper introduces and examines the phenomenon of “visual sycophancy” in multimodal language models, a term we propose to describe these models’ tendency to disproportionately favor visually presented information, even when it contradicts their prior knowledge or responses. Our study employs a systematic methodology to investigate this phenomenon: we present models with images of multiple-choice questions, which they initially answer correctly, then expose the same model to versions with visually pre-marked options. Our findings reveal a significant shift in the models’ responses towards the pre-marked option despite their previous correct answers. Comprehensive evaluations demonstrate that visual sycophancy is a consistent and quantifiable behavior across various model architectures. Our findings highlight potential limitations in the reliability of these models when processing potentially misleading visual information, raising important questions about their application in critical decision-making contexts.

[AI-156] mporal Reversed Training for Spiking Neural Networks with Generalized Spatio-Temporal Representation

链接: https://arxiv.org/abs/2408.09108
作者: Lin Zuo,Yongqi Ding,Wenwei Luo,Mengmeng Jing,Xianlong Tian,Kunshan Yang
关键词-EN: energy computing paradigm, received widespread attention, ultra-low energy computing, Spiking neural networks, neural networks
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: 15 pages, 8 figures

点击查看摘要

Abstract:Spiking neural networks (SNNs) have received widespread attention as an ultra-low energy computing paradigm. Recent studies have focused on improving the feature extraction capability of SNNs, but they suffer from inefficient inference and suboptimal performance. In this paper, we propose a simple yet effective temporal reversed training (TRT) method to optimize the spatio-temporal performance of SNNs and circumvent these problems. We perturb the input temporal data by temporal reversal, prompting the SNN to produce original-reversed consistent output logits and to learn perturbation-invariant representations. For static data without temporal dimension, we generalize this strategy by exploiting the inherent temporal property of spiking neurons for spike feature temporal reversal. In addition, we utilize the lightweight ``star operation" (element-wise multiplication) to hybridize the original and temporally reversed spike firing rates and expand the implicit dimensions, which serves as spatio-temporal regularization to further enhance the generalization of the SNN. Our method involves only an additional temporal reversal operation and element-wise multiplication during training, thus incurring negligible training overhead and not affecting the inference efficiency at all. Extensive experiments on static/neuromorphic object/action recognition, and 3D point cloud classification tasks demonstrate the effectiveness and generalizability of our method. In particular, with only two timesteps, our method achieves 74.77% and 90.57% accuracy on ImageNet and ModelNet40, respectively.

[AI-157] Depth-guided Texture Diffusion for Image Semantic Segmentation

链接: https://arxiv.org/abs/2408.09097
作者: Wei Sun,Yuan Li,Qixiang Ye,Jianbin Jiao,Yanzhao Zhou
关键词-EN: Depth-guided Texture Diffusion, semantic segmentation, Depth, valuable insights, utilized to improve
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Depth information provides valuable insights into the 3D structure especially the outline of objects, which can be utilized to improve the semantic segmentation tasks. However, a naive fusion of depth information can disrupt feature and compromise accuracy due to the modality gap between the depth and the vision. In this work, we introduce a Depth-guided Texture Diffusion approach that effectively tackles the outlined challenge. Our method extracts low-level features from edges and textures to create a texture image. This image is then selectively diffused across the depth map, enhancing structural information vital for precisely extracting object outlines. By integrating this enriched depth map with the original RGB image into a joint feature embedding, our method effectively bridges the disparity between the depth map and the image, enabling more accurate semantic segmentation. We conduct comprehensive experiments across diverse, commonly-used datasets spanning a wide range of semantic segmentation tasks, including Camouflaged Object Detection (COD), Salient Object Detection (SOD), and indoor semantic segmentation. With source-free estimated depth or depth captured by depth cameras, our method consistently outperforms existing baselines and achieves new state-of-theart results, demonstrating the effectiveness of our Depth-guided Texture Diffusion for image semantic segmentation.

[AI-158] Research on color recipe recommendation based on unstructured data using TENN

链接: https://arxiv.org/abs/2408.09094
作者: Seongsu Jhang,Donghwi Yoo,Jaeyong Kown
关键词-EN: Google BARD, language preprocessing methods, business models based, Microsoft copilot, OpenAI Chatgpt
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Recently, services and business models based on large language models, such as OpenAI Chatgpt, Google BARD, and Microsoft copilot, have been introduced, and the applications utilizing natural language processing with deep learning are increasing, and it is one of the natural language preprocessing methods. Conversion to machine language through tokenization and processing of unstructured data are increasing. Although algorithms that can understand and apply human language are becoming increasingly sophisticated, it is difficult to apply them to processes that rely on human emotions and senses in industries that still mainly deal with standardized data. In particular, in processes where brightness, saturation, and color information are essential, such as painting and injection molding, most small and medium-sized companies, excluding large corporations, rely on the tacit knowledge and sensibility of color mixers, and even customer companies often present non-standardized requirements. . In this paper, we proposed TENN to infer color recipe based on unstructured data with emotional natural language, and demonstrated it.

[AI-159] Linking Robustness and Generalization: A k* Distribution Analysis of Concept Clustering in Latent Space for Vision Models

链接: https://arxiv.org/abs/2408.09065
作者: Shashank Kotyan,Pin-Yu Chen,Danilo Vasconcellos Vargas
关键词-EN: latent space, latent, space, vision models, assess latent space
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Most evaluations of vision models use indirect methods to assess latent space quality. These methods often involve adding extra layers to project the latent space into a new one. This projection makes it difficult to analyze and compare the original latent space. This article uses the k* Distribution, a local neighborhood analysis method, to examine the learned latent space at the level of individual concepts, which can be extended to examine the entire latent space. We introduce skewness-based true and approximate metrics for interpreting individual concepts to assess the overall quality of vision models’ latent space. Our findings indicate that current vision models frequently fracture the distributions of individual concepts within the latent space. Nevertheless, as these models improve in generalization across multiple datasets, the degree of fracturing diminishes. A similar trend is observed in robust vision models, where increased robustness correlates with reduced fracturing. Ultimately, this approach enables a direct interpretation and comparison of the latent spaces of different vision models and reveals a relationship between a model’s generalizability and robustness. Results show that as a model becomes more general and robust, it tends to learn features that result in better clustering of concepts. Project Website is available online at this https URL

[AI-160] Learning to Route for Dynamic Adapter Composition in Continual Learning with Language Models

链接: https://arxiv.org/abs/2408.09053
作者: Vladimir Araujo,Marie-Francine Moens,Tinne Tuytelaars
关键词-EN: pre-trained language models, Parameter-efficient fine-tuning, language models, continual learning, PEFT
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:Parameter-efficient fine-tuning (PEFT) methods are increasingly used with pre-trained language models (PLMs) for continual learning (CL). These methods involve training a PEFT module for each new task and using similarity-based selection to route modules during inference. However, they face two major limitations: 1) interference with already learned modules and 2) suboptimal routing when composing modules. In this paper, we introduce a method that isolates the training of PEFT modules for task specialization. Then, before evaluation, it learns to compose the previously learned modules by training a router that leverages samples from a small memory. We evaluate our method in two CL setups using several benchmarks. Our results show that our method provides a better composition of PEFT modules, leading to better generalization and performance compared to previous methods.

[AI-161] Language Models Show Stable Value Orientations Across Diverse Role-Plays

链接: https://arxiv.org/abs/2408.09049
作者: Bruce W. Lee,Yeongheon Lee,Hyunsoo Cho
关键词-EN: adopting diverse personas, large language models, revealing a persistent, prompted to assume, large language
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
*备注:

点击查看摘要

Abstract:We demonstrate that large language models (LLMs) exhibit consistent value orientations despite adopting diverse personas, revealing a persistent inertia in their responses that remains stable across the variety of roles they are prompted to assume. To systematically explore this phenomenon, we introduce the role-play-at-scale methodology, which involves prompting LLMs with randomized, diverse personas and analyzing the macroscopic trend of their responses. Unlike previous works that simply feed these questions to LLMs as if testing human subjects, our role-play-at-scale methodology diagnoses inherent tendencies in a systematic and scalable manner by: (1) prompting the model to act in different random personas and (2) asking the same question multiple times for each random persona. This approach reveals consistent patterns in LLM responses across diverse role-play scenarios, indicating deeply encoded inherent tendencies. Our findings contribute to the discourse on value alignment in foundation models and demonstrate the efficacy of role-play-at-scale as a diagnostic tool for uncovering encoded biases in LLMs.

[AI-162] Keep Calm and Relax – HMI for Autonomous Vehicles

链接: https://arxiv.org/abs/2408.09046
作者: Tima M. Yekta,Julius Schöning
关键词-EN: so-called autonomous vehicles, enhance passenger trust, so-called autonomous, human-machine interfaces, trust and comfort
类目: Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC)
*备注: 14 pages, 3 figures, 1 table

点击查看摘要

Abstract:The growing popularity of self-driving, so-called autonomous vehicles has increased the need for human-machine interfaces~(HMI) and user interaction~(UI) to enhance passenger trust and comfort. While fallback drivers significantly influence the perceived trustfulness of self-driving vehicles, fallback drivers are an expensive solution that may not even improve vehicle safety in emergency situations. Based on a comprehensive literature review, this work delves into the potential of HMI and UI in enhancing trustfulness and emotion regulation in driverless vehicles. By analyzing the impact of various HMI and UI on passenger emotions, innovative and cost-effective concepts for improving human-vehicle interaction are conceptualized. To enable a trustful, highly comfortable, and safe ride, this work concludes by discussing whether HMI and UI are suitable for calming passengers down in emergencies, leading to smarter mobility for all.

[AI-163] Improving VTE Identification through Language Models from Radiology Reports: A Comparative Study of Mamba Phi-3 Mini and BERT

链接: https://arxiv.org/abs/2408.09043
作者: Jamie Deng,Yusen Wu,Yelena Yesha,Phuong Nguyen
关键词-EN: critical cardiovascular condition, Venous thromboembolism, encompassing deep vein, deep vein thrombosis, cardiovascular condition
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Venous thromboembolism (VTE) is a critical cardiovascular condition, encompassing deep vein thrombosis (DVT) and pulmonary embolism (PE). Accurate and timely identification of VTE is essential for effective medical care. This study builds upon our previous work, which addressed VTE detection using deep learning methods for DVT and a hybrid approach combining deep learning and rule-based classification for PE. Our earlier approaches, while effective, had two major limitations: they were complex and required expert involvement for feature engineering of the rule set. To overcome these challenges, we utilize the Mamba architecture-based classifier. This model achieves remarkable results, with a 97% accuracy and F1 score on the DVT dataset and a 98% accuracy and F1 score on the PE dataset. In contrast to the previous hybrid method on PE identification, the Mamba classifier eliminates the need for hand-engineered rules, significantly reducing model complexity while maintaining comparable performance. Additionally, we evaluated a lightweight Large Language Model (LLM), Phi-3 Mini, in detecting VTE. While this model delivers competitive results, outperforming the baseline BERT models, it proves to be computationally intensive due to its larger parameter set. Our evaluation shows that the Mamba-based model demonstrates superior performance and efficiency in VTE identification, offering an effective solution to the limitations of previous approaches.

[AI-164] On the Completeness of Conflict-Based Search: Temporally-Relative Duplicate Pruning

链接: https://arxiv.org/abs/2408.09028
作者: Thayne T Walker,Nathan R Sturtevant
关键词-EN: Conflict-Based Search, multi-agent pathfinding, unsolvable problem instance, TRDP, run forever
类目: Artificial Intelligence (cs.AI); Robotics (cs.RO)
*备注: 9 pages, 4 figures, 2 tables

点击查看摘要

Abstract:Conflict-Based Search (CBS) algorithm for the multi-agent pathfinding (MAPF) problem is that it is incomplete for problems which have no solution; if no mitigating procedure is run in parallel, CBS will run forever when given an unsolvable problem instance. In this work, we introduce Temporally-Relative Duplicate Pruning (TRDP), a technique for duplicate detection and removal in both classic and continuous-time MAPF domains. TRDP is a simple procedure which closes the long-standing theoretic loophole of incompleteness for CBS by detecting and avoiding the expansion of duplicate states. TRDP is shown both theoretically and empirically to ensure termination without a significant impact on runtime in the majority of problem instances. In certain cases, TRDP is shown to increase performance significantly

[AI-165] Efficient Autoregressive Audio Modeling via Next-Scale Prediction

链接: https://arxiv.org/abs/2408.09027
作者: Kai Qiu,Xiang Li,Hao Chen,Jie Sun,Jinglu Wang,Zhe Lin,Marios Savvides,Bhiksha Raj
关键词-EN: sophisticated generative models, textbf, achieved remarkable progress, Audio generation, Fréchet Audio Distance
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
*备注: 7 pages, 6 figures, 7 tables

点击查看摘要

Abstract:Audio generation has achieved remarkable progress with the advance of sophisticated generative models, such as diffusion models (DMs) and autoregressive (AR) models. However, due to the naturally significant sequence length of audio, the efficiency of audio generation remains an essential issue to be addressed, especially for AR models that are incorporated in large language models (LLMs). In this paper, we analyze the token length of audio tokenization and propose a novel \textbfScale-level \textbfAudio \textbfTokenizer (SAT), with improved residual quantization. Based on SAT, a scale-level \textbfAcoustic \textbfAuto\textbfRegressive (AAR) modeling framework is further proposed, which shifts the next-token AR prediction to next-scale AR prediction, significantly reducing the training cost and inference time. To validate the effectiveness of the proposed approach, we comprehensively analyze design choices and demonstrate the proposed AAR framework achieves a remarkable \textbf35 \times faster inference speed and +\textbf1.33 Fréchet Audio Distance (FAD) against baselines on the AudioSet benchmark. Code: \urlthis https URL.

[AI-166] Classifier-Free Guidance is a Predictor-Corrector

链接: https://arxiv.org/abs/2408.09000
作者: Arwen Bradley,Preetum Nakkiran
关键词-EN: CFG, foundations of classifier-free, theoretical foundations, classifier-free guidance, shaky theoretical footing
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: AB and PN contributed equally

点击查看摘要

Abstract:We investigate the theoretical foundations of classifier-free guidance (CFG). CFG is the dominant method of conditional sampling for text-to-image diffusion models, yet unlike other aspects of diffusion, it remains on shaky theoretical footing. In this paper, we disprove common misconceptions, by showing that CFG interacts differently with DDPM (Ho et al., 2020) and DDIM (Song et al., 2021), and neither sampler with CFG generates the gamma-powered distribution p(x|c)^\gamma p(x)^1-\gamma . Then, we clarify the behavior of CFG by showing that it is a kind of predictor-corrector method (Song et al., 2020) that alternates between denoising and sharpening, which we call predictor-corrector guidance (PCG). We prove that in the SDE limit, CFG is actually equivalent to combining a DDIM predictor for the conditional distribution together with a Langevin dynamics corrector for a gamma-powered distribution (with a carefully chosen gamma). Our work thus provides a lens to theoretically understand CFG by embedding it in a broader design space of principled sampling methods.

[AI-167] On the Undecidability of Artificial Intelligence Alignment: Machines that Halt

链接: https://arxiv.org/abs/2408.08995
作者: Gabriel Adriano de Melo,Marcos Ricardo Omena De Albuquerque Maximo,Nei Yoshihiro Soma,Paulo Andre Lima de Castro
关键词-EN: arbitrary artificial intelligence, Turing Halting Problem, artificial intelligence, satisfices a non-trivial, non-trivial alignment function
类目: Artificial Intelligence (cs.AI)
*备注: Submitted for the Scientific Reports AI Alignment Collection

点击查看摘要

Abstract:The inner alignment problem, which asserts whether an arbitrary artificial intelligence (AI) model satisfices a non-trivial alignment function of its outputs given its inputs, is undecidable. This is rigorously proved by Rice’s theorem, which is also equivalent to a reduction to Turing’s Halting Problem, whose proof sketch is presented in this work. Nevertheless, there is an enumerable set of provenly aligned AIs that are constructed from a finite set of provenly aligned operations. Therefore, we argue that the alignment should be a guaranteed property from the AI architecture rather than a characteristic imposed post-hoc on an arbitrary AI model. Furthermore, while the outer alignment problem is the definition of a judge function that captures human values and preferences, we propose that such a function must also impose a halting constraint that guarantees that the AI model always reaches a terminal state in finite execution steps. Our work presents examples and models that illustrate this constraint and the intricate challenges involved, advancing a compelling case for adopting an intrinsically hard-aligned approach to AI systems architectures that ensures halting.

[AI-168] Ask Attend Attack: A Effective Decision-Based Black-Box Targeted Attack for Image-to-Text Models

链接: https://arxiv.org/abs/2408.08989
作者: Qingyuan Zeng,Zhenzhong Wang,Yiu-ming Cheung,Min Jiang
关键词-EN: demonstrated significant advancements, textit, attacks, targeted attacks, vision-language tasks
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:While image-to-text models have demonstrated significant advancements in various vision-language tasks, they remain susceptible to adversarial attacks. Existing white-box attacks on image-to-text models require access to the architecture, gradients, and parameters of the target model, resulting in low practicality. Although the recently proposed gray-box attacks have improved practicality, they suffer from semantic loss during the training process, which limits their targeted attack performance. To advance adversarial attacks of image-to-text models, this paper focuses on a challenging scenario: decision-based black-box targeted attacks where the attackers only have access to the final output text and aim to perform targeted attacks. Specifically, we formulate the decision-based black-box targeted attack as a large-scale optimization problem. To efficiently solve the optimization problem, a three-stage process \textitAsk, Attend, Attack, called \textitAAA, is proposed to coordinate with the solver. \textitAsk guides attackers to create target texts that satisfy the specific semantics. \textitAttend identifies the crucial regions of the image for attacking, thus reducing the search space for the subsequent \textitAttack. \textitAttack uses an evolutionary algorithm to attack the crucial regions, where the attacks are semantically related to the target texts of \textitAsk, thus achieving targeted attacks without semantic loss. Experimental results on transformer-based and CNN+RNN-based image-to-text models confirmed the effectiveness of our proposed \textitAAA.

[AI-169] ASGM-KG: Unveiling Alluvial Gold Mining Through Knowledge Graphs

链接: https://arxiv.org/abs/2408.08972
作者: Debashis Gupta,Aditi Golder,Luis Fernendez,Miles Silman,Greg Lersen,Fan Yang,Bob Plemmons,Sarra Alqahtani,Paul Victor Pauca
关键词-EN: Small-Scale Gold Mining, highly destructive mining, destructive mining practice, Gold Mining, Artisanal and Small-Scale
类目: Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
*备注:

点击查看摘要

Abstract:Artisanal and Small-Scale Gold Mining (ASGM) is a low-cost yet highly destructive mining practice, leading to environmental disasters across the world’s tropical watersheds. The topic of ASGM spans multiple domains of research and information, including natural and social systems, and knowledge is often atomized across a diversity of media and documents. We therefore introduce a knowledge graph (ASGM-KG) that consolidates and provides crucial information about ASGM practices and their environmental effects. The current version of ASGM-KG consists of 1,899 triples extracted using a large language model (LLM) from documents and reports published by both non-governmental and governmental organizations. These documents were carefully selected by a group of tropical ecologists with expertise in ASGM. This knowledge graph was validated using two methods. First, a small team of ASGM experts reviewed and labeled triples as factual or non-factual. Second, we devised and applied an automated factual reduction framework that relies on a search engine and an LLM for labeling triples. Our framework performs as well as five baselines on a publicly available knowledge graph and achieves over 90 accuracy on our ASGM-KG validated by domain experts. ASGM-KG demonstrates an advancement in knowledge aggregation and representation for complex, interdisciplinary environmental crises such as ASGM.

[AI-170] Differentiable Edge-based OPC

链接: https://arxiv.org/abs/2408.08969
作者: Guojin Chen,Haoyu Yang,Haoxing Ren,Bei Yu,David Z. Pan
关键词-EN: Optical proximity correction, Optical proximity, proximity correction, integrated circuits, OPC
类目: Artificial Intelligence (cs.AI); Optics (physics.optics)
*备注: Accepted by ICCAD24

点击查看摘要

Abstract:Optical proximity correction (OPC) is crucial for pushing the boundaries of semiconductor manufacturing and enabling the continued scaling of integrated circuits. While pixel-based OPC, termed as inverse lithography technology (ILT), has gained research interest due to its flexibility and precision. Its complexity and intricate features can lead to challenges in mask writing, increased defects, and higher costs, hence hindering widespread industrial adoption. In this paper, we propose DiffOPC, a differentiable OPC framework that enjoys the virtue of both edge-based OPC and ILT. By employing a mask rule-aware gradient-based optimization approach, DiffOPC efficiently guides mask edge segment movement during mask optimization, minimizing wafer error by propagating true gradients from the cost function back to the mask edges. Our approach achieves lower edge placement error while reducing manufacturing cost by half compared to state-of-the-art OPC techniques, bridging the gap between the high accuracy of pixel-based OPC and the practicality required for industrial adoption, thus offering a promising solution for advanced semiconductor manufacturing.

[AI-171] Online SLA Decomposition: Enabling Real-Time Adaptation to Evolving Systems

链接: https://arxiv.org/abs/2408.08968
作者: Cyril Shih-Huan Hsu,Danny De Vleeschauwer,Chrysa Papagianni
关键词-EN: Service Level Agreement, Service Level, Level Agreement, slice spans multiple, spans multiple domains
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: The paper has been submitted to IEEE Networking Letters

点击查看摘要

Abstract:When a network slice spans multiple domains, each domain must uphold the End-to-End (E2E) Service Level Agreement (SLA). This requires decomposing the End-to-End (E2E) Service Level Agreement (SLA) into partial SLAs for each domain. In a two-level network slicing management system with an E2E orchestrator and local controllers, we propose an online learning-decomposition framework that dynamically updates risk models using recent feedback. This approach utilizes online gradient descent and FIFO memory buffers to enhance stability and robustness. Our empirical study shows the proposed framework outperforms state-of-the-art static methods, offering more accurate and resilient SLA decomposition under varying conditions and sparse data.

[AI-172] Adaptive Guardrails For Large Language Models via Trust Modeling and In-Context Learning

链接: https://arxiv.org/abs/2408.08959
作者: Jinwei Hu,Yi Dong,Xiaowei Huang
关键词-EN: Large language models, maintain LLMs’ alignment, part of Large, Large language, language models
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注: Under Review

点击查看摘要

Abstract:Guardrails have become an integral part of Large language models (LLMs), by moderating harmful or toxic response in order to maintain LLMs’ alignment to human expectations. However, the existing guardrail methods do not consider different needs and access rights of individual users, and treat all the users with the same rule. This study introduces an adaptive guardrail mechanism, supported by trust modeling and enhanced with in-context learning, to dynamically modulate access to sensitive content based on user trust metrics. By leveraging a combination of direct interaction trust and authority-verified trust, the system precisely tailors the strictness of content moderation to align with the user’s credibility and the specific context of their inquiries. Our empirical evaluations demonstrate that the adaptive guardrail effectively meets diverse user needs, outperforming existing guardrails in practicality while securing sensitive information and precisely managing potentially hazardous content through a context-aware knowledge base. This work is the first to introduce trust-oriented concept within a guardrail system, offering a scalable solution that enriches the discourse on ethical deployment for next-generation LLMs.

[AI-173] A Factored MDP Approach To Moving Target Defense With Dynamic Threat Modeling and Cost Efficiency

链接: https://arxiv.org/abs/2408.08934
作者: Megha Bose,Praveen Paruchuri,Akshat Kumar
关键词-EN: Moving Target Defense, Moving Target, evolving cyber threats, counteract evolving cyber, cyber threats
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Moving Target Defense (MTD) has emerged as a proactive and dynamic framework to counteract evolving cyber threats. Traditional MTD approaches often rely on assumptions about the attackers knowledge and behavior. However, real-world scenarios are inherently more complex, with adaptive attackers and limited prior knowledge of their payoffs and intentions. This paper introduces a novel approach to MTD using a Markov Decision Process (MDP) model that does not rely on predefined attacker payoffs. Our framework integrates the attackers real-time responses into the defenders MDP using a dynamic Bayesian Network. By employing a factored MDP model, we provide a comprehensive and realistic system representation. We also incorporate incremental updates to an attack response predictor as new data emerges. This ensures an adaptive and robust defense mechanism. Additionally, we consider the costs of switching configurations in MTD, integrating them into the reward structure to balance execution and defense costs. We first highlight the challenges of the problem through a theoretical negative result on regret. However, empirical evaluations demonstrate the frameworks effectiveness in scenarios marked by high uncertainty and dynamically changing attack landscapes.

[AI-174] RoarGraph: A Projected Bipartite Graph for Efficient Cross-Modal Approximate Nearest Neighbor Search VLDB

链接: https://arxiv.org/abs/2408.08933
作者: Meng Chen,Kai Zhang,Zhenying He,Yinan Jing,X. Sean Wang
关键词-EN: Approximate Nearest Neighbor, Approximate Nearest, language model-based applications, including recommendation systems, large language model-based
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Databases (cs.DB)
*备注: to be published in PVLDB

点击查看摘要

Abstract:Approximate Nearest Neighbor Search (ANNS) is a fundamental and critical component in many applications, including recommendation systems and large language model-based applications. With the advancement of multimodal neural models, which transform data from different modalities into a shared high-dimensional space as feature vectors, cross-modal ANNS aims to use the data vector from one modality (e.g., texts) as the query to retrieve the most similar items from another (e.g., images or videos). However, there is an inherent distribution gap between embeddings from different modalities, and cross-modal queries become Out-of-Distribution (OOD) to the base data. Consequently, state-of-the-art ANNS approaches suffer poor performance for OOD workloads. In this paper, we quantitatively analyze the properties of the OOD workloads to gain an understanding of their ANNS efficiency. Unlike single-modal workloads, we reveal OOD queries spatially deviate from base data, and the k-nearest neighbors of an OOD query are distant from each other in the embedding space. The property breaks the assumptions of existing ANNS approaches and mismatches their design for efficient search. With insights from the OOD workloads, we propose pRojected bipartite Graph (RoarGraph), an efficient ANNS graph index built under the guidance of query distribution. Extensive experiments show that RoarGraph significantly outperforms state-of-the-art approaches on modern cross-modal datasets, achieving up to 3.56x faster search speed at a 90% recall rate for OOD queries.

[AI-175] Personalized Federated Collaborative Filtering: A Variational AutoEncoder Approach

链接: https://arxiv.org/abs/2408.08931
作者: Zhiwei Li,Guodong Long,Tianyi Zhou,Jing Jiang,Chengqi Zhang
关键词-EN: Federated Collaborative Filtering, distributed Collaborative Filtering, Collaborative Filtering, emerging field focused, Federated Collaborative
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 10 pages, 3 figures, 4 tables, conference

点击查看摘要

Abstract:Federated Collaborative Filtering (FedCF) is an emerging field focused on developing a new recommendation framework with preserving privacy in a federated setting. Existing FedCF methods typically combine distributed Collaborative Filtering (CF) algorithms with privacy-preserving mechanisms, and then preserve personalized information into a user embedding vector. However, the user embedding is usually insufficient to preserve the rich information of the fine-grained personalization across heterogeneous clients. This paper proposes a novel personalized FedCF method by preserving users’ personalized information into a latent variable and a neural model simultaneously. Specifically, we decompose the modeling of user knowledge into two encoders, each designed to capture shared knowledge and personalized knowledge separately. A personalized gating network is then applied to balance personalization and generalization between the global and local encoders. Moreover, to effectively train the proposed framework, we model the CF problem as a specialized Variational AutoEncoder (VAE) task by integrating user interaction vector reconstruction with missing value prediction. The decoder is trained to reconstruct the implicit feedback from items the user has interacted with, while also predicting items the user might be interested in but has not yet interacted with. Experimental results on benchmark datasets demonstrate that the proposed method outperforms other baseline methods, showcasing superior performance.

[AI-176] DePrompt: Desensitization and Evaluation of Personal Identifiable Information in Large Language Model Prompts

链接: https://arxiv.org/abs/2408.08930
作者: Xiongtao Sun,Gan Liu,Zhipeng He,Hui Li,Xiaoguang Li
关键词-EN: widely impacting, impacting the accuracy, accuracy and interpretability, large language models, Prompt
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:Prompt serves as a crucial link in interacting with large language models (LLMs), widely impacting the accuracy and interpretability of model outputs. However, acquiring accurate and high-quality responses necessitates precise prompts, which inevitably pose significant risks of personal identifiable information (PII) leakage. Therefore, this paper proposes DePrompt, a desensitization protection and effectiveness evaluation framework for prompt, enabling users to safely and transparently utilize LLMs. Specifically, by leveraging large model fine-tuning techniques as the underlying privacy protection method, we integrate contextual attributes to define privacy types, achieving high-precision PII entity identification. Additionally, through the analysis of key features in prompt desensitization scenarios, we devise adversarial generative desensitization methods that retain important semantic content while disrupting the link between identifiers and privacy attributes. Furthermore, we present utility evaluation metrics for prompt to better gauge and balance privacy and usability. Our framework is adaptable to prompts and can be extended to text usability-dependent scenarios. Through comparison with benchmarks and other model methods, experimental evaluations demonstrate that our desensitized prompt exhibit superior privacy protection utility and model inference results.

[AI-177] Imprecise Belief Fusion Facing a DST benchmark problem

链接: https://arxiv.org/abs/2408.08928
作者: Francisco Aragão,João Alcântara
关键词-EN: Dempster-Shafer Theory, anomalous behavior, agents with equal, faced with anomalous, equal expertise
类目: Artificial Intelligence (cs.AI)
*备注: 12 pages

点击查看摘要

Abstract:When we merge information in Dempster-Shafer Theory (DST), we are faced with anomalous behavior: agents with equal expertise and credibility can have their opinion disregarded after resorting to the belief combination rule of this theory. This problem is interesting because belief fusion is an inherent part of dealing with situations where available information is imprecise, as often occurs in Artificial Intelligence. We managed to identify an isomorphism betwin the DST formal apparatus into that of a Probabilistic Logic. Thus, we solved the problematic inputs affair by replacing the DST combination rule with a new fusion process aiming at eliminating anomalies proposed by that rule. We apply the new fusion method to the DST paradox Problem.

[AI-178] VerilogCoder: Autonomous Verilog Coding Agents with Graph-based Planning and Abstract Syntax Tree (AST)-based Waveform Tracing Tool AAAI2025

链接: https://arxiv.org/abs/2408.08927
作者: Chia-Tung Ho,Haoxing Ren,Brucek Khailany
关键词-EN: automating hardware design, modern Integrated Circuits, Circuit Relation Graph, growing complexity, complexity of modern
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注: main paper 7 pages, reference 1 page, appendix 22 pages. It is under review of AAAI 2025

点击查看摘要

Abstract:Due to the growing complexity of modern Integrated Circuits (ICs), automating hardware design can prevent a significant amount of human error from the engineering process and result in less errors. Verilog is a popular hardware description language for designing and modeling digital systems; thus, Verilog generation is one of the emerging areas of research to facilitate the design process. In this work, we propose VerilogCoder, a system of multiple Artificial Intelligence (AI) agents for Verilog code generation, to autonomously write Verilog code and fix syntax and functional errors using collaborative Verilog tools (i.e., syntax checker, simulator, and waveform tracer). Firstly, we propose a task planner that utilizes a novel Task and Circuit Relation Graph retrieval method to construct a holistic plan based on module descriptions. To debug and fix functional errors, we develop a novel and efficient abstract syntax tree (AST)-based waveform tracing tool, which is integrated within the autonomous Verilog completion flow. The proposed methodology successfully generates 94.2% syntactically and functionally correct Verilog code, surpassing the state-of-the-art methods by 33.9% on the VerilogEval-Human v2 benchmark.

[AI-179] Cybench: A Framework for Evaluating Cybersecurity Capabilities and Risk of Language Models

链接: https://arxiv.org/abs/2408.08926
作者: Andy K. Zhang,Neil Perry,Riya Dulepet,Eliot Jones,Justin W. Lin,Joey Ji,Celeste Menders,Gashon Hussein,Samantha Liu,Donovan Jasper,Pura Peetathawatchai,Ari Glenn,Vikram Sivashankar,Daniel Zamoshchin,Leo Glikbarg,Derek Askaryar,Mike Yang,Teddy Zhang,Rishi Alluri,Nathan Tran,Rinnara Sangpisit,Polycarpos Yiorkadjis,Kenny Osele,Gautham Raghupathi,Dan Boneh,Daniel E. Ho,Percy Liang
关键词-EN: autonomously identifying vulnerabilities, Language Model, real-world impact, capable of autonomously, autonomously identifying
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY); Machine Learning (cs.LG)
*备注: 86 pages, 7 figures

点击查看摘要

Abstract:Language Model (LM) agents for cybersecurity that are capable of autonomously identifying vulnerabilities and executing exploits have the potential to cause real-world impact. Policymakers, model providers, and other researchers in the AI and cybersecurity communities are interested in quantifying the capabilities of such agents to help mitigate cyberrisk and investigate opportunities for penetration testing. Toward that end, we introduce Cybench, a framework for specifying cybersecurity tasks and evaluating agents on those tasks. We include 40 professional-level Capture the Flag (CTF) tasks from 4 distinct CTF competitions, chosen to be recent, meaningful, and spanning a wide range of difficulties. Each task includes its own description, starter files, and is initialized in an environment where an agent can execute bash commands and observe outputs. Since many tasks are beyond the capabilities of existing LM agents, we introduce subtasks, which break down a task into intermediary steps for more gradated evaluation; we add subtasks for 17 of the 40 tasks. To evaluate agent capabilities, we construct a cybersecurity agent and evaluate 7 models: GPT-4o, Claude 3 Opus, Claude 3.5 Sonnet, Mixtral 8x22b Instruct, Gemini 1.5 Pro, Llama 3 70B Chat, and Llama 3.1 405B Instruct. Without guidance, we find that agents are able to solve only the easiest complete tasks that took human teams up to 11 minutes to solve, with Claude 3.5 Sonnet and GPT-4o having the highest success rates. Finally, subtasks provide more signal for measuring performance compared to unguided runs, with models achieving a 3.2% higher success rate on complete tasks with subtask-guidance than without subtask-guidance. All code and data are publicly available at this https URL

[AI-180] Retail-GPT: leveraging Retrieval Augmented Generation (RAG) for building E-commerce Chat Assistants

链接: https://arxiv.org/abs/2408.08925
作者: Bruno Amaral Teixeira de Freitas,Roberto de Alencar Lotufo
关键词-EN: open-source RAG-based chatbot, RAG-based chatbot designed, work presents Retail-GPT, enhance user engagement, work presents
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
*备注: 5 pages, 4 figures

点击查看摘要

Abstract:This work presents Retail-GPT, an open-source RAG-based chatbot designed to enhance user engagement in retail e-commerce by guiding users through product recommendations and assisting with cart operations. The system is cross-platform and adaptable to various e-commerce domains, avoiding reliance on specific chat applications or commercial activities. Retail-GPT engages in human-like conversations, interprets user demands, checks product availability, and manages cart operations, aiming to serve as a virtual sales agent and test the viability of such assistants across different retail businesses.

[AI-181] Prefix Guidance: A Steering Wheel for Large Language Models to Defend Against Jailbreak Attacks

链接: https://arxiv.org/abs/2408.08924
作者: Jiawei Zhao,Kejiang Chen,Xiaojian Yuan,Weiming Zhang
关键词-EN: large language models, achieved remarkable performance, recent years, rapid development, development of large
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:In recent years, the rapid development of large language models (LLMs) has achieved remarkable performance across various tasks. However, research indicates that LLMs are vulnerable to jailbreak attacks, where adversaries can induce the generation of harmful content through meticulously crafted prompts. This vulnerability poses significant challenges to the secure use and promotion of LLMs. Existing defense methods offer protection from different perspectives but often suffer from insufficient effectiveness or a significant impact on the model’s capabilities. In this paper, we propose a plug-and-play and easy-to-deploy jailbreak defense framework, namely Prefix Guidance (PG), which guides the model to identify harmful prompts by directly setting the first few tokens of the model’s output. This approach combines the model’s inherent security capabilities with an external classifier to defend against jailbreak attacks. We demonstrate the effectiveness of PG across three models and five attack methods. Compared to baselines, our approach is generally more effective on average. Additionally, results on the Just-Eval benchmark further confirm PG’s superiority to preserve the model’s performance.

[AI-182] Graph Retrieval-Augmented Generation: A Survey

链接: https://arxiv.org/abs/2408.08921
作者: Boci Peng,Yun Zhu,Yongchao Liu,Xiaohe Bo,Haizhou Shi,Chuntao Hong,Yan Zhang,Siliang Tang
关键词-EN: Large Language Models, Language Models, Large Language, achieved remarkable success, RAG refines LLM
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR)
*备注: Ongoing work

点击查看摘要

Abstract:Recently, Retrieval-Augmented Generation (RAG) has achieved remarkable success in addressing the challenges of Large Language Models (LLMs) without necessitating retraining. By referencing an external knowledge base, RAG refines LLM outputs, effectively mitigating issues such as ``hallucination’', lack of domain-specific knowledge, and outdated information. However, the complex structure of relationships among different entities in databases presents challenges for RAG systems. In response, GraphRAG leverages structural information across entities to enable more precise and comprehensive retrieval, capturing relational knowledge and facilitating more accurate, context-aware responses. Given the novelty and potential of GraphRAG, a systematic review of current technologies is imperative. This paper provides the first comprehensive overview of GraphRAG methodologies. We formalize the GraphRAG workflow, encompassing Graph-Based Indexing, Graph-Guided Retrieval, and Graph-Enhanced Generation. We then outline the core technologies and training methods at each stage. Additionally, we examine downstream tasks, application domains, evaluation methodologies, and industrial use cases of GraphRAG. Finally, we explore future research directions to inspire further inquiries and advance progress in the field.

[AI-183] Supervised and Unsupervised Alignments for Spoofing Behavioral Biometrics

链接: https://arxiv.org/abs/2408.08918
作者: Thomas Thebaud,Gaël Le Lan,Anthony Larcher
关键词-EN: high dimension representations, dimension representations called, Biometric recognition systems, representations called embeddings, behavioral biometric systems
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
*备注: 11 pages, 4 figures, 5 tables, submission in progress

点击查看摘要

Abstract:Biometric recognition systems are security systems based on intrinsic properties of their users, usually encoded in high dimension representations called embeddings, which potential theft would represent a greater threat than a temporary password or a replaceable key. To study the threat of embedding theft, we perform spoofing attacks on two behavioral biometric systems (an automatic speaker verification system and a handwritten digit analysis system) using a set of alignment techniques. Biometric recognition systems based on embeddings work in two phases: enrollment - where embeddings are collected and stored - then authentication - when new embeddings are compared to the stored ones -.The threat of stolen enrollment embeddings has been explored by the template reconstruction attack literature: reconstructing the original data to spoof an authentication system is doable with black-box access to their encoder. In this document, we explore the options available to perform template reconstruction attacks without any access to the encoder. To perform those attacks, we suppose general rules over the distribution of embeddings across encoders and use supervised and unsupervised algorithms to align an unlabeled set of embeddings with a set from a known encoder. The use of an alignment algorithm from the unsupervised translation literature gives promising results on spoofing two behavioral biometric systems.

[AI-184] Cyclic Supports in Recursive Bipolar Argumentation Frameworks: Semantics and LP Mapping

链接: https://arxiv.org/abs/2408.08916
作者: Gianvincenzo Alfano,Sergio Greco,Francesco Parisi,Irina Trubitsyna
关键词-EN: Dung Abstract Argumentation, Abstract Argumentation Framework, Artificial Intelligence, Dung Abstract, Bipolar Argumentation Framework
类目: Artificial Intelligence (cs.AI)
*备注: Paper presented at the 40th International Conference on Logic Programming (ICLP 2024), University of Texas at Dallas, USA, October 2024

点击查看摘要

Abstract:Dung’s Abstract Argumentation Framework (AF) has emerged as a key formalism for argumentation in Artificial Intelligence. It has been extended in several directions, including the possibility to express supports, leading to the development of the Bipolar Argumentation Framework (BAF), and recursive attacks and supports, resulting in the Recursive BAF (Rec-BAF). Different interpretations of supports have been proposed, whereas for Rec-BAF (where the target of attacks and supports may also be attacks and supports) even different semantics for attacks have been defined. However, the semantics of these frameworks have either not been defined in the presence of support cycles, or are often quite intricate in terms of the involved definitions. We encompass this limitation and present classical semantics for general BAF and Rec-BAF and show that the semantics for specific BAF and Rec-BAF frameworks can be defined by very simple and intuitive modifications of that defined for the case of AF. This is achieved by providing a modular definition of the sets of defeated and acceptable elements for each AF-based framework. We also characterize, in an elegant and uniform way, the semantics of general BAF and Rec-BAF in terms of logic programming and partial stable model semantics.

[AI-185] A Survey on Blockchain-based Supply Chain Finance with Progress and Future directions

链接: https://arxiv.org/abs/2408.08915
作者: Zhengdong Luo
关键词-EN: Supply Chain Finance, Supply Chain, Chain Finance, Blockchain-based Supply Chain, Supply Chain Finance-related
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Supply Chain Finance is very important for supply chain competition, which is an important tool to activate the capital flow in the supply chain. Supply Chain Finance-related research can support multiple applications and services, such as providing accounts receivable financing, enhancing risk management, and optimizing supply chain management. For more than a decade, the development of Blockchain has attracted widely attention in various fields, especially in finance. With the characteristics of data tamper-proof, forgery-proof, cryptography, consensus verification, and decentralization, Blockchain fits well with the realistic needs of Supply Chain Finance, which requires data integrity, authenticity, privacy, and information sharing. Therefore, it is time to summarize the applications of Blockchain technology in the field of Supply Chain Finance. What Blockchain technology brings to Supply Chain Finance is not only to alleviate the problems of information asymmetry, credit disassembly, and financing cost, but also to improve Supply Chain Finance operations through smart contracts to intelligent Supply Chain Finance and in combination with other technologies, such as artificial intelligence, cloud computing, and data mining, jointly. So there has been some work in Blockchain-based Supply Chain Finance research for different Supply Chain Finance oriented applications, but most of these work are at the management level to propose conceptual frameworks or simply use Blockchain without exploiting its deep applications. Moreover, there are few systematic reviews providing a comprehensive summary of current work in the area of Blockchain-based Supply Chain Finance. In this paper, we …

[AI-186] An Adaptive Differential Privacy Method Based on Federated Learning

链接: https://arxiv.org/abs/2408.08909
作者: Zhiqiang Wang,Xinyue Yu,Qianli Huang,Yongguang Gong
关键词-EN: privacy budget, differential privacy method, privacy, Differential privacy, solve the problem
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注:

点击查看摘要

Abstract:Differential privacy is one of the methods to solve the problem of privacy protection in federated learning. Setting the same privacy budget for each round will result in reduced accuracy in training. The existing methods of the adjustment of privacy budget consider fewer influencing factors and tend to ignore the boundaries, resulting in unreasonable privacy budgets. Therefore, we proposed an adaptive differential privacy method based on federated learning. The method sets the adjustment coefficient and scoring function according to accuracy, loss, training rounds, and the number of datasets and clients. And the privacy budget is adjusted based on them. Then the local model update is processed according to the scaling factor and the noise. Fi-nally, the server aggregates the noised local model update and distributes the noised global model. The range of parameters and the privacy of the method are analyzed. Through the experimental evaluation, it can reduce the privacy budget by about 16%, while the accuracy remains roughly the same.

[AI-187] What should I wear to a party in a Greek taverna? Evaluation for Conversational Agents in the Fashion Domain KDD

链接: https://arxiv.org/abs/2408.08907
作者: Antonis Maronikolakis,Ana Peleteiro Ramallo,Weiwei Cheng,Thomas Kober
关键词-EN: online fashion retail, enhancing customer experience, Large language models, poised to revolutionize, revolutionize the domain
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
*备注: Accepted at KDD workshop on Evaluation and Trustworthiness of Generative AI Models

点击查看摘要

Abstract:Large language models (LLMs) are poised to revolutionize the domain of online fashion retail, enhancing customer experience and discovery of fashion online. LLM-powered conversational agents introduce a new way of discovery by directly interacting with customers, enabling them to express in their own ways, refine their needs, obtain fashion and shopping advice that is relevant to their taste and intent. For many tasks in e-commerce, such as finding a specific product, conversational agents need to convert their interactions with a customer to a specific call to different backend systems, e.g., a search system to showcase a relevant set of products. Therefore, evaluating the capabilities of LLMs to perform those tasks related to calling other services is vital. However, those evaluations are generally complex, due to the lack of relevant and high quality datasets, and do not align seamlessly with business needs, amongst others. To this end, we created a multilingual evaluation dataset of 4k conversations between customers and a fashion assistant in a large e-commerce fashion platform to measure the capabilities of LLMs to serve as an assistant between customers and a backend engine. We evaluate a range of models, showcasing how our dataset scales to business needs and facilitates iterative development of tools.

[AI-188] Bundle Recommendation with Item-level Causation-enhanced Multi-view Learning

链接: https://arxiv.org/abs/2408.08906
作者: Huy-Son Nguyen,Tuan-Nghia Bui,Long-Hai Nguyen,Hoang Manh-Hung,Cam-Van Thi Nguyen,Hoang-Quynh Le,Duc-Trong Le
关键词-EN: enhance business profitability, Bundle recommendation aims, aims to enhance, enhance business, business profitability
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Bundle recommendation aims to enhance business profitability and user convenience by suggesting a set of interconnected items. In real-world scenarios, leveraging the impact of asymmetric item affiliations is crucial for effective bundle modeling and understanding user preferences. To address this, we present BunCa, a novel bundle recommendation approach employing item-level causation-enhanced multi-view learning. BunCa provides comprehensive representations of users and bundles through two views: the Coherent View, leveraging the Multi-Prospect Causation Network for causation-sensitive relations among items, and the Cohesive View, employing LightGCN for information propagation among users and bundles. Modeling user preferences and bundle construction combined from both views ensures rigorous cohesion in direct user-bundle interactions through the Cohesive View and captures explicit intents through the Coherent View. Simultaneously, the integration of concrete and discrete contrastive learning optimizes the consistency and self-discrimination of multi-view representations. Extensive experiments with BunCa on three benchmark datasets demonstrate the effectiveness of this novel research and validate our hypothesis.

[AI-189] Audit-LLM: Multi-Agent Collaboration for Log-based Insider Threat Detection

链接: https://arxiv.org/abs/2408.08902
作者: Chengyu Song,Linru Ma,Jianming Zheng,Jinzhi Liao,Hongyu Kuang,Lin Yang
关键词-EN: auditing log entries, Log-based insider threat, detects malicious user, insider threat detection, ITD
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
*备注: 12 pages, 5 figures

点击查看摘要

Abstract:Log-based insider threat detection (ITD) detects malicious user activities by auditing log entries. Recently, large language models (LLMs) with strong common sense knowledge have emerged in the domain of ITD. Nevertheless, diverse activity types and overlong log files pose a significant challenge for LLMs in directly discerning malicious ones within myriads of normal activities. Furthermore, the faithfulness hallucination issue from LLMs aggravates its application difficulty in ITD, as the generated conclusion may not align with user commands and activity context. In response to these challenges, we introduce Audit-LLM, a multi-agent log-based insider threat detection framework comprising three collaborative agents: (i) the Decomposer agent, breaking down the complex ITD task into manageable sub-tasks using Chain-of-Thought (COT) reasoning;(ii) the Tool Builder agent, creating reusable tools for sub-tasks to overcome context length limitations in LLMs; and (iii) the Executor agent, generating the final detection conclusion by invoking constructed tools. To enhance conclusion accuracy, we propose a pair-wise Evidence-based Multi-agent Debate (EMAD) mechanism, where two independent Executors iteratively refine their conclusions through reasoning exchange to reach a consensus. Comprehensive experiments conducted on three publicly available ITD datasets-CERT r4.2, CERT r5.2, and PicoDomain-demonstrate the superiority of our method over existing baselines and show that the proposed EMAD significantly improves the faithfulness of explanations generated by LLMs.

[AI-190] Kov: Transferable and Naturalistic Black-Box LLM Attacks using Markov Decision Processes and Tree Search

链接: https://arxiv.org/abs/2408.08899
作者: Robert J. Moss
关键词-EN: Eliciting harmful behavior, Eliciting harmful, important task, task to ensure, ensure the proper
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Eliciting harmful behavior from large language models (LLMs) is an important task to ensure the proper alignment and safety of the models. Often when training LLMs, ethical guidelines are followed yet alignment failures may still be uncovered through red teaming adversarial attacks. This work frames the red-teaming problem as a Markov decision process (MDP) and uses Monte Carlo tree search to find harmful behaviors of black-box, closed-source LLMs. We optimize token-level prompt suffixes towards targeted harmful behaviors on white-box LLMs and include a naturalistic loss term, log-perplexity, to generate more natural language attacks for better interpretability. The proposed algorithm, Kov, trains on white-box LLMs to optimize the adversarial attacks and periodically evaluates responses from the black-box LLM to guide the search towards more harmful black-box behaviors. In our preliminary study, results indicate that we can jailbreak black-box models, such as GPT-3.5, in only 10 queries, yet fail on GPT-4 - which may indicate that newer models are more robust to token-level attacks. All work to reproduce these results is open sourced (this https URL).

[AI-191] Enhancing Exploratory Learning through Exploratory Search with the Emergence of Large Language Models

链接: https://arxiv.org/abs/2408.08894
作者: Yiming Luo,Patrick Cheong-Iao,Shanton Chang
关键词-EN: large language models, learners find, challenging issue, confused learners, large language
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注: 11 pages, 7 figures

点击查看摘要

Abstract:In the information era, how learners find, evaluate, and effectively use information has become a challenging issue, especially with the added complexity of large language models (LLMs) that have further confused learners in their information retrieval and search activities. This study attempts to unpack this complexity by combining exploratory search strategies with the theories of exploratory learning to form a new theoretical model of exploratory learning from the perspective of students’ learning. Our work adapts Kolb’s learning model by incorporating high-frequency exploration and feedback loops, aiming to promote deep cognitive and higher-order cognitive skill development in students. Additionally, this paper discusses and suggests how advanced LLMs integrated into information retrieval and information theory can support students in their exploratory searches, contributing theoretically to promoting student-computer interaction and supporting their learning journeys in the new era with LLMs.

[AI-192] Leveraging Large Language Models for Enhanced Process Model Comprehension

链接: https://arxiv.org/abs/2408.08892
作者: Humam Kourani,Alessandro Berti,Jasmin Henrich,Wolfgang Kratsch,Robin Weidlich,Chiao-Yun Li,Ahmad Arslan,Daniel Schuster,Wil M.P. van der Aalst
关键词-EN: Business Process Management, poses significant challenges, effectively comprehending process, Large Language Models, Process Management
类目: Databases (cs.DB); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In Business Process Management (BPM), effectively comprehending process models is crucial yet poses significant challenges, particularly as organizations scale and processes become more complex. This paper introduces a novel framework utilizing the advanced capabilities of Large Language Models (LLMs) to enhance the interpretability of complex process models. We present different methods for abstracting business process models into a format accessible to LLMs, and we implement advanced prompting strategies specifically designed to optimize LLM performance within our framework. Additionally, we present a tool, AIPA, that implements our proposed framework and allows for conversational process querying. We evaluate our framework and tool by i) an automatic evaluation comparing different LLMs, model abstractions, and prompting strategies and ii) a user study designed to assess AIPA’s effectiveness comprehensively. Results demonstrate our framework’s ability to improve the accessibility and interpretability of process models, pioneering new pathways for integrating AI technologies into the BPM field.

[AI-193] SHARP-Net: A Refined Pyramid Network for Deficiency Segmentation in Culverts and Sewer Pipes

链接: https://arxiv.org/abs/2408.08879
作者: Rasha Alshawi,Md Meftahul Ferdaus,Md Tamjidul Hoque,Kendall Niles,Ken Pathak,Steve Sloan,Mahdi Abdelguerfi
关键词-EN: Haar-Adaptive Refined Pyramid, Refined Pyramid Network, Refined Pyramid, Semantic Haar-Adaptive Refined, Haar-Adaptive Refined
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:This paper introduces Semantic Haar-Adaptive Refined Pyramid Network (SHARP-Net), a novel architecture for semantic segmentation. SHARP-Net integrates a bottom-up pathway featuring Inception-like blocks with varying filter sizes (3x3 and 5x5), parallel max-pooling, and additional spatial detection layers. This design captures multi-scale features and fine structural details. Throughout the network, depth-wise separable convolutions are used to reduce complexity. The top-down pathway of SHARP-Net focuses on generating high-resolution features through upsampling and information fusion using 1\times1 and 3\times3 depth-wise separable convolutions. We evaluated our model using our developed challenging Culvert-Sewer Defects dataset and the benchmark DeepGlobe Land Cover dataset. Our experimental evaluation demonstrated the base model’s (excluding Haar-like features) effectiveness in handling irregular defect shapes, occlusions, and class imbalances. It outperformed state-of-the-art methods, including U-Net, CBAM U-Net, ASCU-Net, FPN, and SegFormer, achieving average improvements of 14.4% and 12.1% on the Culvert-Sewer Defects and DeepGlobe Land Cover datasets, respectively, with IoU scores of 77.2% and 70.6%. Additionally, the training time was reduced. Furthermore, the integration of carefully selected and fine-tuned Haar-like features enhanced the performance of deep learning models by at least 20%. The proposed SHARP-Net, incorporating Haar-like features, achieved an impressive IoU of 94.75%, representing a 22.74% improvement over the base model. These features were also applied to other deep learning models, showing a 35.0% improvement, proving their versatility and effectiveness. SHARP-Net thus provides a powerful and efficient solution for accurate semantic segmentation in challenging real-world scenarios.

[AI-194] Confronting the Reproducibility Crisis: A Case Study of Challenges in Cybersecurity AI DATE

链接: https://arxiv.org/abs/2405.18753
作者: Richard H. Moulton,Gary A. McCully,John D. Hastings
关键词-EN: rapidly evolving field, rapidly evolving, evolving field, maintaining the reliability, reliability and integrity
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
*备注: 8 pages, 0 figures, 2 tables, updated to incorporate feedback and improvements

点击查看摘要

Abstract:In the rapidly evolving field of cybersecurity, ensuring the reproducibility of AI-driven research is critical to maintaining the reliability and integrity of security systems. This paper addresses the reproducibility crisis within the domain of adversarial robustness – a key area in AI-based cybersecurity that focuses on defending deep neural networks against malicious perturbations. Through a detailed case study, we attempt to validate results from prior work on certified robustness using the VeriGauge toolkit, revealing significant challenges due to software and hardware incompatibilities, version conflicts, and obsolescence. Our findings underscore the urgent need for standardized methodologies, containerization, and comprehensive documentation to ensure the reproducibility of AI models deployed in critical cybersecurity applications. By tackling these reproducibility challenges, we aim to contribute to the broader discourse on securing AI systems against advanced persistent threats, enhancing network and IoT security, and protecting critical infrastructure. This work advocates for a concerted effort within the research community to prioritize reproducibility, thereby strengthening the foundation upon which future cybersecurity advancements are built.

[AI-195] LEGENT: Open Platform for Embodied Agents ACL2024

链接: https://arxiv.org/abs/2404.18243
作者: Zhili Cheng,Zhitong Wang,Jinyi Hu,Shengding Hu,An Liu,Yuge Tu,Pengkai Li,Lei Shi,Zhiyuan Liu,Maosong Sun
关键词-EN: Large Language Models, Large Multimodal Models, hindering complex real-life, Large Language, Large Multimodal
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Robotics (cs.RO)
*备注: ACL 2024 System Demonstration

点击查看摘要

Abstract:Despite advancements in Large Language Models (LLMs) and Large Multimodal Models (LMMs), their integration into language-grounded, human-like embodied agents remains incomplete, hindering complex real-life task performance in physical environments. Existing integrations often feature limited open sourcing, challenging collective progress in this field. We introduce LEGENT, an open, scalable platform for developing embodied agents using LLMs and LMMs. LEGENT offers a dual approach: a rich, interactive 3D environment with communicable and actionable agents, paired with a user-friendly interface, and a sophisticated data generation pipeline utilizing advanced algorithms to exploit supervision from simulated worlds at scale. In our experiments, an embryonic vision-language-action model trained on LEGENT-generated data surpasses GPT-4V in embodied tasks, showcasing promising generalization capabilities.

[AI-196] No Screening is More Efficient with Multiple Objects

链接: https://arxiv.org/abs/2408.10077
作者: Shunya Noda,Genta Okada
关键词-EN: multiple heterogeneous objects, allocating multiple heterogeneous, heterogeneous objects, allocating multiple, multiple heterogeneous
类目: Theoretical Economics (econ.TH); Artificial Intelligence (cs.AI); Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We study efficient mechanism design for allocating multiple heterogeneous objects. We aim to maximize the residual surplus, the total value generated from an allocation minus the costs for screening agents’ values. We discover a robust trend indicating that no-screening mechanisms such as serial dictatorship with exogenous priority order tend to perform better as the variety of goods increases. We analyze the underlying reasons by characterizing efficient mechanisms in a stylized environment. We also apply an automated mechanism design approach to numerically derive efficient mechanisms and validate the trend in general environments. Building on this implication, we propose the register-invite-book system (RIB) as an efficient system for scheduling vaccination against pandemic diseases.

[AI-197] Preoperative Rotator Cuff Tear Prediction from Shoulder Radiographs using a Convolutional Block Attention Module-Integrated Neural Network

链接: https://arxiv.org/abs/2408.09894
作者: Chris Hyunchul Jo,Jiwoong Yang,Byunghwan Jeon,Hackjoon Shim,Ikbeom Jang
关键词-EN: Research question, rotator cuff tears, plane shoulder radiograph, standard of care, rotator cuff
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Research question: We test whether a plane shoulder radiograph can be used together with deep learning methods to identify patients with rotator cuff tears as opposed to using an MRI in standard of care. Findings: By integrating convolutional block attention modules into a deep neural network, our model demonstrates high accuracy in detecting patients with rotator cuff tears, achieving an average AUC of 0.889 and an accuracy of 0.831. Meaning: This study validates the efficacy of our deep learning model to accurately detect rotation cuff tears from radiographs, offering a viable pre-assessment or alternative to more expensive imaging techniques such as MRI.

[AI-198] Propagating the prior from shallow to deep with a pre-trained velocity-model Generative Transformer network

链接: https://arxiv.org/abs/2408.09767
作者: Randy Harsuko,Shijun Cheng,Tariq Alkhalifah
关键词-EN: velocity model, velocity, discovery and exploration, model, goals in utilizing
类目: Geophysics (physics.geo-ph); Artificial Intelligence (cs.AI); Computational Physics (physics.comp-ph)
*备注:

点击查看摘要

Abstract:Building subsurface velocity models is essential to our goals in utilizing seismic data for Earth discovery and exploration, as well as monitoring. With the dawn of machine learning, these velocity models (or, more precisely, their distribution) can be stored accurately and efficiently in a generative model. These stored velocity model distributions can be utilized to regularize or quantify uncertainties in inverse problems, like full waveform inversion. However, most generators, like normalizing flows or diffusion models, treat the image (velocity model) uniformly, disregarding spatial dependencies and resolution changes with respect to the observation locations. To address this weakness, we introduce VelocityGPT, a novel implementation that utilizes Transformer decoders trained autoregressively to generate a velocity model from shallow subsurface to deep. Owing to the fact that seismic data are often recorded on the Earth’s surface, a top-down generator can utilize the inverted information in the shallow as guidance (prior) to generating the deep. To facilitate the implementation, we use an additional network to compress the velocity model. We also inject prior information, like well or structure (represented by a migration image) to generate the velocity model. Using synthetic data, we demonstrate the effectiveness of VelocityGPT as a promising approach in generative model applications for seismic velocity model building.

[AI-199] Deep Learning-based Machine Condition Diagnosis using Short-time Fourier Transformation Variants

链接: https://arxiv.org/abs/2408.09649
作者: Eduardo Jr Piedad,Zherish Galvin Mayordo,Eduardo Prieto-Araujo,Oriol Gomis-Bellmunt
关键词-EN: vibration-based sensor data, electrical current signature, current signature serves, motor condition diagnosis, Short-time Fourier Transform
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI)
*备注: 4 pages, 6 images, submitted to 2024 International Conference on Diagnostics in Electrical Engineering (Diagnostika)

点击查看摘要

Abstract:In motor condition diagnosis, electrical current signature serves as an alternative feature to vibration-based sensor data, which is a more expensive and invasive method. Machine learning (ML) techniques have been emerging in diagnosing motor conditions using only motor phase current signals. This study converts time-series motor current signals to time-frequency 2D plots using Short-time Fourier Transform (STFT) methods. The motor current signal dataset consists of 3,750 sample points with five classes - one healthy and four synthetically-applied motor fault conditions, and with five loading conditions: 0, 25, 50, 75, and 100%. Five transformation methods are used on the dataset: non-overlap and overlap STFTs, non-overlap and overlap realigned STFTs, and synchrosqueezed STFT. Then, deep learning (DL) models based on the previous Convolutional Neural Network (CNN) architecture are trained and validated from generated plots of each method. The DL models of overlap-STFT, overlap R-STFT, non-overlap STFT, non-overlap R-STFT, and synchrosqueezed-STFT performed exceptionally with an average accuracy of 97.65, 96.03, 96.08, 96.32, and 88.27%, respectively. Four methods outperformed the previous best ML method with 93.20% accuracy, while all five outperformed previous 2D-plot-based methods with accuracy of 80.25, 74.80, and 82.80%, respectively, using the same dataset, same DL architecture, and validation steps.

[AI-200] Exploring Wavelet Transformations for Deep Learning-based Machine Condition Diagnosis

链接: https://arxiv.org/abs/2408.09644
作者: Eduardo Jr Piedad,Christian Ainsley Del Rosario,Eduardo Prieto-Araujo,Oriol Gomis-Bellmunt
关键词-EN: phase current signals, simply analyzing motor, analyzing motor phase, motor phase current, current signals
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Image and Video Processing (eess.IV)
*备注: 4 pages, 6 figures, submitted to 2024 International Conference on Diagnostics in Electrical Engineering (Diagnostika)

点击查看摘要

Abstract:Deep learning (DL) strategies have recently been utilized to diagnose motor faults by simply analyzing motor phase current signals, offering a less costly and non-intrusive alternative to vibration sensors. This research transforms these time-series current signals into time-frequency 2D representations via Wavelet Transform (WT). The dataset for motor current signals includes 3,750 data points across five categories: one representing normal conditions and four representing artificially induced faults, each under five different load conditions: 0, 25, 50, 75, and 100%. The study employs five WT-based techniques: WT-Amor, WT-Bump, WT-Morse, WSST-Amor, and WSST-Bump. Subsequently, five DL models adopting prior Convolutional Neural Network (CNN) architecture were developed and tested using the transformed 2D plots from each method. The DL models for WT-Amor, WT-Bump, and WT-Morse showed remarkable effectiveness with peak model accuracy of 90.93, 89.20, and 93.73%, respectively, surpassing previous 2D-image-based methods that recorded accuracy of 80.25, 74.80, and 82.80% respectively using the identical dataset and validation protocol. Notably, the WT-Morse approach slightly exceeded the formerly highest ML technique, achieving a 93.20% accuracy. However, the two WSST methods that utilized synchrosqueezing techniques faced difficulty accurately classifying motor faults. The performance of Wavelet-based deep learning methods offers a compelling alternative for machine condition monitoring.

[AI-201] Deformation-aware GAN for Medical Image Synthesis with Substantially Misaligned Pairs

链接: https://arxiv.org/abs/2408.09432
作者: Bowen Xin,Tony Young,Claire E Wainwright,Tamara Blake,Leo Lebrat,Thomas Gaass,Thomas Benkert,Alto Stemmer,David Coman,Jason Dowling
关键词-EN: generates additional imaging, additional imaging modalities, synthesis generates additional, Medical image synthesis, image synthesis generates
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted by MIDL2024

点击查看摘要

Abstract:Medical image synthesis generates additional imaging modalities that are costly, invasive or harmful to acquire, which helps to facilitate the clinical workflow. When training pairs are substantially misaligned (e.g., lung MRI-CT pairs with respiratory motion), accurate image synthesis remains a critical challenge. Recent works explored the directional registration module to adjust misalignment in generative adversarial networks (GANs); however, substantial misalignment will lead to 1) suboptimal data mapping caused by correspondence ambiguity, and 2) degraded image fidelity caused by morphology influence on discriminators. To address the challenges, we propose a novel Deformation-aware GAN (DA-GAN) to dynamically correct the misalignment during the image synthesis based on multi-objective inverse consistency. Specifically, in the generative process, three levels of inverse consistency cohesively optimise symmetric registration and image generation for improved correspondence. In the adversarial process, to further improve image fidelity under misalignment, we design deformation-aware discriminators to disentangle the mismatched spatial morphology from the judgement of image fidelity. Experimental results show that DA-GAN achieved superior performance on a public dataset with simulated misalignments and a real-world lung MRI-CT dataset with respiratory motion misalignment. The results indicate the potential for a wide range of medical image synthesis tasks such as radiotherapy planning.

[AI-202] Fragment-Masked Molecular Optimization

链接: https://arxiv.org/abs/2408.09106
作者: Kun Li,Xiantao Cai,Jia Wu,Bo Du,Wenbin Hu
关键词-EN: minimize side effects, drug discovery, aimed at refining, side effects, ultimately accelerating
类目: Biomolecules (q-bio.BM); Artificial Intelligence (cs.AI)
*备注: 11 pages, 5 figures, 2 tables

点击查看摘要

Abstract:Molecular optimization is a crucial aspect of drug discovery, aimed at refining molecular structures to enhance drug efficacy and minimize side effects, ultimately accelerating the overall drug development process. Many target-based molecular optimization methods have been proposed, significantly advancing drug discovery. These methods primarily on understanding the specific drug target structures or their hypothesized roles in combating diseases. However, challenges such as a limited number of available targets and a difficulty capturing clear structures hinder innovative drug development. In contrast, phenotypic drug discovery (PDD) does not depend on clear target structures and can identify hits with novel and unbiased polypharmacology signatures. As a result, PDD-based molecular optimization can reduce potential safety risks while optimizing phenotypic activity, thereby increasing the likelihood of clinical success. Therefore, we propose a fragment-masked molecular optimization method based on PDD (FMOP). FMOP employs a regression-free diffusion model to conditionally optimize the molecular masked regions without training, effectively generating new molecules with similar scaffolds. On the large-scale drug response dataset GDSCv2, we optimize the potential molecules across all 945 cell lines. The overall experiments demonstrate that the in-silico optimization success rate reaches 94.4%, with an average efficacy increase of 5.3%. Additionally, we conduct extensive ablation and visualization experiments, confirming that FMOP is an effective and robust molecular optimization method. The code is available at:https://anonymous.4open.science/r/FMOP-98C2.

[AI-203] mRNA2vec: mRNA Embedding with Language Model in the 5UTR-CDS for mRNA Design

链接: https://arxiv.org/abs/2408.09048
作者: Honggen Zhang,Xiangrui Gao,June Zhang,Lipeng Lai
关键词-EN: Messenger RNA, pharmaceutical industry, accelerating the discovery, drugs and revolutionizing, revolutionizing the pharmaceutical
类目: Quantitative Methods (q-bio.QM); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Messenger RNA (mRNA)-based vaccines are accelerating the discovery of new drugs and revolutionizing the pharmaceutical industry. However, selecting particular mRNA sequences for vaccines and therapeutics from extensive mRNA libraries is costly. Effective mRNA therapeutics require carefully designed sequences with optimized expression levels and stability. This paper proposes a novel contextual language model (LM)-based embedding method: mRNA2vec. In contrast to existing mRNA embedding approaches, our method is based on the self-supervised teacher-student learning framework of data2vec. We jointly use the 5’ untranslated region (UTR) and coding sequence (CDS) region as the input sequences. We adapt our LM-based approach specifically to mRNA by 1) considering the importance of location on the mRNA sequence with probabilistic masking, 2) using Minimum Free Energy (MFE) prediction and Secondary Structure (SS) classification as additional pretext tasks. mRNA2vec demonstrates significant improvements in translation efficiency (TE) and expression level (EL) prediction tasks in UTR compared to SOTA methods such as UTR-LM. It also gives a competitive performance in mRNA stability and protein production level tasks in CDS such as CodonBERT.

[AI-204] Adaptive Uncertainty Quantification for Generative AI

链接: https://arxiv.org/abs/2408.08990
作者: Jungeum Kim,Sean O’Hagan,Veronika Rockova
关键词-EN: including generative, work is concerned, concerned with conformal, trained on data, black-box model
类目: Methodology (stat.ME); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:This work is concerned with conformal prediction in contemporary applications (including generative AI) where a black-box model has been trained on data that are not accessible to the user. Mirroring split-conformal inference, we design a wrapper around a black-box algorithm which calibrates conformity scores. This calibration is local and proceeds in two stages by first adaptively partitioning the predictor space into groups and then calibrating sectionally group by group. Adaptive partitioning (self-grouping) is achieved by fitting a robust regression tree to the conformity scores on the calibration set. This new tree variant is designed in such a way that adding a single new observation does not change the tree fit with overwhelmingly large probability. This add-one-in robustness property allows us to conclude a finite sample group-conditional coverage guarantee, a refinement of the marginal guarantee. In addition, unlike traditional split-conformal inference, adaptive splitting and within-group calibration yields adaptive bands which can stretch and shrink locally. We demonstrate benefits of local tightening on several simulated as well as real examples using non-parametric regression. Finally, we consider two contemporary classification applications for obtaining uncertainty quantification around GPT-4o predictions. We conformalize skin disease diagnoses based on self-reported symptoms as well as predicted states of U.S. legislators based on summaries of their ideology. We demonstrate substantial local tightening of the uncertainty sets while attaining similar marginal coverage.

[AI-205] Why Do Experts Favor Solar and Wind as Renewable Energies Despite their Intermittency?

链接: https://arxiv.org/abs/2408.08910
作者: Steven P. Reinhardt
关键词-EN: renewable energy generation, renewable energy, humanity accelerates, accelerates its shift, shift to renewable
类目: Physics and Society (physics.soc-ph); Artificial Intelligence (cs.AI)
*备注: Shifted references from hyperlinks to academic style

点击查看摘要

Abstract:As humanity accelerates its shift to renewable energy generation, people who are not experts in renewable energy are learning about energy technologies and the energy market, which are complex. The answers to some questions will be obvious to expert practitioners but not to non-experts. One such question is Why solar and wind generation are expected to supply the bulk of future energy when they are intermittent. We learn here that once the baseline hurdles of scalability to utility scale and the underlying resources being widely available globally are satisfied, the forecasted cost of solar and wind is 2-4X lower than competing technologies, even those that are not as scalable and available. The market views intermittency as surmountable.

[AI-206] U-MedSAM: Uncertainty-aware MedSAM for Medical Image Segmentation

链接: https://arxiv.org/abs/2408.08881
作者: Xin Wang,Xiaoyu Liu,Peng Huang,Pu Huang,Shu Hu,Hongtu Zhu
关键词-EN: Medical Image Foundation, Image Foundation Models, Medical Image, Image Foundation, Foundation Models
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Medical Image Foundation Models have proven to be powerful tools for mask prediction across various datasets. However, accurately assessing the uncertainty of their predictions remains a significant challenge. To address this, we propose a new model, U-MedSAM, which integrates the MedSAM model with an uncertainty-aware loss function and the Sharpness-Aware Minimization (SharpMin) optimizer. The uncertainty-aware loss function automatically combines region-based, distribution-based, and pixel-based loss designs to enhance segmentation accuracy and robustness. SharpMin improves generalization by finding flat minima in the loss landscape, thereby reducing overfitting. Our method was evaluated in the CVPR24 MedSAM on Laptop challenge, where U-MedSAM demonstrated promising performance.

计算机视觉

[CV-0] Criticality Leveraged Adversarial Training (CLAT) for Boosted Performance via Parameter Efficiency

链接: https://arxiv.org/abs/2408.10204
作者: Bhavna Gopal,Huanrui Yang,Jingyang Zhang,Mark Horton,Yiran Chen
关键词-EN: increased generalization errors, enhances neural network, neural network robustness, training enhances neural, Adversarial training
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注: 9 pages + appendix/ additional experiments

点击查看摘要

Abstract:Adversarial training enhances neural network robustness but suffers from a tendency to overfit and increased generalization errors on clean data. This work introduces CLAT, an innovative approach that mitigates adversarial overfitting by introducing parameter efficiency into the adversarial training process, improving both clean accuracy and adversarial robustness. Instead of tuning the entire model, CLAT identifies and fine-tunes robustness-critical layers - those predominantly learning non-robust features - while freezing the remaining model to enhance robustness. It employs dynamic critical layer selection to adapt to changes in layer criticality throughout the fine-tuning process. Empirically, CLAT can be applied on top of existing adversarial training methods, significantly reduces the number of trainable parameters by approximately 95%, and achieves more than a 2% improvement in adversarial robustness compared to baseline methods.

[CV-1] SANER: Annotation-free Societal Attribute Neutralizer for Debiasing CLIP

链接: https://arxiv.org/abs/2408.10202
作者: Yusuke Hirota,Min-Hung Chen,Chien-Yi Wang,Yuta Nakashima,Yu-Chiang Frank Wang,Ryo Hachiuma
关键词-EN: Large-scale vision-language models, Large-scale vision-language, gender and age, harmful societal bias, societal bias
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Large-scale vision-language models, such as CLIP, are known to contain harmful societal bias regarding protected attributes (e.g., gender and age). In this paper, we aim to address the problems of societal bias in CLIP. Although previous studies have proposed to debias societal bias through adversarial learning or test-time projecting, our comprehensive study of these works identifies two critical limitations: 1) loss of attribute information when it is explicitly disclosed in the input and 2) use of the attribute annotations during debiasing process. To mitigate societal bias in CLIP and overcome these limitations simultaneously, we introduce a simple-yet-effective debiasing method called SANER (societal attribute neutralizer) that eliminates attribute information from CLIP text features only of attribute-neutral descriptions. Experimental results show that SANER, which does not require attribute annotations and preserves original information for attribute-specific descriptions, demonstrates superior debiasing ability than the existing methods.

[CV-2] MeshFormer: High-Quality Mesh Generation with 3D-Guided Reconstruction Model

链接: https://arxiv.org/abs/2408.10198
作者: Minghua Liu,Chong Zeng,Xinyue Wei,Ruoxi Shi,Linghao Chen,Chao Xu,Mengqi Zhang,Zhaoning Wang,Xiaoshuai Zhang,Isabella Liu,Hongzhi Wu,Hao Su
关键词-EN: garnered significant attention, recently garnered significant, significant attention, recently garnered, garnered significant
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
*备注: 20 pages, 9 figures

点击查看摘要

Abstract:Open-world 3D reconstruction models have recently garnered significant attention. However, without sufficient 3D inductive bias, existing methods typically entail expensive training costs and struggle to extract high-quality 3D meshes. In this work, we introduce MeshFormer, a sparse-view reconstruction model that explicitly leverages 3D native structure, input guidance, and training supervision. Specifically, instead of using a triplane representation, we store features in 3D sparse voxels and combine transformers with 3D convolutions to leverage an explicit 3D structure and projective bias. In addition to sparse-view RGB input, we require the network to take input and generate corresponding normal maps. The input normal maps can be predicted by 2D diffusion models, significantly aiding in the guidance and refinement of the geometry’s learning. Moreover, by combining Signed Distance Function (SDF) supervision with surface rendering, we directly learn to generate high-quality meshes without the need for complex multi-stage training processes. By incorporating these explicit 3D biases, MeshFormer can be trained efficiently and deliver high-quality textured meshes with fine-grained geometric details. It can also be integrated with 2D diffusion models to enable fast single-image-to-3D and text-to-3D tasks. Project page: this https URL

[CV-3] SpaRP: Fast 3D Object Reconstruction and Pose Estimation from Sparse Views ECCV2024

链接: https://arxiv.org/abs/2408.10195
作者: Chao Xu,Ang Li,Linghao Chen,Yulin Liu,Ruoxi Shi,Hao Su,Minghua Liu
关键词-EN: attracted considerable attention, recently attracted considerable, generation has recently, considerable attention, recently attracted
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Graphics (cs.GR)
*备注: ECCV 2024

点击查看摘要

Abstract:Open-world 3D generation has recently attracted considerable attention. While many single-image-to-3D methods have yielded visually appealing outcomes, they often lack sufficient controllability and tend to produce hallucinated regions that may not align with users’ expectations. In this paper, we explore an important scenario in which the input consists of one or a few unposed 2D images of a single object, with little or no overlap. We propose a novel method, SpaRP, to reconstruct a 3D textured mesh and estimate the relative camera poses for these sparse-view images. SpaRP distills knowledge from 2D diffusion models and finetunes them to implicitly deduce the 3D spatial relationships between the sparse views. The diffusion model is trained to jointly predict surrogate representations for camera poses and multi-view images of the object under known poses, integrating all information from the input sparse views. These predictions are then leveraged to accomplish 3D reconstruction and pose estimation, and the reconstructed 3D model can be used to further refine the camera poses of input views. Through extensive experiments on three datasets, we demonstrate that our method not only significantly outperforms baseline methods in terms of 3D reconstruction quality and pose prediction accuracy but also exhibits strong efficiency. It requires only about 20 seconds to produce a textured mesh and camera poses for the input views. Project page: this https URL.

[CV-4] LongVILA: Scaling Long-Context Visual Language Models for Long Videos

链接: https://arxiv.org/abs/2408.10188
作者: Fuzhao Xue,Yukang Chen,Dacheng Li,Qinghao Hu,Ligeng Zhu,Xiuyu Li,Yunhao Fang,Haotian Tang,Shang Yang,Zhijian Liu,Ethan He,Hongxu Yin,Pavlo Molchanov,Jan Kautz,Linxi Fan,Yuke Zhu,Yao Lu,Song Han
关键词-EN: Sequence Parallelism, Multi-Modal Sequence Parallelism, multi-modal foundation models, capability is critical, Hugging Face Transformers
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
*备注: Code and models are available at this https URL

点击查看摘要

Abstract:Long-context capability is critical for multi-modal foundation models. We introduce LongVILA, a full-stack solution for long-context vision-language models, including system, model training, and dataset development. On the system side, we introduce the first Multi-Modal Sequence Parallelism (MM-SP) system that enables long-context training and inference, enabling 2M context length training on 256 GPUs. MM-SP is also efficient, being 2.1x - 5.7x faster than Ring-Style Sequence Parallelism and 1.1x - 1.4x faster than Megatron-LM in text-only settings. Moreover, it seamlessly integrates with Hugging Face Transformers. For model training, we propose a five-stage pipeline comprising alignment, pre-training, context extension, and long-short joint supervised fine-tuning. Regarding datasets, we meticulously construct large-scale visual language pre-training datasets and long video instruction-following datasets to support our multi-stage training process. The full-stack solution extends the feasible frame number of VILA by a factor of 128 (from 8 to 1024 frames) and improves long video captioning score from 2.00 to 3.26 (1.6x), achieving 99.5% accuracy in 1400-frames video (274k context length) needle in a haystack. LongVILA-8B also demonstrates a consistent improvement in performance on long videos within the VideoMME benchmark as the video frames increase.

[CV-5] Assessment of Spectral based Solutions for the Detection of Floating Marine Debris

链接: https://arxiv.org/abs/2408.10187
作者: Muhammad Alì,Francesca Razzano,Sergio Vitale,Giampaolo Ferraioli,Vito Pascazio,Gilda Schirinzi,Silvia Ullo
关键词-EN: limited spatial coverage, huge human effort, marine debris relies, Marine Debris Archive, Marine Plastic Debris
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 5 pages, 3 figures, submitted and accepted for 2024 Second International Conference on Networks, Multimedia and Information Technology (NMITCON)

点击查看摘要

Abstract:Typically, the detection of marine debris relies on in-situ campaigns that are characterized by huge human effort and limited spatial coverage. Following the need of a rapid solution for the detection of floating plastic, methods based on remote sensing data have been proposed recently. Their main limitation is represented by the lack of a general reference for evaluating performance. Recently, the Marine Debris Archive (MARIDA) has been released as a standard dataset to develop and evaluate Machine Learning (ML) algorithms for detection of Marine Plastic Debris. The MARIDA dataset has been created for simplifying the comparison between detection solutions with the aim of stimulating the research in the field of marine environment preservation. In this work, an assessment of spectral based solutions is proposed by evaluating performance on MARIDA dataset. The outcome highlights the need of precise reference for fair evaluation.

[CV-6] Imbalance-Aware Culvert-Sewer Defect Segmentation Using an Enhanced Feature Pyramid Network

链接: https://arxiv.org/abs/2408.10181
作者: Rasha Alshawi,Md Meftahul Ferdaus,Mahdi Abdelguerfi,Kendall Niles,Ken Pathak,Steve Sloan
关键词-EN: Feature Pyramid Network, Enhanced Feature Pyramid, significant challenge, Imbalanced datasets, Pyramid Network
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Imbalanced datasets are a significant challenge in real-world scenarios. They lead to models that underperform on underrepresented classes, which is a critical issue in infrastructure inspection. This paper introduces the Enhanced Feature Pyramid Network (E-FPN), a deep learning model for the semantic segmentation of culverts and sewer pipes within imbalanced datasets. The E-FPN incorporates architectural innovations like sparsely connected blocks and depth-wise separable convolutions to improve feature extraction and handle object variations. To address dataset imbalance, the model employs strategies like class decomposition and data augmentation. Experimental results on the culvert-sewer defects dataset and a benchmark aerial semantic segmentation drone dataset show that the E-FPN outperforms state-of-the-art methods, achieving an average Intersection over Union (IoU) improvement of 13.8% and 27.2%, respectively. Additionally, class decomposition and data augmentation together boost the model’s performance by approximately 6.9% IoU. The proposed E-FPN presents a promising solution for enhancing object segmentation in challenging, multi-class real-world datasets, with potential applications extending beyond culvert-sewer defect detection.

[CV-7] NeuRodin: A Two-stage Framework for High-Fidelity Neural Surface Reconstruction

链接: https://arxiv.org/abs/2408.10178
作者: Yifan Wang,Di Huang,Weicai Ye,Guofeng Zhang,Wanli Ouyang,Tong He
关键词-EN: Signed Distance Function, Signed Distance, Distance Function, based volume rendering, demonstrated significant capabilities
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Signed Distance Function (SDF)-based volume rendering has demonstrated significant capabilities in surface reconstruction. Although promising, SDF-based methods often fail to capture detailed geometric structures, resulting in visible defects. By comparing SDF-based volume rendering to density-based volume rendering, we identify two main factors within the SDF-based approach that degrade surface quality: SDF-to-density representation and geometric regularization. These factors introduce challenges that hinder the optimization of the SDF field. To address these issues, we introduce NeuRodin, a novel two-stage neural surface reconstruction framework that not only achieves high-fidelity surface reconstruction but also retains the flexible optimization characteristics of density-based methods. NeuRodin incorporates innovative strategies that facilitate transformation of arbitrary topologies and reduce artifacts associated with density bias. Extensive evaluations on the Tanks and Temples and ScanNet++ datasets demonstrate the superiority of NeuRodin, showing strong reconstruction capabilities for both indoor and outdoor environments using solely posed RGB captures. Project website: this https URL

[CV-8] Fairness Under Cover: Evaluating the Impact of Occlusions on Demographic Bias in Facial Recognition ECCV

链接: https://arxiv.org/abs/2408.10175
作者: Rafael M. Mamede,Pedro C. Neto,Ana F. Sequeira
关键词-EN: face recognition systems, Fairness Discrepancy Rate, face recognition, Face Occlusion Impact, face recognition models
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: Accepted at ECCV Workshop FAILED

点击查看摘要

Abstract:This study investigates the effects of occlusions on the fairness of face recognition systems, particularly focusing on demographic biases. Using the Racial Faces in the Wild (RFW) dataset and synthetically added realistic occlusions, we evaluate their effect on the performance of face recognition models trained on the BUPT-Balanced and BUPT-GlobalFace datasets. We note increases in the dispersion of FMR, FNMR, and accuracy alongside decreases in fairness according to Equilized Odds, Demographic Parity, STD of Accuracy, and Fairness Discrepancy Rate. Additionally, we utilize a pixel attribution method to understand the importance of occlusions in model predictions, proposing a new metric, Face Occlusion Impact Ratio (FOIR), that quantifies the extent to which occlusions affect model performance across different demographic groups. Our results indicate that occlusions exacerbate existing demographic biases, with models placing higher importance on occlusions in an unequal fashion, particularly affecting African individuals more severely.

[CV-9] NeuFlow v2: High-Efficiency Optical Flow Estimation on Edge Devices

链接: https://arxiv.org/abs/2408.10161
作者: Zhiyong Zhang,Aniket Gupta,Huaizu Jiang,Hanumant Singh
关键词-EN: Real-time high-accuracy optical, Real-time high-accuracy, high-accuracy optical flow, optical flow estimation, optical flow
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Real-time high-accuracy optical flow estimation is crucial for various real-world applications. While recent learning-based optical flow methods have achieved high accuracy, they often come with significant computational costs. In this paper, we propose a highly efficient optical flow method that balances high accuracy with reduced computational demands. Building upon NeuFlow v1, we introduce new components including a much more light-weight backbone and a fast refinement module. Both these modules help in keeping the computational demands light while providing close to state of the art accuracy. Compares to other state of the art methods, our model achieves a 10x-70x speedup while maintaining comparable performance on both synthetic and real-world data. It is capable of running at over 20 FPS on 512x384 resolution images on a Jetson Orin Nano. The full training and evaluation code is available at this https URL.

[CV-10] LoopSplat: Loop Closure by Registering 3D Gaussian Splats

链接: https://arxiv.org/abs/2408.10154
作者: Liyuan Zhu,Yue Li,Erik Sandström,Konrad Schindler,Iro Armeni
关键词-EN: Gaussian Splats, Simultaneous Localization, recently shown promise, recently shown, shown promise
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
*备注: Project page: \href{ this https URL }{ this http URL }

点击查看摘要

Abstract:Simultaneous Localization and Mapping (SLAM) based on 3D Gaussian Splats (3DGS) has recently shown promise towards more accurate, dense 3D scene maps. However, existing 3DGS-based methods fail to address the global consistency of the scene via loop closure and/or global bundle adjustment. To this end, we propose LoopSplat, which takes RGB-D images as input and performs dense mapping with 3DGS submaps and frame-to-model tracking. LoopSplat triggers loop closure online and computes relative loop edge constraints between submaps directly via 3DGS registration, leading to improvements in efficiency and accuracy over traditional global-to-local point cloud registration. It uses a robust pose graph optimization formulation and rigidly aligns the submaps to achieve global consistency. Evaluation on the synthetic Replica and real-world TUM-RGBD, ScanNet, and ScanNet++ datasets demonstrates competitive or superior tracking, mapping, and rendering compared to existing methods for dense RGB-D SLAM. Code is available at \hrefthis https URLthis http URL.

[CV-11] Structure-preserving Image Translation for Depth Estimation in Colonoscopy Video MICCAI2024

链接: https://arxiv.org/abs/2408.10153
作者: Shuxian Wang,Akshay Paruchuri,Zhaoxi Zhang,Sarah McGill,Roni Sengupta
关键词-EN: colonoscopy video aims, unusual lighting properties, Monocular depth estimation, Monocular depth, colonoscopic environment
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 12 pages, 7 figures, accepted at MICCAI 2024

点击查看摘要

Abstract:Monocular depth estimation in colonoscopy video aims to overcome the unusual lighting properties of the colonoscopic environment. One of the major challenges in this area is the domain gap between annotated but unrealistic synthetic data and unannotated but realistic clinical data. Previous attempts to bridge this domain gap directly target the depth estimation task itself. We propose a general pipeline of structure-preserving synthetic-to-real (sim2real) image translation (producing a modified version of the input image) to retain depth geometry through the translation process. This allows us to generate large quantities of realistic-looking synthetic images for supervised depth estimation with improved generalization to the clinical domain. We also propose a dataset of hand-picked sequences from clinical colonoscopies to improve the image translation process. We demonstrate the simultaneous realism of the translated images and preservation of depth maps via the performance of downstream depth estimation on various datasets.

[CV-12] Multi-Scale Representation Learning for Image Restoration with State-Space Model

链接: https://arxiv.org/abs/2408.10145
作者: Yuhong He,Long Peng,Qiaosi Yi,Chen Wu,Lu Wang
关键词-EN: computer vision systems, reconstruct a high-quality, degraded counterpart, vision systems, endeavors to reconstruct
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Image restoration endeavors to reconstruct a high-quality, detail-rich image from a degraded counterpart, which is a pivotal process in photography and various computer vision systems. In real-world scenarios, different types of degradation can cause the loss of image details at various scales and degrade image contrast. Existing methods predominantly rely on CNN and Transformer to capture multi-scale representations. However, these methods are often limited by the high computational complexity of Transformers and the constrained receptive field of CNN, which hinder them from achieving superior performance and efficiency in image restoration. To address these challenges, we propose a novel Multi-Scale State-Space Model-based (MS-Mamba) for efficient image restoration that enhances the capacity for multi-scale representation learning through our proposed global and regional SSM modules. Additionally, an Adaptive Gradient Block (AGB) and a Residual Fourier Block (RFB) are proposed to improve the network’s detail extraction capabilities by capturing gradients in various directions and facilitating learning details in the frequency domain. Extensive experiments on nine public benchmarks across four classic image restoration tasks, image deraining, dehazing, denoising, and low-light enhancement, demonstrate that our proposed method achieves new state-of-the-art performance while maintaining low computational complexity. The source code will be publicly available.

[CV-13] R2-Mesh: Reinforcement Learning Powered Mesh Reconstruction via Geometry and Appearance Refinement

链接: https://arxiv.org/abs/2408.10135
作者: Haoyang Wang,Liming Liu,Quanlu Jia,Jiangkai Wu,Haodan Zhang,Peiheng Wang,Xinggong Zhang
关键词-EN: Neural Radiance Fields, Neural Radiance, medical imaging due, facilitating real-time rendering, complex geometric structures
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Mesh reconstruction based on Neural Radiance Fields (NeRF) is popular in a variety of applications such as computer graphics, virtual reality, and medical imaging due to its efficiency in handling complex geometric structures and facilitating real-time rendering. However, existing works often fail to capture fine geometric details accurately and struggle with optimizing rendering quality. To address these challenges, we propose a novel algorithm that progressively generates and optimizes meshes from multi-view images. Our approach initiates with the training of a NeRF model to establish an initial Signed Distance Field (SDF) and a view-dependent appearance field. Subsequently, we iteratively refine the SDF through a differentiable mesh extraction method, continuously updating both the vertex positions and their connectivity based on the loss from mesh differentiable rasterization, while also optimizing the appearance representation. To further leverage high-fidelity and detail-rich representations from NeRF, we propose an online-learning strategy based on Upper Confidence Bound (UCB) to enhance viewpoints by adaptively incorporating images rendered by the initial NeRF model into the training dataset. Through extensive experiments, we demonstrate that our method delivers highly competitive and robust performance in both mesh rendering quality and geometric quality.

[CV-14] Perceptual Depth Quality Assessment of Stereoscopic Omnidirectional Images

链接: https://arxiv.org/abs/2408.10134
作者: Wei Zhou,Zhou Wang
关键词-EN: immersive virtual reality, Depth perception plays, depth quality, depth quality assessment, quality
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM); Image and Video Processing (eess.IV)
*备注: Accepted by IEEE TCSVT

点击查看摘要

Abstract:Depth perception plays an essential role in the viewer experience for immersive virtual reality (VR) visual environments. However, previous research investigations in the depth quality of 3D/stereoscopic images are rather limited, and in particular, are largely lacking for 3D viewing of 360-degree omnidirectional content. In this work, we make one of the first attempts to develop an objective quality assessment model named depth quality index (DQI) for efficient no-reference (NR) depth quality assessment of stereoscopic omnidirectional images. Motivated by the perceptual characteristics of the human visual system (HVS), the proposed DQI is built upon multi-color-channel, adaptive viewport selection, and interocular discrepancy features. Experimental results demonstrate that the proposed method outperforms state-of-the-art image quality assessment (IQA) and depth quality assessment (DQA) approaches in predicting the perceptual depth quality when tested using both single-viewport and omnidirectional stereoscopic image databases. Furthermore, we demonstrate that combining the proposed depth quality model with existing IQA methods significantly boosts the performance in predicting the overall quality of 3D omnidirectional images.

[CV-15] UNINEXT-Cutie: The 1st Solution for LSVOS Challenge RVOS Track

链接: https://arxiv.org/abs/2408.10129
作者: Hao Fang,Feiyu Pan,Xiankai Lu,Wei Zhang,Runmin Cong
关键词-EN: LSVOS Challenge RVOS, Challenge RVOS Track, natural language expressions, video object segmentation, RVOS
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Referring video object segmentation (RVOS) relies on natural language expressions to segment target objects in video. In this year, LSVOS Challenge RVOS Track replaced the origin YouTube-RVOS benchmark with MeViS. MeViS focuses on referring the target object in a video through its motion descriptions instead of static attributes, posing a greater challenge to RVOS task. In this work, we integrate strengths of that leading RVOS and VOS models to build up a simple and effective pipeline for RVOS. Firstly, We finetune the state-of-the-art RVOS model to obtain mask sequences that are correlated with language descriptions. Secondly, based on a reliable and high-quality key frames, we leverage VOS model to enhance the quality and temporal consistency of the mask results. Finally, we further improve the performance of the RVOS model using semi-supervised learning. Our solution achieved 62.57 JF on the MeViS test set and ranked 1st place for 6th LSVOS Challenge RVOS Track.

[CV-16] Video Object Segmentation via SAM 2: The 4th Solution for LSVOS Challenge VOS Track

链接: https://arxiv.org/abs/2408.10125
作者: Feiyu Pan,Hao Fang,Runmin Cong,Wei Zhang,Xiankai Lu
关键词-EN: Video Object Segmentation, entire video sequence, object instance, object mask, Object Segmentation
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: arXiv admin note: substantial text overlap with arXiv:2408.00714

点击查看摘要

Abstract:Video Object Segmentation (VOS) task aims to segmenting a particular object instance throughout the entire video sequence given only the object mask of the first frame. Recently, Segment Anything Model 2 (SAM 2) is proposed, which is a foundation model towards solving promptable visual segmentation in images and videos. SAM 2 builds a data engine, which improves model and data via user interaction, to collect the largest video segmentation dataset to date. SAM 2 is a simple transformer architecture with streaming memory for real-time video processing, which trained on the date provides strong performance across a wide range of tasks. In this work, we evaluate the zero-shot performance of SAM 2 on the more challenging VOS datasets MOSE and LVOS. Without fine-tuning on the training set, SAM 2 achieved 75.79 JF on the test set and ranked 4th place for 6th LSVOS Challenge VOS Track.

[CV-17] Learning Precise Affordances from Egocentric Videos for Robotic Manipulation

链接: https://arxiv.org/abs/2408.10123
作者: Gen Li,Nikolaos Tsagkas,Jifei Song,Ruaridh Mon-Williams,Sethu Vijayakumar,Kun Shao,Laura Sevilla-Lara
关键词-EN: potential actions, crucial for robotic, Affordance, GKT, Depth Feature Injector
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
*备注: Project page: this https URL

点击查看摘要

Abstract:Affordance, defined as the potential actions that an object offers, is crucial for robotic manipulation tasks. A deep understanding of affordance can lead to more intelligent AI systems. For example, such knowledge directs an agent to grasp a knife by the handle for cutting and by the blade when passing it to someone. In this paper, we present a streamlined affordance learning system that encompasses data collection, effective model training, and robot deployment. First, we collect training data from egocentric videos in an automatic manner. Different from previous methods that focus only on the object graspable affordance and represent it as coarse heatmaps, we cover both graspable (e.g., object handles) and functional affordances (e.g., knife blades, hammer heads) and extract data with precise segmentation masks. We then propose an effective model, termed Geometry-guided Affordance Transformer (GKT), to train on the collected data. GKT integrates an innovative Depth Feature Injector (DFI) to incorporate 3D shape and geometric priors, enhancing the model’s understanding of affordances. To enable affordance-oriented manipulation, we further introduce Aff-Grasp, a framework that combines GKT with a grasp generation model. For comprehensive evaluation, we create an affordance evaluation dataset with pixel-wise annotations, and design real-world tasks for robot experiments. The results show that GKT surpasses the state-of-the-art by 15.9% in mIoU, and Aff-Grasp achieves high success rates of 95.5% in affordance prediction and 77.1% in successful grasping among 179 trials, including evaluations with seen, unseen objects, and cluttered scenes.

[CV-18] Factorized-Dreamer: Training A High-Quality Video Generator with Limited and Low-Quality Data

链接: https://arxiv.org/abs/2408.10119
作者: Tao Yang,Yangming Shi,Yunwen Huang,Feng Chen,Yin Zheng,Lei Zhang
关键词-EN: gained significant attention, significant attention due, enhancement and translation, gained significant, wide applications
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Text-to-video (T2V) generation has gained significant attention due to its wide applications to video generation, editing, enhancement and translation, \etc. However, high-quality (HQ) video synthesis is extremely challenging because of the diverse and complex motions existed in real world. Most existing works struggle to address this problem by collecting large-scale HQ videos, which are inaccessible to the community. In this work, we show that publicly available limited and low-quality (LQ) data are sufficient to train a HQ video generator without recaptioning or finetuning. We factorize the whole T2V generation process into two steps: generating an image conditioned on a highly descriptive caption, and synthesizing the video conditioned on the generated image and a concise caption of motion details. Specifically, we present \emphFactorized-Dreamer, a factorized spatiotemporal framework with several critical designs for T2V generation, including an adapter to combine text and image embeddings, a pixel-aware cross attention module to capture pixel-level image information, a T5 text encoder to better understand motion description, and a PredictNet to supervise optical flows. We further present a noise schedule, which plays a key role in ensuring the quality and stability of video generation. Our model lowers the requirements in detailed captions and HQ videos, and can be directly trained on limited LQ datasets with noisy and brief captions such as WebVid-10M, largely alleviating the cost to collect large-scale HQ video-text pairs. Extensive experiments in a variety of T2V and image-to-video generation tasks demonstrate the effectiveness of our proposed Factorized-Dreamer. Our source codes are available at \urlthis https URL.

[CV-19] Modelling the Distribution of Human Motion for Sign Language Assessment ECCV2024

链接: https://arxiv.org/abs/2408.10073
作者: Oliver Cory,Ozge Mercanoglu Sincan,Matthew Vowels,Alessia Battisti,Franz Holzknecht,Katja Tissi,Sandra Sidler-Miserez,Tobias Haug,Sarah Ebling,Richard Bowden
关键词-EN: Sign Language Assessment, assess Sign Languages, Sign Language, Language Assessment, Sign
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted to Twelfth International Workshop on Assistive Computer Vision and Robotics at ECCV 2024

点击查看摘要

Abstract:Sign Language Assessment (SLA) tools are useful to aid in language learning and are underdeveloped. Previous work has focused on isolated signs or comparison against a single reference video to assess Sign Languages (SL). This paper introduces a novel SLA tool designed to evaluate the comprehensibility of SL by modelling the natural distribution of human motion. We train our pipeline on data from native signers and evaluate it using SL learners. We compare our results to ratings from a human raters study and find strong correlation between human ratings and our tool. We visually demonstrate our tools ability to detect anomalous results spatio-temporally, providing actionable feedback to aid in SL learning and assessment.

[CV-20] FFAA: Multimodal Large Language Model based Explainable Open-World Face Forgery Analysis Assistant

链接: https://arxiv.org/abs/2408.10072
作者: Zhengchao Huang,Bin Xia,Zicheng Lin,Zhun Mou,Wenming Yang
关键词-EN: widespread public concern, sparked widespread public, public information security, face forgery analysis, face forgery
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 17 pages, 18 figures; project page: this https URL

点击查看摘要

Abstract:The rapid advancement of deepfake technologies has sparked widespread public concern, particularly as face forgery poses a serious threat to public information security. However, the unknown and diverse forgery techniques, varied facial features and complex environmental factors pose significant challenges for face forgery analysis. Existing datasets lack descriptions of these aspects, making it difficult for models to distinguish between real and forged faces using only visual information amid various confounding factors. In addition, existing methods do not yield user-friendly and explainable results, complicating the understanding of the model’s decision-making process. To address these challenges, we introduce a novel Open-World Face Forgery Analysis VQA (OW-FFA-VQA) task and the corresponding benchmark. To tackle this task, we first establish a dataset featuring a diverse collection of real and forged face images with essential descriptions and reliable forgery reasoning. Base on this dataset, we introduce FFAA: Face Forgery Analysis Assistant, consisting of a fine-tuned Multimodal Large Language Model (MLLM) and Multi-answer Intelligent Decision System (MIDS). By integrating hypothetical prompts with MIDS, the impact of fuzzy classification boundaries is effectively mitigated, enhancing the model’s robustness. Extensive experiments demonstrate that our method not only provides user-friendly explainable results but also significantly boosts accuracy and robustness compared to previous methods.

[CV-21] LNQ 2023 challenge: Benchmark of weakly-supervised techniques for mediastinal lymph node quantification

链接: https://arxiv.org/abs/2408.10069
作者: Reuben Dorent,Roya Khajavi,Tagwa Idris,Erik Ziegler,Bhanusupriya Somarouthu,Heather Jacene,Ann LaCasce,Jonathan Deissler,Jan Ehrhardt,Sofija Engelson,Stefan M. Fischer,Yun Gu,Heinz Handels,Satoshi Kasai,Satoshi Kondo,Klaus Maier-Hein,Julia A. Schnabel,Guotai Wang,Litingyu Wang,Tassilo Wald,Guang-Zhong Yang,Hanxiao Zhang,Minghui Zhang,Steve Pieper,Gordon Harris,Ron Kikinis,Tina Kapur
关键词-EN: monitoring treatment response, Accurate assessment, lymph node size, lymph node, Lymph Node Quantification
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Submitted to MELBA

点击查看摘要

Abstract:Accurate assessment of lymph node size in 3D CT scans is crucial for cancer staging, therapeutic management, and monitoring treatment response. Existing state-of-the-art segmentation frameworks in medical imaging often rely on fully annotated datasets. However, for lymph node segmentation, these datasets are typically small due to the extensive time and expertise required to annotate the numerous lymph nodes in 3D CT scans. Weakly-supervised learning, which leverages incomplete or noisy annotations, has recently gained interest in the medical imaging community as a potential solution. Despite the variety of weakly-supervised techniques proposed, most have been validated only on private datasets or small publicly available datasets. To address this limitation, the Mediastinal Lymph Node Quantification (LNQ) challenge was organized in conjunction with the 26th International Conference on Medical Image Computing and Computer Assisted Intervention (MICCAI 2023). This challenge aimed to advance weakly-supervised segmentation methods by providing a new, partially annotated dataset and a robust evaluation framework. A total of 16 teams from 5 countries submitted predictions to the validation leaderboard, and 6 teams from 3 countries participated in the evaluation phase. The results highlighted both the potential and the current limitations of weakly-supervised approaches. On one hand, weakly-supervised approaches obtained relatively good performance with a median Dice score of 61.0% . On the other hand, top-ranked teams, with a median Dice score exceeding 70% , boosted their performance by leveraging smaller but fully annotated datasets to combine weak supervision and full supervision. This highlights both the promise of weakly-supervised methods and the ongoing need for high-quality, fully annotated data to achieve higher segmentation performance.

[CV-22] Facial Wrinkle Segmentation for Cosmetic Dermatology: Pretraining with Texture Map-Based Weak Supervision

链接: https://arxiv.org/abs/2408.10060
作者: Junho Moon,Haejun Chung,Ikbeom Jang
关键词-EN: cosmetic dermatology, plays a crucial, crucial role, role in cosmetic, Facial wrinkle
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Facial wrinkle detection plays a crucial role in cosmetic dermatology. Precise manual segmentation of facial wrinkles is challenging and time-consuming, with inherent subjectivity leading to inconsistent results among graders. To address this issue, we propose two solutions. First, we build and release the first public facial wrinkle dataset, `FFHQ-Wrinkle’, an extension of the NVIDIA FFHQ dataset. This dataset includes 1,000 images with human labels and 50,000 images with automatically generated weak labels. This dataset can foster the research community to develop advanced wrinkle detection algorithms. Second, we introduce a training strategy for U-Net-like encoder-decoder models to detect wrinkles across the face automatically. Our method employs a two-stage training strategy: texture map pretraining and finetuning on human-labeled data. Initially, we pretrain models on a large dataset with weak labels (N=50k) or masked texture maps generated through computer vision techniques, without human intervention. Subsequently, we finetune the models using human-labeled data (N=1k), which consists of manually labeled wrinkle masks. During finetuning, the network inputs a combination of RGB and masked texture maps, comprising four channels. We effectively combine labels from multiple annotators to minimize subjectivity in manual labeling. Our strategies demonstrate improved segmentation performance in facial wrinkle segmentation both quantitatively and visually compared to existing pretraining methods.

[CV-23] Exploiting Fine-Grained Prototype Distribution for Boosting Unsupervised Class Incremental Learning

链接: https://arxiv.org/abs/2408.10046
作者: Jiaming Liu,Hongyuan Liu,Zhili Qin,Wei Han,Yulu Fan,Qinli Yang,Junming Shao
关键词-EN: class incremental learning, dynamic nature, nature of open-world, open-world scenarios, scenarios has attracted
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:The dynamic nature of open-world scenarios has attracted more attention to class incremental learning (CIL). However, existing CIL methods typically presume the availability of complete ground-truth labels throughout the training process, an assumption rarely met in practical applications. Consequently, this paper explores a more challenging problem of unsupervised class incremental learning (UCIL). The essence of addressing this problem lies in effectively capturing comprehensive feature representations and discovering unknown novel classes. To achieve this, we first model the knowledge of class distribution by exploiting fine-grained prototypes. Subsequently, a granularity alignment technique is introduced to enhance the unsupervised class discovery. Additionally, we proposed a strategy to minimize overlap between novel and existing classes, thereby preserving historical knowledge and mitigating the phenomenon of catastrophic forgetting. Extensive experiments on the five datasets demonstrate that our approach significantly outperforms current state-of-the-art methods, indicating the effectiveness of the proposed method.

[CV-24] Implicit Gaussian Splatting with Efficient Multi-Level Tri-Plane Representation

链接: https://arxiv.org/abs/2408.10041
作者: Minye Wu,Tinne Tuytelaars
关键词-EN: Recent advancements, Implicit Gaussian Splatting, Gaussian Splatting, advancements in photo-realistic, photo-realistic novel view
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Recent advancements in photo-realistic novel view synthesis have been significantly driven by Gaussian Splatting (3DGS). Nevertheless, the explicit nature of 3DGS data entails considerable storage requirements, highlighting a pressing need for more efficient data representations. To address this, we present Implicit Gaussian Splatting (IGS), an innovative hybrid model that integrates explicit point clouds with implicit feature embeddings through a multi-level tri-plane architecture. This architecture features 2D feature grids at various resolutions across different levels, facilitating continuous spatial domain representation and enhancing spatial correlations among Gaussian primitives. Building upon this foundation, we introduce a level-based progressive training scheme, which incorporates explicit spatial regularization. This method capitalizes on spatial correlations to enhance both the rendering quality and the compactness of the IGS representation. Furthermore, we propose a novel compression pipeline tailored for both point clouds and 2D feature grids, considering the entropy variations across different levels. Extensive experimental evaluations demonstrate that our algorithm can deliver high-quality rendering using only a few MBs, effectively balancing storage efficiency and rendering fidelity, and yielding results that are competitive with the state-of-the-art.

[CV-25] SHARP: Segmentation of Hands and Arms by Range using Pseudo-Depth for Enhanced Egocentric 3D Hand Pose Estimation and Action Recognition ICPR

链接: https://arxiv.org/abs/2408.10037
作者: Wiktor Mucha,Michael Wray,Martin Kampel
关键词-EN: Hand pose, pose represents key, Hand pose represents, represents key information, hand pose estimation
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted at 27th International Conference on Pattern Recognition (ICPR)

点击查看摘要

Abstract:Hand pose represents key information for action recognition in the egocentric perspective, where the user is interacting with objects. We propose to improve egocentric 3D hand pose estimation based on RGB frames only by using pseudo-depth images. Incorporating state-of-the-art single RGB image depth estimation techniques, we generate pseudo-depth representations of the frames and use distance knowledge to segment irrelevant parts of the scene. The resulting depth maps are then used as segmentation masks for the RGB frames. Experimental results on H2O Dataset confirm the high accuracy of the estimated pose with our method in an action recognition task. The 3D hand pose, together with information from object detection, is processed by a transformer-based action recognition network, resulting in an accuracy of 91.73%, outperforming all state-of-the-art methods. Estimations of 3D hand pose result in competitive performance with existing methods with a mean pose error of 28.66 mm. This method opens up new possibilities for employing distance information in egocentric 3D hand pose estimation without relying on depth sensors.

[CV-26] Dynamic Label Injection for Imbalanced Industrial Defect Segmentation ECCV2024

链接: https://arxiv.org/abs/2408.10031
作者: Emanuele Caruso,Francesco Pelosin,Alessandro Simoni,Marco Boschetti
关键词-EN: multi-class semantic segmentation, deep learning systems, imbalanced multi-class semantic, simple yet effective, effective method
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: ECCV 2024 VISION Workshop

点击查看摘要

Abstract:In this work, we propose a simple yet effective method to tackle the problem of imbalanced multi-class semantic segmentation in deep learning systems. One of the key properties for a good training set is the balancing among the classes. When the input distribution is heavily imbalanced in the number of instances, the learning process could be hindered or difficult to carry on. To this end, we propose a Dynamic Label Injection (DLI) algorithm to impose a uniform distribution in the input batch. Our algorithm computes the current batch defect distribution and re-balances it by transferring defects using a combination of Poisson-based seamless image cloning and cut-paste techniques. A thorough experimental section on the Magnetic Tiles dataset shows better results of DLI compared to other balancing loss approaches also in the challenging weakly-supervised setup. The code is available at this https URL

[CV-27] owards Robust Federated Image Classification: An Empirical Study of Weight Selection Strategies in Manufacturing

链接: https://arxiv.org/abs/2408.10024
作者: Vinit Hegiste,Tatjana Legler,Martin Ruskowski
关键词-EN: Epoch Weight Selection, Final Epoch Weight, Optimal Epoch Weight, weight selection strategies, weight selection
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Submitted to The 2nd IEEE International Conference on Federated Learning Technologies and Applications (FLTA24)

点击查看摘要

Abstract:In the realm of Federated Learning (FL), particularly within the manufacturing sector, the strategy for selecting client weights for server aggregation is pivotal for model performance. This study investigates the comparative effectiveness of two weight selection strategies: Final Epoch Weight Selection (FEWS) and Optimal Epoch Weight Selection (OEWS). Designed for manufacturing contexts where collaboration typically involves a limited number of partners (two to four clients), our research focuses on federated image classification tasks. We employ various neural network architectures, including EfficientNet, ResNet, and VGG, to assess the impact of these weight selection strategies on model convergence and robustness. Our research aims to determine whether FEWS or OEWS enhances the global FL model’s performance across communication rounds (CRs). Through empirical analysis and rigorous experimentation, we seek to provide valuable insights for optimizing FL implementations in manufacturing, ensuring that collaborative efforts yield the most effective and reliable models with a limited number of participating clients. The findings from this study are expected to refine FL practices significantly in manufacturing, thereby enhancing the efficiency and performance of collaborative machine learning endeavors in this vital sector. Comments: Submitted to The 2nd IEEE International Conference on Federated Learning Technologies and Applications (FLTA24) Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2408.10024 [cs.CV] (or arXiv:2408.10024v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2408.10024 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[CV-28] Detecting Adversarial Attacks in Semantic Segmentation via Uncertainty Estimation: A Deep Analysis

链接: https://arxiv.org/abs/2408.10021
作者: Kira Maag,Roman Resner,Asja Fischer
关键词-EN: Deep neural networks, demonstrated remarkable effectiveness, Deep neural, demonstrated remarkable, wide range
类目: Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR)
*备注:

点击查看摘要

Abstract:Deep neural networks have demonstrated remarkable effectiveness across a wide range of tasks such as semantic segmentation. Nevertheless, these networks are vulnerable to adversarial attacks that add imperceptible perturbations to the input image, leading to false predictions. This vulnerability is particularly dangerous in safety-critical applications like automated driving. While adversarial examples and defense strategies are well-researched in the context of image classification, there is comparatively less research focused on semantic segmentation. Recently, we have proposed an uncertainty-based method for detecting adversarial attacks on neural networks for semantic segmentation. We observed that uncertainty, as measured by the entropy of the output distribution, behaves differently on clean versus adversely perturbed images, and we utilize this property to differentiate between the two. In this extended version of our work, we conduct a detailed analysis of uncertainty-based detection of adversarial attacks including a diverse set of adversarial attacks and various state-of-the-art neural networks. Our numerical experiments show the effectiveness of the proposed uncertainty-based detection method, which is lightweight and operates as a post-processing step, i.e., no model modifications or knowledge of the adversarial example generation process are required.

[CV-29] CLIPCleaner: Cleaning Noisy Labels with CLIP

链接: https://arxiv.org/abs/2408.10012
作者: Chen Feng,Georgios Tzimiropoulos,Ioannis Patras
关键词-EN: Machine Learning community, Machine Learning, Noisy labels, sample selection, poses a significant
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted to ACMMM2024

点击查看摘要

Abstract:Learning with Noisy labels (LNL) poses a significant challenge for the Machine Learning community. Some of the most widely used approaches that select as clean samples for which the model itself (the in-training model) has high confidence, e.g., small loss', can suffer from the so called self-confirmation’ bias. This bias arises because the in-training model, is at least partially trained on the noisy labels. Furthermore, in the classification case, an additional challenge arises because some of the label noise is between classes that are visually very similar (`hard noise’). This paper addresses these challenges by proposing a method (\textitCLIPCleaner) that leverages CLIP, a powerful Vision-Language (VL) model for constructing a zero-shot classifier for efficient, offline, clean sample selection. This has the advantage that the sample selection is decoupled from the in-training model and that the sample selection is aware of the semantic and visual similarities between the classes due to the way that CLIP is trained. We provide theoretical justifications and empirical evidence to demonstrate the advantages of CLIP for LNL compared to conventional pre-trained models. Compared to current methods that combine iterative sample selection with various techniques, \textitCLIPCleaner offers a simple, single-step approach that achieves competitive or superior performance on benchmark datasets. To the best of our knowledge, this is the first time a VL model has been used for sample selection to address the problem of Learning with Noisy Labels (LNL), highlighting their potential in the domain.

[CV-30] P3P: Pseudo-3D Pre-training for Scaling 3D Masked Autoencoders

链接: https://arxiv.org/abs/2408.10007
作者: Xuechao Chen,Ying Chen,Jialin Li,Qiang Nie,Yong Liu,Qixing Huang,Yang Li
关键词-EN: perception tasks, data, pre-training, large amount, Abstract
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Under review. Pre-print

点击查看摘要

Abstract:3D pre-training is crucial to 3D perception tasks. However, limited by the difficulties in collecting clean 3D data, 3D pre-training consistently faced data scaling challenges. Inspired by semi-supervised learning leveraging limited labeled data and a large amount of unlabeled data, in this work, we propose a novel self-supervised pre-training framework utilizing the real 3D data and the pseudo-3D data lifted from images by a large depth estimation model. Another challenge lies in the efficiency. Previous methods such as Point-BERT and Point-MAE, employ k nearest neighbors to embed 3D tokens, requiring quadratic time complexity. To efficiently pre-train on such a large amount of data, we propose a linear-time-complexity token embedding strategy and a training-efficient 2D reconstruction target. Our method achieves state-of-the-art performance in 3D classification and few-shot learning while maintaining high pre-training and downstream fine-tuning efficiency.

[CV-31] Boosting Open-Domain Continual Learning via Leveraging Intra-domain Category-aware Prototype

链接: https://arxiv.org/abs/2408.09984
作者: Yadong Lu,Shitian Zhao,Boxiang Yun,Dongsheng Jiang,Yin Li,Qingli Li,Yan Wang
关键词-EN: Open-Domain Continual Learning, maintaining zero-shot capabilities, Continual Learning, Vision-Language Models, Open-Domain Continual
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Despite recent progress in enhancing the efficacy of Open-Domain Continual Learning (ODCL) in Vision-Language Models (VLM), failing to (1) correctly identify the Task-ID of a test image and (2) use only the category set corresponding to the Task-ID, while preserving the knowledge related to each domain, cannot address the two primary challenges of ODCL: forgetting old knowledge and maintaining zero-shot capabilities, as well as the confusions caused by category-relatedness between domains. In this paper, we propose a simple yet effective solution: leveraging intra-domain category-aware prototypes for ODCL in CLIP (DPeCLIP), where the prototype is the key to bridging the above two processes. Concretely, we propose a training-free Task-ID discriminator method, by utilizing prototypes as classifiers for identifying Task-IDs. Furthermore, to maintain the knowledge corresponding to each domain, we incorporate intra-domain category-aware prototypes as domain prior prompts into the training process. Extensive experiments conducted on 11 different datasets demonstrate the effectiveness of our approach, achieving 2.37% and 1.14% average improvement in class-incremental and task-incremental settings, respectively.

[CV-32] Weakly Supervised Pretraining and Multi-Annotator Supervised Finetuning for Facial Wrinkle Detection

链接: https://arxiv.org/abs/2408.09952
作者: Ik Jun Moon,Junho Moon,Ikbeom Jang
关键词-EN: Abstract, facial, facial wrinkles, Research question, skin
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:1. Research question: With the growing interest in skin diseases and skin aesthetics, the ability to predict facial wrinkles is becoming increasingly important. This study aims to evaluate whether a computational model, convolutional neural networks (CNN), can be trained for automated facial wrinkle segmentation. 2. Findings: Our study presents an effective technique for integrating data from multiple annotators and illustrates that transfer learning can enhance performance, resulting in dependable segmentation of facial wrinkles. 3. Meaning: This approach automates intricate and time-consuming tasks of wrinkle analysis with a deep learning framework. It could be used to facilitate skin treatments and diagnostics.

[CV-33] C2RL: Content and Context Representation Learning for Gloss-free Sign Language Translation and Retrieval

链接: https://arxiv.org/abs/2408.09949
作者: Zhigang Chen,Benjia Zhou,Yiqing Huang,Jun Wan,Yibo Hu,Hailin Shi,Yanyan Liang,Zhen Lei,Du Zhang
关键词-EN: Sign Language Translation, Sign Language Retrieval, Language Translation, Language Retrieval, Sign Language
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:Sign Language Representation Learning (SLRL) is crucial for a range of sign language-related downstream tasks such as Sign Language Translation (SLT) and Sign Language Retrieval (SLRet). Recently, many gloss-based and gloss-free SLRL methods have been proposed, showing promising performance. Among them, the gloss-free approach shows promise for strong scalability without relying on gloss annotations. However, it currently faces suboptimal solutions due to challenges in encoding the intricate, context-sensitive characteristics of sign language videos, mainly struggling to discern essential sign features using a non-monotonic video-text alignment strategy. Therefore, we introduce an innovative pretraining paradigm for gloss-free SLRL, called C ^2 RL, in this paper. Specifically, rather than merely incorporating a non-monotonic semantic alignment of video and text to learn language-oriented sign features, we emphasize two pivotal aspects of SLRL: Implicit Content Learning (ICL) and Explicit Context Learning (ECL). ICL delves into the content of communication, capturing the nuances, emphasis, timing, and rhythm of the signs. In contrast, ECL focuses on understanding the contextual meaning of signs and converting them into equivalent sentences. Despite its simplicity, extensive experiments confirm that the joint optimization of ICL and ECL results in robust sign language representation and significant performance gains in gloss-free SLT and SLRet tasks. Notably, C ^2 RL improves the BLEU-4 score by +5.3 on P14T, +10.6 on CSL-daily, +6.2 on OpenASL, and +1.3 on How2Sign. It also boosts the R@1 score by +8.3 on P14T, +14.4 on CSL-daily, and +5.9 on How2Sign. Additionally, we set a new baseline for the OpenASL dataset in the SLRet task.

[CV-34] Caption-Driven Explorations: Aligning Image and Text Embeddings through Human-Inspired Foveated Vision

链接: https://arxiv.org/abs/2408.09948
作者: Dario Zanca,Andrea Zugarini,Simon Dietz,Thomas R. Altstidl,Mark A. Turban Ndjeuha,Leo Schwinn,Bjoern Eskofier
关键词-EN: crucial for vision, vision science, human attention, human, attention
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Understanding human attention is crucial for vision science and AI. While many models exist for free-viewing, less is known about task-driven image exploration. To address this, we introduce CapMIT1003, a dataset with captions and click-contingent image explorations, to study human attention during the captioning task. We also present NevaClip, a zero-shot method for predicting visual scanpaths by combining CLIP models with NeVA algorithms. NevaClip generates fixations to align the representations of foveated visual stimuli and captions. The simulated scanpaths outperform existing human attention models in plausibility for captioning and free-viewing tasks. This research enhances the understanding of human attention and advances scanpath prediction models.

[CV-35] ML-CrAIST: Multi-scale Low-high Frequency Information-based Cross black Attention with Image Super-resolving Transformer

链接: https://arxiv.org/abs/2408.09940
作者: Alik Pramanick,Utsav Bheda,Arijit Sur
关键词-EN: captured significant interest, single-image super-resolution tasks, demonstrating substantial gains, transformers have captured, demonstrating substantial
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
*备注:

点击查看摘要

Abstract:Recently, transformers have captured significant interest in the area of single-image super-resolution tasks, demonstrating substantial gains in performance. Current models heavily depend on the network’s extensive ability to extract high-level semantic details from images while overlooking the effective utilization of multi-scale image details and intermediate information within the network. Furthermore, it has been observed that high-frequency areas in images present significant complexity for super-resolution compared to low-frequency areas. This work proposes a transformer-based super-resolution architecture called ML-CrAIST that addresses this gap by utilizing low-high frequency information in multiple scales. Unlike most of the previous work (either spatial or channel), we operate spatial and channel self-attention, which concurrently model pixel interaction from both spatial and channel dimensions, exploiting the inherent correlations across spatial and channel axis. Further, we devise a cross-attention block for super-resolution, which explores the correlations between low and high-frequency information. Quantitative and qualitative assessments indicate that our proposed ML-CrAIST surpasses state-of-the-art super-resolution methods (e.g., 0.15 dB gain @Manga109 \times 4). Code is available on: this https URL.

[CV-36] Data Augmentation of Contrastive Learning is Estimating Positive-incentive Noise

链接: https://arxiv.org/abs/2408.09929
作者: Hongyuan Zhang,Yanchen Xu,Sida Huang,Xuelong Li
关键词-EN: idea of Positive-incentive, Positive-incentive Noise, reliable noise beneficial, contrastive learning, Noise
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Inspired by the idea of Positive-incentive Noise (Pi-Noise or \pi -Noise) that aims at learning the reliable noise beneficial to tasks, we scientifically investigate the connection between contrastive learning and \pi -noise in this paper. By converting the contrastive loss to an auxiliary Gaussian distribution to quantitatively measure the difficulty of the specific contrastive model under the information theory framework, we properly define the task entropy, the core concept of \pi -noise, of contrastive learning. It is further proved that the predefined data augmentation in the standard contrastive learning paradigm can be regarded as a kind of point estimation of \pi -noise. Inspired by the theoretical study, a framework that develops a \pi -noise generator to learn the beneficial noise (instead of estimation) as data augmentations for contrast is proposed. The designed framework can be applied to diverse types of data and is also completely compatible with the existing contrastive models. From the visualization, we surprisingly find that the proposed method successfully learns effective augmentations.

[CV-37] DiscoNeRF: Class-Agnostic Object Field for 3D Object Discovery

链接: https://arxiv.org/abs/2408.09928
作者: Corentin Dumery,Aoxiang Fan,Ren Li,Nicolas Talabot,Pascal Fua
关键词-EN: Neural Radiance Fields, Neural Radiance, Radiance Fields, tool for modeling, multiple images
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
*备注:

点击查看摘要

Abstract:Neural Radiance Fields (NeRFs) have become a powerful tool for modeling 3D scenes from multiple images. However, NeRFs remain difficult to segment into semantically meaningful regions. Previous approaches to 3D segmentation of NeRFs either require user interaction to isolate a single object, or they rely on 2D semantic masks with a limited number of classes for supervision. As a consequence, they generalize poorly to class-agnostic masks automatically generated in real scenes. This is attributable to the ambiguity arising from zero-shot segmentation, yielding inconsistent masks across views. In contrast, we propose a method that is robust to inconsistent segmentations and successfully decomposes the scene into a set of objects of any class. By introducing a limited number of competing object slots against which masks are matched, a meaningful object representation emerges that best explains the 2D supervision and minimizes an additional regularization term. Our experiments demonstrate the ability of our method to generate 3D panoptic segmentations on complex scenes, and extract high-quality 3D assets from NeRFs that can then be used in virtual 3D environments.

[CV-38] Sliced Maximal Information Coefficient: A Training-Free Approach for Image Quality Assessment Enhancement ICME2024

链接: https://arxiv.org/abs/2408.09920
作者: Kang Xiao,Xu Wang,Yulin He,Baoliang Chen,Xuelin Shen
关键词-EN: Full-reference image quality, models generally operate, Full-reference image, existing IQA models, image quality assessment
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM); Image and Video Processing (eess.IV)
*备注: 6 pages, 5 figures, accepted by ICME2024

点击查看摘要

Abstract:Full-reference image quality assessment (FR-IQA) models generally operate by measuring the visual differences between a degraded image and its reference. However, existing FR-IQA models including both the classical ones (eg, PSNR and SSIM) and deep-learning based measures (eg, LPIPS and DISTS) still exhibit limitations in capturing the full perception characteristics of the human visual system (HVS). In this paper, instead of designing a new FR-IQA measure, we aim to explore a generalized human visual attention estimation strategy to mimic the process of human quality rating and enhance existing IQA models. In particular, we model human attention generation by measuring the statistical dependency between the degraded image and the reference image. The dependency is captured in a training-free manner by our proposed sliced maximal information coefficient and exhibits surprising generalization in different IQA measures. Experimental results verify the performance of existing IQA models can be consistently improved when our attention module is incorporated. The source code is available at this https URL.

[CV-39] Long-Tail Temporal Action Segmentation with Group-wise Temporal Logit Adjustment ECCV2024

链接: https://arxiv.org/abs/2408.09919
作者: Zhanzhong Pang,Fadime Sener,Shrinivas Ramasubramanian,Angela Yao
关键词-EN: Procedural activity videos, long-tailed action distribution, action distribution due, varying action frequencies, Procedural activity
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted by ECCV 2024

点击查看摘要

Abstract:Procedural activity videos often exhibit a long-tailed action distribution due to varying action frequencies and durations. However, state-of-the-art temporal action segmentation methods overlook the long tail and fail to recognize tail actions. Existing long-tail methods make class-independent assumptions and struggle to identify tail classes when applied to temporal segmentation frameworks. This work proposes a novel group-wise temporal logit adjustment~(G-TLA) framework that combines a group-wise softmax formulation while leveraging activity information and action ordering for logit adjustment. The proposed framework significantly improves in segmenting tail actions without any performance loss on head actions.

[CV-40] Attribution Analysis Meets Model Editing: Advancing Knowledge Correction in Vision Language Models with VisEdit

链接: https://arxiv.org/abs/2408.09916
作者: Qizhou Chen,Taolin Zhang,Chengyu Wang,Xiaofeng He,Dakan Wang,Tingting Liu
关键词-EN: Large Language Model, Model editing aims, developed Large Language, Language Model, costly retraining
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:Model editing aims to correct outdated or erroneous knowledge in large models without costly retraining. Recent research discovered that the mid-layer representation of the subject’s final token in a prompt has a strong influence on factual predictions, and developed Large Language Model (LLM) editing techniques based on this observation. However, for Vision-LLMs (VLLMs), how visual representations impact the predictions from a decoder-only language model remains largely unexplored. To the best of our knowledge, model editing for VLLMs has not been extensively studied in the literature. In this work, we employ the contribution allocation and noise perturbation methods to measure the contributions of visual representations for token predictions. Our attribution analysis shows that visual representations in mid-to-later layers that are highly relevant to the prompt contribute significantly to predictions. Based on these insights, we propose VisEdit, a novel model editor for VLLMs that effectively corrects knowledge by editing intermediate visual representations in regions important to the edit prompt. We evaluated VisEdit using multiple VLLM backbones and public VLLM editing benchmark datasets. The results show the superiority of VisEdit over the strong baselines adapted from existing state-of-the-art editors for LLMs.

[CV-41] Harnessing Multi-resolution and Multi-scale Attention for Underwater Image Restoration

链接: https://arxiv.org/abs/2408.09912
作者: Alik Pramanick,Arijit Sur,V. Vijaya Saradhi
关键词-EN: high-level vision tasks, posing challenges, vision tasks, compromised by factors, challenges for high-level
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
*备注:

点击查看摘要

Abstract:Underwater imagery is often compromised by factors such as color distortion and low contrast, posing challenges for high-level vision tasks. Recent underwater image restoration (UIR) methods either analyze the input image at full resolution, resulting in spatial richness but contextual weakness, or progressively from high to low resolution, yielding reliable semantic information but reduced spatial accuracy. Here, we propose a lightweight multi-stage network called Lit-Net that focuses on multi-resolution and multi-scale image analysis for restoring underwater images while retaining original resolution during the first stage, refining features in the second, and focusing on reconstruction in the final stage. Our novel encoder block utilizes parallel 1\times1 convolution layers to capture local information and speed up operations. Further, we incorporate a modified weighted color channel-specific l_1 loss ( cl_1 ) function to recover color and detail information. Extensive experimentations on publicly available datasets suggest our model’s superiority over recent state-of-the-art methods, with significant improvement in qualitative and quantitative measures, such as 29.477 dB PSNR ( 1.92% improvement) and 0.851 SSIM ( 2.87% improvement) on the EUVP dataset. The contributions of Lit-Net offer a more robust approach to underwater image enhancement and super-resolution, which is of considerable importance for underwater autonomous vehicles and surveillance. The code is available at: this https URL.

[CV-42] LCE: A Framework for Explainability of DNNs for Ultrasound Image Based on Concept Discovery

链接: https://arxiv.org/abs/2408.09899
作者: Weiji Kong,Xun Gong,Juan Wang
关键词-EN: Deep Neural Networks, Neural Networks, Deep Neural, decisions of Deep, increasingly important
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)
*备注:

点击查看摘要

Abstract:Explaining the decisions of Deep Neural Networks (DNNs) for medical images has become increasingly important. Existing attribution methods have difficulty explaining the meaning of pixels while existing concept-based methods are limited by additional annotations or specific model structures that are difficult to apply to ultrasound images. In this paper, we propose the Lesion Concept Explainer (LCE) framework, which combines attribution methods with concept-based methods. We introduce the Segment Anything Model (SAM), fine-tuned on a large number of medical images, for concept discovery to enable a meaningful explanation of ultrasound image DNNs. The proposed framework is evaluated in terms of both faithfulness and understandability. We point out deficiencies in the popular faithfulness evaluation metrics and propose a new evaluation metric. Our evaluation of public and private breast ultrasound datasets (BUSI and FG-US-B) shows that LCE performs well compared to commonly-used explainability methods. Finally, we also validate that LCE can consistently provide reliable explanations for more meaningful fine-grained diagnostic tasks in breast ultrasound.

[CV-43] SAM-UNet:Enhancing Zero-Shot Segmentation of SAM for Universal Medical Images

链接: https://arxiv.org/abs/2408.09886
作者: Sihan Yang,Haixia Bi,Hai Zhang,Jian Sun
关键词-EN: demonstrated impressive performance, demonstrated impressive, wide range, medical, medical image segmentation
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Segment Anything Model (SAM) has demonstrated impressive performance on a wide range of natural image segmentation tasks. However, its performance significantly deteriorates when directly applied to medical domain, due to the remarkable differences between natural images and medical images. Some researchers have attempted to train SAM on large scale medical datasets. However, poor zero-shot performance is observed from the experimental results. In this context, inspired by the superior performance of U-Net-like models in medical image segmentation, we propose SAMUNet, a new foundation model which incorporates U-Net to the original SAM, to fully leverage the powerful contextual modeling ability of convolutions. To be specific, we parallel a convolutional branch in the image encoder, which is trained independently with the vision Transformer branch frozen. Additionally, we employ multi-scale fusion in the mask decoder, to facilitate accurate segmentation of objects with different scales. We train SAM-UNet on SA-Med2D-16M, the largest 2-dimensional medical image segmentation dataset to date, yielding a universal pretrained model for medical images. Extensive experiments are conducted to evaluate the performance of the model, and state-of-the-art result is achieved, with a dice similarity coefficient score of 0.883 on SA-Med2D-16M dataset. Specifically, in zero-shot segmentation experiments, our model not only significantly outperforms previous large medical SAM models across all modalities, but also substantially mitigates the performance degradation seen on unseen modalities. It should be highlighted that SAM-UNet is an efficient and extensible foundation model, which can be further fine-tuned for other downstream tasks in medical community. The code is available at this https URL.

[CV-44] New spectral imaging biomarkers for sepsis and mortality in intensive care

链接: https://arxiv.org/abs/2408.09873
作者: Silvia Seidlitz,Katharina Hölzl,Ayca von Garrel,Jan Sellner,Stephan Katzenschlager,Tobias Hölle,Dania Fischer,Maik von der Forst,Felix C.F. Schmitt,Markus A. Weigand,Lena Maier-Hein,Maximilian Dietrich
关键词-EN: high socioeconomic importance, high risk, high socioeconomic, early identification, socioeconomic importance
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
*备注: Markus A. Weigand, Lena Maier-Hein and Maximilian Dietrich contributed equally

点击查看摘要

Abstract:With sepsis remaining a leading cause of mortality, early identification of septic patients and those at high risk of death is a challenge of high socioeconomic importance. The driving hypothesis of this study was that hyperspectral imaging (HSI) could provide novel biomarkers for sepsis diagnosis and treatment management due to its potential to monitor microcirculatory alterations. We conducted a comprehensive study involving HSI data of the palm and fingers from more than 480 patients on the day of their intensive care unit (ICU) admission. The findings demonstrate that HSI measurements can predict sepsis with an area under the receiver operating characteristic curve (AUROC) of 0.80 (95 % confidence interval (CI) [0.76; 0.84]) and mortality with an AUROC of 0.72 (95 % CI [0.65; 0.79]). The predictive performance improves substantially when additional clinical data is incorporated, leading to an AUROC of up to 0.94 (95 % CI [0.92; 0.96]) for sepsis and 0.84 (95 % CI [0.78; 0.89]) for mortality. We conclude that HSI presents novel imaging biomarkers for the rapid, non-invasive prediction of sepsis and mortality, suggesting its potential as an important modality for guiding diagnosis and treatment.

[CV-45] Docling Technical Report

链接: https://arxiv.org/abs/2408.09869
作者: Christoph Auer,Maksym Lysak,Ahmed Nassar,Michele Dolfi,Nikolaos Livathinos,Panos Vagenas,Cesar Berrospi Ramis,Matteo Omenetti,Fabian Lindlbauer,Kasper Dinkla,Valery Weber,Lucas Morin,Ingmar Meijer,Viktor Kuropiatnyk,Peter W. J. Staar
关键词-EN: PDF document conversion, report introduces Docling, MIT-licensed open-source package, technical report introduces, introduces Docling
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Software Engineering (cs.SE)
*备注: arXiv admin note: substantial text overlap with arXiv:2206.01062

点击查看摘要

Abstract:This technical report introduces Docling, an easy to use, self-contained, MIT-licensed open-source package for PDF document conversion. It is powered by state-of-the-art specialized AI models for layout analysis (DocLayNet) and table structure recognition (TableFormer), and runs efficiently on commodity hardware in a small resource budget. The code interface allows for easy extensibility and addition of new features and models.

[CV-46] 3D-Aware Instance Segmentation and Tracking in Egocentric Videos

链接: https://arxiv.org/abs/2408.09860
作者: Yash Bhalgat,Vadim Tschernezki,Iro Laina,João F. Henriques,Andrea Vedaldi,Andrew Zisserman
关键词-EN: rapid camera motion, present unique challenges, videos present unique, scene understanding due, frequent object occlusions
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Egocentric videos present unique challenges for 3D scene understanding due to rapid camera motion, frequent object occlusions, and limited object visibility. This paper introduces a novel approach to instance segmentation and tracking in first-person video that leverages 3D awareness to overcome these obstacles. Our method integrates scene geometry, 3D object centroid tracking, and instance segmentation to create a robust framework for analyzing dynamic egocentric scenes. By incorporating spatial and temporal cues, we achieve superior performance compared to state-of-the-art 2D approaches. Extensive evaluations on the challenging EPIC Fields dataset demonstrate significant improvements across a range of tracking and segmentation consistency metrics. Specifically, our method outperforms the next best performing approach by 7 points in Association Accuracy (AssA) and 4.5 points in IDF1 score, while reducing the number of ID switches by 73% to 80% across various object categories. Leveraging our tracked instance segmentations, we showcase downstream applications in 3D object reconstruction and amodal video object segmentation in these egocentric settings.

[CV-47] OccMamba: Semantic Occupancy Prediction with State Space Models

链接: https://arxiv.org/abs/2408.09859
作者: Heng Li,Yuenan Hou,Xiaohan Xing,Xiao Sun,Yanyong Zhang
关键词-EN: Training deep learning, limited visual cues, complicated driving scenarios, deep learning models, Training deep
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 9 pages, 4 figures

点击查看摘要

Abstract:Training deep learning models for semantic occupancy prediction is challenging due to factors such as a large number of occupancy cells, severe occlusion, limited visual cues, complicated driving scenarios, etc. Recent methods often adopt transformer-based architectures given their strong capability in learning input-conditioned weights and long-range relationships. However, transformer-based networks are notorious for their quadratic computation complexity, seriously undermining their efficacy and deployment in semantic occupancy prediction. Inspired by the global modeling and linear computation complexity of the Mamba architecture, we present the first Mamba-based network for semantic occupancy prediction, termed OccMamba. However, directly applying the Mamba architecture to the occupancy prediction task yields unsatisfactory performance due to the inherent domain gap between the linguistic and 3D domains. To relieve this problem, we present a simple yet effective 3D-to-1D reordering operation, i.e., height-prioritized 2D Hilbert expansion. It can maximally retain the spatial structure of point clouds as well as facilitate the processing of Mamba blocks. Our OccMamba achieves state-of-the-art performance on three prevalent occupancy prediction benchmarks, including OpenOccupancy, SemanticKITTI and SemanticPOSS. Notably, on OpenOccupancy, our OccMamba outperforms the previous state-of-the-art Co-Occ by 3.1% IoU and 3.2% mIoU, respectively. Codes will be released upon publication.

[CV-48] Segment-Anything Models Achieve Zero-shot Robustness in Autonomous Driving

链接: https://arxiv.org/abs/2408.09839
作者: Jun Yan,Pengyu Wang,Danni Wang,Weiquan Huang,Daniel Watzenig,Huilin Yin
关键词-EN: SAM, Semantic segmentation, adversarial robustness, significant perception task, zero-shot adversarial robustness
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: Accepted to IAVVC 2024

点击查看摘要

Abstract:Semantic segmentation is a significant perception task in autonomous driving. It suffers from the risks of adversarial examples. In the past few years, deep learning has gradually transitioned from convolutional neural network (CNN) models with a relatively small number of parameters to foundation models with a huge number of parameters. The segment-anything model (SAM) is a generalized image segmentation framework that is capable of handling various types of images and is able to recognize and segment arbitrary objects in an image without the need to train on a specific object. It is a unified model that can handle diverse downstream tasks, including semantic segmentation, object detection, and tracking. In the task of semantic segmentation for autonomous driving, it is significant to study the zero-shot adversarial robustness of SAM. Therefore, we deliver a systematic empirical study on the robustness of SAM without additional training. Based on the experimental results, the zero-shot adversarial robustness of the SAM under the black-box corruptions and white-box adversarial attacks is acceptable, even without the need for additional training. The finding of this study is insightful in that the gigantic model parameters and huge amounts of training data lead to the phenomenon of emergence, which builds a guarantee of adversarial robustness. SAM is a vision foundation model that can be regarded as an early prototype of an artificial general intelligence (AGI) pipeline. In such a pipeline, a unified model can handle diverse tasks. Therefore, this research not only inspects the impact of vision foundation models on safe autonomous driving but also provides a perspective on developing trustworthy AGI. The code is available at: this https URL.

[CV-49] SurgicaL-CD: Generating Surgical Images via Unpaired Image Translation with Latent Consistency Diffusion Models

链接: https://arxiv.org/abs/2408.09822
作者: Danush Kumar Venkatesh,Dominik Rivoir,Micha Pfeiffer,Stefanie Speidel
关键词-EN: Computer-assisted surgery, enhancing patient care, surgeons during procedures, designed to assist, assist surgeons
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Computer-assisted surgery (CAS) systems are designed to assist surgeons during procedures, thereby reducing complications and enhancing patient care. Training machine learning models for these systems requires a large corpus of annotated datasets, which is challenging to obtain in the surgical domain due to patient privacy concerns and the significant labeling effort required from doctors. Previous methods have explored unpaired image translation using generative models to create realistic surgical images from simulations. However, these approaches have struggled to produce high-quality, diverse surgical images. In this work, we introduce \emphSurgicaL-CD, a consistency-distilled diffusion method to generate realistic surgical images with only a few sampling steps without paired data. We evaluate our approach on three datasets, assessing the generated images in terms of quality and utility as downstream training datasets. Our results demonstrate that our method outperforms GANs and diffusion-based approaches. Our code is available at \urlthis https URL.

[CV-50] Hear Your Face: Face-based voice conversion with F0 estimation INTERSPEECH2024

链接: https://arxiv.org/abs/2408.09802
作者: Jaejun Lee,Yoori Oh,Injune Hwang,Kyogu Lee
关键词-EN: face-based voice conversion, individual facial features, leveraging the unique, paper delves, emerging field
类目: ound (cs.SD); Computer Vision and Pattern Recognition (cs.CV); Audio and Speech Processing (eess.AS)
*备注: Interspeech 2024

点击查看摘要

Abstract:This paper delves into the emerging field of face-based voice conversion, leveraging the unique relationship between an individual’s facial features and their vocal characteristics. We present a novel face-based voice conversion framework that particularly utilizes the average fundamental frequency of the target speaker, derived solely from their facial images. Through extensive analysis, our framework demonstrates superior speech generation quality and the ability to align facial features with voice characteristics, including tracking of the target speaker’s fundamental frequency.

[CV-51] Latent Diffusion for Guided Document Table Generation ICDAR2024

链接: https://arxiv.org/abs/2408.09800
作者: Syed Jawwad Haider Hamdani,Saifullah Saifullah,Stefan Agne,Andreas Dengel,Sheraz Ahmed
关键词-EN: Obtaining annotated table, Obtaining annotated, table structure, challenging task due, table
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted in ICDAR 2024

点击查看摘要

Abstract:Obtaining annotated table structure data for complex tables is a challenging task due to the inherent diversity and complexity of real-world document layouts. The scarcity of publicly available datasets with comprehensive annotations for intricate table structures hinders the development and evaluation of models designed for such scenarios. This research paper introduces a novel approach for generating annotated images for table structure by leveraging conditioned mask images of rows and columns through the application of latent diffusion models. The proposed method aims to enhance the quality of synthetic data used for training object detection models. Specifically, the study employs a conditioning mechanism to guide the generation of complex document table images, ensuring a realistic representation of table layouts. To evaluate the effectiveness of the generated data, we employ the popular YOLOv5 object detection model for training. The generated table images serve as valuable training samples, enriching the dataset with diverse table structures. The model is subsequently tested on the challenging pubtables-1m testset, a benchmark for table structure recognition in complex document layouts. Experimental results demonstrate that the introduced approach significantly improves the quality of synthetic data for training, leading to YOLOv5 models with enhanced performance. The mean Average Precision (mAP) values obtained on the pubtables-1m testset showcase results closely aligned with state-of-the-art methods. Furthermore, low FID results obtained on the synthetic data further validate the efficacy of the proposed methodology in generating annotated images for table structure.

[CV-52] Anim-Director: A Large Multimodal Model Powered Agent for Controllable Animation Video Generation SIGGRAPH

链接: https://arxiv.org/abs/2408.09787
作者: Yunxin Li,Haoyuan Shi,Baotian Hu,Longyue Wang,Jiashun Zhu,Jinyi Xu,Zhen Zhao,Min Zhang
关键词-EN: high training costs, incurs high training, sophisticated multi-stage pipeline, demands substantial human, substantial human effort
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
*备注: Accepted by SIGGRAPH Asia 2024, Project and Codes: this https URL

点击查看摘要

Abstract:Traditional animation generation methods depend on training generative models with human-labelled data, entailing a sophisticated multi-stage pipeline that demands substantial human effort and incurs high training costs. Due to limited prompting plans, these methods typically produce brief, information-poor, and context-incoherent animations. To overcome these limitations and automate the animation process, we pioneer the introduction of large multimodal models (LMMs) as the core processor to build an autonomous animation-making agent, named Anim-Director. This agent mainly harnesses the advanced understanding and reasoning capabilities of LMMs and generative AI tools to create animated videos from concise narratives or simple instructions. Specifically, it operates in three main stages: Firstly, the Anim-Director generates a coherent storyline from user inputs, followed by a detailed director’s script that encompasses settings of character profiles and interior/exterior descriptions, and context-coherent scene descriptions that include appearing characters, interiors or exteriors, and scene events. Secondly, we employ LMMs with the image generation tool to produce visual images of settings and scenes. These images are designed to maintain visual consistency across different scenes using a visual-language prompting method that combines scene descriptions and images of the appearing character and setting. Thirdly, scene images serve as the foundation for producing animated videos, with LMMs generating prompts to guide this process. The whole process is notably autonomous without manual intervention, as the LMMs interact seamlessly with generative tools to generate prompts, evaluate visual quality, and select the best one to optimize the final output.

[CV-53] Cross-composition Feature Disentanglement for Compositional Zero-shot Learning

链接: https://arxiv.org/abs/2408.09786
作者: Yuxia Geng,Runkai Zhu,Jiaoyan Chen,Jintai Chen,Zhuo Chen,Xiang Chen,Can Xu,Yuxiang Wang,Xiaoliang Xu
关键词-EN: Compositional Zero-shot Learning, Zero-shot Learning, shown exceptional results, Compositional Zero-shot, shown exceptional
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: work in progress

点击查看摘要

Abstract:Disentanglement of visual features of primitives (i.e., attributes and objects) has shown exceptional results in Compositional Zero-shot Learning (CZSL). However, due to the feature divergence of an attribute (resp. object) when combined with different objects (resp. attributes), it is challenging to learn disentangled primitive features that are general across different compositions. To this end, we propose the solution of cross-composition feature disentanglement, which takes multiple primitive-sharing compositions as inputs and constrains the disentangled primitive features to be general across these compositions. More specifically, we leverage a compositional graph to define the overall primitive-sharing relationships between compositions, and build a task-specific architecture upon the recently successful large pre-trained vision-language model (VLM) CLIP, with dual cross-composition disentangling adapters (called L-Adapter and V-Adapter) inserted into CLIP’s frozen text and image encoders, respectively. Evaluation on three popular CZSL benchmarks shows that our proposed solution significantly improves the performance of CZSL, and its components have been verified by solid ablation studies.

[CV-54] Event Stream based Human Action Recognition: A High-Definition Benchmark Dataset and Algorithms

链接: https://arxiv.org/abs/2408.09764
作者: Xiao Wang,Shiao Wang,Pengpeng Shao,Bo Jiang,Lin Zhu,Yonghong Tian
关键词-EN: pivotal research domain, RGB cameras dominating, Human Action Recognition, RGB cameras, RGB cameras encounter
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE)
*备注: In Peer Review

点击查看摘要

Abstract:Human Action Recognition (HAR) stands as a pivotal research domain in both computer vision and artificial intelligence, with RGB cameras dominating as the preferred tool for investigation and innovation in this field. However, in real-world applications, RGB cameras encounter numerous challenges, including light conditions, fast motion, and privacy concerns. Consequently, bio-inspired event cameras have garnered increasing attention due to their advantages of low energy consumption, high dynamic range, etc. Nevertheless, most existing event-based HAR datasets are low resolution ( 346 \times 260 ). In this paper, we propose a large-scale, high-definition ( 1280 \times 800 ) human action recognition dataset based on the CeleX-V event camera, termed CeleX-HAR. It encompasses 150 commonly occurring action categories, comprising a total of 124,625 video sequences. Various factors such as multi-view, illumination, action speed, and occlusion are considered when recording these data. To build a more comprehensive benchmark dataset, we report over 20 mainstream HAR models for future works to compare. In addition, we also propose a novel Mamba vision backbone network for event stream based HAR, termed EVMamba, which equips the spatial plane multi-directional scanning and novel voxel temporal scanning mechanism. By encoding and mining the spatio-temporal information of event streams, our EVMamba has achieved favorable results across multiple datasets. Both the dataset and source code will be released on \urlthis https URL

[CV-55] A Unified Framework for Iris Anti-Spoofing: Introducing IrisGeneral Dataset and Masked-MoE Method

链接: https://arxiv.org/abs/2408.09752
作者: Hang Zou,Chenxi Du,Ajian Liu,Yuan Zhang,Jing Liu,Mingchuan Yang,Jun Wan,Hui Zhang
关键词-EN: high-security scenarios due, stability and distinctiveness, iris anti-spoofing, recognition is widely, high-security scenarios
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Iris recognition is widely used in high-security scenarios due to its stability and distinctiveness. However, the acquisition of iris images typically requires near-infrared illumination and near-infrared band filters, leading to significant and consistent differences in imaging across devices. This underscores the importance of developing cross-domain capabilities in iris anti-spoofing methods. Despite this need, there is no dataset available that comprehensively evaluates the generalization ability of the iris anti-spoofing task. To address this gap, we propose the IrisGeneral dataset, which includes 10 subsets, belonging to 7 databases, published by 4 institutions, collected with 6 types of devices. IrisGeneral is designed with three protocols, aimed at evaluating average performance, cross-racial generalization, and cross-device generalization of iris anti-spoofing models. To tackle the challenge of integrating multiple sub-datasets in IrisGeneral, we employ multiple parameter sets to learn from the various subsets. Specifically, we utilize the Mixture of Experts (MoE) to fit complex data distributions using multiple sub-neural networks. To further enhance the generalization capabilities, we introduce a novel method Masked-MoE (MMoE). It randomly masks a portion of tokens for some experts and requires their outputs to be similar to the unmasked experts, which improves the generalization ability and effectively mitigates the overfitting issue produced by MoE. We selected ResNet50, VIT-B/16, CLIP, and FLIP as representative models and benchmarked them on the IrisGeneral dataset. Experimental results demonstrate that our proposed MMoE with CLIP achieves the best performance on IrisGeneral.

[CV-56] Enhanced Cascade Prostate Cancer Classifier in mp-MRI Utilizing Recall Feedback Adaptive Loss and Prior Knowledge-Based Feature Extraction

链接: https://arxiv.org/abs/2408.09746
作者: Kun Luo,Bowen Zheng,Shidong Lv,Jie Tao,Qiang Wei
关键词-EN: males worldwide, Prostate cancer, mpMRI, Cascade Prostate Cancer, Prostate Cancer Classifier
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Prostate cancer is the second most common cancer in males worldwide, and mpMRI is commonly used for diagnosis. However, interpreting mpMRI is challenging and requires expertise from radiologists. This highlights the urgent need for automated grading in mpMRI. Existing studies lack integration of clinical prior information and suffer from uneven training sample distribution due to prevalence. Therefore, we propose a solution that incorporates prior knowledge, addresses the issue of uneven medical sample distribution, and maintains high interpretability in mpMRI. Firstly, we introduce Prior Knowledge-Based Feature Extraction, which mathematically models the PI-RADS criteria for prostate cancer as diagnostic information into model training. Secondly, we propose Adaptive Recall Feedback Loss to address the extremely imbalanced data problem. This method adjusts the training dynamically based on accuracy and recall in the validation set, resulting in high accuracy and recall simultaneously in the testing set.Thirdly, we design an Enhanced Cascade Prostate Cancer Classifier that classifies prostate cancer into different levels in an interpretable way, which refines the classification results and helps with clinical intervention. Our method is validated through experiments on the PI-CAI dataset and outperforms other methods with a more balanced result in both accuracy and recall rate.

[CV-57] RealCustom: Representing Images as Real-Word for Real-Time Customization

链接: https://arxiv.org/abs/2408.09744
作者: Zhendong Mao,Mengqi Huang,Fei Ding,Mingcong Liu,Qian He,Xiaojun Chang,Yongdong Zhang
关键词-EN: images depicting, aims to synthesize, subject appearance, synthesize new images, images that align
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 23 pages

点击查看摘要

Abstract:Text-to-image customization, which takes given texts and images depicting given subjects as inputs, aims to synthesize new images that align with both text semantics and subject appearance. This task provides precise control over details that text alone cannot capture and is fundamental for various real-world applications, garnering significant interest from academia and industry. Existing works follow the pseudo-word paradigm, which involves representing given subjects as pseudo-words and combining them with given texts to collectively guide the generation. However, the inherent conflict and entanglement between the pseudo-words and texts result in a dual-optimum paradox, where subject similarity and text controllability cannot be optimal simultaneously. We propose a novel real-words paradigm termed RealCustom++ that instead represents subjects as non-conflict real words, thereby disentangling subject similarity from text controllability and allowing both to be optimized simultaneously. Specifically, RealCustom++ introduces a novel “train-inference” decoupled framework: (1) During training, RealCustom++ learns the alignment between vision conditions and all real words in the text, ensuring high subject-similarity generation in open domains. This is achieved by the cross-layer cross-scale projector to robustly and finely extract subject features, and a curriculum training recipe that adapts the generated subject to diverse poses and sizes. (2) During inference, leveraging the learned general alignment, an adaptive mask guidance is proposed to only customize the generation of the specific target real word, keeping other subject-irrelevant regions uncontaminated to ensure high text-controllability in real-time.

[CV-58] R2GenCSR: Retrieving Context Samples for Large Language Model based X-ray Medical Report Generation

链接: https://arxiv.org/abs/2408.09743
作者: Xiao Wang,Yuehang Li,Fuling Wang,Shiao Wang,Chuanfu Li,Bo Jiang
关键词-EN: Large Language Models, leverage large models, Large Language, generation methods attempt, existing X-ray medical
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注: In Peer Review

点击查看摘要

Abstract:Inspired by the tremendous success of Large Language Models (LLMs), existing X-ray medical report generation methods attempt to leverage large models to achieve better performance. They usually adopt a Transformer to extract the visual features of a given X-ray image, and then, feed them into the LLM for text generation. How to extract more effective information for the LLMs to help them improve final results is an urgent problem that needs to be solved. Additionally, the use of visual Transformer models also brings high computational complexity. To address these issues, this paper proposes a novel context-guided efficient X-ray medical report generation framework. Specifically, we introduce the Mamba as the vision backbone with linear complexity, and the performance obtained is comparable to that of the strong Transformer model. More importantly, we perform context retrieval from the training set for samples within each mini-batch during the training phase, utilizing both positively and negatively related samples to enhance feature representation and discriminative learning. Subsequently, we feed the vision tokens, context information, and prompt statements to invoke the LLM for generating high-quality medical reports. Extensive experiments on three X-ray report generation datasets (i.e., IU-Xray, MIMIC-CXR, CheXpert Plus) fully validated the effectiveness of our proposed model. The source code of this work will be released on \urlthis https URL.

[CV-59] raDiffusion: Trajectory-Based Training-Free Image Generation

链接: https://arxiv.org/abs/2408.09739
作者: Mingrui Wu,Oucheng Huang,Jiayi Ji,Jiale Li,Xinyue Cai,Huafeng Kuang,Jianzhuang Liu,Xiaoshuai Sun,Rongrong Ji
关键词-EN: trajectory-based controllable, propose a training-free, termed TraDiffusion, energy function, guide image generation
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: The code: this https URL

点击查看摘要

Abstract:In this work, we propose a training-free, trajectory-based controllable T2I approach, termed TraDiffusion. This novel method allows users to effortlessly guide image generation via mouse trajectories. To achieve precise control, we design a distance awareness energy function to effectively guide latent variables, ensuring that the focus of generation is within the areas defined by the trajectory. The energy function encompasses a control function to draw the generation closer to the specified trajectory and a movement function to diminish activity in areas distant from the trajectory. Through extensive experiments and qualitative assessments on the COCO dataset, the results reveal that TraDiffusion facilitates simpler, more natural image control. Moreover, it showcases the ability to manipulate salient regions, attributes, and relationships within the generated images, alongside visual input based on arbitrary or enhanced trajectories.

[CV-60] Mutually-Aware Feature Learning for Few-Shot Object Counting

链接: https://arxiv.org/abs/2408.09734
作者: Yerim Jeon,Subeen Lee,Jihwan Kim,Jae-Pil Heo
关键词-EN: Few-shot object counting, garnered significant attention, Few-shot object, query image based, additional training
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: Submitted to Pattern Recognition

点击查看摘要

Abstract:Few-shot object counting has garnered significant attention for its practicality as it aims to count target objects in a query image based on given exemplars without the need for additional training. However, there is a shortcoming in the prevailing extract-and-match approach: query and exemplar features lack interaction during feature extraction since they are extracted unaware of each other and later correlated based on similarity. This can lead to insufficient target awareness of the extracted features, resulting in target confusion in precisely identifying the actual target when multiple class objects coexist. To address this limitation, we propose a novel framework, Mutually-Aware FEAture learning(MAFEA), which encodes query and exemplar features mutually aware of each other from the outset. By encouraging interaction between query and exemplar features throughout the entire pipeline, we can obtain target-aware features that are robust to a multi-category scenario. Furthermore, we introduce a background token to effectively associate the target region of query with exemplars and decouple its background region from them. Our extensive experiments demonstrate that our model reaches a new state-of-the-art performance on the two challenging benchmarks, FSCD-LVIS and FSC-147, with a remarkably reduced degree of the target confusion problem.

[CV-61] Pedestrian Attribute Recognition: A New Benchmark Dataset and A Large Language Model Augmented Framework

链接: https://arxiv.org/abs/2408.09720
作者: Jiandong Jin,Xiao Wang,Qian Zhu,Haiyang Wang,Chenglong Li
关键词-EN: Pedestrian Attribute Recognition, human-centered research, indispensable tasks, tasks in human-centered, Attribute Recognition
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注: MSP60K PAR Benchmark Dataset, LLM based PAR model, In Peer Review

点击查看摘要

Abstract:Pedestrian Attribute Recognition (PAR) is one of the indispensable tasks in human-centered research. However, existing datasets neglect different domains (e.g., environments, times, populations, and data sources), only conducting simple random splits, and the performance of these datasets has already approached saturation. In the past five years, no large-scale dataset has been opened to the public. To address this issue, this paper proposes a new large-scale, cross-domain pedestrian attribute recognition dataset to fill the data gap, termed MSP60K. It consists of 60,122 images and 57 attribute annotations across eight scenarios. Synthetic degradation is also conducted to further narrow the gap between the dataset and real-world challenging scenarios. To establish a more rigorous benchmark, we evaluate 17 representative PAR models under both random and cross-domain split protocols on our dataset. Additionally, we propose an innovative Large Language Model (LLM) augmented PAR framework, named LLM-PAR. This framework processes pedestrian images through a Vision Transformer (ViT) backbone to extract features and introduces a multi-embedding query Transformer to learn partial-aware features for attribute classification. Significantly, we enhance this framework with LLM for ensemble learning and visual feature augmentation. Comprehensive experiments across multiple PAR benchmark datasets have thoroughly validated the efficacy of our proposed framework. The dataset and source code accompanying this paper will be made publicly available at \urlthis https URL.

[CV-62] HYDEN: Hyperbolic Density Representations for Medical Images and Reports

链接: https://arxiv.org/abs/2408.09715
作者: Zhi Qiao,Linbin Han,Xiantong Zhen,Jia-Hong Gao,Zhen Qian
关键词-EN: hierarchical modeling advantages, inherent entailment relations, point vector embeddings, visual semantic representation, point vector
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
*备注:

点击查看摘要

Abstract:In light of the inherent entailment relations between images and text, hyperbolic point vector embeddings, leveraging the hierarchical modeling advantages of hyperbolic space, have been utilized for visual semantic representation learning. However, point vector embedding approaches fail to address the issue of semantic uncertainty, where an image may have multiple interpretations, and text may refer to different images, a phenomenon particularly prevalent in the medical domain. Therefor, we propose \textbfHYDEN, a novel hyperbolic density embedding based image-text representation learning approach tailored for specific medical domain data. This method integrates text-aware local features alongside global features from images, mapping image-text features to density features in hyperbolic space via using hyperbolic pseudo-Gaussian distributions. An encapsulation loss function is employed to model the partial order relations between image-text density distributions. Experimental results demonstrate the interpretability of our approach and its superior performance compared to the baseline methods across various zero-shot tasks and different datasets.

[CV-63] Dataset Distillation for Histopathology Image Classification

链接: https://arxiv.org/abs/2408.09709
作者: Cong Cong,Shiyu Xuan,Sidong Liu,Maurice Pagnucco,Shiliang Zhang,Yang Song
关键词-EN: Deep neural networks, exhibited remarkable success, Deep neural, histopathology image analysis, neural networks
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Deep neural networks (DNNs) have exhibited remarkable success in the field of histopathology image analysis. On the other hand, the contemporary trend of employing large models and extensive datasets has underscored the significance of dataset distillation, which involves compressing large-scale datasets into a condensed set of synthetic samples, offering distinct advantages in improving training efficiency and streamlining downstream applications. In this work, we introduce a novel dataset distillation algorithm tailored for histopathology image datasets (Histo-DD), which integrates stain normalisation and model augmentation into the distillation progress. Such integration can substantially enhance the compatibility with histopathology images that are often characterised by high colour heterogeneity. We conduct a comprehensive evaluation of the effectiveness of the proposed algorithm and the generated histopathology samples in both patch-level and slide-level classification tasks. The experimental results, carried out on three publicly available WSI datasets, including Camelyon16, TCGA-IDH, and UniToPath, demonstrate that the proposed Histo-DD can generate more informative synthetic patches than previous coreset selection and patch sampling methods. Moreover, the synthetic samples can preserve discriminative information, substantially reduce training efforts, and exhibit architecture-agnostic properties. These advantages indicate that synthetic samples can serve as an alternative to large-scale datasets.

[CV-64] MePT: Multi-Representation Guided Prompt Tuning for Vision-Language Model

链接: https://arxiv.org/abs/2408.09706
作者: Xinyang Wang,Yi Yang,Minfeng Zhu,Kecheng Zheng,Shi Liu,Wei Chen
关键词-EN: pre-trained Vision-Language Models, Recent advancements, Guided Prompt Tuning, prompt tuning, existing prompt tuning
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Recent advancements in pre-trained Vision-Language Models (VLMs) have highlighted the significant potential of prompt tuning for adapting these models to a wide range of downstream tasks. However, existing prompt tuning methods typically map an image to a single representation, limiting the model’s ability to capture the diverse ways an image can be described. To address this limitation, we investigate the impact of visual prompts on the model’s generalization capability and introduce a novel method termed Multi-Representation Guided Prompt Tuning (MePT). Specifically, MePT employs a three-branch framework that focuses on diverse salient regions, uncovering the inherent knowledge within images which is crucial for robust generalization. Further, we employ efficient self-ensemble techniques to integrate these versatile image representations, allowing MePT to learn all conditional, marginal, and fine-grained distributions effectively. We validate the effectiveness of MePT through extensive experiments, demonstrating significant improvements on both base-to-novel class prediction and domain generalization tasks.

[CV-65] Photorealistic Object Insertion with Diffusion-Guided Inverse Rendering ECCV2024

链接: https://arxiv.org/abs/2408.09702
作者: Ruofan Liang,Zan Gojcic,Merlin Nimier-David,David Acuna,Nandita Vijaykumar,Sanja Fidler,Zian Wang
关键词-EN: image formation process, real-world scenes requires, image formation, correct insertion, requires a deep
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Graphics (cs.GR)
*备注: ECCV 2024, Project page: this https URL

点击查看摘要

Abstract:The correct insertion of virtual objects in images of real-world scenes requires a deep understanding of the scene’s lighting, geometry and materials, as well as the image formation process. While recent large-scale diffusion models have shown strong generative and inpainting capabilities, we find that current models do not sufficiently “understand” the scene shown in a single picture to generate consistent lighting effects (shadows, bright reflections, etc.) while preserving the identity and details of the composited object. We propose using a personalized large diffusion model as guidance to a physically based inverse rendering process. Our method recovers scene lighting and tone-mapping parameters, allowing the photorealistic composition of arbitrary virtual objects in single frames or videos of indoor or outdoor scenes. Our physically based pipeline further enables automatic materials and tone-mapping refinement.

[CV-66] MambaLoc: Efficient Camera Localisation via State Space Model

链接: https://arxiv.org/abs/2408.09680
作者: Jialu Wang,Kaichen Zhou,Andrew Markham,Niki Trigoni
关键词-EN: edge-cloud IoT systems, Location information, augmented reality, automation and intelligence, intelligence of terminal
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Location information is pivotal for the automation and intelligence of terminal devices and edge-cloud IoT systems, such as autonomous vehicles and augmented reality. However, achieving reliable positioning across diverse IoT applications remains challenging due to significant training costs and the necessity of densely collected data. To tackle these issues, we have innovatively applied the selective state space (SSM) model to visual localization, introducing a new model named MambaLoc. The proposed model demonstrates exceptional training efficiency by capitalizing on the SSM model’s strengths in efficient feature extraction, rapid computation, and memory optimization, and it further ensures robustness in sparse data environments due to its parameter sparsity. Additionally, we propose the Global Information Selector (GIS), which leverages selective SSM to implicitly achieve the efficient global feature extraction capabilities of Non-local Neural Networks. This design leverages the computational efficiency of the SSM model alongside the Non-local Neural Networks’ capacity to capture long-range dependencies with minimal layers. Consequently, the GIS enables effective global information capture while significantly accelerating convergence. Our extensive experimental validation using public indoor and outdoor datasets first demonstrates our model’s effectiveness, followed by evidence of its versatility with various existing localization models.

[CV-67] Image-based Freeform Handwriting Authentication with Energy-oriented Self-Supervised Learning

链接: https://arxiv.org/abs/2408.09676
作者: Jingyao Wang,Luntian Mou,Changwen Zheng,Wen Gao
关键词-EN: Freeform handwriting authentication, handwriting authentication verifies, verifies a person, person identity, writing style
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted by TMM

点击查看摘要

Abstract:Freeform handwriting authentication verifies a person’s identity from their writing style and habits in messy handwriting data. This technique has gained widespread attention in recent years as a valuable tool for various fields, e.g., fraud prevention and cultural heritage protection. However, it still remains a challenging task in reality due to three reasons: (i) severe damage, (ii) complex high-dimensional features, and (iii) lack of supervision. To address these issues, we propose SherlockNet, an energy-oriented two-branch contrastive self-supervised learning framework for robust and fast freeform handwriting authentication. It consists of four stages: (i) pre-processing: converting manuscripts into energy distributions using a novel plug-and-play energy-oriented operator to eliminate the influence of noise; (ii) generalized pre-training: learning general representation through two-branch momentum-based adaptive contrastive learning with the energy distributions, which handles the high-dimensional features and spatial dependencies of handwriting; (iii) personalized fine-tuning: calibrating the learned knowledge using a small amount of labeled data from downstream tasks; and (iv) practical application: identifying individual handwriting from scrambled, missing, or forged data efficiently and conveniently. Considering the practicality, we construct EN-HA, a novel dataset that simulates data forgery and severe damage in real applications. Finally, we conduct extensive experiments on six benchmark datasets including our EN-HA, and the results prove the robustness and efficiency of SherlockNet.

[CV-68] Implicit Grid Convolution for Multi-Scale Image Super-Resolution

链接: https://arxiv.org/abs/2408.09674
作者: Dongheon Lee,Seokju Yun,Youngmin Ro
关键词-EN: employing neural networks, neural networks, employing neural, Recently, Implicit Grid Convolution
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Recently, Super-Resolution (SR) achieved significant performance improvement by employing neural networks. Most SR methods conventionally train a single model for each targeted scale, which increases redundancy in training and deployment in proportion to the number of scales targeted. This paper challenges this conventional fixed-scale approach. Our preliminary analysis reveals that, surprisingly, encoders trained at different scales extract similar features from images. Furthermore, the commonly used scale-specific upsampler, Sub-Pixel Convolution (SPConv), exhibits significant inter-scale correlations. Based on these observations, we propose a framework for training multiple integer scales simultaneously with a single model. We use a single encoder to extract features and introduce a novel upsampler, Implicit Grid Convolution~(IGConv), which integrates SPConv at all scales within a single module to predict multiple scales. Our extensive experiments demonstrate that training multiple scales with a single model reduces the training budget and stored parameters by one-third while achieving equivalent inference latency and comparable performance. Furthermore, we propose IGConv ^+ , which addresses spectral bias and input-independent upsampling and uses ensemble prediction to improve performance. As a result, SRFormer-IGConv ^+ achieves a remarkable 0.25dB improvement in PSNR at Urban100 \times 4 while reducing the training budget, stored parameters, and inference cost compared to the existing SRFormer.

[CV-69] SG-GS: Photo-realistic Animatable Human Avatars with Semantically-Guided Gaussian Splatting

链接: https://arxiv.org/abs/2408.09665
作者: Haoyu Zhao,Chen Yang,Hao Wang,Xingyue Zhao,Wei Shen
关键词-EN: Reconstructing photo-realistic animatable, videos remains challenging, Reconstructing photo-realistic, animatable human avatars, photo-realistic animatable human
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 12 pages, 5 figures

点击查看摘要

Abstract:Reconstructing photo-realistic animatable human avatars from monocular videos remains challenging in computer vision and graphics. Recently, methods using 3D Gaussians to represent the human body have emerged, offering faster optimization and real-time rendering. However, due to ignoring the crucial role of human body semantic information which represents the intrinsic structure and connections within the human body, they fail to achieve fine-detail reconstruction of dynamic human avatars. To address this issue, we propose SG-GS, which uses semantics-embedded 3D Gaussians, skeleton-driven rigid deformation, and non-rigid cloth dynamics deformation to create photo-realistic animatable human avatars from monocular videos. We then design a Semantic Human-Body Annotator (SHA) which utilizes SMPL’s semantic prior for efficient body part semantic labeling. The generated labels are used to guide the optimization of Gaussian semantic attributes. To address the limited receptive field of point-level MLPs for local features, we also propose a 3D network that integrates geometric and semantic associations for human avatar deformation. We further implement three key strategies to enhance the semantic accuracy of 3D Gaussians and rendering quality: semantic projection with 2D regularization, semantic-guided density regularization and semantic-aware regularization with neighborhood consistency. Extensive experiments demonstrate that SG-GS achieves state-of-the-art geometry and appearance reconstruction performance.

[CV-70] CHASE: 3D-Consistent Human Avatars with Sparse Inputs via Gaussian Splatting and Contrastive Learning

链接: https://arxiv.org/abs/2408.09663
作者: Haoyu Zhao,Hao Wang,Chen Yang,Wei Shen
关键词-EN: photo-realistic animatable human, utilized radiance fields, reconstruct photo-realistic animatable, Recent advancements, human avatar synthesis
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 13 pages, 6 figures

点击查看摘要

Abstract:Recent advancements in human avatar synthesis have utilized radiance fields to reconstruct photo-realistic animatable human avatars. However, both NeRFs-based and 3DGS-based methods struggle with maintaining 3D consistency and exhibit suboptimal detail reconstruction, especially with sparse inputs. To address this challenge, we propose CHASE, which introduces supervision from intrinsic 3D consistency across poses and 3D geometry contrastive learning, achieving performance comparable with sparse inputs to that with full inputs. Following previous work, we first integrate a skeleton-driven rigid deformation and a non-rigid cloth dynamics deformation to coordinate the movements of individual Gaussians during animation, reconstructing basic avatar with coarse 3D consistency. To improve 3D consistency under sparse inputs, we design Dynamic Avatar Adjustment(DAA) to adjust deformed Gaussians based on a selected similar pose/image from the dataset. Minimizing the difference between the image rendered by adjusted Gaussians and the image with the similar pose serves as an additional form of supervision for avatar. Furthermore, we propose a 3D geometry contrastive learning strategy to maintain the 3D global consistency of generated avatars. Though CHASE is designed for sparse inputs, it surprisingly outperforms current SOTA methods \textbfin both full and sparse settings on the ZJU-MoCap and H36M datasets, demonstrating that our CHASE successfully maintains avatar’s 3D consistency, hence improving rendering quality.

[CV-71] ExpoMamba: Exploiting Frequency SSM Blocks for Efficient and Effective Image Enhancement

链接: https://arxiv.org/abs/2408.09650
作者: Eashan Adhikarla,Kai Zhang,John Nicholson,Brian D. Davison
关键词-EN: handling high-resolution images, computer vision, remains a challenging, challenging task, task in computer
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Multimedia (cs.MM); Image and Video Processing (eess.IV)
*备注:

点击查看摘要

Abstract:Low-light image enhancement remains a challenging task in computer vision, with existing state-of-the-art models often limited by hardware constraints and computational inefficiencies, particularly in handling high-resolution images. Recent foundation models, such as transformers and diffusion models, despite their efficacy in various domains, are limited in use on edge devices due to their computational complexity and slow inference times. We introduce ExpoMamba, a novel architecture that integrates components of the frequency state space within a modified U-Net, offering a blend of efficiency and effectiveness. This model is specifically optimized to address mixed exposure challenges, a common issue in low-light image enhancement, while ensuring computational efficiency. Our experiments demonstrate that ExpoMamba enhances low-light images up to 2-3x faster than traditional models with an inference time of 36.6 ms and achieves a PSNR improvement of approximately 15-20% over competing models, making it highly suitable for real-time image processing applications.

[CV-72] C2P-CLIP: Injecting Category Common Prompt in CLIP to Enhance Generalization in Deepfake Detection

链接: https://arxiv.org/abs/2408.09647
作者: Chuangchuang Tan,Renshuai Tao,Huan Liu,Guanghua Gu,Baoyuan Wu,Yao Zhao,Yunchao Wei
关键词-EN: develop universal detectors, universal detectors capable, focuses on AIGC, AIGC detection, CLIP
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 10 pages, 5 figures

点击查看摘要

Abstract:This work focuses on AIGC detection to develop universal detectors capable of identifying various types of forgery images. Recent studies have found large pre-trained models, such as CLIP, are effective for generalizable deepfake detection along with linear classifiers. However, two critical issues remain unresolved: 1) understanding why CLIP features are effective on deepfake detection through a linear classifier; and 2) exploring the detection potential of CLIP. In this study, we delve into the underlying mechanisms of CLIP’s detection capabilities by decoding its detection features into text and performing word frequency analysis. Our finding indicates that CLIP detects deepfakes by recognizing similar concepts (Fig. \reffig:fig1 a). Building on this insight, we introduce Category Common Prompt CLIP, called C2P-CLIP, which integrates the category common prompt into the text encoder to inject category-related concepts into the image encoder, thereby enhancing detection performance (Fig. \reffig:fig1 b). Our method achieves a 12.41% improvement in detection accuracy compared to the original CLIP, without introducing additional parameters during testing. Comprehensive experiments conducted on two widely-used datasets, encompassing 20 generation models, validate the efficacy of the proposed method, demonstrating state-of-the-art performance. The code is available at \urlthis https URL

[CV-73] he First Competition on Resource-Limited Infrared Small Target Detection Challenge: Methods and Results

链接: https://arxiv.org/abs/2408.09615
作者: Boyang Li,Xinyi Ying,Ruojing Li,Yongxian Liu,Yangsi Shi,Miao Li
关键词-EN: small target detection, infrared small target, target detection, small target, infrared small
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:In this paper, we briefly summarize the first competition on resource-limited infrared small target detection (namely, LimitIRSTD). This competition has two tracks, including weakly-supervised infrared small target detection (Track 1) and lightweight infrared small target detection (Track 2). 46 and 60 teams successfully registered and took part in Tracks 1 and Track 2, respectively. The top-performing methods and their results in each track are described with details. This competition inspires the community to explore the tough problems in the application of infrared small target detection, and ultimately promote the deployment of this technology under limited resource.

[CV-74] Enhancing ASL Recognition with GCNs and Successive Residual Connections

链接: https://arxiv.org/abs/2408.09567
作者: Ushnish Sarkar,Archisman Chakraborti,Tapas Samanta,Sarbajit Pal,Amitabha Das
关键词-EN: American Sign Language, enhancing American Sign, Sign Language, American Sign, Graph Convolutional Networks
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: To be submitted in G2-SP CV 2024. Contains 7 pages, 5 figures

点击查看摘要

Abstract:This study presents a novel approach for enhancing American Sign Language (ASL) recognition using Graph Convolutional Networks (GCNs) integrated with successive residual connections. The method leverages the MediaPipe framework to extract key landmarks from each hand gesture, which are then used to construct graph representations. A robust preprocessing pipeline, including translational and scale normalization techniques, ensures consistency across the dataset. The constructed graphs are fed into a GCN-based neural architecture with residual connections to improve network stability. The architecture achieves state-of-the-art results, demonstrating superior generalization capabilities with a validation accuracy of 99.14%.

[CV-75] Generating Automatically Print/Scan Textures for Morphing Attack Detection Applications

链接: https://arxiv.org/abs/2408.09558
作者: Juan E. Tapia,Maximilian Russo,Christoph Busch
关键词-EN: Morphing Attack Detection, relevant topic, topic that aims, aims to detect, detect attempts
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Paper under revision process in Journal

点击查看摘要

Abstract:Morphing Attack Detection (MAD) is a relevant topic that aims to detect attempts by unauthorised individuals to access a “valid” identity. One of the main scenarios is printing morphed images and submitting the respective print in a passport application process. Today, small datasets are available to train the MAD algorithm because of privacy concerns and the limitations resulting from the effort associated with the printing and scanning of images at large numbers. In order to improve the detection capabilities and spot such morphing attacks, it will be necessary to have a larger and more realistic dataset representing the passport application scenario with the diversity of devices and the resulting printed scanned or compressed images. Creating training data representing the diversity of attacks is a very demanding task because the training material is developed manually. This paper proposes two different methods based on transfer-transfer for automatically creating digital print/scan face images and using such images in the training of a Morphing Attack Detection algorithm. Our proposed method can reach an Equal Error Rate (EER) of 3.84% and 1.92% on the FRGC/FERET database when including our synthetic and texture-transfer print/scan with 600 dpi to handcrafted images, respectively.

[CV-76] AnomalyFactory: Regard Anomaly Generation as Unsupervised Anomaly Localization ECCV2024

链接: https://arxiv.org/abs/2408.09533
作者: Ying Zhao
关键词-EN: Recent advances, generation approaches alleviate, anomaly generation approaches, approaches alleviate, alleviate the effect
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted to the 2nd workshop on Vision-based InduStrial InspectiON (VISION) at ECCV 2024

点击查看摘要

Abstract:Recent advances in anomaly generation approaches alleviate the effect of data insufficiency on task of anomaly localization. While effective, most of them learn multiple large generative models on different datasets and cumbersome anomaly prediction models for different classes. To address the limitations, we propose a novel scalable framework, named AnomalyFactory, that unifies unsupervised anomaly generation and localization with same network architecture. It starts with a BootGenerator that combines structure of a target edge map and appearance of a reference color image with the guidance of a learned heatmap. Then, it proceeds with a FlareGenerator that receives supervision signals from the BootGenerator and reforms the heatmap to indicate anomaly locations in the generated image. Finally, it easily transforms the same network architecture to a BlazeDetector that localizes anomaly pixels with the learned heatmap by converting the anomaly images generated by the FlareGenerator to normal images. By manipulating the target edge maps and combining them with various reference images, AnomalyFactory generates authentic and diversity samples cross domains. Comprehensive experiments carried on 5 datasets, including MVTecAD, VisA, MVTecLOCO, MADSim and RealIAD, demonstrate that our approach is superior to competitors in generation capability and scalability.

[CV-77] NAVERO: Unlocking Fine-Grained Semantics for Video-Language Compositionality

链接: https://arxiv.org/abs/2408.09511
作者: Chaofan Tao,Gukyeong Kwon,Varad Gunjal,Hao Yang,Zhaowei Cai,Yonatan Dukler,Ashwin Swaminathan,R. Manmatha,Colin Jon Taylor,Stefano Soatto
关键词-EN: study the capability, Composition understanding, understanding, Composition, NAVERO
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:We study the capability of Video-Language (VidL) models in understanding compositions between objects, attributes, actions and their relations. Composition understanding becomes particularly challenging for video data since the compositional relations rapidly change over time in videos. We first build a benchmark named AARO to evaluate composition understanding related to actions on top of spatial concepts. The benchmark is constructed by generating negative texts with incorrect action descriptions for a given video and the model is expected to pair a positive text with its corresponding video. Furthermore, we propose a training method called NAVERO which utilizes video-text data augmented with negative texts to enhance composition understanding. We also develop a negative-augmented visual-language matching loss which is used explicitly to benefit from the generated negative text. We compare NAVERO with other state-of-the-art methods in terms of compositional understanding as well as video-text retrieval performance. NAVERO achieves significant improvement over other methods for both video-language and image-language composition understanding, while maintaining strong performance on traditional text-video retrieval tasks.

[CV-78] StyleBrush: Style Extraction and Transfer from a Single Image

链接: https://arxiv.org/abs/2408.09496
作者: Wancheng Feng,Wanquan Feng,Dawei Huang,Jiaming Pei,Guangliang Cheng,Lukun Wang
关键词-EN: add specific style, specific style patterns, original structural features, visual content aims, aims to add
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 9 pages, 6figures, Under Review

点击查看摘要

Abstract:Stylization for visual content aims to add specific style patterns at the pixel level while preserving the original structural features. Compared with using predefined styles, stylization guided by reference style images is more challenging, where the main difficulty is to effectively separate style from structural elements. In this paper, we propose StyleBrush, a method that accurately captures styles from a reference image and ``brushes’’ the extracted style onto other input visual content. Specifically, our architecture consists of two branches: ReferenceNet, which extracts style from the reference image, and Structure Guider, which extracts structural features from the input image, thus enabling image-guided stylization. We utilize LLM and T2I models to create a dataset comprising 100K high-quality style images, encompassing a diverse range of styles and contents with high aesthetic score. To construct training pairs, we crop different regions of the same training image. Experiments show that our approach achieves state-of-the-art results through both qualitative and quantitative analyses. We will release our code and dataset upon acceptance of the paper.

[CV-79] Source-Free Test-Time Adaptation For Online Surface-Defect Detection ICPR2024

链接: https://arxiv.org/abs/2408.09494
作者: Yiran Song,Qianyu Zhou,Lizhuang Ma
关键词-EN: Surface defect detection, Surface defect, industrial production, significant in industrial, Surface
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted to ICPR 2024

点击查看摘要

Abstract:Surface defect detection is significant in industrial production. However, detecting defects with varying textures and anomaly classes during the test time is challenging. This arises due to the differences in data distributions between source and target domains. Collecting and annotating new data from the target domain and retraining the model is time-consuming and costly. In this paper, we propose a novel test-time adaptation surface-defect detection approach that adapts pre-trained models to new domains and classes during inference. Our approach involves two core ideas. Firstly, we introduce a supervisor to filter samples and select only those with high confidence to update the model. This ensures that the model is not excessively biased by incorrect data. Secondly, we propose the augmented mean prediction to generate robust pseudo labels and a dynamically-balancing loss to facilitate the model in effectively integrating classification and segmentation results to improve surface-defect detection accuracy. Our approach is real-time and does not require additional offline retraining. Experiments demonstrate it outperforms state-of-the-art techniques.

[CV-80] Advances in Multiple Instance Learning for Whole Slide Image Analysis: Techniques Challenges and Future Directions

链接: https://arxiv.org/abs/2408.09476
作者: Jun Wang,Yu Mao,Nan Guan,Chun Jason Xue
关键词-EN: E-stained tissue samples, gigapixel-scale digital images, tissue samples widely, E-stained tissue, slide images
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Whole slide images (WSIs) are gigapixel-scale digital images of H\E-stained tissue samples widely used in pathology. The substantial size and complexity of WSIs pose unique analytical challenges. Multiple Instance Learning (MIL) has emerged as a powerful approach for addressing these challenges, particularly in cancer classification and detection. This survey provides a comprehensive overview of the challenges and methodologies associated with applying MIL to WSI analysis, including attention mechanisms, pseudo-labeling, transformers, pooling functions, and graph neural networks. Additionally, it explores the potential of MIL in discovering cancer cell morphology, constructing interpretable machine learning models, and quantifying cancer grading. By summarizing the current challenges, methodologies, and potential applications of MIL in WSI analysis, this survey aims to inform researchers about the state of the field and inspire future research directions.

[CV-81] Image-Based Geolocation Using Large Vision-Language Models

链接: https://arxiv.org/abs/2408.09474
作者: Yi Liu,Junchen Ding,Gelei Deng,Yuekang Li,Tianwei Zhang,Weisong Sun,Yaowen Zheng,Jingquan Ge,Yang Liu
关键词-EN: offering numerous benefits, modern life, offering numerous, vital aspect, aspect of modern
类目: Cryptography and Security (cs.CR); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Geolocation is now a vital aspect of modern life, offering numerous benefits but also presenting serious privacy concerns. The advent of large vision-language models (LVLMs) with advanced image-processing capabilities introduces new risks, as these models can inadvertently reveal sensitive geolocation information. This paper presents the first in-depth study analyzing the challenges posed by traditional deep learning and LVLM-based geolocation methods. Our findings reveal that LVLMs can accurately determine geolocations from images, even without explicit geographic training. To address these challenges, we introduce \tool, an innovative framework that significantly enhances image-based geolocation accuracy. \tool employs a systematic chain-of-thought (CoT) approach, mimicking human geoguessing strategies by carefully analyzing visual and contextual cues such as vehicle types, architectural styles, natural landscapes, and cultural elements. Extensive testing on a dataset of 50,000 ground-truth data points shows that \tool outperforms both traditional models and human benchmarks in accuracy. It achieves an impressive average score of 4550.5 in the GeoGuessr game, with an 85.37% win rate, and delivers highly precise geolocation predictions, with the closest distances as accurate as 0.3 km. Furthermore, our study highlights issues related to dataset integrity, leading to the creation of a more robust dataset and a refined framework that leverages LVLMs’ cognitive capabilities to improve geolocation precision. These findings underscore \tool’s superior ability to interpret complex visual data, the urgent need to address emerging security vulnerabilities posed by LVLMs, and the importance of responsible AI development to ensure user privacy protection. Subjects: Cryptography and Security (cs.CR); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2408.09474 [cs.CR] (or arXiv:2408.09474v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2408.09474 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[CV-82] MedMAP: Promoting Incomplete Multi-modal Brain Tumor Segmentation with Alignment

链接: https://arxiv.org/abs/2408.09465
作者: Tianyi Liu,Zhaorui Tan,Muyin Chen,Xi Yang,Haochuan Jiang,Kaizhu Huang
关键词-EN: magnetic resonance imaging, Brain tumor segmentation, multiple magnetic resonance, Brain tumor, resonance imaging
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Brain tumor segmentation is often based on multiple magnetic resonance imaging (MRI). However, in clinical practice, certain modalities of MRI may be missing, which presents a more difficult scenario. To cope with this challenge, Knowledge Distillation, Domain Adaption, and Shared Latent Space have emerged as commonly promising strategies. However, recent efforts typically overlook the modality gaps and thus fail to learn important invariant feature representations across different modalities. Such drawback consequently leads to limited performance for missing modality models. To ameliorate these problems, pre-trained models are used in natural visual segmentation tasks to minimize the gaps. However, promising pre-trained models are often unavailable in medical image segmentation tasks. Along this line, in this paper, we propose a novel paradigm that aligns latent features of involved modalities to a well-defined distribution anchor as the substitution of the pre-trained model. As a major contribution, we prove that our novel training paradigm ensures a tight evidence lower bound, thus theoretically certifying its effectiveness. Extensive experiments on different backbones validate that the proposed paradigm can enable invariant feature representations and produce models with narrowed modality gaps. Models with our alignment paradigm show their superior performance on both BraTS2018 and BraTS2020 datasets.

[CV-83] 3C: Confidence-Guided Clustering and Contrastive Learning for Unsupervised Person Re-Identification

链接: https://arxiv.org/abs/2408.09464
作者: Mingxiao Zheng,Yanpeng Qu,Changjing Shang,Longzhi Yang,Qiang Shen
关键词-EN: cross-camera retrieval capability, Unsupervised person re-identification, unsupervised person Re-ID, aims to learn, Unsupervised person
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Unsupervised person re-identification (Re-ID) aims to learn a feature network with cross-camera retrieval capability in unlabelled datasets. Although the pseudo-label based methods have achieved great progress in Re-ID, their performance in the complex scenario still needs to sharpen up. In order to reduce potential misguidance, including feature bias, noise pseudo-labels and invalid hard samples, accumulated during the learning process, in this pa per, a confidence-guided clustering and contrastive learning (3C) framework is proposed for unsupervised person Re-ID. This 3C framework presents three confidence degrees. i) In the clustering stage, the confidence of the discrepancy between samples and clusters is proposed to implement a harmonic discrepancy clustering algorithm (HDC). ii) In the forward-propagation training stage, the confidence of the camera diversity of a cluster is evaluated via a novel camera information entropy (CIE). Then, the clusters with high CIE values will play leading roles in training the model. iii) In the back-propagation training stage, the confidence of the hard sample in each cluster is designed and further used in a confidence integrated harmonic discrepancy (CHD), to select the informative sample for updating the memory in contrastive learning. Extensive experiments on three popular Re-ID benchmarks demonstrate the superiority of the proposed framework. Particularly, the 3C framework achieves state-of-the-art results: 86.7%/94.7%, 45.3%/73.1% and 47.1%/90.6% in terms of mAP/Rank-1 accuracy on Market-1501, the com plex datasets MSMT17 and VeRi-776, respectively. Code is available at this https URL.

[CV-84] Fine-Grained Building Function Recognition from Street-View Images via Geometry-Aware Semi-Supervised Learning

链接: https://arxiv.org/abs/2408.09460
作者: Weijia Li,Jinhua Yu,Dairong Chen,Yi Lin,Runming Dong,Xiang Zhang,Conghui He,Haohuan Fu
关键词-EN: building function recognition, function recognition, building function, fine-grained building function, fine-grained function recognition
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: This paper is currently under review

点击查看摘要

Abstract:In this work, we propose a geometry-aware semi-supervised method for fine-grained building function recognition. This method leverages the geometric relationships between multi-source data to improve the accuracy of pseudo labels in semi-supervised learning, extending the task’s scope and making it applicable to cross-categorization systems of building function recognition. Firstly, we design an online semi-supervised pre-training stage, which facilitates the precise acquisition of building facade location information in street-view images. In the second stage, we propose a geometry-aware coarse annotation generation module. This module effectively combines GIS data and street-view data based on the geometric relationships, improving the accuracy of pseudo annotations. In the third stage, we combine the newly generated coarse annotations with the existing labeled dataset to achieve fine-grained functional recognition of buildings across multiple cities at a large scale. Extensive experiments demonstrate that our proposed framework exhibits superior performance in fine-grained functional recognition of buildings. Within the same categorization system, it achieves improvements of 7.6% and 4.8% compared to fully-supervised methods and state-of-the-art semi-supervised methods, respectively. Additionally, our method also performs well in cross-city tasks, i.e., extending the model trained on OmniCity (New York) to new areas (i.e., Los Angeles and Boston). This study provides a novel solution for the fine-grained function recognition of large-scale buildings across multiple cities, offering essential data for understanding urban infrastructure planning, human activity patterns, and the interactions between humans and buildings.

[CV-85] G2Face: High-Fidelity Reversible Face Anonymization via Generative and Geometric Priors

链接: https://arxiv.org/abs/2408.09458
作者: Haoxin Yang,Xuemiao Xu,Cheng Xu,Huaidong Zhang,Jing Qin,Yi Wang,Pheng-Ann Heng,Shengfeng He
关键词-EN: sacrificing image clarity, replace sensitive identity, image clarity, unlike traditional face, traditional face pixelization
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Reversible face anonymization, unlike traditional face pixelization, seeks to replace sensitive identity information in facial images with synthesized alternatives, preserving privacy without sacrificing image clarity. Traditional methods, such as encoder-decoder networks, often result in significant loss of facial details due to their limited learning capacity. Additionally, relying on latent manipulation in pre-trained GANs can lead to changes in ID-irrelevant attributes, adversely affecting data utility due to GAN inversion inaccuracies. This paper introduces G\textsuperscript2Face, which leverages both generative and geometric priors to enhance identity manipulation, achieving high-quality reversible face anonymization without compromising data utility. We utilize a 3D face model to extract geometric information from the input face, integrating it with a pre-trained GAN-based decoder. This synergy of generative and geometric priors allows the decoder to produce realistic anonymized faces with consistent geometry. Moreover, multi-scale facial features are extracted from the original face and combined with the decoder using our novel identity-aware feature fusion blocks (IFF). This integration enables precise blending of the generated facial patterns with the original ID-irrelevant features, resulting in accurate identity manipulation. Extensive experiments demonstrate that our method outperforms existing state-of-the-art techniques in face anonymization and recovery, while preserving high data utility. Code is available at this https URL.

[CV-86] Retina-inspired Object Motion Segmentation

链接: https://arxiv.org/abs/2408.09454
作者: Victoria Clerico(1),Shay Snyder(1),Arya Lohia(1),Md Abdullah-Al Kaiser(2),Gregory Schwartz(3),Akhilesh Jaiswal(2),Maryam Parsa(1) ((1) George Mason Unviersity, (2) University of Southern, California, (3) Northwestern University)
关键词-EN: surpasses RGB cameras, high temporal resolution, Dynamic Vision Sensors, RGB cameras, surpasses RGB
类目: Computer Vision and Pattern Recognition (cs.CV); Neural and Evolutionary Computing (cs.NE); Image and Video Processing (eess.IV)
*备注:

点击查看摘要

Abstract:Dynamic Vision Sensors (DVS) have emerged as a revolutionary technology with a high temporal resolution that far surpasses RGB cameras. DVS technology draws biological inspiration from photoreceptors and the initial retinal synapse. Our research showcases the potential of additional retinal functionalities to extract visual features. We provide a domain-agnostic and efficient algorithm for ego-motion compensation based on Object Motion Sensitivity (OMS), one of the multiple robust features computed within the mammalian retina. We develop a framework based on experimental neuroscience that translates OMS’ biological circuitry to a low-overhead algorithm. OMS processes DVS data from dynamic scenes to perform pixel-wise object motion segmentation. Using a real and a synthetic dataset, we highlight OMS’ ability to differentiate object motion from ego-motion, bypassing the need for deep networks. This paper introduces a bio-inspired computer vision method that dramatically reduces the number of parameters by a factor of 1000 compared to prior works. Our work paves the way for robust, high-speed, and low-bandwidth decision-making for in-sensor computations.

[CV-87] Attention Is Not What You Need: Revisiting Multi-Instance Learning for Whole Slide Image Classification

链接: https://arxiv.org/abs/2408.09449
作者: Xin Liu,Weijia Zhang,Min-Ling Zhang
关键词-EN: achieved impressive performances, multi-instance learning algorithms, attention-based multi-instance learning, standard MIL assumptions, standard MIL
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Although attention-based multi-instance learning algorithms have achieved impressive performances on slide-level whole slide image (WSI) classification tasks, they are prone to mistakenly focus on irrelevant patterns such as staining conditions and tissue morphology, leading to incorrect patch-level predictions and unreliable interpretability. Moreover, these attention-based MIL algorithms tend to focus on salient instances and struggle to recognize hard-to-classify instances. In this paper, we first demonstrate that attention-based WSI classification methods do not adhere to the standard MIL assumptions. From the standard MIL assumptions, we propose a surprisingly simple yet effective instance-based MIL method for WSI classification (FocusMIL) based on max-pooling and forward amortized variational inference. We argue that synergizing the standard MIL assumption with variational inference encourages the model to focus on tumour morphology instead of spurious correlations. Our experimental evaluations show that FocusMIL significantly outperforms the baselines in patch-level classification tasks on the Camelyon16 and TCGA-NSCLC benchmarks. Visualization results show that our method also achieves better classification boundaries for identifying hard instances and mitigates the effect of spurious correlations between bags and labels.

[CV-88] CLIP-CID: Efficient CLIP Distillation via Cluster-Instance Discrimination

链接: https://arxiv.org/abs/2408.09441
作者: Kaicheng Yang,Tiancheng Gu,Xiang An,Haiqiang Jiang,Xiangzi Dai,Ziyong Feng,Weidong Cai,Jiankang Deng
关键词-EN: Contrastive Language-Image Pre-training, Contrastive Language-Image, achieved excellent performance, achieved excellent, wide range
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 11 pages,8 figures

点击查看摘要

Abstract:Contrastive Language-Image Pre-training (CLIP) has achieved excellent performance over a wide range of tasks. However, the effectiveness of CLIP heavily relies on a substantial corpus of pre-training data, resulting in notable consumption of computational resources. Although knowledge distillation has been widely applied in single modality models, how to efficiently expand knowledge distillation to vision-language foundation models with extensive data remains relatively unexplored. In this paper, we introduce CLIP-CID, a novel distillation mechanism that effectively transfers knowledge from a large vision-language foundation model to a smaller model. We initially propose a simple but efficient image semantic balance method to reduce transfer learning bias and improve distillation efficiency. This method filters out 43.7% of image-text pairs from the LAION400M while maintaining superior performance. After that, we leverage cluster-instance discrimination to facilitate knowledge transfer from the teacher model to the student model, thereby empowering the student model to acquire a holistic semantic comprehension of the pre-training data. Experimental results demonstrate that CLIP-CID achieves state-of-the-art performance on various downstream tasks including linear probe and zero-shot classification.

[CV-89] Adversarial Attacked Teacher for Unsupervised Domain Adaptive Object Detection

链接: https://arxiv.org/abs/2408.09431
作者: Kaiwen Wang,Yinzhe Shen,Martin Lauer
关键词-EN: detectors encounter challenges, handling domain shifts, Object detectors encounter, detectors encounter, encounter challenges
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Object detectors encounter challenges in handling domain shifts. Cutting-edge domain adaptive object detection methods use the teacher-student framework and domain adversarial learning to generate domain-invariant pseudo-labels for self-training. However, the pseudo-labels generated by the teacher model tend to be biased towards the majority class and often mistakenly include overconfident false positives and underconfident false negatives. We reveal that pseudo-labels vulnerable to adversarial attacks are more likely to be low-quality. To address this, we propose a simple yet effective framework named Adversarial Attacked Teacher (AAT) to improve the quality of pseudo-labels. Specifically, we apply adversarial attacks to the teacher model, prompting it to generate adversarial pseudo-labels to correct bias, suppress overconfidence, and encourage underconfident proposals. An adaptive pseudo-label regularization is introduced to emphasize the influence of pseudo-labels with high certainty and reduce the negative impacts of uncertain predictions. Moreover, robust minority objects verified by pseudo-label regularization are oversampled to minimize dataset imbalance without introducing false positives. Extensive experiments conducted on various datasets demonstrate that AAT achieves superior performance, reaching 52.6 mAP on Clipart1k, surpassing the previous state-of-the-art by 6.7%.

[CV-90] A Robust Algorithm for Contactless Fingerprint Enhancement and Matching

链接: https://arxiv.org/abs/2408.09426
作者: Mahrukh Siddiqui,Shahzaib Iqbal,Bandar AlShammari,Bandar Alhaqbani,Tariq M. Khan,Imran Razzak
关键词-EN: elastic deformation caused, contactless fingerprint images, contact fingerprint images, fingerprint images, contactless fingerprint
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Compared to contact fingerprint images, contactless fingerprint images exhibit four distinct characteristics: (1) they contain less noise; (2) they have fewer discontinuities in ridge patterns; (3) the ridge-valley pattern is less distinct; and (4) they pose an interoperability problem, as they lack the elastic deformation caused by pressing the finger against the capture device. These properties present significant challenges for the enhancement of contactless fingerprint images. In this study, we propose a novel contactless fingerprint identification solution that enhances the accuracy of minutiae detection through improved frequency estimation and a new region-quality-based minutia extraction algorithm. In addition, we introduce an efficient and highly accurate minutiae-based encoding and matching algorithm. We validate the effectiveness of our approach through extensive experimental testing. Our method achieves a minimum Equal Error Rate (EER) of 2.84% on the PolyU contactless fingerprint dataset, demonstrating its superior performance compared to existing state-of-the-art techniques. The proposed fingerprint identification method exhibits notable precision and resilience, proving to be an effective and feasible solution for contactless fingerprint-based identification systems.

[CV-91] OVOSE: Open-Vocabulary Semantic Segmentation in Event-Based Cameras

链接: https://arxiv.org/abs/2408.09424
作者: Muhammad Rameez Ur Rahman,Jhony H. Giraldo,Indro Spinelli,Stéphane Lathuilière,Fabio Galasso
关键词-EN: challenging lighting conditions, sensitive computer vision, computer vision tasks, semantic segmentation, lighting conditions
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: conference

点击查看摘要

Abstract:Event cameras, known for low-latency operation and superior performance in challenging lighting conditions, are suitable for sensitive computer vision tasks such as semantic segmentation in autonomous driving. However, challenges arise due to limited event-based data and the absence of large-scale segmentation benchmarks. Current works are confined to closed-set semantic segmentation, limiting their adaptability to other applications. In this paper, we introduce OVOSE, the first Open-Vocabulary Semantic Segmentation algorithm for Event cameras. OVOSE leverages synthetic event data and knowledge distillation from a pre-trained image-based foundation model to an event-based counterpart, effectively preserving spatial context and transferring open-vocabulary semantic segmentation capabilities. We evaluate the performance of OVOSE on two driving semantic segmentation datasets DDD17, and DSEC-Semantic, comparing it with existing conventional image open-vocabulary models adapted for event-based data. Similarly, we compare OVOSE with state-of-the-art methods designed for closed-set settings in unsupervised domain adaptation for event-based semantic segmentation. OVOSE demonstrates superior performance, showcasing its potential for real-world applications. The code is available at this https URL.

[CV-92] Weakly Supervised Lymph Nodes Segmentation Based on Partial Instance Annotations with Pre-trained Dual-branch Network and Pseudo Label Learning

链接: https://arxiv.org/abs/2408.09411
作者: Litingyu Wang(1),Yijie Qu(1),Xiangde Luo(1 and 2),Wenjun Liao(1 and 3),Shichuan Zhang(1 and 3),Guotai Wang(1 and 2) ((1) University of Electronic Science and Technology of China, Chengdu, China, (2) ShangAI Laboratory, Shanghai, China, (3) Department of Radiation Oncology, Sichuan Cancer Hospital amp; Institute, Sichuan Cancer Center, Chengdu, China)
关键词-EN: estimating cancer progression, identifying surrounding benign, determining potential metastatic, potential metastatic pathways, potentially malignant lymph
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted for publication at the Journal of Machine Learning for Biomedical Imaging (MELBA) this https URL

点击查看摘要

Abstract:Assessing the presence of potentially malignant lymph nodes aids in estimating cancer progression, and identifying surrounding benign lymph nodes can assist in determining potential metastatic pathways for cancer. For quantitative analysis, automatic segmentation of lymph nodes is crucial. However, due to the labor-intensive and time-consuming manual annotation process required for a large number of lymph nodes, it is more practical to annotate only a subset of the lymph node instances to reduce annotation costs. In this study, we propose a pre-trained Dual-Branch network with Dynamically Mixed Pseudo label (DBDMP) to learn from partial instance annotations for lymph nodes segmentation. To obtain reliable pseudo labels for lymph nodes that are not annotated, we employ a dual-decoder network to generate different outputs that are then dynamically mixed. We integrate the original weak partial annotations with the mixed pseudo labels to supervise the network. To further leverage the extensive amount of unannotated voxels, we apply a self-supervised pre-training strategy to enhance the model’s feature extraction capability. Experiments on the mediastinal Lymph Node Quantification (LNQ) dataset demonstrate that our method, compared to directly learning from partial instance annotations, significantly improves the Dice Similarity Coefficient (DSC) from 11.04% to 54.10% and reduces the Average Symmetric Surface Distance (ASSD) from 20.83 mm to 8.72 mm . The code is available at this https URL

[CV-93] OPPH: A Vision-Based Operator for Measuring Body Movements for Personal Healthcare

链接: https://arxiv.org/abs/2408.09409
作者: Chen Long-fei,Subramanian Ramamoorthy,Robert B Fisher
关键词-EN: healthcare purposes, methods show promise, Vision-based motion estimation, body, motion estimation
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Vision-based motion estimation methods show promise in accurately and unobtrusively estimating human body motion for healthcare purposes. However, these methods are not specifically designed for healthcare purposes and face challenges in real-world applications. Human pose estimation methods often lack the accuracy needed for detecting fine-grained, subtle body movements, while optical flow-based methods struggle with poor lighting conditions and unseen real-world data. These issues result in human body motion estimation errors, particularly during critical medical situations where the body is motionless, such as during unconsciousness. To address these challenges and improve the accuracy of human body motion estimation for healthcare purposes, we propose the OPPH operator designed to enhance current vision-based motion estimation methods. This operator, which considers human body movement and noise properties, functions as a multi-stage filter. Results tested on two real-world and one synthetic human motion dataset demonstrate that the operator effectively removes real-world noise, significantly enhances the detection of motionless states, maintains the accuracy of estimating active body movements, and maintains long-term body movement trends. This method could be beneficial for analyzing both critical medical events and chronic medical conditions.

[CV-94] VrdONE: One-stage Video Visual Relation Detection

链接: https://arxiv.org/abs/2408.09408
作者: Xinjie Jiang,Chenxi Zheng,Xuemiao Xu,Bangzhen Liu,Weiying Zheng,Huaidong Zhang,Shengfeng He
关键词-EN: Video Visual Relation, basic visual tasks, gaining deeper insights, Video Visual, Visual Relation Detection
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 12 pages, 8 figures, accepted by ACM Multimedia 2024

点击查看摘要

Abstract:Video Visual Relation Detection (VidVRD) focuses on understanding how entities interact over time and space in videos, a key step for gaining deeper insights into video scenes beyond basic visual tasks. Traditional methods for VidVRD, challenged by its complexity, typically split the task into two parts: one for identifying what relation categories are present and another for determining their temporal boundaries. This split overlooks the inherent connection between these elements. Addressing the need to recognize entity pairs’ spatiotemporal interactions across a range of durations, we propose VrdONE, a streamlined yet efficacious one-stage model. VrdONE combines the features of subjects and objects, turning predicate detection into 1D instance segmentation on their combined representations. This setup allows for both relation category identification and binary mask generation in one go, eliminating the need for extra steps like proposal generation or post-processing. VrdONE facilitates the interaction of features across various frames, adeptly capturing both short-lived and enduring relations. Additionally, we introduce the Subject-Object Synergy (SOS) module, enhancing how subjects and objects perceive each other before combining. VrdONE achieves state-of-the-art performances on the VidOR benchmark and ImageNet-VidVRD, showcasing its superior capability in discerning relations across different temporal scales. The code is available at \textcolor[RGB]228,58,136\hrefthis https URLthis https URL.

[CV-95] Obtaining Optimal Spiking Neural Network in Sequence Learning via CRNN-SNN Conversion

链接: https://arxiv.org/abs/2408.09403
作者: Jiahao Su,Kang You,Zekai Xu,Weizhi Xu,Zhezhi He
关键词-EN: energy-efficient neuromorphic chips, Spiking neural networks, conventional artificial neural, rich neural dynamics, artificial neural networks
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted by 33rd International Conference on Artificial Neural Networks

点击查看摘要

Abstract:Spiking neural networks (SNNs) are becoming a promising alternative to conventional artificial neural networks (ANNs) due to their rich neural dynamics and the implementation of energy-efficient neuromorphic chips. However, the non-differential binary communication mechanism makes SNN hard to converge to an ANN-level accuracy. When SNN encounters sequence learning, the situation becomes worse due to the difficulties in modeling long-range dependencies. To overcome these difficulties, researchers developed variants of LIF neurons and different surrogate gradients but still failed to obtain good results when the sequence became longer (e.g., 500). Unlike them, we obtain an optimal SNN in sequence learning by directly mapping parameters from a quantized CRNN. We design two sub-pipelines to support the end-to-end conversion of different structures in neural networks, which is called CNN-Morph (CNN \rightarrow QCNN \rightarrow BIFSNN) and RNN-Morph (RNN \rightarrow QRNN \rightarrow RBIFSNN). Using conversion pipelines and the s-analog encoding method, the conversion error of our framework is zero. Furthermore, we give the theoretical and experimental demonstration of the lossless CRNN-SNN conversion. Our results show the effectiveness of our method over short and long timescales tasks compared with the state-of-the-art learning- and conversion-based methods. We reach the highest accuracy of 99.16% (0.46 \uparrow ) on S-MNIST, 94.95% (3.95 \uparrow ) on PS-MNIST (sequence length of 784) respectively, and the lowest loss of 0.057 (0.013 \downarrow ) within 8 time-steps in collision avoidance dataset.

[CV-96] Combo: Co-speech holistic 3D human motion generation and efficient customizable adaptation in harmony

链接: https://arxiv.org/abs/2408.09397
作者: Chao Xu,Mingze Sun,Zhi-Qi Cheng,Fei Wang,Yang Liu,Baigui Sun,Ruqi Huang,Alexander Hauptmann
关键词-EN: harmonious co-speech holistic, harmonious co-speech, efficient customizable adaption, holistic human motions, human motion generation
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:In this paper, we propose a novel framework, Combo, for harmonious co-speech holistic 3D human motion generation and efficient customizable adaption. In particular, we identify that one fundamental challenge as the multiple-input-multiple-output (MIMO) nature of the generative model of interest. More concretely, on the input end, the model typically consumes both speech signals and character guidance (e.g., identity and emotion), which not only poses challenge on learning capacity but also hinders further adaptation to varying guidance; on the output end, holistic human motions mainly consist of facial expressions and body movements, which are inherently correlated but non-trivial to coordinate in current data-driven generation process. In response to the above challenge, we propose tailored designs to both ends. For the former, we propose to pre-train on data regarding a fixed identity with neutral emotion, and defer the incorporation of customizable conditions (identity and emotion) to fine-tuning stage, which is boosted by our novel X-Adapter for parameter-efficient fine-tuning. For the latter, we propose a simple yet effective transformer design, DU-Trans, which first divides into two branches to learn individual features of face expression and body movements, and then unites those to learn a joint bi-directional distribution and directly predicts combined coefficients. Evaluated on BEAT2 and SHOW datasets, Combo is highly effective in generating high-quality motions but also efficient in transferring identity and emotion. Project website: \hrefthis https URLCombo.

[CV-97] OU-CoViT: Copula-Enhanced Bi-Channel Multi-Task Vision Transformers with Dual Adaptation for OU-UWF Images

链接: https://arxiv.org/abs/2408.09395
作者: Yang Li,Jianing Deng,Chong Zhong,Danjuan Yang,Meiyan Li,A.H. Welsh,Aiyi Liu,Xingtao Zhou,Catherine C. Liu,Bo Fu
关键词-EN: Myopia screening, cutting-edge ultra-widefield, fundus imaging, small medical datasets, screening using cutting-edge
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Myopia screening using cutting-edge ultra-widefield (UWF) fundus imaging and joint modeling of multiple discrete and continuous clinical scores presents a promising new paradigm for multi-task problems in Ophthalmology. The bi-channel framework that arises from the Ophthalmic phenomenon of ``interocular asymmetries’’ of both eyes (OU) calls for new employment on the SOTA transformer-based models. However, the application of copula models for multiple mixed discrete-continuous labels on deep learning (DL) is challenging. Moreover, the application of advanced large transformer-based models to small medical datasets is challenging due to overfitting and computational resource constraints. To resolve these challenges, we propose OU-CoViT: a novel Copula-Enhanced Bi-Channel Multi-Task Vision Transformers with Dual Adaptation for OU-UWF images, which can i) incorporate conditional correlation information across multiple discrete and continuous labels within a deep learning framework (by deriving the closed form of a novel Copula Loss); ii) take OU inputs subject to both high correlation and interocular asymmetries using a bi-channel model with dual adaptation; and iii) enable the adaptation of large vision transformer (ViT) models to small medical datasets. Solid experiments demonstrate that OU-CoViT significantly improves prediction performance compared to single-channel baseline models with empirical loss. Furthermore, the novel architecture of OU-CoViT allows generalizability and extensions of our dual adaptation and Copula Loss to various ViT variants and large DL models on small medical datasets. Our approach opens up new possibilities for joint modeling of heterogeneous multi-channel input and mixed discrete-continuous clinical scores in medical practices and has the potential to advance AI-assisted clinical decision-making in various medical domains beyond Ophthalmology.

[CV-98] FD2Talk: Towards Generalized Talking Head Generation with Facial Decoupled Diffusion Model

链接: https://arxiv.org/abs/2408.09384
作者: Ziyu Yao,Xuxin Cheng,Zhiqi Huang
关键词-EN: faces numerous challenges, significant research topic, Talking head generation, Talking head, diffusion models
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
*备注: Accepted by ACM Multimedia 2024

点击查看摘要

Abstract:Talking head generation is a significant research topic that still faces numerous challenges. Previous works often adopt generative adversarial networks or regression models, which are plagued by generation quality and average facial shape problem. Although diffusion models show impressive generative ability, their exploration in talking head generation remains unsatisfactory. This is because they either solely use the diffusion model to obtain an intermediate representation and then employ another pre-trained renderer, or they overlook the feature decoupling of complex facial details, such as expressions, head poses and appearance textures. Therefore, we propose a Facial Decoupled Diffusion model for Talking head generation called FD2Talk, which fully leverages the advantages of diffusion models and decouples the complex facial details through multi-stages. Specifically, we separate facial details into motion and appearance. In the initial phase, we design the Diffusion Transformer to accurately predict motion coefficients from raw audio. These motions are highly decoupled from appearance, making them easier for the network to learn compared to high-dimensional RGB images. Subsequently, in the second phase, we encode the reference image to capture appearance textures. The predicted facial and head motions and encoded appearance then serve as the conditions for the Diffusion UNet, guiding the frame generation. Benefiting from decoupling facial details and fully leveraging diffusion models, extensive experiments substantiate that our approach excels in enhancing image quality and generating more accurate and diverse results compared to previous state-of-the-art methods.

[CV-99] Detecting the Undetectable: Combining Kolmogorov-Arnold Networks and MLP for AI-Generated Image Detection

链接: https://arxiv.org/abs/2408.09371
作者: Taharim Rahman Anon,Jakaria Islam Emon
关键词-EN: artificial intelligence progresses, intelligence progresses, artificial intelligence, task of distinguishing, increasingly complicated
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 8 Pages, IEEE Transactions

点击查看摘要

Abstract:As artificial intelligence progresses, the task of distinguishing between real and AI-generated images is increasingly complicated by sophisticated generative models. This paper presents a novel detection framework adept at robustly identifying images produced by cutting-edge generative AI models, such as DALL-E 3, MidJourney, and Stable Diffusion 3. We introduce a comprehensive dataset, tailored to include images from these advanced generators, which serves as the foundation for extensive evaluation. we propose a classification system that integrates semantic image embeddings with a traditional Multilayer Perceptron (MLP). This baseline system is designed to effectively differentiate between real and AI-generated images under various challenging conditions. Enhancing this approach, we introduce a hybrid architecture that combines Kolmogorov-Arnold Networks (KAN) with the MLP. This hybrid model leverages the adaptive, high-resolution feature transformation capabilities of KAN, enabling our system to capture and analyze complex patterns in AI-generated images that are typically overlooked by conventional models. In out-of-distribution testing, our proposed model consistently outperformed the standard MLP across three out of distribution test datasets, demonstrating superior performance and robustness in classifying real images from AI-generated images with impressive F1 scores.

[CV-100] Angle of Arrival Estimation with Transformer: A Sparse and Gridless Method with Zero-Shot Capability

链接: https://arxiv.org/abs/2408.09362
作者: Zhaoxuan Zhu,Chulong Chen,Bo Yang
关键词-EN: Advanced Driver Assistance, Driver Assistance Systems, Automotive Multiple-Input Multiple-Output, Autonomous Vehicles, Advanced Driver
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 8 pages, 8 figures

点击查看摘要

Abstract:Automotive Multiple-Input Multiple-Output (MIMO) radars have gained significant traction in Advanced Driver Assistance Systems (ADAS) and Autonomous Vehicles (AV) due to their cost-effectiveness, resilience to challenging operating conditions, and extended detection range. To fully leverage the advantages of MIMO radars, it is crucial to develop an Angle of Arrival (AOA) algorithm that delivers high performance with reasonable computational workload. This work introduces AAETR (Angle of Arrival Estimation with TRansformer) for high performance gridless AOA estimation. Comprehensive evaluations across various signal-to-noise ratios (SNRs) and multi-target scenarios demonstrate AAETR’s superior performance compared to super resolution AOA algorithms such as Iterative Adaptive Approach (IAA). The proposed architecture features efficient, scalable, sparse and gridless angle-finding capability, overcoming the issues of high computational cost and straddling loss in SNR associated with grid-based IAA. AAETR requires fewer tunable hyper-parameters and is end-to-end trainable in a deep learning radar perception pipeline. When trained on large-scale simulated datasets then evaluated on real dataset, AAETR exhibits remarkable zero-shot sim-to-real transferability and emergent sidelobe suppression capability. This highlights the effectiveness of the proposed approach and its potential as a drop-in module in practical systems.

[CV-101] Panorama Tomosynthesis from Head CBCT with Simulated Projection Geometry

链接: https://arxiv.org/abs/2408.09358
作者: Anusree P.S.,Bikram Keshari Parida,Seong Yong Moon,Wonsang You
关键词-EN: Beam Computed Tomography, Cone Beam Computed, Cone Beam, Computed Tomography, Beam Computed
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 12 pages, 6 figures, 1 table, Journal submission planned

点击查看摘要

Abstract:Cone Beam Computed Tomography (CBCT) and Panoramic X-rays are the most commonly used imaging modalities in dental health care. CBCT can produce three-dimensional views of a patient’s head, providing clinicians with better diagnostic capability, whereas Panoramic X-ray can capture the entire maxillofacial region in a single image. If the CBCT is already available, it can be beneficial to synthesize a Panoramic X-ray, thereby avoiding an immediate additional scan and extra radiation exposure. Existing methods focus on delineating an approximate dental arch and creating orthogonal projections along this arch. However, no golden standard is available for such dental arch extractions, and this choice can affect the quality of synthesized X-rays. To avoid such issues, we propose a novel method for synthesizing Panoramic X-rays from diverse head CBCTs, employing a simulated projection geometry and dynamic rotation centers. Our method effectively synthesized panoramic views from CBCT, even for patients with missing or nonexistent teeth and in the presence of severe metal implants. Our results demonstrate that this method can generate high-quality panoramic images irrespective of the CBCT scanner geometry.

[CV-102] Joint Temporal Pooling for Improving Skeleton-based Action Recognition

链接: https://arxiv.org/abs/2408.09356
作者: Shanaka Ramesh Gunasekara,Wanqing Li,Jack Yang,Philip Ogunbona
关键词-EN: capturing spatiotemporal relationship, human action recognition, critical step, step for capturing, capturing spatiotemporal
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:In skeleton-based human action recognition, temporal pooling is a critical step for capturing spatiotemporal relationship of joint dynamics. Conventional pooling methods overlook the preservation of motion information and treat each frame equally. However, in an action sequence, only a few segments of frames carry discriminative information related to the action. This paper presents a novel Joint Motion Adaptive Temporal Pooling (JMAP) method for improving skeleton-based action recognition. Two variants of JMAP, frame-wise pooling and joint-wise pooling, are introduced. The efficacy of JMAP has been validated through experiments on the popular NTU RGB+D 120 and PKU-MMD datasets.

[CV-103] Boundary-Recovering Network for Temporal Action Detection

链接: https://arxiv.org/abs/2408.09354
作者: Jihwan Kim,Jaehyun Choi,Yerim Jeon,Jae-Pil Heo
关键词-EN: real-world video applications, video applications, vanishing boundary problem, fundamental for real-world, real-world video
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Submitted to Pattern Recognition Journal

点击查看摘要

Abstract:Temporal action detection (TAD) is challenging, yet fundamental for real-world video applications. Large temporal scale variation of actions is one of the most primary difficulties in TAD. Naturally, multi-scale features have potential in localizing actions of diverse lengths as widely used in object detection. Nevertheless, unlike objects in images, actions have more ambiguity in their boundaries. That is, small neighboring objects are not considered as a large one while short adjoining actions can be misunderstood as a long one. In the coarse-to-fine feature pyramid via pooling, these vague action boundaries can fade out, which we call ‘vanishing boundary problem’. To this end, we propose Boundary-Recovering Network (BRN) to address the vanishing boundary problem. BRN constructs scale-time features by introducing a new axis called scale dimension by interpolating multi-scale features to the same temporal length. On top of scale-time features, scale-time blocks learn to exchange features across scale levels, which can effectively settle down the issue. Our extensive experiments demonstrate that our model outperforms the state-of-the-art on the two challenging benchmarks, ActivityNet-v1.3 and THUMOS14, with remarkably reduced degree of the vanishing boundary problem.

[CV-104] Hyperstroke: A Novel High-quality Stroke Representation for Assistive Artistic Drawing

链接: https://arxiv.org/abs/2408.09348
作者: Haoyun Qin,Jian Lin,Hanyuan Liu,Xueting Liu,Chengze Li
关键词-EN: providing intelligent guidance, guidance to artists, aims to facilitate, facilitate the creative, creative process
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 11 pages, 10 figures

点击查看摘要

Abstract:Assistive drawing aims to facilitate the creative process by providing intelligent guidance to artists. Existing solutions often fail to effectively model intricate stroke details or adequately address the temporal aspects of drawing. We introduce hyperstroke, a novel stroke representation designed to capture precise fine stroke details, including RGB appearance and alpha-channel opacity. Using a Vector Quantization approach, hyperstroke learns compact tokenized representations of strokes from real-life drawing videos of artistic drawing. With hyperstroke, we propose to model assistive drawing via a transformer-based architecture, to enable intuitive and user-friendly drawing applications, which are experimented in our exploratory evaluation.

[CV-105] S3D-NeRF: Single-Shot Speech-Driven Neural Radiance Field for High Fidelity Talking Head Synthesis ECCV2024

链接: https://arxiv.org/abs/2408.09347
作者: Dongze Li,Kang Zhao,Wei Wang,Yifeng Ma,Bo Peng,Yingya Zhang,Jing Dong
关键词-EN: Neural Radiance Field, Talking head synthesis, one-shot talking heads, Current Neural Radiance, Neural Radiance
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: ECCV 2024

点击查看摘要

Abstract:Talking head synthesis is a practical technique with wide applications. Current Neural Radiance Field (NeRF) based approaches have shown their superiority on driving one-shot talking heads with videos or signals regressed from audio. However, most of them failed to take the audio as driven information directly, unable to enjoy the flexibility and availability of speech. Since mapping audio signals to face deformation is non-trivial, we design a Single-Shot Speech-Driven Neural Radiance Field (S^3D-NeRF) method in this paper to tackle the following three difficulties: learning a representative appearance feature for each identity, modeling motion of different face regions with audio, and keeping the temporal consistency of the lip area. To this end, we introduce a Hierarchical Facial Appearance Encoder to learn multi-scale representations for catching the appearance of different speakers, and elaborate a Cross-modal Facial Deformation Field to perform speech animation according to the relationship between the audio signal and different face regions. Moreover, to enhance the temporal consistency of the important lip area, we introduce a lip-sync discriminator to penalize the out-of-sync audio-visual sequences. Extensive experiments have shown that our S^3D-NeRF surpasses previous arts on both video fidelity and audio-lip synchronization.

[CV-106] Elite360M: Efficient 360 Multi-task Learning via Bi-projection Fusion and Cross-task Collaboration

链接: https://arxiv.org/abs/2408.09336
作者: Hao Ai,Lin Wang
关键词-EN: exhibiting comprehensive visual, entire surrounding environment, comprehensive visual information, exhibiting comprehensive, entire surrounding
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 15 pages main paper

点击查看摘要

Abstract:360 cameras capture the entire surrounding environment with a large FoV, exhibiting comprehensive visual information to directly infer the 3D structures, e.g., depth and surface normal, and semantic information simultaneously. Existing works predominantly specialize in a single task, leaving multi-task learning of 3D geometry and semantics largely unexplored. Achieving such an objective is, however, challenging due to: 1) inherent spherical distortion of planar equirectangular projection (ERP) and insufficient global perception induced by 360 image’s ultra-wide FoV; 2) non-trivial progress in effectively merging geometry and semantics among different tasks to achieve mutual benefits. In this paper, we propose a novel end-to-end multi-task learning framework, named Elite360M, capable of inferring 3D structures via depth and surface normal estimation, and semantics via semantic segmentation simultaneously. Our key idea is to build a representation with strong global perception and less distortion while exploring the inter- and cross-task relationships between geometry and semantics. We incorporate the distortion-free and spatially continuous icosahedron projection (ICOSAP) points and combine them with ERP to enhance global perception. With a negligible cost, a Bi-projection Bi-attention Fusion module is thus designed to capture the semantic- and distance-aware dependencies between each pixel of the region-aware ERP feature and the ICOSAP point feature set. Moreover, we propose a novel Cross-task Collaboration module to explicitly extract task-specific geometric and semantic information from the learned representation to achieve preliminary predictions. It then integrates the spatial contextual information among tasks to realize cross-task fusion. Extensive experiments demonstrate the effectiveness and efficacy of Elite360M.

[CV-107] YOLOv1 to YOLOv10: The fastest and most accurate real-time object detection systems

链接: https://arxiv.org/abs/2408.09332
作者: Chien-Yao Wang,Hong-Yuan Mark Liao
关键词-EN: YOLO series, YOLO series continued, YOLO, comprehensive review, review article re-examines
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 13 pages, 14 figures

点击查看摘要

Abstract:This is a comprehensive review of the YOLO series of systems. Different from previous literature surveys, this review article re-examines the characteristics of the YOLO series from the latest technical point of view. At the same time, we also analyzed how the YOLO series continued to influence and promote real-time computer vision-related research and led to the subsequent development of computer vision and language models.We take a closer look at how the methods proposed by the YOLO series in the past ten years have affected the development of subsequent technologies and show the applications of YOLO in various fields. We hope this article can play a good guiding role in subsequent real-time computer vision development.

[CV-108] Multi-Camera Multi-Person Association using Transformer-Based Dense Pixel Correspondence Estimation and Detection-Based Masking

链接: https://arxiv.org/abs/2408.09295
作者: Daniel Kathein,Byron Hernandez,Henry Medeiros
关键词-EN: active research topic, multi-camera multi-target association, research topic, applications across robotics, task of identifying
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 5 pages, 6 figures

点击查看摘要

Abstract:Multi-camera Association (MCA) is the task of identifying objects and individuals across camera views and is an active research topic, given its numerous applications across robotics, surveillance, and agriculture. We investigate a novel multi-camera multi-target association algorithm based on dense pixel correspondence estimation with a Transformer-based architecture and underlying detection-based masking. After the algorithm generates a set of corresponding keypoints and their respective confidence levels between every pair of detections in the camera views are computed, an affinity matrix is determined containing the probabilities of matches between each pair. Finally, the Hungarian algorithm is applied to generate an optimal assignment matrix with all the predicted associations between the camera views. Our method is evaluated on the WILDTRACK Seven-Camera HD Dataset, a high-resolution dataset containing footage of walking pedestrians as well as precise annotations and camera calibrations. Our results conclude that the algorithm performs exceptionally well associating pedestrians on camera pairs that are positioned close to each other and observe the scene from similar perspectives. On camera pairs with orientations that are drastically different in distance or angle, there is still significant room for improvement.

[CV-109] Adaptify: A Refined Adaptation Scheme for Frame Classification in Atrophic Gastritis Videos

链接: https://arxiv.org/abs/2408.09261
作者: Zinan Xiong,Shuijiao Chen,Yizhe Zhang,Yu Cao,Benyuan Liu,Xiaowei Liu
关键词-EN: developing gastric cancer, significant risk factor, detecting atrophic gastritis, Atrophic gastritis, gastric cancer
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: ISBI 2024 Proceeding

点击查看摘要

Abstract:Atrophic gastritis is a significant risk factor for developing gastric cancer. The incorporation of machine learning algorithms can efficiently elevate the possibility of accurately detecting atrophic gastritis. Nevertheless, when the trained model is applied in real-life circumstances, its output is often not consistently reliable. In this paper, we propose Adaptify, an adaptation scheme in which the model assimilates knowledge from its own classification decisions. Our proposed approach includes keeping the primary model constant, while simultaneously running and updating the auxiliary model. By integrating the knowledge gleaned by the auxiliary model into the primary model and merging their outputs, we have observed a notable improvement in output stability and consistency compared to relying solely on either the main model or the auxiliary model.

[CV-110] MagicID: Flexible ID Fidelity Generation System

链接: https://arxiv.org/abs/2408.09248
作者: Zhaoli Deng,Wen Liu,Fanyi Wang,Junkang Zhang,Fan Chen,Wendong Zhang,Zhenpeng Mi
关键词-EN: prominent research area, Portrait Fidelity Generation, Portrait Fidelity, generative models, prominent research
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Portrait Fidelity Generation is a prominent research area in generative models, with a primary focus on enhancing both controllability and fidelity. Current methods face challenges in generating high-fidelity portrait results when faces occupy a small portion of the image with a low resolution, especially in multi-person group photo settings. To tackle these issues, we propose a systematic solution called MagicID, based on a self-constructed million-level multi-modal dataset named IDZoom. MagicID consists of Multi-Mode Fusion training strategy (MMF) and DDIM Inversion based ID Restoration inference framework (DIIR). During training, MMF iteratively uses the skeleton and landmark modalities from IDZoom as conditional guidance. By introducing the Clone Face Tuning in training stage and Mask Guided Multi-ID Cross Attention (MGMICA) in inference stage, explicit constraints on face positional features are achieved for multi-ID group photo generation. The DIIR aims to address the issue of artifacts. The DDIM Inversion is used in conjunction with face landmarks, global and local face features to achieve face restoration while keeping the background unchanged. Additionally, DIIR is plug-and-play and can be applied to any diffusion-based portrait generation method. To validate the effectiveness of MagicID, we conducted extensive comparative and ablation experiments. The experimental results demonstrate that MagicID has significant advantages in both subjective and objective metrics, and achieves controllable generation in multi-person scenarios.

[CV-111] Re-boosting Self-Collaboration Parallel Prompt GAN for Unsupervised Image Restoration ICCV2023 ICCV

链接: https://arxiv.org/abs/2408.09241
作者: Xin Lin,Yuyan Zhou,Jingtong Yue,Chao Ren,Kelvin C.K. Chan,Lu Qi,Ming-Hsuan Yang
关键词-EN: generative adversarial networks, requiring paired datasets, restoration approaches based, Unsupervised restoration approaches, Res
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
*备注: This paper is an extended and revised version of our previous work “Unsupervised Image Denoising in Real-World Scenarios via Self-Collaboration Parallel Generative Adversarial Branches”( this https URL )

点击查看摘要

Abstract:Unsupervised restoration approaches based on generative adversarial networks (GANs) offer a promising solution without requiring paired datasets. Yet, these GAN-based approaches struggle to surpass the performance of conventional unsupervised GAN-based frameworks without significantly modifying model structures or increasing the computational complexity. To address these issues, we propose a self-collaboration (SC) strategy for existing restoration models. This strategy utilizes information from the previous stage as feedback to guide subsequent stages, achieving significant performance improvement without increasing the framework’s inference complexity. The SC strategy comprises a prompt learning (PL) module and a restorer ( Res ). It iteratively replaces the previous less powerful fixed restorer \overlineRes in the PL module with a more powerful Res . The enhanced PL module generates better pseudo-degraded/clean image pairs, leading to a more powerful Res for the next iteration. Our SC can significantly improve the Res 's performance by over 1.5 dB without adding extra parameters or computational complexity during inference. Meanwhile, existing self-ensemble (SE) and our SC strategies enhance the performance of pre-trained restorers from different perspectives. As SE increases computational complexity during inference, we propose a re-boosting module to the SC (Reb-SC) to improve the SC strategy further by incorporating SE into SC without increasing inference time. This approach further enhances the restorer’s performance by approximately 0.3 dB. Extensive experimental results on restoration tasks demonstrate that the proposed model performs favorably against existing state-of-the-art unsupervised restoration methods. Source code and trained models are publicly available at: \urlthis https URL.

[CV-112] RepControlNet: ControlNet Reparameterization

链接: https://arxiv.org/abs/2408.09240
作者: Zhaoli Deng,Kaibin Zhou,Fanyi Wang,Zhenpeng Mi
关键词-EN: original diffusion model, diffusion model, wide application, universal application, high cost
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:With the wide application of diffusion model, the high cost of inference resources has became an important bottleneck for its universal application. Controllable generation, such as ControlNet, is one of the key research directions of diffusion model, and the research related to inference acceleration and model compression is more important. In order to solve this problem, this paper proposes a modal reparameterization method, RepControlNet, to realize the controllable generation of diffusion models without increasing computation. In the training process, RepControlNet uses the adapter to modulate the modal information into the feature space, copy the CNN and MLP learnable layers of the original diffusion model as the modal network, and initialize these weights based on the original weights and coefficients. The training process only optimizes the parameters of the modal network. In the inference process, the weights of the neutralization original diffusion model in the modal network are reparameterized, which can be compared with or even surpass the methods such as ControlNet, which use additional parameters and computational quantities, without increasing the number of parameters. We have carried out a large number of experiments on both SD1.5 and SDXL, and the experimental results show the effectiveness and efficiency of the proposed RepControlNet.

[CV-113] Flatten: Video Action Recognition is an Image Classification task

链接: https://arxiv.org/abs/2408.09220
作者: Junlin Chen,Chengcheng Xu,Yangfan Xu,Jian Yang,Jun Li,Zhiping Shi
关键词-EN: video action recognition, subsequently leveraging prevalent, numerous researchers.Most traditional, typically involve converting, methods typically involve
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 13pages, 6figures

点击查看摘要

Abstract:In recent years, video action recognition, as a fundamental task in the field of video understanding, has been deeply explored by numerous researchers.Most traditional video action recognition methods typically involve converting videos into three-dimensional data that encapsulates both spatial and temporal information, subsequently leveraging prevalent image understanding models to model and analyze these data. However,these methods have significant drawbacks. Firstly, when delving into video action recognition tasks, image understanding models often need to be adapted accordingly in terms of model architecture and preprocessing for these spatiotemporal tasks; Secondly, dealing with high-dimensional data often poses greater challenges and incurs higher time costs compared to its lower-dimensional this http URL bridge the gap between image-understanding and video-understanding tasks while simplifying the complexity of video comprehension, we introduce a novel video representation architecture, Flatten, which serves as a plug-and-play module that can be seamlessly integrated into any image-understanding network for efficient and effective 3D temporal data modeling.Specifically, by applying specific flattening operations (e.g., row-major transform), 3D spatiotemporal data is transformed into 2D spatial information, and then ordinary image understanding models are used to capture temporal dynamic and spatial semantic information, which in turn accomplishes effective and efficient video action recognition. Extensive experiments on commonly used datasets (Kinetics-400, Something-Something v2, and HMDB-51) and three classical image classification models (Uniformer, SwinV2, and ResNet), have demonstrated that embedding Flatten provides a significant performance improvements over original model.

[CV-114] DRL-Based Resource Allocation for Motion Blur Resistant Federated Self-Supervised Learning in IoV

链接: https://arxiv.org/abs/2408.09194
作者: Xueying Gu,Qiong Wu,Pingyi Fan,Qiang Fan,Nan Cheng,Wen Chen,Khaled B. Letaief
关键词-EN: Federated Self-Supervised Learning, privacy-preserving solution, Learning, Internet of Vehicles, Federated Learning
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Networking and Internet Architecture (cs.NI)
*备注: This paper has been submitted to IEEE Journal. The source code has been released at: this https URL

点击查看摘要

Abstract:In the Internet of Vehicles (IoV), Federated Learning (FL) provides a privacy-preserving solution by aggregating local models without sharing data. Traditional supervised learning requires image data with labels, but data labeling involves significant manual effort. Federated Self-Supervised Learning (FSSL) utilizes Self-Supervised Learning (SSL) for local training in FL, eliminating the need for labels while protecting privacy. Compared to other SSL methods, Momentum Contrast (MoCo) reduces the demand for computing resources and storage space by creating a dictionary. However, using MoCo in FSSL requires uploading the local dictionary from vehicles to Base Station (BS), which poses a risk of privacy leakage. Simplified Contrast (SimCo) addresses the privacy leakage issue in MoCo-based FSSL by using dual temperature instead of a dictionary to control sample distribution. Additionally, considering the negative impact of motion blur on model aggregation, and based on SimCo, we propose a motion blur-resistant FSSL method, referred to as BFSSL. Furthermore, we address energy consumption and delay in the BFSSL process by proposing a Deep Reinforcement Learning (DRL)-based resource allocation scheme, called DRL-BFSSL. In this scheme, BS allocates the Central Processing Unit (CPU) frequency and transmission power of vehicles to minimize energy consumption and latency, while aggregating received models based on the motion blur level. Simulation results validate the effectiveness of our proposed aggregation and resource allocation methods.

[CV-115] GSLAMOT: A Tracklet and Query Graph-based Simultaneous Locating Mapping and Multiple Object Tracking System ACM-MM2024

链接: https://arxiv.org/abs/2408.09191
作者: Shuo Wang,Yongcai Wang,Zhimin Xu,Yongyu Guo,Wanting Li,Zhe Huang,Xuewei Bai,Deying Li
关键词-EN: Tracklet Graph, Query Graph-based framework, Multi-criteria Star Graph, Star Graph Association, maintained Tracklet Graph
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 11 pages, 9 figures, ACM MM 2024

点击查看摘要

Abstract:For interacting with mobile objects in unfamiliar environments, simultaneously locating, mapping, and tracking the 3D poses of multiple objects are crucially required. This paper proposes a Tracklet Graph and Query Graph-based framework, i.e., GSLAMOT, to address this challenge. GSLAMOT utilizes camera and LiDAR multimodal information as inputs and divides the representation of the dynamic scene into a semantic map for representing the static environment, a trajectory of the ego-agent, and an online maintained Tracklet Graph (TG) for tracking and predicting the 3D poses of the detected mobile objects. A Query Graph (QG) is constructed in each frame by object detection to query and update TG. For accurate object association, a Multi-criteria Star Graph Association (MSGA) method is proposed to find matched objects between the detections in QG and the predicted tracklets in TG. Then, an Object-centric Graph Optimization (OGO) method is proposed to simultaneously optimize the TG, the semantic map, and the agent trajectory. It triangulates the detected objects into the map to enrich the map’s semantic information. We address the efficiency issues to handle the three tightly coupled tasks in parallel. Experiments are conducted on KITTI, Waymo, and an emulated Traffic Congestion dataset that highlights challenging scenarios. Experiments show that GSLAMOT enables accurate crowded object tracking while conducting SLAM accurately in challenging scenarios, demonstrating more excellent performances than the state-of-the-art methods. The code and dataset are at this https URL.

[CV-116] PADetBench: Towards Benchmarking Physical Attacks against Object Detection

链接: https://arxiv.org/abs/2408.09181
作者: Jiawei Lian,Jianhong Pan,Lefan Wang,Yi Wang,Lap-Pui Chau,Shaohui Mei
关键词-EN: significant practical implications, gained increasing attention, increasing attention due, Toggle, Physical
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Physical attacks against object detection have gained increasing attention due to their significant practical implications. However, conducting physical experiments is extremely time-consuming and labor-intensive. Moreover, physical dynamics and cross-domain transformation are challenging to strictly regulate in the real world, leading to unaligned evaluation and comparison, severely hindering the development of physically robust models. To accommodate these challenges, we explore utilizing realistic simulation to thoroughly and rigorously benchmark physical attacks with fairness under controlled physical dynamics and cross-domain transformation. This resolves the problem of capturing identical adversarial images that cannot be achieved in the real world. Our benchmark includes 20 physical attack methods, 48 object detectors, comprehensive physical dynamics, and evaluation metrics. We also provide end-to-end pipelines for dataset generation, detection, evaluation, and further analysis. In addition, we perform 8064 groups of evaluation based on our benchmark, which includes both overall evaluation and further detailed ablation studies for controlled physical dynamics. Through these experiments, we provide in-depth analyses of physical attack performance and physical adversarial robustness, draw valuable observations, and discuss potential directions for future research. Codebase: this https URL Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG) Cite as: arXiv:2408.09181 [cs.CV] (or arXiv:2408.09181v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2408.09181 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Jiawei Lian [view email] [v1] Sat, 17 Aug 2024 12:11:22 UTC (14,993 KB) Full-text links: Access Paper: View a PDF of the paper titled PADetBench: Towards Benchmarking Physical Attacks against Object Detection, by Jiawei Lian and 5 other authorsView PDFHTML (experimental)TeX SourceOther Formats view license Current browse context: cs.CV prev | next new | recent | 2024-08 Change to browse by: cs cs.LG References Citations NASA ADSGoogle Scholar Semantic Scholar a export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Links to Code Toggle Papers with Code (What is Papers with Code?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status Get status notifications via email or slack

[CV-117] MambaTrack: A Simple Baseline for Multiple Object Tracking with State Space Model

链接: https://arxiv.org/abs/2408.09178
作者: Changcheng Xiao,Qiong Cao,Zhigang Luo,Long Lan
关键词-EN: field of Multi-object, Multi-object Tracking, motion, Kalman Filter, prevailing paradigm
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted by ACM Multimedia 2024

点击查看摘要

Abstract:Tracking by detection has been the prevailing paradigm in the field of Multi-object Tracking (MOT). These methods typically rely on the Kalman Filter to estimate the future locations of objects, assuming linear object motion. However, they fall short when tracking objects exhibiting nonlinear and diverse motion in scenarios like dancing and sports. In addition, there has been limited focus on utilizing learning-based motion predictors in MOT. To address these challenges, we resort to exploring data-driven motion prediction methods. Inspired by the great expectation of state space models (SSMs), such as Mamba, in long-term sequence modeling with near-linear complexity, we introduce a Mamba-based motion model named Mamba moTion Predictor (MTP). MTP is designed to model the complex motion patterns of objects like dancers and athletes. Specifically, MTP takes the spatial-temporal location dynamics of objects as input, captures the motion pattern using a bi-Mamba encoding layer, and predicts the next motion. In real-world scenarios, objects may be missed due to occlusion or motion blur, leading to premature termination of their trajectories. To tackle this challenge, we further expand the application of MTP. We employ it in an autoregressive way to compensate for missing observations by utilizing its own predictions as inputs, thereby contributing to more consistent trajectories. Our proposed tracker, MambaTrack, demonstrates advanced performance on benchmarks such as Dancetrack and SportsMOT, which are characterized by complex motion and severe occlusion.

[CV-118] Zero-Shot Object-Centric Representation Learning

链接: https://arxiv.org/abs/2408.09162
作者: Aniket Didolkar,Andrii Zadaianchuk,Anirudh Goyal,Mike Mozer,Yoshua Bengio,Georg Martius,Maximilian Seitzer
关键词-EN: decompose visual scenes, object-centric representation learning, isolates the entities, representation learning, decompose visual
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The goal of object-centric representation learning is to decompose visual scenes into a structured representation that isolates the entities. Recent successes have shown that object-centric representation learning can be scaled to real-world scenes by utilizing pre-trained self-supervised features. However, so far, object-centric methods have mostly been applied in-distribution, with models trained and evaluated on the same dataset. This is in contrast to the wider trend in machine learning towards general-purpose models directly applicable to unseen data and tasks. Thus, in this work, we study current object-centric methods through the lens of zero-shot generalization by introducing a benchmark comprising eight different synthetic and real-world datasets. We analyze the factors influencing zero-shot performance and find that training on diverse real-world images improves transferability to unseen scenarios. Furthermore, inspired by the success of task-specific fine-tuning in foundation models, we introduce a novel fine-tuning strategy to adapt pre-trained vision encoders for the task of object discovery. We find that the proposed approach results in state-of-the-art performance for unsupervised object discovery, exhibiting strong zero-shot transfer to unseen datasets.

[CV-119] DSReLU: A Novel Dynamic Slope Function for Superior Model Training ICPR

链接: https://arxiv.org/abs/2408.09156
作者: Archisman Chakraborti,Bidyut B Chaudhuri
关键词-EN: computer vision tasks, deep neural networks, aimed at enhancing, study introduces, enhancing adaptability
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Under peer review at ICPR, 2024

点击查看摘要

Abstract:This study introduces a novel activation function, characterized by a dynamic slope that adjusts throughout the training process, aimed at enhancing adaptability and performance in deep neural networks for computer vision tasks. The rationale behind this approach is to overcome limitations associated with traditional activation functions, such as ReLU, by providing a more flexible mechanism that can adapt to different stages of the learning process. Evaluated on the Mini-ImageNet, CIFAR-100, and MIT-BIH datasets, our method demonstrated improvements in classification metrics and generalization capabilities. These results suggest that our dynamic slope activation function could offer a new tool for improving the performance of deep learning models in various image recognition tasks.

[CV-120] Are CLIP features all you need for Universal Synthetic Image Origin Attribution? ECCV2024

链接: https://arxiv.org/abs/2408.09153
作者: Dario Cioni,Christos Tzelepis,Lorenzo Seidenari,Ioannis Patras
关键词-EN: significant societal threats, poses significant societal, Diffusion Models, potential abuse, societal threats
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted at ECCV 2024 TWYN workshop

点击查看摘要

Abstract:The steady improvement of Diffusion Models for visual synthesis has given rise to many new and interesting use cases of synthetic images but also has raised concerns about their potential abuse, which poses significant societal threats. To address this, fake images need to be detected and attributed to their source model, and given the frequent release of new generators, realistic applications need to consider an Open-Set scenario where some models are unseen at training time. Existing forensic techniques are either limited to Closed-Set settings or to GAN-generated images, relying on fragile frequency-based “fingerprint” features. By contrast, we propose a simple yet effective framework that incorporates features from large pre-trained foundation models to perform Open-Set origin attribution of synthetic images produced by various generative models, including Diffusion Models. We show that our method leads to remarkable attribution performance, even in the low-data regime, exceeding the performance of existing methods and generalizes better on images obtained from a diverse set of architectures. We make the code publicly available at: this https URL.

[CV-121] Realistic Extreme Image Rescaling via Generative Latent Space Learning

链接: https://arxiv.org/abs/2408.09151
作者: Ce Wang,Wanjie Sun,Zhenzhong Chen
关键词-EN: optimal downscaled low-resolution, Image rescaling aims, Image rescaling, Image, downscaled low-resolution
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
*备注:

点击查看摘要

Abstract:Image rescaling aims to learn the optimal downscaled low-resolution (LR) image that can be accurately reconstructed to its original high-resolution (HR) counterpart. This process is crucial for efficient image processing and storage, especially in the era of ultra-high definition media. However, extreme downscaling factors pose significant challenges due to the highly ill-posed nature of the inverse upscaling process, causing existing methods to struggle in generating semantically plausible structures and perceptually rich textures. In this work, we propose a novel framework called Latent Space Based Image Rescaling (LSBIR) for extreme image rescaling tasks. LSBIR effectively leverages powerful natural image priors learned by a pre-trained text-to-image diffusion model to generate realistic HR images. The rescaling is performed in the latent space of a pre-trained image encoder and decoder, which offers better perceptual reconstruction quality due to its stronger sparsity and richer semantics. LSBIR adopts a two-stage training strategy. In the first stage, a pseudo-invertible encoder-decoder models the bidirectional mapping between the latent features of the HR image and the target-sized LR image. In the second stage, the reconstructed features from the first stage are refined by a pre-trained diffusion model to generate more faithful and visually pleasing details. Extensive experiments demonstrate the superiority of LSBIR over previous methods in both quantitative and qualitative evaluations. The code will be available at: this https URL.

[CV-122] SSNeRF: Sparse View Semi-supervised Neural Radiance Fields with Augmentation

链接: https://arxiv.org/abs/2408.09144
作者: Xiao Cao,Beibei Lin,Bo Wang,Zhiyong Huang,Robby T. Tan
关键词-EN: constrained optimization problem, Sparse view, Sparse, sparse view degradation, view
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Sparse view NeRF is challenging because limited input images lead to an under constrained optimization problem for volume rendering. Existing methods address this issue by relying on supplementary information, such as depth maps. However, generating this supplementary information accurately remains problematic and often leads to NeRF producing images with undesired artifacts. To address these artifacts and enhance robustness, we propose SSNeRF, a sparse view semi supervised NeRF method based on a teacher student framework. Our key idea is to challenge the NeRF module with progressively severe sparse view degradation while providing high confidence pseudo labels. This approach helps the NeRF model become aware of noise and incomplete information associated with sparse views, thus improving its robustness. The novelty of SSNeRF lies in its sparse view specific augmentations and semi supervised learning mechanism. In this approach, the teacher NeRF generates novel views along with confidence scores, while the student NeRF, perturbed by the augmented input, learns from the high confidence pseudo labels. Our sparse view degradation augmentation progressively injects noise into volume rendering weights, perturbs feature maps in vulnerable layers, and simulates sparse view blurriness. These augmentation strategies force the student NeRF to recognize degradation and produce clearer rendered views. By transferring the student’s parameters to the teacher, the teacher gains increased robustness in subsequent training iterations. Extensive experiments demonstrate the effectiveness of our SSNeRF in generating novel views with less sparse view degradation. We will release code upon acceptance.

[CV-123] Learning to Explore for Stochastic Gradient MCMC

链接: https://arxiv.org/abs/2408.09140
作者: SeungHyun Kim,Seohyeon Jung,Seonghyeon Kim,Juho Lee
关键词-EN: Bayesian Neural Networks, Bayesian Neural, Neural Networks, high-dimensional parameters pose, posterior inference due
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Bayesian Neural Networks(BNNs) with high-dimensional parameters pose a challenge for posterior inference due to the multi-modality of the posterior distributions. Stochastic Gradient MCMC(SGMCMC) with cyclical learning rate scheduling is a promising solution, but it requires a large number of sampling steps to explore high-dimensional multi-modal posteriors, making it computationally expensive. In this paper, we propose a meta-learning strategy to build \glssgmcmc which can efficiently explore the multi-modal target distributions. Our algorithm allows the learned SGMCMC to quickly explore the high-density region of the posterior landscape. Also, we show that this exploration property is transferrable to various tasks, even for the ones unseen during a meta-training stage. Using popular image classification benchmarks and a variety of downstream tasks, we demonstrate that our method significantly improves the sampling efficiency, achieving better performance than vanilla \glssgmcmc without incurring significant computational overhead.

[CV-124] StylePrompter: Enhancing Domain Generalization with Test-Time Style Priors

链接: https://arxiv.org/abs/2408.09138
作者: Jiao Zhang,Jian Xu,Xu-Yao Zhang,Cheng-Lin Liu
关键词-EN: causing performance degradation, trained deep models, inference stage, training stage, real-world applications
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:In real-world applications, the sample distribution at the inference stage often differs from the one at the training stage, causing performance degradation of trained deep models. The research on domain generalization (DG) aims to develop robust algorithms that can improve the generalized performance in unseen domains by training on a few domains. However, the domain-agnostic vision model, trained on a limited number of domains using traditional domain generalization methods, cannot guarantee its effectiveness in dealing with unseen domains. The introduction of language can break the closed cognition space of the vision model, providing additional semantic information that cannot be inferred from vision-only datasets. In this paper, we propose to overcome the challenge in previous DG methods by introducing the style prompt in the language modality to adapt the trained model dynamically. In particular, we train a style prompter to extract style information of the current image into an embedding in the token embedding space and place it in front of the candidate category words as prior knowledge to prompt the model. Our open space partition of the style token embedding space and the hand-crafted style regularization enable the trained style prompter to handle data from unknown domains effectively. Extensive experiments verify the effectiveness of our method and demonstrate state-of-the-art performances on multiple public datasets. Codes will be available after the acceptance of this paper.

[CV-125] hin-Plate Spline-based Interpolation for Animation Line Inbetweening

链接: https://arxiv.org/abs/2408.09131
作者: Tianyi Zhu,Wei Shang,Dongwei Ren,Wangmeng Zuo
关键词-EN: predicting intermediate line, animation production aimed, intermediate line arts, crucial step, production aimed
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Animation line inbetweening is a crucial step in animation production aimed at enhancing animation fluidity by predicting intermediate line arts between two key frames. However, existing methods face challenges in effectively addressing sparse pixels and significant motion in line art key frames. In literature, Chamfer Distance (CD) is commonly adopted for evaluating inbetweening performance. Despite achieving favorable CD values, existing methods often generate interpolated frames with line disconnections, especially for scenarios involving large motion. Motivated by this observation, we propose a simple yet effective interpolation method for animation line inbetweening that adopts thin-plate spline-based transformation to estimate coarse motion more accurately by modeling the keypoint correspondence between two key frames, particularly for large motion scenarios. Building upon the coarse estimation, a motion refine module is employed to further enhance motion details before final frame interpolation using a simple UNet model. Furthermore, to more accurately assess the performance of animation line inbetweening, we refine the CD metric and introduce a novel metric termed Weighted Chamfer Distance, which demonstrates a higher consistency with visual perception quality. Additionally, we incorporate Earth Mover’s Distance and conduct user study to provide a more comprehensive evaluation. Our method outperforms existing approaches by delivering high-quality interpolation results with enhanced fluidity. The code is available at \urlthis https URL.

[CV-126] Gaussian in the Dark: Real-Time View Synthesis From Inconsistent Dark Images Using Gaussian Splatting

链接: https://arxiv.org/abs/2408.09130
作者: Sheng Ye,Zhen-Hui Dong,Yubin Hu,Yu-Hui Wen,Yong-Jin Liu
关键词-EN: Gaussian Splatting, recently emerged, powerful representation, remarkable novel views, Splatting has recently
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: accepted by PG 2024

点击查看摘要

Abstract:3D Gaussian Splatting has recently emerged as a powerful representation that can synthesize remarkable novel views using consistent multi-view images as input. However, we notice that images captured in dark environments where the scenes are not fully illuminated can exhibit considerable brightness variations and multi-view inconsistency, which poses great challenges to 3D Gaussian Splatting and severely degrades its performance. To tackle this problem, we propose Gaussian-DK. Observing that inconsistencies are mainly caused by camera imaging, we represent a consistent radiance field of the physical world using a set of anisotropic 3D Gaussians, and design a camera response module to compensate for multi-view inconsistencies. We also introduce a step-based gradient scaling strategy to constrain Gaussians near the camera, which turn out to be floaters, from splitting and cloning. Experiments on our proposed benchmark dataset demonstrate that Gaussian-DK produces high-quality renderings without ghosting and floater artifacts and significantly outperforms existing methods. Furthermore, we can also synthesize light-up images \dzhby controlling exposure levels that clearly show details in shadow areas.

[CV-127] Barbie: Text to Barbie-Style 3D Avatars

链接: https://arxiv.org/abs/2408.09126
作者: Xiaokun Sun,Zhenyu Zhang,Ying Tai,Qian Wang,Hao Tang,Zili Yi,Jian Yang
关键词-EN: made substantial progress, Recent advances, advances in text-guided, made substantial, substantial progress
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 9 pages, 7 figures

点击查看摘要

Abstract:Recent advances in text-guided 3D avatar generation have made substantial progress by distilling knowledge from diffusion models. Despite the plausible generated appearance, existing methods cannot achieve fine-grained disentanglement or high-fidelity modeling between inner body and outfit. In this paper, we propose Barbie, a novel framework for generating 3D avatars that can be dressed in diverse and high-quality Barbie-like garments and accessories. Instead of relying on a holistic model, Barbie achieves fine-grained disentanglement on avatars by semantic-aligned separated models for human body and outfits. These disentangled 3D representations are then optimized by different expert models to guarantee the domain-specific fidelity. To balance geometry diversity and reasonableness, we propose a series of losses for template-preserving and human-prior evolving. The final avatar is enhanced by unified texture refinement for superior texture consistency. Extensive experiments demonstrate that Barbie outperforms existing methods in both dressed human and outfit generation, supporting flexible apparel combination and animation. The code will be released for research purposes. Our project page is: this https URL.

[CV-128] MaskBEV: Towards A Unified Framework for BEV Detection and Map Segmentation ACM-MM2024

链接: https://arxiv.org/abs/2408.09122
作者: Xiao Zhao,Xukun Zhang,Dingkang Yang,Mingyang Sun,Mingcheng Li,Shunli Wang,Lihua Zhang
关键词-EN: autonomous driving systems, modern autonomous driving, Accurate and robust, robust multimodal multi-task, BEV map segmentation
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted to ACM MM 2024

点击查看摘要

Abstract:Accurate and robust multimodal multi-task perception is crucial for modern autonomous driving systems. However, current multimodal perception research follows independent paradigms designed for specific perception tasks, leading to a lack of complementary learning among tasks and decreased performance in multi-task learning (MTL) due to joint training. In this paper, we propose MaskBEV, a masked attention-based MTL paradigm that unifies 3D object detection and bird’s eye view (BEV) map segmentation. MaskBEV introduces a task-agnostic Transformer decoder to process these diverse tasks, enabling MTL to be completed in a unified decoder without requiring additional design of specific task heads. To fully exploit the complementary information between BEV map segmentation and 3D object detection tasks in BEV space, we propose spatial modulation and scene-level context aggregation strategies. These strategies consider the inherent dependencies between BEV segmentation and 3D detection, naturally boosting MTL performance. Extensive experiments on nuScenes dataset show that compared with previous state-of-the-art MTL methods, MaskBEV achieves 1.3 NDS improvement in 3D object detection and 2.7 mIoU improvement in BEV map segmentation, while also demonstrating slightly leading inference speed.

[CV-129] LOID: Lane Occlusion Inpainting and Detection for Enhanced Autonomous Driving Systems

链接: https://arxiv.org/abs/2408.09117
作者: Aayush Agrawal,Ashmitha Jaysi Sivakumar,Ibrahim Kaif,Chayan Banerjee
关键词-EN: effective path planning, Accurate lane detection, Accurate lane, lane detection, autonomous driving
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
*备注: 8 pages, 6 figures and 4 tables

点击查看摘要

Abstract:Accurate lane detection is essential for effective path planning and lane following in autonomous driving, especially in scenarios with significant occlusion from vehicles and pedestrians. Existing models often struggle under such conditions, leading to unreliable navigation and safety risks. We propose two innovative approaches to enhance lane detection in these challenging environments, each showing notable improvements over current methods. The first approach aug-Segment improves conventional lane detection models by augmenting the training dataset of CULanes with simulated occlusions and training a segmentation model. This method achieves a 12% improvement over a number of SOTA models on the CULanes dataset, demonstrating that enriched training data can better handle occlusions, however, since this model lacked robustness to certain settings, our main contribution is the second approach, LOID Lane Occlusion Inpainting and Detection. LOID introduces an advanced lane detection network that uses an image processing pipeline to identify and mask occlusions. It then employs inpainting models to reconstruct the road environment in the occluded areas. The enhanced image is processed by a lane detection algorithm, resulting in a 20% 24% improvement over several SOTA models on the BDDK100 and CULanes datasets respectively, highlighting the effectiveness of this novel technique. Comments: 8 pages, 6 figures and 4 tables Subjects: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO) Cite as: arXiv:2408.09117 [cs.CV] (or arXiv:2408.09117v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2408.09117 Focus to learn more arXiv-issued DOI via DataCite

[CV-130] GoodSAM: Bridging Domain and Capacity Gaps via Segment Anything Model for Panoramic Semantic Segmentation

链接: https://arxiv.org/abs/2408.09115
作者: Weiming Zhang,Yexin Liu,Xu Zheng,Lin Wang
关键词-EN: zero-shot instance segmentation, instance segmentation capability, powerful zero-shot instance, paper presents GoodSAM, panoramic semantic segmentation
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 15 pages, under review

点击查看摘要

Abstract:This paper presents GoodSAM++, a novel framework utilizing the powerful zero-shot instance segmentation capability of SAM (i.e., teacher) to learn a compact panoramic semantic segmentation model, i.e., student, without requiring any labeled data. GoodSAM++ addresses two critical challenges: 1) SAM’s inability to provide semantic labels and inherent distortion problems of panoramic images; 2) the significant capacity disparity between SAM and the student. The `out-of-the-box’ insight of GoodSAM++ is to introduce a teacher assistant (TA) to provide semantic information for SAM, integrated with SAM to obtain reliable pseudo semantic maps to bridge both domain and capacity gaps. To make this possible, we first propose a Distortion-Aware Rectification (DARv2) module to address the domain gap. It effectively mitigates the object deformation and distortion problem in panoramic images to obtain pseudo semantic maps. We then introduce a Multi-level Knowledge Adaptation (MKA) module to efficiently transfer the semantic information from the TA and pseudo semantic maps to our compact student model, addressing the significant capacity gap. We conduct extensive experiments on both outdoor and indoor benchmark datasets, showing that our GoodSAM++ achieves a remarkable performance improvement over the state-of-the-art (SOTA) domain adaptation methods. Moreover, diverse open-world scenarios demonstrate the generalization capacity of our GoodSAM++. Last but not least, our most lightweight student model achieves comparable performance to the SOTA models with only 3.7 million parameters.

[CV-131] Measuring Visual Sycophancy in Multimodal Models

链接: https://arxiv.org/abs/2408.09111
作者: Jaehyuk Lim,Bruce W. Lee
关键词-EN: disproportionately favor visually, favor visually presented, multimodal language models, visually presented information, paper introduces
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)
*备注:

点击查看摘要

Abstract:This paper introduces and examines the phenomenon of “visual sycophancy” in multimodal language models, a term we propose to describe these models’ tendency to disproportionately favor visually presented information, even when it contradicts their prior knowledge or responses. Our study employs a systematic methodology to investigate this phenomenon: we present models with images of multiple-choice questions, which they initially answer correctly, then expose the same model to versions with visually pre-marked options. Our findings reveal a significant shift in the models’ responses towards the pre-marked option despite their previous correct answers. Comprehensive evaluations demonstrate that visual sycophancy is a consistent and quantifiable behavior across various model architectures. Our findings highlight potential limitations in the reliability of these models when processing potentially misleading visual information, raising important questions about their application in critical decision-making contexts.

[CV-132] Locate Anything on Earth: Advancing Open-Vocabulary Object Detection for Remote Sensing Community

链接: https://arxiv.org/abs/2408.09110
作者: Jiancheng Pan,Yanxing Liu,Yuqian Fu,Muyuan Ma,Jiaohao Li,Danda Pani Paudel,Luc Van Gool,Xiaomeng Huang
关键词-EN: natural disaster assessment, open-vocabulary object detection, remote sensing, Object detection, Earth sciences
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 10 pages, 5 figures

点击查看摘要

Abstract:Object detection, particularly open-vocabulary object detection, plays a crucial role in Earth sciences, such as environmental monitoring, natural disaster assessment, and land-use planning. However, existing open-vocabulary detectors, primarily trained on natural-world images, struggle to generalize to remote sensing images due to a significant data domain gap. Thus, this paper aims to advance the development of open-vocabulary object detection in remote sensing community. To achieve this, we first reformulate the task as Locate Anything on Earth (LAE) with the goal of detecting any novel concepts on Earth. We then developed the LAE-Label Engine which collects, auto-annotates, and unifies up to 10 remote sensing datasets creating the LAE-1M - the first large-scale remote sensing object detection dataset with broad category coverage. Using the LAE-1M, we further propose and train the novel LAE-DINO Model, the first open-vocabulary foundation object detector for the LAE task, featuring Dynamic Vocabulary Construction (DVC) and Visual-Guided Text Prompt Learning (VisGT) modules. DVC dynamically constructs vocabulary for each training batch, while VisGT maps visual features to semantic space, enhancing text features. We comprehensively conduct experiments on established remote sensing benchmark DIOR, DOTAv2.0, as well as our newly introduced 80-class LAE-80C benchmark. Results demonstrate the advantages of the LAE-1M dataset and the effectiveness of the LAE-DINO method.

[CV-133] mporal Reversed Training for Spiking Neural Networks with Generalized Spatio-Temporal Representation

链接: https://arxiv.org/abs/2408.09108
作者: Lin Zuo,Yongqi Ding,Wenwei Luo,Mengmeng Jing,Xianlong Tian,Kunshan Yang
关键词-EN: energy computing paradigm, received widespread attention, ultra-low energy computing, Spiking neural networks, neural networks
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: 15 pages, 8 figures

点击查看摘要

Abstract:Spiking neural networks (SNNs) have received widespread attention as an ultra-low energy computing paradigm. Recent studies have focused on improving the feature extraction capability of SNNs, but they suffer from inefficient inference and suboptimal performance. In this paper, we propose a simple yet effective temporal reversed training (TRT) method to optimize the spatio-temporal performance of SNNs and circumvent these problems. We perturb the input temporal data by temporal reversal, prompting the SNN to produce original-reversed consistent output logits and to learn perturbation-invariant representations. For static data without temporal dimension, we generalize this strategy by exploiting the inherent temporal property of spiking neurons for spike feature temporal reversal. In addition, we utilize the lightweight ``star operation" (element-wise multiplication) to hybridize the original and temporally reversed spike firing rates and expand the implicit dimensions, which serves as spatio-temporal regularization to further enhance the generalization of the SNN. Our method involves only an additional temporal reversal operation and element-wise multiplication during training, thus incurring negligible training overhead and not affecting the inference efficiency at all. Extensive experiments on static/neuromorphic object/action recognition, and 3D point cloud classification tasks demonstrate the effectiveness and generalizability of our method. In particular, with only two timesteps, our method achieves 74.77% and 90.57% accuracy on ImageNet and ModelNet40, respectively.

[CV-134] HybridOcc: NeRF Enhanced Transformer-based Multi-Camera 3D Occupancy Prediction

链接: https://arxiv.org/abs/2408.09104
作者: Xiao Zhao,Bo Chen,Mingyang Sun,Dingkang Yang,Youxing Wang,Xukun Zhang,Mingcheng Li,Dongliang Kou,Xiaoyi Wei,Lihua Zhang
关键词-EN: describes autonomous driving, semantic scene completion, autonomous driving scenes, describes autonomous, SSC
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted to IEEE RAL

点击查看摘要

Abstract:Vision-based 3D semantic scene completion (SSC) describes autonomous driving scenes through 3D volume representations. However, the occlusion of invisible voxels by scene surfaces poses challenges to current SSC methods in hallucinating refined 3D geometry. This paper proposes HybridOcc, a hybrid 3D volume query proposal method generated by Transformer framework and NeRF representation and refined in a coarse-to-fine SSC prediction framework. HybridOcc aggregates contextual features through the Transformer paradigm based on hybrid query proposals while combining it with NeRF representation to obtain depth supervision. The Transformer branch contains multiple scales and uses spatial cross-attention for 2D to 3D transformation. The newly designed NeRF branch implicitly infers scene occupancy through volume rendering, including visible and invisible voxels, and explicitly captures scene depth rather than generating RGB color. Furthermore, we present an innovative occupancy-aware ray sampling method to orient the SSC task instead of focusing on the scene surface, further improving the overall performance. Extensive experiments on nuScenes and SemanticKITTI datasets demonstrate the effectiveness of our HybridOcc on the SSC task.

[CV-135] Depth-guided Texture Diffusion for Image Semantic Segmentation

链接: https://arxiv.org/abs/2408.09097
作者: Wei Sun,Yuan Li,Qixiang Ye,Jianbin Jiao,Yanzhao Zhou
关键词-EN: Depth-guided Texture Diffusion, semantic segmentation, Depth, valuable insights, utilized to improve
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Depth information provides valuable insights into the 3D structure especially the outline of objects, which can be utilized to improve the semantic segmentation tasks. However, a naive fusion of depth information can disrupt feature and compromise accuracy due to the modality gap between the depth and the vision. In this work, we introduce a Depth-guided Texture Diffusion approach that effectively tackles the outlined challenge. Our method extracts low-level features from edges and textures to create a texture image. This image is then selectively diffused across the depth map, enhancing structural information vital for precisely extracting object outlines. By integrating this enriched depth map with the original RGB image into a joint feature embedding, our method effectively bridges the disparity between the depth map and the image, enabling more accurate semantic segmentation. We conduct comprehensive experiments across diverse, commonly-used datasets spanning a wide range of semantic segmentation tasks, including Camouflaged Object Detection (COD), Salient Object Detection (SOD), and indoor semantic segmentation. With source-free estimated depth or depth captured by depth cameras, our method consistently outperforms existing baselines and achieves new state-of-theart results, demonstrating the effectiveness of our Depth-guided Texture Diffusion for image semantic segmentation.

[CV-136] Segment Anything with Multiple Modalities

链接: https://arxiv.org/abs/2408.09085
作者: Aoran Xiao,Weihao Xuan,Heli Qi,Yun Xing,Naoto Yokoya,Shijian Lu
关键词-EN: core functionality, visual recognition, recognition and navigation, RGB, Robust and accurate
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Project page: this https URL

点击查看摘要

Abstract:Robust and accurate segmentation of scenes has become one core functionality in various visual recognition and navigation tasks. This has inspired the recent development of Segment Anything Model (SAM), a foundation model for general mask segmentation. However, SAM is largely tailored for single-modal RGB images, limiting its applicability to multi-modal data captured with widely-adopted sensor suites, such as LiDAR plus RGB, depth plus RGB, thermal plus RGB, etc. We develop MM-SAM, an extension and expansion of SAM that supports cross-modal and multi-modal processing for robust and enhanced segmentation with different sensor suites. MM-SAM features two key designs, namely, unsupervised cross-modal transfer and weakly-supervised multi-modal fusion, enabling label-efficient and parameter-efficient adaptation toward various sensor modalities. It addresses three main challenges: 1) adaptation toward diverse non-RGB sensors for single-modal processing, 2) synergistic processing of multi-modal data via sensor fusion, and 3) mask-free training for different downstream tasks. Extensive experiments show that MM-SAM consistently outperforms SAM by large margins, demonstrating its effectiveness and robustness across various sensors and data modalities.

[CV-137] Linking Robustness and Generalization: A k* Distribution Analysis of Concept Clustering in Latent Space for Vision Models

链接: https://arxiv.org/abs/2408.09065
作者: Shashank Kotyan,Pin-Yu Chen,Danilo Vasconcellos Vargas
关键词-EN: latent space, latent, space, vision models, assess latent space
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Most evaluations of vision models use indirect methods to assess latent space quality. These methods often involve adding extra layers to project the latent space into a new one. This projection makes it difficult to analyze and compare the original latent space. This article uses the k* Distribution, a local neighborhood analysis method, to examine the learned latent space at the level of individual concepts, which can be extended to examine the entire latent space. We introduce skewness-based true and approximate metrics for interpreting individual concepts to assess the overall quality of vision models’ latent space. Our findings indicate that current vision models frequently fracture the distributions of individual concepts within the latent space. Nevertheless, as these models improve in generalization across multiple datasets, the degree of fracturing diminishes. A similar trend is observed in robust vision models, where increased robustness correlates with reduced fracturing. Ultimately, this approach enables a direct interpretation and comparison of the latent spaces of different vision models and reveals a relationship between a model’s generalizability and robustness. Results show that as a model becomes more general and robust, it tends to learn features that result in better clustering of concepts. Project Website is available online at this https URL

[CV-138] MoRA: LoRA Guided Multi-Modal Disease Diagnosis with Missing Modality MICCAI2024

链接: https://arxiv.org/abs/2408.09064
作者: Zhiyi Shi,Junsik Kim,Wanhua Li,Yicong Li,Hanspeter Pfister
关键词-EN: Multi-modal pre-trained models, Multi-modal pre-trained, low memory requirements, models efficiently extract, efficiently extract
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: Accepted by MICCAI 2024

点击查看摘要

Abstract:Multi-modal pre-trained models efficiently extract and fuse features from different modalities with low memory requirements for fine-tuning. Despite this efficiency, their application in disease diagnosis is under-explored. A significant challenge is the frequent occurrence of missing modalities, which impairs performance. Additionally, fine-tuning the entire pre-trained model demands substantial computational resources. To address these issues, we introduce Modality-aware Low-Rank Adaptation (MoRA), a computationally efficient method. MoRA projects each input to a low intrinsic dimension but uses different modality-aware up-projections for modality-specific adaptation in cases of missing modalities. Practically, MoRA integrates into the first block of the model, significantly improving performance when a modality is missing. It requires minimal computational resources, with less than 1.6% of the trainable parameters needed compared to training the entire model. Experimental results show that MoRA outperforms existing techniques in disease diagnosis, demonstrating superior performance, robustness, and training efficiency.

[CV-139] ADen: Adaptive Density Representations for Sparse-view Camera Pose Estimation ECCV2024

链接: https://arxiv.org/abs/2408.09042
作者: Hao Tang,Weiyao Wang,Pierre Gleize,Matt Feiszli
关键词-EN: powers key applications, Recovering camera poses, Recovering camera, computer vision, object reconstructions
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: ECCV 2024, Oral

点击查看摘要

Abstract:Recovering camera poses from a set of images is a foundational task in 3D computer vision, which powers key applications such as 3D scene/object reconstructions. Classic methods often depend on feature correspondence, such as keypoints, which require the input images to have large overlap and small viewpoint changes. Such requirements present considerable challenges in scenarios with sparse views. Recent data-driven approaches aim to directly output camera poses, either through regressing the 6DoF camera poses or formulating rotation as a probability distribution. However, each approach has its limitations. On one hand, directly regressing the camera poses can be ill-posed, since it assumes a single mode, which is not true under symmetry and leads to sub-optimal solutions. On the other hand, probabilistic approaches are capable of modeling the symmetry ambiguity, yet they sample the entire space of rotation uniformly by brute-force. This leads to an inevitable trade-off between high sample density, which improves model precision, and sample efficiency that determines the runtime. In this paper, we propose ADen to unify the two frameworks by employing a generator and a discriminator: the generator is trained to output multiple hypotheses of 6DoF camera pose to represent a distribution and handle multi-mode ambiguity, and the discriminator is trained to identify the hypothesis that best explains the data. This allows ADen to combine the best of both worlds, achieving substantially higher precision as well as lower runtime than previous methods in empirical evaluations.

[CV-140] Multi Teacher Privileged Knowledge Distillation for Multimodal Expression Recognition

链接: https://arxiv.org/abs/2408.09035
作者: Muhammad Haseeb Aslam,Marco Pedersoli,Alessandro Lameiras Koerich,Eric Granger
关键词-EN: complex phenomenon conveyed, Human emotion, vocal tones, body language, facial expressions
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Human emotion is a complex phenomenon conveyed and perceived through facial expressions, vocal tones, body language, and physiological signals. Multimodal emotion recognition systems can perform well because they can learn complementary and redundant semantic information from diverse sensors. In real-world scenarios, only a subset of the modalities employed for training may be available at test time. Learning privileged information allows a model to exploit data from additional modalities that are only available during training. SOTA methods for PKD have been proposed to distill information from a teacher model (with privileged modalities) to a student model (without privileged modalities). However, such PKD methods utilize point-to-point matching and do not explicitly capture the relational information. Recently, methods have been proposed to distill the structural information. However, PKD methods based on structural similarity are primarily confined to learning from a single joint teacher representation, which limits their robustness, accuracy, and ability to learn from diverse multimodal sources. In this paper, a multi-teacher PKD (MT-PKDOT) method with self-distillation is introduced to align diverse teacher representations before distilling them to the student. MT-PKDOT employs a structural similarity KD mechanism based on a regularized optimal transport (OT) for distillation. The proposed MT-PKDOT method was validated on the Affwild2 and Biovid datasets. Results indicate that our proposed method can outperform SOTA PKD methods. It improves the visual-only baseline on Biovid data by 5.5%. On the Affwild2 dataset, the proposed method improves 3% and 5% over the visual-only baseline for valence and arousal respectively. Allowing the student to learn from multiple diverse sources is shown to increase the accuracy and implicitly avoids negative transfer to the student model.

[CV-141] Comparative Performance Analysis of Transformer-Based Pre-Trained Models for Detecting Keratoconus Disease

链接: https://arxiv.org/abs/2408.09005
作者: Nayeem Ahmed,Md Maruf Rahman,Md Fatin Ishrak,Md Imran Kabir Joy,Md Sanowar Hossain Sabuj,Md. Sadekur Rahman
关键词-EN: degenerative eye disease, eye disease, degenerative eye, keratoconus, pre-trained CNNs
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 14 pages, 3 tables, 27 figures

点击查看摘要

Abstract:This study compares eight pre-trained CNNs for diagnosing keratoconus, a degenerative eye disease. A carefully selected dataset of keratoconus, normal, and suspicious cases was used. The models tested include DenseNet121, EfficientNetB0, InceptionResNetV2, InceptionV3, MobileNetV2, ResNet50, VGG16, and VGG19. To maximize model training, bad sample removal, resizing, rescaling, and augmentation were used. The models were trained with similar parameters, activation function, classification function, and optimizer to compare performance. To determine class separation effectiveness, each model was evaluated on accuracy, precision, recall, and F1-score. MobileNetV2 was the best accurate model in identifying keratoconus and normal cases with few misclassifications. InceptionV3 and DenseNet121 both performed well in keratoconus detection, but they had trouble with questionable cases. In contrast, EfficientNetB0, ResNet50, and VGG19 had more difficulty distinguishing dubious cases from regular ones, indicating the need for model refining and development. A detailed comparison of state-of-the-art CNN architectures for automated keratoconus identification reveals each model’s benefits and weaknesses. This study shows that advanced deep learning models can enhance keratoconus diagnosis and treatment planning. Future research should explore hybrid models and integrate clinical parameters to improve diagnostic accuracy and robustness in real-world clinical applications, paving the way for more effective AI-driven ophthalmology tools.

[CV-142] Classifier-Free Guidance is a Predictor-Corrector

链接: https://arxiv.org/abs/2408.09000
作者: Arwen Bradley,Preetum Nakkiran
关键词-EN: CFG, foundations of classifier-free, theoretical foundations, classifier-free guidance, shaky theoretical footing
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: AB and PN contributed equally

点击查看摘要

Abstract:We investigate the theoretical foundations of classifier-free guidance (CFG). CFG is the dominant method of conditional sampling for text-to-image diffusion models, yet unlike other aspects of diffusion, it remains on shaky theoretical footing. In this paper, we disprove common misconceptions, by showing that CFG interacts differently with DDPM (Ho et al., 2020) and DDIM (Song et al., 2021), and neither sampler with CFG generates the gamma-powered distribution p(x|c)^\gamma p(x)^1-\gamma . Then, we clarify the behavior of CFG by showing that it is a kind of predictor-corrector method (Song et al., 2020) that alternates between denoising and sharpening, which we call predictor-corrector guidance (PCG). We prove that in the SDE limit, CFG is actually equivalent to combining a DDIM predictor for the conditional distribution together with a Langevin dynamics corrector for a gamma-powered distribution (with a carefully chosen gamma). Our work thus provides a lens to theoretically understand CFG by embedding it in a broader design space of principled sampling methods.

[CV-143] Ask Attend Attack: A Effective Decision-Based Black-Box Targeted Attack for Image-to-Text Models

链接: https://arxiv.org/abs/2408.08989
作者: Qingyuan Zeng,Zhenzhong Wang,Yiu-ming Cheung,Min Jiang
关键词-EN: demonstrated significant advancements, textit, attacks, targeted attacks, vision-language tasks
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:While image-to-text models have demonstrated significant advancements in various vision-language tasks, they remain susceptible to adversarial attacks. Existing white-box attacks on image-to-text models require access to the architecture, gradients, and parameters of the target model, resulting in low practicality. Although the recently proposed gray-box attacks have improved practicality, they suffer from semantic loss during the training process, which limits their targeted attack performance. To advance adversarial attacks of image-to-text models, this paper focuses on a challenging scenario: decision-based black-box targeted attacks where the attackers only have access to the final output text and aim to perform targeted attacks. Specifically, we formulate the decision-based black-box targeted attack as a large-scale optimization problem. To efficiently solve the optimization problem, a three-stage process \textitAsk, Attend, Attack, called \textitAAA, is proposed to coordinate with the solver. \textitAsk guides attackers to create target texts that satisfy the specific semantics. \textitAttend identifies the crucial regions of the image for attacking, thus reducing the search space for the subsequent \textitAttack. \textitAttack uses an evolutionary algorithm to attack the crucial regions, where the attacks are semantically related to the target texts of \textitAsk, thus achieving targeted attacks without semantic loss. Experimental results on transformer-based and CNN+RNN-based image-to-text models confirmed the effectiveness of our proposed \textitAAA.

[CV-144] Fire Dynamic Vision: Image Segmentation and Tracking for Multi-Scale Fire and Plume Behavior

链接: https://arxiv.org/abs/2408.08984
作者: Daryn Sagel,Bryan Quaife
关键词-EN: plume spread models, increasing frequency, frequency and severity, severity of wildfires, wildfires highlight
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:The increasing frequency and severity of wildfires highlight the need for accurate fire and plume spread models. We introduce an approach that effectively isolates and tracks fire and plume behavior across various spatial and temporal scales and image types, identifying physical phenomena in the system and providing insights useful for developing and validating models. Our method combines image segmentation and graph theory to delineate fire fronts and plume boundaries. We demonstrate that the method effectively distinguishes fires and plumes from visually similar objects. Results demonstrate the successful isolation and tracking of fire and plume dynamics across various image sources, ranging from synoptic-scale ( 10^4 - 10^5 m) satellite images to sub-microscale ( 10^0 - 10^1 m) images captured close to the fire environment. Furthermore, the methodology leverages image inpainting and spatio-temporal dataset generation for use in statistical and machine learning models.

[CV-145] Deep Generative Classification of Blood Cell Morphology

链接: https://arxiv.org/abs/2408.08982
作者: Simon Deltadahl,Julian Gilbey,Christine Van Laer,Nancy Boeckx,Mathie Leers,Tanya Freeman,Laura Aiken,Timothy Farren,Matthew Smith,Mohamad Zeina,BloodCounts! consortium,Concetta Piazzese,Joseph Taylor,Nicholas Gleadall,Carola-Bibiane Schönlieb,Suthesh Sivapalaratnam,Michael Roberts,Parashkev Nachev
关键词-EN: presents significant challenges, machine automation owing, cell type frequencies, diagnosing blood disorders, heterogeneities of biological
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Accurate classification of haematological cells is critical for diagnosing blood disorders, but presents significant challenges for machine automation owing to the complexity of cell morphology, heterogeneities of biological, pathological, and imaging characteristics, and the imbalance of cell type frequencies. We introduce CytoDiffusion, a diffusion-based classifier that effectively models blood cell morphology, combining accurate classification with robust anomaly detection, resistance to distributional shifts, interpretability, data efficiency, and superhuman uncertainty quantification. Our approach outperforms state-of-the-art discriminative models in anomaly detection (AUC 0.976 vs. 0.919), resistance to domain shifts (85.85% vs. 74.38% balanced accuracy), and performance in low-data regimes (95.88% vs. 94.95% balanced accuracy). Notably, our model generates synthetic blood cell images that are nearly indistinguishable from real images, as demonstrated by a Turing test in which expert haematologists achieved only 52.3% accuracy (95% CI: [50.5%, 54.2%]). Furthermore, we enhance model explainability through the generation of directly interpretable counterfactual heatmaps. Our comprehensive evaluation framework, encompassing these multiple performance dimensions, establishes a new benchmark for medical image analysis in haematology, ultimately enabling improved diagnostic accuracy in clinical settings. Our code is available at this https URL.

[CV-146] Enhancing Object Detection with Hybrid dataset in Manufacturing Environments: Comparing Federated Learning to Conventional Techniques

链接: https://arxiv.org/abs/2408.08974
作者: Vinit Hegiste,Snehal Walunj,Jibinraj Antony,Tatjana Legler,Martin Ruskowski
关键词-EN: garnered significant attention, Federated Learning, privacy-preserving capabilities, garnered significant, significant attention
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: Submitted and Presented at the IEEE International Conference on Innovative Engineering Sciences and Technological Research (ICIESTR-2024)

点击查看摘要

Abstract:Federated Learning (FL) has garnered significant attention in manufacturing for its robust model development and privacy-preserving capabilities. This paper contributes to research focused on the robustness of FL models in object detection, hereby presenting a comparative study with conventional techniques using a hybrid dataset for small object detection. Our findings demonstrate the superior performance of FL over centralized training models and different deep learning techniques when tested on test data recorded in a different environment with a variety of object viewpoints, lighting conditions, cluttered backgrounds, etc. These results highlight the potential of FL in achieving robust global models that perform efficiently even in unseen environments. The study provides valuable insights for deploying resilient object detection models in manufacturing environments.

[CV-147] Image Class Translation Distance: A Novel Interpretable Feature for Image Classification

链接: https://arxiv.org/abs/2408.08973
作者: Mikyla K. Bowen,Jesse W. Wilson
关键词-EN: conventional black box, black box classification, box classification networks, interpretable alternative, black box
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 20 pages, 18 figures, submitted to Computational Intelligence

点击查看摘要

Abstract:We propose a novel application of image translation networks for image classification and demonstrate its potential as a more interpretable alternative to conventional black box classification networks. We train a network to translate images between possible classes, and then quantify translation distance, i.e. the degree of alteration needed to conform an image to one class or another. These translation distances can then be examined for clusters and trends, and can be fed directly to a simple classifier (e.g. a support vector machine, SVM), providing comparable accuracy compared to a conventional end-to-end convolutional neural network classifier. In addition, visual inspection of translated images can reveal class-specific characteristics and biases in the training sets, such as visual artifacts that are more frequently observed in one class or another. We demonstrate the approach on a toy 2-class scenario, apples versus oranges, and then apply it to two medical imaging tasks: detecting melanoma from photographs of pigmented lesions and classifying 6 cell types in a bone marrow biopsy smear. This novel application of image-to-image networks shows the potential of the technology to go beyond imagining different stylistic changes and to provide greater insight into image classification and medical imaging datasets.

[CV-148] A Survey of Trojan Attacks and Defenses to Deep Neural Networks

链接: https://arxiv.org/abs/2408.08920
作者: Lingxin Jin,Xianyu Wen,Wei Jiang,Jinyu Zhan
关键词-EN: Deep Neural Networks, artificial intelligence systems, Neural Network Trojans, safety-critical artificial intelligence, Deep Neural
类目: Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Deep Neural Networks (DNNs) have found extensive applications in safety-critical artificial intelligence systems, such as autonomous driving and facial recognition systems. However, recent research has revealed their susceptibility to Neural Network Trojans (NN Trojans) maliciously injected by adversaries. This vulnerability arises due to the intricate architecture and opacity of DNNs, resulting in numerous redundant neurons embedded within the models. Adversaries exploit these vulnerabilities to conceal malicious Trojans within DNNs, thereby causing erroneous outputs and posing substantial threats to the efficacy of DNN-based applications. This article presents a comprehensive survey of Trojan attacks against DNNs and the countermeasure methods employed to mitigate them. Initially, we trace the evolution of the concept from traditional Trojans to NN Trojans, highlighting the feasibility and practicality of generating NN Trojans. Subsequently, we provide an overview of notable works encompassing various attack and defense strategies, facilitating a comparative analysis of their approaches. Through these discussions, we offer constructive insights aimed at refining these techniques. In recognition of the gravity and immediacy of this subject matter, we also assess the feasibility of deploying such attacks in real-world scenarios as opposed to controlled ideal datasets. The potential real-world implications underscore the urgency of addressing this issue effectively.

[CV-149] SHARP-Net: A Refined Pyramid Network for Deficiency Segmentation in Culverts and Sewer Pipes

链接: https://arxiv.org/abs/2408.08879
作者: Rasha Alshawi,Md Meftahul Ferdaus,Md Tamjidul Hoque,Kendall Niles,Ken Pathak,Steve Sloan,Mahdi Abdelguerfi
关键词-EN: Haar-Adaptive Refined Pyramid, Refined Pyramid Network, Refined Pyramid, Semantic Haar-Adaptive Refined, Haar-Adaptive Refined
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:This paper introduces Semantic Haar-Adaptive Refined Pyramid Network (SHARP-Net), a novel architecture for semantic segmentation. SHARP-Net integrates a bottom-up pathway featuring Inception-like blocks with varying filter sizes (3x3 and 5x5), parallel max-pooling, and additional spatial detection layers. This design captures multi-scale features and fine structural details. Throughout the network, depth-wise separable convolutions are used to reduce complexity. The top-down pathway of SHARP-Net focuses on generating high-resolution features through upsampling and information fusion using 1\times1 and 3\times3 depth-wise separable convolutions. We evaluated our model using our developed challenging Culvert-Sewer Defects dataset and the benchmark DeepGlobe Land Cover dataset. Our experimental evaluation demonstrated the base model’s (excluding Haar-like features) effectiveness in handling irregular defect shapes, occlusions, and class imbalances. It outperformed state-of-the-art methods, including U-Net, CBAM U-Net, ASCU-Net, FPN, and SegFormer, achieving average improvements of 14.4% and 12.1% on the Culvert-Sewer Defects and DeepGlobe Land Cover datasets, respectively, with IoU scores of 77.2% and 70.6%. Additionally, the training time was reduced. Furthermore, the integration of carefully selected and fine-tuned Haar-like features enhanced the performance of deep learning models by at least 20%. The proposed SHARP-Net, incorporating Haar-like features, achieved an impressive IoU of 94.75%, representing a 22.74% improvement over the base model. These features were also applied to other deep learning models, showing a 35.0% improvement, proving their versatility and effectiveness. SHARP-Net thus provides a powerful and efficient solution for accurate semantic segmentation in challenging real-world scenarios.

[CV-150] LEGENT: Open Platform for Embodied Agents ACL2024

链接: https://arxiv.org/abs/2404.18243
作者: Zhili Cheng,Zhitong Wang,Jinyi Hu,Shengding Hu,An Liu,Yuge Tu,Pengkai Li,Lei Shi,Zhiyuan Liu,Maosong Sun
关键词-EN: Large Language Models, Large Multimodal Models, hindering complex real-life, Large Language, Large Multimodal
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Robotics (cs.RO)
*备注: ACL 2024 System Demonstration

点击查看摘要

Abstract:Despite advancements in Large Language Models (LLMs) and Large Multimodal Models (LMMs), their integration into language-grounded, human-like embodied agents remains incomplete, hindering complex real-life task performance in physical environments. Existing integrations often feature limited open sourcing, challenging collective progress in this field. We introduce LEGENT, an open, scalable platform for developing embodied agents using LLMs and LMMs. LEGENT offers a dual approach: a rich, interactive 3D environment with communicable and actionable agents, paired with a user-friendly interface, and a sophisticated data generation pipeline utilizing advanced algorithms to exploit supervision from simulated worlds at scale. In our experiments, an embryonic vision-language-action model trained on LEGENT-generated data surpasses GPT-4V in embodied tasks, showcasing promising generalization capabilities.

[CV-151] Latency-Aware Generative Semantic Communications with Pre-Trained Diffusion Models

链接: https://arxiv.org/abs/2403.17256
作者: Li Qiao,Mahdi Boloursaz Mashhadi,Zhen Gao,Chuan Heng Foh,Pei Xiao,Mehdi Bennis
关键词-EN: recently shown great, shown great success, synthesizing natural signals, generation process, recently shown
类目: Information Theory (cs.IT); Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM); Signal Processing (eess.SP)
*备注: Accepted for publication in IEEE Wireless Communication Letters

点击查看摘要

Abstract:Generative foundation AI models have recently shown great success in synthesizing natural signals with high perceptual quality using only textual prompts and conditioning signals to guide the generation process. This enables semantic communications at extremely low data rates in future wireless networks. In this paper, we develop a latency-aware semantic communications framework with pre-trained generative models. The transmitter performs multi-modal semantic decomposition on the input signal and transmits each semantic stream with the appropriate coding and communication schemes based on the intent. For the prompt, we adopt a re-transmission-based scheme to ensure reliable transmission, and for the other semantic modalities we use an adaptive modulation/coding scheme to achieve robustness to the changing wireless channel. Furthermore, we design a semantic and latency-aware scheme to allocate transmission power to different semantic modalities based on their importance subjected to semantic quality constraints. At the receiver, a pre-trained generative model synthesizes a high fidelity signal using the received multi-stream semantics. Simulation results demonstrate ultra-low-rate, low-latency, and channel-adaptive semantic communications.

[CV-152] owards a Benchmark for Colorectal Cancer Segmentation in Endorectal Ultrasound Videos: Dataset and Model Development

链接: https://arxiv.org/abs/2408.10067
作者: Yuncheng Jiang,Yiwen Hu,Zixun Zhang,Jun Wei,Chun-Mei Feng,Xuemei Tang,Xiang Wan,Yong Liu,Shuguang Cui,Zhen Li
关键词-EN: important imaging modality, Endorectal ultrasound, colorectal cancer segmentation, cancer segmentation, important imaging
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Endorectal ultrasound (ERUS) is an important imaging modality that provides high reliability for diagnosing the depth and boundary of invasion in colorectal cancer. However, the lack of a large-scale ERUS dataset with high-quality annotations hinders the development of automatic ultrasound diagnostics. In this paper, we collected and annotated the first benchmark dataset that covers diverse ERUS scenarios, i.e. colorectal cancer segmentation, detection, and infiltration depth staging. Our ERUS-10K dataset comprises 77 videos and 10,000 high-resolution annotated frames. Based on this dataset, we further introduce a benchmark model for colorectal cancer segmentation, named the Adaptive Sparse-context TRansformer (ASTR). ASTR is designed based on three considerations: scanning mode discrepancy, temporal information, and low computational complexity. For generalizing to different scanning modes, the adaptive scanning-mode augmentation is proposed to convert between raw sector images and linear scan ones. For mining temporal information, the sparse-context transformer is incorporated to integrate inter-frame local and global features. For reducing computational complexity, the sparse-context block is introduced to extract contextual features from auxiliary frames. Finally, on the benchmark dataset, the proposed ASTR model achieves a 77.6% Dice score in rectal cancer segmentation, largely outperforming previous state-of-the-art methods.

[CV-153] Pose-GuideNet: Automatic Scanning Guidance for Fetal Head Ultrasound from Pose Estimation MICCAI2024

链接: https://arxiv.org/abs/2408.09931
作者: Qianhui Men,Xiaoqing Guo,Aris T. Papageorghiou,J. Alison Noble
关键词-EN: image-guided radiology applications, enables healthcare professionals, techniques initiate automatic, initiate automatic guidance, cross-sectional view enables
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted by MICCAI2024

点击查看摘要

Abstract:3D pose estimation from a 2D cross-sectional view enables healthcare professionals to navigate through the 3D space, and such techniques initiate automatic guidance in many image-guided radiology applications. In this work, we investigate how estimating 3D fetal pose from freehand 2D ultrasound scanning can guide a sonographer to locate a head standard plane. Fetal head pose is estimated by the proposed Pose-GuideNet, a novel 2D/3D registration approach to align freehand 2D ultrasound to a 3D anatomical atlas without the acquisition of 3D ultrasound. To facilitate the 2D to 3D cross-dimensional projection, we exploit the prior knowledge in the atlas to align the standard plane frame in a freehand scan. A semantic-aware contrastive-based approach is further proposed to align the frames that are off standard planes based on their anatomical similarity. In the experiment, we enhance the existing assessment of freehand image localization by comparing the transformation of its estimated pose towards standard plane with the corresponding probe motion, which reflects the actual view change in 3D anatomy. Extensive results on two clinical head biometry tasks show that Pose-GuideNet not only accurately predicts pose but also successfully predicts the direction of the fetal head. Evaluations with probe motions further demonstrate the feasibility of adopting Pose-GuideNet for freehand ultrasound-assisted navigation in a sensor-free environment.

[CV-154] Preoperative Rotator Cuff Tear Prediction from Shoulder Radiographs using a Convolutional Block Attention Module-Integrated Neural Network

链接: https://arxiv.org/abs/2408.09894
作者: Chris Hyunchul Jo,Jiwoong Yang,Byunghwan Jeon,Hackjoon Shim,Ikbeom Jang
关键词-EN: Research question, rotator cuff tears, plane shoulder radiograph, standard of care, rotator cuff
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Research question: We test whether a plane shoulder radiograph can be used together with deep learning methods to identify patients with rotator cuff tears as opposed to using an MRI in standard of care. Findings: By integrating convolutional block attention modules into a deep neural network, our model demonstrates high accuracy in detecting patients with rotator cuff tears, achieving an average AUC of 0.889 and an accuracy of 0.831. Meaning: This study validates the efficacy of our deep learning model to accurately detect rotation cuff tears from radiographs, offering a viable pre-assessment or alternative to more expensive imaging techniques such as MRI.

[CV-155] Coarse-Fine View Attention Alignment-Based GAN for CT Reconstruction from Biplanar X-Rays

链接: https://arxiv.org/abs/2408.09736
作者: Zhi Qiao,Hanqiang Ouyang,Dongheng Chu,Huishu Yuan,Xiantong Zhen,Pei Dong,Zhen Qian
关键词-EN: intra-operation imaging, surgical planning, planning and intra-operation, important alternative, biplanar X-rays
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:For surgical planning and intra-operation imaging, CT reconstruction using X-ray images can potentially be an important alternative when CT imaging is not available or not feasible. In this paper, we aim to use biplanar X-rays to reconstruct a 3D CT image, because biplanar X-rays convey richer information than single-view X-rays and are more commonly used by surgeons. Different from previous studies in which the two X-ray views were treated indifferently when fusing the cross-view data, we propose a novel attention-informed coarse-to-fine cross-view fusion method to combine the features extracted from the orthogonal biplanar views. This method consists of a view attention alignment sub-module and a fine-distillation sub-module that are designed to work together to highlight the unique or complementary information from each of the views. Experiments have demonstrated the superiority of our proposed method over the SOTA methods.

[CV-156] Diff2CT: Diffusion Learning to Reconstruct Spine CT from Biplanar X-Rays

链接: https://arxiv.org/abs/2408.09731
作者: Zhi Qiao,Xuhui Liu,Xiaopeng Wang,Runkun Liu,Xiantong Zhen,Pei Dong,Zhen Qian
关键词-EN: Intraoperative CT imaging, surgical guidance, practical to implement, crucial resource, resource for surgical
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Intraoperative CT imaging serves as a crucial resource for surgical guidance; however, it may not always be readily accessible or practical to implement. In scenarios where CT imaging is not an option, reconstructing CT scans from X-rays can offer a viable alternative. In this paper, we introduce an innovative method for 3D CT reconstruction utilizing biplanar X-rays. Distinct from previous research that relies on conventional image generation techniques, our approach leverages a conditional diffusion process to tackle the task of reconstruction. More precisely, we employ a diffusion-based probabilistic model trained to produce 3D CT images based on orthogonal biplanar X-rays. To improve the structural integrity of the reconstructed images, we incorporate a novel projection loss function. Experimental results validate that our proposed method surpasses existing state-of-the-art benchmarks in both visual image quality and multiple evaluative metrics. Specifically, our technique achieves a higher Structural Similarity Index (SSIM) of 0.83, a relative increase of 10%, and a lower Fréchet Inception Distance (FID) of 83.43, which represents a relative decrease of 25%.

[CV-157] ESL-Net: A Transformer-Enhanced CNN for Accurate Skin Lesion Segmentation

链接: https://arxiv.org/abs/2408.09687
作者: Shahzaib Iqbal,Muhammad Zeeshan,Mehwish Mehmood,Tariq M. Khan,Imran Razzak
关键词-EN: skin cancer relies, Early detection, skin lesions, cancer relies, relies on precise
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Early detection of skin cancer relies on precise segmentation of dermoscopic images of skin lesions. However, this task is challenging due to the irregular shape of the lesion, the lack of sharp borders, and the presence of artefacts such as marker colours and hair follicles. Recent methods for melanoma segmentation are U-Nets and fully connected networks (FCNs). As the depth of these neural network models increases, they can face issues like the vanishing gradient problem and parameter redundancy, potentially leading to a decrease in the Jaccard index of the segmentation model. In this study, we introduced a novel network named TESL-Net for the segmentation of skin lesions. The proposed TESL-Net involves a hybrid network that combines the local features of a CNN encoder-decoder architecture with long-range and temporal dependencies using bi-convolutional long-short-term memory (Bi-ConvLSTM) networks and a Swin transformer. This enables the model to account for the uncertainty of segmentation over time and capture contextual channel relationships in the data. We evaluated the efficacy of TESL-Net in three commonly used datasets (ISIC 2016, ISIC 2017, and ISIC 2018) for the segmentation of skin lesions. The proposed TESL-Net achieves state-of-the-art performance, as evidenced by a significantly elevated Jaccard index demonstrated by empirical results.

[CV-158] Screen Them All: High-Throughput Pan-Cancer Genetic and Phenotypic Biomarker Screening from HE Whole Slide Images

链接: https://arxiv.org/abs/2408.09554
作者: Yi Kan Wang,Ludmila Tydlitatova,Jeremy D. Kunz,Gerard Oakley,Ran A. Godrich,Matthew C. H. Lee,Chad Vanderbilt,Razik Yousfi,Thomas Fuchs,David S. Klimstra,Siqi Liu
关键词-EN: molecular alterations serve, typically detected, alterations serve, prognostic or therapy-predictive, detected using single
类目: Quantitative Methods (q-bio.QM); Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
*备注:

点击查看摘要

Abstract:Many molecular alterations serve as clinically prognostic or therapy-predictive biomarkers, typically detected using single or multi-gene molecular assays. However, these assays are expensive, tissue destructive and often take weeks to complete. Using AI on routine HE WSIs offers a fast and economical approach to screen for multiple molecular biomarkers. We present a high-throughput AI-based system leveraging Virchow2, a foundation model pre-trained on 3 million slides, to interrogate genomic features previously determined by an next-generation sequencing (NGS) assay, using 47,960 scanned hematoxylin and eosin (HE) whole slide images (WSIs) from 38,984 cancer patients. Unlike traditional methods that train individual models for each biomarker or cancer type, our system employs a unified model to simultaneously predict a wide range of clinically relevant molecular biomarkers across cancer types. By training the network to replicate the MSK-IMPACT targeted biomarker panel of 505 genes, it identified 80 high performing biomarkers with a mean AU-ROC of 0.89 in 15 most common cancer types. In addition, 40 biomarkers demonstrated strong associations with specific cancer histologic subtypes. Furthermore, 58 biomarkers were associated with targets frequently assayed clinically for therapy selection and response prediction. The model can also predict the activity of five canonical signaling pathways, identify defects in DNA repair mechanisms, and predict genomic instability measured by tumor mutation burden, microsatellite instability (MSI), and chromosomal instability (CIN). The proposed model can offer potential to guide therapy selection, improve treatment efficacy, accelerate patient screening for clinical trials and provoke the interrogation of new therapeutic targets.

[CV-159] Deformation-aware GAN for Medical Image Synthesis with Substantially Misaligned Pairs

链接: https://arxiv.org/abs/2408.09432
作者: Bowen Xin,Tony Young,Claire E Wainwright,Tamara Blake,Leo Lebrat,Thomas Gaass,Thomas Benkert,Alto Stemmer,David Coman,Jason Dowling
关键词-EN: generates additional imaging, additional imaging modalities, synthesis generates additional, Medical image synthesis, image synthesis generates
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted by MIDL2024

点击查看摘要

Abstract:Medical image synthesis generates additional imaging modalities that are costly, invasive or harmful to acquire, which helps to facilitate the clinical workflow. When training pairs are substantially misaligned (e.g., lung MRI-CT pairs with respiratory motion), accurate image synthesis remains a critical challenge. Recent works explored the directional registration module to adjust misalignment in generative adversarial networks (GANs); however, substantial misalignment will lead to 1) suboptimal data mapping caused by correspondence ambiguity, and 2) degraded image fidelity caused by morphology influence on discriminators. To address the challenges, we propose a novel Deformation-aware GAN (DA-GAN) to dynamically correct the misalignment during the image synthesis based on multi-objective inverse consistency. Specifically, in the generative process, three levels of inverse consistency cohesively optimise symmetric registration and image generation for improved correspondence. In the adversarial process, to further improve image fidelity under misalignment, we design deformation-aware discriminators to disentangle the mismatched spatial morphology from the judgement of image fidelity. Experimental results show that DA-GAN achieved superior performance on a public dataset with simulated misalignments and a real-world lung MRI-CT dataset with respiratory motion misalignment. The results indicate the potential for a wide range of medical image synthesis tasks such as radiotherapy planning.

[CV-160] Flemme: A Flexible and Modular Learning Platform for Medical Images

链接: https://arxiv.org/abs/2408.09369
作者: Guoqing Zhang,Jingyun Yang,Yang Li
关键词-EN: powerful network backbones, increasingly significant, rapid development, development of computer, computer vision
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注: 8 pages, 6 figures

点击查看摘要

Abstract:As the rapid development of computer vision and the emergence of powerful network backbones and architectures, the application of deep learning in medical imaging has become increasingly significant. Unlike natural images, medical images lack huge volumes of data but feature more modalities, making it difficult to train a general model that has satisfactory performance across various datasets. In practice, practitioners often suffer from manually creating and testing models combining independent backbones and architectures, which is a laborious and time-consuming process. We propose Flemme, a FLExible and Modular learning platform for MEdical images. Our platform separates encoders from the model architectures so that different models can be constructed via various combinations of supported encoders and architectures. We construct encoders using building blocks based on convolution, transformer, and state-space model (SSM) to process both 2D and 3D image patches. A base architecture is implemented following an encoder-decoder style, with several derived architectures for image segmentation, reconstruction, and generation tasks. In addition, we propose a general hierarchical architecture incorporating a pyramid loss to optimize and fuse vertical features. Experiments demonstrate that this simple design leads to an average improvement of 5.60% in Dice score and 7.81% in mean interaction of units (mIoU) for segmentation models, as well as an enhancement of 5.57% in peak signal-to-noise ratio (PSNR) and 8.22% in structural similarity (SSIM) for reconstruction models. We further utilize Flemme as an analytical tool to assess the effectiveness and efficiency of various encoders across different tasks. Code is available at this https URL.

[CV-161] Improving Lung Cancer Diagnosis and Survival Prediction with Deep Learning and CT Imaging

链接: https://arxiv.org/abs/2408.09367
作者: Xiawei Wang,James Sharpnack,Thomas C.M. Lee
关键词-EN: Lung cancer, patients’ survival outcomes, cancer-related deaths, Lung, National Lung Screening
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Lung cancer is a major cause of cancer-related deaths, and early diagnosis and treatment are crucial for improving patients’ survival outcomes. In this paper, we propose to employ convolutional neural networks to model the non-linear relationship between the risk of lung cancer and the lungs’ morphology revealed in the CT images. We apply a mini-batched loss that extends the Cox proportional hazards model to handle the non-convexity induced by neural networks, which also enables the training of large data sets. Additionally, we propose to combine mini-batched loss and binary cross-entropy to predict both lung cancer occurrence and the risk of mortality. Simulation results demonstrate the effectiveness of both the mini-batched loss with and without the censoring mechanism, as well as its combination with binary cross-entropy. We evaluate our approach on the National Lung Screening Trial data set with several 3D convolutional neural network architectures, achieving high AUC and C-index scores for lung cancer classification and survival prediction. These results, obtained from simulations and real data experiments, highlight the potential of our approach to improving the diagnosis and treatment of lung cancer.

[CV-162] Unpaired Volumetric Harmonization of Brain MRI with Conditional Latent Diffusion

链接: https://arxiv.org/abs/2408.09315
作者: Mengqi Wu,Minhui Yu,Shuaiming Jing,Pew-Thian Yap,Zhengwu Zhang,Mingxia Liu
关键词-EN: Multi-site structural MRI, diversify subject cohorts, Multi-site structural, subject cohorts, neuroimaging studies
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Multi-site structural MRI is increasingly used in neuroimaging studies to diversify subject cohorts. However, combining MR images acquired from various sites/centers may introduce site-related non-biological variations. Retrospective image harmonization helps address this issue, but current methods usually perform harmonization on pre-extracted hand-crafted radiomic features, limiting downstream applicability. Several image-level approaches focus on 2D slices, disregarding inherent volumetric information, leading to suboptimal outcomes. To this end, we propose a novel 3D MRI Harmonization framework through Conditional Latent Diffusion (HCLD) by explicitly considering image style and brain anatomy. It comprises a generalizable 3D autoencoder that encodes and decodes MRIs through a 4D latent space, and a conditional latent diffusion model that learns the latent distribution and generates harmonized MRIs with anatomical information from source MRIs while conditioned on target image style. This enables efficient volume-level MRI harmonization through latent style translation, without requiring paired images from target and source domains during training. The HCLD is trained and evaluated on 4,158 T1-weighted brain MRIs from three datasets in three tasks, assessing its ability to remove site-related variations while retaining essential biological features. Qualitative and quantitative experiments suggest the effectiveness of HCLD over several state-of-the-arts

[CV-163] Cross-Species Data Integration for Enhanced Layer Segmentation in Kidney Pathology

链接: https://arxiv.org/abs/2408.09278
作者: Junchao Zhu,Mengmeng Yin,Ruining Deng,Yitian Long,Yu Wang,Yaohong Wang,Shilin Zhao,Haichun Yang,Yuankai Huo
关键词-EN: Accurate delineation, subsequent functional structural, functional structural analysis, disease diagnosis, crucial for subsequent
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Accurate delineation of the boundaries between the renal cortex and medulla is crucial for subsequent functional structural analysis and disease diagnosis. Training high-quality deep-learning models for layer segmentation relies on the availability of large amounts of annotated data. However, due to the patient’s privacy of medical data and scarce clinical cases, constructing pathological datasets from clinical sources is relatively difficult and expensive. Moreover, using external natural image datasets introduces noise during the domain generalization process. Cross-species homologous data, such as mouse kidney data, which exhibits high structural and feature similarity to human kidneys, has the potential to enhance model performance on human datasets. In this study, we incorporated the collected private Periodic Acid-Schiff (PAS) stained mouse kidney dataset into the human kidney dataset for joint training. The results showed that after introducing cross-species homologous data, the semantic segmentation models based on CNN and Transformer architectures achieved an average increase of 1.77% and 1.24% in mIoU, and 1.76% and 0.89% in Dice score for the human renal cortex and medulla datasets, respectively. This approach is also capable of enhancing the model’s generalization ability. This indicates that cross-species homologous data, as a low-noise trainable data source, can help improve model performance under conditions of limited clinical samples. Code is available at this https URL.

[CV-164] A Fast and Computationally Inexpensive Method For Image Translation of 3D Volume Patient Data

链接: https://arxiv.org/abs/2408.09218
作者: Cho Yang
关键词-EN: Grand Challenge Dataset, SynthRAD Grand Challenge, Grand Challenge, Challenge Dataset, SynthRAD Grand
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:CycleGAN was trained on SynthRAD Grand Challenge Dataset using the single-epoch modification (SEM) method proposed in this paper which is referred to as (CycleGAN-single) compared to the usual method of training CycleGAN on around 200 epochs (CycleGAN-multi). Model performance were evaluated qualitatively and quantitatively with quantitative performance metrics like PSNR, SSIM, MAE and MSE. The consideration of both quantitative and qualitative performance when evaluating a model is unique to certain image-translation tasks like medical imaging as detailed in this paper. Also, this paper shows that good quantitative performance does not always imply good qualitative performance and the converse is also not always True (i.e. good qualitative performance does not always imply good quantitative performance). This paper also proposes FQGA (Fast Paired Image-to-Image Translation Quarter-Generator Adversary) Model which has 1/4 the number of parameters compared to CycleGAN (when comparing their Generator Models). FQGA outperforms CycleGAN qualitatively and quantitatively even only after training on 20 epochs. Finally, using SEM method on FQGA allowed it to again outperform CycleGAN both quantitatively and qualitatively. These performance gains with fewer model parameters and time savings from running fewer epochs may also be applicable to other image-to-image translation tasks in Machine Learning apart from the Medical image-translation task discussed in this paper between Cone Beam Computed Tomography (CBCT) and Computed Tomography (CT) images.

[CV-165] ree species classification at the pixel-level using deep learning and multispectral time series in an imbalanced context

链接: https://arxiv.org/abs/2408.08887
作者: Florian Mouret(CESBIO, UO),David Morin(CESBIO),Milena Planells(CESBIO),Cécile Vincent-Barbaroux
关键词-EN: multispectral satellite image, satellite image time-series, paper investigates tree, multispectral satellite, image time-series
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:This paper investigates tree species classification using Sentinel-2 multispectral satellite image time-series. Despite their critical importance for many applications, such maps are often unavailable, outdated, or inaccurate for large areas. The interest of using remote sensing time series to produce these maps has been highlighted in many studies. However, many methods proposed in the literature still rely on a standard classification algorithm, usually the Random Forest (RF) algorithm with vegetation indices. This study shows that the use of deep learning models can lead to a significant improvement in classification results, especially in an imbalanced context where the RF algorithm tends to predict towards the majority class. In our use case in the center of France with 10 tree species, we obtain an overall accuracy (OA) around 95% and a F1-macro score around 80% using three different benchmark deep learning architectures. In contrast, using the RF algorithm yields an OA of 93% and an F1 of 60%, indicating that the minority classes are not classified with sufficient accuracy. Therefore, the proposed framework is a strong baseline that can be easily implemented in most scenarios, even with a limited amount of reference data. Our results highlight that standard multilayer perceptron can be competitive with batch normalization and a sufficient amount of parameters. Other architectures (convolutional or attention-based) can also achieve strong results when tuned properly. Furthermore, our results show that DL models are naturally robust to imbalanced data, although similar results can be obtained using dedicated techniques.

[CV-166] U-MedSAM: Uncertainty-aware MedSAM for Medical Image Segmentation

链接: https://arxiv.org/abs/2408.08881
作者: Xin Wang,Xiaoyu Liu,Peng Huang,Pu Huang,Shu Hu,Hongtu Zhu
关键词-EN: Medical Image Foundation, Image Foundation Models, Medical Image, Image Foundation, Foundation Models
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Medical Image Foundation Models have proven to be powerful tools for mask prediction across various datasets. However, accurately assessing the uncertainty of their predictions remains a significant challenge. To address this, we propose a new model, U-MedSAM, which integrates the MedSAM model with an uncertainty-aware loss function and the Sharpness-Aware Minimization (SharpMin) optimizer. The uncertainty-aware loss function automatically combines region-based, distribution-based, and pixel-based loss designs to enhance segmentation accuracy and robustness. SharpMin improves generalization by finding flat minima in the loss landscape, thereby reducing overfitting. Our method was evaluated in the CVPR24 MedSAM on Laptop challenge, where U-MedSAM demonstrated promising performance.

机器学习

[LG-0] KAN 2.0: Kolmogorov-Arnold Networks Meet Science

链接: https://arxiv.org/abs/2408.10205
作者: Ziming Liu,Pingchuan Ma,Yixuan Wang,Wojciech Matusik,Max Tegmark
关键词-EN: inherent incompatibility, depends on symbolism, Science, Science lies, science depends
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computational Physics (physics.comp-ph); Data Analysis, Statistics and Probability (physics.data-an)
*备注: 27 pages, 14 figures

点击查看摘要

Abstract:A major challenge of AI + Science lies in their inherent incompatibility: today’s AI is primarily based on connectionism, while science depends on symbolism. To bridge the two worlds, we propose a framework to seamlessly synergize Kolmogorov-Arnold Networks (KANs) and science. The framework highlights KANs’ usage for three aspects of scientific discovery: identifying relevant features, revealing modular structures, and discovering symbolic formulas. The synergy is bidirectional: science to KAN (incorporating scientific knowledge into KANs), and KAN to science (extracting scientific insights from KANs). We highlight major new functionalities in the pykan package: (1) MultKAN: KANs with multiplication nodes. (2) kanpiler: a KAN compiler that compiles symbolic formulas into KANs. (3) tree converter: convert KANs (or any neural networks) to tree graphs. Based on these tools, we demonstrate KANs’ capability to discover various types of physical laws, including conserved quantities, Lagrangians, symmetries, and constitutive laws.

[LG-1] Criticality Leveraged Adversarial Training (CLAT) for Boosted Performance via Parameter Efficiency

链接: https://arxiv.org/abs/2408.10204
作者: Bhavna Gopal,Huanrui Yang,Jingyang Zhang,Mark Horton,Yiran Chen
关键词-EN: increased generalization errors, enhances neural network, neural network robustness, training enhances neural, Adversarial training
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注: 9 pages + appendix/ additional experiments

点击查看摘要

Abstract:Adversarial training enhances neural network robustness but suffers from a tendency to overfit and increased generalization errors on clean data. This work introduces CLAT, an innovative approach that mitigates adversarial overfitting by introducing parameter efficiency into the adversarial training process, improving both clean accuracy and adversarial robustness. Instead of tuning the entire model, CLAT identifies and fine-tunes robustness-critical layers - those predominantly learning non-robust features - while freezing the remaining model to enhance robustness. It employs dynamic critical layer selection to adapt to changes in layer criticality throughout the fine-tuning process. Empirically, CLAT can be applied on top of existing adversarial training methods, significantly reduces the number of trainable parameters by approximately 95%, and achieves more than a 2% improvement in adversarial robustness compared to baseline methods.

[LG-2] ransformers to SSMs: Distilling Quadratic Knowledge to Subquadratic Models

链接: https://arxiv.org/abs/2408.10189
作者: Aviv Bick,Kevin Y. Li,Eric P. Xing,J. Zico Kolter,Albert Gu
关键词-EN: inference settings due, quadratic-time self-attention, dominant paradigm, paradigm for domains, domains like language
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Transformer architectures have become a dominant paradigm for domains like language modeling but suffer in many inference settings due to their quadratic-time self-attention. Recently proposed subquadratic architectures, such as Mamba, have shown promise, but have been pretrained with substantially less computational resources than the strongest Transformer models. In this work, we present a method that is able to distill a pretrained Transformer architecture into alternative architectures such as state space models (SSMs). The key idea to our approach is that we can view both Transformers and SSMs as applying different forms of mixing matrices over the token sequences. We can thus progressively distill the Transformer architecture by matching different degrees of granularity in the SSM: first matching the mixing matrices themselves, then the hidden units at each block, and finally the end-to-end predictions. Our method, called MOHAWK, is able to distill a Mamba-2 variant based on the Phi-1.5 architecture (Phi-Mamba) using only 3B tokens and a hybrid version (Hybrid Phi-Mamba) using 5B tokens. Despite using less than 1% of the training data typically used to train models from scratch, Phi-Mamba boasts substantially stronger performance compared to all past open-source non-Transformer models. MOHAWK allows models like SSMs to leverage computational resources invested in training Transformer-based architectures, highlighting a new avenue for building such models.

[LG-3] SMILE: Zero-Shot Sparse Mixture of Low-Rank Experts Construction From Pre-Trained Foundation Models

链接: https://arxiv.org/abs/2408.10174
作者: Anke Tang,Li Shen,Yong Luo,Shuai Xie,Han Hu,Lefei Zhang,Bo Du,Dacheng Tao
关键词-EN: deep model fusion, Deep model, Deep model training, model fusion techniques, model fusion
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Code is available at this https URL

点击查看摘要

Abstract:Deep model training on extensive datasets is increasingly becoming cost-prohibitive, prompting the widespread adoption of deep model fusion techniques to leverage knowledge from pre-existing models. From simple weight averaging to more sophisticated methods like AdaMerging, model fusion effectively improves model performance and accelerates the development of new models. However, potential interference between parameters of individual models and the lack of interpretability in the fusion progress remain significant challenges. Existing methods often try to resolve the parameter interference issue by evaluating attributes of parameters, such as their magnitude or sign, or by parameter pruning. In this study, we begin by examining the fine-tuning of linear layers through the lens of subspace analysis and explicitly define parameter interference as an optimization problem to shed light on this subject. Subsequently, we introduce an innovative approach to model fusion called zero-shot Sparse MIxture of Low-rank Experts (SMILE) construction, which allows for the upscaling of source models into an MoE model without extra data or further training. Our approach relies on the observation that fine-tuning mostly keeps the important parts from the pre-training, but it uses less significant or unused areas to adapt to new tasks. Also, the issue of parameter interference, which is intrinsically intractable in the original parameter space, can be managed by expanding the dimensions. We conduct extensive experiments across diverse scenarios, such as image classification and text generalization tasks, using full fine-tuning and LoRA fine-tuning, and we apply our method to large language models (CLIP models, Flan-T5 models, and Mistral-7B models), highlighting the adaptability and scalability of SMILE. Code is available at this https URL

[LG-4] Physics-Aware Combinatorial Assembly Planning using Deep Reinforcement Learning

链接: https://arxiv.org/abs/2408.10162
作者: Ruixuan Liu,Alan Chen,Weiye Zhao,Changliu Liu
关键词-EN: satisfy user specifications, Combinatorial assembly, standardized unit primitives, unit primitives, user specifications
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Combinatorial assembly uses standardized unit primitives to build objects that satisfy user specifications. Lego is a widely used platform for combinatorial assembly, in which people use unit primitives (ie Lego bricks) to build highly customizable 3D objects. This paper studies sequence planning for physical combinatorial assembly using Lego. Given the shape of the desired object, we want to find a sequence of actions for placing Lego bricks to build the target object. In particular, we aim to ensure the planned assembly sequence is physically executable. However, assembly sequence planning (ASP) for combinatorial assembly is particularly challenging due to its combinatorial nature, ie the vast number of possible combinations and complex constraints. To address the challenges, we employ deep reinforcement learning to learn a construction policy for placing unit primitives sequentially to build the desired object. Specifically, we design an online physics-aware action mask that efficiently filters out invalid actions and guides policy learning. In the end, we demonstrate that the proposed method successfully plans physically valid assembly sequences for constructing different Lego structures. The generated construction plan can be executed in real.

[LG-5] Multilingual Needle in a Haystack: Investigating Long-Context Behavior of Multilingual Large Language Models

链接: https://arxiv.org/abs/2408.10151
作者: Amey Hengle,Prasoon Bajpai,Soham Dan,Tanmoy Chakraborty
关键词-EN: handle long multilingual, recent large language, demonstrate remarkable abilities, long multilingual contexts, recent large
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:While recent large language models (LLMs) demonstrate remarkable abilities in responding to queries in diverse languages, their ability to handle long multilingual contexts is unexplored. As such, a systematic evaluation of the long-context capabilities of LLMs in multilingual settings is crucial, specifically in the context of information retrieval. To address this gap, we introduce the MultiLingual Needle-in-a-Haystack (MLNeedle) test, designed to assess a model’s ability to retrieve relevant information (the needle) from a collection of multilingual distractor texts (the haystack). This test serves as an extension of the multilingual question-answering task, encompassing both monolingual and cross-lingual retrieval. We evaluate four state-of-the-art LLMs on MLNeedle. Our findings reveal that model performance can vary significantly with language and needle position. Specifically, we observe that model performance is the lowest when the needle is (i) in a language outside the English language family and (ii) located in the middle of the input context. Furthermore, although some models claim a context size of 8k tokens or greater, none demonstrate satisfactory cross-lingual retrieval performance as the context length increases. Our analysis provides key insights into the long-context behavior of LLMs in multilingual settings to guide future evaluation protocols. To our knowledge, this is the first study to investigate the multilingual long-context behavior of LLMs.

[LG-6] In-Context Learning with Representations: Contextual Generalization of Trained Transformers

链接: https://arxiv.org/abs/2408.10147
作者: Tong Yang,Yu Huang,Yingbin Liang,Yuejie Chi
关键词-EN: pretrained large language, large language models, remarkable capability, capability of pretrained, pretrained large
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:In-context learning (ICL) refers to a remarkable capability of pretrained large language models, which can learn a new task given a few examples during inference. However, theoretical understanding of ICL is largely under-explored, particularly whether transformers can be trained to generalize to unseen examples in a prompt, which will require the model to acquire contextual knowledge of the prompt for generalization. This paper investigates the training dynamics of transformers by gradient descent through the lens of non-linear regression tasks. The contextual generalization here can be attained via learning the template function for each task in-context, where all template functions lie in a linear space with m basis functions. We analyze the training dynamics of one-layer multi-head transformers to in-contextly predict unlabeled inputs given partially labeled prompts, where the labels contain Gaussian noise and the number of examples in each prompt are not sufficient to determine the template. Under mild assumptions, we show that the training loss for a one-layer multi-head transformer converges linearly to a global minimum. Moreover, the transformer effectively learns to perform ridge regression over the basis functions. To our knowledge, this study is the first provable demonstration that transformers can learn contextual (i.e., template) information to generalize to both unseen examples and tasks when prompts contain only a small number of query-answer pairs.

[LG-7] Advancing Voice Cloning for Nepali: Leveraging Transfer Learning in a Low-Resource Language

链接: https://arxiv.org/abs/2408.10128
作者: Manjil Karki,Pratik Shakya,Sandesh Acharya,Ravi Pandit,Dinesh Gothe
关键词-EN: personalized speech interfaces, prominent feature, feature in personalized, Voice cloning, speaker
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注: 7 pages, 10 figures

点击查看摘要

Abstract:Voice cloning is a prominent feature in personalized speech interfaces. A neural vocal cloning system can mimic someone’s voice using just a few audio samples. Both speaker encoding and speaker adaptation are topics of research in the field of voice cloning. Speaker adaptation relies on fine-tuning a multi-speaker generative model, which involves training a separate model to infer a new speaker embedding used for speaker encoding. Both methods can achieve excellent performance, even with a small number of cloning audios, in terms of the speech’s naturalness and similarity to the original speaker. Speaker encoding approaches are more appropriate for low-resource deployment since they require significantly less memory and have a faster cloning time than speaker adaption, which can offer slightly greater naturalness and similarity. The main goal is to create a vocal cloning system that produces audio output with a Nepali accent or that sounds like Nepali. For the further advancement of TTS, the idea of transfer learning was effectively used to address several issues that were encountered in the development of this system, including the poor audio quality and the lack of available data.

[LG-8] Learning Brave Assumption-Based Argumentation Frameworks via ASP WWW ECAI2024

链接: https://arxiv.org/abs/2408.10126
作者: Emanuele De Angelis(1),Maurizio Proietti(1),Francesca Toni(2) ((1) CNR-IASI, Rome, Italy, (2) Imperial, London, UK)
关键词-EN: Assumption-based Argumentation, including logic programming, including logic, unifying formalism, forms of non-monotonic
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Logic in Computer Science (cs.LO)
*备注: Extended version of the paper accepted at the 27th European Conference on Artificial Intelligence (ECAI 2024); Paper ID: M1488 ( this https URL )

点击查看摘要

Abstract:Assumption-based Argumentation (ABA) is advocated as a unifying formalism for various forms of non-monotonic reasoning, including logic programming. It allows capturing defeasible knowledge, subject to argumentative debate. While, in much existing work, ABA frameworks are given up-front, in this paper we focus on the problem of automating their learning from background knowledge and positive/negative examples. Unlike prior work, we newly frame the problem in terms of brave reasoning under stable extensions for ABA. We present a novel algorithm based on transformation rules (such as Rote Learning, Folding, Assumption Introduction and Fact Subsumption) and an implementation thereof that makes use of Answer Set Programming. Finally, we compare our technique to state-of-the-art ILP systems that learn defeasible knowledge.

[LG-9] Molecular Graph Representation Learning Integrating Large Language Models with Domain-specific Small Models

链接: https://arxiv.org/abs/2408.10124
作者: Tianyu Zhang,Yuxiang Ren,Chengbin Hou,Hairong Lv,Xuegong Zhang
关键词-EN: drug discovery, crucial foundation, foundation for drug, Domain-specific Small Models, Large Language Models
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Chemical Physics (physics.chem-ph); Biomolecules (q-bio.BM)
*备注:

点击查看摘要

Abstract:Molecular property prediction is a crucial foundation for drug discovery. In recent years, pre-trained deep learning models have been widely applied to this task. Some approaches that incorporate prior biological domain knowledge into the pre-training framework have achieved impressive results. However, these methods heavily rely on biochemical experts, and retrieving and summarizing vast amounts of domain knowledge literature is both time-consuming and expensive. Large Language Models (LLMs) have demonstrated remarkable performance in understanding and efficiently providing general knowledge. Nevertheless, they occasionally exhibit hallucinations and lack precision in generating domain-specific knowledge. Conversely, Domain-specific Small Models (DSMs) possess rich domain knowledge and can accurately calculate molecular domain-related metrics. However, due to their limited model size and singular functionality, they lack the breadth of knowledge necessary for comprehensive representation learning. To leverage the advantages of both approaches in molecular property prediction, we propose a novel Molecular Graph representation learning framework that integrates Large language models and Domain-specific small models (MolGraph-LarDo). Technically, we design a two-stage prompt strategy where DSMs are introduced to calibrate the knowledge provided by LLMs, enhancing the accuracy of domain-specific information and thus enabling LLMs to generate more precise textual descriptions for molecular samples. Subsequently, we employ a multi-modal alignment method to coordinate various modalities, including molecular graphs and their corresponding descriptive texts, to guide the pre-training of molecular representations. Extensive experiments demonstrate the effectiveness of the proposed method.

[LG-10] PLUTUS: A Well Pre-trained Large Unified Transformer can Unveil Financial Time Series Regularities

链接: https://arxiv.org/abs/2408.10111
作者: Yuanjian Xu,Anxian Liu,Jianing Hao,Zhenzhuo Li,Shichang Meng,Guang Zhang
关键词-EN: high noise levels, predicting market behaviors, textbf, noise levels, Financial time series
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Financial time series modeling is crucial for understanding and predicting market behaviors but faces challenges such as non-linearity, non-stationarity, and high noise levels. Traditional models struggle to capture complex patterns due to these issues, compounded by limitations in computational resources and model capacity. Inspired by the success of large language models in NLP, we introduce \textbfPLUTUS, a \textbfPre-trained \textbfLarge \textbfUnified \textbfTransformer-based model that \textbfUnveils regularities in financial time \textbfSeries. PLUTUS uses an invertible embedding module with contrastive learning and autoencoder techniques to create an approximate one-to-one mapping between raw data and patch embeddings. TimeFormer, an attention based architecture, forms the core of PLUTUS, effectively modeling high-noise time series. We incorporate a novel attention mechanisms to capture features across both variable and temporal dimensions. PLUTUS is pre-trained on an unprecedented dataset of 100 billion observations, designed to thrive in noisy financial environments. To our knowledge, PLUTUS is the first open-source, large-scale, pre-trained financial time series model with over one billion parameters. It achieves state-of-the-art performance in various tasks, demonstrating strong transferability and establishing a robust foundational model for finance. Our research provides technical guidance for pre-training financial time series data, setting a new standard in the field.

[LG-11] Perturb-and-Compare Approach for Detecting Out-of-Distribution Samples in Constrained Access Environments ECAI

链接: https://arxiv.org/abs/2408.10107
作者: Heeyoung Lee,Hoyoon Byun,Changdae Oh,JinYeong Bak,Kyungwoo Song
关键词-EN: Accessing machine learning, Accessing machine, machine learning models, OOD detection, machine learning
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
*备注: Accepted to European Conference on Artificial Intelligence (ECAI) 2024

点击查看摘要

Abstract:Accessing machine learning models through remote APIs has been gaining prevalence following the recent trend of scaling up model parameters for increased performance. Even though these models exhibit remarkable ability, detecting out-of-distribution (OOD) samples remains a crucial safety concern for end users as these samples may induce unreliable outputs from the model. In this work, we propose an OOD detection framework, MixDiff, that is applicable even when the model’s parameters or its activations are not accessible to the end user. To bypass the access restriction, MixDiff applies an identical input-level perturbation to a given target sample and a similar in-distribution (ID) sample, then compares the relative difference in the model outputs of these two samples. MixDiff is model-agnostic and compatible with existing output-based OOD detection methods. We provide theoretical analysis to illustrate MixDiff’s effectiveness in discerning OOD samples that induce overconfident outputs from the model and empirically demonstrate that MixDiff consistently enhances the OOD detection performance on various datasets in vision and text domains.

[LG-12] Federated Frank-Wolfe Algorithm

链接: https://arxiv.org/abs/2408.10090
作者: Ali Dadras,Sourasekhar Banerjee,Karthik Prakhya,Alp Yurtsever
关键词-EN: building privacy-preserving collaborative, collaborative learning systems, privacy-preserving collaborative learning, gained a lot, lot of attention
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注: European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases

点击查看摘要

Abstract:Federated learning (FL) has gained a lot of attention in recent years for building privacy-preserving collaborative learning systems. However, FL algorithms for constrained machine learning problems are still limited, particularly when the projection step is costly. To this end, we propose a Federated Frank-Wolfe Algorithm (FedFW). FedFW features data privacy, low per-iteration cost, and communication of sparse signals. In the deterministic setting, FedFW achieves an \varepsilon -suboptimal solution within O(\varepsilon^-2) iterations for smooth and convex objectives, and O(\varepsilon^-3) iterations for smooth but non-convex objectives. Furthermore, we present a stochastic variant of FedFW and show that it finds a solution within O(\varepsilon^-3) iterations in the convex setting. We demonstrate the empirical performance of FedFW on several machine learning tasks.

[LG-13] MASALA: Model-Agnostic Surrogate Explanations by Locality Adaptation

链接: https://arxiv.org/abs/2408.10085
作者: Saif Anwar,Nathan Griffiths,Abhir Bhalerao,Thomas Popham
关键词-EN: Existing local Explainable, local Explainable, interpretable surrogate model, model, impactful model behaviour
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Existing local Explainable AI (XAI) methods, such as LIME, select a region of the input space in the vicinity of a given input instance, for which they approximate the behaviour of a model using a simpler and more interpretable surrogate model. The size of this region is often controlled by a user-defined locality hyperparameter. In this paper, we demonstrate the difficulties associated with defining a suitable locality size to capture impactful model behaviour, as well as the inadequacy of using a single locality size to explain all predictions. We propose a novel method, MASALA, for generating explanations, which automatically determines the appropriate local region of impactful model behaviour for each individual instance being explained. MASALA approximates the local behaviour used by a complex model to make a prediction by fitting a linear surrogate model to a set of points which experience similar model behaviour. These points are found by clustering the input space into regions of linear behavioural trends exhibited by the model. We compare the fidelity and consistency of explanations generated by our method with existing local XAI methods, namely LIME and CHILLI. Experiments on the PHM08 and MIDAS datasets show that our method produces more faithful and consistent explanations than existing methods, without the need to define any sensitive locality hyperparameters.

[LG-14] ANGO: Clustering with Typicality-Aware Nonlocal Mode-Seeking and Graph-Cut Optimization

链接: https://arxiv.org/abs/2408.10084
作者: Haowen Ma,Zhiguo Long,Hua Meng
关键词-EN: mine structural information, Density-based clustering methods, Density-based clustering, structural information, higher neighbors
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Density-based clustering methods by mode-seeking usually achieve clustering by using local density estimation to mine structural information, such as local dependencies from lower density points to higher neighbors. However, they often rely too heavily on \emphlocal structures and neglect \emphglobal characteristics, which can lead to significant errors in peak selection and dependency establishment. Although introducing more hyperparameters that revise dependencies can help mitigate this issue, tuning them is challenging and even impossible on real-world datasets. In this paper, we propose a new algorithm (TANGO) to establish local dependencies by exploiting a global-view \emphtypicality of points, which is obtained by mining further the density distributions and initial dependencies. TANGO then obtains sub-clusters with the help of the adjusted dependencies, and characterizes the similarity between sub-clusters by incorporating path-based connectivity. It achieves final clustering by employing graph-cut on sub-clusters, thus avoiding the challenging selection of cluster centers. Moreover, this paper provides theoretical analysis and an efficient method for the calculation of typicality. Experimental results on several synthetic and 16 real-world datasets demonstrate the effectiveness and superiority of TANGO.

[LG-15] Personalizing Reinforcement Learning from Human Feedback with Variational Preference Learning

链接: https://arxiv.org/abs/2408.10075
作者: Sriyash Poddar,Yanming Wan,Hamish Ivison,Abhishek Gupta,Natasha Jaques
关键词-EN: Human Feedback, Reinforcement Learning, powerful paradigm, paradigm for aligning, RLHF
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Robotics (cs.RO)
*备注: this http URL

点击查看摘要

Abstract:Reinforcement Learning from Human Feedback (RLHF) is a powerful paradigm for aligning foundation models to human values and preferences. However, current RLHF techniques cannot account for the naturally occurring differences in individual human preferences across a diverse population. When these differences arise, traditional RLHF frameworks simply average over them, leading to inaccurate rewards and poor performance for individual subgroups. To address the need for pluralistic alignment, we develop a class of multimodal RLHF methods. Our proposed techniques are based on a latent variable formulation - inferring a novel user-specific latent and learning reward models and policies conditioned on this latent without additional user-specific data. While conceptually simple, we show that in practice, this reward modeling requires careful algorithmic considerations around model architecture and reward scaling. To empirically validate our proposed technique, we first show that it can provide a way to combat underspecification in simulated control problems, inferring and optimizing user-specific reward functions. Next, we conduct experiments on pluralistic language datasets representing diverse user preferences and demonstrate improved reward function accuracy. We additionally show the benefits of this probabilistic framework in terms of measuring uncertainty, and actively learning user preferences. This work enables learning from diverse populations of users with divergent preferences, an important challenge that naturally occurs in problems from robot learning to foundation model alignment.

[LG-16] Facial Wrinkle Segmentation for Cosmetic Dermatology: Pretraining with Texture Map-Based Weak Supervision

链接: https://arxiv.org/abs/2408.10060
作者: Junho Moon,Haejun Chung,Ikbeom Jang
关键词-EN: cosmetic dermatology, plays a crucial, crucial role, role in cosmetic, Facial wrinkle
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Facial wrinkle detection plays a crucial role in cosmetic dermatology. Precise manual segmentation of facial wrinkles is challenging and time-consuming, with inherent subjectivity leading to inconsistent results among graders. To address this issue, we propose two solutions. First, we build and release the first public facial wrinkle dataset, `FFHQ-Wrinkle’, an extension of the NVIDIA FFHQ dataset. This dataset includes 1,000 images with human labels and 50,000 images with automatically generated weak labels. This dataset can foster the research community to develop advanced wrinkle detection algorithms. Second, we introduce a training strategy for U-Net-like encoder-decoder models to detect wrinkles across the face automatically. Our method employs a two-stage training strategy: texture map pretraining and finetuning on human-labeled data. Initially, we pretrain models on a large dataset with weak labels (N=50k) or masked texture maps generated through computer vision techniques, without human intervention. Subsequently, we finetune the models using human-labeled data (N=1k), which consists of manually labeled wrinkle masks. During finetuning, the network inputs a combination of RGB and masked texture maps, comprising four channels. We effectively combine labels from multiple annotators to minimize subjectivity in manual labeling. Our strategies demonstrate improved segmentation performance in facial wrinkle segmentation both quantitatively and visually compared to existing pretraining methods.

[LG-17] Efficient Exploration in Deep Reinforcement Learning: A Novel Bayesian Actor-Critic Algorithm

链接: https://arxiv.org/abs/2408.10055
作者: Nikolai Rozanov
关键词-EN: Deep Reinforcement Learning, Reinforcement learning, Deep Reinforcement, potential to disrupt, Reinforcement
类目: Machine Learning (cs.LG)
*备注: 74 pages, MRes Thesis in Computer Science, UCL

点击查看摘要

Abstract:Reinforcement learning (RL) and Deep Reinforcement Learning (DRL), in particular, have the potential to disrupt and are already changing the way we interact with the world. One of the key indicators of their applicability is their ability to scale and work in real-world scenarios, that is in large-scale problems. This scale can be achieved via a combination of factors, the algorithm’s ability to make use of large amounts of data and computational resources and the efficient exploration of the environment for viable solutions (i.e. policies). In this work, we investigate and motivate some theoretical foundations for deep reinforcement learning. We start with exact dynamic programming and work our way up to stochastic approximations and stochastic approximations for a model-free scenario, which forms the theoretical basis of modern reinforcement learning. We present an overview of this highly varied and rapidly changing field from the perspective of Approximate Dynamic Programming. We then focus our study on the short-comings with respect to exploration of the cornerstone approaches (i.e. DQN, DDQN, A2C) in deep reinforcement learning. On the theory side, our main contribution is the proposal of a novel Bayesian actor-critic algorithm. On the empirical side, we evaluate Bayesian exploration as well as actor-critic algorithms on standard benchmarks as well as state-of-the-art evaluation suites and show the benefits of both of these approaches over current state-of-the-art deep RL methods. We release all the implementations and provide a full python library that is easy to install and hopefully will serve the reinforcement learning community in a meaningful way, and provide a strong foundation for future work. Comments: 74 pages, MRes Thesis in Computer Science, UCL Subjects: Machine Learning (cs.LG) Cite as: arXiv:2408.10055 [cs.LG] (or arXiv:2408.10055v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2408.10055 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-18] Exploiting Fine-Grained Prototype Distribution for Boosting Unsupervised Class Incremental Learning

链接: https://arxiv.org/abs/2408.10046
作者: Jiaming Liu,Hongyuan Liu,Zhili Qin,Wei Han,Yulu Fan,Qinli Yang,Junming Shao
关键词-EN: class incremental learning, dynamic nature, nature of open-world, open-world scenarios, scenarios has attracted
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:The dynamic nature of open-world scenarios has attracted more attention to class incremental learning (CIL). However, existing CIL methods typically presume the availability of complete ground-truth labels throughout the training process, an assumption rarely met in practical applications. Consequently, this paper explores a more challenging problem of unsupervised class incremental learning (UCIL). The essence of addressing this problem lies in effectively capturing comprehensive feature representations and discovering unknown novel classes. To achieve this, we first model the knowledge of class distribution by exploiting fine-grained prototypes. Subsequently, a granularity alignment technique is introduced to enhance the unsupervised class discovery. Additionally, we proposed a strategy to minimize overlap between novel and existing classes, thereby preserving historical knowledge and mitigating the phenomenon of catastrophic forgetting. Extensive experiments on the five datasets demonstrate that our approach significantly outperforms current state-of-the-art methods, indicating the effectiveness of the proposed method.

[LG-19] PinnDE: Physics-Informed Neural Networks for Solving Differential Equations

链接: https://arxiv.org/abs/2408.10011
作者: Jason Matthews,Alex Bihlo
关键词-EN: solving differential equations, grown substantially, recent years, years the study, solving differential
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In recent years the study of deep learning for solving differential equations has grown substantially. The use of physics-informed neural networks (PINNs) and deep operator networks (DeepONets) have emerged as two of the most useful approaches in approximating differential equation solutions using machine learning. Here, we propose PinnDE, an open-source python library for solving differential equations with both PINNs and DeepONets. We give a brief review of both PINNs and DeepONets, introduce PinnDE along with the structure and usage of the package, and present worked examples to show PinnDE’s effectiveness in approximating solutions with both PINNs and DeepONets.

[LG-20] Unlocking the Power of LSTM for Long Term Time Series Forecasting

链接: https://arxiv.org/abs/2408.10006
作者: Yaxuan Kong,Zepu Wang,Yuqi Nie,Tian Zhou,Stefan Zohren,Yuxuan Liang,Peng Sun,Qingsong Wen
关键词-EN: Traditional recurrent neural, neural network architectures, recurrent neural network, time series forecasting, memory neural networks
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Traditional recurrent neural network architectures, such as long short-term memory neural networks (LSTM), have historically held a prominent role in time series forecasting (TSF) tasks. While the recently introduced sLSTM for Natural Language Processing (NLP) introduces exponential gating and memory mixing that are beneficial for long term sequential learning, its potential short memory issue is a barrier to applying sLSTM directly in TSF. To address this, we propose a simple yet efficient algorithm named P-sLSTM, which is built upon sLSTM by incorporating patching and channel independence. These modifications substantially enhance sLSTM’s performance in TSF, achieving state-of-the-art results. Furthermore, we provide theoretical justifications for our design, and conduct extensive comparative and analytical experiments to fully validate the efficiency and superior performance of our model.

[LG-21] he Fairness-Quality Trade-off in Clustering

链接: https://arxiv.org/abs/2408.10002
作者: Rashida Hakim,Ana-Andreea Stoica,Christos H. Papadimitriou,Mihalis Yannakakis
关键词-EN: significantly increase fairness, Pareto front, considered extensively, significantly increase, Fairness
类目: Machine Learning (cs.LG); Computers and Society (cs.CY)
*备注:

点击查看摘要

Abstract:Fairness in clustering has been considered extensively in the past; however, the trade-off between the two objectives – e.g., can we sacrifice just a little in the quality of the clustering to significantly increase fairness, or vice-versa? – has rarely been addressed. We introduce novel algorithms for tracing the complete trade-off curve, or Pareto front, between quality and fairness in clustering problems; that is, computing all clusterings that are not dominated in both objectives by other clusterings. Unlike previous work that deals with specific objectives for quality and fairness, we deal with all objectives for fairness and quality in two general classes encompassing most of the special cases addressed in previous work. Our algorithm must take exponential time in the worst case as the Pareto front itself can be exponential. Even when the Pareto front is polynomial, our algorithm may take exponential time, and we prove that this is inevitable unless P = NP. However, we also present a new polynomial-time algorithm for computing the entire Pareto front when the cluster centers are fixed, and for perhaps the most natural fairness objective: minimizing the sum, over all clusters, of the imbalance between the two groups in each cluster.

[LG-22] Uniting contrastive and generative learning for event sequences models

链接: https://arxiv.org/abs/2408.09995
作者: Aleksandr Yugay,Alexey Zaytsev
关键词-EN: including risk management, modern banking applications, personalized customer offers, High-quality representation, banking applications
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:High-quality representation of transactional sequences is vital for modern banking applications, including risk management, churn prediction, and personalized customer offers. Different tasks require distinct representation properties: local tasks benefit from capturing the client’s current state, while global tasks rely on general behavioral patterns. Previous research has demonstrated that various self-supervised approaches yield representations that better capture either global or local qualities. This study investigates the integration of two self-supervised learning techniques - instance-wise contrastive learning and a generative approach based on restoring masked events in latent space. The combined approach creates representations that balance local and global transactional data characteristics. Experiments conducted on several public datasets, focusing on sequence classification and next-event type prediction, show that the integrated method achieves superior performance compared to individual approaches and demonstrates synergistic effects. These findings suggest that the proposed approach offers a robust framework for advancing event sequences representation learning in the financial sector. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2408.09995 [cs.LG] (or arXiv:2408.09995v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2408.09995 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-23] Preference-Optimized Pareto Set Learning for Blackbox Optimization

链接: https://arxiv.org/abs/2408.09976
作者: Zhang Haishan,Diptesh Das,Koji Tsuda
关键词-EN: Pareto set, Pareto, Pareto front, MOO, set
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:Multi-Objective Optimization (MOO) is an important problem in real-world applications. However, for a non-trivial problem, no single solution exists that can optimize all the objectives simultaneously. In a typical MOO problem, the goal is to find a set of optimum solutions (Pareto set) that trades off the preferences among objectives. Scalarization in MOO is a well-established method for finding a finite set approximation of the whole Pareto set (PS). However, in real-world experimental design scenarios, it’s beneficial to obtain the whole PS for flexible exploration of the design space. Recently Pareto set learning (PSL) has been introduced to approximate the whole PS. PSL involves creating a manifold representing the Pareto front of a multi-objective optimization problem. A naive approach includes finding discrete points on the Pareto front through randomly generated preference vectors and connecting them by regression. However, this approach is computationally expensive and leads to a poor PS approximation. We propose to optimize the preference points to be distributed evenly on the Pareto front. Our formulation leads to a bilevel optimization problem that can be solved by e.g. differentiable cross-entropy methods. We demonstrated the efficacy of our method for complex and difficult black-box MOO problems using both synthetic and real-world benchmark data.

[LG-24] he Exploration-Exploitation Dilemma Revisited: An Entropy Perspective

链接: https://arxiv.org/abs/2408.09974
作者: Renye Yan,Yaozhong Gan,You Wu,Ling Liang,Junliang Xing,Yimao Cai,Ru Huang
关键词-EN: significant challenge, challenge in reinforcement, exploration reduces learning, reinforcement learning, reduces learning efficiency
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The imbalance of exploration and exploitation has long been a significant challenge in reinforcement learning. In policy optimization, excessive reliance on exploration reduces learning efficiency, while over-dependence on exploitation might trap agents in local optima. This paper revisits the exploration-exploitation dilemma from the perspective of entropy by revealing the relationship between entropy and the dynamic adaptive process of exploration and exploitation. Based on this theoretical insight, we establish an end-to-end adaptive framework called AdaZero, which automatically determines whether to explore or to exploit as well as their balance of strength. Experiments show that AdaZero significantly outperforms baseline models across various Atari and MuJoCo environments with only a single setting. Especially in the challenging environment of Montezuma, AdaZero boosts the final returns by up to fifteen times. Moreover, we conduct a series of visualization analyses to reveal the dynamics of our self-adaptive mechanism, demonstrating how entropy reflects and changes with respect to the agent’s performance and adaptive process.

[LG-25] Unsupervised Machine Learning Hybrid Approach Integrating Linear Programming in Loss Function: A Robust Optimization Technique

链接: https://arxiv.org/abs/2408.09967
作者: Andrew Kiruluta,Andreas Lemos
关键词-EN: machine learning model, integrates linear programming, machine learning, paper presents, unsupervised machine learning
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:This paper presents a novel hybrid approach that integrates linear programming (LP) within the loss function of an unsupervised machine learning model. By leveraging the strengths of both optimization techniques and machine learning, this method introduces a robust framework for solving complex optimization problems where traditional methods may fall short. The proposed approach encapsulates the constraints and objectives of a linear programming problem directly into the loss function, guiding the learning process to adhere to these constraints while optimizing the desired outcomes. This technique not only preserves the interpretability of linear programming but also benefits from the flexibility and adaptability of machine learning, making it particularly well-suited for unsupervised or semi-supervised learning scenarios.

[LG-26] Mask in the Mirror: Implicit Sparsification

链接: https://arxiv.org/abs/2408.09966
作者: Tom Jacobs,Rebekka Burkholz
关键词-EN: Sparsifying deep neural, Sparsifying deep, reduce their inference, inference cost, NP-hard problem
类目: Machine Learning (cs.LG)
*备注: 20 pages, 5 figures

点击查看摘要

Abstract:Sparsifying deep neural networks to reduce their inference cost is an NP-hard problem and difficult to optimize due to its mixed discrete and continuous nature. Yet, as we prove, continuous sparsification has already an implicit bias towards sparsity that would not require common projections of relaxed mask variables. While implicit rather than explicit regularization induces benefits, it usually does not provide enough flexibility in practice, as only a specific target sparsity is obtainable. To exploit its potential for continuous sparsification, we propose a way to control the strength of the implicit bias. Based on the mirror flow framework, we derive resulting convergence and optimality guarantees in the context of underdetermined linear regression and demonstrate the utility of our insights in more general neural network sparsification experiments, achieving significant performance gains, particularly in the high-sparsity regime. Our theoretical contribution might be of independent interest, as we highlight a way to enter the rich regime and show that implicit bias is controllable by a time-dependent Bregman potential.

[LG-27] AdaResNet: Enhancing Residual Networks with Dynamic Weight Adjustment for Improved Feature Integration

链接: https://arxiv.org/abs/2408.09958
作者: Hong Su
关键词-EN: deep neural networks, Residual Network, making it challenging, early layers, deep neural
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In very deep neural networks, gradients can become extremely small during backpropagation, making it challenging to train the early layers. ResNet (Residual Network) addresses this issue by enabling gradients to flow directly through the network via skip connections, facilitating the training of much deeper networks. However, in these skip connections, the input ipd is directly added to the transformed data tfd, treating ipd and tfd equally, without adapting to different scenarios. In this paper, we propose AdaResNet (Auto-Adapting Residual Network), which automatically adjusts the ratio between ipd and tfd based on the training data. We introduce a variable, weight_tfd^ipd, to represent this ratio. This variable is dynamically adjusted during backpropagation, allowing it to adapt to the training data rather than remaining fixed. Experimental results demonstrate that AdaResNet achieves a maximum accuracy improvement of over 50% compared to traditional ResNet.

[LG-28] Weakly Supervised Pretraining and Multi-Annotator Supervised Finetuning for Facial Wrinkle Detection

链接: https://arxiv.org/abs/2408.09952
作者: Ik Jun Moon,Junho Moon,Ikbeom Jang
关键词-EN: Abstract, facial, facial wrinkles, Research question, skin
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:1. Research question: With the growing interest in skin diseases and skin aesthetics, the ability to predict facial wrinkles is becoming increasingly important. This study aims to evaluate whether a computational model, convolutional neural networks (CNN), can be trained for automated facial wrinkle segmentation. 2. Findings: Our study presents an effective technique for integrating data from multiple annotators and illustrates that transfer learning can enhance performance, resulting in dependable segmentation of facial wrinkles. 3. Meaning: This approach automates intricate and time-consuming tasks of wrinkle analysis with a deep learning framework. It could be used to facilitate skin treatments and diagnostics.

[LG-29] Data Augmentation of Contrastive Learning is Estimating Positive-incentive Noise

链接: https://arxiv.org/abs/2408.09929
作者: Hongyuan Zhang,Yanchen Xu,Sida Huang,Xuelong Li
关键词-EN: idea of Positive-incentive, Positive-incentive Noise, reliable noise beneficial, contrastive learning, Noise
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Inspired by the idea of Positive-incentive Noise (Pi-Noise or \pi -Noise) that aims at learning the reliable noise beneficial to tasks, we scientifically investigate the connection between contrastive learning and \pi -noise in this paper. By converting the contrastive loss to an auxiliary Gaussian distribution to quantitatively measure the difficulty of the specific contrastive model under the information theory framework, we properly define the task entropy, the core concept of \pi -noise, of contrastive learning. It is further proved that the predefined data augmentation in the standard contrastive learning paradigm can be regarded as a kind of point estimation of \pi -noise. Inspired by the theoretical study, a framework that develops a \pi -noise generator to learn the beneficial noise (instead of estimation) as data augmentations for contrast is proposed. The designed framework can be applied to diverse types of data and is also completely compatible with the existing contrastive models. From the visualization, we surprisingly find that the proposed method successfully learns effective augmentations.

[LG-30] Expressive Power of Temporal Message Passing

链接: https://arxiv.org/abs/2408.09918
作者: Przemysław Andrzej Wałęga,Michael Rawson
关键词-EN: Graph neural networks, employing temporal versions, neural networks, recently been adapted, temporal
类目: Machine Learning (cs.LG)
*备注: 18 pages

点击查看摘要

Abstract:Graph neural networks (GNNs) have recently been adapted to temporal settings, often employing temporal versions of the message-passing mechanism known from GNNs. We divide temporal message passing mechanisms from literature into two main types: global and local, and establish Weisfeiler-Leman characterisations for both. This allows us to formally analyse expressive power of temporal message-passing models. We show that global and local temporal message-passing mechanisms have incomparable expressive power when applied to arbitrary temporal graphs. However, the local mechanism is strictly more expressive than the global mechanism when applied to colour-persistent temporal graphs, whose node colours are initially the same in all time points. Our theoretical findings are supported by experimental evidence, underlining practical implications of our analysis.

[LG-31] Active Learning for Identifying Disaster-Related Tweets: A Comparison with Keyword Filtering and Generic Fine-Tuning

链接: https://arxiv.org/abs/2408.09914
作者: David Hanny,Sebastian Schmidt,Bernd Resch
关键词-EN: provide essential information, essential information, emergency response, response during natural, natural disasters
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注: Submitted for the Intelligent Systems Conference (IntelliSys 2024). The version of record of this contribution is published in the Springer series Lecture Notes in Networks and Systems, and is available online at this https URL . This preprint has not undergone peer review or any post-submission improvements or corrections. 13 pages, 2 figures

点击查看摘要

Abstract:Information from social media can provide essential information for emergency response during natural disasters in near real-time. However, it is difficult to identify the disaster-related posts among the large amounts of unstructured data available. Previous methods often use keyword filtering, topic modelling or classification-based techniques to identify such posts. Active Learning (AL) presents a promising sub-field of Machine Learning (ML) that has not been used much in the field of text classification of social media content. This study therefore investigates the potential of AL for identifying disaster-related Tweets. We compare a keyword filtering approach, a RoBERTa model fine-tuned with generic data from CrisisLex, a base RoBERTa model trained with AL and a fine-tuned RoBERTa model trained with AL regarding classification performance. For testing, data from CrisisLex and manually labelled data from the 2021 flood in Germany and the 2023 Chile forest fires were considered. The results show that generic fine-tuning combined with 10 rounds of AL outperformed all other approaches. Consequently, a broadly applicable model for the identification of disaster-related Tweets could be trained with very little labelling effort. The model can be applied to use cases beyond this study and provides a useful tool for further research in social media analysis.

[LG-32] pSVM: Soft-margin SVMs with p-norm Hinge Loss

链接: https://arxiv.org/abs/2408.09908
作者: Haoxiang Sun
关键词-EN: Support Vector Machines, Support Vector, Vector Machines, hinge loss, extensively discussed
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Support Vector Machines (SVMs) based on hinge loss have been extensively discussed and applied to various binary classification tasks. These SVMs achieve a balance between margin maximization and the minimization of slack due to outliers. Although many efforts have been dedicated to enhancing the performance of SVMs with hinge loss, studies on p SVMs, soft-margin SVMs with p -norm hinge loss, remain relatively scarce. In this paper, we explore the properties, performance, and training algorithms of p SVMs. We first derive the generalization bound of p SVMs, then formulate the dual optimization problem, comparing it with the traditional approach. Furthermore, we discuss a generalized version of the Sequential Minimal Optimization (SMO) algorithm, p SMO, to train our p SVM model. Comparative experiments on various datasets, including binary and multi-class classification tasks, demonstrate the effectiveness and advantages of our p SVM model and the p SMO method.

[LG-33] Instruction-Based Molecular Graph Generation with Unified Text-Graph Diffusion Model

链接: https://arxiv.org/abs/2408.09896
作者: Yuran Xiang,Haiteng Zhao,Chang Ma,Zhi-Hong Deng
关键词-EN: Recent advancements, synthesizing molecules based, advancements in computational, computational chemistry, chemistry have increasingly
类目: Machine Learning (cs.LG); Chemical Physics (physics.chem-ph); Biomolecules (q-bio.BM)
*备注:

点击查看摘要

Abstract:Recent advancements in computational chemistry have increasingly focused on synthesizing molecules based on textual instructions. Integrating graph generation with these instructions is complex, leading most current methods to use molecular sequences with pre-trained large language models. In response to this challenge, we propose a novel framework, named \textbfUTGDiff (Unified Text-Graph Diffusion Model) , which utilizes language models for discrete graph diffusion to generate molecular graphs from instructions. UTGDiff features a unified text-graph transformer as the denoising network, derived from pre-trained language models and minimally modified to process graph data through attention bias. Our experimental results demonstrate that UTGDiff consistently outperforms sequence-based baselines in tasks involving instruction-based molecule generation and editing, achieving superior performance with fewer parameters given an equivalent level of pretraining corpus. Our code is availble at this https URL.

[LG-34] Performance Law of Large Language Models

链接: https://arxiv.org/abs/2408.09895
作者: Chuhan Wu,Ruiming Tang
关键词-EN: large language models, achieved impressive performance, large language, achieved impressive, scaling law
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注: Personal opinions of the authors

点击查看摘要

Abstract:Guided by the belief of the scaling law, large language models (LLMs) have achieved impressive performance in recent years. However, scaling law only gives a qualitative estimation of loss, which is influenced by various factors such as model architectures, data distributions, tokenizers, and computation precision. Thus, estimating the real performance of LLMs with different training settings rather than loss may be quite useful in practical development. In this article, we present an empirical equation named “Performance Law” to directly predict the MMLU score of an LLM, which is a widely used metric to indicate the general capability of LLMs in real-world conversations and applications. Based on only a few key hyperparameters of the LLM architecture and the size of training data, we obtain a quite accurate MMLU prediction of various LLMs with diverse sizes and architectures developed by different organizations in different years. Performance law can be used to guide the choice of LLM architecture and the effective allocation of computational resources without extensive experiments.

[LG-35] Differential Private Stochastic Optimization with Heavy-tailed Data: Towards Optimal Rates

链接: https://arxiv.org/abs/2408.09891
作者: Puning Zhao,Jiafei Wu,Zhe Liu,Chong Wang,Rongfei Fan,Qingming Li
关键词-EN: differential privacy, problems under differential, sqrt, epsilon, heavy-tailed gradients
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Data Structures and Algorithms (cs.DS)
*备注:

点击查看摘要

Abstract:We study convex optimization problems under differential privacy (DP). With heavy-tailed gradients, existing works achieve suboptimal rates. The main obstacle is that existing gradient estimators have suboptimal tail properties, resulting in a superfluous factor of d in the union bound. In this paper, we explore algorithms achieving optimal rates of DP optimization with heavy-tailed gradients. Our first method is a simple clipping approach. Under bounded p -th order moments of gradients, with n samples, it achieves \tildeO(\sqrtd/n+\sqrtd(\sqrtd/n\epsilon)^1-1/p) population risk with \epsilon\leq 1/\sqrtd . We then propose an iterative updating method, which is more complex but achieves this rate for all \epsilon\leq 1 . The results significantly improve over existing methods. Such improvement relies on a careful treatment of the tail behavior of gradient estimators. Our results match the minimax lower bound in \citekamath2022improved, indicating that the theoretical limit of stochastic convex optimization under DP is achievable.

[LG-36] GINO-Q: Learning an Asymptotically Optimal Index Policy for Restless Multi-armed Bandits

链接: https://arxiv.org/abs/2408.09882
作者: Gongpu Chen,Soung Chang Liew,Deniz Gunduz
关键词-EN: restless multi-armed bandit, multi-armed bandit, variety of fields, restless multi-armed, popular model
类目: Machine Learning (cs.LG)
*备注: 9 pages, 11 figures

点击查看摘要

Abstract:The restless multi-armed bandit (RMAB) framework is a popular model with applications across a wide variety of fields. However, its solution is hindered by the exponentially growing state space (with respect to the number of arms) and the combinatorial action space, making traditional reinforcement learning methods infeasible for large-scale instances. In this paper, we propose GINO-Q, a three-timescale stochastic approximation algorithm designed to learn an asymptotically optimal index policy for RMABs. GINO-Q mitigates the curse of dimensionality by decomposing the RMAB into a series of subproblems, each with the same dimension as a single arm, ensuring that complexity increases linearly with the number of arms. Unlike recently developed Whittle-index-based algorithms, GINO-Q does not require RMABs to be indexable, enhancing its flexibility and applicability. Our experimental results demonstrate that GINO-Q consistently learns near-optimal policies, even for non-indexable RMABs where Whittle-index-based algorithms perform poorly, and it converges significantly faster than existing baselines.

[LG-37] New spectral imaging biomarkers for sepsis and mortality in intensive care

链接: https://arxiv.org/abs/2408.09873
作者: Silvia Seidlitz,Katharina Hölzl,Ayca von Garrel,Jan Sellner,Stephan Katzenschlager,Tobias Hölle,Dania Fischer,Maik von der Forst,Felix C.F. Schmitt,Markus A. Weigand,Lena Maier-Hein,Maximilian Dietrich
关键词-EN: high socioeconomic importance, high risk, high socioeconomic, early identification, socioeconomic importance
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
*备注: Markus A. Weigand, Lena Maier-Hein and Maximilian Dietrich contributed equally

点击查看摘要

Abstract:With sepsis remaining a leading cause of mortality, early identification of septic patients and those at high risk of death is a challenge of high socioeconomic importance. The driving hypothesis of this study was that hyperspectral imaging (HSI) could provide novel biomarkers for sepsis diagnosis and treatment management due to its potential to monitor microcirculatory alterations. We conducted a comprehensive study involving HSI data of the palm and fingers from more than 480 patients on the day of their intensive care unit (ICU) admission. The findings demonstrate that HSI measurements can predict sepsis with an area under the receiver operating characteristic curve (AUROC) of 0.80 (95 % confidence interval (CI) [0.76; 0.84]) and mortality with an AUROC of 0.72 (95 % CI [0.65; 0.79]). The predictive performance improves substantially when additional clinical data is incorporated, leading to an AUROC of up to 0.94 (95 % CI [0.92; 0.96]) for sepsis and 0.84 (95 % CI [0.78; 0.89]) for mortality. We conclude that HSI presents novel imaging biomarkers for the rapid, non-invasive prediction of sepsis and mortality, suggesting its potential as an important modality for guiding diagnosis and treatment.

[LG-38] MAPLE: Enhancing Review Generation with Multi-Aspect Prompt LEarning in Explainable Recommendation

链接: https://arxiv.org/abs/2408.09865
作者: Ching-Wen Yang,Che Wei Chen,Kun-da Wu,Hao Xu,Jui-Feng Yao,Hung-Yu Kao
关键词-EN: Explainable Recommendation task, Explainable Recommendation, Recommendation task, task is designed, designed to receive
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Information Retrieval (cs.IR)
*备注: 8 main pages, 10 pages for appendix. Under review

点击查看摘要

Abstract:Explainable Recommendation task is designed to receive a pair of user and item and output explanations to justify why an item is recommended to a user. Many models treat review-generation as a proxy of explainable recommendation. Although they are able to generate fluent and grammatical sentences, they suffer from generality and hallucination issues. We propose a personalized, aspect-controlled model called Multi-Aspect Prompt LEarner (MAPLE), in which it integrates aspect category as another input dimension to facilitate the memorization of fine-grained aspect terms. Experiments on two real-world review datasets in restaurant domain show that MAPLE outperforms the baseline review-generation models in terms of text and feature diversity while maintaining excellent coherence and factual relevance. We further treat MAPLE as a retriever component in the retriever-reader framework and employ a Large-Language Model (LLM) as the reader, showing that MAPLE’s explanation along with the LLM’s comprehension ability leads to enriched and personalized explanation as a result. We will release the code and data in this http upon acceptance.

[LG-39] 3D-Aware Instance Segmentation and Tracking in Egocentric Videos

链接: https://arxiv.org/abs/2408.09860
作者: Yash Bhalgat,Vadim Tschernezki,Iro Laina,João F. Henriques,Andrea Vedaldi,Andrew Zisserman
关键词-EN: rapid camera motion, present unique challenges, videos present unique, scene understanding due, frequent object occlusions
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Egocentric videos present unique challenges for 3D scene understanding due to rapid camera motion, frequent object occlusions, and limited object visibility. This paper introduces a novel approach to instance segmentation and tracking in first-person video that leverages 3D awareness to overcome these obstacles. Our method integrates scene geometry, 3D object centroid tracking, and instance segmentation to create a robust framework for analyzing dynamic egocentric scenes. By incorporating spatial and temporal cues, we achieve superior performance compared to state-of-the-art 2D approaches. Extensive evaluations on the challenging EPIC Fields dataset demonstrate significant improvements across a range of tracking and segmentation consistency metrics. Specifically, our method outperforms the next best performing approach by 7 points in Association Accuracy (AssA) and 4.5 points in IDF1 score, while reducing the number of ID switches by 73% to 80% across various object categories. Leveraging our tracked instance segmentations, we showcase downstream applications in 3D object reconstruction and amodal video object segmentation in these egocentric settings.

[LG-40] ShortCircuit: AlphaZero-Driven Circuit Design

链接: https://arxiv.org/abs/2408.09858
作者: Dimitrios Tsaras,Antoine Grosnit,Lei Chen,Zhiyao Xie,Haitham Bou-Ammar,Mingxuan Yuan
关键词-EN: Chip design relies, generating Boolean circuits, AND-Inverter Graphs, design relies heavily, generating Boolean
类目: Machine Learning (cs.LG); Hardware Architecture (cs.AR)
*备注:

点击查看摘要

Abstract:Chip design relies heavily on generating Boolean circuits, such as AND-Inverter Graphs (AIGs), from functional descriptions like truth tables. While recent advances in deep learning have aimed to accelerate circuit design, these efforts have mostly focused on tasks other than synthesis, and traditional heuristic methods have plateaued. In this paper, we introduce ShortCircuit, a novel transformer-based architecture that leverages the structural properties of AIGs and performs efficient space exploration. Contrary to prior approaches attempting end-to-end generation of logic circuits using deep networks, ShortCircuit employs a two-phase process combining supervised with reinforcement learning to enhance generalization to unseen truth tables. We also propose an AlphaZero variant to handle the double exponentially large state space and the sparsity of the rewards, enabling the discovery of near-optimal designs. To evaluate the generative performance of our trained model , we extract 500 truth tables from a benchmark set of 20 real-world circuits. ShortCircuit successfully generates AIGs for 84.6% of the 8-input test truth tables, and outperforms the state-of-the-art logic synthesis tool, ABC, by 14.61% in terms of circuits size.

[LG-41] Machine Learning with Physics Knowledge for Prediction: A Survey

链接: https://arxiv.org/abs/2408.09840
作者: Joe Watson,Chen Song,Oliver Weeger,Theo Gruner,An T. Le,Kay Hansel,Ahmed Hendawy,Oleg Arenz,Will Trojak,Miles Cranmer,Carlo D’Eramo,Fabian Bülow,Tanmay Goyal,Jan Peters,Martin W. Hoffman
关键词-EN: partial differential equations, prediction and forecast, differential equations, examines the broad, broad suite
类目: Machine Learning (cs.LG); Numerical Analysis (math.NA); Computational Physics (physics.comp-ph)
*备注: 56 pages, 8 figures, 2 tables

点击查看摘要

Abstract:This survey examines the broad suite of methods and models for combining machine learning with physics knowledge for prediction and forecast, with a focus on partial differential equations. These methods have attracted significant interest due to their potential impact on advancing scientific research and industrial practices by improving predictive models with small- or large-scale datasets and expressive predictive models with useful inductive biases. The survey has two parts. The first considers incorporating physics knowledge on an architectural level through objective functions, structured predictive models, and data augmentation. The second considers data as physics knowledge, which motivates looking at multi-task, meta, and contextual learning as an alternative approach to incorporating physics knowledge in a data-driven fashion. Finally, we also provide an industrial perspective on the application of these methods and a survey of the open-source ecosystem for physics-informed machine learning.

[LG-42] Mitigating the Stability-Plasticity Dilemma in Adaptive Train Scheduling with Curriculum-Driven Continual DQN Expansion

链接: https://arxiv.org/abs/2408.09838
作者: Achref Jaziri,Etienne Künzel,Visvanathan Ramesh
关键词-EN: previously acquired knowledge, preserving previously acquired, develop increasingly complex, increasingly complex behaviors, acquired knowledge
类目: Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
*备注: 9 Pages, 2 Figures

点击查看摘要

Abstract:A continual learning agent builds on previous experiences to develop increasingly complex behaviors by adapting to non-stationary and dynamic environments while preserving previously acquired knowledge. However, scaling these systems presents significant challenges, particularly in balancing the preservation of previous policies with the adaptation of new ones to current environments. This balance, known as the stability-plasticity dilemma, is especially pronounced in complex multi-agent domains such as the train scheduling problem, where environmental and agent behaviors are constantly changing, and the search space is vast. In this work, we propose addressing these challenges in the train scheduling problem using curriculum learning. We design a curriculum with adjacent skills that build on each other to improve generalization performance. Introducing a curriculum with distinct tasks introduces non-stationarity, which we address by proposing a new algorithm: Continual Deep Q-Network (DQN) Expansion (CDE). Our approach dynamically generates and adjusts Q-function subspaces to handle environmental changes and task requirements. CDE mitigates catastrophic forgetting through EWC while ensuring high plasticity using adaptive rational activation functions. Experimental results demonstrate significant improvements in learning efficiency and adaptability compared to RL baselines and other adapted methods for continual learning, highlighting the potential of our method in managing the stability-plasticity dilemma in the adaptive train scheduling setting.

[LG-43] Symplectic Neural Networks Based on Dynamical Systems

链接: https://arxiv.org/abs/2408.09821
作者: Benjamin K Tapley
关键词-EN: Hamiltonian differential equations, symplectic neural networks, designing symplectic neural, neural networks, based on geometric
类目: Machine Learning (cs.LG); Computational Engineering, Finance, and Science (cs.CE); Numerical Analysis (math.NA); Computational Physics (physics.comp-ph)
*备注: 33 pages including appendices but not references, 7 figures

点击查看摘要

Abstract:We present and analyze a framework for designing symplectic neural networks (SympNets) based on geometric integrators for Hamiltonian differential equations. The SympNets are universal approximators in the space of Hamiltonian diffeomorphisms, interpretable and have a non-vanishing gradient property. We also give a representation theory for linear systems, meaning the proposed P-SympNets can exactly parameterize any symplectic map corresponding to quadratic Hamiltonians. Extensive numerical tests demonstrate increased expressiveness and accuracy – often several orders of magnitude better – for lower training cost over existing architectures. Lastly, we show how to perform symbolic Hamiltonian regression with SympNets for polynomial systems using backward error analysis.

[LG-44] Liquid Fourier Latent Dynamics Networks for fast GPU-based numerical simulations in computational cardiology

链接: https://arxiv.org/abs/2408.09818
作者: Matteo Salvador,Alison L. Marsden
关键词-EN: Scientific Machine Learning, Machine Learning, Scientific Machine, Ordinary Differential Equations, Partial Differential Equations
类目: Machine Learning (cs.LG); Computational Engineering, Finance, and Science (cs.CE); Neural and Evolutionary Computing (cs.NE)
*备注:

点击查看摘要

Abstract:Scientific Machine Learning (ML) is gaining momentum as a cost-effective alternative to physics-based numerical solvers in many engineering applications. In fact, scientific ML is currently being used to build accurate and efficient surrogate models starting from high-fidelity numerical simulations, effectively encoding the parameterized temporal dynamics underlying Ordinary Differential Equations (ODEs), or even the spatio-temporal behavior underlying Partial Differential Equations (PDEs), in appropriately designed neural networks. We propose an extension of Latent Dynamics Networks (LDNets), namely Liquid Fourier LDNets (LFLDNets), to create parameterized space-time surrogate models for multiscale and multiphysics sets of highly nonlinear differential equations on complex geometries. LFLDNets employ a neurologically-inspired, sparse, liquid neural network for temporal dynamics, relaxing the requirement of a numerical solver for time advancement and leading to superior performance in terms of tunable parameters, accuracy, efficiency and learned trajectories with respect to neural ODEs based on feedforward fully-connected neural networks. Furthermore, in our implementation of LFLDNets, we use a Fourier embedding with a tunable kernel in the reconstruction network to learn high-frequency functions better and faster than using space coordinates directly as input. We challenge LFLDNets in the framework of computational cardiology and evaluate their capabilities on two 3-dimensional test cases arising from multiscale cardiac electrophysiology and cardiovascular hemodynamics. This paper illustrates the capability to run Artificial Intelligence-based numerical simulations on single or multiple GPUs in a matter of minutes and represents a significant step forward in the development of physics-informed digital twins.

[LG-45] A Population-to-individual Tuning Framework for Adapting Pretrained LM to On-device User Intent Prediction KDD2024

链接: https://arxiv.org/abs/2408.09815
作者: Jiahui Gong,Jingtao Ding,Fanjin Meng,Guilong Chen,Hong Chen,Shen Zhao,Haisheng Lu,Yong Li
关键词-EN: support rich functions, Mobile devices, daily life, support rich, rich functions
类目: Machine Learning (cs.LG); Human-Computer Interaction (cs.HC)
*备注: accepted by KDD 2024

点击查看摘要

Abstract:Mobile devices, especially smartphones, can support rich functions and have developed into indispensable tools in daily life. With the rise of generative AI services, smartphones can potentially transform into personalized assistants, anticipating user needs and scheduling services accordingly. Predicting user intents on smartphones, and reflecting anticipated activities based on past interactions and context, remains a pivotal step towards this vision. Existing research predominantly focuses on specific domains, neglecting the challenge of modeling diverse event sequences across dynamic contexts. Leveraging pre-trained language models (PLMs) offers a promising avenue, yet adapting PLMs to on-device user intent prediction presents significant challenges. To address these challenges, we propose PITuning, a Population-to-Individual Tuning framework. PITuning enhances common pattern extraction through dynamic event-to-intent transition modeling and addresses long-tailed preferences via adaptive unlearning strategies. Experimental results on real-world datasets demonstrate PITuning’s superior intent prediction performance, highlighting its ability to capture long-tailed preferences and its practicality for on-device prediction scenarios.

[LG-46] Enhance Modality Robustness in Text-Centric Multimodal Alignment with Adversarial Prompting

链接: https://arxiv.org/abs/2408.09798
作者: Yun-Da Tsai,Ting-Yu Yen,Keng-Te Liao,Shou-De Lin
关键词-EN: large language models, data is limited, prompts for large, large language, pairwise data
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Converting different modalities into generalized text, which then serves as input prompts for large language models (LLMs), is a common approach for aligning multimodal models, particularly when pairwise data is limited. Text-centric alignment method leverages the unique properties of text as a modality space, transforming diverse inputs into a unified textual representation, thereby enabling downstream models to effectively interpret various modal inputs. This study evaluates the quality and robustness of multimodal representations in the face of noise imperfections, dynamic input order permutations, and missing modalities, revealing that current text-centric alignment methods can compromise downstream robustness. To address this issue, we propose a new text-centric adversarial training approach that significantly enhances robustness compared to traditional robust training methods and pre-trained multimodal foundation models. Our findings underscore the potential of this approach to improve the robustness and adaptability of multimodal representations, offering a promising solution for dynamic and real-world applications.

[LG-47] Unsupervised Composable Representations for Audio

链接: https://arxiv.org/abs/2408.09792
作者: Giovanni Bindi,Philippe Esling
关键词-EN: generate complex structures, generate high-quality artefacts, ability to generate, generate complex, simpler elements
类目: Machine Learning (cs.LG); Sound (cs.SD); Audio and Speech Processing (eess.AS)
*备注: ISMIR 2024

点击查看摘要

Abstract:Current generative models are able to generate high-quality artefacts but have been shown to struggle with compositional reasoning, which can be defined as the ability to generate complex structures from simpler elements. In this paper, we focus on the problem of compositional representation learning for music data, specifically targeting the fully-unsupervised setting. We propose a simple and extensible framework that leverages an explicit compositional inductive bias, defined by a flexible auto-encoding objective that can leverage any of the current state-of-art generative models. We demonstrate that our framework, used with diffusion models, naturally addresses the task of unsupervised audio source separation, showing that our model is able to perform high-quality separation. Our findings reveal that our proposal achieves comparable or superior performance with respect to other blind source separation methods and, furthermore, it even surpasses current state-of-art supervised baselines on signal-to-interference ratio metrics. Additionally, by learning an a-posteriori masking diffusion model in the space of composable representations, we achieve a system capable of seamlessly performing unsupervised source separation, unconditional generation, and variation generation. Finally, as our proposal works in the latent space of pre-trained neural audio codecs, it also provides a lower computational cost with respect to other neural baselines.

[LG-48] Structure-enhanced Contrastive Learning for Graph Clustering

链接: https://arxiv.org/abs/2408.09790
作者: Xunlian Wu,Jingqi Hu,Anqi Zhang,Yining Quan,Qiguang Miao,Peng Gang Sun
关键词-EN: stronger intra-group connections, contrastive learning, widespread applications, focusing on partitioning, Graph clustering
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Graph clustering is a crucial task in network analysis with widespread applications, focusing on partitioning nodes into distinct groups with stronger intra-group connections than inter-group ones. Recently, contrastive learning has achieved significant progress in graph clustering. However, most methods suffer from the following issues: 1) an over-reliance on meticulously designed data augmentation strategies, which can undermine the potential of contrastive learning. 2) overlooking cluster-oriented structural information, particularly the higher-order cluster(community) structure information, which could unveil the mesoscopic cluster structure information of the network. In this study, Structure-enhanced Contrastive Learning (SECL) is introduced to addresses these issues by leveraging inherent network structures. SECL utilizes a cross-view contrastive learning mechanism to enhance node embeddings without elaborate data augmentations, a structural contrastive learning module for ensuring structural consistency, and a modularity maximization strategy for harnessing clustering-oriented information. This comprehensive approach results in robust node representations that greatly enhance clustering performance. Extensive experiments on six datasets confirm SECL’s superiority over current state-of-the-art methods, indicating a substantial improvement in the domain of graph clustering.

[LG-49] Faster Adaptive Decentralized Learning Algorithms ICML2024

链接: https://arxiv.org/abs/2408.09775
作者: Feihu Huang,Jianyu Zhao
关键词-EN: received increasing attention, data privacy, machine learning due, system robustness, Decentralized learning recently
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注: ICML 2024 (Spotlight)

点击查看摘要

Abstract:Decentralized learning recently has received increasing attention in machine learning due to its advantages in implementation simplicity and system robustness, data privacy. Meanwhile, the adaptive gradient methods show superior performances in many machine learning tasks such as training neural networks. Although some works focus on studying decentralized optimization algorithms with adaptive learning rates, these adaptive decentralized algorithms still suffer from high sample complexity. To fill these gaps, we propose a class of faster adaptive decentralized algorithms (i.e., AdaMDOS and AdaMDOF) for distributed nonconvex stochastic and finite-sum optimization, respectively. Moreover, we provide a solid convergence analysis framework for our methods. In particular, we prove that our AdaMDOS obtains a near-optimal sample complexity of \tildeO(\epsilon^-3) for finding an \epsilon -stationary solution of nonconvex stochastic optimization. Meanwhile, our AdaMDOF obtains a near-optimal sample complexity of O(\sqrtn\epsilon^-2) for finding an \epsilon -stationary solution of nonconvex finite-sum optimization, where n denotes the sample size. To the best of our knowledge, our AdaMDOF algorithm is the first adaptive decentralized algorithm for nonconvex finite-sum optimization. Some experimental results demonstrate efficiency of our algorithms.

[LG-50] Baby Bear: Seeking a Just Right Rating Scale for Scalar Annotations

链接: https://arxiv.org/abs/2408.09765
作者: Xu Han,Felix Yu,Joao Sedoc,Benjamin Van Durme
关键词-EN: efficiently assigning scalar, assigning scalar ratings, set of elements, mechanism for efficiently, efficiently assigning
类目: Machine Learning (cs.LG); Human-Computer Interaction (cs.HC)
*备注:

点击查看摘要

Abstract:Our goal is a mechanism for efficiently assigning scalar ratings to each of a large set of elements. For example, “what percent positive or negative is this product review?” When sample sizes are small, prior work has advocated for methods such as Best Worst Scaling (BWS) as being more robust than direct ordinal annotation (“Likert scales”). Here we first introduce IBWS, which iteratively collects annotations through Best-Worst Scaling, resulting in robustly ranked crowd-sourced data. While effective, IBWS is too expensive for large-scale tasks. Using the results of IBWS as a best-desired outcome, we evaluate various direct assessment methods to determine what is both cost-efficient and best correlating to a large scale BWS annotation strategy. Finally, we illustrate in the domains of dialogue and sentiment how these annotations can support robust learning-to-rank models.

[LG-51] Sequential Federated Learning in Hierarchical Architecture on Non-IID Datasets

链接: https://arxiv.org/abs/2408.09762
作者: Xingrun Yan,Shiyuan Zuo,Rongfei Fan,Han Hu,Li Shen,Puning Zhao,Yong Luo
关键词-EN: real federated learning, Hierarchical federated learning, federated learning, passing model parameters, model parameters
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In a real federated learning (FL) system, communication overhead for passing model parameters between the clients and the parameter server (PS) is often a bottleneck. Hierarchical federated learning (HFL) that poses multiple edge servers (ESs) between clients and the PS can partially alleviate communication pressure but still needs the aggregation of model parameters from multiple ESs at the PS. To further reduce communication overhead, we bring sequential FL (SFL) into HFL for the first time, which removes the central PS and enables the model training to be completed only through passing the global model between two adjacent ESs for each iteration, and propose a novel algorithm adaptive to such a combinational framework, referred to as Fed-CHS. Convergence results are derived for strongly convex and non-convex loss functions under various data heterogeneity setups, which show comparable convergence performance with the algorithms for HFL or SFL solely. Experimental results provide evidence of the superiority of our proposed Fed-CHS on both communication overhead saving and test accuracy over baseline methods.

[LG-52] Strategic Demonstration Selection for Improved Fairness in LLM In-Context Learning

链接: https://arxiv.org/abs/2408.09757
作者: Jingyu Hu,Weiru Liu,Mengnan Du
关键词-EN: Recent studies highlight, large language models, steer large language, Recent studies, processing tabular data
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Computers and Society (cs.CY)
*备注:

点击查看摘要

Abstract:Recent studies highlight the effectiveness of using in-context learning (ICL) to steer large language models (LLMs) in processing tabular data, a challenging task given the structured nature of such data. Despite advancements in performance, the fairness implications of these methods are less understood. This study investigates how varying demonstrations within ICL prompts influence the fairness outcomes of LLMs. Our findings reveal that deliberately including minority group samples in prompts significantly boosts fairness without sacrificing predictive accuracy. Further experiments demonstrate that the proportion of minority to majority samples in demonstrations affects the trade-off between fairness and prediction accuracy. Based on these insights, we introduce a mitigation technique that employs clustering and evolutionary strategies to curate a diverse and representative sample set from the training data. This approach aims to enhance both predictive performance and fairness in ICL applications. Experimental results validate that our proposed method dramatically improves fairness across various metrics, showing its efficacy in real-world scenarios.

[LG-53] Parallel-in-Time Solutions with Random Projection Neural Networks

链接: https://arxiv.org/abs/2408.09756
作者: Marta M. Betcke,Lisa Maria Kreusser,Davide Murari
关键词-EN: ordinary differential equations, Projection Neural Networks, coarse propagator, Random Projection Neural, solution of ordinary
类目: Numerical Analysis (math.NA); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This paper considers one of the fundamental parallel-in-time methods for the solution of ordinary differential equations, Parareal, and extends it by adopting a neural network as a coarse propagator. We provide a theoretical analysis of the convergence properties of the proposed algorithm and show its effectiveness for several examples, including Lorenz and Burgers’ equations. In our numerical simulations, we further specialize the underpinning neural architecture to Random Projection Neural Networks (RPNNs), a 2-layer neural network where the first layer weights are drawn at random rather than optimized. This restriction substantially increases the efficiency of fitting RPNN’s weights in comparison to a standard feedforward network without negatively impacting the accuracy, as demonstrated in the SIR system example.

[LG-54] Icing on the Cake: Automatic Code Summarization at Ericsson

链接: https://arxiv.org/abs/2408.09735
作者: Giriprasad Sridhara,Sujoy Roychowdhury,Sumit Soman,Ranjani H G,Ricardo Britto
关键词-EN: Automatic Semantic Augmentation, Large Language Model, global telecommunications company, ASAP method, called Automatic Semantic
类目: oftware Engineering (cs.SE); Machine Learning (cs.LG)
*备注: 16 pages, 6 tables, 4 figures. Accepted at the 2024 International Conference on Software Maintenance and Evolution (ICSME) 2024 - Industry Track

点击查看摘要

Abstract:This paper presents our findings on the automatic summarization of Java methods within Ericsson, a global telecommunications company. We evaluate the performance of an approach called Automatic Semantic Augmentation of Prompts (ASAP), which uses a Large Language Model (LLM) to generate leading summary comments for Java methods. ASAP enhances the LLM’s prompt context by integrating static program analysis and information retrieval techniques to identify similar exemplar methods along with their developer-written Javadocs, and serves as the baseline in our study. In contrast, we explore and compare the performance of four simpler approaches that do not require static program analysis, information retrieval, or the presence of exemplars as in the ASAP method. Our methods rely solely on the Java method body as input, making them lightweight and more suitable for rapid deployment in commercial software development environments. We conducted experiments on an Ericsson software project and replicated the study using two widely-used open-source Java projects, Guava and Elasticsearch, to ensure the reliability of our results. Performance was measured across eight metrics that capture various aspects of similarity. Notably, one of our simpler approaches performed as well as or better than the ASAP method on both the Ericsson project and the open-source projects. Additionally, we performed an ablation study to examine the impact of method names on Javadoc summary generation across our four proposed approaches and the ASAP method. By masking the method names and observing the generated summaries, we found that our approaches were statistically significantly less influenced by the absence of method names compared to the baseline. This suggests that our methods are more robust to variations in method names and may derive summaries more comprehensively from the method body than the ASAP approach.

[LG-55] sTransformer: A Modular Approach for Extracting Inter-Sequential and Temporal Information for Time-Series Forecasting

链接: https://arxiv.org/abs/2408.09723
作者: Jiaheng Yin,Zhengxin Shi,Jianshen Zhang,Xiaomin Lin,Yulin Huang,Yongzhi Qi,Wei Qi
关键词-EN: numerous Transformer-based models, Transformer-based models, numerous Transformer-based, Transformer-based, LTSF
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In recent years, numerous Transformer-based models have been applied to long-term time-series forecasting (LTSF) tasks. However, recent studies with linear models have questioned their effectiveness, demonstrating that simple linear layers can outperform sophisticated Transformer-based models. In this work, we review and categorize existing Transformer-based models into two main types: (1) modifications to the model structure and (2) modifications to the input data. The former offers scalability but falls short in capturing inter-sequential information, while the latter preprocesses time-series data but is challenging to use as a scalable module. We propose \textbfsTransformer , which introduces the Sequence and Temporal Convolutional Network (STCN) to fully capture both sequential and temporal information. Additionally, we introduce a Sequence-guided Mask Attention mechanism to capture global feature information. Our approach ensures the capture of inter-sequential information while maintaining module scalability. We compare our model with linear models and existing forecasting models on long-term time-series forecasting, achieving new state-of-the-art results. We also conducted experiments on other time-series tasks, achieving strong performance. These demonstrate that Transformer-based structures remain effective and our model can serve as a viable baseline for time-series tasks.

[LG-56] owards Few-Shot Learning in the Open World: A Review and Beyond

链接: https://arxiv.org/abs/2408.09722
作者: Hui Xue,Yuexuan An,Yongchun Qin,Wenqian Li,Yixin Wu,Yongjuan Che,Pengfei Fang,Minling Zhang
关键词-EN: apply knowledge, prior knowledge, underpinned by prior, ability to absorb, absorb and apply
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Human intelligence is characterized by our ability to absorb and apply knowledge from the world around us, especially in rapidly acquiring new concepts from minimal examples, underpinned by prior knowledge. Few-shot learning (FSL) aims to mimic this capacity by enabling significant generalizations and transferability. However, traditional FSL frameworks often rely on assumptions of clean, complete, and static data, conditions that are seldom met in real-world environments. Such assumptions falter in the inherently uncertain, incomplete, and dynamic contexts of the open world. This paper presents a comprehensive review of recent advancements designed to adapt FSL for use in open-world settings. We categorize existing methods into three distinct types of open-world few-shot learning: those involving varying instances, varying classes, and varying distributions. Each category is discussed in terms of its specific challenges and methods, as well as its strengths and weaknesses. We standardize experimental settings and metric benchmarks across scenarios, and provide a comparative analysis of the performance of various methods. In conclusion, we outline potential future research directions for this evolving field. It is our hope that this review will catalyze further development of effective solutions to these complex challenges, thereby advancing the field of artificial intelligence.

[LG-57] HYDEN: Hyperbolic Density Representations for Medical Images and Reports

链接: https://arxiv.org/abs/2408.09715
作者: Zhi Qiao,Linbin Han,Xiantong Zhen,Jia-Hong Gao,Zhen Qian
关键词-EN: hierarchical modeling advantages, inherent entailment relations, point vector embeddings, visual semantic representation, point vector
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
*备注:

点击查看摘要

Abstract:In light of the inherent entailment relations between images and text, hyperbolic point vector embeddings, leveraging the hierarchical modeling advantages of hyperbolic space, have been utilized for visual semantic representation learning. However, point vector embedding approaches fail to address the issue of semantic uncertainty, where an image may have multiple interpretations, and text may refer to different images, a phenomenon particularly prevalent in the medical domain. Therefor, we propose \textbfHYDEN, a novel hyperbolic density embedding based image-text representation learning approach tailored for specific medical domain data. This method integrates text-aware local features alongside global features from images, mapping image-text features to density features in hyperbolic space via using hyperbolic pseudo-Gaussian distributions. An encapsulation loss function is employed to model the partial order relations between image-text density distributions. Experimental results demonstrate the interpretability of our approach and its superior performance compared to the baseline methods across various zero-shot tasks and different datasets.

[LG-58] Community-Centric Graph Unlearning

链接: https://arxiv.org/abs/2408.09705
作者: Yi Li,Shichao Zhang,Guixian Zhang,Debo Cheng
关键词-EN: Graph unlearning technology, Graph unlearning, artificial intelligence, Community-centric Graph Eraser, Graph
类目: Machine Learning (cs.LG); Social and Information Networks (cs.SI)
*备注:

点击查看摘要

Abstract:Graph unlearning technology has become increasingly important since the advent of the `right to be forgotten’ and the growing concerns about the privacy and security of artificial intelligence. Graph unlearning aims to quickly eliminate the effects of specific data on graph neural networks (GNNs). However, most existing deterministic graph unlearning frameworks follow a balanced partition-submodel training-aggregation paradigm, resulting in a lack of structural information between subgraph neighborhoods and redundant unlearning parameter calculations. To address this issue, we propose a novel Graph Structure Mapping Unlearning paradigm (GSMU) and a novel method based on it named Community-centric Graph Eraser (CGE). CGE maps community subgraphs to nodes, thereby enabling the reconstruction of a node-level unlearning operation within a reduced mapped graph. CGE makes the exponential reduction of both the amount of training data and the number of unlearning parameters. Extensive experiments conducted on five real-world datasets and three widely used GNN backbones have verified the high performance and efficiency of our CGE method, highlighting its potential in the field of graph unlearning.

[LG-59] LightWeather: Harnessing Absolute Positional Encoding to Efficient and Scalable Global Weather Forecasting

链接: https://arxiv.org/abs/2408.09695
作者: Yisong Fu,Fei Wang,Zezhi Shao,Chengqing Yu,Yujie Li,Zhao Chen,Zhulin An,Yongjun Xu
关键词-EN: capture long-term spatial-temporal, weather forecasting, gained traction, capability to capture, capture long-term
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Atmospheric and Oceanic Physics (physics.ao-ph)
*备注:

点击查看摘要

Abstract:Recently, Transformers have gained traction in weather forecasting for their capability to capture long-term spatial-temporal correlations. However, their complex architectures result in large parameter counts and extended training times, limiting their practical application and scalability to global-scale forecasting. This paper aims to explore the key factor for accurate weather forecasting and design more efficient solutions. Interestingly, our empirical findings reveal that absolute positional encoding is what really works in Transformer-based weather forecasting models, which can explicitly model the spatial-temporal correlations even without attention mechanisms. We theoretically prove that its effectiveness stems from the integration of geographical coordinates and real-world time features, which are intrinsically related to the dynamics of weather. Based on this, we propose LightWeather, a lightweight and effective model for station-based global weather forecasting. We employ absolute positional encoding and a simple MLP in place of other components of Transformer. With under 30k parameters and less than one hour of training time, LightWeather achieves state-of-the-art performance on global weather datasets compared to other advanced DL methods. The results underscore the superiority of integrating spatial-temporal knowledge over complex architectures, providing novel insights for DL in weather forecasting.

[LG-60] Regularization for Adversarial Robust Learning

链接: https://arxiv.org/abs/2408.09672
作者: Jie Wang,Rui Gao,Yao Xie
关键词-EN: artificial neural networks, machine learning models, distributionally robust risk, real-world applications, significant concern
类目: Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML)
*备注: 51 pages, 5 figures

点击查看摘要

Abstract:Despite the growing prevalence of artificial neural networks in real-world applications, their vulnerability to adversarial attacks remains to be a significant concern, which motivates us to investigate the robustness of machine learning models. While various heuristics aim to optimize the distributionally robust risk using the \infty -Wasserstein metric, such a notion of robustness frequently encounters computation intractability. To tackle the computational challenge, we develop a novel approach to adversarial training that integrates \phi -divergence regularization into the distributionally robust risk function. This regularization brings a notable improvement in computation compared with the original formulation. We develop stochastic gradient methods with biased oracles to solve this problem efficiently, achieving the near-optimal sample complexity. Moreover, we establish its regularization effects and demonstrate it is asymptotic equivalence to a regularized empirical risk minimization (ERM) framework, by considering various scaling regimes of the regularization parameter \eta and robustness level \rho . These regimes yield gradient norm regularization, variance regularization, or a smoothed gradient norm regularization that interpolates between these extremes. We numerically validate our proposed method in supervised learning, reinforcement learning, and contextual learning and showcase its state-of-the-art performance against various adversarial attacks.

[LG-61] Contextual Bandits for Unbounded Context Distributions

链接: https://arxiv.org/abs/2408.09655
作者: Puning Zhao,Jiafei Wu,Zhe Liu,Huiwen Wu
关键词-EN: sequential decision making, decision making problems, Nonparametric contextual bandit, important model, model of sequential
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Nonparametric contextual bandit is an important model of sequential decision making problems. Under \alpha -Tsybakov margin condition, existing research has established a regret bound of \tildeO\left(T^1-\frac\alpha+1d+2\right) for bounded supports. However, the optimal regret with unbounded contexts has not been analyzed. The challenge of solving contextual bandit problems with unbounded support is to achieve both exploration-exploitation tradeoff and bias-variance tradeoff simultaneously. In this paper, we solve the nonparametric contextual bandit problem with unbounded contexts. We propose two nearest neighbor methods combined with UCB exploration. The first method uses a fixed k . Our analysis shows that this method achieves minimax optimal regret under a weak margin condition and relatively light-tailed context distributions. The second method uses adaptive k . By a proper data-driven selection of k , this method achieves an expected regret of \tildeO\left(T^1-\frac(\alpha+1)\beta\alpha+(d+2)\beta+T^1-\beta\right) , in which \beta is a parameter describing the tail strength. This bound matches the minimax lower bound up to logarithm factors, indicating that the second method is approximately optimal.

[LG-62] Meta-Learning on Augmented Gene Expression Profiles for Enhanced Lung Cancer Detection

链接: https://arxiv.org/abs/2408.09635
作者: Arya Hadizadeh Moghaddam,Mohsen Nayebi Kerdabadi,Cuncong Zhong,Zijun Yao
关键词-EN: providing critical information, obtained through DNA, DNA microarray, cancer detection classifiers, Gene expression profiles
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Genomics (q-bio.GN)
*备注: Accepted to AMIA 2024 Annual Symposium

点击查看摘要

Abstract:Gene expression profiles obtained through DNA microarray have proven successful in providing critical information for cancer detection classifiers. However, the limited number of samples in these datasets poses a challenge to employ complex methodologies such as deep neural networks for sophisticated analysis. To address this “small data” dilemma, Meta-Learning has been introduced as a solution to enhance the optimization of machine learning models by utilizing similar datasets, thereby facilitating a quicker adaptation to target datasets without the requirement of sufficient samples. In this study, we present a meta-learning-based approach for predicting lung cancer from gene expression profiles. We apply this framework to well-established deep learning methodologies and employ four distinct datasets for the meta-learning tasks, where one as the target dataset and the rest as source datasets. Our approach is evaluated against both traditional and deep learning methodologies, and the results show the superior performance of meta-learning on augmented source data compared to the baselines trained on single datasets. Moreover, we conduct the comparative analysis between meta-learning and transfer learning methodologies to highlight the efficiency of the proposed approach in addressing the challenges associated with limited sample sizes. Finally, we incorporate the explainability study to illustrate the distinctiveness of decisions made by meta-learning.

[LG-63] MoDeGPT: Modular Decomposition for Large Language Model Compression

链接: https://arxiv.org/abs/2408.09632
作者: Chi-Heng Lin,Shangqian Gao,James Seale Smith,Abhishek Patel,Shikhar Tuli,Yilin Shen,Hongxia Jin,Yen-Chang Hsu
关键词-EN: Large Language Models, Large Language, demonstrating exceptional performance, Language Models, reshaped the landscape
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Machine Learning (stat.ML)
*备注: 31 pages, 9 figures

点击查看摘要

Abstract:Large Language Models (LLMs) have reshaped the landscape of artificial intelligence by demonstrating exceptional performance across various tasks. However, substantial computational requirements make their deployment challenging on devices with limited resources. Recently, compression methods using low-rank matrix techniques have shown promise, yet these often lead to degraded accuracy or introduce significant overhead in parameters and inference latency. This paper introduces \textbfModular \textbfDecomposition (MoDeGPT), a novel structured compression framework that does not need recovery fine-tuning while resolving the above drawbacks. MoDeGPT partitions the Transformer block into modules comprised of matrix pairs and reduces the hidden dimensions via reconstructing the module-level outputs. MoDeGPT is developed based on a theoretical framework that utilizes three well-established matrix decomposition algorithms – Nyström approximation, CR decomposition, and SVD – and applies them to our redefined transformer modules. Our comprehensive experiments show MoDeGPT, without backward propagation, matches or surpasses previous structured compression methods that rely on gradient information, and saves 98% of compute costs on compressing a 13B model. On \textscLlama-2/3 and OPT models, MoDeGPT maintains 90-95% zero-shot performance with 25-30% compression rates. Moreover, the compression can be done on a single GPU within a few hours and increases the inference throughput by up to 46%.

[LG-64] Attention is a smoothed cubic spline

链接: https://arxiv.org/abs/2408.09624
作者: Zehua Lai,Lek-Heng Lim,Yucong Liu
关键词-EN: hitherto unobserved insight, important but hitherto, hitherto unobserved, transformer, splines
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注: 20 pages, 2 figures

点击查看摘要

Abstract:We highlight a perhaps important but hitherto unobserved insight: The attention module in a transformer is a smoothed cubic spline. Viewed in this manner, this mysterious but critical component of a transformer becomes a natural development of an old notion deeply entrenched in classical approximation theory. More precisely, we show that with ReLU-activation, attention, masked attention, encoder-decoder attention are all cubic splines. As every component in a transformer is constructed out of compositions of various attention modules (= cubic splines) and feed forward neural networks (= linear splines), all its components – encoder, decoder, and encoder-decoder blocks; multilayered encoders and decoders; the transformer itself – are cubic or higher-order splines. If we assume the Pierce-Birkhoff conjecture, then the converse also holds, i.e., every spline is a ReLU-activated encoder. Since a spline is generally just C^2 , one way to obtain a smoothed C^\infty -version is by replacing ReLU with a smooth activation; and if this activation is chosen to be SoftMax, we recover the original transformer as proposed by Vaswani et al. This insight sheds light on the nature of the transformer by casting it entirely in terms of splines, one of the best known and thoroughly understood objects in applied mathematics.

[LG-65] On the Necessity of World Knowledge for Mitigating Missing Labels in Extreme Classification

链接: https://arxiv.org/abs/2408.09585
作者: Jatin Prakash,Anirudh Buvanesh,Bishal Santra,Deepak Saini,Sachin Yadav,Jian Jiao,Yashoteja Prabhu,Amit Sharma,Manik Varma
关键词-EN: Extreme Classification, aims to map, map a query, missing labels, missing
类目: Machine Learning (cs.LG); Information Retrieval (cs.IR)
*备注: Preprint, 23 pages

点击查看摘要

Abstract:Extreme Classification (XC) aims to map a query to the most relevant documents from a very large document set. XC algorithms used in real-world applications learn this mapping from datasets curated from implicit feedback, such as user clicks. However, these datasets inevitably suffer from missing labels. In this work, we observe that systematic missing labels lead to missing knowledge, which is critical for accurately modelling relevance between queries and documents. We formally show that this absence of knowledge cannot be recovered using existing methods such as propensity weighting and data imputation strategies that solely rely on the training dataset. While LLMs provide an attractive solution to augment the missing knowledge, leveraging them in applications with low latency requirements and large document sets is challenging. To incorporate missing knowledge at scale, we propose SKIM (Scalable Knowledge Infusion for Missing Labels), an algorithm that leverages a combination of small LM and abundant unstructured meta-data to effectively mitigate the missing label problem. We show the efficacy of our method on large-scale public datasets through exhaustive unbiased evaluation ranging from human annotations to simulations inspired from industrial settings. SKIM outperforms existing methods on Recall@100 by more than 10 absolute points. Additionally, SKIM scales to proprietary query-ad retrieval datasets containing 10 million documents, outperforming contemporary methods by 12% in offline evaluation and increased ad click-yield by 1.23% in an online A/B test conducted on a popular search engine. We release our code, prompts, trained XC models and finetuned SLMs at: this https URL

[LG-66] A Markov Random Field Multi-Modal Variational AutoEncoder

链接: https://arxiv.org/abs/2408.09576
作者: Fouad Oubari,Mohamed El Baha,Raphael Meunier,Rodrigue Décatoire,Mathilde Mougeot
关键词-EN: multimodal Variational AutoEncoders, Variational AutoEncoders, Recent advancements, Markov Random Field, multimodal Variational
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Recent advancements in multimodal Variational AutoEncoders (VAEs) have highlighted their potential for modeling complex data from multiple modalities. However, many existing approaches use relatively straightforward aggregating schemes that may not fully capture the complex dynamics present between different modalities. This work introduces a novel multimodal VAE that incorporates a Markov Random Field (MRF) into both the prior and posterior distributions. This integration aims to capture complex intermodal interactions more effectively. Unlike previous models, our approach is specifically designed to model and leverage the intricacies of these relationships, enabling a more faithful representation of multimodal data. Our experiments demonstrate that our model performs competitively on the standard PolyMNIST dataset and shows superior performance in managing complex intermodal dependencies in a specially designed synthetic dataset, intended to test intricate relationships.

[LG-67] Say My Name: a Models Bias Discovery Framework

链接: https://arxiv.org/abs/2408.09570
作者: Massimiliano Ciranni,Luca Molinaro,Carlo Alberto Barbano,Attilio Fiandrotti,Vittorio Murino,Vito Paolo Pastore,Enzo Tartaglione
关键词-EN: increasingly more concerns, non-representative patterns, learning to downstream, downstream tasks, concerns about potential
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
*备注:

点击查看摘要

Abstract:In the last few years, due to the broad applicability of deep learning to downstream tasks and end-to-end training capabilities, increasingly more concerns about potential biases to specific, non-representative patterns have been raised. Many works focusing on unsupervised debiasing usually leverage the tendency of deep models to learn easier'' samples, for example by clustering the latent space to obtain bias pseudo-labels. However, the interpretation of such pseudo-labels is not trivial, especially for a non-expert end user, as it does not provide semantic information about the bias features. To address this issue, we introduce Say My Name’’ (SaMyNa), the first tool to identify biases within deep models semantically. Unlike existing methods, our approach focuses on biases learned by the model. Our text-based pipeline enhances explainability and supports debiasing efforts: applicable during either training or post-hoc validation, our method can disentangle task-related information and proposes itself as a tool to analyze biases. Evaluation on traditional benchmarks demonstrates its effectiveness in detecting biases and even disclaiming them, showcasing its broad applicability for model diagnosis.

[LG-68] Addressing Heterogeneity in Federated Learning: Challenges and Solutions for a Shared Production Environment

链接: https://arxiv.org/abs/2408.09556
作者: Tatjana Legler,Vinit Hegiste,Ahmed Anwar,Martin Ruskowski
关键词-EN: Federated learning, preserving data privacy, machine learning models, training machine learning, shared production environments
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Federated learning (FL) has emerged as a promising approach to training machine learning models across decentralized data sources while preserving data privacy, particularly in manufacturing and shared production environments. However, the presence of data heterogeneity variations in data distribution, quality, and volume across different or clients and production sites, poses significant challenges to the effectiveness and efficiency of FL. This paper provides a comprehensive overview of heterogeneity in FL within the context of manufacturing, detailing the types and sources of heterogeneity, including non-independent and identically distributed (non-IID) data, unbalanced data, variable data quality, and statistical heterogeneity. We discuss the impact of these types of heterogeneity on model training and review current methodologies for mitigating their adverse effects. These methodologies include personalized and customized models, robust aggregation techniques, and client selection techniques. By synthesizing existing research and proposing new strategies, this paper aims to provide insight for effectively managing data heterogeneity in FL, enhancing model robustness, and ensuring fair and efficient training across diverse environments. Future research directions are also identified, highlighting the need for adaptive and scalable solutions to further improve the FL paradigm in the context of Industry 4.0.

[LG-69] Seamless Integration: Sampling Strategies in Federated Learning Systems

链接: https://arxiv.org/abs/2408.09545
作者: Tatjana Legler,Vinit Hegiste,Martin Ruskowski
关键词-EN: Federated Learning, distributed learning networks, represents a paradigm, offering an approach, Learning
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Federated Learning (FL) represents a paradigm shift in the field of machine learning, offering an approach for a decentralized training of models across a multitude of devices while maintaining the privacy of local data. However, the dynamic nature of FL systems, characterized by the ongoing incorporation of new clients with potentially diverse data distributions and computational capabilities, poses a significant challenge to the stability and efficiency of these distributed learning networks. The seamless integration of new clients is imperative to sustain and enhance the performance and robustness of FL systems. This paper looks into the complexities of integrating new clients into existing FL systems and explores how data heterogeneity and varying data distribution (not independent and identically distributed) among them can affect model training, system efficiency, scalability and stability. Despite these challenges, the integration of new clients into FL systems presents opportunities to enhance data diversity, improve learning performance, and leverage distributed computational power. In contrast to other fields of application such as the distributed optimization of word predictions on Gboard (where federated learning once originated), there are usually only a few clients in the production environment, which is why information from each new client becomes all the more valuable. This paper outlines strategies for effective client selection strategies and solutions for ensuring system scalability and stability. Using the example of images from optical quality inspection, it offers insights into practical approaches. In conclusion, this paper proposes that addressing the challenges presented by new client integration is crucial to the advancement and efficiency of distributed learning networks, thus paving the way for the adoption of Federated Learning in production environments.

[LG-70] Byzantine-resilient Federated Learning Employing Normalized Gradients on Non-IID Datasets

链接: https://arxiv.org/abs/2408.09539
作者: Shiyuan Zuo,Xingrun Yan,Rongfei Fan,Li Shen,Puning Zhao,Jie Xu,Han Hu
关键词-EN: malicious Byzantine attacks, malicious Byzantine, Byzantine attacks, practical federated learning, existing Byzantine-robust methods
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注:

点击查看摘要

Abstract:In practical federated learning (FL) systems, the presence of malicious Byzantine attacks and data heterogeneity often introduces biases into the learning process. However, existing Byzantine-robust methods typically only achieve a compromise between adaptability to different loss function types (including both strongly convex and non-convex) and robustness to heterogeneous datasets, but with non-zero optimality gap. Moreover, this compromise often comes at the cost of high computational complexity for aggregation, which significantly slows down the training speed. To address this challenge, we propose a federated learning approach called Federated Normalized Gradients Algorithm (Fed-NGA). Fed-NGA simply normalizes the uploaded local gradients to be unit vectors before aggregation, achieving a time complexity of \mathcalO(pM) , where p represents the dimension of model parameters and M is the number of participating clients. This complexity scale achieves the best level among all the existing Byzantine-robust methods. Furthermore, through rigorous proof, we demonstrate that Fed-NGA transcends the trade-off between adaptability to loss function type and data heterogeneity and the limitation of non-zero optimality gap in existing literature. Specifically, Fed-NGA can adapt to both non-convex loss functions and non-IID datasets simultaneously, with zero optimality gap at a rate of \mathcalO (1/T^\frac12 - \delta) , where T is the iteration number and \delta \in (0,\frac12) . In cases where the loss function is strongly convex, the zero optimality gap achieving rate can be improved to be linear. Experimental results provide evidence of the superiority of our proposed Fed-NGA on time complexity and convergence performance over baseline methods.

[LG-71] Fine-gained air quality inference based on low-quality sensing data using self-supervised learning

链接: https://arxiv.org/abs/2408.09526
作者: Meng Xu,Ke Han,Weijian Hu,Wen Ji
关键词-EN: Fine-grained air quality, Fine-grained air, mapping is made, cheap AQ micro-stations, proliferation of cheap
类目: Machine Learning (cs.LG)
*备注: 17 pages

点击查看摘要

Abstract:Fine-grained air quality (AQ) mapping is made possible by the proliferation of cheap AQ micro-stations (MSs). However, their measurements are often inaccurate and sensitive to local disturbances, in contrast to standardized stations (SSs) that provide accurate readings but fall short in number. To simultaneously address the issues of low data quality (MSs) and high label sparsity (SSs), a multi-task spatio-temporal network (MTSTN) is proposed, which employs self-supervised learning to utilize massive unlabeled data, aided by seasonal and trend decomposition of MS data offering reliable information as features. The MTSTN is applied to infer NO _2 , O _3 and PM _2.5 concentrations in a 250 km ^2 area in Chengdu, China, at a resolution of 500m \times 500m \times 1hr. Data from 55 SSs and 323 MSs were used, along with meteorological, traffic, geographic and timestamp data as features. The MTSTN excels in accuracy compared to several benchmarks, and its performance is greatly enhanced by utilizing low-quality MS data. A series of ablation and pressure tests demonstrate the results’ robustness and interpretability, showcasing the MTSTN’s practical value for accurate and affordable AQ inference.

[LG-72] A Unified Framework for Interpretable Transformers Using PDEs and Information Theory

链接: https://arxiv.org/abs/2408.09523
作者: Yukun Zhang
关键词-EN: Partial Differential Equations, Information Bottleneck Theory, integrating Partial Differential, Bottleneck Theory, Information Flow Theory
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Information Theory (cs.IT)
*备注:

点击查看摘要

Abstract:This paper presents a novel unified theoretical framework for understanding Transformer architectures by integrating Partial Differential Equations (PDEs), Neural Information Flow Theory, and Information Bottleneck Theory. We model Transformer information dynamics as a continuous PDE process, encompassing diffusion, self-attention, and nonlinear residual components. Our comprehensive experiments across image and text modalities demonstrate that the PDE model effectively captures key aspects of Transformer behavior, achieving high similarity (cosine similarity 0.98) with Transformer attention distributions across all layers. While the model excels in replicating general information flow patterns, it shows limitations in fully capturing complex, non-linear transformations. This work provides crucial theoretical insights into Transformer mechanisms, offering a foundation for future optimizations in deep learning architectural design. We discuss the implications of our findings, potential applications in model interpretability and efficiency, and outline directions for enhancing PDE models to better mimic the intricate behaviors observed in Transformers, paving the way for more transparent and optimized AI systems.

[LG-73] Out-of-distribution generalization via composition: a lens through induction heads in Transformers

链接: https://arxiv.org/abs/2408.09503
作者: Jiajun Song,Zhuoyan Xu,Yiqiao Zhong
关键词-EN: Large language models, OOD generalization, Large language, OOD, Large
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 41 pages, 25 figures

点击查看摘要

Abstract:Large language models (LLMs) such as GPT-4 sometimes appear to be creative, solving novel tasks often with a few demonstrations in the prompt. These tasks require the models to generalize on distributions different from those from training data – which is known as out-of-distribution (OOD) generalization. Despite the tremendous success of LLMs, how they approach OOD generalization remains an open and underexplored question. We examine OOD generalization in settings where instances are generated according to hidden rules, including in-context learning with symbolic reasoning. Models are required to infer the hidden rules behind input prompts without any fine-tuning. We empirically examined the training dynamics of Transformers on a synthetic example and conducted extensive experiments on a variety of pretrained LLMs, focusing on a type of components known as induction heads. We found that OOD generalization and composition are tied together – models can learn rules by composing two self-attention layers, thereby achieving OOD generalization. Furthermore, a shared latent subspace in the embedding (or feature) space acts as a bridge for composition by aligning early layers and later layers, which we refer to as the common bridge representation hypothesis. Comments: 41 pages, 25 figures Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Machine Learning (stat.ML) Cite as: arXiv:2408.09503 [cs.CL] (or arXiv:2408.09503v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2408.09503 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-74] Directed Exploration in Reinforcement Learning from Linear Temporal Logic

链接: https://arxiv.org/abs/2408.09495
作者: Marco Bagatella,Andreas Krause,Georg Martius
关键词-EN: Linear temporal logic, discounted return formulations, conventional discounted return, Linear temporal, temporal logic
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Linear temporal logic (LTL) is a powerful language for task specification in reinforcement learning, as it allows describing objectives beyond the expressivity of conventional discounted return formulations. Nonetheless, recent works have shown that LTL formulas can be translated into a variable rewarding and discounting scheme, whose optimization produces a policy maximizing a lower bound on the probability of formula satisfaction. However, the synthesized reward signal remains fundamentally sparse, making exploration challenging. We aim to overcome this limitation, which can prevent current algorithms from scaling beyond low-dimensional, short-horizon problems. We show how better exploration can be achieved by further leveraging the LTL specification and casting its corresponding Limit Deterministic Büchi Automaton (LDBA) as a Markov reward process, thus enabling a form of high-level value estimation. By taking a Bayesian perspective over LDBA dynamics and proposing a suitable prior distribution, we show that the values estimated through this procedure can be treated as a shaping potential and mapped to informative intrinsic rewards. Empirically, we demonstrate applications of our method from tabular settings to high-dimensional continuous systems, which have so far represented a significant challenge for LTL-based reinforcement learning algorithms.

[LG-75] Ancestral Reinforcement Learning: Unifying Zeroth-Order Optimization and Genetic Algorithms for Reinforcement Learning

链接: https://arxiv.org/abs/2408.09493
作者: So Nakashima,Tetsuya J. Kobayashi
关键词-EN: discovering optimal action, optimal action strategies, Ancestral Reinforcement Learning, Reinforcement Learning, offers a fundamental
类目: Machine Learning (cs.LG)
*备注: 16pages, 3 figures

点击查看摘要

Abstract:Reinforcement Learning (RL) offers a fundamental framework for discovering optimal action strategies through interactions within unknown environments. Recent advancement have shown that the performance and applicability of RL can significantly be enhanced by exploiting a population of agents in various ways. Zeroth-Order Optimization (ZOO) leverages an agent population to estimate the gradient of the objective function, enabling robust policy refinement even in non-differentiable scenarios. As another application, Genetic Algorithms (GA) boosts the exploration of policy landscapes by mutational generation of policy diversity in an agent population and its refinement by selection. A natural question is whether we can have the best of two worlds that the agent population can have. In this work, we propose Ancestral Reinforcement Learning (ARL), which synergistically combines the robust gradient estimation of ZOO with the exploratory power of GA. The key idea in ARL is that each agent within a population infers gradient by exploiting the history of its ancestors, i.e., the ancestor population in the past, while maintaining the diversity of policies in the current population as in GA. We also theoretically reveal that the populational search in ARL implicitly induces the KL-regularization of the objective function, resulting in the enhanced exploration. Our results extend the applicability of populational algorithms for RL.

[LG-76] Leveraging Invariant Principle for Heterophilic Graph Structure Distribution Shifts

链接: https://arxiv.org/abs/2408.09490
作者: Jinluan Yang,Zhengyu Chen,Teng Xiao,Wenqiao Zhang,Yong Lin,Kun Kuang
关键词-EN: Graph Neural Networks, Neural Networks, Heterophilic Graph Neural, shown promising results, Graph Neural
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 20 pages, 7 figures

点击查看摘要

Abstract:Heterophilic Graph Neural Networks (HGNNs) have shown promising results for semi-supervised learning tasks on graphs. Notably, most real-world heterophilic graphs are composed of a mixture of nodes with different neighbor patterns, exhibiting local node-level homophilic and heterophilic structures. However, existing works are only devoted to designing better HGNN backbones or architectures for node classification tasks on heterophilic and homophilic graph benchmarks simultaneously, and their analyses of HGNN performance with respect to nodes are only based on the determined data distribution without exploring the effect caused by this structural difference between training and testing nodes. How to learn invariant node representations on heterophilic graphs to handle this structure difference or distribution shifts remains unexplored. In this paper, we first discuss the limitations of previous graph-based invariant learning methods from the perspective of data augmentation. Then, we propose \textbfHEI, a framework capable of generating invariant node representations through incorporating heterophily information to infer latent environments without augmentation, which are then used for invariant prediction, under heterophilic graph structure distribution shifts. We theoretically show that our proposed method can achieve guaranteed performance under heterophilic graph structure distribution shifts. Extensive experiments on various benchmarks and backbones can also demonstrate the effectiveness of our method compared with existing state-of-the-art baselines.

[LG-77] Mitigating Noise Detriment in Differentially Private Federated Learning with Model Pre-training

链接: https://arxiv.org/abs/2408.09478
作者: Huitong Jin,Yipeng Zhou,Laizhong Cui,Quan Z. Sheng
关键词-EN: Pre-training exploits public, advanced machine learning, exploits public datasets, downstream tasks, exploits public
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注:

点击查看摘要

Abstract:Pre-training exploits public datasets to pre-train an advanced machine learning model, so that the model can be easily tuned to adapt to various downstream tasks. Pre-training has been extensively explored to mitigate computation and communication resource consumption. Inspired by these advantages, we are the first to explore how model pre-training can mitigate noise detriment in differentially private federated learning (DPFL). DPFL is upgraded from federated learning (FL), the de-facto standard for privacy preservation when training the model across multiple clients owning private data. DPFL introduces differentially private (DP) noises to obfuscate model gradients exposed in FL, which however can considerably impair model accuracy. In our work, we compare head fine-tuning (HT) and full fine-tuning (FT), which are based on pre-training, with scratch training (ST) in DPFL through a comprehensive empirical study. Our experiments tune pre-trained models (obtained by pre-training on ImageNet-1K) with CIFAR-10, CHMNIST and Fashion-MNIST (FMNIST) datasets, respectively. The results demonstrate that HT and FT can significantly mitigate noise influence by diminishing gradient exposure times. In particular, HT outperforms FT when the privacy budget is tight or the model size is large. Visualization and explanation study further substantiates our findings. Our pioneering study introduces a new perspective on enhancing DPFL and expanding its practical applications.

[LG-78] Advances in Multiple Instance Learning for Whole Slide Image Analysis: Techniques Challenges and Future Directions

链接: https://arxiv.org/abs/2408.09476
作者: Jun Wang,Yu Mao,Nan Guan,Chun Jason Xue
关键词-EN: E-stained tissue samples, gigapixel-scale digital images, tissue samples widely, E-stained tissue, slide images
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Whole slide images (WSIs) are gigapixel-scale digital images of H\E-stained tissue samples widely used in pathology. The substantial size and complexity of WSIs pose unique analytical challenges. Multiple Instance Learning (MIL) has emerged as a powerful approach for addressing these challenges, particularly in cancer classification and detection. This survey provides a comprehensive overview of the challenges and methodologies associated with applying MIL to WSI analysis, including attention mechanisms, pseudo-labeling, transformers, pooling functions, and graph neural networks. Additionally, it explores the potential of MIL in discovering cancer cell morphology, constructing interpretable machine learning models, and quantifying cancer grading. By summarizing the current challenges, methodologies, and potential applications of MIL in WSI analysis, this survey aims to inform researchers about the state of the field and inspire future research directions.

[LG-79] Advancements in Molecular Property Prediction: A Survey of Single and Multimodal Approaches

链接: https://arxiv.org/abs/2408.09461
作者: Tanya Liyaqat,Tanvir Ahmad,Chandni Saxena
关键词-EN: Molecular Property Prediction, Property Prediction, spanning drug discovery, Molecular Property, material science
类目: Machine Learning (cs.LG); Materials Science (cond-mat.mtrl-sci); Chemical Physics (physics.chem-ph); Biomolecules (q-bio.BM)
*备注: Submitted to the journal

点击查看摘要

Abstract:Molecular Property Prediction (MPP) plays a pivotal role across diverse domains, spanning drug discovery, material science, and environmental chemistry. Fueled by the exponential growth of chemical data and the evolution of artificial intelligence, recent years have witnessed remarkable strides in MPP. However, the multifaceted nature of molecular data, such as molecular structures, SMILES notation, and molecular images, continues to pose a fundamental challenge in its effective representation. To address this, representation learning techniques are instrumental as they acquire informative and interpretable representations of molecular data. This article explores recent AI/-based approaches in MPP, focusing on both single and multiple modality representation techniques. It provides an overview of various molecule representations and encoding schemes, categorizes MPP methods by their use of modalities, and outlines datasets and tools available for feature generation. The article also analyzes the performance of recent methods and suggests future research directions to advance the field of MPP.

[LG-80] In-Memory Learning Automata Architecture using Y-Flash Cell

链接: https://arxiv.org/abs/2408.09456
作者: Omar Ghazal,Tian Lan,Shalman Ojukwu,Komal Krishnamurthy,Alex Yakovlev,Rishad Shafik
关键词-EN: faces significant challenges, significant challenges due, frequent data transfer, architectures faces significant, faces significant
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The modern implementation of machine learning architectures faces significant challenges due to frequent data transfer between memory and processing units. In-memory computing, primarily through memristor-based analog computing, offers a promising solution to overcome this von Neumann bottleneck. In this technology, data processing and storage are located inside the memory. Here, we introduce a novel approach that utilizes floating-gate Y-Flash memristive devices manufactured with a standard 180 nm CMOS process. These devices offer attractive features, including analog tunability and moderate device-to-device variation; such characteristics are essential for reliable decision-making in ML applications. This paper uses a new machine learning algorithm, the Tsetlin Machine ™, for in-memory processing architecture. The TM’s learning element, Automaton, is mapped into a single Y-Flash cell, where the Automaton’s range is transferred into the Y-Flash’s conductance scope. Through comprehensive simulations, the proposed hardware implementation of the learning automata, particularly for Tsetlin machines, has demonstrated enhanced scalability and on-edge learning capabilities.

[LG-81] Reparameterized Multi-Resolution Convolutions for Long Sequence Modelling

链接: https://arxiv.org/abs/2408.09453
作者: Harry Jake Cunningham,Giorgio Giannone,Mingtian Zhang,Marc Peter Deisenroth
关键词-EN: shown increasing promise, powerful general-purpose sequence, general-purpose sequence models, shown increasing, increasing promise
类目: Machine Learning (cs.LG)
*备注: 22 pages, 7 figures

点击查看摘要

Abstract:Global convolutions have shown increasing promise as powerful general-purpose sequence models. However, training long convolutions is challenging, and kernel parameterizations must be able to learn long-range dependencies without overfitting. This work introduces reparameterized multi-resolution convolutions ( \textttMRConv ), a novel approach to parameterizing global convolutional kernels for long-sequence modelling. By leveraging multi-resolution convolutions, incorporating structural reparameterization and introducing learnable kernel decay, \textttMRConv learns expressive long-range kernels that perform well across various data modalities. Our experiments demonstrate state-of-the-art performance on the Long Range Arena, Sequential CIFAR, and Speech Commands tasks among convolution models and linear-time transformers. Moreover, we report improved performance on ImageNet classification by replacing 2D convolutions with 1D \textttMRConv layers.

[LG-82] GraphSPNs: Sum-Product Networks Benefit From Canonical Orderings

链接: https://arxiv.org/abs/2408.09451
作者: Milan Papež,Martin Rektoris,Václav Šmídl,Tomáš Pevný
关键词-EN: capturing complex probability, complex probability distributions, Deep generative, recently made, made a remarkable
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Deep generative models have recently made a remarkable progress in capturing complex probability distributions over graphs. However, they are intractable and thus unable to answer even the most basic probabilistic inference queries without resorting to approximations. Therefore, we propose graph sum-product networks (GraphSPNs), a tractable deep generative model which provides exact and efficient inference over (arbitrary parts of) graphs. We investigate different principles to make SPNs permutation invariant. We demonstrate that GraphSPNs are able to (conditionally) generate novel and chemically valid molecular graphs, being competitive to, and sometimes even better than, existing intractable models. We find out that (Graph)SPNs benefit from ensuring the permutation invariance via canonical ordering.

[LG-83] Attention Is Not What You Need: Revisiting Multi-Instance Learning for Whole Slide Image Classification

链接: https://arxiv.org/abs/2408.09449
作者: Xin Liu,Weijia Zhang,Min-Ling Zhang
关键词-EN: achieved impressive performances, multi-instance learning algorithms, attention-based multi-instance learning, standard MIL assumptions, standard MIL
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Although attention-based multi-instance learning algorithms have achieved impressive performances on slide-level whole slide image (WSI) classification tasks, they are prone to mistakenly focus on irrelevant patterns such as staining conditions and tissue morphology, leading to incorrect patch-level predictions and unreliable interpretability. Moreover, these attention-based MIL algorithms tend to focus on salient instances and struggle to recognize hard-to-classify instances. In this paper, we first demonstrate that attention-based WSI classification methods do not adhere to the standard MIL assumptions. From the standard MIL assumptions, we propose a surprisingly simple yet effective instance-based MIL method for WSI classification (FocusMIL) based on max-pooling and forward amortized variational inference. We argue that synergizing the standard MIL assumption with variational inference encourages the model to focus on tumour morphology instead of spurious correlations. Our experimental evaluations show that FocusMIL significantly outperforms the baselines in patch-level classification tasks on the Camelyon16 and TCGA-NSCLC benchmarks. Visualization results show that our method also achieves better classification boundaries for identifying hard instances and mitigates the effect of spurious correlations between bags and labels.

[LG-84] Parameterized Physics-informed Neural Networks for Parameterized PDEs

链接: https://arxiv.org/abs/2408.09446
作者: Woojin Cho,Minju Jo,Haksoo Lim,Kookjin Lee,Dongeun Lee,Sanghyun Hong,Noseong Park
关键词-EN: Complex physical systems, partial differential equations, Complex physical, Reynolds number, differential equations
类目: Machine Learning (cs.LG); Numerical Analysis (math.NA); Computational Physics (physics.comp-ph)
*备注:

点击查看摘要

Abstract:Complex physical systems are often described by partial differential equations (PDEs) that depend on parameters such as the Reynolds number in fluid mechanics. In applications such as design optimization or uncertainty quantification, solutions of those PDEs need to be evaluated at numerous points in the parameter space. While physics-informed neural networks (PINNs) have emerged as a new strong competitor as a surrogate, their usage in this scenario remains underexplored due to the inherent need for repetitive and time-consuming training. In this paper, we address this problem by proposing a novel extension, parameterized physics-informed neural networks (P ^2 INNs). P ^2 INNs enable modeling the solutions of parameterized PDEs via explicitly encoding a latent representation of PDE parameters. With the extensive empirical evaluation, we demonstrate that P ^2 INNs outperform the baselines both in accuracy and parameter efficiency on benchmark 1D and 2D parameterized PDEs and are also effective in overcoming the known “failure modes”.

[LG-85] Parallel Sampling via Counting

链接: https://arxiv.org/abs/2408.09442
作者: Nima Anari,Ruiquan Gao,Aviad Rubinstein
关键词-EN: product space, sigma, parallelization to speed, autoregressive models, arbitrary distribution
类目: Data Structures and Algorithms (cs.DS); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Probability (math.PR)
*备注:

点击查看摘要

Abstract:We show how to use parallelization to speed up sampling from an arbitrary distribution \mu on a product space [q]^n , given oracle access to counting queries: \mathbbP_X\sim \mu[X_S=\sigma_S] for any S\subseteq [n] and \sigma_S \in [q]^S . Our algorithm takes O(n^2/3\cdot \operatornamepolylog(n,q)) parallel time, to the best of our knowledge, the first sublinear in n runtime for arbitrary distributions. Our results have implications for sampling in autoregressive models. Our algorithm directly works with an equivalent oracle that answers conditional marginal queries \mathbbP_X\sim \mu[X_i=\sigma_i;\vert; X_S=\sigma_S] , whose role is played by a trained neural network in autoregressive models. This suggests a roughly n^1/3 -factor speedup is possible for sampling in any-order autoregressive models. We complement our positive result by showing a lower bound of \widetilde\Omega(n^1/3) for the runtime of any parallel sampling algorithm making at most \operatornamepoly(n) queries to the counting oracle, even for q=2 .

[LG-86] Reefknot: A Comprehensive Benchmark for Relation Hallucination Evaluation Analysis and Mitigation in Multimodal Large Language Models

链接: https://arxiv.org/abs/2408.09429
作者: Kening Zheng,Junkai Chen,Yibo Yan,Xin Zou,Xuming Hu
关键词-EN: large language models, issues persistently plagued, Hallucination issues persistently, persistently plagued current, relation hallucinations
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Hallucination issues persistently plagued current multimodal large language models (MLLMs). While existing research primarily focuses on object-level or attribute-level hallucinations, sidelining the more sophisticated relation hallucinations that necessitate advanced reasoning abilities from MLLMs. Besides, recent benchmarks regarding relation hallucinations lack in-depth evaluation and effective mitigation. Moreover, their datasets are typically derived from a systematic annotation process, which could introduce inherent biases due to the predefined process. To handle the aforementioned challenges, we introduce Reefknot, a comprehensive benchmark specifically targeting relation hallucinations, consisting of over 20,000 samples derived from real-world scenarios. Specifically, we first provide a systematic definition of relation hallucinations, integrating perspectives from perceptive and cognitive domains. Furthermore, we construct the relation-based corpus utilizing the representative scene graph dataset Visual Genome (VG), from which semantic triplets follow real-world distributions. Our comparative evaluation across three distinct tasks revealed a substantial shortcoming in the capabilities of current MLLMs to mitigate relation hallucinations. Finally, we advance a novel confidence-based mitigation strategy tailored to tackle the relation hallucinations problem. Across three datasets, including Reefknot, we observed an average reduction of 9.75% in the hallucination rate. We believe our paper sheds valuable insights into achieving trustworthy multimodal intelligence. Our dataset and code will be released upon paper acceptance.

[LG-87] Clustering and Alignment: Understanding the Training Dynamics in Modular Addition

链接: https://arxiv.org/abs/2408.09414
作者: Tiberiu Musat
关键词-EN: neural networks learn, networks learn interpretable, learn interpretable algorithms, Recent studies, studies have revealed
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Recent studies have revealed that neural networks learn interpretable algorithms for many simple problems. However, little is known about how these algorithms emerge during training. In this article, we study the training dynamics of a simplified transformer with 2-dimensional embeddings on the problem of modular addition. We observe that embedding vectors tend to organize into two types of structures: grids and circles. We study these structures and explain their emergence as a result of two simple tendencies exhibited by pairs of embeddings: clustering and alignment. We propose explicit formulae for these tendencies as interaction forces between different pairs of embeddings. To show that our formulae can fully account for the emergence of these structures, we construct an equivalent particle simulation where we find that identical structures emerge. We use our insights to discuss the role of weight decay and reveal a new mechanism that links regularization and training dynamics. We also release an interactive demo to support our findings: https://modular-addition.vercel.app/.

[LG-88] GRLinQ: An Intelligent Spectrum Sharing Mechanism for Device-to-Device Communications with Graph Reinforcement Learning

链接: https://arxiv.org/abs/2408.09394
作者: Zhiwei Shan,Xinping Yi,Le Liang,Chung-Shou Liao,Shi Jin
关键词-EN: combinatorial optimization problem, challenging non-convex combinatorial, non-convex combinatorial optimization, involving entangled link, entangled link scheduling
类目: Information Theory (cs.IT); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Device-to-device (D2D) spectrum sharing in wireless communications is a challenging non-convex combinatorial optimization problem, involving entangled link scheduling and power control in a large-scale network. The state-of-the-art methods, either from a model-based or a data-driven perspective, exhibit certain limitations such as the critical need for channel state information (CSI) and/or a large number of (solved) instances (e.g., network layouts) as training samples. To advance this line of research, we propose a novel hybrid model/datadriven spectrum sharing mechanism with graph reinforcement learning for link scheduling (GRLinQ), injecting information theoretical insights into machine learning models, in such a way that link scheduling and power control can be solved in an intelligent yet explainable manner. Through an extensive set of experiments, GRLinQ demonstrates superior performance to the existing model-based and data-driven link scheduling and/or power control methods, with a relaxed requirement for CSI, a substantially reduced number of unsolved instances as training samples, a possible distributed deployment, reduced online/offline computational complexity, and more remarkably excellent scalability and generalizability over different network scenarios and system configurations.

[LG-89] Federated Graph Learning with Structure Proxy Alignment KDD2024

链接: https://arxiv.org/abs/2408.09393
作者: Xingbo Fu,Zihan Chen,Binchi Zhang,Chen Chen,Jundong Li
关键词-EN: Federated Graph Learning, financial fraud detection, multiple data owners, generic Federated Learning, graph data distributed
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注: Accepted by KDD 2024

点击查看摘要

Abstract:Federated Graph Learning (FGL) aims to learn graph learning models over graph data distributed in multiple data owners, which has been applied in various applications such as social recommendation and financial fraud detection. Inherited from generic Federated Learning (FL), FGL similarly has the data heterogeneity issue where the label distribution may vary significantly for distributed graph data across clients. For instance, a client can have the majority of nodes from a class, while another client may have only a few nodes from the same class. This issue results in divergent local objectives and impairs FGL convergence for node-level tasks, especially for node classification. Moreover, FGL also encounters a unique challenge for the node classification task: the nodes from a minority class in a client are more likely to have biased neighboring information, which prevents FGL from learning expressive node embeddings with Graph Neural Networks (GNNs). To grapple with the challenge, we propose FedSpray, a novel FGL framework that learns local class-wise structure proxies in the latent space and aligns them to obtain global structure proxies in the server. Our goal is to obtain the aligned structure proxies that can serve as reliable, unbiased neighboring information for node classification. To achieve this, FedSpray trains a global feature-structure encoder and generates unbiased soft targets with structure proxies to regularize local training of GNN models in a personalized way. We conduct extensive experiments over four datasets, and experiment results validate the superiority of FedSpray compared with other baselines. Our code is available at this https URL.

[LG-90] Mutual Information Multinomial Estimation

链接: https://arxiv.org/abs/2408.09377
作者: Yanzhi Chen,Zijing Ou,Adrian Weller,Yingzhen Li
关键词-EN: Estimating mutual information, Estimating mutual, mutual information, machine learning, fundamental yet challenging
类目: Machine Learning (cs.LG); Information Theory (cs.IT); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Estimating mutual information (MI) is a fundamental yet challenging task in data science and machine learning. This work proposes a new estimator for mutual information. Our main discovery is that a preliminary estimate of the data distribution can dramatically help estimate. This preliminary estimate serves as a bridge between the joint and the marginal distribution, and by comparing with this bridge distribution we can easily obtain the true difference between the joint distributions and the marginal distributions. Experiments on diverse tasks including non-Gaussian synthetic problems with known ground-truth and real-world applications demonstrate the advantages of our method.

[LG-91] Detecting the Undetectable: Combining Kolmogorov-Arnold Networks and MLP for AI-Generated Image Detection

链接: https://arxiv.org/abs/2408.09371
作者: Taharim Rahman Anon,Jakaria Islam Emon
关键词-EN: artificial intelligence progresses, intelligence progresses, artificial intelligence, task of distinguishing, increasingly complicated
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 8 Pages, IEEE Transactions

点击查看摘要

Abstract:As artificial intelligence progresses, the task of distinguishing between real and AI-generated images is increasingly complicated by sophisticated generative models. This paper presents a novel detection framework adept at robustly identifying images produced by cutting-edge generative AI models, such as DALL-E 3, MidJourney, and Stable Diffusion 3. We introduce a comprehensive dataset, tailored to include images from these advanced generators, which serves as the foundation for extensive evaluation. we propose a classification system that integrates semantic image embeddings with a traditional Multilayer Perceptron (MLP). This baseline system is designed to effectively differentiate between real and AI-generated images under various challenging conditions. Enhancing this approach, we introduce a hybrid architecture that combines Kolmogorov-Arnold Networks (KAN) with the MLP. This hybrid model leverages the adaptive, high-resolution feature transformation capabilities of KAN, enabling our system to capture and analyze complex patterns in AI-generated images that are typically overlooked by conventional models. In out-of-distribution testing, our proposed model consistently outperformed the standard MLP across three out of distribution test datasets, demonstrating superior performance and robustness in classifying real images from AI-generated images with impressive F1 scores.

[LG-92] Behavioral Learning of Dish Rinsing and Scrubbing based on Interruptive Direct Teaching Considering Assistance Rate

链接: https://arxiv.org/abs/2408.09360
作者: Shumpei Wakabayashi,Kento Kawaharazuka,Kei Okada,Masayuki Inaba
关键词-EN: expected to manipulate, dishes, human assistance, manipulate objects, robot
类目: Robotics (cs.RO); Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注: Accepted at Advanced Robotics

点击查看摘要

Abstract:Robots are expected to manipulate objects in a safe and dexterous way. For example, washing dishes is a dexterous operation that involves scrubbing the dishes with a sponge and rinsing them with water. It is necessary to learn it safely without splashing water and without dropping the dishes. In this study, we propose a safe and dexterous manipulation system. %that can scrub and rinse dirty dishes. The robot learns a dynamics model of the object by estimating the state of the object and the robot itself, the control input, and the amount of human assistance required (assistance rate) after the human corrects the initial trajectory of the robot’s hands by interruptive direct teaching. By backpropagating the error between the estimated and the reference value %at the next time using the acquired dynamics model, the robot can generate a control input that approaches the reference value, for example, so that human assistance is not required and the dish does not move excessively. This allows for adaptive rinsing and scrubbing of dishes with unknown shapes and properties. As a result, it is possible to generate safe actions that require less human assistance.

[LG-93] E-CGL: An Efficient Continual Graph Learner

链接: https://arxiv.org/abs/2408.09350
作者: Jianhao Guo,Zixuan Ni,Yun Zhu,Siliang Tang
关键词-EN: preserving previous knowledge, continual graph learning, continual graph, graph data, graph
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Continual learning has emerged as a crucial paradigm for learning from sequential data while preserving previous knowledge. In the realm of continual graph learning, where graphs continuously evolve based on streaming graph data, continual graph learning presents unique challenges that require adaptive and efficient graph learning methods in addition to the problem of catastrophic forgetting. The first challenge arises from the interdependencies between different graph data, where previous graphs can influence new data distributions. The second challenge lies in the efficiency concern when dealing with large graphs. To addresses these two problems, we produce an Efficient Continual Graph Learner (E-CGL) in this paper. We tackle the interdependencies issue by demonstrating the effectiveness of replay strategies and introducing a combined sampling strategy that considers both node importance and diversity. To overcome the limitation of efficiency, E-CGL leverages a simple yet effective MLP model that shares weights with a GCN during training, achieving acceleration by circumventing the computationally expensive message passing process. Our method comprehensively surpasses nine baselines on four graph continual learning datasets under two settings, meanwhile E-CGL largely reduces the catastrophic forgetting problem down to an average of -1.1%. Additionally, E-CGL achieves an average of 15.83x training time acceleration and 4.89x inference time acceleration across the four datasets. These results indicate that E-CGL not only effectively manages the correlation between different graph data during continual training but also enhances the efficiency of continual learning on large graphs. The code is publicly available at this https URL.

[LG-94] Improvement of Bayesian PINN Training Convergence in Solving Multi-scale PDEs with Noise

链接: https://arxiv.org/abs/2408.09340
作者: Yilong Hou,Xi’an Li,Jinran Wu
关键词-EN: Bayesian Physics Informed, Physics Informed Neural, Informed Neural Networks, Physics Informed, received considerable attention
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Bayesian Physics Informed Neural Networks (BPINN) have received considerable attention for inferring differential equations’ system states and physical parameters according to noisy observations. However, in practice, Hamiltonian Monte Carlo (HMC) used to estimate the internal parameters of BPINN often encounters troubles, including poor performance and awful convergence for a given step size used to adjust the momentum of those parameters. To improve the efficacy of HMC convergence for the BPINN method and extend its application scope to multi-scale partial differential equations (PDE), we developed a robust multi-scale Bayesian PINN (dubbed MBPINN) method by integrating multi-scale deep neural networks (MscaleDNN) and Bayesian inference. In this newly proposed MBPINN method, we reframe HMC with Stochastic Gradient Descent (SGD) to ensure the most ``likely’’ estimation is always provided, and we configure its solver as a Fourier feature mapping-induced MscaleDNN. The MBPINN method offers several key advantages: (1) it is more robust than HMC, (2) it incurs less computational cost than HMC, and (3) it is more flexible for complex problems. We demonstrate the applicability and performance of the proposed method through general Poisson and multi-scale elliptic problems in one- to three-dimensional spaces. Our findings indicate that the proposed method can avoid HMC failures and provide valid results. Additionally, our method can handle complex PDE and produce comparable results for general PDE. These findings suggest that our proposed approach has excellent potential for physics-informed machine learning for parameter estimation and solution recovery in the case of ill-posed problems.

[LG-95] hreshold Filtering Packing for Supervised Fine-Tuning: Training Related Samples within Packs

链接: https://arxiv.org/abs/2408.09327
作者: Jiancheng Dong,Lei Jiang,Wei Jin,Lu Cheng
关键词-EN: facilitate GPU processing, designed maximum length, concatenating data points, models involves concatenating, autoregressive models involves
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
*备注: 13 pages, 4 figures

点击查看摘要

Abstract:Packing for Supervised Fine-Tuning (SFT) in autoregressive models involves concatenating data points of varying lengths until reaching the designed maximum length to facilitate GPU processing. However, randomly concatenating data points and feeding them into an autoregressive transformer can lead to cross-contamination of sequences due to the significant difference in their subject matter. The mainstream approaches in SFT ensure that each token in the attention calculation phase only focuses on tokens within its own short sequence, without providing additional learning signals for the preceding context. To address these challenges, we introduce Threshold Filtering Packing (TFP), a method that selects samples with related context while maintaining sufficient diversity within the same pack. Our experiments show that TFP offers a simple-to-implement and scalable approach that significantly enhances SFT performance, with observed improvements of up to 7% on GSM8K, 4% on HumanEval, and 15% on the adult-census-income dataset.

[LG-96] A Probabilistic Framework for Adapting to Changing and Recurring Concepts in Data Streams

链接: https://arxiv.org/abs/2408.09324
作者: Ben Halstead,Yun Sing Koh,Patricia Riddle,Mykola Pechenizkiy,Albert Bifet
关键词-EN: experience, concept drift, concept, conditions, streaming data
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The distribution of streaming data often changes over time as conditions change, a phenomenon known as concept drift. Only a subset of previous experience, collected in similar conditions, is relevant to learning an accurate classifier for current data. Learning from irrelevant experience describing a different concept can degrade performance. A system learning from streaming data must identify which recent experience is irrelevant when conditions change and which past experience is relevant when concepts reoccur, \textite.g., when weather events or financial patterns repeat. Existing streaming approaches either do not consider experience to change in relevance over time and thus cannot handle concept drift, or only consider the recency of experience and thus cannot handle recurring concepts, or only sparsely evaluate relevance and thus fail when concept drift is missed. To enable learning in changing conditions, we propose SELeCT, a probabilistic method for continuously evaluating the relevance of past experience. SELeCT maintains a distinct internal state for each concept, representing relevant experience with a unique classifier. We propose a Bayesian algorithm for estimating state relevance, combining the likelihood of drawing recent observations from a given state with a transition pattern prior based on the system’s current state.

[LG-97] Predicting travel demand of a bike sharing system using graph convolutional neural networks

链接: https://arxiv.org/abs/2408.09317
作者: Ali Behroozi,Ali Edrisi
关键词-EN: meet public demands, meet public, Public transportation systems, business operations, public demands
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Public transportation systems play a crucial role in daily commutes, business operations, and leisure activities, emphasizing the need for effective management to meet public demands. One approach to achieve this goal is by predicting demand at the station level. Bike-sharing systems, as a form of transit service, contribute to the reduction of air and noise pollution, as well as traffic congestion. This study focuses on predicting travel demand within a bike-sharing system. A novel hybrid deep learning model called the gate graph convolutional neural network is introduced. This model enables prediction of the travel demand at station level. By integrating trajectory data, weather data, access data, and leveraging gate graph convolution networks, the accuracy of travel demand forecasting is significantly improved. Chicago City bike-sharing system is chosen as the case study. In this investigation, the proposed model is compared to the base models used in previous literature to evaluate their performance, demonstrating that the main model exhibits better performance than the base models. By utilizing this framework, transportation planners can make informed decisions on resource allocation and rebalancing management.

[LG-98] Learning Fair Invariant Representations under Covariate and Correlation Shifts Simultaneously CIKM2024

链接: https://arxiv.org/abs/2408.09312
作者: Dong Li,Chen Zhao,Minglai Shao,Wenjun Wang
关键词-EN: invariant classifier, substantial and complex, complex challenge, challenge in machine, shifted test domains
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: CIKM 2024

点击查看摘要

Abstract:Achieving the generalization of an invariant classifier from training domains to shifted test domains while simultaneously considering model fairness is a substantial and complex challenge in machine learning. Existing methods address the problem of fairness-aware domain generalization, focusing on either covariate shift or correlation shift, but rarely consider both at the same time. In this paper, we introduce a novel approach that focuses on learning a fairness-aware domain-invariant predictor within a framework addressing both covariate and correlation shifts simultaneously, ensuring its generalization to unknown test domains inaccessible during training. In our approach, data are first disentangled into content and style factors in latent spaces. Furthermore, fairness-aware domain-invariant content representations can be learned by mitigating sensitive information and retaining as much other information as possible. Extensive empirical studies on benchmark datasets demonstrate that our approach surpasses state-of-the-art methods with respect to model accuracy as well as both group and individual fairness.

[LG-99] Narrowing the Focus: Learned Optimizers for Pretrained Models

链接: https://arxiv.org/abs/2408.09310
作者: Gus Kristiansen,Mark Sandler,Andrey Zhmoginov,Nolan Miller,Anirudh Goyal,Jihwan Lee,Max Vladymyrov
关键词-EN: modern deep learning, applying gradient updates, modern deep, applying gradient, deep learning
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In modern deep learning, the models are learned by applying gradient updates using an optimizer, which transforms the updates based on various statistics. Optimizers are often hand-designed and tuning their hyperparameters is a big part of the training process. Learned optimizers have shown some initial promise, but are generally unsuccessful as a general optimization mechanism applicable to every problem. In this work we explore a different direction: instead of learning general optimizers, we instead specialize them to a specific training environment. We propose a novel optimizer technique that learns a layer-specific linear combination of update directions provided by a set of base optimizers, effectively adapting its strategy to the specific model and dataset. When evaluated on image classification tasks, this specialized optimizer significantly outperforms both traditional off-the-shelf methods such as Adam, as well as existing general learned optimizers. Moreover, it demonstrates robust generalization with respect to model initialization, evaluating on unseen datasets, and training durations beyond its meta-training horizon.

[LG-100] A Benchmark Time Series Dataset for Semiconductor Fabrication Manufacturing Constructed using Component-based Discrete-Event Simulation Models

链接: https://arxiv.org/abs/2408.09307
作者: Vamsi Krishna Pendyala,Hessam S. Sarjoughian,Bala Potineni,Edward J. Yellig
关键词-EN: high-computing devices increase, smart manufacturing factories, Advancements in high-computing, models, high-computing devices
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Advancements in high-computing devices increase the necessity for improved and new understanding and development of smart manufacturing factories. Discrete-event models with simulators have been shown to be critical to architect, designing, building, and operating the manufacturing of semiconductor chips. The diffusion, implantation, and lithography machines have intricate processes due to their feedforward and feedback connectivity. The dataset collected from simulations of the factory models holds the promise of generating valuable machine-learning models. As surrogate data-based models, their executions are highly efficient compared to the physics-based counterpart models. For the development of surrogate models, it is beneficial to have publicly available benchmark simulation models that are grounded in factory models that have concise structures and accurate behaviors. Hence, in this research, a dataset is devised and constructed based on a benchmark model of an Intel semiconductor fabrication factory. The model is formalized using the Parallel Discrete-Event System Specification and executed using the DEVS-Suite simulator. The time series dataset is constructed using discrete-event time trajectories. This dataset is further analyzed and used to develop baseline univariate and multivariate machine learning models. The dataset can also be utilized in the machine learning community for behavioral analysis based on formalized and scalable component-based discrete-event models and simulations.

[LG-101] Enhancing Audio-Language Models through Self-Supervised Post-Training with Text-Audio Pairs

链接: https://arxiv.org/abs/2408.09269
作者: Anshuman Sinha,Camille Migozzi,Aubin Rey,Chao Zhang
关键词-EN: rapidly gained interest, contrastive learning strategies, gained interest, learning strategies, rapidly gained
类目: ound (cs.SD); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注: 31 pages, 11 figures

点击查看摘要

Abstract:Research on multi-modal contrastive learning strategies for audio and text has rapidly gained interest. Contrastively trained Audio-Language Models (ALMs), such as CLAP, which establish a unified representation across audio and language modalities, have enhanced the efficacy in various subsequent tasks by providing good text aligned audio encoders and vice versa. These improvements are evident in areas like zero-shot audio classification and audio retrieval, among others. However, the ability of these models to understand natural language and temporal relations is still a largely unexplored and open field for research. In this paper, we propose to equip the multi-modal ALMs with temporal understanding without loosing their inherent prior capabilities of audio-language tasks with a temporal instillation method TeminAL. We implement a two-stage training scheme TeminAL A \ B, where the model first learns to differentiate between multiple sounds in TeminAL A, followed by a phase that instills a sense of time, thereby enhancing its temporal understanding in TeminAL B. This approach results in an average performance gain of 5.28% in temporal understanding on the ESC-50 dataset, while the model remains competitive in zero-shot retrieval and classification tasks on the AudioCap/Clotho datasets. We also note the lack of proper evaluation techniques for contrastive ALMs and propose a strategy for evaluating ALMs in zero-shot settings. The general-purpose zero-shot model evaluation strategy ZSTE, is used to evaluate various prior models. ZSTE demonstrates a general strategy to evaluate all ZS contrastive models. The model trained with TeminAL successfully outperforms current models on most downstream tasks.

[LG-102] Graph Classification with GNNs: Optimisation Representation and Inductive Bias

链接: https://arxiv.org/abs/2408.09266
作者: P. Krishna Kumar a,Harish G. Ramaswamy
关键词-EN: detecting graph isomorphism, Theoretical studies, centered around understanding, WL-Tests for detecting, GNN learning process
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Theoretical studies on the representation power of GNNs have been centered around understanding the equivalence of GNNs, using WL-Tests for detecting graph isomorphism. In this paper, we argue that such equivalence ignores the accompanying optimization issues and does not provide a holistic view of the GNN learning process. We illustrate these gaps between representation and optimization with examples and experiments. We also explore the existence of an implicit inductive bias (e.g. fully connected networks prefer to learn low frequency functions in their input space) in GNNs, in the context of graph classification tasks. We further prove theoretically that the message-passing layers in the graph, have a tendency to search for either discriminative subgraphs, or a collection of discriminative nodes dispersed across the graph, depending on the different global pooling layers used. We empirically verify this bias through experiments over real-world and synthetic datasets. Finally, we show how our work can help in incorporating domain knowledge via attention based architectures, and can evince their capability to discriminate coherent subgraphs.

[LG-103] ByCAN: Reverse Engineering Controller Area Network (CAN) Messages from Bit to Byte Level

链接: https://arxiv.org/abs/2408.09265
作者: Xiaojie Lin,Baihe Ma,Xu Wang,Guangsheng Yu,Ying He,Ren Ping Liu,Wei Ni
关键词-EN: Controller Area Network, Area Network, Controller Area, primary standard protocol, automotive cybersecurity threats
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG); Networking and Internet Architecture (cs.NI); Systems and Control (eess.SY)
*备注: Accept by IEEE Internet of Things Journal, 15 pages, 5 figures, 6 tables

点击查看摘要

Abstract:As the primary standard protocol for modern cars, the Controller Area Network (CAN) is a critical research target for automotive cybersecurity threats and autonomous applications. As the decoding specification of CAN is a proprietary black-box maintained by Original Equipment Manufacturers (OEMs), conducting related research and industry developments can be challenging without a comprehensive understanding of the meaning of CAN messages. In this paper, we propose a fully automated reverse-engineering system, named ByCAN, to reverse engineer CAN messages. ByCAN outperforms existing research by introducing byte-level clusters and integrating multiple features at both byte and bit levels. ByCAN employs the clustering and template matching algorithms to automatically decode the specifications of CAN frames without the need for prior knowledge. Experimental results demonstrate that ByCAN achieves high accuracy in slicing and labeling performance, i.e., the identification of CAN signal boundaries and labels. In the experiments, ByCAN achieves slicing accuracy of 80.21%, slicing coverage of 95.21%, and labeling accuracy of 68.72% for general labels when analyzing the real-world CAN frames.

[LG-104] PREMAP: A Unifying PREiMage APproximation Framework for Neural Networks

链接: https://arxiv.org/abs/2408.09262
作者: Xiyue Zhang,Benjie Wang,Marta Kwiatkowska,Huan Zhang
关键词-EN: neural network, network verification focus, focus on bounding, neural network predictions, neural network verification
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO)
*备注: arXiv admin note: text overlap with arXiv:2305.03686

点击查看摘要

Abstract:Most methods for neural network verification focus on bounding the image, i.e., set of outputs for a given input set. This can be used to, for example, check the robustness of neural network predictions to bounded perturbations of an input. However, verifying properties concerning the preimage, i.e., the set of inputs satisfying an output property, requires abstractions in the input space. We present a general framework for preimage abstraction that produces under- and over-approximations of any polyhedral output set. Our framework employs cheap parameterised linear relaxations of the neural network, together with an anytime refinement procedure that iteratively partitions the input region by splitting on input features and neurons. The effectiveness of our approach relies on carefully designed heuristics and optimization objectives to achieve rapid improvements in the approximation volume. We evaluate our method on a range of tasks, demonstrating significant improvement in efficiency and scalability to high-input-dimensional image classification tasks compared to state-of-the-art techniques. Further, we showcase the application to quantitative verification and robustness analysis, presenting a sound and complete algorithm for the former and providing sound quantitative results for the latter.

[LG-105] V2X-VLM: End-to-End V2X Cooperative Autonomous Driving Through Large Vision-Language Models

链接: https://arxiv.org/abs/2408.09251
作者: Junwei You,Haotian Shi,Zhuoyu Jiang,Zilin Huang,Rui Gan,Keshu Wu,Xi Cheng,Xiaopeng Li,Bin Ran
关键词-EN: systems that manage, navigation and control, increasingly focused, manage the full, full spectrum
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Advancements in autonomous driving have increasingly focused on end-to-end (E2E) systems that manage the full spectrum of driving tasks, from environmental perception to vehicle navigation and control. This paper introduces V2X-VLM, an innovative E2E vehicle-infrastructure cooperative autonomous driving (VICAD) framework with large vision-language models (VLMs). V2X-VLM is designed to enhance situational awareness, decision-making, and ultimate trajectory planning by integrating data from vehicle-mounted cameras, infrastructure sensors, and textual information. The strength of the comprehensive multimodel data fusion of the VLM enables precise and safe E2E trajectory planning in complex and dynamic driving scenarios. Validation on the DAIR-V2X dataset demonstrates that V2X-VLM outperforms existing state-of-the-art methods in cooperative autonomous driving.

[LG-106] QEDCartographer: Automating Formal Verification Using Reward-Free Reinforcement Learning ICSE

链接: https://arxiv.org/abs/2408.09237
作者: Alex Sanchez-Stern,Abhishek Varghese,Zhanna Kaufman,Dylan Zhang,Talia Ringer,Yuriy Brun
关键词-EN: producing reliable software, proofs severely limits, manually writing verification, reliable software, utility in practice
类目: oftware Engineering (cs.SE); Machine Learning (cs.LG); Programming Languages (cs.PL)
*备注: Published in the International Conference on Software Engineering (ICSE) 2025: Alex Sanchez-Stern, Abhishek Varghese, Zhanna Kaufman, Dylan Zhang, Talia Ringer, and Yuriy Brun, QEDCartographer: Automating Formal Verification Using Reward-Free Reinforcement Learning, in Proceedings of the 47th International Conference on Software Engineering (ICSE), 2025

点击查看摘要

Abstract:Formal verification is a promising method for producing reliable software, but the difficulty of manually writing verification proofs severely limits its utility in practice. Recent methods have automated some proof synthesis by guiding a search through the proof space using a theorem prover. Unfortunately, the theorem prover provides only the crudest estimate of progress, resulting in effectively undirected search. To address this problem, we create QEDCartographer, an automated proof-synthesis tool that combines supervised and reinforcement learning to more effectively explore the proof space. QEDCartographer incorporates the proofs’ branching structure, enabling reward-free search and overcoming the sparse reward problem inherent to formal verification. We evaluate QEDCartographer using the CoqGym benchmark of 68.5K theorems from 124 open-source Coq projects. QEDCartographer fully automatically proves 21.4% of the test-set theorems. Previous search-based proof-synthesis tools Tok, Tac, ASTactic, Passport, and Proverbot9001, which rely only on supervised learning, prove 9.6%, 9.8%, 10.9%, 12.5%, and 19.8%, respectively. Diva, which combines 62 tools, proves 19.2%. Comparing to the most effective prior tool, Proverbot9001, QEDCartographer produces 26% shorter proofs 27% faster, on average over the theorems both tools prove. Together, QEDCartographer and non-learning-based CoqHammer prove 31.8% of the theorems, while CoqHammer alone proves 26.6%. Our work demonstrates that reinforcement learning is a fruitful research direction for improving proof-synthesis tools’ search mechanisms.

[LG-107] Scalable and Certifiable Graph Unlearning via Lazy Local Propagation

链接: https://arxiv.org/abs/2408.09212
作者: Lu Yi,Zhewei Wei
关键词-EN: Graph Neural Networks, Neural Networks, crucial research area, Networks for modeling, modeling graph-structured data
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:With the recent adoption of laws supporting the ``right to be forgotten’’ and the widespread use of Graph Neural Networks for modeling graph-structured data, graph unlearning has emerged as a crucial research area. Current studies focus on the efficient update of model parameters. However, they often overlook the time-consuming re-computation of graph propagation required for each removal, significantly limiting their scalability on large graphs. In this paper, we present ScaleGUN, the first certifiable graph unlearning mechanism that scales to billion-edge graphs. ScaleGUN employs a lazy local propagation method to facilitate efficient updates of the embedding matrix during data removal. Such lazy local propagation can be proven to ensure certified unlearning under all three graph unlearning scenarios, including node feature, edge, and node unlearning. Extensive experiments on real-world datasets demonstrate the efficiency and efficacy of ScaleGUN. Remarkably, ScaleGUN accomplishes (\epsilon,\delta)=(1,10^-4) certified unlearning on the billion-edge graph ogbn-papers100M in 20 seconds for a 5K -random-edge removal request – of which only 5 seconds are required for updating the embedding matrix – compared to 1.91 hours for retraining and 1.89 hours for re-propagation. Our code is available online. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2408.09212 [cs.LG] (or arXiv:2408.09212v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2408.09212 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-108] On the Improvement of Generalization and Stability of Forward-Only Learning via Neural Polarization ECAI2024

链接: https://arxiv.org/abs/2408.09210
作者: Erik B. Terres-Escudero,Javier Del Ser,Pablo Garcia-Bringas
关键词-EN: contrastive forward pass, recently gained attention, additional contrastive forward, Forward-only learning algorithms, replacing the backward
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE)
*备注: To be published in ECAI 2024

点击查看摘要

Abstract:Forward-only learning algorithms have recently gained attention as alternatives to gradient backpropagation, replacing the backward step of this latter solver with an additional contrastive forward pass. Among these approaches, the so-called Forward-Forward Algorithm (FFA) has been shown to achieve competitive levels of performance in terms of generalization and complexity. Networks trained using FFA learn to contrastively maximize a layer-wise defined goodness score when presented with real data (denoted as positive samples) and to minimize it when processing synthetic data (corr. negative samples). However, this algorithm still faces weaknesses that negatively affect the model accuracy and training stability, primarily due to a gradient imbalance between positive and negative samples. To overcome this issue, in this work we propose a novel implementation of the FFA algorithm, denoted as Polar-FFA, which extends the original formulation by introducing a neural division (\emphpolarization) between positive and negative instances. Neurons in each of these groups aim to maximize their goodness when presented with their respective data type, thereby creating a symmetric gradient behavior. To empirically gauge the improved learning capabilities of our proposed Polar-FFA, we perform several systematic experiments using different activation and goodness functions over image classification datasets. Our results demonstrate that Polar-FFA outperforms FFA in terms of accuracy and convergence speed. Furthermore, its lower reliance on hyperparameters reduces the need for hyperparameter tuning to guarantee optimal generalization capabilities, thereby allowing for a broader range of neural network configurations.

[LG-109] H2PIPE: High throughput CNN Inference on FPGAs with High-Bandwidth Memory

链接: https://arxiv.org/abs/2408.09209
作者: Mario Doumet,Marius Stan,Mathew Hall,Vaughn Betz
关键词-EN: Convolutional Neural Networks, Convolutional Neural, Programmable Gate Arrays, Neural Networks, frequent memory access
类目: Hardware Architecture (cs.AR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Convolutional Neural Networks (CNNs) combine large amounts of parallelizable computation with frequent memory access. Field Programmable Gate Arrays (FPGAs) can achieve low latency and high throughput CNN inference by implementing dataflow accelerators that pipeline layer-specific hardware to implement an entire network. By implementing a different processing element for each CNN layer, these layer-pipelined accelerators can achieve high compute density, but having all layers processing in parallel requires high memory bandwidth. Traditionally this has been satisfied by storing all weights on chip, but this is infeasible for the largest CNNs, which are often those most in need of acceleration. In this work we augment a state-of-the-art dataflow accelerator (HPIPE) to leverage both High-Bandwidth Memory (HBM) and on-chip storage, enabling high performance layer-pipelined dataflow acceleration of large CNNs. Based on profiling results of HBM’s latency and throughput against expected address patterns, we develop an algorithm to choose which weight buffers should be moved off chip and how deep the on-chip FIFOs to HBM should be to minimize compute unit stalling. We integrate the new hardware generation within the HPIPE domain-specific CNN compiler and demonstrate good bandwidth efficiency against theoretical limits. Compared to the best prior work we obtain speed-ups of at least 19.4x, 5.1x and 10.5x on ResNet-18, ResNet-50 and VGG-16 respectively.

[LG-110] NDDEs: A Deep Neural Network Framework for Solving Forward and Inverse Problems in Delay Differential Equations

链接: https://arxiv.org/abs/2408.09202
作者: Housen Wang,Yuxing Chen,Sirong Cao,Xiaoli Wang,Qiang Liu
关键词-EN: delay differential equations, delay differential, differential equations, neural delay differential, deep neural networks
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This article proposes a solution framework for delay differential equations (DDEs) based on deep neural networks (DNNs) - the neural delay differential equations (NDDEs), aimed at solving the forward and inverse problems of delay differential equations. This framework embeds the delay differential equations into the neural networks to accommodate the diverse requirements of DDEs in terms of initial conditions, control equations, and known data. NDDEs adjust the network parameters through automatic differentiation and optimization algorithms to minimize the loss function, thereby obtaining numerical solutions to the delay differential equations without the grid dependence and discretization errors typical of traditional numerical methods. In addressing inverse problems, the NDDE framework can utilize observational data to perform precise estimation of single or multiple delay parameters. The results of multiple numerical experiments have shown that NDDEs demonstrate high precision in both forward and inverse problems, proving their effectiveness and promising potential in dealing with delayed differential equation issues.

[LG-111] DRL-Based Resource Allocation for Motion Blur Resistant Federated Self-Supervised Learning in IoV

链接: https://arxiv.org/abs/2408.09194
作者: Xueying Gu,Qiong Wu,Pingyi Fan,Qiang Fan,Nan Cheng,Wen Chen,Khaled B. Letaief
关键词-EN: Federated Self-Supervised Learning, privacy-preserving solution, Learning, Internet of Vehicles, Federated Learning
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Networking and Internet Architecture (cs.NI)
*备注: This paper has been submitted to IEEE Journal. The source code has been released at: this https URL

点击查看摘要

Abstract:In the Internet of Vehicles (IoV), Federated Learning (FL) provides a privacy-preserving solution by aggregating local models without sharing data. Traditional supervised learning requires image data with labels, but data labeling involves significant manual effort. Federated Self-Supervised Learning (FSSL) utilizes Self-Supervised Learning (SSL) for local training in FL, eliminating the need for labels while protecting privacy. Compared to other SSL methods, Momentum Contrast (MoCo) reduces the demand for computing resources and storage space by creating a dictionary. However, using MoCo in FSSL requires uploading the local dictionary from vehicles to Base Station (BS), which poses a risk of privacy leakage. Simplified Contrast (SimCo) addresses the privacy leakage issue in MoCo-based FSSL by using dual temperature instead of a dictionary to control sample distribution. Additionally, considering the negative impact of motion blur on model aggregation, and based on SimCo, we propose a motion blur-resistant FSSL method, referred to as BFSSL. Furthermore, we address energy consumption and delay in the BFSSL process by proposing a Deep Reinforcement Learning (DRL)-based resource allocation scheme, called DRL-BFSSL. In this scheme, BS allocates the Central Processing Unit (CPU) frequency and transmission power of vehicles to minimize energy consumption and latency, while aggregating received models based on the motion blur level. Simulation results validate the effectiveness of our proposed aggregation and resource allocation methods.

[LG-112] SA-GDA: Spectral Augmentation for Graph Domain Adaptation

链接: https://arxiv.org/abs/2408.09189
作者: Jinhui Pang,Zixuan Wang,Jiliang Tang,Mingyan Xiao,Nan Yin
关键词-EN: achieved impressive impressions, domain, graph-related tasks, achieved impressive, impressive impressions
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Graph neural networks (GNNs) have achieved impressive impressions for graph-related tasks. However, most GNNs are primarily studied under the cases of signal domain with supervised training, which requires abundant task-specific labels and is difficult to transfer to other domains. There are few works focused on domain adaptation for graph node classification. They mainly focused on aligning the feature space of the source and target domains, without considering the feature alignment between different categories, which may lead to confusion of classification in the target domain. However, due to the scarcity of labels of the target domain, we cannot directly perform effective alignment of categories from different domains, which makes the problem more challenging. In this paper, we present the \textitSpectral Augmentation for Graph Domain Adaptation (\method) for graph node classification. First, we observe that nodes with the same category in different domains exhibit similar characteristics in the spectral domain, while different classes are quite different. Following the observation, we align the category feature space of different domains in the spectral domain instead of aligning the whole features space, and we theoretical proof the stability of proposed \method. Then, we develop a dual graph convolutional network to jointly exploits local and global consistency for feature aggregation. Last, we utilize a domain classifier with an adversarial learning submodule to facilitate knowledge transfer between different domain graphs. Experimental results on a variety of publicly available datasets reveal the effectiveness of our \method.

[LG-113] PADetBench: Towards Benchmarking Physical Attacks against Object Detection

链接: https://arxiv.org/abs/2408.09181
作者: Jiawei Lian,Jianhong Pan,Lefan Wang,Yi Wang,Lap-Pui Chau,Shaohui Mei
关键词-EN: significant practical implications, gained increasing attention, increasing attention due, Toggle, Physical
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Physical attacks against object detection have gained increasing attention due to their significant practical implications. However, conducting physical experiments is extremely time-consuming and labor-intensive. Moreover, physical dynamics and cross-domain transformation are challenging to strictly regulate in the real world, leading to unaligned evaluation and comparison, severely hindering the development of physically robust models. To accommodate these challenges, we explore utilizing realistic simulation to thoroughly and rigorously benchmark physical attacks with fairness under controlled physical dynamics and cross-domain transformation. This resolves the problem of capturing identical adversarial images that cannot be achieved in the real world. Our benchmark includes 20 physical attack methods, 48 object detectors, comprehensive physical dynamics, and evaluation metrics. We also provide end-to-end pipelines for dataset generation, detection, evaluation, and further analysis. In addition, we perform 8064 groups of evaluation based on our benchmark, which includes both overall evaluation and further detailed ablation studies for controlled physical dynamics. Through these experiments, we provide in-depth analyses of physical attack performance and physical adversarial robustness, draw valuable observations, and discuss potential directions for future research. Codebase: this https URL Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG) Cite as: arXiv:2408.09181 [cs.CV] (or arXiv:2408.09181v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2408.09181 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Jiawei Lian [view email] [v1] Sat, 17 Aug 2024 12:11:22 UTC (14,993 KB) Full-text links: Access Paper: View a PDF of the paper titled PADetBench: Towards Benchmarking Physical Attacks against Object Detection, by Jiawei Lian and 5 other authorsView PDFHTML (experimental)TeX SourceOther Formats view license Current browse context: cs.CV prev | next new | recent | 2024-08 Change to browse by: cs cs.LG References Citations NASA ADSGoogle Scholar Semantic Scholar a export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Links to Code Toggle Papers with Code (What is Papers with Code?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status Get status notifications via email or slack

[LG-114] Ranking Across Different Content Types: The Robust Beauty of Multinomial Blending RECSYS24

链接: https://arxiv.org/abs/2408.09168
作者: Jan Malte Lichtenberg,Giuseppe Di Benedetto,Matteo Ruffini
关键词-EN: media streaming services, multiple content types, content types, increasing number, number of media
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: To appear in 18th ACM Conference on Recommender Systems (RecSys24), Bari, Italy. ACM, New York, NY, USA, 3 pages

点击查看摘要

Abstract:An increasing number of media streaming services have expanded their offerings to include entities of multiple content types. For instance, audio streaming services that started by offering music only, now also offer podcasts, merchandise items, and videos. Ranking items across different content types into a single slate poses a significant challenge for traditional learning-to-rank (LTR) algorithms due to differing user engagement patterns for different content types. We explore a simple method for cross-content-type ranking, called multinomial blending (MB), which can be used in conjunction with most existing LTR algorithms. We compare MB to existing baselines not only in terms of ranking quality but also from other industry-relevant perspectives such as interpretability, ease-of-use, and stability in dynamic environments with changing user behavior and ranking model retraining. Finally, we report the results of an A/B test from an Amazon Music ranking use-case.

[LG-115] Zero-Shot Object-Centric Representation Learning

链接: https://arxiv.org/abs/2408.09162
作者: Aniket Didolkar,Andrii Zadaianchuk,Anirudh Goyal,Mike Mozer,Yoshua Bengio,Georg Martius,Maximilian Seitzer
关键词-EN: decompose visual scenes, object-centric representation learning, isolates the entities, representation learning, decompose visual
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The goal of object-centric representation learning is to decompose visual scenes into a structured representation that isolates the entities. Recent successes have shown that object-centric representation learning can be scaled to real-world scenes by utilizing pre-trained self-supervised features. However, so far, object-centric methods have mostly been applied in-distribution, with models trained and evaluated on the same dataset. This is in contrast to the wider trend in machine learning towards general-purpose models directly applicable to unseen data and tasks. Thus, in this work, we study current object-centric methods through the lens of zero-shot generalization by introducing a benchmark comprising eight different synthetic and real-world datasets. We analyze the factors influencing zero-shot performance and find that training on diverse real-world images improves transferability to unseen scenarios. Furthermore, inspired by the success of task-specific fine-tuning in foundation models, we introduce a novel fine-tuning strategy to adapt pre-trained vision encoders for the task of object discovery. We find that the proposed approach results in state-of-the-art performance for unsupervised object discovery, exhibiting strong zero-shot transfer to unseen datasets.

[LG-116] Linear Attention is Enough in Spatial-Temporal Forecasting

链接: https://arxiv.org/abs/2408.09158
作者: Xinyu Ning
关键词-EN: forecasting task attracted, task attracted numerous, attracted numerous attention, spatial-temporal forecasting tasks, traffic forecasting task
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:As the most representative scenario of spatial-temporal forecasting tasks, the traffic forecasting task attracted numerous attention from machine learning community due to its intricate correlation both in space and time dimension. Existing methods often treat road networks over time as spatial-temporal graphs, addressing spatial and temporal representations independently. However, these approaches struggle to capture the dynamic topology of road networks, encounter issues with message passing mechanisms and over-smoothing, and face challenges in learning spatial and temporal relationships separately. To address these limitations, we propose treating nodes in road networks at different time steps as independent spatial-temporal tokens and feeding them into a vanilla Transformer to learn complex spatial-temporal patterns, design STformer achieving SOTA. Given its quadratic complexity, we introduce a variant NSTformer based on Nystr \ddoto m method to approximate self-attention with linear complexity but even slightly better than former in a few cases astonishingly. Extensive experimental results on traffic datasets demonstrate that the proposed method achieves state-of-the-art performance at an affordable computational cost. Our code will be made available.

[LG-117] On the KL-Divergence-based Robust Satisficing Model

链接: https://arxiv.org/abs/2408.09157
作者: Haojie Yan,Minglong Zhou,Jiayi Guo
关键词-EN: Optimizer Curse stemming, http URL address, Empirical risk minimization, Optimizer Curse, Curse stemming
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Empirical risk minimization, a cornerstone in machine learning, is often hindered by the Optimizer’s Curse stemming from discrepancies between the empirical and true data-generating this http URL address this challenge, the robust satisficing framework has emerged recently to mitigate ambiguity in the true distribution. Distinguished by its interpretable hyperparameter and enhanced performance guarantees, this approach has attracted increasing attention from academia. However, its applicability in tackling general machine learning problems, notably deep neural networks, remains largely unexplored due to the computational challenges in solving this model efficiently across general loss functions. In this study, we delve into the Kullback Leibler divergence based robust satisficing model under a general loss function, presenting analytical interpretations, diverse performance guarantees, efficient and stable numerical methods, convergence analysis, and an extension tailored for hierarchical data structures. Through extensive numerical experiments across three distinct machine learning tasks, we demonstrate the superior performance of our model compared to state-of-the-art benchmarks.

[LG-118] Point Source Identification Using Singularity Enriched Neural Networks

链接: https://arxiv.org/abs/2408.09143
作者: Tianhao Hu,Bangti Jin,Zhi Zhou
关键词-EN: recovering point sources, applied inverse problems, point sources, point sources represents, point source identification
类目: Numerical Analysis (math.NA); Machine Learning (cs.LG)
*备注: 22 pages

点击查看摘要

Abstract:The inverse problem of recovering point sources represents an important class of applied inverse problems. However, there is still a lack of neural network-based methods for point source identification, mainly due to the inherent solution singularity. In this work, we develop a novel algorithm to identify point sources, utilizing a neural network combined with a singularity enrichment technique. We employ the fundamental solution and neural networks to represent the singular and regular parts, respectively, and then minimize an empirical loss involving the intensities and locations of the unknown point sources, as well as the parameters of the neural network. Moreover, by combining the conditional stability argument of the inverse problem with the generalization error of the empirical loss, we conduct a rigorous error analysis of the algorithm. We demonstrate the effectiveness of the method with several challenging experiments.

[LG-119] Learning to Explore for Stochastic Gradient MCMC

链接: https://arxiv.org/abs/2408.09140
作者: SeungHyun Kim,Seohyeon Jung,Seonghyeon Kim,Juho Lee
关键词-EN: Bayesian Neural Networks, Bayesian Neural, Neural Networks, high-dimensional parameters pose, posterior inference due
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Bayesian Neural Networks(BNNs) with high-dimensional parameters pose a challenge for posterior inference due to the multi-modality of the posterior distributions. Stochastic Gradient MCMC(SGMCMC) with cyclical learning rate scheduling is a promising solution, but it requires a large number of sampling steps to explore high-dimensional multi-modal posteriors, making it computationally expensive. In this paper, we propose a meta-learning strategy to build \glssgmcmc which can efficiently explore the multi-modal target distributions. Our algorithm allows the learned SGMCMC to quickly explore the high-density region of the posterior landscape. Also, we show that this exploration property is transferrable to various tasks, even for the ones unseen during a meta-training stage. Using popular image classification benchmarks and a variety of downstream tasks, we demonstrate that our method significantly improves the sampling efficiency, achieving better performance than vanilla \glssgmcmc without incurring significant computational overhead.

[LG-120] Vanilla Gradient Descent for Oblique Decision Trees ECAI2024

链接: https://arxiv.org/abs/2408.09135
作者: Subrat Prasad Panda,Blaise Genest,Arvind Easwaran,Ponnuthurai Nagaratnam Suganthan
关键词-EN: major highly non-linear, DTs, textit, non-linear AI models, tabular data
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Published in ECAI2024. This version includes supplementary material

点击查看摘要

Abstract:Decision Trees (DTs) constitute one of the major highly non-linear AI models, valued, e.g., for their efficiency on tabular data. Learning accurate DTs is, however, complicated, especially for oblique DTs, and does take a significant training time. Further, DTs suffer from overfitting, e.g., they proverbially “do not generalize” in regression tasks. Recently, some works proposed ways to make (oblique) DTs differentiable. This enables highly efficient gradient-descent algorithms to be used to learn DTs. It also enables generalizing capabilities by learning regressors at the leaves simultaneously with the decisions in the tree. Prior approaches to making DTs differentiable rely either on probabilistic approximations at the tree’s internal nodes (soft DTs) or on approximations in gradient computation at the internal node (quantized gradient descent). In this work, we propose \textitDTSemNet, a novel \textitsemantically equivalent and invertible encoding for (hard, oblique) DTs as Neural \textitNetworks (NNs), that uses standard vanilla gradient descent. Experiments across various classification and regression benchmarks show that oblique DTs learned using \textitDTSemNet are more accurate than oblique DTs of similar size learned using state-of-the-art techniques. Further, DT training time is significantly reduced. We also experimentally demonstrate that \textitDTSemNet can learn DT policies as efficiently as NN policies in the Reinforcement Learning (RL) setup with physical inputs (dimensions \leq32 ). The code is available at \colorblue\textit\urlthis https URL.

[LG-121] Markov Balance Satisfaction Improves Performance in Strictly Batch Offline Imitation Learning

链接: https://arxiv.org/abs/2408.09125
作者: Rishabh Agrawal,Nathan Dahlin,Rahul Jain,Ashutosh Nayyar
关键词-EN: directly programming behaviors, defining optimal control, optimal control costs, costs is challenging, notably effective
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Imitation learning (IL) is notably effective for robotic tasks where directly programming behaviors or defining optimal control costs is challenging. In this work, we address a scenario where the imitator relies solely on observed behavior and cannot make environmental interactions during learning. It does not have additional supplementary datasets beyond the expert’s dataset nor any information about the transition dynamics. Unlike state-of-the-art (SOTA) IL methods, this approach tackles the limitations of conventional IL by operating in a more constrained and realistic setting. Our method uses the Markov balance equation and introduces a novel conditional density estimation-based imitation learning framework. It employs conditional normalizing flows for transition dynamics estimation and aims at satisfying a balance equation for the environment. Through a series of numerical experiments on Classic Control and MuJoCo environments, we demonstrate consistently superior empirical performance compared to many SOTA IL algorithms.

[LG-122] Dynamic Neural Dowker Network: Approximating Persistent Homology in Dynamic Directed Graphs KDD2024

链接: https://arxiv.org/abs/2408.09123
作者: Hao Li,Hao Jiang,Jiajun Fan,Dongsheng Ye,Liang Du
关键词-EN: Topological Data Analysis, Data Analysis, encounters computational difficulties, Persistent homology, Topological Data
类目: Machine Learning (cs.LG)
*备注: KDD 2024

点击查看摘要

Abstract:Persistent homology, a fundamental technique within Topological Data Analysis (TDA), captures structural and shape characteristics of graphs, yet encounters computational difficulties when applied to dynamic directed graphs. This paper introduces the Dynamic Neural Dowker Network (DNDN), a novel framework specifically designed to approximate the results of dynamic Dowker filtration, aiming to capture the high-order topological features of dynamic directed graphs. Our approach creatively uses line graph transformations to produce both source and sink line graphs, highlighting the shared neighbor structures that Dowker complexes focus on. The DNDN incorporates a Source-Sink Line Graph Neural Network (SSLGNN) layer to effectively capture the neighborhood relationships among dynamic edges. Additionally, we introduce an innovative duality edge fusion mechanism, ensuring that the results for both the sink and source line graphs adhere to the duality principle intrinsic to Dowker complexes. Our approach is validated through comprehensive experiments on real-world datasets, demonstrating DNDN’s capability not only to effectively approximate dynamic Dowker filtration results but also to perform exceptionally in dynamic graph classification tasks.

[LG-123] Selective Prompt Anchoring for Code Generation

链接: https://arxiv.org/abs/2408.09121
作者: Yuan Tian,Tianyi Zhang
关键词-EN: large language models, automating coding tasks, transformed software development, Recent advances, Copilot and ChatGPT
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Software Engineering (cs.SE)
*备注: Under review

点击查看摘要

Abstract:Recent advances in large language models (LLMs) such as Copilot and ChatGPT have transformed software development by automating coding tasks. Despite these advancements, challenges remain in reducing error rates and fully meeting user expectations. Our empirical study reveals LLMs tend to dilute their self-attention on the initial prompt as more code tokens are generated. We hypothesize this self-attention dilution issue is one of the root causes of inaccuracies in LLM-generated code. To mitigate this issue, we propose Selective Prompt Anchoring (SPA). SPA amplifies the influence of the selected parts in the initial prompt, which we refer to as ``anchored text’', during code generation. Specifically, SPA calculates the logit distribution difference with and without the anchored text. We prove this difference approximates the anchored text’s contextual contribution to the output logits. SPA creates an augmented logit distribution by linearly combining the original logit distribution and the logit difference. We evaluate SPA with five LLMs on four benchmarks. Our results demonstrate that using SPA can consistently improve Pass@1 rates by up to 9.7% in all settings. Notably, with selective text anchoring, a small version of DeepSeek-Coder (6.7B) can achieve better performance than an original much larger version (33B). Our code is available at this https URL.

[LG-124] raining Verifiably Robust Agents Using Set-Based Reinforcement Learning

链接: https://arxiv.org/abs/2408.09112
作者: Manuel Wendl,Lukas Koller,Tobias Ladner,Matthias Althoff
关键词-EN: complex control tasks, solve complex control, control tasks, neural networks, solve complex
类目: Machine Learning (cs.LG); Robotics (cs.RO); Systems and Control (eess.SY)
*备注:

点击查看摘要

Abstract:Reinforcement learning often uses neural networks to solve complex control tasks. However, neural networks are sensitive to input perturbations, which makes their deployment in safety-critical environments challenging. This work lifts recent results from formally verifying neural networks against such disturbances to reinforcement learning in continuous state and action spaces using reachability analysis. While previous work mainly focuses on adversarial attacks for robust reinforcement learning, we train neural networks utilizing entire sets of perturbed inputs and maximize the worst-case reward. The obtained agents are verifiably more robust than agents obtained by related work, making them more applicable in safety-critical environments. This is demonstrated with an extensive empirical evaluation of four different benchmarks.

[LG-125] Improved Q-learning based Multi-hop Routing for UAV-Assisted Communication

链接: https://arxiv.org/abs/2408.09109
作者: N P Sharvari,Dibakar Das,Jyotsna Bapat,Debabrata Das
关键词-EN: Designing effective Unmanned, Unmanned Aerial Vehicle, effective Unmanned Aerial, limited battery capacity, Designing effective
类目: Networking and Internet Architecture (cs.NI); Machine Learning (cs.LG)
*备注: arXiv admin note: substantial text overlap with arXiv:2308.16719

点击查看摘要

Abstract:Designing effective Unmanned Aerial Vehicle(UAV)-assisted routing protocols is challenging due to changing topology, limited battery capacity, and the dynamic nature of communication environments. Current protocols prioritize optimizing individual network parameters, overlooking the necessity for a nuanced approach in scenarios with intermittent connectivity, fluctuating signal strength, and varying network densities, ultimately failing to address aerial network requirements comprehensively. This paper proposes a novel, Improved Q-learning-based Multi-hop Routing (IQMR) algorithm for optimal UAV-assisted communication systems. Using Q(\lambda) learning for routing decisions, IQMR substantially enhances energy efficiency and network data throughput. IQMR improves system resilience by prioritizing reliable connectivity and inter-UAV collision avoidance while integrating real-time network status information, all in the absence of predefined UAV path planning, thus ensuring dynamic adaptability to evolving network conditions. The results validate IQMR’s adaptability to changing system conditions and superiority over the current techniques. IQMR showcases 36.35% and 32.05% improvements in energy efficiency and data throughput over the existing methods.

[LG-126] Dynamic Graph Representation Learning for Passenger Behavior Prediction

链接: https://arxiv.org/abs/2408.09092
作者: Mingxuan Xie,Tao Zou,Junchen Ye,Bowen Du,Runhe Huang
关键词-EN: timely risk management, track passenger travel, urban station passenger, station passenger flow, alighting data
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Passenger behavior prediction aims to track passenger travel patterns through historical boarding and alighting data, enabling the analysis of urban station passenger flow and timely risk management. This is crucial for smart city development and public transportation planning. Existing research primarily relies on statistical methods and sequential models to learn from individual historical interactions, which ignores the correlations between passengers and stations. To address these issues, this paper proposes DyGPP, which leverages dynamic graphs to capture the intricate evolution of passenger behavior. First, we formalize passengers and stations as heterogeneous vertices in a dynamic graph, with connections between vertices representing interactions between passengers and stations. Then, we sample the historical interaction sequences for passengers and stations separately. We capture the temporal patterns from individual sequences and correlate the temporal behavior between the two sequences. Finally, we use an MLP-based encoder to learn the temporal patterns in the interactions and generate real-time representations of passengers and stations. Experiments on real-world datasets confirmed that DyGPP outperformed current models in the behavior prediction task, demonstrating the superiority of our model.

[LG-127] win Sorting Dynamic Programming Assisted User Association and Wireless Bandwidth Allocation for Hierarchical Federated Learning

链接: https://arxiv.org/abs/2408.09076
作者: Rung-Hung Gau,Ting-Yu Wang,Chun-Hung Liu
关键词-EN: hierarchical federated learning, federated learning system, wireless bandwidth allocation, bandwidth allocation, hierarchical federated
类目: Machine Learning (cs.LG); Networking and Internet Architecture (cs.NI)
*备注: 14 pages

点击查看摘要

Abstract:In this paper, we study user association and wireless bandwidth allocation for a hierarchical federated learning system that consists of mobile users, edge servers, and a cloud server. To minimize the length of a global round in hierarchical federated learning with equal bandwidth allocation, we formulate a combinatorial optimization problem. We design the twin sorting dynamic programming (TSDP) algorithm that obtains a globally optimal solution in polynomial time when there are two edge servers. In addition, we put forward the TSDP-assisted algorithm for user association when there are three or more edge servers. Furthermore, given a user association matrix, we formulate and solve a convex optimization problem for optimal wireless bandwidth allocation. Simulation results show that the proposed approach outperforms a number of alternative schemes.

[LG-128] Improving Rare Word Translation With Dictionaries and Attention Masking

链接: https://arxiv.org/abs/2408.09075
作者: Kenneth J. Sible,David Chiang
关键词-EN: dominant encoder-decoder architecture, encoder-decoder architecture, translation settings, dominant encoder-decoder, machine translation
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In machine translation, rare words continue to be a problem for the dominant encoder-decoder architecture, especially in low-resource and out-of-domain translation settings. Human translators solve this problem with monolingual or bilingual dictionaries. In this paper, we propose appending definitions from a bilingual dictionary to source sentences and using attention masking to link together rare words with their definitions. We find that including definitions for rare words improves performance by up to 1.0 BLEU and 1.6 MacroF1.

[LG-129] Gradient-Variation Online Learning under Generalized Smoothness

链接: https://arxiv.org/abs/2408.09074
作者: Yan-Feng Xie,Peng Zhao,Zhi-Hua Zhou
关键词-EN: receiving increased attention, attaining fast convergence, increased attention, aims to achieve, crucial for attaining
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:Gradient-variation online learning aims to achieve regret guarantees that scale with the variations in the gradients of online functions, which has been shown to be crucial for attaining fast convergence in games and robustness in stochastic optimization, hence receiving increased attention. Existing results often require the smoothness condition by imposing a fixed bound on the gradient Lipschitzness, but this may not hold in practice. Recent efforts in neural network optimization suggest a generalized smoothness condition, allowing smoothness to correlate with gradient norms. In this paper, we systematically study gradient-variation online learning under generalized smoothness. To this end, we extend the classic optimistic mirror descent algorithm to derive gradient-variation bounds by conducting stability analysis over the optimization trajectory and exploiting smoothness locally. Furthermore, we explore universal online learning, designing a single algorithm enjoying optimal gradient-variation regrets for convex and strongly convex functions simultaneously without knowing curvature information. The algorithm adopts a two-layer structure with a meta-algorithm running over a group of base-learners. To ensure favorable guarantees, we have designed a new meta-algorithm that is Lipschitz-adaptive to handle potentially unbounded gradients and meanwhile ensures second-order regret to cooperate with base-learners. Finally, we provide implications of our findings and obtain new results in fast-rate games and stochastic extended adversarial optimization.

[LG-130] Enhancing Community Detection in Networks: A Comparative Analysis of Local Metrics and Hierarchical Algorithms

链接: https://arxiv.org/abs/2408.09072
作者: Julio-Omar Palacio-Niño,Fernando Berzal
关键词-EN: understanding social behavior, social behavior, increasingly relevant, relevant for understanding, understanding social
类目: ocial and Information Networks (cs.SI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The analysis and detection of communities in network structures are becoming increasingly relevant for understanding social behavior. One of the principal challenges in this field is the complexity of existing algorithms. The Girvan-Newman algorithm, which uses the betweenness metric as a measure of node similarity, is one of the most representative algorithms in this area. This study employs the same method to evaluate the relevance of using local similarity metrics for community detection. A series of local metrics were tested on a set of networks constructed using the Girvan-Newman basic algorithm. The efficacy of these metrics was evaluated by applying the base algorithm to several real networks with varying community sizes, using modularity and NMI. The results indicate that approaches based on local similarity metrics have significant potential for community detection.

[LG-131] Linking Robustness and Generalization: A k* Distribution Analysis of Concept Clustering in Latent Space for Vision Models

链接: https://arxiv.org/abs/2408.09065
作者: Shashank Kotyan,Pin-Yu Chen,Danilo Vasconcellos Vargas
关键词-EN: latent space, latent, space, vision models, assess latent space
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Most evaluations of vision models use indirect methods to assess latent space quality. These methods often involve adding extra layers to project the latent space into a new one. This projection makes it difficult to analyze and compare the original latent space. This article uses the k* Distribution, a local neighborhood analysis method, to examine the learned latent space at the level of individual concepts, which can be extended to examine the entire latent space. We introduce skewness-based true and approximate metrics for interpreting individual concepts to assess the overall quality of vision models’ latent space. Our findings indicate that current vision models frequently fracture the distributions of individual concepts within the latent space. Nevertheless, as these models improve in generalization across multiple datasets, the degree of fracturing diminishes. A similar trend is observed in robust vision models, where increased robustness correlates with reduced fracturing. Ultimately, this approach enables a direct interpretation and comparison of the latent spaces of different vision models and reveals a relationship between a model’s generalizability and robustness. Results show that as a model becomes more general and robust, it tends to learn features that result in better clustering of concepts. Project Website is available online at this https URL

[LG-132] MoRA: LoRA Guided Multi-Modal Disease Diagnosis with Missing Modality MICCAI2024

链接: https://arxiv.org/abs/2408.09064
作者: Zhiyi Shi,Junsik Kim,Wanhua Li,Yicong Li,Hanspeter Pfister
关键词-EN: Multi-modal pre-trained models, Multi-modal pre-trained, low memory requirements, models efficiently extract, efficiently extract
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: Accepted by MICCAI 2024

点击查看摘要

Abstract:Multi-modal pre-trained models efficiently extract and fuse features from different modalities with low memory requirements for fine-tuning. Despite this efficiency, their application in disease diagnosis is under-explored. A significant challenge is the frequent occurrence of missing modalities, which impairs performance. Additionally, fine-tuning the entire pre-trained model demands substantial computational resources. To address these issues, we introduce Modality-aware Low-Rank Adaptation (MoRA), a computationally efficient method. MoRA projects each input to a low intrinsic dimension but uses different modality-aware up-projections for modality-specific adaptation in cases of missing modalities. Practically, MoRA integrates into the first block of the model, significantly improving performance when a modality is missing. It requires minimal computational resources, with less than 1.6% of the trainable parameters needed compared to training the entire model. Experimental results show that MoRA outperforms existing techniques in disease diagnosis, demonstrating superior performance, robustness, and training efficiency.

[LG-133] Learning to Route for Dynamic Adapter Composition in Continual Learning with Language Models

链接: https://arxiv.org/abs/2408.09053
作者: Vladimir Araujo,Marie-Francine Moens,Tinne Tuytelaars
关键词-EN: pre-trained language models, Parameter-efficient fine-tuning, language models, continual learning, PEFT
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:Parameter-efficient fine-tuning (PEFT) methods are increasingly used with pre-trained language models (PLMs) for continual learning (CL). These methods involve training a PEFT module for each new task and using similarity-based selection to route modules during inference. However, they face two major limitations: 1) interference with already learned modules and 2) suboptimal routing when composing modules. In this paper, we introduce a method that isolates the training of PEFT modules for task specialization. Then, before evaluation, it learns to compose the previously learned modules by training a router that leverages samples from a small memory. We evaluate our method in two CL setups using several benchmarks. Our results show that our method provides a better composition of PEFT modules, leading to better generalization and performance compared to previous methods.

[LG-134] Improving VTE Identification through Language Models from Radiology Reports: A Comparative Study of Mamba Phi-3 Mini and BERT

链接: https://arxiv.org/abs/2408.09043
作者: Jamie Deng,Yusen Wu,Yelena Yesha,Phuong Nguyen
关键词-EN: critical cardiovascular condition, Venous thromboembolism, encompassing deep vein, deep vein thrombosis, cardiovascular condition
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Venous thromboembolism (VTE) is a critical cardiovascular condition, encompassing deep vein thrombosis (DVT) and pulmonary embolism (PE). Accurate and timely identification of VTE is essential for effective medical care. This study builds upon our previous work, which addressed VTE detection using deep learning methods for DVT and a hybrid approach combining deep learning and rule-based classification for PE. Our earlier approaches, while effective, had two major limitations: they were complex and required expert involvement for feature engineering of the rule set. To overcome these challenges, we utilize the Mamba architecture-based classifier. This model achieves remarkable results, with a 97% accuracy and F1 score on the DVT dataset and a 98% accuracy and F1 score on the PE dataset. In contrast to the previous hybrid method on PE identification, the Mamba classifier eliminates the need for hand-engineered rules, significantly reducing model complexity while maintaining comparable performance. Additionally, we evaluated a lightweight Large Language Model (LLM), Phi-3 Mini, in detecting VTE. While this model delivers competitive results, outperforming the baseline BERT models, it proves to be computationally intensive due to its larger parameter set. Our evaluation shows that the Mamba-based model demonstrates superior performance and efficiency in VTE identification, offering an effective solution to the limitations of previous approaches.

[LG-135] Error Bounds For Gaussian Process Regression Under Bounded Support Noise With Applications To Safety Certification

链接: https://arxiv.org/abs/2408.09033
作者: Robert Reed,Luca Laurenti,Morteza Lahijanian
关键词-EN: Gaussian Process Regression, Gaussian Process, Process Regression, powerful and elegant, elegant method
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 11 pages, 5 figures, 6 tables

点击查看摘要

Abstract:Gaussian Process Regression (GPR) is a powerful and elegant method for learning complex functions from noisy data with a wide range of applications, including in safety-critical domains. Such applications have two key features: (i) they require rigorous error quantification, and (ii) the noise is often bounded and non-Gaussian due to, e.g., physical constraints. While error bounds for applying GPR in the presence of non-Gaussian noise exist, they tend to be overly restrictive and conservative in practice. In this paper, we provide novel error bounds for GPR under bounded support noise. Specifically, by relying on concentration inequalities and assuming that the latent function has low complexity in the reproducing kernel Hilbert space (RKHS) corresponding to the GP kernel, we derive both probabilistic and deterministic bounds on the error of the GPR. We show that these errors are substantially tighter than existing state-of-the-art bounds and are particularly well-suited for GPR with neural network kernels, i.e., Deep Kernel Learning (DKL). Furthermore, motivated by applications in safety-critical domains, we illustrate how these bounds can be combined with stochastic barrier functions to successfully quantify the safety probability of an unknown dynamical system from finite data. We validate the efficacy of our approach through several benchmarks and comparisons against existing bounds. The results show that our bounds are consistently smaller, and that DKLs can produce error bounds tighter than sample noise, significantly improving the safety probability of control systems.

[LG-136] AdaRank: Disagreement Based Module Rank Prediction for Low-rank Adaptation

链接: https://arxiv.org/abs/2408.09015
作者: Yihe Dong
关键词-EN: general-purpose foundational model, common practice, rise of language, language and multimodal, general-purpose foundational
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:With the rise of language and multimodal models of ever-increasing size, pretraining a general-purpose foundational model and adapting it to downstream tasks has become common practice. To this end, adaptation efficiency can be a critical bottleneck given the large model sizes, hence efficient finetuning methods such as LoRA have become prevalent. However, LoRA is typically applied with the same rank across all model layers, despite mounting evidence from transfer learning literature that during finetuning, later layers diverge more from pretrained weights. Inspired by the theory and observations around feature learning and module criticality, we develop a simple model disagreement based technique to predict the rank of a given module relative to the other modules. Empirically, AdaRank generalizes notably better on unseen data than using uniform ranks with the same number of parameters. Compared to prior work, AdaRank has the unique advantage of leaving the pretraining and adaptation stages completely intact: no need for any additional objectives or regularizers, which can hinder adaptation accuracy and performance. Our code is publicly available at this https URL.

[LG-137] An optimal pairwise merge algorithm improves the quality and consistency of nonnegative matrix factorization

链接: https://arxiv.org/abs/2408.09013
作者: Youdong Guo,Timothy E. Holy
关键词-EN: Non-negative matrix factorization, Non-negative matrix, matrix factorization, source separation, key technique
类目: Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注:

点击查看摘要

Abstract:Non-negative matrix factorization (NMF) is a key technique for feature extraction and widely used in source separation. However, existing algorithms may converge to poor local minima, or to one of several minima with similar objective value but differing feature parametrizations. Additionally, the performance of NMF greatly depends on the number of components, but choosing the optimal count remains a challenge. Here we show that some of these weaknesses may be mitigated by performing NMF in a higher-dimensional feature space and then iteratively combining components with an analytically-solvable pairwise merge strategy. Experimental results demonstrate our method helps NMF achieve better local optima and greater consistency of the solutions. Iterative merging also provides an efficient and informative framework for choosing the number of components. Surprisingly, despite these extra steps, our approach often improves computational performance by reducing the occurrence of ``convergence stalling’’ near saddle points. This can be recommended as a preferred approach for most applications of NMF.

[LG-138] Classifier-Free Guidance is a Predictor-Corrector

链接: https://arxiv.org/abs/2408.09000
作者: Arwen Bradley,Preetum Nakkiran
关键词-EN: CFG, foundations of classifier-free, theoretical foundations, classifier-free guidance, shaky theoretical footing
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: AB and PN contributed equally

点击查看摘要

Abstract:We investigate the theoretical foundations of classifier-free guidance (CFG). CFG is the dominant method of conditional sampling for text-to-image diffusion models, yet unlike other aspects of diffusion, it remains on shaky theoretical footing. In this paper, we disprove common misconceptions, by showing that CFG interacts differently with DDPM (Ho et al., 2020) and DDIM (Song et al., 2021), and neither sampler with CFG generates the gamma-powered distribution p(x|c)^\gamma p(x)^1-\gamma . Then, we clarify the behavior of CFG by showing that it is a kind of predictor-corrector method (Song et al., 2020) that alternates between denoising and sharpening, which we call predictor-corrector guidance (PCG). We prove that in the SDE limit, CFG is actually equivalent to combining a DDIM predictor for the conditional distribution together with a Langevin dynamics corrector for a gamma-powered distribution (with a carefully chosen gamma). Our work thus provides a lens to theoretically understand CFG by embedding it in a broader design space of principled sampling methods.

[LG-139] Model-based RL as a Minimalist Approach to Horizon-Free and Second-Order Bounds

链接: https://arxiv.org/abs/2408.08994
作者: Zhiyong Wang,Dongruo Zhou,John C.S. Lui,Wen Sun
关键词-EN: Maximum Likelihood Estimation, simplest Model-based Reinforcement, Model-based Reinforcement Learning, Likelihood Estimation, Maximum Likelihood
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Learning a transition model via Maximum Likelihood Estimation (MLE) followed by planning inside the learned model is perhaps the most standard and simplest Model-based Reinforcement Learning (RL) framework. In this work, we show that such a simple Model-based RL scheme, when equipped with optimistic and pessimistic planning procedures, achieves strong regret and sample complexity bounds in online and offline RL settings. Particularly, we demonstrate that under the conditions where the trajectory-wise reward is normalized between zero and one and the transition is time-homogenous, it achieves horizon-free and second-order bounds. Horizon-free means that our bounds have no polynomial dependence on the horizon of the Markov Decision Process. A second-order bound is a type of instance-dependent bound that scales with respect to the variances of the returns of the policies which can be small when the system is nearly deterministic and (or) the optimal policy has small values. We highlight that our algorithms are simple, fairly standard, and indeed have been extensively studied in the RL literature: they learn a model via MLE, build a version space around the MLE solution, and perform optimistic or pessimistic planning depending on whether operating in the online or offline mode. These algorithms do not rely on additional specialized algorithmic designs such as learning variances and performing variance-weighted learning and thus can leverage rich function approximations that are significantly beyond linear or tabular structures. The simplicity of the algorithms also implies that our horizon-free and second-order regret analysis is actually standard and mainly follows the general framework of optimism/pessimism in the face of uncertainty.

[LG-140] Electroencephalogram Emotion Recognition via AUC Maximization

链接: https://arxiv.org/abs/2408.08979
作者: Minheng Xiao,Shi Bo
关键词-EN: pose significant challenges, datasets pose significant, robust model performance, Imbalanced datasets pose, cognitive science
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Imbalanced datasets pose significant challenges in areas including neuroscience, cognitive science, and medical diagnostics, where accurately detecting minority classes is essential for robust model performance. This study addresses the issue of class imbalance, using the `Liking’ label in the DEAP dataset as an example. Such imbalances are often overlooked by prior research, which typically focuses on the more balanced arousal and valence labels and predominantly uses accuracy metrics to measure model performance. To tackle this issue, we adopt numerical optimization techniques aimed at maximizing the area under the curve (AUC), thus enhancing the detection of underrepresented classes. Our approach, which begins with a linear classifier, is compared against traditional linear classifiers, including logistic regression and support vector machines (SVM). Our method significantly outperforms these models, increasing recall from 41.6% to 79.7% and improving the F1-score from 0.506 to 0.632. These results highlight the efficacy of AUC maximization via numerical optimization in managing imbalanced datasets, providing an effective solution for enhancing predictive accuracy in detecting minority but crucial classes in out-of-sample datasets.

[LG-141] Enhancing Object Detection with Hybrid dataset in Manufacturing Environments: Comparing Federated Learning to Conventional Techniques

链接: https://arxiv.org/abs/2408.08974
作者: Vinit Hegiste,Snehal Walunj,Jibinraj Antony,Tatjana Legler,Martin Ruskowski
关键词-EN: garnered significant attention, Federated Learning, privacy-preserving capabilities, garnered significant, significant attention
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: Submitted and Presented at the IEEE International Conference on Innovative Engineering Sciences and Technological Research (ICIESTR-2024)

点击查看摘要

Abstract:Federated Learning (FL) has garnered significant attention in manufacturing for its robust model development and privacy-preserving capabilities. This paper contributes to research focused on the robustness of FL models in object detection, hereby presenting a comparative study with conventional techniques using a hybrid dataset for small object detection. Our findings demonstrate the superior performance of FL over centralized training models and different deep learning techniques when tested on test data recorded in a different environment with a variety of object viewpoints, lighting conditions, cluttered backgrounds, etc. These results highlight the potential of FL in achieving robust global models that perform efficiently even in unseen environments. The study provides valuable insights for deploying resilient object detection models in manufacturing environments.

[LG-142] ASGM-KG: Unveiling Alluvial Gold Mining Through Knowledge Graphs

链接: https://arxiv.org/abs/2408.08972
作者: Debashis Gupta,Aditi Golder,Luis Fernendez,Miles Silman,Greg Lersen,Fan Yang,Bob Plemmons,Sarra Alqahtani,Paul Victor Pauca
关键词-EN: Small-Scale Gold Mining, highly destructive mining, destructive mining practice, Gold Mining, Artisanal and Small-Scale
类目: Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
*备注:

点击查看摘要

Abstract:Artisanal and Small-Scale Gold Mining (ASGM) is a low-cost yet highly destructive mining practice, leading to environmental disasters across the world’s tropical watersheds. The topic of ASGM spans multiple domains of research and information, including natural and social systems, and knowledge is often atomized across a diversity of media and documents. We therefore introduce a knowledge graph (ASGM-KG) that consolidates and provides crucial information about ASGM practices and their environmental effects. The current version of ASGM-KG consists of 1,899 triples extracted using a large language model (LLM) from documents and reports published by both non-governmental and governmental organizations. These documents were carefully selected by a group of tropical ecologists with expertise in ASGM. This knowledge graph was validated using two methods. First, a small team of ASGM experts reviewed and labeled triples as factual or non-factual. Second, we devised and applied an automated factual reduction framework that relies on a search engine and an LLM for labeling triples. Our framework performs as well as five baselines on a publicly available knowledge graph and achieves over 90 accuracy on our ASGM-KG validated by domain experts. ASGM-KG demonstrates an advancement in knowledge aggregation and representation for complex, interdisciplinary environmental crises such as ASGM.

[LG-143] Online SLA Decomposition: Enabling Real-Time Adaptation to Evolving Systems

链接: https://arxiv.org/abs/2408.08968
作者: Cyril Shih-Huan Hsu,Danny De Vleeschauwer,Chrysa Papagianni
关键词-EN: Service Level Agreement, Service Level, Level Agreement, slice spans multiple, spans multiple domains
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: The paper has been submitted to IEEE Networking Letters

点击查看摘要

Abstract:When a network slice spans multiple domains, each domain must uphold the End-to-End (E2E) Service Level Agreement (SLA). This requires decomposing the End-to-End (E2E) Service Level Agreement (SLA) into partial SLAs for each domain. In a two-level network slicing management system with an E2E orchestrator and local controllers, we propose an online learning-decomposition framework that dynamically updates risk models using recent feedback. This approach utilizes online gradient descent and FIFO memory buffers to enhance stability and robustness. Our empirical study shows the proposed framework outperforms state-of-the-art static methods, offering more accurate and resilient SLA decomposition under varying conditions and sparse data.

[LG-144] Information-Theoretic Progress Measures reveal Grokking is an Emergent Phase Transition ICML2024

链接: https://arxiv.org/abs/2408.08944
作者: Kenzo Clauw,Sebastiano Stramaglia,Daniele Marinazzo
关键词-EN: models suddenly generalize, paper studies emergent, studies emergent phenomena, delayed memorization, paper studies
类目: Machine Learning (cs.LG); Information Theory (cs.IT)
*备注: ICML 2024 MI workshop

点击查看摘要

Abstract:This paper studies emergent phenomena in neural networks by focusing on grokking where models suddenly generalize after delayed memorization. To understand this phase transition, we utilize higher-order mutual information to analyze the collective behavior (synergy) and shared properties (redundancy) between neurons during training. We identify distinct phases before grokking allowing us to anticipate when it occurs. We attribute grokking to an emergent phase transition caused by the synergistic interactions between neurons as a whole. We show that weight decay and weight initialization can enhance the emergent phase.

[LG-145] A Factored MDP Approach To Moving Target Defense With Dynamic Threat Modeling and Cost Efficiency

链接: https://arxiv.org/abs/2408.08934
作者: Megha Bose,Praveen Paruchuri,Akshat Kumar
关键词-EN: Moving Target Defense, Moving Target, evolving cyber threats, counteract evolving cyber, cyber threats
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Moving Target Defense (MTD) has emerged as a proactive and dynamic framework to counteract evolving cyber threats. Traditional MTD approaches often rely on assumptions about the attackers knowledge and behavior. However, real-world scenarios are inherently more complex, with adaptive attackers and limited prior knowledge of their payoffs and intentions. This paper introduces a novel approach to MTD using a Markov Decision Process (MDP) model that does not rely on predefined attacker payoffs. Our framework integrates the attackers real-time responses into the defenders MDP using a dynamic Bayesian Network. By employing a factored MDP model, we provide a comprehensive and realistic system representation. We also incorporate incremental updates to an attack response predictor as new data emerges. This ensures an adaptive and robust defense mechanism. Additionally, we consider the costs of switching configurations in MTD, integrating them into the reward structure to balance execution and defense costs. We first highlight the challenges of the problem through a theoretical negative result on regret. However, empirical evaluations demonstrate the frameworks effectiveness in scenarios marked by high uncertainty and dynamically changing attack landscapes.

[LG-146] Personalized Federated Collaborative Filtering: A Variational AutoEncoder Approach

链接: https://arxiv.org/abs/2408.08931
作者: Zhiwei Li,Guodong Long,Tianyi Zhou,Jing Jiang,Chengqi Zhang
关键词-EN: Federated Collaborative Filtering, distributed Collaborative Filtering, Collaborative Filtering, emerging field focused, Federated Collaborative
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 10 pages, 3 figures, 4 tables, conference

点击查看摘要

Abstract:Federated Collaborative Filtering (FedCF) is an emerging field focused on developing a new recommendation framework with preserving privacy in a federated setting. Existing FedCF methods typically combine distributed Collaborative Filtering (CF) algorithms with privacy-preserving mechanisms, and then preserve personalized information into a user embedding vector. However, the user embedding is usually insufficient to preserve the rich information of the fine-grained personalization across heterogeneous clients. This paper proposes a novel personalized FedCF method by preserving users’ personalized information into a latent variable and a neural model simultaneously. Specifically, we decompose the modeling of user knowledge into two encoders, each designed to capture shared knowledge and personalized knowledge separately. A personalized gating network is then applied to balance personalization and generalization between the global and local encoders. Moreover, to effectively train the proposed framework, we model the CF problem as a specialized Variational AutoEncoder (VAE) task by integrating user interaction vector reconstruction with missing value prediction. The decoder is trained to reconstruct the implicit feedback from items the user has interacted with, while also predicting items the user might be interested in but has not yet interacted with. Experimental results on benchmark datasets demonstrate that the proposed method outperforms other baseline methods, showcasing superior performance.

[LG-147] Cybench: A Framework for Evaluating Cybersecurity Capabilities and Risk of Language Models

链接: https://arxiv.org/abs/2408.08926
作者: Andy K. Zhang,Neil Perry,Riya Dulepet,Eliot Jones,Justin W. Lin,Joey Ji,Celeste Menders,Gashon Hussein,Samantha Liu,Donovan Jasper,Pura Peetathawatchai,Ari Glenn,Vikram Sivashankar,Daniel Zamoshchin,Leo Glikbarg,Derek Askaryar,Mike Yang,Teddy Zhang,Rishi Alluri,Nathan Tran,Rinnara Sangpisit,Polycarpos Yiorkadjis,Kenny Osele,Gautham Raghupathi,Dan Boneh,Daniel E. Ho,Percy Liang
关键词-EN: autonomously identifying vulnerabilities, Language Model, real-world impact, capable of autonomously, autonomously identifying
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY); Machine Learning (cs.LG)
*备注: 86 pages, 7 figures

点击查看摘要

Abstract:Language Model (LM) agents for cybersecurity that are capable of autonomously identifying vulnerabilities and executing exploits have the potential to cause real-world impact. Policymakers, model providers, and other researchers in the AI and cybersecurity communities are interested in quantifying the capabilities of such agents to help mitigate cyberrisk and investigate opportunities for penetration testing. Toward that end, we introduce Cybench, a framework for specifying cybersecurity tasks and evaluating agents on those tasks. We include 40 professional-level Capture the Flag (CTF) tasks from 4 distinct CTF competitions, chosen to be recent, meaningful, and spanning a wide range of difficulties. Each task includes its own description, starter files, and is initialized in an environment where an agent can execute bash commands and observe outputs. Since many tasks are beyond the capabilities of existing LM agents, we introduce subtasks, which break down a task into intermediary steps for more gradated evaluation; we add subtasks for 17 of the 40 tasks. To evaluate agent capabilities, we construct a cybersecurity agent and evaluate 7 models: GPT-4o, Claude 3 Opus, Claude 3.5 Sonnet, Mixtral 8x22b Instruct, Gemini 1.5 Pro, Llama 3 70B Chat, and Llama 3.1 405B Instruct. Without guidance, we find that agents are able to solve only the easiest complete tasks that took human teams up to 11 minutes to solve, with Claude 3.5 Sonnet and GPT-4o having the highest success rates. Finally, subtasks provide more signal for measuring performance compared to unguided runs, with models achieving a 3.2% higher success rate on complete tasks with subtask-guidance than without subtask-guidance. All code and data are publicly available at this https URL

[LG-148] PATopics: An automatic framework to extract useful information from pharmaceutical patents documents

链接: https://arxiv.org/abs/2408.08905
作者: Pablo Cecilio,Antônio Perreira,Juliana Santos Rosa Viegas,Washington Cunha,Felipe Viegas,Elisa Tuler,Fabiana Testa Moura de Carvalho Vicentini,Leonardo Rocha
关键词-EN: disruptive innovations focusing, promote disruptive innovations, Pharmaceutical patents play, disruptive innovations, innovations focusing
类目: Digital Libraries (cs.DL); Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注: 17 pages, 5 figures, 5 tables

点击查看摘要

Abstract:Pharmaceutical patents play an important role by protecting the innovation from copies but also drive researchers to innovate, create new products, and promote disruptive innovations focusing on collective health. The study of patent management usually refers to an exhaustive manual search. This happens, because patent documents are complex with a lot of details regarding the claims and methodology/results explanation of the invention. To mitigate the manual search, we proposed PATopics, a framework specially designed to extract relevant information for Pharmaceutical patents. PATopics is composed of four building blocks that extract textual information from the patents, build relevant topics that are capable of summarizing the patents, correlate these topics with useful patent characteristics and then, summarize the information in a friendly web interface to final users. The general contributions of PATopics are its ability to centralize patents and to manage patents into groups based on their similarities. We extensively analyzed the framework using 4,832 pharmaceutical patents concerning 809 molecules patented by 478 companies. In our analysis, we evaluate the use of the framework considering the demands of three user profiles – researchers, chemists, and companies. We also designed four real-world use cases to evaluate the framework’s applicability. Our analysis showed how practical and helpful PATopics are in the pharmaceutical scenario.

[LG-149] Kov: Transferable and Naturalistic Black-Box LLM Attacks using Markov Decision Processes and Tree Search

链接: https://arxiv.org/abs/2408.08899
作者: Robert J. Moss
关键词-EN: Eliciting harmful behavior, Eliciting harmful, important task, task to ensure, ensure the proper
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Eliciting harmful behavior from large language models (LLMs) is an important task to ensure the proper alignment and safety of the models. Often when training LLMs, ethical guidelines are followed yet alignment failures may still be uncovered through red teaming adversarial attacks. This work frames the red-teaming problem as a Markov decision process (MDP) and uses Monte Carlo tree search to find harmful behaviors of black-box, closed-source LLMs. We optimize token-level prompt suffixes towards targeted harmful behaviors on white-box LLMs and include a naturalistic loss term, log-perplexity, to generate more natural language attacks for better interpretability. The proposed algorithm, Kov, trains on white-box LLMs to optimize the adversarial attacks and periodically evaluates responses from the black-box LLM to guide the search towards more harmful black-box behaviors. In our preliminary study, results indicate that we can jailbreak black-box models, such as GPT-3.5, in only 10 queries, yet fail on GPT-4 - which may indicate that newer models are more robust to token-level attacks. All work to reproduce these results is open sourced (this https URL).

[LG-150] A Classifier-Based Approach to Multi-Class Anomaly Detection Applied to Astronomical Time-Series ICML2024

链接: https://arxiv.org/abs/2408.08888
作者: Rithwik Gupta,Daniel Muthukrishna,Michelle Lochner
关键词-EN: Automating anomaly detection, modern telescopes generate, telescopes generate millions, anomaly detection, Automating anomaly
类目: Machine Learning (cs.LG); High Energy Astrophysical Phenomena (astro-ph.HE); Instrumentation and Methods for Astrophysics (astro-ph.IM)
*备注: Accepted in the ICML 2024 AI for Science Workshop. 15 pages, 10 figures. this https URL

点击查看摘要

Abstract:Automating anomaly detection is an open problem in many scientific fields, particularly in time-domain astronomy, where modern telescopes generate millions of alerts per night. Currently, most anomaly detection algorithms for astronomical time-series rely either on hand-crafted features or on features generated through unsupervised representation learning, coupled with standard anomaly detection algorithms. In this work, we introduce a novel approach that leverages the latent space of a neural network classifier for anomaly detection. We then propose a new method called Multi-Class Isolation Forests (MCIF), which trains separate isolation forests for each class to derive an anomaly score for an object based on its latent space representation. This approach significantly outperforms a standard isolation forest when distinct clusters exist in the latent space. Using a simulated dataset emulating the Zwicky Transient Facility (54 anomalies and 12,040 common), our anomaly detection pipeline discovered 46\pm3 anomalies ( \sim 85% recall) after following up the top 2,000 ( \sim 15% ) ranked objects. Furthermore, our classifier-based approach outperforms or approaches the performance of other state-of-the-art anomaly detection pipelines. Our novel method demonstrates that existing and new classifiers can be effectively repurposed for real-time anomaly detection. The code used in this work, including a Python package, is publicly available, this https URL.

[LG-151] Confronting the Reproducibility Crisis: A Case Study of Challenges in Cybersecurity AI DATE

链接: https://arxiv.org/abs/2405.18753
作者: Richard H. Moulton,Gary A. McCully,John D. Hastings
关键词-EN: rapidly evolving field, rapidly evolving, evolving field, maintaining the reliability, reliability and integrity
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
*备注: 8 pages, 0 figures, 2 tables, updated to incorporate feedback and improvements

点击查看摘要

Abstract:In the rapidly evolving field of cybersecurity, ensuring the reproducibility of AI-driven research is critical to maintaining the reliability and integrity of security systems. This paper addresses the reproducibility crisis within the domain of adversarial robustness – a key area in AI-based cybersecurity that focuses on defending deep neural networks against malicious perturbations. Through a detailed case study, we attempt to validate results from prior work on certified robustness using the VeriGauge toolkit, revealing significant challenges due to software and hardware incompatibilities, version conflicts, and obsolescence. Our findings underscore the urgent need for standardized methodologies, containerization, and comprehensive documentation to ensure the reproducibility of AI models deployed in critical cybersecurity applications. By tackling these reproducibility challenges, we aim to contribute to the broader discourse on securing AI systems against advanced persistent threats, enhancing network and IoT security, and protecting critical infrastructure. This work advocates for a concerted effort within the research community to prioritize reproducibility, thereby strengthening the foundation upon which future cybersecurity advancements are built.

[LG-152] LEGENT: Open Platform for Embodied Agents ACL2024

链接: https://arxiv.org/abs/2404.18243
作者: Zhili Cheng,Zhitong Wang,Jinyi Hu,Shengding Hu,An Liu,Yuge Tu,Pengkai Li,Lei Shi,Zhiyuan Liu,Maosong Sun
关键词-EN: Large Language Models, Large Multimodal Models, hindering complex real-life, Large Language, Large Multimodal
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Robotics (cs.RO)
*备注: ACL 2024 System Demonstration

点击查看摘要

Abstract:Despite advancements in Large Language Models (LLMs) and Large Multimodal Models (LMMs), their integration into language-grounded, human-like embodied agents remains incomplete, hindering complex real-life task performance in physical environments. Existing integrations often feature limited open sourcing, challenging collective progress in this field. We introduce LEGENT, an open, scalable platform for developing embodied agents using LLMs and LMMs. LEGENT offers a dual approach: a rich, interactive 3D environment with communicable and actionable agents, paired with a user-friendly interface, and a sophisticated data generation pipeline utilizing advanced algorithms to exploit supervision from simulated worlds at scale. In our experiments, an embryonic vision-language-action model trained on LEGENT-generated data surpasses GPT-4V in embodied tasks, showcasing promising generalization capabilities.

[LG-153] Area under the ROC Curve has the Most Consistent Evaluation for Binary Classification

链接: https://arxiv.org/abs/2408.10193
作者: Jing Li
关键词-EN: binary classification tasks, Evaluation Metrics, model evaluation metrics, classification tasks, model evaluation
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Evaluation Metrics is an important question for model evaluation and model selection in binary classification tasks. This study investigates how consistent metrics are at evaluating different models under different data scenarios. Analyzing over 150 data scenarios and 18 model evaluation metrics using statistical simulation, I find that for binary classification tasks, evaluation metrics that are less influenced by prevalence offer more consistent ranking of a set of different models. In particular, Area Under the ROC Curve (AUC) has smallest variance in ranking of different models. Matthew’s correlation coefficient as a more strict measure of model performance has the second smallest variance. These patterns holds across a rich set of data scenarios and five commonly used machine learning models as well as a naive random guess model. The results have significant implications for model evaluation and model selection in binary classification tasks.

[LG-154] Robust spectral clustering with rank statistics

链接: https://arxiv.org/abs/2408.10136
作者: Joshua Cape,Xianshi Yu,Jonquil Z. Liao
关键词-EN: spectral clustering method, robust spectral clustering, paper analyzes, spectral clustering, rank statistics
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST); Methodology (stat.ME)
*备注: 82 pages, 8 figures, 1 table

点击查看摘要

Abstract:This paper analyzes the statistical performance of a robust spectral clustering method for latent structure recovery in noisy data matrices. We consider eigenvector-based clustering applied to a matrix of nonparametric rank statistics that is derived entrywise from the raw, original data matrix. This approach is robust in the sense that, unlike traditional spectral clustering procedures, it can provably recover population-level latent block structure even when the observed data matrix includes heavy-tailed entries and has a heterogeneous variance profile. Our main theoretical contributions are threefold and hold under flexible data generating conditions. First, we establish that robust spectral clustering with rank statistics can consistently recover latent block structure, viewed as communities of nodes in a graph, in the sense that unobserved community memberships for all but a vanishing fraction of nodes are correctly recovered with high probability when the data matrix is large. Second, we refine the former result and further establish that, under certain conditions, the community membership of any individual, specified node of interest can be asymptotically exactly recovered with probability tending to one in the large-data limit. Third, we establish asymptotic normality results associated with the truncated eigenstructure of matrices whose entries are rank statistics, made possible by synthesizing contemporary entrywise matrix perturbation analysis with the classical nonparametric theory of so-called simple linear rank statistics. Collectively, these results demonstrate the statistical utility of rank-based data transformations when paired with spectral techniques for dimensionality reduction. Additionally, for a dataset of human connectomes, our approach yields parsimonious dimensionality reduction and improved recovery of ground-truth neuroanatomical cluster structure. Comments: 82 pages, 8 figures, 1 table Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST); Methodology (stat.ME) MSC classes: 62H12, 62H30, 62G35 Cite as: arXiv:2408.10136 [stat.ML] (or arXiv:2408.10136v1 [stat.ML] for this version) https://doi.org/10.48550/arXiv.2408.10136 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-155] No Screening is More Efficient with Multiple Objects

链接: https://arxiv.org/abs/2408.10077
作者: Shunya Noda,Genta Okada
关键词-EN: multiple heterogeneous objects, allocating multiple heterogeneous, heterogeneous objects, allocating multiple, multiple heterogeneous
类目: Theoretical Economics (econ.TH); Artificial Intelligence (cs.AI); Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We study efficient mechanism design for allocating multiple heterogeneous objects. We aim to maximize the residual surplus, the total value generated from an allocation minus the costs for screening agents’ values. We discover a robust trend indicating that no-screening mechanisms such as serial dictatorship with exogenous priority order tend to perform better as the variety of goods increases. We analyze the underlying reasons by characterizing efficient mechanisms in a stylized environment. We also apply an automated mechanism design approach to numerically derive efficient mechanisms and validate the trend in general environments. Building on this implication, we propose the register-invite-book system (RIB) as an efficient system for scheduling vaccination against pandemic diseases.

[LG-156] Parseval Convolution Operators and Neural Networks

链接: https://arxiv.org/abs/2408.09981
作者: Michael Unser,Stanislas Ducotterd
关键词-EN: discrete multicomponent signals, Parseval convolution operators, linear shift-invariant, multicomponent signals, establish a kernel
类目: ignal Processing (eess.SP); Machine Learning (cs.LG); Functional Analysis (math.FA); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:We first establish a kernel theorem that characterizes all linear shift-invariant (LSI) operators acting on discrete multicomponent signals. This result naturally leads to the identification of the Parseval convolution operators as the class of energy-preserving filterbanks. We then present a constructive approach for the design/specification of such filterbanks via the chaining of elementary Parseval modules, each of which being parameterized by an orthogonal matrix or a 1-tight frame. Our analysis is complemented with explicit formulas for the Lipschitz constant of all the components of a convolutional neural network (CNN), which gives us a handle on their stability. Finally, we demonstrate the usage of those tools with the design of a CNN-based algorithm for the iterative reconstruction of biomedical images. Our algorithm falls within the plug-and-play framework for the resolution of inverse problems. It yields better-quality results than the sparsity-based methods used in compressed sensing, while offering essentially the same convergence and robustness guarantees.

[LG-157] he curse of random quantum data

链接: https://arxiv.org/abs/2408.09937
作者: Kaining Zhang,Junyu Liu,Liu Liu,Liang Jiang,Min-Hsiu Hsieh,Dacheng Tao
关键词-EN: Quantum machine learning, significant flagship applications, involves running machine, machine learning, running machine learning
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG)
*备注: 40 pages, 8 figures

点击查看摘要

Abstract:Quantum machine learning, which involves running machine learning algorithms on quantum devices, may be one of the most significant flagship applications for these devices. Unlike its classical counterparts, the role of data in quantum machine learning has not been fully understood. In this work, we quantify the performances of quantum machine learning in the landscape of quantum data. Provided that the encoding of quantum data is sufficiently random, the performance, we find that the training efficiency and generalization capabilities in quantum machine learning will be exponentially suppressed with the increase in the number of qubits, which we call “the curse of random quantum data”. Our findings apply to both the quantum kernel method and the large-width limit of quantum neural networks. Conversely, we highlight that through meticulous design of quantum datasets, it is possible to avoid these curses, thereby achieving efficient convergence and robust generalization. Our conclusions are corroborated by extensive numerical simulations.

[LG-158] Electron-nucleus cross sections from transfer learning

链接: https://arxiv.org/abs/2408.09936
作者: Krzysztof M. Graczyk,Beata E. Kowal,Artur M. Ankowski,Rwik Dharmapal Banerjee,Jose Luis Bonilla,Hemant Prasad,Jan T. Sobczyk
关键词-EN: deep neural network, Transfer learning, neural network, limited information, deep neural
类目: High Energy Physics - Phenomenology (hep-ph); Machine Learning (cs.LG); High Energy Physics - Experiment (hep-ex); Nuclear Experiment (nucl-ex)
*备注: 4 pages, 2 figures

点击查看摘要

Abstract:Transfer learning (TL) allows a deep neural network (DNN) trained on one type of data to be adapted for new problems with limited information. We propose to use the TL technique in physics. The DNN learns the physics of one process, and after fine-tuning, it makes predictions for related processes. We consider the DNNs, trained on inclusive electron-carbon scattering data, and show that after fine-tuning, they accurately predict cross sections for electron interactions with nuclear targets ranging from lithium to iron. The method works even when the DNN is fine-tuned on a small dataset.

[LG-159] ALTBI: Constructing Improved Outlier Detection Models via Optimization of Inlier-Memorization Effect

链接: https://arxiv.org/abs/2408.09791
作者: Seoyoung Cho,Jaesung Hwang,Kwan-Young Bak,Dongha Kim
关键词-EN: learning unique patterns, identifying unusual observations, unique patterns, patterns of normal, truncated loss function
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 24 pages in total

点击查看摘要

Abstract:Outlier detection (OD) is the task of identifying unusual observations (or outliers) from a given or upcoming data by learning unique patterns of normal observations (or inliers). Recently, a study introduced a powerful unsupervised OD (UOD) solver based on a new observation of deep generative models, called inlier-memorization (IM) effect, which suggests that generative models memorize inliers before outliers in early learning stages. In this study, we aim to develop a theoretically principled method to address UOD tasks by maximally utilizing the IM effect. We begin by observing that the IM effect is observed more clearly when the given training data contain fewer outliers. This finding indicates a potential for enhancing the IM effect in UOD regimes if we can effectively exclude outliers from mini-batches when designing the loss function. To this end, we introduce two main techniques: 1) increasing the mini-batch size as the model training proceeds and 2) using an adaptive threshold to calculate the truncated loss function. We theoretically show that these two techniques effectively filter out outliers from the truncated loss function, allowing us to utilize the IM effect to the fullest. Coupled with an additional ensemble strategy, we propose our method and term it Adaptive Loss Truncation with Batch Increment (ALTBI). We provide extensive experimental results to demonstrate that ALTBI achieves state-of-the-art performance in identifying outliers compared to other recent methods, even with significantly lower computation costs. Additionally, we show that our method yields robust performances when combined with privacy-preserving algorithms.

[LG-160] Confirmation Bias in Gaussian Mixture Models

链接: https://arxiv.org/abs/2408.09718
作者: Amnon Balanov,Tamir Bendory,Wasim Huleihel
关键词-EN: impact scientific research, profoundly impact scientific, Gaussian mixture, Confirmation bias, Gaussian mixture models
类目: Machine Learning (stat.ML); Information Theory (cs.IT); Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注:

点击查看摘要

Abstract:Confirmation bias, the tendency to interpret information in a way that aligns with one’s preconceptions, can profoundly impact scientific research, leading to conclusions that reflect the researcher’s hypotheses even when the observational data do not support them. This issue is especially critical in scientific fields involving highly noisy observations, such as cryo-electron microscopy. This study investigates confirmation bias in Gaussian mixture models. We consider the following experiment: A team of scientists assumes they are analyzing data drawn from a Gaussian mixture model with known signals (hypotheses) as centroids. However, in reality, the observations consist entirely of noise without any informative structure. The researchers use a single iteration of the K-means or expectation-maximization algorithms, two popular algorithms to estimate the centroids. Despite the observations being pure noise, we show that these algorithms yield biased estimates that resemble the initial hypotheses, contradicting the unbiased expectation that averaging these noise observations would converge to zero. Namely, the algorithms generate estimates that mirror the postulated model, although the hypotheses (the presumed centroids of the Gaussian mixture) are not evident in the observations. Specifically, among other results, we prove a positive correlation between the estimates produced by the algorithms and the corresponding hypotheses. We also derive explicit closed-form expressions of the estimates for a finite and infinite number of hypotheses. This study underscores the risks of confirmation bias in low signal-to-noise environments, provides insights into potential pitfalls in scientific methodologies, and highlights the importance of prudent data interpretation. Subjects: Machine Learning (stat.ML); Information Theory (cs.IT); Machine Learning (cs.LG); Signal Processing (eess.SP) Cite as: arXiv:2408.09718 [stat.ML] (or arXiv:2408.09718v1 [stat.ML] for this version) https://doi.org/10.48550/arXiv.2408.09718 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-161] Branch and Bound to Assess Stability of Regression Coefficients in Uncertain Models

链接: https://arxiv.org/abs/2408.09634
作者: Brian Knaeble,R. Mitchell Hughes,George Rudolph,Mark A. Abramson,Daniel Razo
关键词-EN: difficult to interpret, model, slope coefficient, Abstract, coefficient
类目: Methodology (stat.ME); Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:It can be difficult to interpret a coefficient of an uncertain model. A slope coefficient of a regression model may change as covariates are added or removed from the model. In the context of high-dimensional data, there are too many model extensions to check. However, as we show here, it is possible to efficiently search, with a branch and bound algorithm, for maximum and minimum values of that adjusted slope coefficient over a discrete space of regularized regression models. Here we introduce our algorithm, along with supporting mathematical results, an example application, and a link to our computer code, to help researchers summarize high-dimensional data and assess the stability of regression coefficients in uncertain models.

[LG-162] Circuit design in biology and machine learning. I. Random networks and dimensional reduction

链接: https://arxiv.org/abs/2408.09604
作者: Steven A. Frank
关键词-EN: biological circuits, circuits, biological, biochemical cascade, taking inputs
类目: Populations and Evolution (q-bio.PE); Machine Learning (cs.LG); Biological Physics (physics.bio-ph)
*备注:

点击查看摘要

Abstract:A biological circuit is a neural or biochemical cascade, taking inputs and producing outputs. How have biological circuits learned to solve environmental challenges over the history of life? The answer certainly follows Dobzhansky’s famous quote that ``nothing in biology makes sense except in the light of evolution.‘’ But that quote leaves out the mechanistic basis by which natural selection’s trial-and-error learning happens, which is exactly what we have to understand. How does the learning process that designs biological circuits actually work? How much insight can we gain about the form and function of biological circuits by studying the processes that have made those circuits? Because life’s circuits must often solve the same problems as those faced by machine learning, such as environmental tracking, homeostatic control, dimensional reduction, or classification, we can begin by considering how machine learning designs computational circuits to solve problems. We can then ask: How much insight do those computational circuits provide about the design of biological circuits? How much does biology differ from computers in the particular circuit designs that it uses to solve problems? This article steps through two classic machine learning models to set the foundation for analyzing broad questions about the design of biological circuits. One insight is the surprising power of randomly connected networks. Another is the central role of internal models of the environment embedded within biological circuits, illustrated by a model of dimensional reduction and trend prediction. Overall, many challenges in biology have machine learning analogs, suggesting hypotheses about how biology’s circuits are designed.

[LG-163] Convolutional Conditional Neural Processes

链接: https://arxiv.org/abs/2408.09583
作者: Wessel P. Bruinsma
关键词-EN: Neural processes, Neural, neural networks, Toggle, processes
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: PhD thesis, 226 pages

点击查看摘要

Abstract:Neural processes are a family of models which use neural networks to directly parametrise a map from data sets to predictions. Directly parametrising this map enables the use of expressive neural networks in small-data problems where neural networks would traditionally overfit. Neural processes can produce well-calibrated uncertainties, effectively deal with missing data, and are simple to train. These properties make this family of models appealing for a breadth of applications areas, such as healthcare or environmental sciences. This thesis advances neural processes in three ways. First, we propose convolutional neural processes (ConvNPs). ConvNPs improve data efficiency of neural processes by building in a symmetry called translation equivariance. ConvNPs rely on convolutional neural networks rather than multi-layer perceptrons. Second, we propose Gaussian neural processes (GNPs). GNPs directly parametrise dependencies in the predictions of a neural process. Current approaches to modelling dependencies in the predictions depend on a latent variable, which consequently requires approximate inference, undermining the simplicity of the approach. Third, we propose autoregressive conditional neural processes (AR CNPs). AR CNPs train a neural process without any modifications to the model or training procedure and, at test time, roll out the model in an autoregressive fashion. AR CNPs equip the neural process framework with a new knob where modelling complexity and computational expense at training time can be traded for computational expense at test time. In addition to methodological advancements, this thesis also proposes a software abstraction that enables a compositional approach to implementing neural processes. This approach allows the user to rapidly explore the space of neural process models by putting together elementary building blocks in different ways. Comments: PhD thesis, 226 pages Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG) Cite as: arXiv:2408.09583 [stat.ML] (or arXiv:2408.09583v1 [stat.ML] for this version) https://doi.org/10.48550/arXiv.2408.09583 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Related DOI: https://doi.org/10.17863/CAM.100216 Focus to learn more DOI(s) linking to related resources Submission history From: Wessel Bruinsma [view email] [v1] Sun, 18 Aug 2024 19:53:38 UTC (20,393 KB) Full-text links: Access Paper: View a PDF of the paper titled Convolutional Conditional Neural Processes, by Wessel P. BruinsmaView PDFTeX SourceOther Formats view license Current browse context: stat.ML prev | next new | recent | 2024-08 Change to browse by: cs cs.LG stat References Citations NASA ADSGoogle Scholar Semantic Scholar a export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Links to Code Toggle Papers with Code (What is Papers with Code?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status Get status notifications via email or slack

[LG-164] Security Concerns in Quantum Machine Learning as a Service

链接: https://arxiv.org/abs/2408.09562
作者: Satwik Kundu,Swaroop Ghosh
关键词-EN: Quantum machine learning, variational quantum circuits, machine learning tasks, employ variational quantum, tackle machine learning
类目: Quantum Physics (quant-ph); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注: 9 pages, 3 figures

点击查看摘要

Abstract:Quantum machine learning (QML) is a category of algorithms that employ variational quantum circuits (VQCs) to tackle machine learning tasks. Recent discoveries have shown that QML models can effectively generalize from limited training data samples. This capability has sparked increased interest in deploying these models to address practical, real-world challenges, resulting in the emergence of Quantum Machine Learning as a Service (QMLaaS). QMLaaS represents a hybrid model that utilizes both classical and quantum computing resources. Classical computers play a crucial role in this setup, handling initial pre-processing and subsequent post-processing of data to compensate for the current limitations of quantum hardware. Since this is a new area, very little work exists to paint the whole picture of QMLaaS in the context of known security threats in the domain of classical and quantum machine learning. This SoK paper is aimed to bridge this gap by outlining the complete QMLaaS workflow, which encompasses both the training and inference phases and highlighting significant security concerns involving untrusted classical or quantum providers. QML models contain several sensitive assets, such as the model architecture, training/testing data, encoding techniques, and trained parameters. Unauthorized access to these components could compromise the model’s integrity and lead to intellectual property (IP) theft. We pinpoint the critical security issues that must be considered to pave the way for a secure QMLaaS deployment.

[LG-165] Sample-Optimal Large-Scale Optimal Subset Selection

链接: https://arxiv.org/abs/2408.09537
作者: Zaile Li,Weiwei Fan,L. Jeff Hong
关键词-EN: large-scale OSS problem, conventionally aims, aims to select, select the unique, finite set
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Methodology (stat.ME)
*备注:

点击查看摘要

Abstract:Ranking and selection (RS) conventionally aims to select the unique best alternative with the largest mean performance from a finite set of alternatives. However, for better supporting decision making, it may be more informative to deliver a small menu of alternatives whose mean performances are among the top m . Such problem, called optimal subset selection (OSS), is generally more challenging to address than the conventional RS. This challenge becomes even more significant when the number of alternatives is considerably large. Thus, the focus of this paper is on addressing the large-scale OSS problem. To achieve this goal, we design a top- m greedy selection mechanism that keeps sampling the current top m alternatives with top m running sample means and propose the explore-first top- m greedy (EFG- m ) procedure. Through an extended boundary-crossing framework, we prove that the EFG- m procedure is both sample optimal and consistent in terms of the probability of good selection, confirming its effectiveness in solving large-scale OSS problem. Surprisingly, we also demonstrate that the EFG- m procedure enables to achieve an indifference-based ranking within the selected subset of alternatives at no extra cost. This is highly beneficial as it delivers deeper insights to decision-makers, enabling more informed decision-makings. Lastly, numerical experiments validate our results and demonstrate the efficiency of our procedures.

[LG-166] Deep Limit Model-free Prediction in Regression

链接: https://arxiv.org/abs/2408.09532
作者: Kejin Wu,Dimitris N. Politis
关键词-EN: Deep Neural Network, Neural Network, Deep Neural, general regression setting, based on Deep
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Methodology (stat.ME)
*备注:

点击查看摘要

Abstract:In this paper, we provide a novel Model-free approach based on Deep Neural Network (DNN) to accomplish point prediction and prediction interval under a general regression setting. Usually, people rely on parametric or non-parametric models to bridge dependent and independent variables (Y and X). However, this classical method relies heavily on the correct model specification. Even for the non-parametric approach, some additive form is often assumed. A newly proposed Model-free prediction principle sheds light on a prediction procedure without any model assumption. Previous work regarding this principle has shown better performance than other standard alternatives. Recently, DNN, one of the machine learning methods, has received increasing attention due to its great performance in practice. Guided by the Model-free prediction idea, we attempt to apply a fully connected forward DNN to map X and some appropriate reference random variable Z to Y. The targeted DNN is trained by minimizing a specially designed loss function so that the randomness of Y conditional on X is outsourced to Z through the trained DNN. Our method is more stable and accurate compared to other DNN-based counterparts, especially for optimal point predictions. With a specific prediction procedure, our prediction interval can capture the estimation variability so that it can render a better coverage rate for finite sample cases. The superior performance of our method is verified by simulation and empirical studies.

[LG-167] Enhancing Quantum Memory Lifetime with Measurement-Free Local Error Correction and Reinforcement Learning

链接: https://arxiv.org/abs/2408.09524
作者: Mincheol Park,Nishad Maskara,Marcin Kalinowski,Mikhail D. Lukin
关键词-EN: Reliable quantum computation, computation requires systematic, requires systematic identification, quantum computation requires, Reliable quantum
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG); Computational Physics (physics.comp-ph)
*备注: 15 + 11 pages, 16 figures

点击查看摘要

Abstract:Reliable quantum computation requires systematic identification and correction of errors that occur and accumulate in quantum hardware. To diagnose and correct such errors, standard quantum error-correcting protocols utilize \textitglobal error information across the system obtained by mid-circuit readout of ancillary qubits. We investigate circuit-level error-correcting protocols that are measurement-free and based on \textitlocal error information. Such a local error correction (LEC) circuit consists of faulty multi-qubit gates to perform both syndrome extraction and ancilla-controlled error removal. We develop and implement a reinforcement learning framework that takes a fixed set of faulty gates as inputs and outputs an optimized LEC circuit. To evaluate this approach, we quantitatively characterize an extension of logical qubit lifetime by a noisy LEC circuit. For the 2D classical Ising model and 4D toric code, our optimized LEC circuit performs better at extending a memory lifetime compared to a conventional LEC circuit based on Toom’s rule in a sub-threshold gate error regime. We further show that such circuits can be used to reduce the rate of mid-circuit readouts to preserve a 2D toric code memory. Finally, we discuss the application of the LEC protocol on dissipative preparation of quantum states with topological phases.

[LG-168] Enhancing Startup Success Predictions in Venture Capital: A GraphRAG Augmented Multivariate Time Series Method

链接: https://arxiv.org/abs/2408.09420
作者: Gao Zitian,Xiao Yihao
关键词-EN: subjective revenue forecasts, limited financial data, revenue forecasts, Venture Capital, challenging due
类目: Computational Finance (q-fin.CP); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In the Venture Capital(VC) industry, predicting the success of startups is challenging due to limited financial data and the need for subjective revenue forecasts. Previous methods based on time series analysis or deep learning often fall short as they fail to incorporate crucial inter-company relationships such as competition and collaboration. Regarding the issues, we propose a novel approach using GrahphRAG augmented time series model. With GraphRAG, time series predictive methods are enhanced by integrating these vital relationships into the analysis framework, allowing for a more dynamic understanding of the startup ecosystem in venture capital. Our experimental results demonstrate that our model significantly outperforms previous models in startup success predictions. To the best of our knowledge, our work is the first application work of GraphRAG.

[LG-169] Exploratory Optimal Stopping: A Singular Control Formulation

链接: https://arxiv.org/abs/2408.09335
作者: Jodi Dianetti,Giorgio Ferrari,Renyuan Xu
关键词-EN: paper explores continuous-time, state-space optimal stopping, randomized stopping times, reinforcement learning perspective, paper explores
类目: Optimization and Control (math.OC); Machine Learning (cs.LG); Mathematical Finance (q-fin.MF); Machine Learning (stat.ML)
*备注: 49 pages, 3 figures

点击查看摘要

Abstract:This paper explores continuous-time and state-space optimal stopping problems from a reinforcement learning perspective. We begin by formulating the stopping problem using randomized stopping times, where the decision maker’s control is represented by the probability of stopping within a given time–specifically, a bounded, non-decreasing, càdlàg control process. To encourage exploration and facilitate learning, we introduce a regularized version of the problem by penalizing it with the cumulative residual entropy of the randomized stopping time. The regularized problem takes the form of an (n+1)-dimensional degenerate singular stochastic control with finite-fuel. We address this through the dynamic programming principle, which enables us to identify the unique optimal exploratory strategy. For the specific case of a real option problem, we derive a semi-explicit solution to the regularized problem, allowing us to assess the impact of entropy regularization and analyze the vanishing entropy limit. Finally, we propose a reinforcement learning algorithm based on policy iteration. We show both policy improvement and policy convergence results for our proposed algorithm.

[LG-170] Malacopula: adversarial automatic speaker verification attacks using a neural-based generalised Hammerstein model

链接: https://arxiv.org/abs/2408.09300
作者: Massimiliano Todisco,Michele Panariello,Xin Wang,Héctor Delgado,Kong Aik Lee,Nicholas Evans
关键词-EN: neural-based generalised Hammerstein, generalised Hammerstein model, Hammerstein model designed, automatic speaker verification, generalised Hammerstein
类目: Audio and Speech Processing (eess.AS); Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注: Accepted at ASVspoof Workshop 2024

点击查看摘要

Abstract:We present Malacopula, a neural-based generalised Hammerstein model designed to introduce adversarial perturbations to spoofed speech utterances so that they better deceive automatic speaker verification (ASV) systems. Using non-linear processes to modify speech utterances, Malacopula enhances the effectiveness of spoofing attacks. The model comprises parallel branches of polynomial functions followed by linear time-invariant filters. The adversarial optimisation procedure acts to minimise the cosine distance between speaker embeddings extracted from spoofed and bona fide utterances. Experiments, performed using three recent ASV systems and the ASVspoof 2019 dataset, show that Malacopula increases vulnerabilities by a substantial margin. However, speech quality is reduced and attacks can be detected effectively under controlled conditions. The findings emphasise the need to identify new vulnerabilities and design defences to protect ASV systems from adversarial attacks in the wild.

[LG-171] Out-of-distribution materials property prediction using adversarial learning based fine-tuning

链接: https://arxiv.org/abs/2408.09297
作者: Qinyang Li,Nicholas Miklaucic,Jianjun Hu
关键词-EN: engineering disciplines, material property prediction, wide range, range of scientific, scientific and engineering
类目: Materials Science (cond-mat.mtrl-sci); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The accurate prediction of material properties is crucial in a wide range of scientific and engineering disciplines. Machine learning (ML) has advanced the state of the art in this field, enabling scientists to discover novel materials and design materials with specific desired properties. However, one major challenge that persists in material property prediction is the generalization of models to out-of-distribution (OOD) samples,i.e., samples that differ significantly from those encountered during training. In this paper, we explore the application of advancements in OOD learning approaches to enhance the robustness and reliability of material property prediction models. We propose and apply the Crystal Adversarial Learning (CAL) algorithm for OOD materials property prediction,which generates synthetic data during training to bias the training towards those samples with high prediction uncertainty. We further propose an adversarial learning based targeting finetuning approach to make the model adapted to a particular OOD dataset, as an alternative to traditional fine-tuning. Our experiments demonstrate the success of our CAL algorithm with its high effectiveness in ML with limited samples which commonly occurs in materials science. Our work represents a promising direction toward better OOD learning and materials property prediction.

[LG-172] A Fast and Computationally Inexpensive Method For Image Translation of 3D Volume Patient Data

链接: https://arxiv.org/abs/2408.09218
作者: Cho Yang
关键词-EN: Grand Challenge Dataset, SynthRAD Grand Challenge, Grand Challenge, Challenge Dataset, SynthRAD Grand
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:CycleGAN was trained on SynthRAD Grand Challenge Dataset using the single-epoch modification (SEM) method proposed in this paper which is referred to as (CycleGAN-single) compared to the usual method of training CycleGAN on around 200 epochs (CycleGAN-multi). Model performance were evaluated qualitatively and quantitatively with quantitative performance metrics like PSNR, SSIM, MAE and MSE. The consideration of both quantitative and qualitative performance when evaluating a model is unique to certain image-translation tasks like medical imaging as detailed in this paper. Also, this paper shows that good quantitative performance does not always imply good qualitative performance and the converse is also not always True (i.e. good qualitative performance does not always imply good quantitative performance). This paper also proposes FQGA (Fast Paired Image-to-Image Translation Quarter-Generator Adversary) Model which has 1/4 the number of parameters compared to CycleGAN (when comparing their Generator Models). FQGA outperforms CycleGAN qualitatively and quantitatively even only after training on 20 epochs. Finally, using SEM method on FQGA allowed it to again outperform CycleGAN both quantitatively and qualitatively. These performance gains with fewer model parameters and time savings from running fewer epochs may also be applicable to other image-to-image translation tasks in Machine Learning apart from the Medical image-translation task discussed in this paper between Cone Beam Computed Tomography (CBCT) and Computed Tomography (CT) images.

[LG-173] me Series Analysis by State Space Learning

链接: https://arxiv.org/abs/2408.09120
作者: André Ramos,Davi Valladão,Alexandre Street
关键词-EN: Time series, Time series analysis, explanatory variables, extracting unobservable components, Time
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 22 pages, 28 figures

点击查看摘要

Abstract:Time series analysis by state-space models is widely used in forecasting and extracting unobservable components like level, slope, and seasonality, along with explanatory variables. However, their reliance on traditional Kalman filtering frequently hampers their effectiveness, primarily due to Gaussian assumptions and the absence of efficient subset selection methods to accommodate the multitude of potential explanatory variables in today’s big-data applications. Our research introduces the State Space Learning (SSL), a novel framework and paradigm that leverages the capabilities of statistical learning to construct a comprehensive framework for time series modeling and forecasting. By utilizing a regularized high-dimensional regression framework, our approach jointly extracts typical time series unobservable components, detects and addresses outliers, and selects the influence of exogenous variables within a high-dimensional space in polynomial time and global optimality guarantees. Through a controlled numerical experiment, we demonstrate the superiority of our approach in terms of subset selection of explanatory variables accuracy compared to relevant benchmarks. We also present an intuitive forecasting scheme and showcase superior performances relative to traditional time series models using a dataset of 48,000 monthly time series from the M4 competition. We extend the applicability of our approach to reformulate any linear state space formulation featuring time-varying coefficients into high-dimensional regularized regressions, expanding the impact of our research to other engineering applications beyond time series analysis. Finally, our proposed methodology is implemented within the Julia open-source package, ``StateSpaceLearning.jl".

[LG-174] mRNA2vec: mRNA Embedding with Language Model in the 5UTR-CDS for mRNA Design

链接: https://arxiv.org/abs/2408.09048
作者: Honggen Zhang,Xiangrui Gao,June Zhang,Lipeng Lai
关键词-EN: Messenger RNA, pharmaceutical industry, accelerating the discovery, drugs and revolutionizing, revolutionizing the pharmaceutical
类目: Quantitative Methods (q-bio.QM); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Messenger RNA (mRNA)-based vaccines are accelerating the discovery of new drugs and revolutionizing the pharmaceutical industry. However, selecting particular mRNA sequences for vaccines and therapeutics from extensive mRNA libraries is costly. Effective mRNA therapeutics require carefully designed sequences with optimized expression levels and stability. This paper proposes a novel contextual language model (LM)-based embedding method: mRNA2vec. In contrast to existing mRNA embedding approaches, our method is based on the self-supervised teacher-student learning framework of data2vec. We jointly use the 5’ untranslated region (UTR) and coding sequence (CDS) region as the input sequences. We adapt our LM-based approach specifically to mRNA by 1) considering the importance of location on the mRNA sequence with probabilistic masking, 2) using Minimum Free Energy (MFE) prediction and Secondary Structure (SS) classification as additional pretext tasks. mRNA2vec demonstrates significant improvements in translation efficiency (TE) and expression level (EL) prediction tasks in UTR compared to SOTA methods such as UTR-LM. It also gives a competitive performance in mRNA stability and protein production level tasks in CDS such as CodonBERT.

[LG-175] Error Bounds for Learning Fourier Linear Operators

链接: https://arxiv.org/abs/2408.09004
作者: Unique Subedi,Ambuj Tewari
关键词-EN: Fourier Neural Operator, Fourier Neural, Discrete Fourier Transform, Neural Operator, function spaces
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 30 pages

点击查看摘要

Abstract:We investigate the problem of learning operators between function spaces, focusing on the linear layer of the Fourier Neural Operator. First, we identify three main errors that occur during the learning process: statistical error due to finite sample size, truncation error from finite rank approximation of the operator, and discretization error from handling functional data on a finite grid of domain points. Finally, we analyze a Discrete Fourier Transform (DFT) based least squares estimator, establishing both upper and lower bounds on the aforementioned errors.

[LG-176] A Confidence Interval for the ell_2 Expected Calibration Error

链接: https://arxiv.org/abs/2408.08998
作者: Yan Sun,Pratik Chaudhari,Ian J. Barnett,Edgar Dobriban
关键词-EN: Recent advances, significantly improved prediction, improved prediction accuracy, advances in machine, machine learning
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Recent advances in machine learning have significantly improved prediction accuracy in various applications. However, ensuring the calibration of probabilistic predictions remains a significant challenge. Despite efforts to enhance model calibration, the rigorous statistical evaluation of model calibration remains less explored. In this work, we develop confidence intervals the \ell_2 Expected Calibration Error (ECE). We consider top-1-to- k calibration, which includes both the popular notion of confidence calibration as well as full calibration. For a debiased estimator of the ECE, we show asymptotic normality, but with different convergence rates and asymptotic variances for calibrated and miscalibrated models. We develop methods to construct asymptotically valid confidence intervals for the ECE, accounting for this behavior as well as non-negativity. Our theoretical findings are supported through extensive experiments, showing that our methods produce valid confidence intervals with shorter lengths compared to those obtained by resampling-based methods.

[LG-177] Adaptive Uncertainty Quantification for Generative AI

链接: https://arxiv.org/abs/2408.08990
作者: Jungeum Kim,Sean O’Hagan,Veronika Rockova
关键词-EN: including generative, work is concerned, concerned with conformal, trained on data, black-box model
类目: Methodology (stat.ME); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:This work is concerned with conformal prediction in contemporary applications (including generative AI) where a black-box model has been trained on data that are not accessible to the user. Mirroring split-conformal inference, we design a wrapper around a black-box algorithm which calibrates conformity scores. This calibration is local and proceeds in two stages by first adaptively partitioning the predictor space into groups and then calibrating sectionally group by group. Adaptive partitioning (self-grouping) is achieved by fitting a robust regression tree to the conformity scores on the calibration set. This new tree variant is designed in such a way that adding a single new observation does not change the tree fit with overwhelmingly large probability. This add-one-in robustness property allows us to conclude a finite sample group-conditional coverage guarantee, a refinement of the marginal guarantee. In addition, unlike traditional split-conformal inference, adaptive splitting and within-group calibration yields adaptive bands which can stretch and shrink locally. We demonstrate benefits of local tightening on several simulated as well as real examples using non-parametric regression. Finally, we consider two contemporary classification applications for obtaining uncertainty quantification around GPT-4o predictions. We conformalize skin disease diagnoses based on self-reported symptoms as well as predicted states of U.S. legislators based on summaries of their ideology. We demonstrate substantial local tightening of the uncertainty sets while attaining similar marginal coverage.

[LG-178] Optimal transport natural gradient for statistical manifolds with continuous sample space

链接: https://arxiv.org/abs/1805.08380
作者: Yifan Chen,Wuchen Li
关键词-EN: continuous sample spaces, Wasserstein statistical manifold, Wasserstein natural gradient, Wasserstein metric tensor, Wasserstein
类目: Optimization and Control (math.OC); Information Theory (cs.IT); Machine Learning (cs.LG); Statistics Theory (math.ST); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:We study the Wasserstein natural gradient in parametric statistical models with continuous sample spaces. Our approach is to pull back the L^2 -Wasserstein metric tensor in the probability density space to a parameter space, equipping the latter with a positive definite metric tensor, under which it becomes a Riemannian manifold, named the Wasserstein statistical manifold. In general, it is not a totally geodesic sub-manifold of the density space, and therefore its geodesics will differ from the Wasserstein geodesics, except for the well-known Gaussian distribution case, a fact which can also be validated under our framework. We use the sub-manifold geometry to derive a gradient flow and natural gradient descent method in the parameter space. When parametrized densities lie in \bR , the induced metric tensor establishes an explicit formula. In optimization problems, we observe that the natural gradient descent outperforms the standard gradient descent when the Wasserstein distance is the objective function. In such a case, we prove that the resulting algorithm behaves similarly to the Newton method in the asymptotic regime. The proof calculates the exact Hessian formula for the Wasserstein distance, which further motivates another preconditioner for the optimization process. To the end, we present examples to illustrate the effectiveness of the natural gradient in several parametric statistical models, including the Gaussian measure, Gaussian mixture, Gamma distribution, and Laplace distribution.

信息检索

[IR-0] Customizing Language Models with Instance-wise LoRA for Sequential Recommendation

链接: https://arxiv.org/abs/2408.10159
作者: Xiaoyu Kong,Jiancan Wu,An Zhang,Leheng Sheng,Hui Lin,Xiang Wang,Xiangnan He
关键词-EN: Large Language Models, Sequential recommendation systems, recommendation systems predict, analyzing past interactions, Sequential recommendation
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Sequential recommendation systems predict a user’s next item of interest by analyzing past interactions, aligning recommendations with individual preferences. Leveraging the strengths of Large Language Models (LLMs) in knowledge comprehension and reasoning, recent approaches have applied LLMs to sequential recommendation through language generation paradigms. These methods convert user behavior sequences into prompts for LLM fine-tuning, utilizing Low-Rank Adaptation (LoRA) modules to refine recommendations. However, the uniform application of LoRA across diverse user behaviors sometimes fails to capture individual variability, leading to suboptimal performance and negative transfer between disparate sequences. To address these challenges, we propose Instance-wise LoRA (iLoRA), integrating LoRA with the Mixture of Experts (MoE) framework. iLoRA creates a diverse array of experts, each capturing specific aspects of user preferences, and introduces a sequence representation guided gate function. This gate function processes historical interaction sequences to generate enriched representations, guiding the gating network to output customized expert participation weights. This tailored approach mitigates negative transfer and dynamically adjusts to diverse behavior patterns. Extensive experiments on three benchmark datasets demonstrate the effectiveness of iLoRA, highlighting its superior performance compared to existing methods in capturing user-specific preferences and improving recommendation accuracy.

[IR-1] Molecular Graph Representation Learning Integrating Large Language Models with Domain-specific Small Models

链接: https://arxiv.org/abs/2408.10124
作者: Tianyu Zhang,Yuxiang Ren,Chengbin Hou,Hairong Lv,Xuegong Zhang
关键词-EN: drug discovery, crucial foundation, foundation for drug, Domain-specific Small Models, Large Language Models
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Chemical Physics (physics.chem-ph); Biomolecules (q-bio.BM)
*备注:

点击查看摘要

Abstract:Molecular property prediction is a crucial foundation for drug discovery. In recent years, pre-trained deep learning models have been widely applied to this task. Some approaches that incorporate prior biological domain knowledge into the pre-training framework have achieved impressive results. However, these methods heavily rely on biochemical experts, and retrieving and summarizing vast amounts of domain knowledge literature is both time-consuming and expensive. Large Language Models (LLMs) have demonstrated remarkable performance in understanding and efficiently providing general knowledge. Nevertheless, they occasionally exhibit hallucinations and lack precision in generating domain-specific knowledge. Conversely, Domain-specific Small Models (DSMs) possess rich domain knowledge and can accurately calculate molecular domain-related metrics. However, due to their limited model size and singular functionality, they lack the breadth of knowledge necessary for comprehensive representation learning. To leverage the advantages of both approaches in molecular property prediction, we propose a novel Molecular Graph representation learning framework that integrates Large language models and Domain-specific small models (MolGraph-LarDo). Technically, we design a two-stage prompt strategy where DSMs are introduced to calibrate the knowledge provided by LLMs, enhancing the accuracy of domain-specific information and thus enabling LLMs to generate more precise textual descriptions for molecular samples. Subsequently, we employ a multi-modal alignment method to coordinate various modalities, including molecular graphs and their corresponding descriptive texts, to guide the pre-training of molecular representations. Extensive experiments demonstrate the effectiveness of the proposed method.

[IR-2] Efficient Inference of Sub-Item Id-based Sequential Recommendation Models with Millions of Items RECSYS2024

链接: https://arxiv.org/abs/2408.09992
作者: Aleksandr V. Petrov,Craig Macdonald,Nicola Tonellotto
关键词-EN: Transformer-based recommender systems, recommender systems, results in sequential, sequential recommendation, memory consumption
类目: Information Retrieval (cs.IR); Data Structures and Algorithms (cs.DS)
*备注: Accepted by RecSys 2024

点击查看摘要

Abstract:Transformer-based recommender systems, such as BERT4Rec or SASRec, achieve state-of-the-art results in sequential recommendation. However, it is challenging to use these models in production environments with catalogues of millions of items: scaling Transformers beyond a few thousand items is problematic for several reasons, including high model memory consumption and slow inference. In this respect, RecJPQ is a state-of-the-art method of reducing the models’ memory consumption; RecJPQ compresses item catalogues by decomposing item IDs into a small number of shared sub-item IDs. Despite reporting the reduction of memory consumption by a factor of up to 50x, the original RecJPQ paper did not report inference efficiency improvements over the baseline Transformer-based models. Upon analysing RecJPQ’s scoring algorithm, we find that its efficiency is limited by its use of score accumulators for each item, which prevents parallelisation. In contrast, LightRec (a non-sequential method that uses a similar idea of sub-ids) reported large inference efficiency improvements using an algorithm we call PQTopK. We show that it is also possible to improve RecJPQ-based models’ inference efficiency using the PQTopK algorithm. In particular, we speed up RecJPQ-enhanced SASRec by a factor of 4.5 x compared to the original SASRec’s inference method and by a factor of 1.56 x compared to the method implemented in RecJPQ code on a large-scale Gowalla dataset with more than a million items. Further, using simulated data, we show that PQTopK remains efficient with catalogues of up to tens of millions of items, removing one of the last obstacles to using Transformer-based models in production environments with large catalogues.

[IR-3] MAPLE: Enhancing Review Generation with Multi-Aspect Prompt LEarning in Explainable Recommendation

链接: https://arxiv.org/abs/2408.09865
作者: Ching-Wen Yang,Che Wei Chen,Kun-da Wu,Hao Xu,Jui-Feng Yao,Hung-Yu Kao
关键词-EN: Explainable Recommendation task, Explainable Recommendation, Recommendation task, task is designed, designed to receive
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Information Retrieval (cs.IR)
*备注: 8 main pages, 10 pages for appendix. Under review

点击查看摘要

Abstract:Explainable Recommendation task is designed to receive a pair of user and item and output explanations to justify why an item is recommended to a user. Many models treat review-generation as a proxy of explainable recommendation. Although they are able to generate fluent and grammatical sentences, they suffer from generality and hallucination issues. We propose a personalized, aspect-controlled model called Multi-Aspect Prompt LEarner (MAPLE), in which it integrates aspect category as another input dimension to facilitate the memorization of fine-grained aspect terms. Experiments on two real-world review datasets in restaurant domain show that MAPLE outperforms the baseline review-generation models in terms of text and feature diversity while maintaining excellent coherence and factual relevance. We further treat MAPLE as a retriever component in the retriever-reader framework and employ a Large-Language Model (LLM) as the reader, showing that MAPLE’s explanation along with the LLM’s comprehension ability leads to enriched and personalized explanation as a result. We will release the code and data in this http upon acceptance.

[IR-4] Fashion Image-to-Image Translation for Complementary Item Retrieval

链接: https://arxiv.org/abs/2408.09847
作者: Matteo Attimonelli,Claudio Pomo,Dietmar Jannach,Tommaso Di Noia
关键词-EN: matching user queries, Bayesian Personalized Ranking, compatibility modeling, online fashion retail, focusing on matching
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:The increasing demand for online fashion retail has boosted research in fashion compatibility modeling and item retrieval, focusing on matching user queries (textual descriptions or reference images) with compatible fashion items. A key challenge is top-bottom retrieval, where precise compatibility modeling is essential. Traditional methods, often based on Bayesian Personalized Ranking (BPR), have shown limited performance. Recent efforts have explored using generative models in compatibility modeling and item retrieval, where generated images serve as additional inputs. However, these approaches often overlook the quality of generated images, which could be crucial for model performance. Additionally, generative models typically require large datasets, posing challenges when such data is scarce. To address these issues, we introduce the Generative Compatibility Model (GeCo), a two-stage approach that improves fashion image retrieval through paired image-to-image translation. First, the Complementary Item Generation Model (CIGM), built on Conditional Generative Adversarial Networks (GANs), generates target item images (e.g., bottoms) from seed items (e.g., tops), offering conditioning signals for retrieval. These generated samples are then integrated into GeCo, enhancing compatibility modeling and retrieval accuracy. Evaluations on three datasets show that GeCo outperforms state-of-the-art baselines. Key contributions include: (i) the GeCo model utilizing paired image-to-image translation within the Composed Image Retrieval framework, (ii) comprehensive evaluations on benchmark datasets, and (iii) the release of a new Fashion Taobao dataset designed for top-bottom retrieval, promoting further research. Subjects: Information Retrieval (cs.IR) Cite as: arXiv:2408.09847 [cs.IR] (or arXiv:2408.09847v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2408.09847 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[IR-5] Ranking Generated Answers: On the Agreement of Retrieval Models with Humans on Consumer Health Questions

链接: https://arxiv.org/abs/2408.09831
作者: Sebastian Heineking,Jonas Probst,Daniel Steinbach,Martin Potthast,Harrisen Scells
关键词-EN: generative large language, large language models, difficult to scale, output of generative, generative large
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Evaluating the output of generative large language models (LLMs) is challenging and difficult to scale. Most evaluations of LLMs focus on tasks such as single-choice question-answering or text classification. These tasks are not suitable for assessing open-ended question-answering capabilities, which are critical in domains where expertise is required, such as health, and where misleading or incorrect answers can have a significant impact on a user’s health. Using human experts to evaluate the quality of LLM answers is generally considered the gold standard, but expert annotation is costly and slow. We present a method for evaluating LLM answers that uses ranking signals as a substitute for explicit relevance judgements. Our scoring method correlates with the preferences of human experts. We validate it by investigating the well-known fact that the quality of generated answers improves with the size of the model as well as with more sophisticated prompting strategies.

[IR-6] Contextual Dual Learning Algorithm with Listwise Distillation for Unbiased Learning to Rank

链接: https://arxiv.org/abs/2408.09817
作者: Lulu Yu,Keping Bi,Shiyu Ni,Jiafeng Guo
关键词-EN: implicit user feedback, leverage biased implicit, biased implicit user, unbiased ranking model, Dual Learning Algorithm
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
*备注: 12 pages, 2 figures

点击查看摘要

Abstract:Unbiased Learning to Rank (ULTR) aims to leverage biased implicit user feedback (e.g., click) to optimize an unbiased ranking model. The effectiveness of the existing ULTR methods has primarily been validated on synthetic datasets. However, their performance on real-world click data remains unclear. Recently, Baidu released a large publicly available dataset of their web search logs. Subsequently, the NTCIR-17 ULTRE-2 task released a subset dataset extracted from it. We conduct experiments on commonly used or effective ULTR methods on this subset to determine whether they maintain their effectiveness. In this paper, we propose a Contextual Dual Learning Algorithm with Listwise Distillation (CDLA-LD) to simultaneously address both position bias and contextual bias. We utilize a listwise-input ranking model to obtain reconstructed feature vectors incorporating local contextual information and employ the Dual Learning Algorithm (DLA) method to jointly train this ranking model and a propensity model to address position bias. As this ranking model learns the interaction information within the documents list of the training set, to enhance the ranking model’s generalization ability, we additionally train a pointwise-input ranking model to learn the listwise-input ranking model’s capability for relevance judgment in a listwise manner. Extensive experiments and analysis confirm the effectiveness of our approach.

[IR-7] Revisiting Reciprocal Recommender Systems: Metrics Formulation and Method KDD2024

链接: https://arxiv.org/abs/2408.09748
作者: Chen Yang,Sunhao Dai,Yupeng Hou,Wayne Xin Zhao,Jun Xu,Yang Song,Hengshu Zhu
关键词-EN: gained increasing attention, enhancing matching efficiency, conducting bilateral recommendations, Reciprocal recommender systems, involved parties
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
*备注: KDD 2024

点击查看摘要

Abstract:Reciprocal recommender systems~(RRS), conducting bilateral recommendations between two involved parties, have gained increasing attention for enhancing matching efficiency. However, the majority of existing methods in the literature still reuse conventional ranking metrics to separately assess the performance on each side of the recommendation process. These methods overlook the fact that the ranking outcomes of both sides collectively influence the effectiveness of the RRS, neglecting the necessity of a more holistic evaluation and a capable systemic solution. In this paper, we systemically revisit the task of reciprocal recommendation, by introducing the new metrics, formulation, and method. Firstly, we propose five new evaluation metrics that comprehensively and accurately assess the performance of RRS from three distinct perspectives: overall coverage, bilateral stability, and balanced ranking. These metrics provide a more holistic understanding of the system’s effectiveness and enable a comprehensive evaluation. Furthermore, we formulate the RRS from a causal perspective, formulating recommendations as bilateral interventions, which can better model the decoupled effects of potential influencing factors. By utilizing the potential outcome framework, we further develop a model-agnostic causal reciprocal recommendation method that considers the causal effects of recommendations. Additionally, we introduce a reranking strategy to maximize matching outcomes, as measured by the proposed metrics. Extensive experiments on two real-world datasets from recruitment and dating scenarios demonstrate the effectiveness of our proposed metrics and approach. The code and dataset are available at: this https URL. Comments: KDD 2024 Subjects: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI) Cite as: arXiv:2408.09748 [cs.IR] (or arXiv:2408.09748v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2408.09748 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Related DOI: https://doi.org/10.1145/3637528.3671734 Focus to learn more DOI(s) linking to related resources

[IR-8] Carbon Footprint Accounting Driven by Large Language Models and Retrieval-augmented Generation

链接: https://arxiv.org/abs/2408.09713
作者: Haijin Wang,Zheng Chen,Nan Shang,Shangheng Yao,Zibin Pan,Fushuan Wen,Junhua Zhao
关键词-EN: quantifying greenhouse gas, neutrality.The dynamic nature, supply structures necessitates, carbon neutrality.The dynamic, structures necessitates real-time
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Carbon footprint accounting is crucial for quantifying greenhouse gas emissions and achieving carbon neutrality.The dynamic nature of processes, accounting rules, carbon-related policies, and energy supply structures necessitates real-time updates of CFA. Traditional life cycle assessment methods rely heavily on human expertise, making near-real-time updates challenging. This paper introduces a novel approach integrating large language models (LLMs) with retrieval-augmented generation technology to enhance the real-time, professional, and economical aspects of carbon footprint information retrieval and analysis. By leveraging LLMs’ logical and language understanding abilities and RAG’s efficient retrieval capabilities, the proposed method LLMs-RAG-CFA can retrieve more relevant professional information to assist LLMs, enhancing the model’s generative abilities. This method offers broad professional coverage, efficient real-time carbon footprint information acquisition and accounting, and cost-effective automation without frequent LLMs’ parameter updates. Experimental results across five industries(primary aluminum, lithium battery, photovoltaic, new energy vehicles, and transformers)demonstrate that the LLMs-RAG-CFA method outperforms traditional methods and other LLMs, achieving higher information retrieval rates and significantly lower information deviations and carbon footprint accounting deviations. The economically viable design utilizes RAG technology to balance real-time updates with cost-effectiveness, providing an efficient, reliable, and cost-saving solution for real-time carbon emission management, thereby enhancing environmental sustainability practices.

[IR-9] Harnessing Multimodal Large Language Models for Multimodal Sequential Recommendation

链接: https://arxiv.org/abs/2408.09698
作者: Yuyang Ye,Zhi Zheng,Yishan Shen,Tianshu Wang,Hengruo Zhang,Peijun Zhu,Runlong Yu,Kai Zhang,Hui Xiong
关键词-EN: Large Language Models, Multimodal Large Language, demonstrated significant potential, Recent advances, Large Language
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Recent advances in Large Language Models (LLMs) have demonstrated significant potential in the field of Recommendation Systems (RSs). Most existing studies have focused on converting user behavior logs into textual prompts and leveraging techniques such as prompt tuning to enable LLMs for recommendation tasks. Meanwhile, research interest has recently grown in multimodal recommendation systems that integrate data from images, text, and other sources using modality fusion techniques. This introduces new challenges to the existing LLM-based recommendation paradigm which relies solely on text modality information. Moreover, although Multimodal Large Language Models (MLLMs) capable of processing multi-modal inputs have emerged, how to equip MLLMs with multi-modal recommendation capabilities remains largely unexplored. To this end, in this paper, we propose the Multimodal Large Language Model-enhanced Sequential Multimodal Recommendation (MLLM-MSR) model. To capture the dynamic user preference, we design a two-stage user preference summarization method. Specifically, we first utilize an MLLM-based item-summarizer to extract image feature given an item and convert the image into text. Then, we employ a recurrent user preference summarization generation paradigm to capture the dynamic changes in user preferences based on an LLM-based user-summarizer. Finally, to enable the MLLM for multi-modal recommendation task, we propose to fine-tune a MLLM-based recommender using Supervised Fine-Tuning (SFT) techniques. Extensive evaluations across various datasets validate the effectiveness of MLLM-MSR, showcasing its superior ability to capture and adapt to the evolving dynamics of user preferences.

[IR-10] GANPrompt: Enhancing Robustness in LLM-Based Recommendations with GAN-Enhanced Diversity Prompts

链接: https://arxiv.org/abs/2408.09671
作者: Xinyu Li,Chuang Zhao,Hongke Zhao,Likang Wu,Ming HE
关键词-EN: demonstrated remarkable proficiency, generating natural language, recent years, demonstrated remarkable, remarkable proficiency
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:In recent years, LLM has demonstrated remarkable proficiency in comprehending and generating natural language, with a growing prevalence in the domain of recommender systems. However, LLM continues to face a significant challenge in that it is highly susceptible to the influence of prompt words. This inconsistency in response to minor alterations in prompt input may compromise the accuracy and resilience of recommendation models. To address this issue, this paper proposes GANPrompt, a multi-dimensional large language model prompt diversity framework based on Generative Adversarial Networks (GANs). The framework enhances the model’s adaptability and stability to diverse prompts by integrating GAN generation techniques with the deep semantic understanding capabilities of LLMs. GANPrompt first trains a generator capable of producing diverse prompts by analysing multidimensional user behavioural data. These diverse prompts are then used to train the LLM to improve its performance in the face of unseen prompts. Furthermore, to ensure a high degree of diversity and relevance of the prompts, this study introduces a mathematical theory-based diversity constraint mechanism that optimises the generated prompts to ensure that they are not only superficially distinct, but also semantically cover a wide range of user intentions. Through extensive experiments on multiple datasets, we demonstrate the effectiveness of the proposed framework, especially in improving the adaptability and robustness of recommender systems in complex and dynamic environments. The experimental results demonstrate that GANPrompt yields substantial enhancements in accuracy and robustness relative to existing state-of-the-art methodologies.

[IR-11] Data-driven Conditional Instrumental Variables for Debiasing Recommender Systems

链接: https://arxiv.org/abs/2408.09651
作者: Zhirong Huang,Shichao Zhang,Debo Cheng,Jiuyong Li,Lin Liu,Guangquan Lu
关键词-EN: true user preferences, latent variables, deviate from true, recommender systems, user-item interaction data
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In recommender systems, latent variables can cause user-item interaction data to deviate from true user preferences. This biased data is then used to train recommendation models, further amplifying the bias and ultimately compromising both recommendation accuracy and user satisfaction. Instrumental Variable (IV) methods are effective tools for addressing the confounding bias introduced by latent variables; however, identifying a valid IV is often challenging. To overcome this issue, we propose a novel data-driven conditional IV (CIV) debiasing method for recommender systems, called CIV4Rec. CIV4Rec automatically generates valid CIVs and their corresponding conditioning sets directly from interaction data, significantly reducing the complexity of IV selection while effectively mitigating the confounding bias caused by latent variables in recommender systems. Specifically, CIV4Rec leverages a variational autoencoder (VAE) to generate the representations of the CIV and its conditional set from interaction data, followed by the application of least squares to derive causal representations for click prediction. Extensive experiments on two real-world datasets, Movielens-10M and Douban-Movie, demonstrate that our CIV4Rec successfully identifies valid CIVs, effectively reduces bias, and consequently improves recommendation accuracy.

[IR-12] Debiased Contrastive Representation Learning for Mitigating Dual Biases in Recommender Systems

链接: https://arxiv.org/abs/2408.09646
作者: Zhirong Huang,Shichao Zhang,Debo Cheng,Jiuyong Li,Lin Liu,Guixian Zhang
关键词-EN: undermine recommender effectiveness, disproportionately favouring popular, biases undermine recommender, user-item historical data, favouring popular items
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In recommender systems, popularity and conformity biases undermine recommender effectiveness by disproportionately favouring popular items, leading to their over-representation in recommendation lists and causing an unbalanced distribution of user-item historical data. We construct a causal graph to address both biases and describe the abstract data generation mechanism. Then, we use it as a guide to develop a novel Debiased Contrastive Learning framework for Mitigating Dual Biases, called DCLMDB. In DCLMDB, both popularity bias and conformity bias are handled in the model training process by contrastive learning to ensure that user choices and recommended items are not unduly influenced by conformity and popularity. Extensive experiments on two real-world datasets, Movielens-10M and Netflix, show that DCLMDB can effectively reduce the dual biases, as well as significantly enhance the accuracy and diversity of recommendations.

[IR-13] On the Necessity of World Knowledge for Mitigating Missing Labels in Extreme Classification

链接: https://arxiv.org/abs/2408.09585
作者: Jatin Prakash,Anirudh Buvanesh,Bishal Santra,Deepak Saini,Sachin Yadav,Jian Jiao,Yashoteja Prabhu,Amit Sharma,Manik Varma
关键词-EN: Extreme Classification, aims to map, map a query, missing labels, missing
类目: Machine Learning (cs.LG); Information Retrieval (cs.IR)
*备注: Preprint, 23 pages

点击查看摘要

Abstract:Extreme Classification (XC) aims to map a query to the most relevant documents from a very large document set. XC algorithms used in real-world applications learn this mapping from datasets curated from implicit feedback, such as user clicks. However, these datasets inevitably suffer from missing labels. In this work, we observe that systematic missing labels lead to missing knowledge, which is critical for accurately modelling relevance between queries and documents. We formally show that this absence of knowledge cannot be recovered using existing methods such as propensity weighting and data imputation strategies that solely rely on the training dataset. While LLMs provide an attractive solution to augment the missing knowledge, leveraging them in applications with low latency requirements and large document sets is challenging. To incorporate missing knowledge at scale, we propose SKIM (Scalable Knowledge Infusion for Missing Labels), an algorithm that leverages a combination of small LM and abundant unstructured meta-data to effectively mitigate the missing label problem. We show the efficacy of our method on large-scale public datasets through exhaustive unbiased evaluation ranging from human annotations to simulations inspired from industrial settings. SKIM outperforms existing methods on Recall@100 by more than 10 absolute points. Additionally, SKIM scales to proprietary query-ad retrieval datasets containing 10 million documents, outperforming contemporary methods by 12% in offline evaluation and increased ad click-yield by 1.23% in an online A/B test conducted on a popular search engine. We release our code, prompts, trained XC models and finetuned SLMs at: this https URL

[IR-14] WPN: An Unlearning Method Based on N-pair Contrastive Learning in Language Models ECAI2024

链接: https://arxiv.org/abs/2408.09459
作者: Guitao Chen,Yunshen Wang,Hongye Sun,Guang Chen
关键词-EN: Generative language models, offer numerous advantages, Generative language, harmful knowledge acquired, harmful outputs due
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
*备注: ECAI 2024

点击查看摘要

Abstract:Generative language models (LMs) offer numerous advantages but may produce inappropriate or harmful outputs due to the harmful knowledge acquired during pre-training. This knowledge often manifests as undesirable correspondences, such as “harmful prompts” leading to “harmful outputs,” which our research aims to mitigate through unlearning techniques.However, existing unlearning methods based on gradient ascent can significantly impair the performance of LMs. To address this issue, we propose a novel approach called Weighted Positional N-pair (WPN) Learning, which leverages position-weighted mean pooling within an n-pair contrastive learning framework. WPN is designed to modify the output distribution of LMs by eliminating specific harmful outputs (e.g., replacing toxic responses with neutral ones), thereby transforming the model’s behavior from “harmful prompt-harmful output” to “harmful prompt-harmless response”.Experiments on OPT and GPT-NEO LMs show that WPN effectively reduces the proportion of harmful responses, achieving a harmless rate of up to 95.8% while maintaining stable performance on nine common benchmarks (with less than 2% degradation on average). Moreover, we provide empirical evidence to demonstrate WPN’s ability to weaken the harmful correspondences in terms of generalizability and robustness, as evaluated on out-of-distribution test sets and under adversarial attacks.

[IR-15] owards Boosting LLMs-driven Relevance Modeling with Progressive Retrieved Behavior-augmented Prompting

链接: https://arxiv.org/abs/2408.09439
作者: Zeyuan Chen,Haiyan Wu,Kaixin Wu,Wei Chen,Mingjie Zhong,Jia Xu,Zhongyi Liu,Wei Zhang
关键词-EN: enhancing user experience, Relevance modeling, Relevance, critical component, component for enhancing
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Relevance modeling is a critical component for enhancing user experience in search engines, with the primary objective of identifying items that align with users’ queries. Traditional models only rely on the semantic congruence between queries and items to ascertain relevance. However, this approach represents merely one aspect of the relevance judgement, and is insufficient in isolation. Even powerful Large Language Models (LLMs) still cannot accurately judge the relevance of a query and an item from a semantic perspective. To augment LLMs-driven relevance modeling, this study proposes leveraging user interactions recorded in search logs to yield insights into users’ implicit search intentions. The challenge lies in the effective prompting of LLMs to capture dynamic search intentions, which poses several obstacles in real-world relevance scenarios, i.e., the absence of domain-specific knowledge, the inadequacy of an isolated prompt, and the prohibitive costs associated with deploying LLMs. In response, we propose ProRBP, a novel Progressive Retrieved Behavior-augmented Prompting framework for integrating search scenario-oriented knowledge with LLMs effectively. Specifically, we perform the user-driven behavior neighbors retrieval from the daily search logs to obtain domain-specific knowledge in time, retrieving candidates that users consider to meet their expectations. Then, we guide LLMs for relevance modeling by employing advanced prompting techniques that progressively improve the outputs of the LLMs, followed by a progressive aggregation with comprehensive consideration of diverse aspects. For online serving, we have developed an industrial application framework tailored for the deployment of LLMs in relevance modeling. Experiments on real-world industry data and online A/B testing demonstrate our proposal achieves promising performance.

[IR-16] Hindi-BEIR : A Large Scale Retrieval Benchmark in Hindi

链接: https://arxiv.org/abs/2408.09437
作者: Arkadeep Acharya,Rudra Murthy,Vishwajeet Kumar,Jaydeep Sen
关键词-EN: Hindi speakers worldwide, information retrieval systems, efficient information retrieval, Hindi, speakers worldwide
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:Given the large number of Hindi speakers worldwide, there is a pressing need for robust and efficient information retrieval systems for Hindi. Despite ongoing research, there is a lack of comprehensive benchmark for evaluating retrieval models in Hindi. To address this gap, we introduce the Hindi version of the BEIR benchmark, which includes a subset of English BEIR datasets translated to Hindi, existing Hindi retrieval datasets, and synthetically created datasets for retrieval. The benchmark is comprised of 15 datasets spanning across 8 distinct tasks. We evaluate state-of-the-art multilingual retrieval models on this benchmark to identify task and domain-specific challenges and their impact on retrieval performance. By releasing this benchmark and a set of relevant baselines, we enable researchers to understand the limitations and capabilities of current Hindi retrieval models, promoting advancements in this critical area. The datasets from Hindi-BEIR are publicly available.

[IR-17] ELASTIC: Efficient Linear Attention for Sequential Interest Compression AAAI2025

链接: https://arxiv.org/abs/2408.09380
作者: Jiaxin Deng,Shiyao Wang,Song Lu,Yinfeng Li,Xinchen Luo,Yuanjun Liu,Peixing Xu,Guorui Zhou
关键词-EN: models heavily rely, transformer attention mechanism, linear dispatcher attention, Efficient Linear Attention, dispatcher attention mechanism
类目: Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
*备注: Submitted to AAAI 2025

点击查看摘要

Abstract:State-of-the-art sequential recommendation models heavily rely on transformer’s attention mechanism. However, the quadratic computational and memory complexities of self attention have limited its scalability for modeling users’ long range behaviour sequences. To address this problem, we propose ELASTIC, an Efficient Linear Attention for SequenTial Interest Compression, requiring only linear time complexity and decoupling model capacity from computational cost. Specifically, ELASTIC introduces a fixed length interest experts with linear dispatcher attention mechanism which compresses the long-term behaviour sequences to a significantly more compact representation which reduces up to 90% GPU memory usage with x2.7 inference speed up. The proposed linear dispatcher attention mechanism significantly reduces the quadratic complexity and makes the model feasible for adequately modeling extremely long sequences. Moreover, in order to retain the capacity for modeling various user interests, ELASTIC initializes a vast learnable interest memory bank and sparsely retrieves compressed user’s interests from the memory with a negligible computational overhead. The proposed interest memory retrieval technique significantly expands the cardinality of available interest space while keeping the same computational cost, thereby striking a trade-off between recommendation accuracy and efficiency. To validate the effectiveness of our proposed ELASTIC, we conduct extensive experiments on various public datasets and compare it with several strong sequential recommenders. Experimental results demonstrate that ELASTIC consistently outperforms baselines by a significant margin and also highlight the computational efficiency of ELASTIC when modeling long sequences. We will make our implementation code publicly available.

[IR-18] Gender Dynamics in Russian Online Political Discourse

链接: https://arxiv.org/abs/2408.09378
作者: Elizaveta Savchenko,Michael Raphael Freedman
关键词-EN: military events Notably, events Notably females, conflict periods Contrary, Russian-Ukrainian war analyzing, study examines YouTube
类目: Information Retrieval (cs.IR); Social and Information Networks (cs.SI)
*备注:

点击查看摘要

Abstract:The digital landscape provides a dynamic platform for political discourse crucial for understanding shifts in public opinion and engagement especially under authoritarian governments This study examines YouTube user behavior during the Russian-Ukrainian war analyzing 2168 videos with over 36000 comments from January 2022 to February 2024 We observe distinct patterns of participation and gender dynamics that correlate with major political and military events Notably females were more active in antigovernment channels especially during peak conflict periods Contrary to assumptions about online engagement in authoritarian contexts our findings suggest a complex interplay where women emerge as pivotal digital communicators This highlights online platforms role in facilitating political expression under authoritarian regimes demonstrating its potential as a barometer for public sentiment.

[IR-19] Deep Code Search with Naming-Agnostic Contrastive Multi-View Learning

链接: https://arxiv.org/abs/2408.09345
作者: Jiadong Feng,Wei Li,Zhao Wei,Yong Xu,Juhong Wang,Hui Li
关键词-EN: Code search, Software development, software development process, based code search, Code
类目: Information Retrieval (cs.IR); Software Engineering (cs.SE)
*备注:

点击查看摘要

Abstract:Software development is a repetitive task, as developers usually reuse or get inspiration from existing implementations. Code search, which refers to the retrieval of relevant code snippets from a codebase according to the developer’s intent that has been expressed as a query, has become increasingly important in the software development process. Due to the success of deep learning in various applications, a great number of deep learning based code search approaches have sprung up and achieved promising results. However, developers may not follow the same naming conventions and the same variable may have different variable names in different implementations, bringing a challenge to deep learning based code search methods that rely on explicit variable correspondences to understand source code. To overcome this challenge, we propose a naming-agnostic code search method (NACS) based on contrastive multi-view code representation learning. NACS strips information bound to variable names from Abstract Syntax Tree (AST), the representation of the abstract syntactic structure of source code, and focuses on capturing intrinsic properties solely from AST structures. We use semantic-level and syntax-level augmentation techniques to prepare realistically rational data and adopt contrastive learning to design a graph-view modeling component in NACS to enhance the understanding of code snippets. We further model ASTs in a path view to strengthen the graph-view modeling component through multi-view learning. Extensive experiments show that NACS provides superior code search performance compared to baselines and NACS can be adapted to help existing code search methods overcome the impact of different naming conventions.

[IR-20] A Study of PHOC Spatial Region Configurations for Math Formula Retrieval

链接: https://arxiv.org/abs/2408.09283
作者: Matt Langsenkamp,Bryan Amador,Richard Zanibbi
关键词-EN: Histogram Of Characters, Pyramidal Histogram, PHOC, represents the spatial, spatial location
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:A Pyramidal Histogram Of Characters (PHOC) represents the spatial location of symbols as binary vectors. The vectors are composed of levels that split a formula into equal-sized regions of one or more types (e.g., rectangles or ellipses). For each region type, this produces a pyramid of overlapping regions, where the first level contains the entire formula, and the final level the finest-grained regions. In this work, we introduce concentric rectangles for regions, and analyze whether subsequent PHOC levels encode redundant information by omitting levels from PHOC configurations. As a baseline, we include a bag of words PHOC containing only the first whole-formula level. Finally, using the ARQMath-3 formula retrieval benchmark, we demonstrate that some levels encoded in the original PHOC configurations are redundant, that PHOC models with rectangular regions outperform earlier PHOC models, and that despite their simplicity, PHOC models are surprisingly competitive with the state-of-the-art. PHOC is not math-specific, and might be used for chemical diagrams, charts, or other graphics.

[IR-21] owards Effective Top-N Hamming Search via Bipartite Graph Contrastive Hashing

链接: https://arxiv.org/abs/2408.09239
作者: Yankai Chen,Yixiang Fang,Yifei Zhang,Chenhao Ma,Yang Hong,Irwin King
关键词-EN: recommendation systems, document querying, bipartite graphs serves, fundamental task, Searching
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Searching on bipartite graphs serves as a fundamental task for various real-world applications, such as recommendation systems, database retrieval, and document querying. Conventional approaches rely on similarity matching in continuous Euclidean space of vectorized node embeddings. To handle intensive similarity computation efficiently, hashing techniques for graph-structured data have emerged as a prominent research direction. However, despite the retrieval efficiency in Hamming space, previous studies have encountered catastrophic performance decay. To address this challenge, we investigate the problem of hashing with Graph Convolutional Network for effective Top-N search. Our findings indicate the learning effectiveness of incorporating hashing techniques within the exploration of bipartite graph reception fields, as opposed to simply treating hashing as post-processing to output embeddings. To further enhance the model performance, we advance upon these findings and propose Bipartite Graph Contrastive Hashing (BGCH+). BGCH+ introduces a novel dual augmentation approach to both intermediate information and hash code outputs in the latent feature spaces, thereby producing more expressive and robust hash codes within a dual self-supervised learning paradigm. Comprehensive empirical analyses on six real-world benchmarks validate the effectiveness of our dual feature contrastive learning in boosting the performance of BGCH+ compared to existing approaches.

[IR-22] Hybrid Semantic Search: Unveiling User Intent Beyond Keywords

链接: https://arxiv.org/abs/2408.09236
作者: Aman Ahluwalia,Bishwajit Sutradhar,Karishma Ghosh
关键词-EN: Large Language Models, Large Language, non-semantic search engines, understanding user intent, traditional keyword-based search
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:This paper addresses the limitations of traditional keyword-based search in understanding user intent and introduces a novel hybrid search approach that leverages the strengths of non-semantic search engines, Large Language Models (LLMs), and embedding models. The proposed system integrates keyword matching, semantic vector embeddings, and LLM-generated structured queries to deliver highly relevant and contextually appropriate search results. By combining these complementary methods, the hybrid approach effectively captures both explicit and implicit user intent.The paper further explores techniques to optimize query execution for faster response times and demonstrates the effectiveness of this hybrid search model in producing comprehensive and accurate search outcomes.

[IR-23] FabricQA-Extractor: A Question Answering System to Extract Information from Documents using Natural Language Questions

链接: https://arxiv.org/abs/2408.09226
作者: Qiming Wang,Raul Castro Fernandez
关键词-EN: Reading comprehension models, answer questions posed, Reading comprehension, comprehension models answer, models answer questions
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Reading comprehension models answer questions posed in natural language when provided with a short passage of text. They present an opportunity to address a long-standing challenge in data management: the extraction of structured data from unstructured text. Consequently, several approaches are using these models to perform information extraction. However, these modern approaches leave an opportunity behind because they do not exploit the relational structure of the target extraction table. In this paper, we introduce a new model, Relation Coherence, that exploits knowledge of the relational structure to improve the extraction quality. We incorporate the Relation Coherence model as part of FabricQA-Extractor, an end-to-end system we built from scratch to conduct large scale extraction tasks over millions of documents. We demonstrate on two datasets with millions of passages that Relation Coherence boosts extraction performance and evaluate FabricQA-Extractor on large scale datasets.

[IR-24] C-RAG:Turing-Complete RAGs Case study on Medical LLM Systems

链接: https://arxiv.org/abs/2408.09199
作者: Xinke Jiang,Yue Fang,Rihong Qiu,Haoyu Zhang,Yongxin Xu,Hao Chen,Wentao Zhang,Ruizhe Zhang,Yuchen Fang,Xu Chu,Junfeng Zhao,Yasha Wang
关键词-EN: Large Language Models, domain-specific Large Language, enhancing domain-specific Large, Language Models, highly specialized queries
类目: Information Retrieval (cs.IR)
*备注: version 1.0

点击查看摘要

Abstract:In the pursuit of enhancing domain-specific Large Language Models (LLMs), Retrieval-Augmented Generation (RAG) emerges as a promising solution to mitigate issues such as hallucinations, outdated knowledge, and limited expertise in highly specialized queries. However, existing approaches to RAG fall short by neglecting system state variables, which are crucial for ensuring adaptive control, retrieval halting, and system convergence. In this paper, we introduce the TC-RAG through rigorous proof, a novel framework that addresses these challenges by incorporating a Turing Complete System to manage state variables, thereby enabling more efficient and accurate knowledge retrieval. By leveraging a memory stack system with adaptive retrieval, reasoning, and planning capabilities, TC-RAG not only ensures the controlled halting of retrieval processes but also mitigates the accumulation of erroneous knowledge via Push and Pop actions. In the case study of the medical domain, our extensive experiments on real-world healthcare datasets demonstrate the superiority of TC-RAG over existing methods in accuracy by over 7.20%. Our dataset and code have been available at https://https://github.com/Artessay/SAMA.git.

[IR-25] Ranking Across Different Content Types: The Robust Beauty of Multinomial Blending RECSYS24

链接: https://arxiv.org/abs/2408.09168
作者: Jan Malte Lichtenberg,Giuseppe Di Benedetto,Matteo Ruffini
关键词-EN: media streaming services, multiple content types, content types, increasing number, number of media
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: To appear in 18th ACM Conference on Recommender Systems (RecSys24), Bari, Italy. ACM, New York, NY, USA, 3 pages

点击查看摘要

Abstract:An increasing number of media streaming services have expanded their offerings to include entities of multiple content types. For instance, audio streaming services that started by offering music only, now also offer podcasts, merchandise items, and videos. Ranking items across different content types into a single slate poses a significant challenge for traditional learning-to-rank (LTR) algorithms due to differing user engagement patterns for different content types. We explore a simple method for cross-content-type ranking, called multinomial blending (MB), which can be used in conjunction with most existing LTR algorithms. We compare MB to existing baselines not only in terms of ranking quality but also from other industry-relevant perspectives such as interpretability, ease-of-use, and stability in dynamic environments with changing user behavior and ranking model retraining. Finally, we report the results of an A/B test from an Amazon Music ranking use-case.

[IR-26] CodeTaxo: Enhancing Taxonomy Expansion with Limited Examples via Code Language Prompts

链接: https://arxiv.org/abs/2408.09070
作者: Qingkai Zeng,Yuyang Bai,Zhaoxuan Tan,Zhenyu Wu,Shangbin Feng,Meng Jiang
关键词-EN: representation of knowledge, play a crucial, crucial role, applications by providing, providing a structural
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Taxonomies play a crucial role in various applications by providing a structural representation of knowledge. The task of taxonomy expansion involves integrating emerging concepts into existing taxonomies by identifying appropriate parent concepts for these new query concepts. Previous approaches typically relied on self-supervised methods that generate annotation data from existing taxonomies. However, these methods are less effective when the existing taxonomy is small (fewer than 100 entities). In this work, we introduce \textscCodeTaxo, a novel approach that leverages large language models through code language prompts to capture the taxonomic structure. Extensive experiments on five real-world benchmarks from different domains demonstrate that \textscCodeTaxo consistently achieves superior performance across all evaluation metrics, significantly outperforming previous state-of-the-art methods. The code and data are available at \urlthis https URL.

[IR-27] Meta Knowledge for Retrieval Augmented Large Language Models KDD2024

链接: https://arxiv.org/abs/2408.09017
作者: Laurent Mombaerts,Terry Ding,Adi Banerjee,Florian Felice,Jonathan Taws,Tarik Borogovac
关键词-EN: Retrieval Augmented Generation, underlying model parameters, augment Large Language, Augmented Generation, contextually relevant
类目: Information Retrieval (cs.IR)
*备注: Accepted in Workshop on Generative AI for Recommender Systems and Personalization, KDD 2024

点击查看摘要

Abstract:Retrieval Augmented Generation (RAG) is a technique used to augment Large Language Models (LLMs) with contextually relevant, time-critical, or domain-specific information without altering the underlying model parameters. However, constructing RAG systems that can effectively synthesize information from large and diverse set of documents remains a significant challenge. We introduce a novel data-centric RAG workflow for LLMs, transforming the traditional retrieve-then-read system into a more advanced prepare-then-rewrite-then-retrieve-then-read framework, to achieve higher domain expert-level understanding of the knowledge base. Our methodology relies on generating metadata and synthetic Questions and Answers (QA) for each document, as well as introducing the new concept of Meta Knowledge Summary (MK Summary) for metadata-based clusters of documents. The proposed innovations enable personalized user-query augmentation and in-depth information retrieval across the knowledge base. Our research makes two significant contributions: using LLMs as evaluators and employing new comparative performance metrics, we demonstrate that (1) using augmented queries with synthetic question matching significantly outperforms traditional RAG pipelines that rely on document chunking (p 0.01), and (2) meta knowledge-augmented queries additionally significantly improve retrieval precision and recall, as well as the final answers breadth, depth, relevancy, and specificity. Our methodology is cost-effective, costing less than 20 per 2000 research papers using Claude 3 Haiku, and can be adapted with any fine-tuning of either the language or embedding models to further enhance the performance of end-to-end RAG pipelines.

[IR-28] From Lazy to Prolific: Tackling Missing Labels in Open Vocabulary Extreme Classification by Positive-Unlabeled Sequence Learning

链接: https://arxiv.org/abs/2408.08981
作者: Haoran Ranran Zhang,Bensu Uçar,Soumik Dey,Hansi Wu,Binbin Li,Rui Zhang
关键词-EN: Extreme Multi-label Classification, Open-vocabulary Extreme Multi-label, extends traditional XMC, Open-vocabulary Extreme, Multi-label Classification
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:Open-vocabulary Extreme Multi-label Classification (OXMC) extends traditional XMC by allowing prediction beyond an extremely large, predefined label set (typically 10^3 to 10^12 labels), addressing the dynamic nature of real-world labeling tasks. However, self-selection bias in data annotation leads to significant missing labels in both training and test data, particularly for less popular inputs. This creates two critical challenges: generation models learn to be “lazy’” by under-generating labels, and evaluation becomes unreliable due to insufficient annotation in the test set. In this work, we introduce Positive-Unlabeled Sequence Learning (PUSL), which reframes OXMC as an infinite keyphrase generation task, addressing the generation model’s laziness. Additionally, we propose to adopt a suite of evaluation metrics, F1@ \mathcalO and newly proposed B@ k , to reliably assess OXMC models with incomplete ground truths. In a highly imbalanced e-commerce dataset with substantial missing labels, PUSL generates 30% more unique labels, and 72% of its predictions align with actual user queries. On the less skewed EURLex-4.3k dataset, PUSL demonstrates superior F1 scores, especially as label counts increase from 15 to 30. Our approach effectively tackles both the modeling and evaluation challenges in OXMC with missing labels.

[IR-29] ASGM-KG: Unveiling Alluvial Gold Mining Through Knowledge Graphs

链接: https://arxiv.org/abs/2408.08972
作者: Debashis Gupta,Aditi Golder,Luis Fernendez,Miles Silman,Greg Lersen,Fan Yang,Bob Plemmons,Sarra Alqahtani,Paul Victor Pauca
关键词-EN: Small-Scale Gold Mining, highly destructive mining, destructive mining practice, Gold Mining, Artisanal and Small-Scale
类目: Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
*备注:

点击查看摘要

Abstract:Artisanal and Small-Scale Gold Mining (ASGM) is a low-cost yet highly destructive mining practice, leading to environmental disasters across the world’s tropical watersheds. The topic of ASGM spans multiple domains of research and information, including natural and social systems, and knowledge is often atomized across a diversity of media and documents. We therefore introduce a knowledge graph (ASGM-KG) that consolidates and provides crucial information about ASGM practices and their environmental effects. The current version of ASGM-KG consists of 1,899 triples extracted using a large language model (LLM) from documents and reports published by both non-governmental and governmental organizations. These documents were carefully selected by a group of tropical ecologists with expertise in ASGM. This knowledge graph was validated using two methods. First, a small team of ASGM experts reviewed and labeled triples as factual or non-factual. Second, we devised and applied an automated factual reduction framework that relies on a search engine and an LLM for labeling triples. Our framework performs as well as five baselines on a publicly available knowledge graph and achieves over 90 accuracy on our ASGM-KG validated by domain experts. ASGM-KG demonstrates an advancement in knowledge aggregation and representation for complex, interdisciplinary environmental crises such as ASGM.

[IR-30] RoarGraph: A Projected Bipartite Graph for Efficient Cross-Modal Approximate Nearest Neighbor Search VLDB

链接: https://arxiv.org/abs/2408.08933
作者: Meng Chen,Kai Zhang,Zhenying He,Yinan Jing,X. Sean Wang
关键词-EN: Approximate Nearest Neighbor, Approximate Nearest, language model-based applications, including recommendation systems, large language model-based
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Databases (cs.DB)
*备注: to be published in PVLDB

点击查看摘要

Abstract:Approximate Nearest Neighbor Search (ANNS) is a fundamental and critical component in many applications, including recommendation systems and large language model-based applications. With the advancement of multimodal neural models, which transform data from different modalities into a shared high-dimensional space as feature vectors, cross-modal ANNS aims to use the data vector from one modality (e.g., texts) as the query to retrieve the most similar items from another (e.g., images or videos). However, there is an inherent distribution gap between embeddings from different modalities, and cross-modal queries become Out-of-Distribution (OOD) to the base data. Consequently, state-of-the-art ANNS approaches suffer poor performance for OOD workloads. In this paper, we quantitatively analyze the properties of the OOD workloads to gain an understanding of their ANNS efficiency. Unlike single-modal workloads, we reveal OOD queries spatially deviate from base data, and the k-nearest neighbors of an OOD query are distant from each other in the embedding space. The property breaks the assumptions of existing ANNS approaches and mismatches their design for efficient search. With insights from the OOD workloads, we propose pRojected bipartite Graph (RoarGraph), an efficient ANNS graph index built under the guidance of query distribution. Extensive experiments show that RoarGraph significantly outperforms state-of-the-art approaches on modern cross-modal datasets, achieving up to 3.56x faster search speed at a 90% recall rate for OOD queries.

[IR-31] Personalized Federated Collaborative Filtering: A Variational AutoEncoder Approach

链接: https://arxiv.org/abs/2408.08931
作者: Zhiwei Li,Guodong Long,Tianyi Zhou,Jing Jiang,Chengqi Zhang
关键词-EN: Federated Collaborative Filtering, distributed Collaborative Filtering, Collaborative Filtering, emerging field focused, Federated Collaborative
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 10 pages, 3 figures, 4 tables, conference

点击查看摘要

Abstract:Federated Collaborative Filtering (FedCF) is an emerging field focused on developing a new recommendation framework with preserving privacy in a federated setting. Existing FedCF methods typically combine distributed Collaborative Filtering (CF) algorithms with privacy-preserving mechanisms, and then preserve personalized information into a user embedding vector. However, the user embedding is usually insufficient to preserve the rich information of the fine-grained personalization across heterogeneous clients. This paper proposes a novel personalized FedCF method by preserving users’ personalized information into a latent variable and a neural model simultaneously. Specifically, we decompose the modeling of user knowledge into two encoders, each designed to capture shared knowledge and personalized knowledge separately. A personalized gating network is then applied to balance personalization and generalization between the global and local encoders. Moreover, to effectively train the proposed framework, we model the CF problem as a specialized Variational AutoEncoder (VAE) task by integrating user interaction vector reconstruction with missing value prediction. The decoder is trained to reconstruct the implicit feedback from items the user has interacted with, while also predicting items the user might be interested in but has not yet interacted with. Experimental results on benchmark datasets demonstrate that the proposed method outperforms other baseline methods, showcasing superior performance.

[IR-32] Retail-GPT: leveraging Retrieval Augmented Generation (RAG) for building E-commerce Chat Assistants

链接: https://arxiv.org/abs/2408.08925
作者: Bruno Amaral Teixeira de Freitas,Roberto de Alencar Lotufo
关键词-EN: open-source RAG-based chatbot, RAG-based chatbot designed, work presents Retail-GPT, enhance user engagement, work presents
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
*备注: 5 pages, 4 figures

点击查看摘要

Abstract:This work presents Retail-GPT, an open-source RAG-based chatbot designed to enhance user engagement in retail e-commerce by guiding users through product recommendations and assisting with cart operations. The system is cross-platform and adaptable to various e-commerce domains, avoiding reliance on specific chat applications or commercial activities. Retail-GPT engages in human-like conversations, interprets user demands, checks product availability, and manages cart operations, aiming to serve as a virtual sales agent and test the viability of such assistants across different retail businesses.

[IR-33] Graph Retrieval-Augmented Generation: A Survey

链接: https://arxiv.org/abs/2408.08921
作者: Boci Peng,Yun Zhu,Yongchao Liu,Xiaohe Bo,Haizhou Shi,Chuntao Hong,Yan Zhang,Siliang Tang
关键词-EN: Large Language Models, Language Models, Large Language, achieved remarkable success, RAG refines LLM
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR)
*备注: Ongoing work

点击查看摘要

Abstract:Recently, Retrieval-Augmented Generation (RAG) has achieved remarkable success in addressing the challenges of Large Language Models (LLMs) without necessitating retraining. By referencing an external knowledge base, RAG refines LLM outputs, effectively mitigating issues such as ``hallucination’', lack of domain-specific knowledge, and outdated information. However, the complex structure of relationships among different entities in databases presents challenges for RAG systems. In response, GraphRAG leverages structural information across entities to enable more precise and comprehensive retrieval, capturing relational knowledge and facilitating more accurate, context-aware responses. Given the novelty and potential of GraphRAG, a systematic review of current technologies is imperative. This paper provides the first comprehensive overview of GraphRAG methodologies. We formalize the GraphRAG workflow, encompassing Graph-Based Indexing, Graph-Guided Retrieval, and Graph-Enhanced Generation. We then outline the core technologies and training methods at each stage. Additionally, we examine downstream tasks, application domains, evaluation methodologies, and industrial use cases of GraphRAG. Finally, we explore future research directions to inspire further inquiries and advance progress in the field.

[IR-34] MLoRA: Multi-Domain Low-Rank Adaptive Network for CTR Prediction RECSYS’2024

链接: https://arxiv.org/abs/2408.08913
作者: Zhiming Yang,Haining Gao,Dehong Gao,Luwei Yang,Libin Yang,Xiaoyan Cai,Wei Ning,Guannan Zhang
关键词-EN: Click-through rate, social media, streaming media, CTR prediction, media
类目: Information Retrieval (cs.IR)
*备注: 11 pages. Accepted by RecSys’2024, full paper

点击查看摘要

Abstract:Click-through rate (CTR) prediction is one of the fundamental tasks in the industry, especially in e-commerce, social media, and streaming media. It directly impacts website revenues, user satisfaction, and user retention. However, real-world production platforms often encompass various domains to cater for diverse customer needs. Traditional CTR prediction models struggle in multi-domain recommendation scenarios, facing challenges of data sparsity and disparate data distributions across domains. Existing multi-domain recommendation approaches introduce specific-domain modules for each domain, which partially address these issues but often significantly increase model parameters and lead to insufficient training. In this paper, we propose a Multi-domain Low-Rank Adaptive network (MLoRA) for CTR prediction, where we introduce a specialized LoRA module for each domain. This approach enhances the model’s performance in multi-domain CTR prediction tasks and is able to be applied to various deep-learning models. We evaluate the proposed method on several multi-domain datasets. Experimental results demonstrate our MLoRA approach achieves a significant improvement compared with state-of-the-art baselines. Furthermore, we deploy it in the production environment of the this http URL. The online A/B testing results indicate the superiority and flexibility in real-world production environments. The code of our MLoRA is publicly available.

[IR-35] What should I wear to a party in a Greek taverna? Evaluation for Conversational Agents in the Fashion Domain KDD

链接: https://arxiv.org/abs/2408.08907
作者: Antonis Maronikolakis,Ana Peleteiro Ramallo,Weiwei Cheng,Thomas Kober
关键词-EN: online fashion retail, enhancing customer experience, Large language models, poised to revolutionize, revolutionize the domain
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
*备注: Accepted at KDD workshop on Evaluation and Trustworthiness of Generative AI Models

点击查看摘要

Abstract:Large language models (LLMs) are poised to revolutionize the domain of online fashion retail, enhancing customer experience and discovery of fashion online. LLM-powered conversational agents introduce a new way of discovery by directly interacting with customers, enabling them to express in their own ways, refine their needs, obtain fashion and shopping advice that is relevant to their taste and intent. For many tasks in e-commerce, such as finding a specific product, conversational agents need to convert their interactions with a customer to a specific call to different backend systems, e.g., a search system to showcase a relevant set of products. Therefore, evaluating the capabilities of LLMs to perform those tasks related to calling other services is vital. However, those evaluations are generally complex, due to the lack of relevant and high quality datasets, and do not align seamlessly with business needs, amongst others. To this end, we created a multilingual evaluation dataset of 4k conversations between customers and a fashion assistant in a large e-commerce fashion platform to measure the capabilities of LLMs to serve as an assistant between customers and a backend engine. We evaluate a range of models, showcasing how our dataset scales to business needs and facilitates iterative development of tools.

[IR-36] Bundle Recommendation with Item-level Causation-enhanced Multi-view Learning

链接: https://arxiv.org/abs/2408.08906
作者: Huy-Son Nguyen,Tuan-Nghia Bui,Long-Hai Nguyen,Hoang Manh-Hung,Cam-Van Thi Nguyen,Hoang-Quynh Le,Duc-Trong Le
关键词-EN: enhance business profitability, Bundle recommendation aims, aims to enhance, enhance business, business profitability
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Bundle recommendation aims to enhance business profitability and user convenience by suggesting a set of interconnected items. In real-world scenarios, leveraging the impact of asymmetric item affiliations is crucial for effective bundle modeling and understanding user preferences. To address this, we present BunCa, a novel bundle recommendation approach employing item-level causation-enhanced multi-view learning. BunCa provides comprehensive representations of users and bundles through two views: the Coherent View, leveraging the Multi-Prospect Causation Network for causation-sensitive relations among items, and the Cohesive View, employing LightGCN for information propagation among users and bundles. Modeling user preferences and bundle construction combined from both views ensures rigorous cohesion in direct user-bundle interactions through the Cohesive View and captures explicit intents through the Coherent View. Simultaneously, the integration of concrete and discrete contrastive learning optimizes the consistency and self-discrimination of multi-view representations. Extensive experiments with BunCa on three benchmark datasets demonstrate the effectiveness of this novel research and validate our hypothesis.

[IR-37] PATopics: An automatic framework to extract useful information from pharmaceutical patents documents

链接: https://arxiv.org/abs/2408.08905
作者: Pablo Cecilio,Antônio Perreira,Juliana Santos Rosa Viegas,Washington Cunha,Felipe Viegas,Elisa Tuler,Fabiana Testa Moura de Carvalho Vicentini,Leonardo Rocha
关键词-EN: disruptive innovations focusing, promote disruptive innovations, Pharmaceutical patents play, disruptive innovations, innovations focusing
类目: Digital Libraries (cs.DL); Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注: 17 pages, 5 figures, 5 tables

点击查看摘要

Abstract:Pharmaceutical patents play an important role by protecting the innovation from copies but also drive researchers to innovate, create new products, and promote disruptive innovations focusing on collective health. The study of patent management usually refers to an exhaustive manual search. This happens, because patent documents are complex with a lot of details regarding the claims and methodology/results explanation of the invention. To mitigate the manual search, we proposed PATopics, a framework specially designed to extract relevant information for Pharmaceutical patents. PATopics is composed of four building blocks that extract textual information from the patents, build relevant topics that are capable of summarizing the patents, correlate these topics with useful patent characteristics and then, summarize the information in a friendly web interface to final users. The general contributions of PATopics are its ability to centralize patents and to manage patents into groups based on their similarities. We extensively analyzed the framework using 4,832 pharmaceutical patents concerning 809 molecules patented by 478 companies. In our analysis, we evaluate the use of the framework considering the demands of three user profiles – researchers, chemists, and companies. We also designed four real-world use cases to evaluate the framework’s applicability. Our analysis showed how practical and helpful PATopics are in the pharmaceutical scenario.

[IR-38] Bayesian inference to improve quality of Retrieval Augmented Generation

链接: https://arxiv.org/abs/2408.08901
作者: Dattaraj Rao
关键词-EN: Retrieval Augmented Generation, modern Large Language, Retrieval Augmented, Augmented Generation, Large Language Model
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Retrieval Augmented Generation or RAG is the most popular pattern for modern Large Language Model or LLM applications. RAG involves taking a user query and finding relevant paragraphs of context in a large corpus typically captured in a vector database. Once the first level of search happens over a vector database, the top n chunks of relevant text are included directly in the context and sent as prompt to the LLM. Problem with this approach is that quality of text chunks depends on effectiveness of search. There is no strong post processing after search to determine if the chunk does hold enough information to include in prompt. Also many times there may be chunks that have conflicting information on the same subject and the model has no prior experience which chunk to prioritize to make a decision. Often times, this leads to the model providing a statement that there are conflicting statements, and it cannot produce an answer. In this research we propose a Bayesian approach to verify the quality of text chunks from the search results. Bayes theorem tries to relate conditional probabilities of the hypothesis with evidence and prior probabilities. We propose that, finding likelihood of text chunks to give a quality answer and using prior probability of quality of text chunks can help us improve overall quality of the responses from RAG systems. We can use the LLM itself to get a likelihood of relevance of a context paragraph. For prior probability of the text chunk, we use the page number in the documents parsed. Assumption is that that paragraphs in earlier pages have a better probability of being findings and more relevant to generalizing an answer.

[IR-39] owards Effective Authorship Attribution: Integrating Class-Incremental Learning

链接: https://arxiv.org/abs/2408.08900
作者: Mostafa Rahgouy,Hamed Babaei Giglou,Mehnaz Tabassum,Dongji Feng,Amit Das,Taher Rahgooy,Gerry Dozier,Cheryl D. Seals
关键词-EN: possessing multiple samples, multiple samples, process of attributing, attributing an unidentified, unidentified document
类目: Information Retrieval (cs.IR); Digital Libraries (cs.DL)
*备注: Submitted to IEEE CogMI 2024 Conference

点击查看摘要

Abstract:AA is the process of attributing an unidentified document to its true author from a predefined group of known candidates, each possessing multiple samples. The nature of AA necessitates accommodating emerging new authors, as each individual must be considered unique. This uniqueness can be attributed to various factors, including their stylistic preferences, areas of expertise, gender, cultural background, and other personal characteristics that influence their writing. These diverse attributes contribute to the distinctiveness of each author, making it essential for AA systems to recognize and account for these variations. However, current AA benchmarks commonly overlook this uniqueness and frame the problem as a closed-world classification, assuming a fixed number of authors throughout the system’s lifespan and neglecting the inclusion of emerging new authors. This oversight renders the majority of existing approaches ineffective for real-world applications of AA, where continuous learning is essential. These inefficiencies manifest as current models either resist learning new authors or experience catastrophic forgetting, where the introduction of new data causes the models to lose previously acquired knowledge. To address these inefficiencies, we propose redefining AA as CIL, where new authors are introduced incrementally after the initial training phase, allowing the system to adapt and learn continuously. To achieve this, we briefly examine subsequent CIL approaches introduced in other domains. Moreover, we have adopted several well-known CIL methods, along with an examination of their strengths and weaknesses in the context of AA. Additionally, we outline potential future directions for advancing CIL AA systems. As a result, our paper can serve as a starting point for evolving AA systems from closed-world models to continual learning through CIL paradigms.

[IR-40] LLMJudge: LLMs for Relevance Judgments

链接: https://arxiv.org/abs/2408.08896
作者: Hossein A. Rahmani,Emine Yilmaz,Nick Craswell,Bhaskar Mitra,Paul Thomas,Charles L. A. Clarke,Mohammad Aliannejadi,Clemencia Siro,Guglielmo Faggioli
关键词-EN: workshop at SIGIR, organized as part, SIGIR, relevance, relevance judgments
类目: Information Retrieval (cs.IR)
*备注: LLMJudge Challenge Overview, 3 pages

点击查看摘要

Abstract:The LLMJudge challenge is organized as part of the LLM4Eval workshop at SIGIR 2024. Test collections are essential for evaluating information retrieval (IR) systems. The evaluation and tuning of a search system is largely based on relevance labels, which indicate whether a document is useful for a specific search and user. However, collecting relevance judgments on a large scale is costly and resource-intensive. Consequently, typical experiments rely on third-party labelers who may not always produce accurate annotations. The LLMJudge challenge aims to explore an alternative approach by using LLMs to generate relevance judgments. Recent studies have shown that LLMs can generate reliable relevance judgments for search systems. However, it remains unclear which LLMs can match the accuracy of human labelers, which prompts are most effective, how fine-tuned open-source LLMs compare to closed-source LLMs like GPT-4, whether there are biases in synthetically generated data, and if data leakage affects the quality of generated labels. This challenge will investigate these questions, and the collected data will be released as a package to support automatic relevance judgment research in information retrieval and search.

[IR-41] Enhancing Exploratory Learning through Exploratory Search with the Emergence of Large Language Models

链接: https://arxiv.org/abs/2408.08894
作者: Yiming Luo,Patrick Cheong-Iao,Shanton Chang
关键词-EN: large language models, learners find, challenging issue, confused learners, large language
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注: 11 pages, 7 figures

点击查看摘要

Abstract:In the information era, how learners find, evaluate, and effectively use information has become a challenging issue, especially with the added complexity of large language models (LLMs) that have further confused learners in their information retrieval and search activities. This study attempts to unpack this complexity by combining exploratory search strategies with the theories of exploratory learning to form a new theoretical model of exploratory learning from the perspective of students’ learning. Our work adapts Kolb’s learning model by incorporating high-frequency exploration and feedback loops, aiming to promote deep cognitive and higher-order cognitive skill development in students. Additionally, this paper discusses and suggests how advanced LLMs integrated into information retrieval and information theory can support students in their exploratory searches, contributing theoretically to promoting student-computer interaction and supporting their learning journeys in the new era with LLMs.

附件下载

点击下载今日全部论文列表