本篇博文主要展示 2024-08-28 从Arxiv.org论文网站获取的最新论文列表,自动更新,按照NLP、CV、ML、AI、IR五个大方向区分,若需要邮件定时接收,请在评论区留下你的邮箱号。

说明:每日论文数据从Arxiv.org获取,每天早上10:30左右定时自动更新。

友情提示: 如何您需要邮箱接收每日论文数据,请在评论处留下你的邮箱,同样每天10:30左右邮件定时自动发送。

目录

概览 (2024-08-28)

今日共更新378篇论文,其中:

  • 自然语言处理50篇(Computation and Language (cs.CL))
  • 人工智能96篇(Artificial Intelligence (cs.AI))
  • 计算机视觉98篇(Computer Vision and Pattern Recognition (cs.CV))
  • 机器学习116篇(Machine Learning (cs.LG))

自然语言处理

[NLP-0] Into the Unknown Unknowns: Engaged Human Learning through Participation in Language Model Agent Conversations
[NLP-0] 进入未知的未知:通过参与语言模型代理对话来促进人类学习

链接: https://arxiv.org/abs/2408.15232
作者: Yucheng Jiang,Yijia Shao,Dekun Ma,Sina J. Semnani,Monica S. Lam
关键词-EN: answering concrete queries, unknowns remains challenging, create Collaborative STORM, unknown unknowns remains, language model
关键词-ZH: 回答具体的询问,未知数仍然具有挑战性,创建协作STORM,未知的未知数仍然,语言模型
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:While language model (LM)-powered chatbots and generative search engines excel at answering concrete queries, discovering information in the terrain of unknown unknowns remains challenging for users. To emulate the common educational scenario where children/students learn by listening to and participating in conversations of their parents/teachers, we create Collaborative STORM (Co-STORM). Unlike QA systems that require users to ask all the questions, Co-STORM lets users observe and occasionally steer the discourse among several LM agents. The agents ask questions on the user’s behalf, allowing the user to discover unknown unknowns serendipitously. To facilitate user interaction, Co-STORM assists users in tracking the discourse by organizing the uncovered information into a dynamic mind map, ultimately generating a comprehensive report as takeaways. For automatic evaluation, we construct the WildSeek dataset by collecting real information-seeking records with user goals. Co-STORM outperforms baseline methods on both discourse trace and report quality. In a further human evaluation, 70% of participants prefer Co-STORM over a search engine, and 78% favor it over a RAG chatbot.
摘要:虽然语言模型(LM)驱动的聊天机器人和生成式搜索引擎擅长回答具体的问题,但在未知的领域中发现信息对用户来说仍然是一个挑战。为了效仿儿童/学生通过倾听和参与家长/老师的对话来学习的常见教育场景,我们创建了协作风暴(Co-Storm)。与要求用户提出所有问题的QA系统不同,Co-Storm允许用户观察并偶尔在几个LM代理之间引导话语。代理代表用户提出问题,允许用户偶然发现未知的未知。为了方便用户互动,Co-Storm通过将未被覆盖的信息组织到动态思维导图中来帮助用户跟踪话语,最终生成一份全面的报告作为外卖。对于自动评估,我们通过收集具有用户目标的真实信息搜索记录来构建WildSeek数据集。在语篇追踪和报道质量上,Co-Storm都优于基线方法。在进一步的人类评估中,70%的参与者更喜欢Co-Storm而不是搜索引擎,78%的人更喜欢它而不是破烂的聊天机器人。

[NLP-1] LLM Defenses Are Not Robust to Multi-Turn Human Jailbreaks Yet
[NLP-1] LLM辩护对多次越狱还不强

链接: https://arxiv.org/abs/2408.15221
作者: Nathaniel Li,Ziwen Han,Ian Steneker,Willow Primack,Riley Goodside,Hugh Zhang,Zifan Wang,Cristina Menghini,Summer Yue
关键词-EN: Recent large language, refuse harmful queries, greatly improved models’, improved models’ ability, Recent large
关键词-ZH: 最近的大型语言,拒绝有害查询,大大改进了模型,提高了模型的能力,最近的大型
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Cryptography and Security (cs.CR); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:Recent large language model (LLM) defenses have greatly improved models’ ability to refuse harmful queries, even when adversarially attacked. However, LLM defenses are primarily evaluated against automated adversarial attacks in a single turn of conversation, an insufficient threat model for real-world malicious use. We demonstrate that multi-turn human jailbreaks uncover significant vulnerabilities, exceeding 70% attack success rate (ASR) on HarmBench against defenses that report single-digit ASRs with automated single-turn attacks. Human jailbreaks also reveal vulnerabilities in machine unlearning defenses, successfully recovering dual-use biosecurity knowledge from unlearned models. We compile these results into Multi-Turn Human Jailbreaks (MHJ), a dataset of 2,912 prompts across 537 multi-turn jailbreaks. We publicly release MHJ alongside a compendium of jailbreak tactics developed across dozens of commercial red teaming engagements, supporting research towards stronger LLM defenses.
摘要:最近的大型语言模型(LLM)防御极大地提高了模型拒绝有害查询的能力,即使在遭到恶意攻击时也是如此。然而,LLM防御主要是在单轮对话中针对自动对手攻击进行评估,这不足以构成现实世界恶意使用的威胁模型。我们证明,多轮人类越狱揭示了重大漏洞,针对使用自动化单轮攻击报告个位数ASR的防御系统,HarmB边上的攻击成功率(ASR)超过70%。人类越狱还揭示了机器遗忘防御的漏洞,成功地从未学习的模型中恢复了双重用途的生物安全知识。我们将这些结果汇编成多轮人类越狱(MHJ),这是一个包含537次多轮越狱的2912个提示的数据集。我们公开发布了MHJ,以及在数十个商业红色团队交战中开发的越狱战术概要,支持对更强大的LLM防御的研究。

[NLP-2] Classifying populist language in American presidential and governor speeches using automatic text analysis
[NLP-2] 使用自动文本分析对美国总统和州长演讲中的民粹主义语言进行分类

链接: https://arxiv.org/abs/2408.15213
作者: Olaf van der Veen,Semir Dzebo,Levi Littvay,Kirk Hawkins,Oren Dar
关键词-EN: difficult to measure, notoriously difficult, Populism, populist, speeches
关键词-ZH: 难以衡量,出了名的困难,民粹主义,民粹主义,演讲
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Populism is a concept that is often used but notoriously difficult to measure. Common qualitative measurements like holistic grading or content analysis require great amounts of time and labour, making it difficult to quickly scope out which politicians should be classified as populist and which should not, while quantitative methods show mixed results when it comes to classifying populist rhetoric. In this paper, we develop a pipeline to train and validate an automated classification model to estimate the use of populist language. We train models based on sentences that were identified as populist and pluralist in 300 US governors’ speeches from 2010 to 2018 and in 45 speeches of presidential candidates in 2016. We find that these models classify most speeches correctly, including 84% of governor speeches and 89% of presidential speeches. These results extend to different time periods (with 92% accuracy on more recent American governors), different amounts of data (with as few as 70 training sentences per category achieving similar results), and when classifying politicians instead of individual speeches. This pipeline is thus an effective tool that can optimise the systematic and swift classification of the use of populist language in politicians’ speeches.
摘要:民粹主义是一个经常被使用但出了名的难以衡量的概念。整体评分或内容分析等常见的定性衡量标准需要大量时间和人力,这使得很难快速确定哪些政客应该被归类为民粹主义者,哪些不应该被归类,而在对民粹主义言论进行分类时,定量方法的结果喜忧参半。在本文中,我们开发了一个管道来训练和验证自动分类模型,以估计民粹主义语言的使用。我们根据2010年至2018年300位美国州长演讲和2016年总统候选人45次演讲中被确定为民粹主义和多元化的句子来训练模型。我们发现,这些模型对大多数演讲进行了正确的分类,包括84%的州长演讲和89%的总统演讲。这些结果延伸到不同的时间段(对最近的美国州长的准确率为92%),不同数量的数据(每个类别只有70个训练句子获得类似的结果),以及对政治家而不是个人演讲进行分类。因此,这一渠道是一种有效的工具,可以优化对政客演讲中民粹主义语言使用的系统和快速分类。

[NLP-3] Can Unconfident LLM Annotations Be Used for Confident Conclusions?
[NLP-3] 不自信的LLM注释可以用于自信的结论吗?

链接: https://arxiv.org/abs/2408.15204
作者: Kristina Gligorić,Tijana Zrnic,Cinoo Lee,Emmanuel J. Candès,Dan Jurafsky
关键词-EN: Large language models, shown high agreement, Large language, human data collection, LLM annotations
关键词-ZH: 大型语言模型,显示出高一致性,大型语言,人工数据收集,LLM注释
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have shown high agreement with human raters across a variety of tasks, demonstrating potential to ease the challenges of human data collection. In computational social science (CSS), researchers are increasingly leveraging LLM annotations to complement slow and expensive human annotations. Still, guidelines for collecting and using LLM annotations, without compromising the validity of downstream conclusions, remain limited. We introduce Confidence-Driven Inference: a method that combines LLM annotations and LLM confidence indicators to strategically select which human annotations should be collected, with the goal of producing accurate statistical estimates and provably valid confidence intervals while reducing the number of human annotations needed. Our approach comes with safeguards against LLM annotations of poor quality, guaranteeing that the conclusions will be both valid and no less accurate than if we only relied on human annotations. We demonstrate the effectiveness of Confidence-Driven Inference over baselines in statistical estimation tasks across three CSS settings–text politeness, stance, and bias–reducing the needed number of human annotations by over 25% in each. Although we use CSS settings for demonstration, Confidence-Driven Inference can be used to estimate most standard quantities across a broad range of NLP problems.
摘要:大型语言模型(LLM)在各种任务中与人类评分者表现出高度的一致性,显示出缓解人类数据收集挑战的潜力。在计算社会科学(CSS)中,研究人员越来越多地利用LLM注释来补充缓慢且昂贵的人类注释。尽管如此,在不影响下游结论有效性的情况下,收集和使用LLM注释的指导方针仍然有限。我们引入了置信度驱动的推理:一种结合LLM标注和LLM置信度指标的方法来策略性地选择应该收集哪些人工标注,目标是产生准确的统计估计和可证明有效的置信度区间,同时减少所需的人工标注的数量。我们的方法带有针对质量不佳的LLM注释的保护措施,确保结论将是有效的,并且不低于我们仅依赖人工注释的准确性。我们在三个CSS设置–文本礼貌、立场和偏见–中展示了基于基线的信心驱动推理在统计估计任务中的有效性–在每个设置中减少了超过25%的所需人工注释数量。尽管我们使用css设置进行演示,但置信度驱动的推理可以用于估计广泛范围内的NLP问题中的大多数标准量。

[NLP-4] Unlocking Potential in Pre-Trained Music Language Models for Versatile Multi-Track Music Arrangement AAAI2025
[NLP-4] 释放预训练音乐语言模型的潜力,用于多功能多轨音乐编曲

链接: https://arxiv.org/abs/2408.15176
作者: Longshen Ou,Jingwei Zhao,Ziyu Wang,Gus Xia,Ye Wang
关键词-EN: shown significant capabilities, Large language models, Large language, symbolic music generation, shown significant
关键词-ZH: 表现出显着的能力,大型语言模型,大型语言,符号音乐生成,表现出显着
类目: ound (cs.SD); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
备注: Submitted to AAAI 2025

点击查看摘要

Abstract:Large language models have shown significant capabilities across various domains, including symbolic music generation. However, leveraging these pre-trained models for controllable music arrangement tasks, each requiring different forms of musical information as control, remains a novel challenge. In this paper, we propose a unified sequence-to-sequence framework that enables the fine-tuning of a symbolic music language model for multiple multi-track arrangement tasks, including band arrangement, piano reduction, drum arrangement, and voice separation. Our experiments demonstrate that the proposed approach consistently achieves higher musical quality compared to task-specific baselines across all four tasks. Furthermore, through additional experiments on probing analysis, we show the pre-training phase equips the model with essential knowledge to understand musical conditions, which is hard to acquired solely through task-specific fine-tuning.
摘要:大型语言模型在各个领域都表现出了显着的能力,包括符号音乐生成。然而,利用这些预先训练的模型进行可控的音乐编曲任务,每个任务都需要不同形式的音乐信息作为控制,仍然是一个新的挑战。在本文中,我们提出了一个统一的序列到序列框架,该框架能够为多个多曲目编曲任务(包括乐队编曲、钢琴缩小、鼓编曲和声音分离)微调符号音乐语言模型。我们的实验表明,与所有四项任务的特定任务基线相比,所提出的方法始终能够实现更高的音乐质量。此外,通过额外的探索分析实验,我们表明训练前阶段为模型提供了理解音乐条件的基本知识,而仅通过特定任务的微调很难获得这些知识。

[NLP-5] X-Reflect: Cross-Reflection Prompting for Multimodal Recommendation
[NLP-5] X-Reflect:多模式推荐的交叉反射预算

链接: https://arxiv.org/abs/2408.15172
作者: Hanjia Lyu,Ryan Rossi,Xiang Chen,Md Mehrab Tanjim,Stefano Petrangeli,Somdeb Sarkhel,Jiebo Luo
关键词-EN: Large Language Models, Large Multimodal Models, Language Models, Large Language, Multimodal Models
关键词-ZH: 大型语言模型、大型多模式模型、语言模型、大型语言、多模式模型
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) and Large Multimodal Models (LMMs) have been shown to enhance the effectiveness of enriching item descriptions, thereby improving the accuracy of recommendation systems. However, most existing approaches either rely on text-only prompting or employ basic multimodal strategies that do not fully exploit the complementary information available from both textual and visual modalities. This paper introduces a novel framework, Cross-Reflection Prompting, termed X-Reflect, designed to address these limitations by prompting LMMs to explicitly identify and reconcile supportive and conflicting information between text and images. By capturing nuanced insights from both modalities, this approach generates more comprehensive and contextually richer item representations. Extensive experiments conducted on two widely used benchmarks demonstrate that our method outperforms existing prompting baselines in downstream recommendation accuracy. Additionally, we evaluate the generalizability of our framework across different LMM backbones and the robustness of the prompting strategies, offering insights for optimization. This work underscores the importance of integrating multimodal information and presents a novel solution for improving item understanding in multimodal recommendation systems.
摘要:大型语言模型(LLM)和大型多通道模型(LMM)能够提高丰富商品描述的有效性,从而提高推荐系统的准确率。然而,大多数现有的方法要么依赖于纯文本提示,要么采用基本的多通道策略,没有充分利用来自文本和视觉通道的补充信息。本文介绍了一种新的框架-交叉反射提示,称为X-Reflect,旨在通过促使LMM明确地识别和协调文本和图像之间的支持和冲突信息来解决这些限制。通过从这两种模式获取细微差别的洞察,该方法生成更全面和更丰富的项目表示。在两个广泛使用的基准上进行的大量实验表明,我们的方法在下游推荐准确率方面优于现有的提示基线。此外,我们评估了我们的框架在不同LMM主干上的通用性和提示策略的健壮性,为优化提供了见解。这项工作强调了集成多通道信息的重要性,并为提高多通道推荐系统中的条目理解提供了一种新的解决方案。

[NLP-6] Measuring text summarization factuality using atomic facts entailment metrics in the context of retrieval augmented generation
[NLP-6] 在检索增强生成的背景下使用原子事实蕴含指标衡量文本摘要真实性

链接: https://arxiv.org/abs/2408.15171
作者: N. E. Kriman
关键词-EN: large language models, language models, large language, significantly increased, introduction of ChatGPT
关键词-ZH: 大型语言模型,语言模型,大型语言,显着增加,引入ChatGPT
类目: Computation and Language (cs.CL)
备注: 12 pages

点击查看摘要

Abstract:The use of large language models (LLMs) has significantly increased since the introduction of ChatGPT in 2022, demonstrating their value across various applications. However, a major challenge for enterprise and commercial adoption of LLMs is their tendency to generate inaccurate information, a phenomenon known as “hallucination.” This project proposes a method for estimating the factuality of a summary generated by LLMs when compared to a source text. Our approach utilizes Naive Bayes classification to assess the accuracy of the content produced.
摘要:自2022年推出ChatGPT以来,大型语言模型(LLM)的使用显着增加,证明了它们在各种应用程序中的价值。然而,企业和商业采用LLM面临的一个主要挑战是它们倾向于生成不准确的信息,这种现象被称为“幻觉”。“该项目提出了一种方法,用于评估LLM生成的摘要与源文本相比的真实性。我们的方法利用Naive Bayes分类来评估所产生内容的准确性。

[NLP-7] How transformers learn structured data: insights from hierarchical filtering
[NLP-7] 转换器如何学习结构化数据:分层过滤的见解

链接: https://arxiv.org/abs/2408.15138
作者: Jerome Garnier-Brun,Marc Mézard,Emanuele Moscato,Luca Saglietti
关键词-EN: optimal Belief Propagation, sequences on trees, enabling control, Belief Propagation algorithm, procedure for generative
关键词-ZH: 最佳信念传播,树上的序列,启用控制,信念传播算法,生成程序
类目: Machine Learning (cs.LG); Disordered Systems and Neural Networks (cond-mat.dis-nn); Statistical Mechanics (cond-mat.stat-mech); Computation and Language (cs.CL)
备注: 18 pages, 9 figures

点击查看摘要

Abstract:We introduce a hierarchical filtering procedure for generative models of sequences on trees, enabling control over the range of positional correlations in the data. Leveraging this controlled setting, we provide evidence that vanilla encoder-only transformer architectures can implement the optimal Belief Propagation algorithm on both root classification and masked language modeling tasks. Correlations at larger distances corresponding to increasing layers of the hierarchy are sequentially included as the network is trained. We analyze how the transformer layers succeed by focusing on attention maps from models trained with varying degrees of filtering. These attention maps show clear evidence for iterative hierarchical reconstruction of correlations, and we can relate these observations to a plausible implementation of the exact inference algorithm for the network sizes considered.
摘要:我们为树上序列的生成模型引入了分层过滤过程,从而能够控制数据中位置相关性的范围。利用这种受控设置,我们提供了证据,证明纯香草编码器的Transformer架构可以在根分类和掩蔽语言建模任务上实现最佳信念传播算法。随着网络的训练,相应于分层结构的增加层的较大距离处的相关性被顺序地包括在内。我们通过关注来自经过不同程度过滤训练的模型的注意力图来分析Transformer层如何成功。这些注意力图为相关性的迭代分层重建提供了明确的证据,我们可以将这些观察结果与针对所考虑的网络规模的精确推理算法的合理实现联系起来。

[NLP-8] Relation Also Knows: Rethinking the Recall and Editing of Factual Associations in Auto-Regressive Transformer Language Models
[NLP-8] 关系也知道:重新思考自回归Transformer语言模型中事实关联的回忆和编辑

链接: https://arxiv.org/abs/2408.15091
作者: Xiyu Liu,Zhengxiao Liu,Naibin Gu,Zheng Lin,Wanli Ma,Ji Xiang,Weiping Wang
关键词-EN: located model weights, transformer language models, auto-regressive transformer language, model weights, deal of attention
关键词-ZH: 定位模型权重、Transformer语言模型、自回归Transformer语言、模型权重、注意力处理
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The storage and recall of factual associations in auto-regressive transformer language models (LMs) have drawn a great deal of attention, inspiring knowledge editing by directly modifying the located model weights. Most editing works achieve knowledge editing under the guidance of existing interpretations of knowledge recall that mainly focus on subject knowledge. However, these interpretations are seriously flawed, neglecting relation information and leading to the over-generalizing problem for editing. In this work, we discover a novel relation-focused perspective to interpret the knowledge recall of transformer LMs during inference and apply it on knowledge editing to avoid over-generalizing. Experimental results on the dataset supplemented with a new R-Specificity criterion demonstrate that our editing approach significantly alleviates over-generalizing while remaining competitive on other criteria, breaking the domination of subject-focused editing for future research.
摘要:自回归Transformer语言模型(LM)中事实关联的存储和召回引起了广泛关注,通过直接修改定位的模型权重来激发知识编辑。大多数编辑作品都是在现有以学科知识为主要内容的知识回忆解释的指导下实现知识编辑的。然而,这些解释存在严重缺陷,忽视了关系信息,导致编辑过度概括化问题。在这项工作中,我们发现了一种新颖的以关系为中心的视角来解释推理过程中Transformer LM的知识回忆,并将其应用于知识编辑,以避免过度概括。补充了新的R特异性标准的数据集的实验结果表明,我们的编辑方法显着减轻了过度概括,同时在其他标准上保持竞争力,打破了以主题为中心的编辑在未来研究中的主导地位。

[NLP-9] BaichuanSEED: Sharing the Potential of ExtensivE Data Collection and Deduplication by Introducing a Competitive Large Language Model Baseline
[NLP-9] BaichuanSEED:通过引入有竞争力的大型语言模型基线来分享ExtensivE数据收集和卸载的潜力

链接: https://arxiv.org/abs/2408.15079
作者: Guosheng Dong,Da Pan,Yiding Sun,Shusen Zhang,Zheng Liang,Xin Wu,Yanjun Shen,Fan Yang,Haoze Sun,Tianpeng Li,Mingan Lin,Jianhua Xu,Yufan Zhang,Xiaonan Nie,Lei Su,Bingning Wang,Wentao Zhang,Jiaxin Mao,Zenan Zhou,Weipeng Chen
关键词-EN: extensive pretraining datasets, Large Language Models, highly rely, pretraining datasets, data processing pipeline
关键词-ZH: 广泛的预训练数据集、大型语言模型、高度依赖、预训练数据集、数据处理管道
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 19 pages, 6 figures

点击查看摘要

Abstract:The general capabilities of Large Language Models (LLM) highly rely on the composition and selection on extensive pretraining datasets, treated as commercial secrets by several institutions. To mitigate this issue, we open-source the details of a universally applicable data processing pipeline and validate its effectiveness and potential by introducing a competitive LLM baseline. Specifically, the data processing pipeline consists of broad collection to scale up and reweighting to improve quality. We then pretrain a 7B model BaichuanSEED with 3T tokens processed by our pipeline without any deliberate downstream task-related optimization, followed by an easy but effective supervised fine-tuning stage. BaichuanSEED demonstrates consistency and predictability throughout training and achieves comparable performance on comprehensive benchmarks with several commercial advanced large language models, such as Qwen1.5 and Llama3. We also conduct several heuristic experiments to discuss the potential for further optimization of downstream tasks, such as mathematics and coding.
摘要:大型语言模型的一般能力高度依赖于对大量预训练数据集的合成和选择,这些数据集被多个机构视为商业秘密。为了缓解这个问题,我们开源了一个普遍适用的数据处理管道的细节,并通过引入一个具有竞争力的LLM基线来验证其有效性和潜力。具体地说,数据处理流水线包括广泛的收集以扩大规模和重新加权以提高质量。然后,我们使用我们的流水线处理的3T令牌来预训练7B模型白川SEED,而不是任何刻意的下游任务相关优化,随后是一个简单但有效的有监督的微调阶段。白川SEED在整个培训过程中展示了一致性和可预测性,并在综合基准上取得了与几个商业高级大型语言模型(如Qwen1.5和Llama3)相当的性能。我们还进行了几个启发式实验,以讨论进一步优化下游任务的潜力,如数学和编码。

[NLP-10] Self-supervised Topic Taxonomy Discovery in the Box Embedding Space ACL
[NLP-10] 盒子嵌入空间中的自我监督主题分类发现

链接: https://arxiv.org/abs/2408.15050
作者: Yuyin Lu,Hegang Chen,Pengbo Mao,Yanghui Rao,Haoran Xie,Fu Lee Wang,Qing Li
关键词-EN: taxonomy discovery aims, constructing hierarchical relations, hierarchical relations, discovery aims, aims at uncovering
关键词-ZH: 分类学发现的目标,构建层次关系,层次关系,发现的目标,旨在发现
类目: Computation and Language (cs.CL)
备注: to be published in TACL

点击查看摘要

Abstract:Topic taxonomy discovery aims at uncovering topics of different abstraction levels and constructing hierarchical relations between them. Unfortunately, most of prior work can hardly model semantic scopes of words and topics by holding the Euclidean embedding space assumption. What’s worse, they infer asymmetric hierarchical relations by symmetric distances between topic embeddings. As a result, existing methods suffer from problems of low-quality topics at high abstraction levels and inaccurate hierarchical relations. To alleviate these problems, this paper develops a Box embedding-based Topic Model (BoxTM) that maps words and topics into the box embedding space, where the asymmetric metric is defined to properly infer hierarchical relations among topics. Additionally, our BoxTM explicitly infers upper-level topics based on correlation between specific topics through recursive clustering on topic boxes. Finally, extensive experiments validate high-quality of the topic taxonomy learned by BoxTM.
摘要:主题分类发现的目的是发现不同抽象层次的主题,并在它们之间建立层次关系。遗憾的是,以往的大多数工作都很难通过欧几里德嵌入空间的假设来建模单词和主题的语义范围。更糟糕的是,他们通过主题嵌入之间的对称距离来推断不对称的层次关系。因此,现有的方法存在高抽象层次的低质量主题和不准确的层次关系的问题。为了缓解这些问题,本文提出了一种基于盒子嵌入的主题模型(BoxTM),该模型将词语和主题映射到盒子嵌入空间,在盒子嵌入空间中定义了非对称度量来正确推断主题之间的层次关系。此外,我们的BoxTM通过对主题框进行递归聚类,基于特定主题之间的相关性显式推断上层主题。最后,大量的实验验证了BoxTM学习的主题分类的高质量。

[NLP-11] A Survey of Large Language Models for European Languages
[NLP-11] 欧洲语言大型语言模型概览

链接: https://arxiv.org/abs/2408.15040
作者: Wazir Ali,Sampo Pyysalo
关键词-EN: Large Language Models, gained significant attention, significant attention due, natural language tasks, Large Language
关键词-ZH: 大型语言模型,获得了极大的关注,应有的极大关注,自然语言任务,大型语言
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have gained significant attention due to their high performance on a wide range of natural language tasks since the release of ChatGPT. The LLMs learn to understand and generate language by training billions of model parameters on vast volumes of text data. Despite being a relatively new field, LLM research is rapidly advancing in various directions. In this paper, we present an overview of LLM families, including LLaMA, PaLM, GPT, and MoE, and the methods developed to create and enhance LLMs for official European Union (EU) languages. We provide a comprehensive summary of common monolingual and multilingual datasets used for pretraining LLMs.
摘要:自ChatGPT发布以来,大型语言模型(LLM)因其在各种自然语言任务中的高性能而受到了广泛关注。LLM通过在大量文本数据上训练数十亿个模型参数来学习理解和生成语言。尽管LLM研究是一个相对较新的领域,但正在向各个方向迅速发展。在本文中,我们概述了LLaMA、PaLM、GPT和MoE,以及为创建和增强欧盟官方语言LLM而开发的方法。我们提供了用于预训练LLM的常见单语和多语言数据集的全面总结。

[NLP-12] Evidence-Enhanced Triplet Generation Framework for Hallucination Alleviation in Generative Question Answering
[NLP-12] 用于减少生成性问题回答中幻觉的证据增强三重组生成框架

链接: https://arxiv.org/abs/2408.15037
作者: Haowei Du,Huishuai Zhang,Dongyan Zhao
关键词-EN: generative question answering, evidence-enhanced triplet generation, generative question, triplet generation framework, question answering
关键词-ZH: 生成式问答、证据增强三重生成、生成式问题、三重生成框架、问答
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:To address the hallucination in generative question answering (GQA) where the answer can not be derived from the document, we propose a novel evidence-enhanced triplet generation framework, EATQA, encouraging the model to predict all the combinations of (Question, Evidence, Answer) triplet by flipping the source pair and the target label to understand their logical relationships, i.e., predict Answer(A), Question(Q), and Evidence(E) given a QE, EA, and QA pairs, respectively. Furthermore, we bridge the distribution gap to distill the knowledge from evidence in inference stage. Our framework ensures the model to learn the logical relation between query, evidence and answer, which simultaneously improves the evidence generation and query answering. In this paper, we apply EATQA to LLama and it outperforms other LLMs-based methods and hallucination mitigation approaches on two challenging GQA benchmarks. Further analysis shows that our method not only keeps prior knowledge within LLM, but also mitigates hallucination and generates faithful answers.
摘要:为了解决生成性问答(GQA)中无法从文档中获得答案的幻觉,我们提出了一种新的证据增强型三元组生成框架EATQA,该框架鼓励模型通过翻转源对和目标标签来预测(问题、证据、答案)三元组的所有组合,以理解它们的逻辑关系,即分别给定QE、EA和QA对的预测答案(A)、问题(Q)和证据(E)。此外,我们还弥合了分布鸿沟,在推理阶段从证据中提取知识。该框架保证了模型能够学习查询、证据和答案之间的逻辑关系,同时提高了证据生成和查询回答的效率。在本文中,我们将EATQA应用于骆驼,在两个具有挑战性的GQA基准上,它的表现优于其他基于LLMS的方法和幻觉缓解方法。进一步的分析表明,我们的方法不仅保留了LLM中的先验知识,而且还减少了幻觉并生成了可信的答案。

[NLP-13] Speech Recognition Transformers: Topological-lingualism Perspective
[NLP-13] 语音识别变形者:话题语言主义视角

链接: https://arxiv.org/abs/2408.14991
作者: Shruti Singh,Muskaan Singh,Virender Kadyan
关键词-EN: artificial intelligence tasks, evolved with great, great success, artificial intelligence, intelligence tasks
关键词-ZH: 人工智能任务,发展取得了巨大的成功,人工智能,智能任务
类目: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:Transformers have evolved with great success in various artificial intelligence tasks. Thanks to our recent prevalence of self-attention mechanisms, which capture long-term dependency, phenomenal outcomes in speech processing and recognition tasks have been produced. The paper presents a comprehensive survey of transformer techniques oriented in speech modality. The main contents of this survey include (1) background of traditional ASR, end-to-end transformer ecosystem, and speech transformers (2) foundational models in a speech via lingualism paradigm, i.e., monolingual, bilingual, multilingual, and cross-lingual (3) dataset and languages, acoustic features, architecture, decoding, and evaluation metric from a specific topological lingualism perspective (4) popular speech transformer toolkit for building end-to-end ASR systems. Finally, highlight the discussion of open challenges and potential research directions for the community to conduct further research in this domain.
摘要:变形金刚在各种人工智能任务中取得了巨大成功。由于我们最近流行的自我注意机制(它捕捉了长期依赖性),语音处理和识别任务产生了惊人的结果。本文对面向语音情态的Transformer技术进行了全面的综述。本调查的主要内容包括:(1)传统ASB、端到端Transformer生态系统和语音转换器的背景(2)通过语言主义范式的语音中的基础模型,即单语、双语、多语言和跨语言(3)数据集和语言、声学特征、体系结构、解码和从特定拓学语言学角度来看的评估指标(4)用于构建端到端ASB系统的流行语音Transformer工具包。最后,强调对开放挑战和潜在研究方向的讨论,以便社区在该领域进行进一步研究。

[NLP-14] Agent Monitor: A Plug-and-Play Framework for Predictive and Secure Multi-Agent Systems
[NLP-14] AgentMonitor:用于预测性和安全多代理系统的即插即用框架

链接: https://arxiv.org/abs/2408.14972
作者: Chi-Min Chan,Jianxuan Yu,Weize Chen,Chunyang Jiang,Xinyu Liu,Weijie Shi,Zhiyuan Liu,Wei Xue,Yike Guo
关键词-EN: large language models, rapid advancement, advancement of large, large language, rise of LLM-based
关键词-ZH: 大型语言模型,快速发展,大型语言的进步,基于LLM的兴起
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The rapid advancement of large language models (LLMs) has led to the rise of LLM-based agents. Recent research shows that multi-agent systems (MAS), where each agent plays a specific role, can outperform individual LLMs. However, configuring an MAS for a task remains challenging, with performance only observable post-execution. Inspired by scaling laws in LLM development, we investigate whether MAS performance can be predicted beforehand. We introduce AgentMonitor, a framework that integrates at the agent level to capture inputs and outputs, transforming them into statistics for training a regression model to predict task performance. Additionally, it can further apply real-time corrections to address security risks posed by malicious agents, mitigating negative impacts and enhancing MAS security. Experiments demonstrate that an XGBoost model achieves a Spearman correlation of 0.89 in-domain and 0.58 in more challenging scenarios. Furthermore, using AgentMonitor reduces harmful content by 6.2% and increases helpful content by 1.8% on average, enhancing safety and reliability. Code is available at \urlthis https URL.
摘要:大语言模型的快速发展导致了基于大语言模型的代理的兴起。最近的研究表明,每个智能体扮演特定角色的多智能体系统(MAS)可以比单个LLM表现得更好。然而,为任务配置MAS仍然具有挑战性,只有在执行后才能观察到性能。受LLM发展中的标度定律的启发,我们研究了MAS性能是否可以预先预测。我们引入了AgentMonitor,这是一个在代理级别集成的框架,可以捕获输入和输出,将它们转换为统计数据,用于训练回归模型来预测任务性能。此外,它还可以进一步应用实时纠正来应对恶意代理带来的安全风险,减轻负面影响并增强MAS的安全性。实验表明,XGBoost模型在域内的Spearman相关度为0.89,在更具挑战性的场景下为0.58。此外,使用AgentMonitor使有害含量平均减少6.2%,有益含量平均增加1.8%,增强了安全性和可靠性。代码位于此HTTPS URL。

[NLP-15] MRSE: An Efficient Multi-modality Retrieval System for Large Scale E-commerce
[NLP-15] MRSE:一个面向大规模电子商务的高效多形态检索系统

链接: https://arxiv.org/abs/2408.14968
作者: Hao Jiang,Haoxiang Zhang,Qingshan Hou,Chaofeng Chen,Weisi Lin,Jingchang Zhang,Annan Wang
关键词-EN: Providing high-quality item, e-commerce search systems, Providing high-quality, high-quality item recall, Embedding-based Retrieval Systems
关键词-ZH: 提供高质量的物品、电子商务搜索系统、提供高质量、高质量的物品召回、基于嵌入的检索系统
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Providing high-quality item recall for text queries is crucial in large-scale e-commerce search systems. Current Embedding-based Retrieval Systems (ERS) embed queries and items into a shared low-dimensional space, but uni-modality ERS rely too heavily on textual features, making them unreliable in complex contexts. While multi-modality ERS incorporate various data sources, they often overlook individual preferences for different modalities, leading to suboptimal results. To address these issues, we propose MRSE, a Multi-modality Retrieval System that integrates text, item images, and user preferences through lightweight mixture-of-expert (LMoE) modules to better align features across and within modalities. MRSE also builds user profiles at a multi-modality level and introduces a novel hybrid loss function that enhances consistency and robustness using hard negative sampling. Experiments on a large-scale dataset from Shopee and online A/B testing show that MRSE achieves an 18.9% improvement in offline relevance and a 3.7% gain in online core metrics compared to Shopee’s state-of-the-art uni-modality system.
摘要:在大规模电子商务搜索系统中,为文本查询提供高质量的商品召回是至关重要的。目前基于嵌入的检索系统(ERS)将查询和条目嵌入到共享的低维空间中,但单通道ERS过于依赖文本特征,使得它们在复杂的上下文中不可靠。虽然多模式ERS结合了各种数据源,但它们往往忽略了不同模式的个人偏好,导致结果不太理想。为了解决这些问题,我们提出了MRSE,这是一个多通道检索系统,通过轻量级混合专家(LMoE)模块集成文本、项目图像和用户偏好,以更好地对齐通道之间和通道内的特征。MRSE还建立了多通道级别的用户配置文件,并引入了一种新的混合损失函数,该函数使用硬负采样来增强一致性和稳健性。在Shopee的大规模数据集和在线A/B测试上的实验表明,与Shopee最先进的单一通道系统相比,MRSE在离线相关性方面提高了18.9%,在线核心指标提高了3.7%。

[NLP-16] Multilingual Arbitrage: Optimizing Data Pools to Accelerate Multilingual Progress
[NLP-16] 多语言套利:优化数据分配以加速多语言进展

链接: https://arxiv.org/abs/2408.14960
作者: Ayomide Odumakinde,Daniel D’souza,Pat Verga,Beyza Ermis,Sara Hooker
关键词-EN: role in recent, played a critical, critical role, synthetic data, data has played
关键词-ZH: 最近的作用,发挥了关键、关键的作用,合成数据,数据发挥了作用
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The use of synthetic data has played a critical role in recent state-of-art breakthroughs. However, overly relying on a single oracle teacher model to generate data has been shown to lead to model collapse and invite propagation of biases. These limitations are particularly evident in multilingual settings, where the absence of a universally effective teacher model that excels across all languages presents significant challenges. In this work, we address these extreme difference by introducing “multilingual arbitrage”, which capitalizes on performance variations between multiple models for a given language. To do so, we strategically route samples through a diverse pool of models, each with unique strengths in different languages. Across exhaustive experiments on state-of-art models, our work suggests that arbitrage techniques allow for spectacular gains in performance that far outperform relying on a single teacher. In particular, compared to the best single teacher, we observe gains of up to 56.5% improvement in win rates averaged across all languages when switching to multilingual arbitrage. We observe the most significant gains for the least resourced languages in our pool.
摘要:合成数据的使用在最近的技术突破中发挥了关键作用。然而,过度依赖单一的甲骨文教师模型来生成数据已被证明会导致模型崩溃并引发偏见的传播。这些限制在多语种环境中尤其明显,因为缺乏一种在所有语言中都表现出色的普遍有效的教师模式是一个巨大的挑战。在这项工作中,我们通过引入“多语言套利”来解决这些极端差异,这种套利利用了给定语言多个模型之间的性能差异。为了做到这一点,我们战略性地通过不同的模型池发送样本,每个模型在不同语言中具有独特的优势。通过对最先进的模型进行详尽的实验,我们的工作表明,套利技术可以在表现上取得惊人的进步,远远超过依赖单一教师的表现。特别是,与最好的单一教师相比,我们观察到,当切换到多语言套利时,所有语言的平均胜率提高了56.5%。我们观察到,在我们的资源库中,资源最少的语言获得了最显著的收益。

[NLP-17] SpikingSSMs: Learning Long Sequences with Sparse and Parallel Spiking State Space Models
[NLP-17] SpikingSSms:使用稀疏和并行Spiking状态空间模型学习长序列

链接: https://arxiv.org/abs/2408.14909
作者: Shuaijie Shen,Chao Wang,Renzhuo Huang,Yan Zhong,Qinghai Guo,Zhichao Lu,Jianguo Zhang,Luziwei Leng
关键词-EN: energy consumption networks, low energy consumption, past decades, energy consumption, gained a lot
关键词-ZH: 能源消费网络,低能源消耗,过去几十年,能源消耗,收获了很多
类目: Computation and Language (cs.CL); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
备注:

点击查看摘要

Abstract:Known as low energy consumption networks, spiking neural networks (SNNs) have gained a lot of attention within the past decades. While SNNs are increasing competitive with artificial neural networks (ANNs) for vision tasks, they are rarely used for long sequence tasks, despite their intrinsic temporal dynamics. In this work, we develop spiking state space models (SpikingSSMs) for long sequence learning by leveraging on the sequence learning abilities of state space models (SSMs). Inspired by dendritic neuron structure, we hierarchically integrate neuronal dynamics with the original SSM block, meanwhile realizing sparse synaptic computation. Furthermore, to solve the conflict of event-driven neuronal dynamics with parallel computing, we propose a light-weight surrogate dynamic network which accurately predicts the after-reset membrane potential and compatible to learnable thresholds, enabling orders of acceleration in training speed compared with conventional iterative methods. On the long range arena benchmark task, SpikingSSM achieves competitive performance to state-of-the-art SSMs meanwhile realizing on average 90% of network sparsity. On language modeling, our network significantly surpasses existing spiking large language models (spikingLLMs) on the WikiText-103 dataset with only a third of the model size, demonstrating its potential as backbone architecture for low computation cost LLMs.
摘要:尖峰神经网络(SNN)被称为低能耗网络,在过去的几十年里得到了广泛的关注。尽管SNN在视觉任务方面与人工神经网络(ANN)的竞争日益激烈,但它们很少用于长序列任务,尽管它们具有内在的时间动力学。在这项工作中,我们利用状态空间模型(SSM)的序列学习能力,开发了用于长序列学习的尖峰状态空间模型(SpikingSSMS)。受树突神经元结构的启发,我们将神经元动力学与原始SSM块进行分层集成,同时实现了稀疏突触计算。此外,为了解决事件驱动的神经元动力学与并行计算的冲突,我们提出了一种轻量级的代理动态网络,它准确地预测了重置后的膜电位,并与可学习的阈值兼容,与传统的迭代方法相比,使训练速度达到了几个数量级。在远程竞技场基准测试任务上,SpikingSSM在实现平均90%的网络稀疏性的同时,取得了与最先进的SSM相当的性能。在语言建模方面,我们的网络在Wikitext-103数据集上大大超过了现有的尖峰大型语言模型(SpikingLLMS),只有模型大小的三分之一,展示了它作为低计算成本LLMS的骨干架构的潜力。

[NLP-18] ripl`etoile: Extraction of Knowledge from Microblogging Text
[NLP-18] ripl ’ etoile:从微博文本中提取知识

链接: https://arxiv.org/abs/2408.14908
作者: Vanni Zavarella,Sergio Consoli,Diego Reforgiato Recupero,Gianni Fenu,Simone Angioni,Davide Buscaldi,Danilo Dessì,Francesco Osborne
关键词-EN: Numerous methods, publications and patents, recently emerged, scientific publications, Numerous
关键词-ZH: 最近出现的众多方法、出版物和专利、科学出版物、众多
类目: Information Retrieval (cs.IR); Computational Engineering, Finance, and Science (cs.CE); Computation and Language (cs.CL)
备注: 42 pages, 6 figures

点击查看摘要

Abstract:Numerous methods and pipelines have recently emerged for the automatic extraction of knowledge graphs from documents such as scientific publications and patents. However, adapting these methods to incorporate alternative text sources like micro-blogging posts and news has proven challenging as they struggle to model open-domain entities and relations, typically found in these sources. In this paper, we propose an enhanced information extraction pipeline tailored to the extraction of a knowledge graph comprising open-domain entities from micro-blogging posts on social media platforms. Our pipeline leverages dependency parsing and classifies entity relations in an unsupervised manner through hierarchical clustering over word embeddings. We provide a use case on extracting semantic triples from a corpus of 100 thousand tweets about digital transformation and publicly release the generated knowledge graph. On the same dataset, we conduct two experimental evaluations, showing that the system produces triples with precision over 95% and outperforms similar pipelines of around 5% in terms of precision, while generating a comparatively higher number of triples.
摘要:最近出现了许多从科学出版物和专利等文档中自动提取知识图的方法和管道。然而,事实证明,调整这些方法以纳入微博帖子和新闻等替代文本来源具有挑战性,因为它们难以对通常在这些来源中找到的开放领域实体和关系进行建模。在本文中,我们提出了一种改进的信息提取管道,用于从社交媒体平台上的微博帖子中提取包含开放领域实体的知识图。我们的流水线利用依存关系分析,并通过词嵌入的层次聚类以无监督的方式对实体关系进行分类。我们提供了一个从10万条关于数字转型的推文语料库中提取语义三元组的用例,并将生成的知识图公开发布。在相同的数据集上,我们进行了两次实验评估,结果表明,该系统生成的三元组的准确率超过95%,在精度方面优于5%左右的类似流水线,同时生成的三元组的数量相对较高。

[NLP-19] Writing in the Margins: Better Inference Pattern for Long Context Retrieval
[NLP-19] 边缘写作:长上下文检索的更好推理模式

链接: https://arxiv.org/abs/2408.14906
作者: Melisa Russak,Umar Jamil,Christopher Bryant,Kiran Kamble,Axel Magnuson,Mateusz Russak,Waseem AlShikh
关键词-EN: Large Language Models, Language Models designed, long input sequences, Large Language, introduce Writing
关键词-ZH: 大型语言模型,设计的语言模型,长输入序列,大型语言,介绍写作
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:In this paper, we introduce Writing in the Margins (WiM), a new inference pattern for Large Language Models designed to optimize the handling of long input sequences in retrieval-oriented tasks. This approach leverages the chunked prefill of the key-value cache to perform segment-wise inference, which enables efficient processing of extensive contexts along with the generation and classification of intermediate information (“margins”) that guide the model towards specific tasks. This method increases computational overhead marginally while significantly enhancing the performance of off-the-shelf models without the need for fine-tuning. Specifically, we observe that WiM provides an average enhancement of 7.5% in accuracy for reasoning skills (HotpotQA, MultiHop-RAG) and more than a 30.0% increase in the F1-score for aggregation tasks (CWE). Additionally, we show how the proposed pattern fits into an interactive retrieval design that provides end-users with ongoing updates about the progress of context processing, and pinpoints the integration of relevant information into the final response. We release our implementation of WiM using Hugging Face Transformers library at this https URL.
摘要:在本文中,我们介绍了边缘写作(WIM),这是一种新的推理模式,适用于大型语言模型,旨在优化面向检索的任务中长输入序列的处理。这种方法利用键值缓存的分块预填充来执行分段推理,从而实现了对大量上下文的高效处理以及对引导模型执行特定任务的中间信息(“边距”)的生成和分类。这种方法略微增加了计算开销,同时显著增强了现成模型的性能,而不需要微调。具体地说,我们观察到,WIM在推理技能(HotpotQA,MultiHop-RAG)方面的准确率平均提高了7.5%,在聚合任务(CWE)的F1分数上提高了30.0%以上。此外,我们还展示了提出的模式如何适应交互式检索设计,为最终用户提供有关上下文处理进度的持续更新,并准确地将相关信息整合到最终响应中。我们在这个HTTPS URL上发布了我们使用拥抱面孔变形金刚程序库的WIM实现。

[NLP-20] VHAKG: A Multi-modal Knowledge Graph Based on Synchronized Multi-view Videos of Daily Activities CIKM2024
[NLP-20] VHAKG:基于同步多视图日常活动视频的多模式知识图谱

链接: https://arxiv.org/abs/2408.14895
作者: Shusaku Egami,Takahiro Ugai,Ken Fukuda
关键词-EN: Multi-modal knowledge graphs, resources enabling knowledge, enabling knowledge processing, Multi-modal knowledge, non-symbolic data
关键词-ZH: 多模式知识图、支持知识的资源、支持知识处理、多模式知识、非符号数据
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注: 5 pages,4 figures, accepted by CIKM2024 Resource Track

点击查看摘要

Abstract:Multi-modal knowledge graphs (MMKGs), which ground various non-symbolic data (e.g., images and videos) into symbols, have attracted attention as resources enabling knowledge processing and machine learning across modalities. However, the construction of MMKGs for videos consisting of multiple events, such as daily activities, is still in the early stages. In this paper, we construct an MMKG based on synchronized multi-view simulated videos of daily activities. Besides representing the content of daily life videos as event-centric knowledge, our MMKG also includes frame-by-frame fine-grained changes, such as bounding boxes within video frames. In addition, we provide support tools for querying our MMKG. As an application example, we demonstrate that our MMKG facilitates benchmarking vision-language models by providing the necessary vision-language datasets for a tailored task.
摘要:多模式知识图(MMKG),它基于各种非符号数据(例如,图像和视频)转化为符号,作为实现跨模式知识处理和机器学习的资源而引起了关注。然而,由日常活动等多个事件组成的视频MMKG的构建仍处于早期阶段。在本文中,我们基于日常活动的同步多视图模拟视频构建了一个MMKG。除了将日常生活视频的内容表示为以事件为中心的知识之外,我们的MMKG还包括逐帧细粒度的更改,例如视频帧内的边界框。此外,我们还提供查询MMKG的支持工具。作为应用示例,我们证明了我们的MMKG通过为定制任务提供必要的视觉语言数据集来促进对视觉语言模型进行基准测试。

[NLP-21] A Functional Trade-off between Prosodic and Semantic Cues in Conveying Sarcasm INTERSPEECH2024
[NLP-21] 讽刺传达中韵律线索与语义线索的功能权衡

链接: https://arxiv.org/abs/2408.14892
作者: Zhu Li,Xiyuan Gao,Yuqing Zhang,Shekhar Nayak,Matt Coler
关键词-EN: cues signaling sarcasm, prosodic cues signaling, study investigates, investigates the acoustic, disentangles the interplay
关键词-ZH: 暗示讽刺,韵律暗示,研究调查,调查声学,解开相互作用
类目: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: accepted at Interspeech 2024

点击查看摘要

Abstract:This study investigates the acoustic features of sarcasm and disentangles the interplay between the propensity of an utterance being used sarcastically and the presence of prosodic cues signaling sarcasm. Using a dataset of sarcastic utterances compiled from television shows, we analyze the prosodic features within utterances and key phrases belonging to three distinct sarcasm categories (embedded, propositional, and illocutionary), which vary in the degree of semantic cues present, and compare them to neutral expressions. Results show that in phrases where the sarcastic meaning is salient from the semantics, the prosodic cues are less relevant than when the sarcastic meaning is not evident from the semantics, suggesting a trade-off between prosodic and semantic cues of sarcasm at the phrase level. These findings highlight a lessened reliance on prosodic modulation in semantically dense sarcastic expressions and a nuanced interaction that shapes the communication of sarcastic intent.
摘要:本研究探讨了讽刺的声学特征,揭示了讽刺倾向与暗示讽刺的韵律线索的存在之间的相互作用。利用电视节目中的讽刺话语数据,我们分析了三个不同的讽刺类型(嵌入式、命题和言外之意)的话语和关键短语的韵律特征,并将它们与中性表达进行了比较。结果表明,在讽刺意义从语义上突出的短语中,韵律线索的相关性低于讽刺意义从语义中不明显时的相关性,这表明在短语层面上讽刺的韵律线索和语义线索之间存在权衡。这些发现突显了在语义密集的讽刺表达中对韵律调节的依赖减少,以及形成讽刺意图交流的细微差别的互动。

[NLP-22] Inverse-Q*: Token Level Reinforcement Learning for Aligning Large Language Models Without Preference Data
[NLP-22] Inverse-Q*:用于在没有偏好数据的情况下对齐大型语言模型的令牌级强化学习

链接: https://arxiv.org/abs/2408.14874
作者: Han Xia,Songyang Gao,Qiming Ge,Zhiheng Xi,Qi Zhang,Xuanjing Huang
关键词-EN: Proximal Policy Optimization, aligning large language, methodologies like Proximal, Reinforcement Learning, token-level reinforcement learning
关键词-ZH: Proximal策略优化、调整大型语言、Proximal等方法论、强化学习、代币级强化学习
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Reinforcement Learning from Human Feedback (RLHF) has proven effective in aligning large language models with human intentions, yet it often relies on complex methodologies like Proximal Policy Optimization (PPO) that require extensive hyper-parameter tuning and present challenges in sample efficiency and stability. In this paper, we introduce Inverse-Q*, an innovative framework that transcends traditional RL methods by optimizing token-level reinforcement learning without the need for additional reward or value models. Inverse-Q* leverages direct preference optimization techniques but extends them by estimating the conditionally optimal policy directly from the model’s responses, facilitating more granular and flexible policy shaping. Our approach reduces reliance on human annotation and external supervision, making it especially suitable for low-resource settings. We present extensive experimental results demonstrating that Inverse-Q* not only matches but potentially exceeds the effectiveness of PPO in terms of convergence speed and the alignment of model responses with human preferences. Our findings suggest that Inverse-Q* offers a practical and robust alternative to conventional RLHF approaches, paving the way for more efficient and adaptable model training approaches.
摘要:人类反馈强化学习(RLHF)已被证明能有效地使大型语言模型与人类意图保持一致,但它通常依赖于复杂的方法,如近似值策略优化(PPO),这些方法需要广泛的超参数调整,并在样本效率和稳定性方面提出了挑战。在本文中,我们介绍了一个创新的框架,它通过优化令牌级强化学习而超越了传统的RL方法,而不需要额外的回报或价值模型。逆向Q利用直接偏好优化技术,但通过直接从模型的响应估计条件最优策略来扩展它们,从而促进更精细和灵活的策略形成。我们的方法减少了对人工注释和外部监督的依赖,使其特别适合资源较少的环境。我们给出了大量的实验结果,证明了逆向Q不仅在收敛速度和模型响应与人类偏好的一致性方面与PPO匹配,而且潜在地超过了PPO的有效性。我们的发现表明,逆向Q*为传统的RLHF方法提供了一种实用而稳健的替代方法,为更有效和更具适应性的模型训练方法铺平了道路。

[NLP-23] Advancing Adversarial Suffix Transfer Learning on Aligned Large Language Models
[NLP-23] 在对齐的大型语言模型上推进对抗性后缀迁移学习

链接: https://arxiv.org/abs/2408.14866
作者: Hongfu Liu,Yuxi Xie,Ye Wang,Michael Shieh
关键词-EN: Language Language Models, Language Language, face safety concerns, safety concerns due, Greedy Coordinate Gradient
关键词-ZH: 语言语言模型,语言语言,面临安全问题,应有的安全问题,贪婪协调梯度
类目: Computation and Language (cs.CL); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
备注: 11 pages, 4 figures

点击查看摘要

Abstract:Language Language Models (LLMs) face safety concerns due to potential misuse by malicious users. Recent red-teaming efforts have identified adversarial suffixes capable of jailbreaking LLMs using the gradient-based search algorithm Greedy Coordinate Gradient (GCG). However, GCG struggles with computational inefficiency, limiting further investigations regarding suffix transferability and scalability across models and data. In this work, we bridge the connection between search efficiency and suffix transferability. We propose a two-stage transfer learning framework, DeGCG, which decouples the search process into behavior-agnostic pre-searching and behavior-relevant post-searching. Specifically, we employ direct first target token optimization in pre-searching to facilitate the search process. We apply our approach to cross-model, cross-data, and self-transfer scenarios. Furthermore, we introduce an interleaved variant of our approach, i-DeGCG, which iteratively leverages self-transferability to accelerate the search process. Experiments on HarmBench demonstrate the efficiency of our approach across various models and domains. Notably, our i-DeGCG outperforms the baseline on Llama2-chat-7b with ASRs of 43.9 ( +22.2 ) and 39.0 ( +19.5 ) on valid and test sets, respectively. Further analysis on cross-model transfer indicates the pivotal role of first target token optimization in leveraging suffix transferability for efficient searching.
摘要:语言模型由于可能被恶意用户滥用而面临安全隐患。最近的红队工作已经识别出能够使用基于梯度的搜索算法贪婪坐标梯度(GCG)越狱的对抗性后缀。然而,GCG面临着计算效率低下的问题,限制了关于后缀可转移性和跨模型和数据可伸缩性的进一步研究。在这项工作中,我们在搜索效率和后缀可转移性之间架起了桥梁。我们提出了一个两阶段迁移学习框架DeGCG,它将搜索过程分离为行为不可知的预搜索和行为相关的后搜索。具体地说,我们在预搜索中采用了直接的第一目标令牌优化来简化搜索过程。我们将我们的方法应用于跨模型、跨数据和自我传输场景。此外,我们引入了我们的方法的交错变体I-DeGCG,它迭代地利用自我转移来加速搜索过程。在HarmBtch上的实验证明了我们的方法在不同的模型和领域中的有效性。值得注意的是,我们的I-DeGCG在有效集和测试集上的ASR分别为43.9(+22.2)和39.0(+19.5),超过了Llama2-Chat-7b上的基线。对跨模型迁移的进一步分析表明,第一目标令牌优化在利用后缀可转移性进行高效搜索方面起着关键作用。

[NLP-24] Detecting AI Flaws: Target-Driven Attacks on Internal Faults in Language Models
[NLP-24] 检测人工智能缺陷:对语言模型内部故障的目标驱动攻击

链接: https://arxiv.org/abs/2408.14853
作者: Yuhao Du,Zhuo Li,Pengyu Cheng,Xiang Wan,Anningzhe Gao
关键词-EN: Large Language Models, Large Language, rapidly evolving field, Language Models, artificial intelligence
关键词-ZH: 大型语言模型、大型语言、快速发展的领域、语言模型、人工智能
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have become a focal point in the rapidly evolving field of artificial intelligence. However, a critical concern is the presence of toxic content within the pre-training corpus of these models, which can lead to the generation of inappropriate outputs. Investigating methods for detecting internal faults in LLMs can help us understand their limitations and improve their security. Existing methods primarily focus on jailbreaking attacks, which involve manually or automatically constructing adversarial content to prompt the target LLM to generate unexpected responses. These methods rely heavily on prompt engineering, which is time-consuming and usually requires specially designed questions. To address these challenges, this paper proposes a target-driven attack paradigm that focuses on directly eliciting the target response instead of optimizing the prompts. We introduce the use of another LLM as the detector for toxic content, referred to as ToxDet. Given a target toxic response, ToxDet can generate a possible question and a preliminary answer to provoke the target model into producing desired toxic responses with meanings equivalent to the provided one. ToxDet is trained by interacting with the target LLM and receiving reward signals from it, utilizing reinforcement learning for the optimization process. While the primary focus of the target models is on open-source LLMs, the fine-tuned ToxDet can also be transferred to attack black-box models such as GPT-4o, achieving notable results. Experimental results on AdvBench and HH-Harmless datasets demonstrate the effectiveness of our methods in detecting the tendencies of target LLMs to generate harmful responses. This algorithm not only exposes vulnerabilities but also provides a valuable resource for researchers to strengthen their models against such attacks.
摘要:大语言模型已成为人工智能领域研究的热点。然而,一个严重的关切是,这些模型的训练前语料库中存在有毒内容,这可能导致产生不适当的产出。研究LLMS内部故障检测方法可以帮助我们了解它们的局限性,提高它们的安全性。现有的方法主要集中在越狱攻击,这涉及手动或自动构建敌意内容,以促使目标LLM生成意外响应。这些方法严重依赖于即时工程,这是耗时的,通常需要专门设计的问题。为了应对这些挑战,本文提出了一种目标驱动的攻击范式,该范式专注于直接引发目标响应,而不是优化提示。我们介绍了使用另一种LLM作为有毒物质的检测器,称为ToxDet。给定目标毒性反应,ToxDet可以生成可能的问题和初步答案,以激发目标模型产生与所提供的含义相同的所需毒性反应。ToxDet是通过与目标LLM交互并从其接收奖励信号来训练的,利用强化学习进行优化过程。虽然目标模型的主要焦点是开源LLMS,但微调的ToxDet也可以转移到攻击GPT-40等黑匣子模型上,取得了显著的效果。在AdvBtch和HH-无害数据集上的实验结果表明,该方法在检测目标LLM产生有害响应的倾向方面是有效的。该算法不仅暴露了漏洞,而且为研究人员提供了一个宝贵的资源,以加强他们的模型对此类攻击的攻击。

[NLP-25] Project SHADOW: Symbolic Higher-order Associative Deductive reasoning On Wikidata using LM probing
[NLP-25] Project SHADOW:符号性高级联想演绎推理使用LM探测的维基数据

链接: https://arxiv.org/abs/2408.14849
作者: Hanna Abi Akl
关键词-EN: Wikidata triple completion, associative deductive reasoning, base construction task, fine-tuned language model, language model trained
关键词-ZH: 维基数据三重完成、联想演绎推理、基础构建任务、微调语言模型、训练语言模型
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 6 pages, 1 figure

点击查看摘要

Abstract:We introduce SHADOW, a fine-tuned language model trained on an intermediate task using associative deductive reasoning, and measure its performance on a knowledge base construction task using Wikidata triple completion. We evaluate SHADOW on the LM-KBC 2024 challenge and show that it outperforms the baseline solution by 20% with a F1 score of 68.72%.
摘要:我们引入SHADOW,这是一种使用联想演绎推理在中间任务上训练的微调语言模型,并使用维基数据三重完成来衡量其在知识库构建任务上的性能。我们在LM-KBC 2024挑战赛中评估了SHADOW,结果显示它比基线解决方案高出20%,F1评分为68.72%。

[NLP-26] AAVENUE: Detecting LLM Biases on NLU Tasks in AAVE via a Novel Benchmark
[NLP-26] AAVENUE:通过新型基准检测AAVE中NLU任务的LLM偏差

链接: https://arxiv.org/abs/2408.14845
作者: Abhay Gupta,Philip Meng,Ece Yurtseven,Sean O’Brien,Kevin Zhu
关键词-EN: African American Vernacular, American Vernacular English, Standard American English, natural language understanding, natural language processing
关键词-ZH: 非裔美国人白话、美国白话英语、标准美式英语、自然语言理解、自然语言处理
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Detecting biases in natural language understanding (NLU) for African American Vernacular English (AAVE) is crucial to developing inclusive natural language processing (NLP) systems. To address dialect-induced performance discrepancies, we introduce AAVENUE (AAVE Natural Language Understanding Evaluation), a benchmark for evaluating large language model (LLM) performance on NLU tasks in AAVE and Standard American English (SAE). AAVENUE builds upon and extends existing benchmarks like VALUE, replacing deterministic syntactic and morphological transformations with a more flexible methodology leveraging LLM-based translation with few-shot prompting, improving performance across our evaluation metrics when translating key tasks from the GLUE and SuperGLUE benchmarks. We compare AAVENUE and VALUE translations using five popular LLMs and a comprehensive set of metrics including fluency, BARTScore, quality, coherence, and understandability. Additionally, we recruit fluent AAVE speakers to validate our translations for authenticity. Our evaluations reveal that LLMs consistently perform better on SAE tasks than AAVE-translated versions, underscoring inherent biases and highlighting the need for more inclusive NLP models. We have open-sourced our source code on GitHub and created a website to showcase our work at https://aavenue.live.
摘要:检测非裔美国人白话英语(AAVE)自然语言理解(NLU)中的偏差是开发包容性自然语言处理(NLP)系统的关键。为了解决方言导致的性能差异,我们引入了AAVE自然语言理解评估AAVENUE(AAVE Natural Language Underach Value),这是一个评估AAVE和标准美国英语(SAE)中NLU任务的大型语言模型(LLM)性能的基准。AAVENUE在VALUE等现有基准测试的基础上进行构建和扩展,以更灵活的方法取代确定性的句法和形态转换,利用基于LLM的翻译和极少的提示,在翻译来自GLUE和Superglue基准测试的关键任务时提高我们的评估指标的性能。我们使用五个流行的LLM以及包括流畅度、BARTScore、质量、连贯性和可理解性在内的一组综合指标来比较AAVENUE和Value翻译。此外,我们招募流利的AAVE演讲者来验证我们的翻译的真实性。我们的评估显示,LLM在SAE任务中始终比AAVE翻译版本表现得更好,这突显了固有的偏见,并强调了对更具包容性的NLP模型的需求。我们在GitHub上开源了我们的源代码,并创建了一个网站来展示我们在https://aavenue.live.上的工作

[NLP-27] CL4KGE: A Curriculum Learning Method for Knowledge Graph Embedding
[NLP-27] CL4 KGE:一种知识图谱嵌入的课程学习方法

链接: https://arxiv.org/abs/2408.14840
作者: Yang Liu,Chuan Zhou,Peng Zhang,Yanan Cao,Yongchao Liu,Zhao Li,Hongyang Chen
关键词-EN: Knowledge graph embedding, crafting representations comprehensive, Knowledge graph, KGE models, graph embedding
关键词-ZH: 知识图嵌入、全面制作表示、知识图、KGE模型、图嵌入
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 16 pages, 3 figures

点击查看摘要

Abstract:Knowledge graph embedding (KGE) constitutes a foundational task, directed towards learning representations for entities and relations within knowledge graphs (KGs), with the objective of crafting representations comprehensive enough to approximate the logical and symbolic interconnections among entities. In this paper, we define a metric Z-counts to measure the difficulty of training each triple ( head entity, relation, tail entity ) in KGs with theoretical analysis. Based on this metric, we propose \textbfCL4KGE, an efficient \textbfCurriculum \textbfLearning based training strategy for \textbfKGE. This method includes a difficulty measurer and a training scheduler that aids in the training of KGE models. Our approach possesses the flexibility to act as a plugin within a wide range of KGE models, with the added advantage of adaptability to the majority of KGs in existence. The proposed method has been evaluated on popular KGE models, and the results demonstrate that it enhances the state-of-the-art methods. The use of Z-counts as a metric has enabled the identification of challenging triples in KGs, which helps in devising effective training strategies.
摘要:知识图嵌入(KGE)是一项基础性任务,旨在学习实体的表示和知识图(KG)中的关系,目的是构建足够全面的表示,以近似实体之间的逻辑和符号互连。在本文中,我们定义了一个度量Z-Counts来度量KGS中每个三元组(头部实体、关系实体、尾部实体)的训练难度,并进行了理论分析。在此基础上,我们提出了一种有效的基于文本bf课程的培训策略-.该方法包括帮助训练KGE模型的难度测量器和训练调度器。我们的方法具有在广泛的KGE模型中作为插件的灵活性,并具有对现有的大多数KG的适应性的附加优势。该方法已经在流行的KGE模型上进行了评估,结果表明它提高了最新方法的性能。使用Z计数作为衡量标准,使得能够在幼儿园中识别具有挑战性的三元组,这有助于制定有效的培训策略。

[NLP-28] PolicyLR: A Logic Representation For Privacy Policies
[NLP-28] Policy LR:隐私政策的逻辑表示

链接: https://arxiv.org/abs/2408.14830
作者: Ashish Hooda,Rishabh Khandelwal,Prasad Chalasani,Kassem Fawaz,Somesh Jha
关键词-EN: GDPR and CCPA, services handle user, handle user data, online ecosystem, defining how services
关键词-ZH: GDPR和CCPA、服务处理用户、处理用户数据、在线生态系统、定义服务如何
类目: Cryptography and Security (cs.CR); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Privacy policies are crucial in the online ecosystem, defining how services handle user data and adhere to regulations such as GDPR and CCPA. However, their complexity and frequent updates often make them difficult for stakeholders to understand and analyze. Current automated analysis methods, which utilize natural language processing, have limitations. They typically focus on individual tasks and fail to capture the full context of the policies. We propose PolicyLR, a new paradigm that offers a comprehensive machine-readable representation of privacy policies, serving as an all-in-one solution for multiple downstream tasks. PolicyLR converts privacy policies into a machine-readable format using valuations of atomic formulae, allowing for formal definitions of tasks like compliance and consistency. We have developed a compiler that transforms unstructured policy text into this format using off-the-shelf Large Language Models (LLMs). This compiler breaks down the transformation task into a two-stage translation and entailment procedure. This procedure considers the full context of the privacy policy to infer a complex formula, where each formula consists of simpler atomic formulae. The advantage of this model is that PolicyLR is interpretable by design and grounded in segments of the privacy policy. We evaluated the compiler using ToS;DR, a community-annotated privacy policy entailment dataset. Utilizing open-source LLMs, our compiler achieves precision and recall values of 0.91 and 0.88, respectively. Finally, we demonstrate the utility of PolicyLR in three privacy tasks: Policy Compliance, Inconsistency Detection, and Privacy Comparison Shopping.
摘要:隐私政策在在线生态系统中至关重要,它定义了服务如何处理用户数据,并遵守GDPR和CCPA等法规。然而,它们的复杂性和频繁更新往往使利益相关者难以理解和分析。目前利用自然语言处理的自动分析方法具有局限性。它们通常专注于单个任务,而不能捕获策略的完整上下文。我们提出了PolicyLR,这是一种新的范例,它提供了一种全面的机器可读的隐私策略表示法,作为多个下游任务的一体化解决方案。PolicyLR使用原子公式的赋值将隐私策略转换为机器可读的格式,从而允许对合规性和一致性等任务进行正式定义。我们已经开发了一个编译器,可以使用现成的大型语言模型(LLM)将非结构化策略文本转换为这种格式。该编译器将转换任务分解为两个阶段的翻译和蕴涵过程。此过程考虑隐私策略的完整上下文来推断复杂的公式,其中每个公式由更简单的原子公式组成。此模型的优势在于PolicyLR是可通过设计解释的,并基于隐私策略的部分。我们使用ToS;DR,一个社区注释的隐私策略蕴含数据集对编译器进行了评估。利用开源的LLMS,我们的编译器实现了0.91的查准率和0.88的查全率。最后,我们展示了PolicyLR在三个隐私任务中的效用:策略遵从性、不一致性检测和隐私比较购物。

[NLP-29] From Rule-Based Models to Deep Learning Transformers Architectures for Natural Language Processing and Sign Language Translation Systems: Survey Taxonomy and Performance Evaluation
[NLP-29] 从基于规则的模型到自然语言处理和手语翻译系统的深度学习变形者架构:调查分类和性能评估

链接: https://arxiv.org/abs/2408.14825
作者: Nada Shahin,Leila Ismail
关键词-EN: Hearing population worldwide, Deaf and Hard, Hard of Hearing, growing Deaf, Hearing population
关键词-ZH: 全球听力人口,聋哑人和重听人,聋哑人,听力人口不断增长
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:With the growing Deaf and Hard of Hearing population worldwide and the persistent shortage of certified sign language interpreters, there is a pressing need for an efficient, signs-driven, integrated end-to-end translation system, from sign to gloss to text and vice-versa. There has been a wealth of research on machine translations and related reviews. However, there are few works on sign language machine translation considering the particularity of the language being continuous and dynamic. This paper aims to address this void, providing a retrospective analysis of the temporal evolution of sign language machine translation algorithms and a taxonomy of the Transformers architectures, the most used approach in language translation. We also present the requirements of a real-time Quality-of-Service sign language ma-chine translation system underpinned by accurate deep learning algorithms. We propose future research directions for sign language translation systems.
摘要:随着全球聋人和重听人口的不断增长,以及经过认证的手语翻译员的持续短缺,迫切需要一个高效的、手语驱动的、集成的端到端翻译系统,从手语到文字,反之亦然。关于机器翻译和相关评论有大量的研究。然而,考虑到手语连续性和动态性的特殊性,关于手语机器翻译的研究却很少。本文旨在解决这一空白,对手语机器翻译算法的时间演变进行了回顾性分析,并对语言翻译中最常用的方法Transformers架构进行了分类。我们还提出了以准确的深度学习算法为基础的实时服务质量手语机器翻译系统的要求。我们为手语翻译系统提出了未来的研究方向。

[NLP-30] GSIFN: A Graph-Structured and Interlaced-Masked Multimodal Transformer Based Fusion Network for Multimodal Sentiment Analysis
[NLP-30] GSIFN:一种基于图形结构和交错屏蔽多模式Transformer的融合网络,用于多模式情绪分析

链接: https://arxiv.org/abs/2408.14809
作者: Yijie Jin
关键词-EN: Multimodal Sentiment Analysis, Sentiment Analysis, leverages multiple modals, analyze sentiments, leverages multiple
关键词-ZH: 多模式情绪分析,情绪分析,利用多种模式,分析情绪,利用多种
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Multimodal Sentiment Analysis (MSA) leverages multiple modals to analyze sentiments. Typically, advanced fusion methods and representation learning-based methods are designed to tackle it. Our proposed GSIFN solves two key problems to be solved in MSA: (i) In multimodal fusion, the decoupling of modal combinations and tremendous parameter redundancy in existing fusion methods, which lead to poor fusion performance and efficiency. (ii) The trade-off between representation capability and computation overhead of the unimodal feature extractors and enhancers. GSIFN incorporates two main components to solve these problems: (i) Graph-Structured and Interlaced-Masked Multimodal Transformer. It adopts the Interlaced Mask mechanism to construct robust multimodal graph embedding, achieve all-modal-in-one Transformer-based fusion, and greatly reduce the computation overhead. (ii) A self-supervised learning framework with low computation overhead and high performance, which utilizes a parallelized LSTM with matrix memory to enhance non-verbal modal feature for unimodal label generation. Evaluated on the MSA datasets CMU-MOSI, CMU-MOSEI, and CH-SIMS, GSIFN demonstrates superior performance with significantly lower computation overhead compared with state-of-the-art methods.
摘要:多通道情感分析(MSA)利用多个通道分析情感。通常,先进的融合方法和基于表示学习的方法被设计来解决这个问题。该算法解决了多通道融合中需要解决的两个关键问题:(1)在多通道融合中,现有融合方法存在通道组合解耦和参数冗余度大等问题,导致融合性能和效率较低。(Ii)单峰特征提取和增强器的表示能力和计算开销之间的权衡。GSIFN包含两个主要组件来解决这些问题:(I)图结构和隔行扫描掩码多模转换器。该算法采用交错掩码机制构造健壮的多模图嵌入,实现了基于变压器的多模合一融合,大大降低了计算开销。(Ii)一种低计算开销和高性能的自监督学习框架,该框架利用具有矩阵记忆的并行化LSTM来增强单峰标签生成的非语言模态特征。在MSA数据集CMU-MOSI、CMU-MOSEI和CH-SIMS上进行的评估表明,与最先进的方法相比,GSIFN具有更高的性能和显著更低的计算开销。

[NLP-31] Instruct-SkillMix: A Powerful Pipeline for LLM Instruction Tuning
[NLP-31] Direct-SkillMix:LLM指令调优的强大管道

链接: https://arxiv.org/abs/2408.14774
作者: Simran Kaur,Simon Park,Anirudh Goyal,Sanjeev Arora
关键词-EN: powerful LLM, existing powerful LLM, high quality SFT, automated approach, LLM
关键词-ZH: 强大的LLM,现有强大的LLM,高质量SFT,自动化方法,LLM
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We introduce Instruct-SkillMix, an automated approach for creating diverse, high quality SFT data. The Instruct-SkillMix pipeline involves two stages, each leveraging an existing powerful LLM: (1) Skill extraction: uses the LLM to extract core “skills” for instruction-following, either from existing datasets, or by directly prompting the model; (2) Data generation: uses the powerful LLM to generate (instruction, response) data that exhibit a randomly chosen pair of these skills. Here, the use of random skill combinations promotes diversity and difficulty. Vanilla SFT (i.e., no PPO, DPO, or RL methods) on data generated from Instruct-SkillMix leads to strong gains on instruction following benchmarks such as AlpacaEval 2.0, MT-Bench, and WildBench. With just 4 K examples, LLaMA-3-8B-Base achieves 42.76% length-controlled win rate on AlpacaEval 2.0. To our knowledge, this achieves state-of-the-art performance among all models that have only undergone SFT (no RL methods) and competes with proprietary models such as Claude 3 Opus and LLaMA-3.1-405B-Instruct. Ablation studies also suggest plausible reasons for why creating open instruction-tuning datasets via naive crowd-sourcing has proved difficult. Introducing low quality answers (“shirkers”) in 20% of Instruct-SkillMix examples causes performance to plummet, sometimes catastrophically. The Instruct-SkillMix pipeline is flexible and is adaptable to other settings. Subjects: Machine Learning (cs.LG); Computation and Language (cs.CL) Cite as: arXiv:2408.14774 [cs.LG] (or arXiv:2408.14774v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2408.14774 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Simran Kaur [view email] [v1] Tue, 27 Aug 2024 04:31:58 UTC (239 KB)
摘要:我们介绍了一种用于创建多样化、高质量SFT数据的自动化方法Indict-Skill Mix。指令-Skill Mix流水线包括两个阶段,每个阶段都利用现有的强大的LLM:(1)技能提取:使用LLM从现有数据集或通过直接提示模型来提取核心“技能”以用于指导;(2)数据生成:使用强大的LLM生成(指令、响应)数据,这些数据展示了随机选择的一对这些技能。在这里,随机技能组合的使用提高了多样性和难度。在指令-Skill Mix生成的数据上的普通SFT(即,没有PPO、DPO或RL方法)导致指令在遵循AlpacaEval 2.0、MT-BENCH和WildB边等基准测试的指令上获得强劲收益。仅4K个示例,骆驼-3-8B-Base在AlpacaEval 2.0上实现了42.76%的长度控制胜率。据我们所知,这在所有只经历过SFT(没有RL方法)的车型中实现了最先进的性能,并与克劳德3 Opus和Llama-3.1-405B-Indict等专有车型竞争。消融研究还提出了合理的理由,解释了为什么通过天真的众包创建开放的教学调整数据集被证明是困难的。在20%的指令技能混合示例中引入低质量的答案(“shallkers”)会导致性能直线下降,有时甚至是灾难性的。Indict-Skill Mix管道非常灵活,并且可以适应其他设置。主题:机器学习(cs.lg);计算与语言(cs.CL)引用如下:arxiv:2408.14774cs.lghttps://doi.org/10.48550/arXiv.2408.14774 Focus通过DataCite(待注册)了解更多arxiv发布的DOI来自:Simran Kaur[查看电子邮件][v1]Tue,27 Aug 2024年04:31:58 UTC(239KB)

[NLP-32] A global AI community requires language-diverse publishing ICLR
[NLP-32] 全球人工智能社区需要多种语言的出版

链接: https://arxiv.org/abs/2408.14772
作者: Haley Lepp,Parth Sarin
关键词-EN: reinforces broader regimes, English language publishing, English dominance, language publishing upholds, English language
关键词-ZH: 加强更广泛的政权、英语出版、英语主导地位、语言出版维护、英语
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Translations by Michael Hardy (Guarani), Vandana Sarin and Vivek Sarin (Hindi), Roshna Omer Abdulrahman (Soranî Kurdish), Gabriel Poesia (Portuguese), and Matías Grinberg (Spanish). In the proceedings of the Global AI Cultures Workshop at the Twelfth International Conference on Learning Representations (ICLR) 2024, Vienna, Austria, May 7-11, 2024

点击查看摘要

Abstract:In this provocation, we discuss the English dominance of the AI research community, arguing that the requirement for English language publishing upholds and reinforces broader regimes of extraction in AI. While large language models and machine translation have been celebrated as a way to break down barriers, we regard their use as a symptom of linguistic exclusion of scientists and potential readers. We propose alternative futures for a healthier publishing culture, organized around three themes: administering conferences in the languages of the country in which they are held, instructing peer reviewers not to adjudicate the language appropriateness of papers, and offering opportunities to publish and present in multiple languages. We welcome new translations of this piece. Please contact the authors if you would like to contribute one.
摘要:在这一挑衅中,我们讨论了人工智能研究界的英语主导地位,认为英语出版的要求维护并加强了更广泛的人工智能提取制度。虽然大型语言模型和机器翻译被誉为打破障碍的一种方式,但我们认为它们的使用是科学家和潜在读者语言排斥的症状。我们提出了一个更健康的出版文化的替代未来,围绕三个主题组织:以会议所在国的语言管理会议,指示同行评审员不要评判论文的语言适当性,并提供以多种语言出版和展示的机会。我们欢迎这篇文章的新翻译。如果您想贡献作者,请联系作者。

[NLP-33] LyCon: Lyrics Reconstruction from the Bag-of-Words Using Large Language Models
[NLP-33] LyCon:使用大型语言模型从词袋中重建歌词

链接: https://arxiv.org/abs/2408.14750
作者: Haven Kim,Kahyun Choi
关键词-EN: paper addresses, addresses the unique, unique challenge, challenge of conducting, conducting research
关键词-ZH: 论文解决了独特的、独特的挑战,进行研究的挑战
类目: Computation and Language (cs.CL); Digital Libraries (cs.DL)
备注: Dataset downlodable at this https URL

点击查看摘要

Abstract:This paper addresses the unique challenge of conducting research in lyric studies, where direct use of lyrics is often restricted due to copyright concerns. Unlike typical data, internet-sourced lyrics are frequently protected under copyright law, necessitating alternative approaches. Our study introduces a novel method for generating copyright-free lyrics from publicly available Bag-of-Words (BoW) datasets, which contain the vocabulary of lyrics but not the lyrics themselves. Utilizing metadata associated with BoW datasets and large language models, we successfully reconstructed lyrics. We have compiled and made available a dataset of reconstructed lyrics, LyCon, aligned with metadata from renowned sources including the Million Song Dataset, Deezer Mood Detection Dataset, and AllMusic Genre Dataset, available for public access. We believe that the integration of metadata such as mood annotations or genres enables a variety of academic experiments on lyrics, such as conditional lyric generation.
摘要:本文讨论了在歌词研究中进行研究的独特挑战,在歌词研究中,由于版权问题,歌词的直接使用往往受到限制。与典型数据不同,互联网来源的歌词经常受到版权法的保护,因此有必要采取其他方法。我们的研究介绍了一种从公开可用的词袋(BOW)数据集生成无版权歌词的新方法,这些数据集包含歌词的词汇,但不包含歌词本身。利用与弓数据集和大型语言模型相关联的元数据,我们成功地重建了歌词。我们已经汇编并提供了一个重建歌词的数据集Lycon,与来自著名来源的元数据保持一致,包括百万歌曲数据集、Deezer情绪检测数据集和AllMusic流派数据集,可供公众访问。我们认为,情绪注释或体裁等元数据的集成使各种关于歌词的学术实验成为可能,例如有条件的歌词生成。

[NLP-34] PAT: Pruning-Aware Tuning for Large Language Models
[NLP-34] Mat:大型语言模型的修剪感知调优

链接: https://arxiv.org/abs/2408.14721
作者: Yijiang Liu,Huanrui Yang,Youxin Chen,Rongyu Zhang,Miao Wang,Yuan Du,Li Du
关键词-EN: Large language models, Large language, language tasks, language models, Structural pruning
关键词-ZH: 大型语言模型、大型语言、语言任务、语言模型、结构修剪
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) excel in language tasks, especially with supervised fine-tuning after pre-training. However, their substantial memory and computational requirements hinder practical applications. Structural pruning, which reduces less significant weight dimensions, is one solution. Yet, traditional post-hoc pruning often leads to significant performance loss, with limited recovery from further fine-tuning due to reduced capacity. Since the model fine-tuning refines the general and chaotic knowledge in pre-trained models, we aim to incorporate structural pruning with the fine-tuning, and propose the Pruning-Aware Tuning (PAT) paradigm to eliminate model redundancy while preserving the model performance to the maximum extend. Specifically, we insert the innovative Hybrid Sparsification Modules (HSMs) between the Attention and FFN components to accordingly sparsify the upstream and downstream linear modules. The HSM comprises a lightweight operator and a globally shared trainable mask. The lightweight operator maintains a training overhead comparable to that of LoRA, while the trainable mask unifies the channels to be sparsified, ensuring structural pruning. Additionally, we propose the Identity Loss which decouples the transformation and scaling properties of the HSMs to enhance training robustness. Extensive experiments demonstrate that PAT excels in both performance and efficiency. For example, our Llama2-7b model with a 25% pruning ratio achieves 1.33 \times speedup while outperforming the LoRA-finetuned model by up to 1.26% in accuracy with a similar training cost. Code: this https URL
摘要:大型语言模型(LLM)在语言任务中表现出色,尤其是在训练前进行了有监督的微调。然而,它们对内存和计算的巨大要求阻碍了实际应用。结构修剪是一种解决方案,它可以减少不太重要的重量维度。然而,传统的后自组织剪枝通常会导致显著的性能损失,并且由于容量减少而进行进一步的微调后恢复有限。由于模型微调提炼了预先训练的模型中的一般和混沌知识,我们的目标是将结构剪枝和微调结合起来,并提出剪枝感知调整(PAT)范式来消除模型冗余,同时最大限度地保持模型的性能。具体地说,我们在注意和FFN组件之间插入创新的混合稀疏模块(HSM),以相应地稀疏上行和下行的线性模块。HSM包括一个轻型操作员和一个全球共享的可训练面具。轻量级操作员保持了与LORA相当的训练开销,而可训练的面具统一了要稀疏的通道,确保了结构修剪。此外,为了增强训练的稳健性,我们提出了将HSM的变换和缩放特性解耦的身份损失。大量的实验表明,PAT在性能和效率上都有很大的优势。例如,我们的剪枝比为25%的Llama2-7b模型获得了1.33倍的加速比,而在类似训练成本的情况下,精度比Lora-Finetuned模型高达1.26%。代码:此HTTPS URL

[NLP-35] Smart Multi-Modal Search: Contextual Sparse and Dense Embedding Integration in Adobe Express
[NLP-35] 智能多模式搜索:Adobe Express中的上下文稀疏和密集嵌入集成

链接: https://arxiv.org/abs/2408.14698
作者: Cherag Aroraa,Tracy Holloway King,Jayant Kumar,Yi Lu,Sanat Sharma,Arvind Srikantan,David Uvalle,Josep Valls-Vargas,Harsha Vardhan
关键词-EN: multi-modal search systems, effective multi-modal search, multi-modal search, search systems, search
关键词-ZH: 多模式搜索系统,有效的多模式搜索,多模式搜索,搜索系统,搜索
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:As user content and queries become increasingly multi-modal, the need for effective multi-modal search systems has grown. Traditional search systems often rely on textual and metadata annotations for indexed images, while multi-modal embeddings like CLIP enable direct search using text and image embeddings. However, embedding-based approaches face challenges in integrating contextual features such as user locale and recency. Building a scalable multi-modal search system requires fine-tuning several components. This paper presents a multi-modal search architecture and a series of AB tests that optimize embeddings and multi-modal technologies in Adobe Express template search. We address considerations such as embedding model selection, the roles of embeddings in matching and ranking, and the balance between dense and sparse embeddings. Our iterative approach demonstrates how utilizing sparse, dense, and contextual features enhances short and long query search, significantly reduces null rates (over 70%), and increases click-through rates (CTR). Our findings provide insights into developing robust multi-modal search systems, thereby enhancing relevance for complex queries.
摘要:随着用户内容和查询变得越来越多模式,对有效的多模式搜索系统的需求也在增长。传统的搜索系统通常依赖于索引图像的文本和元数据注释,而像Clip这样的多模式嵌入允许使用文本和图像嵌入进行直接搜索。然而,基于嵌入的方法在集成用户区域设置和新近性等上下文特征方面面临挑战。构建一个可扩展的多模式搜索系统需要对几个组件进行微调。本文提出了一种多模式搜索架构和一系列AB测试,以优化Adobe Express模板搜索中的嵌入和多模式技术。我们讨论了嵌入模型的选择,嵌入在匹配和排序中的作用,以及密集和稀疏嵌入之间的平衡等考虑因素。我们的迭代方法展示了如何利用稀疏、密集和上下文特征来增强短查询和长查询搜索,显著降低空率(超过70%),并提高点击率(CTR)。我们的发现为开发健壮的多模式搜索系统提供了见解,从而提高了对复杂查询的相关性。

[NLP-36] raining-Free Activation Sparsity in Large Language Models
[NLP-36] 大型语言模型中的无雨激活稀疏性

链接: https://arxiv.org/abs/2408.14690
作者: James Liu,Pragaash Ponnusamy,Tianle Cai,Han Guo,Yoon Kim,Ben Athiwaratkun
关键词-EN: enable practical inference, practical inference speedups, large language models, forward pass, enable practical
关键词-ZH: 启用实用推理、实用推理加速、大型语言模型、前向传递、启用实用
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Activation sparsity can enable practical inference speedups in large language models (LLMs) by reducing the compute and memory-movement required for matrix multiplications during the forward pass. However, existing methods face limitations that inhibit widespread adoption. Some approaches are tailored towards older models with ReLU-based sparsity, while others require extensive continued pre-training on up to hundreds of billions of tokens. This paper describes TEAL, a simple training-free method that applies magnitude-based activation sparsity to hidden states throughout the entire model. TEAL achieves 40-50% model-wide sparsity with minimal performance degradation across Llama-2, Llama-3, and Mistral families, with sizes varying from 7B to 70B. We improve existing sparse kernels and demonstrate wall-clock decoding speed-ups of up to 1.53 \times and 1.8 \times at 40% and 50% model-wide sparsity. TEAL is compatible with weight quantization, enabling further efficiency gains.
摘要:激活稀疏性可以通过减少前向过程中矩阵相乘所需的计算和内存移动来实现大型语言模型(LLM)中的实际推理加速。然而,现有方法面临着限制,阻碍了广泛采用。有些方法是针对具有基于ReLU稀疏性的旧模型量身定制的,而另一些方法则需要对高达数千亿个代币进行广泛的持续预训练。本文描述了TEAL,这是一种简单的免训练方法,它将基于大小的激活稀疏性应用于整个模型中的隐藏状态。TEAL在Lama-2、Lama-3和Mistral系列(尺寸从7 B到70 B不等)中实现了40-50%的模型范围稀疏性,性能下降最小。我们改进了现有的稀疏核,并在40%和50%的模型范围稀疏度下展示了高达1.53倍和1.8倍的时钟解码速度。TEAL与权重量化兼容,进一步提高效率。

[NLP-37] Relationships are Complicated! An Analysis of Relationships Between Datasets on the Web
[NLP-37] 关系很复杂!Web数据集之间关系分析

链接: https://arxiv.org/abs/2408.14636
作者: Kate Lin,Tarfah Alrashed,Natasha Noy
关键词-EN: rapid pace, relationships, datasets, today has millions, continues to grow
关键词-ZH: 快节奏、关系、数据集,今天有数百万个,持续增长
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:The Web today has millions of datasets, and the number of datasets continues to grow at a rapid pace. These datasets are not standalone entities; rather, they are intricately connected through complex relationships. Semantic relationships between datasets provide critical insights for research and decision-making processes. In this paper, we study dataset relationships from the perspective of users who discover, use, and share datasets on the Web: what relationships are important for different tasks? What contextual information might users want to know? We first present a comprehensive taxonomy of relationships between datasets on the Web and map these relationships to user tasks performed during dataset discovery. We develop a series of methods to identify these relationships and compare their performance on a large corpus of datasets generated from Web pages with this http URL markup. We demonstrate that machine-learning based methods that use dataset metadata achieve multi-class classification accuracy of 90%. Finally, we highlight gaps in available semantic markup for datasets and discuss how incorporating comprehensive semantics can facilitate the identification of dataset relationships. By providing a comprehensive overview of dataset relationships at scale, this paper sets a benchmark for future research.
摘要:今天的Web拥有数以百万计的数据集,并且数据集的数量持续快速增长。这些数据集不是独立的实体;相反,它们通过复杂的关系错综复杂地联系在一起。数据集之间的语义关系为研究和决策过程提供了重要的见解。在本文中,我们从在Web上发现、使用和共享数据集的用户的角度来研究数据集关系:对于不同的任务,哪些关系是重要的?用户可能想知道哪些上下文信息?我们首先介绍Web上数据集之间的关系的全面分类,并将这些关系映射到在数据集发现期间执行的用户任务。我们开发了一系列方法来识别这些关系,并在使用此http URL标记从Web页面生成的大型数据集上比较它们的性能。我们证明了使用数据集元数据的基于机器学习的方法获得了90%的多类分类准确率。最后,我们强调了数据集可用的语义标记中的差距,并讨论了如何结合全面的语义来促进数据集关系的识别。通过提供大规模数据集关系的全面概述,本文为未来的研究设定了一个基准。

[NLP-38] MODOC: A Modular Interface for Flexible Interlinking of Text Retrieval and Text Generation Functions
[NLP-38] MODoc:文本检索和文本生成功能灵活互连的模块化接口

链接: https://arxiv.org/abs/2408.14623
作者: Yingqiang Gao,Jhony Prada,Nianlong Gu,Jessica Lam,Richard H.R. Hahnloser
关键词-EN: Large Language Models, Large Language, Language Models, produce eloquent texts, produce eloquent
关键词-ZH: 大型语言模型,大型语言,语言模型,产生雄辩的文本,产生雄辩的
类目: Human-Computer Interaction (cs.HC); Computation and Language (cs.CL); Digital Libraries (cs.DL); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) produce eloquent texts but often the content they generate needs to be verified. Traditional information retrieval systems can assist with this task, but most systems have not been designed with LLM-generated queries in mind. As such, there is a compelling need for integrated systems that provide both retrieval and generation functionality within a single user interface. We present MODOC, a modular user interface that leverages the capabilities of LLMs and provides assistance with detecting their confabulations, promoting integrity in scientific writing. MODOC represents a significant step forward in scientific writing assistance. Its modular architecture supports flexible functions for retrieving information and for writing and generating text in a single, user-friendly interface. Subjects: Human-Computer Interaction (cs.HC); Computation and Language (cs.CL); Digital Libraries (cs.DL); Information Retrieval (cs.IR) Cite as: arXiv:2408.14623 [cs.HC] (or arXiv:2408.14623v1 [cs.HC] for this version) https://doi.org/10.48550/arXiv.2408.14623 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
摘要:大型语言模型(LLM)可以生成雄辩的文本,但它们所生成的内容通常需要验证。传统的信息检索系统可以帮助完成这项任务,但大多数系统在设计时都没有考虑到LLM生成的查询。因此,迫切需要在单个用户界面内同时提供检索和生成功能的集成系统。我们介绍了MoDoc,这是一个模块化的用户界面,它利用了LLMS的功能,并提供了帮助检测他们的虚构,促进科学写作的完整性。莫多克代表着在科学写作辅助方面向前迈出的重要一步。它的模块化体系结构支持灵活的功能,可以在一个用户友好的单一界面中检索信息以及编写和生成文本。主题:人机交互(cs.hc);计算与语言(cs.CL);数字图书馆(cs.dl);信息检索(cs.IR)引用为:arxiv:2408.14623cs.hchttps://doi.org/10.48550/arXiv.2408.14623 Focus通过DataCite了解更多由arxiv发布的目录信息(待注册)

[NLP-39] What Makes a Good Story and How Can We Measure It? A Comprehensive Survey of Story Evaluation
[NLP-39] 什么是好故事?我们如何衡量它?故事评价综合调查

链接: https://arxiv.org/abs/2408.14622
作者: Dingyi Yang,Qin Jin
关键词-EN: Large Language Models, Language Models, Large Language, success of Large, automatically generated stories
关键词-ZH: 大型语言模型、语言模型、大型语言、自动生成故事的成功
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:With the development of artificial intelligence, particularly the success of Large Language Models (LLMs), the quantity and quality of automatically generated stories have significantly increased. This has led to the need for automatic story evaluation to assess the generative capabilities of computing systems and analyze the quality of both automatic-generated and human-written stories. Evaluating a story can be more challenging than other generation evaluation tasks. While tasks like machine translation primarily focus on assessing the aspects of fluency and accuracy, story evaluation demands complex additional measures such as overall coherence, character development, interestingness, etc. This requires a thorough review of relevant research. In this survey, we first summarize existing storytelling tasks, including text-to-text, visual-to-text, and text-to-visual. We highlight their evaluation challenges, identify various human criteria to measure stories, and present existing benchmark datasets. Then, we propose a taxonomy to organize evaluation metrics that have been developed or can be adopted for story evaluation. We also provide descriptions of these metrics, along with the discussion of their merits and limitations. Later, we discuss the human-AI collaboration for story evaluation and generation. Finally, we suggest potential future research directions, extending from story evaluation to general evaluations.
摘要:随着人工智能的发展,特别是大语言模型的成功,自动生成故事的数量和质量都有了显著的提高。这导致了对自动故事评估的需要,以评估计算机系统的生成能力,并分析自动生成的故事和人工编写的故事的质量。评估一个故事可能比其他代评估任务更具挑战性。虽然像机器翻译这样的任务主要集中在评估流利性和准确性方面,但故事评估需要复杂的额外测量,如整体连贯、性格发展、趣味性等。这需要对相关研究进行彻底的审查。在这项调查中,我们首先总结了现有的讲故事任务,包括文本到文本、视觉到文本和文本到视觉。我们强调了他们的评估挑战,确定了衡量故事的各种人类标准,并提供了现有的基准数据集。然后,我们提出了一种分类来组织已经开发或可以用于故事评估的评估度量。我们还提供了对这些指标的描述,并讨论了它们的优点和局限性。随后,我们讨论了人类与人工智能协作进行故事评估和生成。最后,我们提出了未来可能的研究方向,从故事评价延伸到一般评价。

[NLP-40] Surprisingly Fragile: Assessing and Addressing Prompt Instability in Multimodal Foundation Models
[NLP-40] 令人惊讶的脆弱性:评估和解决多模式基础模型中的即时不稳定性

链接: https://arxiv.org/abs/2408.14595
作者: Ian Stewart,Sameera Horawalavithana,Brendan Kennedy,Sai Munikoti,Karl Pazdernik
关键词-EN: Multimodal foundation models, Multimodal foundation, OFASys show, show the potential, potential to unlock
关键词-ZH: 多模式基金会模型,多模式基金会,OFASys展示,展示潜力,释放潜力
类目: Computation and Language (cs.CL)
备注: in submission

点击查看摘要

Abstract:Multimodal foundation models (MFMs) such as OFASys show the potential to unlock analysis of complex data such as images, videos, and audio data via text prompts alone. However, their performance may suffer in the face of text input that differs even slightly from their training distribution, which is surprising considering the use of modality-specific data to “ground” the text input. This study demonstrates that prompt instability is a major concern for MFMs, leading to a consistent drop in performance across all modalities, but that instability can be mitigated with additional training with augmented data. We evaluate several methods for grounded prompt perturbation, where we generate perturbations and filter based on similarity to text and/or modality data. After re-training the models on the augmented data, we find improved accuracy and more stable performance on the perturbed test data regardless of perturbation condition, suggesting that the data augmentation strategy helps the models handle domain shifts more effectively. In error analysis, we find consistent patterns of performance improvement across domains, suggesting that retraining on prompt perturbations tends to help general reasoning capabilities in MFMs.
摘要:OFASys等多模式基础模型(MFM)显示了仅通过文本提示就可以解锁对图像、视频和音频数据等复杂数据的分析的潜力。然而,在文本输入与其训练分布略有不同的情况下,它们的性能可能会受到影响,考虑到使用特定于通道的数据来对文本输入进行“接地”,这是令人惊讶的。这项研究表明,即时不稳定是MFMS的一个主要问题,导致所有模式的性能持续下降,但这种不稳定可以通过增加数据的额外训练来缓解。我们评估了几种接地即时扰动的方法,其中我们生成扰动并基于与文本和/或通道数据的相似性进行过滤。在对扩展后的数据重新训练模型后,我们发现无论在何种扰动条件下,模型在扰动测试数据上的准确率都有所提高,性能也更加稳定,这表明数据扩展策略有助于模型更有效地处理域迁移。在错误分析中,我们发现了跨域性能提高的一致模式,这表明对即时扰动的再培训往往有助于MFMS的一般推理能力。

[NLP-41] CURLoRA: Stable LLM Continual Fine-Tuning and Catastrophic Forgetting Mitigation
[NLP-41] CROLoRA:稳定的LLM持续微调和灾难性遗忘缓解

链接: https://arxiv.org/abs/2408.14572
作者: Muhammad Fawi
关键词-EN: Low-Rank Adaptation, leverages CUR matrix, fine-tuning large language, paper introduces CURLoRA, large language models
关键词-ZH: 低等级自适应,利用CUR矩阵,微调大型语言,论文介绍了CROLoRA,大型语言模型
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Code available at this https URL

点击查看摘要

Abstract:This paper introduces CURLoRA, a novel approach to fine-tuning large language models (LLMs) that leverages CUR matrix decomposition in the context of Low-Rank Adaptation (LoRA). Our method addresses two critical challenges in LLM fine-tuning: mitigating catastrophic forgetting during continual learning and reducing the number of trainable parameters. We propose a unique modification to the CUR decomposition process, utilizing inverted probabilities for column and row selection which acts as an implicit regularization, and initializing the U matrix as a zero matrix, and only fine-tuning it. We demonstrate through experiments on multiple datasets that CURLoRA outperforms standard LoRA in mitigating catastrophic forgetting. It maintains model stability and performance across tasks while significantly reducing the number of trainable parameters. Our results show that CURLoRA achieves very good and stable task accuracy while maintaining base model’s perplexity scores fixed compared to LoRA upon continual fine-tuning, particularly in scenarios with limited data.
摘要:本文介绍了CURLoRA,这是一种在低阶适应(LORA)环境下利用CUR矩阵分解来微调大型语言模型(LLMS)的新方法。我们的方法解决了LLM微调中的两个关键挑战:减轻持续学习过程中的灾难性遗忘和减少可训练参数的数量。我们对CUR分解过程提出了一种独特的改进,利用倒置概率进行列和行选择,作为隐式正则化,并将U矩阵初始化为零矩阵,并且仅对其进行微调。我们通过在多个数据集上的实验证明,CURLoRA在缓解灾难性遗忘方面优于标准LORA。它保持了模型的稳定性和跨任务的性能,同时显著减少了可训练参数的数量。我们的结果表明,CURLoRA在持续微调的情况下,特别是在数据有限的场景中,与LORA相比,在保持基本模型的困惑分数固定的同时,获得了非常好和稳定的任务精度。

[NLP-42] Improving Clinical Note Generation from Complex Doctor-Patient Conversation
[NLP-42] 改善从复杂的医患对话中生成临床笔记

链接: https://arxiv.org/abs/2408.14568
作者: Yizhan Li,Sifan Wu,Christopher Smith,Thomas Lo,Bang Liu
关键词-EN: patient care documentation, documenting medical exams, clinical note generation, Writing clinical notes, healthcare professionals
关键词-ZH: 患者护理记录、记录体检、生成临床笔记、撰写临床笔记、医疗保健专业人员
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Writing clinical notes and documenting medical exams is a critical task for healthcare professionals, serving as a vital component of patient care documentation. However, manually writing these notes is time-consuming and can impact the amount of time clinicians can spend on direct patient interaction and other tasks. Consequently, the development of automated clinical note generation systems has emerged as a clinically meaningful area of research within AI for health. In this paper, we present three key contributions to the field of clinical note generation using large language models (LLMs). First, we introduce CliniKnote, a comprehensive dataset consisting of 1,200 complex doctor-patient conversations paired with their full clinical notes. This dataset, created and curated by medical experts with the help of modern neural networks, provides a valuable resource for training and evaluating models in clinical note generation tasks. Second, we propose the K-SOAP (Keyword, Subjective, Objective, Assessment, and Plan) note format, which enhances traditional SOAP~\citepodder2023soap (Subjective, Objective, Assessment, and Plan) notes by adding a keyword section at the top, allowing for quick identification of essential information. Third, we develop an automatic pipeline to generate K-SOAP notes from doctor-patient conversations and benchmark various modern LLMs using various metrics. Our results demonstrate significant improvements in efficiency and performance compared to standard LLM finetuning methods.
摘要:撰写临床笔记和记录体检是医疗专业人员的一项重要任务,是患者护理记录的重要组成部分。然而,手动编写这些笔记非常耗时,并可能影响临床医生在患者直接互动和其他任务上花费的时间。因此,临床笔记自动生成系统的开发已经成为人工智能健康领域中一个具有临床意义的研究领域。在本文中,我们提出了使用大语言模型(LLMS)生成临床病历领域的三个关键贡献。首先,我们介绍CliniKnote,这是一个全面的数据集,包含1200个复杂的医患对话和他们的完整临床笔记。该数据集由医学专家在现代神经网络的帮助下创建和管理,为临床病历生成任务中的培训和评估模型提供了宝贵的资源。其次,我们提出了K-Soap(Keyword,主观,客观,评估和计划)笔记格式,对传统的Soap(主观,客观,评估和计划)笔记进行了改进,在顶部增加了一个关键字部分,允许快速识别基本信息。第三,我们开发了一个自动管道来从医患对话中生成K-Soap笔记,并使用各种度量标准对各种现代LLM进行基准测试。我们的结果表明,与标准的LLM精调方法相比,我们在效率和性能上都有了显著的改进。

[NLP-43] Revisiting Image Captioning Training Paradigm via Direct CLIP-based Optimization BMVC2024
[NLP-43] 通过基于直接CLIP的优化重新审视图像字幕培训范式

链接: https://arxiv.org/abs/2408.14547
作者: Nicholas Moratelli,Davide Caffagni,Marcella Cornia,Lorenzo Baraldi,Rita Cucchiara
关键词-EN: Self-Critical Sequence Training, Self-Critical Sequence, image captioning involves, captioning involves pre-training, maximize hand-crafted captioning
关键词-ZH: 自我批评序列训练,自我批评序列,图像字幕涉及,字幕涉及预训练,最大限度地使用手工字幕
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Multimedia (cs.MM)
备注: BMVC 2024

点击查看摘要

Abstract:The conventional training approach for image captioning involves pre-training a network using teacher forcing and subsequent fine-tuning with Self-Critical Sequence Training to maximize hand-crafted captioning metrics. However, when attempting to optimize modern and higher-quality metrics like CLIP-Score and PAC-Score, this training method often encounters instability and fails to acquire the genuine descriptive capabilities needed to produce fluent and informative captions. In this paper, we propose a new training paradigm termed Direct CLIP-Based Optimization (DiCO). Our approach jointly learns and optimizes a reward model that is distilled from a learnable captioning evaluator with high human correlation. This is done by solving a weighted classification problem directly inside the captioner. At the same time, DiCO prevents divergence from the original model, ensuring that fluency is maintained. DiCO not only exhibits improved stability and enhanced quality in the generated captions but also aligns more closely with human preferences compared to existing methods, especially in modern metrics. Additionally, it maintains competitive performance in traditional metrics. Our source code and trained models are publicly available at this https URL.
摘要:传统的图像字幕训练方法包括使用教师强迫预先训练网络,并随后通过自我临界序列训练进行微调,以最大化手工制作的字幕度量。然而,当试图优化像CLIP-SCORE和PAC-SCORE这样的现代和更高质量的指标时,这种训练方法经常遇到不稳定的问题,并且无法获得产生流畅和信息量大的字幕所需的真正的描述能力。在本文中,我们提出了一种新的训练范式,称为基于直接剪辑的优化(DICO)。我们的方法联合学习和优化了一个奖励模型,该模型是从具有高度人类相关性的可学习字幕评估器中提取的。这是通过直接在捕获器内解决加权分类问题来实现的。同时,DICO防止了与原始模型的分歧,确保了流畅性。与现有的方法相比,DICO不仅在生成的字幕中表现出更好的稳定性和更高的质量,而且更符合人类的偏好,特别是在现代度量中。此外,它在传统指标中保持了具有竞争力的表现。我们的源代码和经过训练的模型可在此HTTPS URL上公开获得。

[NLP-44] LLMs as Zero-shot Graph Learners: Alignment of GNN Representations with LLM Token Embeddings
[NLP-44] LLM作为零镜头图学习者:GNN表示与LLM令牌嵌入的一致性

链接: https://arxiv.org/abs/2408.14512
作者: Duo Wang,Yuan Zuo,Fengzhi Li,Junjie Wu
关键词-EN: scarce labeled data, garnered significant interest, significant interest due, graph neural networks, neural networks
关键词-ZH: 稀缺的标签数据,获得了显着的兴趣,显着的兴趣,图神经网络,神经网络
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Zero-shot graph machine learning, especially with graph neural networks (GNNs), has garnered significant interest due to the challenge of scarce labeled data. While methods like self-supervised learning and graph prompt learning have been extensively explored, they often rely on fine-tuning with task-specific labels, limiting their effectiveness in zero-shot scenarios. Inspired by the zero-shot capabilities of instruction-fine-tuned large language models (LLMs), we introduce a novel framework named Token Embedding-Aligned Graph Language Model (TEA-GLM) that leverages LLMs as cross-dataset and cross-task zero-shot learners for graph machine learning. Concretely, we pretrain a GNN, aligning its representations with token embeddings of an LLM. We then train a linear projector that transforms the GNN’s representations into a fixed number of graph token embeddings without tuning the LLM. A unified instruction is designed for various graph tasks at different levels, such as node classification (node-level) and link prediction (edge-level). These design choices collectively enhance our method’s effectiveness in zero-shot learning, setting it apart from existing methods. Experiments show that our graph token embeddings help the LLM predictor achieve state-of-the-art performance on unseen datasets and tasks compared to other methods using LLMs as predictors.
摘要:零概率图机器学习,尤其是图神经网络学习,由于面临稀缺标签数据的挑战,已经引起了人们的极大兴趣。虽然自我监督学习和图形提示学习等方法已经被广泛探索,但它们往往依赖于带有特定任务标签的微调,限制了它们在零命中场景中的有效性。受指令微调大语言模型(LLMS)零命中能力的启发,我们提出了一种新的框架–标记嵌入对齐图语言模型(TEA-GLM),它利用LLMS作为跨数据集和跨任务的零命中学习器进行图形机器学习。具体地说,我们预先训练GNN,将其表示与LLM的令牌嵌入对齐。然后,我们训练一个线性投影仪,它将GNN的表示转换为固定数量的图标记嵌入,而不需要调整LLM。针对不同层次的图任务,如节点分类(节点级)和链接预测(边级),设计了统一的指令。这些设计选择共同提高了我们的方法在零射击学习中的有效性,使其有别于现有方法。实验表明,与其他使用LLMS作为预测器的方法相比,我们的图令牌嵌入帮助LLM预测器在不可见的数据集和任务上获得了最先进的性能。

[NLP-45] Unveiling the Statistical Foundations of Chain-of-Thought Prompting Methods
[NLP-45] 揭开思想链预算方法的统计基础

链接: https://arxiv.org/abs/2408.14511
作者: Xinyang Hu,Fengzhuo Zhang,Siyu Chen,Zhuoran Yang
关键词-EN: solving multi-step reasoning, multi-step reasoning problem, large language models, multi-step reasoning, gained popularity
关键词-ZH: 解决多步推理、多步推理问题、大型语言模型、多步推理,受到欢迎
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Statistics Theory (math.ST); Machine Learning (stat.ML)
备注: 150 pages, 18 figures, 3 tables

点击查看摘要

Abstract:Chain-of-Thought (CoT) prompting and its variants have gained popularity as effective methods for solving multi-step reasoning problems using pretrained large language models (LLMs). In this work, we analyze CoT prompting from a statistical estimation perspective, providing a comprehensive characterization of its sample complexity. To this end, we introduce a multi-step latent variable model that encapsulates the reasoning process, where the latent variable encodes the task information. Under this framework, we demonstrate that when the pretraining dataset is sufficiently large, the estimator formed by CoT prompting is equivalent to a Bayesian estimator. This estimator effectively solves the multi-step reasoning problem by aggregating a posterior distribution inferred from the demonstration examples in the prompt. Moreover, we prove that the statistical error of the CoT estimator can be decomposed into two main components: (i) a prompting error, which arises from inferring the true task using CoT prompts, and (ii) the statistical error of the pretrained LLM. We establish that, under appropriate assumptions, the prompting error decays exponentially to zero as the number of demonstrations increases. Additionally, we explicitly characterize the approximation and generalization errors of the pretrained LLM. Notably, we construct a transformer model that approximates the target distribution of the multi-step reasoning problem with an error that decreases exponentially in the number of transformer blocks. Our analysis extends to other variants of CoT, including Self-Consistent CoT, Tree-of-Thought, and Selection-Inference, offering a broad perspective on the efficacy of these methods. We also provide numerical experiments to validate the theoretical findings.
摘要:思想链(CoT)提示及其变体作为一种使用预先训练的大语言模型(LLM)解决多步骤推理问题的有效方法而受到广泛的欢迎。在这项工作中,我们从统计估计的角度分析了COT提示,提供了其样本复杂性的全面表征。为此,我们引入了一种封装推理过程的多步隐变量模型,其中隐变量对任务信息进行了编码。在此框架下,我们证明了当预训练数据集足够大时,由CoT提示形成的估计量等价于贝叶斯估计量。该估计器通过聚合从提示中的演示示例推断出的后验分布,有效地解决了多步骤推理问题。此外,我们还证明了COT估计器的统计误差可以分解为两个主要部分:(I)使用COT提示推理真实任务时产生的提示误差;(Ii)预先训练的LLM的统计误差。我们证明,在适当的假设下,提示误差随着演示次数的增加而指数衰减到零。此外,我们还明确地刻画了预先训练的LLM的逼近误差和推广误差。值得注意的是,我们构建了一个变压器模型,该模型近似于多步推理问题的目标分布,误差随变压器块的数量呈指数下降。我们的分析扩展到COT的其他变体,包括自洽COT、思维树和选择推理,为这些方法的有效性提供了一个广阔的视角。我们还提供了数值实验来验证理论研究结果。

[NLP-46] Empowering Pre-Trained Language Models for Spatio-Temporal Forecasting via Decoupling Enhanced Discrete Reprogramming
[NLP-46] 通过脱钩增强型离散重编程为时空预测提供支持预训练语言模型

链接: https://arxiv.org/abs/2408.14505
作者: Hao Wang,Jindong Han,Wei Fan,Hao Liu
关键词-EN: time series forecasting, series forecasting plays, Pre-trained Language Models, time series, energy management
关键词-ZH: 时间序列预测、系列预测戏剧、预训练语言模型、时间序列、能源管理
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Spatio-temporal time series forecasting plays a critical role in various real-world applications, such as transportation optimization, energy management, and climate analysis. The recent advancements in Pre-trained Language Models (PLMs) have inspired efforts to reprogram these models for time series forecasting tasks, by leveraging their superior reasoning and generalization capabilities. However, existing approaches fall short in handling complex spatial inter-series dependencies and intrinsic intra-series frequency components, limiting their spatio-temporal forecasting performance. Moreover, the linear mapping of continuous time series to a compressed subset vocabulary in reprogramming constrains the spatio-temporal semantic expressivity of PLMs and may lead to potential information bottleneck. To overcome the above limitations, we propose \textscRePST, a tailored PLM reprogramming framework for spatio-temporal forecasting. The key insight of \textscRePST is to decouple the spatio-temporal dynamics in the frequency domain, allowing better alignment with the PLM text space. Specifically, we first decouple spatio-temporal data in Fourier space and devise a structural diffusion operator to obtain temporal intrinsic and spatial diffusion signals, making the dynamics more comprehensible and predictable for PLMs. To avoid information bottleneck from a limited vocabulary, we further propose a discrete reprogramming strategy that selects relevant discrete textual information from an expanded vocabulary space in a differentiable manner. Extensive experiments on four real-world datasets show that our proposed approach significantly outperforms state-of-the-art spatio-temporal forecasting models, particularly in data-scarce scenarios.
摘要:时空时间序列预测在交通优化、能源管理、气候分析等实际应用中起着至关重要的作用。最近预先训练的语言模型(PLM)的进步激发了人们的努力,通过利用它们优越的推理和泛化能力,为时间序列预测任务重新编写这些模型的程序。然而,现有的方法在处理复杂的空间序列间相关性和固有的序列内频率分量方面存在不足,限制了它们的时空预测性能。此外,在重编程中,连续时间序列到压缩子集词汇的线性映射限制了PLM的时空语义表达能力,并可能导致潜在的信息瓶颈。为了克服上述局限性,我们提出了一种用于时空预测的定制PLM重编程框架–TextscRePST。\extscRePST的关键见解是在频域中解耦时空动力学,允许更好地与PLM文本空间对齐。具体地说,我们首先在傅立叶空间解耦时空数据,并设计了一个结构扩散算子来获得时间本征和空间扩散信号,使得PLM的动力学更容易理解和预测。为了避免有限词汇量带来的信息瓶颈,我们进一步提出了一种离散重编程策略,它以可区分的方式从扩展的词汇表空间中选择相关的离散文本信息。在四个真实世界数据集上的大量实验表明,我们提出的方法显著优于最先进的时空预测模型,特别是在数据稀缺的情况下。

[NLP-47] A New Era in Computational Pathology: A Survey on Foundation and Vision-Language Models
[NLP-47] 计算病理学的新时代:基础和视觉语言模型调查

链接: https://arxiv.org/abs/2408.14496
作者: Dibaloke Chanda,Milan Aryal,Nasim Yahya Soltani,Masoud Ganji
关键词-EN: integrating foundation models, existing deep learning, deep learning approaches, deep learning, decision-making process
关键词-ZH: 集成基础模型、现有深度学习、深度学习方法、深度学习、决策流程
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Image and Video Processing (eess.IV)
备注: Initial Version

点击查看摘要

Abstract:Recent advances in deep learning have completely transformed the domain of computational pathology (CPath), which in turn altered the diagnostic workflow of pathologists by integrating foundation models (FMs) and vision-language models (VLMs) in their assessment and decision-making process. FMs overcome the limitations of existing deep learning approaches in CPath by learning a representation space that can be adapted to a wide variety of downstream tasks without explicit supervision. VLMs allow pathology reports written in natural language to be used as a rich semantic information source to improve existing models as well as generate predictions in natural language form. In this survey, a holistic and systematic overview of recent innovations in FMs and VLMs in CPath is presented. Furthermore, the tools, datasets and training schemes for these models are summarized in addition to categorizing them into distinct groups. This extensive survey highlights the current trends in CPath and the way it is going to be transformed through FMs and VLMs in the future.
摘要:深度学习的最新进展彻底改变了计算病理学(CPATH)的领域,通过在评估和决策过程中整合基础模型(FM)和视觉语言模型(VLM),进而改变了病理学家的诊断工作流程。FMS克服了CPATH中现有深度学习方法的局限性,通过学习一个表示空间,该空间可以在没有显式监督的情况下适应各种下游任务。VLM允许以自然语言编写的病理报告被用作丰富的语义信息源,以改进现有的模型并以自然语言的形式生成预测。在这项调查中,全面和系统地概述了CPATH在FM和VLM方面的最新创新。此外,还总结了这些模型的工具、数据集和训练方案,并将它们归类为不同的组。这项广泛的调查突出了CPATH的当前趋势以及未来通过FM和VLM进行改造的方式。

[NLP-48] Agent ic Retrieval-Augmented Generation for Time Series Analysis KDD KDD2024
[NLP-48] 用于时间序列分析的统计检索增强生成

链接: https://arxiv.org/abs/2408.14484
作者: Chidaksh Ravuru,Sagar Srinivas Sakhinana,Venkataramana Runkana
关键词-EN: predict task-specific outcomes, complex spatio-temporal dependencies, Time series modeling, Time series, time series tasks
关键词-ZH: 预测特定任务的结果、复杂的时空依赖性、时间序列建模、时间序列、时间序列任务
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Paper was accepted for Undergraduate Consortium at ACM KDD, 2024. Please find the link: this https URL

点击查看摘要

Abstract:Time series modeling is crucial for many applications, however, it faces challenges such as complex spatio-temporal dependencies and distribution shifts in learning from historical context to predict task-specific outcomes. To address these challenges, we propose a novel approach using an agentic Retrieval-Augmented Generation (RAG) framework for time series analysis. The framework leverages a hierarchical, multi-agent architecture where the master agent orchestrates specialized sub-agents and delegates the end-user request to the relevant sub-agent. The sub-agents utilize smaller, pre-trained language models (SLMs) customized for specific time series tasks through fine-tuning using instruction tuning and direct preference optimization, and retrieve relevant prompts from a shared repository of prompt pools containing distilled knowledge about historical patterns and trends to improve predictions on new data. Our proposed modular, multi-agent RAG approach offers flexibility and achieves state-of-the-art performance across major time series tasks by tackling complex challenges more effectively than task-specific customized methods across benchmark datasets.
摘要:时间序列建模对于许多应用来说是至关重要的,但它在从历史背景中学习以预测特定任务的结果方面面临着复杂的时空依赖和分布变化等挑战。为了应对这些挑战,我们提出了一种新的方法,使用代理检索-增强生成(RAG)框架来进行时间序列分析。该框架利用分层的多代理体系结构,其中主代理协调专用子代理,并将最终用户请求委托给相关子代理。子代理通过使用指令调整和直接偏好优化进行微调,利用为特定时间序列任务定制的较小的预先训练的语言模型(SLM),并从包含关于历史模式和趋势的提取知识的提示池的共享储存库中检索相关提示,以改进对新数据的预测。我们建议的模块化、多代理RAG方法提供了灵活性,并通过比基准数据集上的特定于任务的定制方法更有效地应对复杂挑战,在主要时间序列任务中实现了最先进的性能。

[NLP-49] Infusing Acoustic Pause Context into Text-Based Dementia Assessment INTERSPEECH2024
[NLP-49] 将声学预设背景融入基于文本的痴呆症评估中

链接: https://arxiv.org/abs/2408.15188
作者: Franziska Braun,Sebastian P. Bayerl,Florian Hönig,Hartmut Lehfeld,Thomas Hillemacher,Tobias Bocklet,Korbinian Riedhammer
关键词-EN: alongside content, content and structure, offer a valuable, valuable and non-invasive, non-invasive biomarker
关键词-ZH: 除了内容、内容和结构之外,还提供了有价值、有价值且非侵入性、非侵入性的生物标志物
类目: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Sound (cs.SD)
备注: Accepted at INTERSPEECH 2024

点击查看摘要

Abstract:Speech pauses, alongside content and structure, offer a valuable and non-invasive biomarker for detecting dementia. This work investigates the use of pause-enriched transcripts in transformer-based language models to differentiate the cognitive states of subjects with no cognitive impairment, mild cognitive impairment, and Alzheimer’s dementia based on their speech from a clinical assessment. We address three binary classification tasks: Onset, monitoring, and dementia exclusion. The performance is evaluated through experiments on a German Verbal Fluency Test and a Picture Description Test, comparing the model’s effectiveness across different speech production contexts. Starting from a textual baseline, we investigate the effect of incorporation of pause information and acoustic context. We show the test should be chosen depending on the task, and similarly, lexical pause information and acoustic cross-attention contribute differently.
摘要:言语停顿以及内容和结构为检测痴呆症提供了有价值的非侵入性生物标志物。这项工作研究了在基于转换器的语言模型中使用停顿丰富的转录本,根据言语与临床评估区分无认知障碍、轻度认知障碍和阿尔茨海默氏痴呆的受试者的认知状态。我们处理三个二元分类任务:发病、监测和痴呆症排除。通过德语言语流利度测试和图片描述测试的实验来评估性能,比较模型在不同语音产生上下文中的有效性。我们从文本基线开始,研究暂停信息和声学背景的结合的影响。我们表明应该根据任务来选择测试,同样,词汇停顿信息和声学交叉注意的贡献不同。

人工智能

[AI-0] he Mamba in the Llama: Distilling and Accelerating Hybrid Models

链接: https://arxiv.org/abs/2408.15237
作者: Junxiong Wang,Daniele Paliotta,Avner May,Alexander M. Rush,Tri Dao
关键词-EN: advantageous deployment characteristics, Linear RNN architectures, Linear RNN, language modeling, RNN architectures
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Code is open-sourced at this https URL

点击查看摘要

Abstract:Linear RNN architectures, like Mamba, can be competitive with Transformer models in language modeling while having advantageous deployment characteristics. Given the focus on training large-scale Transformer models, we consider the challenge of converting these pretrained models for deployment. We demonstrate that it is feasible to distill large Transformers into linear RNNs by reusing the linear projection weights from attention layers with academic GPU resources. The resulting hybrid model, which incorporates a quarter of the attention layers, achieves performance comparable to the original Transformer in chat benchmarks and outperforms open-source hybrid Mamba models trained from scratch with trillions of tokens in both chat benchmarks and general benchmarks. Moreover, we introduce a hardware-aware speculative decoding algorithm that accelerates the inference speed of Mamba and hybrid models. Overall we show how, with limited computation resources, we can remove many of the original attention layers and generate from the resulting model more efficiently. Our top-performing model, distilled from Llama3-8B-Instruct, achieves a 29.61 length-controlled win rate on AlpacaEval 2 against GPT-4 and 7.35 on MT-Bench, surpassing the best instruction-tuned linear RNN model.

[AI-1] Into the Unknown Unknowns: Engaged Human Learning through Participation in Language Model Agent Conversations

链接: https://arxiv.org/abs/2408.15232
作者: Yucheng Jiang,Yijia Shao,Dekun Ma,Sina J. Semnani,Monica S. Lam
关键词-EN: answering concrete queries, unknowns remains challenging, create Collaborative STORM, unknown unknowns remains, language model
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:While language model (LM)-powered chatbots and generative search engines excel at answering concrete queries, discovering information in the terrain of unknown unknowns remains challenging for users. To emulate the common educational scenario where children/students learn by listening to and participating in conversations of their parents/teachers, we create Collaborative STORM (Co-STORM). Unlike QA systems that require users to ask all the questions, Co-STORM lets users observe and occasionally steer the discourse among several LM agents. The agents ask questions on the user’s behalf, allowing the user to discover unknown unknowns serendipitously. To facilitate user interaction, Co-STORM assists users in tracking the discourse by organizing the uncovered information into a dynamic mind map, ultimately generating a comprehensive report as takeaways. For automatic evaluation, we construct the WildSeek dataset by collecting real information-seeking records with user goals. Co-STORM outperforms baseline methods on both discourse trace and report quality. In a further human evaluation, 70% of participants prefer Co-STORM over a search engine, and 78% favor it over a RAG chatbot.

[AI-2] Can Unconfident LLM Annotations Be Used for Confident Conclusions?

链接: https://arxiv.org/abs/2408.15204
作者: Kristina Gligorić,Tijana Zrnic,Cinoo Lee,Emmanuel J. Candès,Dan Jurafsky
关键词-EN: Large language models, shown high agreement, Large language, human data collection, LLM annotations
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
*备注:

点击查看摘要

Abstract:Large language models (LLMs) have shown high agreement with human raters across a variety of tasks, demonstrating potential to ease the challenges of human data collection. In computational social science (CSS), researchers are increasingly leveraging LLM annotations to complement slow and expensive human annotations. Still, guidelines for collecting and using LLM annotations, without compromising the validity of downstream conclusions, remain limited. We introduce Confidence-Driven Inference: a method that combines LLM annotations and LLM confidence indicators to strategically select which human annotations should be collected, with the goal of producing accurate statistical estimates and provably valid confidence intervals while reducing the number of human annotations needed. Our approach comes with safeguards against LLM annotations of poor quality, guaranteeing that the conclusions will be both valid and no less accurate than if we only relied on human annotations. We demonstrate the effectiveness of Confidence-Driven Inference over baselines in statistical estimation tasks across three CSS settings–text politeness, stance, and bias–reducing the needed number of human annotations by over 25% in each. Although we use CSS settings for demonstration, Confidence-Driven Inference can be used to estimate most standard quantities across a broad range of NLP problems.

[AI-3] PoseWatch: A Transformer-based Architecture for Human-centric Video Anomaly Detection Using Spatio-temporal Pose Tokenization

链接: https://arxiv.org/abs/2408.15185
作者: Ghazal Alinezhad Noghre,Armin Danesh Pazho,Hamed Tabkhi
关键词-EN: Video Anomaly Detection, pose-based VAD, Video Anomaly, Anomaly Detection, VAD
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Video Anomaly Detection (VAD) presents a significant challenge in computer vision, particularly due to the unpredictable and infrequent nature of anomalous events, coupled with the diverse and dynamic environments in which they occur. Human-centric VAD, a specialized area within this domain, faces additional complexities, including variations in human behavior, potential biases in data, and substantial privacy concerns related to human subjects. These issues complicate the development of models that are both robust and generalizable. To address these challenges, recent advancements have focused on pose-based VAD, which leverages human pose as a high-level feature to mitigate privacy concerns, reduce appearance biases, and minimize background interference. In this paper, we introduce PoseWatch, a novel transformer-based architecture designed specifically for human-centric pose-based VAD. PoseWatch features an innovative Spatio-Temporal Pose and Relative Pose (ST-PRP) tokenization method that enhances the representation of human motion over time, which is also beneficial for broader human behavior analysis tasks. The architecture’s core, a Unified Encoder Twin Decoders (UETD) transformer, significantly improves the detection of anomalous behaviors in video data. Extensive evaluations across multiple benchmark datasets demonstrate that PoseWatch consistently outperforms existing methods, establishing a new state-of-the-art in pose-based VAD. This work not only demonstrates the efficacy of PoseWatch but also highlights the potential of integrating Natural Language Processing techniques with computer vision to advance human behavior analysis.

[AI-4] Evaluating the Energy Consumption of Machine Learning: Systematic Literature Review and Experiments

链接: https://arxiv.org/abs/2408.15128
作者: Charlotte Rodriguez,Laura Degioanni,Laetitia Kameni,Richard Vidal,Giovanni Neglia
关键词-EN: energy consumption, evaluate energy consumption, Machine Learning, energy, consumption
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
*备注: 52 pages,

点击查看摘要

Abstract:Monitoring, understanding, and optimizing the energy consumption of Machine Learning (ML) are various reasons why it is necessary to evaluate the energy usage of ML. However, there exists no universal tool that can answer this question for all use cases, and there may even be disagreement on how to evaluate energy consumption for a specific use case. Tools and methods are based on different approaches, each with their own advantages and drawbacks, and they need to be mapped out and explained in order to select the most suitable one for a given situation. We address this challenge through two approaches. First, we conduct a systematic literature review of all tools and methods that permit to evaluate the energy consumption of ML (both at training and at inference), irrespective of whether they were originally designed for machine learning or general software. Second, we develop and use an experimental protocol to compare a selection of these tools and methods. The comparison is both qualitative and quantitative on a range of ML tasks of different nature (vision, language) and computational complexity. The systematic literature review serves as a comprehensive guide for understanding the array of tools and methods used in evaluating energy consumption of ML, for various use cases going from basic energy monitoring to consumption optimization. Two open-source repositories are provided for further exploration. The first one contains tools that can be used to replicate this work or extend the current review. The second repository houses the experimental protocol, allowing users to augment the protocol with new ML computing tasks and additional energy evaluation tools.

[AI-5] Aligning XAI with EU Regulations for Smart Biomedical Devices: A Methodology for Compliance Analysis ECAI2024

链接: https://arxiv.org/abs/2408.15121
作者: Francesco Sovrano,Michael Lognoul,Giulia Vilone
关键词-EN: integrating Artificial Intelligence, Artificial Intelligence, Significant investment, integrating Artificial, investment and development
类目: Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
*备注: Accepted for publication at ECAI 2024, main-track

点击查看摘要

Abstract:Significant investment and development have gone into integrating Artificial Intelligence (AI) in medical and healthcare applications, leading to advanced control systems in medical technology. However, the opacity of AI systems raises concerns about essential characteristics needed in such sensitive applications, like transparency and trustworthiness. Our study addresses these concerns by investigating a process for selecting the most adequate Explainable AI (XAI) methods to comply with the explanation requirements of key EU regulations in the context of smart bioelectronics for medical devices. The adopted methodology starts with categorising smart devices by their control mechanisms (open-loop, closed-loop, and semi-closed-loop systems) and delving into their technology. Then, we analyse these regulations to define their explainability requirements for the various devices and related goals. Simultaneously, we classify XAI methods by their explanatory objectives. This allows for matching legal explainability requirements with XAI explanatory goals and determining the suitable XAI algorithms for achieving them. Our findings provide a nuanced understanding of which XAI algorithms align better with EU regulations for different types of medical devices. We demonstrate this through practical case studies on different neural implants, from chronic disease management to advanced prosthetics. This study fills a crucial gap in aligning XAI applications in bioelectronics with stringent provisions of EU regulations. It provides a practical framework for developers and researchers, ensuring their AI innovations advance healthcare technology and adhere to legal and ethical standards.

[AI-6] Urdu Digital Text Word Optical Character Recognition Using Permuted Auto Regressive Sequence Modeling

链接: https://arxiv.org/abs/2408.15119
作者: Ahmed Mustafa,Ijlal Baig,Hasan Sajid
关键词-EN: innovative word-level Optical, word-level Optical Character, Optical Character Recognition, word-level Optical, Optical Character
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:This research paper introduces an innovative word-level Optical Character Recognition (OCR) model specifically designed for digital Urdu text recognition. Utilizing transformer-based architectures and attention mechanisms, the model was trained on a comprehensive dataset of approximately 160,000 Urdu text images, achieving a character error rate (CER) of 0.178, which highlights its superior accuracy in recognizing Urdu characters. The model’s strength lies in its unique architecture, incorporating the permuted autoregressive sequence (PARSeq) model, which allows for context-aware inference and iterative refinement by leveraging bidirectional context information to enhance recognition accuracy. Furthermore, its capability to handle a diverse range of Urdu text styles, fonts, and variations enhances its applicability in real-world scenarios. Despite its promising results, the model has some limitations, such as difficulty with blurred images, non-horizontal orientations, and overlays of patterns, lines, or other text, which can occasionally lead to suboptimal performance. Additionally, trailing or following punctuation marks can introduce noise into the recognition process. Addressing these challenges will be a focus of future research, aiming to refine the model further, explore data augmentation techniques, optimize hyperparameters, and integrate contextual improvements for more accurate and efficient Urdu text recognition.

[AI-7] Evaluating Stability of Unreflective Alignment

链接: https://arxiv.org/abs/2408.15116
作者: James Lucassen,Mark Henry,Philippa Wright,Owen Yeung
关键词-EN: designing alignment mechanisms, Counterfactual Priority Change, reflective stability, reflective stability problems, designing alignment
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Many theoretical obstacles to AI alignment are consequences of reflective stability - the problem of designing alignment mechanisms that the AI would not disable if given the option. However, problems stemming from reflective stability are not obviously present in current LLMs, leading to disagreement over whether they will need to be solved to enable safe delegation of cognitive labor. In this paper, we propose Counterfactual Priority Change (CPC) destabilization as a mechanism by which reflective stability problems may arise in future LLMs. We describe two risk factors for CPC-destabilization: 1) CPC-based stepping back and 2) preference instability. We develop preliminary evaluations for each of these risk factors, and apply them to frontier LLMs. Our findings indicate that in current LLMs, increased scale and capability are associated with increases in both CPC-based stepping back and preference instability, suggesting that CPC-destabilization may cause reflective stability problems in future LLMs.

[AI-8] Few-Shot Unsupervised Implicit Neural Shape Representation Learning with Spatial Adversaries ICML2024

链接: https://arxiv.org/abs/2408.15114
作者: Amine Ouasfi,Adnane Boukhayma
关键词-EN: Implicit Neural Representations, Neural Signed Distance, Implicit Neural, Signed Distance Functions, complex data modalities
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Graphics (cs.GR); Machine Learning (cs.LG)
*备注: ICML 2024

点击查看摘要

Abstract:Implicit Neural Representations have gained prominence as a powerful framework for capturing complex data modalities, encompassing a wide range from 3D shapes to images and audio. Within the realm of 3D shape representation, Neural Signed Distance Functions (SDF) have demonstrated remarkable potential in faithfully encoding intricate shape geometry. However, learning SDFs from sparse 3D point clouds in the absence of ground truth supervision remains a very challenging task. While recent methods rely on smoothness priors to regularize the learning, our method introduces a regularization term that leverages adversarial samples around the shape to improve the learned SDFs. Through extensive experiments and evaluations, we illustrate the efficacy of our proposed method, highlighting its capacity to improve SDF learning with respect to baselines and the state-of-the-art using synthetic and real data.

[AI-9] MTMamba: Enhancing Multi-Task Dense Scene Understanding via Mamba-Based Decoders

链接: https://arxiv.org/abs/2408.15101
作者: Baijiong Lin,Weisen Jiang,Pengguang Chen,Shu Liu,Ying-Cong Chen
关键词-EN: multiple dense prediction, multi-task dense prediction, Multi-task dense scene, application scenarios, Multi-task dense
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: arXiv admin note: text overlap with arXiv:2407.02228

点击查看摘要

Abstract:Multi-task dense scene understanding, which trains a model for multiple dense prediction tasks, has a wide range of application scenarios. Capturing long-range dependency and enhancing cross-task interactions are crucial to multi-task dense prediction. In this paper, we propose MTMamba++, a novel architecture for multi-task scene understanding featuring with a Mamba-based decoder. It contains two types of core blocks: self-task Mamba (STM) block and cross-task Mamba (CTM) block. STM handles long-range dependency by leveraging state-space models, while CTM explicitly models task interactions to facilitate information exchange across tasks. We design two types of CTM block, namely F-CTM and S-CTM, to enhance cross-task interaction from feature and semantic perspectives, respectively. Experiments on NYUDv2, PASCAL-Context, and Cityscapes datasets demonstrate the superior performance of MTMamba++ over CNN-based and Transformer-based methods. The code is available at this https URL.

[AI-10] No Regrets: Investigating and Improving Regret Approximations for Curriculum Discovery

链接: https://arxiv.org/abs/2408.15099
作者: Alexander Rutherford,Michael Beukman,Timon Willi,Bruno Lacerda,Nick Hawes,Jakob Foerster
关键词-EN: Unsupervised Environment Design, improve downstream performance, reinforcement learning, downstream performance, topical question
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Robotics (cs.RO)
*备注:

点击查看摘要

Abstract:What data or environments to use for training to improve downstream performance is a longstanding and very topical question in reinforcement learning. In particular, Unsupervised Environment Design (UED) methods have gained recent attention as their adaptive curricula enable agents to be robust to in- and out-of-distribution tasks. We ask to what extent these methods are themselves robust when applied to a novel setting, closely inspired by a real-world robotics problem. Surprisingly, we find that the state-of-the-art UED methods either do not improve upon the naïve baseline of Domain Randomisation (DR), or require substantial hyperparameter tuning to do so. Our analysis shows that this is due to their underlying scoring functions failing to predict intuitive measures of ``learnability’', i.e., in finding the settings that the agent sometimes solves, but not always. Based on this, we instead directly train on levels with high learnability and find that this simple and intuitive approach outperforms UED methods and DR in several binary-outcome environments, including on our domain and the standard UED domain of Minigrid. We further introduce a new adversarial evaluation procedure for directly measuring robustness, closely mirroring the conditional value at risk (CVaR). We open-source all our code and present visualisations of final policies here: this https URL.

[AI-11] Post-processing fairness with minimal changes

链接: https://arxiv.org/abs/2408.15096
作者: Federico Di Gennaro,Thibault Laugel,Vincent Grari,Xavier Renard,Marcin Detyniecki
关键词-EN: test time, require the sensitive, sensitive attribute, attribute at test, post-processing algorithm
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In this paper, we introduce a novel post-processing algorithm that is both model-agnostic and does not require the sensitive attribute at test time. In addition, our algorithm is explicitly designed to enforce minimal changes between biased and debiased predictions; a property that, while highly desirable, is rarely prioritized as an explicit objective in fairness literature. Our approach leverages a multiplicative factor applied to the logit value of probability scores produced by a black-box classifier. We demonstrate the efficacy of our method through empirical evaluations, comparing its performance against other four debiasing algorithms on two widely used datasets in fairness research.

[AI-12] BaichuanSEED: Sharing the Potential of ExtensivE Data Collection and Deduplication by Introducing a Competitive Large Language Model Baseline

链接: https://arxiv.org/abs/2408.15079
作者: Guosheng Dong,Da Pan,Yiding Sun,Shusen Zhang,Zheng Liang,Xin Wu,Yanjun Shen,Fan Yang,Haoze Sun,Tianpeng Li,Mingan Lin,Jianhua Xu,Yufan Zhang,Xiaonan Nie,Lei Su,Bingning Wang,Wentao Zhang,Jiaxin Mao,Zenan Zhou,Weipeng Chen
关键词-EN: extensive pretraining datasets, Large Language Models, highly rely, pretraining datasets, data processing pipeline
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: 19 pages, 6 figures

点击查看摘要

Abstract:The general capabilities of Large Language Models (LLM) highly rely on the composition and selection on extensive pretraining datasets, treated as commercial secrets by several institutions. To mitigate this issue, we open-source the details of a universally applicable data processing pipeline and validate its effectiveness and potential by introducing a competitive LLM baseline. Specifically, the data processing pipeline consists of broad collection to scale up and reweighting to improve quality. We then pretrain a 7B model BaichuanSEED with 3T tokens processed by our pipeline without any deliberate downstream task-related optimization, followed by an easy but effective supervised fine-tuning stage. BaichuanSEED demonstrates consistency and predictability throughout training and achieves comparable performance on comprehensive benchmarks with several commercial advanced large language models, such as Qwen1.5 and Llama3. We also conduct several heuristic experiments to discuss the potential for further optimization of downstream tasks, such as mathematics and coding.

[AI-13] MMASD: A Novel Dataset for Privacy-Preserving Behavior Analysis of Children with Autism Spectrum Disorder

链接: https://arxiv.org/abs/2408.15077
作者: Pavan Uttej Ravva,Behdokht Kiafar,Pinar Kullu,Jicheng Li,Anjana Bhat,Roghayeh Leila Barmaki
关键词-EN: comprehending communication signals, Autism spectrum disorder, spectrum disorder, communication signals, social interaction
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Autism spectrum disorder (ASD) is characterized by significant challenges in social interaction and comprehending communication signals. Recently, therapeutic interventions for ASD have increasingly utilized Deep learning powered-computer vision techniques to monitor individual progress over time. These models are trained on private, non-public datasets from the autism community, creating challenges in comparing results across different models due to privacy-preserving data-sharing issues. This work introduces MMASD+. MMASD+ consists of diverse data modalities, including 3D-Skeleton, 3D Body Mesh, and Optical Flow data. It integrates the capabilities of Yolov8 and Deep SORT algorithms to distinguish between the therapist and children, addressing a significant barrier in the original dataset. Additionally, a Multimodal Transformer framework is proposed to predict 11 action types and the presence of ASD. This framework achieves an accuracy of 95.03% for predicting action types and 96.42% for predicting ASD presence, demonstrating over a 10% improvement compared to models trained on single data modalities. These findings highlight the advantages of integrating multiple data modalities within the Multimodal Transformer framework.

[AI-14] MiWaves Reinforcement Learning Algorithm

链接: https://arxiv.org/abs/2408.15076
作者: Susobhan Ghosh,Yongyi Guo,Pei-Yao Hung,Lara Coughlin,Erin Bonar,Inbal Nahum-Shani,Maureen Walton,Susan Murphy
关键词-EN: health challenge globally, significant public health, public health challenge, challenge globally, escalating prevalence
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: arXiv admin note: substantial text overlap with arXiv:2402.17739

点击查看摘要

Abstract:The escalating prevalence of cannabis use poses a significant public health challenge globally. In the U.S., cannabis use is more prevalent among emerging adults (EAs) (ages 18-25) than any other age group, with legalization in the multiple states contributing to a public perception that cannabis is less risky than in prior decades. To address this growing concern, we developed MiWaves, a reinforcement learning (RL) algorithm designed to optimize the delivery of personalized intervention prompts to reduce cannabis use among EAs. MiWaves leverages domain expertise and prior data to tailor the likelihood of delivery of intervention messages. This paper presents a comprehensive overview of the algorithm’s design, including key decisions and experimental outcomes. The finalized MiWaves RL algorithm was deployed in a clinical trial from March to May 2024.

[AI-15] Interactive dense pixel visualizations for time series and model attribution explanations

链接: https://arxiv.org/abs/2408.15073
作者: Udo Schlegel,Daniel A. Keim
关键词-EN: Explainable Artificial Intelligence, Artificial Intelligence, Explainable Artificial, offering numerous techniques, Deep Neural Network
类目: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
*备注: 5 pages, 2 figures, accepted at MLVIS 2023

点击查看摘要

Abstract:The field of Explainable Artificial Intelligence (XAI) for Deep Neural Network models has developed significantly, offering numerous techniques to extract explanations from models. However, evaluating explanations is often not trivial, and differences in applied metrics can be subtle, especially with non-intelligible data. Thus, there is a need for visualizations tailored to explore explanations for domains with such data, e.g., time series. We propose DAVOTS, an interactive visual analytics approach to explore raw time series data, activations of neural networks, and attributions in a dense-pixel visualization to gain insights into the data, models’ decisions, and explanations. To further support users in exploring large datasets, we apply clustering approaches to the visualized data domains to highlight groups and present ordering strategies for individual and combined data exploration to facilitate finding patterns. We visualize a CNN trained on the FordA dataset to demonstrate the approach.

[AI-16] Causal Rule Forest: Toward Interpretable and Precise Treatment Effect Estimation

链接: https://arxiv.org/abs/2408.15055
作者: Chan Hsu,Jun-Ting Wu,Yihuang Kang
关键词-EN: Heterogeneous Treatment Effects, Average Treatment Effects, Conditional Average Treatment, inferencing Heterogeneous Treatment, Conditional Average
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: The 25th IEEE International Conference on Information Reuse and Integration for Data Science (IRI 2024)

点击查看摘要

Abstract:Understanding and inferencing Heterogeneous Treatment Effects (HTE) and Conditional Average Treatment Effects (CATE) are vital for developing personalized treatment recommendations. Many state-of-the-art approaches achieve inspiring performance in estimating HTE on benchmark datasets or simulation studies. However, the indirect predicting manner and complex model architecture reduce the interpretability of these approaches. To mitigate the gap between predictive performance and heterogeneity interpretability, we introduce the Causal Rule Forest (CRF), a novel approach to learning hidden patterns from data and transforming the patterns into interpretable multi-level Boolean rules. By training the other interpretable causal inference models with data representation learned by CRF, we can reduce the predictive errors of these models in estimating HTE and CATE, while keeping their interpretability for identifying subgroups that a treatment is more effective. Our experiments underscore the potential of CRF to advance personalized interventions and policies, paving the way for future research to enhance its scalability and application across complex causal inference challenges.

[AI-17] Earth Observation Satellite Scheduling with Graph Neural Networks

链接: https://arxiv.org/abs/2408.15041
作者: Antoine Jacquet,Guillaume Infantes,Nicolas Meuleau,Emmanuel Benazera,Stéphanie Roussel,Vincent Baudoui,Jonathan Guerra
关键词-EN: Observation Satellite Planning, Earth Observation Satellite, considerable practical interest, Satellite Planning, difficult optimization problem
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注: Accepted at 17th European Workshop on Reinforcement Learning (EWRL 2024)

点击查看摘要

Abstract:The Earth Observation Satellite Planning (EOSP) is a difficult optimization problem with considerable practical interest. A set of requested observations must be scheduled on an agile Earth observation satellite while respecting constraints on their visibility window, as well as maneuver constraints that impose varying delays between successive observations. In addition, the problem is largely oversubscribed: there are much more candidate observations than what can possibly be achieved. Therefore, one must select the set of observations that will be performed while maximizing their weighted cumulative benefit, and propose a feasible schedule for these observations. As previous work mostly focused on heuristic and iterative search algorithms, this paper presents a new technique for selecting and scheduling observations based on Graph Neural Networks (GNNs) and Deep Reinforcement Learning (DRL). GNNs are used to extract relevant information from the graphs representing instances of the EOSP, and DRL drives the search for optimal schedules. Our simulations show that it is able to learn on small problem instances and generalize to larger real-world instances, with very competitive performance compared to traditional approaches.

[AI-18] Evidence-Enhanced Triplet Generation Framework for Hallucination Alleviation in Generative Question Answering

链接: https://arxiv.org/abs/2408.15037
作者: Haowei Du,Huishuai Zhang,Dongyan Zhao
关键词-EN: generative question answering, evidence-enhanced triplet generation, generative question, triplet generation framework, question answering
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:To address the hallucination in generative question answering (GQA) where the answer can not be derived from the document, we propose a novel evidence-enhanced triplet generation framework, EATQA, encouraging the model to predict all the combinations of (Question, Evidence, Answer) triplet by flipping the source pair and the target label to understand their logical relationships, i.e., predict Answer(A), Question(Q), and Evidence(E) given a QE, EA, and QA pairs, respectively. Furthermore, we bridge the distribution gap to distill the knowledge from evidence in inference stage. Our framework ensures the model to learn the logical relation between query, evidence and answer, which simultaneously improves the evidence generation and query answering. In this paper, we apply EATQA to LLama and it outperforms other LLMs-based methods and hallucination mitigation approaches on two challenging GQA benchmarks. Further analysis shows that our method not only keeps prior knowledge within LLM, but also mitigates hallucination and generates faithful answers.

[AI-19] Mamba2MIL: State Space Duality Based Multiple Instance Learning for Computational Pathology

链接: https://arxiv.org/abs/2408.15032
作者: Yuqi Zhang,Xiaoqian Zhang,Jiakai Wang,Yuancheng Yang,Taiying Peng,Chao Tong
关键词-EN: Computational pathology, Multiple Instance Learning, Convolutional Neural Networks, significantly advanced, advanced the clinical
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Computational pathology (CPath) has significantly advanced the clinical practice of pathology. Despite the progress made, Multiple Instance Learning (MIL), a promising paradigm within CPath, continues to face challenges, particularly related to incomplete information utilization. Existing frameworks, such as those based on Convolutional Neural Networks (CNNs), attention, and selective scan space state sequential model (SSM), lack sufficient flexibility and scalability in fusing diverse features, and cannot effectively fuse diverse features. Additionally, current approaches do not adequately exploit order-related and order-independent features, resulting in suboptimal utilization of sequence information. To address these limitations, we propose a novel MIL framework called Mamba2MIL. Our framework utilizes the state space duality model (SSD) to model long sequences of patches of whole slide images (WSIs), which, combined with weighted feature selection, supports the fusion processing of more branching features and can be extended according to specific application needs. Moreover, we introduce a sequence transformation method tailored to varying WSI sizes, which enhances sequence-independent features while preserving local sequence information, thereby improving sequence information utilization. Extensive experiments demonstrate that Mamba2MIL surpasses state-of-the-art MIL methods. We conducted extensive experiments across multiple datasets, achieving improvements in nearly all performance metrics. Specifically, on the NSCLC dataset, Mamba2MIL achieves a binary tumor classification AUC of 0.9533 and an accuracy of 0.8794. On the BRACS dataset, it achieves a multiclass classification AUC of 0.7986 and an accuracy of 0.4981. The code is available at this https URL.

[AI-20] Sequence-aware Pre-training for Echocardiography Probe Guidance

链接: https://arxiv.org/abs/2408.15026
作者: Haojun Jiang,Zhenguo Sun,Yu Sun,Ning Jia,Meng Li,Shaqi Luo,Shiji Song,Gao Huang
关键词-EN: obtain high-quality sectional, high-quality sectional images, Cardiac ultrasound, pose to obtain, obtain high-quality
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: Tech Report

点击查看摘要

Abstract:Cardiac ultrasound probe guidance aims to help novices adjust the 6-DOF probe pose to obtain high-quality sectional images. Cardiac ultrasound faces two major challenges: (1) the inherently complex structure of the heart, and (2) significant individual variations. Previous works have only learned the population-averaged 2D and 3D structures of the heart rather than personalized cardiac structural features, leading to a performance bottleneck. Clinically, we observed that sonographers adjust their understanding of a patient’s cardiac structure based on prior scanning sequences, thereby modifying their scanning strategies. Inspired by this, we propose a sequence-aware self-supervised pre-training method. Specifically, our approach learns personalized 2D and 3D cardiac structural features by predicting the masked-out images and actions in a scanning sequence. We hypothesize that if the model can predict the missing content it has acquired a good understanding of the personalized cardiac structure. In the downstream probe guidance task, we also introduced a sequence modeling approach that models individual cardiac structural information based on the images and actions from historical scan data, enabling more accurate navigation decisions. Experiments on a large-scale dataset with 1.36 million samples demonstrated that our proposed sequence-aware paradigm can significantly reduce navigation errors, with translation errors decreasing by 15.90% to 36.87% and rotation errors decreasing by 11.13% to 20.77%, compared to state-of-the-art methods.

[AI-21] Cross-subject Brain Functional Connectivity Analysis for Multi-task Cognitive State Evaluation

链接: https://arxiv.org/abs/2408.15018
作者: Jun Chen,Anqi Chen,Bingkun Jiang,Mohammad S. Obaidat,Ni Li,Xinyu Zhang
关键词-EN: fundamental psychological essence, perception and processing, function of information, information perception, fundamental psychological
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Cognition refers to the function of information perception and processing, which is the fundamental psychological essence of human beings. It is responsible for reasoning and decision-making, while its evaluation is significant for the aviation domain in mitigating potential safety risks. Existing studies tend to use varied methods for cognitive state evaluation yet have limitations in timeliness, generalisation, and interpretability. Accordingly, this study adopts brain functional connectivity with electroencephalography signals to capture associations in brain regions across multiple subjects for evaluating real-time cognitive states. Specifically, a virtual reality-based flight platform is constructed with multi-screen embedded. Three distinctive cognitive tasks are designed and each has three degrees of difficulty. Thirty subjects are acquired for analysis and evaluation. The results are interpreted through different perspectives, including inner-subject and cross-subject for task-wise and gender-wise underlying brain functional connectivity. Additionally, this study incorporates questionnaire-based, task performance-based, and physiological measure-based approaches to fairly label the trials. A multi-class cognitive state evaluation is further conducted with the active brain connections. Benchmarking results demonstrate that the identified brain regions have considerable influences in cognition, with a multi-class accuracy rate of 95.83% surpassing existing studies. The derived findings bring significance to understanding the dynamic relationships among human brain functional regions, cross-subject cognitive behaviours, and decision-making, which have promising practical application values.

[AI-22] Flexible categorization using formal concept analysis and Dempster-Shafer theory

链接: https://arxiv.org/abs/2408.15012
作者: Marcel Boersma,Krishna Manoorkar,Alessandra Palmigiano,Mattia Panettiere,Apostolos Tzimoulis,Nachoem Wijnberg
关键词-EN: important part, financial accounts, bipartite graphs, business processes, weighted bipartite graphs
类目: Artificial Intelligence (cs.AI)
*备注: arXiv admin note: substantial text overlap with arXiv:2210.17330

点击查看摘要

Abstract:Categorization of business processes is an important part of auditing. Large amounts of transactional data in auditing can be represented as transactions between financial accounts using weighted bipartite graphs. We view such bipartite graphs as many-valued formal contexts, which we use to obtain explainable categorization of these business processes in terms of financial accounts involved in a business process by using methods in formal concept analysis. We use Dempster-Shafer mass functions to represent agendas showing different interest in different set of financial accounts. We also model some possible deliberation scenarios between agents with different interrogative agendas to reach an aggregated agenda and categorization. The framework developed in this paper provides a formal ground to obtain and study explainable categorizations from the data represented as bipartite graphs according to the agendas of different agents in an organization (e.g. an audit firm), and interaction between these through deliberation. We use this framework to describe a machine-leaning meta algorithm for outlier detection and classification which can provide local and global explanations of its result and demonstrate it through an outlier detection algorithm.

[AI-23] Prior-free Balanced Replay: Uncertainty-guided Reservoir Sampling for Long-Tailed Continual Learning

链接: https://arxiv.org/abs/2408.14976
作者: Lei Liu,Li Liu,Yawen Cui
关键词-EN: continual data stream, Long-Tailed Continual Learning, continual learning, data stream exhibits, continual data
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Even in the era of large models, one of the well-known issues in continual learning (CL) is catastrophic forgetting, which is significantly challenging when the continual data stream exhibits a long-tailed distribution, termed as Long-Tailed Continual Learning (LTCL). Existing LTCL solutions generally require the label distribution of the data stream to achieve re-balance training. However, obtaining such prior information is often infeasible in real scenarios since the model should learn without pre-identifying the majority and minority classes. To this end, we propose a novel Prior-free Balanced Replay (PBR) framework to learn from long-tailed data stream with less forgetting. Concretely, motivated by our experimental finding that the minority classes are more likely to be forgotten due to the higher uncertainty, we newly design an uncertainty-guided reservoir sampling strategy to prioritize rehearsing minority data without using any prior information, which is based on the mutual dependence between the model and samples. Additionally, we incorporate two prior-free components to further reduce the forgetting issue: (1) Boundary constraint is to preserve uncertain boundary supporting samples for continually re-estimating task boundaries. (2) Prototype constraint is to maintain the consistency of learned class prototypes along with training. Our approach is evaluated on three standard long-tailed benchmarks, demonstrating superior performance to existing CL methods and previous SOTA LTCL approach in both task- and class-incremental learning settings, as well as ordered- and shuffled-LTCL settings.

[AI-24] CVPT: Cross-Attention help Visual Prompt Tuning adapt visual task

链接: https://arxiv.org/abs/2408.14961
作者: Lingyun Huang,Jianxu Mao,Yaonan Wang,Junfei Yi,Ziming Tao
关键词-EN: demonstrating remarkable capabilities, models demonstrating remarkable, Visual Prompt Tuning, pre-trained models demonstrating, recent years
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In recent years, the rapid expansion of model sizes has led to large-scale pre-trained models demonstrating remarkable capabilities. Consequently, there has been a trend towards increasing the scale of models. However, this trend introduces significant challenges, including substantial computational costs of training and transfer to downstream tasks. To address these issues, Parameter-Efficient Fine-Tuning (PEFT) methods have been introduced. These methods optimize large-scale pre-trained models for specific tasks by fine-tuning a select group of parameters. Among these PEFT methods, adapter-based and prompt-based methods are the primary techniques. Specifically, in the field of visual fine-tuning, adapters gain prominence over prompts because of the latter’s relatively weaker performance and efficiency. Under the circumstances, we refine the widely-used Visual Prompt Tuning (VPT) method, proposing Cross Visual Prompt Tuning (CVPT). CVPT calculates cross-attention between the prompt tokens and the embedded tokens, which allows us to compute the semantic relationship between them and conduct the fine-tuning of models exactly to adapt visual tasks better. Furthermore, we introduce the weight-sharing mechanism to initialize the parameters of cross-attention, which avoids massive learnable parameters from cross-attention and enhances the representative capability of cross-attention. We conduct comprehensive testing across 25 datasets and the result indicates that CVPT significantly improves VPT’s performance and efficiency in visual tasks. For example, on the VTAB-1K benchmark, CVPT outperforms VPT over 4% in average accuracy, rivaling the advanced adapter-based methods in performance and efficiency. Our experiments confirm that prompt-based methods can achieve exceptional results in visual fine-tuning.

[AI-25] Multilingual Arbitrage: Optimizing Data Pools to Accelerate Multilingual Progress

链接: https://arxiv.org/abs/2408.14960
作者: Ayomide Odumakinde,Daniel D’souza,Pat Verga,Beyza Ermis,Sara Hooker
关键词-EN: role in recent, played a critical, critical role, synthetic data, data has played
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The use of synthetic data has played a critical role in recent state-of-art breakthroughs. However, overly relying on a single oracle teacher model to generate data has been shown to lead to model collapse and invite propagation of biases. These limitations are particularly evident in multilingual settings, where the absence of a universally effective teacher model that excels across all languages presents significant challenges. In this work, we address these extreme difference by introducing “multilingual arbitrage”, which capitalizes on performance variations between multiple models for a given language. To do so, we strategically route samples through a diverse pool of models, each with unique strengths in different languages. Across exhaustive experiments on state-of-art models, our work suggests that arbitrage techniques allow for spectacular gains in performance that far outperform relying on a single teacher. In particular, compared to the best single teacher, we observe gains of up to 56.5% improvement in win rates averaged across all languages when switching to multilingual arbitrage. We observe the most significant gains for the least resourced languages in our pool.

[AI-26] NeuralOOD: Improving Out-of-Distribution Generalization Performance with Brain-machine Fusion Learning Framework

链接: https://arxiv.org/abs/2408.14950
作者: Shuangchen Zhao,Changde Du,Hui Li,Huiguang He
关键词-EN: Deep Neural Networks, traditional computer vision, demonstrated exceptional recognition, exceptional recognition capabilities, Deep Neural
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Deep Neural Networks (DNNs) have demonstrated exceptional recognition capabilities in traditional computer vision (CV) tasks. However, existing CV models often suffer a significant decrease in accuracy when confronted with out-of-distribution (OOD) data. In contrast to these DNN models, human can maintain a consistently low error rate when facing OOD scenes, partly attributed to the rich prior cognitive knowledge stored in the human brain. Previous OOD generalization researches only focus on the single modal, overlooking the advantages of multimodal learning method. In this paper, we utilize the multimodal learning method to improve the OOD generalization and propose a novel Brain-machine Fusion Learning (BMFL) framework. We adopt the cross-attention mechanism to fuse the visual knowledge from CV model and prior cognitive knowledge from the human brain. Specially, we employ a pre-trained visual neural encoding model to predict the functional Magnetic Resonance Imaging (fMRI) from visual features which eliminates the need for the fMRI data collection and pre-processing, effectively reduces the workload associated with conventional BMFL methods. Furthermore, we construct a brain transformer to facilitate the extraction of knowledge inside the fMRI data. Moreover, we introduce the Pearson correlation coefficient maximization regularization method into the training process, which improves the fusion capability with better constrains. Our model outperforms the DINOv2 and baseline models on the ImageNet-1k validation dataset as well as six curated OOD datasets, showcasing its superior performance in diverse scenarios.

[AI-27] Quotient Normalized Maximum Likelihood Criterion for Learning Bayesian Network Structures AISTATS2018

链接: https://arxiv.org/abs/2408.14935
作者: Tomi Silander,Janne Leppä-aho,Elias Jääsaari,Teemu Roos
关键词-EN: Bayesian network structure, network structure learning, normalized maximum likelihood, Bayesian network, call quotient normalized
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Accepted to AISTATS 2018

点击查看摘要

Abstract:We introduce an information theoretic criterion for Bayesian network structure learning which we call quotient normalized maximum likelihood (qNML). In contrast to the closely related factorized normalized maximum likelihood criterion, qNML satisfies the property of score equivalence. It is also decomposable and completely free of adjustable hyperparameters. For practical computations, we identify a remarkably accurate approximation proposed earlier by Szpankowski and Weinberger. Experiments on both simulated and real data demonstrate that the new criterion leads to parsimonious models with good predictive accuracy.

[AI-28] Distance-Forward Learning: Enhancing the Forward-Forward Algorithm Towards High-Performance On-Chip Learning

链接: https://arxiv.org/abs/2408.14925
作者: Yujie Wu,Siyuan Xu,Jibin Wu,Lei Deng,Mingkun Xu,Qinghao Wen,Guoqi Li
关键词-EN: offering biological plausibility, parallelized computational benefits, highly parallelized computational, limitations of backpropagation, offering biological
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The Forward-Forward (FF) algorithm was recently proposed as a local learning method to address the limitations of backpropagation (BP), offering biological plausibility along with memory-efficient and highly parallelized computational benefits. However, it suffers from suboptimal performance and poor generalization, largely due to inadequate theoretical support and a lack of effective learning strategies. In this work, we reformulate FF using distance metric learning and propose a distance-forward algorithm (DF) to improve FF performance in supervised vision tasks while preserving its local computational properties, making it competitive for efficient on-chip learning. To achieve this, we reinterpret FF through the lens of centroid-based metric learning and develop a goodness-based N-pair margin loss to facilitate the learning of discriminative features. Furthermore, we integrate layer-collaboration local update strategies to reduce information loss caused by greedy local parameter updates. Our method surpasses existing FF models and other advanced local learning approaches, with accuracies of 99.7% on MNIST, 88.2% on CIFAR-10, 59% on CIFAR-100, 95.9% on SVHN, and 82.5% on ImageNette, respectively. Moreover, it achieves comparable performance with less than 40% memory cost compared to BP training, while exhibiting stronger robustness to multiple types of hardware-related noise, demonstrating its potential for online learning and energy-efficient computation on neuromorphic chips.

[AI-29] VHAKG: A Multi-modal Knowledge Graph Based on Synchronized Multi-view Videos of Daily Activities CIKM2024

链接: https://arxiv.org/abs/2408.14895
作者: Shusaku Egami,Takahiro Ugai,Ken Fukuda
关键词-EN: Multi-modal knowledge graphs, resources enabling knowledge, enabling knowledge processing, Multi-modal knowledge, non-symbolic data
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
*备注: 5 pages,4 figures, accepted by CIKM2024 Resource Track

点击查看摘要

Abstract:Multi-modal knowledge graphs (MMKGs), which ground various non-symbolic data (e.g., images and videos) into symbols, have attracted attention as resources enabling knowledge processing and machine learning across modalities. However, the construction of MMKGs for videos consisting of multiple events, such as daily activities, is still in the early stages. In this paper, we construct an MMKG based on synchronized multi-view simulated videos of daily activities. Besides representing the content of daily life videos as event-centric knowledge, our MMKG also includes frame-by-frame fine-grained changes, such as bounding boxes within video frames. In addition, we provide support tools for querying our MMKG. As an application example, we demonstrate that our MMKG facilitates benchmarking vision-language models by providing the necessary vision-language datasets for a tailored task.

[AI-30] he VoxCeleb Speaker Recognition Challenge: A Retrospective

链接: https://arxiv.org/abs/2408.14886
作者: Jaesung Huh,Joon Son Chung,Arsha Nagrani,Andrew Brown,Jee-weon Jung,Daniel Garcia-Romero,Andrew Zisserman
关键词-EN: VoxCeleb Speaker Recognition, Speaker Recognition, Speaker Recognition Challenges, workshops that ran, ran annually
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
*备注: TASLP 2024

点击查看摘要

Abstract:The VoxCeleb Speaker Recognition Challenges (VoxSRC) were a series of challenges and workshops that ran annually from 2019 to 2023. The challenges primarily evaluated the tasks of speaker recognition and diarisation under various settings including: closed and open training data; as well as supervised, self-supervised, and semi-supervised training for domain adaptation. The challenges also provided publicly available training and evaluation datasets for each task and setting, with new test sets released each year. In this paper, we provide a review of these challenges that covers: what they explored; the methods developed by the challenge participants and how these evolved; and also the current state of the field for speaker verification and diarisation. We chart the progress in performance over the five installments of the challenge on a common evaluation dataset and provide a detailed analysis of how each year’s special focus affected participants’ performance. This paper is aimed both at researchers who want an overview of the speaker recognition and diarisation field, and also at challenge organisers who want to benefit from the successes and avoid the mistakes of the VoxSRC challenges. We end with a discussion of the current strengths of the field and open challenges. Project page : this https URL

[AI-31] Adversarial Attacks and Defenses in Multivariate Time-Series Forecasting for Smart and Connected Infrastructures

链接: https://arxiv.org/abs/2408.14875
作者: Pooja Krishan,Rohan Mohapatra,Saptarshi Sengupta
关键词-EN: deep learning models, devices and infrastructures, Gradient Sign Method, Basic Iterative Method, emergence of deep
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Performance (cs.PF)
*备注: 17 pages, 32 figures

点击查看摘要

Abstract:The emergence of deep learning models has revolutionized various industries over the last decade, leading to a surge in connected devices and infrastructures. However, these models can be tricked into making incorrect predictions with high confidence, leading to disastrous failures and security concerns. To this end, we explore the impact of adversarial attacks on multivariate time-series forecasting and investigate methods to counter them. Specifically, we employ untargeted white-box attacks, namely the Fast Gradient Sign Method (FGSM) and the Basic Iterative Method (BIM), to poison the inputs to the training process, effectively misleading the model. We also illustrate the subtle modifications to the inputs after the attack, which makes detecting the attack using the naked eye quite difficult. Having demonstrated the feasibility of these attacks, we develop robust models through adversarial training and model hardening. We are among the first to showcase the transferability of these attacks and defenses by extrapolating our work from the benchmark electricity data to a larger, 10-year real-world data used for predicting the time-to-failure of hard disks. Our experimental results confirm that the attacks and defenses achieve the desired security thresholds, leading to a 72.41% and 94.81% decrease in RMSE for the electricity and hard disk datasets respectively after implementing the adversarial defenses.

[AI-32] Learning Robust Reward Machines from Noisy Labels KR2024

链接: https://arxiv.org/abs/2408.14871
作者: Roko Parac,Lorenzo Nodari,Leo Ardon,Daniel Furelos-Blanco,Federico Cerutti,Alessandra Russo
关键词-EN: paper presents PROB-IRM, noisy execution traces, paper presents, noisy traces, robust reward machines
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Preprint accepted for publication to the 21st International Conference on Principles of Knowledge Representation and Reasoning (KR 2024)

点击查看摘要

Abstract:This paper presents PROB-IRM, an approach that learns robust reward machines (RMs) for reinforcement learning (RL) agents from noisy execution traces. The key aspect of RM-driven RL is the exploitation of a finite-state machine that decomposes the agent’s task into different subtasks. PROB-IRM uses a state-of-the-art inductive logic programming framework robust to noisy examples to learn RMs from noisy traces using the Bayesian posterior degree of beliefs, thus ensuring robustness against inconsistencies. Pivotal for the results is the interleaving between RM learning and policy learning: a new RM is learned whenever the RL agent generates a trace that is believed not to be accepted by the current RM. To speed up the training of the RL agent, PROB-IRM employs a probabilistic formulation of reward shaping that uses the posterior Bayesian beliefs derived from the traces. Our experimental analysis shows that PROB-IRM can learn (potentially imperfect) RMs from noisy traces and exploit them to train an RL agent to solve its tasks successfully. Despite the complexity of learning the RM from noisy traces, agents trained with PROB-IRM perform comparably to agents provided with handcrafted RMs.

[AI-33] Enhancing Analogical Reasoning in the Abstraction and Reasoning Corpus via Model-Based RL IJCAI2024

链接: https://arxiv.org/abs/2408.14855
作者: Jihwan Lee,Woochang Sim,Sejin Kim,Sundong Kim
关键词-EN: Proximal Policy Optimization, model-based reinforcement learning, paper demonstrates, suitable approach, analogical reasoning
类目: Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO)
*备注: Accepted to IJCAI 2024 IARML Workshop

点击查看摘要

Abstract:This paper demonstrates that model-based reinforcement learning (model-based RL) is a suitable approach for the task of analogical reasoning. We hypothesize that model-based RL can solve analogical reasoning tasks more efficiently through the creation of internal models. To test this, we compared DreamerV3, a model-based RL method, with Proximal Policy Optimization, a model-free RL method, on the Abstraction and Reasoning Corpus (ARC) tasks. Our results indicate that model-based RL not only outperforms model-free RL in learning and generalizing from single tasks but also shows significant advantages in reasoning across similar tasks.

[AI-34] Detecting AI Flaws: Target-Driven Attacks on Internal Faults in Language Models

链接: https://arxiv.org/abs/2408.14853
作者: Yuhao Du,Zhuo Li,Pengyu Cheng,Xiang Wan,Anningzhe Gao
关键词-EN: Large Language Models, Large Language, rapidly evolving field, Language Models, artificial intelligence
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
*备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have become a focal point in the rapidly evolving field of artificial intelligence. However, a critical concern is the presence of toxic content within the pre-training corpus of these models, which can lead to the generation of inappropriate outputs. Investigating methods for detecting internal faults in LLMs can help us understand their limitations and improve their security. Existing methods primarily focus on jailbreaking attacks, which involve manually or automatically constructing adversarial content to prompt the target LLM to generate unexpected responses. These methods rely heavily on prompt engineering, which is time-consuming and usually requires specially designed questions. To address these challenges, this paper proposes a target-driven attack paradigm that focuses on directly eliciting the target response instead of optimizing the prompts. We introduce the use of another LLM as the detector for toxic content, referred to as ToxDet. Given a target toxic response, ToxDet can generate a possible question and a preliminary answer to provoke the target model into producing desired toxic responses with meanings equivalent to the provided one. ToxDet is trained by interacting with the target LLM and receiving reward signals from it, utilizing reinforcement learning for the optimization process. While the primary focus of the target models is on open-source LLMs, the fine-tuned ToxDet can also be transferred to attack black-box models such as GPT-4o, achieving notable results. Experimental results on AdvBench and HH-Harmless datasets demonstrate the effectiveness of our methods in detecting the tendencies of target LLMs to generate harmful responses. This algorithm not only exposes vulnerabilities but also provides a valuable resource for researchers to strengthen their models against such attacks.

[AI-35] Project SHADOW: Symbolic Higher-order Associative Deductive reasoning On Wikidata using LM probing

链接: https://arxiv.org/abs/2408.14849
作者: Hanna Abi Akl
关键词-EN: Wikidata triple completion, associative deductive reasoning, base construction task, fine-tuned language model, language model trained
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: 6 pages, 1 figure

点击查看摘要

Abstract:We introduce SHADOW, a fine-tuned language model trained on an intermediate task using associative deductive reasoning, and measure its performance on a knowledge base construction task using Wikidata triple completion. We evaluate SHADOW on the LM-KBC 2024 challenge and show that it outperforms the baseline solution by 20% with a F1 score of 68.72%.

[AI-36] Diffusion based Semantic Outlier Generation via Nuisance Awareness for Out-of-Distribution Detection

链接: https://arxiv.org/abs/2408.14841
作者: Suhee Yoon,Sanghyu Yoon,Hankook Lee,Ye Seul Sim,Sungik Choi,Kyungeun Lee,Hye-Seung Cho,Woohyung Lim
关键词-EN: recently shown promising, shown promising results, synthetic OOD datasets, recently shown, shown promising
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Out-of-distribution (OOD) detection, which determines whether a given sample is part of the in-distribution (ID), has recently shown promising results through training with synthetic OOD datasets. Nonetheless, existing methods often produce outliers that are considerably distant from the ID, showing limited efficacy for capturing subtle distinctions between ID and OOD. To address these issues, we propose a novel framework, Semantic Outlier generation via Nuisance Awareness (SONA), which notably produces challenging outliers by directly leveraging pixel-space ID samples through diffusion models. Our approach incorporates SONA guidance, providing separate control over semantic and nuisance regions of ID samples. Thereby, the generated outliers achieve two crucial properties: (i) they present explicit semantic-discrepant information, while (ii) maintaining various levels of nuisance resemblance with ID. Furthermore, the improved OOD detector training with SONA outliers facilitates learning with a focus on semantic distinctions. Extensive experiments demonstrate the effectiveness of our framework, achieving an impressive AUROC of 88% on near-OOD datasets, which surpasses the performance of baseline methods by a significant margin of approximately 6%.

[AI-37] CL4KGE: A Curriculum Learning Method for Knowledge Graph Embedding

链接: https://arxiv.org/abs/2408.14840
作者: Yang Liu,Chuan Zhou,Peng Zhang,Yanan Cao,Yongchao Liu,Zhao Li,Hongyang Chen
关键词-EN: Knowledge graph embedding, crafting representations comprehensive, Knowledge graph, KGE models, graph embedding
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注: 16 pages, 3 figures

点击查看摘要

Abstract:Knowledge graph embedding (KGE) constitutes a foundational task, directed towards learning representations for entities and relations within knowledge graphs (KGs), with the objective of crafting representations comprehensive enough to approximate the logical and symbolic interconnections among entities. In this paper, we define a metric Z-counts to measure the difficulty of training each triple ( head entity, relation, tail entity ) in KGs with theoretical analysis. Based on this metric, we propose \textbfCL4KGE, an efficient \textbfCurriculum \textbfLearning based training strategy for \textbfKGE. This method includes a difficulty measurer and a training scheduler that aids in the training of KGE models. Our approach possesses the flexibility to act as a plugin within a wide range of KGE models, with the added advantage of adaptability to the majority of KGs in existence. The proposed method has been evaluated on popular KGE models, and the results demonstrate that it enhances the state-of-the-art methods. The use of Z-counts as a metric has enabled the identification of challenging triples in KGs, which helps in devising effective training strategies.

[AI-38] Diffusion Models Are Real-Time Game Engines

链接: https://arxiv.org/abs/2408.14837
作者: Dani Valevski,Yaniv Leviathan,Moab Arar,Shlomi Fruchter
关键词-EN: game engine powered, enables real-time interaction, high quality, engine powered, real-time interaction
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: Project page: this https URL

点击查看摘要

Abstract:We present GameNGen, the first game engine powered entirely by a neural model that enables real-time interaction with a complex environment over long trajectories at high quality. GameNGen can interactively simulate the classic game DOOM at over 20 frames per second on a single TPU. Next frame prediction achieves a PSNR of 29.4, comparable to lossy JPEG compression. Human raters are only slightly better than random chance at distinguishing short clips of the game from clips of the simulation. GameNGen is trained in two phases: (1) an RL-agent learns to play the game and the training sessions are recorded, and (2) a diffusion model is trained to produce the next frame, conditioned on the sequence of past frames and actions. Conditioning augmentations enable stable auto-regressive generation over long trajectories.

[AI-39] Strategic Optimization and Challenges of Large Language Models in Object-Oriented Programming

链接: https://arxiv.org/abs/2408.14834
作者: Zinan Wang
关键词-EN: crafting individual functions, developing class-level method, integrates contextual information, emphasis has transitioned, transitioned from crafting
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
*备注: 10 pages

点击查看摘要

Abstract:In the area of code generation research, the emphasis has transitioned from crafting individual functions to developing class-level method code that integrates contextual information. This shift has brought several benchmarks such as ClassEval and CoderEval, which consider class-level contexts. Nevertheless, the influence of specific contextual factors at the method level remains less explored. This research focused on method-level code generation within the Object-Oriented Programming (OOP) framework. Based on CoderEval, we devised experiments that varied the extent of contextual information in the prompts, ranging from method-specific to project-wide details. We introduced the innovative metric of “Prompt-Token Cost-Effectiveness” to evaluate the economic viability of incorporating additional contextual layers. Our findings indicate that prompts enriched with method invocation details yield the highest cost-effectiveness. Additionally, our study revealed disparities among Large Language Models (LLMs) regarding error type distributions and the level of assistance they provide to developers. Notably, larger LLMs do not invariably perform better. We also observed that tasks with higher degrees of coupling present more substantial challenges, suggesting that the choice of LLM should be tailored to the task’s coupling degree. For example, GPT-4 exhibited improved performance in low-coupling scenarios, whereas GPT-3.5 seemed better suited for tasks with high coupling. By meticulously curating prompt content and selecting the appropriate LLM, developers can optimize code quality while maximizing cost-efficiency during the development process. Comments: 10 pages Subjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI) Cite as: arXiv:2408.14834 [cs.SE] (or arXiv:2408.14834v1 [cs.SE] for this version) https://doi.org/10.48550/arXiv.2408.14834 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-40] From Rule-Based Models to Deep Learning Transformers Architectures for Natural Language Processing and Sign Language Translation Systems: Survey Taxonomy and Performance Evaluation

链接: https://arxiv.org/abs/2408.14825
作者: Nada Shahin,Leila Ismail
关键词-EN: Hearing population worldwide, Deaf and Hard, Hard of Hearing, growing Deaf, Hearing population
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:With the growing Deaf and Hard of Hearing population worldwide and the persistent shortage of certified sign language interpreters, there is a pressing need for an efficient, signs-driven, integrated end-to-end translation system, from sign to gloss to text and vice-versa. There has been a wealth of research on machine translations and related reviews. However, there are few works on sign language machine translation considering the particularity of the language being continuous and dynamic. This paper aims to address this void, providing a retrospective analysis of the temporal evolution of sign language machine translation algorithms and a taxonomy of the Transformers architectures, the most used approach in language translation. We also present the requirements of a real-time Quality-of-Service sign language ma-chine translation system underpinned by accurate deep learning algorithms. We propose future research directions for sign language translation systems.

[AI-41] A Comprehensive Benchmark of Machine and Deep Learning Across Diverse Tabular Datasets

链接: https://arxiv.org/abs/2408.14817
作者: Assaf Shmuel,Oren Glickman,Teddy Lazebnik
关键词-EN: Gradient Boosting Machines, Deep Learning, Machine Learning, highly prevalent, scientific research
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The analysis of tabular datasets is highly prevalent both in scientific research and real-world applications of Machine Learning (ML). Unlike many other ML tasks, Deep Learning (DL) models often do not outperform traditional methods in this area. Previous comparative benchmarks have shown that DL performance is frequently equivalent or even inferior to models such as Gradient Boosting Machines (GBMs). In this study, we introduce a comprehensive benchmark aimed at better characterizing the types of datasets where DL models excel. Although several important benchmarks for tabular datasets already exist, our contribution lies in the variety and depth of our comparison: we evaluate 111 datasets with 20 different models, including both regression and classification tasks. These datasets vary in scale and include both those with and without categorical variables. Importantly, our benchmark contains a sufficient number of datasets where DL models perform best, allowing for a thorough analysis of the conditions under which DL models excel. Building on the results of this benchmark, we train a model that predicts scenarios where DL models outperform alternative methods with 86.1% accuracy (AUC 0.78). We present insights derived from this characterization and compare these findings to previous benchmarks.

[AI-42] Brain-inspired Artificial Intelligence: A Comprehensive Review

链接: https://arxiv.org/abs/2408.14811
作者: Jing Ren,Feng Xia
关键词-EN: meticulous parameter tuning, optimization techniques, Current artificial intelligence, focus on enhancing, enhancing performance
类目: Artificial Intelligence (cs.AI)
*备注: 35 pages, 4 figures

点击查看摘要

Abstract:Current artificial intelligence (AI) models often focus on enhancing performance through meticulous parameter tuning and optimization techniques. However, the fundamental design principles behind these models receive comparatively less attention, which can limit our understanding of their potential and constraints. This comprehensive review explores the diverse design inspirations that have shaped modern AI models, i.e., brain-inspired artificial intelligence (BIAI). We present a classification framework that categorizes BIAI approaches into physical structure-inspired and human behavior-inspired models. We also examine the real-world applications where different BIAI models excel, highlighting their practical benefits and deployment challenges. By delving into these areas, we provide new insights and propose future research directions to drive innovation and address current gaps in the field. This review offers researchers and practitioners a comprehensive overview of the BIAI landscape, helping them harness its potential and expedite advancements in AI development.

[AI-43] Poly2Vec: Polymorphic Encoding of Geospatial Objects for Spatial Reasoning with Deep Neural Networks

链接: https://arxiv.org/abs/2408.14806
作者: Maria Despoina Siampou,Jialiang Li,John Krumm,Cyrus Shahabi,Hua Lu
关键词-EN: enabling machine learning, machine learning, crucial for enabling, enabling machine, identifying the topological
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Encoding geospatial data is crucial for enabling machine learning (ML) models to perform tasks that require spatial reasoning, such as identifying the topological relationships between two different geospatial objects. However, existing encoding methods are limited as they are typically customized to handle only specific types of spatial data, which impedes their applicability across different downstream tasks where multiple data types coexist. To address this, we introduce Poly2Vec, an encoding framework that unifies the modeling of different geospatial objects, including 2D points, polylines, and polygons, irrespective of the downstream task. We leverage the power of the 2D Fourier transform to encode useful spatial properties, such as shape and location, from geospatial objects into fixed-length vectors. These vectors are then inputted into neural network models for spatial reasoning tasks.This unified approach eliminates the need to develop and train separate models for each distinct spatial type. We evaluate Poly2Vec on both synthetic and real datasets of mixed geometry types and verify its consistent performance across several downstream spatial reasoning tasks.

[AI-44] Optimizing Structured Data Processing through Robotic Process Automation

链接: https://arxiv.org/abs/2408.14791
作者: Vivek Bhardwaj,Ajit Noonia,Sandeep Chaurasia,Mukesh Kumar,Abdulnaser Rashid,Mohamed Tahar Ben Othman
关键词-EN: Robotic Process Automation, analyze large volumes, Robotic Process, Process Automation, data extraction
类目: Artificial Intelligence (cs.AI); Robotics (cs.RO)
*备注: This manuscript has been accepted for publication in the journal Revue d’Intelligence Artificielle

点击查看摘要

Abstract:Robotic Process Automation (RPA) has emerged as a game-changing technology in data extraction, revolutionizing the way organizations process and analyze large volumes of documents such as invoices, purchase orders, and payment advices. This study investigates the use of RPA for structured data extraction and evaluates its advantages over manual processes. By comparing human-performed tasks with those executed by RPA software bots, we assess efficiency and accuracy in data extraction from invoices, focusing on the effectiveness of the RPA system. Through four distinct scenarios involving varying numbers of invoices, we measure efficiency in terms of time and effort required for task completion, as well as accuracy by comparing error rates between manual and RPA processes. Our findings highlight the significant efficiency gains achieved by RPA, with bots completing tasks in significantly less time compared to manual efforts across all cases. Moreover, the RPA system consistently achieves perfect accuracy, mitigating the risk of errors and enhancing process reliability. These results underscore the transformative potential of RPA in optimizing operational efficiency, reducing human labor costs, and improving overall business performance.

[AI-45] GINN-KAN: Interpretability pipelining with applications in Physics Informed Neural Networks

链接: https://arxiv.org/abs/2408.14780
作者: Nisal Ranasinghe,Yu Xia,Sachith Seneviratne,Saman Halgamuge
关键词-EN: powerful function approximators, interpretable neural network, neural network, interpretable neural, Neural
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Neural networks are powerful function approximators, yet their ``black-box" nature often renders them opaque and difficult to interpret. While many post-hoc explanation methods exist, they typically fail to capture the underlying reasoning processes of the networks. A truly interpretable neural network would be trained similarly to conventional models using techniques such as backpropagation, but additionally provide insights into the learned input-output relationships. In this work, we introduce the concept of interpretability pipelineing, to incorporate multiple interpretability techniques to outperform each individual technique. To this end, we first evaluate several architectures that promise such interpretability, with a particular focus on two recent models selected for their potential to incorporate interpretability into standard neural network architectures while still leveraging backpropagation: the Growing Interpretable Neural Network (GINN) and Kolmogorov Arnold Networks (KAN). We analyze the limitations and strengths of each and introduce a novel interpretable neural network GINN-KAN that synthesizes the advantages of both models. When tested on the Feynman symbolic regression benchmark datasets, GINN-KAN outperforms both GINN and KAN. To highlight the capabilities and the generalizability of this approach, we position GINN-KAN as an alternative to conventional black-box networks in Physics-Informed Neural Networks (PINNs). We expect this to have far-reaching implications in the application of deep learning pipelines in the natural sciences. Our experiments with this interpretable PINN on 15 different partial differential equations demonstrate that GINN-KAN augmented PINNs outperform PINNs with black-box networks in solving differential equations and surpass the capabilities of both GINN and KAN.

[AI-46] MROVSeg: Breaking the Resolution Curse of Vision-Language Models in Open-Vocabulary Semantic Segmentation

链接: https://arxiv.org/abs/2408.14776
作者: Yuanbing Zhu,Bingke Zhu,Zhen Chen,Huan Xu,Ming Tang,Jinqiao Wang
关键词-EN: recognize semantically meaningful, Open-vocabulary semantic segmentation, semantically meaningful regions, meaningful regions based, descriptions during inference
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: Technical report

点击查看摘要

Abstract:Open-vocabulary semantic segmentation aims to segment and recognize semantically meaningful regions based on text-based descriptions during inference. A typical solution to address this task is to leverage powerful vision-language models (VLMs), such as CLIP, to bridge the gap between open- and close-vocabulary recognition. As VLMs are usually pretrained with low-resolution images (e.g. 224\times224 ), most previous methods operate only on downscaled images. We question this design as low resolution features often fail to preserve fine details. Although employing additional image backbones for high-resolution inputs can mitigate this issue, it may also introduce significant computation overhead. Therefore, we propose MROVSeg, a multi-resolution training framework for open-vocabulary semantic segmentation with a single pretrained CLIP backbone, that uses sliding windows to slice the high-resolution input into uniform patches, each matching the input size of the well-trained image encoder. Its key components include a Multi-Res Adapter, which restores the spatial geometry and grasps local-global correspondences across patches by learnable convolutional and scale attention layers. To achieve accurate segmentation, we introduce Multi-grained Masked Attention scheme to aggregate multi-grained semantics by performing cross-attention between object queries and multi-resolution CLIP features within the region of interests. Through comprehensive experiments, we demonstrate the superiority of MROVSeg on well-established open-vocabulary semantic segmentation benchmarks, particularly for high-resolution inputs, establishing new standards for open-vocabulary semantic segmentation.

[AI-47] A global AI community requires language-diverse publishing ICLR

链接: https://arxiv.org/abs/2408.14772
作者: Haley Lepp,Parth Sarin
关键词-EN: reinforces broader regimes, English language publishing, English dominance, language publishing upholds, English language
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: Translations by Michael Hardy (Guarani), Vandana Sarin and Vivek Sarin (Hindi), Roshna Omer Abdulrahman (Soranî Kurdish), Gabriel Poesia (Portuguese), and Matías Grinberg (Spanish). In the proceedings of the Global AI Cultures Workshop at the Twelfth International Conference on Learning Representations (ICLR) 2024, Vienna, Austria, May 7-11, 2024

点击查看摘要

Abstract:In this provocation, we discuss the English dominance of the AI research community, arguing that the requirement for English language publishing upholds and reinforces broader regimes of extraction in AI. While large language models and machine translation have been celebrated as a way to break down barriers, we regard their use as a symptom of linguistic exclusion of scientists and potential readers. We propose alternative futures for a healthier publishing culture, organized around three themes: administering conferences in the languages of the country in which they are held, instructing peer reviewers not to adjudicate the language appropriateness of papers, and offering opportunities to publish and present in multiple languages. We welcome new translations of this piece. Please contact the authors if you would like to contribute one.

[AI-48] CoopASD: Cooperative Machine Anomalous Sound Detection with Privacy Concerns

链接: https://arxiv.org/abs/2408.14753
作者: Anbai Jiang,Yuchen Shi,Pingyi Fan,Wei-Qiang Zhang,Jia Liu
关键词-EN: Internet of Things, Industrial Internet, anomalous sound detection, promoting production efficiency, Machine anomalous sound
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC); Audio and Speech Processing (eess.AS)
*备注: Accepted by GLOBECOM 2024

点击查看摘要

Abstract:Machine anomalous sound detection (ASD) has emerged as one of the most promising applications in the Industrial Internet of Things (IIoT) due to its unprecedented efficacy in mitigating risks of malfunctions and promoting production efficiency. Previous works mainly investigated the machine ASD task under centralized settings. However, developing the ASD system under decentralized settings is crucial in practice, since the machine data are dispersed in various factories and the data should not be explicitly shared due to privacy concerns. To enable these factories to cooperatively develop a scalable ASD model while preserving their privacy, we propose a novel framework named CoopASD, where each factory trains an ASD model on its local dataset, and a central server aggregates these local models periodically. We employ a pre-trained model as the backbone of the ASD model to improve its robustness and develop specialized techniques to stabilize the model under a completely non-iid and domain shift setting. Compared with previous state-of-the-art (SOTA) models trained in centralized settings, CoopASD showcases competitive results with negligible degradation of 0.08%. We also conduct extensive ablation studies to demonstrate the effectiveness of CoopASD.

[AI-49] Benchmarking Reinforcement Learning Methods for Dexterous Robotic Manipulation with a Three-Fingered Gripper

链接: https://arxiv.org/abs/2408.14747
作者: Elizabeth Cutler,Yuning Xing,Tony Cui,Brendan Zhou,Koen van Rijnsoever,Ben Hart,David Valencia,Lee Violet C. Ong,Trevor Gee,Minas Liarokapis,Henry Williams
关键词-EN: Reinforcement Learning, controlled simulation environments, simulation environments, predominantly conducted, conducted in cost-effective
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Reinforcement Learning (RL) training is predominantly conducted in cost-effective and controlled simulation environments. However, the transfer of these trained models to real-world tasks often presents unavoidable challenges. This research explores the direct training of RL algorithms in controlled yet realistic real-world settings for the execution of dexterous manipulation. The benchmarking results of three RL algorithms trained on intricate in-hand manipulation tasks within practical real-world contexts are presented. Our study not only demonstrates the practicality of RL training in authentic real-world scenarios, facilitating direct real-world applications, but also provides insights into the associated challenges and considerations. Additionally, our experiences with the employed experimental methods are shared, with the aim of empowering and engaging fellow researchers and practitioners in this dynamic field of robotics.

[AI-50] RSTeller: Scaling Up Visual Language Modeling in Remote Sensing with Rich Linguistic Semantics from Openly Available Data and Large Language Models

链接: https://arxiv.org/abs/2408.14744
作者: Junyao Ge,Yang Zheng,Kaitai Guo,Jimin Liang
关键词-EN: Google Earth Engine, aligning complex visual, enabling the development, interpretation tasks, pivotal for aligning
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: Submitted to ISPRS

点击查看摘要

Abstract:Abundant, well-annotated multimodal data in remote sensing are pivotal for aligning complex visual remote sensing (RS) scenes with human language, enabling the development of specialized vision language models across diverse RS interpretation tasks. However, annotating RS images with rich linguistic semantics at scale demands expertise in RS and substantial human labor, making it costly and often impractical. In this study, we propose a workflow that leverages large language models (LLMs) to generate multimodal datasets with semantically rich captions at scale from plain OpenStreetMap (OSM) data for images sourced from the Google Earth Engine (GEE) platform. This approach facilitates the generation of paired remote sensing data and can be readily scaled up using openly available data. Within this framework, we present RSTeller, a multimodal dataset comprising over 1 million RS images, each accompanied by multiple descriptive captions. Extensive experiments demonstrate that RSTeller enhances the performance of multiple existing vision language models for RS scene understanding through continual pre-training. Our methodology significantly reduces the manual effort and expertise needed for annotating remote sensing imagery while democratizing access to high-quality annotated data. This advancement fosters progress in visual language modeling and encourages broader participation in remote sensing research and applications. The RSTeller dataset is available at this https URL.

[AI-51] ART: Boosting Clean Accuracy Through Tangent Direction Guided Adversarial Training

链接: https://arxiv.org/abs/2408.14728
作者: Bongsoo Yi,Rongjie Lai,Yao Li
关键词-EN: deep neural networks, Guided Adversarial Training, Adversarial training, Adversarial, Direction Guided Adversarial
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
*备注:

点击查看摘要

Abstract:Adversarial training has been shown to be successful in enhancing the robustness of deep neural networks against adversarial attacks. However, this robustness is accompanied by a significant decline in accuracy on clean data. In this paper, we propose a novel method, called Tangent Direction Guided Adversarial Training (TART), that leverages the tangent space of the data manifold to ameliorate the existing adversarial defense algorithms. We argue that training with adversarial examples having large normal components significantly alters the decision boundary and hurts accuracy. TART mitigates this issue by estimating the tangent direction of adversarial examples and allocating an adaptive perturbation limit according to the norm of their tangential component. To the best of our knowledge, our paper is the first work to consider the concept of tangent space and direction in the context of adversarial defense. We validate the effectiveness of TART through extensive experiments on both simulated and benchmark datasets. The results demonstrate that TART consistently boosts clean accuracy while retaining a high level of robustness against adversarial attacks. Our findings suggest that incorporating the geometric properties of data can lead to more effective and efficient adversarial training methods.

[AI-52] PAT: Pruning-Aware Tuning for Large Language Models

链接: https://arxiv.org/abs/2408.14721
作者: Yijiang Liu,Huanrui Yang,Youxin Chen,Rongyu Zhang,Miao Wang,Yuan Du,Li Du
关键词-EN: Large language models, Large language, language tasks, language models, Structural pruning
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:Large language models (LLMs) excel in language tasks, especially with supervised fine-tuning after pre-training. However, their substantial memory and computational requirements hinder practical applications. Structural pruning, which reduces less significant weight dimensions, is one solution. Yet, traditional post-hoc pruning often leads to significant performance loss, with limited recovery from further fine-tuning due to reduced capacity. Since the model fine-tuning refines the general and chaotic knowledge in pre-trained models, we aim to incorporate structural pruning with the fine-tuning, and propose the Pruning-Aware Tuning (PAT) paradigm to eliminate model redundancy while preserving the model performance to the maximum extend. Specifically, we insert the innovative Hybrid Sparsification Modules (HSMs) between the Attention and FFN components to accordingly sparsify the upstream and downstream linear modules. The HSM comprises a lightweight operator and a globally shared trainable mask. The lightweight operator maintains a training overhead comparable to that of LoRA, while the trainable mask unifies the channels to be sparsified, ensuring structural pruning. Additionally, we propose the Identity Loss which decouples the transformation and scaling properties of the HSMs to enhance training robustness. Extensive experiments demonstrate that PAT excels in both performance and efficiency. For example, our Llama2-7b model with a 25% pruning ratio achieves 1.33 \times speedup while outperforming the LoRA-finetuned model by up to 1.26% in accuracy with a similar training cost. Code: this https URL

[AI-53] Residual-based Adaptive Huber Loss (RAHL) – Design of an improved Huber loss for CQI prediction in 5G networks

链接: https://arxiv.org/abs/2408.14718
作者: Mina Kaviani,Jurandy Almeida,Fabio L. Verdi
关键词-EN: Channel Quality Indicator, ensure high Quality, Quality Indicator, Channel Quality, optimizing infrastructure dynamically
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI)
*备注: this https URL

点击查看摘要

Abstract:The Channel Quality Indicator (CQI) plays a pivotal role in 5G networks, optimizing infrastructure dynamically to ensure high Quality of Service (QoS). Recent research has focused on improving CQI estimation in 5G networks using machine learning. In this field, the selection of the proper loss function is critical for training an accurate model. Two commonly used loss functions are Mean Squared Error (MSE) and Mean Absolute Error (MAE). Roughly speaking, MSE put more weight on outliers, MAE on the majority. Here, we argue that the Huber loss function is more suitable for CQI prediction, since it combines the benefits of both MSE and MAE. To achieve this, the Huber loss transitions smoothly between MSE and MAE, controlled by a user-defined hyperparameter called delta. However, finding the right balance between sensitivity to small errors (MAE) and robustness to outliers (MSE) by manually choosing the optimal delta is challenging. To address this issue, we propose a novel loss function, named Residual-based Adaptive Huber Loss (RAHL). In RAHL, a learnable residual is added to the delta, enabling the model to adapt based on the distribution of errors in the data. Our approach effectively balances model robustness against outliers while preserving inlier data precision. The widely recognized Long Short-Term Memory (LSTM) model is employed in conjunction with RAHL, showcasing significantly improved results compared to the aforementioned loss functions. The obtained results affirm the superiority of RAHL, offering a promising avenue for enhanced CQI prediction in 5G networks.

[AI-54] xt2SQL is Not Enough: Unifying AI and Databases with TAG

链接: https://arxiv.org/abs/2408.14717
作者: Asim Biswal,Liana Patel,Siddarth Jha,Amog Kamsetty,Shu Liu,Joseph E. Gonzalez,Carlos Guestrin,Matei Zaharia
关键词-EN: natural language questions, language questions, serve natural language, natural language, promise to unlock
类目: Databases (cs.DB); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:AI systems that serve natural language questions over databases promise to unlock tremendous value. Such systems would allow users to leverage the powerful reasoning and knowledge capabilities of language models (LMs) alongside the scalable computational power of data management systems. These combined capabilities would empower users to ask arbitrary natural language questions over custom data sources. However, existing methods and benchmarks insufficiently explore this setting. Text2SQL methods focus solely on natural language questions that can be expressed in relational algebra, representing a small subset of the questions real users wish to ask. Likewise, Retrieval-Augmented Generation (RAG) considers the limited subset of queries that can be answered with point lookups to one or a few data records within the database. We propose Table-Augmented Generation (TAG), a unified and general-purpose paradigm for answering natural language questions over databases. The TAG model represents a wide range of interactions between the LM and database that have been previously unexplored and creates exciting research opportunities for leveraging the world knowledge and reasoning capabilities of LMs over data. We systematically develop benchmarks to study the TAG problem and find that standard methods answer no more than 20% of queries correctly, confirming the need for further research in this area. We release code for the benchmark at this https URL.

[AI-55] StyleSpeech: Parameter-efficient Fine Tuning for Pre-trained Controllable Text-to-Speech

链接: https://arxiv.org/abs/2408.14713
作者: Haowei Lou,Helen Paik,Wen Hu,Lina Yao
关键词-EN: Lower Rank Adaptation, paper introduces StyleSpeech, TTS system, enhances the naturalness, naturalness and accuracy
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Multimedia (cs.MM); Audio and Speech Processing (eess.AS)
*备注:

点击查看摘要

Abstract:This paper introduces StyleSpeech, a novel Text-to-Speech~(TTS) system that enhances the naturalness and accuracy of synthesized speech. Building upon existing TTS technologies, StyleSpeech incorporates a unique Style Decorator structure that enables deep learning models to simultaneously learn style and phoneme features, improving adaptability and efficiency through the principles of Lower Rank Adaptation~(LoRA). LoRA allows efficient adaptation of style features in pre-trained models. Additionally, we introduce a novel automatic evaluation metric, the LLM-Guided Mean Opinion Score (LLM-MOS), which employs large language models to offer an objective and robust protocol for automatically assessing TTS system performance. Extensive testing on benchmark datasets shows that our approach markedly outperforms existing state-of-the-art baseline methods in producing natural, accurate, and high-quality speech. These advancements not only pushes the boundaries of current TTS system capabilities, but also facilitate the application of TTS system in more dynamic and specialized, such as interactive virtual assistants, adaptive audiobooks, and customized voice for gaming. Speech samples can be found in https://style-speech.vercel.app

[AI-56] Artificial Intelligence in Landscape Architecture: A Survey

链接: https://arxiv.org/abs/2408.14700
作者: Yue Xing,Wensheng Gan,Qidi Chen
关键词-EN: landscape architecture, history of landscape, extend human intelligence, human pursuit, ecological balance
类目: Artificial Intelligence (cs.AI)
*备注: Preprint. 3 figures, 2 tables

点击查看摘要

Abstract:The development history of landscape architecture (LA) reflects the human pursuit of environmental beautification and ecological balance. With the advancement of artificial intelligence (AI) technologies that simulate and extend human intelligence, immense opportunities have been provided for LA, offering scientific and technological support throughout the entire workflow. In this article, we comprehensively review the applications of AI technology in the field of LA. First, we introduce the many potential benefits that AI brings to the design, planning, and management aspects of LA. Secondly, we discuss how AI can assist the LA field in solving its current development problems, including urbanization, environmental degradation and ecological decline, irrational planning, insufficient management and maintenance, and lack of public participation. Furthermore, we summarize the key technologies and practical cases of applying AI in the LA domain, from design assistance to intelligent management, all of which provide innovative solutions for the planning, design, and maintenance of LA. Finally, we look ahead to the problems and opportunities in LA, emphasizing the need to combine human expertise and judgment for rational decision-making. This article provides both theoretical and practical guidance for LA designers, researchers, and technology developers. The successful integration of AI technology into LA holds great promise for enhancing the field’s capabilities and achieving more sustainable, efficient, and user-friendly outcomes.

[AI-57] Smart Multi-Modal Search: Contextual Sparse and Dense Embedding Integration in Adobe Express

链接: https://arxiv.org/abs/2408.14698
作者: Cherag Aroraa,Tracy Holloway King,Jayant Kumar,Yi Lu,Sanat Sharma,Arvind Srikantan,David Uvalle,Josep Valls-Vargas,Harsha Vardhan
关键词-EN: multi-modal search systems, effective multi-modal search, multi-modal search, search systems, search
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:As user content and queries become increasingly multi-modal, the need for effective multi-modal search systems has grown. Traditional search systems often rely on textual and metadata annotations for indexed images, while multi-modal embeddings like CLIP enable direct search using text and image embeddings. However, embedding-based approaches face challenges in integrating contextual features such as user locale and recency. Building a scalable multi-modal search system requires fine-tuning several components. This paper presents a multi-modal search architecture and a series of AB tests that optimize embeddings and multi-modal technologies in Adobe Express template search. We address considerations such as embedding model selection, the roles of embeddings in matching and ranking, and the balance between dense and sparse embeddings. Our iterative approach demonstrates how utilizing sparse, dense, and contextual features enhances short and long query search, significantly reduces null rates (over 70%), and increases click-through rates (CTR). Our findings provide insights into developing robust multi-modal search systems, thereby enhancing relevance for complex queries.

[AI-58] raining-Free Activation Sparsity in Large Language Models

链接: https://arxiv.org/abs/2408.14690
作者: James Liu,Pragaash Ponnusamy,Tianle Cai,Han Guo,Yoon Kim,Ben Athiwaratkun
关键词-EN: enable practical inference, practical inference speedups, large language models, forward pass, enable practical
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Activation sparsity can enable practical inference speedups in large language models (LLMs) by reducing the compute and memory-movement required for matrix multiplications during the forward pass. However, existing methods face limitations that inhibit widespread adoption. Some approaches are tailored towards older models with ReLU-based sparsity, while others require extensive continued pre-training on up to hundreds of billions of tokens. This paper describes TEAL, a simple training-free method that applies magnitude-based activation sparsity to hidden states throughout the entire model. TEAL achieves 40-50% model-wide sparsity with minimal performance degradation across Llama-2, Llama-3, and Mistral families, with sizes varying from 7B to 70B. We improve existing sparse kernels and demonstrate wall-clock decoding speed-ups of up to 1.53 \times and 1.8 \times at 40% and 50% model-wide sparsity. TEAL is compatible with weight quantization, enabling further efficiency gains.

[AI-59] Bridging the Gap: Unpacking the Hidden Challenges in Knowledge Distillation for Online Ranking Systems

链接: https://arxiv.org/abs/2408.14678
作者: Nikhil Khani,Shuo Yang,Aniruddh Nath,Yang Liu,Pendo Abbo,Li Wei,Shawn Andrews,Maciej Kula,Jarrod Kahn,Zhe Zhao,Lichan Hong,Ed Chi
关键词-EN: Knowledge Distillation, powerful approach, approach for compressing, compressing a large, beneficial for latency-sensitive
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Knowledge Distillation (KD) is a powerful approach for compressing a large model into a smaller, more efficient model, particularly beneficial for latency-sensitive applications like recommender systems. However, current KD research predominantly focuses on Computer Vision (CV) and NLP tasks, overlooking unique data characteristics and challenges inherent to recommender systems. This paper addresses these overlooked challenges, specifically: (1) mitigating data distribution shifts between teacher and student models, (2) efficiently identifying optimal teacher configurations within time and budgetary constraints, and (3) enabling computationally efficient and rapid sharing of teacher labels to support multiple students. We present a robust KD system developed and rigorously evaluated on multiple large-scale personalized video recommendation systems within Google. Our live experiment results demonstrate significant improvements in student model performance while ensuring consistent and reliable generation of high quality teacher labels from a continuous data stream of data.

[AI-60] KGPrune: a Web Application to Extract Subgraphs of Interest from Wikidata with Analogical Pruning ECAI2024

链接: https://arxiv.org/abs/2408.14658
作者: Pierre Monnin,Cherif-Hassan Nousradine,Lucas Jarnac,Laurel Zuckerman,Miguel Couceiro
关键词-EN: array of domains, ubiquitous publicly, nowadays covering, Knowledge graphs, knowledge sources
类目: Artificial Intelligence (cs.AI); Databases (cs.DB); Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注: Accepted as a demo paper at ECAI 2024

点击查看摘要

Abstract:Knowledge graphs (KGs) have become ubiquitous publicly available knowledge sources, and are nowadays covering an ever increasing array of domains. However, not all knowledge represented is useful or pertaining when considering a new application or specific task. Also, due to their increasing size, handling large KGs in their entirety entails scalability issues. These two aspects asks for efficient methods to extract subgraphs of interest from existing KGs. To this aim, we introduce KGPrune, a Web Application that, given seed entities of interest and properties to traverse, extracts their neighboring subgraphs from Wikidata. To avoid topical drift, KGPrune relies on a frugal pruning algorithm based on analogical reasoning to only keep relevant neighbors while pruning irrelevant ones. The interest of KGPrune is illustrated by two concrete applications, namely, bootstrapping an enterprise KG and extracting knowledge related to looted artworks.

[AI-61] Emergent Language in Open-Ended Environments

链接: https://arxiv.org/abs/2408.14649
作者: Cornelius Wolff,Julius Mayer,Elia Bruni,Xenia Ohmer
关键词-EN: made significant progress, Emergent language research, Emergent language, situated multi-agent systems, recent years
类目: Artificial Intelligence (cs.AI)
*备注: 10 pages, 4 figures, 4 tables, preprint

点击查看摘要

Abstract:Emergent language research has made significant progress in recent years, but still largely fails to explore how communication emerges in more complex and situated multi-agent systems. Existing setups often employ a reference game, which limits the range of language emergence phenomena that can be studied, as the game consists of a single, purely language-based interaction between the agents. In this paper, we address these limitations and explore the emergence and utility of token-based communication in open-ended multi-agent environments, where situated agents interact with the environment through movement and communication over multiple time-steps. Specifically, we introduce two novel cooperative environments: Multi-Agent Pong and Collectors. These environments are interesting because optimal performance requires the emergence of a communication protocol, but moderate success can be achieved without one. By employing various methods from explainable AI research, such as saliency maps, perturbation, and diagnostic classifiers, we are able to track and interpret the agents’ language channel use over time. We find that the emerging communication is sparse, with the agents only generating meaningful messages and acting upon incoming messages in states where they cannot succeed without coordination.

[AI-62] Visions of Destruction: Exploring a Potential of Generative AI in Interactive Art

链接: https://arxiv.org/abs/2408.14644
作者: Mar Canet Sola,Varvara Guljajeva
关键词-EN: practice-based research approach, employing a practice-based, research approach, interactive art, practice-based research
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:This paper explores the potential of generative AI within interactive art, employing a practice-based research approach. It presents the interactive artwork “Visions of Destruction” as a detailed case study, highlighting its innovative use of generative AI to create a dynamic, audience-responsive experience. This artwork applies gaze-based interaction to dynamically alter digital landscapes, symbolizing the impact of human activities on the environment by generating contemporary collages created with AI, trained on data about human damage to nature, and guided by audience interaction. The transformation of pristine natural scenes into human-made and industrialized landscapes through viewer interaction serves as a stark reminder of environmental degradation. The paper thoroughly explores the technical challenges and artistic innovations involved in creating such an interactive art installation, emphasizing the potential of generative AI to revolutionize artistic expression, audience engagement, and especially the opportunities for the interactive art field. It offers insights into the conceptual framework behind the artwork, aiming to evoke a deeper understanding and reflection on the Anthropocene era and human-induced climate change. This study contributes significantly to the field of creative AI and interactive art, blending technology and environmental consciousness in a compelling, thought-provoking manner.

[AI-63] Effect of Adaptation Rate and Cost Display in a Human-AI Interaction Game

链接: https://arxiv.org/abs/2408.14640
作者: Jason T. Isa,Bohan Wu,Qirui Wang,Yilin Zhang,Samuel A. Burden,Lillian J. Ratliff,Benjamin J. Chasnov
关键词-EN: joint action vector, current joint action, action vector, joint action, human behavior
类目: Artificial Intelligence (cs.AI); Computer Science and Game Theory (cs.GT); Human-Computer Interaction (cs.HC)
*备注:

点击查看摘要

Abstract:As interactions between humans and AI become more prevalent, it is critical to have better predictors of human behavior in these interactions. We investigated how changes in the AI’s adaptive algorithm impact behavior predictions in two-player continuous games. In our experiments, the AI adapted its actions using a gradient descent algorithm under different adaptation rates while human participants were provided cost feedback. The cost feedback was provided by one of two types of visual displays: (a) cost at the current joint action vector, or (b) cost in a local neighborhood of the current joint action vector. Our results demonstrate that AI adaptation rate can significantly affect human behavior, having the ability to shift the outcome between two game theoretic equilibrium. We observed that slow adaptation rates shift the outcome towards the Nash equilibrium, while fast rates shift the outcome towards the human-led Stackelberg equilibrium. The addition of localized cost information had the effect of shifting outcomes towards Nash, compared to the outcomes from cost information at only the current joint action vector. Future work will investigate other effects that influence the convergence of gradient descent games.

[AI-64] Hybrid Deep Convolutional Neural Networks Combined with Autoencoders And Augmented Data To Predict The Look-Up Table 2006

链接: https://arxiv.org/abs/2408.14626
作者: Messaoud Djeddou,Aouatef Hellal,Ibrahim A. Hameed,Xingang Zhao,Djehad Al Dallal
关键词-EN: convolutional neural network, critical heat flux, predict critical heat, deep convolutional neural, data augmentation techniques
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 11 pages, 6 figures

点击查看摘要

Abstract:This study explores the development of a hybrid deep convolutional neural network (DCNN) model enhanced by autoencoders and data augmentation techniques to predict critical heat flux (CHF) with high accuracy. By augmenting the original input features using three different autoencoder configurations, the model’s predictive capabilities were significantly improved. The hybrid models were trained and tested on a dataset of 7225 samples, with performance metrics including the coefficient of determination (R2), Nash-Sutcliffe efficiency (NSE), mean absolute error (MAE), and normalized root-mean-squared error (NRMSE) used for evaluation. Among the tested models, the DCNN_3F-A2 configuration demonstrated the highest accuracy, achieving an R2 of 0.9908 during training and 0.9826 during testing, outperforming the base model and other augmented versions. These results suggest that the proposed hybrid approach, combining deep learning with feature augmentation, offers a robust solution for CHF prediction, with the potential to generalize across a wider range of conditions.

[AI-65] On Centralized Critics in Multi-Agent Reinforcement Learning

链接: https://arxiv.org/abs/2408.14597
作者: Xueguang Lyu,Andrea Baisero,Yuchen Xiao,Brett Daley,Christopher Amato
关键词-EN: Centralized Training, Multi-Agent Reinforcement Learning, Reinforcement Learning, Centralized, Multi-Agent Reinforcement
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Centralized Training for Decentralized Execution where agents are trained offline in a centralized fashion and execute online in a decentralized manner, has become a popular approach in Multi-Agent Reinforcement Learning (MARL). In particular, it has become popular to develop actor-critic methods that train decentralized actors with a centralized critic where the centralized critic is allowed access global information of the entire system, including the true system state. Such centralized critics are possible given offline information and are not used for online execution. While these methods perform well in a number of domains and have become a de facto standard in MARL, using a centralized critic in this context has yet to be sufficiently analyzed theoretically or empirically. In this paper, we therefore formally analyze centralized and decentralized critic approaches, and analyze the effect of using state-based critics in partially observable environments. We derive theories contrary to the common intuition: critic centralization is not strictly beneficial, and using state values can be harmful. We further prove that, in particular, state-based critics can introduce unexpected bias and variance compared to history-based critics. Finally, we demonstrate how the theory applies in practice by comparing different forms of critics on a wide range of common multi-agent benchmarks. The experiments show practical issues such as the difficulty of representation learning with partial observability, which highlights why the theoretical problems are often overlooked in the literature.

[AI-66] How to build trust in answers given by Generative AI for specific and vague financial questions

链接: https://arxiv.org/abs/2408.14593
作者: Alex Zarifis,Xusen Cheng
关键词-EN: Generative artificial intelligence, Generative artificial, artificial intelligence, growth in adoption, explosive growth
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Purpose: Generative artificial intelligence (GenAI) has progressed in its ability and has seen explosive growth in adoption. However, the consumer’s perspective on its use, particularly in specific scenarios such as financial advice, is unclear. This research develops a model of how to build trust in the advice given by GenAI when answering financial questions. Design/methodology/approach: The model is tested with survey data using structural equation modelling (SEM) and multi-group analysis (MGA). The MGA compares two scenarios, one where the consumer makes a specific question and one where a vague question is made. Findings: This research identifies that building trust for consumers is different when they ask a specific financial question in comparison to a vague one. Humanness has a different effect in the two scenarios. When a financial question is specific, human-like interaction does not strengthen trust, while (1) when a question is vague, humanness builds trust. The four ways to build trust in both scenarios are (2) human oversight and being in the loop, (3) transparency and control, (4) accuracy and usefulness and finally (5) ease of use and support. Originality/value: This research contributes to a better understanding of the consumer’s perspective when using GenAI for financial questions and highlights the importance of understanding GenAI in specific contexts from specific stakeholders.

[AI-67] DIAGen: Diverse Image Augmentation with Generative Models

链接: https://arxiv.org/abs/2408.14584
作者: Tobias Lingenberg,Markus Reuter,Gopika Sudhakaran,Dominik Gojny,Stefan Roth,Simone Schaub-Meyer
关键词-EN: Simple data augmentation, Simple data, computer vision models, data augmentation techniques, rotations and flips
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: Accepted for publication in GCPR 2024

点击查看摘要

Abstract:Simple data augmentation techniques, such as rotations and flips, are widely used to enhance the generalization power of computer vision models. However, these techniques often fail to modify high-level semantic attributes of a class. To address this limitation, researchers have explored generative augmentation methods like the recently proposed DA-Fusion. Despite some progress, the variations are still largely limited to textural changes, thus falling short on aspects like varied viewpoints, environment, weather conditions, or even class-level semantic attributes (eg, variations in a dog’s breed). To overcome this challenge, we propose DIAGen, building upon DA-Fusion. First, we apply Gaussian noise to the embeddings of an object learned with Textual Inversion to diversify generations using a pre-trained diffusion model’s knowledge. Second, we exploit the general knowledge of a text-to-text generative model to guide the image generation of the diffusion model with varied class-specific prompts. Finally, we introduce a weighting mechanism to mitigate the impact of poorly generated samples. Experimental results across various datasets show that DIAGen not only enhances semantic diversity but also improves the performance of subsequent classifiers. The advantages of DIAGen over standard augmentations and the DA-Fusion baseline are particularly pronounced with out-of-distribution samples.

[AI-68] EVINCE: Optimizing Adversarial LLM Dialogues via Conditional Statistics and Information Theory

链接: https://arxiv.org/abs/2408.14575
作者: Edward Y. Chang
关键词-EN: Artificial General Intelligence, advancing Artificial General, large language models, framework advancing Artificial, Conditional Exchanges
类目: Artificial Intelligence (cs.AI)
*备注: 19 pages, 7 figures, four tables

点击查看摘要

Abstract:This paper introduces EVINCE (Entropy and Variation IN Conditional Exchanges), a dialogue framework advancing Artificial General Intelligence (AGI) by enhancing versatility, adaptivity, and reasoning in large language models (LLMs). Leveraging adversarial debate and a novel dual entropy theory, EVINCE improves prediction accuracy, robustness, and stability in LLMs by integrating statistical modeling, information theory, and machine learning to balance diverse perspective exploration with strong prior exploitation. The framework’s effectiveness is demonstrated through consistent convergence of information-theoretic metrics, particularly improved mutual information, fostering productive LLM collaboration. We apply EVINCE to healthcare, showing improved disease diagnosis, and discuss its broader implications for decision-making across domains. This work provides theoretical foundations and empirical validation for EVINCE, paving the way for advancements in LLM collaboration and AGI development.

[AI-69] CURLoRA: Stable LLM Continual Fine-Tuning and Catastrophic Forgetting Mitigation

链接: https://arxiv.org/abs/2408.14572
作者: Muhammad Fawi
关键词-EN: Low-Rank Adaptation, leverages CUR matrix, fine-tuning large language, paper introduces CURLoRA, large language models
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注: Code available at this https URL

点击查看摘要

Abstract:This paper introduces CURLoRA, a novel approach to fine-tuning large language models (LLMs) that leverages CUR matrix decomposition in the context of Low-Rank Adaptation (LoRA). Our method addresses two critical challenges in LLM fine-tuning: mitigating catastrophic forgetting during continual learning and reducing the number of trainable parameters. We propose a unique modification to the CUR decomposition process, utilizing inverted probabilities for column and row selection which acts as an implicit regularization, and initializing the U matrix as a zero matrix, and only fine-tuning it. We demonstrate through experiments on multiple datasets that CURLoRA outperforms standard LoRA in mitigating catastrophic forgetting. It maintains model stability and performance across tasks while significantly reducing the number of trainable parameters. Our results show that CURLoRA achieves very good and stable task accuracy while maintaining base model’s perplexity scores fixed compared to LoRA upon continual fine-tuning, particularly in scenarios with limited data.

[AI-70] Improving Clinical Note Generation from Complex Doctor-Patient Conversation

链接: https://arxiv.org/abs/2408.14568
作者: Yizhan Li,Sifan Wu,Christopher Smith,Thomas Lo,Bang Liu
关键词-EN: patient care documentation, documenting medical exams, clinical note generation, Writing clinical notes, healthcare professionals
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Writing clinical notes and documenting medical exams is a critical task for healthcare professionals, serving as a vital component of patient care documentation. However, manually writing these notes is time-consuming and can impact the amount of time clinicians can spend on direct patient interaction and other tasks. Consequently, the development of automated clinical note generation systems has emerged as a clinically meaningful area of research within AI for health. In this paper, we present three key contributions to the field of clinical note generation using large language models (LLMs). First, we introduce CliniKnote, a comprehensive dataset consisting of 1,200 complex doctor-patient conversations paired with their full clinical notes. This dataset, created and curated by medical experts with the help of modern neural networks, provides a valuable resource for training and evaluating models in clinical note generation tasks. Second, we propose the K-SOAP (Keyword, Subjective, Objective, Assessment, and Plan) note format, which enhances traditional SOAP~\citepodder2023soap (Subjective, Objective, Assessment, and Plan) notes by adding a keyword section at the top, allowing for quick identification of essential information. Third, we develop an automatic pipeline to generate K-SOAP notes from doctor-patient conversations and benchmark various modern LLMs using various metrics. Our results demonstrate significant improvements in efficiency and performance compared to standard LLM finetuning methods.

[AI-71] A Survey of Camouflaged Object Detection and Beyond

链接: https://arxiv.org/abs/2408.14562
作者: Fengyang Xiao,Sujie Hu,Yuqi Shen,Chengyu Fang,Jinfa Huang,Chunming He,Longxiang Tang,Ziyun Yang,Xiu Li
关键词-EN: computer vision systems, Camouflaged Object Detection, segmenting objects, Object Detection, posing a significant
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 26 pages, 10 figures, 8 tables

点击查看摘要

Abstract:Camouflaged Object Detection (COD) refers to the task of identifying and segmenting objects that blend seamlessly into their surroundings, posing a significant challenge for computer vision systems. In recent years, COD has garnered widespread attention due to its potential applications in surveillance, wildlife conservation, autonomous systems, and more. While several surveys on COD exist, they often have limitations in terms of the number and scope of papers covered, particularly regarding the rapid advancements made in the field since mid-2023. To address this void, we present the most comprehensive review of COD to date, encompassing both theoretical frameworks and practical contributions to the field. This paper explores various COD methods across four domains, including both image-level and video-level solutions, from the perspectives of traditional and deep learning approaches. We thoroughly investigate the correlations between COD and other camouflaged scenario methods, thereby laying the theoretical foundation for subsequent analyses. Beyond object-level detection, we also summarize extended methods for instance-level tasks, including camouflaged instance segmentation, counting, and ranking. Additionally, we provide an overview of commonly used benchmarks and evaluation metrics in COD tasks, conducting a comprehensive evaluation of deep learning-based techniques in both image and video domains, considering both qualitative and quantitative performance. Finally, we discuss the limitations of current COD models and propose 9 promising directions for future research, focusing on addressing inherent challenges and exploring novel, meaningful technologies. For those interested, a curated list of COD-related techniques, datasets, and additional resources can be found at this https URL

[AI-72] Revisiting Image Captioning Training Paradigm via Direct CLIP-based Optimization BMVC2024

链接: https://arxiv.org/abs/2408.14547
作者: Nicholas Moratelli,Davide Caffagni,Marcella Cornia,Lorenzo Baraldi,Rita Cucchiara
关键词-EN: Self-Critical Sequence Training, Self-Critical Sequence, image captioning involves, captioning involves pre-training, maximize hand-crafted captioning
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Multimedia (cs.MM)
*备注: BMVC 2024

点击查看摘要

Abstract:The conventional training approach for image captioning involves pre-training a network using teacher forcing and subsequent fine-tuning with Self-Critical Sequence Training to maximize hand-crafted captioning metrics. However, when attempting to optimize modern and higher-quality metrics like CLIP-Score and PAC-Score, this training method often encounters instability and fails to acquire the genuine descriptive capabilities needed to produce fluent and informative captions. In this paper, we propose a new training paradigm termed Direct CLIP-Based Optimization (DiCO). Our approach jointly learns and optimizes a reward model that is distilled from a learnable captioning evaluator with high human correlation. This is done by solving a weighted classification problem directly inside the captioner. At the same time, DiCO prevents divergence from the original model, ensuring that fluency is maintained. DiCO not only exhibits improved stability and enhanced quality in the generated captions but also aligns more closely with human preferences compared to existing methods, especially in modern metrics. Additionally, it maintains competitive performance in traditional metrics. Our source code and trained models are publicly available at this https URL.

[AI-73] Multi-Agent Path Finding with Real Robot Dynamics and Interdependent Tasks for Automated Warehouses ECAI-2024

链接: https://arxiv.org/abs/2408.14527
作者: Vassilissa Lehoux-Lebacque,Tomi Silander,Christelle Loiodice,Seungjoon Lee,Albert Wang,Sofia Michel
关键词-EN: Multi-Agent Path Finding, Path Finding, Multi-Agent Path, important optimization problem, optimization problem underlying
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
*备注: Accepted to ECAI-2024. For related videos, see this https URL

点击查看摘要

Abstract:Multi-Agent Path Finding (MAPF) is an important optimization problem underlying the deployment of robots in automated warehouses and factories. Despite the large body of work on this topic, most approaches make heavy simplifications, both on the environment and the agents, which make the resulting algorithms impractical for real-life scenarios. In this paper, we consider a realistic problem of online order delivery in a warehouse, where a fleet of robots bring the products belonging to each order from shelves to workstations. This creates a stream of inter-dependent pickup and delivery tasks and the associated MAPF problem consists of computing realistic collision-free robot trajectories fulfilling these tasks. To solve this MAPF problem, we propose an extension of the standard Prioritized Planning algorithm to deal with the inter-dependent tasks (Interleaved Prioritized Planning) and a novel Via-Point Star (VP*) algorithm to compute an optimal dynamics-compliant robot trajectory to visit a sequence of goal locations while avoiding moving obstacles. We prove the completeness of our approach and evaluate it in simulation as well as in a real warehouse.

[AI-74] Estimating Uncertainty with Implicit Quantile Network

链接: https://arxiv.org/abs/2408.14525
作者: Yi Hung Lim
关键词-EN: performance critical applications, Implicit Quantile Network, performance critical, bayesian neural networks, Uncertainty quantification
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE)
*备注: This method is simple to implement and offers important information for performance critical applications

点击查看摘要

Abstract:Uncertainty quantification is an important part of many performance critical applications. This paper provides a simple alternative to existing approaches such as ensemble learning and bayesian neural networks. By directly modeling the loss distribution with an Implicit Quantile Network, we get an estimate of how uncertain the model is of its predictions. For experiments with MNIST and CIFAR datasets, the mean of the estimated loss distribution is 2x higher for incorrect predictions. When data with high estimated uncertainty is removed from the test dataset, the accuracy of the model goes up as much as 10%. This method is simple to implement while offering important information to applications where the user has to know when the model could be wrong (e.g. deep learning for healthcare).

[AI-75] Retrieval Augmented Generation for Dynamic Graph Modeling

链接: https://arxiv.org/abs/2408.14523
作者: Yuxia Wu,Yuan Fang,Lizi Liao
关键词-EN: Dynamic graph modeling, Dynamic graph, graph modeling, analyzing evolving patterns, graph
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Under review

点击查看摘要

Abstract:Dynamic graph modeling is crucial for analyzing evolving patterns in various applications. Existing approaches often integrate graph neural networks with temporal modules or redefine dynamic graph modeling as a generative sequence task. However, these methods typically rely on isolated historical contexts of the target nodes from a narrow perspective, neglecting occurrences of similar patterns or relevant cases associated with other nodes. In this work, we introduce the Retrieval-Augmented Generation for Dynamic Graph Modeling (RAG4DyG) framework, which leverages guidance from contextually and temporally analogous examples to broaden the perspective of each node. This approach presents two critical challenges: (1) How to identify and retrieve high-quality demonstrations that are contextually and temporally analogous to dynamic graph samples? (2) How can these demonstrations be effectively integrated to improve dynamic graph modeling? To address these challenges, we propose RAG4DyG, which enriches the understanding of historical contexts by retrieving and learning from contextually and temporally pertinent demonstrations. Specifically, we employ a time- and context-aware contrastive learning module to identify and retrieve relevant cases for each query sequence. Moreover, we design a graph fusion strategy to integrate the retrieved cases, thereby augmenting the inherent historical contexts for improved prediction. Extensive experiments on real-world datasets across different domains demonstrate the effectiveness of RAG4DyG for dynamic graph modeling.

[AI-76] owards Graph Prompt Learning: A Survey and Beyond

链接: https://arxiv.org/abs/2408.14520
作者: Qingqing Long,Yuchen Yan,Peiyan Zhang,Chen Fang,Wentao Cui,Zhiyuan Ning,Meng Xiao,Ning Cao,Xiao Luo,Lingjun Xu,Shiyue Jiang,Zheng Fang,Chong Chen,Xian-Sheng Hua,Yuanchun Zhou
关键词-EN: demonstrated remarkable adaptability, enabling broad applications, image recognition, remarkable adaptability, enabling broad
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Social and Information Networks (cs.SI)
*备注: 19 pages, 2 figures

点击查看摘要

Abstract:Large-scale “pre-train and prompt learning” paradigms have demonstrated remarkable adaptability, enabling broad applications across diverse domains such as question answering, image recognition, and multimodal retrieval. This approach fully leverages the potential of large-scale pre-trained models, reducing downstream data requirements and computational costs while enhancing model applicability across various tasks. Graphs, as versatile data structures that capture relationships between entities, play pivotal roles in fields such as social network analysis, recommender systems, and biological graphs. Despite the success of pre-train and prompt learning paradigms in Natural Language Processing (NLP) and Computer Vision (CV), their application in graph domains remains nascent. In graph-structured data, not only do the node and edge features often have disparate distributions, but the topological structures also differ significantly. This diversity in graph data can lead to incompatible patterns or gaps between pre-training and fine-tuning on downstream graphs. We aim to bridge this gap by summarizing methods for alleviating these disparities. This includes exploring prompt design methodologies, comparing related techniques, assessing application scenarios and datasets, and identifying unresolved problems and challenges. This survey categorizes over 100 relevant works in this field, summarizing general design principles and the latest applications, including text-attributed graphs, molecules, proteins, and recommendation systems. Through this extensive review, we provide a foundational understanding of graph prompt learning, aiming to impact not only the graph mining community but also the broader Artificial General Intelligence (AGI) community.

[AI-77] A Joint Learning Model with Variational Interaction for Multilingual Program Translation

链接: https://arxiv.org/abs/2408.14515
作者: Yali Du,Hui Sun,Ming Li
关键词-EN: program translation, Multilingual Program Translation, program, translation, Multilingual Program
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Accepted by the 39th IEEE/ACM International Conference on Automated Software Engineering (ASE 2024)

点击查看摘要

Abstract:Programs implemented in various programming languages form the foundation of software applications. To alleviate the burden of program migration and facilitate the development of software systems, automated program translation across languages has garnered significant attention. Previous approaches primarily focus on pairwise translation paradigms, learning translation between pairs of languages using bilingual parallel data. However, parallel data is difficult to collect for some language pairs, and the distribution of program semantics across languages can shift, posing challenges for pairwise program translation. In this paper, we argue that jointly learning a unified model to translate code across multiple programming languages is superior to separately learning from bilingual parallel data. We propose Variational Interaction for Multilingual Program Translation~(VIM-PT), a disentanglement-based generative approach that jointly trains a unified model for multilingual program translation across multiple languages. VIM-PT disentangles code into language-shared and language-specific features, using variational inference and interaction information with a novel lower bound, then achieves program translation through conditional generation. VIM-PT demonstrates four advantages: 1) captures language-shared information more accurately from various implementations and improves the quality of multilingual program translation, 2) mines and leverages the capability of non-parallel data, 3) addresses the distribution shift of program semantics across languages, 4) and serves as a unified model, reducing deployment complexity.

[AI-78] Variational autoencoder-based neural network model compression

链接: https://arxiv.org/abs/2408.14513
作者: Liang Cheng,Peiyuan Guan,Amir Taherkordi,Lei Liu,Dapeng Lan
关键词-EN: Variational Autoencoders, shown great great, great great peformance, neural network, Convolutional Neural Network
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Variational Autoencoders (VAEs), as a form of deep generative model, have been widely used in recent years, and shown great great peformance in a number of different domains, including image generation and anomaly detection, etc… This paper aims to explore neural network model compression method based on VAE. The experiment uses different neural network models for MNIST recognition as compression targets, including Feedforward Neural Network (FNN), Convolutional Neural Network (CNN), Recurrent Neural Network (RNN) and Long Short-Term Memory (LSTM). These models are the most basic models in deep learning, and other more complex and advanced models are based on them or inherit their features and evolve. In the experiment, the first step is to train the models mentioned above, each trained model will have different accuracy and number of total parameters. And then the variants of parameters for each model are processed as training data in VAEs separately, and the trained VAEs are tested by the true model parameters. The experimental results show that using the latent space as a representation of the model compression can improve the compression rate compared to some traditional methods such as pruning and quantization, meanwhile the accuracy is not greatly affected using the model parameters reconstructed based on the latent space. In the future, a variety of different large-scale deep learning models will be used more widely, so exploring different ways to save time and space on saving or transferring models will become necessary, and the use of VAE in this paper can provide a basis for these further explorations.

[AI-79] LLMs as Zero-shot Graph Learners: Alignment of GNN Representations with LLM Token Embeddings

链接: https://arxiv.org/abs/2408.14512
作者: Duo Wang,Yuan Zuo,Fengzhi Li,Junjie Wu
关键词-EN: scarce labeled data, garnered significant interest, significant interest due, graph neural networks, neural networks
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:Zero-shot graph machine learning, especially with graph neural networks (GNNs), has garnered significant interest due to the challenge of scarce labeled data. While methods like self-supervised learning and graph prompt learning have been extensively explored, they often rely on fine-tuning with task-specific labels, limiting their effectiveness in zero-shot scenarios. Inspired by the zero-shot capabilities of instruction-fine-tuned large language models (LLMs), we introduce a novel framework named Token Embedding-Aligned Graph Language Model (TEA-GLM) that leverages LLMs as cross-dataset and cross-task zero-shot learners for graph machine learning. Concretely, we pretrain a GNN, aligning its representations with token embeddings of an LLM. We then train a linear projector that transforms the GNN’s representations into a fixed number of graph token embeddings without tuning the LLM. A unified instruction is designed for various graph tasks at different levels, such as node classification (node-level) and link prediction (edge-level). These design choices collectively enhance our method’s effectiveness in zero-shot learning, setting it apart from existing methods. Experiments show that our graph token embeddings help the LLM predictor achieve state-of-the-art performance on unseen datasets and tasks compared to other methods using LLMs as predictors.

[AI-80] Unveiling the Statistical Foundations of Chain-of-Thought Prompting Methods

链接: https://arxiv.org/abs/2408.14511
作者: Xinyang Hu,Fengzhuo Zhang,Siyu Chen,Zhuoran Yang
关键词-EN: solving multi-step reasoning, multi-step reasoning problem, large language models, multi-step reasoning, gained popularity
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Statistics Theory (math.ST); Machine Learning (stat.ML)
*备注: 150 pages, 18 figures, 3 tables

点击查看摘要

Abstract:Chain-of-Thought (CoT) prompting and its variants have gained popularity as effective methods for solving multi-step reasoning problems using pretrained large language models (LLMs). In this work, we analyze CoT prompting from a statistical estimation perspective, providing a comprehensive characterization of its sample complexity. To this end, we introduce a multi-step latent variable model that encapsulates the reasoning process, where the latent variable encodes the task information. Under this framework, we demonstrate that when the pretraining dataset is sufficiently large, the estimator formed by CoT prompting is equivalent to a Bayesian estimator. This estimator effectively solves the multi-step reasoning problem by aggregating a posterior distribution inferred from the demonstration examples in the prompt. Moreover, we prove that the statistical error of the CoT estimator can be decomposed into two main components: (i) a prompting error, which arises from inferring the true task using CoT prompts, and (ii) the statistical error of the pretrained LLM. We establish that, under appropriate assumptions, the prompting error decays exponentially to zero as the number of demonstrations increases. Additionally, we explicitly characterize the approximation and generalization errors of the pretrained LLM. Notably, we construct a transformer model that approximates the target distribution of the multi-step reasoning problem with an error that decreases exponentially in the number of transformer blocks. Our analysis extends to other variants of CoT, including Self-Consistent CoT, Tree-of-Thought, and Selection-Inference, offering a broad perspective on the efficacy of these methods. We also provide numerical experiments to validate the theoretical findings.

[AI-81] Artificial intelligence for science: The easy and hard problems

链接: https://arxiv.org/abs/2408.14508
作者: Ruairidh M. Battleday,Samuel J. Gershman
关键词-EN: impressive scientific discoveries, artificial intelligence, suite of impressive, driven by recent, recent advances
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Neurons and Cognition (q-bio.NC)
*备注: 16 pages, 3 boxes, 4 figures

点击查看摘要

Abstract:A suite of impressive scientific discoveries have been driven by recent advances in artificial intelligence. These almost all result from training flexible algorithms to solve difficult optimization problems specified in advance by teams of domain scientists and engineers with access to large amounts of data. Although extremely useful, this kind of problem solving only corresponds to one part of science - the “easy problem.” The other part of scientific research is coming up with the problem itself - the “hard problem.” Solving the hard problem is beyond the capacities of current algorithms for scientific discovery because it requires continual conceptual revision based on poorly defined constraints. We can make progress on understanding how humans solve the hard problem by studying the cognitive science of scientists, and then use the results to design new computational agents that automatically infer and update their scientific paradigms.

[AI-82] Cost-Aware Uncertainty Reduction in Schema Matching with GPT-4: The Prompt-Matcher Framework

链接: https://arxiv.org/abs/2408.14507
作者: Longyu Feng,Huahang Li,Chen Jason Zhang
关键词-EN: database management systems, data warehousing, schema matching algorithms, Schema matching, current schema matching
类目: Databases (cs.DB); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Schema matching is the process of identifying correspondences between the elements of two given schemata, essential for database management systems, data integration, and data warehousing. The inherent uncertainty of current schema matching algorithms leads to the generation of a set of candidate matches. Storing these results necessitates the use of databases and systems capable of handling probabilistic queries. This complicates the querying process and increases the associated storage costs. Motivated by GPT-4 outstanding performance, we explore its potential to reduce uncertainty. Our proposal is to supplant the role of crowdworkers with GPT-4 for querying the set of candidate matches. To get more precise correspondence verification responses from GPT-4, We have crafted Semantic-match and Abbreviation-match prompt for GPT-4, achieving state-of-the-art results on two benchmark datasets DeepMDatasets 100% (+0.0) and Fabricated-Datasets 91.8% (+2.2) recall rate. To optimise budget utilisation, we have devised a cost-aware solution. Within the constraints of the budget, our solution delivers favourable outcomes with minimal time expenditure. We introduce a novel framework, Prompt-Matcher, to reduce the uncertainty in the process of integration of multiple automatic schema matching algorithms and the selection of complex parameterization. It assists users in diminishing the uncertainty associated with candidate schema match results and in optimally ranking the most promising matches. We formally define the Correspondence Selection Problem, aiming to optimise the revenue within the confines of the GPT-4 budget. We demonstrate that CSP is NP-Hard and propose an approximation algorithm with minimal time expenditure. Ultimately, we demonstrate the efficacy of Prompt-Matcher through rigorous experiments. Subjects: Databases (cs.DB); Artificial Intelligence (cs.AI) Cite as: arXiv:2408.14507 [cs.DB] (or arXiv:2408.14507v1 [cs.DB] for this version) https://doi.org/10.48550/arXiv.2408.14507 Focus to learn more arXiv-issued DOI via DataCite

[AI-83] Empowering Pre-Trained Language Models for Spatio-Temporal Forecasting via Decoupling Enhanced Discrete Reprogramming

链接: https://arxiv.org/abs/2408.14505
作者: Hao Wang,Jindong Han,Wei Fan,Hao Liu
关键词-EN: time series forecasting, series forecasting plays, Pre-trained Language Models, time series, energy management
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:Spatio-temporal time series forecasting plays a critical role in various real-world applications, such as transportation optimization, energy management, and climate analysis. The recent advancements in Pre-trained Language Models (PLMs) have inspired efforts to reprogram these models for time series forecasting tasks, by leveraging their superior reasoning and generalization capabilities. However, existing approaches fall short in handling complex spatial inter-series dependencies and intrinsic intra-series frequency components, limiting their spatio-temporal forecasting performance. Moreover, the linear mapping of continuous time series to a compressed subset vocabulary in reprogramming constrains the spatio-temporal semantic expressivity of PLMs and may lead to potential information bottleneck. To overcome the above limitations, we propose \textscRePST, a tailored PLM reprogramming framework for spatio-temporal forecasting. The key insight of \textscRePST is to decouple the spatio-temporal dynamics in the frequency domain, allowing better alignment with the PLM text space. Specifically, we first decouple spatio-temporal data in Fourier space and devise a structural diffusion operator to obtain temporal intrinsic and spatial diffusion signals, making the dynamics more comprehensible and predictable for PLMs. To avoid information bottleneck from a limited vocabulary, we further propose a discrete reprogramming strategy that selects relevant discrete textual information from an expanded vocabulary space in a differentiable manner. Extensive experiments on four real-world datasets show that our proposed approach significantly outperforms state-of-the-art spatio-temporal forecasting models, particularly in data-scarce scenarios.

[AI-84] Is Functional Correctness Enough to Evaluate Code Language Models? Exploring Diversity of Generated Codes

链接: https://arxiv.org/abs/2408.14504
作者: Heejae Chon,Seonghyeon Lee,Jinyoung Yeo,Dongha Lee
关键词-EN: natural language requirements, exhibited impressive abilities, language requirements, natural language, exhibited impressive
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
*备注: 15pages, 6 figures, 8 tables

点击查看摘要

Abstract:Language models (LMs) have exhibited impressive abilities in generating codes from natural language requirements. In this work, we highlight the diversity of code generated by LMs as a critical criterion for evaluating their code generation capabilities, in addition to functional correctness. Despite its practical implications, there is a lack of studies focused on assessing the diversity of generated code, which overlooks its importance in the development of code LMs. We propose a systematic approach to evaluate the diversity of generated code, utilizing various metrics for inter-code similarity as well as functional correctness. Specifically, we introduce a pairwise code similarity measure that leverages large LMs’ capabilities in code understanding and reasoning, demonstrating the highest correlation with human judgment. We extensively investigate the impact of various factors on the quality of generated code, including model sizes, temperatures, training approaches, prompting strategies, and the difficulty of input problems. Our consistent observation of a positive correlation between the test pass score and the inter-code similarity score indicates that current LMs tend to produce functionally correct code with limited diversity.

[AI-85] Applying graph neural network to SupplyGraph for supply chain network

链接: https://arxiv.org/abs/2408.14501
作者: Kihwan Han
关键词-EN: Supply chain, Supply chain networks, networks describe interactions, supply chain dataset, chain networks describe
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 8 pages, 5 figures

点击查看摘要

Abstract:Supply chain networks describe interactions between products, manufacture facilities, storages in the context of supply and demand of the products. Supply chain data are inherently under graph structure; thus, it can be fertile ground for applications of graph neural network (GNN). Very recently, supply chain dataset, SupplyGraph, has been released to the public. Though the SupplyGraph dataset is valuable given scarcity of publicly available data, there was less clarity on description of the dataset, data quality assurance process, and hyperparameters of the selected models. Further, for generalizability of findings, it would be more convincing to present the findings by performing statistical analyses on the distribution of errors rather than showing the average value of the errors. Therefore, this study assessed the supply chain dataset, SupplyGraph, with better clarity on analyses processes, data quality assurance, machine learning (ML) model specifications. After data quality assurance procedures, this study compared performance of Multilayer Perceptions (MLP), Graph Convolution Network (GCN), and Graph Attention Network (GAT) on a demanding forecasting task while matching hyperparameters as feasible as possible. The analyses revealed that GAT performed best, followed by GCN and MLP. Those performance improvements were statistically significant at \alpha = 0.05 after correction for multiple comparisons. This study also discussed several considerations in applying GNN to supply chain networks. The current study reinforces the previous study in supply chain benchmark dataset with respect to description of the dataset and methodology, so that the future research in applications of GNN to supply chain becomes more reproducible.

[AI-86] SHEDAD: SNN-Enhanced District Heating Anomaly Detection for Urban Substations

链接: https://arxiv.org/abs/2408.14499
作者: Jonne van Dreven,Abbas Cheddad,Sadi Alawadi,Ahmad Nauman Ghazi,Jad Al Koussa,Dirk Vanhoudt
关键词-EN: energy-efficient urban heating, Enhanced District Heating, District Heating Anomaly, Neighbor Enhanced District, District Heating
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 12 pages, 5 figures, FMEC2024

点击查看摘要

Abstract:District Heating (DH) systems are essential for energy-efficient urban heating. However, despite the advancements in automated fault detection and diagnosis (FDD), DH still faces challenges in operational faults that impact efficiency. This study introduces the Shared Nearest Neighbor Enhanced District Heating Anomaly Detection (SHEDAD) approach, designed to approximate the DH network topology and allow for local anomaly detection without disclosing sensitive information, such as substation locations. The approach leverages a multi-adaptive k-Nearest Neighbor (k-NN) graph to improve the initial neighborhood creation. Moreover, it introduces a merging technique that reduces noise and eliminates trivial edges. We use the Median Absolute Deviation (MAD) and modified z-scores to flag anomalous substations. The results reveal that SHEDAD outperforms traditional clustering methods, achieving significantly lower intra-cluster variance and distance. Additionally, SHEDAD effectively isolates and identifies two distinct categories of anomalies: supply temperatures and substation performance. We identified 30 anomalous substations and reached a sensitivity of approximately 65% and specificity of approximately 97%. By focusing on this subset of poor-performing substations in the network, SHEDAD enables more targeted and effective maintenance interventions, which can reduce energy usage while optimizing network performance.

[AI-87] A New Era in Computational Pathology: A Survey on Foundation and Vision-Language Models

链接: https://arxiv.org/abs/2408.14496
作者: Dibaloke Chanda,Milan Aryal,Nasim Yahya Soltani,Masoud Ganji
关键词-EN: integrating foundation models, existing deep learning, deep learning approaches, deep learning, decision-making process
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Image and Video Processing (eess.IV)
*备注: Initial Version

点击查看摘要

Abstract:Recent advances in deep learning have completely transformed the domain of computational pathology (CPath), which in turn altered the diagnostic workflow of pathologists by integrating foundation models (FMs) and vision-language models (VLMs) in their assessment and decision-making process. FMs overcome the limitations of existing deep learning approaches in CPath by learning a representation space that can be adapted to a wide variety of downstream tasks without explicit supervision. VLMs allow pathology reports written in natural language to be used as a rich semantic information source to improve existing models as well as generate predictions in natural language form. In this survey, a holistic and systematic overview of recent innovations in FMs and VLMs in CPath is presented. Furthermore, the tools, datasets and training schemes for these models are summarized in addition to categorizing them into distinct groups. This extensive survey highlights the current trends in CPath and the way it is going to be transformed through FMs and VLMs in the future.

[AI-88] Knowledge Graph Modeling-Driven Large Language Model Operating System (LLM OS) for Task Automation in Process Engineering Problem-Solving

链接: https://arxiv.org/abs/2408.14494
作者: Sakhinana Sagar Srinivas,Vijay Sri Vaikunth,Venkataramana Runkana
关键词-EN: Engineering Operations Assistant, Operations Assistant, Process Engineering Operations, AI-driven framework designed, solve complex problems
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Accepted for Publication by Association for the Advancement of Artificial Intelligence, Fall Symposium Series

点击查看摘要

Abstract:We present the Process Engineering Operations Assistant (PEOA), an AI-driven framework designed to solve complex problems in the chemical and process industries. The framework employs a modular architecture orchestrated by a meta-agent, which serves as the central coordinator, managing an action generator and instruction-tuned small-scale language models (expert models). The action generator decomposes complex problems into sub-tasks and identifies suitable expert models to execute each, delivering precise solutions for multi-step problem-solving. Key techniques include advanced knowledge modeling using property graphs for improved information retrieval, facilitating more accurate and contextually relevant solutions. Additionally, the framework utilizes a teacher-student transfer-learning approach with GPT-4 (Omni) to fine-tune the action generator and expert models for domain adaptation, alongside an iterative problem-solving mechanism with sophisticated error handling. Custom datasets were developed to evaluate the framework against leading proprietary language models on various engineering tasks. The results demonstrate the framework effectiveness in automating calculations, accelerating prototyping, and providing AI-augmented decision support for industrial processes, marking a significant advancement in process engineering capabilities.

[AI-89] Active learning of digenic functions with boolean matrix logic programming

链接: https://arxiv.org/abs/2408.14487
作者: Lun Ai,Stephen H. Muggleton,Shi-shun Liang,Geoff S. Baldwin
关键词-EN: drive biological discovery, apply logic-based machine, processes called genome-scale, logic-based machine learning, machine learning techniques
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Symbolic Computation (cs.SC); Molecular Networks (q-bio.MN)
*备注: arXiv admin note: substantial text overlap with arXiv:2405.06724

点击查看摘要

Abstract:We apply logic-based machine learning techniques to facilitate cellular engineering and drive biological discovery, based on comprehensive databases of metabolic processes called genome-scale metabolic network models (GEMs). Predicted host behaviours are not always correctly described by GEMs. Learning the intricate genetic interactions within GEMs presents computational and empirical challenges. To address these, we describe a novel approach called Boolean Matrix Logic Programming (BMLP) by leveraging boolean matrices to evaluate large logic programs. We introduce a new system, BMLP_active , which efficiently explores the genomic hypothesis space by guiding informative experimentation through active learning. In contrast to sub-symbolic methods, BMLP_active encodes a state-of-the-art GEM of a widely accepted bacterial host in an interpretable and logical representation using datalog logic programs. Notably, BMLP_active can successfully learn the interaction between a gene pair with fewer training examples than random experimentation, overcoming the increase in experimental design space. BMLP_active enables rapid optimisation of metabolic models and offers a realistic approach to a self-driving lab for microbial engineering.

[AI-90] Agent ic Retrieval-Augmented Generation for Time Series Analysis KDD KDD2024

链接: https://arxiv.org/abs/2408.14484
作者: Chidaksh Ravuru,Sagar Srinivas Sakhinana,Venkataramana Runkana
关键词-EN: predict task-specific outcomes, complex spatio-temporal dependencies, Time series modeling, Time series, time series tasks
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注: Paper was accepted for Undergraduate Consortium at ACM KDD, 2024. Please find the link: this https URL

点击查看摘要

Abstract:Time series modeling is crucial for many applications, however, it faces challenges such as complex spatio-temporal dependencies and distribution shifts in learning from historical context to predict task-specific outcomes. To address these challenges, we propose a novel approach using an agentic Retrieval-Augmented Generation (RAG) framework for time series analysis. The framework leverages a hierarchical, multi-agent architecture where the master agent orchestrates specialized sub-agents and delegates the end-user request to the relevant sub-agent. The sub-agents utilize smaller, pre-trained language models (SLMs) customized for specific time series tasks through fine-tuning using instruction tuning and direct preference optimization, and retrieve relevant prompts from a shared repository of prompt pools containing distilled knowledge about historical patterns and trends to improve predictions on new data. Our proposed modular, multi-agent RAG approach offers flexibility and achieves state-of-the-art performance across major time series tasks by tackling complex challenges more effectively than task-specific customized methods across benchmark datasets.

[AI-91] Handling abort commands for household kitchen robots

链接: https://arxiv.org/abs/2408.14480
作者: Darius Has,Adrian Groza,Mihai Pomarlan
关键词-EN: handling abort commands, handling abort, Domain Definition Language, abort commands, Planning Domain Definition
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:We propose a solution for handling abort commands given to robots. The solution is exemplified with a running scenario with household kitchen robots. The robot uses planning to find sequences of actions that must be performed in order to gracefully cancel a previously received command. The Planning Domain Definition Language (PDDL) is used to write a domain to model kitchen activities and behaviours, and this domain is enriched with knowledge from online ontologies and knowledge graphs, like DBPedia. We discuss the results obtained in different scenarios.

[AI-92] Fundus2Video: Cross-Modal Angiography Video Generation from Static Fundus Photography with Clinical Knowledge Guidance MICCAI

链接: https://arxiv.org/abs/2408.15217
作者: Weiyi Zhang,Siyu Huang,Jiancheng Yang,Ruoyu Chen,Zongyuan Ge,Yingfeng Zheng,Danli Shi,Mingguang He
关键词-EN: Fundus Fluorescein Angiography, Fluorescein Angiography, assessing retinal vascular, Fundus Fluorescein, retinal vascular dynamics
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: The paper has been accepted by Medical Image Computing and Computer Assisted Intervention Society (MICCAI) 2024

点击查看摘要

Abstract:Fundus Fluorescein Angiography (FFA) is a critical tool for assessing retinal vascular dynamics and aiding in the diagnosis of eye diseases. However, its invasive nature and less accessibility compared to Color Fundus (CF) images pose significant challenges. Current CF to FFA translation methods are limited to static generation. In this work, we pioneer dynamic FFA video generation from static CF images. We introduce an autoregressive GAN for smooth, memory-saving frame-by-frame FFA synthesis. To enhance the focus on dynamic lesion changes in FFA regions, we design a knowledge mask based on clinical experience. Leveraging this mask, our approach integrates innovative knowledge mask-guided techniques, including knowledge-boosted attention, knowledge-aware discriminators, and mask-enhanced patchNCE loss, aimed at refining generation in critical areas and addressing the pixel misalignment challenge. Our method achieves the best FVD of 1503.21 and PSNR of 11.81 compared to other common video generation approaches. Human assessment by an ophthalmologist confirms its high generation quality. Notably, our knowledge mask surpasses supervised lesion segmentation masks, offering a promising non-invasive alternative to traditional FFA for research and clinical applications. The code is available at this https URL.

[AI-93] Automatic 8-tissue Segmentation for 6-month Infant Brains MICCAI

链接: https://arxiv.org/abs/2408.15198
作者: Yilan Dong(1 and 2),Vanessa Kyriakopoulou(1 and 2),Irina Grigorescu(1),Grainne McAlonan(2),Dafnis Batalle(1 and 2),Maria Deprez(1) ((1) School of Biomedical Engineering amp; Imaging Sciences, King’s College London, London, United Kingdom, (2) Department of Forensic and Neurodevelopmental Science, Institute of Psychiatry, Psychology amp; Neuroscience, King’s College London, London, United Kingdom)
关键词-EN: atypical brain development, numerous infant studies, infancy and toddlerhood, neurodevelopmental condition, highlighted that atypical
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 11 pages, 4 figures, to be published in MICCAI PIPPI workshop

点击查看摘要

Abstract:Numerous studies have highlighted that atypical brain development, particularly during infancy and toddlerhood, is linked to an increased likelihood of being diagnosed with a neurodevelopmental condition, such as autism. Accurate brain tissue segmentations for morphological analysis are essential in numerous infant studies. However, due to ongoing white matter (WM) myelination changing tissue contrast in T1- and T2-weighted images, automatic tissue segmentation in 6-month infants is particularly difficult. On the other hand, manual labelling by experts is time-consuming and labor-intensive. In this study, we propose the first 8-tissue segmentation pipeline for six-month-old infant brains. This pipeline utilizes domain adaptation (DA) techniques to leverage our longitudinal data, including neonatal images segmented with the neonatal Developing Human Connectome Project structural pipeline. Our pipeline takes raw 6-month images as inputs and generates the 8-tissue segmentation as outputs, forming an end-to-end segmentation pipeline. The segmented tissues include WM, gray matter (GM), cerebrospinal fluid (CSF), ventricles, cerebellum, basal ganglia, brainstem, and hippocampus/amygdala. Cycle-Consistent Generative Adversarial Network (CycleGAN) and Attention U-Net were employed to achieve the image contrast transformation between neonatal and 6-month images and perform tissue segmentation on the synthesized 6-month images (neonatal images with 6-month intensity contrast), respectively. Moreover, we incorporated the segmentation outputs from Infant Brain Extraction and Analysis Toolbox (iBEAT) and another Attention U-Net to further enhance the performance and construct the end-to-end segmentation pipeline. Our evaluation with real 6-month images achieved a DICE score of 0.92, an HD95 of 1.6, and an ASSD of 0.42.

[AI-94] Sequential-Scanning Dual-Energy CT Imaging Using High Temporal Resolution Image Reconstruction and Error-Compensated Material Basis Image Generation

链接: https://arxiv.org/abs/2408.14754
作者: Qiaoxin Li,Ruifeng Chen,Peng Wang,Guotao Quan,Yanfeng Du,Dong Liang,Yinsheng Li
关键词-EN: Dual-energy computed tomography, precise medical diagnosis, obtain quantitative elemental, quantitative elemental composition, Dual-energy computed
类目: Medical Physics (physics.med-ph); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Instrumentation and Detectors (physics.ins-det)
*备注:

点击查看摘要

Abstract:Dual-energy computed tomography (DECT) has been widely used to obtain quantitative elemental composition of imaged subjects for personalized and precise medical diagnosis. Compared with DECT leveraging advanced X-ray source and/or detector technologies, the use of the sequential-scanning data acquisition scheme to implement DECT may make a broader impact on clinical practice because this scheme requires no specialized hardware designs and can be directly implemented into conventional CT systems. However, since the concentration of iodinated contrast agent in the imaged subject varies over time, sequentially scanned data sets acquired at two tube potentials are temporally inconsistent. As existing material basis image reconstruction approaches assume that the data sets acquired at two tube potentials are temporally consistent, the violation of this assumption results in inaccurate quantification of material concentration. In this work, we developed sequential-scanning DECT imaging using high temporal resolution image reconstruction and error-compensated material basis image generation, ACCELERATION in short, to address the technical challenge induced by temporal inconsistency of sequentially scanned data sets and improve quantification accuracy of material concentration in sequential-scanning DECT. ACCELERATION has been validated and evaluated using numerical simulation data sets generated from clinical human subject exams and experimental human subject studies. Results demonstrated the improvement of quantification accuracy and image quality using ACCELERATION.

[AI-95] Uncertainty Quantification in Alzheimers Disease Progression Modeling

链接: https://arxiv.org/abs/2408.14478
作者: Wael Mobeirek,Shirley Mao
关键词-EN: early disease detection, Alzheimer Disease, diagnosed with Alzheimer, disease detection, early disease
类目: Neurons and Cognition (q-bio.NC); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Information Theory (cs.IT)
*备注: This work was done as part of degree requirements for the authors in 2021-2022

点击查看摘要

Abstract:With the increasing number of patients diagnosed with Alzheimer’s Disease, prognosis models have the potential to aid in early disease detection. However, current approaches raise dependability concerns as they do not account for uncertainty. In this work, we compare the performance of Monte Carlo Dropout, Variational Inference, Markov Chain Monte Carlo, and Ensemble Learning trained on 512 patients to predict 4-year cognitive score trajectories with confidence bounds. We show that MC Dropout and MCMC are able to produce well-calibrated, and accurate predictions under noisy training data.

计算机视觉

[CV-0] Drone-assisted Road Gaussian Splatting with Cross-view Uncertainty BMVC2024

链接: https://arxiv.org/abs/2408.15242
作者: Saining Zhang,Baijun Ye,Xiaoxue Chen,Yuantao Chen,Zongzheng Zhang,Cheng Peng,Yongliang Shi,Hao Zhao
关键词-EN: autonomous driving simulation, Robust and realistic, large-scale road scenes, road scene renderings, driving simulation
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: BMVC2024 Project Page: this https URL Code: this https URL

点击查看摘要

Abstract:Robust and realistic rendering for large-scale road scenes is essential in autonomous driving simulation. Recently, 3D Gaussian Splatting (3D-GS) has made groundbreaking progress in neural rendering, but the general fidelity of large-scale road scene renderings is often limited by the input imagery, which usually has a narrow field of view and focuses mainly on the street-level local area. Intuitively, the data from the drone’s perspective can provide a complementary viewpoint for the data from the ground vehicle’s perspective, enhancing the completeness of scene reconstruction and rendering. However, training naively with aerial and ground images, which exhibit large view disparity, poses a significant convergence challenge for 3D-GS, and does not demonstrate remarkable improvements in performance on road views. In order to enhance the novel view synthesis of road views and to effectively use the aerial information, we design an uncertainty-aware training method that allows aerial images to assist in the synthesis of areas where ground images have poor learning outcomes instead of weighting all pixels equally in 3D-GS training like prior work did. We are the first to introduce the cross-view uncertainty to 3D-GS by matching the car-view ensemble-based rendering uncertainty to aerial images, weighting the contribution of each pixel to the training process. Additionally, to systematically quantify evaluation metrics, we assemble a high-quality synthesized dataset comprising both aerial and ground images for road scenes.

[CV-1] GenRec: Unifying Video Generation and Recognition with Diffusion Models

链接: https://arxiv.org/abs/2408.15241
作者: Zejia Weng,Xitong Yang,Zhen Xing,Zuxuan Wu,Yu-Gang Jiang
关键词-EN: generate high-quality videos, Video diffusion models, learning strong spatial-temporal, Stable Video Diffusion, strong spatial-temporal priors
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 17 pages, 6 figures, 7 tables

点击查看摘要

Abstract:Video diffusion models are able to generate high-quality videos by learning strong spatial-temporal priors on large-scale datasets. In this paper, we aim to investigate whether such priors derived from a generative process are suitable for video recognition, and eventually joint optimization of generation and recognition. Building upon Stable Video Diffusion, we introduce GenRec, the first unified framework trained with a random-frame conditioning process so as to learn generalized spatial-temporal representations. The resulting framework can naturally supports generation and recognition, and more importantly is robust even when visual inputs contain limited information. Extensive experiments demonstrate the efficacy of GenRec for both recognition and generation. In particular, GenRec achieves competitive recognition performance, offering 75.8% and 87.2% accuracy on SSV2 and K400, respectively. GenRec also performs the best class-conditioned image-to-video generation results, achieving 46.5 and 49.3 FVD scores on SSV2 and EK-100 datasets. Furthermore, GenRec demonstrates extraordinary robustness in scenarios that only limited frames can be observed.

[CV-2] Generative Inbetweening: Adapting Image-to-Video Models for Keyframe Interpolation

链接: https://arxiv.org/abs/2408.15239
作者: Xiaojuan Wang,Boyang Zhou,Brian Curless,Ira Kemelmacher-Shlizerman,Aleksander Holynski,Steven M. Seitz
关键词-EN: generating video sequences, single input image, input key frames, key frame interpolation, sequences with coherent
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: project page: this https URL

点击查看摘要

Abstract:We present a method for generating video sequences with coherent motion between a pair of input key frames. We adapt a pretrained large-scale image-to-video diffusion model (originally trained to generate videos moving forward in time from a single input image) for key frame interpolation, i.e., to produce a video in between two input frames. We accomplish this adaptation through a lightweight fine-tuning technique that produces a version of the model that instead predicts videos moving backwards in time from a single input image. This model (along with the original forward-moving model) is subsequently used in a dual-directional diffusion sampling process that combines the overlapping model estimates starting from each of the two keyframes. Our experiments show that our method outperforms both existing diffusion-based methods and traditional frame interpolation techniques.

[CV-3] Learning-based Multi-View Stereo: A Survey

链接: https://arxiv.org/abs/2408.15235
作者: Fangjinhua Wang,Qingtian Zhu,Di Chang,Quankai Gao,Junlin Han,Tong Zhang,Richard Hartley,Marc Pollefeys
关键词-EN: recover the dense, Virtual Reality, aims to recover, MVS, methods
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:3D reconstruction aims to recover the dense 3D structure of a scene. It plays an essential role in various applications such as Augmented/Virtual Reality (AR/VR), autonomous driving and robotics. Leveraging multiple views of a scene captured from different viewpoints, Multi-View Stereo (MVS) algorithms synthesize a comprehensive 3D representation, enabling precise reconstruction in complex environments. Due to its efficiency and effectiveness, MVS has become a pivotal method for image-based 3D reconstruction. Recently, with the success of deep learning, many learning-based MVS methods have been proposed, achieving impressive performance against traditional methods. We categorize these learning-based methods as: depth map-based, voxel-based, NeRF-based, 3D Gaussian Splatting-based, and large feed-forward methods. Among these, we focus significantly on depth map-based methods, which are the main family of MVS due to their conciseness, flexibility and scalability. In this survey, we provide a comprehensive review of the literature at the time of this writing. We investigate these learning-based methods, summarize their performances on popular benchmarks, and discuss promising future research directions in this area.

[CV-4] DCT-CryptoNets: Scaling Private Inference in the Frequency Domain

链接: https://arxiv.org/abs/2408.15231
作者: Arjun Roy,Kaushik Roy
关键词-EN: offers unprecedented opportunities, fully homomorphic encryption, learning offers unprecedented, machine learning offers, FHE enables computation
类目: Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: Under Review; 10 pages content, 3 pages appendix, 4 figures, 8 tables; Code TBD

点击查看摘要

Abstract:The convergence of fully homomorphic encryption (FHE) and machine learning offers unprecedented opportunities for private inference of sensitive data. FHE enables computation directly on encrypted data, safeguarding the entire machine learning pipeline, including data and model confidentiality. However, existing FHE-based implementations for deep neural networks face significant challenges in computational cost, latency, and scalability, limiting their practical deployment. This paper introduces DCT-CryptoNets, a novel approach that leverages frequency-domain learning to tackle these issues. Our method operates directly in the frequency domain, utilizing the discrete cosine transform (DCT) commonly employed in JPEG compression. This approach is inherently compatible with remote computing services, where images are usually transmitted and stored in compressed formats. DCT-CryptoNets reduces the computational burden of homomorphic operations by focusing on perceptually relevant low-frequency components. This is demonstrated by substantial latency reduction of up to 5.3 \times compared to prior work on image classification tasks, including a novel demonstration of ImageNet inference within 2.5 hours, down from 12.5 hours compared to prior work on equivalent compute resources. Moreover, DCT-CryptoNets improves the reliability of encrypted accuracy by reducing variability (e.g., from \pm 2.5% to \pm 1.0% on ImageNet). This study demonstrates a promising avenue for achieving efficient and practical privacy-preserving deep learning on high resolution images seen in real-world applications.

[CV-5] Leveraging Hallucinations to Reduce Manual Prompt Dependency in Promptable Segmentation DATE

链接: https://arxiv.org/abs/2408.15205
作者: Jian Hu,Jiayi Lin,Junchi Yan,Shaogang Gong
关键词-EN: Promptable segmentation typically, segmentation typically requires, typically requires instance-specific, requires instance-specific manual, Multimodal Large Language
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: We propose using hallucinations as prior knowledge to extract and validate task-related information, which helps generate instance-specific prompts for reducing reliance on manual prompts in promptable segmentation

点击查看摘要

Abstract:Promptable segmentation typically requires instance-specific manual prompts to guide the segmentation of each desired object. To minimize such a need, task-generic promptable segmentation has been introduced, which employs a single task-generic prompt to segment various images of different objects in the same task. Current methods use Multimodal Large Language Models (MLLMs) to reason detailed instance-specific prompts from a task-generic prompt for improving segmentation accuracy. The effectiveness of this segmentation heavily depends on the precision of these derived prompts. However, MLLMs often suffer hallucinations during reasoning, resulting in inaccurate prompting. While existing methods focus on eliminating hallucinations to improve a model, we argue that MLLM hallucinations can reveal valuable contextual insights when leveraged correctly, as they represent pre-trained large-scale knowledge beyond individual images. In this paper, we utilize hallucinations to mine task-related information from images and verify its accuracy for enhancing precision of the generated prompts. Specifically, we introduce an iterative Prompt-Mask Cycle generation framework (ProMaC) with a prompt generator and a mask generator.The prompt generator uses a multi-scale chain of thought prompting, initially exploring hallucinations for extracting extended contextual knowledge on a test image.These hallucinations are then reduced to formulate precise instance-specific prompts, directing the mask generator to produce masks that are consistent with task semantics by mask semantic alignment. The generated masks iteratively induce the prompt generator to focus more on task-relevant image areas and reduce irrelevant hallucinations, resulting jointly in better prompts and masks. Experiments on 5 benchmarks demonstrate the effectiveness of ProMaC. Code given in this https URL.

[CV-6] An Investigation on The Position Encoding in Vision-Based Dynamics Prediction ECCV2024

链接: https://arxiv.org/abs/2408.15201
作者: Jiageng Zhu,Hanchen Xie,Jiazhi Li,Mahyar Khayatkhoei,Wael AbdAlmageed
关键词-EN: utilizing RGB images, simple object descriptions, predict object states, utilizing RGB, RGB images
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 13 pages, 4 tables, and 3 figures. Accepted to ECCV2024 eXCV workshop

点击查看摘要

Abstract:Despite the success of vision-based dynamics prediction models, which predict object states by utilizing RGB images and simple object descriptions, they were challenged by environment misalignments. Although the literature has demonstrated that unifying visual domains with both environment context and object abstract, such as semantic segmentation and bounding boxes, can effectively mitigate the visual domain misalignment challenge, discussions were focused on the abstract of environment context, and the insight of using bounding box as the object abstract is under-explored. Furthermore, we notice that, as empirical results shown in the literature, even when the visual appearance of objects is removed, object bounding boxes alone, instead of being directly fed into the network, can indirectly provide sufficient position information via the Region of Interest Pooling operation for dynamics prediction. However, previous literature overlooked discussions regarding how such position information is implicitly encoded in the dynamics prediction model. Thus, in this paper, we provide detailed studies to investigate the process and necessary conditions for encoding position information via using the bounding box as the object abstract into output features. Furthermore, we study the limitation of solely using object abstracts, such that the dynamics prediction performance will be jeopardized when the environment context varies.

[CV-7] PoseWatch: A Transformer-based Architecture for Human-centric Video Anomaly Detection Using Spatio-temporal Pose Tokenization

链接: https://arxiv.org/abs/2408.15185
作者: Ghazal Alinezhad Noghre,Armin Danesh Pazho,Hamed Tabkhi
关键词-EN: Video Anomaly Detection, pose-based VAD, Video Anomaly, Anomaly Detection, VAD
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Video Anomaly Detection (VAD) presents a significant challenge in computer vision, particularly due to the unpredictable and infrequent nature of anomalous events, coupled with the diverse and dynamic environments in which they occur. Human-centric VAD, a specialized area within this domain, faces additional complexities, including variations in human behavior, potential biases in data, and substantial privacy concerns related to human subjects. These issues complicate the development of models that are both robust and generalizable. To address these challenges, recent advancements have focused on pose-based VAD, which leverages human pose as a high-level feature to mitigate privacy concerns, reduce appearance biases, and minimize background interference. In this paper, we introduce PoseWatch, a novel transformer-based architecture designed specifically for human-centric pose-based VAD. PoseWatch features an innovative Spatio-Temporal Pose and Relative Pose (ST-PRP) tokenization method that enhances the representation of human motion over time, which is also beneficial for broader human behavior analysis tasks. The architecture’s core, a Unified Encoder Twin Decoders (UETD) transformer, significantly improves the detection of anomalous behaviors in video data. Extensive evaluations across multiple benchmark datasets demonstrate that PoseWatch consistently outperforms existing methods, establishing a new state-of-the-art in pose-based VAD. This work not only demonstrates the efficacy of PoseWatch but also highlights the potential of integrating Natural Language Processing techniques with computer vision to advance human behavior analysis.

[CV-8] A Review of Transformer-Based Models for Computer Vision Tasks: Capturing Global Context and Spatial Relationships

链接: https://arxiv.org/abs/2408.15178
作者: Gracile Astlin Pereira,Muhammad Hussain
关键词-EN: natural language processing, computer vision tasks, computer vision, language processing, remarkable success
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Transformer-based models have transformed the landscape of natural language processing (NLP) and are increasingly applied to computer vision tasks with remarkable success. These models, renowned for their ability to capture long-range dependencies and contextual information, offer a promising alternative to traditional convolutional neural networks (CNNs) in computer vision. In this review paper, we provide an extensive overview of various transformer architectures adapted for computer vision tasks. We delve into how these models capture global context and spatial relationships in images, empowering them to excel in tasks such as image classification, object detection, and segmentation. Analyzing the key components, training methodologies, and performance metrics of transformer-based models, we highlight their strengths, limitations, and recent advancements. Additionally, we discuss potential research directions and applications of transformer-based models in computer vision, offering insights into their implications for future advancements in the field.

[CV-9] X-Reflect: Cross-Reflection Prompting for Multimodal Recommendation

链接: https://arxiv.org/abs/2408.15172
作者: Hanjia Lyu,Ryan Rossi,Xiang Chen,Md Mehrab Tanjim,Stefano Petrangeli,Somdeb Sarkhel,Jiebo Luo
关键词-EN: Large Language Models, Large Multimodal Models, Language Models, Large Language, Multimodal Models
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Large Language Models (LLMs) and Large Multimodal Models (LMMs) have been shown to enhance the effectiveness of enriching item descriptions, thereby improving the accuracy of recommendation systems. However, most existing approaches either rely on text-only prompting or employ basic multimodal strategies that do not fully exploit the complementary information available from both textual and visual modalities. This paper introduces a novel framework, Cross-Reflection Prompting, termed X-Reflect, designed to address these limitations by prompting LMMs to explicitly identify and reconcile supportive and conflicting information between text and images. By capturing nuanced insights from both modalities, this approach generates more comprehensive and contextually richer item representations. Extensive experiments conducted on two widely used benchmarks demonstrate that our method outperforms existing prompting baselines in downstream recommendation accuracy. Additionally, we evaluate the generalizability of our framework across different LMM backbones and the robustness of the prompting strategies, offering insights for optimization. This work underscores the importance of integrating multimodal information and presents a novel solution for improving item understanding in multimodal recommendation systems.

[CV-10] Empowering Sign Language Communication: Integrating Sentiment and Semantics for Facial Expression Synthesis

链接: https://arxiv.org/abs/2408.15159
作者: Rafael Azevedo,Thiago Coutinho,João Ferreira,Thiago Gomes,Erickson Nascimento
关键词-EN: Translating written sentences, Sign Language Production, Translating written, non-manual gestures plays, Sign Language
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Translating written sentences from oral languages to a sequence of manual and non-manual gestures plays a crucial role in building a more inclusive society for deaf and hard-of-hearing people. Facial expressions (non-manual), in particular, are responsible for encoding the grammar of the sentence to be spoken, applying punctuation, pronouns, or emphasizing signs. These non-manual gestures are closely related to the semantics of the sentence being spoken and also to the utterance of the speaker’s emotions. However, most Sign Language Production (SLP) approaches are centered on synthesizing manual gestures and do not focus on modeling the speakers expression. This paper introduces a new method focused in synthesizing facial expressions for sign language. Our goal is to improve sign language production by integrating sentiment information in facial expression generation. The approach leverages a sentence sentiment and semantic features to sample from a meaningful representation space, integrating the bias of the non-manual components into the sign language production process. To evaluate our method, we extend the Frechet Gesture Distance (FGD) and propose a new metric called Frechet Expression Distance (FED) and apply an extensive set of metrics to assess the quality of specific regions of the face. The experimental results showed that our method achieved state of the art, being superior to the competitors on How2Sign and PHOENIX14T datasets. Moreover, our architecture is based on a carefully designed graph pyramid that makes it simpler, easier to train, and capable of leveraging emotions to produce facial expressions.

[CV-11] A Preliminary Exploration Towards General Image Restoration

链接: https://arxiv.org/abs/2408.15143
作者: Xiangtao Kong,Jinjin Gu,Yihao Liu,Wenlong Zhang,Xiangyu Chen,Yu Qiao,Chao Dong
关键词-EN: image restoration tasks, individual image restoration, image restoration, restoration tasks, major technical challenges
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Despite the tremendous success of deep models in various individual image restoration tasks, there are at least two major technical challenges preventing these works from being applied to real-world usages: (1) the lack of generalization ability and (2) the complex and unknown degradations in real-world scenarios. Existing deep models, tailored for specific individual image restoration tasks, often fall short in effectively addressing these challenges. In this paper, we present a new problem called general image restoration (GIR) which aims to address these challenges within a unified model. GIR covers most individual image restoration tasks (\eg, image denoising, deblurring, deraining and super-resolution) and their combinations for general purposes. This paper proceeds to delineate the essential aspects of GIR, including problem definition and the overarching significance of generalization performance. Moreover, the establishment of new datasets and a thorough evaluation framework for GIR models is discussed. We conduct a comprehensive evaluation of existing approaches for tackling the GIR challenge, illuminating their strengths and pragmatic challenges. By analyzing these approaches, we not only underscore the effectiveness of GIR but also highlight the difficulties in its practical implementation. At last, we also try to understand and interpret these models’ behaviors to inspire the future direction. Our work can open up new valuable research directions and contribute to the research of general vision.

[CV-12] -FAKE: Synthesizing Thermal Images for Facial Landmarking

链接: https://arxiv.org/abs/2408.15127
作者: Philipp Flotho(1),Moritz Piening(2),Anna Kukleva(3),Gabriele Steidl(2) ((1) Systems Neuroscience amp; Neurotechnology Unit, Faculty of Medicine, Saarland University amp; htw saar, (2) Institute of Mathematics, Technische Universität Berlin, (3) Max Planck Institute for Informatics, Saarland Informatics Campus)
关键词-EN: facial RGB datasets, autonomous driving, key component, wide range, range of applications
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 22 pages, 12 figures, Philipp Flotho and Moritz Piening share equal contribution

点击查看摘要

Abstract:Facial analysis is a key component in a wide range of applications such as security, autonomous driving, entertainment, and healthcare. Despite the availability of various facial RGB datasets, the thermal modality, which plays a crucial role in life sciences, medicine, and biometrics, has been largely overlooked. To address this gap, we introduce the T-FAKE dataset, a new large-scale synthetic thermal dataset with sparse and dense landmarks. To facilitate the creation of the dataset, we propose a novel RGB2Thermal loss function, which enables the transfer of thermal style to RGB faces. By utilizing the Wasserstein distance between thermal and RGB patches and the statistical analysis of clinical temperature distributions on faces, we ensure that the generated thermal images closely resemble real samples. Using RGB2Thermal style transfer based on our RGB2Thermal loss function, we create the T-FAKE dataset, a large-scale synthetic thermal dataset of faces. Leveraging our novel T-FAKE dataset, probabilistic landmark prediction, and label adaptation networks, we demonstrate significant improvements in landmark detection methods on thermal images across different landmark conventions. Our models show excellent performance with both sparse 70-point landmarks and dense 478-point landmark annotations. Our code and models are available at this https URL.

[CV-13] Machine Learning for Methane Detection and Quantification from Space – A survey

链接: https://arxiv.org/abs/2408.15122
作者: Enno Tiemann,Shanyu Zhou,Alexander Kläser,Konrad Heidler,Rochelle Schneider,Xiao Xiang Zhu
关键词-EN: Carbon Dioxide, anthropogenic greenhouse gas, potent anthropogenic greenhouse, warming than Carbon, greenhouse gas
类目: Computer Vision and Pattern Recognition (cs.CV); Atmospheric and Oceanic Physics (physics.ao-ph)
*备注:

点击查看摘要

Abstract:Methane (CH_4) is a potent anthropogenic greenhouse gas, contributing 86 times more to global warming than Carbon Dioxide (CO_2) over 20 years, and it also acts as an air pollutant. Given its high radiative forcing potential and relatively short atmospheric lifetime (9\textpm1 years), methane has important implications for climate change, therefore, cutting methane emissions is crucial for effective climate change mitigation. This work expands existing information on operational methane point source detection sensors in the Short-Wave Infrared (SWIR) bands. It reviews the state-of-the-art for traditional as well as Machine Learning (ML) approaches. The architecture and data used in such ML models will be discussed separately for methane plume segmentation and emission rate estimation. Traditionally, experts rely on labor-intensive manually adjusted methods for methane detection. However, ML approaches offer greater scalability. Our analysis reveals that ML models outperform traditional methods, particularly those based on convolutional neural networks (CNN), which are based on the U-net and transformer architectures. These ML models extract valuable information from methane-sensitive spectral data, enabling a more accurate detection. Challenges arise when comparing these methods due to variations in data, sensor specifications, and evaluation metrics. To address this, we discuss existing datasets and metrics, providing an overview of available resources and identifying open research problems. Finally, we explore potential future advances in ML, emphasizing approaches for model comparability, large dataset creation, and the European Union’s forthcoming methane strategy.

[CV-14] Urdu Digital Text Word Optical Character Recognition Using Permuted Auto Regressive Sequence Modeling

链接: https://arxiv.org/abs/2408.15119
作者: Ahmed Mustafa,Ijlal Baig,Hasan Sajid
关键词-EN: innovative word-level Optical, word-level Optical Character, Optical Character Recognition, word-level Optical, Optical Character
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:This research paper introduces an innovative word-level Optical Character Recognition (OCR) model specifically designed for digital Urdu text recognition. Utilizing transformer-based architectures and attention mechanisms, the model was trained on a comprehensive dataset of approximately 160,000 Urdu text images, achieving a character error rate (CER) of 0.178, which highlights its superior accuracy in recognizing Urdu characters. The model’s strength lies in its unique architecture, incorporating the permuted autoregressive sequence (PARSeq) model, which allows for context-aware inference and iterative refinement by leveraging bidirectional context information to enhance recognition accuracy. Furthermore, its capability to handle a diverse range of Urdu text styles, fonts, and variations enhances its applicability in real-world scenarios. Despite its promising results, the model has some limitations, such as difficulty with blurred images, non-horizontal orientations, and overlays of patterns, lines, or other text, which can occasionally lead to suboptimal performance. Additionally, trailing or following punctuation marks can introduce noise into the recognition process. Addressing these challenges will be a focus of future research, aiming to refine the model further, explore data augmentation techniques, optimize hyperparameters, and integrate contextual improvements for more accurate and efficient Urdu text recognition.

[CV-15] Few-Shot Unsupervised Implicit Neural Shape Representation Learning with Spatial Adversaries ICML2024

链接: https://arxiv.org/abs/2408.15114
作者: Amine Ouasfi,Adnane Boukhayma
关键词-EN: Implicit Neural Representations, Neural Signed Distance, Implicit Neural, Signed Distance Functions, complex data modalities
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Graphics (cs.GR); Machine Learning (cs.LG)
*备注: ICML 2024

点击查看摘要

Abstract:Implicit Neural Representations have gained prominence as a powerful framework for capturing complex data modalities, encompassing a wide range from 3D shapes to images and audio. Within the realm of 3D shape representation, Neural Signed Distance Functions (SDF) have demonstrated remarkable potential in faithfully encoding intricate shape geometry. However, learning SDFs from sparse 3D point clouds in the absence of ground truth supervision remains a very challenging task. While recent methods rely on smoothness priors to regularize the learning, our method introduces a regularization term that leverages adversarial samples around the shape to improve the learned SDFs. Through extensive experiments and evaluations, we illustrate the efficacy of our proposed method, highlighting its capacity to improve SDF learning with respect to baselines and the state-of-the-art using synthetic and real data.

[CV-16] AnomalousPatchCore: Exploring the Use of Anomalous Samples in Industrial Anomaly Detection ECCV

链接: https://arxiv.org/abs/2408.15113
作者: Mykhailo Koshil,Tilman Wegener,Detlef Mentrup,Simone Frintrop,Christian Wilms
关键词-EN: quality control types, Visual inspection, industrial anomaly detection, anomaly detection, common quality control
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted at the 2nd workshop on Vision-based InduStrial InspectiON (VISION) @ ECCV

点击查看摘要

Abstract:Visual inspection, or industrial anomaly detection, is one of the most common quality control types in manufacturing. The task is to identify the presence of an anomaly given an image, e.g., a missing component on an image of a circuit board, for subsequent manual inspection. While industrial anomaly detection has seen a surge in recent years, most anomaly detection methods still utilize knowledge only from normal samples, failing to leverage the information from the frequently available anomalous samples. Additionally, they heavily rely on very general feature extractors pre-trained on common image classification datasets. In this paper, we address these shortcomings and propose the new anomaly detection system AnomalousPatchCore~(APC) based on a feature extractor fine-tuned with normal and anomalous in-domain samples and a subsequent memory bank for identifying unusual features. To fine-tune the feature extractor in APC, we propose three auxiliary tasks that address the different aspects of anomaly detection~(classification vs. localization) and mitigate the effect of the imbalance between normal and anomalous samples. Our extensive evaluation on the MVTec dataset shows that APC outperforms state-of-the-art systems in detecting anomalies, which is especially important in industrial anomaly detection given the subsequent manual inspection. In detailed ablation studies, we further investigate the properties of our APC.

[CV-17] Enhancing License Plate Super-Resolution: A Layout-Aware and Character-Driven Approach

链接: https://arxiv.org/abs/2408.15103
作者: Valfride Nascimento,Rayson Laroca,Rafael O. Ribeiro,William Robson Schwartz,David Menotti
关键词-EN: License Plate Recognition, License Plate, advancements in License, Plate Recognition, rely on high-resolution
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted for presentation at the Conference on Graphics, Patterns and Images (SIBGRAPI) 2024

点击查看摘要

Abstract:Despite significant advancements in License Plate Recognition (LPR) through deep learning, most improvements rely on high-resolution images with clear characters. This scenario does not reflect real-world conditions where traffic surveillance often captures low-resolution and blurry images. Under these conditions, characters tend to blend with the background or neighboring characters, making accurate LPR challenging. To address this issue, we introduce a novel loss function, Layout and Character Oriented Focal Loss (LCOFL), which considers factors such as resolution, texture, and structural details, as well as the performance of the LPR task itself. We enhance character feature learning using deformable convolutions and shared weights in an attention module and employ a GAN-based training approach with an Optical Character Recognition (OCR) model as the discriminator to guide the super-resolution process. Our experimental results show significant improvements in character reconstruction quality, outperforming two state-of-the-art methods in both quantitative and qualitative measures. Our code is publicly available at this https URL

[CV-18] MTMamba: Enhancing Multi-Task Dense Scene Understanding via Mamba-Based Decoders

链接: https://arxiv.org/abs/2408.15101
作者: Baijiong Lin,Weisen Jiang,Pengguang Chen,Shu Liu,Ying-Cong Chen
关键词-EN: multiple dense prediction, multi-task dense prediction, Multi-task dense scene, application scenarios, Multi-task dense
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: arXiv admin note: text overlap with arXiv:2407.02228

点击查看摘要

Abstract:Multi-task dense scene understanding, which trains a model for multiple dense prediction tasks, has a wide range of application scenarios. Capturing long-range dependency and enhancing cross-task interactions are crucial to multi-task dense prediction. In this paper, we propose MTMamba++, a novel architecture for multi-task scene understanding featuring with a Mamba-based decoder. It contains two types of core blocks: self-task Mamba (STM) block and cross-task Mamba (CTM) block. STM handles long-range dependency by leveraging state-space models, while CTM explicitly models task interactions to facilitate information exchange across tasks. We design two types of CTM block, namely F-CTM and S-CTM, to enhance cross-task interaction from feature and semantic perspectives, respectively. Experiments on NYUDv2, PASCAL-Context, and Cityscapes datasets demonstrate the superior performance of MTMamba++ over CNN-based and Transformer-based methods. The code is available at this https URL.

[CV-19] CLIP-AGIQA: Boosting the Performance of AI-Generated Image Quality Assessment with CLIP ICPR2024

链接: https://arxiv.org/abs/2408.15098
作者: Zhenchen Tang,Zichuan Wang,Bo Peng,Jing Dong
关键词-EN: generated images, quality, image quality assessment, generated, quality assessment
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: accepted by ICPR2024

点击查看摘要

Abstract:With the rapid development of generative technologies, AI-Generated Images (AIGIs) have been widely applied in various aspects of daily life. However, due to the immaturity of the technology, the quality of the generated images varies, so it is important to develop quality assessment techniques for the generated images. Although some models have been proposed to assess the quality of generated images, they are inadequate when faced with the ever-increasing and diverse categories of generated images. Consequently, the development of more advanced and effective models for evaluating the quality of generated images is urgently needed. Recent research has explored the significant potential of the visual language model CLIP in image quality assessment, finding that it performs well in evaluating the quality of natural images. However, its application to generated images has not been thoroughly investigated. In this paper, we build on this idea and further explore the potential of CLIP in evaluating the quality of generated images. We design CLIP-AGIQA, a CLIP-based regression model for quality assessment of generated images, leveraging rich visual and textual knowledge encapsulated in CLIP. Particularly, we implement multi-category learnable prompts to fully utilize the textual knowledge in CLIP for quality assessment. Extensive experiments on several generated image quality assessment benchmarks, including AGIQA-3K and AIGCIQA2023, demonstrate that CLIP-AGIQA outperforms existing IQA models, achieving excellent results in evaluating the quality of generated images.

[CV-20] Constrained Diffusion Models via Dual Training

链接: https://arxiv.org/abs/2408.15094
作者: Shervin Khalafi,Dongsheng Ding,Alejandro Ribeiro
关键词-EN: Diffusion, Diffusion models, constrained diffusion models, constrained diffusion, high fidelity
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Optimization and Control (math.OC)
*备注: 41 pages, 4 figures, 2 tables

点击查看摘要

Abstract:Diffusion models have attained prominence for their ability to synthesize a probability distribution for a given dataset via a diffusion process, enabling the generation of new data points with high fidelity. However, diffusion processes are prone to generating biased data based on the training dataset. To address this issue, we develop constrained diffusion models by imposing diffusion constraints based on desired distributions that are informed by requirements. Specifically, we cast the training of diffusion models under requirements as a constrained distribution optimization problem that aims to reduce the distribution difference between original and generated data while obeying constraints on the distribution of generated data. We show that our constrained diffusion models generate new data from a mixture data distribution that achieves the optimal trade-off among objective and constraints. To train constrained diffusion models, we develop a dual training algorithm and characterize the optimality of the trained constrained diffusion model. We empirically demonstrate the effectiveness of our constrained models in two constrained generation tasks: (i) we consider a dataset with one or more underrepresented classes where we train the model with constraints to ensure fairly sampling from all classes during inference; (ii) we fine-tune a pre-trained diffusion model to sample from a new dataset while avoiding overfitting.

[CV-21] MMASD: A Novel Dataset for Privacy-Preserving Behavior Analysis of Children with Autism Spectrum Disorder

链接: https://arxiv.org/abs/2408.15077
作者: Pavan Uttej Ravva,Behdokht Kiafar,Pinar Kullu,Jicheng Li,Anjana Bhat,Roghayeh Leila Barmaki
关键词-EN: comprehending communication signals, Autism spectrum disorder, spectrum disorder, communication signals, social interaction
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Autism spectrum disorder (ASD) is characterized by significant challenges in social interaction and comprehending communication signals. Recently, therapeutic interventions for ASD have increasingly utilized Deep learning powered-computer vision techniques to monitor individual progress over time. These models are trained on private, non-public datasets from the autism community, creating challenges in comparing results across different models due to privacy-preserving data-sharing issues. This work introduces MMASD+. MMASD+ consists of diverse data modalities, including 3D-Skeleton, 3D Body Mesh, and Optical Flow data. It integrates the capabilities of Yolov8 and Deep SORT algorithms to distinguish between the therapist and children, addressing a significant barrier in the original dataset. Additionally, a Multimodal Transformer framework is proposed to predict 11 action types and the presence of ASD. This framework achieves an accuracy of 95.03% for predicting action types and 96.42% for predicting ASD presence, demonstrating over a 10% improvement compared to models trained on single data modalities. These findings highlight the advantages of integrating multiple data modalities within the Multimodal Transformer framework.

[CV-22] Geometric Artifact Correction for Symmetric Multi-Linear Trajectory CT: Theory Method and Generalization

链接: https://arxiv.org/abs/2408.15069
作者: Zhisheng Wang(1 and 2),Yanxu Sun(1 and 2),Shangyu Li(1 and 2),Legeng Lin(1 and 2),Shunli Wang(1 and 2),Junning Cui(1 and 2) ((1) Center of Ultra-precision Optoelectronic Instrument engineering, Harbin Institute of Technology, (2) Key Lab of Ultra-precision Intelligent Instrumentation, Harbin Institute of Technology)
关键词-EN: trajectory Computed Tomography, Multi-Linear trajectory Computed, Computed Tomography, perform non-destructive testing, trajectory Computed
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV); Instrumentation and Detectors (physics.ins-det)
*备注: 15 pages, 10 figures

点击查看摘要

Abstract:For extending CT field-of-view to perform non-destructive testing, the Symmetric Multi-Linear trajectory Computed Tomography (SMLCT) has been developed as a successful example of non-standard CT scanning modes. However, inevitable geometric errors can cause severe artifacts in the reconstructed images. The existing calibration method for SMLCT is both crude and inefficient. It involves reconstructing hundreds of images by exhaustively substituting each potential error, and then manually identifying the images with the fewest geometric artifacts to estimate the final geometric errors for calibration. In this paper, we comprehensively and efficiently address the challenging geometric artifacts in SMLCT, , and the corresponding works mainly involve theory, method, and generalization. In particular, after identifying sensitive parameters and conducting some theory analysis of geometric artifacts, we summarize several key properties between sensitive geometric parameters and artifact characteristics. Then, we further construct mathematical relationships that relate sensitive geometric errors to the pixel offsets of reconstruction images with artifact characteristics. To accurately extract pixel bias, we innovatively adapt the Generalized Cross-Correlation with Phase Transform (GCC-PHAT) algorithm, commonly used in sound processing, for our image registration task for each paired symmetric LCT. This adaptation leads to the design of a highly efficient rigid translation registration method. Simulation and physical experiments have validated the excellent performance of this work. Additionally, our results demonstrate significant generalization to common rotated CT and a variant of SMLCT.

[CV-23] Adapting Segment Anything Model to Multi-modal Salient Object Detection with Semantic Feature Fusion Guidance

链接: https://arxiv.org/abs/2408.15063
作者: Kunpeng Wang,Keke Chen,Chenglong Li,Zhengzheng Tu,Bin Luo
关键词-EN: methods demonstrate effectiveness, SAM, salient object detection, methods demonstrate, reaching optimality
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 10 pages, 9 figures

点击查看摘要

Abstract:Although most existing multi-modal salient object detection (SOD) methods demonstrate effectiveness through training models from scratch, the limited multi-modal data hinders these methods from reaching optimality. In this paper, we propose a novel framework to explore and exploit the powerful feature representation and zero-shot generalization ability of the pre-trained Segment Anything Model (SAM) for multi-modal SOD. Despite serving as a recent vision fundamental model, driving the class-agnostic SAM to comprehend and detect salient objects accurately is non-trivial, especially in challenging scenes. To this end, we develop \underlineSAM with se\underlinemantic f\underlineeature fu\underlinesion guidanc\underlinee (Sammese), which incorporates multi-modal saliency-specific knowledge into SAM to adapt SAM to multi-modal SOD tasks. However, it is difficult for SAM trained on single-modal data to directly mine the complementary benefits of multi-modal inputs and comprehensively utilize them to achieve accurate saliency this http URL address these issues, we first design a multi-modal complementary fusion module to extract robust multi-modal semantic features by integrating information from visible and thermal or depth image pairs. Then, we feed the extracted multi-modal semantic features into both the SAM image encoder and mask decoder for fine-tuning and prompting, respectively. Specifically, in the image encoder, a multi-modal adapter is proposed to adapt the single-modal SAM to multi-modal information. In the mask decoder, a semantic-geometric prompt generation strategy is proposed to produce corresponding embeddings with various saliency cues. Extensive experiments on both RGB-D and RGB-T SOD benchmarks show the effectiveness of the proposed framework.

[CV-24] DocLayLLM: An Efficient and Effective Multi-modal Extension of Large Language Models for Text-rich Document Understanding

链接: https://arxiv.org/abs/2408.15045
作者: Wenhui Liao,Jiapeng Wang,Hongliang Li,Chengyu Wang,Jun Huang,Lianwen Jin
关键词-EN: Text-rich document understanding, substantial textual content, Text-rich document, refers to analyzing, analyzing and comprehending
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Text-rich document understanding (TDU) refers to analyzing and comprehending documents containing substantial textual content. With the rapid evolution of large language models (LLMs), they have been widely leveraged for TDU due to their remarkable versatility and generalization. In this paper, we introduce DocLayLLM, an efficient and effective multi-modal extension of LLMs specifically designed for TDU. By integrating visual patch tokens and 2D positional tokens into LLMs and encoding the document content using the LLMs themselves, we fully take advantage of the document comprehension capability of LLMs and enhance their perception of OCR information. We have also deeply considered the role of the chain-of-thought (CoT) and innovatively proposed the techniques of CoT Pre-training and CoT Annealing. Our DocLayLLM can achieve remarkable performances with lightweight training settings, showcasing its efficiency and effectiveness. Experimental results demonstrate that our DocLayLLM surpasses existing OCR-dependent methods and also outperforms OCR-free competitors.

[CV-25] Interactive Occlusion Boundary Estimation through Exploitation of Synthetic Data

链接: https://arxiv.org/abs/2408.15038
作者: Lintao Xu,Chaohui Wang
关键词-EN: scene understanding problems, Occlusion boundaries, occlusion events, understanding problems, information for addressing
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Occlusion boundaries (OBs) geometrically localize the occlusion events in a 2D image, and contain useful information for addressing various scene understanding problems. To advance their study, we have led the investigation in the following three aspects. Firstly, we have studied interactive estimation of OBs, which is the first in the literature, and proposed an efficient deep-network-based method using multiple-scribble intervention, named DNMMSI, which significantly improves the performance over the state-of-the-art fully-automatic methods. Secondly, we propose to exploit the synthetic benchmark for the training process, thanks to the particularity that OBs are determined geometrically and unambiguously from the 3D scene. To this end, we have developed an efficient tool, named Mesh2OB, for the automatic generation of 2D images together with their ground-truth OBs, using which we have constructed a synthetic benchmark, named OB-FUTURE. Abundant experimental results demonstrate that leveraging such a synthetic benchmark for training achieves promising performance, even without the use of domain adaptation techniques. Finally, to achieve a more compelling and robust evaluation in OB-related research, we have created a real benchmark, named OB-LabName, consisting of 120 high-resolution images together with their ground-truth OBs, with precision surpassing that of previous benchmarks. We will release DNMMSI with pre-trained parameters, Mesh2OB, OB-FUTURE, and OB-LabName to support further research.

[CV-26] Mamba2MIL: State Space Duality Based Multiple Instance Learning for Computational Pathology

链接: https://arxiv.org/abs/2408.15032
作者: Yuqi Zhang,Xiaoqian Zhang,Jiakai Wang,Yuancheng Yang,Taiying Peng,Chao Tong
关键词-EN: Computational pathology, Multiple Instance Learning, Convolutional Neural Networks, significantly advanced, advanced the clinical
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Computational pathology (CPath) has significantly advanced the clinical practice of pathology. Despite the progress made, Multiple Instance Learning (MIL), a promising paradigm within CPath, continues to face challenges, particularly related to incomplete information utilization. Existing frameworks, such as those based on Convolutional Neural Networks (CNNs), attention, and selective scan space state sequential model (SSM), lack sufficient flexibility and scalability in fusing diverse features, and cannot effectively fuse diverse features. Additionally, current approaches do not adequately exploit order-related and order-independent features, resulting in suboptimal utilization of sequence information. To address these limitations, we propose a novel MIL framework called Mamba2MIL. Our framework utilizes the state space duality model (SSD) to model long sequences of patches of whole slide images (WSIs), which, combined with weighted feature selection, supports the fusion processing of more branching features and can be extended according to specific application needs. Moreover, we introduce a sequence transformation method tailored to varying WSI sizes, which enhances sequence-independent features while preserving local sequence information, thereby improving sequence information utilization. Extensive experiments demonstrate that Mamba2MIL surpasses state-of-the-art MIL methods. We conducted extensive experiments across multiple datasets, achieving improvements in nearly all performance metrics. Specifically, on the NSCLC dataset, Mamba2MIL achieves a binary tumor classification AUC of 0.9533 and an accuracy of 0.8794. On the BRACS dataset, it achieves a multiclass classification AUC of 0.7986 and an accuracy of 0.4981. The code is available at this https URL.

[CV-27] Sequence-aware Pre-training for Echocardiography Probe Guidance

链接: https://arxiv.org/abs/2408.15026
作者: Haojun Jiang,Zhenguo Sun,Yu Sun,Ning Jia,Meng Li,Shaqi Luo,Shiji Song,Gao Huang
关键词-EN: obtain high-quality sectional, high-quality sectional images, Cardiac ultrasound, pose to obtain, obtain high-quality
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: Tech Report

点击查看摘要

Abstract:Cardiac ultrasound probe guidance aims to help novices adjust the 6-DOF probe pose to obtain high-quality sectional images. Cardiac ultrasound faces two major challenges: (1) the inherently complex structure of the heart, and (2) significant individual variations. Previous works have only learned the population-averaged 2D and 3D structures of the heart rather than personalized cardiac structural features, leading to a performance bottleneck. Clinically, we observed that sonographers adjust their understanding of a patient’s cardiac structure based on prior scanning sequences, thereby modifying their scanning strategies. Inspired by this, we propose a sequence-aware self-supervised pre-training method. Specifically, our approach learns personalized 2D and 3D cardiac structural features by predicting the masked-out images and actions in a scanning sequence. We hypothesize that if the model can predict the missing content it has acquired a good understanding of the personalized cardiac structure. In the downstream probe guidance task, we also introduced a sequence modeling approach that models individual cardiac structural information based on the images and actions from historical scan data, enabling more accurate navigation decisions. Experiments on a large-scale dataset with 1.36 million samples demonstrated that our proposed sequence-aware paradigm can significantly reduce navigation errors, with translation errors decreasing by 15.90% to 36.87% and rotation errors decreasing by 11.13% to 20.77%, compared to state-of-the-art methods.

[CV-28] Hierarchical Graph Interaction Transformer with Dynamic Token Clustering for Camouflaged Object Detection

链接: https://arxiv.org/abs/2408.15020
作者: Siyuan Yao,Hao Sun,Tian-Zhu Xiang,Xiao Wang,Xiaochun Cao
关键词-EN: Camouflaged object detection, Camouflaged object, object detection, aims to identify, seamlessly blend
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Submitted to IEEE Transactions on Image Processing

点击查看摘要

Abstract:Camouflaged object detection (COD) aims to identify the objects that seamlessly blend into the surrounding backgrounds. Due to the intrinsic similarity between the camouflaged objects and the background region, it is extremely challenging to precisely distinguish the camouflaged objects by existing approaches. In this paper, we propose a hierarchical graph interaction network termed HGINet for camouflaged object detection, which is capable of discovering imperceptible objects via effective graph interaction among the hierarchical tokenized features. Specifically, we first design a region-aware token focusing attention (RTFA) with dynamic token clustering to excavate the potentially distinguishable tokens in the local region. Afterwards, a hierarchical graph interaction transformer (HGIT) is proposed to construct bi-directional aligned communication between hierarchical features in the latent interaction space for visual semantics enhancement. Furthermore, we propose a decoder network with confidence aggregated feature fusion (CAFF) modules, which progressively fuses the hierarchical interacted features to refine the local detail in ambiguous regions. Extensive experiments conducted on the prevalent datasets, i.e. COD10K, CAMO, NC4K and CHAMELEON demonstrate the superior performance of HGINet compared to existing state-of-the-art methods. Our code is available at this https URL.

[CV-29] Alternating Minimization Schemes for Computing Rate-Distortion-Perception Functions with f-Divergence Perception Constraints

链接: https://arxiv.org/abs/2408.15015
作者: Giuseppe Serra,Photios A. Stavrou,Marios Kountouris
关键词-EN: discrete memoryless sources, memoryless sources subject, alternating minimization, Optimal Alternating Minimization, single-letter average distortion
类目: Information Theory (cs.IT); Computer Vision and Pattern Recognition (cs.CV); Signal Processing (eess.SP)
*备注: This work has been submitted for possible publication

点击查看摘要

Abstract:We study the computation of the rate-distortion-perception function (RDPF) for discrete memoryless sources subject to a single-letter average distortion constraint and a perception constraint that belongs to the family of f -divergences. In this setting, the RDPF forms a convex programming problem for which we characterize the optimal parametric solutions. We employ the developed solutions in an alternating minimization scheme, namely Optimal Alternating Minimization (OAM), for which we provide convergence guarantees. Nevertheless, the OAM scheme does not lead to a direct implementation of a generalized Blahut-Arimoto (BA) type of algorithm due to the presence of implicit equations in the structure of the iteration. To overcome this difficulty, we propose two alternative minimization approaches whose applicability depends on the smoothness of the used perception metric: a Newton-based Alternating Minimization (NAM) scheme, relying on Newton’s root-finding method for the approximation of the optimal iteration solution, and a Relaxed Alternating Minimization (RAM) scheme, based on a relaxation of the OAM iterates. Both schemes are shown, via the derivation of necessary and sufficient conditions, to guarantee convergence to a globally optimal solution. We also provide sufficient conditions on the distortion and the perception constraints which guarantee that the proposed algorithms converge exponentially fast in the number of iteration steps. We corroborate our theoretical results with numerical simulations and draw connections with existing results.

[CV-30] Pre-training Everywhere: Parameter-Efficient Fine-Tuning for Medical Image Analysis via Target Parameter Pre-training

链接: https://arxiv.org/abs/2408.15011
作者: Xingliang Lei,Yiwen Ye,Ziyang Chen,Minglei Shu,Yong Xia
关键词-EN: high computational costs, target parameters, Parameter-efficient fine-tuning, parameters, PEFT
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Parameter-efficient fine-tuning (PEFT) techniques have emerged to address issues of overfitting and high computational costs associated with fully fine-tuning in the paradigm of self-supervised learning. Mainstream methods based on PEFT involve adding a few trainable parameters while keeping the pre-trained parameters of the backbone fixed. These methods achieve comparative, and often superior, performance to fully fine-tuning, demonstrating the powerful representation ability of the pre-trained backbone. Despite its success, these methods typically ignore the initialization of the new parameters, often relying solely on random initialization. We argue that if pre-training is significantly beneficial, it should be applied to all parameters requiring representational capacity. Motivated by this insight, we propose a simple yet effective fine-tuning framework based on Target Parameter Pre-training (TPP). The target parameters refer to the new parameters introduced during fine-tuning. TPP includes an additional stage before PEFT to pre-train these target parameters. During this stage, the pre-trained backbone parameters are frozen, and only the target parameters are trainable. A defined pre-text task is used to encourage the target parameters to learn specific representations of downstream data. When PEFT is subsequently employed, the pre-trained target parameters are loaded to enhance fine-tuning efficiency. The proposed TPP framework is versatile, allowing for the integration of various pretext tasks for pre-training and supporting different PEFT methods as backbones. We evaluated the fine-tining performance of our method using five public datasets, including three modalities and two task types. The results demonstrate that the proposed TPP can be easily integrated into existing PEFT methods, significantly improving performance.

[CV-31] Knowledge Discovery in Optical Music Recognition: Enhancing Information Retrieval with Instance Segmentation

链接: https://arxiv.org/abs/2408.15002
作者: Elona Shatri,George Fazekas
关键词-EN: Optical Character Recognition, Optical Music Recognition, Western Music Notation, Common Western Music, manual transcription
类目: Information Retrieval (cs.IR); Computer Vision and Pattern Recognition (cs.CV); Sound (cs.SD)
*备注: 8 pages content and one references, accepted version at the International Conference on Knowledge Discovery and Information Retrieval 2024, Porto, Portugal

点击查看摘要

Abstract:Optical Music Recognition (OMR) automates the transcription of musical notation from images into machine-readable formats like MusicXML, MEI, or MIDI, significantly reducing the costs and time of manual transcription. This study explores knowledge discovery in OMR by applying instance segmentation using Mask R-CNN to enhance the detection and delineation of musical symbols in sheet music. Unlike Optical Character Recognition (OCR), OMR must handle the intricate semantics of Common Western Music Notation (CWMN), where symbol meanings depend on shape, position, and context. Our approach leverages instance segmentation to manage the density and overlap of musical symbols, facilitating more precise information retrieval from music scores. Evaluations on the DoReMi and MUSCIMA++ datasets demonstrate substantial improvements, with our method achieving a mean Average Precision (mAP) of up to 59.70% in dense symbol environments, achieving comparable results to object detection. Furthermore, using traditional computer vision techniques, we add a parallel step for staff detection to infer the pitch for the recognised symbols. This study emphasises the role of pixel-wise segmentation in advancing accurate music symbol recognition, contributing to knowledge discovery in OMR. Our findings indicate that instance segmentation provides more precise representations of musical symbols, particularly in densely populated scores, advancing OMR technology. We make our implementation, pre-processing scripts, trained models, and evaluation results publicly available to support further research and development.

[CV-32] FastTextSpotter: A High-Efficiency Transformer for Multilingual Scene Text Spotting ICPR2024

链接: https://arxiv.org/abs/2408.14998
作者: Alloy Das,Sanket Biswas,Umapada Pal,Josep Lladós,Saumik Bhattacharya
关键词-EN: optical character recognition, unstructured environments presents, environments presents significant, presents significant challenges, character recognition
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted in ICPR 2024

点击查看摘要

Abstract:The proliferation of scene text in both structured and unstructured environments presents significant challenges in optical character recognition (OCR), necessitating more efficient and robust text spotting solutions. This paper presents FastTextSpotter, a framework that integrates a Swin Transformer visual backbone with a Transformer Encoder-Decoder architecture, enhanced by a novel, faster self-attention unit, SAC2, to improve processing speeds while maintaining accuracy. FastTextSpotter has been validated across multiple datasets, including ICDAR2015 for regular texts and CTW1500 and TotalText for arbitrary-shaped texts, benchmarking against current state-of-the-art models. Our results indicate that FastTextSpotter not only achieves superior accuracy in detecting and recognizing multilingual scene text (English and Vietnamese) but also improves model efficiency, thereby setting new benchmarks in the field. This study underscores the potential of advanced transformer architectures in improving the adaptability and speed of text spotting applications in diverse real-world settings. The dataset, code, and pre-trained models have been released in our Github.

[CV-33] Depth Restoration of Hand-Held Transparent Objects for Human-to-Robot Handover

链接: https://arxiv.org/abs/2408.14997
作者: Ran Yu,Haixin Yu,Huang Yan,Ziwu Song,Shoujie Li,Wenbo Ding
关键词-EN: unique optical properties, optical properties pose, properties pose challenges, capture accurate depth, Transparent objects
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
*备注: 7 pages, 7 figures, conference

点击查看摘要

Abstract:Transparent objects are common in daily life, while their unique optical properties pose challenges for RGB-D cameras, which struggle to capture accurate depth information. For assistant robots, accurately perceiving transparent objects held by humans is essential for effective human-robot interaction. This paper presents a Hand-Aware Depth Restoration (HADR) method for hand-held transparent objects based on creating an implicit neural representation function from a single RGB-D image. The proposed method introduces the hand posture as an important guidance to leverage semantic and geometric information. To train and evaluate the proposed method, we create a high-fidelity synthetic dataset called TransHand-14K with a real-to-sim data generation scheme. Experiments show that our method has a better performance and generalization ability compared with existing methods. We further develop a real-world human-to-robot handover system based on the proposed depth restoration method, demonstrating its application value in human-robot interaction.

[CV-34] Prior-free Balanced Replay: Uncertainty-guided Reservoir Sampling for Long-Tailed Continual Learning

链接: https://arxiv.org/abs/2408.14976
作者: Lei Liu,Li Liu,Yawen Cui
关键词-EN: continual data stream, Long-Tailed Continual Learning, continual learning, data stream exhibits, continual data
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Even in the era of large models, one of the well-known issues in continual learning (CL) is catastrophic forgetting, which is significantly challenging when the continual data stream exhibits a long-tailed distribution, termed as Long-Tailed Continual Learning (LTCL). Existing LTCL solutions generally require the label distribution of the data stream to achieve re-balance training. However, obtaining such prior information is often infeasible in real scenarios since the model should learn without pre-identifying the majority and minority classes. To this end, we propose a novel Prior-free Balanced Replay (PBR) framework to learn from long-tailed data stream with less forgetting. Concretely, motivated by our experimental finding that the minority classes are more likely to be forgotten due to the higher uncertainty, we newly design an uncertainty-guided reservoir sampling strategy to prioritize rehearsing minority data without using any prior information, which is based on the mutual dependence between the model and samples. Additionally, we incorporate two prior-free components to further reduce the forgetting issue: (1) Boundary constraint is to preserve uncertain boundary supporting samples for continually re-estimating task boundaries. (2) Prototype constraint is to maintain the consistency of learned class prototypes along with training. Our approach is evaluated on three standard long-tailed benchmarks, demonstrating superior performance to existing CL methods and previous SOTA LTCL approach in both task- and class-incremental learning settings, as well as ordered- and shuffled-LTCL settings.

[CV-35] MegActor-Sigma: Unlocking Flexible Mixed-Modal Control in Portrait Animation with Diffusion Transformer

链接: https://arxiv.org/abs/2408.14975
作者: Shurong Yang,Huadong Li,Juhao Wu,Minhao Jing,Linze Li,Renhe Ji,Jiajun Liang,Haoqiang Fan,Jin Wang
关键词-EN: demonstrated superior performance, control, demonstrated superior, superior performance, control strength
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Diffusion models have demonstrated superior performance in the field of portrait animation. However, current approaches relied on either visual or audio modality to control character movements, failing to exploit the potential of mixed-modal control. This challenge arises from the difficulty in balancing the weak control strength of audio modality and the strong control strength of visual modality. To address this issue, we introduce MegActor- \Sigma : a mixed-modal conditional diffusion transformer (DiT), which can flexibly inject audio and visual modality control signals into portrait animation. Specifically, we make substantial advancements over its predecessor, MegActor, by leveraging the promising model structure of DiT and integrating audio and visual conditions through advanced modules within the DiT framework. To further achieve flexible combinations of mixed-modal control signals, we propose a Modality Decoupling Control" training strategy to balance the control strength between visual and audio modalities, along with the Amplitude Adjustment" inference strategy to freely regulate the motion amplitude of each modality. Finally, to facilitate extensive studies in this field, we design several dataset evaluation metrics to filter out public datasets and solely use this filtered dataset to train MegActor- \Sigma . Extensive experiments demonstrate the superiority of our approach in generating vivid portrait animations, outperforming previous methods trained on private dataset.

[CV-36] Deep Learning-based Average Shear Wave Velocity Prediction using Accelerometer Records

链接: https://arxiv.org/abs/2408.14962
作者: Barış Yılmaz,Melek Türkmen,Sanem Meral,Erdem Akagündüz,Salih Tileylioglu
关键词-EN: designing earthquake-resilient structures, evaluating structural damage, Assessing seismic hazards, Assessing seismic, ground motion records
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 12 pages, 14 figures, Accepted by 18th World Conference on Earthquake Engineering WCEE2024

点击查看摘要

Abstract:Assessing seismic hazards and thereby designing earthquake-resilient structures or evaluating structural damage that has been incurred after an earthquake are important objectives in earthquake engineering. Both tasks require critical evaluation of strong ground motion records, and the knowledge of site conditions at the earthquake stations plays a major role in achieving the aforementioned objectives. Site conditions are generally represented by the time-averaged shear wave velocity in the upper 30 meters of the geological materials (Vs30). Several strong motion stations lack Vs30 measurements resulting in potentially inaccurate assessment of seismic hazards and evaluation of ground motion records. In this study, we present a deep learning-based approach for predicting Vs30 at strong motion station locations using three-channel earthquake records. For this purpose, Convolutional Neural Networks (CNNs) with dilated and causal convolutional layers are used to extract deep features from accelerometer records collected from over 700 stations located in Turkey. In order to overcome the limited availability of labeled data, we propose a two-phase training approach. In the first phase, a CNN is trained to estimate the epicenters, for which ground truth is available for all records. After the CNN is trained, the pre-trained encoder is fine-tuned based on the Vs30 ground truth. The performance of the proposed method is compared with machine learning models that utilize hand-crafted features. The results demonstrate that the deep convolutional encoder based Vs30 prediction model outperforms the machine learning models that rely on hand-crafted features.

[CV-37] CVPT: Cross-Attention help Visual Prompt Tuning adapt visual task

链接: https://arxiv.org/abs/2408.14961
作者: Lingyun Huang,Jianxu Mao,Yaonan Wang,Junfei Yi,Ziming Tao
关键词-EN: demonstrating remarkable capabilities, models demonstrating remarkable, Visual Prompt Tuning, pre-trained models demonstrating, recent years
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In recent years, the rapid expansion of model sizes has led to large-scale pre-trained models demonstrating remarkable capabilities. Consequently, there has been a trend towards increasing the scale of models. However, this trend introduces significant challenges, including substantial computational costs of training and transfer to downstream tasks. To address these issues, Parameter-Efficient Fine-Tuning (PEFT) methods have been introduced. These methods optimize large-scale pre-trained models for specific tasks by fine-tuning a select group of parameters. Among these PEFT methods, adapter-based and prompt-based methods are the primary techniques. Specifically, in the field of visual fine-tuning, adapters gain prominence over prompts because of the latter’s relatively weaker performance and efficiency. Under the circumstances, we refine the widely-used Visual Prompt Tuning (VPT) method, proposing Cross Visual Prompt Tuning (CVPT). CVPT calculates cross-attention between the prompt tokens and the embedded tokens, which allows us to compute the semantic relationship between them and conduct the fine-tuning of models exactly to adapt visual tasks better. Furthermore, we introduce the weight-sharing mechanism to initialize the parameters of cross-attention, which avoids massive learnable parameters from cross-attention and enhances the representative capability of cross-attention. We conduct comprehensive testing across 25 datasets and the result indicates that CVPT significantly improves VPT’s performance and efficiency in visual tasks. For example, on the VTAB-1K benchmark, CVPT outperforms VPT over 4% in average accuracy, rivaling the advanced adapter-based methods in performance and efficiency. Our experiments confirm that prompt-based methods can achieve exceptional results in visual fine-tuning.

[CV-38] Applying ViT in Generalized Few-shot Semantic Segmentation

链接: https://arxiv.org/abs/2408.14957
作者: Liyuan Geng,Jinhong Xia,Yuanhe Guo
关键词-EN: generalized few-shot semantic, pretrained Vision Transformer, few-shot semantic segmentation, paper explores, explores the capability
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 7 pages, 4 figures

点击查看摘要

Abstract:This paper explores the capability of ViT-based models under the generalized few-shot semantic segmentation (GFSS) framework. We conduct experiments with various combinations of backbone models, including ResNets and pretrained Vision Transformer (ViT)-based models, along with decoders featuring a linear classifier, UPerNet, and Mask Transformer. The structure made of DINOv2 and linear classifier takes the lead on popular few-shot segmentation bench mark PASCAL- 5^i , substantially outperforming the best of ResNet structure by 116% in one-shot scenario. We demonstrate the great potential of large pretrained ViT-based model on GFSS task, and expect further improvement on testing benchmarks. However, a potential caveat is that when applying pure ViT-based model and large scale ViT decoder, the model is easy to overfit.

[CV-39] NeuralOOD: Improving Out-of-Distribution Generalization Performance with Brain-machine Fusion Learning Framework

链接: https://arxiv.org/abs/2408.14950
作者: Shuangchen Zhao,Changde Du,Hui Li,Huiguang He
关键词-EN: Deep Neural Networks, traditional computer vision, demonstrated exceptional recognition, exceptional recognition capabilities, Deep Neural
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Deep Neural Networks (DNNs) have demonstrated exceptional recognition capabilities in traditional computer vision (CV) tasks. However, existing CV models often suffer a significant decrease in accuracy when confronted with out-of-distribution (OOD) data. In contrast to these DNN models, human can maintain a consistently low error rate when facing OOD scenes, partly attributed to the rich prior cognitive knowledge stored in the human brain. Previous OOD generalization researches only focus on the single modal, overlooking the advantages of multimodal learning method. In this paper, we utilize the multimodal learning method to improve the OOD generalization and propose a novel Brain-machine Fusion Learning (BMFL) framework. We adopt the cross-attention mechanism to fuse the visual knowledge from CV model and prior cognitive knowledge from the human brain. Specially, we employ a pre-trained visual neural encoding model to predict the functional Magnetic Resonance Imaging (fMRI) from visual features which eliminates the need for the fMRI data collection and pre-processing, effectively reduces the workload associated with conventional BMFL methods. Furthermore, we construct a brain transformer to facilitate the extraction of knowledge inside the fMRI data. Moreover, we introduce the Pearson correlation coefficient maximization regularization method into the training process, which improves the fusion capability with better constrains. Our model outperforms the DINOv2 and baseline models on the ImageNet-1k validation dataset as well as six curated OOD datasets, showcasing its superior performance in diverse scenarios.

[CV-40] BOX3D: Lightweight Camera-LiDAR Fusion for 3D Object Detection and Localization

链接: https://arxiv.org/abs/2408.14941
作者: Mario A.V. Saucedo,Nikolaos Stathoulopoulos,Vidya Sumathy,Christoforos Kanellakis,George Nikolakopoulos
关键词-EN: semantic scene understanding, Scene Graphs, global localization play, Graphs for semantic, scene understanding
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Presented in MED 2024

点击查看摘要

Abstract:Object detection and global localization play a crucial role in robotics, spanning across a great spectrum of applications from autonomous cars to multi-layered 3D Scene Graphs for semantic scene understanding. This article proposes BOX3D, a novel multi-modal and lightweight scheme for localizing objects of interest by fusing the information from RGB camera and 3D LiDAR. BOX3D is structured around a three-layered architecture, building up from the local perception of the incoming sequential sensor data to the global perception refinement that covers for outliers and the general consistency of each object’s observation. More specifically, the first layer handles the low-level fusion of camera and LiDAR data for initial 3D bounding box extraction. The second layer converts each LiDAR’s scan 3D bounding boxes to the world coordinate frame and applies a spatial pairing and merging mechanism to maintain the uniqueness of objects observed from different viewpoints. Finally, BOX3D integrates the third layer that supervises the consistency of the results on the global map iteratively, using a point-to-voxel comparison for identifying all points in the global map that belong to the object. Benchmarking results of the proposed novel architecture are showcased in multiple experimental trials on public state-of-the-art large-scale dataset of urban environments.

[CV-41] Cross-Modal Temporal Alignment for Event-guided Video Deblurring ECCV2024

链接: https://arxiv.org/abs/2408.14930
作者: Taewoo Kim,Hoonhee Cho,Kuk-Jin Yoon
关键词-EN: effectively gathering information, adjacent video frames, single blurred frame, Video deblurring, enhance the quality
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted in ECCV2024

点击查看摘要

Abstract:Video deblurring aims to enhance the quality of restored results in motion-blurred videos by effectively gathering information from adjacent video frames to compensate for the insufficient data in a single blurred frame. However, when faced with consecutively severe motion blur situations, frame-based video deblurring methods often fail to find accurate temporal correspondence among neighboring video frames, leading to diminished performance. To address this limitation, we aim to solve the video deblurring task by leveraging an event camera with micro-second temporal resolution. To fully exploit the dense temporal resolution of the event camera, we propose two modules: 1) Intra-frame feature enhancement operates within the exposure time of a single blurred frame, iteratively enhancing cross-modality features in a recurrent manner to better utilize the rich temporal information of events, 2) Inter-frame temporal feature alignment gathers valuable long-range temporal information to target frames, aggregating sharp features leveraging the advantages of the events. In addition, we present a novel dataset composed of real-world blurred RGB videos, corresponding sharp videos, and event data. This dataset serves as a valuable resource for evaluating event-guided deblurring methods. We demonstrate that our proposed methods outperform state-of-the-art frame-based and event-based motion deblurring methods through extensive experiments conducted on both synthetic and real-world deblurring datasets. The code and dataset are available at this https URL.

[CV-42] owards Real-world Event-guided Low-light Video Enhancement and Deblurring ECCV2024

链接: https://arxiv.org/abs/2408.14916
作者: Taewoo Kim,Jaeseok Jeong,Hoonhee Cho,Yuhwan Jeong,Kuk-Jin Yoon
关键词-EN: reduced visibility, long exposure times, requires long exposure, low-light enhancement, low-light conditions
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted in ECCV2024

点击查看摘要

Abstract:In low-light conditions, capturing videos with frame-based cameras often requires long exposure times, resulting in motion blur and reduced visibility. While frame-based motion deblurring and low-light enhancement have been studied, they still pose significant challenges. Event cameras have emerged as a promising solution for improving image quality in low-light environments and addressing motion blur. They provide two key advantages: capturing scene details well even in low light due to their high dynamic range, and effectively capturing motion information during long exposures due to their high temporal resolution. Despite efforts to tackle low-light enhancement and motion deblurring using event cameras separately, previous work has not addressed both simultaneously. To explore the joint task, we first establish real-world datasets for event-guided low-light enhancement and deblurring using a hybrid camera system based on beam splitters. Subsequently, we introduce an end-to-end framework to effectively handle these tasks. Our framework incorporates a module to efficiently leverage temporal information from events and frames. Furthermore, we propose a module to utilize cross-modal feature information to employ a low-pass filter for noise suppression while enhancing the main structural information. Our proposed method significantly outperforms existing approaches in addressing the joint task. Our project pages are available at this https URL.

[CV-43] MeshUp: Multi-Target Mesh Deformation via Blended Score Distillation

链接: https://arxiv.org/abs/2408.14899
作者: Hyunwoo Kim,Itai Lang,Noam Aigerman,Thibault Groueix,Vladimir G. Kim,Rana Hanocka
关键词-EN: multiple target concepts, propose MeshUp, Blended Score Distillation, target concepts, score distillation
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
*备注:

点击查看摘要

Abstract:We propose MeshUp, a technique that deforms a 3D mesh towards multiple target concepts, and intuitively controls the region where each concept is expressed. Conveniently, the concepts can be defined as either text queries, e.g., “a dog” and “a turtle,” or inspirational images, and the local regions can be selected as any number of vertices on the mesh. We can effectively control the influence of the concepts and mix them together using a novel score distillation approach, referred to as the Blended Score Distillation (BSD). BSD operates on each attention layer of the denoising U-Net of a diffusion model as it extracts and injects the per-objective activations into a unified denoising pipeline from which the deformation gradients are calculated. To localize the expression of these activations, we create a probabilistic Region of Interest (ROI) map on the surface of the mesh, and turn it into 3D-consistent masks that we use to control the expression of these activations. We demonstrate the effectiveness of BSD empirically and show that it can deform various meshes towards multiple objectives.

[CV-44] VHAKG: A Multi-modal Knowledge Graph Based on Synchronized Multi-view Videos of Daily Activities CIKM2024

链接: https://arxiv.org/abs/2408.14895
作者: Shusaku Egami,Takahiro Ugai,Ken Fukuda
关键词-EN: Multi-modal knowledge graphs, resources enabling knowledge, enabling knowledge processing, Multi-modal knowledge, non-symbolic data
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
*备注: 5 pages,4 figures, accepted by CIKM2024 Resource Track

点击查看摘要

Abstract:Multi-modal knowledge graphs (MMKGs), which ground various non-symbolic data (e.g., images and videos) into symbols, have attracted attention as resources enabling knowledge processing and machine learning across modalities. However, the construction of MMKGs for videos consisting of multiple events, such as daily activities, is still in the early stages. In this paper, we construct an MMKG based on synchronized multi-view simulated videos of daily activities. Besides representing the content of daily life videos as event-centric knowledge, our MMKG also includes frame-by-frame fine-grained changes, such as bounding boxes within video frames. In addition, we provide support tools for querying our MMKG. As an application example, we demonstrate that our MMKG facilitates benchmarking vision-language models by providing the necessary vision-language datasets for a tailored task.

[CV-45] Adversarial Manhole: Challenging Monocular Depth Estimation and Semantic Segmentation Models with Patch Attack

链接: https://arxiv.org/abs/2408.14879
作者: Naufal Suryanto,Andro Aprila Adiputra,Ahmada Yusril Kadiptya,Yongsu Kim,Howon Kim
关键词-EN: Monocular depth estimation, autonomous driving systems, Monocular depth, semantic segmentation, navigation and environmental
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted for WISA 2024. Code and dataset: this https URL

点击查看摘要

Abstract:Monocular depth estimation (MDE) and semantic segmentation (SS) are crucial for the navigation and environmental interpretation of many autonomous driving systems. However, their vulnerability to practical adversarial attacks is a significant concern. This paper presents a novel adversarial attack using practical patches that mimic manhole covers to deceive MDE and SS models. The goal is to cause these systems to misinterpret scenes, leading to false detections of near obstacles or non-passable objects. We use Depth Planar Mapping to precisely position these patches on road surfaces, enhancing the attack’s effectiveness. Our experiments show that these adversarial patches cause a 43% relative error in MDE and achieve a 96% attack success rate in SS. These patches create affected error regions over twice their size in MDE and approximately equal to their size in SS. Our studies also confirm the patch’s effectiveness in physical simulations, the adaptability of the patches across different target models, and the effectiveness of our proposed modules, highlighting their practical implications.

[CV-46] ZeroMamba: Exploring Visual State Space Model for Zero-Shot Learning

链接: https://arxiv.org/abs/2408.14868
作者: Wenjin Hou,Dingjie Fu,Kun Li,Shiming Chen,Hehe Fan,Yi Yang
关键词-EN: recognize unseen classes, Convolutional Neural Networks, Zero-shot learning, transferring semantic knowledge, unseen classes
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Zero-shot learning (ZSL) aims to recognize unseen classes by transferring semantic knowledge from seen classes to unseen ones, guided by semantic information. To this end, existing works have demonstrated remarkable performance by utilizing global visual features from Convolutional Neural Networks (CNNs) or Vision Transformers (ViTs) for visual-semantic interactions. Due to the limited receptive fields of CNNs and the quadratic complexity of ViTs, however, these visual backbones achieve suboptimal visual-semantic interactions. In this paper, motivated by the visual state space model (i.e., Vision Mamba), which is capable of capturing long-range dependencies and modeling complex visual dynamics, we propose a parameter-efficient ZSL framework called ZeroMamba to advance ZSL. Our ZeroMamba comprises three key components: Semantic-aware Local Projection (SLP), Global Representation Learning (GRL), and Semantic Fusion (SeF). Specifically, SLP integrates semantic embeddings to map visual features to local semantic-related representations, while GRL encourages the model to learn global semantic representations. SeF combines these two semantic representations to enhance the discriminability of semantic features. We incorporate these designs into Vision Mamba, forming an end-to-end ZSL framework. As a result, the learned semantic representations are better suited for classification. Through extensive experiments on four prominent ZSL benchmarks, ZeroMamba demonstrates superior performance, significantly outperforming the state-of-the-art (i.e., CNN-based and ViT-based) methods under both conventional ZSL (CZSL) and generalized ZSL (GZSL) settings. Code is available at: https://anonymous.4open.science/r/ZeroMamba.

[CV-47] DiffSurf: A Transformer-based Diffusion Model for Generating and Reconstructing 3D Surfaces in Pose ECCV2024

链接: https://arxiv.org/abs/2408.14860
作者: Yusuke Yoshiyasu,Leyuan Sun
关键词-EN: transformer-based denoising diffusion, paper presents DiffSurf, generating and reconstructing, denoising diffusion model, paper presents
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted at ECCV2024

点击查看摘要

Abstract:This paper presents DiffSurf, a transformer-based denoising diffusion model for generating and reconstructing 3D surfaces. Specifically, we design a diffusion transformer architecture that predicts noise from noisy 3D surface vertices and normals. With this architecture, DiffSurf is able to generate 3D surfaces in various poses and shapes, such as human bodies, hands, animals and man-made objects. Further, DiffSurf is versatile in that it can address various 3D downstream tasks including morphing, body shape variation and 3D human mesh fitting to 2D keypoints. Experimental results on 3D human model benchmarks demonstrate that DiffSurf can generate shapes with greater diversity and higher quality than previous generative models. Furthermore, when applied to the task of single-image 3D human mesh recovery, DiffSurf achieves accuracy comparable to prior techniques at a near real-time rate.

[CV-48] Diffusion-Occ: 3D Point Cloud Completion via Occupancy Diffusion

链接: https://arxiv.org/abs/2408.14846
作者: Guoqing Zhang,Jian Liu
关键词-EN: capturing three-dimensional data, point cloud completion, point cloud, cloud completion, resolution and occlusion
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Point clouds are crucial for capturing three-dimensional data but often suffer from incompleteness due to limitations such as resolution and occlusion. Traditional methods typically rely on point-based approaches within discriminative frameworks for point cloud completion. In this paper, we introduce \textbfDiffusion-Occ, a novel framework for Diffusion Point Cloud Completion. Diffusion-Occ utilizes a two-stage coarse-to-fine approach. In the first stage, the Coarse Density Voxel Prediction Network (CDNet) processes partial points to predict coarse density voxels, streamlining global feature extraction through voxel classification, as opposed to previous regression-based methods. In the second stage, we introduce the Occupancy Generation Network (OccGen), a conditional occupancy diffusion model based on a transformer architecture and enhanced by our Point-Voxel Fuse (PVF) block. This block integrates coarse density voxels with partial points to leverage both global and local features for comprehensive completion. By thresholding the occupancy field, we convert it into a complete point cloud. Additionally, our method employs diverse training mixtures and efficient diffusion parameterization to enable effective one-step sampling during both training and inference. Experimental results demonstrate that Diffusion-Occ outperforms existing discriminative and generative methods.

[CV-49] From Bias to Balance: Detecting Facial Expression Recognition Biases in Large Multimodal Foundation Models

链接: https://arxiv.org/abs/2408.14842
作者: Kaylee Chhua,Zhoujinyi Wen,Vedant Hathalia,Kevin Zhu,Sean O’Brien
关键词-EN: Large Multimodal Foundation, Large Multimodal, Multimodal Foundation Models, Multimodal Foundation, traditional FER models
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This study addresses the racial biases in facial expression recognition (FER) systems within Large Multimodal Foundation Models (LMFMs). Despite advances in deep learning and the availability of diverse datasets, FER systems often exhibit higher error rates for individuals with darker skin tones. Existing research predominantly focuses on traditional FER models (CNNs, RNNs, ViTs), leaving a gap in understanding racial biases in LMFMs. We benchmark four leading LMFMs: GPT-4o, PaliGemma, Gemini, and CLIP to assess their performance in facial emotion detection across different racial demographics. A linear classifier trained on CLIP embeddings obtains accuracies of 95.9% for RADIATE, 90.3% for Tarr, and 99.5% for Chicago Face. Furthermore, we identify that Anger is misclassified as Disgust 2.1 times more often in Black Females than White Females. This study highlights the need for fairer FER systems and establishes a foundation for developing unbiased, accurate FER technologies. Visit this https URL for further information regarding the biases within facial expression recognition.

[CV-50] Diffusion based Semantic Outlier Generation via Nuisance Awareness for Out-of-Distribution Detection

链接: https://arxiv.org/abs/2408.14841
作者: Suhee Yoon,Sanghyu Yoon,Hankook Lee,Ye Seul Sim,Sungik Choi,Kyungeun Lee,Hye-Seung Cho,Woohyung Lim
关键词-EN: recently shown promising, shown promising results, synthetic OOD datasets, recently shown, shown promising
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Out-of-distribution (OOD) detection, which determines whether a given sample is part of the in-distribution (ID), has recently shown promising results through training with synthetic OOD datasets. Nonetheless, existing methods often produce outliers that are considerably distant from the ID, showing limited efficacy for capturing subtle distinctions between ID and OOD. To address these issues, we propose a novel framework, Semantic Outlier generation via Nuisance Awareness (SONA), which notably produces challenging outliers by directly leveraging pixel-space ID samples through diffusion models. Our approach incorporates SONA guidance, providing separate control over semantic and nuisance regions of ID samples. Thereby, the generated outliers achieve two crucial properties: (i) they present explicit semantic-discrepant information, while (ii) maintaining various levels of nuisance resemblance with ID. Furthermore, the improved OOD detector training with SONA outliers facilitates learning with a focus on semantic distinctions. Extensive experiments demonstrate the effectiveness of our framework, achieving an impressive AUROC of 88% on near-OOD datasets, which surpasses the performance of baseline methods by a significant margin of approximately 6%.

[CV-51] Diffusion Models Are Real-Time Game Engines

链接: https://arxiv.org/abs/2408.14837
作者: Dani Valevski,Yaniv Leviathan,Moab Arar,Shlomi Fruchter
关键词-EN: game engine powered, enables real-time interaction, high quality, engine powered, real-time interaction
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: Project page: this https URL

点击查看摘要

Abstract:We present GameNGen, the first game engine powered entirely by a neural model that enables real-time interaction with a complex environment over long trajectories at high quality. GameNGen can interactively simulate the classic game DOOM at over 20 frames per second on a single TPU. Next frame prediction achieves a PSNR of 29.4, comparable to lossy JPEG compression. Human raters are only slightly better than random chance at distinguishing short clips of the game from clips of the simulation. GameNGen is trained in two phases: (1) an RL-agent learns to play the game and the training sessions are recorded, and (2) a diffusion model is trained to produce the next frame, conditioned on the sequence of past frames and actions. Conditioning augmentations enable stable auto-regressive generation over long trajectories.

[CV-52] me-Aware Face Anti-Spoofing with Rotation Invariant Local Binary Patterns and Deep Learning

链接: https://arxiv.org/abs/2408.14829
作者: Moritz Finke,Alexandra Dmitrienko
关键词-EN: modern world, integral part, Machine Learning, Abstract, Learning
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Facial recognition systems have become an integral part of the modern world. These methods accomplish the task of human identification in an automatic, fast, and non-interfering way. Past research has uncovered high vulnerability to simple imitation attacks that could lead to erroneous identification and subsequent authentication of attackers. Similar to face recognition, imitation attacks can also be detected with Machine Learning. Attack detection systems use a variety of facial features and advanced machine learning models for uncovering the presence of attacks. In this work, we assess existing work on liveness detection and propose a novel approach that promises high classification accuracy by combining previously unused features with time-aware deep learning strategies.

[CV-53] Alfie: Democratising RGBA Image Generation With No ECCV

链接: https://arxiv.org/abs/2408.14826
作者: Fabio Quattrini,Vittorio Pippi,Silvia Cascianelli,Rita Cucchiara
关键词-EN: requiring graphic design, graphic design skills, artworks are ubiquitous, skills and dedicated, dedicated software
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
*备注: Accepted at ECCV AI for Visual Arts Workshop and Challenges

点击查看摘要

Abstract:Designs and artworks are ubiquitous across various creative fields, requiring graphic design skills and dedicated software to create compositions that include many graphical elements, such as logos, icons, symbols, and art scenes, which are integral to visual storytelling. Automating the generation of such visual elements improves graphic designers’ productivity, democratizes and innovates the creative industry, and helps generate more realistic synthetic data for related tasks. These illustration elements are mostly RGBA images with irregular shapes and cutouts, facilitating blending and scene composition. However, most image generation models are incapable of generating such images and achieving this capability requires expensive computational resources, specific training recipes, or post-processing solutions. In this work, we propose a fully-automated approach for obtaining RGBA illustrations by modifying the inference-time behavior of a pre-trained Diffusion Transformer model, exploiting the prompt-guided controllability and visual quality offered by such models with no additional computational cost. We force the generation of entire subjects without sharp croppings, whose background is easily removed for seamless integration into design projects or artistic scenes. We show with a user study that, in most cases, users prefer our solution over generating and then matting an image, and we show that our generated illustrations yield good results when used as inputs for composite scene generation pipelines. We release the code at this https URL.

[CV-54] From Rule-Based Models to Deep Learning Transformers Architectures for Natural Language Processing and Sign Language Translation Systems: Survey Taxonomy and Performance Evaluation

链接: https://arxiv.org/abs/2408.14825
作者: Nada Shahin,Leila Ismail
关键词-EN: Hearing population worldwide, Deaf and Hard, Hard of Hearing, growing Deaf, Hearing population
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:With the growing Deaf and Hard of Hearing population worldwide and the persistent shortage of certified sign language interpreters, there is a pressing need for an efficient, signs-driven, integrated end-to-end translation system, from sign to gloss to text and vice-versa. There has been a wealth of research on machine translations and related reviews. However, there are few works on sign language machine translation considering the particularity of the language being continuous and dynamic. This paper aims to address this void, providing a retrospective analysis of the temporal evolution of sign language machine translation algorithms and a taxonomy of the Transformers architectures, the most used approach in language translation. We also present the requirements of a real-time Quality-of-Service sign language ma-chine translation system underpinned by accurate deep learning algorithms. We propose future research directions for sign language translation systems.

[CV-55] LapisGS: Layered Progressive 3D Gaussian Splatting for Adaptive Streaming

链接: https://arxiv.org/abs/2408.14823
作者: Yuang Shi,Simone Gasparini,Géraldine Morin,Wei Tsang Ooi
关键词-EN: Extended Reality, rise of Extended, requires efficient streaming, online worlds, challenging current
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
*备注:

点击查看摘要

Abstract:The rise of Extended Reality (XR) requires efficient streaming of 3D online worlds, challenging current 3DGS representations to adapt to bandwidth-constrained environments. This paper proposes LapisGS, a layered 3DGS that supports adaptive streaming and progressive rendering. Our method constructs a layered structure for cumulative representation, incorporates dynamic opacity optimization to maintain visual fidelity, and utilizes occupancy maps to efficiently manage Gaussian splats. This proposed model offers a progressive representation supporting a continuous rendering quality adapted for bandwidth-aware streaming. Extensive experiments validate the effectiveness of our approach in balancing visual fidelity with the compactness of the model, with up to 50.71% improvement in SSIM, 286.53% improvement in LPIPS, and 318.41% reduction in model size, and shows its potential for bandwidth-adapted 3D streaming and rendering applications.

[CV-56] Build-A-Scene: Interactive 3D Layout Control for Diffusion-Based Image Generation

链接: https://arxiv.org/abs/2408.14819
作者: Abdelrahman Eldesokey,Peter Wonka
关键词-EN: layout control, layout, control, propose a diffusion-based, diffusion-based approach
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Project Page: this https URL

点击查看摘要

Abstract:We propose a diffusion-based approach for Text-to-Image (T2I) generation with interactive 3D layout control. Layout control has been widely studied to alleviate the shortcomings of T2I diffusion models in understanding objects’ placement and relationships from text descriptions. Nevertheless, existing approaches for layout control are limited to 2D layouts, require the user to provide a static layout beforehand, and fail to preserve generated images under layout changes. This makes these approaches unsuitable for applications that require 3D object-wise control and iterative refinements, e.g., interior design and complex scene generation. To this end, we leverage the recent advancements in depth-conditioned T2I models and propose a novel approach for interactive 3D layout control. We replace the traditional 2D boxes used in layout control with 3D boxes. Furthermore, we revamp the T2I task as a multi-stage generation process, where at each stage, the user can insert, change, and move an object in 3D while preserving objects from earlier stages. We achieve this through our proposed Dynamic Self-Attention (DSA) module and the consistent 3D object translation strategy. Experiments show that our approach can generate complicated scenes based on 3D layouts, boosting the object generation success rate over the standard depth-conditioned T2I methods by 2x. Moreover, it outperforms other methods in comparison in preserving objects under layout changes. Project Page: \urlthis https URL

[CV-57] HPT: Hierarchically Prompting Vision-Language Models with Multi-Granularity Knowledge Generation and Improved Structure Modeling

链接: https://arxiv.org/abs/2408.14812
作者: Yubin Wang,Xinyang Jiang,De Cheng,Wenli Sun,Dongsheng Li,Cairong Zhao
关键词-EN: adapting vision-language foundation, CLIP to downstream, vision-language foundation models, downstream tasks, prevalent strategy
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 19 pages, 7 figures, 7 tables. arXiv admin note: substantial text overlap with arXiv:2312.06323

点击查看摘要

Abstract:Prompt learning has become a prevalent strategy for adapting vision-language foundation models (VLMs) such as CLIP to downstream tasks. With the emergence of large language models (LLMs), recent studies have explored the potential of using category-related descriptions to enhance prompt effectiveness. However, conventional descriptions lack explicit structured information necessary to represent the interconnections among key elements like entities or attributes with relation to a particular category. Since existing prompt tuning methods give little consideration to managing structured knowledge, this paper advocates leveraging LLMs to construct a graph for each description to prioritize such structured knowledge. Consequently, we propose a novel approach called Hierarchical Prompt Tuning (HPT), enabling simultaneous modeling of both structured and conventional linguistic knowledge. Specifically, we introduce a relationship-guided attention module to capture pair-wise associations among entities and attributes for low-level prompt learning. In addition, by incorporating high-level and global-level prompts modeling overall semantics, the proposed hierarchical structure forges cross-level interlinks and empowers the model to handle more complex and long-term relationships. Finally, by enhancing multi-granularity knowledge generation, redesigning the relationship-driven attention re-weighting module, and incorporating consistent constraints on the hierarchical text encoder, we propose HPT++, which further improves the performance of HPT. Our experiments are conducted across a wide range of evaluation settings, including base-to-new generalization, cross-dataset evaluation, and domain generalization. Extensive results and ablation studies demonstrate the effectiveness of our methods, which consistently outperform existing SOTA methods.

[CV-58] Platypus: A Generalized Specialist Model for Reading Text in Various Forms ECCV2024

链接: https://arxiv.org/abs/2408.14805
作者: Peng Wang,Zhaohai Li,Jun Tang,Humen Zhong,Fei Huang,Zhibo Yang,Cong Yao
关键词-EN: wide application range, long-standing research topic, high technical challenge, topic for decades, application range
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted by ECCV2024

点击查看摘要

Abstract:Reading text from images (either natural scenes or documents) has been a long-standing research topic for decades, due to the high technical challenge and wide application range. Previously, individual specialist models are developed to tackle the sub-tasks of text reading (e.g., scene text recognition, handwritten text recognition and mathematical expression recognition). However, such specialist models usually cannot effectively generalize across different sub-tasks. Recently, generalist models (such as GPT-4V), trained on tremendous data in a unified way, have shown enormous potential in reading text in various scenarios, but with the drawbacks of limited accuracy and low efficiency. In this work, we propose Platypus, a generalized specialist model for text reading. Specifically, Platypus combines the best of both worlds: being able to recognize text of various forms with a single unified architecture, while achieving excellent accuracy and high efficiency. To better exploit the advantage of Platypus, we also construct a text reading dataset (called Worms), the images of which are curated from previous datasets and partially re-labeled. Experiments on standard benchmarks demonstrate the effectiveness and superiority of the proposed Platypus model. Model and data will be made publicly available at this https URL.

[CV-59] RAW-Adapter: Adapting Pre-trained Visual Model to Camera RAW Images ECCV ECCV2024

链接: https://arxiv.org/abs/2408.14802
作者: Ziteng Cui,Tatsuya Harada
关键词-EN: pre-training visual models, camera RAW data, efficient storage, ISP stages, predominant choice
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: ECCV 2024, code link: this https URL

点击查看摘要

Abstract:sRGB images are now the predominant choice for pre-training visual models in computer vision research, owing to their ease of acquisition and efficient storage. Meanwhile, the advantage of RAW images lies in their rich physical information under variable real-world challenging lighting conditions. For computer vision tasks directly based on camera RAW data, most existing studies adopt methods of integrating image signal processor (ISP) with backend networks, yet often overlook the interaction capabilities between the ISP stages and subsequent networks. Drawing inspiration from ongoing adapter research in NLP and CV areas, we introduce RAW-Adapter, a novel approach aimed at adapting sRGB pre-trained models to camera RAW data. RAW-Adapter comprises input-level adapters that employ learnable ISP stages to adjust RAW inputs, as well as model-level adapters to build connections between ISP stages and subsequent high-level networks. Additionally, RAW-Adapter is a general framework that could be used in various computer vision frameworks. Abundant experiments under different lighting conditions have shown our algorithm’s state-of-the-art (SOTA) performance, demonstrating its effectiveness and efficiency across a range of real-world and synthetic datasets.

[CV-60] Revisiting Surgical Instrument Segmentation Without Human Intervention: A Graph Partitioning View

链接: https://arxiv.org/abs/2408.14789
作者: Mingyu Sheng,Jianan Fan,Dongnan Liu,Ron Kikinis,Weidong Cai
关键词-EN: minimally invasive surgery, boosting minimally invasive, endoscopic images stands, Surgical instrument segmentation, surgical video frames
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Surgical instrument segmentation (SIS) on endoscopic images stands as a long-standing and essential task in the context of computer-assisted interventions for boosting minimally invasive surgery. Given the recent surge of deep learning methodologies and their data-hungry nature, training a neural predictive model based on massive expert-curated annotations has been dominating and served as an off-the-shelf approach in the field, which could, however, impose prohibitive burden to clinicians for preparing fine-grained pixel-wise labels corresponding to the collected surgical video frames. In this work, we propose an unsupervised method by reframing the video frame segmentation as a graph partitioning problem and regarding image pixels as graph nodes, which is significantly different from the previous efforts. A self-supervised pre-trained model is firstly leveraged as a feature extractor to capture high-level semantic features. Then, Laplacian matrixs are computed from the features and are eigendecomposed for graph partitioning. On the “deep” eigenvectors, a surgical video frame is meaningfully segmented into different modules such as tools and tissues, providing distinguishable semantic information like locations, classes, and relations. The segmentation problem can then be naturally tackled by applying clustering or threshold on the eigenvectors. Extensive experiments are conducted on various datasets (e.g., EndoVis2017, EndoVis2018, UCL, etc.) for different clinical endpoints. Across all the challenging scenarios, our method demonstrates outstanding performance and robustness higher than unsupervised state-of-the-art (SOTA) methods. The code is released at this https URL.

[CV-61] MROVSeg: Breaking the Resolution Curse of Vision-Language Models in Open-Vocabulary Semantic Segmentation

链接: https://arxiv.org/abs/2408.14776
作者: Yuanbing Zhu,Bingke Zhu,Zhen Chen,Huan Xu,Ming Tang,Jinqiao Wang
关键词-EN: recognize semantically meaningful, Open-vocabulary semantic segmentation, semantically meaningful regions, meaningful regions based, descriptions during inference
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: Technical report

点击查看摘要

Abstract:Open-vocabulary semantic segmentation aims to segment and recognize semantically meaningful regions based on text-based descriptions during inference. A typical solution to address this task is to leverage powerful vision-language models (VLMs), such as CLIP, to bridge the gap between open- and close-vocabulary recognition. As VLMs are usually pretrained with low-resolution images (e.g. 224\times224 ), most previous methods operate only on downscaled images. We question this design as low resolution features often fail to preserve fine details. Although employing additional image backbones for high-resolution inputs can mitigate this issue, it may also introduce significant computation overhead. Therefore, we propose MROVSeg, a multi-resolution training framework for open-vocabulary semantic segmentation with a single pretrained CLIP backbone, that uses sliding windows to slice the high-resolution input into uniform patches, each matching the input size of the well-trained image encoder. Its key components include a Multi-Res Adapter, which restores the spatial geometry and grasps local-global correspondences across patches by learnable convolutional and scale attention layers. To achieve accurate segmentation, we introduce Multi-grained Masked Attention scheme to aggregate multi-grained semantics by performing cross-attention between object queries and multi-resolution CLIP features within the region of interests. Through comprehensive experiments, we demonstrate the superiority of MROVSeg on well-established open-vocabulary semantic segmentation benchmarks, particularly for high-resolution inputs, establishing new standards for open-vocabulary semantic segmentation.

[CV-62] xt-guided Foundation Model Adaptation for Long-Tailed Medical Image Classification

链接: https://arxiv.org/abs/2408.14770
作者: Sirui Li,Li Lin,Yijin Huang,Pujin Cheng,Xiaoying Tang
关键词-EN: imbalanced data distribution, due to scarce, rare diseases, greatly impairs, deep learning models
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted by IEEE ISBI 2024

点击查看摘要

Abstract:In medical contexts, the imbalanced data distribution in long-tailed datasets, due to scarce labels for rare diseases, greatly impairs the diagnostic accuracy of deep learning models. Recent multimodal text-image supervised foundation models offer new solutions to data scarcity through effective representation learning. However, their limited medical-specific pretraining hinders their performance in medical image classification relative to natural images. To address this issue, we propose a novel Text-guided Foundation model Adaptation for Long-Tailed medical image classification (TFA-LT). We adopt a two-stage training strategy, integrating representations from the foundation model using just two linear adapters and a single ensembler for balanced outcomes. Experimental results on two long-tailed medical image datasets validate the simplicity, lightweight and efficiency of our approach: requiring only 6.1% GPU memory usage of the current best-performing algorithm, our method achieves an accuracy improvement of up to 27.1%, highlighting the substantial potential of foundation model adaptation in this area.

[CV-63] CrossViewDiff: A Cross-View Diffusion Model for Satellite-to-Street View Synthesis

链接: https://arxiv.org/abs/2408.14765
作者: Weijia Li,Jun He,Junyan Ye,Huaping Zhong,Zhimeng Zheng,Zilong Huang,Dahua Lin,Conghui He
关键词-EN: view synthesis aims, satellite-view image, cross-view, synthesis, image
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 21 pages, 11 figures

点击查看摘要

Abstract:Satellite-to-street view synthesis aims at generating a realistic street-view image from its corresponding satellite-view image. Although stable diffusion models have exhibit remarkable performance in a variety of image generation applications, their reliance on similar-view inputs to control the generated structure or texture restricts their application to the challenging cross-view synthesis task. In this work, we propose CrossViewDiff, a cross-view diffusion model for satellite-to-street view synthesis. To address the challenges posed by the large discrepancy across views, we design the satellite scene structure estimation and cross-view texture mapping modules to construct the structural and textural controls for street-view image synthesis. We further design a cross-view control guided denoising process that incorporates the above controls via an enhanced cross-view attention module. To achieve a more comprehensive evaluation of the synthesis results, we additionally design a GPT-based scoring method as a supplement to standard evaluation metrics. We also explore the effect of different data sources (e.g., text, maps, building heights, and multi-temporal satellite imagery) on this task. Results on three public cross-view datasets show that CrossViewDiff outperforms current state-of-the-art on both standard and GPT-based evaluation metrics, generating high-quality street-view panoramas with more realistic structures and textures across rural, suburban, and urban scenes. The code and models of this work will be released at this https URL.

[CV-64] SynthDoc: Bilingual Documents Synthesis for Visual Document Understanding

链接: https://arxiv.org/abs/2408.14764
作者: Chuanghao Ding,Xuejing Liu,Wei Tang,Juan Li,Xiaoliang Wang,Rui Zhao,Cam-Tu Nguyen,Fei Tan
关键词-EN: Visual Document Understanding, enhance Visual Document, paper introduces SynthDoc, enhance Visual, generation pipeline designed
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
*备注:

点击查看摘要

Abstract:This paper introduces SynthDoc, a novel synthetic document generation pipeline designed to enhance Visual Document Understanding (VDU) by generating high-quality, diverse datasets that include text, images, tables, and charts. Addressing the challenges of data acquisition and the limitations of existing datasets, SynthDoc leverages publicly available corpora and advanced rendering tools to create a comprehensive and versatile dataset. Our experiments, conducted using the Donut model, demonstrate that models trained with SynthDoc’s data achieve superior performance in pre-training read tasks and maintain robustness in downstream tasks, despite language inconsistencies. The release of a benchmark dataset comprising 5,000 image-text pairs not only showcases the pipeline’s capabilities but also provides a valuable resource for the VDU community to advance research and development in document image recognition. This work significantly contributes to the field by offering a scalable solution to data scarcity and by validating the efficacy of end-to-end models in parsing complex, real-world documents.

[CV-65] Learning effective pruning at initialization from iterative pruning

链接: https://arxiv.org/abs/2408.14757
作者: Shengkai Liu,Yaofeng Cheng,Fusheng Zha,Wei Guo,Lining Sun,Zhenshan Bing,Chenguang Yang
关键词-EN: reduces training costs, growing network size, costs by removing, removing weights, increasingly crucial
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Pruning at initialization (PaI) reduces training costs by removing weights before training, which becomes increasingly crucial with the growing network size. However, current PaI methods still have a large accuracy gap with iterative pruning, especially at high sparsity levels. This raises an intriguing question: can we get inspiration from iterative pruning to improve the PaI performance? In the lottery ticket hypothesis, the iterative rewind pruning (IRP) finds subnetworks retroactively by rewinding the parameter to the original initialization in every pruning iteration, which means all the subnetworks are based on the initial state. Here, we hypothesise the surviving subnetworks are more important and bridge the initial feature and their surviving score as the PaI criterion. We employ an end-to-end neural network (\textbfAutoSparse) to learn this correlation, input the model’s initial features, output their score and then prune the lowest score parameters before training. To validate the accuracy and generalization of our method, we performed PaI across various models. Results show that our approach outperforms existing methods in high-sparsity settings. Notably, as the underlying logic of model pruning is consistent in different models, only one-time IRP on one model is needed (e.g., once IRP on ResNet-18/CIFAR-10, AutoS can be generalized to VGG-16/CIFAR-10, ResNet-18/TinyImageNet, et al.). As the first neural network-based PaI method, we conduct extensive experiments to validate the factors influencing this approach. These results reveal the learning tendencies of neural networks and provide new insights into our understanding and research of PaI from a practical perspective. Our code is available at: this https URL.

[CV-66] RSTeller: Scaling Up Visual Language Modeling in Remote Sensing with Rich Linguistic Semantics from Openly Available Data and Large Language Models

链接: https://arxiv.org/abs/2408.14744
作者: Junyao Ge,Yang Zheng,Kaitai Guo,Jimin Liang
关键词-EN: Google Earth Engine, aligning complex visual, enabling the development, interpretation tasks, pivotal for aligning
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: Submitted to ISPRS

点击查看摘要

Abstract:Abundant, well-annotated multimodal data in remote sensing are pivotal for aligning complex visual remote sensing (RS) scenes with human language, enabling the development of specialized vision language models across diverse RS interpretation tasks. However, annotating RS images with rich linguistic semantics at scale demands expertise in RS and substantial human labor, making it costly and often impractical. In this study, we propose a workflow that leverages large language models (LLMs) to generate multimodal datasets with semantically rich captions at scale from plain OpenStreetMap (OSM) data for images sourced from the Google Earth Engine (GEE) platform. This approach facilitates the generation of paired remote sensing data and can be readily scaled up using openly available data. Within this framework, we present RSTeller, a multimodal dataset comprising over 1 million RS images, each accompanied by multiple descriptive captions. Extensive experiments demonstrate that RSTeller enhances the performance of multiple existing vision language models for RS scene understanding through continual pre-training. Our methodology significantly reduces the manual effort and expertise needed for annotating remote sensing imagery while democratizing access to high-quality annotated data. This advancement fosters progress in visual language modeling and encourages broader participation in remote sensing research and applications. The RSTeller dataset is available at this https URL.

[CV-67] Personalized Video Summarization using Text-Based Queries and Conditional Modeling

链接: https://arxiv.org/abs/2408.14743
作者: Jia-Hong Huang
关键词-EN: Vimeo presents significant, presents significant challenges, efficiently locating relevant, YouTube and Vimeo, Vimeo presents
类目: Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR)
*备注: Ph.D. thesis, 137 pages

点击查看摘要

Abstract:The proliferation of video content on platforms like YouTube and Vimeo presents significant challenges in efficiently locating relevant information. Automatic video summarization aims to address this by extracting and presenting key content in a condensed form. This thesis explores enhancing video summarization by integrating text-based queries and conditional modeling to tailor summaries to user needs. Traditional methods often produce fixed summaries that may not align with individual requirements. To overcome this, we propose a multi-modal deep learning approach that incorporates both textual queries and visual information, fusing them at different levels of the model architecture. Evaluation metrics such as accuracy and F1-score assess the quality of the generated summaries. The thesis also investigates improving text-based query representations using contextualized word embeddings and specialized attention networks. This enhances the semantic understanding of queries, leading to better video summaries. To emulate human-like summarization, which accounts for both visual coherence and abstract factors like storyline consistency, we introduce a conditional modeling approach. This method uses multiple random variables and joint distributions to capture key summarization components, resulting in more human-like and explainable summaries. Addressing data scarcity in fully supervised learning, the thesis proposes a segment-level pseudo-labeling approach. This self-supervised method generates additional data, improving model performance even with limited human-labeled datasets. In summary, this research aims to enhance automatic video summarization by incorporating text-based queries, improving query representations, introducing conditional modeling, and addressing data scarcity, thereby creating more effective and personalized video summaries.

[CV-68] Learning Differentially Private Diffusion Models via Stochastic Adversarial Distillation ECCV2024

链接: https://arxiv.org/abs/2408.14738
作者: Bochao Liu,Pengju Wang,Shiming Ge
关键词-EN: deep learning relies, privacy-sensitive domains, relies on large, large amounts, generative model learning
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV)
*备注: accepted by ECCV 2024

点击查看摘要

Abstract:While the success of deep learning relies on large amounts of training datasets, data is often limited in privacy-sensitive domains. To address this challenge, generative model learning with differential privacy has emerged as a solution to train private generative models for desensitized data generation. However, the quality of the images generated by existing methods is limited due to the complexity of modeling data distribution. We build on the success of diffusion models and introduce DP-SAD, which trains a private diffusion model by a stochastic adversarial distillation method. Specifically, we first train a diffusion model as a teacher and then train a student by distillation, in which we achieve differential privacy by adding noise to the gradients from other models to the student. For better generation quality, we introduce a discriminator to distinguish whether an image is from the teacher or the student, which forms the adversarial training. Extensive experiments and analysis clearly demonstrate the effectiveness of our proposed method.

[CV-69] OctFusion: Octree-based Diffusion Models for 3D Shape Generation

链接: https://arxiv.org/abs/2408.14732
作者: Bojun Xiong,Si-Tong Wei,Xin-Yang Zheng,Yan-Pei Cao,Zhouhui Lian,Peng-Shuai Wang
关键词-EN: Diffusion models, popular method, Diffusion, accompanying diffusion models, proposed diffusion model
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
*备注: Technical Report

点击查看摘要

Abstract:Diffusion models have emerged as a popular method for 3D generation. However, it is still challenging for diffusion models to efficiently generate diverse and high-quality 3D shapes. In this paper, we introduce OctFusion, which can generate 3D shapes with arbitrary resolutions in 2.5 seconds on a single Nvidia 4090 GPU, and the extracted meshes are guaranteed to be continuous and manifold. The key components of OctFusion are the octree-based latent representation and the accompanying diffusion models. The representation combines the benefits of both implicit neural representations and explicit spatial octrees and is learned with an octree-based variational autoencoder. The proposed diffusion model is a unified multi-scale U-Net that enables weights and computation sharing across different octree levels and avoids the complexity of widely used cascaded diffusion schemes. We verify the effectiveness of OctFusion on the ShapeNet and Objaverse datasets and achieve state-of-the-art performances on shape generation tasks. We demonstrate that OctFusion is extendable and flexible by generating high-quality color fields for textured mesh generation and high-quality 3D shapes conditioned on text prompts, sketches, or category labels. Our code and pre-trained models are available at \urlthis https URL.

[CV-70] GeoTransfer : Generalizable Few-Shot Multi-View Reconstruction via Transfer Learning

链接: https://arxiv.org/abs/2408.14724
作者: Shubhendu Jena,Franck Multon,Adnane Boukhayma
关键词-EN: Neural Radiance Fields, power of Neural, Neural Radiance, accurate occupancy fields, occupancy fields
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:This paper presents a novel approach for sparse 3D reconstruction by leveraging the expressive power of Neural Radiance Fields (NeRFs) and fast transfer of their features to learn accurate occupancy fields. Existing 3D reconstruction methods from sparse inputs still struggle with capturing intricate geometric details and can suffer from limitations in handling occluded regions. On the other hand, NeRFs excel in modeling complex scenes but do not offer means to extract meaningful geometry. Our proposed method offers the best of both worlds by transferring the information encoded in NeRF features to derive an accurate occupancy field representation. We utilize a pre-trained, generalizable state-of-the-art NeRF network to capture detailed scene radiance information, and rapidly transfer this knowledge to train a generalizable implicit occupancy network. This process helps in leveraging the knowledge of the scene geometry encoded in the generalizable NeRF prior and refining it to learn occupancy fields, facilitating a more precise generalizable representation of 3D space. The transfer learning approach leads to a dramatic reduction in training time, by orders of magnitude (i.e. from several days to 3.5 hrs), obviating the need to train generalizable sparse surface reconstruction methods from scratch. Additionally, we introduce a novel loss on volumetric rendering weights that helps in the learning of accurate occupancy fields, along with a normal loss that helps in global smoothing of the occupancy fields. We evaluate our approach on the DTU dataset and demonstrate state-of-the-art performance in terms of reconstruction accuracy, especially in challenging scenarios with sparse input data and occluded regions. We furthermore demonstrate the generalization capabilities of our method by showing qualitative results on the Blended MVS dataset without any retraining.

[CV-71] Snap and Diagnose: An Advanced Multimodal Retrieval System for Identifying Plant Diseases in the Wild

链接: https://arxiv.org/abs/2408.14723
作者: Tianqi Wei,Zhi Chen,Xin Yu
关键词-EN: ensures crop health, Plant disease recognition, Plant disease, disease, critical task
类目: Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Plant disease recognition is a critical task that ensures crop health and mitigates the damage caused by diseases. A handy tool that enables farmers to receive a diagnosis based on query pictures or the text description of suspicious plants is in high demand for initiating treatment before potential diseases spread further. In this paper, we develop a multimodal plant disease image retrieval system to support disease search based on either image or text prompts. Specifically, we utilize the largest in-the-wild plant disease dataset PlantWild, which includes over 18,000 images across 89 categories, to provide a comprehensive view of potential diseases relating to the query. Furthermore, cross-modal retrieval is achieved in the developed system, facilitated by a novel CLIP-based vision-language model that encodes both disease descriptions and disease images into the same latent space. Built on top of the retriever, our retrieval system allows users to upload either plant disease images or disease descriptions to retrieve the corresponding images with similar characteristics from the disease dataset to suggest candidate diseases for end users’ consideration.

[CV-72] Smart Multi-Modal Search: Contextual Sparse and Dense Embedding Integration in Adobe Express

链接: https://arxiv.org/abs/2408.14698
作者: Cherag Aroraa,Tracy Holloway King,Jayant Kumar,Yi Lu,Sanat Sharma,Arvind Srikantan,David Uvalle,Josep Valls-Vargas,Harsha Vardhan
关键词-EN: multi-modal search systems, effective multi-modal search, multi-modal search, search systems, search
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:As user content and queries become increasingly multi-modal, the need for effective multi-modal search systems has grown. Traditional search systems often rely on textual and metadata annotations for indexed images, while multi-modal embeddings like CLIP enable direct search using text and image embeddings. However, embedding-based approaches face challenges in integrating contextual features such as user locale and recency. Building a scalable multi-modal search system requires fine-tuning several components. This paper presents a multi-modal search architecture and a series of AB tests that optimize embeddings and multi-modal technologies in Adobe Express template search. We address considerations such as embedding model selection, the roles of embeddings in matching and ranking, and the balance between dense and sparse embeddings. Our iterative approach demonstrates how utilizing sparse, dense, and contextual features enhances short and long query search, significantly reduces null rates (over 70%), and increases click-through rates (CTR). Our findings provide insights into developing robust multi-modal search systems, thereby enhancing relevance for complex queries.

[CV-73] Enhancing Neural Network Interpretability Through Conductance-Based Information Plane Analysis

链接: https://arxiv.org/abs/2408.14681
作者: Jaouad Dabounou,Amine Baazzouz
关键词-EN: Information Plane, Information Plane analysis, conductance-based Information Plane, Information, traditional methods based
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注: 16 pages, 10 figures

点击查看摘要

Abstract:The Information Plane is a conceptual framework used to analyze the flow of information in neural networks, but traditional methods based on activations may not fully capture the dynamics of information processing. This paper introduces a new approach that uses layer conductance, a measure of sensitivity to input features, to enhance the Information Plane analysis. By incorporating gradient-based contributions, we provide a more precise characterization of information dynamics within the network. The proposed conductance-based Information Plane and a new Information Transformation Efficiency (ITE) metric are evaluated on pretrained ResNet50 and VGG16 models using the ImageNet dataset. Our results demonstrate the ability to identify critical hidden layers that contribute significantly to model performance and interpretability, giving insights into information compression, preservation, and utilization across layers. The conductance-based approach offers a granular perspective on feature attribution, enhancing our understanding of the decision-making processes within neural networks. Furthermore, our empirical findings challenge certain theoretical predictions of the Information Bottleneck theory, highlighting the complexities of information dynamics in real-world data scenarios. The proposed method not only advances our understanding of information dynamics in neural networks but also has the potential to significantly impact the broader field of Artificial Intelligence by enabling the development of more interpretable, efficient, and robust models.

[CV-74] gWaveNet: Classification of Gravity Waves from Noisy Satellite Data using Custom Kernel Integrated Deep Learning Method ICPR

链接: https://arxiv.org/abs/2408.14674
作者: Seraj Al Mahmud Mostafa,Omar Faruque,Chenxi Wang,Jia Yue,Sanjay Purushotham,Jianwu Wang
关键词-EN: Earths atmosphere caused, Earths atmosphere, gravity waves, buoyancy forces, waves
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: This paper has been accepted at the 27th International Conference on Pattern Recognition (ICPR) 2024

点击查看摘要

Abstract:Atmospheric gravity waves occur in the Earths atmosphere caused by an interplay between gravity and buoyancy forces. These waves have profound impacts on various aspects of the atmosphere, including the patterns of precipitation, cloud formation, ozone distribution, aerosols, and pollutant dispersion. Therefore, understanding gravity waves is essential to comprehend and monitor changes in a wide range of atmospheric behaviors. Limited studies have been conducted to identify gravity waves from satellite data using machine learning techniques. Particularly, without applying noise removal techniques, it remains an underexplored area of research. This study presents a novel kernel design aimed at identifying gravity waves within satellite images. The proposed kernel is seamlessly integrated into a deep convolutional neural network, denoted as gWaveNet. Our proposed model exhibits impressive proficiency in detecting images containing gravity waves from noisy satellite data without any feature engineering. The empirical results show our model outperforms related approaches by achieving over 98% training accuracy and over 94% test accuracy which is known to be the best result for gravity waves detection up to the time of this work. We open sourced our code at this https URL.

[CV-75] Physically Feasible Semantic Segmentation

链接: https://arxiv.org/abs/2408.14672
作者: Shamik Basu,Christos Sakaridis,Luc Van Gool
关键词-EN: minimizing solely per-pixel, solely per-pixel classification, per-pixel classification objectives, semantic segmentation, minimizing solely
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:State-of-the-art semantic segmentation models are typically optimized in a data-driven fashion, minimizing solely per-pixel classification objectives on their training data. This purely data-driven paradigm often leads to absurd segmentations, especially when the domain of input images is shifted from the one encountered during training. For instance, state-of-the-art models may assign the label road'' to a segment which is located above a segment that is respectively labeled as sky’', although our knowledge of the physical world dictates that such a configuration is not feasible for images captured by forward-facing upright cameras. Our method, Physically Feasible Semantic Segmentation (PhyFea), extracts explicit physical constraints that govern spatial class relations from the training sets of semantic segmentation datasets and enforces a differentiable loss function that penalizes violations of these constraints to promote prediction feasibility. PhyFea yields significant performance improvements in mIoU over each state-of-the-art network we use as baseline across ADE20K, Cityscapes and ACDC, notably a 1.5% improvement on ADE20K and a 2.1% improvement on ACDC.

[CV-76] Comparative Analysis: Violence Recognition from Videos using Transfer Learning

链接: https://arxiv.org/abs/2408.14659
作者: Dursun Dashdamirov
关键词-EN: hot topic, computer vision, Action recognition, Abstract, simple actions
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 6 pages, 5 figures, The paper will be published in IEEE AICT 2024 Conference

点击查看摘要

Abstract:Action recognition has become a hot topic in computer vision. However, the main applications of computer vision in video processing have focused on detection of relatively simple actions while complex events such as violence detection have been comparatively less investigated. This study focuses on the benchmarking of various deep learning techniques on a complex dataset. Next, a larger dataset is utilized to test the uplift from increasing volume of data. The dataset size increase from 500 to 1,600 videos resulted in a notable average accuracy improvement of 6% across four models.

[CV-77] 3D Point Cloud Network Pruning: When Some Weights Do not Matter BMVC2024

链接: https://arxiv.org/abs/2408.14601
作者: Amrijit Biswas,Md. Ismail Hossain,M M Lutfe Elahi,Ali Cheraghian,Fuad Rahman,Nabeel Mohammed,Shafin Rahman
关键词-EN: data structure utilized, Point Cloud Neural, geometric data structure, Cloud Neural Networks, point cloud
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted in BMVC 2024

点击查看摘要

Abstract:A point cloud is a crucial geometric data structure utilized in numerous applications. The adoption of deep neural networks referred to as Point Cloud Neural Networks (PC- NNs), for processing 3D point clouds, has significantly advanced fields that rely on 3D geometric data to enhance the efficiency of tasks. Expanding the size of both neural network models and 3D point clouds introduces significant challenges in minimizing computational and memory requirements. This is essential for meeting the demanding requirements of real-world applications, which prioritize minimal energy consumption and low latency. Therefore, investigating redundancy in PCNNs is crucial yet challenging due to their sensitivity to parameters. Additionally, traditional pruning methods face difficulties as these networks rely heavily on weights and points. Nonetheless, our research reveals a promising phenomenon that could refine standard PCNN pruning techniques. Our findings suggest that preserving only the top p% of the highest magnitude weights is crucial for accuracy preservation. For example, pruning 99% of the weights from the PointNet model still results in accuracy close to the base level. Specifically, in the ModelNet40 dataset, where the base accuracy with the PointNet model was 87. 5%, preserving only 1% of the weights still achieves an accuracy of 86.8%. Codes are available in: this https URL

[CV-78] PVAFN: Point-Voxel Attention Fusion Network with Multi-Pooling Enhancing for 3D Object Detection

链接: https://arxiv.org/abs/2408.14600
作者: Yidi Li,Jiahao Wen,Bin Ren,Wenhao Li,Zhenhuan Xu,Hao Guo,Hong Liu,Nicu Sebe
关键词-EN: common in LiDAR-based, Attention Fusion Network, information effectively, voxel representations, PVAFN
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 3D Object Detection

点击查看摘要

Abstract:The integration of point and voxel representations is becoming more common in LiDAR-based 3D object detection. However, this combination often struggles with capturing semantic information effectively. Moreover, relying solely on point features within regions of interest can lead to information loss and limitations in local feature representation. To tackle these challenges, we propose a novel two-stage 3D object detector, called Point-Voxel Attention Fusion Network (PVAFN). PVAFN leverages an attention mechanism to improve multi-modal feature fusion during the feature extraction phase. In the refinement stage, it utilizes a multi-pooling strategy to integrate both multi-scale and region-specific information effectively. The point-voxel attention mechanism adaptively combines point cloud and voxel-based Bird’s-Eye-View (BEV) features, resulting in richer object representations that help to reduce false detections. Additionally, a multi-pooling enhancement module is introduced to boost the model’s perception capabilities. This module employs cluster pooling and pyramid pooling techniques to efficiently capture key geometric details and fine-grained shape structures, thereby enhancing the integration of local and global features. Extensive experiments on the KITTI and Waymo datasets demonstrate that the proposed PVAFN achieves competitive performance. The code and models will be available.

[CV-79] MMR: Evaluating Reading Ability of Large Multimodal Models

链接: https://arxiv.org/abs/2408.14594
作者: Jian Chen,Ruiyi Zhang,Yufan Zhou,Ryan Rossi,Jiuxiang Gu,Changyou Chen
关键词-EN: Large multimodal models, Large multimodal, demonstrated impressive capabilities, demonstrated impressive, text-rich image
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Large multimodal models (LMMs) have demonstrated impressive capabilities in understanding various types of image, including text-rich images. Most existing text-rich image benchmarks are simple extraction-based question answering, and many LMMs now easily achieve high scores. This means that current benchmarks fail to accurately reflect performance of different models, and a natural idea is to build a new benchmark to evaluate their complex reasoning and spatial understanding abilities. In this work, we propose the Multi-Modal Reading (MMR) benchmark in 11 diverse tasks to evaluate LMMs for text-rich image understanding. MMR is the first text-rich image benchmark built on human annotations with the help of language models. By evaluating several state-of-the-art LMMs, including GPT-4o, it reveals the limited capabilities of existing LMMs underscoring the value of our benchmark.

[CV-80] Global-Local Distillation Network-Based Audio-Visual Speaker Tracking with Incomplete Modalities

链接: https://arxiv.org/abs/2408.14585
作者: Yidi Li,Yihan Li,Yixin Guo,Bin Ren,Zhenhuan Xu,Hao Guo,Hong Liu,Nicu Sebe
关键词-EN: speaker tracking research, complementing multi-modal data, integrating and complementing, crucial strategy, strategy for improving
类目: Computer Vision and Pattern Recognition (cs.CV); Sound (cs.SD); Audio and Speech Processing (eess.AS)
*备注: Audio-Visual Speaker Tracking with Incomplete Modalities

点击查看摘要

Abstract:In speaker tracking research, integrating and complementing multi-modal data is a crucial strategy for improving the accuracy and robustness of tracking systems. However, tracking with incomplete modalities remains a challenging issue due to noisy observations caused by occlusion, acoustic noise, and sensor failures. Especially when there is missing data in multiple modalities, the performance of existing multi-modal fusion methods tends to decrease. To this end, we propose a Global-Local Distillation-based Tracker (GLDTracker) for robust audio-visual speaker tracking. GLDTracker is driven by a teacher-student distillation model, enabling the flexible fusion of incomplete information from each modality. The teacher network processes global signals captured by camera and microphone arrays, and the student network handles local information subject to visual occlusion and missing audio channels. By transferring knowledge from teacher to student, the student network can better adapt to complex dynamic scenes with incomplete observations. In the student network, a global feature reconstruction module based on the generative adversarial network is constructed to reconstruct global features from feature embedding with missing local information. Furthermore, a multi-modal multi-level fusion attention is introduced to integrate the incomplete feature and the reconstructed feature, leveraging the complementarity and consistency of audio-visual and global-local features. Experimental results on the AV16.3 dataset demonstrate that the proposed GLDTracker outperforms existing state-of-the-art audio-visual trackers and achieves leading performance on both standard and incomplete modalities datasets, highlighting its superiority and robustness in complex conditions. The code and models will be available.

[CV-81] DIAGen: Diverse Image Augmentation with Generative Models

链接: https://arxiv.org/abs/2408.14584
作者: Tobias Lingenberg,Markus Reuter,Gopika Sudhakaran,Dominik Gojny,Stefan Roth,Simone Schaub-Meyer
关键词-EN: Simple data augmentation, Simple data, computer vision models, data augmentation techniques, rotations and flips
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: Accepted for publication in GCPR 2024

点击查看摘要

Abstract:Simple data augmentation techniques, such as rotations and flips, are widely used to enhance the generalization power of computer vision models. However, these techniques often fail to modify high-level semantic attributes of a class. To address this limitation, researchers have explored generative augmentation methods like the recently proposed DA-Fusion. Despite some progress, the variations are still largely limited to textural changes, thus falling short on aspects like varied viewpoints, environment, weather conditions, or even class-level semantic attributes (eg, variations in a dog’s breed). To overcome this challenge, we propose DIAGen, building upon DA-Fusion. First, we apply Gaussian noise to the embeddings of an object learned with Textual Inversion to diversify generations using a pre-trained diffusion model’s knowledge. Second, we exploit the general knowledge of a text-to-text generative model to guide the image generation of the diffusion model with varied class-specific prompts. Finally, we introduce a weighting mechanism to mitigate the impact of poorly generated samples. Experimental results across various datasets show that DIAGen not only enhances semantic diversity but also improves the performance of subsequent classifiers. The advantages of DIAGen over standard augmentations and the DA-Fusion baseline are particularly pronounced with out-of-distribution samples.

[CV-82] A Survey of Camouflaged Object Detection and Beyond

链接: https://arxiv.org/abs/2408.14562
作者: Fengyang Xiao,Sujie Hu,Yuqi Shen,Chengyu Fang,Jinfa Huang,Chunming He,Longxiang Tang,Ziyun Yang,Xiu Li
关键词-EN: computer vision systems, Camouflaged Object Detection, segmenting objects, Object Detection, posing a significant
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 26 pages, 10 figures, 8 tables

点击查看摘要

Abstract:Camouflaged Object Detection (COD) refers to the task of identifying and segmenting objects that blend seamlessly into their surroundings, posing a significant challenge for computer vision systems. In recent years, COD has garnered widespread attention due to its potential applications in surveillance, wildlife conservation, autonomous systems, and more. While several surveys on COD exist, they often have limitations in terms of the number and scope of papers covered, particularly regarding the rapid advancements made in the field since mid-2023. To address this void, we present the most comprehensive review of COD to date, encompassing both theoretical frameworks and practical contributions to the field. This paper explores various COD methods across four domains, including both image-level and video-level solutions, from the perspectives of traditional and deep learning approaches. We thoroughly investigate the correlations between COD and other camouflaged scenario methods, thereby laying the theoretical foundation for subsequent analyses. Beyond object-level detection, we also summarize extended methods for instance-level tasks, including camouflaged instance segmentation, counting, and ranking. Additionally, we provide an overview of commonly used benchmarks and evaluation metrics in COD tasks, conducting a comprehensive evaluation of deep learning-based techniques in both image and video domains, considering both qualitative and quantitative performance. Finally, we discuss the limitations of current COD models and propose 9 promising directions for future research, focusing on addressing inherent challenges and exploring novel, meaningful technologies. For those interested, a curated list of COD-related techniques, datasets, and additional resources can be found at this https URL

[CV-83] Exploring the Potential of Synthetic Data to Replace Real Data ICIP2024

链接: https://arxiv.org/abs/2408.14559
作者: Hyungtae Lee,Yan Zhang,Heesung Kwon,Shuvra S. Bhattacharrya
关键词-EN: synthetic data, replace real data, real data creates, synthetic, data
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: ICIP 2024

点击查看摘要

Abstract:The potential of synthetic data to replace real data creates a huge demand for synthetic data in data-hungry AI. This potential is even greater when synthetic data is used for training along with a small number of real images from domains other than the test domain. We find that this potential varies depending on (i) the number of cross-domain real images and (ii) the test set on which the trained model is evaluated. We introduce two new metrics, the train2test distance and \textAP_\textt2t , to evaluate the ability of a cross-domain training set using synthetic data to represent the characteristics of test instances in relation to training performance. Using these metrics, we delve deeper into the factors that influence the potential of synthetic data and uncover some interesting dynamics about how synthetic data impacts training performance. We hope these discoveries will encourage more widespread use of synthetic data.

[CV-84] Revisiting Image Captioning Training Paradigm via Direct CLIP-based Optimization BMVC2024

链接: https://arxiv.org/abs/2408.14547
作者: Nicholas Moratelli,Davide Caffagni,Marcella Cornia,Lorenzo Baraldi,Rita Cucchiara
关键词-EN: Self-Critical Sequence Training, Self-Critical Sequence, image captioning involves, captioning involves pre-training, maximize hand-crafted captioning
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Multimedia (cs.MM)
*备注: BMVC 2024

点击查看摘要

Abstract:The conventional training approach for image captioning involves pre-training a network using teacher forcing and subsequent fine-tuning with Self-Critical Sequence Training to maximize hand-crafted captioning metrics. However, when attempting to optimize modern and higher-quality metrics like CLIP-Score and PAC-Score, this training method often encounters instability and fails to acquire the genuine descriptive capabilities needed to produce fluent and informative captions. In this paper, we propose a new training paradigm termed Direct CLIP-Based Optimization (DiCO). Our approach jointly learns and optimizes a reward model that is distilled from a learnable captioning evaluator with high human correlation. This is done by solving a weighted classification problem directly inside the captioner. At the same time, DiCO prevents divergence from the original model, ensuring that fluency is maintained. DiCO not only exhibits improved stability and enhanced quality in the generated captions but also aligns more closely with human preferences compared to existing methods, especially in modern metrics. Additionally, it maintains competitive performance in traditional metrics. Our source code and trained models are publicly available at this https URL.

[CV-85] Improving Nonlinear Projection Heads using Pretrained Autoencoder Embeddings

链接: https://arxiv.org/abs/2408.14514
作者: Andreas Schliebitz,Heiko Tapken,Martin Atzmueller
关键词-EN: empirical study aims, MLP projection head, MLP projection, empirical study, study aims
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注: 15 pages, 1 figure

点击查看摘要

Abstract:This empirical study aims at improving the effectiveness of the standard 2-layer MLP projection head g(\cdot) featured in the SimCLR framework through the use of pretrained autoencoder embeddings. Given a contrastive learning task with a largely unlabeled image classification dataset, we first train a shallow autoencoder architecture and extract its compressed representations contained in the encoder’s embedding layer. After freezing the weights within this pretrained layer, we use it as a drop-in replacement for the input layer of SimCLR’s default projector. Additionally, we also apply further architectural changes to the projector by decreasing its width and changing its activation function. The different projection heads are then used to contrastively train and evaluate a feature extractor f(\cdot) following the SimCLR protocol, while also examining the performance impact of Z-score normalized datasets. Our experiments indicate that using a pretrained autoencoder embedding in the projector can not only increase classification accuracy by up to 2.9% or 1.7% on average but can also significantly decrease the dimensionality of the projection space. Our results also suggest, that using the sigmoid and tanh activation functions within the projector can outperform ReLU in terms of peak and average classification accuracy. When applying our presented projectors, then not applying Z-score normalization to datasets often increases peak performance. In contrast, the default projection head can benefit more from normalization. All experiments involving our pretrained projectors are conducted with frozen embeddings, since our test results indicate an advantage compared to using their non-frozen counterparts.

[CV-86] Satellite Sunroof: High-res Digital Surface Models and Roof Segmentation for Global Solar Mapping

链接: https://arxiv.org/abs/2408.14400
作者: Vishal Batchu,Alex Wilson,Betty Peng,Carl Elkin,Umangi Jain,Christopher Van Arsdale,Ross Goroshin,Varun Gulshan
关键词-EN: mitigating climate change, renewable energy, climate change, key to mitigating, mitigating climate
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 14 pages

点击查看摘要

Abstract:The transition to renewable energy, particularly solar, is key to mitigating climate change. Google’s Solar API aids this transition by estimating solar potential from aerial imagery, but its impact is constrained by geographical coverage. This paper proposes expanding the API’s reach using satellite imagery, enabling global solar potential assessment. We tackle challenges involved in building a Digital Surface Model (DSM) and roof instance segmentation from lower resolution and single oblique views using deep learning models. Our models, trained on aligned satellite and aerial datasets, produce 25cm DSMs and roof segments. With ~1m DSM MAE on buildings, ~5deg roof pitch error and ~56% IOU on roof segmentation, they significantly enhance the Solar API’s potential to promote solar adoption.

[CV-87] SAM SAM 2 in 3D Slicer: SegmentWithSAM Extension for Annotating Medical Images

链接: https://arxiv.org/abs/2408.15224
作者: Zafer Yildiz,Yuwen Chen,Maciej A. Mazurowski
关键词-EN: highly specialized expertise, requires highly specialized, specialized expertise, Creating annotations, data is time-consuming
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Software Engineering (cs.SE)
*备注: Future work: support for box and mask inputs for the video predictor of SAM 2

点击查看摘要

Abstract:Creating annotations for 3D medical data is time-consuming and often requires highly specialized expertise. Various tools have been implemented to aid this process. Segment Anything Model 2 (SAM 2) offers a general-purpose prompt-based segmentation algorithm designed to annotate videos. In this paper, we adapt this model to the annotation of 3D medical images and offer our implementation in the form of an extension to the popular annotation software: 3D Slicer. Our extension allows users to place point prompts on 2D slices to generate annotation masks and propagate these annotations across entire volumes in either single-directional or bi-directional manners. Our code is publicly available on this https URL and can be easily installed directly from the Extension Manager of 3D Slicer as well.

[CV-88] Histo-Diffusion: A Diffusion Super-Resolution Method for Digital Pathology with Comprehensive Quality Assessment

链接: https://arxiv.org/abs/2408.15218
作者: Xuan Xu,Saarthak Kapse,Prateek Prasanna
关键词-EN: encompassing vast amounts, accurate disease diagnosis, Generative Adversarial Networks, encompassing vast, advanced significantly
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注: We have submitted our paper to Medical Image Analysis and are currently awaiting feedback

点击查看摘要

Abstract:Digital pathology has advanced significantly over the last decade, with Whole Slide Images (WSIs) encompassing vast amounts of data essential for accurate disease diagnosis. High-resolution WSIs are essential for precise diagnosis but technical limitations in scanning equipment and variablity in slide preparation can hinder obtaining these images. Super-resolution techniques can enhance low-resolution images; while Generative Adversarial Networks (GANs) have been effective in natural image super-resolution tasks, they often struggle with histopathology due to overfitting and mode collapse. Traditional evaluation metrics fall short in assessing the complex characteristics of histopathology images, necessitating robust histology-specific evaluation methods. We introduce Histo-Diffusion, a novel diffusion-based method specially designed for generating and evaluating super-resolution images in digital pathology. It includes a restoration module for histopathology prior and a controllable diffusion module for generating high-quality images. We have curated two histopathology datasets and proposed a comprehensive evaluation strategy which incorporates both full-reference and no-reference metrics to thoroughly assess the quality of digital pathology images. Comparative analyses on multiple datasets with state-of-the-art methods reveal that Histo-Diffusion outperforms GANs. Our method offers a versatile solution for histopathology image super-resolution, capable of handling multi-resolution generation from varied input sizes, providing valuable support in diagnostic processes. Comments: We have submitted our paper to Medical Image Analysis and are currently awaiting feedback Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2408.15218 [eess.IV] (or arXiv:2408.15218v1 [eess.IV] for this version) https://doi.org/10.48550/arXiv.2408.15218 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[CV-89] Fundus2Video: Cross-Modal Angiography Video Generation from Static Fundus Photography with Clinical Knowledge Guidance MICCAI

链接: https://arxiv.org/abs/2408.15217
作者: Weiyi Zhang,Siyu Huang,Jiancheng Yang,Ruoyu Chen,Zongyuan Ge,Yingfeng Zheng,Danli Shi,Mingguang He
关键词-EN: Fundus Fluorescein Angiography, Fluorescein Angiography, assessing retinal vascular, Fundus Fluorescein, retinal vascular dynamics
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: The paper has been accepted by Medical Image Computing and Computer Assisted Intervention Society (MICCAI) 2024

点击查看摘要

Abstract:Fundus Fluorescein Angiography (FFA) is a critical tool for assessing retinal vascular dynamics and aiding in the diagnosis of eye diseases. However, its invasive nature and less accessibility compared to Color Fundus (CF) images pose significant challenges. Current CF to FFA translation methods are limited to static generation. In this work, we pioneer dynamic FFA video generation from static CF images. We introduce an autoregressive GAN for smooth, memory-saving frame-by-frame FFA synthesis. To enhance the focus on dynamic lesion changes in FFA regions, we design a knowledge mask based on clinical experience. Leveraging this mask, our approach integrates innovative knowledge mask-guided techniques, including knowledge-boosted attention, knowledge-aware discriminators, and mask-enhanced patchNCE loss, aimed at refining generation in critical areas and addressing the pixel misalignment challenge. Our method achieves the best FVD of 1503.21 and PSNR of 11.81 compared to other common video generation approaches. Human assessment by an ophthalmologist confirms its high generation quality. Notably, our knowledge mask surpasses supervised lesion segmentation masks, offering a promising non-invasive alternative to traditional FFA for research and clinical applications. The code is available at this https URL.

[CV-90] DIFR3CT: Latent Diffusion for Probabilistic 3D CT Reconstruction from Few Planar X-Rays

链接: https://arxiv.org/abs/2408.15118
作者: Yiran Sun,Hana Baroudi,Tucker Netherton,Laurence Court,Osama Mawlawi,Ashok Veeraraghavan,Guha Balakrishnan
关键词-EN: Computed Tomography, external beam radiotherapy, clinical ailments, visualization and diagnosis, external beam
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注: 11 pages, 9 figures

点击查看摘要

Abstract:Computed Tomography (CT) scans are the standard-of-care for the visualization and diagnosis of many clinical ailments, and are needed for the treatment planning of external beam radiotherapy. Unfortunately, the availability of CT scanners in low- and mid-resource settings is highly variable. Planar x-ray radiography units, in comparison, are far more prevalent, but can only provide limited 2D observations of the 3D anatomy. In this work we propose DIFR3CT, a 3D latent diffusion model, that can generate a distribution of plausible CT volumes from one or few (10) planar x-ray observations. DIFR3CT works by fusing 2D features from each x-ray into a joint 3D space, and performing diffusion conditioned on these fused features in a low-dimensional latent space. We conduct extensive experiments demonstrating that DIFR3CT is better than recent sparse CT reconstruction baselines in terms of standard pixel-level (PSNR, SSIM) on both the public LIDC and in-house post-mastectomy CT datasets. We also show that DIFR3CT supports uncertainty quantification via Monte Carlo sampling, which provides an opportunity to measure reconstruction reliability. Finally, we perform a preliminary pilot study evaluating DIFR3CT for automated breast radiotherapy contouring and planning – and demonstrate promising feasibility. Our code is available at this https URL.

[CV-91] LN-Gen: Rectal Lymph Nodes Generation via Anatomical Features

链接: https://arxiv.org/abs/2408.14977
作者: Weidong Guo,Hantao Zhang,Shouhong Wan,Bingbing Zou,Wanqin Wang,Peiquan Jin
关键词-EN: Accurate segmentation, rectal lymph nodes, rectal cancer, rectal lymph, lymph nodes
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注: 8 pages

点击查看摘要

Abstract:Accurate segmentation of rectal lymph nodes is crucial for the staging and treatment planning of rectal cancer. However, the complexity of the surrounding anatomical structures and the scarcity of annotated data pose significant challenges. This study introduces a novel lymph node synthesis technique aimed at generating diverse and realistic synthetic rectal lymph node samples to mitigate the reliance on manual annotation. Unlike direct diffusion methods, which often produce masks that are discontinuous and of suboptimal quality, our approach leverages an implicit SDF-based method for mask generation, ensuring the production of continuous, stable, and morphologically diverse masks. Experimental results demonstrate that our synthetic data significantly improves segmentation performance. Our work highlights the potential of diffusion model for accurately synthesizing structurally complex lesions, such as lymph nodes in rectal cancer, alleviating the challenge of limited annotated data in this field and aiding in advancements in rectal cancer diagnosis and treatment.

[CV-92] ERX: A Fast Real-Time Anomaly Detection Algorithm for Hyperspectral Line-Scanning

链接: https://arxiv.org/abs/2408.14947
作者: Samuel Garske,Bradley Evans,Christopher Artlett,KC Wong
关键词-EN: Detecting unexpected objects, Detecting unexpected, potential for monitoring, protecting the environment, great potential
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注: 10 pages, 9 figures, 3 tables, code and datasets accessible at this https URL

点击查看摘要

Abstract:Detecting unexpected objects (anomalies) in real-time has great potential for monitoring, managing, and protecting the environment. Hyperspectral line-scan cameras are a low-cost solution that enhance confidence in anomaly detection over RGB and multispectral imagery. However, real-time algorithms for these cameras must be fast when using small computers (e.g., those onboard a drone or small satellite), scalable to high dimensions, adaptable to changing scenery, and robust against geometric and radiometric distortions. This paper introduces the Exponentially moving RX algorithm (ERX) and compares it to existing RX-based anomaly detection methods for real-time line-scanning. ERX was tested using a Jetson Xavier NX compute module, achieving the best combination of speed and detection across three novel datasets compared to the other algorithms. This research paves the way for future studies in grouping and locating anomalous objects, adaptive and automatic threshold selection, and real-time field tests. The Python code for the algorithms and experiments is available at this https URL.

[CV-93] Automatic Detection of COVID-19 from Chest X-ray Images Using Deep Learning Model

链接: https://arxiv.org/abs/2408.14927
作者: Alloy Das,Rohit Agarwal,Rituparna Singh,Arindam Chowdhury,Debashis Nandi
关键词-EN: infectious disease caused, entire world, widely spreading, shaken the entire, infectious disease
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted in AIP Conference Proceedings (Vol. 2424, No. 1)

点击查看摘要

Abstract:The infectious disease caused by novel corona virus (2019-nCoV) has been widely spreading since last year and has shaken the entire world. It has caused an unprecedented effect on daily life, global economy and public health. Hence this disease detection has life-saving importance for both patients as well as doctors. Due to limited test kits, it is also a daunting task to test every patient with severe respiratory problems using conventional techniques (RT-PCR). Thus implementing an automatic diagnosis system is urgently required to overcome the scarcity problem of Covid-19 test kits at hospital, health care systems. The diagnostic approach is mainly classified into two categories-laboratory based and Chest radiography approach. In this paper, a novel approach for computerized corona virus (2019-nCoV) detection from lung x-ray images is presented. Here, we propose models using deep learning to show the effectiveness of diagnostic systems. In the experimental result, we evaluate proposed models on publicly available data-set which exhibit satisfactory performance and promising results compared with other previous existing methods.

[CV-94] Intraoperative Glioma Segmentation with YOLO SAM for Improved Accuracy in Tumor Resection

链接: https://arxiv.org/abs/2408.14847
作者: Samir Kassam,Angelo Markham,Katie Vo,Yashas Revanakara,Michael Lam,Kevin Zhu
关键词-EN: Magnetic Resonance Imaging, malignant brain tumor, significant surgical challenges, healthy tissue, common type
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Gliomas, a common type of malignant brain tumor, present significant surgical challenges due to their similarity to healthy tissue. Preoperative Magnetic Resonance Imaging (MRI) images are often ineffective during surgery due to factors such as brain shift, which alters the position of brain structures and tumors. This makes real-time intraoperative MRI (ioMRI) crucial, as it provides updated imaging that accounts for these shifts, ensuring more accurate tumor localization and safer resections. This paper presents a deep learning pipeline combining You Only Look Once Version 8 (YOLOv8) and Segment Anything Model Vision Transformer-base (SAM ViT-b) to enhance glioma detection and segmentation during ioMRI. Our model was trained using the Brain Tumor Segmentation 2021 (BraTS 2021) dataset, which includes standard magnetic resonance imaging (MRI) images, and noise-augmented MRI images that simulate ioMRI images. Noised MRI images are harder for a deep learning pipeline to segment, but they are more representative of surgical conditions. Achieving a Dice Similarity Coefficient (DICE) score of 0.79, our model performs comparably to state-of-the-art segmentation models tested on noiseless data. This performance demonstrates the model’s potential to assist surgeons in maximizing tumor resection and improving surgical outcomes.

[CV-95] Generalist Segmentation Algorithm for Photoreceptors Analysis in Adaptive Optics Imaging

链接: https://arxiv.org/abs/2408.14810
作者: Mikhail Kulyabin,Aline Sindel,Hilde Pedersen,Stuart Gilson,Rigmor Baraas,Andreas Maier
关键词-EN: living human retina, cone photoreceptor pattern, circ, eye conditions, living human
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Analyzing the cone photoreceptor pattern in images obtained from the living human retina using quantitative methods can be crucial for the early detection and management of various eye conditions. Confocal adaptive optics scanning light ophthalmoscope (AOSLO) imaging enables visualization of the cones from reflections of waveguiding cone photoreceptors. While there have been significant improvements in automated algorithms for segmenting cones in confocal AOSLO images, the process of labelling data remains labor-intensive and manual. This paper introduces a method based on deep learning (DL) for detecting and segmenting cones in AOSLO images. The models were trained on a semi-automatically labelled dataset of 20 AOSLO batches of images of 18 participants for 0 ^\circ , 1 ^\circ , and 2 ^\circ from the foveal center. F1 scores were 0.968, 0.958, and 0.954 for 0 ^\circ , 1 ^\circ , and 2 ^\circ , respectively, which is better than previously reported DL approaches. Our method minimizes the need for labelled data by only necessitating a fraction of labelled cones, which is especially beneficial in the field of ophthalmology, where labelled data can often be limited.

[CV-96] Sequential-Scanning Dual-Energy CT Imaging Using High Temporal Resolution Image Reconstruction and Error-Compensated Material Basis Image Generation

链接: https://arxiv.org/abs/2408.14754
作者: Qiaoxin Li,Ruifeng Chen,Peng Wang,Guotao Quan,Yanfeng Du,Dong Liang,Yinsheng Li
关键词-EN: Dual-energy computed tomography, precise medical diagnosis, obtain quantitative elemental, quantitative elemental composition, Dual-energy computed
类目: Medical Physics (physics.med-ph); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Instrumentation and Detectors (physics.ins-det)
*备注:

点击查看摘要

Abstract:Dual-energy computed tomography (DECT) has been widely used to obtain quantitative elemental composition of imaged subjects for personalized and precise medical diagnosis. Compared with DECT leveraging advanced X-ray source and/or detector technologies, the use of the sequential-scanning data acquisition scheme to implement DECT may make a broader impact on clinical practice because this scheme requires no specialized hardware designs and can be directly implemented into conventional CT systems. However, since the concentration of iodinated contrast agent in the imaged subject varies over time, sequentially scanned data sets acquired at two tube potentials are temporally inconsistent. As existing material basis image reconstruction approaches assume that the data sets acquired at two tube potentials are temporally consistent, the violation of this assumption results in inaccurate quantification of material concentration. In this work, we developed sequential-scanning DECT imaging using high temporal resolution image reconstruction and error-compensated material basis image generation, ACCELERATION in short, to address the technical challenge induced by temporal inconsistency of sequentially scanned data sets and improve quantification accuracy of material concentration in sequential-scanning DECT. ACCELERATION has been validated and evaluated using numerical simulation data sets generated from clinical human subject exams and experimental human subject studies. Results demonstrated the improvement of quantification accuracy and image quality using ACCELERATION.

[CV-97] BreakNet: Discontinuity-Resilient Multi-Scale Transformer Segmentation of Retinal Layers

链接: https://arxiv.org/abs/2408.14606
作者: Razieh Ganjee,Bingjie Wang,Lingyun Wang,Chengcheng Zhao,José-Alain Sahel,Shaohua Pi
关键词-EN: optical coherence tomography, Visible light optical, light optical coherence, retinal imaging due, Visible light
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Visible light optical coherence tomography (vis-OCT) is gaining traction for retinal imaging due to its high resolution and functional capabilities. However, the significant absorption of hemoglobin in the visible light range leads to pronounced shadow artifacts from retinal blood vessels, posing challenges for accurate layer segmentation. In this study, we present BreakNet, a multi-scale Transformer-based segmentation model designed to address boundary discontinuities caused by these shadow artifacts. BreakNet utilizes hierarchical Transformer and convolutional blocks to extract multi-scale global and local feature maps, capturing essential contextual, textural, and edge characteristics. The model incorporates decoder blocks that expand pathwaproys to enhance the extraction of fine details and semantic information, ensuring precise segmentation. Evaluated on rodent retinal images acquired with prototype vis-OCT, BreakNet demonstrated superior performance over state-of-the-art segmentation models, such as TCCT-BP and U-Net, even when faced with limited-quality ground truth data. Our findings indicate that BreakNet has the potential to significantly improve retinal quantification and analysis.

机器学习

[LG-0] Generative Verifiers: Reward Modeling as Next-Token Prediction

链接: https://arxiv.org/abs/2408.15240
作者: Lunjun Zhang,Arian Hosseini,Hritik Bansal,Mehran Kazemi,Aviral Kumar,Rishabh Agarwal
关键词-EN: large language models, performance of large, large language, Verifiers, LLMs
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Verifiers or reward models are often used to enhance the reasoning performance of large language models (LLMs). A common approach is the Best-of-N method, where N candidate solutions generated by the LLM are ranked by a verifier, and the best one is selected. While LLM-based verifiers are typically trained as discriminative classifiers to score solutions, they do not utilize the text generation capabilities of pretrained LLMs. To overcome this limitation, we instead propose training verifiers using the ubiquitous next-token prediction objective, jointly on verification and solution generation. Compared to standard verifiers, such generative verifiers (GenRM) can benefit from several advantages of LLMs: they integrate seamlessly with instruction tuning, enable chain-of-thought reasoning, and can utilize additional inference-time compute via majority voting for better verification. We demonstrate that when using Gemma-based verifiers on algorithmic and grade-school math reasoning tasks, GenRM outperforms discriminative verifiers and LLM-as-a-Judge, showing a 16-64% improvement in the percentage of problems solved with Best-of-N. Furthermore, we show that GenRM scales favorably across dataset size, model capacity, and inference-time compute.

[LG-1] he Mamba in the Llama: Distilling and Accelerating Hybrid Models

链接: https://arxiv.org/abs/2408.15237
作者: Junxiong Wang,Daniele Paliotta,Avner May,Alexander M. Rush,Tri Dao
关键词-EN: advantageous deployment characteristics, Linear RNN architectures, Linear RNN, language modeling, RNN architectures
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Code is open-sourced at this https URL

点击查看摘要

Abstract:Linear RNN architectures, like Mamba, can be competitive with Transformer models in language modeling while having advantageous deployment characteristics. Given the focus on training large-scale Transformer models, we consider the challenge of converting these pretrained models for deployment. We demonstrate that it is feasible to distill large Transformers into linear RNNs by reusing the linear projection weights from attention layers with academic GPU resources. The resulting hybrid model, which incorporates a quarter of the attention layers, achieves performance comparable to the original Transformer in chat benchmarks and outperforms open-source hybrid Mamba models trained from scratch with trillions of tokens in both chat benchmarks and general benchmarks. Moreover, we introduce a hardware-aware speculative decoding algorithm that accelerates the inference speed of Mamba and hybrid models. Overall we show how, with limited computation resources, we can remove many of the original attention layers and generate from the resulting model more efficiently. Our top-performing model, distilled from Llama3-8B-Instruct, achieves a 29.61 length-controlled win rate on AlpacaEval 2 against GPT-4 and 7.35 on MT-Bench, surpassing the best instruction-tuned linear RNN model.

[LG-2] DCT-CryptoNets: Scaling Private Inference in the Frequency Domain

链接: https://arxiv.org/abs/2408.15231
作者: Arjun Roy,Kaushik Roy
关键词-EN: offers unprecedented opportunities, fully homomorphic encryption, learning offers unprecedented, machine learning offers, FHE enables computation
类目: Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: Under Review; 10 pages content, 3 pages appendix, 4 figures, 8 tables; Code TBD

点击查看摘要

Abstract:The convergence of fully homomorphic encryption (FHE) and machine learning offers unprecedented opportunities for private inference of sensitive data. FHE enables computation directly on encrypted data, safeguarding the entire machine learning pipeline, including data and model confidentiality. However, existing FHE-based implementations for deep neural networks face significant challenges in computational cost, latency, and scalability, limiting their practical deployment. This paper introduces DCT-CryptoNets, a novel approach that leverages frequency-domain learning to tackle these issues. Our method operates directly in the frequency domain, utilizing the discrete cosine transform (DCT) commonly employed in JPEG compression. This approach is inherently compatible with remote computing services, where images are usually transmitted and stored in compressed formats. DCT-CryptoNets reduces the computational burden of homomorphic operations by focusing on perceptually relevant low-frequency components. This is demonstrated by substantial latency reduction of up to 5.3 \times compared to prior work on image classification tasks, including a novel demonstration of ImageNet inference within 2.5 hours, down from 12.5 hours compared to prior work on equivalent compute resources. Moreover, DCT-CryptoNets improves the reliability of encrypted accuracy by reducing variability (e.g., from \pm 2.5% to \pm 1.0% on ImageNet). This study demonstrates a promising avenue for achieving efficient and practical privacy-preserving deep learning on high resolution images seen in real-world applications.

[LG-3] LLM Defenses Are Not Robust to Multi-Turn Human Jailbreaks Yet

链接: https://arxiv.org/abs/2408.15221
作者: Nathaniel Li,Ziwen Han,Ian Steneker,Willow Primack,Riley Goodside,Hugh Zhang,Zifan Wang,Cristina Menghini,Summer Yue
关键词-EN: Recent large language, refuse harmful queries, greatly improved models’, improved models’ ability, Recent large
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Cryptography and Security (cs.CR); Computers and Society (cs.CY)
*备注:

点击查看摘要

Abstract:Recent large language model (LLM) defenses have greatly improved models’ ability to refuse harmful queries, even when adversarially attacked. However, LLM defenses are primarily evaluated against automated adversarial attacks in a single turn of conversation, an insufficient threat model for real-world malicious use. We demonstrate that multi-turn human jailbreaks uncover significant vulnerabilities, exceeding 70% attack success rate (ASR) on HarmBench against defenses that report single-digit ASRs with automated single-turn attacks. Human jailbreaks also reveal vulnerabilities in machine unlearning defenses, successfully recovering dual-use biosecurity knowledge from unlearned models. We compile these results into Multi-Turn Human Jailbreaks (MHJ), a dataset of 2,912 prompts across 537 multi-turn jailbreaks. We publicly release MHJ alongside a compendium of jailbreak tactics developed across dozens of commercial red teaming engagements, supporting research towards stronger LLM defenses.

[LG-4] On latent dynamics learning in nonlinear reduced order modeling

链接: https://arxiv.org/abs/2408.15183
作者: Nicola Farenga,Stefania Fresca,Simone Brivio,Andrea Manzoni
关键词-EN: latent dynamics models, LDM approximation, reduced order modeling, LDM, latent dynamics
类目: Numerical Analysis (math.NA); Machine Learning (cs.LG)
*备注: 43 pages

点击查看摘要

Abstract:In this work, we present the novel mathematical framework of latent dynamics models (LDMs) for reduced order modeling of parameterized nonlinear time-dependent PDEs. Our framework casts this latter task as a nonlinear dimensionality reduction problem, while constraining the latent state to evolve accordingly to an (unknown) dynamical system. A time-continuous setting is employed to derive error and stability estimates for the LDM approximation of the full order model (FOM) solution. We analyze the impact of using an explicit Runge-Kutta scheme in the time-discrete setting, resulting in the \Delta\textLDM formulation, and further explore the learnable setting, \Delta\textLDM_\theta , where deep neural networks approximate the discrete LDM components, while providing a bounded approximation error with respect to the FOM. Moreover, we extend the concept of parameterized Neural ODE - recently proposed as a possible way to build data-driven dynamical systems with varying input parameters - to be a convolutional architecture, where the input parameters information is injected by means of an affine modulation mechanism, while designing a convolutional autoencoder neural network able to retain spatial-coherence, thus enhancing interpretability at the latent level. Numerical experiments, including the Burgers’ and the advection-reaction-diffusion equations, demonstrate the framework’s ability to obtain, in a multi-query context, a time-continuous approximation of the FOM solution, thus being able to query the LDM approximation at any given time instance while retaining a prescribed level of accuracy. Our findings highlight the remarkable potential of the proposed LDMs, representing a mathematically rigorous framework to enhance the accuracy and approximation capabilities of reduced order modeling for time-dependent parameterized PDEs.

[LG-5] Exploiting Approximate Symmetry for Efficient Multi-Agent Reinforcement Learning

链接: https://arxiv.org/abs/2408.15173
作者: Batuhan Yardim,Niao He
关键词-EN: solving large-scale multi-agent, large-scale multi-agent reinforcement, Mean-field games, multi-agent reinforcement learning, reinforcement learning problems
类目: Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML)
*备注: 5 figures

点击查看摘要

Abstract:Mean-field games (MFG) have become significant tools for solving large-scale multi-agent reinforcement learning problems under symmetry. However, the assumption of exact symmetry limits the applicability of MFGs, as real-world scenarios often feature inherent heterogeneity. Furthermore, most works on MFG assume access to a known MFG model, which might not be readily available for real-world finite-agent games. In this work, we broaden the applicability of MFGs by providing a methodology to extend any finite-player, possibly asymmetric, game to an “induced MFG”. First, we prove that N -player dynamic games can be symmetrized and smoothly extended to the infinite-player continuum via explicit Kirszbraun extensions. Next, we propose the notion of \alpha,\beta -symmetric games, a new class of dynamic population games that incorporate approximate permutation invariance. For \alpha,\beta -symmetric games, we establish explicit approximation bounds, demonstrating that a Nash policy of the induced MFG is an approximate Nash of the N -player dynamic game. We show that TD learning converges up to a small bias using trajectories of the N -player game with finite-sample guarantees, permitting symmetrized learning without building an explicit MFG model. Finally, for certain games satisfying monotonicity, we prove a sample complexity of \widetilde\mathcalO(\varepsilon^-6) for the N -agent game to learn an \varepsilon -Nash up to symmetrization bias. Our theory is supported by evaluations on MARL benchmarks with thousands of agents.

[LG-6] Latent Ewald summation for machine learning of long-range interactions

链接: https://arxiv.org/abs/2408.15165
作者: Bingqing Cheng
关键词-EN: Machine learning interatomic, learning interatomic potentials, Machine learning, neglect long-range interactions, interatomic potentials
类目: Machine Learning (cs.LG); Materials Science (cond-mat.mtrl-sci); Chemical Physics (physics.chem-ph); Computational Physics (physics.comp-ph)
*备注:

点击查看摘要

Abstract:Machine learning interatomic potentials (MLIPs) often neglect long-range interactions, such as electrostatic and dispersion forces. In this work, we introduce a straightforward and efficient method to account for long-range interactions by learning a latent variable from local atomic descriptors and applying an Ewald summation to this variable. We demonstrate that in systems including charged, polar, or apolar molecular dimers, bulk water, and water-vapor interface, standard short-ranged MLIPs can lead to unphysical predictions even when employing message passing. The long-range models effectively eliminate these artifacts, with only about twice the computational cost of short-range MLIPs.

[LG-7] Delay as Payoff in MAB

链接: https://arxiv.org/abs/2408.15158
作者: Ofir Schlisselberg,Ido Cohen,Tal Lancewicki,Yishay Mansour
关键词-EN: stochastic Multi-armed Bandit, classical stochastic Multi-armed, Multi-armed Bandit, stochastic Multi-armed, Delta
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In this paper, we investigate a variant of the classical stochastic Multi-armed Bandit (MAB) problem, where the payoff received by an agent (either cost or reward) is both delayed, and directly corresponds to the magnitude of the delay. This setting models faithfully many real world scenarios such as the time it takes for a data packet to traverse a network given a choice of route (where delay serves as the agent’s cost); or a user’s time spent on a web page given a choice of content (where delay serves as the agent’s reward). Our main contributions are tight upper and lower bounds for both the cost and reward settings. For the case that delays serve as costs, which we are the first to consider, we prove optimal regret that scales as \sum_i:\Delta_i 0\frac\log T\Delta_i + d^* , where T is the maximal number of steps, \Delta_i are the sub-optimality gaps and d^* is the minimal expected delay amongst arms. For the case that delays serves as rewards, we show optimal regret of \sum_i:\Delta_i 0\frac\log T\Delta_i + \bard , where \bar d is the second maximal expected delay. These improve over the regret in the general delay-dependent payoff setting, which scales as \sum_i:\Delta_i 0\frac\log T\Delta_i + D , where D is the maximum possible delay. Our regret bounds highlight the difference between the cost and reward scenarios, showing that the improvement in the cost scenario is more significant than for the reward. Finally, we accompany our theoretical results with an empirical evaluation. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2408.15158 [cs.LG] (or arXiv:2408.15158v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2408.15158 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-8] How transformers learn structured data: insights from hierarchical filtering

链接: https://arxiv.org/abs/2408.15138
作者: Jerome Garnier-Brun,Marc Mézard,Emanuele Moscato,Luca Saglietti
关键词-EN: optimal Belief Propagation, sequences on trees, enabling control, Belief Propagation algorithm, procedure for generative
类目: Machine Learning (cs.LG); Disordered Systems and Neural Networks (cond-mat.dis-nn); Statistical Mechanics (cond-mat.stat-mech); Computation and Language (cs.CL)
*备注: 18 pages, 9 figures

点击查看摘要

Abstract:We introduce a hierarchical filtering procedure for generative models of sequences on trees, enabling control over the range of positional correlations in the data. Leveraging this controlled setting, we provide evidence that vanilla encoder-only transformer architectures can implement the optimal Belief Propagation algorithm on both root classification and masked language modeling tasks. Correlations at larger distances corresponding to increasing layers of the hierarchy are sequentially included as the network is trained. We analyze how the transformer layers succeed by focusing on attention maps from models trained with varying degrees of filtering. These attention maps show clear evidence for iterative hierarchical reconstruction of correlations, and we can relate these observations to a plausible implementation of the exact inference algorithm for the network sizes considered.

[LG-9] Using LLMs for Explaining Sets of Counterfactual Examples to Final Users KDD2024

链接: https://arxiv.org/abs/2408.15133
作者: Arturo Fredes,Jordi Vitria
关键词-EN: Causality is vital, field of Explainable, understanding true, relationships between variables, mere correlations
类目: Machine Learning (cs.LG)
*备注: Presented as a poster in the 2nd Workshop on Causal Inference and Machine Learning in Practice at KDD 2024

点击查看摘要

Abstract:Causality is vital for understanding true cause-and-effect relationships between variables within predictive models, rather than relying on mere correlations, making it highly relevant in the field of Explainable AI. In an automated decision-making scenario, causal inference methods can analyze the underlying data-generation process, enabling explanations of a model’s decision by manipulating features and creating counterfactual examples. These counterfactuals explore hypothetical scenarios where a minimal number of factors are altered, providing end-users with valuable information on how to change their situation. However, interpreting a set of multiple counterfactuals can be challenging for end-users who are not used to analyzing raw data records. In our work, we propose a novel multi-step pipeline that uses counterfactuals to generate natural language explanations of actions that will lead to a change in outcome in classifiers of tabular data using LLMs. This pipeline is designed to guide the LLM through smaller tasks that mimic human reasoning when explaining a decision based on counterfactual cases. We conducted various experiments using a public dataset and proposed a method of closed-loop evaluation to assess the coherence of the final explanation with the counterfactuals, as well as the quality of the content. Results are promising, although further experiments with other datasets and human evaluations should be carried out.

[LG-10] Evaluating the Energy Consumption of Machine Learning: Systematic Literature Review and Experiments

链接: https://arxiv.org/abs/2408.15128
作者: Charlotte Rodriguez,Laura Degioanni,Laetitia Kameni,Richard Vidal,Giovanni Neglia
关键词-EN: energy consumption, evaluate energy consumption, Machine Learning, energy, consumption
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
*备注: 52 pages,

点击查看摘要

Abstract:Monitoring, understanding, and optimizing the energy consumption of Machine Learning (ML) are various reasons why it is necessary to evaluate the energy usage of ML. However, there exists no universal tool that can answer this question for all use cases, and there may even be disagreement on how to evaluate energy consumption for a specific use case. Tools and methods are based on different approaches, each with their own advantages and drawbacks, and they need to be mapped out and explained in order to select the most suitable one for a given situation. We address this challenge through two approaches. First, we conduct a systematic literature review of all tools and methods that permit to evaluate the energy consumption of ML (both at training and at inference), irrespective of whether they were originally designed for machine learning or general software. Second, we develop and use an experimental protocol to compare a selection of these tools and methods. The comparison is both qualitative and quantitative on a range of ML tasks of different nature (vision, language) and computational complexity. The systematic literature review serves as a comprehensive guide for understanding the array of tools and methods used in evaluating energy consumption of ML, for various use cases going from basic energy monitoring to consumption optimization. Two open-source repositories are provided for further exploration. The first one contains tools that can be used to replicate this work or extend the current review. The second repository houses the experimental protocol, allowing users to augment the protocol with new ML computing tasks and additional energy evaluation tools.

[LG-11] Few-Shot Unsupervised Implicit Neural Shape Representation Learning with Spatial Adversaries ICML2024

链接: https://arxiv.org/abs/2408.15114
作者: Amine Ouasfi,Adnane Boukhayma
关键词-EN: Implicit Neural Representations, Neural Signed Distance, Implicit Neural, Signed Distance Functions, complex data modalities
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Graphics (cs.GR); Machine Learning (cs.LG)
*备注: ICML 2024

点击查看摘要

Abstract:Implicit Neural Representations have gained prominence as a powerful framework for capturing complex data modalities, encompassing a wide range from 3D shapes to images and audio. Within the realm of 3D shape representation, Neural Signed Distance Functions (SDF) have demonstrated remarkable potential in faithfully encoding intricate shape geometry. However, learning SDFs from sparse 3D point clouds in the absence of ground truth supervision remains a very challenging task. While recent methods rely on smoothness priors to regularize the learning, our method introduces a regularization term that leverages adversarial samples around the shape to improve the learned SDFs. Through extensive experiments and evaluations, we illustrate the efficacy of our proposed method, highlighting its capacity to improve SDF learning with respect to baselines and the state-of-the-art using synthetic and real data.

[LG-12] No Regrets: Investigating and Improving Regret Approximations for Curriculum Discovery

链接: https://arxiv.org/abs/2408.15099
作者: Alexander Rutherford,Michael Beukman,Timon Willi,Bruno Lacerda,Nick Hawes,Jakob Foerster
关键词-EN: Unsupervised Environment Design, improve downstream performance, reinforcement learning, downstream performance, topical question
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Robotics (cs.RO)
*备注:

点击查看摘要

Abstract:What data or environments to use for training to improve downstream performance is a longstanding and very topical question in reinforcement learning. In particular, Unsupervised Environment Design (UED) methods have gained recent attention as their adaptive curricula enable agents to be robust to in- and out-of-distribution tasks. We ask to what extent these methods are themselves robust when applied to a novel setting, closely inspired by a real-world robotics problem. Surprisingly, we find that the state-of-the-art UED methods either do not improve upon the naïve baseline of Domain Randomisation (DR), or require substantial hyperparameter tuning to do so. Our analysis shows that this is due to their underlying scoring functions failing to predict intuitive measures of ``learnability’', i.e., in finding the settings that the agent sometimes solves, but not always. Based on this, we instead directly train on levels with high learnability and find that this simple and intuitive approach outperforms UED methods and DR in several binary-outcome environments, including on our domain and the standard UED domain of Minigrid. We further introduce a new adversarial evaluation procedure for directly measuring robustness, closely mirroring the conditional value at risk (CVaR). We open-source all our code and present visualisations of final policies here: this https URL.

[LG-13] Data-Driven Nonlinear Deformation Design of 3D-Printable Shells

链接: https://arxiv.org/abs/2408.15097
作者: Samuel Silverman,Kelsey L. Snapp,Keith A. Brown,Emily Whiting
关键词-EN: specific mechanical properties, mechanical properties requires, properties requires understanding, Designing and fabricating, parameters and performance
类目: Graphics (cs.GR); Materials Science (cond-mat.mtrl-sci); Machine Learning (cs.LG)
*备注: Submitted to 3D Printing and Additive Manufacturing

点击查看摘要

Abstract:Designing and fabricating structures with specific mechanical properties requires understanding the intricate relationship between design parameters and performance. Understanding the design-performance relationship becomes increasingly complicated for nonlinear deformations. Though successful at modeling elastic deformations, simulation-based techniques struggle to model large elastoplastic deformations exhibiting plasticity and densification. We propose a neural network trained on experimental data to learn the design-performance relationship between 3D-printable shells and their compressive force-displacement behavior. Trained on thousands of physical experiments, our network aids in both forward and inverse design to generate shells exhibiting desired elastoplastic and hyperelastic deformations. We validate a subset of generated designs through fabrication and testing. Furthermore, we demonstrate the network’s inverse design efficacy in generating custom shells for several applications.

[LG-14] Post-processing fairness with minimal changes

链接: https://arxiv.org/abs/2408.15096
作者: Federico Di Gennaro,Thibault Laugel,Vincent Grari,Xavier Renard,Marcin Detyniecki
关键词-EN: test time, require the sensitive, sensitive attribute, attribute at test, post-processing algorithm
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In this paper, we introduce a novel post-processing algorithm that is both model-agnostic and does not require the sensitive attribute at test time. In addition, our algorithm is explicitly designed to enforce minimal changes between biased and debiased predictions; a property that, while highly desirable, is rarely prioritized as an explicit objective in fairness literature. Our approach leverages a multiplicative factor applied to the logit value of probability scores produced by a black-box classifier. We demonstrate the efficacy of our method through empirical evaluations, comparing its performance against other four debiasing algorithms on two widely used datasets in fairness research.

[LG-15] Constrained Diffusion Models via Dual Training

链接: https://arxiv.org/abs/2408.15094
作者: Shervin Khalafi,Dongsheng Ding,Alejandro Ribeiro
关键词-EN: Diffusion, Diffusion models, constrained diffusion models, constrained diffusion, high fidelity
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Optimization and Control (math.OC)
*备注: 41 pages, 4 figures, 2 tables

点击查看摘要

Abstract:Diffusion models have attained prominence for their ability to synthesize a probability distribution for a given dataset via a diffusion process, enabling the generation of new data points with high fidelity. However, diffusion processes are prone to generating biased data based on the training dataset. To address this issue, we develop constrained diffusion models by imposing diffusion constraints based on desired distributions that are informed by requirements. Specifically, we cast the training of diffusion models under requirements as a constrained distribution optimization problem that aims to reduce the distribution difference between original and generated data while obeying constraints on the distribution of generated data. We show that our constrained diffusion models generate new data from a mixture data distribution that achieves the optimal trade-off among objective and constraints. To train constrained diffusion models, we develop a dual training algorithm and characterize the optimality of the trained constrained diffusion model. We empirically demonstrate the effectiveness of our constrained models in two constrained generation tasks: (i) we consider a dataset with one or more underrepresented classes where we train the model with constraints to ensure fairly sampling from all classes during inference; (ii) we fine-tune a pre-trained diffusion model to sample from a new dataset while avoiding overfitting.

[LG-16] SiHGNN: Leveraging Properties of Semantic Graphs for Efficient HGNN Acceleration

链接: https://arxiv.org/abs/2408.15089
作者: Runzhen Xue,Mingyu Yan,Dengke Han,Zhimin Tang,Xiaochun Ye,Dongrui Fan
关键词-EN: Heterogeneous Graph Neural, heterogeneous graph fields, Graph Neural Networks, Heterogeneous Graph, semantic graph
类目: Hardware Architecture (cs.AR); Machine Learning (cs.LG)
*备注: 12 pages, 18 figures. arXiv admin note: text overlap with arXiv:2404.04792

点击查看摘要

Abstract:Heterogeneous Graph Neural Networks (HGNNs) have expanded graph representation learning to heterogeneous graph fields. Recent studies have demonstrated their superior performance across various applications, including medical analysis and recommendation systems, often surpassing existing methods. However, GPUs often experience inefficiencies when executing HGNNs due to their unique and complex execution patterns. Compared to traditional Graph Neural Networks, these patterns further exacerbate irregularities in memory access. To tackle these challenges, recent studies have focused on developing domain-specific accelerators for HGNNs. Nonetheless, most of these efforts have concentrated on optimizing the datapath or scheduling data accesses, while largely overlooking the potential benefits that could be gained from leveraging the inherent properties of the semantic graph, such as its topology, layout, and generation. In this work, we focus on leveraging the properties of semantic graphs to enhance HGNN performance. First, we analyze the Semantic Graph Build (SGB) stage and identify significant opportunities for data reuse during semantic graph generation. Next, we uncover the phenomenon of buffer thrashing during the Graph Feature Processing (GFP) stage, revealing potential optimization opportunities in semantic graph layout. Furthermore, we propose a lightweight hardware accelerator frontend for HGNNs, called SiHGNN. This accelerator frontend incorporates a tree-based Semantic Graph Builder for efficient semantic graph generation and features a novel Graph Restructurer for optimizing semantic graph layouts. Experimental results show that SiHGNN enables the state-of-the-art HGNN accelerator to achieve an average performance improvement of 2.95 \times . Comments: 12 pages, 18 figures. arXiv admin note: text overlap with arXiv:2404.04792 Subjects: Hardware Architecture (cs.AR); Machine Learning (cs.LG) Cite as: arXiv:2408.15089 [cs.AR] (or arXiv:2408.15089v1 [cs.AR] for this version) https://doi.org/10.48550/arXiv.2408.15089 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-17] MMASD: A Novel Dataset for Privacy-Preserving Behavior Analysis of Children with Autism Spectrum Disorder

链接: https://arxiv.org/abs/2408.15077
作者: Pavan Uttej Ravva,Behdokht Kiafar,Pinar Kullu,Jicheng Li,Anjana Bhat,Roghayeh Leila Barmaki
关键词-EN: comprehending communication signals, Autism spectrum disorder, spectrum disorder, communication signals, social interaction
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Autism spectrum disorder (ASD) is characterized by significant challenges in social interaction and comprehending communication signals. Recently, therapeutic interventions for ASD have increasingly utilized Deep learning powered-computer vision techniques to monitor individual progress over time. These models are trained on private, non-public datasets from the autism community, creating challenges in comparing results across different models due to privacy-preserving data-sharing issues. This work introduces MMASD+. MMASD+ consists of diverse data modalities, including 3D-Skeleton, 3D Body Mesh, and Optical Flow data. It integrates the capabilities of Yolov8 and Deep SORT algorithms to distinguish between the therapist and children, addressing a significant barrier in the original dataset. Additionally, a Multimodal Transformer framework is proposed to predict 11 action types and the presence of ASD. This framework achieves an accuracy of 95.03% for predicting action types and 96.42% for predicting ASD presence, demonstrating over a 10% improvement compared to models trained on single data modalities. These findings highlight the advantages of integrating multiple data modalities within the Multimodal Transformer framework.

[LG-18] MiWaves Reinforcement Learning Algorithm

链接: https://arxiv.org/abs/2408.15076
作者: Susobhan Ghosh,Yongyi Guo,Pei-Yao Hung,Lara Coughlin,Erin Bonar,Inbal Nahum-Shani,Maureen Walton,Susan Murphy
关键词-EN: health challenge globally, significant public health, public health challenge, challenge globally, escalating prevalence
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: arXiv admin note: substantial text overlap with arXiv:2402.17739

点击查看摘要

Abstract:The escalating prevalence of cannabis use poses a significant public health challenge globally. In the U.S., cannabis use is more prevalent among emerging adults (EAs) (ages 18-25) than any other age group, with legalization in the multiple states contributing to a public perception that cannabis is less risky than in prior decades. To address this growing concern, we developed MiWaves, a reinforcement learning (RL) algorithm designed to optimize the delivery of personalized intervention prompts to reduce cannabis use among EAs. MiWaves leverages domain expertise and prior data to tailor the likelihood of delivery of intervention messages. This paper presents a comprehensive overview of the algorithm’s design, including key decisions and experimental outcomes. The finalized MiWaves RL algorithm was deployed in a clinical trial from March to May 2024.

[LG-19] Interactive dense pixel visualizations for time series and model attribution explanations

链接: https://arxiv.org/abs/2408.15073
作者: Udo Schlegel,Daniel A. Keim
关键词-EN: Explainable Artificial Intelligence, Artificial Intelligence, Explainable Artificial, offering numerous techniques, Deep Neural Network
类目: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
*备注: 5 pages, 2 figures, accepted at MLVIS 2023

点击查看摘要

Abstract:The field of Explainable Artificial Intelligence (XAI) for Deep Neural Network models has developed significantly, offering numerous techniques to extract explanations from models. However, evaluating explanations is often not trivial, and differences in applied metrics can be subtle, especially with non-intelligible data. Thus, there is a need for visualizations tailored to explore explanations for domains with such data, e.g., time series. We propose DAVOTS, an interactive visual analytics approach to explore raw time series data, activations of neural networks, and attributions in a dense-pixel visualization to gain insights into the data, models’ decisions, and explanations. To further support users in exploring large datasets, we apply clustering approaches to the visualized data domains to highlight groups and present ordering strategies for individual and combined data exploration to facilitate finding patterns. We visualize a CNN trained on the FordA dataset to demonstrate the approach.

[LG-20] Subgroup Analysis via Model-based Rule Forest

链接: https://arxiv.org/abs/2408.15057
作者: I-Ling Cheng,Chan Hsu,Chantung Ku,Pei-Ju Lee,Yihuang Kang
关键词-EN: critical decision-making scenarios, black-box nature, raising concerns, decision-making scenarios, Deep Rule Forests
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Machine learning models are often criticized for their black-box nature, raising concerns about their applicability in critical decision-making scenarios. Consequently, there is a growing demand for interpretable models in such contexts. In this study, we introduce Model-based Deep Rule Forests (mobDRF), an interpretable representation learning algorithm designed to extract transparent models from data. By leveraging IF-THEN rules with multi-level logic expressions, mobDRF enhances the interpretability of existing models without compromising accuracy. We apply mobDRF to identify key risk factors for cognitive decline in an elderly population, demonstrating its effectiveness in subgroup analysis and local model optimization. Our method offers a promising solution for developing trustworthy and interpretable machine learning models, particularly valuable in fields like healthcare, where understanding differential effects across patient subgroups can lead to more personalized and effective treatments.

[LG-21] Causal Rule Forest: Toward Interpretable and Precise Treatment Effect Estimation

链接: https://arxiv.org/abs/2408.15055
作者: Chan Hsu,Jun-Ting Wu,Yihuang Kang
关键词-EN: Heterogeneous Treatment Effects, Average Treatment Effects, Conditional Average Treatment, inferencing Heterogeneous Treatment, Conditional Average
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: The 25th IEEE International Conference on Information Reuse and Integration for Data Science (IRI 2024)

点击查看摘要

Abstract:Understanding and inferencing Heterogeneous Treatment Effects (HTE) and Conditional Average Treatment Effects (CATE) are vital for developing personalized treatment recommendations. Many state-of-the-art approaches achieve inspiring performance in estimating HTE on benchmark datasets or simulation studies. However, the indirect predicting manner and complex model architecture reduce the interpretability of these approaches. To mitigate the gap between predictive performance and heterogeneity interpretability, we introduce the Causal Rule Forest (CRF), a novel approach to learning hidden patterns from data and transforming the patterns into interpretable multi-level Boolean rules. By training the other interpretable causal inference models with data representation learned by CRF, we can reduce the predictive errors of these models in estimating HTE and CATE, while keeping their interpretability for identifying subgroups that a treatment is more effective. Our experiments underscore the potential of CRF to advance personalized interventions and policies, paving the way for future research to enhance its scalability and application across complex causal inference challenges.

[LG-22] Earth Observation Satellite Scheduling with Graph Neural Networks

链接: https://arxiv.org/abs/2408.15041
作者: Antoine Jacquet,Guillaume Infantes,Nicolas Meuleau,Emmanuel Benazera,Stéphanie Roussel,Vincent Baudoui,Jonathan Guerra
关键词-EN: Observation Satellite Planning, Earth Observation Satellite, considerable practical interest, Satellite Planning, difficult optimization problem
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注: Accepted at 17th European Workshop on Reinforcement Learning (EWRL 2024)

点击查看摘要

Abstract:The Earth Observation Satellite Planning (EOSP) is a difficult optimization problem with considerable practical interest. A set of requested observations must be scheduled on an agile Earth observation satellite while respecting constraints on their visibility window, as well as maneuver constraints that impose varying delays between successive observations. In addition, the problem is largely oversubscribed: there are much more candidate observations than what can possibly be achieved. Therefore, one must select the set of observations that will be performed while maximizing their weighted cumulative benefit, and propose a feasible schedule for these observations. As previous work mostly focused on heuristic and iterative search algorithms, this paper presents a new technique for selecting and scheduling observations based on Graph Neural Networks (GNNs) and Deep Reinforcement Learning (DRL). GNNs are used to extract relevant information from the graphs representing instances of the EOSP, and DRL drives the search for optimal schedules. Our simulations show that it is able to learn on small problem instances and generalize to larger real-world instances, with very competitive performance compared to traditional approaches.

[LG-23] MONAS: Efficient Zero-Shot Neural Architecture Search for MCUs

链接: https://arxiv.org/abs/2408.15034
作者: Ye Qiao,Haocheng Xu,Yifan Zhang,Sitao Huang
关键词-EN: Convolutional Neural Network, discovering new Convolutional, Convolutional Neural, accuracy optimization goals, optimization goals
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Neural Architecture Search (NAS) has proven effective in discovering new Convolutional Neural Network (CNN) architectures, particularly for scenarios with well-defined accuracy optimization goals. However, previous approaches often involve time-consuming training on super networks or intensive architecture sampling and evaluations. Although various zero-cost proxies correlated with CNN model accuracy have been proposed for efficient architecture search without training, their lack of hardware consideration makes it challenging to target highly resource-constrained edge devices such as microcontroller units (MCUs). To address these challenges, we introduce MONAS, a novel hardware-aware zero-shot NAS framework specifically designed for MCUs in edge computing. MONAS incorporates hardware optimality considerations into the search process through our proposed MCU hardware latency estimation model. By combining this with specialized performance indicators (proxies), MONAS identifies optimal neural architectures without incurring heavy training and evaluation costs, optimizing for both hardware latency and accuracy under resource constraints. MONAS achieves up to a 1104x improvement in search efficiency over previous work targeting MCUs and can discover CNN models with over 3.23x faster inference on MCUs while maintaining similar accuracy compared to more general NAS approaches.

[LG-24] Prior-free Balanced Replay: Uncertainty-guided Reservoir Sampling for Long-Tailed Continual Learning

链接: https://arxiv.org/abs/2408.14976
作者: Lei Liu,Li Liu,Yawen Cui
关键词-EN: continual data stream, Long-Tailed Continual Learning, continual learning, data stream exhibits, continual data
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Even in the era of large models, one of the well-known issues in continual learning (CL) is catastrophic forgetting, which is significantly challenging when the continual data stream exhibits a long-tailed distribution, termed as Long-Tailed Continual Learning (LTCL). Existing LTCL solutions generally require the label distribution of the data stream to achieve re-balance training. However, obtaining such prior information is often infeasible in real scenarios since the model should learn without pre-identifying the majority and minority classes. To this end, we propose a novel Prior-free Balanced Replay (PBR) framework to learn from long-tailed data stream with less forgetting. Concretely, motivated by our experimental finding that the minority classes are more likely to be forgotten due to the higher uncertainty, we newly design an uncertainty-guided reservoir sampling strategy to prioritize rehearsing minority data without using any prior information, which is based on the mutual dependence between the model and samples. Additionally, we incorporate two prior-free components to further reduce the forgetting issue: (1) Boundary constraint is to preserve uncertain boundary supporting samples for continually re-estimating task boundaries. (2) Prototype constraint is to maintain the consistency of learned class prototypes along with training. Our approach is evaluated on three standard long-tailed benchmarks, demonstrating superior performance to existing CL methods and previous SOTA LTCL approach in both task- and class-incremental learning settings, as well as ordered- and shuffled-LTCL settings.

[LG-25] Cross-Modal Learning for Chemistry Property Prediction: Large Language Models Meet Graph Machine Learning NEURIPS2023

链接: https://arxiv.org/abs/2408.14964
作者: Sakhinana Sagar Srinivas,Venkataramana Runkana
关键词-EN: facilitating accurate property, accurate property predictions, Graph Neural Networks, Large Language Models, property prediction tasks
类目: Machine Learning (cs.LG)
*备注: Paper Accepted at Workshop on Robustness of Few-shot and Zero-shot Learning in Foundation Models at NeurIPS 2023

点击查看摘要

Abstract:In the field of chemistry, the objective is to create novel molecules with desired properties, facilitating accurate property predictions for applications such as material design and drug screening. However, existing graph deep learning methods face limitations that curb their expressive power. To address this, we explore the integration of vast molecular domain knowledge from Large Language Models (LLMs) with the complementary strengths of Graph Neural Networks (GNNs) to enhance performance in property prediction tasks. We introduce a Multi-Modal Fusion (MMF) framework that synergistically harnesses the analytical prowess of GNNs and the linguistic generative and predictive abilities of LLMs, thereby improving accuracy and robustness in predicting molecular properties. Our framework combines the effectiveness of GNNs in modeling graph-structured data with the zero-shot and few-shot learning capabilities of LLMs, enabling improved predictions while reducing the risk of overfitting. Furthermore, our approach effectively addresses distributional shifts, a common challenge in real-world applications, and showcases the efficacy of learning cross-modal representations, surpassing state-of-the-art baselines on benchmark datasets for property prediction tasks.

[LG-26] Domain-decoupled Physics-informed Neural Networks with Closed-form Gradients for Fast Model Learning of Dynamical Systems

链接: https://arxiv.org/abs/2408.14951
作者: Henrik Krauss,Tim-Lukas Habich,Max Bartholdt,Thomas Seel,Moritz Schappler
关键词-EN: incorporate unmodeled effects, Physics-informed neural networks, Physics-informed neural, trained using physical, physical equations
类目: ystems and Control (eess.SY); Machine Learning (cs.LG); Robotics (cs.RO)
*备注: Accepted to International Conference on Informatics in Control, Automation and Robotics (ICINCO) 2024

点击查看摘要

Abstract:Physics-informed neural networks (PINNs) are trained using physical equations and can also incorporate unmodeled effects by learning from data. PINNs for control (PINCs) of dynamical systems are gaining interest due to their prediction speed compared to classical numerical integration methods for nonlinear state-space models, making them suitable for real-time control applications. We introduce the domain-decoupled physics-informed neural network (DD-PINN) to address current limitations of PINC in handling large and complex nonlinear dynamic systems. The time domain is decoupled from the feed-forward neural network to construct an Ansatz function, allowing for calculation of gradients in closed form. This approach significantly reduces training times, especially for large dynamical systems, compared to PINC, which relies on graph-based automatic differentiation. Additionally, the DD-PINN inherently fulfills the initial condition and supports higher-order excitation inputs, simplifying the training process and enabling improved prediction accuracy. Validation on three systems - a nonlinear mass-spring-damper, a five-mass-chain, and a two-link robot - demonstrates that the DD-PINN achieves significantly shorter training times. In cases where the PINC’s prediction diverges, the DD-PINN’s prediction remains stable and accurate due to higher physics loss reduction or use of a higher-order excitation input. The DD-PINN allows for fast and accurate learning of large dynamical systems previously out of reach for the PINC.

[LG-27] Quotient Normalized Maximum Likelihood Criterion for Learning Bayesian Network Structures AISTATS2018

链接: https://arxiv.org/abs/2408.14935
作者: Tomi Silander,Janne Leppä-aho,Elias Jääsaari,Teemu Roos
关键词-EN: Bayesian network structure, network structure learning, normalized maximum likelihood, Bayesian network, call quotient normalized
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Accepted to AISTATS 2018

点击查看摘要

Abstract:We introduce an information theoretic criterion for Bayesian network structure learning which we call quotient normalized maximum likelihood (qNML). In contrast to the closely related factorized normalized maximum likelihood criterion, qNML satisfies the property of score equivalence. It is also decomposable and completely free of adjustable hyperparameters. For practical computations, we identify a remarkably accurate approximation proposed earlier by Szpankowski and Weinberger. Experiments on both simulated and real data demonstrate that the new criterion leads to parsimonious models with good predictive accuracy.

[LG-28] Can Transformers Do Enumerative Geometry?

链接: https://arxiv.org/abs/2408.14915
作者: Baran Hashemi,Roderic G. Corominas,Alessandro Giacchetto
关键词-EN: class intersection numbers, intersection numbers, learn enumerative geometry, class intersection, intersection
类目: Machine Learning (cs.LG); Algebraic Geometry (math.AG)
*备注:

点击查看摘要

Abstract:How can Transformers model and learn enumerative geometry? What is a robust procedure for using Transformers in abductive knowledge discovery within a mathematician-machine collaboration? In this work, we introduce a new paradigm in computational enumerative geometry in analyzing the \psi -class intersection numbers on the moduli space of curves. By formulating the enumerative problem as a continuous optimization task, we develop a Transformer-based model for computing \psi -class intersection numbers based on the underlying quantum Airy structure. For a finite range of genera, our model is capable of regressing intersection numbers that span an extremely wide range of values, from 10^-45 to 10^45 . To provide a proper inductive bias for capturing the recursive behavior of intersection numbers, we propose a new activation function, Dynamic Range Activator (DRA). Moreover, given the severe heteroscedasticity of \psi -class intersections and the required precision, we quantify the uncertainty of the predictions using Conformal Prediction with a dynamic sliding window that is aware of the number of marked points. Next, we go beyond merely computing intersection numbers and explore the enumerative “world-model” of the Transformers. Through a series of causal inference and correlational interpretability analyses, we demonstrate that Transformers are actually modeling Virasoro constraints in a purely data-driven manner. Additionally, we provide evidence for the comprehension of several values appearing in the large genus asymptotic of \psi -class intersection numbers through abductive hypothesis testing.

[LG-29] SpikingSSMs: Learning Long Sequences with Sparse and Parallel Spiking State Space Models

链接: https://arxiv.org/abs/2408.14909
作者: Shuaijie Shen,Chao Wang,Renzhuo Huang,Yan Zhong,Qinghai Guo,Zhichao Lu,Jianguo Zhang,Luziwei Leng
关键词-EN: energy consumption networks, low energy consumption, past decades, energy consumption, gained a lot
类目: Computation and Language (cs.CL); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
*备注:

点击查看摘要

Abstract:Known as low energy consumption networks, spiking neural networks (SNNs) have gained a lot of attention within the past decades. While SNNs are increasing competitive with artificial neural networks (ANNs) for vision tasks, they are rarely used for long sequence tasks, despite their intrinsic temporal dynamics. In this work, we develop spiking state space models (SpikingSSMs) for long sequence learning by leveraging on the sequence learning abilities of state space models (SSMs). Inspired by dendritic neuron structure, we hierarchically integrate neuronal dynamics with the original SSM block, meanwhile realizing sparse synaptic computation. Furthermore, to solve the conflict of event-driven neuronal dynamics with parallel computing, we propose a light-weight surrogate dynamic network which accurately predicts the after-reset membrane potential and compatible to learnable thresholds, enabling orders of acceleration in training speed compared with conventional iterative methods. On the long range arena benchmark task, SpikingSSM achieves competitive performance to state-of-the-art SSMs meanwhile realizing on average 90% of network sparsity. On language modeling, our network significantly surpasses existing spiking large language models (spikingLLMs) on the WikiText-103 dataset with only a third of the model size, demonstrating its potential as backbone architecture for low computation cost LLMs.

[LG-30] Adversarial Attacks and Defenses in Multivariate Time-Series Forecasting for Smart and Connected Infrastructures

链接: https://arxiv.org/abs/2408.14875
作者: Pooja Krishan,Rohan Mohapatra,Saptarshi Sengupta
关键词-EN: deep learning models, devices and infrastructures, Gradient Sign Method, Basic Iterative Method, emergence of deep
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Performance (cs.PF)
*备注: 17 pages, 32 figures

点击查看摘要

Abstract:The emergence of deep learning models has revolutionized various industries over the last decade, leading to a surge in connected devices and infrastructures. However, these models can be tricked into making incorrect predictions with high confidence, leading to disastrous failures and security concerns. To this end, we explore the impact of adversarial attacks on multivariate time-series forecasting and investigate methods to counter them. Specifically, we employ untargeted white-box attacks, namely the Fast Gradient Sign Method (FGSM) and the Basic Iterative Method (BIM), to poison the inputs to the training process, effectively misleading the model. We also illustrate the subtle modifications to the inputs after the attack, which makes detecting the attack using the naked eye quite difficult. Having demonstrated the feasibility of these attacks, we develop robust models through adversarial training and model hardening. We are among the first to showcase the transferability of these attacks and defenses by extrapolating our work from the benchmark electricity data to a larger, 10-year real-world data used for predicting the time-to-failure of hard disks. Our experimental results confirm that the attacks and defenses achieve the desired security thresholds, leading to a 72.41% and 94.81% decrease in RMSE for the electricity and hard disk datasets respectively after implementing the adversarial defenses.

[LG-31] Learning Robust Reward Machines from Noisy Labels KR2024

链接: https://arxiv.org/abs/2408.14871
作者: Roko Parac,Lorenzo Nodari,Leo Ardon,Daniel Furelos-Blanco,Federico Cerutti,Alessandra Russo
关键词-EN: paper presents PROB-IRM, noisy execution traces, paper presents, noisy traces, robust reward machines
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Preprint accepted for publication to the 21st International Conference on Principles of Knowledge Representation and Reasoning (KR 2024)

点击查看摘要

Abstract:This paper presents PROB-IRM, an approach that learns robust reward machines (RMs) for reinforcement learning (RL) agents from noisy execution traces. The key aspect of RM-driven RL is the exploitation of a finite-state machine that decomposes the agent’s task into different subtasks. PROB-IRM uses a state-of-the-art inductive logic programming framework robust to noisy examples to learn RMs from noisy traces using the Bayesian posterior degree of beliefs, thus ensuring robustness against inconsistencies. Pivotal for the results is the interleaving between RM learning and policy learning: a new RM is learned whenever the RL agent generates a trace that is believed not to be accepted by the current RM. To speed up the training of the RL agent, PROB-IRM employs a probabilistic formulation of reward shaping that uses the posterior Bayesian beliefs derived from the traces. Our experimental analysis shows that PROB-IRM can learn (potentially imperfect) RMs from noisy traces and exploit them to train an RL agent to solve its tasks successfully. Despite the complexity of learning the RM from noisy traces, agents trained with PROB-IRM perform comparably to agents provided with handcrafted RMs.

[LG-32] Advancing Adversarial Suffix Transfer Learning on Aligned Large Language Models

链接: https://arxiv.org/abs/2408.14866
作者: Hongfu Liu,Yuxi Xie,Ye Wang,Michael Shieh
关键词-EN: Language Language Models, Language Language, face safety concerns, safety concerns due, Greedy Coordinate Gradient
类目: Computation and Language (cs.CL); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注: 11 pages, 4 figures

点击查看摘要

Abstract:Language Language Models (LLMs) face safety concerns due to potential misuse by malicious users. Recent red-teaming efforts have identified adversarial suffixes capable of jailbreaking LLMs using the gradient-based search algorithm Greedy Coordinate Gradient (GCG). However, GCG struggles with computational inefficiency, limiting further investigations regarding suffix transferability and scalability across models and data. In this work, we bridge the connection between search efficiency and suffix transferability. We propose a two-stage transfer learning framework, DeGCG, which decouples the search process into behavior-agnostic pre-searching and behavior-relevant post-searching. Specifically, we employ direct first target token optimization in pre-searching to facilitate the search process. We apply our approach to cross-model, cross-data, and self-transfer scenarios. Furthermore, we introduce an interleaved variant of our approach, i-DeGCG, which iteratively leverages self-transferability to accelerate the search process. Experiments on HarmBench demonstrate the efficiency of our approach across various models and domains. Notably, our i-DeGCG outperforms the baseline on Llama2-chat-7b with ASRs of 43.9 ( +22.2 ) and 39.0 ( +19.5 ) on valid and test sets, respectively. Further analysis on cross-model transfer indicates the pivotal role of first target token optimization in leveraging suffix transferability for efficient searching.

[LG-33] Dynamic operator management in meta-heuristics using reinforcement learning: an application to permutation flowshop scheduling problems

链接: https://arxiv.org/abs/2408.14864
作者: Maryam Karimi Mamaghan,Mehrdad Mohammadi,Wout Dullaert,Daniele Vigo,Amir Pirayesh
关键词-EN: search operators, study develops, based on reinforcement, reinforcement learning, manage a large
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This study develops a framework based on reinforcement learning to dynamically manage a large portfolio of search operators within meta-heuristics. Using the idea of tabu search, the framework allows for continuous adaptation by temporarily excluding less efficient operators and updating the portfolio composition during the search. A Q-learning-based adaptive operator selection mechanism is used to select the most suitable operator from the dynamically updated portfolio at each stage. Unlike traditional approaches, the proposed framework requires no input from the experts regarding the search operators, allowing domain-specific non-experts to effectively use the framework. The performance of the proposed framework is analyzed through an application to the permutation flowshop scheduling problem. The results demonstrate the superior performance of the proposed framework against state-of-the-art algorithms in terms of optimality gap and convergence speed.

[LG-34] Correntropy-Based Improper Likelihood Model for Robust Electrophysiological Source Imaging

链接: https://arxiv.org/abs/2408.14843
作者: Yuanhao Li,Badong Chen,Zhongxu Hu,Keita Suzuki,Wenjun Bai,Yasuharu Koike,Okito Yamashita
关键词-EN: Bayesian source imaging, source imaging task, electrophysiological source imaging, source imaging, Bayesian learning
类目: Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE); Signal Processing (eess.SP)
*备注:

点击查看摘要

Abstract:Bayesian learning provides a unified skeleton to solve the electrophysiological source imaging task. From this perspective, existing source imaging algorithms utilize the Gaussian assumption for the observation noise to build the likelihood function for Bayesian inference. However, the electromagnetic measurements of brain activity are usually affected by miscellaneous artifacts, leading to a potentially non-Gaussian distribution for the observation noise. Hence the conventional Gaussian likelihood model is a suboptimal choice for the real-world source imaging task. In this study, we aim to solve this problem by proposing a new likelihood model which is robust with respect to non-Gaussian noises. Motivated by the robust maximum correntropy criterion, we propose a new improper distribution model concerning the noise assumption. This new noise distribution is leveraged to structure a robust likelihood function and integrated with hierarchical prior distributions to estimate source activities by variational inference. In particular, the score matching is adopted to determine the hyperparameters for the improper likelihood model. A comprehensive performance evaluation is performed to compare the proposed noise assumption to the conventional Gaussian model. Simulation results show that, the proposed method can realize more precise source reconstruction by designing known ground-truth. The real-world dataset also demonstrates the superiority of our new method with the visual perception task. This study provides a new backbone for Bayesian source imaging, which would facilitate its application using real-world noisy brain signal.

[LG-35] From Bias to Balance: Detecting Facial Expression Recognition Biases in Large Multimodal Foundation Models

链接: https://arxiv.org/abs/2408.14842
作者: Kaylee Chhua,Zhoujinyi Wen,Vedant Hathalia,Kevin Zhu,Sean O’Brien
关键词-EN: Large Multimodal Foundation, Large Multimodal, Multimodal Foundation Models, Multimodal Foundation, traditional FER models
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This study addresses the racial biases in facial expression recognition (FER) systems within Large Multimodal Foundation Models (LMFMs). Despite advances in deep learning and the availability of diverse datasets, FER systems often exhibit higher error rates for individuals with darker skin tones. Existing research predominantly focuses on traditional FER models (CNNs, RNNs, ViTs), leaving a gap in understanding racial biases in LMFMs. We benchmark four leading LMFMs: GPT-4o, PaliGemma, Gemini, and CLIP to assess their performance in facial emotion detection across different racial demographics. A linear classifier trained on CLIP embeddings obtains accuracies of 95.9% for RADIATE, 90.3% for Tarr, and 99.5% for Chicago Face. Furthermore, we identify that Anger is misclassified as Disgust 2.1 times more often in Black Females than White Females. This study highlights the need for fairer FER systems and establishes a foundation for developing unbiased, accurate FER technologies. Visit this https URL for further information regarding the biases within facial expression recognition.

[LG-36] CL4KGE: A Curriculum Learning Method for Knowledge Graph Embedding

链接: https://arxiv.org/abs/2408.14840
作者: Yang Liu,Chuan Zhou,Peng Zhang,Yanan Cao,Yongchao Liu,Zhao Li,Hongyang Chen
关键词-EN: Knowledge graph embedding, crafting representations comprehensive, Knowledge graph, KGE models, graph embedding
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注: 16 pages, 3 figures

点击查看摘要

Abstract:Knowledge graph embedding (KGE) constitutes a foundational task, directed towards learning representations for entities and relations within knowledge graphs (KGs), with the objective of crafting representations comprehensive enough to approximate the logical and symbolic interconnections among entities. In this paper, we define a metric Z-counts to measure the difficulty of training each triple ( head entity, relation, tail entity ) in KGs with theoretical analysis. Based on this metric, we propose \textbfCL4KGE, an efficient \textbfCurriculum \textbfLearning based training strategy for \textbfKGE. This method includes a difficulty measurer and a training scheduler that aids in the training of KGE models. Our approach possesses the flexibility to act as a plugin within a wide range of KGE models, with the added advantage of adaptability to the majority of KGs in existence. The proposed method has been evaluated on popular KGE models, and the results demonstrate that it enhances the state-of-the-art methods. The use of Z-counts as a metric has enabled the identification of challenging triples in KGs, which helps in devising effective training strategies.

[LG-37] Diffusion Models Are Real-Time Game Engines

链接: https://arxiv.org/abs/2408.14837
作者: Dani Valevski,Yaniv Leviathan,Moab Arar,Shlomi Fruchter
关键词-EN: game engine powered, enables real-time interaction, high quality, engine powered, real-time interaction
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: Project page: this https URL

点击查看摘要

Abstract:We present GameNGen, the first game engine powered entirely by a neural model that enables real-time interaction with a complex environment over long trajectories at high quality. GameNGen can interactively simulate the classic game DOOM at over 20 frames per second on a single TPU. Next frame prediction achieves a PSNR of 29.4, comparable to lossy JPEG compression. Human raters are only slightly better than random chance at distinguishing short clips of the game from clips of the simulation. GameNGen is trained in two phases: (1) an RL-agent learns to play the game and the training sessions are recorded, and (2) a diffusion model is trained to produce the next frame, conditioned on the sequence of past frames and actions. Conditioning augmentations enable stable auto-regressive generation over long trajectories.

[LG-38] DRL-Based Federated Self-Supervised Learning for Task Offloading and Resource Allocation in ISAC-Enabled Vehicle Edge Computing

链接: https://arxiv.org/abs/2408.14831
作者: Xueying Gu,Qiong Wu,Pingyi Fan,Nan Cheng,Wen Chen,Khaled B. Letaief
关键词-EN: leverage Integrated Sensing, Intelligent Transportation Systems, Intelligent Transportation, Sensing and Communications, Integrated Sensing
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC); Networking and Internet Architecture (cs.NI)
*备注: This paper has been submitted to Digital Communications and Networks. The source code has been released at: this https URL

点击查看摘要

Abstract:Intelligent Transportation Systems (ITS) leverage Integrated Sensing and Communications (ISAC) to enhance data exchange between vehicles and infrastructure in the Internet of Vehicles (IoV). This integration inevitably increases computing demands, risking real-time system stability. Vehicle Edge Computing (VEC) addresses this by offloading tasks to Road Side Unit (RSU), ensuring timely services. Our previous work FLSimCo algorithm, which uses local resources for Federated Self-Supervised Learning (SSL), though vehicles often can’t complete all iterations task. Our improved algorithm offloads partial task to RSU and optimizes energy consumption by adjusting transmission power, CPU frequency, and task assignment ratios, balancing local and RSU-based training. Meanwhile, setting an offloading threshold further prevents inefficiencies. Simulation results show that the enhanced algorithm reduces energy consumption, improves offloading efficiency and the accuracy of Federated SSL.

[LG-39] From Rule-Based Models to Deep Learning Transformers Architectures for Natural Language Processing and Sign Language Translation Systems: Survey Taxonomy and Performance Evaluation

链接: https://arxiv.org/abs/2408.14825
作者: Nada Shahin,Leila Ismail
关键词-EN: Hearing population worldwide, Deaf and Hard, Hard of Hearing, growing Deaf, Hearing population
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:With the growing Deaf and Hard of Hearing population worldwide and the persistent shortage of certified sign language interpreters, there is a pressing need for an efficient, signs-driven, integrated end-to-end translation system, from sign to gloss to text and vice-versa. There has been a wealth of research on machine translations and related reviews. However, there are few works on sign language machine translation considering the particularity of the language being continuous and dynamic. This paper aims to address this void, providing a retrospective analysis of the temporal evolution of sign language machine translation algorithms and a taxonomy of the Transformers architectures, the most used approach in language translation. We also present the requirements of a real-time Quality-of-Service sign language ma-chine translation system underpinned by accurate deep learning algorithms. We propose future research directions for sign language translation systems.

[LG-40] Data-driven Effective Modeling of Multiscale Stochastic Dynamical Systems

链接: https://arxiv.org/abs/2408.14821
作者: Yuan Chen,Dongbin Xiu
关键词-EN: multiscale stochastic dynamical, unknown multiscale stochastic, stochastic dynamical systems, slow components, unknown multiscale
类目: Machine Learning (cs.LG); Numerical Analysis (math.NA); Machine Learning (stat.ML)
*备注: arXiv admin note: text overlap with arXiv:2406.15747

点击查看摘要

Abstract:We present a numerical method for learning the dynamics of slow components of unknown multiscale stochastic dynamical systems. While the governing equations of the systems are unknown, bursts of observation data of the slow variables are available. By utilizing the observation data, our proposed method is capable of constructing a generative stochastic model that can accurately capture the effective dynamics of the slow variables in distribution. We present a comprehensive set of numerical examples to demonstrate the performance of the proposed method.

[LG-41] A Comprehensive Benchmark of Machine and Deep Learning Across Diverse Tabular Datasets

链接: https://arxiv.org/abs/2408.14817
作者: Assaf Shmuel,Oren Glickman,Teddy Lazebnik
关键词-EN: Gradient Boosting Machines, Deep Learning, Machine Learning, highly prevalent, scientific research
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The analysis of tabular datasets is highly prevalent both in scientific research and real-world applications of Machine Learning (ML). Unlike many other ML tasks, Deep Learning (DL) models often do not outperform traditional methods in this area. Previous comparative benchmarks have shown that DL performance is frequently equivalent or even inferior to models such as Gradient Boosting Machines (GBMs). In this study, we introduce a comprehensive benchmark aimed at better characterizing the types of datasets where DL models excel. Although several important benchmarks for tabular datasets already exist, our contribution lies in the variety and depth of our comparison: we evaluate 111 datasets with 20 different models, including both regression and classification tasks. These datasets vary in scale and include both those with and without categorical variables. Importantly, our benchmark contains a sufficient number of datasets where DL models perform best, allowing for a thorough analysis of the conditions under which DL models excel. Building on the results of this benchmark, we train a model that predicts scenarios where DL models outperform alternative methods with 86.1% accuracy (AUC 0.78). We present insights derived from this characterization and compare these findings to previous benchmarks.

[LG-42] Poly2Vec: Polymorphic Encoding of Geospatial Objects for Spatial Reasoning with Deep Neural Networks

链接: https://arxiv.org/abs/2408.14806
作者: Maria Despoina Siampou,Jialiang Li,John Krumm,Cyrus Shahabi,Hua Lu
关键词-EN: enabling machine learning, machine learning, crucial for enabling, enabling machine, identifying the topological
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Encoding geospatial data is crucial for enabling machine learning (ML) models to perform tasks that require spatial reasoning, such as identifying the topological relationships between two different geospatial objects. However, existing encoding methods are limited as they are typically customized to handle only specific types of spatial data, which impedes their applicability across different downstream tasks where multiple data types coexist. To address this, we introduce Poly2Vec, an encoding framework that unifies the modeling of different geospatial objects, including 2D points, polylines, and polygons, irrespective of the downstream task. We leverage the power of the 2D Fourier transform to encode useful spatial properties, such as shape and location, from geospatial objects into fixed-length vectors. These vectors are then inputted into neural network models for spatial reasoning tasks.This unified approach eliminates the need to develop and train separate models for each distinct spatial type. We evaluate Poly2Vec on both synthetic and real datasets of mixed geometry types and verify its consistent performance across several downstream spatial reasoning tasks.

[LG-43] Learning from Complementary Features

链接: https://arxiv.org/abs/2408.14788
作者: Kosuke Sugiyama,Masato Uchida
关键词-EN: insufficient observation accuracy, high collection costs, observation accuracy, insufficient observation, complementary information indicating
类目: Machine Learning (cs.LG)
*备注: 16 pages, 7 figures

点击查看摘要

Abstract:While precise data observation is essential for the learning processes of predictive models, it can be challenging owing to factors such as insufficient observation accuracy, high collection costs, and privacy constraints. In this paper, we examines cases where some qualitative features are unavailable as precise information indicating “what it is,” but rather as complementary information indicating “what it is not.” We refer to features defined by precise information as ordinary features (OFs) and those defined by complementary information as complementary features (CFs). We then formulate a new learning scenario termed Complementary Feature Learning (CFL), where predictive models are constructed using instances consisting of OFs and CFs. The simplest formalization of CFL applies conventional supervised learning directly using the observed values of CFs. However, this approach does not resolve the ambiguity associated with CFs, making learning challenging and complicating the interpretation of the predictive model’s specific predictions. Therefore, we derive an objective function from an information-theoretic perspective to estimate the OF values corresponding to CFs and to predict output labels based on these estimations. Based on this objective function, we propose a theoretically guaranteed graph-based estimation method along with its practical approximation, for estimating OF values corresponding to CFs. The results of numerical experiments conducted with real-world data demonstrate that our proposed method effectively estimates OF values corresponding to CFs and predicts output labels.

[LG-44] Unsupervised-to-Online Reinforcement Learning

链接: https://arxiv.org/abs/2408.14785
作者: Junsu Kim,Seohong Park,Sergey Levine
关键词-EN: reinforcement learning, data-driven decision-making, considered a promising, domain-specific supervised offline, requires domain-specific offline
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Offline-to-online reinforcement learning (RL), a framework that trains a policy with offline RL and then further fine-tunes it with online RL, has been considered a promising recipe for data-driven decision-making. While sensible, this framework has drawbacks: it requires domain-specific offline RL pre-training for each task, and is often brittle in practice. In this work, we propose unsupervised-to-online RL (U2O RL), which replaces domain-specific supervised offline RL with unsupervised offline RL, as a better alternative to offline-to-online RL. U2O RL not only enables reusing a single pre-trained model for multiple downstream tasks, but also learns better representations, which often result in even better performance and stability than supervised offline-to-online RL. To instantiate U2O RL in practice, we propose a general recipe for U2O RL to bridge task-agnostic unsupervised offline skill-based policy pre-training and supervised online fine-tuning. Throughout our experiments in nine state-based and pixel-based environments, we empirically demonstrate that U2O RL achieves strong performance that matches or even outperforms previous offline-to-online RL approaches, while being able to reuse a single pre-trained model for a number of different downstream tasks.

[LG-45] GINN-KAN: Interpretability pipelining with applications in Physics Informed Neural Networks

链接: https://arxiv.org/abs/2408.14780
作者: Nisal Ranasinghe,Yu Xia,Sachith Seneviratne,Saman Halgamuge
关键词-EN: powerful function approximators, interpretable neural network, neural network, interpretable neural, Neural
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Neural networks are powerful function approximators, yet their ``black-box" nature often renders them opaque and difficult to interpret. While many post-hoc explanation methods exist, they typically fail to capture the underlying reasoning processes of the networks. A truly interpretable neural network would be trained similarly to conventional models using techniques such as backpropagation, but additionally provide insights into the learned input-output relationships. In this work, we introduce the concept of interpretability pipelineing, to incorporate multiple interpretability techniques to outperform each individual technique. To this end, we first evaluate several architectures that promise such interpretability, with a particular focus on two recent models selected for their potential to incorporate interpretability into standard neural network architectures while still leveraging backpropagation: the Growing Interpretable Neural Network (GINN) and Kolmogorov Arnold Networks (KAN). We analyze the limitations and strengths of each and introduce a novel interpretable neural network GINN-KAN that synthesizes the advantages of both models. When tested on the Feynman symbolic regression benchmark datasets, GINN-KAN outperforms both GINN and KAN. To highlight the capabilities and the generalizability of this approach, we position GINN-KAN as an alternative to conventional black-box networks in Physics-Informed Neural Networks (PINNs). We expect this to have far-reaching implications in the application of deep learning pipelines in the natural sciences. Our experiments with this interpretable PINN on 15 different partial differential equations demonstrate that GINN-KAN augmented PINNs outperform PINNs with black-box networks in solving differential equations and surpass the capabilities of both GINN and KAN.

[LG-46] GPU-Accelerated Counterfactual Regret Minimization

链接: https://arxiv.org/abs/2408.14778
作者: Juho Kim
关键词-EN: Counterfactual regret minimization, no-regret learning dynamics, learning dynamics capable, solving large-scale imperfect, large-scale imperfect information
类目: Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Counterfactual regret minimization (CFR) is a family of algorithms of no-regret learning dynamics capable of solving large-scale imperfect information games. There has been a notable lack of work on making CFR more computationally efficient. We propose implementing this algorithm as a series of dense and sparse matrix and vector operations, thereby making it highly parallelizable for a graphical processing unit. Our experiments show that our implementation performs up to about 352.5 times faster than OpenSpiel’s Python implementation and up to about 22.2 times faster than OpenSpiel’s C++ implementation and the speedup becomes more pronounced as the size of the game being solved grows.

[LG-47] Instruct-SkillMix: A Powerful Pipeline for LLM Instruction Tuning

链接: https://arxiv.org/abs/2408.14774
作者: Simran Kaur,Simon Park,Anirudh Goyal,Sanjeev Arora
关键词-EN: powerful LLM, existing powerful LLM, high quality SFT, automated approach, LLM
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:We introduce Instruct-SkillMix, an automated approach for creating diverse, high quality SFT data. The Instruct-SkillMix pipeline involves two stages, each leveraging an existing powerful LLM: (1) Skill extraction: uses the LLM to extract core “skills” for instruction-following, either from existing datasets, or by directly prompting the model; (2) Data generation: uses the powerful LLM to generate (instruction, response) data that exhibit a randomly chosen pair of these skills. Here, the use of random skill combinations promotes diversity and difficulty. Vanilla SFT (i.e., no PPO, DPO, or RL methods) on data generated from Instruct-SkillMix leads to strong gains on instruction following benchmarks such as AlpacaEval 2.0, MT-Bench, and WildBench. With just 4 K examples, LLaMA-3-8B-Base achieves 42.76% length-controlled win rate on AlpacaEval 2.0. To our knowledge, this achieves state-of-the-art performance among all models that have only undergone SFT (no RL methods) and competes with proprietary models such as Claude 3 Opus and LLaMA-3.1-405B-Instruct. Ablation studies also suggest plausible reasons for why creating open instruction-tuning datasets via naive crowd-sourcing has proved difficult. Introducing low quality answers (“shirkers”) in 20% of Instruct-SkillMix examples causes performance to plummet, sometimes catastrophically. The Instruct-SkillMix pipeline is flexible and is adaptable to other settings. Subjects: Machine Learning (cs.LG); Computation and Language (cs.CL) Cite as: arXiv:2408.14774 [cs.LG] (or arXiv:2408.14774v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2408.14774 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Simran Kaur [view email] [v1] Tue, 27 Aug 2024 04:31:58 UTC (239 KB)

[LG-48] Channel-wise Influence: Estimating Data Influence for Multivariate Time Series

链接: https://arxiv.org/abs/2408.14763
作者: Muyao Wang,Zeke Xie,Bo Chen
关键词-EN: MTS, influence function, influence, MTS analysis tasks, Multivariate Time Series
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The influence function, a technique from robust statistics, measures the impact on model parameters or related functions when training data is removed or modified. This effective and valuable post-hoc method allows for studying the interpretability of machine learning models without requiring costly model retraining. It would provide extensions like increasing model performance, improving model generalization, and offering interpretability. Recently, Multivariate Time Series (MTS) analysis has become an important yet challenging task, attracting significant attention. However, there is no preceding research on the influence functions of MTS to shed light on the effects of modifying the channel of training MTS. Given that each channel in an MTS plays a crucial role in its analysis, it is essential to characterize the influence of different channels. To fill this gap, we propose a channel-wise influence function, which is the first method that can estimate the influence of different channels in MTS, utilizing a first-order gradient approximation that leverages the more informative average gradient of the data set. Additionally, we demonstrate how this influence function can be used to estimate the impact of a channel in MTS. Finally, we validated the accuracy and effectiveness of our influence estimation function in critical MTS analysis tasks, such as MTS anomaly detection and MTS forecasting. According to abundant experiments on real-world dataset, the original influence function performs worse than our method and even fail for the channel pruning problem, which demonstrate the superiority and necessity of channel-wise influence function in MTS analysis tasks.

[LG-49] Explainable Hierarchical Urban Representation Learning for Commuting Flow Prediction

链接: https://arxiv.org/abs/2408.14762
作者: Mingfei Cai,Yanbo Pang,Yoshihide Sekimoto
关键词-EN: Commuting flow prediction, real world, municipal operations, essential task, commuting origin-destination
类目: Machine Learning (cs.LG); Social and Information Networks (cs.SI)
*备注: 11 pages, 6 figures

点击查看摘要

Abstract:Commuting flow prediction is an essential task for municipal operations in the real world. Previous studies have revealed that it is feasible to estimate the commuting origin-destination (OD) demand within a city using multiple auxiliary data. However, most existing methods are not suitable to deal with a similar task at a large scale, namely within a prefecture or the whole nation, owing to the increased number of geographical units that need to be maintained. In addition, region representation learning is a universal approach for gaining urban knowledge for diverse metropolitan downstream tasks. Although many researchers have developed comprehensive frameworks to describe urban units from multi-source data, they have not clarified the relationship between the selected geographical elements. Furthermore, metropolitan areas naturally preserve ranked structures, like cities and their inclusive districts, which makes elucidating relations between cross-level urban units necessary. Therefore, we develop a heterogeneous graph-based model to generate meaningful region embeddings at multiple spatial resolutions for predicting different types of inter-level OD flows. To demonstrate the effectiveness of the proposed method, extensive experiments were conducted using real-world aggregated mobile phone datasets collected from Shizuoka Prefecture, Japan. The results indicate that our proposed model outperforms existing models in terms of a uniform urban structure. We extend the understanding of predicted results using reasonable explanations to enhance the credibility of the model.

[LG-50] Learning effective pruning at initialization from iterative pruning

链接: https://arxiv.org/abs/2408.14757
作者: Shengkai Liu,Yaofeng Cheng,Fusheng Zha,Wei Guo,Lining Sun,Zhenshan Bing,Chenguang Yang
关键词-EN: reduces training costs, growing network size, costs by removing, removing weights, increasingly crucial
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Pruning at initialization (PaI) reduces training costs by removing weights before training, which becomes increasingly crucial with the growing network size. However, current PaI methods still have a large accuracy gap with iterative pruning, especially at high sparsity levels. This raises an intriguing question: can we get inspiration from iterative pruning to improve the PaI performance? In the lottery ticket hypothesis, the iterative rewind pruning (IRP) finds subnetworks retroactively by rewinding the parameter to the original initialization in every pruning iteration, which means all the subnetworks are based on the initial state. Here, we hypothesise the surviving subnetworks are more important and bridge the initial feature and their surviving score as the PaI criterion. We employ an end-to-end neural network (\textbfAutoSparse) to learn this correlation, input the model’s initial features, output their score and then prune the lowest score parameters before training. To validate the accuracy and generalization of our method, we performed PaI across various models. Results show that our approach outperforms existing methods in high-sparsity settings. Notably, as the underlying logic of model pruning is consistent in different models, only one-time IRP on one model is needed (e.g., once IRP on ResNet-18/CIFAR-10, AutoS can be generalized to VGG-16/CIFAR-10, ResNet-18/TinyImageNet, et al.). As the first neural network-based PaI method, we conduct extensive experiments to validate the factors influencing this approach. These results reveal the learning tendencies of neural networks and provide new insights into our understanding and research of PaI from a practical perspective. Our code is available at: this https URL.

[LG-51] raining-Free Time-Series Anomaly Detection: Leveraging Image Foundation Models

链接: https://arxiv.org/abs/2408.14756
作者: Nobuo Namura,Yuma Ichikawa
关键词-EN: Recent advancements, handle the diverse, diverse behaviors, Recent, anomaly detection
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Recent advancements in time-series anomaly detection have relied on deep learning models to handle the diverse behaviors of time-series data. However, these models often suffer from unstable training and require extensive hyperparameter tuning, leading to practical limitations. Although foundation models present a potential solution, their use in time series is limited. To overcome these issues, we propose an innovative image-based, training-free time-series anomaly detection (ITF-TAD) approach. ITF-TAD converts time-series data into images using wavelet transform and compresses them into a single representation, leveraging image foundation models for anomaly detection. This approach achieves high-performance anomaly detection without unstable neural network training or hyperparameter tuning. Furthermore, ITF-TAD identifies anomalies across different frequencies, providing users with a detailed visualization of anomalies and their corresponding frequencies. Comprehensive experiments on five benchmark datasets, including univariate and multivariate time series, demonstrate that ITF-TAD offers a practical and effective solution with performance exceeding or comparable to that of deep models.

[LG-52] Benchmarking Reinforcement Learning Methods for Dexterous Robotic Manipulation with a Three-Fingered Gripper

链接: https://arxiv.org/abs/2408.14747
作者: Elizabeth Cutler,Yuning Xing,Tony Cui,Brendan Zhou,Koen van Rijnsoever,Ben Hart,David Valencia,Lee Violet C. Ong,Trevor Gee,Minas Liarokapis,Henry Williams
关键词-EN: Reinforcement Learning, controlled simulation environments, simulation environments, predominantly conducted, conducted in cost-effective
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Reinforcement Learning (RL) training is predominantly conducted in cost-effective and controlled simulation environments. However, the transfer of these trained models to real-world tasks often presents unavoidable challenges. This research explores the direct training of RL algorithms in controlled yet realistic real-world settings for the execution of dexterous manipulation. The benchmarking results of three RL algorithms trained on intricate in-hand manipulation tasks within practical real-world contexts are presented. Our study not only demonstrates the practicality of RL training in authentic real-world scenarios, facilitating direct real-world applications, but also provides insights into the associated challenges and considerations. Additionally, our experiences with the employed experimental methods are shared, with the aim of empowering and engaging fellow researchers and practitioners in this dynamic field of robotics.

[LG-53] Learning Differentially Private Diffusion Models via Stochastic Adversarial Distillation ECCV2024

链接: https://arxiv.org/abs/2408.14738
作者: Bochao Liu,Pengju Wang,Shiming Ge
关键词-EN: deep learning relies, privacy-sensitive domains, relies on large, large amounts, generative model learning
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV)
*备注: accepted by ECCV 2024

点击查看摘要

Abstract:While the success of deep learning relies on large amounts of training datasets, data is often limited in privacy-sensitive domains. To address this challenge, generative model learning with differential privacy has emerged as a solution to train private generative models for desensitized data generation. However, the quality of the images generated by existing methods is limited due to the complexity of modeling data distribution. We build on the success of diffusion models and introduce DP-SAD, which trains a private diffusion model by a stochastic adversarial distillation method. Specifically, we first train a diffusion model as a teacher and then train a student by distillation, in which we achieve differential privacy by adding noise to the gradients from other models to the student. For better generation quality, we introduce a discriminator to distinguish whether an image is from the teacher or the student, which forms the adversarial training. Extensive experiments and analysis clearly demonstrate the effectiveness of our proposed method.

[LG-54] Bandwidth-Aware and Overlap-Weighted Compression for Communication-Efficient Federated Learning

链接: https://arxiv.org/abs/2408.14736
作者: Zichen Tang,Junlin Huang,Rudan Yan,Yuxin Wang,Zhenheng Tang,Shaohuai Shi,Amelie Chi Zhou,Xiaowen Chu
关键词-EN: Federated Averaging, Current data compression, Federated Learning, Current data, sparsification in Federated
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Current data compression methods, such as sparsification in Federated Averaging (FedAvg), effectively enhance the communication efficiency of Federated Learning (FL). However, these methods encounter challenges such as the straggler problem and diminished model performance due to heterogeneous bandwidth and non-IID (Independently and Identically Distributed) data. To address these issues, we introduce a bandwidth-aware compression framework for FL, aimed at improving communication efficiency while mitigating the problems associated with non-IID data. First, our strategy dynamically adjusts compression ratios according to bandwidth, enabling clients to upload their models at a close pace, thus exploiting the otherwise wasted time to transmit more data. Second, we identify the non-overlapped pattern of retained parameters after compression, which results in diminished client update signals due to uniformly averaged weights. Based on this finding, we propose a parameter mask to adjust the client-averaging coefficients at the parameter level, thereby more closely approximating the original updates, and improving the training convergence under heterogeneous environments. Our evaluations reveal that our method significantly boosts model accuracy, with a maximum improvement of 13% over the uncompressed FedAvg. Moreover, it achieves a 3.37\times speedup in reaching the target accuracy compared to FedAvg with a Top-K compressor, demonstrating its effectiveness in accelerating convergence with compression. The integration of common compression techniques into our framework further establishes its potential as a versatile foundation for future cross-device, communication-efficient FL research, addressing critical challenges in FL and advancing the field of distributed machine learning.

[LG-55] General-Kindred Physics-Informed Neural Network to the Solutions of Singularly Perturbed Differential Equations

链接: https://arxiv.org/abs/2408.14734
作者: Sen Wang,Peizhi Zhao,Qinglong Ma,Tao Song
关键词-EN: Partial Differential Equations, solving Partial Differential, Partial Differential, Singular Perturbation Differential, singular perturbation problems
类目: Machine Learning (cs.LG); Mathematical Physics (math-ph); Numerical Analysis (math.NA)
*备注:

点击查看摘要

Abstract:Physics-Informed Neural Networks (PINNs) have become a promising research direction in the field of solving Partial Differential Equations (PDEs). Dealing with singular perturbation problems continues to be a difficult challenge in the field of PINN. The solution of singular perturbation problems often exhibits sharp boundary layers and steep gradients, and traditional PINN cannot achieve approximation of boundary layers. In this manuscript, we propose the General-Kindred Physics-Informed Neural Network (GKPINN) for solving Singular Perturbation Differential Equations (SPDEs). This approach utilizes asymptotic analysis to acquire prior knowledge of the boundary layer from the equation and establishes a novel network to assist PINN in approximating the boundary layer. It is compared with traditional PINN by solving examples of one-dimensional, two-dimensional, and time-varying SPDE equations. The research findings underscore the exceptional performance of our novel approach, GKPINN, which delivers a remarkable enhancement in reducing the L_2 error by two to four orders of magnitude compared to the established PINN methodology. This significant improvement is accompanied by a substantial acceleration in convergence rates, without compromising the high precision that is critical for our applications. Furthermore, GKPINN still performs well in extreme cases with perturbation parameters of 1\times10^-38 , demonstrating its excellent generalization ability.

[LG-56] ART: Boosting Clean Accuracy Through Tangent Direction Guided Adversarial Training

链接: https://arxiv.org/abs/2408.14728
作者: Bongsoo Yi,Rongjie Lai,Yao Li
关键词-EN: deep neural networks, Guided Adversarial Training, Adversarial training, Adversarial, Direction Guided Adversarial
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
*备注:

点击查看摘要

Abstract:Adversarial training has been shown to be successful in enhancing the robustness of deep neural networks against adversarial attacks. However, this robustness is accompanied by a significant decline in accuracy on clean data. In this paper, we propose a novel method, called Tangent Direction Guided Adversarial Training (TART), that leverages the tangent space of the data manifold to ameliorate the existing adversarial defense algorithms. We argue that training with adversarial examples having large normal components significantly alters the decision boundary and hurts accuracy. TART mitigates this issue by estimating the tangent direction of adversarial examples and allocating an adaptive perturbation limit according to the norm of their tangential component. To the best of our knowledge, our paper is the first work to consider the concept of tangent space and direction in the context of adversarial defense. We validate the effectiveness of TART through extensive experiments on both simulated and benchmark datasets. The results demonstrate that TART consistently boosts clean accuracy while retaining a high level of robustness against adversarial attacks. Our findings suggest that incorporating the geometric properties of data can lead to more effective and efficient adversarial training methods.

[LG-57] PAT: Pruning-Aware Tuning for Large Language Models

链接: https://arxiv.org/abs/2408.14721
作者: Yijiang Liu,Huanrui Yang,Youxin Chen,Rongyu Zhang,Miao Wang,Yuan Du,Li Du
关键词-EN: Large language models, Large language, language tasks, language models, Structural pruning
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:Large language models (LLMs) excel in language tasks, especially with supervised fine-tuning after pre-training. However, their substantial memory and computational requirements hinder practical applications. Structural pruning, which reduces less significant weight dimensions, is one solution. Yet, traditional post-hoc pruning often leads to significant performance loss, with limited recovery from further fine-tuning due to reduced capacity. Since the model fine-tuning refines the general and chaotic knowledge in pre-trained models, we aim to incorporate structural pruning with the fine-tuning, and propose the Pruning-Aware Tuning (PAT) paradigm to eliminate model redundancy while preserving the model performance to the maximum extend. Specifically, we insert the innovative Hybrid Sparsification Modules (HSMs) between the Attention and FFN components to accordingly sparsify the upstream and downstream linear modules. The HSM comprises a lightweight operator and a globally shared trainable mask. The lightweight operator maintains a training overhead comparable to that of LoRA, while the trainable mask unifies the channels to be sparsified, ensuring structural pruning. Additionally, we propose the Identity Loss which decouples the transformation and scaling properties of the HSMs to enhance training robustness. Extensive experiments demonstrate that PAT excels in both performance and efficiency. For example, our Llama2-7b model with a 25% pruning ratio achieves 1.33 \times speedup while outperforming the LoRA-finetuned model by up to 1.26% in accuracy with a similar training cost. Code: this https URL

[LG-58] A Synthetic Benchmark to Explore Limitations of Localized Drift Detections KDD2024

链接: https://arxiv.org/abs/2408.14687
作者: Flavio Giobergia,Eliana Pastor,Luca de Alfaro,Elena Baralis
关键词-EN: target variable change, drift, common phenomenon, statistical properties, target variable
类目: Machine Learning (cs.LG)
*备注: Paper accepted at DELTA Workshop @ KDD 2024

点击查看摘要

Abstract:Concept drift is a common phenomenon in data streams where the statistical properties of the target variable change over time. Traditionally, drift is assumed to occur globally, affecting the entire dataset uniformly. However, this assumption does not always hold true in real-world scenarios where only specific subpopulations within the data may experience drift. This paper explores the concept of localized drift and evaluates the performance of several drift detection techniques in identifying such localized changes. We introduce a synthetic dataset based on the Agrawal generator, where drift is induced in a randomly chosen subgroup. Our experiments demonstrate that commonly adopted drift detection methods may fail to detect drift when it is confined to a small subpopulation. We propose and test various drift detection approaches to quantify their effectiveness in this localized drift scenario. We make the source code for the generation of the synthetic benchmark available at this https URL.

[LG-59] Detecting Interpretable Subgroup Drifts

链接: https://arxiv.org/abs/2408.14682
作者: Flavio Giobergia,Eliana Pastor,Luca de Alfaro,Elena Baralis
关键词-EN: machine learning models, ability to detect, detect and adapt, distributions is crucial, crucial to maintain
类目: Machine Learning (cs.LG)
*备注: Currently under submission

点击查看摘要

Abstract:The ability to detect and adapt to changes in data distributions is crucial to maintain the accuracy and reliability of machine learning models. Detection is generally approached by observing the drift of model performance from a global point of view. However, drifts occurring in (fine-grained) data subgroups may go unnoticed when monitoring global drift. We take a different perspective, and introduce methods for observing drift at the finer granularity of subgroups. Relevant data subgroups are identified during training and monitored efficiently throughout the model’s life. Performance drifts in any subgroup are detected, quantified and characterized so as to provide an interpretable summary of the model behavior over time. Experimental results confirm that our subgroup-level drift analysis identifies drifts that do not show at the (coarser) global dataset level. The proposed approach provides a valuable tool for monitoring model performance in dynamic real-world applications, offering insights into the evolving nature of data and ultimately contributing to more robust and adaptive models.

[LG-60] Enhancing Neural Network Interpretability Through Conductance-Based Information Plane Analysis

链接: https://arxiv.org/abs/2408.14681
作者: Jaouad Dabounou,Amine Baazzouz
关键词-EN: Information Plane, Information Plane analysis, conductance-based Information Plane, Information, traditional methods based
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注: 16 pages, 10 figures

点击查看摘要

Abstract:The Information Plane is a conceptual framework used to analyze the flow of information in neural networks, but traditional methods based on activations may not fully capture the dynamics of information processing. This paper introduces a new approach that uses layer conductance, a measure of sensitivity to input features, to enhance the Information Plane analysis. By incorporating gradient-based contributions, we provide a more precise characterization of information dynamics within the network. The proposed conductance-based Information Plane and a new Information Transformation Efficiency (ITE) metric are evaluated on pretrained ResNet50 and VGG16 models using the ImageNet dataset. Our results demonstrate the ability to identify critical hidden layers that contribute significantly to model performance and interpretability, giving insights into information compression, preservation, and utilization across layers. The conductance-based approach offers a granular perspective on feature attribution, enhancing our understanding of the decision-making processes within neural networks. Furthermore, our empirical findings challenge certain theoretical predictions of the Information Bottleneck theory, highlighting the complexities of information dynamics in real-world data scenarios. The proposed method not only advances our understanding of information dynamics in neural networks but also has the potential to significantly impact the broader field of Artificial Intelligence by enabling the development of more interpretable, efficient, and robust models.

[LG-61] On-Chip Learning with Memristor-Based Neural Networks: Assessing Accuracy and Efficiency Under Device Variations Conductance Errors and Input Noise

链接: https://arxiv.org/abs/2408.14680
作者: M. Reza Eslami,Dhiman Biswas,Soheib Takhtardeshir,Sarah S. Sharif,Yaser M. Banad
关键词-EN: paper presents, Utilizing realistic SPICE, device variations, realistic SPICE models, training
类目: Neural and Evolutionary Computing (cs.NE); Materials Science (cond-mat.mtrl-sci); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This paper presents a memristor-based compute-in-memory hardware accelerator for on-chip training and inference, focusing on its accuracy and efficiency against device variations, conductance errors, and input noise. Utilizing realistic SPICE models of commercially available silver-based metal self-directed channel (M-SDC) memristors, the study incorporates inherent device non-idealities into the circuit simulations. The hardware, consisting of 30 memristors and 4 neurons, utilizes three different M-SDC structures with tungsten, chromium, and carbon media to perform binary image classification tasks. An on-chip training algorithm precisely tunes memristor conductance to achieve target weights. Results show that incorporating moderate noise (15%) during training enhances robustness to device variations and noisy input data, achieving up to 97% accuracy despite conductance variations and input noises. The network tolerates a 10% conductance error without significant accuracy loss. Notably, omitting the initial memristor reset pulse during training considerably reduces training time and energy consumption. The hardware designed with chromium-based memristors exhibits superior performance, achieving a training time of 2.4 seconds and an energy consumption of 18.9 mJ. This research provides insights for developing robust and energy-efficient memristor-based neural networks for on-chip learning in edge applications.

[LG-62] Bridging the Gap: Unpacking the Hidden Challenges in Knowledge Distillation for Online Ranking Systems

链接: https://arxiv.org/abs/2408.14678
作者: Nikhil Khani,Shuo Yang,Aniruddh Nath,Yang Liu,Pendo Abbo,Li Wei,Shawn Andrews,Maciej Kula,Jarrod Kahn,Zhe Zhao,Lichan Hong,Ed Chi
关键词-EN: Knowledge Distillation, powerful approach, approach for compressing, compressing a large, beneficial for latency-sensitive
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Knowledge Distillation (KD) is a powerful approach for compressing a large model into a smaller, more efficient model, particularly beneficial for latency-sensitive applications like recommender systems. However, current KD research predominantly focuses on Computer Vision (CV) and NLP tasks, overlooking unique data characteristics and challenges inherent to recommender systems. This paper addresses these overlooked challenges, specifically: (1) mitigating data distribution shifts between teacher and student models, (2) efficiently identifying optimal teacher configurations within time and budgetary constraints, and (3) enabling computationally efficient and rapid sharing of teacher labels to support multiple students. We present a robust KD system developed and rigorously evaluated on multiple large-scale personalized video recommendation systems within Google. Our live experiment results demonstrate significant improvements in student model performance while ensuring consistent and reliable generation of high quality teacher labels from a continuous data stream of data.

[LG-63] Can Optimization Trajectories Explain Multi-Task Transfer?

链接: https://arxiv.org/abs/2408.14677
作者: David Mueller,Mark Dredze,Nicholas Andrews
关键词-EN: deep learning, MTL, widespread adoption, generalization, multi-task
类目: Machine Learning (cs.LG)
*备注: Pre-print

点击查看摘要

Abstract:Despite the widespread adoption of multi-task training in deep learning, little is understood about how multi-task learning (MTL) affects generalization. Prior work has conjectured that the negative effects of MTL are due to optimization challenges that arise during training, and many optimization methods have been proposed to improve multi-task performance. However, recent work has shown that these methods fail to consistently improve multi-task generalization. In this work, we seek to improve our understanding of these failures by empirically studying how MTL impacts the optimization of tasks, and whether this impact can explain the effects of MTL on generalization. We show that MTL results in a generalization gap-a gap in generalization at comparable training loss-between single-task and multi-task trajectories early into training. However, we find that factors of the optimization trajectory previously proposed to explain generalization gaps in single-task settings cannot explain the generalization gaps between single-task and multi-task models. Moreover, we show that the amount of gradient conflict between tasks is correlated with negative effects to task optimization, but is not predictive of generalization. Our work sheds light on the underlying causes for failures in MTL and, importantly, raises questions about the role of general purpose multi-task optimization algorithms.

[LG-64] KGPrune: a Web Application to Extract Subgraphs of Interest from Wikidata with Analogical Pruning ECAI2024

链接: https://arxiv.org/abs/2408.14658
作者: Pierre Monnin,Cherif-Hassan Nousradine,Lucas Jarnac,Laurel Zuckerman,Miguel Couceiro
关键词-EN: array of domains, ubiquitous publicly, nowadays covering, Knowledge graphs, knowledge sources
类目: Artificial Intelligence (cs.AI); Databases (cs.DB); Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注: Accepted as a demo paper at ECAI 2024

点击查看摘要

Abstract:Knowledge graphs (KGs) have become ubiquitous publicly available knowledge sources, and are nowadays covering an ever increasing array of domains. However, not all knowledge represented is useful or pertaining when considering a new application or specific task. Also, due to their increasing size, handling large KGs in their entirety entails scalability issues. These two aspects asks for efficient methods to extract subgraphs of interest from existing KGs. To this aim, we introduce KGPrune, a Web Application that, given seed entities of interest and properties to traverse, extracts their neighboring subgraphs from Wikidata. To avoid topical drift, KGPrune relies on a frugal pruning algorithm based on analogical reasoning to only keep relevant neighbors while pruning irrelevant ones. The interest of KGPrune is illustrated by two concrete applications, namely, bootstrapping an enterprise KG and extracting knowledge related to looted artworks.

[LG-65] Relationships are Complicated! An Analysis of Relationships Between Datasets on the Web

链接: https://arxiv.org/abs/2408.14636
作者: Kate Lin,Tarfah Alrashed,Natasha Noy
关键词-EN: rapid pace, relationships, datasets, today has millions, continues to grow
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The Web today has millions of datasets, and the number of datasets continues to grow at a rapid pace. These datasets are not standalone entities; rather, they are intricately connected through complex relationships. Semantic relationships between datasets provide critical insights for research and decision-making processes. In this paper, we study dataset relationships from the perspective of users who discover, use, and share datasets on the Web: what relationships are important for different tasks? What contextual information might users want to know? We first present a comprehensive taxonomy of relationships between datasets on the Web and map these relationships to user tasks performed during dataset discovery. We develop a series of methods to identify these relationships and compare their performance on a large corpus of datasets generated from Web pages with this http URL markup. We demonstrate that machine-learning based methods that use dataset metadata achieve multi-class classification accuracy of 90%. Finally, we highlight gaps in available semantic markup for datasets and discuss how incorporating comprehensive semantics can facilitate the identification of dataset relationships. By providing a comprehensive overview of dataset relationships at scale, this paper sets a benchmark for future research.

[LG-66] Hybrid Deep Convolutional Neural Networks Combined with Autoencoders And Augmented Data To Predict The Look-Up Table 2006

链接: https://arxiv.org/abs/2408.14626
作者: Messaoud Djeddou,Aouatef Hellal,Ibrahim A. Hameed,Xingang Zhao,Djehad Al Dallal
关键词-EN: convolutional neural network, critical heat flux, predict critical heat, deep convolutional neural, data augmentation techniques
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 11 pages, 6 figures

点击查看摘要

Abstract:This study explores the development of a hybrid deep convolutional neural network (DCNN) model enhanced by autoencoders and data augmentation techniques to predict critical heat flux (CHF) with high accuracy. By augmenting the original input features using three different autoencoder configurations, the model’s predictive capabilities were significantly improved. The hybrid models were trained and tested on a dataset of 7225 samples, with performance metrics including the coefficient of determination (R2), Nash-Sutcliffe efficiency (NSE), mean absolute error (MAE), and normalized root-mean-squared error (NRMSE) used for evaluation. Among the tested models, the DCNN_3F-A2 configuration demonstrated the highest accuracy, achieving an R2 of 0.9908 during training and 0.9826 during testing, outperforming the base model and other augmented versions. These results suggest that the proposed hybrid approach, combining deep learning with feature augmentation, offers a robust solution for CHF prediction, with the potential to generalize across a wider range of conditions.

[LG-67] Meta Flow Matching: Integrating Vector Fields on the Wasserstein Manifold

链接: https://arxiv.org/abs/2408.14608
作者: Lazar Atanackovic,Xi Zhang,Brandon Amos,Mathieu Blanchette,Leo J. Lee,Yoshua Bengio,Alexander Tong,Kirill Neklyudov
关键词-EN: interacting entities evolving, entities evolving continuously, Numerous biological, physical particles, interacting entities
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Numerous biological and physical processes can be modeled as systems of interacting entities evolving continuously over time, e.g. the dynamics of communicating cells or physical particles. Learning the dynamics of such systems is essential for predicting the temporal evolution of populations across novel samples and unseen environments. Flow-based models allow for learning these dynamics at the population level - they model the evolution of the entire distribution of samples. However, current flow-based models are limited to a single initial population and a set of predefined conditions which describe different dynamics. We argue that multiple processes in natural sciences have to be represented as vector fields on the Wasserstein manifold of probability densities. That is, the change of the population at any moment in time depends on the population itself due to the interactions between samples. In particular, this is crucial for personalized medicine where the development of diseases and their respective treatment response depends on the microenvironment of cells specific to each patient. We propose Meta Flow Matching (MFM), a practical approach to integrating along these vector fields on the Wasserstein manifold by amortizing the flow model over the initial populations. Namely, we embed the population of samples using a Graph Neural Network (GNN) and use these embeddings to train a Flow Matching model. This gives MFM the ability to generalize over the initial distributions unlike previously proposed methods. We demonstrate the ability of MFM to improve prediction of individual treatment responses on a large scale multi-patient single-cell drug screen dataset.

[LG-68] Biased Dueling Bandits with Stochastic Delayed Feedback

链接: https://arxiv.org/abs/2408.14603
作者: Bongsoo Yi,Yue Kang,Yao Li
关键词-EN: prominent recently due, traditional multi-armed bandit, dueling bandit problem, significantly prominent recently, multi-armed bandit problem
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:The dueling bandit problem, an essential variation of the traditional multi-armed bandit problem, has become significantly prominent recently due to its broad applications in online advertising, recommendation systems, information retrieval, and more. However, in many real-world applications, the feedback for actions is often subject to unavoidable delays and is not immediately available to the agent. This partially observable issue poses a significant challenge to existing dueling bandit literature, as it significantly affects how quickly and accurately the agent can update their policy on the fly. In this paper, we introduce and examine the biased dueling bandit problem with stochastic delayed feedback, revealing that this new practical problem will delve into a more realistic and intriguing scenario involving a preference bias between the selections. We present two algorithms designed to handle situations involving delay. Our first algorithm, requiring complete delay distribution information, achieves the optimal regret bound for the dueling bandit problem when there is no delay. The second algorithm is tailored for situations where the distribution is unknown, but only the expected value of delay is available. We provide a comprehensive regret analysis for the two proposed algorithms and then evaluate their empirical performance on both synthetic and real datasets.

[LG-69] Efficient fine-tuning of 37-level GraphCast with the Canadian global deterministic analysis

链接: https://arxiv.org/abs/2408.14587
作者: Christopher Subich
关键词-EN: Climate Change Canada, Deterministic Prediction System, Global Deterministic Prediction, Change Canada, Prediction System
类目: Machine Learning (cs.LG); Atmospheric and Oceanic Physics (physics.ao-ph)
*备注:

点击查看摘要

Abstract:This work describes a process for efficiently fine-tuning the GraphCast data-driven forecast model to simulate another analysis system, here the Global Deterministic Prediction System (GDPS) of Environment and Climate Change Canada (ECCC). Using two years of training data (July 2019 – December 2021) and 37 GPU-days of computation to tune the 37-level, quarter-degree version of GraphCast, the resulting model significantly outperforms both the unmodified GraphCast and operational forecast, showing significant forecast skill in the troposphere over lead times from 1 to 10 days. This fine-tuning is accomplished through abbreviating DeepMind’s original training curriculum for GraphCast, relying on a shorter single-step forecast stage to accomplish the bulk of the adaptation work and consolidating the autoregressive stages into separate 12hr, 1d, 2d, and 3d stages with larger learning rates. Additionally, training over 3d forecasts is split into two sub-steps to conserve host memory while maintaining a strong correlation with training over the full period.

[LG-70] CURLoRA: Stable LLM Continual Fine-Tuning and Catastrophic Forgetting Mitigation

链接: https://arxiv.org/abs/2408.14572
作者: Muhammad Fawi
关键词-EN: Low-Rank Adaptation, leverages CUR matrix, fine-tuning large language, paper introduces CURLoRA, large language models
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注: Code available at this https URL

点击查看摘要

Abstract:This paper introduces CURLoRA, a novel approach to fine-tuning large language models (LLMs) that leverages CUR matrix decomposition in the context of Low-Rank Adaptation (LoRA). Our method addresses two critical challenges in LLM fine-tuning: mitigating catastrophic forgetting during continual learning and reducing the number of trainable parameters. We propose a unique modification to the CUR decomposition process, utilizing inverted probabilities for column and row selection which acts as an implicit regularization, and initializing the U matrix as a zero matrix, and only fine-tuning it. We demonstrate through experiments on multiple datasets that CURLoRA outperforms standard LoRA in mitigating catastrophic forgetting. It maintains model stability and performance across tasks while significantly reducing the number of trainable parameters. Our results show that CURLoRA achieves very good and stable task accuracy while maintaining base model’s perplexity scores fixed compared to LoRA upon continual fine-tuning, particularly in scenarios with limited data.

[LG-71] Exploring the Potential of Synthetic Data to Replace Real Data ICIP2024

链接: https://arxiv.org/abs/2408.14559
作者: Hyungtae Lee,Yan Zhang,Heesung Kwon,Shuvra S. Bhattacharrya
关键词-EN: synthetic data, replace real data, real data creates, synthetic, data
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: ICIP 2024

点击查看摘要

Abstract:The potential of synthetic data to replace real data creates a huge demand for synthetic data in data-hungry AI. This potential is even greater when synthetic data is used for training along with a small number of real images from domains other than the test domain. We find that this potential varies depending on (i) the number of cross-domain real images and (ii) the test set on which the trained model is evaluated. We introduce two new metrics, the train2test distance and \textAP_\textt2t , to evaluate the ability of a cross-domain training set using synthetic data to represent the characteristics of test instances in relation to training performance. Using these metrics, we delve deeper into the factors that influence the potential of synthetic data and uncover some interesting dynamics about how synthetic data impacts training performance. We hope these discoveries will encourage more widespread use of synthetic data.

[LG-72] Aiding Humans in Financial Fraud Decision Making: Toward an XAI-Visualization Framework IEEE-VIS’24

链接: https://arxiv.org/abs/2408.14552
作者: Angelos Chatzimparmpas,Evanthia Dimara
关键词-EN: financial fraud detection, decision making, financial fraud, Current Visual Analytics, fraud detection
类目: Machine Learning (cs.LG); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC)
*备注: Accepted poster at IEEE VIS '24, Florida, USA, 13-18 October, 2024

点击查看摘要

Abstract:AI prevails in financial fraud detection and decision making. Yet, due to concerns about biased automated decision making or profiling, regulations mandate that final decisions are made by humans. Financial fraud investigators face the challenge of manually synthesizing vast amounts of unstructured information, including AI alerts, transaction histories, social media insights, and governmental laws. Current Visual Analytics (VA) systems primarily support isolated aspects of this process, such as explaining binary AI alerts and visualizing transaction patterns, thus adding yet another layer of information to the overall complexity. In this work, we propose a framework where the VA system supports decision makers throughout all stages of financial fraud investigation, including data collection, information synthesis, and human criteria iteration. We illustrate how VA can claim a central role in AI-aided decision making, ensuring that human judgment remains in control while minimizing potential biases and labor-intensive tasks.

[LG-73] Adaptive Resolution Inference (ARI): Energy-Efficient Machine Learning for Internet of Things

链接: https://arxiv.org/abs/2408.14528
作者: Ziheng Wang,Pedro Reviriego,Farzad Niknia,Javier Conde,Shanshan Liu,Fabrizio Lombardi
关键词-EN: Things devices poses, Internet of Things, devices poses significant, poses significant operational, significant operational challenges
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注:

点击查看摘要

Abstract:The implementation of machine learning in Internet of Things devices poses significant operational challenges due to limited energy and computation resources. In recent years, significant efforts have been made to implement simplified ML models that can achieve reasonable performance while reducing computation and energy, for example by pruning weights in neural networks, or using reduced precision for the parameters and arithmetic operations. However, this type of approach is limited by the performance of the ML implementation, i.e., by the loss for example in accuracy due to the model simplification. In this article, we present adaptive resolution inference (ARI), a novel approach that enables to evaluate new tradeoffs between energy dissipation and model performance in ML implementations. The main principle of the proposed approach is to run inferences with reduced precision (quantization) and use the margin over the decision threshold to determine if either the result is reliable, or the inference must run with the full model. The rationale is that quantization only introduces small deviations in the inference scores, such that if the scores have a sufficient margin over the decision threshold, it is unlikely that the full model would have a different result. Therefore, we can run the quantized model first, and only when the scores do not have a sufficient margin, the full model is run. This enables most inferences to run with the reduced precision model and only a small fraction requires the full model, so significantly reducing computation and energy while not affecting model performance. The proposed ARI approach is presented, analyzed in detail, and evaluated using different data sets for floating-point and stochastic computing implementations. The results show that ARI can significantly reduce the energy for inference in different configurations with savings between 40% and 85%.

[LG-74] Estimating Uncertainty with Implicit Quantile Network

链接: https://arxiv.org/abs/2408.14525
作者: Yi Hung Lim
关键词-EN: performance critical applications, Implicit Quantile Network, performance critical, bayesian neural networks, Uncertainty quantification
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE)
*备注: This method is simple to implement and offers important information for performance critical applications

点击查看摘要

Abstract:Uncertainty quantification is an important part of many performance critical applications. This paper provides a simple alternative to existing approaches such as ensemble learning and bayesian neural networks. By directly modeling the loss distribution with an Implicit Quantile Network, we get an estimate of how uncertain the model is of its predictions. For experiments with MNIST and CIFAR datasets, the mean of the estimated loss distribution is 2x higher for incorrect predictions. When data with high estimated uncertainty is removed from the test dataset, the accuracy of the model goes up as much as 10%. This method is simple to implement while offering important information to applications where the user has to know when the model could be wrong (e.g. deep learning for healthcare).

[LG-75] Retrieval Augmented Generation for Dynamic Graph Modeling

链接: https://arxiv.org/abs/2408.14523
作者: Yuxia Wu,Yuan Fang,Lizi Liao
关键词-EN: Dynamic graph modeling, Dynamic graph, graph modeling, analyzing evolving patterns, graph
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Under review

点击查看摘要

Abstract:Dynamic graph modeling is crucial for analyzing evolving patterns in various applications. Existing approaches often integrate graph neural networks with temporal modules or redefine dynamic graph modeling as a generative sequence task. However, these methods typically rely on isolated historical contexts of the target nodes from a narrow perspective, neglecting occurrences of similar patterns or relevant cases associated with other nodes. In this work, we introduce the Retrieval-Augmented Generation for Dynamic Graph Modeling (RAG4DyG) framework, which leverages guidance from contextually and temporally analogous examples to broaden the perspective of each node. This approach presents two critical challenges: (1) How to identify and retrieve high-quality demonstrations that are contextually and temporally analogous to dynamic graph samples? (2) How can these demonstrations be effectively integrated to improve dynamic graph modeling? To address these challenges, we propose RAG4DyG, which enriches the understanding of historical contexts by retrieving and learning from contextually and temporally pertinent demonstrations. Specifically, we employ a time- and context-aware contrastive learning module to identify and retrieve relevant cases for each query sequence. Moreover, we design a graph fusion strategy to integrate the retrieved cases, thereby augmenting the inherent historical contexts for improved prediction. Extensive experiments on real-world datasets across different domains demonstrate the effectiveness of RAG4DyG for dynamic graph modeling.

[LG-76] owards Graph Prompt Learning: A Survey and Beyond

链接: https://arxiv.org/abs/2408.14520
作者: Qingqing Long,Yuchen Yan,Peiyan Zhang,Chen Fang,Wentao Cui,Zhiyuan Ning,Meng Xiao,Ning Cao,Xiao Luo,Lingjun Xu,Shiyue Jiang,Zheng Fang,Chong Chen,Xian-Sheng Hua,Yuanchun Zhou
关键词-EN: demonstrated remarkable adaptability, enabling broad applications, image recognition, remarkable adaptability, enabling broad
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Social and Information Networks (cs.SI)
*备注: 19 pages, 2 figures

点击查看摘要

Abstract:Large-scale “pre-train and prompt learning” paradigms have demonstrated remarkable adaptability, enabling broad applications across diverse domains such as question answering, image recognition, and multimodal retrieval. This approach fully leverages the potential of large-scale pre-trained models, reducing downstream data requirements and computational costs while enhancing model applicability across various tasks. Graphs, as versatile data structures that capture relationships between entities, play pivotal roles in fields such as social network analysis, recommender systems, and biological graphs. Despite the success of pre-train and prompt learning paradigms in Natural Language Processing (NLP) and Computer Vision (CV), their application in graph domains remains nascent. In graph-structured data, not only do the node and edge features often have disparate distributions, but the topological structures also differ significantly. This diversity in graph data can lead to incompatible patterns or gaps between pre-training and fine-tuning on downstream graphs. We aim to bridge this gap by summarizing methods for alleviating these disparities. This includes exploring prompt design methodologies, comparing related techniques, assessing application scenarios and datasets, and identifying unresolved problems and challenges. This survey categorizes over 100 relevant works in this field, summarizing general design principles and the latest applications, including text-attributed graphs, molecules, proteins, and recommendation systems. Through this extensive review, we provide a foundational understanding of graph prompt learning, aiming to impact not only the graph mining community but also the broader Artificial General Intelligence (AGI) community.

[LG-77] A Multilateral Attention-enhanced Deep Neural Network for Disease Outbreak Forecasting: A Case Study on COVID-19

链接: https://arxiv.org/abs/2408.14519
作者: Ashutosh Anshul,Jhalak Gupta,Mohammad Zia Ur Rehman,Nagendra Kumar
关键词-EN: accurate forecasting models, necessitating the development, Multilateral Attention-enhanced GRU, development of accurate, Attention-enhanced GRU model
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The worldwide impact of the recent COVID-19 pandemic has been substantial, necessitating the development of accurate forecasting models to predict the spread and course of a pandemic. Previous methods for outbreak forecasting have faced limitations by not utilizing multiple sources of input and yielding suboptimal performance due to the limited availability of data. In this study, we propose a novel approach to address the challenges of infectious disease forecasting. We introduce a Multilateral Attention-enhanced GRU model that leverages information from multiple sources, thus enabling a comprehensive analysis of factors influencing the spread of a pandemic. By incorporating attention mechanisms within a GRU framework, our model can effectively capture complex relationships and temporal dependencies in the data, leading to improved forecasting performance. Further, we have curated a well-structured multi-source dataset for the recent COVID-19 pandemic that the research community can utilize as a great resource to conduct experiments and analysis on time-series forecasting. We evaluated the proposed model on our COVID-19 dataset and reported the output in terms of RMSE and MAE. The experimental results provide evidence that our proposed model surpasses existing techniques in terms of performance. We also performed performance gain and qualitative analysis on our dataset to evaluate the impact of the attention mechanism and show that the proposed model closely follows the trajectory of the pandemic.

[LG-78] A Survey on Reinforcement Learning Applications in SLAM

链接: https://arxiv.org/abs/2408.14518
作者: Mohammad Dehghani Tezerjani,Mohammad Khoshnazar,Mohammadhamed Tangestanizadeh,Qing Yang
关键词-EN: enriched user experiences, complex navigation challenges, reinforcement learning, mobile robotics, automotive industry
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The emergence of mobile robotics, particularly in the automotive industry, introduces a promising era of enriched user experiences and adept handling of complex navigation challenges. The realization of these advancements necessitates a focused technological effort and the successful execution of numerous intricate tasks, particularly in the critical domain of Simultaneous Localization and Mapping (SLAM). Various artificial intelligence (AI) methodologies, such as deep learning and reinforcement learning, present viable solutions to address the challenges in SLAM. This study specifically explores the application of reinforcement learning in the context of SLAM. By enabling the agent (the robot) to iteratively interact with and receive feedback from its environment, reinforcement learning facilitates the acquisition of navigation and mapping skills, thereby enhancing the robot’s decision-making capabilities. This approach offers several advantages, including improved navigation proficiency, increased resilience, reduced dependence on sensor precision, and refinement of the decision-making process. The findings of this study, which provide an overview of reinforcement learning’s utilization in SLAM, reveal significant advancements in the field. The investigation also highlights the evolution and innovative integration of these techniques.

[LG-79] A Joint Learning Model with Variational Interaction for Multilingual Program Translation

链接: https://arxiv.org/abs/2408.14515
作者: Yali Du,Hui Sun,Ming Li
关键词-EN: program translation, Multilingual Program Translation, program, translation, Multilingual Program
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Accepted by the 39th IEEE/ACM International Conference on Automated Software Engineering (ASE 2024)

点击查看摘要

Abstract:Programs implemented in various programming languages form the foundation of software applications. To alleviate the burden of program migration and facilitate the development of software systems, automated program translation across languages has garnered significant attention. Previous approaches primarily focus on pairwise translation paradigms, learning translation between pairs of languages using bilingual parallel data. However, parallel data is difficult to collect for some language pairs, and the distribution of program semantics across languages can shift, posing challenges for pairwise program translation. In this paper, we argue that jointly learning a unified model to translate code across multiple programming languages is superior to separately learning from bilingual parallel data. We propose Variational Interaction for Multilingual Program Translation~(VIM-PT), a disentanglement-based generative approach that jointly trains a unified model for multilingual program translation across multiple languages. VIM-PT disentangles code into language-shared and language-specific features, using variational inference and interaction information with a novel lower bound, then achieves program translation through conditional generation. VIM-PT demonstrates four advantages: 1) captures language-shared information more accurately from various implementations and improves the quality of multilingual program translation, 2) mines and leverages the capability of non-parallel data, 3) addresses the distribution shift of program semantics across languages, 4) and serves as a unified model, reducing deployment complexity.

[LG-80] Improving Nonlinear Projection Heads using Pretrained Autoencoder Embeddings

链接: https://arxiv.org/abs/2408.14514
作者: Andreas Schliebitz,Heiko Tapken,Martin Atzmueller
关键词-EN: empirical study aims, MLP projection head, MLP projection, empirical study, study aims
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注: 15 pages, 1 figure

点击查看摘要

Abstract:This empirical study aims at improving the effectiveness of the standard 2-layer MLP projection head g(\cdot) featured in the SimCLR framework through the use of pretrained autoencoder embeddings. Given a contrastive learning task with a largely unlabeled image classification dataset, we first train a shallow autoencoder architecture and extract its compressed representations contained in the encoder’s embedding layer. After freezing the weights within this pretrained layer, we use it as a drop-in replacement for the input layer of SimCLR’s default projector. Additionally, we also apply further architectural changes to the projector by decreasing its width and changing its activation function. The different projection heads are then used to contrastively train and evaluate a feature extractor f(\cdot) following the SimCLR protocol, while also examining the performance impact of Z-score normalized datasets. Our experiments indicate that using a pretrained autoencoder embedding in the projector can not only increase classification accuracy by up to 2.9% or 1.7% on average but can also significantly decrease the dimensionality of the projection space. Our results also suggest, that using the sigmoid and tanh activation functions within the projector can outperform ReLU in terms of peak and average classification accuracy. When applying our presented projectors, then not applying Z-score normalization to datasets often increases peak performance. In contrast, the default projection head can benefit more from normalization. All experiments involving our pretrained projectors are conducted with frozen embeddings, since our test results indicate an advantage compared to using their non-frozen counterparts.

[LG-81] Variational autoencoder-based neural network model compression

链接: https://arxiv.org/abs/2408.14513
作者: Liang Cheng,Peiyuan Guan,Amir Taherkordi,Lei Liu,Dapeng Lan
关键词-EN: Variational Autoencoders, shown great great, great great peformance, neural network, Convolutional Neural Network
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Variational Autoencoders (VAEs), as a form of deep generative model, have been widely used in recent years, and shown great great peformance in a number of different domains, including image generation and anomaly detection, etc… This paper aims to explore neural network model compression method based on VAE. The experiment uses different neural network models for MNIST recognition as compression targets, including Feedforward Neural Network (FNN), Convolutional Neural Network (CNN), Recurrent Neural Network (RNN) and Long Short-Term Memory (LSTM). These models are the most basic models in deep learning, and other more complex and advanced models are based on them or inherit their features and evolve. In the experiment, the first step is to train the models mentioned above, each trained model will have different accuracy and number of total parameters. And then the variants of parameters for each model are processed as training data in VAEs separately, and the trained VAEs are tested by the true model parameters. The experimental results show that using the latent space as a representation of the model compression can improve the compression rate compared to some traditional methods such as pruning and quantization, meanwhile the accuracy is not greatly affected using the model parameters reconstructed based on the latent space. In the future, a variety of different large-scale deep learning models will be used more widely, so exploring different ways to save time and space on saving or transferring models will become necessary, and the use of VAE in this paper can provide a basis for these further explorations.

[LG-82] LLMs as Zero-shot Graph Learners: Alignment of GNN Representations with LLM Token Embeddings

链接: https://arxiv.org/abs/2408.14512
作者: Duo Wang,Yuan Zuo,Fengzhi Li,Junjie Wu
关键词-EN: scarce labeled data, garnered significant interest, significant interest due, graph neural networks, neural networks
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:Zero-shot graph machine learning, especially with graph neural networks (GNNs), has garnered significant interest due to the challenge of scarce labeled data. While methods like self-supervised learning and graph prompt learning have been extensively explored, they often rely on fine-tuning with task-specific labels, limiting their effectiveness in zero-shot scenarios. Inspired by the zero-shot capabilities of instruction-fine-tuned large language models (LLMs), we introduce a novel framework named Token Embedding-Aligned Graph Language Model (TEA-GLM) that leverages LLMs as cross-dataset and cross-task zero-shot learners for graph machine learning. Concretely, we pretrain a GNN, aligning its representations with token embeddings of an LLM. We then train a linear projector that transforms the GNN’s representations into a fixed number of graph token embeddings without tuning the LLM. A unified instruction is designed for various graph tasks at different levels, such as node classification (node-level) and link prediction (edge-level). These design choices collectively enhance our method’s effectiveness in zero-shot learning, setting it apart from existing methods. Experiments show that our graph token embeddings help the LLM predictor achieve state-of-the-art performance on unseen datasets and tasks compared to other methods using LLMs as predictors.

[LG-83] Unveiling the Statistical Foundations of Chain-of-Thought Prompting Methods

链接: https://arxiv.org/abs/2408.14511
作者: Xinyang Hu,Fengzhuo Zhang,Siyu Chen,Zhuoran Yang
关键词-EN: solving multi-step reasoning, multi-step reasoning problem, large language models, multi-step reasoning, gained popularity
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Statistics Theory (math.ST); Machine Learning (stat.ML)
*备注: 150 pages, 18 figures, 3 tables

点击查看摘要

Abstract:Chain-of-Thought (CoT) prompting and its variants have gained popularity as effective methods for solving multi-step reasoning problems using pretrained large language models (LLMs). In this work, we analyze CoT prompting from a statistical estimation perspective, providing a comprehensive characterization of its sample complexity. To this end, we introduce a multi-step latent variable model that encapsulates the reasoning process, where the latent variable encodes the task information. Under this framework, we demonstrate that when the pretraining dataset is sufficiently large, the estimator formed by CoT prompting is equivalent to a Bayesian estimator. This estimator effectively solves the multi-step reasoning problem by aggregating a posterior distribution inferred from the demonstration examples in the prompt. Moreover, we prove that the statistical error of the CoT estimator can be decomposed into two main components: (i) a prompting error, which arises from inferring the true task using CoT prompts, and (ii) the statistical error of the pretrained LLM. We establish that, under appropriate assumptions, the prompting error decays exponentially to zero as the number of demonstrations increases. Additionally, we explicitly characterize the approximation and generalization errors of the pretrained LLM. Notably, we construct a transformer model that approximates the target distribution of the multi-step reasoning problem with an error that decreases exponentially in the number of transformer blocks. Our analysis extends to other variants of CoT, including Self-Consistent CoT, Tree-of-Thought, and Selection-Inference, offering a broad perspective on the efficacy of these methods. We also provide numerical experiments to validate the theoretical findings.

[LG-84] Artificial intelligence for science: The easy and hard problems

链接: https://arxiv.org/abs/2408.14508
作者: Ruairidh M. Battleday,Samuel J. Gershman
关键词-EN: impressive scientific discoveries, artificial intelligence, suite of impressive, driven by recent, recent advances
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Neurons and Cognition (q-bio.NC)
*备注: 16 pages, 3 boxes, 4 figures

点击查看摘要

Abstract:A suite of impressive scientific discoveries have been driven by recent advances in artificial intelligence. These almost all result from training flexible algorithms to solve difficult optimization problems specified in advance by teams of domain scientists and engineers with access to large amounts of data. Although extremely useful, this kind of problem solving only corresponds to one part of science - the “easy problem.” The other part of scientific research is coming up with the problem itself - the “hard problem.” Solving the hard problem is beyond the capacities of current algorithms for scientific discovery because it requires continual conceptual revision based on poorly defined constraints. We can make progress on understanding how humans solve the hard problem by studying the cognitive science of scientists, and then use the results to design new computational agents that automatically infer and update their scientific paradigms.

[LG-85] Distilling Long-tailed Datasets

链接: https://arxiv.org/abs/2408.14506
作者: Zhenghao Zhao,Haoxuan Wang,Yuzhang Shang,Kai Wang,Yan Yan
关键词-EN: neural network training, efficient neural network, long-tailed dataset distillation, Dataset, Dataset distillation
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Dataset distillation (DD) aims to distill a small, information-rich dataset from a larger one for efficient neural network training. However, existing DD methods struggle with long-tailed datasets, which are prevalent in real-world scenarios. By investigating the reasons behind this unexpected result, we identified two main causes: 1) Expert networks trained on imbalanced data develop biased gradients, leading to the synthesis of similarly imbalanced distilled datasets. Parameter matching, a common technique in DD, involves aligning the learning parameters of the distilled dataset with that of the original dataset. However, in the context of long-tailed datasets, matching biased experts leads to inheriting the imbalance present in the original data, causing the distilled dataset to inadequately represent tail classes. 2) The experts trained on such datasets perform suboptimally on tail classes, resulting in misguided distillation supervision and poor-quality soft-label initialization. To address these issues, we propose a novel long-tailed dataset distillation method, Long-tailed Aware Dataset distillation (LAD). Specifically, we propose Weight Mismatch Avoidance to avoid directly matching the biased expert trajectories. It reduces the distance between the student and the biased expert trajectories and prevents the tail class bias from being distilled to the synthetic dataset. Moreover, we propose Adaptive Decoupled Matching, which jointly matches the decoupled backbone and classifier to improve the tail class performance and initialize reliable soft labels. This work pioneers the field of long-tailed dataset distillation (LTDD), marking the first effective effort to distill long-tailed datasets.

[LG-86] Empowering Pre-Trained Language Models for Spatio-Temporal Forecasting via Decoupling Enhanced Discrete Reprogramming

链接: https://arxiv.org/abs/2408.14505
作者: Hao Wang,Jindong Han,Wei Fan,Hao Liu
关键词-EN: time series forecasting, series forecasting plays, Pre-trained Language Models, time series, energy management
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:Spatio-temporal time series forecasting plays a critical role in various real-world applications, such as transportation optimization, energy management, and climate analysis. The recent advancements in Pre-trained Language Models (PLMs) have inspired efforts to reprogram these models for time series forecasting tasks, by leveraging their superior reasoning and generalization capabilities. However, existing approaches fall short in handling complex spatial inter-series dependencies and intrinsic intra-series frequency components, limiting their spatio-temporal forecasting performance. Moreover, the linear mapping of continuous time series to a compressed subset vocabulary in reprogramming constrains the spatio-temporal semantic expressivity of PLMs and may lead to potential information bottleneck. To overcome the above limitations, we propose \textscRePST, a tailored PLM reprogramming framework for spatio-temporal forecasting. The key insight of \textscRePST is to decouple the spatio-temporal dynamics in the frequency domain, allowing better alignment with the PLM text space. Specifically, we first decouple spatio-temporal data in Fourier space and devise a structural diffusion operator to obtain temporal intrinsic and spatial diffusion signals, making the dynamics more comprehensible and predictable for PLMs. To avoid information bottleneck from a limited vocabulary, we further propose a discrete reprogramming strategy that selects relevant discrete textual information from an expanded vocabulary space in a differentiable manner. Extensive experiments on four real-world datasets show that our proposed approach significantly outperforms state-of-the-art spatio-temporal forecasting models, particularly in data-scarce scenarios.

[LG-87] Physics-Informed Neural Network for Concrete Manufacturing Process Optimization

链接: https://arxiv.org/abs/2408.14502
作者: Sam Varghese,Mr. Rahul Anand,Gaurav Paliwal
关键词-EN: Deep Neural Network, Concrete manufacturing projects, Informed Neural Networks, Physics Informed Neural, Neural Network
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Concrete manufacturing projects are one of the most common ones for consulting agencies. Because of the highly non-linear dependency of input materials like ash, water, cement, superplastic, etc; with the resultant strength of concrete, it gets difficult for machine learning models to successfully capture this relation and perform cost optimizations. This paper highlights how PINNs (Physics Informed Neural Networks) can be useful in the given situation. This state-of-the-art model shall also get compared with traditional models like Linear Regression, Random Forest, Gradient Boosting, and Deep Neural Network. Results of the research highlights how well PINNs performed even with reduced dataset, thus resolving one of the biggest issues of limited data availability for ML models. On an average, PINN got the loss value reduced by 26.3% even with 40% lesser data compared to the Deep Neural Network. In addition to predicting strength of the concrete given the quantity of raw materials, the paper also highlights the use of heuristic optimization method like Particle Swarm Optimization (PSO) in predicting quantity of raw materials required to manufacture concrete of given strength with least cost.

[LG-88] Applying graph neural network to SupplyGraph for supply chain network

链接: https://arxiv.org/abs/2408.14501
作者: Kihwan Han
关键词-EN: Supply chain, Supply chain networks, networks describe interactions, supply chain dataset, chain networks describe
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 8 pages, 5 figures

点击查看摘要

Abstract:Supply chain networks describe interactions between products, manufacture facilities, storages in the context of supply and demand of the products. Supply chain data are inherently under graph structure; thus, it can be fertile ground for applications of graph neural network (GNN). Very recently, supply chain dataset, SupplyGraph, has been released to the public. Though the SupplyGraph dataset is valuable given scarcity of publicly available data, there was less clarity on description of the dataset, data quality assurance process, and hyperparameters of the selected models. Further, for generalizability of findings, it would be more convincing to present the findings by performing statistical analyses on the distribution of errors rather than showing the average value of the errors. Therefore, this study assessed the supply chain dataset, SupplyGraph, with better clarity on analyses processes, data quality assurance, machine learning (ML) model specifications. After data quality assurance procedures, this study compared performance of Multilayer Perceptions (MLP), Graph Convolution Network (GCN), and Graph Attention Network (GAT) on a demanding forecasting task while matching hyperparameters as feasible as possible. The analyses revealed that GAT performed best, followed by GCN and MLP. Those performance improvements were statistically significant at \alpha = 0.05 after correction for multiple comparisons. This study also discussed several considerations in applying GNN to supply chain networks. The current study reinforces the previous study in supply chain benchmark dataset with respect to description of the dataset and methodology, so that the future research in applications of GNN to supply chain becomes more reproducible.

[LG-89] SHEDAD: SNN-Enhanced District Heating Anomaly Detection for Urban Substations

链接: https://arxiv.org/abs/2408.14499
作者: Jonne van Dreven,Abbas Cheddad,Sadi Alawadi,Ahmad Nauman Ghazi,Jad Al Koussa,Dirk Vanhoudt
关键词-EN: energy-efficient urban heating, Enhanced District Heating, District Heating Anomaly, Neighbor Enhanced District, District Heating
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 12 pages, 5 figures, FMEC2024

点击查看摘要

Abstract:District Heating (DH) systems are essential for energy-efficient urban heating. However, despite the advancements in automated fault detection and diagnosis (FDD), DH still faces challenges in operational faults that impact efficiency. This study introduces the Shared Nearest Neighbor Enhanced District Heating Anomaly Detection (SHEDAD) approach, designed to approximate the DH network topology and allow for local anomaly detection without disclosing sensitive information, such as substation locations. The approach leverages a multi-adaptive k-Nearest Neighbor (k-NN) graph to improve the initial neighborhood creation. Moreover, it introduces a merging technique that reduces noise and eliminates trivial edges. We use the Median Absolute Deviation (MAD) and modified z-scores to flag anomalous substations. The results reveal that SHEDAD outperforms traditional clustering methods, achieving significantly lower intra-cluster variance and distance. Additionally, SHEDAD effectively isolates and identifies two distinct categories of anomalies: supply temperatures and substation performance. We identified 30 anomalous substations and reached a sensitivity of approximately 65% and specificity of approximately 97%. By focusing on this subset of poor-performing substations in the network, SHEDAD enables more targeted and effective maintenance interventions, which can reduce energy usage while optimizing network performance.

[LG-90] A New Era in Computational Pathology: A Survey on Foundation and Vision-Language Models

链接: https://arxiv.org/abs/2408.14496
作者: Dibaloke Chanda,Milan Aryal,Nasim Yahya Soltani,Masoud Ganji
关键词-EN: integrating foundation models, existing deep learning, deep learning approaches, deep learning, decision-making process
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Image and Video Processing (eess.IV)
*备注: Initial Version

点击查看摘要

Abstract:Recent advances in deep learning have completely transformed the domain of computational pathology (CPath), which in turn altered the diagnostic workflow of pathologists by integrating foundation models (FMs) and vision-language models (VLMs) in their assessment and decision-making process. FMs overcome the limitations of existing deep learning approaches in CPath by learning a representation space that can be adapted to a wide variety of downstream tasks without explicit supervision. VLMs allow pathology reports written in natural language to be used as a rich semantic information source to improve existing models as well as generate predictions in natural language form. In this survey, a holistic and systematic overview of recent innovations in FMs and VLMs in CPath is presented. Furthermore, the tools, datasets and training schemes for these models are summarized in addition to categorizing them into distinct groups. This extensive survey highlights the current trends in CPath and the way it is going to be transformed through FMs and VLMs in the future.

[LG-91] Knowledge Graph Modeling-Driven Large Language Model Operating System (LLM OS) for Task Automation in Process Engineering Problem-Solving

链接: https://arxiv.org/abs/2408.14494
作者: Sakhinana Sagar Srinivas,Vijay Sri Vaikunth,Venkataramana Runkana
关键词-EN: Engineering Operations Assistant, Operations Assistant, Process Engineering Operations, AI-driven framework designed, solve complex problems
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Accepted for Publication by Association for the Advancement of Artificial Intelligence, Fall Symposium Series

点击查看摘要

Abstract:We present the Process Engineering Operations Assistant (PEOA), an AI-driven framework designed to solve complex problems in the chemical and process industries. The framework employs a modular architecture orchestrated by a meta-agent, which serves as the central coordinator, managing an action generator and instruction-tuned small-scale language models (expert models). The action generator decomposes complex problems into sub-tasks and identifies suitable expert models to execute each, delivering precise solutions for multi-step problem-solving. Key techniques include advanced knowledge modeling using property graphs for improved information retrieval, facilitating more accurate and contextually relevant solutions. Additionally, the framework utilizes a teacher-student transfer-learning approach with GPT-4 (Omni) to fine-tune the action generator and expert models for domain adaptation, alongside an iterative problem-solving mechanism with sophisticated error handling. Custom datasets were developed to evaluate the framework against leading proprietary language models on various engineering tasks. The results demonstrate the framework effectiveness in automating calculations, accelerating prototyping, and providing AI-augmented decision support for industrial processes, marking a significant advancement in process engineering capabilities.

[LG-92] Extraction of Typical Operating Scenarios of New Power System Based on Deep Time Series Aggregation

链接: https://arxiv.org/abs/2408.14493
作者: Zhaoyang Qu,Zhenming Zhang,Nan Qu,Yuguang Zhou,Yang Li,Tao Jiang,Min Li,Chao Long
关键词-EN: Extracting typical operational, typical operational scenarios, making flexible decisions, typical operational, operational scenarios
类目: Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注: Accepted by CAAI Transactions on Intelligence Technology

点击查看摘要

Abstract:Extracting typical operational scenarios is essential for making flexible decisions in the dispatch of a new power system. This study proposed a novel deep time series aggregation scheme (DTSAs) to generate typical operational scenarios, considering the large amount of historical operational snapshot data. Specifically, DTSAs analyze the intrinsic mechanisms of different scheduling operational scenario switching to mathematically represent typical operational scenarios. A gramian angular summation field (GASF) based operational scenario image encoder was designed to convert operational scenario sequences into high-dimensional spaces. This enables DTSAs to fully capture the spatiotemporal characteristics of new power systems using deep feature iterative aggregation models. The encoder also facilitates the generation of typical operational scenarios that conform to historical data distributions while ensuring the integrity of grid operational snapshots. Case studies demonstrate that the proposed method extracted new fine-grained power system dispatch schemes and outperformed the latest high-dimensional featurescreening methods. In addition, experiments with different new energy access ratios were conducted to verify the robustness of the proposed method. DTSAs enables dispatchers to master the operation experience of the power system in advance, and actively respond to the dynamic changes of the operation scenarios under the high access rate of new energy.

[LG-93] Evolvable Psychology Informed Neural Network for Memory Behavior Modeling

链接: https://arxiv.org/abs/2408.14492
作者: Xiaoxuan Shen,Zhihai Hu,Qirong Chen,Shengyingjie Liu,Ruxia Liang,Jianwen Sun
关键词-EN: Memory behavior modeling, Memory behavior, behavior modeling, Memory, behavior
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Memory behavior modeling is a core issue in cognitive psychology and education. Classical psychological theories typically use memory equations to describe memory behavior, which exhibits insufficient accuracy and controversy, while data-driven memory modeling methods often require large amounts of training data and lack interpretability. Knowledge-informed neural network models have shown excellent performance in fields like physics, but there have been few attempts in the domain of behavior modeling. This paper proposed a psychology theory informed neural networks for memory behavior modeling named PsyINN, where it constructs a framework that combines neural network with differentiating sparse regression, achieving joint optimization. Specifically, to address the controversies and ambiguity of descriptors in memory equations, a descriptor evolution method based on differentiating operators is proposed to achieve precise characterization of descriptors and the evolution of memory theoretical equations. Additionally, a buffering mechanism for the sparse regression and a multi-module alternating iterative optimization method are proposed, effectively mitigating gradient instability and local optima issues. On four large-scale real-world memory behavior datasets, the proposed method surpasses the state-of-the-art methods in prediction accuracy. Ablation study demonstrates the effectiveness of the proposed refinements, and application experiments showcase its potential in inspiring psychological research.

[LG-94] Multimodal Methods for Analyzing Learning and Training Environments: A Systematic Literature Review

链接: https://arxiv.org/abs/2408.14491
作者: Clayton Cohn,Eduardo Davalos,Caleb Vatral,Joyce Horn Fonteles,Hanchen David Wang,Meiyi Ma,Gautam Biswas
关键词-EN: Recent technological advancements, analyze rich multimodal, rich multimodal data, learning and training, eye gaze
类目: Machine Learning (cs.LG); Multimedia (cs.MM)
*备注: Submitted to ACM Computing Surveys. Currently under review

点击查看摘要

Abstract:Recent technological advancements have enhanced our ability to collect and analyze rich multimodal data (e.g., speech, video, and eye gaze) to better inform learning and training experiences. While previous reviews have focused on parts of the multimodal pipeline (e.g., conceptual models and data fusion), a comprehensive literature review on the methods informing multimodal learning and training environments has not been conducted. This literature review provides an in-depth analysis of research methods in these environments, proposing a taxonomy and framework that encapsulates recent methodological advances in this field and characterizes the multimodal domain in terms of five modality groups: Natural Language, Video, Sensors, Human-Centered, and Environment Logs. We introduce a novel data fusion category – mid fusion – and a graph-based technique for refining literature reviews, termed citation graph pruning. Our analysis reveals that leveraging multiple modalities offers a more holistic understanding of the behaviors and outcomes of learners and trainees. Even when multimodality does not enhance predictive accuracy, it often uncovers patterns that contextualize and elucidate unimodal data, revealing subtleties that a single modality may miss. However, there remains a need for further research to bridge the divide between multimodal learning and training studies and foundational AI research.

[LG-95] Multi-Task Multi-Fidelity Learning of Properties for Energetic Materials

链接: https://arxiv.org/abs/2408.14488
作者: Robert J. Appleton,Daniel Klinger,Brian H. Lee,Michael Taylor,Sohee Kim,Samuel Blankenship,Brian C. Barnes,Steven F. Son,Alejandro Strachan
关键词-EN: increasingly important role, physical sciences, artificial intelligence, intelligence are playing, playing an increasingly
类目: Machine Learning (cs.LG); Materials Science (cond-mat.mtrl-sci)
*备注: 16 pages, 4 figures, 2 tables

点击查看摘要

Abstract:Data science and artificial intelligence are playing an increasingly important role in the physical sciences. Unfortunately, in the field of energetic materials data scarcity limits the accuracy and even applicability of ML tools. To address data limitations, we compiled multi-modal data: both experimental and computational results for several properties. We find that multi-task neural networks can learn from multi-modal data and outperform single-task models trained for specific properties. As expected, the improvement is more significant for data-scarce properties. These models are trained using descriptors built from simple molecular information and can be readily applied for large-scale materials screening to explore multiple properties simultaneously. This approach is widely applicable to fields outside energetic materials.

[LG-96] Active learning of digenic functions with boolean matrix logic programming

链接: https://arxiv.org/abs/2408.14487
作者: Lun Ai,Stephen H. Muggleton,Shi-shun Liang,Geoff S. Baldwin
关键词-EN: drive biological discovery, apply logic-based machine, processes called genome-scale, logic-based machine learning, machine learning techniques
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Symbolic Computation (cs.SC); Molecular Networks (q-bio.MN)
*备注: arXiv admin note: substantial text overlap with arXiv:2405.06724

点击查看摘要

Abstract:We apply logic-based machine learning techniques to facilitate cellular engineering and drive biological discovery, based on comprehensive databases of metabolic processes called genome-scale metabolic network models (GEMs). Predicted host behaviours are not always correctly described by GEMs. Learning the intricate genetic interactions within GEMs presents computational and empirical challenges. To address these, we describe a novel approach called Boolean Matrix Logic Programming (BMLP) by leveraging boolean matrices to evaluate large logic programs. We introduce a new system, BMLP_active , which efficiently explores the genomic hypothesis space by guiding informative experimentation through active learning. In contrast to sub-symbolic methods, BMLP_active encodes a state-of-the-art GEM of a widely accepted bacterial host in an interpretable and logical representation using datalog logic programs. Notably, BMLP_active can successfully learn the interaction between a gene pair with fewer training examples than random experimentation, overcoming the increase in experimental design space. BMLP_active enables rapid optimisation of metabolic models and offers a realistic approach to a self-driving lab for microbial engineering.

[LG-97] Agent ic Retrieval-Augmented Generation for Time Series Analysis KDD KDD2024

链接: https://arxiv.org/abs/2408.14484
作者: Chidaksh Ravuru,Sagar Srinivas Sakhinana,Venkataramana Runkana
关键词-EN: predict task-specific outcomes, complex spatio-temporal dependencies, Time series modeling, Time series, time series tasks
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注: Paper was accepted for Undergraduate Consortium at ACM KDD, 2024. Please find the link: this https URL

点击查看摘要

Abstract:Time series modeling is crucial for many applications, however, it faces challenges such as complex spatio-temporal dependencies and distribution shifts in learning from historical context to predict task-specific outcomes. To address these challenges, we propose a novel approach using an agentic Retrieval-Augmented Generation (RAG) framework for time series analysis. The framework leverages a hierarchical, multi-agent architecture where the master agent orchestrates specialized sub-agents and delegates the end-user request to the relevant sub-agent. The sub-agents utilize smaller, pre-trained language models (SLMs) customized for specific time series tasks through fine-tuning using instruction tuning and direct preference optimization, and retrieve relevant prompts from a shared repository of prompt pools containing distilled knowledge about historical patterns and trends to improve predictions on new data. Our proposed modular, multi-agent RAG approach offers flexibility and achieves state-of-the-art performance across major time series tasks by tackling complex challenges more effectively than task-specific customized methods across benchmark datasets.

[LG-98] Gravix: Active Learning for Gravitational Waves Classification Algorithms

链接: https://arxiv.org/abs/2408.14483
作者: Raja Vavekanand,Kira Sam,Vavek Bharwani
关键词-EN: Convolutional Neural Networks, specifically Convolutional Neural, Neural Networks, Convolutional Neural, Bayesian Optimization
类目: Machine Learning (cs.LG); General Relativity and Quantum Cosmology (gr-qc)
*备注: 5 figures

点击查看摘要

Abstract:This project explores the integration of Bayesian Optimization (BO) algorithms into a base machine learning model, specifically Convolutional Neural Networks (CNNs), for classifying gravitational waves among background noise. The primary objective is to evaluate whether optimizing hyperparameters using Bayesian Optimization enhances the base model’s performance. For this purpose, a Kaggle [1] dataset that comprises real background noise (labeled 0) and simulated gravitational wave signals with noise (labeled 1) is used. Data with real noise is collected from three detectors: LIGO Livingston, LIGO Hanford, and Virgo. Through data preprocessing and training, the models effectively classify testing data, predicting the presence of gravitational wave signals with a remarkable score, of 83.61%. The BO model demonstrates comparable accuracy to the base model, but its performance improvement is not very significant (84.34%). However, it is worth noting that the BO model needs additional computational resources and time due to the iterations required for hyperparameter optimization, requiring additional training on the entire dataset. For this reason, the BO model is less efficient in terms of resources compared to the base model in gravitational wave classification

[LG-99] Satellite Sunroof: High-res Digital Surface Models and Roof Segmentation for Global Solar Mapping

链接: https://arxiv.org/abs/2408.14400
作者: Vishal Batchu,Alex Wilson,Betty Peng,Carl Elkin,Umangi Jain,Christopher Van Arsdale,Ross Goroshin,Varun Gulshan
关键词-EN: mitigating climate change, renewable energy, climate change, key to mitigating, mitigating climate
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 14 pages

点击查看摘要

Abstract:The transition to renewable energy, particularly solar, is key to mitigating climate change. Google’s Solar API aids this transition by estimating solar potential from aerial imagery, but its impact is constrained by geographical coverage. This paper proposes expanding the API’s reach using satellite imagery, enabling global solar potential assessment. We tackle challenges involved in building a Digital Surface Model (DSM) and roof instance segmentation from lower resolution and single oblique views using deep learning models. Our models, trained on aligned satellite and aerial datasets, produce 25cm DSMs and roof segments. With ~1m DSM MAE on buildings, ~5deg roof pitch error and ~56% IOU on roof segmentation, they significantly enhance the Solar API’s potential to promote solar adoption.

[LG-100] Automatic 8-tissue Segmentation for 6-month Infant Brains MICCAI

链接: https://arxiv.org/abs/2408.15198
作者: Yilan Dong(1 and 2),Vanessa Kyriakopoulou(1 and 2),Irina Grigorescu(1),Grainne McAlonan(2),Dafnis Batalle(1 and 2),Maria Deprez(1) ((1) School of Biomedical Engineering amp; Imaging Sciences, King’s College London, London, United Kingdom, (2) Department of Forensic and Neurodevelopmental Science, Institute of Psychiatry, Psychology amp; Neuroscience, King’s College London, London, United Kingdom)
关键词-EN: atypical brain development, numerous infant studies, infancy and toddlerhood, neurodevelopmental condition, highlighted that atypical
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 11 pages, 4 figures, to be published in MICCAI PIPPI workshop

点击查看摘要

Abstract:Numerous studies have highlighted that atypical brain development, particularly during infancy and toddlerhood, is linked to an increased likelihood of being diagnosed with a neurodevelopmental condition, such as autism. Accurate brain tissue segmentations for morphological analysis are essential in numerous infant studies. However, due to ongoing white matter (WM) myelination changing tissue contrast in T1- and T2-weighted images, automatic tissue segmentation in 6-month infants is particularly difficult. On the other hand, manual labelling by experts is time-consuming and labor-intensive. In this study, we propose the first 8-tissue segmentation pipeline for six-month-old infant brains. This pipeline utilizes domain adaptation (DA) techniques to leverage our longitudinal data, including neonatal images segmented with the neonatal Developing Human Connectome Project structural pipeline. Our pipeline takes raw 6-month images as inputs and generates the 8-tissue segmentation as outputs, forming an end-to-end segmentation pipeline. The segmented tissues include WM, gray matter (GM), cerebrospinal fluid (CSF), ventricles, cerebellum, basal ganglia, brainstem, and hippocampus/amygdala. Cycle-Consistent Generative Adversarial Network (CycleGAN) and Attention U-Net were employed to achieve the image contrast transformation between neonatal and 6-month images and perform tissue segmentation on the synthesized 6-month images (neonatal images with 6-month intensity contrast), respectively. Moreover, we incorporated the segmentation outputs from Infant Brain Extraction and Analysis Toolbox (iBEAT) and another Attention U-Net to further enhance the performance and construct the end-to-end segmentation pipeline. Our evaluation with real 6-month images achieved a DICE score of 0.92, an HD95 of 1.6, and an ASSD of 0.42.

[LG-101] Low-Budget Simulation-Based Inference with Bayesian Neural Networks

链接: https://arxiv.org/abs/2408.15136
作者: Arnaud Delaunoy,Maxence de la Brassinne Bonardeaux,Siddharth Mishra-Sharma,Gilles Louppe
关键词-EN: Bayesian neural networks, Simulation-based inference methods, Bayesian neural, data-poor regime, Simulation-based inference
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Simulation-based inference methods have been shown to be inaccurate in the data-poor regime, when training simulations are limited or expensive. Under these circumstances, the inference network is particularly prone to overfitting, and using it without accounting for the computational uncertainty arising from the lack of identifiability of the network weights can lead to unreliable results. To address this issue, we propose using Bayesian neural networks in low-budget simulation-based inference, thereby explicitly accounting for the computational uncertainty of the posterior approximation. We design a family of Bayesian neural network priors that are tailored for inference and show that they lead to well-calibrated posteriors on tested benchmarks, even when as few as O(10) simulations are available. This opens up the possibility of performing reliable simulation-based inference using very expensive simulators, as we demonstrate on a problem from the field of cosmology where single simulations are computationally expensive. We show that Bayesian neural networks produce informative and well-calibrated posterior estimates with only a few hundred simulations.

[LG-102] Force-Guided Bridge Matching for Full-Atom Time-Coarsened Dynamics of Peptides

链接: https://arxiv.org/abs/2408.15126
作者: Ziyang Yu,Wenbing Huang,Yang Liu
关键词-EN: Molecular Dynamics, materials science, irreplaceable and ubiquitous, Molecular, time-coarsened dynamics
类目: Chemical Physics (physics.chem-ph); Machine Learning (cs.LG); Computational Physics (physics.comp-ph); Biomolecules (q-bio.BM)
*备注:

点击查看摘要

Abstract:Molecular Dynamics (MD) simulations are irreplaceable and ubiquitous in fields of materials science, chemistry, pharmacology just to name a few. Conventional MD simulations are plagued by numerical stability as well as long equilibration time issues, which limits broader applications of MD simulations. Recently, a surge of deep learning approaches have been devised for time-coarsened dynamics, which learns the state transition mechanism over much larger time scales to overcome these limitations. However, only a few methods target the underlying Boltzmann distribution by resampling techniques, where proposals are rarely accepted as new states with low efficiency. In this work, we propose a force-guided bridge matching model, FBM, a novel framework that first incorporates physical priors into bridge matching for full-atom time-coarsened dynamics. With the guidance of our well-designed intermediate force field, FBM is feasible to target the Boltzmann-like distribution by direct inference without extra steps. Experiments on small peptides verify our superiority in terms of comprehensive metrics and demonstrate transferability to unseen peptide systems.

[LG-103] he Benefits of Balance: From Information Projections to Variance Reduction

链接: https://arxiv.org/abs/2408.15065
作者: Lang Liu,Ronak Mehta,Soumik Pal,Zaid Harchaoui
关键词-EN: CLIP and DINO, achieving universal representation, multiple modalities, foundation models, achieving universal
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST)
*备注:

点击查看摘要

Abstract:Data balancing across multiple modalities/sources appears in various forms in several foundation models (e.g., CLIP and DINO) achieving universal representation learning. We show that this iterative algorithm, usually used to avoid representation collapse, enjoys an unsuspected benefit: reducing the variance of estimators that are functionals of the empirical distribution over these sources. We provide non-asymptotic bounds quantifying this variance reduction effect and relate them to the eigendecays of appropriately defined Markov operators. We explain how various forms of data balancing in contrastive multimodal learning and self-supervised clustering can be interpreted as instances of this variance reduction scheme.

[LG-104] argetin the partition function of chemically disordered materials with a generative approach based on inverse variational autoencoders

链接: https://arxiv.org/abs/2408.14928
作者: Maciej J. Karcz,Luca Messina,Eiji Kawasaki,Emeric Bourasseau
关键词-EN: Special Quasirandom Structures, vast configuration space, efficient exploration, configuration space, Monte Carlo
类目: Materials Science (cond-mat.mtrl-sci); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Computing atomic-scale properties of chemically disordered materials requires an efficient exploration of their vast configuration space. Traditional approaches such as Monte Carlo or Special Quasirandom Structures either entail sampling an excessive amount of configurations or do not ensure that the configuration space has been properly covered. In this work, we propose a novel approach where generative machine learning is used to yield a representative set of configurations for accurate property evaluation and provide accurate estimations of atomic-scale properties with minimal computational cost. Our method employs a specific type of variational autoencoder with inverse roles for the encoder and decoder, enabling the application of an unsupervised active learning scheme that does not require any initial training database. The model iteratively generates configuration batches, whose properties are computed with conventional atomic-scale methods. These results are then fed back into the model to estimate the partition function, repeating the process until convergence. We illustrate our approach by computing point-defect formation energies and concentrations in (U, Pu)O2 mixed-oxide fuels. In addition, the ML model provides valuable insights into the physical factors influencing the target property. Our method is generally applicable to explore other properties, such as atomic-scale diffusion coefficients, in ideally or non-ideally disordered materials like high-entropy alloys.

[LG-105] Development of Large Annotated Music Datasets using HMM-based Forced Viterbi Alignment

链接: https://arxiv.org/abs/2408.14890
作者: S. Johanan Joysingh,P. Vijayalakshmi,T. Nagarajan
关键词-EN: machine learning task, machine learning, Automatic Music Transcription, learning task, transcriptions
类目: Audio and Speech Processing (eess.AS); Machine Learning (cs.LG); Sound (cs.SD)
*备注: submitted to TENCON 2019

点击查看摘要

Abstract:Datasets are essential for any machine learning task. Automatic Music Transcription (AMT) is one such task, where considerable amount of data is required depending on the way the solution is achieved. Considering the fact that a music dataset, complete with audio and its time-aligned transcriptions would require the effort of people with musical experience, it could be stated that the task becomes even more challenging. Musical experience is required in playing the musical instrument(s), and in annotating and verifying the transcriptions. We propose a method that would help in streamlining this process, making the task of obtaining a dataset from a particular instrument easy and efficient. We use predefined guitar exercises and hidden Markov model(HMM) based forced viterbi alignment to accomplish this. The guitar exercises are designed to be simple. Since the note sequence are already defined, HMM based forced viterbi alignment provides time-aligned transcriptions of these audio files. The onsets of the transcriptions are manually verified and the labels are accurate up to 10ms, averaging at 5ms. The contributions of the proposed work is two fold, i) a well streamlined and efficient method for generating datasets for any instrument, especially monophonic and, ii) an acoustic plectrum guitar dataset containing wave files and transcriptions in the form of label files. This method will aid as a preliminary step towards building concrete datasets for building AMT systems for different instruments.

[LG-106] owards turbine-location-aware multi-decadal wind power predictions with CMIP6

链接: https://arxiv.org/abs/2408.14889
作者: Nina Effenberger,Nicole Ludwig
关键词-EN: increasing amount, amount of renewable, renewable energy, multiple decades, wind power
类目: Atmospheric and Oceanic Physics (physics.ao-ph); Machine Learning (cs.LG)
*备注: 4 pages, pre-print

点击查看摘要

Abstract:With the increasing amount of renewable energy in the grid, long-term wind power forecasting for multiple decades becomes more critical. In these long-term forecasts, climate data is essential as it allows us to account for climate change. Yet the resolution of climate models is often very coarse. In this paper, we show that by including turbine locations when downscaling with Gaussian Processes, we can generate valuable aggregate wind power predictions despite the low resolution of the CMIP6 climate models. This work is a first step towards multi-decadal turbine-location-aware wind power forecasting using global climate model output.

[LG-107] Literary and Colloquial Dialect Identification for Tamil using Acoustic Features

链接: https://arxiv.org/abs/2408.14887
作者: M. Nanmalar,P. Vijayalakshmi,T. Nagarajan
关键词-EN: automatic speech recognition, automatic speech, evolution and diversity, speech recognition, dialects
类目: Audio and Speech Processing (eess.AS); Machine Learning (cs.LG)
*备注: submitted to TENCON 2019

点击查看摘要

Abstract:The evolution and diversity of a language is evident from it’s various dialects. If the various dialects are not addressed in technological advancements like automatic speech recognition and speech synthesis, there is a chance that these dialects may disappear. Speech technology plays a role in preserving various dialects of a language from going extinct. In order to build a full fledged automatic speech recognition system that addresses various dialects, an Automatic Dialect Identification (ADI) system acting as the front end is required. This is similar to how language identification systems act as front ends to automatic speech recognition systems that handle multiple languages. The current work proposes a way to identify two popular and broadly classified Tamil dialects, namely literary and colloquial Tamil. Acoustical characteristics rather than phonetics and phonotactics are used, alleviating the requirement of language-dependant linguistic tools. Hence one major advantage of the proposed method is that it does not require an annotated corpus, hence it can be easily adapted to other languages. Gaussian Mixture Models (GMM) using Mel Frequency Cepstral Coefficient (MFCC) features are used to perform the classification task. The experiments yielded an error rate of 12%. Vowel nasalization, as being the reason for this good performance, is discussed. The number of mixture models for the GMM is varied and the performance is analysed.

[LG-108] Data downlink prioritization using image classification on-board a 6U CubeSat

链接: https://arxiv.org/abs/2408.14865
作者: Keenan A. A. Chatar,Ezra Fielding,Kei Sano,Kentaro Kitamura
关键词-EN: low-cost dedicated sensing, dedicated sensing systems, lean development cycles, development cycles, proliferating as low-cost
类目: Instrumentation and Methods for Astrophysics (astro-ph.IM); Machine Learning (cs.LG)
*备注: 14 pages

点击查看摘要

Abstract:Nanosatellites are proliferating as low-cost dedicated sensing systems with lean development cycles. Kyushu Institute of Technology and collaborators have launched a joint venture for a nanosatellite mission, VERTECS. The primary mission is to elucidate the formation history of stars by observing the optical-wavelength cosmic background radiation. The VERTECS satellite will be equipped with a small-aperture telescope and a high-precision attitude control system to capture the cosmic data for analysis on the ground. However, nanosatellites are limited by their onboard memory resources and downlink speed capabilities. Additionally, due to a limited number of ground stations, the satellite mission will face issues meeting the required data budget for mission success. To alleviate this issue, we propose an on-orbit system to autonomously classify and then compress desirable image data for data downlink prioritization and optimization. The system comprises a prototype Camera Controller Board (CCB) which carries a Raspberry Pi Compute Module 4 which is used for classification and compression. The system uses a lightweight Convolutional Neural Network (CNN) model to classify and determine the desirability of captured image data. The model is designed to be lean and robust to reduce the computational and memory load on the satellite. The model is trained and tested on a novel star field dataset consisting of data captured by the Sloan Digital Sky Survey (SDSS). The dataset is meant to simulate the expected data produced by the 6U satellite. The compression step implements GZip, RICE or HCOMPRESS compression, which are standards for astronomical data. Preliminary testing on the proposed CNN model results in a classification accuracy of about 100% on the star field dataset, with compression ratios of 3.99, 5.16 and 5.43 for GZip, RICE and HCOMPRESS that were achieved on tested FITS image data.

[LG-109] Intraoperative Glioma Segmentation with YOLO SAM for Improved Accuracy in Tumor Resection

链接: https://arxiv.org/abs/2408.14847
作者: Samir Kassam,Angelo Markham,Katie Vo,Yashas Revanakara,Michael Lam,Kevin Zhu
关键词-EN: Magnetic Resonance Imaging, malignant brain tumor, significant surgical challenges, healthy tissue, common type
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Gliomas, a common type of malignant brain tumor, present significant surgical challenges due to their similarity to healthy tissue. Preoperative Magnetic Resonance Imaging (MRI) images are often ineffective during surgery due to factors such as brain shift, which alters the position of brain structures and tumors. This makes real-time intraoperative MRI (ioMRI) crucial, as it provides updated imaging that accounts for these shifts, ensuring more accurate tumor localization and safer resections. This paper presents a deep learning pipeline combining You Only Look Once Version 8 (YOLOv8) and Segment Anything Model Vision Transformer-base (SAM ViT-b) to enhance glioma detection and segmentation during ioMRI. Our model was trained using the Brain Tumor Segmentation 2021 (BraTS 2021) dataset, which includes standard magnetic resonance imaging (MRI) images, and noise-augmented MRI images that simulate ioMRI images. Noised MRI images are harder for a deep learning pipeline to segment, but they are more representative of surgical conditions. Achieving a Dice Similarity Coefficient (DICE) score of 0.79, our model performs comparably to state-of-the-art segmentation models tested on noiseless data. This performance demonstrates the model’s potential to assist surgeons in maximizing tumor resection and improving surgical outcomes.

[LG-110] MaskCycleGAN-based Whisper to Normal Speech Conversion

链接: https://arxiv.org/abs/2408.14797
作者: K. Rohith Gupta,K. Ramnath,S. Johanan Joysingh,P. Vijayalakshmi,T. Nagarajan
关键词-EN: Whisper to normal, area of research, active area, generative adversarial networks, generative adversarial
类目: Audio and Speech Processing (eess.AS); Machine Learning (cs.LG)
*备注: submitted to TENCON 2024

点击查看摘要

Abstract:Whisper to normal speech conversion is an active area of research. Various architectures based on generative adversarial networks have been proposed in the recent past. Especially, recent study shows that MaskCycleGAN, which is a mask guided, and cyclic consistency keeping, generative adversarial network, performs really well for voice conversion from spectrogram representations. In the current work we present a MaskCycleGAN approach for the conversion of whispered speech to normal speech. We find that tuning the mask parameters, and pre-processing the signal with a voice activity detector provides superior performance when compared to the existing approach. The wTIMIT dataset is used for evaluation. Objective metrics such as PESQ and G-Loss are used to evaluate the converted speech, along with subjective evaluation using mean opinion score. The results show that the proposed approach offers considerable benefits.

[LG-111] Quartered Chirp Spectral Envelope for Whispered vs Normal Speech Classification

链接: https://arxiv.org/abs/2408.14777
作者: S. Johanan Joysingh,P. Vijayalakshmi,T. Nagarajan
关键词-EN: gaining traction, acceptable form, form of human-computer, human-computer interaction, interaction is gaining
类目: Audio and Speech Processing (eess.AS); Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注: submitted to TENCON 2024

点击查看摘要

Abstract:Whispered speech as an acceptable form of human-computer interaction is gaining traction. Systems that address multiple modes of speech require a robust front-end speech classifier. Performance of whispered vs normal speech classification drops in the presence of additive white Gaussian noise, since normal speech takes on some of the characteristics of whispered speech. In this work, we propose a new feature named the quartered chirp spectral envelope, a combination of the chirp spectrum and the quartered spectral envelope, to classify whispered and normal speech. The chirp spectrum can be fine-tuned to obtain customized features for a given task, and the quartered spectral envelope has been proven to work especially well for the current task. The feature is trained on a one dimensional convolutional neural network, that captures the trends in the spectral envelope. The proposed system performs better than the state of the art, in the presence of white noise.

[LG-112] Model-Based Reinforcement Learning for Control of Strongly-Disturbed Unsteady Aerodynamic Flows

链接: https://arxiv.org/abs/2408.14685
作者: Zhecheng Liu(1),Diederik Beckers(2),Jeff D. Eldredge(1) ((1) University of California, Los Angeles, (2) California Institute of Technology)
关键词-EN: intrinsic high dimension, reinforcement learning, flow nonlinear response, strong disturbances, dimension of fluid
类目: Fluid Dynamics (physics.flu-dyn); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The intrinsic high dimension of fluid dynamics is an inherent challenge to control of aerodynamic flows, and this is further complicated by a flow’s nonlinear response to strong disturbances. Deep reinforcement learning, which takes advantage of the exploratory aspects of reinforcement learning (RL) and the rich nonlinearity of a deep neural network, provides a promising approach to discover feasible control strategies. However, the typical model-free approach to reinforcement learning requires a significant amount of interaction between the flow environment and the RL agent during training, and this high training cost impedes its development and application. In this work, we propose a model-based reinforcement learning (MBRL) approach by incorporating a novel reduced-order model as a surrogate for the full environment. The model consists of a physics-augmented autoencoder, which compresses high-dimensional CFD flow field snaphsots into a three-dimensional latent space, and a latent dynamics model that is trained to accurately predict the long-time dynamics of trajectories in the latent space in response to action sequences. The robustness and generalizability of the model is demonstrated in two distinct flow environments, a pitching airfoil in a highly disturbed environment and a vertical-axis wind turbine in a disturbance-free environment. Based on the trained model in the first problem, we realize an MBRL strategy to mitigate lift variation during gust-airfoil encounters. We demonstrate that the policy learned in the reduced-order environment translates to an effective control strategy in the full CFD environment.

[LG-113] General targeted machine learning for modern causal mediation analysis

链接: https://arxiv.org/abs/2408.14620
作者: Richard Liu,Nicholas T. Williams,Kara E. Rudolph,Iván Díaz
关键词-EN: mediation analyses investigate, Causal mediation analyses, analyses investigate, investigate the mechanisms, central to scientific
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Causal mediation analyses investigate the mechanisms through which causes exert their effects, and are therefore central to scientific progress. The literature on the non-parametric definition and identification of mediational effects in rigourous causal models has grown significantly in recent years, and there has been important progress to address challenges in the interpretation and identification of such effects. Despite great progress in the causal inference front, statistical methodology for non-parametric estimation has lagged behind, with few or no methods available for tackling non-parametric estimation in the presence of multiple, continuous, or high-dimensional mediators. In this paper we show that the identification formulas for six popular non-parametric approaches to mediation analysis proposed in recent years can be recovered from just two statistical estimands. We leverage this finding to propose an all-purpose one-step estimation algorithm that can be coupled with machine learning in any mediation study that uses any of these six definitions of mediation. The estimators have desirable properties, such as \sqrtn -convergence and asymptotic normality. Estimating the first-order correction for the one-step estimator requires estimation of complex density ratios on the potentially high-dimensional mediators, a challenge that is solved using recent advancements in so-called Riesz learning. We illustrate the properties of our methods in a simulation study and illustrate its use on real data to estimate the extent to which pain management practices mediate the total effect of having a chronic pain disorder on opioid use disorder.

[LG-114] Reconstruction-based Multi-Normal Prototypes Learning for Weakly Supervised Anomaly Detection

链接: https://arxiv.org/abs/2408.14498
作者: Zhijin Dong,Hongzhi Liu,Boyuan Ren,Weimin Xiong,Zhonghai Wu
关键词-EN: crucial task, normal sample data, Anomaly detection, unlabeled data, data
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Anomaly detection is a crucial task in various domains. Most of the existing methods assume the normal sample data clusters around a single central prototype while the real data may consist of multiple categories or subgroups. In addition, existing methods always assume all unlabeled data are normal while they inevitably contain some anomalous samples. To address these issues, we propose a reconstruction-based multi-normal prototypes learning framework that leverages limited labeled anomalies in conjunction with abundant unlabeled data for anomaly detection. Specifically, we assume the normal sample data may satisfy multi-modal distribution, and utilize deep embedding clustering and contrastive learning to learn multiple normal prototypes to represent it. Additionally, we estimate the likelihood of each unlabeled sample being normal based on the multi-normal prototypes, guiding the training process to mitigate the impact of contaminated anomalies in the unlabeled data. Extensive experiments on various datasets demonstrate the superior performance of our method compared to state-of-the-art techniques.

[LG-115] RISE-iEEG: Robust to Inter-Subject Electrodes Implantation Variability iEEG Classifier

链接: https://arxiv.org/abs/2408.14477
作者: Maryam Ostadsharif Memar,Navid Ziaei,Behzad Nazari,Ali Yousefi
关键词-EN: brain-computer interface applications, Utilization of intracranial, electrode implantation variability, inter-subject electrode implantation, electrode implantation
类目: Neurons and Cognition (q-bio.NC); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注:

点击查看摘要

Abstract:Utilization of intracranial electroencephalography (iEEG) is rapidly increasing for clinical and brain-computer interface applications. iEEG facilitates the recording of neural activity with high spatial and temporal resolution, making it a desirable neuroimaging modality for studying neural dynamics. Despite its benefits, iEEG faces challenges such as inter-subject variability in electrode implantation, which makes the development of unified neural decoder models across different patients difficult. In this research, we introduce a novel decoder model that is robust to inter-subject electrode implantation variability. We call this model RISE-iEEG, which stands for Robust Inter-Subject Electrode Implantation Variability iEEG Classifier. RISE-iEEG employs a deep neural network structure preceded by a patient-specific projection network. The projection network maps the neural data of individual patients onto a common low-dimensional space, compensating for the implantation variability. In other words, we developed an iEEG decoder model that can be applied across multiple patients’ data without requiring the coordinates of electrode for each patient. The performance of RISE-iEEG across multiple datasets, including the Audio-Visual dataset, Music Reconstruction dataset, and Upper-Limb Movement dataset, surpasses that of state-of-the-art iEEG decoder models such as HTNet and EEGNet. Our analysis shows that the performance of RISE-iEEG is 10% higher than that of HTNet and EEGNet in terms of F1 score, with an average F1 score of 83%, which is the highest result among the evaluation methods defined. Furthermore, the analysis of projection network weights in the Music Reconstruction dataset across patients suggests that the Superior Temporal lobe serves as the primary encoding neural node. This finding aligns with the auditory processing physiology.

信息检索

[IR-0] Into the Unknown Unknowns: Engaged Human Learning through Participation in Language Model Agent Conversations

链接: https://arxiv.org/abs/2408.15232
作者: Yucheng Jiang,Yijia Shao,Dekun Ma,Sina J. Semnani,Monica S. Lam
关键词-EN: answering concrete queries, unknowns remains challenging, create Collaborative STORM, unknown unknowns remains, language model
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:While language model (LM)-powered chatbots and generative search engines excel at answering concrete queries, discovering information in the terrain of unknown unknowns remains challenging for users. To emulate the common educational scenario where children/students learn by listening to and participating in conversations of their parents/teachers, we create Collaborative STORM (Co-STORM). Unlike QA systems that require users to ask all the questions, Co-STORM lets users observe and occasionally steer the discourse among several LM agents. The agents ask questions on the user’s behalf, allowing the user to discover unknown unknowns serendipitously. To facilitate user interaction, Co-STORM assists users in tracking the discourse by organizing the uncovered information into a dynamic mind map, ultimately generating a comprehensive report as takeaways. For automatic evaluation, we construct the WildSeek dataset by collecting real information-seeking records with user goals. Co-STORM outperforms baseline methods on both discourse trace and report quality. In a further human evaluation, 70% of participants prefer Co-STORM over a search engine, and 78% favor it over a RAG chatbot.

[IR-1] X-Reflect: Cross-Reflection Prompting for Multimodal Recommendation

链接: https://arxiv.org/abs/2408.15172
作者: Hanjia Lyu,Ryan Rossi,Xiang Chen,Md Mehrab Tanjim,Stefano Petrangeli,Somdeb Sarkhel,Jiebo Luo
关键词-EN: Large Language Models, Large Multimodal Models, Language Models, Large Language, Multimodal Models
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Large Language Models (LLMs) and Large Multimodal Models (LMMs) have been shown to enhance the effectiveness of enriching item descriptions, thereby improving the accuracy of recommendation systems. However, most existing approaches either rely on text-only prompting or employ basic multimodal strategies that do not fully exploit the complementary information available from both textual and visual modalities. This paper introduces a novel framework, Cross-Reflection Prompting, termed X-Reflect, designed to address these limitations by prompting LMMs to explicitly identify and reconcile supportive and conflicting information between text and images. By capturing nuanced insights from both modalities, this approach generates more comprehensive and contextually richer item representations. Extensive experiments conducted on two widely used benchmarks demonstrate that our method outperforms existing prompting baselines in downstream recommendation accuracy. Additionally, we evaluate the generalizability of our framework across different LMM backbones and the robustness of the prompting strategies, offering insights for optimization. This work underscores the importance of integrating multimodal information and presents a novel solution for improving item understanding in multimodal recommendation systems.

[IR-2] Measuring publication relatedness using controlled vocabularies

链接: https://arxiv.org/abs/2408.15004
作者: Emil Dolmer Alnor
关键词-EN: scientific publications, publications has important, important applications, areas of bibliometrics, measuring relatedness
类目: Information Retrieval (cs.IR); Information Theory (cs.IT); Social and Information Networks (cs.SI)
*备注: Accepted for presentation at the 28th International Conference on Science, Technology and Innovation Indicators, 2024

点击查看摘要

Abstract:Measuring the relatedness between scientific publications has important applications in many areas of bibliometrics and science policy. Controlled vocabularies provide a promising basis for measuring relatedness because they address issues that arise when using citation or textual similarity to measure relatedness. While several controlled-vocabulary-based relatedness measures have been developed, there exists no comprehensive and direct test of their accuracy and suitability for different types of research questions. This paper reviews existing measures, develops a new measure, and benchmarks the measures using TREC Genomics data as a ground truth of topics. The benchmark test show that the new measure and the measure proposed by Ahlgren et al. (2020) have differing strengths and weaknesses. These results inform a discussion of which method to choose when studying interdisciplinarity, information retrieval, clustering of science, and researcher topic switching.

[IR-3] Knowledge Discovery in Optical Music Recognition: Enhancing Information Retrieval with Instance Segmentation

链接: https://arxiv.org/abs/2408.15002
作者: Elona Shatri,George Fazekas
关键词-EN: Optical Character Recognition, Optical Music Recognition, Western Music Notation, Common Western Music, manual transcription
类目: Information Retrieval (cs.IR); Computer Vision and Pattern Recognition (cs.CV); Sound (cs.SD)
*备注: 8 pages content and one references, accepted version at the International Conference on Knowledge Discovery and Information Retrieval 2024, Porto, Portugal

点击查看摘要

Abstract:Optical Music Recognition (OMR) automates the transcription of musical notation from images into machine-readable formats like MusicXML, MEI, or MIDI, significantly reducing the costs and time of manual transcription. This study explores knowledge discovery in OMR by applying instance segmentation using Mask R-CNN to enhance the detection and delineation of musical symbols in sheet music. Unlike Optical Character Recognition (OCR), OMR must handle the intricate semantics of Common Western Music Notation (CWMN), where symbol meanings depend on shape, position, and context. Our approach leverages instance segmentation to manage the density and overlap of musical symbols, facilitating more precise information retrieval from music scores. Evaluations on the DoReMi and MUSCIMA++ datasets demonstrate substantial improvements, with our method achieving a mean Average Precision (mAP) of up to 59.70% in dense symbol environments, achieving comparable results to object detection. Furthermore, using traditional computer vision techniques, we add a parallel step for staff detection to infer the pitch for the recognised symbols. This study emphasises the role of pixel-wise segmentation in advancing accurate music symbol recognition, contributing to knowledge discovery in OMR. Our findings indicate that instance segmentation provides more precise representations of musical symbols, particularly in densely populated scores, advancing OMR technology. We make our implementation, pre-processing scripts, trained models, and evaluation results publicly available to support further research and development.

[IR-4] MRSE: An Efficient Multi-modality Retrieval System for Large Scale E-commerce

链接: https://arxiv.org/abs/2408.14968
作者: Hao Jiang,Haoxiang Zhang,Qingshan Hou,Chaofeng Chen,Weisi Lin,Jingchang Zhang,Annan Wang
关键词-EN: Providing high-quality item, e-commerce search systems, Providing high-quality, high-quality item recall, Embedding-based Retrieval Systems
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:Providing high-quality item recall for text queries is crucial in large-scale e-commerce search systems. Current Embedding-based Retrieval Systems (ERS) embed queries and items into a shared low-dimensional space, but uni-modality ERS rely too heavily on textual features, making them unreliable in complex contexts. While multi-modality ERS incorporate various data sources, they often overlook individual preferences for different modalities, leading to suboptimal results. To address these issues, we propose MRSE, a Multi-modality Retrieval System that integrates text, item images, and user preferences through lightweight mixture-of-expert (LMoE) modules to better align features across and within modalities. MRSE also builds user profiles at a multi-modality level and introduces a novel hybrid loss function that enhances consistency and robustness using hard negative sampling. Experiments on a large-scale dataset from Shopee and online A/B testing show that MRSE achieves an 18.9% improvement in offline relevance and a 3.7% gain in online core metrics compared to Shopee’s state-of-the-art uni-modality system.

[IR-5] ripl`etoile: Extraction of Knowledge from Microblogging Text

链接: https://arxiv.org/abs/2408.14908
作者: Vanni Zavarella,Sergio Consoli,Diego Reforgiato Recupero,Gianni Fenu,Simone Angioni,Davide Buscaldi,Danilo Dessì,Francesco Osborne
关键词-EN: Numerous methods, publications and patents, recently emerged, scientific publications, Numerous
类目: Information Retrieval (cs.IR); Computational Engineering, Finance, and Science (cs.CE); Computation and Language (cs.CL)
*备注: 42 pages, 6 figures

点击查看摘要

Abstract:Numerous methods and pipelines have recently emerged for the automatic extraction of knowledge graphs from documents such as scientific publications and patents. However, adapting these methods to incorporate alternative text sources like micro-blogging posts and news has proven challenging as they struggle to model open-domain entities and relations, typically found in these sources. In this paper, we propose an enhanced information extraction pipeline tailored to the extraction of a knowledge graph comprising open-domain entities from micro-blogging posts on social media platforms. Our pipeline leverages dependency parsing and classifies entity relations in an unsupervised manner through hierarchical clustering over word embeddings. We provide a use case on extracting semantic triples from a corpus of 100 thousand tweets about digital transformation and publicly release the generated knowledge graph. On the same dataset, we conduct two experimental evaluations, showing that the system produces triples with precision over 95% and outperforms similar pipelines of around 5% in terms of precision, while generating a comparatively higher number of triples.

[IR-6] Writing in the Margins: Better Inference Pattern for Long Context Retrieval

链接: https://arxiv.org/abs/2408.14906
作者: Melisa Russak,Umar Jamil,Christopher Bryant,Kiran Kamble,Axel Magnuson,Mateusz Russak,Waseem AlShikh
关键词-EN: Large Language Models, Language Models designed, long input sequences, Large Language, introduce Writing
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:In this paper, we introduce Writing in the Margins (WiM), a new inference pattern for Large Language Models designed to optimize the handling of long input sequences in retrieval-oriented tasks. This approach leverages the chunked prefill of the key-value cache to perform segment-wise inference, which enables efficient processing of extensive contexts along with the generation and classification of intermediate information (“margins”) that guide the model towards specific tasks. This method increases computational overhead marginally while significantly enhancing the performance of off-the-shelf models without the need for fine-tuning. Specifically, we observe that WiM provides an average enhancement of 7.5% in accuracy for reasoning skills (HotpotQA, MultiHop-RAG) and more than a 30.0% increase in the F1-score for aggregation tasks (CWE). Additionally, we show how the proposed pattern fits into an interactive retrieval design that provides end-users with ongoing updates about the progress of context processing, and pinpoints the integration of relevant information into the final response. We release our implementation of WiM using Hugging Face Transformers library at this https URL.

[IR-7] Graph and Sequential Neural Networks in Session-based Recommendation: A Survey

链接: https://arxiv.org/abs/2408.14851
作者: Zihao Li,Chao Yang,Yakun Chen,Xianzhi Wang,Hongxu Chen,Guandong Xu,Lina Yao,Quan Z. Sheng
关键词-EN: information overload problem, overload problem, years have witnessed, witnessed the remarkable, remarkable success
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Recent years have witnessed the remarkable success of recommendation systems (RSs) in alleviating the information overload problem. As a new paradigm of RSs, session-based recommendation (SR) specializes in users’ short-term preference capture and aims to provide a more dynamic and timely recommendation based on the ongoing interacted actions. In this survey, we will give a comprehensive overview of the recent works on SR. First, we clarify the definitions of various SR tasks and introduce the characteristics of session-based recommendation against other recommendation tasks. Then, we summarize the existing methods in two categories: sequential neural network based methods and graph neural network (GNN) based methods. The standard frameworks and technical are also introduced. Finally, we discuss the challenges of SR and new research directions in this area.

[IR-8] Personalized Video Summarization using Text-Based Queries and Conditional Modeling

链接: https://arxiv.org/abs/2408.14743
作者: Jia-Hong Huang
关键词-EN: Vimeo presents significant, presents significant challenges, efficiently locating relevant, YouTube and Vimeo, Vimeo presents
类目: Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR)
*备注: Ph.D. thesis, 137 pages

点击查看摘要

Abstract:The proliferation of video content on platforms like YouTube and Vimeo presents significant challenges in efficiently locating relevant information. Automatic video summarization aims to address this by extracting and presenting key content in a condensed form. This thesis explores enhancing video summarization by integrating text-based queries and conditional modeling to tailor summaries to user needs. Traditional methods often produce fixed summaries that may not align with individual requirements. To overcome this, we propose a multi-modal deep learning approach that incorporates both textual queries and visual information, fusing them at different levels of the model architecture. Evaluation metrics such as accuracy and F1-score assess the quality of the generated summaries. The thesis also investigates improving text-based query representations using contextualized word embeddings and specialized attention networks. This enhances the semantic understanding of queries, leading to better video summaries. To emulate human-like summarization, which accounts for both visual coherence and abstract factors like storyline consistency, we introduce a conditional modeling approach. This method uses multiple random variables and joint distributions to capture key summarization components, resulting in more human-like and explainable summaries. Addressing data scarcity in fully supervised learning, the thesis proposes a segment-level pseudo-labeling approach. This self-supervised method generates additional data, improving model performance even with limited human-labeled datasets. In summary, this research aims to enhance automatic video summarization by incorporating text-based queries, improving query representations, introducing conditional modeling, and addressing data scarcity, thereby creating more effective and personalized video summaries.

[IR-9] Snap and Diagnose: An Advanced Multimodal Retrieval System for Identifying Plant Diseases in the Wild

链接: https://arxiv.org/abs/2408.14723
作者: Tianqi Wei,Zhi Chen,Xin Yu
关键词-EN: ensures crop health, Plant disease recognition, Plant disease, disease, critical task
类目: Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Plant disease recognition is a critical task that ensures crop health and mitigates the damage caused by diseases. A handy tool that enables farmers to receive a diagnosis based on query pictures or the text description of suspicious plants is in high demand for initiating treatment before potential diseases spread further. In this paper, we develop a multimodal plant disease image retrieval system to support disease search based on either image or text prompts. Specifically, we utilize the largest in-the-wild plant disease dataset PlantWild, which includes over 18,000 images across 89 categories, to provide a comprehensive view of potential diseases relating to the query. Furthermore, cross-modal retrieval is achieved in the developed system, facilitated by a novel CLIP-based vision-language model that encodes both disease descriptions and disease images into the same latent space. Built on top of the retriever, our retrieval system allows users to upload either plant disease images or disease descriptions to retrieve the corresponding images with similar characteristics from the disease dataset to suggest candidate diseases for end users’ consideration.

[IR-10] Smart Multi-Modal Search: Contextual Sparse and Dense Embedding Integration in Adobe Express

链接: https://arxiv.org/abs/2408.14698
作者: Cherag Aroraa,Tracy Holloway King,Jayant Kumar,Yi Lu,Sanat Sharma,Arvind Srikantan,David Uvalle,Josep Valls-Vargas,Harsha Vardhan
关键词-EN: multi-modal search systems, effective multi-modal search, multi-modal search, search systems, search
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:As user content and queries become increasingly multi-modal, the need for effective multi-modal search systems has grown. Traditional search systems often rely on textual and metadata annotations for indexed images, while multi-modal embeddings like CLIP enable direct search using text and image embeddings. However, embedding-based approaches face challenges in integrating contextual features such as user locale and recency. Building a scalable multi-modal search system requires fine-tuning several components. This paper presents a multi-modal search architecture and a series of AB tests that optimize embeddings and multi-modal technologies in Adobe Express template search. We address considerations such as embedding model selection, the roles of embeddings in matching and ranking, and the balance between dense and sparse embeddings. Our iterative approach demonstrates how utilizing sparse, dense, and contextual features enhances short and long query search, significantly reduces null rates (over 70%), and increases click-through rates (CTR). Our findings provide insights into developing robust multi-modal search systems, thereby enhancing relevance for complex queries.

[IR-11] Federated User Preference Modeling for Privacy-Preserving Cross-Domain Recommendation

链接: https://arxiv.org/abs/2408.14689
作者: Li Wang,Shoujin Wang,Quangui Zhang,Qiang Wu,Min Xu
关键词-EN: Cross-domain recommendation, transferring knowledge, CDR, Cross-domain, user-item interaction
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Cross-domain recommendation (CDR) aims to address the data-sparsity problem by transferring knowledge across domains. Existing CDR methods generally assume that the user-item interaction data is shareable between domains, which leads to privacy leakage. Recently, some privacy-preserving CDR (PPCDR) models have been proposed to solve this problem. However, they primarily transfer simple representations learned only from user-item interaction histories, overlooking other useful side information, leading to inaccurate user preferences. Additionally, they transfer differentially private user-item interaction matrices or embeddings across domains to protect privacy. However, these methods offer limited privacy protection, as attackers may exploit external information to infer the original data. To address these challenges, we propose a novel Federated User Preference Modeling (FUPM) framework. In FUPM, first, a novel comprehensive preference exploration module is proposed to learn users’ comprehensive preferences from both interaction data and additional data including review texts and potentially positive items. Next, a private preference transfer module is designed to first learn differentially private local and global prototypes, and then privately transfer the global prototypes using a federated learning strategy. These prototypes are generalized representations of user groups, making it difficult for attackers to infer individual information. Extensive experiments on four CDR tasks conducted on the Amazon and Douban datasets validate the superiority of FUPM over SOTA baselines. Code is available at this https URL.

[IR-12] Bridging the Gap: Unpacking the Hidden Challenges in Knowledge Distillation for Online Ranking Systems

链接: https://arxiv.org/abs/2408.14678
作者: Nikhil Khani,Shuo Yang,Aniruddh Nath,Yang Liu,Pendo Abbo,Li Wei,Shawn Andrews,Maciej Kula,Jarrod Kahn,Zhe Zhao,Lichan Hong,Ed Chi
关键词-EN: Knowledge Distillation, powerful approach, approach for compressing, compressing a large, beneficial for latency-sensitive
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Knowledge Distillation (KD) is a powerful approach for compressing a large model into a smaller, more efficient model, particularly beneficial for latency-sensitive applications like recommender systems. However, current KD research predominantly focuses on Computer Vision (CV) and NLP tasks, overlooking unique data characteristics and challenges inherent to recommender systems. This paper addresses these overlooked challenges, specifically: (1) mitigating data distribution shifts between teacher and student models, (2) efficiently identifying optimal teacher configurations within time and budgetary constraints, and (3) enabling computationally efficient and rapid sharing of teacher labels to support multiple students. We present a robust KD system developed and rigorously evaluated on multiple large-scale personalized video recommendation systems within Google. Our live experiment results demonstrate significant improvements in student model performance while ensuring consistent and reliable generation of high quality teacher labels from a continuous data stream of data.

[IR-13] KGPrune: a Web Application to Extract Subgraphs of Interest from Wikidata with Analogical Pruning ECAI2024

链接: https://arxiv.org/abs/2408.14658
作者: Pierre Monnin,Cherif-Hassan Nousradine,Lucas Jarnac,Laurel Zuckerman,Miguel Couceiro
关键词-EN: array of domains, ubiquitous publicly, nowadays covering, Knowledge graphs, knowledge sources
类目: Artificial Intelligence (cs.AI); Databases (cs.DB); Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注: Accepted as a demo paper at ECAI 2024

点击查看摘要

Abstract:Knowledge graphs (KGs) have become ubiquitous publicly available knowledge sources, and are nowadays covering an ever increasing array of domains. However, not all knowledge represented is useful or pertaining when considering a new application or specific task. Also, due to their increasing size, handling large KGs in their entirety entails scalability issues. These two aspects asks for efficient methods to extract subgraphs of interest from existing KGs. To this aim, we introduce KGPrune, a Web Application that, given seed entities of interest and properties to traverse, extracts their neighboring subgraphs from Wikidata. To avoid topical drift, KGPrune relies on a frugal pruning algorithm based on analogical reasoning to only keep relevant neighbors while pruning irrelevant ones. The interest of KGPrune is illustrated by two concrete applications, namely, bootstrapping an enterprise KG and extracting knowledge related to looted artworks.

[IR-14] Relationships are Complicated! An Analysis of Relationships Between Datasets on the Web

链接: https://arxiv.org/abs/2408.14636
作者: Kate Lin,Tarfah Alrashed,Natasha Noy
关键词-EN: rapid pace, relationships, datasets, today has millions, continues to grow
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The Web today has millions of datasets, and the number of datasets continues to grow at a rapid pace. These datasets are not standalone entities; rather, they are intricately connected through complex relationships. Semantic relationships between datasets provide critical insights for research and decision-making processes. In this paper, we study dataset relationships from the perspective of users who discover, use, and share datasets on the Web: what relationships are important for different tasks? What contextual information might users want to know? We first present a comprehensive taxonomy of relationships between datasets on the Web and map these relationships to user tasks performed during dataset discovery. We develop a series of methods to identify these relationships and compare their performance on a large corpus of datasets generated from Web pages with this http URL markup. We demonstrate that machine-learning based methods that use dataset metadata achieve multi-class classification accuracy of 90%. Finally, we highlight gaps in available semantic markup for datasets and discuss how incorporating comprehensive semantics can facilitate the identification of dataset relationships. By providing a comprehensive overview of dataset relationships at scale, this paper sets a benchmark for future research.

[IR-15] MODOC: A Modular Interface for Flexible Interlinking of Text Retrieval and Text Generation Functions

链接: https://arxiv.org/abs/2408.14623
作者: Yingqiang Gao,Jhony Prada,Nianlong Gu,Jessica Lam,Richard H.R. Hahnloser
关键词-EN: Large Language Models, Large Language, Language Models, produce eloquent texts, produce eloquent
类目: Human-Computer Interaction (cs.HC); Computation and Language (cs.CL); Digital Libraries (cs.DL); Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Large Language Models (LLMs) produce eloquent texts but often the content they generate needs to be verified. Traditional information retrieval systems can assist with this task, but most systems have not been designed with LLM-generated queries in mind. As such, there is a compelling need for integrated systems that provide both retrieval and generation functionality within a single user interface. We present MODOC, a modular user interface that leverages the capabilities of LLMs and provides assistance with detecting their confabulations, promoting integrity in scientific writing. MODOC represents a significant step forward in scientific writing assistance. Its modular architecture supports flexible functions for retrieving information and for writing and generating text in a single, user-friendly interface. Subjects: Human-Computer Interaction (cs.HC); Computation and Language (cs.CL); Digital Libraries (cs.DL); Information Retrieval (cs.IR) Cite as: arXiv:2408.14623 [cs.HC] (or arXiv:2408.14623v1 [cs.HC] for this version) https://doi.org/10.48550/arXiv.2408.14623 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

附件下载

点击下载今日全部论文列表