本篇博文主要展示 2024-09-12 从Arxiv.org论文网站获取的最新论文列表,自动更新,按照NLP、CV、ML、AI、IR五个大方向区分,若需要邮件定时接收,请在评论区留下你的邮箱号。

说明:每日论文数据从Arxiv.org获取,每天早上10:30左右定时自动更新。

友情提示: 如何您需要邮箱接收每日论文数据,请在评论处留下你的邮箱,同样每天10:30左右邮件定时自动发送。

目录

概览 (2024-09-12)

今日共更新375篇论文,其中:

  • 自然语言处理38篇(Computation and Language (cs.CL))
  • 人工智能83篇(Artificial Intelligence (cs.AI))
  • 计算机视觉100篇(Computer Vision and Pattern Recognition (cs.CV))
  • 机器学习115篇(Machine Learning (cs.LG))

自然语言处理

[NLP-0] SUPER: Evaluating Agents on Setting Up and Executing Tasks from Research Repositories
[NLP-0] SUPER:评估代理人在研究储存库中设置和执行任务

链接: https://arxiv.org/abs/2409.07440
作者: Ben Bogin,Kejuan Yang,Shashank Gupta,Kyle Richardson,Erin Bransom,Peter Clark,Ashish Sabharwal,Tushar Khot
关键词-EN: Large Language Models, autonomously reproduce results, Large Language, made significant progress, research repositories
关键词-ZH: 大型语言模型,自主复制结果,大型语言,取得重大进展,研究库
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Software Engineering (cs.SE)
备注:

点击查看摘要

Abstract:Given that Large Language Models (LLMs) have made significant progress in writing code, can they now be used to autonomously reproduce results from research repositories? Such a capability would be a boon to the research community, helping researchers validate, understand, and extend prior work. To advance towards this goal, we introduce SUPER, the first benchmark designed to evaluate the capability of LLMs in setting up and executing tasks from research repositories. SUPERaims to capture the realistic challenges faced by researchers working with Machine Learning (ML) and Natural Language Processing (NLP) research repositories. Our benchmark comprises three distinct problem sets: 45 end-to-end problems with annotated expert solutions, 152 sub problems derived from the expert set that focus on specific challenges (e.g., configuring a trainer), and 602 automatically generated problems for larger-scale development. We introduce various evaluation measures to assess both task success and progress, utilizing gold solutions when available or approximations otherwise. We show that state-of-the-art approaches struggle to solve these problems with the best model (GPT-4o) solving only 16.3% of the end-to-end set, and 46.1% of the scenarios. This illustrates the challenge of this task, and suggests that SUPER can serve as a valuable resource for the community to make and measure progress.
摘要:鉴于大型语言模型(LLM)在编写代码方面取得了重大进展,它们现在是否可以用于自动复制来自研究资源库的结果?这样的能力将是研究界的福音,帮助研究人员验证、理解和扩展先前的工作。为了实现这一目标,我们引入了Super,这是第一个旨在评估LLMS在建立和执行来自研究存储库的任务方面的能力的基准。SUPER旨在捕捉使用机器学习(ML)和自然语言处理(NLP)研究库的研究人员所面临的现实挑战。我们的基准包括三个不同的问题集:45个带有注释的专家解决方案的端到端问题,152个子问题来自专家集中的特定挑战(例如,配置训练员),以及602个自动生成的问题,用于更大规模的开发。我们引入了各种评估措施来评估任务的成功和进展,在可用的情况下使用金牌解决方案,否则使用近似解决方案。我们表明,最先进的方法难以解决这些问题,最好的模型(GPT-40)仅解决了16.3%的端到端集合和46.1%的场景。这说明了这项任务的挑战,并表明Super可以作为社区取得和衡量进展的宝贵资源。

[NLP-1] A Suite for Acoustic Language Model Evaluation
[NLP-1] 声学语言模型评估套件

链接: https://arxiv.org/abs/2409.07437
作者: Gallil Maimon,Amit Roth,Yossi Adi
关键词-EN: recently demonstrated great, demonstrated great potential, speech processing systems, universal speech processing, processing systems
关键词-ZH: 最近表现出巨大的、巨大的潜力,语音处理系统,通用语音处理,处理系统
类目: ound (cs.SD); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:Speech language models have recently demonstrated great potential as universal speech processing systems. Such models have the ability to model the rich acoustic information existing in audio signals, beyond spoken content, such as emotion, background noise, etc. Despite this, evaluation benchmarks which evaluate awareness to a wide range of acoustic aspects, are lacking. To help bridge this gap, we introduce SALMon, a novel evaluation suite encompassing background noise, emotion, speaker identity and room impulse response. The proposed benchmarks both evaluate the consistency of the inspected element and how much it matches the spoken text. We follow a modelling based approach, measuring whether a model gives correct samples higher scores than incorrect ones. This approach makes the benchmark fast to compute even for large models. We evaluated several speech language models on SALMon, thus highlighting the strengths and weaknesses of each evaluated method. Code and data are publicly available at this https URL .
摘要:语音语言模型近年来显示出作为通用语音处理系统的巨大潜力。这种模型能够对音频信号中存在的丰富声学信息进行建模,而不仅仅是语音内容,如情感、背景噪声等。尽管如此,缺乏评估对广泛声学方面的认识的评估基准。为了帮助弥合这一差距,我们引入了SAMMON,这是一个新的评估套件,包括背景噪音、情感、说话人身份和房间脉冲响应。拟议的基准既评估了被检查要素的一致性,也评估了它与口头文本的匹配程度。我们遵循基于模型的方法,衡量模型是否给正确的样本打了比不正确的样本更高的分数。这种方法使基准计算速度更快,即使对于大型模型也是如此。我们对鲑鱼上的几种语音语言模型进行了评估,从而突出了每种评估方法的优缺点。代码和数据在此HTTPS URL上公开可用。

[NLP-2] Synthetic continued pretraining
[NLP-2] 合成持续预训练

链接: https://arxiv.org/abs/2409.07431
作者: Zitong Yang,Neil Band,Shuangping Li,Emmanuel Candès,Tatsunori Hashimoto
关键词-EN: unstructured internet text, enabled language models, unstructured internet, synthetic continued pretraining, acquire a significant
关键词-ZH: 非结构化互联网文本、启用语言模型、非结构化互联网、合成持续预训练、获得重要的
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:Pretraining on large-scale, unstructured internet text has enabled language models to acquire a significant amount of world knowledge. However, this knowledge acquisition is data-inefficient – to learn a given fact, models must be trained on hundreds to thousands of diverse representations of it. This poses a challenge when adapting a pretrained model to a small corpus of domain-specific documents, where each fact may appear rarely or only once. We propose to bridge this gap with synthetic continued pretraining: using the small domain-specific corpus to synthesize a large corpus more amenable to learning, and then performing continued pretraining on the synthesized corpus. We instantiate this proposal with EntiGraph, a synthetic data augmentation algorithm that extracts salient entities from the source documents and then generates diverse text by drawing connections between the sampled entities. Synthetic continued pretraining using EntiGraph enables a language model to answer questions and follow generic instructions related to the source documents without access to them. If instead, the source documents are available at inference time, we show that the knowledge acquired through our approach compounds with retrieval-augmented generation. To better understand these results, we build a simple mathematical model of EntiGraph, and show how synthetic data augmentation can “rearrange” knowledge to enable more data-efficient learning.
摘要:对大规模、非结构化的互联网文本进行预训练,使语言模型能够获得大量的世界知识。然而,这种知识获取是数据低效的–要学习给定的事实,必须对模型进行成百上千种不同的表示形式的训练。这给将预先训练的模型适应于领域特定文档的小型语料库带来了挑战,在这种语料库中,每个事实可能很少出现或只出现一次。我们建议用合成持续预训练来弥合这一差距:使用小的领域特定语料库来合成更适合学习的大型语料库,然后在合成的语料库上进行持续的预训练。我们用EntiGraph实例化了这一建议,EntiGraph是一种合成数据增强算法,它从源文档中提取显著实体,然后通过绘制采样实体之间的连接来生成不同的文本。使用EntiGraph的综合持续预培训使语言模型能够回答问题并遵循与源文档相关的一般说明,而无需访问它们。相反,如果源文档在推理时可用,我们表明通过我们的方法获得的知识与检索-增强生成复合。为了更好地理解这些结果,我们构建了一个简单的EntiGraph数学模型,并展示了合成数据增强如何“重新排列”知识以实现更有效的数据学习。

[NLP-3] Agent Workflow Memory
[NLP-3] 代理工作流程内存

链接: https://arxiv.org/abs/2409.07429
作者: Zora Zhiruo Wang,Jiayuan Mao,Daniel Fried,Graham Neubig
关键词-EN: language model-based agents, complex action trajectories, potential of language, language model-based, struggle with long-horizon
关键词-ZH: 基于语言模型的主体,复杂的动作轨迹,语言的潜力,基于语言模型,与长期的斗争
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Despite the potential of language model-based agents to solve real-world tasks such as web navigation, current methods still struggle with long-horizon tasks with complex action trajectories. In contrast, humans can flexibly solve complex tasks by learning reusable task workflows from past experiences and using them to guide future actions. To build agents that can similarly benefit from this process, we introduce Agent Workflow Memory (AWM), a method for inducing commonly reused routines, i.e., workflows, and selectively providing workflows to the agent to guide subsequent generations. AWM flexibly applies to both offline and online scenarios, where agents induce workflows from training examples beforehand or from test queries on the fly. We experiment on two major web navigation benchmarks – Mind2Web and WebArena – that collectively cover 1000+ tasks from 200+ domains across travel, shopping, and social media, among others. AWM substantially improves the baseline results by 24.6% and 51.1% relative success rate on Mind2Web and WebArena while reducing the number of steps taken to solve WebArena tasks successfully. Furthermore, online AWM robustly generalizes in cross-task, website, and domain evaluations, surpassing baselines from 8.9 to 14.0 absolute points as train-test task distribution gaps widen.
摘要:尽管基于语言模型的智能体具有解决网络导航等现实世界任务的潜力,但目前的方法仍然难以处理具有复杂动作轨迹的长期任务。相比之下,人类可以通过从过去的经验中学习可重用的任务工作流并使用它们来指导未来的行动,从而灵活地解决复杂的任务。为了构建同样可以从这一过程中受益的代理,我们引入了代理工作流内存(AWM),这是一种用于诱导常用重用例程(即工作流)并有选择地向代理提供工作流以指导后续代的方法。AWM灵活地适用于离线和在线场景,在这些场景中,代理从事先训练的示例或从运行中的测试查询生成工作流。我们试验了两个主要的网络导航基准–Mind2Web和WebArena,它们总共涵盖了旅游、购物和社交媒体等200多个领域的1000多项任务。AWM在Mind2Web和WebArena上显著提高了基线结果24.6%和51.1%的相对成功率,同时减少了成功解决WebArena任务所需的步骤数量。此外,随着列车测试任务分配差距的扩大,在线AWM在跨任务、网站和领域评估中的泛化能力很强,超过了基线,从8.9分提高到14.0分。

[NLP-4] owards Fairer Health Recommendations: finding informative unbiased samples via Word Sense Disambiguation RECSYS2024
[NLP-4] owards Fairer健康建议:通过词意消除歧义查找信息丰富的公正样本

链接: https://arxiv.org/abs/2409.07424
作者: Gavin Butts,Pegah Emdad,Jethro Lee,Shannon Song,Chiman Salavati,Willmar Sosa Diaz,Shiri Dori-Hacohen,Fabricio Murai
关键词-EN: produce biased predictions, growing concerns, concerns around high-stake, biased predictions, produce biased
关键词-ZH: 产生偏见的预测,日益增长的担忧,对高风险的担忧,有偏见的预测,产生偏见
类目: Computation and Language (cs.CL); Computers and Society (cs.CY); Machine Learning (cs.LG)
备注: Accepted for long presentation at the FAcctRec @ Recsys 2024

点击查看摘要

Abstract:There have been growing concerns around high-stake applications that rely on models trained with biased data, which consequently produce biased predictions, often harming the most vulnerable. In particular, biased medical data could cause health-related applications and recommender systems to create outputs that jeopardize patient care and widen disparities in health outcomes. A recent framework titled Fairness via AI posits that, instead of attempting to correct model biases, researchers must focus on their root causes by using AI to debias data. Inspired by this framework, we tackle bias detection in medical curricula using NLP models, including LLMs, and evaluate them on a gold standard dataset containing 4,105 excerpts annotated by medical experts for bias from a large corpus. We build on previous work by coauthors which augments the set of negative samples with non-annotated text containing social identifier terms. However, some of these terms, especially those related to race and ethnicity, can carry different meanings (e.g., “white matter of spinal cord”). To address this issue, we propose the use of Word Sense Disambiguation models to refine dataset quality by removing irrelevant sentences. We then evaluate fine-tuned variations of BERT models as well as GPT models with zero- and few-shot prompting. We found LLMs, considered SOTA on many NLP tasks, unsuitable for bias detection, while fine-tuned BERT models generally perform well across all evaluated metrics.
摘要:人们越来越担心高风险的应用程序依赖于用有偏差的数据训练的模型,这会产生有偏见的预测,往往会伤害到最脆弱的人。特别是,有偏见的医疗数据可能会导致与健康相关的应用程序和推荐系统产生危及患者护理并扩大健康结果差异的产出。最近一个名为通过人工智能实现公平的框架提出,研究人员必须通过使用人工智能来消除数据的偏见,而不是试图纠正模型偏差,而是专注于其根本原因。在这个框架的启发下,我们使用包括LLMS在内的NLP模型来处理医学课程中的偏见检测,并在包含4,105个由医学专家注释的大型语料库中的偏见的黄金标准数据集上对它们进行评估。我们在合著者之前的工作的基础上,用包含社会识别术语的未注释文本来扩充负面样本集。然而,其中一些术语,特别是与种族和族裔有关的术语,可以有不同的含义(例如,“脊髓白质”)。为了解决这个问题,我们提出了使用词义消歧模型来通过删除不相关的句子来提炼数据集质量。然后,我们评估了BERT模型和GPT模型的微调变体,以及零镜头和少镜头提示的GPT模型。我们发现,在许多NLP任务中考虑SOTA的LLMS不适合进行偏差检测,而微调的BERT模型通常在所有评估的指标中表现良好。

[NLP-5] Enhancing adversarial robustness in Natural Language Inference using explanations
[NLP-5] 使用解释增强自然语言推理中的对抗稳健性

链接: https://arxiv.org/abs/2409.07423
作者: Alexandros Koulakos,Maria Lymperaiou,Giorgos Filandrianos,Giorgos Stamou
关键词-EN: NLP model performance, limits of NLP, Natural Language Inference, Transformer-based models, NLP model
关键词-ZH: NLP模型性能、NLP的局限性、自然语言推理、基于转换器的模型、NLP模型
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The surge of state-of-the-art Transformer-based models has undoubtedly pushed the limits of NLP model performance, excelling in a variety of tasks. We cast the spotlight on the underexplored task of Natural Language Inference (NLI), since models trained on popular well-suited datasets are susceptible to adversarial attacks, allowing subtle input interventions to mislead the model. In this work, we validate the usage of natural language explanation as a model-agnostic defence strategy through extensive experimentation: only by fine-tuning a classifier on the explanation rather than premise-hypothesis inputs, robustness under various adversarial attacks is achieved in comparison to explanation-free baselines. Moreover, since there is no standard strategy of testing the semantic validity of the generated explanations, we research the correlation of widely used language generation metrics with human perception, in order for them to serve as a proxy towards robust NLI models. Our approach is resource-efficient and reproducible without significant computational limitations.
摘要:最先进的基于变压器的模型的激增无疑已经突破了NLP模型的性能极限,在各种任务中表现出色。我们将注意力集中在自然语言推理(NLI)这一未被探索的任务上,因为在流行的匹配良好的数据集上训练的模型容易受到对抗性攻击,允许微妙的输入干预误导模型。在这项工作中,我们通过广泛的实验验证了自然语言解释作为一种与模型无关的防御策略的使用:只有通过微调解释而不是前提假设输入的分类器,才能实现与无解释基线相比在各种对手攻击下的健壮性。此外,由于没有标准的策略来测试生成的解释的语义有效性,我们研究了广泛使用的语言生成度量与人类感知的相关性,以便它们能够作为稳健的NLI模型的代理。我们的方法是资源高效和可重复性的,没有明显的计算限制。

[NLP-6] What to align in multimodal contrastive learning?
[NLP-6] 多模式对比学习中需要配合什么?

链接: https://arxiv.org/abs/2409.07402
作者: Benoit Dufumier,Javiera Castillo-Navarro,Devis Tuia,Jean-Philippe Thiran
关键词-EN: Humans perceive, multisensory integration, adapt their behavior, perceive the world, world through multisensory
关键词-ZH: 人类感知,多感官融合,调整自己的行为,通过多感官感知世界,世界
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注: 22 pages

点击查看摘要

Abstract:Humans perceive the world through multisensory integration, blending the information of different modalities to adapt their behavior. Contrastive learning offers an appealing solution for multimodal self-supervised learning. Indeed, by considering each modality as a different view of the same entity, it learns to align features of different modalities in a shared representation space. However, this approach is intrinsically limited as it only learns shared or redundant information between modalities, while multimodal interactions can arise in other ways. In this work, we introduce CoMM, a Contrastive MultiModal learning strategy that enables the communication between modalities in a single multimodal space. Instead of imposing cross- or intra- modality constraints, we propose to align multimodal representations by maximizing the mutual information between augmented versions of these multimodal features. Our theoretical analysis shows that shared, synergistic and unique terms of information naturally emerge from this formulation, allowing us to estimate multimodal interactions beyond redundancy. We test CoMM both in a controlled and in a series of real-world settings: in the former, we demonstrate that CoMM effectively captures redundant, unique and synergistic information between modalities. In the latter, CoMM learns complex multimodal interactions and achieves state-of-the-art results on the six multimodal benchmarks.
摘要:人类通过多感官整合来感知世界,融合不同形式的信息来适应自己的行为。对比学习为多通道自我监督学习提供了一种有吸引力的解决方案。事实上,通过将每个通道视为同一实体的不同视图,它学习在共享的表示空间中对齐不同通道的特征。然而,这种方法本质上是有限的,因为它只学习通道之间共享或冗余的信息,而多通道交互可能以其他方式发生。在这项工作中,我们引入了COMM,这是一种对比性多通道学习策略,使单个多通道空间中的通道之间能够进行交流。我们建议通过最大化这些多通道特征的增强版本之间的互信息来对齐多通道表示,而不是施加跨通道或内通道约束。我们的理论分析表明,共享的、协同的和独特的信息术语自然地从这个公式中产生,使我们能够估计超出冗余的多通道相互作用。我们在受控和一系列真实环境中测试了COMM:在前者中,我们证明COMM有效地捕获了通道之间冗余的、唯一的和协同的信息。在后者中,COMM学习复杂的多模式交互,并在六个多模式基准上取得最先进的结果。

[NLP-7] AdaCAD: Adaptively Decoding to Balance Conflicts between Contextual and Parametric Knowledge
[NLP-7] AdaCAD:适应性解码以平衡上下文知识和参数知识之间的冲突

链接: https://arxiv.org/abs/2409.07394
作者: Han Wang,Archiki Prasad,Elias Stengel-Eskin,Mohit Bansal
关键词-EN: large language model, Knowledge conflict arises, arises from discrepancies, discrepancies between information, large language
关键词-ZH: 大语言模型,知识冲突出现,源于差异,信息之间的差异,大语言
类目: Computation and Language (cs.CL)
备注: 16 pages, Code: this https URL

点击查看摘要

Abstract:Knowledge conflict arises from discrepancies between information in the context of a large language model (LLM) and the knowledge stored in its parameters. This can hurt performance when using standard decoding techniques, which tend to ignore the context. Existing test-time contrastive methods seek to address this by comparing the LLM’s output distribution with and without the context and adjust the model according to the contrast between them. However, we find that these methods frequently misjudge the degree of conflict and struggle to handle instances that vary in their amount of conflict, with static methods over-adjusting when conflict is absent. We propose a fine-grained, instance-level approach called AdaCAD, which dynamically infers the weight of adjustment based on the degree of conflict, as measured by the Jensen-Shannon divergence between distributions representing contextual and parametric knowledge. Our experiments across four models on six diverse question-answering (QA) datasets and three summarization tasks demonstrate that our training-free adaptive method consistently outperforms other decoding methods on QA, with average accuracy gains of 14.21% (absolute) over a static contrastive baseline, and improves the factuality of summaries by 5.59 (AlignScore). Furthermore, our analysis shows that while decoding with contrastive baselines hurts performance when conflict is absent, AdaCAD mitigates these losses, making it more applicable to real-world datasets in which some examples have conflict and others do not.
摘要:知识冲突是由于大语言模型(LLM)上下文中的信息与其参数中存储的知识之间的差异引起的。在使用标准解码技术时,这可能会影响性能,因为标准解码技术往往会忽略上下文。现有的测试时间对比方法试图通过比较LLM在有无上下文的情况下的输出分布来解决这一问题,并根据它们之间的对比调整模型。然而,我们发现,这些方法经常错误地判断冲突的程度,并努力处理冲突数量不同的实例,当没有冲突时,静态方法会过度调整。我们提出了一种称为AdaCAD的细粒度实例级方法,它基于冲突程度来动态推断调整的权重,该程度通过表示上下文知识和参数知识的分布之间的Jensen-Shannon分歧来衡量。我们在四个模型上对六个不同的问答数据集和三个摘要任务进行的实验表明,我们的自适应方法在问答上的性能一致优于其他解码方法,在静态对比基线上的平均准确率提高了14.21%(绝对),摘要的真实性提高了5.59(AlignScore)。此外,我们的分析表明,尽管在没有冲突的情况下使用对比基线解码会损害性能,但AdaCAD减轻了这些损失,使其更适用于某些示例存在冲突而另一些示例不存在冲突的真实数据集。

[NLP-8] Recent Trends of Multimodal Affective Computing: A Survey from NLP Perspective
[NLP-8] 多模式情感计算的最新趋势:NLP视角的调查

链接: https://arxiv.org/abs/2409.07388
作者: Guimin Hu,Yi Xin,Weimin Lyu,Haojian Huang,Chang Sun,Zhihong Zhu,Lin Gui,Ruichu Cai
关键词-EN: Multimodal affective computing, Multimodal affective, affective computing, garnered increasing attention, increasing attention due
关键词-ZH: 多模式情感计算,多模式情感,情感计算,引起了越来越多的关注,越来越多的关注,
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Multimodal affective computing (MAC) has garnered increasing attention due to its broad applications in analyzing human behaviors and intentions, especially in text-dominated multimodal affective computing field. This survey presents the recent trends of multimodal affective computing from NLP perspective through four hot tasks: multimodal sentiment analysis, multimodal emotion recognition in conversation, multimodal aspect-based sentiment analysis and multimodal multi-label emotion recognition. The goal of this survey is to explore the current landscape of multimodal affective research, identify development trends, and highlight the similarities and differences across various tasks, offering a comprehensive report on the recent progress in multimodal affective computing from an NLP perspective. This survey covers the formalization of tasks, provides an overview of relevant works, describes benchmark datasets, and details the evaluation metrics for each task. Additionally, it briefly discusses research in multimodal affective computing involving facial expressions, acoustic signals, physiological signals, and emotion causes. Additionally, we discuss the technical approaches, challenges, and future directions in multimodal affective computing. To support further research, we released a repository that compiles related works in multimodal affective computing, providing detailed resources and references for the community.
摘要:多通道情感计算因其在分析人类行为和意图方面的广泛应用而受到越来越多的关注,尤其是在以文本为主导的多通道情感计算领域。本文从自然语言处理的角度,通过多通道情感分析、对话中的多通道情感识别、基于特征的多通道情感分析和多通道多标签情绪识别四个热点任务,介绍了多通道情感计算的最新发展趋势。这项调查的目的是探索多通道情感研究的现状,确定发展趋势,并突出不同任务之间的异同,从自然语言处理的角度全面报告多通道情感计算的最新进展。本调查涵盖了任务的正规化,提供了相关工作的概述,描述了基准数据集,并详细介绍了每项任务的评估指标。此外,它还简要讨论了涉及面部表情、声音信号、生理信号和情感原因的多模式情感计算的研究。此外,我们还讨论了多通道情感计算的技术途径、挑战和未来发展方向。为了支持进一步的研究,我们发布了一个资源库,汇编了多模式情感计算的相关工作,为社区提供了详细的资源和参考。

[NLP-9] Awaking the Slides: A Tuning-free and Knowledge-regulated AI Tutoring System via Language Model Coordination
[NLP-9] 唤醒幻灯片:通过语言模型协调的免调谐和知识监管的人工智能教学系统

链接: https://arxiv.org/abs/2409.07372
作者: Daniel Zhang-Li,Zheyuan Zhang,Jifan Yu,Joy Lim Jia Yin,Shangqing Tu,Linlu Gong,Haohua Wang,Zhiyuan Liu,Huiqin Liu,Lei Hou,Juanzi Li
关键词-EN: carry lecture knowledge, vast pre-existing slides, pre-existing slides serve, heterogeneous teaching actions, teaching actions
关键词-ZH: 承载讲座知识、大量预先存在的幻灯片、预先存在的幻灯片服务、异类教学行动、教学行动
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:The vast pre-existing slides serve as rich and important materials to carry lecture knowledge. However, effectively leveraging lecture slides to serve students is difficult due to the multi-modal nature of slide content and the heterogeneous teaching actions. We study the problem of discovering effective designs that convert a slide into an interactive lecture. We develop Slide2Lecture, a tuning-free and knowledge-regulated intelligent tutoring system that can (1) effectively convert an input lecture slide into a structured teaching agenda consisting of a set of heterogeneous teaching actions; (2) create and manage an interactive lecture that generates responsive interactions catering to student learning demands while regulating the interactions to follow teaching actions. Slide2Lecture contains a complete pipeline for learners to obtain an interactive classroom experience to learn the slide. For teachers and developers, Slide2Lecture enables customization to cater to personalized demands. The evaluation rated by annotators and students shows that Slide2Lecture is effective in outperforming the remaining implementation. Slide2Lecture’s online deployment has made more than 200K interaction with students in the 3K lecture sessions. We open source Slide2Lecture’s implementation in https://anonymous.4open.science/r/slide2lecture-4210/.
摘要:庞大的已有幻灯片是承载讲座知识的丰富而重要的素材。然而,由于幻灯片内容的多样性和教学行为的异质性,有效地利用讲义幻灯片为学生服务是困难的。我们研究的问题是发现有效的设计,将幻灯片转化为互动讲座。我们开发了Slide2Lecture,这是一个自由调谐和受知识控制的智能教学系统,它可以(1)将输入的讲课幻灯片有效地转换为由一组不同的教学动作组成的结构化教学议程;(2)创建和管理互动课程,生成满足学生学习需求的响应互动,同时调节互动以跟随教学动作。Slide2Lecture包含一个完整的渠道,供学员获得学习幻灯片的互动课堂体验。对于教师和开发人员来说,Slide2Lecture支持定制以满足个性化需求。注释者和学生对Slide2Lecture的评估表明,Slide2Lecture在性能上优于其他实现。Slide2Lecture的线上部署已经在3K讲座环节与学生进行了超过200K的互动。我们用https://anonymous.4open.science/r/slide2lecture-4210/.开源了幻灯片2的实现

[NLP-10] hink Together and Work Better: Combining Humans and LLMs Think-Aloud Outcomes for Effective Text Evaluation
[NLP-10] 共同思考并更好地工作:将人类和LLM结合起来以实现有效的文本评估

链接: https://arxiv.org/abs/2409.07355
作者: SeongYeub Chu,JongWoo Kim,MunYong Yi
关键词-EN: Large Language Models, Language Models, Large Language, expertise and Large, integrates human expertise
关键词-ZH: 大型语言模型、语言模型、大型语言、专业知识和大型,集成人类专业知识
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:This study introduces \textbfInteractEval, a framework that integrates human expertise and Large Language Models (LLMs) using the Think-Aloud (TA) method to generate attributes for checklist-based text evaluation. By combining human flexibility and reasoning with LLM consistency, InteractEval outperforms traditional non-LLM-based and LLM-based baselines across four distinct dimensions, consisting of Coherence, Fluency, Consistency, and Relevance. The experiment also investigates the effectiveness of the TA method, showing that it promotes divergent thinking in both humans and LLMs, leading to the generation of a wider range of relevant attributes and enhance text evaluation performance. Comparative analysis reveals that humans excel at identifying attributes related to internal quality (Coherence and Fluency), but LLMs perform better at those attributes related to external alignment (Consistency and Relevance). Consequently, leveraging both humans and LLMs together produces the best evaluation outcomes. In other words, this study emphasizes the necessity of effectively combining humans and LLMs in an automated checklist-based text evaluation framework. The code is available at \textbf\urlthis https URL.
摘要:本研究介绍了一个集成了人类专业知识和大型语言模型(LLM)的框架,该框架使用有声思考(TA)方法来生成基于检查表的文本评估属性。通过将人类的灵活性和推理与LLM一致性相结合,InteractEval在一致性、流畅性、一致性和相关性四个不同的维度上超过了传统的基于LLM和基于LLM的基线。实验还考察了TA方法的有效性,表明它促进了人类和LLMS的发散思维,导致了更广泛的相关属性的生成,并提高了文本评估性能。对比分析表明,人类擅长识别与内部质量相关的属性(连贯性和流畅性),但LLM在与外部一致性(一致性和相关性)相关的属性上表现得更好。因此,同时利用人类和LLMS可以产生最佳的评估结果。换言之,本研究强调了在基于检查表的自动文本评估框架中有效地将人和LLMS结合起来的必要性。代码可在此HTTPS URL的\textbf\url找到。

[NLP-11] Explanation Debate Align: A Weak-to-Strong Framework for Language Model Generalization
[NLP-11] 解释辩论Align:语言模型概括的从弱到强的框架

链接: https://arxiv.org/abs/2409.07335
作者: Mehrdad Zakershahrak,Samira Ghodratnama
关键词-EN: artificial intelligence systems, forefront of research, task execution, rapid advancement, advancement of artificial
关键词-ZH: 人工智能系统,研究前沿,任务执行,快速进步,人工的进步
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The rapid advancement of artificial intelligence systems has brought the challenge of AI alignment to the forefront of research, particularly in complex decision-making and task execution. As these systems surpass human-level performance in sophisticated problems, ensuring their alignment with human values, intentions, and ethical guidelines becomes crucial. Building on previous work in explanation generation for human-agent alignment, we address the more complex dynamics of multi-agent systems and human-AI teams. This paper introduces a novel approach to model alignment through weak-to-strong generalization in the context of language models. We present a framework where a strong model facilitates the improvement of a weaker model, bridging the gap between explanation generation and model alignment. Our method, formalized as a facilitation function, allows for the transfer of capabilities from advanced models to less capable ones without direct access to extensive training data. Our results suggest that this facilitation-based approach not only enhances model performance but also provides insights into the nature of model alignment and the potential for scalable oversight of AI systems.
摘要:人工智能系统的快速发展将人工智能对齐的挑战带入了研究的前沿,特别是在复杂的决策和任务执行中。随着这些系统在复杂问题上超越人类水平的表现,确保它们与人类的价值观、意图和道德准则保持一致变得至关重要。在先前在解释生成人类-代理对齐方面的工作的基础上,我们解决了多代理系统和人类-人工智能团队更复杂的动力学问题。针对语言模型,提出了一种基于弱到强泛化的模型对齐方法。我们提出了一个框架,其中一个强大的模型有助于改进一个较弱的模型,弥合了解释生成和模型对齐之间的差距。我们的方法正式化为促进函数,允许将能力从高级模型转移到能力较差的模型,而无需直接访问大量训练数据。我们的结果表明,这种基于促进的方法不仅提高了模型的性能,还提供了对模型对齐的性质和对人工智能系统进行可扩展监督的潜力的见解。

[NLP-12] MEDIC: Towards a Comprehensive Framework for Evaluating LLMs in Clinical Applications
[NLP-12] MEDIC:建立评估临床应用中LLM的综合框架

链接: https://arxiv.org/abs/2409.07314
作者: Praveen K Kanithi,Clément Christophe,Marco AF Pimentel,Tathagata Raha,Nada Saadi,Hamza Javed,Svetlana Maslenkova,Nasir Hayat,Ronnie Rajan,Shadab Khan
关键词-EN: benchmarks like USMLE, Large Language Models, development of Large, Large Language, rapid development
关键词-ZH: USMLE等基准测试、大型语言模型、大型开发、大型语言、快速开发
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Technical report

点击查看摘要

Abstract:The rapid development of Large Language Models (LLMs) for healthcare applications has spurred calls for holistic evaluation beyond frequently-cited benchmarks like USMLE, to better reflect real-world performance. While real-world assessments are valuable indicators of utility, they often lag behind the pace of LLM evolution, likely rendering findings obsolete upon deployment. This temporal disconnect necessitates a comprehensive upfront evaluation that can guide model selection for specific clinical applications. We introduce MEDIC, a framework assessing LLMs across five critical dimensions of clinical competence: medical reasoning, ethics and bias, data and language understanding, in-context learning, and clinical safety. MEDIC features a novel cross-examination framework quantifying LLM performance across areas like coverage and hallucination detection, without requiring reference outputs. We apply MEDIC to evaluate LLMs on medical question-answering, safety, summarization, note generation, and other tasks. Our results show performance disparities across model sizes, baseline vs medically finetuned models, and have implications on model selection for applications requiring specific model strengths, such as low hallucination or lower cost of inference. MEDIC’s multifaceted evaluation reveals these performance trade-offs, bridging the gap between theoretical capabilities and practical implementation in healthcare settings, ensuring that the most promising models are identified and adapted for diverse healthcare applications.
摘要:医疗保健应用的大型语言模型(LLM)的快速发展促使人们呼吁超越USMLE等经常被引用的基准,进行全面评估,以更好地反映真实世界的性能。虽然现实世界的评估是有价值的效用指标,但它们往往落后于LLM发展的步伐,可能会使研究结果在部署时过时。这种时间上的脱节需要一个全面的前期评估,以指导特定临床应用的模型选择。我们介绍了MEDIC,这是一个从临床能力的五个关键维度评估LLM的框架:医疗推理、伦理和偏见、数据和语言理解、情景学习和临床安全性。MEDIC具有一个新颖的交叉检查框架,可以量化覆盖和幻觉检测等领域的LLM性能,而不需要参考输出。我们使用MEDIC来评估LLMS在医疗问题回答、安全性、摘要、笔记生成和其他任务上的性能。我们的结果显示了不同模型大小、基线和医学精调模型的性能差异,并对需要特定模型强度的应用程序的模型选择具有影响,例如低幻觉或较低的推理成本。Medic的多方面评估揭示了这些性能权衡,弥合了医疗保健环境中理论能力和实际实施之间的差距,确保确定并调整最有前途的模型,以适应不同的医疗保健应用。

[NLP-13] Using Generative Agents to Create Tip Sheets for Investigative Data Reporting
[NLP-13] 使用生成代理创建调查数据报告提示表

链接: https://arxiv.org/abs/2409.07286
作者: Joris Veerbeek,Nicholas Diakopoulos
关键词-EN: create tip sheets, investigative data reporting, paper introduces, data reporting, create tip
关键词-ZH: 创建提示表、调查数据报告、论文介绍、数据报告、创建提示
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Short paper to be presented at Computation + Journalism 2024

点击查看摘要

Abstract:This paper introduces a system using generative AI agents to create tip sheets for investigative data reporting. Our system employs three specialized agents–an analyst, a reporter, and an editor–to collaboratively generate and refine tips from datasets. We validate this approach using real-world investigative stories, demonstrating that our agent-based system generally generates more newsworthy and accurate insights compared to a baseline model without agents, although some variability was noted between different stories. Our findings highlight the potential of generative AI to provide leads for investigative data reporting.
摘要:本文介绍了一个使用生成式人工智能代理来创建调查数据报告的提示表的系统。我们的系统聘请了三个专业代理–分析师、记者和编辑–从数据集协作生成和完善提示。我们使用现实世界的调查故事来验证这种方法,证明与没有代理的基线模型相比,我们的基于代理的系统通常会产生更多新闻价值和准确的见解,尽管不同的故事之间存在一些差异。我们的研究结果强调了生成性人工智能为调查数据报告提供线索的潜力。

[NLP-14] Cross-Dialect Text-To-Speech in Pitch-Accent Language Incorporating Multi-Dialect Phoneme-Level BERT
[NLP-14] 音调口音语言中的跨方言文本转语音涉及多方言音素级BERT

链接: https://arxiv.org/abs/2409.07265
作者: Kazuki Yamauchi,Yuki Saito,Hiroshi Saruwatari
关键词-EN: learned speakers’ voices, synthesize learned speakers’, explore cross-dialect, pitch-accent languages, learned speakers’
关键词-ZH: 博学的说话者的声音,综合博学的说话者”,探索跨方言、音调语言、博学的说话者”
类目: ound (cs.SD); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
备注: Accepted by IEEE SLT 2024

点击查看摘要

Abstract:We explore cross-dialect text-to-speech (CD-TTS), a task to synthesize learned speakers’ voices in non-native dialects, especially in pitch-accent languages. CD-TTS is important for developing voice agents that naturally communicate with people across regions. We present a novel TTS model comprising three sub-modules to perform competitively at this task. We first train a backbone TTS model to synthesize dialect speech from a text conditioned on phoneme-level accent latent variables (ALVs) extracted from speech by a reference encoder. Then, we train an ALV predictor to predict ALVs tailored to a target dialect from input text leveraging our novel multi-dialect phoneme-level BERT. We conduct multi-dialect TTS experiments and evaluate the effectiveness of our model by comparing it with a baseline derived from conventional dialect TTS methods. The results show that our model improves the dialectal naturalness of synthetic speech in CD-TTS.
摘要:我们探索跨方言文本转语音(CD-TTC),这是一项综合非母语方言中习得的说话者声音的任务,特别是音调口音语言。CD-TTC对于开发与跨地区的人们自然通信的语音代理非常重要。我们提出了一种新颖的TTC模型,该模型包括三个子模块,以在这项任务中有竞争力地执行。我们首先训练一个主干TTC模型,以根据参考编码器从语音中提取的音素级口音潜在变量(ALV)来从文本中合成方言语音。然后,我们训练ALV预测器,利用我们新颖的多方言音素级BERT,根据输入文本预测针对目标方言量身定制的ALV。我们进行了多方言TTC实验,并通过将其与传统方言TTC方法得出的基线进行比较来评估我们模型的有效性。结果表明,我们的模型提高了CD-TTC中合成语音的方言自然度。

[NLP-15] Propaganda to Hate: A Multimodal Analysis of Arabic Memes with Multi-Agent LLMs
[NLP-15] 从宣传到仇恨:用多代理LLM对阿拉伯语模因的多模式分析

链接: https://arxiv.org/abs/2409.07246
作者: Firoj Alam,Md. Rafiul Biswas,Uzair Shah,Wajdi Zaghouani,Georgios Mikros
关键词-EN: social media platforms, past decade, social media, dissemination and consumption, media platforms
关键词-ZH: 社交媒体平台,过去十年,社交媒体,传播和消费,媒体平台
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: propaganda, hate-speech, disinformation, misinformation, fake news, LLMs, GPT-4, multimodality, multimodal LLMs

点击查看摘要

Abstract:In the past decade, social media platforms have been used for information dissemination and consumption. While a major portion of the content is posted to promote citizen journalism and public awareness, some content is posted to mislead users. Among different content types such as text, images, and videos, memes (text overlaid on images) are particularly prevalent and can serve as powerful vehicles for propaganda, hate, and humor. In the current literature, there have been efforts to individually detect such content in memes. However, the study of their intersection is very limited. In this study, we explore the intersection between propaganda and hate in memes using a multi-agent LLM-based approach. We extend the propagandistic meme dataset with coarse and fine-grained hate labels. Our finding suggests that there is an association between propaganda and hate in memes. We provide detailed experimental results that can serve as a baseline for future studies. We will make the experimental resources publicly available to the community.
摘要:在过去的十年里,社交媒体平台被用于信息传播和消费。虽然大部分内容是为了促进公民新闻和公众意识而发布的,但也有一些内容是为了误导用户。在文本、图像和视频等不同的内容类型中,模因(覆盖在图像上的文本)尤其普遍,可以作为宣传、仇恨和幽默的强大工具。在目前的文献中,已经有人努力在模因中单独检测这样的内容。然而,对它们交集的研究却非常有限。在这项研究中,我们使用基于多智能体LLM的方法来探索模因中宣传和仇恨之间的交集。我们使用粗粒度和细粒度的仇恨标签来扩展宣传模因数据集。我们的发现表明,在模因中,宣传和仇恨之间存在联系。我们提供了详细的实验结果,可以作为未来研究的基线。我们将把实验资源向社会公开。

[NLP-16] Learning Efficient Recursive Numeral Systems via Reinforcement Learning
[NLP-16] 通过强化学习学习高效的回归数字系统

链接: https://arxiv.org/abs/2409.07170
作者: Jonathan D. Thomas,Andrea Silvi,Devdatt Dubhashi,Emil Carlsson,Moa Johansson
关键词-EN: mathematical concepts, mathematics and reasoning, understudied area, emergence of mathematical, numeral systems
关键词-ZH: 数学概念、数学和推理、研究不足的领域、数学、数字系统的出现
类目: Computation and Language (cs.CL)
备注: 8 pages, 6 figures

点击查看摘要

Abstract:The emergence of mathematical concepts, such as number systems, is an understudied area in AI for mathematics and reasoning. It has previously been shown Carlsson et al. (2021) that by using reinforcement learning (RL), agents can derive simple approximate and exact-restricted numeral systems. However, it is a major challenge to show how more complex recursive numeral systems, similar to the one utilised in English, could arise via a simple learning mechanism such as RL. Here, we introduce an approach towards deriving a mechanistic explanation of the emergence of recursive number systems where we consider an RL agent which directly optimizes a lexicon under a given meta-grammar. Utilising a slightly modified version of the seminal meta-grammar of Hurford (1975), we demonstrate that our RL agent can effectively modify the lexicon towards Pareto-optimal configurations which are comparable to those observed within human numeral systems.
摘要:数字系统等数学概念的出现是数学和推理人工智能中一个未充分研究的领域。Carlsson等人(2021)之前已经表明,通过使用强化学习(RL),代理可以推导出简单的近似和精确限制的数字系统。然而,展示如何通过RL等简单学习机制产生类似于英语中使用的更复杂的循环数字系统是一个重大挑战。在这里,我们引入了一种对循环数系统的出现进行机械解释的方法,其中我们考虑了一个RL代理,它在给定的元语法下直接优化词典。利用Hurford(1975)开创性元语法的稍微修改版本,我们证明了我们的RL代理可以有效地将词典修改为帕累托最优配置,该配置与人类数字系统中观察到的配置相当。

[NLP-17] A Fine-grained Sentiment Analysis of App Reviews using Large Language Models : An Evaluation Study
[NLP-17] 使用大型语言模型对应用程序评论进行细粒度情绪分析:评估研究

链接: https://arxiv.org/abs/2409.07162
作者: Faiz Ali Shah,Ahmed Sabir,Rajesh Sharma
关键词-EN: Analyzing user reviews, provide valuable insights, Analyzing user, user reviews, user
关键词-ZH: 分析用户评论,提供有价值的见解,分析用户,用户评论,用户
类目: Computation and Language (cs.CL); Software Engineering (cs.SE)
备注: The summary of the project is available at this https URL

点击查看摘要

Abstract:Analyzing user reviews for sentiment towards app features can provide valuable insights into users’ perceptions of app functionality and their evolving needs. Given the volume of user reviews received daily, an automated mechanism to generate feature-level sentiment summaries of user reviews is needed. Recent advances in Large Language Models (LLMs) such as ChatGPT have shown impressive performance on several new tasks without updating the model’s parameters i.e. using zero or a few labeled examples. Despite these advancements, LLMs’ capabilities to perform feature-specific sentiment analysis of user reviews remain unexplored. This study compares the performance of state-of-the-art LLMs, including GPT-4, ChatGPT, and LLama-2-chat variants, for extracting app features and associated sentiments under 0-shot, 1-shot, and 5-shot scenarios. Results indicate the best-performing GPT-4 model outperforms rule-based approaches by 23.6% in f1-score with zero-shot feature extraction; 5-shot further improving it by 6%. GPT-4 achieves a 74% f1-score for predicting positive sentiment towards correctly predicted app features, with 5-shot enhancing it by 7%. Our study suggests that LLM models are promising for generating feature-specific sentiment summaries of user reviews.
摘要:分析用户对应用程序功能的评论可以为了解用户对应用程序功能的看法及其不断变化的需求提供有价值的见解。考虑到每天收到的用户评论的数量,需要一种自动机制来生成用户评论的特征级情感摘要。大型语言模型(LLM)的最新进展,如ChatGPT,在几个新任务上表现出令人印象深刻的性能,而不需要更新模型的参数,即使用零个或几个标记的示例。尽管有这些进步,但LLMS对用户评论执行特定功能的情绪分析的能力仍未得到探索。这项研究比较了最先进的LLMS,包括GPT-4,ChatGPT和Llama-2-Chat变体,在0枪、1枪和5枪场景下提取应用程序特征和关联情感的性能。结果表明,性能最好的GPT-4模型在零镜头特征提取的F1得分上比基于规则的方法高23.6%;5镜头模型进一步提高了6%。GPT-4在预测对正确预测的APP功能的积极情绪方面获得了74%的F1得分,5次射击将其提高了7%。我们的研究表明,LLM模型在生成用户评论的特定特征的情感摘要方面很有希望。

[NLP-18] Gated Slot Attention for Efficient Linear-Time Sequence Modeling
[NLP-18] 有效线性时间序列建模的门控槽注意力

链接: https://arxiv.org/abs/2409.07146
作者: Yu Zhang,Songlin Yang,Ruijie Zhu,Yue Zhang,Leyang Cui,Yiqiao Wang,Bolun Wang,Freda Shi,Bailin Wang,Wei Bi,Peng Zhou,Guohong Fu
关键词-EN: Gated Linear Attention, Linear attention Transformers, Gated Slot Attention, recall-intensive tasks compared, demand significant resources
关键词-ZH: 门控线性注意力、线性注意力变形金刚、门控插槽注意力、回忆密集型任务相比,需要大量资源
类目: Computation and Language (cs.CL)
备注: Preprint

点击查看摘要

Abstract:Linear attention Transformers and their gated variants, celebrated for enabling parallel training and efficient recurrent inference, still fall short in recall-intensive tasks compared to traditional Transformers and demand significant resources for training from scratch. This paper introduces Gated Slot Attention (GSA), which enhances Attention with Bounded-memory-Control (ABC) by incorporating a gating mechanism inspired by Gated Linear Attention (GLA). Essentially, GSA comprises a two-layer GLA linked via softmax, utilizing context-aware memory reading and adaptive forgetting to improve memory capacity while maintaining compact recurrent state size. This design greatly enhances both training and inference efficiency through GLA’s hardware-efficient training algorithm and reduced state size. Additionally, retaining the softmax operation is particularly beneficial in “finetuning pretrained Transformers to RNNs” (T2R) settings, reducing the need for extensive training from scratch. Extensive experiments confirm GSA’s superior performance in scenarios requiring in-context recall and in T2R settings.
摘要:与传统的变形金刚相比,线性注意变形器及其门控变种以实现并行训练和高效的递归推理而著称,但在回忆密集型任务中仍然存在不足,并且需要大量的资源来从头开始进行训练。本文介绍了门控时隙注意(GSA),它通过结合门控线性注意(GLA)启发的门控机制,利用有限记忆控制(ABC)来增强注意。本质上,GSA包括一个通过Softmax链接的两层GLA,利用上下文感知的内存读取和自适应遗忘来提高内存容量,同时保持紧凑的循环状态大小。该设计通过GLA硬件高效的训练算法和减少状态规模,大大提高了训练和推理的效率。此外,保留Softmax操作在“将预先培训的变压器精确调整为RNN”(T2R)设置中尤其有益,从而减少了从头开始进行大量培训的需要。大量实验证实,GSA在需要情景回忆的场景和T2R环境中具有优异的性能。

[NLP-19] Leveraging Unstructured Text Data for Federated Instruction Tuning of Large Language Models
[NLP-19] 利用非结构化文本数据进行大型语言模型的联邦指令调优

链接: https://arxiv.org/abs/2409.07136
作者: Rui Ye,Rui Ge,Yuchi Fengting,Jingyi Chai,Yanfeng Wang,Siheng Chen
关键词-EN: large language model, shared large language, directly sharing raw, Federated instruction tuning, follow humans’ instructions
关键词-ZH: 大型语言模型、共享大型语言、直接共享原始语言、联邦指令调优、遵循人类指令
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注: 11 pages, work in progress

点击查看摘要

Abstract:Federated instruction tuning enables multiple clients to collaboratively fine-tune a shared large language model (LLM) that can follow humans’ instructions without directly sharing raw data. However, existing literature impractically requires that all the clients readily hold instruction-tuning data (i.e., structured instruction-response pairs), which necessitates massive human annotations since clients’ data is usually unstructured text instead. Addressing this, we propose a novel and flexible framework FedIT-U2S, which can automatically transform unstructured corpus into structured data for federated instruction tuning. FedIT-U2S consists two key steps: (1) few-shot instruction-tuning data generation, where each unstructured data piece together with several examples is combined to prompt an LLM in generating an instruction-response pair. To further enhance the flexibility, a retrieval-based example selection technique is proposed, where the examples are automatically selected based on the relatedness between the client’s data piece and example pool, bypassing the need of determining examples in advance. (2) A typical federated instruction tuning process based on the generated data. Overall, FedIT-U2S can be applied to diverse scenarios as long as the client holds valuable text corpus, broadening the application scope of federated instruction tuning. We conduct a series of experiments on three domains (medicine, knowledge, and math), showing that our proposed FedIT-U2S can consistently and significantly brings improvement over the base LLM.
摘要:联合指令调优使多个客户端能够协作地微调共享的大型语言模型(LLM),该模型可以遵循人类的指令,而不直接共享原始数据。然而,现有文献不切实际地要求所有客户端都容易地保存指令调整数据(即,结构化指令-响应对),这需要大量的人工注释,因为客户端的数据通常是非结构化的文本。针对这一问题,我们提出了一种新颖而灵活的框架FedIT-U2S,它可以将非结构化语料库自动转换为结构化数据,用于联邦指令调优。FEDIT-U2S由两个关键步骤组成:(1)少镜头指令调优数据生成,其中每个非结构化数据片段与几个示例相结合,以促使LLM生成指令-响应对。为了进一步提高实例选择的灵活性,提出了一种基于检索的实例选择技术,该方法根据用户数据与实例池之间的关联度自动选择实例,省去了预先确定实例的需要。(2)基于生成数据的典型联邦指令调优过程。总体而言,只要客户端拥有有价值的文本语料库,FedIT-U2就可以应用于不同的场景,拓宽了联邦指令调优的应用范围。我们在三个领域(医学、知识和数学)上进行了一系列实验,表明我们提出的FedIT-U2可以持续且显著地改进基本的LLM。

[NLP-20] LLM-based feature generation from text for interpretable machine learning
[NLP-20] 基于LLM从文本生成特征以实现可解释机器学习

链接: https://arxiv.org/abs/2409.07132
作者: Vojtěch Balek,Lukáš Sýkora,Vilém Sklenák,Tomáš Kliegr
关键词-EN: Existing text representations, questionable feature-level interpretability, rule learning due, Existing text, feature-level interpretability
关键词-ZH: 现有文本表示、可疑的特征级可解释性、规则学习到期、现有文本、特征级可解释性
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Existing text representations such as embeddings and bag-of-words are not suitable for rule learning due to their high dimensionality and absent or questionable feature-level interpretability. This article explores whether large language models (LLMs) could address this by extracting a small number of interpretable features from text. We demonstrate this process on two datasets (CORD-19 and M17+) containing several thousand scientific articles from multiple disciplines and a target being a proxy for research impact. An evaluation based on testing for the statistically significant correlation with research impact has shown that LLama 2-generated features are semantically meaningful. We consequently used these generated features in text classification to predict the binary target variable representing the citation rate for the CORD-19 dataset and the ordinal 5-class target representing an expert-awarded grade in the M17+ dataset. Machine-learning models trained on the LLM-generated features provided similar predictive performance to the state-of-the-art embedding model SciBERT for scientific text. The LLM used only 62 features compared to 768 features in SciBERT embeddings, and these features were directly interpretable, corresponding to notions such as article methodological rigor, novelty, or grammatical correctness. As the final step, we extract a small number of well-interpretable action rules. Consistently competitive results obtained with the same LLM feature set across both thematically diverse datasets show that this approach generalizes across domains.
摘要:现有的文本表示,如嵌入和词袋,由于维度高,缺乏或有问题的特征级可解释性,不适合规则学习。本文探讨大型语言模型(LLM)是否可以通过从文本中提取少量可解释的特征来解决这个问题。我们在两个数据集(CORD-19和M17+)上演示了这一过程,该数据集包含来自多个学科的数千篇科学文章,目标是研究影响的代理。一项基于对与研究影响的统计显著相关性的测试的评估表明,骆驼2生成的特征在语义上是有意义的。因此,我们在文本分类中使用这些生成的特征来预测表示CORD-19数据集的引文率的二元目标变量和表示M17+数据集中的专家授予的等级的序数5类目标。在LLM生成的特征上训练的机器学习模型提供了与最先进的科学文本嵌入模型SciBERT类似的预测性能。LLM只使用了62个特征,而SciBERT嵌入中使用了768个特征,并且这些特征是直接可解释的,对应于文章方法的严谨性、新颖性或语法正确性等概念。作为最后一步,我们提取了少量可很好解释的动作规则。在两个主题不同的数据集上使用相同的LLM特征集获得的一致竞争结果表明,该方法可以跨域推广。

[NLP-21] Reranking Laws for Language Generation: A Communication-Theoretic Perspective
[NLP-21] 重新审视语言生成定律:传播理论的视角

链接: https://arxiv.org/abs/2409.07131
作者: António Farinhas,Haau-Sing Li,André F. T. Martins
关键词-EN: ensure large language, large language models, generate unacceptable answers, LLM generate multiple, ensure large
关键词-ZH: 确保大型语言、大型语言模型,生成不可接受的答案,LLM生成多个,确保大型
类目: Computation and Language (cs.CL); Machine Learning (cs.LG); Machine Learning (stat.ML)
备注: Preprint

点击查看摘要

Abstract:To ensure large language models (LLMs) are used safely, one must reduce their propensity to hallucinate or to generate unacceptable answers. A simple and often used strategy is to first let the LLM generate multiple hypotheses and then employ a reranker to choose the best one. In this paper, we draw a parallel between this strategy and the use of redundancy to decrease the error rate in noisy communication channels. We conceptualize the generator as a sender transmitting multiple descriptions of a message through parallel noisy channels. The receiver decodes the message by ranking the (potentially corrupted) descriptions and selecting the one found to be most reliable. We provide conditions under which this protocol is asymptotically error-free (i.e., yields an acceptable answer almost surely) even in scenarios where the reranker is imperfect (governed by Mallows or Zipf-Mandelbrot models) and the channel distributions are statistically dependent. We use our framework to obtain reranking laws which we validate empirically on two real-world tasks using LLMs: text-to-code generation with DeepSeek-Coder 7B and machine translation of medical data with TowerInstruct 13B.
摘要:为了确保大型语言模型(LLM)的安全使用,必须减少它们产生幻觉或产生不可接受答案的倾向。一种简单且经常使用的策略是,首先让LLM生成多个假设,然后使用重新排序器来选择最好的一个。在本文中,我们将这种策略与使用冗余来降低噪声通信信道中的误码率进行了比较。我们将生成器概念化为通过并行噪声信道传输消息的多个描述的发送者。接收器通过对(可能被破坏的)描述进行排序并选择被发现最可靠的描述来解码消息。我们给出了该协议是渐近无错的(即,几乎肯定会产生可接受的答案)的条件,即使在重定位器不完美的情况下(由Mlowers或Zipf-Mandelbrot模型控制),并且信道分布是统计相关的。我们使用我们的框架来获得重排规则,并使用LLMS在两个真实世界的任务上进行了经验验证:使用DeepSeek-Coder 7B进行文本到代码的生成和使用TowerInstruct 13B进行医疗数据的机器翻译。

[NLP-22] Cross-Refine: Improving Natural Language Explanation Generation by Learning in Tandem
[NLP-22] 交叉细化:通过串联学习改进自然语言解释生成

链接: https://arxiv.org/abs/2409.07123
作者: Qianli Wang,Tatiana Anikina,Nils Feldhus,Simon Ostermann,Sebastian Möller,Vera Schmitt
关键词-EN: large language model, Natural language explanations, Natural language, language model, large language
关键词-ZH: 大型语言模型,自然语言解释,自然语言,语言模型,大型语言
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 17 pages; under review

点击查看摘要

Abstract:Natural language explanations (NLEs) are vital for elucidating the reasoning behind large language model (LLM) decisions. Many techniques have been developed to generate NLEs using LLMs. However, like humans, LLMs might not always produce optimal NLEs on first attempt. Inspired by human learning processes, we introduce Cross-Refine, which employs role modeling by deploying two LLMs as generator and critic, respectively. The generator outputs a first NLE and then refines this initial explanation using feedback and suggestions provided by the critic. Cross-Refine does not require any supervised training data or additional training. We validate Cross-Refine across three NLP tasks using three state-of-the-art open-source LLMs through automatic and human evaluation. We select Self-Refine (Madaan et al., 2023) as the baseline, which only utilizes self-feedback to refine the explanations. Our findings from automatic evaluation and a user study indicate that Cross-Refine outperforms Self-Refine. Meanwhile, Cross-Refine can perform effectively with less powerful LLMs, whereas Self-Refine only yields strong results with ChatGPT. Additionally, we conduct an ablation study to assess the importance of feedback and suggestions. Both of them play an important role in refining explanations. We further evaluate Cross-Refine on a bilingual dataset in English and German.
摘要:自然语言解释对于阐明大语言模型(LLM)决策背后的推理是至关重要的。已经开发了许多技术来使用LLMS来产生NLE。然而,像人类一样,LLM可能并不总是在第一次尝试时产生最佳的NLE。受人类学习过程的启发,我们引入了Cross-Reine,它通过部署两个LLM分别作为生成器和批评者来使用角色建模。生成器输出第一个NLE,然后使用评论家提供的反馈和建议来改进这个初始解释。交叉优化不需要任何受监督的培训数据或额外的培训。我们使用三个最先进的开源LLM通过自动和人工评估来验证跨三个NLP任务的Cross-Refining。我们选择自我完善(Madaan等人,2023年)作为基线,它只利用自我反馈来完善解释。我们从自动评估和用户研究中的发现表明,交叉精炼的性能优于自我精炼。同时,交叉细化可以在功能较弱的LLM中有效执行,而自我细化只能在ChatGPT中产生强大的结果。此外,我们还进行了消融研究,以评估反馈和建议的重要性。它们在完善解释方面都起到了重要作用。我们进一步评估了在英语和德语双语数据集上的Cross-Refining。

[NLP-23] Ontology-Free General-Domain Knowledge Graph-to-Text Generation Dataset Synthesis using Large Language Model
[NLP-23] 使用大型语言模型的无实体通用领域知识图形到文本生成数据集合成

链接: https://arxiv.org/abs/2409.07088
作者: Daehee Kim,Deokhyung Kang,Sangwon Ryu,Gary Geunbae Lee
关键词-EN: verbalizing structured knowledge, structured knowledge graphs, natural language text, Pretrained Language Models, involves verbalizing structured
关键词-ZH: 语言化结构化知识、结构化知识图、自然语言文本、预训练语言模型,涉及语言化结构化
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 18 pages, 9 figures

点击查看摘要

Abstract:Knowledge Graph-to-Text (G2T) generation involves verbalizing structured knowledge graphs into natural language text. Recent advancements in Pretrained Language Models (PLMs) have improved G2T performance, but their effectiveness depends on datasets with precise graph-text alignment. However, the scarcity of high-quality, general-domain G2T generation datasets restricts progress in the general-domain G2T generation research. To address this issue, we introduce Wikipedia Ontology-Free Graph-text dataset (WikiOFGraph), a new large-scale G2T dataset generated using a novel method that leverages Large Language Model (LLM) and Data-QuestEval. Our new dataset, which contains 5.85M general-domain graph-text pairs, offers high graph-text consistency without relying on external ontologies. Experimental results demonstrate that PLM fine-tuned on WikiOFGraph outperforms those trained on other datasets across various evaluation metrics. Our method proves to be a scalable and effective solution for generating high-quality G2T data, significantly advancing the field of G2T generation.
摘要:知识图到文本(G2T)的生成涉及将结构化知识图描述为自然语言文本。最近在预训练语言模型(PLM)中的进步提高了G2T的性能,但它们的有效性取决于具有精确图文对齐的数据集。然而,高质量、通用域G2T生成数据集的稀缺限制了通用域G2T生成研究的进展。为了解决这个问题,我们引入了Wikipedia Ontology-Free Graph-Text DataSet(WikiOFGraph),这是一个新的大规模G2T数据集,使用了一种新的方法,利用了大型语言模型(LLM)和Data-QuestEval。我们的新数据集包含5.85M个通用领域图文对,在不依赖外部本体的情况下提供了高图文一致性。实验结果表明,在WikiOFGraph上微调的PLM在各种评价指标上都优于在其他数据集上训练的PLM。我们的方法被证明是一种可扩展的有效解决方案,用于生成高质量的G2T数据,显著推动了G2T生成领域的发展。

[NLP-24] Understanding Knowledge Drift in LLMs through Misinformation KDD2024
[NLP-24] 通过错误信息了解LLM的知识漂移

链接: https://arxiv.org/abs/2409.07085
作者: Alina Fastowski,Gjergji Kasneci
关键词-EN: Large Language Models, Large Language, revolutionized numerous applications, Language Models, digital ecosystem
关键词-ZH: 大型语言模型,大型语言,彻底改变了众多应用程序,语言模型,数字生态系统
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 13 pages, 3 figures. Accepted at DELTA workshop at KDD 2024

点击查看摘要

Abstract:Large Language Models (LLMs) have revolutionized numerous applications, making them an integral part of our digital ecosystem. However, their reliability becomes critical, especially when these models are exposed to misinformation. We primarily analyze the susceptibility of state-of-the-art LLMs to factual inaccuracies when they encounter false information in a QnA scenario, an issue that can lead to a phenomenon we refer to as knowledge drift, which significantly undermines the trustworthiness of these models. We evaluate the factuality and the uncertainty of the models’ responses relying on Entropy, Perplexity, and Token Probability metrics. Our experiments reveal that an LLM’s uncertainty can increase up to 56.6% when the question is answered incorrectly due to the exposure to false information. At the same time, repeated exposure to the same false information can decrease the models uncertainty again (-52.8% w.r.t. the answers on the untainted prompts), potentially manipulating the underlying model’s beliefs and introducing a drift from its original knowledge. These findings provide insights into LLMs’ robustness and vulnerability to adversarial inputs, paving the way for developing more reliable LLM applications across various domains. The code is available at this https URL.
摘要:大型语言模型(LLM)使众多应用发生了革命性的变化,使它们成为我们数字生态系统中不可或缺的一部分。然而,它们的可靠性变得至关重要,特别是当这些模型暴露在错误信息中时。我们主要分析最先进的LLM在QNA场景中遇到错误信息时对事实不准确的敏感性,这个问题可能会导致我们称为“知识漂移”的现象,这将显著破坏这些模型的可信度。我们依赖于熵、困惑和令牌概率度量来评估模型响应的真实性和不确定性。我们的实验表明,当问题由于暴露在虚假信息中而被错误回答时,LLM的不确定性可以增加高达56.6%。同时,重复暴露于相同的错误信息可以再次降低模型的不确定性(-52.8%w.r.t.未受污染的提示上的答案),潜在地操纵潜在模型的信念,并引入偏离其原始知识的漂移。这些发现提供了对LLMS的健壮性和对敌意输入的脆弱性的洞察,为开发跨不同领域的更可靠的LLM应用程序铺平了道路。代码可在此HTTPS URL上找到。

[NLP-25] Latent Space Interpretation for Stylistic Analysis and Explainable Authorship Attribution
[NLP-25] 文体分析和可解释作者归因的潜在空间解释

链接: https://arxiv.org/abs/2409.07072
作者: Milad Alshomary,Narutatsu Ri,Marianna Apidianaki,Ajay Patel,Smaranda Muresan,Kathleen McKeown
关键词-EN: authorship attribution methods, methods learn authorship, learn authorship representations, attribution methods learn, hindering their usability
关键词-ZH: 作者归因方法,学习作者身份的表示,归因方法学习,阻碍其可用性
类目: Computation and Language (cs.CL)
备注: 8 pages, 8 figures, under review

点击查看摘要

Abstract:Recent state-of-the-art authorship attribution methods learn authorship representations of texts in a latent, non-interpretable space, hindering their usability in real-world applications. Our work proposes a novel approach to interpreting these learned embeddings by identifying representative points in the latent space and utilizing LLMs to generate informative natural language descriptions of the writing style of each point. We evaluate the alignment of our interpretable space with the latent one and find that it achieves the best prediction agreement compared to other baselines. Additionally, we conduct a human evaluation to assess the quality of these style descriptions, validating their utility as explanations for the latent space. Finally, we investigate whether human performance on the challenging AA task improves when aided by our system’s explanations, finding an average improvement of around +20% in accuracy.
摘要:最近最先进的作者归属方法在潜在的、不可解释的空间中学习文本的作者归属表示,从而阻碍了它们在现实世界应用程序中的可用性。我们的工作提出了一种新颖的方法来解释这些习得的嵌入,通过识别潜在空间中的代表点并利用LLM生成每个点写作风格的信息丰富的自然语言描述。我们评估了可解释空间与潜在空间的一致性,并发现与其他基线相比,它实现了最好的预测一致性。此外,我们还进行了人为评估,以评估这些风格描述的质量,验证它们作为潜在空间解释的实用性。最后,我们调查了在系统解释的帮助下,人类在具有挑战性的AA任务中的表现是否有所提高,发现准确性平均提高了+20%左右。

[NLP-26] Automated Speaking Assessment of Conversation Tests with Novel Graph-based Modeling on Spoken Response Coherence
[NLP-26] 使用基于图形的新型口语响应一致性建模对对话测试进行自动口语评估

链接: https://arxiv.org/abs/2409.07064
作者: Jiun-Ting Li,Bi-Cheng Yan,Tien-Hong Lo,Yi-Cheng Wang,Yung-Chang Hsu,Berlin Chen
关键词-EN: Automated speaking assessment, Automated speaking, aims to evaluate, speaking proficiency, interlocutor interacts
关键词-ZH: 自动演讲评估,自动演讲,旨在评估、演讲能力、对话者互动
类目: Computation and Language (cs.CL)
备注: Accepted by IEEE SLT 2024

点击查看摘要

Abstract:Automated speaking assessment in conversation tests (ASAC) aims to evaluate the overall speaking proficiency of an L2 (second-language) speaker in a setting where an interlocutor interacts with one or more candidates. Although prior ASAC approaches have shown promising performance on their respective datasets, there is still a dearth of research specifically focused on incorporating the coherence of the logical flow within a conversation into the grading model. To address this critical challenge, we propose a hierarchical graph model that aptly incorporates both broad inter-response interactions (e.g., discourse relations) and nuanced semantic information (e.g., semantic words and speaker intents), which is subsequently fused with contextual information for the final prediction. Extensive experimental results on the NICT-JLE benchmark dataset suggest that our proposed modeling approach can yield considerable improvements in prediction accuracy with respect to various assessment metrics, as compared to some strong baselines. This also sheds light on the importance of investigating coherence-related facets of spoken responses in ASAC.
摘要:会话测试中的自动口语测试(ASAC)旨在评估一名二语(第二语言)说话人在与一名或多名考生互动的情况下的整体口语水平。虽然以前的ASAC方法在各自的数据集上表现出了良好的性能,但仍然缺乏专门关注将对话中的逻辑流的一致性纳入评分模型的研究。为了解决这一关键挑战,我们提出了一个层次图模型,该模型恰当地结合了广泛的反应间交互(例如,话语关系)和细微差别的语义信息(例如,语义词和说话人意图),然后将这些信息与上下文信息融合在一起,以进行最终预测。在NICT-JLE基准数据集上的大量实验结果表明,与一些强基线相比,我们提出的建模方法可以在预测精度方面相对于各种评估指标产生相当大的提高。这也说明了研究ASAC口语回答中连贯相关方面的重要性。

[NLP-27] Legal Fact Prediction: Task Definition and Dataset Construction
[NLP-27] 法律事实预测:任务定义和数据集构建

链接: https://arxiv.org/abs/2409.07055
作者: Junkai Liu,Yujie Tong,Hui Huang,Shuyuan Zheng,Muyun Yang,Peicheng Wu,Makoto Onizuka,Chuan Xiao
关键词-EN: proven by acknowledged, Legal facts refer, Legal, Legal facts, facts
关键词-ZH: 经承认证明,法律事实参考,法律,法律事实,事实
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:Legal facts refer to the facts that can be proven by acknowledged evidence in a trial. They form the basis for the determination of court judgments. This paper introduces a novel NLP task: legal fact prediction, which aims to predict the legal fact based on a list of evidence. The predicted facts can instruct the parties and their lawyers involved in a trial to strengthen their submissions and optimize their strategies during the trial. Moreover, since real legal facts are difficult to obtain before the final judgment, the predicted facts also serve as an important basis for legal judgment prediction. We construct a benchmark dataset consisting of evidence lists and ground-truth legal facts for real civil loan cases, LFPLoan. Our experiments on this dataset show that this task is non-trivial and requires further considerable research efforts.
摘要:法律事实是指在审判中能够被公认的证据证明的事实。它们构成了法院判决的基础。本文介绍了一种新颖的NLP任务:法律事实预测,旨在根据证据列表预测法律事实。预测的事实可以指导参与审判的当事人及其律师加强提交材料并优化审判期间的策略。而且,由于在最终判决前很难获得真实的法律事实,因此预测的事实也是法律判决预测的重要依据。我们构建了一个基准数据集,由真实民事贷款案件LFPLoan的证据列表和地面真相法律事实组成。我们对该数据集的实验表明,这项任务并非易事,需要进一步做出大量的研究努力。

[NLP-28] Native vs Non-Native Language Prompting: A Comparative Analysis
[NLP-28] 母语与非母语预算:比较分析

链接: https://arxiv.org/abs/2409.07054
作者: Mohamed Bayan Kmainasi,Rakif Khan,Ali Ezzat Shahroor,Boushra Bendou,Maram Hasanain,Firoj Alam
关键词-EN: Natural Language Processing, including standard Natural, shown remarkable abilities, Large language models, standard Natural Language
关键词-ZH: 自然语言处理,包括标准自然语言,表现出非凡的能力,大型语言模型,标准自然语言
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Foundation Models, Large Language Models, Arabic NLP, LLMs, Native, Contextual Understanding, Arabic LLM

点击查看摘要

Abstract:Large language models (LLMs) have shown remarkable abilities in different fields, including standard Natural Language Processing (NLP) tasks. To elicit knowledge from LLMs, prompts play a key role, consisting of natural language instructions. Most open and closed source LLMs are trained on available labeled and unlabeled resources–digital content such as text, images, audio, and videos. Hence, these models have better knowledge for high-resourced languages but struggle with low-resourced languages. Since prompts play a crucial role in understanding their capabilities, the language used for prompts remains an important research question. Although there has been significant research in this area, it is still limited, and less has been explored for medium to low-resourced languages. In this study, we investigate different prompting strategies (native vs. non-native) on 11 different NLP tasks associated with 12 different Arabic datasets (9.7K data points). In total, we conducted 197 experiments involving 3 LLMs, 12 datasets, and 3 prompting strategies. Our findings suggest that, on average, the non-native prompt performs the best, followed by mixed and native prompts.
摘要:大型语言模型在包括标准自然语言处理(NLP)任务在内的不同领域都表现出了卓越的能力。要从LLMS获取知识,提示起着关键作用,由自然语言指令组成。大多数开放源代码和封闭源代码的LLM都接受了关于可用的已标记和未标记资源的培训–数字内容,如文本、图像、音频和视频。因此,这些模型对资源丰富的语言有更好的了解,但在资源匮乏的语言中却举步维艰。由于提示在理解其能力方面起着至关重要的作用,提示所使用的语言仍然是一个重要的研究问题。虽然在这一领域已经有了重要的研究,但仍然是有限的,对中到低资源语言的探索更少。在这项研究中,我们考察了不同的提示策略(母语和非母语)在11个不同的自然语言处理任务上与12个不同的阿拉伯语数据集(9.7K数据点)相关联。我们总共进行了197个实验,涉及3个LLM、12个数据集和3个提示策略。我们的研究结果表明,平均而言,非母语提示的表现最好,其次是混合提示和母语提示。

[NLP-29] Beyond IID: Optimizing Instruction Learning from the Perspective of Instruction Interaction and Dependency
[NLP-29] 超越IID:从教学互动和依赖的角度优化教学学习

链接: https://arxiv.org/abs/2409.07045
作者: Hanyu Zhao,Li Du,Yiming Ju,Chengwei Wu,Tengfei Pan
关键词-EN: large language models, fine-tune large language, language models, pivotal challenge, effectively select
关键词-ZH: 大型语言模型,微调大型语言,语言模型,关键挑战,有效选择
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:With the availability of various instruction datasets, a pivotal challenge is how to effectively select and integrate these instructions to fine-tune large language models (LLMs). Previous research mainly focuses on selecting individual high-quality instructions. However, these works overlooked the joint interactions and dependencies between different categories of instructions, leading to suboptimal selection strategies. Moreover, the nature of these interaction patterns remains largely unexplored, let alone optimize the instruction set with regard to them. To fill these gaps, in this paper, we: (1) systemically investigate interaction and dependency patterns between different categories of instructions, (2) manage to optimize the instruction set concerning the interaction patterns using a linear programming-based method, and optimize the learning schema of SFT using an instruction dependency taxonomy guided curriculum learning. Experimental results across different LLMs demonstrate improved performance over strong baselines on widely adopted benchmarks.
摘要:随着各种指令数据集的出现,如何有效地选择和集成这些指令来微调大型语言模型(LLM)是一个关键的挑战。以往的研究主要集中在选择个别高质量的教学内容。然而,这些工作忽略了不同类别指令之间的联合交互和依赖关系,导致了次优选择策略。此外,这些交互模式的性质在很大程度上仍未得到探索,更不用说优化与它们相关的指令集了。为了填补这些空白,在本文中,我们:(1)系统地研究不同类别指令之间的交互和依赖模式;(2)使用基于线性规划的方法来优化与交互模式相关的指令集;以及使用指令依赖分类指导的课程学习来优化SFT的学习模式。在不同LLM上的实验结果表明,在广泛采用的基准上,性能优于强基线。

[NLP-30] You Have Thirteen Hours in Which to Solve the Labyrinth: Enhancing AI Game Masters with Function Calling ACL2024
[NLP-30] 你有十三个小时来解决迷宫:用函数调用增强人工智能游戏大师

链接: https://arxiv.org/abs/2409.06949
作者: Jaewoo Song,Andrew Zhu,Chris Callison-Burch
关键词-EN: large language models, challenging task due, Jim Henson Labyrinth, game master role, game master
关键词-ZH: 大型语言模型,具有挑战性的任务,吉姆·汉森·拉布瑞斯,游戏大师角色,游戏大师
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Wordplay Workshop @ ACL 2024

点击查看摘要

Abstract:Developing a consistent and reliable AI game master for text-based games is a challenging task due to the limitations of large language models (LLMs) and the complexity of the game master’s role. This paper presents a novel approach to enhance AI game masters by leveraging function calling in the context of the table-top role-playing game “Jim Henson’s Labyrinth: The Adventure Game.” Our methodology involves integrating game-specific controls through functions, which we show improves the narrative quality and state update consistency of the AI game master. The experimental results, based on human evaluations and unit tests, demonstrate the effectiveness of our approach in enhancing gameplay experience and maintaining coherence with the game state. This work contributes to the advancement of game AI and interactive storytelling, offering insights into the design of more engaging and consistent AI-driven game masters.
摘要:由于大型语言模型(LLM)的局限性和游戏大师角色的复杂性,为基于文本的游戏开发一致且可靠的人工智能游戏大师是一项具有挑战性的任务。本文提出了一种新颖的方法,通过在桌面角色扮演游戏“吉姆·汉森的迷宫:冒险游戏”的背景下利用函数调用来增强人工智能游戏大师。“我们的方法涉及通过功能集成特定于游戏的控制,我们证明这可以提高人工智能游戏大师的叙事质量和状态更新一致性。基于人类评估和单元测试的实验结果证明了我们的方法在增强游戏体验和保持与游戏状态一致性方面的有效性。这项工作有助于游戏人工智能和交互式讲故事的进步,为设计更引人入胜、更一致的人工智能驱动游戏大师提供了见解。

[NLP-31] Representation Tuning
[NLP-31] 表示调整

链接: https://arxiv.org/abs/2409.06927
作者: Christopher M. Ackerman
关键词-EN: large language models, increasingly popular, large language, vectors, online control
关键词-ZH: 大型语言模型,日益流行,大型语言,载体,在线控制
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: 9 pages, 6 figures, 6 tables

点击查看摘要

Abstract:Activation engineering is becoming increasingly popular as a means of online control of large language models (LLMs). In this work, I extend the idea of active steering with vectors that represent a behavioral direction of interest to tuning those vectors directly into the model, obviating the need for online control. First, I identify activation vectors related to honesty in an open-source LLM (Llama- 2-13b-chat). Next, I demonstrate that model output can be made more or less honest by adding positive or negative multiples of these vectors to residual stream activations during generation. Then, I show that a similar effect can be achieved by fine-tuning the vectors directly into the model, by use of a dual loss function based on the cosine similarity of residual stream activations to the vectors combined with a standard token-based loss (“representation tuning”). Finally, I compare the generations in response to honesty-probing prompts from the resulting models to those from models fine-tuned with a token-based loss alone, and to those from the untuned model subjected to online steering. Overall, fine-tuning the vectors into the models using the cosine similarity plus token loss showed a stronger effect than online steering, and generalized better than using the standard loss, suggesting the potential utility of this approach as a safety measure. Code and data are available at this https URL tuned models are available at this https URL representation-tuning-66da1e5ab41cd1b824687d9f.
摘要:激活工程作为在线控制大型语言模型的一种手段正变得越来越流行。在这项工作中,我扩展了主动转向的想法,使用代表感兴趣的行为方向的矢量来直接将这些矢量调整到模型中,从而消除了在线控制的需要。首先,我在开放源码的LLM(Llama-2-13b-chat)中识别与诚实相关的激活向量。接下来,我将演示通过在生成过程中将这些向量的正负倍数添加到剩余流激活中,可以使模型输出或多或少诚实。然后,我证明了通过使用基于残余流激活与向量的余弦相似性的双重损失函数并结合标准的基于令牌的损失,直接将向量微调到模型中,可以达到类似的效果。最后,我比较了结果模型对诚实探测提示的反应世代,与仅使用基于代币的损失进行微调的模型的世代,以及接受在线指导的未调整模型的世代。总体而言,使用余弦相似度加令牌损失将向量微调到模型中的效果比在线转向更强,总体上比使用标准损失更好,这表明该方法作为一种安全措施具有潜在的实用性。代码和数据可在此HTTPS URL获得优化模型可在此HTTPS URL representation-tuning-66da1e5ab41cd1b824687d9f.上获得

[NLP-32] A Dataset for Evaluating LLM-based Evaluation Functions for Research Question Extraction Task
[NLP-32] 用于评估研究问题提取任务基于LLM的评估函数的数据集

链接: https://arxiv.org/abs/2409.06883
作者: Yuya Fujisaki,Shiro Takagi,Hideki Asoh,Wataru Kumagai
关键词-EN: text summarization techniques, progress in text, task, text summarization, summarization techniques
关键词-ZH: 文本摘要技术、文本进展、任务、文本摘要、摘要技术
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:The progress in text summarization techniques has been remarkable. However the task of accurately extracting and summarizing necessary information from highly specialized documents such as research papers has not been sufficiently investigated. We are focusing on the task of extracting research questions (RQ) from research papers and construct a new dataset consisting of machine learning papers, RQ extracted from these papers by GPT-4, and human evaluations of the extracted RQ from multiple perspectives. Using this dataset, we systematically compared recently proposed LLM-based evaluation functions for summarizations, and found that none of the functions showed sufficiently high correlations with human evaluations. We expect our dataset provides a foundation for further research on developing better evaluation functions tailored to the RQ extraction task, and contribute to enhance the performance of the task. The dataset is available at this https URL.
摘要:文本摘要技术取得了显着的进步。然而,从研究论文等高度专业化的文件中准确提取和总结必要信息的任务尚未得到充分的研究。我们专注于从研究论文中提取研究问题(PQ)的任务,并构建一个新的数据集,该数据集由机器学习论文、GPT-4从这些论文中提取的PQ以及从多个角度对提取的PQ进行的人类评估组成。使用该数据集,我们系统地比较了最近提出的基于LLM的评估函数以进行总结,发现这些函数都没有表现出与人类评估足够高的相关性。我们希望我们的数据集为进一步研究开发针对PQ提取任务量身定制的更好评估函数提供基础,并有助于提高任务的性能。该数据集可在此httpsURL中获取。

[NLP-33] NSP: A Neuro-Symbolic Natural Language Navigational Planner
[NLP-33] NSP:神经符号自然语言导航规划者

链接: https://arxiv.org/abs/2409.06859
作者: William English,Dominic Simon,Rickard Ewetz,Sumit Jha
关键词-EN: instructions hold promise, language instructions hold, natural language inputs, free-form natural language, natural language instructions
关键词-ZH: 指令信守承诺,语言指令信守,自然语言输入,自由形式自然语言,自然语言指令
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
备注: 8 pages

点击查看摘要

Abstract:Path planners that can interpret free-form natural language instructions hold promise to automate a wide range of robotics applications. These planners simplify user interactions and enable intuitive control over complex semi-autonomous systems. While existing symbolic approaches offer guarantees on the correctness and efficiency, they struggle to parse free-form natural language inputs. Conversely, neural approaches based on pre-trained Large Language Models (LLMs) can manage natural language inputs but lack performance guarantees. In this paper, we propose a neuro-symbolic framework for path planning from natural language inputs called NSP. The framework leverages the neural reasoning abilities of LLMs to i) craft symbolic representations of the environment and ii) a symbolic path planning algorithm. Next, a solution to the path planning problem is obtained by executing the algorithm on the environment representation. The framework uses a feedback loop from the symbolic execution environment to the neural generation process to self-correct syntax errors and satisfy execution time constraints. We evaluate our neuro-symbolic approach using a benchmark suite with 1500 path-planning problems. The experimental evaluation shows that our neuro-symbolic approach produces 90.1% valid paths that are on average 19-77% shorter than state-of-the-art neural approaches.
摘要:能够解释自由格式自然语言指令的路径规划器有望实现广泛的机器人应用自动化。这些规划器简化了用户交互,并允许对复杂的半自主系统进行直观控制。虽然现有的符号方法提供了对正确性和效率的保证,但它们很难解析自由格式的自然语言输入。相反,基于预先训练的大语言模型(LLM)的神经方法可以管理自然语言输入,但缺乏性能保证。本文提出了一种基于自然语言输入的神经符号路径规划框架NSP。该框架利用LLMS的神经推理能力,i)对环境进行符号表示,ii)符号路径规划算法。然后,通过对环境表示执行该算法,得到路径规划问题的解。该框架使用从符号执行环境到神经生成过程的反馈循环来自我纠正语法错误并满足执行时间约束。我们使用一个包含1500个路径规划问题的基准测试套件来评估我们的神经符号方法。实验评估表明,我们的神经符号方法产生了90.1%的有效路径,比最先进的神经方法平均缩短了19%-77%。

[NLP-34] What is the Role of Small Models in the LLM Era: A Survey
[NLP-34] LLM时代小模型的作用是什么:一项调查

链接: https://arxiv.org/abs/2409.06857
作者: Lihu Chen,Gaël Varoquaux
关键词-EN: Large Language Models, artificial general intelligence, Large Language, increasingly large models, made significant progress
关键词-ZH: 大语言模型、人工通用智能、大语言、越来越大的模型,取得了重大进展
类目: Computation and Language (cs.CL)
备注: a survey paper of small models

点击查看摘要

Abstract:Large Language Models (LLMs) have made significant progress in advancing artificial general intelligence (AGI), leading to the development of increasingly large models such as GPT-4 and LLaMA-405B. However, scaling up model sizes results in exponentially higher computational costs and energy consumption, making these models impractical for academic researchers and businesses with limited resources. At the same time, Small Models (SMs) are frequently used in practical settings, although their significance is currently underestimated. This raises important questions about the role of small models in the era of LLMs, a topic that has received limited attention in prior research. In this work, we systematically examine the relationship between LLMs and SMs from two key perspectives: Collaboration and Competition. We hope this survey provides valuable insights for practitioners, fostering a deeper understanding of the contribution of small models and promoting more efficient use of computational resources. The code is available at this https URL
摘要:大语言模型在推进人工通用智能方面取得了重大进展,导致了越来越大的模型的发展,如GPT-4和LLAMA-405B。然而,扩大模型规模会导致计算成本和能源消耗成倍增加,这使得这些模型对于学术研究人员和资源有限的企业来说是不现实的。与此同时,小模型(SMS)在实际环境中经常被使用,尽管它们的重要性目前被低估了。这引发了关于小模型在低成本管理时代的作用的重要问题,这个问题在以前的研究中得到的关注有限。在这项工作中,我们从两个关键的角度系统地考察了低成本管理和短信之间的关系:合作和竞争。我们希望这项调查为从业者提供有价值的见解,促进更深入地了解小模型的贡献,并促进更有效地使用计算资源。代码可在此HTTPS URL中找到

[NLP-35] PingPong: A Benchmark for Role-Playing Language Models with User Emulation and Multi-Model Evaluation
[NLP-35] PingPong:具有用户模拟和多模型评估的角色扮演语言模型基准

链接: https://arxiv.org/abs/2409.06820
作者: Ilya Gusev
关键词-EN: language models, approach leverages language, leverages language models, role-playing capabilities, model
关键词-ZH: 语言模型,方法利用语言,利用语言模型,角色扮演能力,模型
类目: Computation and Language (cs.CL)
备注: 4 main pages

点击查看摘要

Abstract:We introduce a novel benchmark for evaluating the role-playing capabilities of language models. Our approach leverages language models themselves to emulate users in dynamic, multi-turn conversations and to assess the resulting dialogues. The framework consists of three main components: a player model assuming a specific character role, an interrogator model simulating user behavior, and a judge model evaluating conversation quality. We conducted experiments comparing automated evaluations with human annotations to validate our approach, demonstrating strong correlations across multiple criteria. This work provides a foundation for a robust and dynamic evaluation of model capabilities in interactive scenarios.
摘要:我们引入了一种新颖的基准来评估语言模型的角色扮演能力。我们的方法利用语言模型本身来模拟动态、多回合对话中的用户,并评估最终的对话。该框架由三个主要组件组成:承担特定角色角色的玩家模型、模拟用户行为的收件箱模型以及评估对话质量的判断模型。我们进行了比较自动评估与人类注释的实验,以验证我们的方法,证明了多个标准之间的强相关性。这项工作为交互场景中模型能力的稳健和动态评估提供了基础。

[NLP-36] Decomposition of surprisal: Unified computational model of ERP components in language processing
[NLP-36] 收件箱分解:语言处理中企业资源规划组件的统一计算模型

链接: https://arxiv.org/abs/2409.06803
作者: Jiaxuan Li,Richard Futrell
关键词-EN: psycholinguistics for decades, language-related ERP components, central debate, debate in psycholinguistics, language-related ERP
关键词-ZH: 几十年来的心理语言学、语言相关的企业资源规划组成部分、中心辩论、心理语言学中的辩论、语言相关的企业资源规划
类目: Computation and Language (cs.CL); Information Theory (cs.IT)
备注:

点击查看摘要

Abstract:The functional interpretation of language-related ERP components has been a central debate in psycholinguistics for decades. We advance an information-theoretic model of human language processing in the brain in which incoming linguistic input is processed at first shallowly and later with more depth, with these two kinds of information processing corresponding to distinct electroencephalographic signatures. Formally, we show that the information content (surprisal) of a word in context can be decomposed into two quantities: (A) heuristic surprise, which signals shallow processing difficulty for a word, and corresponds with the N400 signal; and (B) discrepancy signal, which reflects the discrepancy between shallow and deep interpretations, and corresponds to the P600 signal. Both of these quantities can be estimated straightforwardly using modern NLP models. We validate our theory by successfully simulating ERP patterns elicited by a variety of linguistic manipulations in previously-reported experimental data from six experiments, with successful novel qualitative and quantitative predictions. Our theory is compatible with traditional cognitive theories assuming a `good-enough’ heuristic interpretation stage, but with a precise information-theoretic formulation. The model provides an information-theoretic model of ERP components grounded on cognitive processes, and brings us closer to a fully-specified neuro-computational model of language processing.
摘要:语言相关事件相关事件相关成分的功能解释几十年来一直是心理语言学中的一个中心问题。我们提出了一个人类大脑语言处理的信息论模型,在该模型中,输入的语言输入首先被浅层处理,然后被更深地处理,这两种信息处理对应于不同的脑电特征。形式上,我们证明了一个词在上下文中的信息含量(惊喜)可以分解为两个量:(A)启发式惊喜,它标志着一个词的浅层处理困难,对应于N400信号;(B)差异信号,它反映了浅层和深层解释之间的差异,对应于P600信号。这两个量都可以使用现代NLP模型直接估计。我们通过在先前报道的六个实验的实验数据中成功地模拟了各种语言操作引起的ERP模式,并成功地进行了新的定性和定量预测,从而验证了我们的理论。我们的理论与传统的认知理论相容,但与精确的信息论公式相一致。该模型提供了一个基于认知过程的ERP组件的信息论模型,并使我们更接近语言处理的完全特定的神经计算模型。

[NLP-37] ranslating Step-by-Step: Decomposing the Translation Process for Improved Translation Quality of Long-Form Texts
[NLP-37] 分步搜索:分解翻译过程以提高长篇文本的翻译质量

链接: https://arxiv.org/abs/2409.06790
作者: Eleftheria Briakou,Jiaming Luo,Colin Cherry,Markus Freitag
关键词-EN: long-form text translation, approach to long-form, drawing on established, paper we present, long-form text
关键词-ZH: 长篇文本翻译,长篇方法,借鉴既定的,我们呈现的论文,长篇文本
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:In this paper we present a step-by-step approach to long-form text translation, drawing on established processes in translation studies. Instead of viewing machine translation as a single, monolithic task, we propose a framework that engages language models in a multi-turn interaction, encompassing pre-translation research, drafting, refining, and proofreading, resulting in progressively improved translations. Extensive automatic evaluations using Gemini 1.5 Pro across ten language pairs show that translating step-by-step yields large translation quality improvements over conventional zero-shot prompting approaches and earlier human-like baseline strategies, resulting in state-of-the-art results on WMT2024.
摘要:在本文中,我们借鉴翻译研究中的既定流程,提出了一种分步方法来进行长篇文本翻译。我们没有将机器翻译视为一项单一的整体任务,而是提出了一个框架,该框架将语言模型进行多轮交互,包括翻译前研究、起草、精炼和校对,从而逐步改进翻译。使用Gemini 1.5 Pro对十种语言对进行的广泛自动评估表明,与传统的零触发提示方法和早期的类人基线策略相比,逐步翻译可以大幅提高翻译质量,从而在WMT 2024上获得最先进的结果。

人工智能

[AI-0] “My Grade is Wrong!”: A Contestable AI Framework for Interactive Feedback in Evaluating Student Essays

链接: https://arxiv.org/abs/2409.07453
作者: Shengxin Hong,Chang Cai,Sixuan Du,Haiyue Feng,Siyuan Liu,Xiuyi Fan
关键词-EN: Large Language Models, traditional one-way feedback, effective than traditional, traditional one-way, feedback
类目: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
*备注:

点击查看摘要

Abstract:Interactive feedback, where feedback flows in both directions between teacher and student, is more effective than traditional one-way feedback. However, it is often too time-consuming for widespread use in educational practice. While Large Language Models (LLMs) have potential for automating feedback, they struggle with reasoning and interaction in an interactive setting. This paper introduces CAELF, a Contestable AI Empowered LLM Framework for automating interactive feedback. CAELF allows students to query, challenge, and clarify their feedback by integrating a multi-agent system with computational argumentation. Essays are first assessed by multiple Teaching-Assistant Agents (TA Agents), and then a Teacher Agent aggregates the evaluations through formal reasoning to generate feedback and grades. Students can further engage with the feedback to refine their understanding. A case study on 500 critical thinking essays with user studies demonstrates that CAELF significantly improves interactive feedback, enhancing the reasoning and interaction capabilities of LLMs. This approach offers a promising solution to overcoming the time and resource barriers that have limited the adoption of interactive feedback in educational settings.

[AI-1] Introducing Perturb-ability Score (PS) to Enhance Robustness Against Evasion Adversarial Attacks on ML-NIDS

链接: https://arxiv.org/abs/2409.07448
作者: Mohamed elShehaby,Ashraf Matrawy
关键词-EN: identify Network Intrusion, Intrusion Detection Systems, Network Intrusion Detection, Perturb-ability Score, Network Intrusion
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This paper proposes a novel Perturb-ability Score (PS) that can be used to identify Network Intrusion Detection Systems (NIDS) features that can be easily manipulated by attackers in the problem-space. We demonstrate that using PS to select only non-perturb-able features for ML-based NIDS maintains detection performance while enhancing robustness against adversarial attacks.

[AI-2] SUPER: Evaluating Agents on Setting Up and Executing Tasks from Research Repositories

链接: https://arxiv.org/abs/2409.07440
作者: Ben Bogin,Kejuan Yang,Shashank Gupta,Kyle Richardson,Erin Bransom,Peter Clark,Ashish Sabharwal,Tushar Khot
关键词-EN: Large Language Models, autonomously reproduce results, Large Language, made significant progress, research repositories
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Software Engineering (cs.SE)
*备注:

点击查看摘要

Abstract:Given that Large Language Models (LLMs) have made significant progress in writing code, can they now be used to autonomously reproduce results from research repositories? Such a capability would be a boon to the research community, helping researchers validate, understand, and extend prior work. To advance towards this goal, we introduce SUPER, the first benchmark designed to evaluate the capability of LLMs in setting up and executing tasks from research repositories. SUPERaims to capture the realistic challenges faced by researchers working with Machine Learning (ML) and Natural Language Processing (NLP) research repositories. Our benchmark comprises three distinct problem sets: 45 end-to-end problems with annotated expert solutions, 152 sub problems derived from the expert set that focus on specific challenges (e.g., configuring a trainer), and 602 automatically generated problems for larger-scale development. We introduce various evaluation measures to assess both task success and progress, utilizing gold solutions when available or approximations otherwise. We show that state-of-the-art approaches struggle to solve these problems with the best model (GPT-4o) solving only 16.3% of the end-to-end set, and 46.1% of the scenarios. This illustrates the challenge of this task, and suggests that SUPER can serve as a valuable resource for the community to make and measure progress.

[AI-3] Synthetic continued pretraining

链接: https://arxiv.org/abs/2409.07431
作者: Zitong Yang,Neil Band,Shuangping Li,Emmanuel Candès,Tatsunori Hashimoto
关键词-EN: unstructured internet text, enabled language models, unstructured internet, synthetic continued pretraining, acquire a significant
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Pretraining on large-scale, unstructured internet text has enabled language models to acquire a significant amount of world knowledge. However, this knowledge acquisition is data-inefficient – to learn a given fact, models must be trained on hundreds to thousands of diverse representations of it. This poses a challenge when adapting a pretrained model to a small corpus of domain-specific documents, where each fact may appear rarely or only once. We propose to bridge this gap with synthetic continued pretraining: using the small domain-specific corpus to synthesize a large corpus more amenable to learning, and then performing continued pretraining on the synthesized corpus. We instantiate this proposal with EntiGraph, a synthetic data augmentation algorithm that extracts salient entities from the source documents and then generates diverse text by drawing connections between the sampled entities. Synthetic continued pretraining using EntiGraph enables a language model to answer questions and follow generic instructions related to the source documents without access to them. If instead, the source documents are available at inference time, we show that the knowledge acquired through our approach compounds with retrieval-augmented generation. To better understand these results, we build a simple mathematical model of EntiGraph, and show how synthetic data augmentation can “rearrange” knowledge to enable more data-efficient learning.

[AI-4] Hierarchical Reinforcement Learning for Temporal Abstraction of Listwise Recommendation

链接: https://arxiv.org/abs/2409.07416
作者: Luo Ji,Gao Liu,Mingyang Yin,Hongxia Yang,Jingren Zhou
关键词-EN: short-term interest shifts, Modern listwise recommendation, listwise recommendation systems, Modern listwise, long-term user perceptions
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 18 pages, 4 figures

点击查看摘要

Abstract:Modern listwise recommendation systems need to consider both long-term user perceptions and short-term interest shifts. Reinforcement learning can be applied on recommendation to study such a problem but is also subject to large search space, sparse user feedback and long interactive latency. Motivated by recent progress in hierarchical reinforcement learning, we propose a novel framework called mccHRL to provide different levels of temporal abstraction on listwise recommendation. Within the hierarchical framework, the high-level agent studies the evolution of user perception, while the low-level agent produces the item selection policy by modeling the process as a sequential decision-making problem. We argue that such framework has a well-defined decomposition of the outra-session context and the intra-session context, which are encoded by the high-level and low-level agents, respectively. To verify this argument, we implement both a simulator-based environment and an industrial dataset-based experiment. Results observe significant performance improvement by our method, compared with several well-known baselines. Data and codes have been made public.

[AI-5] SoK: Security and Privacy Risks of Medical AI

链接: https://arxiv.org/abs/2409.07415
作者: Yuanhaur Chang,Han Liu,Evin Jaff,Chenyang Lu,Ning Zhang
关键词-EN: powered by artificial, machine learning, products and services, era where software, artificial intelligence
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The integration of technology and healthcare has ushered in a new era where software systems, powered by artificial intelligence and machine learning, have become essential components of medical products and services. While these advancements hold great promise for enhancing patient care and healthcare delivery efficiency, they also expose sensitive medical data and system integrity to potential cyberattacks. This paper explores the security and privacy threats posed by AI/ML applications in healthcare. Through a thorough examination of existing research across a range of medical domains, we have identified significant gaps in understanding the adversarial attacks targeting medical AI systems. By outlining specific adversarial threat models for medical settings and identifying vulnerable application domains, we lay the groundwork for future research that investigates the security and resilience of AI-driven medical systems. Through our analysis of different threat models and feasibility studies on adversarial attacks in different medical domains, we provide compelling insights into the pressing need for cybersecurity research in the rapidly evolving field of AI healthcare technology.

[AI-6] Robust Robot Walker: Learning Agile Locomotion over Tiny Traps

链接: https://arxiv.org/abs/2409.07409
作者: Shaoting Zhu,Runhan Huang,Linzhan Mou,Hang Zhao
关键词-EN: exhibit robust walking, robust walking capabilities, Quadruped robots, enables quadruped robots, practical applications
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注: 10 pages, 17 figures

点击查看摘要

Abstract:Quadruped robots must exhibit robust walking capabilities in practical applications. In this work, we propose a novel approach that enables quadruped robots to pass various small obstacles, or “tiny traps”. Existing methods often rely on exteroceptive sensors, which can be unreliable for detecting such tiny traps. To overcome this limitation, our approach focuses solely on proprioceptive inputs. We introduce a two-stage training framework incorporating a contact encoder and a classification head to learn implicit representations of different traps. Additionally, we design a set of tailored reward functions to improve both the stability of training and the ease of deployment for goal-tracking tasks. To benefit further research, we design a new benchmark for tiny trap task. Extensive experiments in both simulation and real-world settings demonstrate the effectiveness and robustness of our method. Project Page: this https URL

[AI-7] CLNX: Bridging Code and Natural Language for C/C Vulnerability-Contributing Commits Identification

链接: https://arxiv.org/abs/2409.07407
作者: Zeqing Qin,Yiwei Wu,Lansheng Han
关键词-EN: Large Language Models, Large Language, Language Models, shown great promise, vulnerability identification
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
*备注: 8 pages, 2 figures, conference

点击查看摘要

Abstract:Large Language Models (LLMs) have shown great promise in vulnerability identification. As C/C++ comprises half of the Open-Source Software (OSS) vulnerabilities over the past decade and updates in OSS mainly occur through commits, enhancing LLMs’ ability to identify C/C++ Vulnerability-Contributing Commits (VCCs) is essential. However, current studies primarily focus on further pre-training LLMs on massive code datasets, which is resource-intensive and poses efficiency challenges. In this paper, we enhance the ability of BERT-based LLMs to identify C/C++ VCCs in a lightweight manner. We propose CodeLinguaNexus (CLNX) as a bridge facilitating communication between C/C++ programs and LLMs. Based on commits, CLNX efficiently converts the source code into a more natural representation while preserving key details. Specifically, CLNX first applies structure-level naturalization to decompose complex programs, followed by token-level naturalization to interpret complex symbols. We evaluate CLNX on public datasets of 25,872 C/C++ functions with their commits. The results show that CLNX significantly enhances the performance of LLMs on identifying C/C++ VCCs. Moreover, CLNX-equipped CodeBERT achieves new state-of-the-art and identifies 38 OSS vulnerabilities in the real world.

[AI-8] What to align in multimodal contrastive learning?

链接: https://arxiv.org/abs/2409.07402
作者: Benoit Dufumier,Javiera Castillo-Navarro,Devis Tuia,Jean-Philippe Thiran
关键词-EN: Humans perceive, multisensory integration, adapt their behavior, perceive the world, world through multisensory
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
*备注: 22 pages

点击查看摘要

Abstract:Humans perceive the world through multisensory integration, blending the information of different modalities to adapt their behavior. Contrastive learning offers an appealing solution for multimodal self-supervised learning. Indeed, by considering each modality as a different view of the same entity, it learns to align features of different modalities in a shared representation space. However, this approach is intrinsically limited as it only learns shared or redundant information between modalities, while multimodal interactions can arise in other ways. In this work, we introduce CoMM, a Contrastive MultiModal learning strategy that enables the communication between modalities in a single multimodal space. Instead of imposing cross- or intra- modality constraints, we propose to align multimodal representations by maximizing the mutual information between augmented versions of these multimodal features. Our theoretical analysis shows that shared, synergistic and unique terms of information naturally emerge from this formulation, allowing us to estimate multimodal interactions beyond redundancy. We test CoMM both in a controlled and in a series of real-world settings: in the former, we demonstrate that CoMM effectively captures redundant, unique and synergistic information between modalities. In the latter, CoMM learns complex multimodal interactions and achieves state-of-the-art results on the six multimodal benchmarks.

[AI-9] Awaking the Slides: A Tuning-free and Knowledge-regulated AI Tutoring System via Language Model Coordination

链接: https://arxiv.org/abs/2409.07372
作者: Daniel Zhang-Li,Zheyuan Zhang,Jifan Yu,Joy Lim Jia Yin,Shangqing Tu,Linlu Gong,Haohua Wang,Zhiyuan Liu,Huiqin Liu,Lei Hou,Juanzi Li
关键词-EN: carry lecture knowledge, vast pre-existing slides, pre-existing slides serve, heterogeneous teaching actions, teaching actions
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
*备注:

点击查看摘要

Abstract:The vast pre-existing slides serve as rich and important materials to carry lecture knowledge. However, effectively leveraging lecture slides to serve students is difficult due to the multi-modal nature of slide content and the heterogeneous teaching actions. We study the problem of discovering effective designs that convert a slide into an interactive lecture. We develop Slide2Lecture, a tuning-free and knowledge-regulated intelligent tutoring system that can (1) effectively convert an input lecture slide into a structured teaching agenda consisting of a set of heterogeneous teaching actions; (2) create and manage an interactive lecture that generates responsive interactions catering to student learning demands while regulating the interactions to follow teaching actions. Slide2Lecture contains a complete pipeline for learners to obtain an interactive classroom experience to learn the slide. For teachers and developers, Slide2Lecture enables customization to cater to personalized demands. The evaluation rated by annotators and students shows that Slide2Lecture is effective in outperforming the remaining implementation. Slide2Lecture’s online deployment has made more than 200K interaction with students in the 3K lecture sessions. We open source Slide2Lecture’s implementation in https://anonymous.4open.science/r/slide2lecture-4210/.

[AI-10] Demo: SGCode: A Flexible Prompt-Optimizing System for Secure Generation of Code

链接: https://arxiv.org/abs/2409.07368
作者: Khiem Ton,Nhi Nguyen,Mahmoud Nazzal,Abdallah Khreishah,Cristian Borcea,NhatHai Phan,Ruoming Jin,Issa Khalil,Yelong Shen
关键词-EN: generate secure code, flexible prompt-optimizing system, large language models, paper introduces SGCode, secure code
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:This paper introduces SGCode, a flexible prompt-optimizing system to generate secure code with large language models (LLMs). SGCode integrates recent prompt-optimization approaches with LLMs in a unified system accessible through front-end and back-end APIs, enabling users to 1) generate secure code, which is free of vulnerabilities, 2) review and share security analysis, and 3) easily switch from one prompt optimization approach to another, while providing insights on model and system performance. We populated SGCode on an AWS server with PromSec, an approach that optimizes prompts by combining an LLM and security tools with a lightweight generative adversarial graph neural network to detect and fix security vulnerabilities in the generated code. Extensive experiments show that SGCode is practical as a public tool to gain insights into the trade-offs between model utility, secure code generation, and system cost. SGCode has only a marginal cost compared with prompting LLMs. SGCode is available at: this http URL.

[AI-11] Securing Vision-Language Models with a Robust Encoder Against Jailbreak and Adversarial Attacks

链接: https://arxiv.org/abs/2409.07353
作者: Md Zarif Hossain,Ahmed Imteaj
关键词-EN: Large Vision-Language Models, Large Vision-Language, multimodal big datasets, vision-language tasks, Vision-Language Models
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Large Vision-Language Models (LVLMs), trained on multimodal big datasets, have significantly advanced AI by excelling in vision-language tasks. However, these models remain vulnerable to adversarial attacks, particularly jailbreak attacks, which bypass safety protocols and cause the model to generate misleading or harmful responses. This vulnerability stems from both the inherent susceptibilities of LLMs and the expanded attack surface introduced by the visual modality. We propose Sim-CLIP+, a novel defense mechanism that adversarially fine-tunes the CLIP vision encoder by leveraging a Siamese architecture. This approach maximizes cosine similarity between perturbed and clean samples, facilitating resilience against adversarial manipulations. Sim-CLIP+ offers a plug-and-play solution, allowing seamless integration into existing LVLM architectures as a robust vision encoder. Unlike previous defenses, our method requires no structural modifications to the LVLM and incurs minimal computational overhead. Sim-CLIP+ demonstrates effectiveness against both gradient-based adversarial attacks and various jailbreak techniques. We evaluate Sim-CLIP+ against three distinct jailbreak attack strategies and perform clean evaluations using standard downstream datasets, including COCO for image captioning and OKVQA for visual question answering. Extensive experiments demonstrate that Sim-CLIP+ maintains high clean accuracy while substantially improving robustness against both gradient-based adversarial attacks and jailbreak techniques. Our code and robust vision encoders are available at this https URL.

[AI-12] Federated Impression for Learning with Distributed Heterogeneous Data

链接: https://arxiv.org/abs/2409.07351
作者: Sana Ayromlou,Atrin Arya,Armin Saadat,Purang Abolmaesumi,Xiaoxiao Li
关键词-EN: Standard deep learning-based, real-world clinical applications, Standard deep, deep learning-based classification, learning-based classification approaches
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注:

点击查看摘要

Abstract:Standard deep learning-based classification approaches may not always be practical in real-world clinical applications, as they require a centralized collection of all samples. Federated learning (FL) provides a paradigm that can learn from distributed datasets across clients without requiring them to share data, which can help mitigate privacy and data ownership issues. In FL, sub-optimal convergence caused by data heterogeneity is common among data from different health centers due to the variety in data collection protocols and patient demographics across centers. Through experimentation in this study, we show that data heterogeneity leads to the phenomenon of catastrophic forgetting during local training. We propose FedImpres which alleviates catastrophic forgetting by restoring synthetic data that represents the global information as federated impression. To achieve this, we distill the global model resulting from each communication round. Subsequently, we use the synthetic data alongside the local data to enhance the generalization of local training. Extensive experiments show that the proposed method achieves state-of-the-art performance on both the BloodMNIST and Retina datasets, which contain label imbalance and domain shift, with an improvement in classification accuracy of up to 20%.

[AI-13] Online Decision MetaMorphFormer: A Casual Transformer-Based Reinforcement Learning Framework of Universal Embodied Intelligence

链接: https://arxiv.org/abs/2409.07341
作者: Luo Ji,Runji Lin
关键词-EN: motion control field, Interactive artificial intelligence, interesting topic, Interactive artificial, motion control
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Robotics (cs.RO)
*备注: 12 pages, 6 figures

点击查看摘要

Abstract:Interactive artificial intelligence in the motion control field is an interesting topic, especially when universal knowledge is adaptive to multiple tasks and universal environments. Despite there being increasing efforts in the field of Reinforcement Learning (RL) with the aid of transformers, most of them might be limited by the offline training pipeline, which prohibits exploration and generalization abilities. To address this limitation, we propose the framework of Online Decision MetaMorphFormer (ODM) which aims to achieve self-awareness, environment recognition, and action planning through a unified model architecture. Motivated by cognitive and behavioral psychology, an ODM agent is able to learn from others, recognize the world, and practice itself based on its own experience. ODM can also be applied to any arbitrary agent with a multi-joint body, located in different environments, and trained with different types of tasks using large-scale pre-trained datasets. Through the use of pre-trained datasets, ODM can quickly warm up and learn the necessary knowledge to perform the desired task, while the target environment continues to reinforce the universal policy. Extensive online experiments as well as few-shot and zero-shot environmental tests are used to verify ODM’s performance and generalization ability. The results of our study contribute to the study of general artificial intelligence in embodied and cognitive fields. Code, results, and video examples can be found on the website \urlthis https URL.

[AI-14] A Framework for Predicting the Impact of Game Balance Changes through Meta Discovery

链接: https://arxiv.org/abs/2409.07340
作者: Akash Saravanan,Matthew Guzdial
关键词-EN: League of Legends, collection of knowledge, balance, Meta Discovery framework, Pokémon
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 11 pages, 1 figure, IEEE Transactions on Games

点击查看摘要

Abstract:A metagame is a collection of knowledge that goes beyond the rules of a game. In competitive, team-based games like Pokémon or League of Legends, it refers to the set of current dominant characters and/or strategies within the player base. Developer changes to the balance of the game can have drastic and unforeseen consequences on these sets of meta characters. A framework for predicting the impact of balance changes could aid developers in making more informed balance decisions. In this paper we present such a Meta Discovery framework, leveraging Reinforcement Learning for automated testing of balance changes. Our results demonstrate the ability to predict the outcome of balance changes in Pokémon Showdown, a collection of competitive Pokémon tiers, with high accuracy.

[AI-15] Explanation Debate Align: A Weak-to-Strong Framework for Language Model Generalization

链接: https://arxiv.org/abs/2409.07335
作者: Mehrdad Zakershahrak,Samira Ghodratnama
关键词-EN: artificial intelligence systems, forefront of research, task execution, rapid advancement, advancement of artificial
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:The rapid advancement of artificial intelligence systems has brought the challenge of AI alignment to the forefront of research, particularly in complex decision-making and task execution. As these systems surpass human-level performance in sophisticated problems, ensuring their alignment with human values, intentions, and ethical guidelines becomes crucial. Building on previous work in explanation generation for human-agent alignment, we address the more complex dynamics of multi-agent systems and human-AI teams. This paper introduces a novel approach to model alignment through weak-to-strong generalization in the context of language models. We present a framework where a strong model facilitates the improvement of a weaker model, bridging the gap between explanation generation and model alignment. Our method, formalized as a facilitation function, allows for the transfer of capabilities from advanced models to less capable ones without direct access to extensive training data. Our results suggest that this facilitation-based approach not only enhances model performance but also provides insights into the nature of model alignment and the potential for scalable oversight of AI systems.

[AI-16] Module-wise Adaptive Adversarial Training for End-to-end Autonomous Driving

链接: https://arxiv.org/abs/2409.07321
作者: Tianyuan Zhang,Lu Wang,Jiaqi Kang,Xinwei Zhang,Siyuan Liang,Yuwei Chen,Aishan Liu,Xianglong Liu
关键词-EN: Recent advances, improved autonomous driving, markedly improved autonomous, autonomous driving, systems that integrate
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 14 pages

点击查看摘要

Abstract:Recent advances in deep learning have markedly improved autonomous driving (AD) models, particularly end-to-end systems that integrate perception, prediction, and planning stages, achieving state-of-the-art performance. However, these models remain vulnerable to adversarial attacks, where human-imperceptible perturbations can disrupt decision-making processes. While adversarial training is an effective method for enhancing model robustness against such attacks, no prior studies have focused on its application to end-to-end AD models. In this paper, we take the first step in adversarial training for end-to-end AD models and present a novel Module-wise Adaptive Adversarial Training (MA2T). However, extending conventional adversarial training to this context is highly non-trivial, as different stages within the model have distinct objectives and are strongly interconnected. To address these challenges, MA2T first introduces Module-wise Noise Injection, which injects noise before the input of different modules, targeting training models with the guidance of overall objectives rather than each independent module loss. Additionally, we introduce Dynamic Weight Accumulation Adaptation, which incorporates accumulated weight changes to adaptively learn and adjust the loss weights of each module based on their contributions (accumulated reduction rates) for better balance and robust training. To demonstrate the efficacy of our defense, we conduct extensive experiments on the widely-used nuScenes dataset across several end-to-end AD models under both white-box and black-box attacks, where our method outperforms other baselines by large margins (+5-10%). Moreover, we validate the robustness of our defense through closed-loop evaluation in the CARLA simulation environment, showing improved resilience even against natural corruption.

[AI-17] MEDIC: Towards a Comprehensive Framework for Evaluating LLMs in Clinical Applications

链接: https://arxiv.org/abs/2409.07314
作者: Praveen K Kanithi,Clément Christophe,Marco AF Pimentel,Tathagata Raha,Nada Saadi,Hamza Javed,Svetlana Maslenkova,Nasir Hayat,Ronnie Rajan,Shadab Khan
关键词-EN: benchmarks like USMLE, Large Language Models, development of Large, Large Language, rapid development
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: Technical report

点击查看摘要

Abstract:The rapid development of Large Language Models (LLMs) for healthcare applications has spurred calls for holistic evaluation beyond frequently-cited benchmarks like USMLE, to better reflect real-world performance. While real-world assessments are valuable indicators of utility, they often lag behind the pace of LLM evolution, likely rendering findings obsolete upon deployment. This temporal disconnect necessitates a comprehensive upfront evaluation that can guide model selection for specific clinical applications. We introduce MEDIC, a framework assessing LLMs across five critical dimensions of clinical competence: medical reasoning, ethics and bias, data and language understanding, in-context learning, and clinical safety. MEDIC features a novel cross-examination framework quantifying LLM performance across areas like coverage and hallucination detection, without requiring reference outputs. We apply MEDIC to evaluate LLMs on medical question-answering, safety, summarization, note generation, and other tasks. Our results show performance disparities across model sizes, baseline vs medically finetuned models, and have implications on model selection for applications requiring specific model strengths, such as low hallucination or lower cost of inference. MEDIC’s multifaceted evaluation reveals these performance trade-offs, bridging the gap between theoretical capabilities and practical implementation in healthcare settings, ensuring that the most promising models are identified and adapted for diverse healthcare applications.

[AI-18] Exploring User-level Gradient Inversion with a Diffusion Prior NEURIPS2023

链接: https://arxiv.org/abs/2409.07291
作者: Zhuohang Li,Andrew Lowy,Jing Liu,Toshiaki Koike-Akino,Bradley Malin,Kieran Parsons,Ye Wang
关键词-EN: explore user-level gradient, user-level gradient inversion, distributed learning, explore user-level, surface in distributed
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)
*备注: Presented at the International Workshop on Federated Learning in the Age of Foundation Models in conjunction with NeurIPS 2023

点击查看摘要

Abstract:We explore user-level gradient inversion as a new attack surface in distributed learning. We first investigate existing attacks on their ability to make inferences about private information beyond training data reconstruction. Motivated by the low reconstruction quality of existing methods, we propose a novel gradient inversion attack that applies a denoising diffusion model as a strong image prior in order to enhance recovery in the large batch setting. Unlike traditional attacks, which aim to reconstruct individual samples and suffer at large batch and image sizes, our approach instead aims to recover a representative image that captures the sensitive shared semantic information corresponding to the underlying user. Our experiments with face images demonstrate the ability of our methods to recover realistic facial images along with private user attributes.

[AI-19] Using Generative Agents to Create Tip Sheets for Investigative Data Reporting

链接: https://arxiv.org/abs/2409.07286
作者: Joris Veerbeek,Nicholas Diakopoulos
关键词-EN: create tip sheets, investigative data reporting, paper introduces, data reporting, create tip
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注: Short paper to be presented at Computation + Journalism 2024

点击查看摘要

Abstract:This paper introduces a system using generative AI agents to create tip sheets for investigative data reporting. Our system employs three specialized agents–an analyst, a reporter, and an editor–to collaboratively generate and refine tips from datasets. We validate this approach using real-world investigative stories, demonstrating that our agent-based system generally generates more newsworthy and accurate insights compared to a baseline model without agents, although some variability was noted between different stories. Our findings highlight the potential of generative AI to provide leads for investigative data reporting.

[AI-20] Propaganda to Hate: A Multimodal Analysis of Arabic Memes with Multi-Agent LLMs

链接: https://arxiv.org/abs/2409.07246
作者: Firoj Alam,Md. Rafiul Biswas,Uzair Shah,Wajdi Zaghouani,Georgios Mikros
关键词-EN: social media platforms, past decade, social media, dissemination and consumption, media platforms
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: propaganda, hate-speech, disinformation, misinformation, fake news, LLMs, GPT-4, multimodality, multimodal LLMs

点击查看摘要

Abstract:In the past decade, social media platforms have been used for information dissemination and consumption. While a major portion of the content is posted to promote citizen journalism and public awareness, some content is posted to mislead users. Among different content types such as text, images, and videos, memes (text overlaid on images) are particularly prevalent and can serve as powerful vehicles for propaganda, hate, and humor. In the current literature, there have been efforts to individually detect such content in memes. However, the study of their intersection is very limited. In this study, we explore the intersection between propaganda and hate in memes using a multi-agent LLM-based approach. We extend the propagandistic meme dataset with coarse and fine-grained hate labels. Our finding suggests that there is an association between propaganda and hate in memes. We provide detailed experimental results that can serve as a baseline for future studies. We will make the experimental resources publicly available to the community.

[AI-21] Behavioral Cloning Models Reality Check for Autonomous Driving

链接: https://arxiv.org/abs/2409.07218
作者: Mustafa Yildirim,Barkin Dagda,Vinal Asodia,Saber Fallah
关键词-EN: autonomous vehicle, effective are recent, recent advancements, autonomous vehicle control, utilize Behavior Cloning
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:How effective are recent advancements in autonomous vehicle perception systems when applied to real-world autonomous vehicle control? While numerous vision-based autonomous vehicle systems have been trained and evaluated in simulated environments, there is a notable lack of real-world validation for these systems. This paper addresses this gap by presenting the real-world validation of state-of-the-art perception systems that utilize Behavior Cloning (BC) for lateral control, processing raw image data to predict steering commands. The dataset was collected using a scaled research vehicle and tested on various track setups. Experimental results demonstrate that these methods predict steering angles with low error margins in real-time, indicating promising potential for real-world applications.

[AI-22] Heterogeneity-Aware Coordination for Federated Learning via Stitching Pre-trained blocks

链接: https://arxiv.org/abs/2409.07202
作者: Shichen Zhan,Yebo Wu,Chunlin Tian,Yan Zhao,Li Li
关键词-EN: coordinates multiple devices, preserving data privacy, global model, Federated learning, coordinates multiple
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Federated learning (FL) coordinates multiple devices to collaboratively train a shared model while preserving data privacy. However, large memory footprint and high energy consumption during the training process excludes the low-end devices from contributing to the global model with their own data, which severely deteriorates the model performance in real-world scenarios. In this paper, we propose FedStitch, a hierarchical coordination framework for heterogeneous federated learning with pre-trained blocks. Unlike the traditional approaches that train the global model from scratch, for a new task, FedStitch composes the global model via stitching pre-trained blocks. Specifically, each participating client selects the most suitable block based on their local data from the candidate pool composed of blocks from pre-trained models. The server then aggregates the optimal block for stitching. This process iterates until a new stitched network is generated. Except for the new training paradigm, FedStitch consists of the following three core components: 1) an RL-weighted aggregator, 2) a search space optimizer deployed on the server side, and 3) a local energy optimizer deployed on each participating client. The RL-weighted aggregator helps to select the right block in the non-IID scenario, while the search space optimizer continuously reduces the size of the candidate block pool during stitching. Meanwhile, the local energy optimizer is designed to minimize energy consumption of each client while guaranteeing the overall training progress. The results demonstrate that compared to existing approaches, FedStitch improves the model accuracy up to 20.93%. At the same time, it achieves up to 8.12% speedup, reduces the memory footprint up to 79.5%, and achieves 89.41% energy saving at most during the learning procedure.

[AI-23] hermalGaussian: Thermal 3D Gaussian Splatting

链接: https://arxiv.org/abs/2409.07200
作者: Rongfeng Lu,Hangyu Chen,Zunjie Zhu,Yuhang Qin,Ming Lu,Le Zhang,Chenggang Yan,Anke Xue
关键词-EN: Neural Radiance Fields, Radiance Fields, users of surveillance, Neural Radiance, thermal
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 10 pages, 7 figures

点击查看摘要

Abstract:Thermography is especially valuable for the military and other users of surveillance cameras. Some recent methods based on Neural Radiance Fields (NeRF) are proposed to reconstruct the thermal scenes in 3D from a set of thermal and RGB images. However, unlike NeRF, 3D Gaussian splatting (3DGS) prevails due to its rapid training and real-time rendering. In this work, we propose ThermalGaussian, the first thermal 3DGS approach capable of rendering high-quality images in RGB and thermal modalities. We first calibrate the RGB camera and the thermal camera to ensure that both modalities are accurately aligned. Subsequently, we use the registered images to learn the multimodal 3D Gaussians. To prevent the overfitting of any single modality, we introduce several multimodal regularization constraints. We also develop smoothing constraints tailored to the physical characteristics of the thermal modality. Besides, we contribute a real-world dataset named RGBT-Scenes, captured by a hand-hold thermal-infrared camera, facilitating future research on thermal scene reconstruction. We conduct comprehensive experiments to show that ThermalGaussian achieves photorealistic rendering of thermal images and improves the rendering quality of RGB images. With the proposed multimodal regularization constraints, we also reduced the model’s storage cost by 90%. The code and dataset will be released.

[AI-24] Cyber Deception: State of the art Trends and Open challenges

链接: https://arxiv.org/abs/2409.07194
作者: Pedro Beltrán López,Manuel Gil Pérez,Pantaleone Nespoli
关键词-EN: Cyber Deception, significantly increased articles, increased articles designing, growing interest, interest in cybersecurity
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computer Science and Game Theory (cs.GT)
*备注: 38 pages

点击查看摘要

Abstract:The growing interest in cybersecurity has significantly increased articles designing and implementing various Cyber Deception (CYDEC) mechanisms. This trend reflects the urgent need for new strategies to address cyber threats effectively. Since its emergence, CYDEC has established itself as an innovative defense against attackers, thanks to its proactive and reactive capabilities, finding applications in numerous real-life scenarios. Despite the considerable work devoted to CYDEC, the literature still presents significant gaps. In particular, there has not been (i) a comprehensive analysis of the main components characterizing CYDEC, (ii) a generic classification covering all types of solutions, nor (iii) a survey of the current state of the literature in various contexts. This article aims to fill these gaps through a detailed review of the main features that comprise CYDEC, developing a comprehensive classification taxonomy. In addition, the different frameworks used to generate CYDEC are reviewed, presenting a more comprehensive one. Existing solutions in the literature using CYDEC, both without Artificial Intelligence (AI) and with AI, are studied and compared. Finally, the most salient trends of the current state of the art are discussed, offering a list of pending challenges for future research.

[AI-25] How Mature is Requirements Engineering for AI-based Systems? A Systematic Mapping Study on Practices Challenges and Future Research Directions

链接: https://arxiv.org/abs/2409.07192
作者: Umm-e- Habiba,Markus Haug,Justus Bogner,Stefan Wagner
关键词-EN: Artificial intelligence, emerging ethical implications, quality requirements due, ethical implications, engineering for artificial
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
*备注: Accepted in Requirements Engineering Journal, 2024

点击查看摘要

Abstract:Artificial intelligence (AI) permeates all fields of life, which resulted in new challenges in requirements engineering for artificial intelligence (RE4AI), e.g., the difficulty in specifying and validating requirements for AI or considering new quality requirements due to emerging ethical implications. It is currently unclear if existing RE methods are sufficient or if new ones are needed to address these challenges. Therefore, our goal is to provide a comprehensive overview of RE4AI to researchers and practitioners. What has been achieved so far, i.e., what practices are available, and what research gaps and challenges still need to be addressed? To achieve this, we conducted a systematic mapping study combining query string search and extensive snowballing. The extracted data was aggregated, and results were synthesized using thematic analysis. Our selection process led to the inclusion of 126 primary studies. Existing RE4AI research focuses mainly on requirements analysis and elicitation, with most practices applied in these areas. Furthermore, we identified requirements specification, explainability, and the gap between machine learning engineers and end-users as the most prevalent challenges, along with a few others. Additionally, we proposed seven potential research directions to address these challenges. Practitioners can use our results to identify and select suitable RE methods for working on their AI-based systems, while researchers can build on the identified gaps and research directions to push the field forward.

[AI-26] A Perspective on AI-Guided Molecular Simulations in VR: Exploring Strategies for Imitation Learning in Hyperdimensional Molecular Systems ECAI24

链接: https://arxiv.org/abs/2409.07189
作者: Mohamed Dhouioui,Jonathan Barnoud,Rhoslyn Roebuck Williams,Harry J. Stroud,Phil Bates,David R. Glowacki
关键词-EN: crucial computational tool, engineer molecular structure, Molecular dynamics simulations, crucial computational, computational tool
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Biomolecules (q-bio.BM)
*备注: (Accepted for presentation at the First Workshop on “eXtended Reality \ Intelligent Agents” (XRIA24) @ ECAI24, Santiago De Compostela (Spain), 20 October 2024)

点击查看摘要

Abstract:Molecular dynamics simulations are a crucial computational tool for researchers to understand and engineer molecular structure and function in areas such as drug discovery, protein engineering, and material design. Despite their utility, MD simulations are expensive, owing to the high dimensionality of molecular systems. Interactive molecular dynamics in virtual reality (iMD-VR) has recently been developed as a ‘human-in-the-loop’ strategy, which leverages high-performance computing to accelerate the researcher’s ability to solve the hyperdimensional sampling problem. By providing an immersive 3D environment that enables visualization and manipulation of real-time molecular motion, iMD-VR enables researchers and students to efficiently and intuitively explore and navigate these complex, high-dimensional systems. iMD-VR platforms offer a unique opportunity to quickly generate rich datasets that capture human experts’ spatial insight regarding molecular structure and function. This paper explores the possibility of employing user-generated iMD-VR datasets to train AI agents via imitation learning (IL). IL is an important technique in robotics that enables agents to mimic complex behaviors from expert demonstrations, thus circumventing the need for explicit programming or intricate reward design. We review the utilization of IL for manipulation tasks in robotics and discuss how iMD-VR recordings could be used to train IL models for solving specific molecular ‘tasks’. We then investigate how such approaches could be applied to the data captured from iMD-VR recordings. Finally, we outline the future research directions and potential challenges of using AI agents to augment human expertise to efficiently navigate conformational spaces, highlighting how this approach could provide valuable insight across domains such as materials science, protein engineering, and computer-aided drug design.

[AI-27] Enhancing Angular Resolution via Directionality Encoding and Geometric Constraints in Brain Diffusion Tensor Imaging ICONIP2024

链接: https://arxiv.org/abs/2409.07186
作者: Sheng Chen,Zihao Tang,Mariano Cabezas,Xinyi Wang,Arkiev D’Souza,Michael Barnett,Fernando Calamante,Weidong Cai,Chenyu Wang
关键词-EN: Magnetic Resonance Imaging, Magnetic Resonance, fiber tracts non-invasively, reconstruct white matter, white matter fiber
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: Accepted to ICONIP2024, Diffusion Weighted Imaging, Diffusion Tensor Imaging, Angular Resolution Enhancement, Fractional Anisotropy

点击查看摘要

Abstract:Diffusion-weighted imaging (DWI) is a type of Magnetic Resonance Imaging (MRI) technique sensitised to the diffusivity of water molecules, offering the capability to inspect tissue microstructures and is the only in-vivo method to reconstruct white matter fiber tracts non-invasively. The DWI signal can be analysed with the diffusion tensor imaging (DTI) model to estimate the directionality of water diffusion within voxels. Several scalar metrics, including axial diffusivity (AD), mean diffusivity (MD), radial diffusivity (RD), and fractional anisotropy (FA), can be further derived from DTI to quantitatively summarise the microstructural integrity of brain tissue. These scalar metrics have played an important role in understanding the organisation and health of brain tissue at a microscopic level in clinical studies. However, reliable DTI metrics rely on DWI acquisitions with high gradient directions, which often go beyond the commonly used clinical protocols. To enhance the utility of clinically acquired DWI and save scanning time for robust DTI analysis, this work proposes DirGeo-DTI, a deep learning-based method to estimate reliable DTI metrics even from a set of DWIs acquired with the minimum theoretical number (6) of gradient directions. DirGeo-DTI leverages directional encoding and geometric constraints to facilitate the training process. Two public DWI datasets were used for evaluation, demonstrating the effectiveness of the proposed method. Extensive experimental results show that the proposed method achieves the best performance compared to existing DTI enhancement methods and potentially reveals further clinical insights with routine clinical DWI scans.

[AI-28] Linear Time Complexity Conformers with SummaryMixing for Streaming Speech Recognition

链接: https://arxiv.org/abs/2409.07165
作者: Titouan Parcollet,Rogier van Dalen,Shucong Zhang,Sourav Batthacharya
关键词-EN: Automatic speech recognition, Automatic speech, Automatic, speech recognition, ASR
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
*备注:

点击查看摘要

Abstract:Automatic speech recognition (ASR) with an encoder equipped with self-attention, whether streaming or non-streaming, takes quadratic time in the length of the speech utterance. This slows down training and decoding, increase their cost, and limit the deployment of the ASR in constrained devices. SummaryMixing is a promising linear-time complexity alternative to self-attention for non-streaming speech recognition that, for the first time, preserves or outperforms the accuracy of self-attention models. Unfortunately, the original definition of SummaryMixing is not suited to streaming speech recognition. Hence, this work extends SummaryMixing to a Conformer Transducer that works in both a streaming and an offline mode. It shows that this new linear-time complexity speech encoder outperforms self-attention in both scenarios while requiring less compute and memory during training and decoding.

[AI-29] Recurrent Aggregators in Neural Algorithmic Reasoning

链接: https://arxiv.org/abs/2409.07154
作者: Kaijia Xu,Petar Veličković
关键词-EN: classical algorithmic computations, mimic classical algorithmic, Neural algorithmic, neural algorithmic reasoners, emerging field
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Neural algorithmic reasoning (NAR) is an emerging field that seeks to design neural networks that mimic classical algorithmic computations. Today, graph neural networks (GNNs) are widely used in neural algorithmic reasoners due to their message passing framework and permutation equivariance. In this extended abstract, we challenge this design choice, and replace the equivariant aggregation function with a recurrent neural network. While seemingly counter-intuitive, this approach has appropriate grounding when nodes have a natural ordering – and this is the case frequently in established reasoning benchmarks like CLRS-30. Indeed, our recurrent NAR (RNAR) model performs very strongly on such tasks, while handling many others gracefully. A notable achievement of RNAR is its decisive state-of-the-art result on the Heapsort and Quickselect tasks, both deemed as a significant challenge for contemporary neural algorithmic reasoners – especially the latter, where RNAR achieves a mean micro-F1 score of 87%.

[AI-30] Leveraging Unstructured Text Data for Federated Instruction Tuning of Large Language Models

链接: https://arxiv.org/abs/2409.07136
作者: Rui Ye,Rui Ge,Yuchi Fengting,Jingyi Chai,Yanfeng Wang,Siheng Chen
关键词-EN: large language model, shared large language, directly sharing raw, Federated instruction tuning, follow humans’ instructions
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
*备注: 11 pages, work in progress

点击查看摘要

Abstract:Federated instruction tuning enables multiple clients to collaboratively fine-tune a shared large language model (LLM) that can follow humans’ instructions without directly sharing raw data. However, existing literature impractically requires that all the clients readily hold instruction-tuning data (i.e., structured instruction-response pairs), which necessitates massive human annotations since clients’ data is usually unstructured text instead. Addressing this, we propose a novel and flexible framework FedIT-U2S, which can automatically transform unstructured corpus into structured data for federated instruction tuning. FedIT-U2S consists two key steps: (1) few-shot instruction-tuning data generation, where each unstructured data piece together with several examples is combined to prompt an LLM in generating an instruction-response pair. To further enhance the flexibility, a retrieval-based example selection technique is proposed, where the examples are automatically selected based on the relatedness between the client’s data piece and example pool, bypassing the need of determining examples in advance. (2) A typical federated instruction tuning process based on the generated data. Overall, FedIT-U2S can be applied to diverse scenarios as long as the client holds valuable text corpus, broadening the application scope of federated instruction tuning. We conduct a series of experiments on three domains (medicine, knowledge, and math), showing that our proposed FedIT-U2S can consistently and significantly brings improvement over the base LLM.

[AI-31] DCMAC: Demand-aware Customized Multi-Agent Communication via Upper Bound Training

链接: https://arxiv.org/abs/2409.07127
作者: Dongkun Huo,Huateng Zhang,Yixue Hao,Yuanlin Ye,Long Hu,Rui Wang,Min Chen
关键词-EN: multi-agent reinforcement learning, collaborative multi-agent reinforcement, Efficient communication, reinforcement learning, performance of collaborative
类目: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
*备注:

点击查看摘要

Abstract:Efficient communication can enhance the overall performance of collaborative multi-agent reinforcement learning. A common approach is to share observations through full communication, leading to significant communication overhead. Existing work attempts to perceive the global state by conducting teammate model based on local information. However, they ignore that the uncertainty generated by prediction may lead to difficult training. To address this problem, we propose a Demand-aware Customized Multi-Agent Communication (DCMAC) protocol, which use an upper bound training to obtain the ideal policy. By utilizing the demand parsing module, agent can interpret the gain of sending local message on teammate, and generate customized messages via compute the correlation between demands and local observation using cross-attention mechanism. Moreover, our method can adapt to the communication resources of agents and accelerate the training progress by appropriating the ideal policy which is trained with joint observation. Experimental results reveal that DCMAC significantly outperforms the baseline algorithms in both unconstrained and communication constrained scenarios.

[AI-32] Credibility-Limited Revision for Epistemic Spaces

链接: https://arxiv.org/abs/2409.07119
作者: Kai Sauerwald
关键词-EN: permitting inconsistent belief, credibility-limited revision operators, inconsistent belief sets, credibility-limited revision, AGM revision operators
类目: Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO)
*备注:

点击查看摘要

Abstract:We consider credibility-limited revision in the framework of belief change for epistemic spaces, permitting inconsistent belief sets and inconsistent beliefs. In this unrestricted setting, the class of credibility-limited revision operators does not include any AGM revision operators. We extend the class of credibility-limited revision operators in a way that all AGM revision operators are included while keeping the original spirit of credibility-limited revision. Extended credibility-limited revision operators are defined axiomatically. A semantic characterization of extended credibility-limited revision operators that employ total preorders on possible worlds is presented.

[AI-33] A Continual and Incremental Learning Approach for TinyML On-device Training Using Dataset Distillation and Model Size Adaption

链接: https://arxiv.org/abs/2409.07114
作者: Marcus Rüb,Philipp Tuchel,Axel Sikora,Daniel Mueller-Gritschneder
关键词-EN: Tiny Machine learning, machine learning models, Tiny Machine, context of Tiny, Machine learning
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:A new algorithm for incremental learning in the context of Tiny Machine learning (TinyML) is presented, which is optimized for low-performance and energy efficient embedded devices. TinyML is an emerging field that deploys machine learning models on resource-constrained devices such as microcontrollers, enabling intelligent applications like voice recognition, anomaly detection, predictive maintenance, and sensor data processing in environments where traditional machine learning models are not feasible. The algorithm solve the challenge of catastrophic forgetting through the use of knowledge distillation to create a small, distilled dataset. The novelty of the method is that the size of the model can be adjusted dynamically, so that the complexity of the model can be adapted to the requirements of the task. This offers a solution for incremental learning in resource-constrained environments, where both model size and computational efficiency are critical factors. Results show that the proposed algorithm offers a promising approach for TinyML incremental learning on embedded devices. The algorithm was tested on five datasets including: CIFAR10, MNIST, CORE50, HAR, Speech Commands. The findings indicated that, despite using only 43% of Floating Point Operations (FLOPs) compared to a larger fixed model, the algorithm experienced a negligible accuracy loss of just 1%. In addition, the presented method is memory efficient. While state-of-the-art incremental learning is usually very memory intensive, the method requires only 1% of the original data set.

[AI-34] Advancing On-Device Neural Network Training with TinyPropv2: Dynamic Sparse and Efficient Backpropagation IJCNN

链接: https://arxiv.org/abs/2409.07109
作者: Marcus Rüb,Axel Sikora,Daniel Mueller-Gritschneder
关键词-EN: deep neural networks, low-power microcontroller units, innovative algorithm optimized, study introduces, neural networks
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 2024 International Joint Conference on Neural Networks (IJCNN)

点击查看摘要

Abstract:This study introduces TinyPropv2, an innovative algorithm optimized for on-device learning in deep neural networks, specifically designed for low-power microcontroller units. TinyPropv2 refines sparse backpropagation by dynamically adjusting the level of sparsity, including the ability to selectively skip training steps. This feature significantly lowers computational effort without substantially compromising accuracy. Our comprehensive evaluation across diverse datasets CIFAR 10, CIFAR100, Flower, Food, Speech Command, MNIST, HAR, and DCASE2020 reveals that TinyPropv2 achieves near-parity with full training methods, with an average accuracy drop of only around 1 percent in most cases. For instance, against full training, TinyPropv2’s accuracy drop is minimal, for example, only 0.82 percent on CIFAR 10 and 1.07 percent on CIFAR100. In terms of computational effort, TinyPropv2 shows a marked reduction, requiring as little as 10 percent of the computational effort needed for full training in some scenarios, and consistently outperforms other sparse training methodologies. These findings underscore TinyPropv2’s capacity to efficiently manage computational resources while maintaining high accuracy, positioning it as an advantageous solution for advanced embedded device applications in the IoT ecosystem.

[AI-35] Redundancy-Aware Camera Selection for Indoor Scene Neural Rendering

链接: https://arxiv.org/abs/2409.07098
作者: Zehao Wang,Han Zhou,Matthew B. Blaschko,Tinne Tuytelaars,Minye Wu
关键词-EN: monocular video sequence, view synthesis, achieved by capturing, capturing a monocular, monocular video
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Novel view synthesis of indoor scenes can be achieved by capturing a monocular video sequence of the environment. However, redundant information caused by artificial movements in the input video data reduces the efficiency of scene modeling. In this work, we tackle this challenge from the perspective of camera selection. We begin by constructing a similarity matrix that incorporates both the spatial diversity of the cameras and the semantic variation of the images. Based on this matrix, we use the Intra-List Diversity (ILD) metric to assess camera redundancy, formulating the camera selection task as an optimization problem. Then we apply a diversity-based sampling algorithm to optimize the camera selection. We also develop a new dataset, IndoorTraj, which includes long and complex camera movements captured by humans in virtual indoor environments, closely mimicking real-world scenarios. Experimental results demonstrate that our strategy outperforms other approaches under time and memory constraints. Remarkably, our method achieves performance comparable to models trained on the full dataset, while using only an average of 15% of the frames and 75% of the allotted time.

[AI-36] Ontology-Free General-Domain Knowledge Graph-to-Text Generation Dataset Synthesis using Large Language Model

链接: https://arxiv.org/abs/2409.07088
作者: Daehee Kim,Deokhyung Kang,Sangwon Ryu,Gary Geunbae Lee
关键词-EN: verbalizing structured knowledge, structured knowledge graphs, natural language text, Pretrained Language Models, involves verbalizing structured
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: 18 pages, 9 figures

点击查看摘要

Abstract:Knowledge Graph-to-Text (G2T) generation involves verbalizing structured knowledge graphs into natural language text. Recent advancements in Pretrained Language Models (PLMs) have improved G2T performance, but their effectiveness depends on datasets with precise graph-text alignment. However, the scarcity of high-quality, general-domain G2T generation datasets restricts progress in the general-domain G2T generation research. To address this issue, we introduce Wikipedia Ontology-Free Graph-text dataset (WikiOFGraph), a new large-scale G2T dataset generated using a novel method that leverages Large Language Model (LLM) and Data-QuestEval. Our new dataset, which contains 5.85M general-domain graph-text pairs, offers high graph-text consistency without relying on external ontologies. Experimental results demonstrate that PLM fine-tuned on WikiOFGraph outperforms those trained on other datasets across various evaluation metrics. Our method proves to be a scalable and effective solution for generating high-quality G2T data, significantly advancing the field of G2T generation.

[AI-37] Multimodal Emotion Recognition with Vision-language Prompting and Modality Dropout

链接: https://arxiv.org/abs/2409.07078
作者: Anbin QI,Zhongliang Liu,Xinyong Zhou,Jinba Xiao,Fengrun Zhang,Qi Gan,Ming Tao,Gaozheng Zhang,Lu Zhang
关键词-EN: Emotion Recognition Challenge, Multimodal Emotion Recognition, Recognition Challenge Track, Emotion Recognition, Recognition Challenge
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In this paper, we present our solution for the Second Multimodal Emotion Recognition Challenge Track 1(MER2024-SEMI). To enhance the accuracy and generalization performance of emotion recognition, we propose several methods for Multimodal Emotion Recognition. Firstly, we introduce EmoVCLIP, a model fine-tuned based on CLIP using vision-language prompt learning, designed for video-based emotion recognition tasks. By leveraging prompt learning on CLIP, EmoVCLIP improves the performance of pre-trained CLIP on emotional videos. Additionally, to address the issue of modality dependence in multimodal fusion, we employ modality dropout for robust information fusion. Furthermore, to aid Baichuan in better extracting emotional information, we suggest using GPT-4 as the prompt for Baichuan. Lastly, we utilize a self-training strategy to leverage unlabeled videos. In this process, we use unlabeled videos with high-confidence pseudo-labels generated by our model and incorporate them into the training set. Experimental results demonstrate that our model ranks 1st in the MER2024-SEMI track, achieving an accuracy of 90.15% on the test set.

[AI-38] Legal Fact Prediction: Task Definition and Dataset Construction

链接: https://arxiv.org/abs/2409.07055
作者: Junkai Liu,Yujie Tong,Hui Huang,Shuyuan Zheng,Muyun Yang,Peicheng Wu,Makoto Onizuka,Chuan Xiao
关键词-EN: proven by acknowledged, Legal facts refer, Legal, Legal facts, facts
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
*备注:

点击查看摘要

Abstract:Legal facts refer to the facts that can be proven by acknowledged evidence in a trial. They form the basis for the determination of court judgments. This paper introduces a novel NLP task: legal fact prediction, which aims to predict the legal fact based on a list of evidence. The predicted facts can instruct the parties and their lawyers involved in a trial to strengthen their submissions and optimize their strategies during the trial. Moreover, since real legal facts are difficult to obtain before the final judgment, the predicted facts also serve as an important basis for legal judgment prediction. We construct a benchmark dataset consisting of evidence lists and ground-truth legal facts for real civil loan cases, LFPLoan. Our experiments on this dataset show that this task is non-trivial and requires further considerable research efforts.

[AI-39] Native vs Non-Native Language Prompting: A Comparative Analysis

链接: https://arxiv.org/abs/2409.07054
作者: Mohamed Bayan Kmainasi,Rakif Khan,Ali Ezzat Shahroor,Boushra Bendou,Maram Hasanain,Firoj Alam
关键词-EN: Natural Language Processing, including standard Natural, shown remarkable abilities, Large language models, standard Natural Language
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: Foundation Models, Large Language Models, Arabic NLP, LLMs, Native, Contextual Understanding, Arabic LLM

点击查看摘要

Abstract:Large language models (LLMs) have shown remarkable abilities in different fields, including standard Natural Language Processing (NLP) tasks. To elicit knowledge from LLMs, prompts play a key role, consisting of natural language instructions. Most open and closed source LLMs are trained on available labeled and unlabeled resources–digital content such as text, images, audio, and videos. Hence, these models have better knowledge for high-resourced languages but struggle with low-resourced languages. Since prompts play a crucial role in understanding their capabilities, the language used for prompts remains an important research question. Although there has been significant research in this area, it is still limited, and less has been explored for medium to low-resourced languages. In this study, we investigate different prompting strategies (native vs. non-native) on 11 different NLP tasks associated with 12 different Arabic datasets (9.7K data points). In total, we conducted 197 experiments involving 3 LLMs, 12 datasets, and 3 prompting strategies. Our findings suggest that, on average, the non-native prompt performs the best, followed by mixed and native prompts.

[AI-40] Beyond IID: Optimizing Instruction Learning from the Perspective of Instruction Interaction and Dependency

链接: https://arxiv.org/abs/2409.07045
作者: Hanyu Zhao,Li Du,Yiming Ju,Chengwei Wu,Tengfei Pan
关键词-EN: large language models, fine-tune large language, language models, pivotal challenge, effectively select
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:With the availability of various instruction datasets, a pivotal challenge is how to effectively select and integrate these instructions to fine-tune large language models (LLMs). Previous research mainly focuses on selecting individual high-quality instructions. However, these works overlooked the joint interactions and dependencies between different categories of instructions, leading to suboptimal selection strategies. Moreover, the nature of these interaction patterns remains largely unexplored, let alone optimize the instruction set with regard to them. To fill these gaps, in this paper, we: (1) systemically investigate interaction and dependency patterns between different categories of instructions, (2) manage to optimize the instruction set concerning the interaction patterns using a linear programming-based method, and optimize the learning schema of SFT using an instruction dependency taxonomy guided curriculum learning. Experimental results across different LLMs demonstrate improved performance over strong baselines on widely adopted benchmarks.

[AI-41] E-commerce Webpage Recommendation Scheme Base on Semantic Mining and Neural Networks

链接: https://arxiv.org/abs/2409.07033
作者: Wenchao Zhao,Xiaoyi Liu,Ruilin Xu,Lingxi Xiao,Muqing Li
关键词-EN: page recommendation technology, web page recommendation, web, web page, page recommendation
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
*备注: arXiv admin note: text overlap with arXiv:2409.01137

点击查看摘要

Abstract:In e-commerce websites, web mining web page recommendation technology has been widely used. However, recommendation solutions often cannot meet the actual application needs of online shopping users. To address this problem, this paper proposes an e-commerce web page recommendation solution that combines semantic web mining and BP neural networks. First, the web logs of user searches are processed, and 5 features are extracted: content priority, time consumption priority, online shopping users’ explicit/implicit feedback on the website, recommendation semantics and input deviation amount. Then, these features are used as input features of the BP neural network to classify and identify the priority of the final output web page. Finally, the web pages are sorted according to priority and recommended to users. This project uses book sales webpages as samples for experiments. The results show that this solution can quickly and accurately identify the webpages required by users.

[AI-42] Improving Anomalous Sound Detection via Low-Rank Adaptation Fine-Tuning of Pre-Trained Audio Models

链接: https://arxiv.org/abs/2409.07016
作者: Xinhu Zheng,Anbai Jiang,Bing Han,Yanmin Qian,Pingyi Fan,Jia Liu,Wei-Qiang Zhang
关键词-EN: Anomalous Sound Detection, Anomalous Sound, Sound Detection, Artificial Intelligence, gained significant interest
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
*备注:

点击查看摘要

Abstract:Anomalous Sound Detection (ASD) has gained significant interest through the application of various Artificial Intelligence (AI) technologies in industrial settings. Though possessing great potential, ASD systems can hardly be readily deployed in real production sites due to the generalization problem, which is primarily caused by the difficulty of data collection and the complexity of environmental factors. This paper introduces a robust ASD model that leverages audio pre-trained models. Specifically, we fine-tune these models using machine operation data, employing SpecAug as a data augmentation strategy. Additionally, we investigate the impact of utilizing Low-Rank Adaptation (LoRA) tuning instead of full fine-tuning to address the problem of limited data for fine-tuning. Our experiments on the DCASE2023 Task 2 dataset establish a new benchmark of 77.75% on the evaluation set, with a significant improvement of 6.48% compared with previous state-of-the-art (SOTA) models, including top-tier traditional convolutional networks and speech pre-trained models, which demonstrates the effectiveness of audio pre-trained models with LoRA tuning. Ablation studies are also conducted to showcase the efficacy of the proposed scheme.

[AI-43] What is the Right Notion of Distance between Predict-then-Optimize Tasks?

链接: https://arxiv.org/abs/2409.06997
作者: Paula Rodriguez-Diaz,Lingkai Kong,Kai Wang,David Alvarez-Melis,Milind Tambe
关键词-EN: detecting data drift, Comparing datasets, learning paradigms, data drift, machine learning
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Comparing datasets is a fundamental task in machine learning, essential for various learning paradigms; from evaluating train and test datasets for model generalization to using dataset similarity for detecting data drift. While traditional notions of dataset distances offer principled measures of similarity, their utility has largely been assessed through prediction error minimization. However, in Predict-then-Optimize (PtO) frameworks, where predictions serve as inputs for downstream optimization tasks, model performance is measured through decision regret minimization rather than prediction error minimization. In this work, we (i) show that traditional dataset distances, which rely solely on feature and label dimensions, lack informativeness in the PtO context, and (ii) propose a new dataset distance that incorporates the impacts of downstream decisions. Our results show that this decision-aware dataset distance effectively captures adaptation success in PtO contexts, providing a PtO adaptation bound in terms of dataset distance. Empirically, we show that our proposed distance measure accurately predicts transferability across three different PtO tasks from the literature.

[AI-44] Large Language Models and the Extended Church-Turing Thesis

链接: https://arxiv.org/abs/2409.06978
作者: Jiří Wiedermann,Jan van Leeuwen
关键词-EN: Extended Church-Turing Thesis, effective information processing, Church-Turing Thesis, Extended Church-Turing, non-uniform interactive computations
类目: Formal Languages and Automata Theory (cs.FL); Artificial Intelligence (cs.AI)
*备注: In Proceedings NCMA 2024, arXiv:2409.06120

点击查看摘要

Abstract:The Extended Church-Turing Thesis (ECTT) posits that all effective information processing, including unbounded and non-uniform interactive computations, can be described in terms of interactive Turing machines with advice. Does this assertion also apply to the abilities of contemporary large language models (LLMs)? From a broader perspective, this question calls for an investigation of the computational power of LLMs by the classical means of computability and computational complexity theory, especially the theory of automata. Along these lines, we establish a number of fundamental results. Firstly, we argue that any fixed (non-adaptive) LLM is computationally equivalent to a, possibly very large, deterministic finite-state transducer. This characterizes the base level of LLMs. We extend this to a key result concerning the simulation of space-bounded Turing machines by LLMs. Secondly, we show that lineages of evolving LLMs are computationally equivalent to interactive Turing machines with advice. The latter finding confirms the validity of the ECTT for lineages of LLMs. From a computability viewpoint, it also suggests that lineages of LLMs possess super-Turing computational power. Consequently, in our computational model knowledge generation is in general a non-algorithmic process realized by lineages of LLMs. Finally, we discuss the merits of our findings in the broader context of several related disciplines and philosophies.

[AI-45] Policy Filtration in RLHF to Fine-Tune LLM for Code Generation

链接: https://arxiv.org/abs/2409.06957
作者: Wei Shen,Chuheng Zhang
关键词-EN: large language models, Reinforcement learning, human feedback, key techniques, large language
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Reinforcement learning from human feedback (RLHF) is one of the key techniques that helps large language models (LLMs) to follow instructions and provide helpful and harmless responses. While direct policy optimization methods exist, state-of-the-art LLMs adopt RL-based methods (usually PPO) in RLHF to train the policy to generate good responses guided by a reward model learned from preference data. The main challenge of these methods is the inaccuracy of the intermediate reward model, especially in code generation tasks that require long and complex reasoning to score a response. We find that the reliability of the reward model varies across responses assigned with different rewards. This motivates us to filter the samples whose rewards may be unreliable to improve signal-to-noise ratio during policy learning, resulting in Policy Filtration for Proximal Policy Optimization (PF-PPO). To choose a proper policy filtration strategy for a given reward model, the coefficient of determination ( R^2 ) between rewards and actual scores on filtered samples serves as a good metrics and helps us find several promising strategies. We provide extensive experiments to validate the effectiveness of PF-PPO in code generation tasks, and find that some variants of PF-PPO are highly effective and achieve new state-of-the-art performance across 7-billion-parameter models on HumanEval, MBPP, and a new and more challenging LeetCode Contest benchmark.

[AI-46] Neural Algorithmic Reasoning with Multiple Correct Solutions

链接: https://arxiv.org/abs/2409.06953
作者: Zeno Kujawa,John Poole,Dobrik Georgiev,Danilo Numeroso,Pietro Liò
关键词-EN: Neural Algorithmic Reasoning, aims to optimize, Neural Algorithmic, NAR train neural, NAR
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Neural Algorithmic Reasoning (NAR) aims to optimize classical algorithms. However, canonical implementations of NAR train neural networks to return only a single solution, even when there are multiple correct solutions to a problem, such as single-source shortest paths. For some applications, it is desirable to recover more than one correct solution. To that end, we give the first method for NAR with multiple solutions. We demonstrate our method on two classical algorithms: Bellman-Ford (BF) and Depth-First Search (DFS), favouring deeper insight into two algorithms over a broader survey of algorithms. This method involves generating appropriate training data as well as sampling and validating solutions from model output. Each step of our method, which can serve as a framework for neural algorithmic reasoning beyond the tasks presented in this paper, might be of independent interest to the field and our results represent the first attempt at this task in the NAR literature.

[AI-47] You Have Thirteen Hours in Which to Solve the Labyrinth: Enhancing AI Game Masters with Function Calling ACL2024

链接: https://arxiv.org/abs/2409.06949
作者: Jaewoo Song,Andrew Zhu,Chris Callison-Burch
关键词-EN: large language models, challenging task due, Jim Henson Labyrinth, game master role, game master
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: Wordplay Workshop @ ACL 2024

点击查看摘要

Abstract:Developing a consistent and reliable AI game master for text-based games is a challenging task due to the limitations of large language models (LLMs) and the complexity of the game master’s role. This paper presents a novel approach to enhance AI game masters by leveraging function calling in the context of the table-top role-playing game “Jim Henson’s Labyrinth: The Adventure Game.” Our methodology involves integrating game-specific controls through functions, which we show improves the narrative quality and state update consistency of the AI game master. The experimental results, based on human evaluations and unit tests, demonstrate the effectiveness of our approach in enhancing gameplay experience and maintaining coherence with the game state. This work contributes to the advancement of game AI and interactive storytelling, offering insights into the design of more engaging and consistent AI-driven game masters.

[AI-48] FSMDet: Vision-guided feature diffusion for fully sparse 3D detector ECCV

链接: https://arxiv.org/abs/2409.06945
作者: Tianran Liu,Morteza Mousa Pasandi,Robert Laganiere
关键词-EN: Fully sparse, fully sparse models, Fully Sparse Multi-modal, Sparse Multi-modal Detection, fully sparse works
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: Accepted by European Conference on Computer Vision (ECCV) 2024 workshop on VCAD

点击查看摘要

Abstract:Fully sparse 3D detection has attracted an increasing interest in the recent years. However, the sparsity of the features in these frameworks challenges the generation of proposals because of the limited diffusion process. In addition, the quest for efficiency has led to only few work on vision-assisted fully sparse models. In this paper, we propose FSMDet (Fully Sparse Multi-modal Detection), which use visual information to guide the LiDAR feature diffusion process while still maintaining the efficiency of the pipeline. Specifically, most of fully sparse works focus on complex customized center fusion diffusion/regression operators. However, we observed that if the adequate object completion is performed, even the simplest interpolation operator leads to satisfactory results. Inspired by this observation, we split the vision-guided diffusion process into two modules: a Shape Recover Layer (SRLayer) and a Self Diffusion Layer (SDLayer). The former uses RGB information to recover the shape of the visible part of an object, and the latter uses a visual prior to further spread the features to the center region. Experiments demonstrate that our approach successfully improves the performance of previous fully sparse models that use LiDAR only and reaches SOTA performance in multimodal models. At the same time, thanks to the sparse architecture, our method can be up to 5 times more efficient than previous SOTA methods in the inference process.

[AI-49] FreeRide: Harvesting Bubbles in Pipeline Parallelism

链接: https://arxiv.org/abs/2409.06941
作者: Jiashu Zhang,Zihan Pan,Molly(Yiming)Xu,Khuzaima Daudjee,Sihang Liu
关键词-EN: side tasks, GPU resources, large language model, GPU, side
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The occurrence of bubbles in pipeline parallelism is an inherent limitation that can account for more than 40% of the large language model (LLM) training time and is one of the main reasons for the underutilization of GPU resources in LLM training. Harvesting these bubbles for GPU side tasks can increase resource utilization and reduce training costs but comes with challenges. First, because bubbles are discontinuous with various shapes, programming side tasks becomes difficult while requiring excessive engineering effort. Second, a side task can compete with pipeline training for GPU resources and incur significant overhead. To address these challenges, we propose FreeRide, a system designed to harvest bubbles in pipeline parallelism for side tasks. FreeRide provides programmers with interfaces to implement side tasks easily, manages bubbles and side tasks during pipeline training, and controls access to GPU resources by side tasks to reduce overhead. We demonstrate that FreeRide achieves 7.8% average cost savings with a negligible overhead of about 1% in training LLMs while serving model training, graph analytics, and image processing side tasks.

[AI-50] Intrapartum Ultrasound Image Segmentation of Pubic Symphysis and Fetal Head Using Dual Student-Teacher Framework with CNN-ViT Collaborative Learning

链接: https://arxiv.org/abs/2409.06928
作者: Jianmei Jiang,Huijin Wang,Jieyun Bai,Shun Long,Shuangping Chen,Victor M. Campello,Karim Lekadir
关键词-EN: potential delivery complications, monitoring labor progression, identifying potential delivery, PSFH Segmentation Grand, Convolutional Neural Networks
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The segmentation of the pubic symphysis and fetal head (PSFH) constitutes a pivotal step in monitoring labor progression and identifying potential delivery complications. Despite the advances in deep learning, the lack of annotated medical images hinders the training of segmentation. Traditional semi-supervised learning approaches primarily utilize a unified network model based on Convolutional Neural Networks (CNNs) and apply consistency regularization to mitigate the reliance on extensive annotated data. However, these methods often fall short in capturing the discriminative features of unlabeled data and in delineating the long-range dependencies inherent in the ambiguous boundaries of PSFH within ultrasound images. To address these limitations, we introduce a novel framework, the Dual-Student and Teacher Combining CNN and Transformer (DSTCT), which synergistically integrates the capabilities of CNNs and Transformers. Our framework comprises a Vision Transformer (ViT) as the teacher and two student mod ls one ViT and one CNN. This dual-student setup enables mutual supervision through the generation of both hard and soft pseudo-labels, with the consistency in their predictions being refined by minimizing the classifier determinacy discrepancy. The teacher model further reinforces learning within this architecture through the imposition of consistency regularization constraints. To augment the generalization abilities of our approach, we employ a blend of data and model perturbation techniques. Comprehensive evaluations on the benchmark dataset of the PSFH Segmentation Grand Challenge at MICCAI 2023 demonstrate our DSTCT framework outperformed ten contemporary semi-supervised segmentation methods. Code available at this https URL.

[AI-51] Interactive Counterfactual Exploration of Algorithmic Harms in Recommender Systems

链接: https://arxiv.org/abs/2409.06916
作者: Yongsu Ahn,Quinn K Wolter,Jonilyn Dick,Janet Dick,Yu-Ru Lin
关键词-EN: shaping user interactions, integral to digital, interactions and preferences, Recommender systems, digital experiences
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
*备注:

点击查看摘要

Abstract:Recommender systems have become integral to digital experiences, shaping user interactions and preferences across various platforms. Despite their widespread use, these systems often suffer from algorithmic biases that can lead to unfair and unsatisfactory user experiences. This study introduces an interactive tool designed to help users comprehend and explore the impacts of algorithmic harms in recommender systems. By leveraging visualizations, counterfactual explanations, and interactive modules, the tool allows users to investigate how biases such as miscalibration, stereotypes, and filter bubbles affect their recommendations. Informed by in-depth user interviews, this tool benefits both general users and researchers by increasing transparency and offering personalized impact assessments, ultimately fostering a better understanding of algorithmic biases and contributing to more equitable recommendation outcomes. This work provides valuable insights for future research and practical applications in mitigating bias and enhancing fairness in machine learning algorithms.

[AI-52] A Bayesian framework for active object recognition pose estimation and shape transfer learning through touch

链接: https://arxiv.org/abs/2409.06912
作者: Haodong Zheng,Andrei Jalba,Raymond H. Cuijpers,Wijnand IJsselsteijn,Sanne Schoenmakers
关键词-EN: sense of touch, tactile sensing, robotic perception, understand the world, important aspect
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:As humans can explore and understand the world through the sense of touch, tactile sensing is also an important aspect of robotic perception. In unstructured environments, robots can encounter both known and novel objects, this calls for a method to address both known and novel objects. In this study, we combine a particle filter (PF) and Gaussian process implicit surface (GPIS) in a unified Bayesian framework. The framework can differentiate between known and novel objects, perform object recognition, estimate pose for known objects, and reconstruct shapes for unknown objects, in an active learning fashion. By grounding the selection of the GPIS prior with the maximum-likelihood-estimation (MLE) shape from the PF, the knowledge about known objects’ shapes can be transferred to learn novel shapes. An exploration procedure with global shape estimation is proposed to guide active data acquisition and conclude the exploration when sufficient information is obtained. The performance of the proposed Bayesian framework is evaluated through simulations on known and novel objects, initialized with random poses and is compared with a rapidly explore random tree (RRT).The results show that the proposed exploration procedure, utilizing global shape estimation, achieves faster exploration than the RRT-based local exploration procedure. Overall, results indicate that the proposed framework is effective and efficient in object recognition, pose estimation and shape reconstruction. Moreover, we show that a learned shape can be included as a new prior and used effectively for future object recognition and pose estimation of novel objects.

[AI-53] Applied Federated Model Personalisation in the Industrial Domain: A Comparative Study

链接: https://arxiv.org/abs/2409.06904
作者: Ilias Siniosoglou,Vasileios Argyriou,George Fragulis,Panagiotis Fouliras,Georgios Th. Papadopoulos,Anastasios Lytos,Panagiotis Sarigiannidis
关键词-EN: deploying complicated Machine, Machine and Deep, complicated Machine, Machine Learning, Deep Learning
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Performance (cs.PF)
*备注:

点击查看摘要

Abstract:The time-consuming nature of training and deploying complicated Machine and Deep Learning (DL) models for a variety of applications continues to pose significant challenges in the field of Machine Learning (ML). These challenges are particularly pronounced in the federated domain, where optimizing models for individual nodes poses significant difficulty. Many methods have been developed to tackle this problem, aiming to reduce training expenses and time while maintaining efficient optimisation. Three suggested strategies to tackle this challenge include Active Learning, Knowledge Distillation, and Local Memorization. These methods enable the adoption of smaller models that require fewer computational resources and allow for model personalization with local insights, thereby improving the effectiveness of current models. The present study delves into the fundamental principles of these three approaches and proposes an advanced Federated Learning System that utilises different Personalisation methods towards improving the accuracy of AI models and enhancing user experience in real-time NG-IoT applications, investigating the efficacy of these techniques in the local and federated domain. The results of the original and optimised models are then compared in both local and federated contexts using a comparison analysis. The post-analysis shows encouraging outcomes when it comes to optimising and personalising the models with the suggested techniques.

[AI-54] Formative Study for AI-assisted Data Visualization

链接: https://arxiv.org/abs/2409.06892
作者: Rania Saber,Anna Fariha
关键词-EN: uncleaned datasets influence, formative study investigates, investigates the impact, influence the outcomes, uncleaned datasets
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:This formative study investigates the impact of data quality on AI-assisted data visualizations, focusing on how uncleaned datasets influence the outcomes of these tools. By generating visualizations from datasets with inherent quality issues, the research aims to identify and categorize the specific visualization problems that arise. The study further explores potential methods and tools to address these visualization challenges efficiently and effectively. Although tool development has not yet been undertaken, the findings emphasize enhancing AI visualization tools to handle flawed data better. This research underscores the critical need for more robust, user-friendly solutions that facilitate quicker and easier correction of data and visualization errors, thereby improving the overall reliability and usability of AI-assisted data visualization processes.

[AI-55] A Dataset for Evaluating LLM-based Evaluation Functions for Research Question Extraction Task

链接: https://arxiv.org/abs/2409.06883
作者: Yuya Fujisaki,Shiro Takagi,Hideki Asoh,Wataru Kumagai
关键词-EN: text summarization techniques, progress in text, task, text summarization, summarization techniques
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The progress in text summarization techniques has been remarkable. However the task of accurately extracting and summarizing necessary information from highly specialized documents such as research papers has not been sufficiently investigated. We are focusing on the task of extracting research questions (RQ) from research papers and construct a new dataset consisting of machine learning papers, RQ extracted from these papers by GPT-4, and human evaluations of the extracted RQ from multiple perspectives. Using this dataset, we systematically compared recently proposed LLM-based evaluation functions for summarizations, and found that none of the functions showed sufficiently high correlations with human evaluations. We expect our dataset provides a foundation for further research on developing better evaluation functions tailored to the RQ extraction task, and contribute to enhance the performance of the task. The dataset is available at this https URL.

[AI-56] NSP: A Neuro-Symbolic Natural Language Navigational Planner

链接: https://arxiv.org/abs/2409.06859
作者: William English,Dominic Simon,Rickard Ewetz,Sumit Jha
关键词-EN: instructions hold promise, language instructions hold, natural language inputs, free-form natural language, natural language instructions
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
*备注: 8 pages

点击查看摘要

Abstract:Path planners that can interpret free-form natural language instructions hold promise to automate a wide range of robotics applications. These planners simplify user interactions and enable intuitive control over complex semi-autonomous systems. While existing symbolic approaches offer guarantees on the correctness and efficiency, they struggle to parse free-form natural language inputs. Conversely, neural approaches based on pre-trained Large Language Models (LLMs) can manage natural language inputs but lack performance guarantees. In this paper, we propose a neuro-symbolic framework for path planning from natural language inputs called NSP. The framework leverages the neural reasoning abilities of LLMs to i) craft symbolic representations of the environment and ii) a symbolic path planning algorithm. Next, a solution to the path planning problem is obtained by executing the algorithm on the environment representation. The framework uses a feedback loop from the symbolic execution environment to the neural generation process to self-correct syntax errors and satisfy execution time constraints. We evaluate our neuro-symbolic approach using a benchmark suite with 1500 path-planning problems. The experimental evaluation shows that our neuro-symbolic approach produces 90.1% valid paths that are on average 19-77% shorter than state-of-the-art neural approaches.

[AI-57] LIME-M: Less Is More for Evaluation of MLLMs

链接: https://arxiv.org/abs/2409.06851
作者: Kang Zhu,Qianbo Zang,Shian Jia,Siwei Wu,Feiteng Fang,Yizhi Li,Shuyue Guo,Tianyu Zheng,Bo Li,Haoning Wu,Xingwei Qu,Jian Yang,Zachary Liu,Xiang Yue,J.H. Liu,Chenghua Lin,Min Yang,Shiwen Ni,Wenhao Huang,Ge Zhang
关键词-EN: Multimodal Large Language, Large Language Models, Large Language, visual question answering, remarkable success achieved
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:With the remarkable success achieved by Multimodal Large Language Models (MLLMs), numerous benchmarks have been designed to assess MLLMs’ ability to guide their development in image perception tasks (e.g., image captioning and visual question answering). However, the existence of numerous benchmarks results in a substantial computational burden when evaluating model performance across all of them. Moreover, these benchmarks contain many overly simple problems or challenging samples, which do not effectively differentiate the capabilities among various MLLMs. To address these challenges, we propose a pipeline to process the existing benchmarks, which consists of two modules: (1) Semi-Automated Screening Process and (2) Eliminating Answer Leakage. The Semi-Automated Screening Process filters out samples that cannot distinguish the model’s capabilities by synthesizing various MLLMs and manually evaluating them. The Eliminate Answer Leakage module filters samples whose answers can be inferred without images. Finally, we curate the LIME-M: Less Is More for Evaluation of Multimodal LLMs, a lightweight Multimodal benchmark that can more effectively evaluate the performance of different models. Our experiments demonstrate that: LIME-M can better distinguish the performance of different MLLMs with fewer samples (24% of the original) and reduced time (23% of the original); LIME-M eliminates answer leakage, focusing mainly on the information within images; The current automatic metric (i.e., CIDEr) is insufficient for evaluating MLLMs’ capabilities in captioning. Moreover, removing the caption task score when calculating the overall score provides a more accurate reflection of model performance differences. All our codes and data are released at this https URL.

[AI-58] Bifurcation Identification for Ultrasound-driven Robotic Cannulation

链接: https://arxiv.org/abs/2409.06817
作者: Cecilia G. Morales,Dhruv Srikanth,Jack H. Good,Keith A. Dufendach,Artur Dubrawski
关键词-EN: critical care settings, precise intravascular access, care settings, rapid and precise, patients’ survival
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In trauma and critical care settings, rapid and precise intravascular access is key to patients’ survival. Our research aims at ensuring this access, even when skilled medical personnel are not readily available. Vessel bifurcations are anatomical landmarks that can guide the safe placement of catheters or needles during medical procedures. Although ultrasound is advantageous in navigating anatomical landmarks in emergency scenarios due to its portability and safety, to our knowledge no existing algorithm can autonomously extract vessel bifurcations using ultrasound images. This is primarily due to the limited availability of ground truth data, in particular, data from live subjects, needed for training and validating reliable models. Researchers often resort to using data from anatomical phantoms or simulations. We introduce BIFURC, Bifurcation Identification for Ultrasound-driven Robot Cannulation, a novel algorithm that identifies vessel bifurcations and provides optimal needle insertion sites for an autonomous robotic cannulation system. BIFURC integrates expert knowledge with deep learning techniques to efficiently detect vessel bifurcations within the femoral region and can be trained on a limited amount of in-vivo data. We evaluated our algorithm using a medical phantom as well as real-world experiments involving live pigs. In all cases, BIFURC consistently identified bifurcation points and needle insertion locations in alignment with those identified by expert clinicians.

[AI-59] Personalized Federated Learning Techniques: Empirical Analysis

链接: https://arxiv.org/abs/2409.06805
作者: Azal Ahmad Khan,Ahmad Faraz Khan,Haider Ali,Ali Anwar
关键词-EN: holds immense promise, Personalized Federated Learning, tailoring machine learning, preserving data privacy, Personalized Federated
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
*备注:

点击查看摘要

Abstract:Personalized Federated Learning (pFL) holds immense promise for tailoring machine learning models to individual users while preserving data privacy. However, achieving optimal performance in pFL often requires a careful balancing act between memory overhead costs and model accuracy. This paper delves into the trade-offs inherent in pFL, offering valuable insights for selecting the right algorithms for diverse real-world scenarios. We empirically evaluate ten prominent pFL techniques across various datasets and data splits, uncovering significant differences in their performance. Our study reveals interesting insights into how pFL methods that utilize personalized (local) aggregation exhibit the fastest convergence due to their efficiency in communication and computation. Conversely, fine-tuning methods face limitations in handling data heterogeneity and potential adversarial attacks while multi-objective learning methods achieve higher accuracy at the cost of additional training and resource consumption. Our study emphasizes the critical role of communication efficiency in scaling pFL, demonstrating how it can significantly affect resource usage in real-world deployments.

[AI-60] Adaptive Meta-Domain Transfer Learning (AMDTL): A Novel Approach for Knowledge Transfer in AI

链接: https://arxiv.org/abs/2409.06800
作者: Michele Laurelli
关键词-EN: paper presents Adaptive, presents Adaptive Meta-Domain, Adaptive Meta-Domain Transfer, artificial intelligence models, Meta-Domain Transfer Learning
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:This paper presents Adaptive Meta-Domain Transfer Learning (AMDTL), a novel methodology that combines principles of meta-learning with domain-specific adaptations to enhance the transferability of artificial intelligence models across diverse and unknown domains. AMDTL aims to address the main challenges of transfer learning, such as domain misalignment, negative transfer, and catastrophic forgetting, through a hybrid framework that emphasizes both generalization and contextual specialization. The framework integrates a meta-learner trained on a diverse distribution of tasks, adversarial training techniques for aligning domain feature distributions, and dynamic feature regulation mechanisms based on contextual domain embeddings. Experimental results on benchmark datasets demonstrate that AMDTL outperforms existing transfer learning methodologies in terms of accuracy, adaptation efficiency, and robustness. This research provides a solid theoretical and practical foundation for the application of AMDTL in various fields, opening new perspectives for the development of more adaptable and inclusive AI systems.

[AI-61] Modeling Image Tone Dichotomy with the Power Function

链接: https://arxiv.org/abs/2409.06764
作者: Axel Martinez,Gustavo Olague,Emilio Hernandez
关键词-EN: illumination modeling based, image illumination modeling, power function, primary purpose, present the concept
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)
*备注: 49 pages, 11 figures and 36 references

点击查看摘要

Abstract:The primary purpose of this paper is to present the concept of dichotomy in image illumination modeling based on the power function. In particular, we review several mathematical properties of the power function to identify the limitations and propose a new mathematical model capable of abstracting illumination dichotomy. The simplicity of the equation opens new avenues for classical and modern image analysis and processing. The article provides practical and illustrative image examples to explain how the new model manages dichotomy in image perception. The article shows dichotomy image space as a viable way to extract rich information from images despite poor contrast linked to tone, lightness, and color perception. Moreover, a comparison with state-of-the-art methods in image enhancement provides evidence of the method’s value.

[AI-62] Beyond designers knowledge: Generating materials design hypotheses via large language models

链接: https://arxiv.org/abs/2409.06756
作者: Quanliang Liu,Maciej P. Polak,So Yeon Kim,MD Al Amin Shuvo,Hrishikesh Shridhar Deodhar,Jeongsoo Han,Dane Morgan,Hyunseok Oh
关键词-EN: process inherently limited, extract knowledge implications, expertise is required, inherently limited, relies on human-generated
类目: Machine Learning (cs.LG); Materials Science (cond-mat.mtrl-sci); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Materials design often relies on human-generated hypotheses, a process inherently limited by cognitive constraints such as knowledge gaps and limited ability to integrate and extract knowledge implications, particularly when multidisciplinary expertise is required. This work demonstrates that large language models (LLMs), coupled with prompt engineering, can effectively generate non-trivial materials hypotheses by integrating scientific principles from diverse sources without explicit design guidance by human experts. These include design ideas for high-entropy alloys with superior cryogenic properties and halide solid electrolytes with enhanced ionic conductivity and formability. These design ideas have been experimentally validated in high-impact publications in 2023 not available in the LLM training data, demonstrating the LLM’s ability to generate highly valuable and realizable innovative ideas not established in the literature. Our approach primarily leverages materials system charts encoding processing-structure-property relationships, enabling more effective data integration by condensing key information from numerous papers, and evaluation and categorization of numerous hypotheses for human cognition, both through the LLM. This LLM-driven approach opens the door to new avenues of artificial intelligence-driven materials discovery by accelerating design, democratizing innovation, and expanding capabilities beyond the designer’s direct knowledge.

[AI-63] Scaling Law Hypothesis for Multimodal Model

链接: https://arxiv.org/abs/2409.06754
作者: Qingyun Sun,Zhen Guo
关键词-EN: models processing text, scaling law hypothesis, processing text, embedding space, multimodal models processing
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:We propose a scaling law hypothesis for multimodal models processing text, audio, images, and video within a shared token and embedding space. Our framework predicts model performance based on modality-specific compression and tokenization efficiency, extending established scaling laws from text-based decoder models to mixed-modality systems. We explore whether leveraging more training data in multiple modalities can reduce the size of the multimodal model, enabling efficient deployment on resource-constrained devices.

[AI-64] Can Agents Spontaneously Form a Society? Introducing a Novel Architecture for Generative Multi-Agents to Elicit Social Emergence

链接: https://arxiv.org/abs/2409.06750
作者: H. Zhang,J. Yin,M. Jiang,C. Su
关键词-EN: demonstrated impressive capabilities, social interactions, specific tasks, independent tasks, framework called LTRHA
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
*备注: 13 pages, 8 figures

点击查看摘要

Abstract:Generative agents have demonstrated impressive capabilities in specific tasks, but most of these frameworks focus on independent tasks and lack attention to social interactions. We introduce a generative agent architecture called ITCMA-S, which includes a basic framework for individual agents and a framework called LTRHA that supports social interactions among multi-agents. This architecture enables agents to identify and filter out behaviors that are detrimental to social interactions, guiding them to choose more favorable actions. We designed a sandbox environment to simulate the natural evolution of social relationships among multiple identity-less agents for experimental evaluation. The results showed that ITCMA-S performed well on multiple evaluation indicators, demonstrating its ability to actively explore the environment, recognize new agents, and acquire new information through continuous actions and dialogue. Observations show that as agents establish connections with each other, they spontaneously form cliques with internal hierarchies around a selected leader and organize collective activities.

[AI-65] EasyST: A Simple Framework for Spatio-Temporal Prediction CIKM’2024

链接: https://arxiv.org/abs/2409.06748
作者: Jiabin Tang,Wei Wei,Lianghao Xia,Chao Huang
关键词-EN: crucial research area, public safety, Graph Neural Networks, implications for transportation, environmental monitoring
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Accepted by CIKM’2024, full paper

点击查看摘要

Abstract:Spatio-temporal prediction is a crucial research area in data-driven urban computing, with implications for transportation, public safety, and environmental monitoring. However, scalability and generalization challenges remain significant obstacles. Advanced models often rely on Graph Neural Networks to encode spatial and temporal correlations, but struggle with the increased complexity of large-scale datasets. The recursive GNN-based message passing schemes used in these models hinder their training and deployment in real-life urban sensing scenarios. Moreover, long-spanning large-scale spatio-temporal data introduce distribution shifts, necessitating improved generalization performance. To address these challenges, we propose a simple framework for spatio-temporal prediction - EasyST paradigm. It learns lightweight and robust Multi-Layer Perceptrons (MLPs) by effectively distilling knowledge from complex spatio-temporal GNNs. We ensure robust knowledge distillation by integrating the spatio-temporal information bottleneck with teacher-bounded regression loss, filtering out task-irrelevant noise and avoiding erroneous guidance. We further enhance the generalization ability of the student model by incorporating spatial and temporal prompts to provide downstream task contexts. Evaluation on three spatio-temporal datasets for urban computing tasks demonstrates that EasyST surpasses state-of-the-art approaches in terms of efficiency and accuracy. The implementation code is available at: this https URL.

[AI-66] Personalized Knowledge Tracing through Student Representation Reconstruction and Class Imbalance Mitigation

链接: https://arxiv.org/abs/2409.06745
作者: Zhiyu Chen,Wei Ji,Jing Xiao,Zitao Liu
关键词-EN: predicts students’ future, students’ future performance, enabling a precise, students’ future, analyzing their learning
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
*备注:

点击查看摘要

Abstract:Knowledge tracing is a technique that predicts students’ future performance by analyzing their learning process through historical interactions with intelligent educational platforms, enabling a precise evaluation of their knowledge mastery. Recent studies have achieved significant progress by leveraging powerful deep neural networks. These models construct complex input representations using questions, skills, and other auxiliary information but overlook individual student characteristics, which limits the capability for personalized assessment. Additionally, the available datasets in the field exhibit class imbalance issues. The models that simply predict all responses as correct without substantial effort can yield impressive accuracy. In this paper, we propose PKT, a novel approach for personalized knowledge tracing. PKT reconstructs representations from sequences of interactions with a tutoring platform to capture latent information about the students. Moreover, PKT incorporates focal loss to improve prioritize minority classes, thereby achieving more balanced predictions. Extensive experimental results on four publicly available educational datasets demonstrate the advanced predictive performance of PKT in comparison with 16 state-of-the-art models. To ensure the reproducibility of our research, the code is publicly available at https://anonymous.4open.science/r/PKT.

[AI-67] Generative AI for Requirements Engineering: A Systematic Literature Review

链接: https://arxiv.org/abs/2409.06741
作者: Haowei Cheng,Jati H. Husen,Sien Reeve Peralta,Bowen Jiang,Nobukazu Yoshioka,Naoyasu Ubayashi,Hironori Washizaki
关键词-EN: software engineering, actively exploring, transformative tool, tool in software, requirements engineering
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Context: Generative AI (GenAI) has emerged as a transformative tool in software engineering, with requirements engineering (RE) actively exploring its potential to revolutionize processes and outcomes. The integration of GenAI into RE presents both promising opportunities and significant challenges that necessitate systematic analysis and evaluation. Objective: This paper presents a comprehensive systematic literature review (SLR) analyzing state-of-the-art applications and innovative proposals leveraging GenAI in RE. It surveys studies focusing on the utilization of GenAI to enhance RE processes while identifying key challenges and opportunities in this rapidly evolving field. Method: A rigorous SLR methodology was used to analyze 27 carefully selected primary studies in-depth. The review examined research questions pertaining to the application of GenAI across various RE phases, the models and techniques used, and the challenges encountered in implementation and adoption. Results: The most salient findings include i) a predominant focus on the early stages of RE, particularly the elicitation and analysis of requirements, indicating potential for expansion into later phases; ii) the dominance of large language models, especially the GPT series, highlighting the need for diverse AI approaches; and iii) persistent challenges in domain-specific applications and the interpretability of AI-generated outputs, underscoring areas requiring further research and development. Conclusions: The results highlight the critical need for comprehensive evaluation frameworks, improved human-AI collaboration models, and thorough consideration of ethical implications in GenAI-assisted RE. Future research should prioritize extending GenAI applications across the entire RE lifecycle, enhancing domain-specific capabilities, and developing strategies for responsible AI integration in RE practices.

[AI-68] How will advanced AI systems impact democracy?

链接: https://arxiv.org/abs/2409.06729
作者: Christopher Summerfield,Lisa Argyle,Michiel Bakker,Teddy Collins,Esin Durmus,Tyna Eloundou,Iason Gabriel,Deep Ganguli,Kobi Hackenburg,Gillian Hadfield,Luke Hewitt,Saffron Huang,Helene Landemore,Nahema Marchal,Aviv Ovadya,Ariel Procaccia,Mathias Risse,Bruce Schneier,Elizabeth Seger,Divya Siddarth,Henrik Skaug Sætra,MH Tessler,Matthew Botvinick
关键词-EN: generating humanlike text, Advanced AI systems, capable of generating, generating humanlike, humanlike text
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
*备注: 25 pages

点击查看摘要

Abstract:Advanced AI systems capable of generating humanlike text and multimodal content are now widely available. In this paper, we discuss the impacts that generative artificial intelligence may have on democratic processes. We consider the consequences of AI for citizens’ ability to make informed choices about political representatives and issues (epistemic impacts). We ask how AI might be used to destabilise or support democratic mechanisms like elections (material impacts). Finally, we discuss whether AI will strengthen or weaken democratic principles (foundational impacts). It is widely acknowledged that new AI systems could pose significant challenges for democracy. However, it has also been argued that generative AI offers new opportunities to educate and learn from citizens, strengthen public discourse, help people find common ground, and to reimagine how democracies might work better.

[AI-69] Feedback-based Modal Mutual Search for Attacking Vision-Language Pre-training Models

链接: https://arxiv.org/abs/2409.06726
作者: Renhua Ding,Xinze Zhang,Xiao Yang,Kun He
关键词-EN: achieved remarkable progress, attacking VLP models, vision-language pre-training, achieved remarkable, remarkable progress
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 9 pages

点击查看摘要

Abstract:Although vision-language pre-training (VLP) models have achieved remarkable progress on cross-modal tasks, they remain vulnerable to adversarial attacks. Using data augmentation and cross-modal interactions to generate transferable adversarial examples on surrogate models, transfer-based black-box attacks have become the mainstream methods in attacking VLP models, as they are more practical in real-world scenarios. However, their transferability may be limited due to the differences on feature representation across different models. To this end, we propose a new attack paradigm called Feedback-based Modal Mutual Search (FMMS). FMMS introduces a novel modal mutual loss (MML), aiming to push away the matched image-text pairs while randomly drawing mismatched pairs closer in feature space, guiding the update directions of the adversarial examples. Additionally, FMMS leverages the target model feedback to iteratively refine adversarial examples, driving them into the adversarial region. To our knowledge, this is the first work to exploit target model feedback to explore multi-modality adversarial boundaries. Extensive empirical evaluations on Flickr30K and MSCOCO datasets for image-text matching tasks show that FMMS significantly outperforms the state-of-the-art baselines.

[AI-70] Elementary School Students and Teachers Perceptions Towards Creative Mathematical Writing with Generative AI

链接: https://arxiv.org/abs/2409.06723
作者: Yukyeong Song,Jinhee Kim,Wanli Xing,Zifeng Liu,Chenglu Li,Hyunju Oh
关键词-EN: expressing mathematical ideas, potentially engage students, school-age students struggle, potentially engage, creative writing activities
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:While mathematical creative writing can potentially engage students in expressing mathematical ideas in an imaginative way, some elementary school-age students struggle in this process. Generative AI (GenAI) offers possibilities for supporting creative writing activities, such as providing story generation. However, the design of GenAI-powered learning technologies requires careful consideration of the technology reception in the actual classrooms. This study explores students’ and teachers’ perceptions of creative mathematical writing with the developed GenAI-powered technology. The study adopted a qualitative thematic analysis of the interviews, triangulated with open-ended survey responses and classroom observation of 79 elementary school students, resulting in six themes and 19 subthemes. This study contributes by investigating the lived experience of GenAI-supported learning and the design considerations for GenAI-powered learning technologies and instructions.

[AI-71] Students Perceived Roles Opportunities and Challenges of a Generative AI-powered Teachable Agent : A Case of Middle School Math Class

链接: https://arxiv.org/abs/2409.06721
作者: Yukyeong Song,Jinhee Kim,Zifeng Liu,Chenglu Li,Wanli Xing
关键词-EN: Ongoing advancements, advancements in Generative, applying long-standing, teachable agent, boosted the potential
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Ongoing advancements in Generative AI (GenAI) have boosted the potential of applying long-standing learning-by-teaching practices in the form of a teachable agent (TA). Despite the recognized roles and opportunities of TAs, less is known about how GenAI could create synergy or introduce challenges in TAs and how students perceived the application of GenAI in TAs. This study explored middle school students perceived roles, benefits, and challenges of GenAI-powered TAs in an authentic mathematics classroom. Through classroom observation, focus-group interviews, and open-ended surveys of 108 sixth-grade students, we found that students expected the GenAI-powered TA to serve as a learning companion, facilitator, and collaborative problem-solver. Students also expressed the benefits and challenges of GenAI-powered TAs. This study provides implications for the design of educational AI and AI-assisted instruction.

[AI-72] Evolutionary Game Dynamics Applied to Strategic Adoption of Immersive Technologies in Cultural Heritage and Tourism

链接: https://arxiv.org/abs/2409.06720
作者: Gioacchino Fazio,Stefano Fricano,Claudio Pirrone
关键词-EN: potential sectors interested, interested in integration, actors pondering, sectors interested, Immersive technologies
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Theoretical Economics (econ.TH)
*备注:

点击查看摘要

Abstract:Immersive technologies such as Metaverse, AR, and VR are at a crossroads, with many actors pondering their adoption and potential sectors interested in integration. The cultural and tourism industries are particularly impacted, facing significant pressure to make decisions that could shape their future landscapes. Stakeholders’ perceptions play a crucial role in this process, influencing the speed and extent of technology adoption. As immersive technologies promise to revolutionize experiences, stakeholders in these fields weigh the benefits and challenges of embracing such innovations. The current choices will likely determine the trajectory of cultural preservation and tourism enhancement, potentially transforming how we engage with history, art, and travel. Starting from a decomposition of stakeholders’ perceptions into principal components using Q-methodology, this article employs an evolutionary game model to attempt to map possible scenarios and highlight potential decision-making trajectories. The proposed approach highlights how evolutionary dynamics lead to identifying a dominant long-term strategy that emerges from the complex system of coexistence among various stakeholders.

[AI-73] Unveiling Visual Biases in Audio-Visual Localization Benchmarks ECCV24

链接: https://arxiv.org/abs/2409.06709
作者: Liangyu Chen,Zihao Yue,Boshen Xu,Qin Jin
关键词-EN: Audio-Visual Source Localization, Source Localization, aims to localize, Source, Localization
类目: Multimedia (cs.MM); Artificial Intelligence (cs.AI); Sound (cs.SD); Audio and Speech Processing (eess.AS)
*备注: Accepted by ECCV24 AVGenL Workshop

点击查看摘要

Abstract:Audio-Visual Source Localization (AVSL) aims to localize the source of sound within a video. In this paper, we identify a significant issue in existing benchmarks: the sounding objects are often easily recognized based solely on visual cues, which we refer to as visual bias. Such biases hinder these benchmarks from effectively evaluating AVSL models. To further validate our hypothesis regarding visual biases, we examine two representative AVSL benchmarks, VGG-SS and EpicSounding-Object, where the vision-only models outperform all audiovisual baselines. Our findings suggest that existing AVSL benchmarks need further refinement to facilitate audio-visual learning.

[AI-74] Ensuring Fairness with Transparent Auditing of Quantitative Bias in AI Systems

链接: https://arxiv.org/abs/2409.06708
作者: Chih-Cheng Rex Yuan,Bow-Yaw Wang
关键词-EN: decision-making processes, rapid advancement, growing trend, trend to integrate, American justice system
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
*备注:

点击查看摘要

Abstract:With the rapid advancement of AI, there is a growing trend to integrate AI into decision-making processes. However, AI systems may exhibit biases that lead decision-makers to draw unfair conclusions. Notably, the COMPAS system used in the American justice system to evaluate recidivism was found to favor racial majority groups; specifically, it violates a fairness standard called equalized odds. Various measures have been proposed to assess AI fairness. We present a framework for auditing AI fairness, involving third-party auditors and AI system providers, and we have created a tool to facilitate systematic examination of AI systems. The tool is open-sourced and publicly available. Unlike traditional AI systems, we advocate a transparent white-box and statistics-based approach. It can be utilized by third-party auditors, AI developers, or the general public for reference when judging the fairness criterion of AI systems.

[AI-75] Discovering Long-Term Effects on Parameter Efficient Fine-tuning

链接: https://arxiv.org/abs/2409.06706
作者: Gaole Dai,Yiming Tang,Chunkai Fan,Qizhe Zhang,Zhi Zhang,Yulu Gan,Chengqing Zeng,Shanghang Zhang,Tiejun Huang
关键词-EN: Pre-trained Artificial Neural, specifically Biological Neural, Artificial Neural Networks, Biological Neural Networks, Artificial Neural
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Pre-trained Artificial Neural Networks (ANNs) exhibit robust pattern recognition capabilities and share extensive similarities with the human brain, specifically Biological Neural Networks (BNNs). We are particularly intrigued by these models’ ability to acquire new knowledge through fine-tuning. In this regard, Parameter-efficient Fine-tuning (PEFT) has gained widespread adoption as a substitute for full fine-tuning due to its cost reduction in training and mitigation of over-fitting risks by limiting the number of trainable parameters during adaptation. Since both ANNs and BNNs propagate information layer-by-layer, a common analogy can be drawn: weights in ANNs represent synapses in BNNs, while features (also known as latent variables or logits) in ANNs represent neurotransmitters released by neurons in BNNs. Mainstream PEFT methods aim to adjust feature or parameter values using only a limited number of trainable parameters (usually less than 1% of the total parameters), yet achieve surprisingly good results. Building upon this clue, we delve deeper into exploring the connections between feature adjustment and parameter adjustment, resulting in our proposed method Synapses Neurons (SAN) that learns scaling matrices for features and propagates their effects towards posterior weight matrices. Our approach draws strong inspiration from well-known neuroscience phenomena - Long-term Potentiation (LTP) and Long-term Depression (LTD), which also reveal the relationship between synapse development and neurotransmitter release levels. We conducted extensive comparisons of PEFT on 26 datasets using attention-based networks as well as convolution-based networks, leading to significant improvements compared to other tuning methods (+8.5% over fully-finetune, +7% over Visual Prompt Tuning, and +3.2% over LoRA). The codes would be released.

[AI-76] Zero-Shot Text-to-Speech as Golden Speech Generator: A Systematic Framework and its Applicability in Automatic Pronunciation Assessment

链接: https://arxiv.org/abs/2409.07151
作者: Tien-Hong Lo,Meng-Ting Tsai,Berlin Chen
关键词-EN: respective speech characteristics, imitating golden speech, golden speech, speech, learner-specific golden speech
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI)
*备注: 11 pages, 4 figures, 4 tables

点击查看摘要

Abstract:Second language (L2) learners can improve their pronunciation by imitating golden speech, especially when the speech that aligns with their respective speech characteristics. This study explores the hypothesis that learner-specific golden speech generated with zero-shot text-to-speech (ZS-TTS) techniques can be harnessed as an effective metric for measuring the pronunciation proficiency of L2 learners. Building on this exploration, the contributions of this study are at least two-fold: 1) design and development of a systematic framework for assessing the ability of a synthesis model to generate golden speech, and 2) in-depth investigations of the effectiveness of using golden speech in automatic pronunciation assessment (APA). Comprehensive experiments conducted on the L2-ARCTIC and Speechocean762 benchmark datasets suggest that our proposed modeling can yield significant performance improvements with respect to various assessment metrics in relation to some prior arts. To our knowledge, this study is the first to explore the role of golden speech in both ZS-TTS and APA, offering a promising regime for computer-assisted pronunciation training (CAPT).

[AI-77] Deep Learning Techniques for Hand Vein Biometrics: A Comprehensive Review

链接: https://arxiv.org/abs/2409.07128
作者: Mustapha Hemis,Hamza Kheddar,Sami Bourouis,Nasir Saleem
关键词-EN: garnered significant attention, hand vein biometrics, hand vein, hand vein recognition, vein
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Biometric authentication has garnered significant attention as a secure and efficient method of identity verification. Among the various modalities, hand vein biometrics, including finger vein, palm vein, and dorsal hand vein recognition, offer unique advantages due to their high accuracy, low susceptibility to forgery, and non-intrusiveness. The vein patterns within the hand are highly complex and distinct for each individual, making them an ideal biometric identifier. Additionally, hand vein recognition is contactless, enhancing user convenience and hygiene compared to other modalities such as fingerprint or iris recognition. Furthermore, the veins are internally located, rendering them less susceptible to damage or alteration, thus enhancing the security and reliability of the biometric system. The combination of these factors makes hand vein biometrics a highly effective and secure method for identity verification. This review paper delves into the latest advancements in deep learning techniques applied to finger vein, palm vein, and dorsal hand vein recognition. It encompasses all essential fundamentals of hand vein biometrics, summarizes publicly available datasets, and discusses state-of-the-art metrics used for evaluating the three modes. Moreover, it provides a comprehensive overview of suggested approaches for finger, palm, dorsal, and multimodal vein techniques, offering insights into the best performance achieved, data augmentation techniques, and effective transfer learning methods, along with associated pretrained deep learning models. Additionally, the review addresses research challenges faced and outlines future directions and perspectives, encouraging researchers to enhance existing methods and propose innovative techniques.

[AI-78] Attention Down-Sampling Transformer Relative Ranking and Self-Consistency for Blind Image Quality Assessment ICIP

链接: https://arxiv.org/abs/2409.07115
作者: Mohammed Alsaafin,Musab Alsheikh,Saeed Anwar,Muhammad Usman
关键词-EN: image quality assessment, addresses estimating image, no-reference image quality, image quality, quality assessment
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: Accepted in International Conference on Image Processing (ICIP)

点击查看摘要

Abstract:The no-reference image quality assessment is a challenging domain that addresses estimating image quality without the original reference. We introduce an improved mechanism to extract local and non-local information from images via different transformer encoders and CNNs. The utilization of Transformer encoders aims to mitigate locality bias and generate a non-local representation by sequentially processing CNN features, which inherently capture local visual structures. Establishing a stronger connection between subjective and objective assessments is achieved through sorting within batches of images based on relative distance information. A self-consistency approach to self-supervision is presented, explicitly addressing the degradation of no-reference image quality assessment (NR-IQA) models under equivariant transformations. Our approach ensures model robustness by maintaining consistency between an image and its horizontally flipped equivalent. Through empirical evaluation of five popular image quality assessment datasets, the proposed model outperforms alternative algorithms in the context of no-reference image quality assessment datasets, especially on smaller datasets. Codes are available at \hrefthis https URLthis https URL

[AI-79] CWT-Net: Super-resolution of Histopathology Images Using a Cross-scale Wavelet-based Transformer

链接: https://arxiv.org/abs/2409.07092
作者: Feiyang Jia,Zhineng Chen,Ziying Song,Lin Liu,Caiyan Jia
关键词-EN: medical imaging, quality of low-resolution, widely applied, applied in medical, low-resolution images
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Super-resolution (SR) aims to enhance the quality of low-resolution images and has been widely applied in medical imaging. We found that the design principles of most existing methods are influenced by SR tasks based on real-world images and do not take into account the significance of the multi-level structure in pathological images, even if they can achieve respectable objective metric evaluations. In this work, we delve into two super-resolution working paradigms and propose a novel network called CWT-Net, which leverages cross-scale image wavelet transform and Transformer architecture. Our network consists of two branches: one dedicated to learning super-resolution and the other to high-frequency wavelet features. To generate high-resolution histopathology images, the Transformer module shares and fuses features from both branches at various stages. Notably, we have designed a specialized wavelet reconstruction module to effectively enhance the wavelet domain features and enable the network to operate in different modes, allowing for the introduction of additional relevant information from cross-scale images. Our experimental results demonstrate that our model significantly outperforms state-of-the-art methods in both performance and visualization evaluations and can substantially boost the accuracy of image diagnostic networks.

[AI-80] owards Predicting Temporal Changes in a Patients Chest X-ray Images based on Electronic Health Records

链接: https://arxiv.org/abs/2409.07012
作者: Daeun Kyung,Junu Kim,Tackeun Kim,Edward Choi
关键词-EN: Chest X-ray imaging, Chest X-ray, X-ray imaging, important diagnostic tool, assess patient conditions
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Chest X-ray imaging (CXR) is an important diagnostic tool used in hospitals to assess patient conditions and monitor changes over time. Generative models, specifically diffusion-based models, have shown promise in generating realistic synthetic X-rays. However, these models mainly focus on conditional generation using single-time-point data, i.e., typically CXRs taken at a specific time with their corresponding reports, limiting their clinical utility, particularly for capturing temporal changes. To address this limitation, we propose a novel framework, EHRXDiff, which predicts future CXR images by integrating previous CXRs with subsequent medical events, e.g., prescriptions, lab measures, etc. Our framework dynamically tracks and predicts disease progression based on a latent diffusion model, conditioned on the previous CXR image and a history of medical events. We comprehensively evaluate the performance of our framework across three key aspects, including clinical consistency, demographic consistency, and visual realism. We demonstrate that our framework generates high-quality, realistic future images that capture potential temporal changes, suggesting its potential for further development as a clinical simulation tool. This could offer valuable insights for patient monitoring and treatment planning in the medical field.

[AI-81] Generative Hierarchical Materials Search

链接: https://arxiv.org/abs/2409.06762
作者: Sherry Yang,Simon Batzner,Ruiqi Gao,Muratahan Aykol,Alexander L. Gaunt,Brendan McMorrow,Danilo J. Rezende,Dale Schuurmans,Igor Mordatch,Ekin D. Cubuk
关键词-EN: Generative Hierarchical Materials, Generative models trained, crystal structures, produce text, scientific data
类目: Materials Science (cond-mat.mtrl-sci); Artificial Intelligence (cs.AI)
*备注: this https URL

点击查看摘要

Abstract:Generative models trained at scale can now produce text, video, and more recently, scientific data such as crystal structures. In applications of generative approaches to materials science, and in particular to crystal structures, the guidance from the domain expert in the form of high-level instructions can be essential for an automated system to output candidate crystals that are viable for downstream research. In this work, we formulate end-to-end language-to-structure generation as a multi-objective optimization problem, and propose Generative Hierarchical Materials Search (GenMS) for controllable generation of crystal structures. GenMS consists of (1) a language model that takes high-level natural language as input and generates intermediate textual information about a crystal (e.g., chemical formulae), and (2) a diffusion model that takes intermediate information as input and generates low-level continuous value crystal structures. GenMS additionally uses a graph neural network to predict properties (e.g., formation energy) from the generated crystal structures. During inference, GenMS leverages all three components to conduct a forward tree search over the space of possible structures. Experiments show that GenMS outperforms other alternatives of directly using language models to generate structures both in satisfying user request and in generating low-energy structures. We confirm that GenMS is able to generate common crystal structures such as double perovskites, or spinels, solely from natural language input, and hence can form the foundation for more complex structure generation in near future.

[AI-82] ProteinBench: A Holistic Evaluation of Protein Foundation Models

链接: https://arxiv.org/abs/2409.06744
作者: Fei Ye,Zaixiang Zheng,Dongyu Xue,Yuning Shen,Lihao Wang,Yiming Ma,Yan Wang,Xinyou Wang,Xiangxin Zhou,Quanquan Gu
关键词-EN: Recent years, protein foundation models, significantly improving performance, generative tasks ranging, structure prediction
类目: Quantitative Methods (q-bio.QM); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Biomolecules (q-bio.BM)
*备注: 29 pages, 1 figure and 11 tables

点击查看摘要

Abstract:Recent years have witnessed a surge in the development of protein foundation models, significantly improving performance in protein prediction and generative tasks ranging from 3D structure prediction and protein design to conformational dynamics. However, the capabilities and limitations associated with these models remain poorly understood due to the absence of a unified evaluation framework. To fill this gap, we introduce ProteinBench, a holistic evaluation framework designed to enhance the transparency of protein foundation models. Our approach consists of three key components: (i) A taxonomic classification of tasks that broadly encompass the main challenges in the protein domain, based on the relationships between different protein modalities; (ii) A multi-metric evaluation approach that assesses performance across four key dimensions: quality, novelty, diversity, and robustness; and (iii) In-depth analyses from various user objectives, providing a holistic view of model performance. Our comprehensive evaluation of protein foundation models reveals several key findings that shed light on their current capabilities and limitations. To promote transparency and facilitate further research, we release the evaluation dataset, code, and a public leaderboard publicly for further analysis and a general modular toolkit. We intend for ProteinBench to be a living benchmark for establishing a standardized, in-depth evaluation framework for protein foundation models, driving their development and application while fostering collaboration within the field.

计算机视觉

[CV-0] Self-Evolving Depth-Supervised 3D Gaussian Splatting from Rendered Stereo Pairs BMVC2024

链接: https://arxiv.org/abs/2409.07456
作者: Sadra Safadoust,Fabio Tosi,Fatma Güney,Matteo Poggi
关键词-EN: Gaussian Splatting, rendering depth maps, significantly struggles, represent the underlying, resulting in inaccuracies
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: BMVC 2024. Project page: this https URL

点击查看摘要

Abstract:3D Gaussian Splatting (GS) significantly struggles to accurately represent the underlying 3D scene geometry, resulting in inaccuracies and floating artifacts when rendering depth maps. In this paper, we address this limitation, undertaking a comprehensive analysis of the integration of depth priors throughout the optimization process of Gaussian primitives, and present a novel strategy for this purpose. This latter dynamically exploits depth cues from a readily available stereo network, processing virtual stereo pairs rendered by the GS model itself during training and achieving consistent self-improvement of the scene representation. Experimental results on three popular datasets, breaking ground as the first to assess depth accuracy for these models, validate our findings.

[CV-1] DreamMesh: Jointly Manipulating and Texturing Triangle Meshes for Text-to-3D Generation ECCV2024

链接: https://arxiv.org/abs/2409.07454
作者: Haibo Yang,Yang Chen,Yingwei Pan,Ting Yao,Zhineng Chen,Zuxuan Wu,Yu-Gang Jiang,Tao Mei
关键词-EN: Learning radiance fields, Learning radiance, radiance fields, garnered popularity, Learning
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
*备注: ECCV 2024. Project page is available at \url{ this https URL }

点击查看摘要

Abstract:Learning radiance fields (NeRF) with powerful 2D diffusion models has garnered popularity for text-to-3D generation. Nevertheless, the implicit 3D representations of NeRF lack explicit modeling of meshes and textures over surfaces, and such surface-undefined way may suffer from the issues, e.g., noisy surfaces with ambiguous texture details or cross-view inconsistency. To alleviate this, we present DreamMesh, a novel text-to-3D architecture that pivots on well-defined surfaces (triangle meshes) to generate high-fidelity explicit 3D model. Technically, DreamMesh capitalizes on a distinctive coarse-to-fine scheme. In the coarse stage, the mesh is first deformed by text-guided Jacobians and then DreamMesh textures the mesh with an interlaced use of 2D diffusion models in a tuning free manner from multiple viewpoints. In the fine stage, DreamMesh jointly manipulates the mesh and refines the texture map, leading to high-quality triangle meshes with high-fidelity textured materials. Extensive experiments demonstrate that DreamMesh significantly outperforms state-of-the-art text-to-3D methods in faithfully generating 3D content with richer textual details and enhanced geometry. Our project page is available at this https URL.

[CV-2] Hi3D: Pursuing High-Resolution Image-to-3D Generation with Video Diffusion Models

链接: https://arxiv.org/abs/2409.07452
作者: Haibo Yang,Yang Chen,Yingwei Pan,Ting Yao,Zhineng Chen,Chong-Wah Ngo,Tao Mei
关键词-EN: existing methods, video diffusion model, tremendous progress, methods still struggle, video diffusion
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
*备注: ACM Multimedia 2024. Source code is available at \url{ this https URL }

点击查看摘要

Abstract:Despite having tremendous progress in image-to-3D generation, existing methods still struggle to produce multi-view consistent images with high-resolution textures in detail, especially in the paradigm of 2D diffusion that lacks 3D awareness. In this work, we present High-resolution Image-to-3D model (Hi3D), a new video diffusion based paradigm that redefines a single image to multi-view images as 3D-aware sequential image generation (i.e., orbital video generation). This methodology delves into the underlying temporal consistency knowledge in video diffusion model that generalizes well to geometry consistency across multiple views in 3D generation. Technically, Hi3D first empowers the pre-trained video diffusion model with 3D-aware prior (camera pose condition), yielding multi-view images with low-resolution texture details. A 3D-aware video-to-video refiner is learnt to further scale up the multi-view images with high-resolution texture details. Such high-resolution multi-view images are further augmented with novel views through 3D Gaussian Splatting, which are finally leveraged to obtain high-fidelity meshes via 3D reconstruction. Extensive experiments on both novel view synthesis and single view reconstruction demonstrate that our Hi3D manages to produce superior multi-view consistency images with highly-detailed textures. Source code and data are available at \urlthis https URL.

[CV-3] FreeEnhance: Tuning-Free Image Enhancement via Content-Consistent Noising-and-Denoising Process

链接: https://arxiv.org/abs/2409.07451
作者: Yang Luo,Yiheng Zhang,Zhaofan Qiu,Ting Yao,Zhineng Chen,Yu-Gang Jiang,Tao Mei
关键词-EN: diffusion models, Latent Diffusion Models, image diffusion models, image, image enhancement
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
*备注: ACM Multimedia 2024

点击查看摘要

Abstract:The emergence of text-to-image generation models has led to the recognition that image enhancement, performed as post-processing, would significantly improve the visual quality of the generated images. Exploring diffusion models to enhance the generated images nevertheless is not trivial and necessitates to delicately enrich plentiful details while preserving the visual appearance of key content in the original image. In this paper, we propose a novel framework, namely FreeEnhance, for content-consistent image enhancement using the off-the-shelf image diffusion models. Technically, FreeEnhance is a two-stage process that firstly adds random noise to the input image and then capitalizes on a pre-trained image diffusion model (i.e., Latent Diffusion Models) to denoise and enhance the image details. In the noising stage, FreeEnhance is devised to add lighter noise to the region with higher frequency to preserve the high-frequent patterns (e.g., edge, corner) in the original image. In the denoising stage, we present three target properties as constraints to regularize the predicted noise, enhancing images with high acutance and high visual quality. Extensive experiments conducted on the HPDv2 dataset demonstrate that our FreeEnhance outperforms the state-of-the-art image enhancement models in terms of quantitative metrics and human preference. More remarkably, FreeEnhance also shows higher human preference compared to the commercial image enhancement solution of Magnific AI.

[CV-4] VMAS: Video-to-Music Generation via Semantic Alignment in Web Music Videos

链接: https://arxiv.org/abs/2409.07450
作者: Yan-Bo Lin,Yu Tian,Linjie Yang,Gedas Bertasius,Heng Wang
关键词-EN: music, background music, generate background music, video, background
类目: Multimedia (cs.MM); Computer Vision and Pattern Recognition (cs.CV); Sound (cs.SD); Audio and Speech Processing (eess.AS)
*备注: Project Page: this https URL

点击查看摘要

Abstract:We present a framework for learning to generate background music from video inputs. Unlike existing works that rely on symbolic musical annotations, which are limited in quantity and diversity, our method leverages large-scale web videos accompanied by background music. This enables our model to learn to generate realistic and diverse music. To accomplish this goal, we develop a generative video-music Transformer with a novel semantic video-music alignment scheme. Our model uses a joint autoregressive and contrastive learning objective, which encourages the generation of music aligned with high-level video content. We also introduce a novel video-beat alignment scheme to match the generated music beats with the low-level motions in the video. Lastly, to capture fine-grained visual cues in a video needed for realistic background music generation, we introduce a new temporal video encoder architecture, allowing us to efficiently process videos consisting of many densely sampled frames. We train our framework on our newly curated DISCO-MV dataset, consisting of 2.2M video-music samples, which is orders of magnitude larger than any prior datasets used for video music generation. Our method outperforms existing approaches on the DISCO-MV and MusicCaps datasets according to various music generation evaluation metrics, including human evaluation. Results are available at this https URL

[CV-5] StereoCrafter: Diffusion-based Generation of Long and High-fidelity Stereoscopic 3D from Monocular Videos

链接: https://arxiv.org/abs/2409.07447
作者: Sijie Zhao,Wenbo Hu,Xiaodong Cun,Yong Zhang,Xiaoyu Li,Zhe Kong,Xiangjun Gao,Muyao Niu,Ying Shan
关键词-EN: addressing the growing, paper presents, growing demand, stereo video inpainting, video
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
*备注: 11 pages, 10 figures

点击查看摘要

Abstract:This paper presents a novel framework for converting 2D videos to immersive stereoscopic 3D, addressing the growing demand for 3D content in immersive experience. Leveraging foundation models as priors, our approach overcomes the limitations of traditional methods and boosts the performance to ensure the high-fidelity generation required by the display devices. The proposed system consists of two main steps: depth-based video splatting for warping and extracting occlusion mask, and stereo video inpainting. We utilize pre-trained stable video diffusion as the backbone and introduce a fine-tuning protocol for the stereo video inpainting task. To handle input video with varying lengths and resolutions, we explore auto-regressive strategies and tiled processing. Finally, a sophisticated data processing pipeline has been developed to reconstruct a large-scale and high-quality dataset to support our training. Our framework demonstrates significant improvements in 2D-to-3D video conversion, offering a practical solution for creating immersive content for 3D devices like Apple Vision Pro and 3D displays. In summary, this work contributes to the field by presenting an effective method for generating high-quality stereoscopic videos from monocular input, potentially transforming how we experience digital media.

[CV-6] Adaptive Adapter Routing for Long-Tailed Class-Incremental Learning

链接: https://arxiv.org/abs/2409.07446
作者: Zhi-Hong Qi,Da-Wei Zhou,Yiran Yao,Han-Jia Ye,De-Chuan Zhan
关键词-EN: e-commerce platform reviews, ever-evolving world, platform reviews, e-commerce platform, long-tailed distribution
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted to Machine Learning Journal. Code is available at: this https URL

点击查看摘要

Abstract:In our ever-evolving world, new data exhibits a long-tailed distribution, such as e-commerce platform reviews. This necessitates continuous model learning imbalanced data without forgetting, addressing the challenge of long-tailed class-incremental learning (LTCIL). Existing methods often rely on retraining linear classifiers with former data, which is impractical in real-world settings. In this paper, we harness the potent representation capabilities of pre-trained models and introduce AdaPtive Adapter RouTing (APART) as an exemplar-free solution for LTCIL. To counteract forgetting, we train inserted adapters with frozen pre-trained weights for deeper adaptation and maintain a pool of adapters for selection during sequential model updates. Additionally, we present an auxiliary adapter pool designed for effective generalization, especially on minority classes. Adaptive instance routing across these pools captures crucial correlations, facilitating a comprehensive representation of all classes. Consequently, APART tackles the imbalance problem as well as catastrophic forgetting in a unified framework. Extensive benchmark experiments validate the effectiveness of APART. Code is available at: this https URL

[CV-7] Deep Neural Network-Based Sign Language Recognition: A Comprehensive Approach Using Transfer Learning with Explainability

链接: https://arxiv.org/abs/2409.07426
作者: A. E. M Ridwan,Mushfiqul Islam Chowdhury,Mekhala Mariam Mary,Md Tahmid Chowdhury Abir
关键词-EN: sign language recognition, language recognition, sign language, ensuring effective communication, language
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:To promote inclusion and ensuring effective communication for those who rely on sign language as their main form of communication, sign language recognition (SLR) is crucial. Sign language recognition (SLR) seamlessly incorporates with diverse technology, enhancing accessibility for the deaf community by facilitating their use of digital platforms, video calls, and communication devices. To effectively solve this problem, we suggest a novel solution that uses a deep neural network to fully automate sign language recognition. This methodology integrates sophisticated preprocessing methodologies to optimise the overall performance. The architectures resnet, inception, xception, and vgg are utilised to selectively categorise images of sign language. We prepared a DNN architecture and merged it with the pre-processing architectures. In the post-processing phase, we utilised the SHAP deep explainer, which is based on cooperative game theory, to quantify the influence of specific features on the output of a machine learning model. Bhutanese-Sign-Language (BSL) dataset was used for training and testing the suggested technique. While training on Bhutanese-Sign-Language (BSL) dataset, overall ResNet50 with the DNN model performed better accuracy which is 98.90%. Our model’s ability to provide informational clarity was assessed using the SHAP (SHapley Additive exPlanations) method. In part to its considerable robustness and reliability, the proposed methodological approach can be used to develop a fully automated system for sign language recognition.

[CV-8] NVRC: Neural Video Representation Compression

链接: https://arxiv.org/abs/2409.07414
作者: Ho Man Kwan,Ge Gao,Fan Zhang,Andrew Gower,David Bull
关键词-EN: Recent advances, implicit neural representation, Neural Video Representation, VVC VTM, Video Representation Compression
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
*备注:

点击查看摘要

Abstract:Recent advances in implicit neural representation (INR)-based video coding have demonstrated its potential to compete with both conventional and other learning-based approaches. With INR methods, a neural network is trained to overfit a video sequence, with its parameters compressed to obtain a compact representation of the video content. However, although promising results have been achieved, the best INR-based methods are still out-performed by the latest standard codecs, such as VVC VTM, partially due to the simple model compression techniques employed. In this paper, rather than focusing on representation architectures as in many existing works, we propose a novel INR-based video compression framework, Neural Video Representation Compression (NVRC), targeting compression of the representation. Based on the novel entropy coding and quantization models proposed, NVRC, for the first time, is able to optimize an INR-based video codec in a fully end-to-end manner. To further minimize the additional bitrate overhead introduced by the entropy models, we have also proposed a new model compression framework for coding all the network, quantization and entropy model parameters hierarchically. Our experiments show that NVRC outperforms many conventional and learning-based benchmark codecs, with a 24% average coding gain over VVC VTM (Random Access) on the UVG dataset, measured in PSNR. As far as we are aware, this is the first time an INR-based video codec achieving such performance. The implementation of NVRC will be released at this http URL.

[CV-9] What to align in multimodal contrastive learning?

链接: https://arxiv.org/abs/2409.07402
作者: Benoit Dufumier,Javiera Castillo-Navarro,Devis Tuia,Jean-Philippe Thiran
关键词-EN: Humans perceive, multisensory integration, adapt their behavior, perceive the world, world through multisensory
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
*备注: 22 pages

点击查看摘要

Abstract:Humans perceive the world through multisensory integration, blending the information of different modalities to adapt their behavior. Contrastive learning offers an appealing solution for multimodal self-supervised learning. Indeed, by considering each modality as a different view of the same entity, it learns to align features of different modalities in a shared representation space. However, this approach is intrinsically limited as it only learns shared or redundant information between modalities, while multimodal interactions can arise in other ways. In this work, we introduce CoMM, a Contrastive MultiModal learning strategy that enables the communication between modalities in a single multimodal space. Instead of imposing cross- or intra- modality constraints, we propose to align multimodal representations by maximizing the mutual information between augmented versions of these multimodal features. Our theoretical analysis shows that shared, synergistic and unique terms of information naturally emerge from this formulation, allowing us to estimate multimodal interactions beyond redundancy. We test CoMM both in a controlled and in a series of real-world settings: in the former, we demonstrate that CoMM effectively captures redundant, unique and synergistic information between modalities. In the latter, CoMM learns complex multimodal interactions and achieves state-of-the-art results on the six multimodal benchmarks.

[CV-10] FIRAL: An Active Learning Algorithm for Multinomial Logistic Regression NEURIPS2023

链接: https://arxiv.org/abs/2409.07379
作者: Youguang Chen,George Biros
关键词-EN: Fisher Information Ratio, pool-based active learning, investigate theory, Information Ratio, multinomial logistic regression
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted at the 37th Conference on Neural Information Processing Systems (NeurIPS 2023)

点击查看摘要

Abstract:We investigate theory and algorithms for pool-based active learning for multiclass classification using multinomial logistic regression. Using finite sample analysis, we prove that the Fisher Information Ratio (FIR) lower and upper bounds the excess risk. Based on our theoretical analysis, we propose an active learning algorithm that employs regret minimization to minimize the FIR. To verify our derived excess risk bounds, we conduct experiments on synthetic datasets. Furthermore, we compare FIRAL with five other methods and found that our scheme outperforms them: it consistently produces the smallest classification error in the multiclass logistic regression setting, as demonstrated through experiments on MNIST, CIFAR-10, and 50-class ImageNet.

[CV-11] Event-based Mosaicing Bundle Adjustment

链接: https://arxiv.org/abs/2409.07365
作者: Shuang Guo,Guillermo Gallego
关键词-EN: mosaicing bundle adjustment, purely rotating event, rotating event camera, bundle adjustment, simultaneous refinement
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO); Image and Video Processing (eess.IV)
*备注: 14+11 pages, 11 figures, 10 tables, this https URL

点击查看摘要

Abstract:We tackle the problem of mosaicing bundle adjustment (i.e., simultaneous refinement of camera orientations and scene map) for a purely rotating event camera. We formulate the problem as a regularized non-linear least squares optimization. The objective function is defined using the linearized event generation model in the camera orientations and the panoramic gradient map of the scene. We show that this BA optimization has an exploitable block-diagonal sparsity structure, so that the problem can be solved efficiently. To the best of our knowledge, this is the first work to leverage such sparsity to speed up the optimization in the context of event-based cameras, without the need to convert events into image-like representations. We evaluate our method, called EMBA, on both synthetic and real-world datasets to show its effectiveness (50% photometric error decrease), yielding results of unprecedented quality. In addition, we demonstrate EMBA using high spatial resolution event cameras, yielding delicate panoramas in the wild, even without an initial map. Project page: this https URL

[CV-12] Securing Vision-Language Models with a Robust Encoder Against Jailbreak and Adversarial Attacks

链接: https://arxiv.org/abs/2409.07353
作者: Md Zarif Hossain,Ahmed Imteaj
关键词-EN: Large Vision-Language Models, Large Vision-Language, multimodal big datasets, vision-language tasks, Vision-Language Models
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Large Vision-Language Models (LVLMs), trained on multimodal big datasets, have significantly advanced AI by excelling in vision-language tasks. However, these models remain vulnerable to adversarial attacks, particularly jailbreak attacks, which bypass safety protocols and cause the model to generate misleading or harmful responses. This vulnerability stems from both the inherent susceptibilities of LLMs and the expanded attack surface introduced by the visual modality. We propose Sim-CLIP+, a novel defense mechanism that adversarially fine-tunes the CLIP vision encoder by leveraging a Siamese architecture. This approach maximizes cosine similarity between perturbed and clean samples, facilitating resilience against adversarial manipulations. Sim-CLIP+ offers a plug-and-play solution, allowing seamless integration into existing LVLM architectures as a robust vision encoder. Unlike previous defenses, our method requires no structural modifications to the LVLM and incurs minimal computational overhead. Sim-CLIP+ demonstrates effectiveness against both gradient-based adversarial attacks and various jailbreak techniques. We evaluate Sim-CLIP+ against three distinct jailbreak attack strategies and perform clean evaluations using standard downstream datasets, including COCO for image captioning and OKVQA for visual question answering. Extensive experiments demonstrate that Sim-CLIP+ maintains high clean accuracy while substantially improving robustness against both gradient-based adversarial attacks and jailbreak techniques. Our code and robust vision encoders are available at this https URL.

[CV-13] Federated Impression for Learning with Distributed Heterogeneous Data

链接: https://arxiv.org/abs/2409.07351
作者: Sana Ayromlou,Atrin Arya,Armin Saadat,Purang Abolmaesumi,Xiaoxiao Li
关键词-EN: Standard deep learning-based, real-world clinical applications, Standard deep, deep learning-based classification, learning-based classification approaches
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注:

点击查看摘要

Abstract:Standard deep learning-based classification approaches may not always be practical in real-world clinical applications, as they require a centralized collection of all samples. Federated learning (FL) provides a paradigm that can learn from distributed datasets across clients without requiring them to share data, which can help mitigate privacy and data ownership issues. In FL, sub-optimal convergence caused by data heterogeneity is common among data from different health centers due to the variety in data collection protocols and patient demographics across centers. Through experimentation in this study, we show that data heterogeneity leads to the phenomenon of catastrophic forgetting during local training. We propose FedImpres which alleviates catastrophic forgetting by restoring synthetic data that represents the global information as federated impression. To achieve this, we distill the global model resulting from each communication round. Subsequently, we use the synthetic data alongside the local data to enhance the generalization of local training. Extensive experiments show that the proposed method achieves state-of-the-art performance on both the BloodMNIST and Retina datasets, which contain label imbalance and domain shift, with an improvement in classification accuracy of up to 20%.

[CV-14] Benchmarking 2D Egocentric Hand Pose Datasets

链接: https://arxiv.org/abs/2409.07337
作者: Olga Taran,Damian M. Manzone,Jose Zariffa
关键词-EN: including human-computer interaction, significant research interest, Hand pose estimation, Hand pose, pose estimation
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Hand pose estimation from egocentric video has broad implications across various domains, including human-computer interaction, assistive technologies, activity recognition, and robotics, making it a topic of significant research interest. The efficacy of modern machine learning models depends on the quality of data used for their training. Thus, this work is devoted to the analysis of state-of-the-art egocentric datasets suitable for 2D hand pose estimation. We propose a novel protocol for dataset evaluation, which encompasses not only the analysis of stated dataset characteristics and assessment of data quality, but also the identification of dataset shortcomings through the evaluation of state-of-the-art hand pose estimation models. Our study reveals that despite the availability of numerous egocentric databases intended for 2D hand pose estimation, the majority are tailored for specific use cases. There is no ideal benchmark dataset yet; however, H2O and GANerated Hands datasets emerge as the most promising real and synthetic datasets, respectively.

[CV-15] Learning to Compress Contexts for Efficient Knowledge-based Visual Question Answering

链接: https://arxiv.org/abs/2409.07331
作者: Weixi Weng,Jieming Zhu,Hao Zhang,Xiaojun Meng,Rui Zhang,Chun Yuan
关键词-EN: Large Language Models, Multimodal Large Language, Language Models, Large Language, demonstrated great zero-shot
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Multimodal Large Language Models (MLLMs) have demonstrated great zero-shot performance on visual question answering (VQA). However, when it comes to knowledge-based VQA (KB-VQA), MLLMs may lack human commonsense or specialized domain knowledge to answer such questions and require obtaining necessary information from external knowledge sources. Previous works like Retrival-Augmented VQA-v2 (RAVQA-v2) focus on utilizing as much input information, such as image-based textual descriptions and retrieved knowledge, as possible to improve performance, but they all overlook the issue that with the number of input tokens increasing, inference efficiency significantly decreases, which contradicts the demands of practical applications. To address this issue, we propose Retrieval-Augmented MLLM with Compressed Contexts (RACC). RACC learns to compress and aggregate retrieved contexts, from which it generates a compact modulation in the form of Key-Value (KV) cache. This modulation is then used to adapt the downstream frozen MLLM, thereby achieving effective and efficient inference. RACC achieves a state-of-the-art (SOTA) performance of 62.9% on OK-VQA. Moreover, it significantly reduces inference latency by 22.0%-59.7% compared to the prominent RAVQA-v2. Abundant experiments show RACC’s broad applicability. It is compatible with various off-the-shelf MLLMs and can also handle different knowledge sources including textual and multimodal documents.

[CV-16] Current Symmetry Group Equivariant Convolution Frameworks for Representation Learning

链接: https://arxiv.org/abs/2409.07327
作者: Ramzan Basheer,Deepak Mishra
关键词-EN: addressing real-world signals, Euclidean deep learning, Euclidean deep, complex topologies, inadequate for addressing
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 31 pages, 4 figures

点击查看摘要

Abstract:Euclidean deep learning is often inadequate for addressing real-world signals where the representation space is irregular and curved with complex topologies. Interpreting the geometric properties of such feature spaces has become paramount in obtaining robust and compact feature representations that remain unaffected by nontrivial geometric transformations, which vanilla CNNs cannot effectively handle. Recognizing rotation, translation, permutation, or scale symmetries can lead to equivariance properties in the learned representations. This has led to notable advancements in computer vision and machine learning tasks under the framework of geometric deep learning, as compared to their invariant counterparts. In this report, we emphasize the importance of symmetry group equivariant deep learning models and their realization of convolution-like operations on graphs, 3D shapes, and non-Euclidean spaces by leveraging group theory and symmetry. We categorize them as regular, steerable, and PDE-based convolutions and thoroughly examine the inherent symmetries of their input spaces and ensuing representations. We also outline the mathematical link between group convolutions or message aggregation operations and the concept of equivariance. The report also highlights various datasets, their application scopes, limitations, and insightful observations on future directions to serve as a valuable reference and stimulate further research in this emerging discipline.

[CV-17] Module-wise Adaptive Adversarial Training for End-to-end Autonomous Driving

链接: https://arxiv.org/abs/2409.07321
作者: Tianyuan Zhang,Lu Wang,Jiaqi Kang,Xinwei Zhang,Siyuan Liang,Yuwei Chen,Aishan Liu,Xianglong Liu
关键词-EN: Recent advances, improved autonomous driving, markedly improved autonomous, autonomous driving, systems that integrate
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 14 pages

点击查看摘要

Abstract:Recent advances in deep learning have markedly improved autonomous driving (AD) models, particularly end-to-end systems that integrate perception, prediction, and planning stages, achieving state-of-the-art performance. However, these models remain vulnerable to adversarial attacks, where human-imperceptible perturbations can disrupt decision-making processes. While adversarial training is an effective method for enhancing model robustness against such attacks, no prior studies have focused on its application to end-to-end AD models. In this paper, we take the first step in adversarial training for end-to-end AD models and present a novel Module-wise Adaptive Adversarial Training (MA2T). However, extending conventional adversarial training to this context is highly non-trivial, as different stages within the model have distinct objectives and are strongly interconnected. To address these challenges, MA2T first introduces Module-wise Noise Injection, which injects noise before the input of different modules, targeting training models with the guidance of overall objectives rather than each independent module loss. Additionally, we introduce Dynamic Weight Accumulation Adaptation, which incorporates accumulated weight changes to adaptively learn and adjust the loss weights of each module based on their contributions (accumulated reduction rates) for better balance and robust training. To demonstrate the efficacy of our defense, we conduct extensive experiments on the widely-used nuScenes dataset across several end-to-end AD models under both white-box and black-box attacks, where our method outperforms other baselines by large margins (+5-10%). Moreover, we validate the robustness of our defense through closed-loop evaluation in the CARLA simulation environment, showing improved resilience even against natural corruption.

[CV-18] Data Augmentation via Latent Diffusion for Saliency Prediction ECCV2024

链接: https://arxiv.org/abs/2409.07307
作者: Bahar Aydemir,Deblina Bhattacharjee,Tong Zhang,Mathieu Salzmann,Sabine Süsstrunk
关键词-EN: limited diversity, diversity and quantity, quantity of labeled, Saliency, data augmentation
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 18 pages, published in ECCV 2024

点击查看摘要

Abstract:Saliency prediction models are constrained by the limited diversity and quantity of labeled data. Standard data augmentation techniques such as rotating and cropping alter scene composition, affecting saliency. We propose a novel data augmentation method for deep saliency prediction that edits natural images while preserving the complexity and variability of real-world scenes. Since saliency depends on high-level and low-level features, our approach involves learning both by incorporating photometric and semantic attributes such as color, contrast, brightness, and class. To that end, we introduce a saliency-guided cross-attention mechanism that enables targeted edits on the photometric properties, thereby enhancing saliency within specific image regions. Experimental results show that our data augmentation method consistently improves the performance of various saliency models. Moreover, leveraging the augmentation features for saliency prediction yields superior performance on publicly available saliency benchmarks. Our predictions align closely with human visual attention patterns in the edited images, as validated by a user study.

[CV-19] PaveSAM Segment Anything for Pavement Distress

链接: https://arxiv.org/abs/2409.07295
作者: Neema Jakisa Owor,Yaw Adu-Gyamfi,Armstrong Aboah,Mark Amo-Boateng
关键词-EN: Automated pavement monitoring, Automated pavement, segmentation, manual methods, analyze pavement conditions
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Automated pavement monitoring using computer vision can analyze pavement conditions more efficiently and accurately than manual methods. Accurate segmentation is essential for quantifying the severity and extent of pavement defects and consequently, the overall condition index used for prioritizing rehabilitation and maintenance activities. Deep learning-based segmentation models are however, often supervised and require pixel-level annotations, which can be costly and time-consuming. While the recent evolution of zero-shot segmentation models can generate pixel-wise labels for unseen classes without any training data, they struggle with irregularities of cracks and textured pavement backgrounds. This research proposes a zero-shot segmentation model, PaveSAM, that can segment pavement distresses using bounding box prompts. By retraining SAM’s mask decoder with just 180 images, pavement distress segmentation is revolutionized, enabling efficient distress segmentation using bounding box prompts, a capability not found in current segmentation models. This not only drastically reduces labeling efforts and costs but also showcases our model’s high performance with minimal input, establishing the pioneering use of SAM in pavement distress segmentation. Furthermore, researchers can use existing open-source pavement distress images annotated with bounding boxes to create segmentation masks, which increases the availability and diversity of segmentation pavement distress datasets.

[CV-20] A Unified Contrastive Loss for Self-Training

链接: https://arxiv.org/abs/2409.07292
作者: Aurelien Gauffre,Julien Horvat,Massih-Reza Amini
关键词-EN: exploiting abundant unlabeled, abundant unlabeled data, exploiting abundant, abundant unlabeled, loss function
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Self-training methods have proven to be effective in exploiting abundant unlabeled data in semi-supervised learning, particularly when labeled data is scarce. While many of these approaches rely on a cross-entropy loss function (CE), recent advances have shown that the supervised contrastive loss function (SupCon) can be more effective. Additionally, unsupervised contrastive learning approaches have also been shown to capture high quality data representations in the unsupervised setting. To benefit from these advantages in a semi-supervised setting, we propose a general framework to enhance self-training methods, which replaces all instances of CE losses with a unique contrastive loss. By using class prototypes, which are a set of class-wise trainable parameters, we recover the probability distributions of the CE setting and show a theoretical equivalence with it. Our framework, when applied to popular self-training methods, results in significant performance improvements across three different datasets with a limited number of labeled data. Additionally, we demonstrate further improvements in convergence speed, transfer ability, and hyperparameter stability. The code is available at \urlthis https URL.

[CV-21] Exploring User-level Gradient Inversion with a Diffusion Prior NEURIPS2023

链接: https://arxiv.org/abs/2409.07291
作者: Zhuohang Li,Andrew Lowy,Jing Liu,Toshiaki Koike-Akino,Bradley Malin,Kieran Parsons,Ye Wang
关键词-EN: explore user-level gradient, user-level gradient inversion, distributed learning, explore user-level, surface in distributed
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)
*备注: Presented at the International Workshop on Federated Learning in the Age of Foundation Models in conjunction with NeurIPS 2023

点击查看摘要

Abstract:We explore user-level gradient inversion as a new attack surface in distributed learning. We first investigate existing attacks on their ability to make inferences about private information beyond training data reconstruction. Motivated by the low reconstruction quality of existing methods, we propose a novel gradient inversion attack that applies a denoising diffusion model as a strong image prior in order to enhance recovery in the large batch setting. Unlike traditional attacks, which aim to reconstruct individual samples and suffer at large batch and image sizes, our approach instead aims to recover a representative image that captures the sensitive shared semantic information corresponding to the underlying user. Our experiments with face images demonstrate the ability of our methods to recover realistic facial images along with private user attributes.

[CV-22] LD-READY: Traffic Light Detection – Relevance Estimation and Deployment Analysis

链接: https://arxiv.org/abs/2409.07284
作者: Nikolai Polley,Svetlana Pavlitska,Yacin Boualili,Patrick Rohrbeck,Paul Stiller,Ashok Kumar Bangaru,J. Marius Zöllner
关键词-EN: Effective traffic light, traffic light detection, Small Traffic Lights, Traffic Lights Dataset, Effective traffic
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Effective traffic light detection is a critical component of the perception stack in autonomous vehicles. This work introduces a novel deep-learning detection system while addressing the challenges of previous work. Utilizing a comprehensive dataset amalgamation, including the Bosch Small Traffic Lights Dataset, LISA, the DriveU Traffic Light Dataset, and a proprietary dataset from Karlsruhe, we ensure a robust evaluation across varied scenarios. Furthermore, we propose a relevance estimation system that innovatively uses directional arrow markings on the road, eliminating the need for prior map creation. On the DriveU dataset, this approach results in 96% accuracy in relevance estimation. Finally, a real-world evaluation is performed to evaluate the deployment and generalizing abilities of these models. For reproducibility and to facilitate further research, we provide the model weights and code: this https URL.

[CV-23] uning-Free Online Robust Principal Component Analysis through Implicit Regularization

链接: https://arxiv.org/abs/2409.07275
作者: Lakshmi Jayalal,Gokularam Muthukrishnan,Sheetal Kalyani
关键词-EN: Principal Component Analysis, Online Robust Principal, Robust Principal Component, standard Online Robust, Component Analysis
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:The performance of the standard Online Robust Principal Component Analysis (OR-PCA) technique depends on the optimum tuning of the explicit regularizers and this tuning is dataset sensitive. We aim to remove the dependency on these tuning parameters by using implicit regularization. We propose to use the implicit regularization effect of various modified gradient descents to make OR-PCA tuning free. Our method incorporates three different versions of modified gradient descent that separately but naturally encourage sparsity and low-rank structures in the data. The proposed method performs comparable or better than the tuned OR-PCA for both simulated and real-world datasets. Tuning-free ORPCA makes it more scalable for large datasets since we do not require dataset-dependent parameter tuning.

[CV-24] CCFExp: Facial Image Synthesis with Cycle Cross-Fusion Diffusion Model for Facial Paralysis Individuals

链接: https://arxiv.org/abs/2409.07271
作者: Weixiang Gao,Yifan Xia
关键词-EN: Facial paralysis, Facial, paralysis, facial paralysis datasets, debilitating condition
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Facial paralysis is a debilitating condition that affects the movement of facial muscles, leading to a significant loss of facial expressions. Currently, the diagnosis of facial paralysis remains a challenging task, often relying heavily on the subjective judgment and experience of clinicians, which can introduce variability and uncertainty in the assessment process. One promising application in real-life situations is the automatic estimation of facial paralysis. However, the scarcity of facial paralysis datasets limits the development of robust machine learning models for automated diagnosis and therapeutic interventions. To this end, this study aims to synthesize a high-quality facial paralysis dataset to address this gap, enabling more accurate and efficient algorithm training. Specifically, a novel Cycle Cross-Fusion Expression Generative Model (CCFExp) based on the diffusion model is proposed to combine different features of facial information and enhance the visual details of facial appearance and texture in facial regions, thus creating synthetic facial images that accurately represent various degrees and types of facial paralysis. We have qualitatively and quantitatively evaluated the proposed method on the commonly used public clinical datasets of facial paralysis to demonstrate its effectiveness. Experimental results indicate that the proposed method surpasses state-of-the-art methods, generating more realistic facial images and maintaining identity consistency.

[CV-25] Realistic and Efficient Face Swapping: A Unified Approach with Diffusion Models WACV2025

链接: https://arxiv.org/abs/2409.07269
作者: Sanoojan Baliah,Qinliang Lin,Shengcai Liao,Xiaodan Liang,Muhammad Haris Khan
关键词-EN: scenarios involving high, images remain elusive, high pose variation, swapped images remain, involving high pose
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted as a conference paper at WACV 2025

点击查看摘要

Abstract:Despite promising progress in face swapping task, realistic swapped images remain elusive, often marred by artifacts, particularly in scenarios involving high pose variation, color differences, and occlusion. To address these issues, we propose a novel approach that better harnesses diffusion models for face-swapping by making following core contributions. (a) We propose to re-frame the face-swapping task as a self-supervised, train-time inpainting problem, enhancing the identity transfer while blending with the target image. (b) We introduce a multi-step Denoising Diffusion Implicit Model (DDIM) sampling during training, reinforcing identity and perceptual similarities. © Third, we introduce CLIP feature disentanglement to extract pose, expression, and lighting information from the target image, improving fidelity. (d) Further, we introduce a mask shuffling technique during inpainting training, which allows us to create a so-called universal model for swapping, with an additional feature of head swapping. Ours can swap hair and even accessories, beyond traditional face swapping. Unlike prior works reliant on multiple off-the-shelf models, ours is a relatively unified approach and so it is resilient to errors in other off-the-shelf models. Extensive experiments on FFHQ and CelebA datasets validate the efficacy and robustness of our approach, showcasing high-fidelity, realistic face-swapping with minimal inference time. Our code is available at this https URL.

[CV-26] MiniDrive: More Efficient Vision-Language Models with Multi-Level 2D Features as Text Tokens for Autonomous Driving

链接: https://arxiv.org/abs/2409.07267
作者: Enming Zhang,Xingyuan Dai,Yisheng Lv,Qianghai Miao
关键词-EN: Vision-language models, serve as general-purpose, performing subtasks, autonomous driving, visual token embeddings
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Vision-language models (VLMs) serve as general-purpose end-to-end models in autonomous driving, performing subtasks such as prediction, planning, and perception through question-and-answer interactions. However, most existing methods rely on computationally expensive visual encoders and large language models (LLMs), making them difficult to deploy in real-world scenarios and real-time applications. Meanwhile, most existing VLMs lack the ability to process multiple images, making it difficult to adapt to multi-camera perception in autonomous driving. To address these issues, we propose a novel framework called MiniDrive, which incorporates our proposed Feature Engineering Mixture of Experts (FE-MoE) module and Dynamic Instruction Adapter (DI-Adapter). The FE-MoE effectively maps 2D features into visual token embeddings before being input into the language model. The DI-Adapter enables the visual token embeddings to dynamically change with the instruction text embeddings, resolving the issue of static visual token embeddings for the same image in previous approaches. Compared to previous works, MiniDrive achieves state-of-the-art performance in terms of parameter size, floating point operations, and response efficiency, with the smallest version containing only 83M parameters.

[CV-27] opoMap: A faster and more space efficient technique to compute projections with topological guarantees

链接: https://arxiv.org/abs/2409.07257
作者: Vitoria Guardieiro,Felipe Inagaki de Oliveira,Harish Doraiswamy,Luis Gustavo Nonato,Claudio Silva
关键词-EN: High-dimensional data, visualize effectively, difficult to visualize, data, visualizing high-dimensional data
类目: Graphics (cs.GR); Computational Geometry (cs.CG); Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
*备注: This is the author’s version of the article that has been accepted for publication in IEEE Transactions on Visualization and Computer Graphics (TVCG)

点击查看摘要

Abstract:High-dimensional data, characterized by many features, can be difficult to visualize effectively. Dimensionality reduction techniques, such as PCA, UMAP, and t-SNE, address this challenge by projecting the data into a lower-dimensional space while preserving important relationships. TopoMap is another technique that excels at preserving the underlying structure of the data, leading to interpretable visualizations. In particular, TopoMap maps the high-dimensional data into a visual space, guaranteeing that the 0-dimensional persistence diagram of the Rips filtration of the visual space matches the one from the high-dimensional data. However, the original TopoMap algorithm can be slow and its layout can be too sparse for large and complex datasets. In this paper, we propose three improvements to TopoMap: 1) a more space-efficient layout, 2) a significantly faster implementation, and 3) a novel TreeMap-based representation that makes use of the topological hierarchy to aid the exploration of the projections. These advancements make TopoMap, now referred to as TopoMap++, a more powerful tool for visualizing high-dimensional data which we demonstrate through different use case scenarios.

[CV-28] MRAC Track 1: 2nd Workshop on Multimodal Generative and Responsible Affective Computing ACM-MM

链接: https://arxiv.org/abs/2409.07256
作者: Shreya Ghosh,Zhixi Cai,Abhinav Dhall,Dimitrios Kollias,Roland Goecke,Tom Gedeon
关键词-EN: Affective Computing research, Affective Computing, Affective Computing involves, generative affective computing, affective computing requires
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: ACM MM Workshop 2024. Workshop webpage: this https URL

点击查看摘要

Abstract:With the rapid advancements in multimodal generative technology, Affective Computing research has provoked discussion about the potential consequences of AI systems equipped with emotional intelligence. Affective Computing involves the design, evaluation, and implementation of Emotion AI and related technologies aimed at improving people’s lives. Designing a computational model in affective computing requires vast amounts of multimodal data, including RGB images, video, audio, text, and physiological signals. Moreover, Affective Computing research is deeply engaged with ethical considerations at various stages-from training emotionally intelligent models on large-scale human data to deploying these models in specific applications. Fundamentally, the development of any AI system must prioritize its impact on humans, aiming to augment and enhance human abilities rather than replace them, while drawing inspiration from human intelligence in a safe and responsible manner. The MRAC 2024 Track 1 workshop seeks to extend these principles from controlled, small-scale lab environments to real-world, large-scale contexts, emphasizing responsible development. The workshop also aims to highlight the potential implications of generative technology, along with the ethical consequences of its use, to researchers and industry professionals. To the best of our knowledge, this is the first workshop series to comprehensively address the full spectrum of multimodal, generative affective computing from a responsible AI perspective, and this is the second iteration of this workshop. Webpage: this https URL

[CV-29] EMOdiffhead: Continuously Emotional Control in Talking Head Generation via Diffusion

链接: https://arxiv.org/abs/2409.07255
作者: Jian Zhang,Weijian Mai,Zhijun Zhang
关键词-EN: talking head video, talking head, animation involves generating, track of speech, emotion-driven talking head
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 12 pages, 7 figures

点击查看摘要

Abstract:The task of audio-driven portrait animation involves generating a talking head video using an identity image and an audio track of speech. While many existing approaches focus on lip synchronization and video quality, few tackle the challenge of generating emotion-driven talking head videos. The ability to control and edit emotions is essential for producing expressive and realistic animations. In response to this challenge, we propose EMOdiffhead, a novel method for emotional talking head video generation that not only enables fine-grained control of emotion categories and intensities but also enables one-shot generation. Given the FLAME 3D model’s linearity in expression modeling, we utilize the DECA method to extract expression vectors, that are combined with audio to guide a diffusion model in generating videos with precise lip synchronization and rich emotional expressiveness. This approach not only enables the learning of rich facial information from emotion-irrelevant data but also facilitates the generation of emotional videos. It effectively overcomes the limitations of emotional data, such as the lack of diversity in facial and background information, and addresses the absence of emotional details in emotion-irrelevant data. Extensive experiments and user studies demonstrate that our approach achieves state-of-the-art performance compared to other emotion portrait animation methods.

[CV-30] Alignment of Diffusion Models: Fundamentals Challenges and Future

链接: https://arxiv.org/abs/2409.07253
作者: Buhua Liu,Shitong Shao,Bao Li,Lichen Bai,Haoyi Xiong,James Kwok,Sumi Helal,Zeke Xie
关键词-EN: Diffusion models, Diffusion, generative modeling, models, leading paradigm
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注: 35 pages, 5 figures, 3 tables

点击查看摘要

Abstract:Diffusion models have emerged as the leading paradigm in generative modeling, excelling in various applications. Despite their success, these models often misalign with human intentions, generating outputs that may not match text prompts or possess desired properties. Inspired by the success of alignment in tuning large language models, recent studies have investigated aligning diffusion models with human expectations and preferences. This work mainly reviews alignment of diffusion models, covering advancements in fundamentals of alignment, alignment techniques of diffusion models, preference benchmarks, and evaluation for diffusion models. Moreover, we discuss key perspectives on current challenges and promising future directions on solving the remaining challenges in alignment of diffusion models. To the best of our knowledge, our work is the first comprehensive review paper for researchers and engineers to comprehend, practice, and research alignment of diffusion models.

[CV-31] Single-View 3D Reconstruction via SO(2)-Equivariant Gaussian Sculpting Networks

链接: https://arxiv.org/abs/2409.07245
作者: Ruihan Xu,Anthony Opipari,Joshua Mah,Stanley Lewis,Haoran Zhang,Hanzhe Guo,Odest Chadwicke Jenkins
关键词-EN: Equivariant Gaussian Sculpting, Gaussian Sculpting Networks, Sculpting Networks, Equivariant Gaussian, single-view image observations
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
*备注: Accepted to RSS 2024 Workshop on Geometric and Algebraic Structure in Robot Learning

点击查看摘要

Abstract:This paper introduces SO(2)-Equivariant Gaussian Sculpting Networks (GSNs) as an approach for SO(2)-Equivariant 3D object reconstruction from single-view image observations. GSNs take a single observation as input to generate a Gaussian splat representation describing the observed object’s geometry and texture. By using a shared feature extractor before decoding Gaussian colors, covariances, positions, and opacities, GSNs achieve extremely high throughput (150FPS). Experiments demonstrate that GSNs can be trained efficiently using a multi-view rendering loss and are competitive, in quality, with expensive diffusion-based reconstruction algorithms. The GSN model is validated on multiple benchmark experiments. Moreover, we demonstrate the potential for GSNs to be used within a robotic manipulation pipeline for object-centric grasping. Comments: Accepted to RSS 2024 Workshop on Geometric and Algebraic Structure in Robot Learning Subjects: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO) Cite as: arXiv:2409.07245 [cs.CV] (or arXiv:2409.07245v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2409.07245 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[CV-32] PiTe: Pixel-Temporal Alignment for Large Video-Language Model

链接: https://arxiv.org/abs/2409.07239
作者: Yang Liu,Pengxiang Ding,Siteng Huang,Min Zhang,Han Zhao,Donglin Wang
关键词-EN: Large Visual-Language Models, pivotal advancement, bridging the gap, Large Language Models, Large Visual-Language
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Fueled by the Large Language Models (LLMs) wave, Large Visual-Language Models (LVLMs) have emerged as a pivotal advancement, bridging the gap between image and text. However, video making it challenging for LVLMs to perform adequately due to the complexity of the relationship between language and spatial-temporal data structure. Recent Large Video-Language Models (LVidLMs) align feature of static visual data like image into latent space of language feature, by general multi-modal tasks to leverage abilities of LLMs sufficiently. In this paper, we explore fine-grained alignment approach via object trajectory for different modalities across both spatial and temporal dimensions simultaneously. Thus, we propose a novel LVidLM by trajectory-guided Pixel-Temporal Alignment, dubbed PiTe, that exhibits promising applicable model property. To achieve fine-grained video-language alignment, we curate a multi-modal pre-training dataset PiTe-143k, the dataset provision of moving trajectories in pixel level for all individual objects, that appear and mention in the video and caption both, by our automatic annotation pipeline. Meanwhile, PiTe demonstrates astounding capabilities on myriad video-related multi-modal tasks through beat the state-of-the-art methods by a large margin.

[CV-33] Diff-VPS: Video Polyp Segmentation via a Multi-task Diffusion Network with Adversarial Temporal Reasoning

链接: https://arxiv.org/abs/2409.07238
作者: Yingling Lu,Yijun Yang,Zhaohu Xing,Qiong Wang,Lei Zhu
关键词-EN: Diffusion Probabilistic Models, recently attracted significant, attracted significant attention, computer vision due, Probabilistic Models
类目: Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Diffusion Probabilistic Models have recently attracted significant attention in the community of computer vision due to their outstanding performance. However, while a substantial amount of diffusion-based research has focused on generative tasks, no work introduces diffusion models to advance the results of polyp segmentation in videos, which is frequently challenged by polyps’ high camouflage and redundant temporal this http URL this paper, we present a novel diffusion-based network for video polyp segmentation task, dubbed as Diff-VPS. We incorporate multi-task supervision into diffusion models to promote the discrimination of diffusion models on pixel-by-pixel segmentation. This integrates the contextual high-level information achieved by the joint classification and detection tasks. To explore the temporal dependency, Temporal Reasoning Module (TRM) is devised via reasoning and reconstructing the target frame from the previous frames. We further equip TRM with a generative adversarial self-supervised strategy to produce more realistic frames and thus capture better dynamic cues. Extensive experiments are conducted on SUN-SEG, and the results indicate that our proposed Diff-VPS significantly achieves state-of-the-art performance. Code is available at this https URL.

[CV-34] Watchlist Challenge: 3rd Open-set Face Detection and Identification

链接: https://arxiv.org/abs/2409.07220
作者: Furkan Kasım,Terrance E. Boult,Rensso Mora,Bernardo Biesseck,Rafael Ribeiro,Jan Schlueter,Tomáš Repák,Rafael Henrique Vareto,David Menotti,William Robson Schwartz,Manuel Günther
关键词-EN: accurately recognize faces, Watchlist Challenge addresses, settings is paramount, current landscape, landscape of biometrics
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted for presentation at IJCB 2024

点击查看摘要

Abstract:In the current landscape of biometrics and surveillance, the ability to accurately recognize faces in uncontrolled settings is paramount. The Watchlist Challenge addresses this critical need by focusing on face detection and open-set identification in real-world surveillance scenarios. This paper presents a comprehensive evaluation of participating algorithms, using the enhanced UnConstrained College Students (UCCS) dataset with new evaluation protocols. In total, four participants submitted four face detection and nine open-set face recognition systems. The evaluation demonstrates that while detection capabilities are generally robust, closed-set identification performance varies significantly, with models pre-trained on large-scale datasets showing superior performance. However, open-set scenarios require further improvement, especially at higher true positive identification rates, i.e., lower thresholds.

[CV-35] Behavioral Cloning Models Reality Check for Autonomous Driving

链接: https://arxiv.org/abs/2409.07218
作者: Mustafa Yildirim,Barkin Dagda,Vinal Asodia,Saber Fallah
关键词-EN: autonomous vehicle, effective are recent, recent advancements, autonomous vehicle control, utilize Behavior Cloning
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:How effective are recent advancements in autonomous vehicle perception systems when applied to real-world autonomous vehicle control? While numerous vision-based autonomous vehicle systems have been trained and evaluated in simulated environments, there is a notable lack of real-world validation for these systems. This paper addresses this gap by presenting the real-world validation of state-of-the-art perception systems that utilize Behavior Cloning (BC) for lateral control, processing raw image data to predict steering commands. The dataset was collected using a scaled research vehicle and tested on various track setups. Experimental results demonstrate that these methods predict steering angles with low error margins in real-time, indicating promising potential for real-world applications.

[CV-36] Enhancing CTC-Based Visual Speech Recognition

链接: https://arxiv.org/abs/2409.07210
作者: Hendrik Laux,Anke Schmeink
关键词-EN: Visual Speech Recognition, Automatic Speech Recognition, previously introduced efficient, Speech Recognition, Visual Speech
类目: Computer Vision and Pattern Recognition (cs.CV); Sound (cs.SD); Audio and Speech Processing (eess.AS)
*备注:

点击查看摘要

Abstract:This paper presents LiteVSR2, an enhanced version of our previously introduced efficient approach to Visual Speech Recognition (VSR). Building upon our knowledge distillation framework from a pre-trained Automatic Speech Recognition (ASR) model, we introduce two key improvements: a stabilized video preprocessing technique and feature normalization in the distillation process. These improvements yield substantial performance gains on the LRS2 and LRS3 benchmarks, positioning LiteVSR2 as the current best CTC-based VSR model without increasing the volume of training data or computational resources utilized. Furthermore, we explore the scalability of our approach by examining performance metrics across varying model complexities and training data volumes. LiteVSR2 maintains the efficiency of its predecessor while significantly enhancing accuracy, thereby demonstrating the potential for resource-efficient advancements in VSR technology.

[CV-37] hermalGaussian: Thermal 3D Gaussian Splatting

链接: https://arxiv.org/abs/2409.07200
作者: Rongfeng Lu,Hangyu Chen,Zunjie Zhu,Yuhang Qin,Ming Lu,Le Zhang,Chenggang Yan,Anke Xue
关键词-EN: Neural Radiance Fields, Radiance Fields, users of surveillance, Neural Radiance, thermal
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 10 pages, 7 figures

点击查看摘要

Abstract:Thermography is especially valuable for the military and other users of surveillance cameras. Some recent methods based on Neural Radiance Fields (NeRF) are proposed to reconstruct the thermal scenes in 3D from a set of thermal and RGB images. However, unlike NeRF, 3D Gaussian splatting (3DGS) prevails due to its rapid training and real-time rendering. In this work, we propose ThermalGaussian, the first thermal 3DGS approach capable of rendering high-quality images in RGB and thermal modalities. We first calibrate the RGB camera and the thermal camera to ensure that both modalities are accurately aligned. Subsequently, we use the registered images to learn the multimodal 3D Gaussians. To prevent the overfitting of any single modality, we introduce several multimodal regularization constraints. We also develop smoothing constraints tailored to the physical characteristics of the thermal modality. Besides, we contribute a real-world dataset named RGBT-Scenes, captured by a hand-hold thermal-infrared camera, facilitating future research on thermal scene reconstruction. We conduct comprehensive experiments to show that ThermalGaussian achieves photorealistic rendering of thermal images and improves the rendering quality of RGB images. With the proposed multimodal regularization constraints, we also reduced the model’s storage cost by 90%. The code and dataset will be released.

[CV-38] Enhancing Angular Resolution via Directionality Encoding and Geometric Constraints in Brain Diffusion Tensor Imaging ICONIP2024

链接: https://arxiv.org/abs/2409.07186
作者: Sheng Chen,Zihao Tang,Mariano Cabezas,Xinyi Wang,Arkiev D’Souza,Michael Barnett,Fernando Calamante,Weidong Cai,Chenyu Wang
关键词-EN: Magnetic Resonance Imaging, Magnetic Resonance, fiber tracts non-invasively, reconstruct white matter, white matter fiber
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: Accepted to ICONIP2024, Diffusion Weighted Imaging, Diffusion Tensor Imaging, Angular Resolution Enhancement, Fractional Anisotropy

点击查看摘要

Abstract:Diffusion-weighted imaging (DWI) is a type of Magnetic Resonance Imaging (MRI) technique sensitised to the diffusivity of water molecules, offering the capability to inspect tissue microstructures and is the only in-vivo method to reconstruct white matter fiber tracts non-invasively. The DWI signal can be analysed with the diffusion tensor imaging (DTI) model to estimate the directionality of water diffusion within voxels. Several scalar metrics, including axial diffusivity (AD), mean diffusivity (MD), radial diffusivity (RD), and fractional anisotropy (FA), can be further derived from DTI to quantitatively summarise the microstructural integrity of brain tissue. These scalar metrics have played an important role in understanding the organisation and health of brain tissue at a microscopic level in clinical studies. However, reliable DTI metrics rely on DWI acquisitions with high gradient directions, which often go beyond the commonly used clinical protocols. To enhance the utility of clinically acquired DWI and save scanning time for robust DTI analysis, this work proposes DirGeo-DTI, a deep learning-based method to estimate reliable DTI metrics even from a set of DWIs acquired with the minimum theoretical number (6) of gradient directions. DirGeo-DTI leverages directional encoding and geometric constraints to facilitate the training process. Two public DWI datasets were used for evaluation, demonstrating the effectiveness of the proposed method. Extensive experimental results show that the proposed method achieves the best performance compared to existing DTI enhancement methods and potentially reveals further clinical insights with routine clinical DWI scans.

[CV-39] Phy124: Fast Physics-Driven 4D Content Generation from a Single Image

链接: https://arxiv.org/abs/2409.07179
作者: Jiajing Lin,Zhenzhong Wang,Yongjie Hou,Yuzhou Tang,Min Jiang
关键词-EN: objects that change, diffusion models, video diffusion models, focuses on creating, content generation focuses
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:4D content generation focuses on creating dynamic 3D objects that change over time. Existing methods primarily rely on pre-trained video diffusion models, utilizing sampling processes or reference videos. However, these approaches face significant challenges. Firstly, the generated 4D content often fails to adhere to real-world physics since video diffusion models do not incorporate physical priors. Secondly, the extensive sampling process and the large number of parameters in diffusion models result in exceedingly time-consuming generation processes. To address these issues, we introduce Phy124, a novel, fast, and physics-driven method for controllable 4D content generation from a single image. Phy124 integrates physical simulation directly into the 4D generation process, ensuring that the resulting 4D content adheres to natural physical laws. Phy124 also eliminates the use of diffusion models during the 4D dynamics generation phase, significantly speeding up the process. Phy124 allows for the control of 4D dynamics, including movement speed and direction, by manipulating external forces. Extensive experiments demonstrate that Phy124 generates high-fidelity 4D content with significantly reduced inference times, achieving stateof-the-art performance. The code and generated 4D content are available at the provided link: https://anonymous.4open.science/r/BBF2/.

[CV-40] Swin-LiteMedSAM: A Lightweight Box-Based Segment Anything Model for Large-Scale Medical Image Datasets

链接: https://arxiv.org/abs/2409.07172
作者: Ruochen Gao,Donghang Lyu,Marius Staring
关键词-EN: medical image segmentation, receiving high attention, subtask receiving high, automatic medical image, medical image
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 13 pages

点击查看摘要

Abstract:Medical imaging is essential for the diagnosis and treatment of diseases, with medical image segmentation as a subtask receiving high attention. However, automatic medical image segmentation models are typically task-specific and struggle to handle multiple scenarios, such as different imaging modalities and regions of interest. With the introduction of the Segment Anything Model (SAM), training a universal model for various clinical scenarios has become feasible. Recently, several Medical SAM (MedSAM) methods have been proposed, but these models often rely on heavy image encoders to achieve high performance, which may not be practical for real-world applications due to their high computational demands and slow inference speed. To address this issue, a lightweight version of the MedSAM (LiteMedSAM) can provide a viable solution, achieving high performance while requiring fewer resources and less time. In this work, we introduce Swin-LiteMedSAM, a new variant of LiteMedSAM. This model integrates the tiny Swin Transformer as the image encoder, incorporates multiple types of prompts, including box-based points and scribble generated from a given bounding box, and establishes skip connections between the image encoder and the mask decoder. In the \textitSegment Anything in Medical Images on Laptop challenge (CVPR 2024), our approach strikes a good balance between segmentation performance and speed, demonstrating significantly improved overall results across multiple modalities compared to the LiteMedSAM baseline provided by the challenge organizers. Our proposed model achieved a DSC score of \textbf0.8678 and an NSD score of \textbf0.8844 on the validation set. On the final test set, it attained a DSC score of \textbf0.8193 and an NSD score of \textbf0.8461, securing fourth place in the challenge.

[CV-41] Mamba Policy: Towards Efficient 3D Diffusion Policy with Hybrid Selective State Models

链接: https://arxiv.org/abs/2409.07163
作者: Jiahang Cao,Qiang Zhang,Jingkai Sun,Jiaxu Wang,Hao Cheng,Yulin Li,Jun Ma,Yecheng Shao,Wen Zhao,Gang Han,Yijie Guo,Renjing Xu
关键词-EN: Mamba Policy, Diffusion models, diffusion models typically, Mamba, manipulation due
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
*备注: 7 pages, 5 figures

点击查看摘要

Abstract:Diffusion models have been widely employed in the field of 3D manipulation due to their efficient capability to learn distributions, allowing for precise prediction of action trajectories. However, diffusion models typically rely on large parameter UNet backbones as policy networks, which can be challenging to deploy on resource-constrained devices. Recently, the Mamba model has emerged as a promising solution for efficient modeling, offering low computational complexity and strong performance in sequence modeling. In this work, we propose the Mamba Policy, a lighter but stronger policy that reduces the parameter count by over 80% compared to the original policy network while achieving superior performance. Specifically, we introduce the XMamba Block, which effectively integrates input information with conditional features and leverages a combination of Mamba and Attention mechanisms for deep feature extraction. Extensive experiments demonstrate that the Mamba Policy excels on the Adroit, Dexart, and MetaWorld datasets, requiring significantly fewer computational resources. Additionally, we highlight the Mamba Policy’s enhanced robustness in long-horizon scenarios compared to baseline methods and explore the performance of various Mamba variants within the Mamba Policy framework. Our project page is in this https URL.

[CV-42] MVLLaVA: An Intelligent Agent for Unified and Flexible Novel View Synthesis

链接: https://arxiv.org/abs/2409.07129
作者: Hanyu Jiang,Jian Xue,Xing Lan,Guohong Hu,Ke Lu
关键词-EN: intelligent agent designed, paper introduces MVLLaVA, paper introduces, intelligent agent, agent designed
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: project page: this https URL

点击查看摘要

Abstract:This paper introduces MVLLaVA, an intelligent agent designed for novel view synthesis tasks. MVLLaVA integrates multiple multi-view diffusion models with a large multimodal model, LLaVA, enabling it to handle a wide range of tasks efficiently. MVLLaVA represents a versatile and unified platform that adapts to diverse input types, including a single image, a descriptive caption, or a specific change in viewing azimuth, guided by language instructions for viewpoint generation. We carefully craft task-specific instruction templates, which are subsequently used to fine-tune LLaVA. As a result, MVLLaVA acquires the capability to generate novel view images based on user instructions, demonstrating its flexibility across diverse tasks. Experiments are conducted to validate the effectiveness of MVLLaVA, demonstrating its robust performance and versatility in tackling diverse novel view synthesis challenges.

[CV-43] Redundancy-Aware Camera Selection for Indoor Scene Neural Rendering

链接: https://arxiv.org/abs/2409.07098
作者: Zehao Wang,Han Zhou,Matthew B. Blaschko,Tinne Tuytelaars,Minye Wu
关键词-EN: monocular video sequence, view synthesis, achieved by capturing, capturing a monocular, monocular video
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Novel view synthesis of indoor scenes can be achieved by capturing a monocular video sequence of the environment. However, redundant information caused by artificial movements in the input video data reduces the efficiency of scene modeling. In this work, we tackle this challenge from the perspective of camera selection. We begin by constructing a similarity matrix that incorporates both the spatial diversity of the cameras and the semantic variation of the images. Based on this matrix, we use the Intra-List Diversity (ILD) metric to assess camera redundancy, formulating the camera selection task as an optimization problem. Then we apply a diversity-based sampling algorithm to optimize the camera selection. We also develop a new dataset, IndoorTraj, which includes long and complex camera movements captured by humans in virtual indoor environments, closely mimicking real-world scenarios. Experimental results demonstrate that our strategy outperforms other approaches under time and memory constraints. Remarkably, our method achieves performance comparable to models trained on the full dataset, while using only an average of 15% of the frames and 75% of the allotted time.

[CV-44] Multimodal Emotion Recognition with Vision-language Prompting and Modality Dropout

链接: https://arxiv.org/abs/2409.07078
作者: Anbin QI,Zhongliang Liu,Xinyong Zhou,Jinba Xiao,Fengrun Zhang,Qi Gan,Ming Tao,Gaozheng Zhang,Lu Zhang
关键词-EN: Emotion Recognition Challenge, Multimodal Emotion Recognition, Recognition Challenge Track, Emotion Recognition, Recognition Challenge
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In this paper, we present our solution for the Second Multimodal Emotion Recognition Challenge Track 1(MER2024-SEMI). To enhance the accuracy and generalization performance of emotion recognition, we propose several methods for Multimodal Emotion Recognition. Firstly, we introduce EmoVCLIP, a model fine-tuned based on CLIP using vision-language prompt learning, designed for video-based emotion recognition tasks. By leveraging prompt learning on CLIP, EmoVCLIP improves the performance of pre-trained CLIP on emotional videos. Additionally, to address the issue of modality dependence in multimodal fusion, we employ modality dropout for robust information fusion. Furthermore, to aid Baichuan in better extracting emotional information, we suggest using GPT-4 as the prompt for Baichuan. Lastly, we utilize a self-training strategy to leverage unlabeled videos. In this process, we use unlabeled videos with high-confidence pseudo-labels generated by our model and incorporate them into the training set. Experimental results demonstrate that our model ranks 1st in the MER2024-SEMI track, achieving an accuracy of 90.15% on the test set.

[CV-45] Edge Modeling Activation Free Fourier Network for Spacecraft Image Denoising

链接: https://arxiv.org/abs/2409.07067
作者: Jingfan Yang,Hu Gao,Ying Zhang,Bowen Ma,Depeng Dang
关键词-EN: crucial basic technology, basic technology closely, technology closely related, spacecraft noise image, Spacecraft image denoising
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Spacecraft image denoising is a crucial basic technology closely related to aerospace research. However, the existing deep learning-based image denoising methods lack deep consideration of the characteristics of spacecraft image. To address the aforementioned shortcomings, we analyses spacecraft noise image and identifies two main characteristics. One is that there are a large number of low-light images in the obtained spacecraft noise image dataset. Another is there are a lot of repetitive periodic features in spacecraft image. According to the above mentioned characteristics, we propose a Edge modeling Activation Free Fourier Network (EAFFN), which is an efficient spacecraft image denoising method including Edge Modeling Block (EMB) and Activation Free Fourier Block (AFFB). We present EMB to effectively model edge and extract structural information and better identify the spacecraft components from dark regions in spacecraft noise image. We present AFFB and utilize an improved fast fourier block to extract repetitive periodic features and long-range information in noisy spacecraft image. In addition, Simple Gate is designed in our AFFB to reduce the computational complexity. Extensive experimental results demonstrate our EAFFN performs competitively to the state-of-the-art on spacecraft noise image datasets.

[CV-46] Pushing the Limits of Vision-Language Models in Remote Sensing without Human Annotations

链接: https://arxiv.org/abs/2409.07048
作者: Keumgang Cha,Donggeun Yu,Junghoon Seo
关键词-EN: witnessed a surge, multifarious applications, generalized foundation models, prominence of generalized, integration has witnessed
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: This study was primarily conducted during the latter half of 2023

点击查看摘要

Abstract:The prominence of generalized foundation models in vision-language integration has witnessed a surge, given their multifarious applications. Within the natural domain, the procurement of vision-language datasets to construct these foundation models is facilitated by their abundant availability and the ease of web crawling. Conversely, in the remote sensing domain, although vision-language datasets exist, their volume is suboptimal for constructing robust foundation models. This study introduces an approach to curate vision-language datasets by employing an image decoding machine learning model, negating the need for human-annotated labels. Utilizing this methodology, we amassed approximately 9.6 million vision-language paired datasets in VHR imagery. The resultant model outperformed counterparts that did not leverage publicly available vision-language datasets, particularly in downstream tasks such as zero-shot classification, semantic localization, and image-text retrieval. Moreover, in tasks exclusively employing vision encoders, such as linear probing and k-NN classification, our model demonstrated superior efficacy compared to those relying on domain-specific vision-language datasets.

[CV-47] SoftShadow: Leveraging Penumbra-Aware Soft Masks for Shadow Removal

链接: https://arxiv.org/abs/2409.07041
作者: Xinrui Wang,Lanqing Guo,Xiyu Wang,Siyu Huang,Bihan Wen
关键词-EN: yielded promising results, Recent advancements, shadow removal, shadow, shadow removal task
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Recent advancements in deep learning have yielded promising results for the image shadow removal task. However, most existing methods rely on binary pre-generated shadow masks. The binary nature of such masks could potentially lead to artifacts near the boundary between shadow and non-shadow areas. In view of this, inspired by the physical model of shadow formation, we introduce novel soft shadow masks specifically designed for shadow removal. To achieve such soft masks, we propose a \textitSoftShadow framework by leveraging the prior knowledge of pretrained SAM and integrating physical constraints. Specifically, we jointly tune the SAM and the subsequent shadow removal network using penumbra formation constraint loss and shadow removal loss. This framework enables accurate predictions of penumbra (partially shaded regions) and umbra (fully shaded regions) areas while simultaneously facilitating end-to-end shadow removal. Through extensive experiments on popular datasets, we found that our SoftShadow framework, which generates soft masks, can better restore boundary artifacts, achieve state-of-the-art performance, and demonstrate superior generalizability.

[CV-48] Retinex-RAWMamba: Bridging Demosaicing and Denoising for Low-Light RAW Image Enhancement

链接: https://arxiv.org/abs/2409.07040
作者: Xianmin Chen,Peiliang Huang,Xiaoxu Feng,Dingwen Zhang,Longfei Han,Junwei Han
关键词-EN: remains a significant, significant challenge, Image Signal Processing, sRGB domain, Retinex Decomposition Module
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
*备注:

点击查看摘要

Abstract:Low-light image enhancement, particularly in cross-domain tasks such as mapping from the raw domain to the sRGB domain, remains a significant challenge. Many deep learning-based methods have been developed to address this issue and have shown promising results in recent years. However, single-stage methods, which attempt to unify the complex mapping across both domains, leading to limited denoising performance. In contrast, two-stage approaches typically decompose a raw image with color filter arrays (CFA) into a four-channel RGGB format before feeding it into a neural network. However, this strategy overlooks the critical role of demosaicing within the Image Signal Processing (ISP) pipeline, leading to color distortions under varying lighting conditions, especially in low-light scenarios. To address these issues, we design a novel Mamba scanning mechanism, called RAWMamba, to effectively handle raw images with different CFAs. Furthermore, we present a Retinex Decomposition Module (RDM) grounded in Retinex prior, which decouples illumination from reflectance to facilitate more effective denoising and automatic non-linear exposure correction. By bridging demosaicing and denoising, better raw image enhancement is achieved. Experimental evaluations conducted on public datasets SID and MCR demonstrate that our proposed RAWMamba achieves state-of-the-art performance on cross-domain mapping.

[CV-49] SCLNet: A Scale-Robust Complementary Learning Network for Object Detection in UAV Images

链接: https://arxiv.org/abs/2409.07024
作者: Xuexue Li
关键词-EN: Unmanned Aerial Vehicle, Unmanned Aerial, Aerial Vehicle, detectors focus primarily, recent UAV
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Most recent UAV (Unmanned Aerial Vehicle) detectors focus primarily on general challenge such as uneven distribution and occlusion. However, the neglect of scale challenges, which encompass scale variation and small objects, continues to hinder object detection in UAV images. Although existing works propose solutions, they are implicitly modeled and have redundant steps, so detection performance remains limited. And one specific work addressing the above scale challenges can help improve the performance of UAV image detectors. Compared to natural scenes, scale challenges in UAV images happen with problems of limited perception in comprehensive scales and poor robustness to small objects. We found that complementary learning is beneficial for the detection model to address the scale challenges. Therefore, the paper introduces it to form our scale-robust complementary learning network (SCLNet) in conjunction with the object detection model. The SCLNet consists of two implementations and a cooperation method. In detail, one implementation is based on our proposed scale-complementary decoder and scale-complementary loss function to explicitly extract complementary information as complement, named comprehensive-scale complementary learning (CSCL). Another implementation is based on our proposed contrastive complement network and contrastive complement loss function to explicitly guide the learning of small objects with the rich texture detail information of the large objects, named inter-scale contrastive complementary learning (ICCL). In addition, an end-to-end cooperation (ECoop) between two implementations and with the detection model is proposed to exploit each potential.

[CV-50] Insight Any Instance: Promptable Instance Segmentation for Remote Sensing Images

链接: https://arxiv.org/abs/2409.07022
作者: Xuexue Li
关键词-EN: Instance segmentation, instance segmentation model, remote sensing images, segmentation, Instance
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Instance segmentation of remote sensing images (RSIs) is an essential task for a wide range of applications such as land planning and intelligent transport. Instance segmentation of RSIs is constantly plagued by the unbalanced ratio of foreground and background and limited instance size. And most of the instance segmentation models are based on deep feature learning and contain operations such as multiple downsampling, which is harmful to instance segmentation of RSIs, and thus the performance is still limited. Inspired by the recent superior performance of prompt learning in visual tasks, we propose a new prompt paradigm to address the above issues. Based on the existing instance segmentation model, firstly, a local prompt module is designed to mine local prompt information from original local tokens for specific instances; secondly, a global-to-local prompt module is designed to model the contextual information from the global tokens to the local tokens where the instances are located for specific instances. Finally, a proposal’s area loss function is designed to add a decoupling dimension for proposals on the scale to better exploit the potential of the above two prompt modules. It is worth mentioning that our proposed approach can extend the instance segmentation model to a promptable instance segmentation model, i.e., to segment the instances with the specific boxes prompt. The time consumption for each promptable instance segmentation process is only 40 ms. The paper evaluates the effectiveness of our proposed approach based on several existing models in four instance segmentation datasets of RSIs, and thorough experiments prove that our proposed approach is effective for addressing the above issues and is a competitive model for instance segmentation of RSIs.

[CV-51] ODYSSEE: Oyster Detection Yielded by Sensor Systems on Edge Electronics

链接: https://arxiv.org/abs/2409.07003
作者: Xiaomin Lin,Vivek Mange,Arjun Suresh,Bernhard Neuberger,Aadi Palnitkar,Brendan Campbell,Alan Williams,Kleio Baxevani,Jeremy Mallette,Alhim Vera,Markus Vincze,Ioannis Rekleitis,Herbert G. Tanner,Yiannis Aloimonos
关键词-EN: offering significant economic, coastal ecosystems, offering significant, significant economic, cultural benefits
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
*备注:

点击查看摘要

Abstract:Oysters are a keystone species in coastal ecosystems, offering significant economic, environmental, and cultural benefits. However, current monitoring systems are often destructive, typically involving dredging to physically collect and count oysters. A nondestructive alternative is manual identification from video footage collected by divers, which is time-consuming and labor-intensive with expert input. An alternative to human monitoring is the deployment of a system with trained object detection models that performs real-time, on edge oyster detection in the field. One such platform is the Aqua2 robot. Effective training of these models requires extensive high-quality data, which is difficult to obtain in marine settings. To address these complications, we introduce a novel method that leverages stable diffusion to generate high-quality synthetic data for the marine domain. We exploit diffusion models to create photorealistic marine imagery, using ControlNet inputs to ensure consistency with the segmentation ground-truth mask, the geometry of the scene, and the target domain of real underwater images for oysters. The resulting dataset is used to train a YOLOv10-based vision model, achieving a state-of-the-art 0.657 mAP@50 for oyster detection on the Aqua2 platform. The system we introduce not only improves oyster habitat monitoring, but also paves the way to autonomous surveillance for various tasks in marine contexts, improving aquaculture and conservation efforts. Subjects: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO) Cite as: arXiv:2409.07003 [cs.CV] (or arXiv:2409.07003v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2409.07003 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[CV-52] AdvLogo: Adversarial Patch Attack against Object Detectors based on Diffusion Models

链接: https://arxiv.org/abs/2409.07002
作者: Boming Miao,Chunxiao Li,Yao Zhu,Weixiang Sun,Zizhe Wang,Xiaoyi Wang,Chuanlong Xie
关键词-EN: demonstrated impressive performance, deep learning, rapid development, development of deep, demonstrated impressive
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:With the rapid development of deep learning, object detectors have demonstrated impressive performance; however, vulnerabilities still exist in certain scenarios. Current research exploring the vulnerabilities using adversarial patches often struggles to balance the trade-off between attack effectiveness and visual quality. To address this problem, we propose a novel framework of patch attack from semantic perspective, which we refer to as AdvLogo. Based on the hypothesis that every semantic space contains an adversarial subspace where images can cause detectors to fail in recognizing objects, we leverage the semantic understanding of the diffusion denoising process and drive the process to adversarial subareas by perturbing the latent and unconditional embeddings at the last timestep. To mitigate the distribution shift that exposes a negative impact on image quality, we apply perturbation to the latent in frequency domain with the Fourier Transform. Experimental results demonstrate that AdvLogo achieves strong attack performance while maintaining high visual quality.

[CV-53] 1M-Deepfakes Detection Challenge ACM-MM2024

链接: https://arxiv.org/abs/2409.06991
作者: Zhixi Cai,Abhinav Dhall,Shreya Ghosh,Munawar Hayat,Dimitrios Kollias,Kalin Stefanov,Usman Tariq
关键词-EN: digital media security, small fake segments, remains a significant, media security, small fake
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: ACM MM 2024. Challenge webpage: this https URL

点击查看摘要

Abstract:The detection and localization of deepfake content, particularly when small fake segments are seamlessly mixed with real videos, remains a significant challenge in the field of digital media security. Based on the recently released AV-Deepfake1M dataset, which contains more than 1 million manipulated videos across more than 2,000 subjects, we introduce the 1M-Deepfakes Detection Challenge. This challenge is designed to engage the research community in developing advanced methods for detecting and localizing deepfake manipulations within the large-scale high-realistic audio-visual dataset. The participants can access the AV-Deepfake1M dataset and are required to submit their inference results for evaluation across the metrics for detection or localization tasks. The methodologies developed through the challenge will contribute to the development of next-generation deepfake detection and localization systems. Evaluation scripts, baseline models, and accompanying code will be available on this https URL.

[CV-54] PanAdapter: Two-Stage Fine-Tuning with Spatial-Spectral Priors Injecting for Pansharpening

链接: https://arxiv.org/abs/2409.06980
作者: RuoCheng Wu,ZiEn Zhang,ShangQi Deng,YuLe Duan,LiangJian Deng
关键词-EN: low-resolution multispectral images, involves restoring images, challenging image fusion, low-resolution multispectral, high-resolution panchromatic
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Pansharpening is a challenging image fusion task that involves restoring images using two different modalities: low-resolution multispectral images (LRMS) and high-resolution panchromatic (PAN). Many end-to-end specialized models based on deep learning (DL) have been proposed, yet the scale and performance of these models are limited by the size of dataset. Given the superior parameter scales and feature representations of pre-trained models, they exhibit outstanding performance when transferred to downstream tasks with small datasets. Therefore, we propose an efficient fine-tuning method, namely PanAdapter, which utilizes additional advanced semantic information from pre-trained models to alleviate the issue of small-scale datasets in pansharpening tasks. Specifically, targeting the large domain discrepancy between image restoration and pansharpening tasks, the PanAdapter adopts a two-stage training strategy for progressively adapting to the downstream task. In the first stage, we fine-tune the pre-trained CNN model and extract task-specific priors at two scales by proposed Local Prior Extraction (LPE) module. In the second stage, we feed the extracted two-scale priors into two branches of cascaded adapters respectively. At each adapter, we design two parameter-efficient modules for allowing the two branches to interact and be injected into the frozen pre-trained VisionTransformer (ViT) blocks. We demonstrate that by only training the proposed LPE modules and adapters with a small number of parameters, our approach can benefit from pre-trained image restoration models and achieve state-of-the-art performance in several benchmark pansharpening datasets. The code will be available soon.

[CV-55] Brain-Inspired Stepwise Patch Merging for Vision Transformers

链接: https://arxiv.org/abs/2409.06963
作者: Yonghao Yu,Dongcheng Zhao,Guobin Shen,Yiting Dong,Yi Zeng
关键词-EN: Patch Merging serving, Stepwise Patch Merging, mainstream design paradigm, Patch Merging, called Stepwise Patch
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:The hierarchical architecture has become a mainstream design paradigm for Vision Transformers (ViTs), with Patch Merging serving as the pivotal component that transforms a columnar architecture into a hierarchical one. Drawing inspiration from the brain’s ability to integrate global and local information for comprehensive visual understanding, we propose a novel technique called Stepwise Patch Merging (SPM), which enhances the subsequent attention mechanism’s ability to ‘see’ better. SPM comprises two critical modules: Multi-Scale Aggregation (MSA) and Guided Local Enhancement (GLE). The MSA module integrates multi-scale features to enrich feature representation, while the GLE module focuses on refining local detail extraction, thus achieving an optimal balance between long-range dependency modeling and local feature enhancement. Extensive experiments conducted on benchmark datasets, including ImageNet-1K, COCO, and ADE20K, demonstrate that SPM significantly improves the performance of various models, particularly in dense prediction tasks such as object detection and semantic segmentation. These results underscore the efficacy of SPM in enhancing model accuracy and robustness across a wide range of computer vision tasks.

[CV-56] Bridging Domain Gap of Point Cloud Representations via Self-Supervised Geometric Augmentation

链接: https://arxiv.org/abs/2409.06956
作者: Li Yu,Hongchao Zhong,Longkun Zou,Ke Chen,Pan Gao
关键词-EN: Recent progress, point clouds, synthetic point clouds, semantic point clouds, point clouds analysis
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 10 pages, 6 figures, 5 tables

点击查看摘要

Abstract:Recent progress of semantic point clouds analysis is largely driven by synthetic data (e.g., the ModelNet and the ShapeNet), which are typically complete, well-aligned and noisy free. Therefore, representations of those ideal synthetic point clouds have limited variations in the geometric perspective and can gain good performance on a number of 3D vision tasks such as point cloud classification. In the context of unsupervised domain adaptation (UDA), representation learning designed for synthetic point clouds can hardly capture domain invariant geometric patterns from incomplete and noisy point clouds. To address such a problem, we introduce a novel scheme for induced geometric invariance of point cloud representations across domains, via regularizing representation learning with two self-supervised geometric augmentation tasks. On one hand, a novel pretext task of predicting translation distances of augmented samples is proposed to alleviate centroid shift of point clouds due to occlusion and noises. On the other hand, we pioneer an integration of the relational self-supervised learning on geometrically-augmented point clouds in a cascade manner, utilizing the intrinsic relationship of augmented variants and other samples as extra constraints of cross-domain geometric features. Experiments on the PointDA-10 dataset demonstrate the effectiveness of the proposed method, achieving the state-of-the-art performance.

[CV-57] FSMDet: Vision-guided feature diffusion for fully sparse 3D detector ECCV

链接: https://arxiv.org/abs/2409.06945
作者: Tianran Liu,Morteza Mousa Pasandi,Robert Laganiere
关键词-EN: Fully sparse, fully sparse models, Fully Sparse Multi-modal, Sparse Multi-modal Detection, fully sparse works
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: Accepted by European Conference on Computer Vision (ECCV) 2024 workshop on VCAD

点击查看摘要

Abstract:Fully sparse 3D detection has attracted an increasing interest in the recent years. However, the sparsity of the features in these frameworks challenges the generation of proposals because of the limited diffusion process. In addition, the quest for efficiency has led to only few work on vision-assisted fully sparse models. In this paper, we propose FSMDet (Fully Sparse Multi-modal Detection), which use visual information to guide the LiDAR feature diffusion process while still maintaining the efficiency of the pipeline. Specifically, most of fully sparse works focus on complex customized center fusion diffusion/regression operators. However, we observed that if the adequate object completion is performed, even the simplest interpolation operator leads to satisfactory results. Inspired by this observation, we split the vision-guided diffusion process into two modules: a Shape Recover Layer (SRLayer) and a Self Diffusion Layer (SDLayer). The former uses RGB information to recover the shape of the visible part of an object, and the latter uses a visual prior to further spread the features to the center region. Experiments demonstrate that our approach successfully improves the performance of previous fully sparse models that use LiDAR only and reaches SOTA performance in multimodal models. At the same time, thanks to the sparse architecture, our method can be up to 5 times more efficient than previous SOTA methods in the inference process.

[CV-58] Automated Body Composition Analysis Using DAFS Express on 2D MRI Slices at L3 Vertebral Level

链接: https://arxiv.org/abs/2409.06942
作者: Varun Akella,Razeyeh Bagherinasab,Jia Ming Li,Long Nguyen,Vincent Tze Yang Chow,Hyunwoo Lee,Karteek Popuri,Mirza Faisal Beg
关键词-EN: assessing health conditions, Body composition analysis, VAT, SAT, Analysis Facilitation Suite
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Body composition analysis is vital in assessing health conditions such as obesity, sarcopenia, and metabolic syndromes. MRI provides detailed images of skeletal muscle (SKM), visceral adipose tissue (VAT), and subcutaneous adipose tissue (SAT), but their manual segmentation is labor-intensive and limits clinical applicability. This study validates an automated tool for MRI-based 2D body composition analysis- (Data Analysis Facilitation Suite (DAFS) Express), comparing its automated measurements with expert manual segmentations using UK Biobank data. A cohort of 399 participants from the UK Biobank dataset was selected, yielding 423 single L3 slices for analysis. DAFS Express performed automated segmentations of SKM, VAT, and SAT, which were then manually corrected by expert raters for validation. Evaluation metrics included Jaccard coefficients, Dice scores, Intraclass Correlation Coefficients (ICCs), and Bland-Altman Plots to assess segmentation agreement and reliability. High agreements were observed between automated and manual segmentations with mean Jaccard scores: SKM 99.03%, VAT 95.25%, and SAT 99.57%; and mean Dice scores: SKM 99.51%, VAT 97.41%, and SAT 99.78%. Cross-sectional area comparisons showed consistent measurements with automated methods closely matching manual measurements for SKM and SAT, and slightly higher values for VAT (SKM: Auto 132.51 cm^2, Manual 132.36 cm^2; VAT: Auto 137.07 cm^2, Manual 134.46 cm^2; SAT: Auto 203.39 cm^2, Manual 202.85 cm^2). ICCs confirmed strong reliability (SKM: 0.998, VAT: 0.994, SAT: 0.994). Bland-Altman plots revealed minimal biases, and boxplots illustrated distribution similarities across SKM, VAT, and SAT areas. On average DAFS Express took 18 seconds per DICOM. This underscores its potential to streamline image analysis processes in research and clinical settings, enhancing diagnostic accuracy and efficiency.

[CV-59] Intrapartum Ultrasound Image Segmentation of Pubic Symphysis and Fetal Head Using Dual Student-Teacher Framework with CNN-ViT Collaborative Learning

链接: https://arxiv.org/abs/2409.06928
作者: Jianmei Jiang,Huijin Wang,Jieyun Bai,Shun Long,Shuangping Chen,Victor M. Campello,Karim Lekadir
关键词-EN: potential delivery complications, monitoring labor progression, identifying potential delivery, PSFH Segmentation Grand, Convolutional Neural Networks
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The segmentation of the pubic symphysis and fetal head (PSFH) constitutes a pivotal step in monitoring labor progression and identifying potential delivery complications. Despite the advances in deep learning, the lack of annotated medical images hinders the training of segmentation. Traditional semi-supervised learning approaches primarily utilize a unified network model based on Convolutional Neural Networks (CNNs) and apply consistency regularization to mitigate the reliance on extensive annotated data. However, these methods often fall short in capturing the discriminative features of unlabeled data and in delineating the long-range dependencies inherent in the ambiguous boundaries of PSFH within ultrasound images. To address these limitations, we introduce a novel framework, the Dual-Student and Teacher Combining CNN and Transformer (DSTCT), which synergistically integrates the capabilities of CNNs and Transformers. Our framework comprises a Vision Transformer (ViT) as the teacher and two student mod ls one ViT and one CNN. This dual-student setup enables mutual supervision through the generation of both hard and soft pseudo-labels, with the consistency in their predictions being refined by minimizing the classifier determinacy discrepancy. The teacher model further reinforces learning within this architecture through the imposition of consistency regularization constraints. To augment the generalization abilities of our approach, we employ a blend of data and model perturbation techniques. Comprehensive evaluations on the benchmark dataset of the PSFH Segmentation Grand Challenge at MICCAI 2023 demonstrate our DSTCT framework outperformed ten contemporary semi-supervised segmentation methods. Code available at this https URL.

[CV-60] Rethinking Directional Parameterization in Neural Implicit Surface Reconstruction ECCV2024

链接: https://arxiv.org/abs/2409.06923
作者: Zijie Jiang,Tianhan Xu,Hiroharu Kato
关键词-EN: made notable progress, view-dependent radiance fields, neural implicit representations, hybrid directional parameterization, view-dependent radiance
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted to ECCV 2024

点击查看摘要

Abstract:Multi-view 3D surface reconstruction using neural implicit representations has made notable progress by modeling the geometry and view-dependent radiance fields within a unified framework. However, their effectiveness in reconstructing objects with specular or complex surfaces is typically biased by the directional parameterization used in their view-dependent radiance network. \it Viewing direction and \it reflection direction are the two most commonly used directional parameterizations but have their own limitations. Typically, utilizing the viewing direction usually struggles to correctly decouple the geometry and appearance of objects with highly specular surfaces, while using the reflection direction tends to yield overly smooth reconstructions for concave or complex structures. In this paper, we analyze their failed cases in detail and propose a novel hybrid directional parameterization to address their limitations in a unified form. Extensive experiments demonstrate the proposed hybrid directional parameterization consistently delivered satisfactory results in reconstructing objects with a wide variety of materials, geometry and appearance, whereas using other directional parameterizations faces challenges in reconstructing certain objects. Moreover, the proposed hybrid directional parameterization is nearly parameter-free and can be effortlessly applied in any existing neural surface reconstruction method.

[CV-61] Enhanced Pix2Pix GAN for Visual Defect Removal in UAV-Captured Images

链接: https://arxiv.org/abs/2409.06889
作者: Volodymyr Rizun
关键词-EN: removes visual defects, paper presents, presents a neural, neural network, effectively removes visual
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: Prepared for IEEE APUAVD 2024 conference

点击查看摘要

Abstract:This paper presents a neural network that effectively removes visual defects from UAV-captured images. It features an enhanced Pix2Pix GAN, specifically engineered to address visual defects in UAV imagery. The method incorporates advanced modifications to the Pix2Pix architecture, targeting prevalent issues such as mode collapse. The suggested method facilitates significant improvements in the quality of defected UAV images, yielding cleaner and more precise visual results. The effectiveness of the proposed approach is demonstrated through evaluation on a custom dataset of aerial photographs, highlighting its capability to refine and restore UAV imagery effectively.

[CV-62] AssistTaxi: A Comprehensive Dataset for Taxiway Analysis and Autonomous Operations

链接: https://arxiv.org/abs/2409.06856
作者: Parth Ganeriwala,Siddhartha Bhattacharyya,Sean Gunther,Brian Kish,Mohammed Abdul Hafeez Khan,Ankur Dhadoti,Natasha Neogi
关键词-EN: high-quality datasets play, availability of high-quality, play a crucial, crucial role, role in advancing
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:The availability of high-quality datasets play a crucial role in advancing research and development especially, for safety critical and autonomous systems. In this paper, we present AssistTaxi, a comprehensive novel dataset which is a collection of images for runway and taxiway analysis. The dataset comprises of more than 300,000 frames of diverse and carefully collected data, gathered from Melbourne (MLB) and Grant-Valkaria (X59) general aviation airports. The importance of AssistTaxi lies in its potential to advance autonomous operations, enabling researchers and developers to train and evaluate algorithms for efficient and safe taxiing. Researchers can utilize AssistTaxi to benchmark their algorithms, assess performance, and explore novel approaches for runway and taxiway analysis. Addition-ally, the dataset serves as a valuable resource for validating and enhancing existing algorithms, facilitating innovation in autonomous operations for aviation. We also propose an initial approach to label the dataset using a contour based detection and line extraction technique.

[CV-63] ExIQA: Explainable Image Quality Assessment Using Distortion Attributes

链接: https://arxiv.org/abs/2409.06853
作者: Sepehr Kazemi Ranjbar,Emad Fatemizadeh
关键词-EN: Image Quality Assessment, Blind Image Quality, Quality Assessment, Blind Image, aims to develop
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Blind Image Quality Assessment (BIQA) aims to develop methods that estimate the quality scores of images in the absence of a reference image. In this paper, we approach BIQA from a distortion identification perspective, where our primary goal is to predict distortion types and strengths using Vision-Language Models (VLMs), such as CLIP, due to their extensive knowledge and generalizability. Based on these predicted distortions, we then estimate the quality score of the image. To achieve this, we propose an explainable approach for distortion identification based on attribute learning. Instead of prompting VLMs with the names of distortions, we prompt them with the attributes or effects of distortions and aggregate this information to infer the distortion strength. Additionally, we consider multiple distortions per image, making our method more scalable. To support this, we generate a dataset consisting of 100,000 images for efficient training. Finally, attribute probabilities are retrieved and fed into a regressor to predict the image quality score. The results show that our approach, besides its explainability and transparency, achieves state-of-the-art (SOTA) performance across multiple datasets in both PLCC and SRCC metrics. Moreover, the zero-shot results demonstrate the generalizability of the proposed approach.

[CV-64] LIME-M: Less Is More for Evaluation of MLLMs

链接: https://arxiv.org/abs/2409.06851
作者: Kang Zhu,Qianbo Zang,Shian Jia,Siwei Wu,Feiteng Fang,Yizhi Li,Shuyue Guo,Tianyu Zheng,Bo Li,Haoning Wu,Xingwei Qu,Jian Yang,Zachary Liu,Xiang Yue,J.H. Liu,Chenghua Lin,Min Yang,Shiwen Ni,Wenhao Huang,Ge Zhang
关键词-EN: Multimodal Large Language, Large Language Models, Large Language, visual question answering, remarkable success achieved
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:With the remarkable success achieved by Multimodal Large Language Models (MLLMs), numerous benchmarks have been designed to assess MLLMs’ ability to guide their development in image perception tasks (e.g., image captioning and visual question answering). However, the existence of numerous benchmarks results in a substantial computational burden when evaluating model performance across all of them. Moreover, these benchmarks contain many overly simple problems or challenging samples, which do not effectively differentiate the capabilities among various MLLMs. To address these challenges, we propose a pipeline to process the existing benchmarks, which consists of two modules: (1) Semi-Automated Screening Process and (2) Eliminating Answer Leakage. The Semi-Automated Screening Process filters out samples that cannot distinguish the model’s capabilities by synthesizing various MLLMs and manually evaluating them. The Eliminate Answer Leakage module filters samples whose answers can be inferred without images. Finally, we curate the LIME-M: Less Is More for Evaluation of Multimodal LLMs, a lightweight Multimodal benchmark that can more effectively evaluate the performance of different models. Our experiments demonstrate that: LIME-M can better distinguish the performance of different MLLMs with fewer samples (24% of the original) and reduced time (23% of the original); LIME-M eliminates answer leakage, focusing mainly on the information within images; The current automatic metric (i.e., CIDEr) is insufficient for evaluating MLLMs’ capabilities in captioning. Moreover, removing the caption task score when calculating the overall score provides a more accurate reflection of model performance differences. All our codes and data are released at this https URL.

[CV-65] Shadow Removal Refinement via Material-Consistent Shadow Edges

链接: https://arxiv.org/abs/2409.06848
作者: Shilin Hu,Hieu Le,ShahRukh Athar,Sagnik Das,Dimitris Samaras
关键词-EN: Shadow, shadow removal, shadow edges, exhibit sharp, luminance or contrast
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Shadow boundaries can be confused with material boundaries as both exhibit sharp changes in luminance or contrast within a scene. However, shadows do not modify the intrinsic color or texture of surfaces. Therefore, on both sides of shadow edges traversing regions with the same material, the original color and textures should be the same if the shadow is removed properly. These shadow/shadow-free pairs are very useful but hard-to-collect supervision signals. The crucial contribution of this paper is to learn how to identify those shadow edges that traverse material-consistent regions and how to use them as self-supervision for shadow removal refinement during test time. To achieve this, we fine-tune SAM, an image segmentation foundation model, to produce a shadow-invariant segmentation and then extract material-consistent shadow edges by comparing the SAM segmentation with the shadow mask. Utilizing these shadow edges, we introduce color and texture-consistency losses to enhance the shadow removal process. We demonstrate the effectiveness of our method in improving shadow removal results on more challenging, in-the-wild images, outperforming the state-of-the-art shadow removal methods. Additionally, we propose a new metric and an annotated dataset for evaluating the performance of shadow removal methods without the need for paired shadow/shadow-free data.

[CV-66] Face Mask Removal with Region-attentive Face Inpainting

链接: https://arxiv.org/abs/2409.06845
作者: Minmin Yang
关键词-EN: face masks, face, removing face masks, masks, face inpainting
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:During the COVID-19 pandemic, face masks have become ubiquitous in our lives. Face masks can cause some face recognition models to fail since they cover significant portion of a face. In addition, removing face masks from captured images or videos can be desirable, e.g., for better social interaction and for image/video editing and enhancement purposes. Hence, we propose a generative face inpainting method to effectively recover/reconstruct the masked part of a face. Face inpainting is more challenging compared to traditional inpainting, since it requires high fidelity while maintaining the identity at the same time. Our proposed method includes a Multi-scale Channel-Spatial Attention Module (M-CSAM) to mitigate the spatial information loss and learn the inter- and intra-channel correlation. In addition, we introduce an approach enforcing the supervised signal to focus on masked regions instead of the whole image. We also synthesize our own Masked-Faces dataset from the CelebA dataset by incorporating five different types of face masks, including surgical mask, regular mask and scarves, which also cover the neck area. The experimental results show that our proposed method outperforms different baselines in terms of structural similarity index measure, peak signal-to-noise ratio and l1 loss, while also providing better outputs qualitatively. The code will be made publicly available. Code is available at GitHub.

[CV-67] Few-Shot Learning: Expanding ID Cards Presentation Attack Detection to Unknown ID Countries

链接: https://arxiv.org/abs/2409.06842
作者: Alvaro S. Rocamora,Juan M. Espin,Juan E. Tapia
关键词-EN: remote verification system, detecting Presentation Attacks, Few-shot Learning, approach for detecting, Prototypical Networks
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:This paper proposes a Few-shot Learning (FSL) approach for detecting Presentation Attacks on ID Cards deployed in a remote verification system and its extension to new countries. Our research analyses the performance of Prototypical Networks across documents from Spain and Chile as a baseline and measures the extension of generalisation capabilities of new ID Card countries such as Argentina and Costa Rica. Specifically targeting the challenge of screen display presentation attacks. By leveraging convolutional architectures and meta-learning principles embodied in Prototypical Networks, we have crafted a model that demonstrates high efficacy with Few-shot examples. This research reveals that competitive performance can be achieved with as Few-shots as five unique identities and with under 100 images per new country added. This opens a new insight for novel generalised Presentation Attack Detection on ID cards to unknown attacks.

[CV-68] Cross-Modal Self-Supervised Learning with Effective Contrastive Units for LiDAR Point Clouds IROS2024

链接: https://arxiv.org/abs/2409.06827
作者: Mu Cai,Chenxu Luo,Yong Jae Lee,Xiaodong Yang
关键词-EN: point clouds, properly act, point, clouds, contrastive
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: IROS 2024

点击查看摘要

Abstract:3D perception in LiDAR point clouds is crucial for a self-driving vehicle to properly act in 3D environment. However, manually labeling point clouds is hard and costly. There has been a growing interest in self-supervised pre-training of 3D perception models. Following the success of contrastive learning in images, current methods mostly conduct contrastive pre-training on point clouds only. Yet an autonomous driving vehicle is typically supplied with multiple sensors including cameras and LiDAR. In this context, we systematically study single modality, cross-modality, and multi-modality for contrastive learning of point clouds, and show that cross-modality wins over other alternatives. In addition, considering the huge difference between the training sources in 2D images and 3D point clouds, it remains unclear how to design more effective contrastive units for LiDAR. We therefore propose the instance-aware and similarity-balanced contrastive units that are tailored for self-driving point clouds. Extensive experiments reveal that our approach achieves remarkable performance gains over various point cloud models across the downstream perception tasks of LiDAR based 3D object detection and 3D semantic segmentation on the four popular benchmarks including Waymo Open Dataset, nuScenes, SemanticKITTI and ONCE.

[CV-69] Sam2Rad: A Segmentation Model for Medical Images with Learnable Prompts

链接: https://arxiv.org/abs/2409.06821
作者: Assefa Seyoum Wahd,Banafshe Felfeliyan,Yuyue Zhou,Shrimanti Ghosh,Adam McArthur,Jiechen Zhang,Jacob L. Jaremko,Abhilash Hareendranathan
关键词-EN: model require high-quality, Foundation models, require high-quality manual, model require, requires expertise
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Foundation models like the segment anything model require high-quality manual prompts for medical image segmentation, which is time-consuming and requires expertise. SAM and its variants often fail to segment structures in ultrasound (US) images due to domain shift. We propose Sam2Rad, a prompt learning approach to adapt SAM and its variants for US bone segmentation without human prompts. It introduces a prompt predictor network (PPN) with a cross-attention module to predict prompt embeddings from image encoder features. PPN outputs bounding box and mask prompts, and 256-dimensional embeddings for regions of interest. The framework allows optional manual prompting and can be trained end-to-end using parameter-efficient fine-tuning (PEFT). Sam2Rad was tested on 3 musculoskeletal US datasets: wrist (3822 images), rotator cuff (1605 images), and hip (4849 images). It improved performance across all datasets without manual prompts, increasing Dice scores by 2-7% for hip/wrist and up to 33% for shoulder data. Sam2Rad can be trained with as few as 10 labeled images and is compatible with any SAM architecture for automatic segmentation. Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2409.06821 [cs.CV] (or arXiv:2409.06821v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2409.06821 Focus to learn more arXiv-issued DOI via DataCite

[CV-70] Bifurcation Identification for Ultrasound-driven Robotic Cannulation

链接: https://arxiv.org/abs/2409.06817
作者: Cecilia G. Morales,Dhruv Srikanth,Jack H. Good,Keith A. Dufendach,Artur Dubrawski
关键词-EN: critical care settings, precise intravascular access, care settings, rapid and precise, patients’ survival
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In trauma and critical care settings, rapid and precise intravascular access is key to patients’ survival. Our research aims at ensuring this access, even when skilled medical personnel are not readily available. Vessel bifurcations are anatomical landmarks that can guide the safe placement of catheters or needles during medical procedures. Although ultrasound is advantageous in navigating anatomical landmarks in emergency scenarios due to its portability and safety, to our knowledge no existing algorithm can autonomously extract vessel bifurcations using ultrasound images. This is primarily due to the limited availability of ground truth data, in particular, data from live subjects, needed for training and validating reliable models. Researchers often resort to using data from anatomical phantoms or simulations. We introduce BIFURC, Bifurcation Identification for Ultrasound-driven Robot Cannulation, a novel algorithm that identifies vessel bifurcations and provides optimal needle insertion sites for an autonomous robotic cannulation system. BIFURC integrates expert knowledge with deep learning techniques to efficiently detect vessel bifurcations within the femoral region and can be trained on a limited amount of in-vivo data. We evaluated our algorithm using a medical phantom as well as real-world experiments involving live pigs. In all cases, BIFURC consistently identified bifurcation points and needle insertion locations in alignment with those identified by expert clinicians.

[CV-71] Object Modeling from Underwater Forward-Scan Sonar Imagery with Sea-Surface Multipath

链接: https://arxiv.org/abs/2409.06815
作者: Yuhan Liu,Shahriar Negaharipour
关键词-EN: underwater object modeling, forward-scan sonar images, air-water interface, underwater object, object image formed
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Copyright 2024 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works

点击查看摘要

Abstract:We propose an optimization technique for 3-D underwater object modeling from 2-D forward-scan sonar images at known poses. A key contribution, for objects imaged in the proximity of the sea surface, is to resolve the multipath artifacts due to the air-water interface. Here, the object image formed by the direct target backscatter is almost always corrupted by the ghost and sometimes by the mirror components (generated by the multipath propagation). Assuming a planar air-water interface, we model, localize, and discard the corrupted object region within each view, thus avoiding the distortion of recovered 3-D shape. Additionally, complementary visual cues from the boundary of the mirror component, distinct at suitable sonar poses, are employed to enhance the 3-D modeling accuracy. The optimization is implemented as iterative shape adjustment by displacing the vertices of triangular patches in the 3-D surface mesh model, in order to minimize the discrepancy between the data and synthesized views of the 3-D object model. To this end, we first determine 2-D motion fields that align the object regions in the data and synthesized views, then calculate the 3-D motion of triangular patch centers, and finally the model vertices. The 3-D model is initialized with the solution of an earlier space carving method applied to the same data. The same parameters are applied in various experiments with 2 real data sets, mixed real-synthetic data set, and computer-generated data guided by general findings from a real experiment, to explore the impact of non-flat air-water interface. The results confirm the generation of a refined 3-D model in about half-dozen iterations. Comments: Copyright 2024 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2409.06815 [cs.CV] (or arXiv:2409.06815v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2409.06815 Focus to learn more arXiv-issued DOI via DataCite

[CV-72] DetailCLIP: Detail-Oriented CLIP for Fine-Grained Tasks

链接: https://arxiv.org/abs/2409.06809
作者: Amin Karimi Monsefi,Kishore Prakash Sailaja,Ali Alilooee,Ser-Nam Lim,Rajiv Ramnath
关键词-EN: contrastive learning-based vision-language, Detail-Oriented CLIP, handling detail-oriented, address the limitations, limitations of contrastive
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:In this paper, we introduce DetailCLIP: A Detail-Oriented CLIP to address the limitations of contrastive learning-based vision-language models, particularly CLIP, in handling detail-oriented and fine-grained tasks like segmentation. While CLIP and its variants excel in the global alignment of image and text representations, they often struggle to capture the fine-grained details necessary for precise segmentation. To overcome these challenges, we propose a novel framework that employs patch-level comparison of self-distillation and pixel-level reconstruction losses, enhanced with an attention-based token removal mechanism. This approach selectively retains semantically relevant tokens, enabling the model to focus on the image’s critical regions aligned with the specific functions of our model, including textual information processing, patch comparison, and image reconstruction, ensuring that the model learns high-level semantics and detailed visual features. Our experiments demonstrate that DetailCLIP surpasses existing CLIP-based and traditional self-supervised learning (SSL) models in segmentation accuracy and exhibits superior generalization across diverse datasets. DetailCLIP represents a significant advancement in vision-language modeling, offering a robust solution for tasks that demand high-level semantic understanding and detailed feature extraction. this https URL.

[CV-73] Human Motion Synthesis_ A Diffusion Approach for Motion Stitching and In-Betweening

链接: https://arxiv.org/abs/2409.06791
作者: Michael Adewole,Oluwaseyi Giwa,Favour Nerrise,Martins Osifeko,Ajibola Oyedeji
关键词-EN: Human motion generation, important area, area of research, Human motion, Frechet Inception Distance
类目: Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
*备注: 12 pages, 5 figures, and 11 equations

点击查看摘要

Abstract:Human motion generation is an important area of research in many fields. In this work, we tackle the problem of motion stitching and in-betweening. Current methods either require manual efforts, or are incapable of handling longer sequences. To address these challenges, we propose a diffusion model with a transformer-based denoiser to generate realistic human motion. Our method demonstrated strong performance in generating in-betweening sequences, transforming a variable number of input poses into smooth and realistic motion sequences consisting of 75 frames at 15 fps, resulting in a total duration of 5 seconds. We present the performance evaluation of our method using quantitative metrics such as Frechet Inception Distance (FID), Diversity, and Multimodality, along with visual assessments of the generated outputs.

[CV-74] gsplat: An Open-Source Library for Gaussian Splatting

链接: https://arxiv.org/abs/2409.06765
作者: Vickie Ye,Ruilong Li,Justin Kerr,Matias Turkulainen,Brent Yi,Zhuoyang Pan,Otto Seiskari,Jianbo Ye,Jeffrey Hu,Matthew Tancik,Angjoo Kanazawa
关键词-EN: Gaussian Splatting methods, developing Gaussian Splatting, Gaussian Splatting, Gaussian Splatting models, Splatting methods
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 17 pages, 2 figures, JMLR MLOSS

点击查看摘要

Abstract:gsplat is an open-source library designed for training and developing Gaussian Splatting methods. It features a front-end with Python bindings compatible with the PyTorch library and a back-end with highly optimized CUDA kernels. gsplat offers numerous features that enhance the optimization of Gaussian Splatting models, which include optimization improvements for speed, memory, and convergence times. Experimental results demonstrate that gsplat achieves up to 10% less training time and 4x less memory than the original implementation. Utilized in several research projects, gsplat is actively maintained on GitHub. Source code is available at this https URL under Apache License 2.0. We welcome contributions from the open-source community.

[CV-75] Modeling Image Tone Dichotomy with the Power Function

链接: https://arxiv.org/abs/2409.06764
作者: Axel Martinez,Gustavo Olague,Emilio Hernandez
关键词-EN: illumination modeling based, image illumination modeling, power function, primary purpose, present the concept
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)
*备注: 49 pages, 11 figures and 36 references

点击查看摘要

Abstract:The primary purpose of this paper is to present the concept of dichotomy in image illumination modeling based on the power function. In particular, we review several mathematical properties of the power function to identify the limitations and propose a new mathematical model capable of abstracting illumination dichotomy. The simplicity of the equation opens new avenues for classical and modern image analysis and processing. The article provides practical and illustrative image examples to explain how the new model manages dichotomy in image perception. The article shows dichotomy image space as a viable way to extract rich information from images despite poor contrast linked to tone, lightness, and color perception. Moreover, a comparison with state-of-the-art methods in image enhancement provides evidence of the method’s value.

[CV-76] Feedback-based Modal Mutual Search for Attacking Vision-Language Pre-training Models

链接: https://arxiv.org/abs/2409.06726
作者: Renhua Ding,Xinze Zhang,Xiao Yang,Kun He
关键词-EN: achieved remarkable progress, attacking VLP models, vision-language pre-training, achieved remarkable, remarkable progress
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 9 pages

点击查看摘要

Abstract:Although vision-language pre-training (VLP) models have achieved remarkable progress on cross-modal tasks, they remain vulnerable to adversarial attacks. Using data augmentation and cross-modal interactions to generate transferable adversarial examples on surrogate models, transfer-based black-box attacks have become the mainstream methods in attacking VLP models, as they are more practical in real-world scenarios. However, their transferability may be limited due to the differences on feature representation across different models. To this end, we propose a new attack paradigm called Feedback-based Modal Mutual Search (FMMS). FMMS introduces a novel modal mutual loss (MML), aiming to push away the matched image-text pairs while randomly drawing mismatched pairs closer in feature space, guiding the update directions of the adversarial examples. Additionally, FMMS leverages the target model feedback to iteratively refine adversarial examples, driving them into the adversarial region. To our knowledge, this is the first work to exploit target model feedback to explore multi-modality adversarial boundaries. Extensive empirical evaluations on Flickr30K and MSCOCO datasets for image-text matching tasks show that FMMS significantly outperforms the state-of-the-art baselines.

[CV-77] Quantized neural network for complex hologram generation

链接: https://arxiv.org/abs/2409.06711
作者: Yutaka Endo,Minoru Oikawa,Timothy D. Wilkinson,Tomoyoshi Shimobaba,Tomoyoshi Ito
关键词-EN: augmented reality displays, reality displays, head-up displays, Computer-generated holography, promising technology
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
*备注: 10 pages, 2 figures

点击查看摘要

Abstract:Computer-generated holography (CGH) is a promising technology for augmented reality displays, such as head-mounted or head-up displays. However, its high computational demand makes it impractical for implementation. Recent efforts to integrate neural networks into CGH have successfully accelerated computing speed, demonstrating the potential to overcome the trade-off between computational cost and image quality. Nevertheless, deploying neural network-based CGH algorithms on computationally limited embedded systems requires more efficient models with lower computational cost, memory footprint, and power consumption. In this study, we developed a lightweight model for complex hologram generation by introducing neural network quantization. Specifically, we built a model based on tensor holography and quantized it from 32-bit floating-point precision (FP32) to 8-bit integer precision (INT8). Our performance evaluation shows that the proposed INT8 model achieves hologram quality comparable to that of the FP32 model while reducing the model size by approximately 70% and increasing the speed fourfold. Additionally, we implemented the INT8 model on a system-on-module to demonstrate its deployability on embedded platforms and high power efficiency.

[CV-78] McGrids: Monte Carlo-Driven Adaptive Grids for Iso-Surface Extraction

链接: https://arxiv.org/abs/2409.06710
作者: Daxuan Renınst,Hezi Shiınst,Jianmin Zheng,Jianfei Cai
关键词-EN: vision and graphics, applications of computer, computer vision, Iso-surface extraction, Iso-surface
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
*备注:

点击查看摘要

Abstract:Iso-surface extraction from an implicit field is a fundamental process in various applications of computer vision and graphics. When dealing with geometric shapes with complicated geometric details, many existing algorithms suffer from high computational costs and memory usage. This paper proposes McGrids, a novel approach to improve the efficiency of iso-surface extraction. The key idea is to construct adaptive grids for iso-surface extraction rather than using a simple uniform grid as prior art does. Specifically, we formulate the problem of constructing adaptive grids as a probability sampling problem, which is then solved by Monte Carlo process. We demonstrate McGrids’ capability with extensive experiments from both analytical SDFs computed from surface meshes and learned implicit fields from real multiview images. The experiment results show that our McGrids can significantly reduce the number of implicit field queries, resulting in significant memory reduction, while producing high-quality meshes with rich geometric details.

[CV-79] Gating Syn-to-Real Knowledge for Pedestrian Crossing Prediction in Safe Driving

链接: https://arxiv.org/abs/2409.06707
作者: Jie Bai,Jianwu Fang,Yisheng Lv,Chen Lv,Jianru Xue,Zhengguo Li
关键词-EN: driving scenes plays, Pedestrian Crossing Prediction, Pedestrian Crossing, Crossing Prediction, intelligent vehicles
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: under review by TITS

点击查看摘要

Abstract:Pedestrian Crossing Prediction (PCP) in driving scenes plays a critical role in ensuring the safe operation of intelligent vehicles. Due to the limited observations of pedestrian crossing behaviors in typical situations, recent studies have begun to leverage synthetic data with flexible variation to boost prediction performance, employing domain adaptation frameworks. However, different domain knowledge has distinct cross-domain distribution gaps, which necessitates suitable domain knowledge adaption ways for PCP tasks. In this work, we propose a Gated Syn-to-Real Knowledge transfer approach for PCP (Gated-S2R-PCP), which has two aims: 1) designing the suitable domain adaptation ways for different kinds of crossing-domain knowledge, and 2) transferring suitable knowledge for specific situations with gated knowledge fusion. Specifically, we design a framework that contains three domain adaption methods including style transfer, distribution approximation, and knowledge distillation for various information, such as visual, semantic, depth, location, etc. A Learnable Gated Unit (LGU) is employed to fuse suitable cross-domain knowledge to boost pedestrian crossing prediction. We construct a new synthetic benchmark S2R-PCP-3181 with 3181 sequences (489,740 frames) which contains the pedestrian locations, RGB frames, semantic images, and depth images. With the synthetic S2R-PCP-3181, we transfer the knowledge to two real challenging datasets of PIE and JAAD, and superior PCP performance is obtained to the state-of-the-art methods.

[CV-80] HSR-KAN: Efficient Hyperspectral Image Super-Resolution via Kolmogorov-Arnold Networks

链接: https://arxiv.org/abs/2409.06705
作者: Baisong Li,Xingwang Wang,Haixiao Xu
关键词-EN: visual tasks due, Hyperspectral images, high-resolution hyperspectral images, great potential, visual tasks
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Hyperspectral images (HSIs) have great potential in various visual tasks due to their rich spectral information. However, obtaining high-resolution hyperspectral images remains challenging due to limitations of physical imaging. Inspired by Kolmogorov-Arnold Networks (KANs), we propose an efficient HSI super-resolution (HSI-SR) model to fuse a low-resolution HSI (LR-HSI) and a high-resolution multispectral image (HR-MSI), yielding a high-resolution HSI (HR-HSI). To achieve the effective integration of spatial information from HR-MSI, we design a fusion module based on KANs, called KAN-Fusion. Further inspired by the channel attention mechanism, we design a spectral channel attention module called KAN Channel Attention Block (KAN-CAB) for post-fusion feature extraction. As a channel attention module integrated with KANs, KAN-CAB not only enhances the fine-grained adjustment ability of deep networks, enabling networks to accurately simulate details of spectral sequences and spatial textures, but also effectively avoid Curse of Dimensionality (COD). Extensive experiments show that, compared to current state-of-the-art (SOTA) HSI-SR methods, proposed HSR-KAN achieves the best performance in terms of both qualitative and quantitative assessments. Our code is available at: this https URL.

[CV-81] Controllable retinal image synthesis using conditional StyleGAN and latent space manipulation for improved diagnosis and grading of diabetic retinopathy

链接: https://arxiv.org/abs/2409.07422
作者: Somayeh Pakdelmoez(1),Saba Omidikia(1),Seyyed Ali Seyyedsalehi(1),Seyyede Zohreh Seyyedsalehi(2) ((1) Department of Biomedical Engineering, Amirkabir University of Technology, Tehran, Iran, (2) Department of Biomedical Engineering, Faculty of Health, Tehran Medical Sciences, Islamic Azad University, Tehran, Iran)
关键词-EN: diabetes mellitus characterized, Diabetic retinopathy, retinal tissue, consequence of diabetes, diabetes mellitus
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注: 30 pages, 17 figures

点击查看摘要

Abstract:Diabetic retinopathy (DR) is a consequence of diabetes mellitus characterized by vascular damage within the retinal tissue. Timely detection is paramount to mitigate the risk of vision loss. However, training robust grading models is hindered by a shortage of annotated data, particularly for severe cases. This paper proposes a framework for controllably generating high-fidelity and diverse DR fundus images, thereby improving classifier performance in DR grading and detection. We achieve comprehensive control over DR severity and visual features (optic disc, vessel structure, lesion areas) within generated images solely through a conditional StyleGAN, eliminating the need for feature masks or auxiliary networks. Specifically, leveraging the SeFa algorithm to identify meaningful semantics within the latent space, we manipulate the DR images generated conditionally on grades, further enhancing the dataset diversity. Additionally, we propose a novel, effective SeFa-based data augmentation strategy, helping the classifier focus on discriminative regions while ignoring redundant features. Using this approach, a ResNet50 model trained for DR detection achieves 98.09% accuracy, 99.44% specificity, 99.45% precision, and an F1-score of 98.09%. Moreover, incorporating synthetic images generated by conditional StyleGAN into ResNet50 training for DR grading yields 83.33% accuracy, a quadratic kappa score of 87.64%, 95.67% specificity, and 72.24% precision. Extensive experiments conducted on the APTOS 2019 dataset demonstrate the exceptional realism of the generated images and the superior performance of our classifier compared to recent studies.

[CV-82] Efficient One-Step Diffusion Refinement for Snapshot Compressive Imaging

链接: https://arxiv.org/abs/2409.07417
作者: Yunzhen Wang,Haijin Zeng,Shaoguang Huang,Hongyu Chen,Hongyan Zhang
关键词-EN: Aperture Snapshot Spectral, Coded Aperture Snapshot, Snapshot Spectral Imaging, three-dimensional multispectral images, Coded Aperture
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Coded Aperture Snapshot Spectral Imaging (CASSI) is a crucial technique for capturing three-dimensional multispectral images (MSIs) through the complex inverse task of reconstructing these images from coded two-dimensional measurements. Current state-of-the-art methods, predominantly end-to-end, face limitations in reconstructing high-frequency details and often rely on constrained datasets like KAIST and CAVE, resulting in models with poor generalizability. In response to these challenges, this paper introduces a novel one-step Diffusion Probabilistic Model within a self-supervised adaptation framework for Snapshot Compressive Imaging (SCI). Our approach leverages a pretrained SCI reconstruction network to generate initial predictions from two-dimensional measurements. Subsequently, a one-step diffusion model produces high-frequency residuals to enhance these initial predictions. Additionally, acknowledging the high costs associated with collecting MSIs, we develop a self-supervised paradigm based on the Equivariant Imaging (EI) framework. Experimental results validate the superiority of our model compared to previous methods, showcasing its simplicity and adaptability to various end-to-end or unfolding techniques.

[CV-83] Quantifying Knee Cartilage Shape and Lesion: From Image to Metrics

链接: https://arxiv.org/abs/2409.07361
作者: Yongcheng Yao,Weitian Chen
关键词-EN: potential imaging biomarkers, imaging feature extraction, potential imaging, imaging biomarkers, knee articular cartilage
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注: The paper will be in the conference proceedings of AMAI 2024. See the conference website: this https URL

点击查看摘要

Abstract:Imaging features of knee articular cartilage have been shown to be potential imaging biomarkers for knee osteoarthritis. Despite recent methodological advancements in image analysis techniques like image segmentation, registration, and domain-specific image computing algorithms, only a few works focus on building fully automated pipelines for imaging feature extraction. In this study, we developed a deep-learning-based medical image analysis application for knee cartilage morphometrics, CartiMorph Toolbox (CMT). We proposed a 2-stage joint template learning and registration network, CMT-reg. We trained the model using the OAI-ZIB dataset and assessed its performance in template-to-image registration. The CMT-reg demonstrated competitive results compared to other state-of-the-art models. We integrated the proposed model into an automated pipeline for the quantification of cartilage shape and lesion (full-thickness cartilage loss, specifically). The toolbox provides a comprehensive, user-friendly solution for medical image analysis and data visualization. The software and models are available at this https URL .

[CV-84] BLS-GAN: A Deep Layer Separation Framework for Eliminating Bone Overlap in Conventional Radiographs

链接: https://arxiv.org/abs/2409.07304
作者: Haolin Wang,Yafei Ou,Prasoon Ambalathankandy,Gen Ota,Pengyu Dai,Masayuki Ikebe,Kenji Suzuki,Tamotsu Kamishima
关键词-EN: bone layer separation, bone layer, bone layer images, conventional radiographs, layer separation
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Conventional radiography is the widely used imaging technology in diagnosing, monitoring, and prognosticating musculoskeletal (MSK) diseases because of its easy availability, versatility, and cost-effectiveness. In conventional radiographs, bone overlaps are prevalent, and can impede the accurate assessment of bone characteristics by radiologists or algorithms, posing significant challenges to conventional and computer-aided diagnoses. This work initiated the study of a challenging scenario - bone layer separation in conventional radiographs, in which separate overlapped bone regions enable the independent assessment of the bone characteristics of each bone layer and lay the groundwork for MSK disease diagnosis and its automation. This work proposed a Bone Layer Separation GAN (BLS-GAN) framework that can produce high-quality bone layer images with reasonable bone characteristics and texture. This framework introduced a reconstructor based on conventional radiography imaging principles, which achieved efficient reconstruction and mitigates the recurrent calculations and training instability issues caused by soft tissue in the overlapped regions. Additionally, pre-training with synthetic images was implemented to enhance the stability of both the training process and the results. The generated images passed the visual Turing test, and improved performance in downstream tasks. This work affirms the feasibility of extracting bone layer images from conventional radiographs, which holds promise for leveraging bone layer separation technology to facilitate more comprehensive analytical research in MSK diagnosis, monitoring, and prognosis. Code and dataset will be made available.

[CV-85] 3DGCQA: A Quality Assessment Database for 3D AI-Generated Contents

链接: https://arxiv.org/abs/2409.07236
作者: Yingjie Zhou,Zicheng Zhang,Farong Wen,Jun Jia,Yanwei Jiang,Xiaohong Liu,Xiongkuo Min,Guangtao Zhai
关键词-EN: accelerating design timelines, reducing production costs, quality assessment, offers advantages, design timelines
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Although 3D generated content (3DGC) offers advantages in reducing production costs and accelerating design timelines, its quality often falls short when compared to 3D professionally generated content. Common quality issues frequently affect 3DGC, highlighting the importance of timely and effective quality assessment. Such evaluations not only ensure a higher standard of 3DGCs for end-users but also provide critical insights for advancing generative technologies. To address existing gaps in this domain, this paper introduces a novel 3DGC quality assessment dataset, 3DGCQA, built using 7 representative Text-to-3D generation methods. During the dataset’s construction, 50 fixed prompts are utilized to generate contents across all methods, resulting in the creation of 313 textured meshes that constitute the 3DGCQA dataset. The visualization intuitively reveals the presence of 6 common distortion categories in the generated 3DGCs. To further explore the quality of the 3DGCs, subjective quality assessment is conducted by evaluators, whose ratings reveal significant variation in quality across different generation methods. Additionally, several objective quality assessment algorithms are tested on the 3DGCQA dataset. The results expose limitations in the performance of existing algorithms and underscore the need for developing more specialized quality assessment methods. To provide a valuable resource for future research and development in 3D content generation and quality assessment, the dataset has been open-sourced in this https URL.

[CV-86] AC-IND: Sparse CT reconstruction based on attenuation coefficient estimation and implicit neural distribution

链接: https://arxiv.org/abs/2409.07171
作者: Wangduo Xie,Richard Schoonhoven,Tristan van Leeuwen,Matthew B. Blaschko
关键词-EN: Computed tomography, industrial nondestructive testing, plays a crucial, crucial role, nondestructive testing
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注: 12 pages

点击查看摘要

Abstract:Computed tomography (CT) reconstruction plays a crucial role in industrial nondestructive testing and medical diagnosis. Sparse view CT reconstruction aims to reconstruct high-quality CT images while only using a small number of projections, which helps to improve the detection speed of industrial assembly lines and is also meaningful for reducing radiation in medical scenarios. Sparse CT reconstruction methods based on implicit neural representations (INRs) have recently shown promising performance, but still produce artifacts because of the difficulty of obtaining useful prior information. In this work, we incorporate a powerful prior: the total number of material categories of objects. To utilize the prior, we design AC-IND, a self-supervised method based on Attenuation Coefficient Estimation and Implicit Neural Distribution. Specifically, our method first transforms the traditional INR from scalar mapping to probability distribution mapping. Then we design a compact attenuation coefficient estimator initialized with values from a rough reconstruction and fast segmentation. Finally, our algorithm finishes the CT reconstruction by jointly optimizing the estimator and the generated distribution. Through experiments, we find that our method not only outperforms the comparative methods in sparse CT reconstruction but also can automatically generate semantic segmentation maps.

[CV-87] Deep Learning Techniques for Hand Vein Biometrics: A Comprehensive Review

链接: https://arxiv.org/abs/2409.07128
作者: Mustapha Hemis,Hamza Kheddar,Sami Bourouis,Nasir Saleem
关键词-EN: garnered significant attention, hand vein biometrics, hand vein, hand vein recognition, vein
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Biometric authentication has garnered significant attention as a secure and efficient method of identity verification. Among the various modalities, hand vein biometrics, including finger vein, palm vein, and dorsal hand vein recognition, offer unique advantages due to their high accuracy, low susceptibility to forgery, and non-intrusiveness. The vein patterns within the hand are highly complex and distinct for each individual, making them an ideal biometric identifier. Additionally, hand vein recognition is contactless, enhancing user convenience and hygiene compared to other modalities such as fingerprint or iris recognition. Furthermore, the veins are internally located, rendering them less susceptible to damage or alteration, thus enhancing the security and reliability of the biometric system. The combination of these factors makes hand vein biometrics a highly effective and secure method for identity verification. This review paper delves into the latest advancements in deep learning techniques applied to finger vein, palm vein, and dorsal hand vein recognition. It encompasses all essential fundamentals of hand vein biometrics, summarizes publicly available datasets, and discusses state-of-the-art metrics used for evaluating the three modes. Moreover, it provides a comprehensive overview of suggested approaches for finger, palm, dorsal, and multimodal vein techniques, offering insights into the best performance achieved, data augmentation techniques, and effective transfer learning methods, along with associated pretrained deep learning models. Additionally, the review addresses research challenges faced and outlines future directions and perspectives, encouraging researchers to enhance existing methods and propose innovative techniques.

[CV-88] Attention Down-Sampling Transformer Relative Ranking and Self-Consistency for Blind Image Quality Assessment ICIP

链接: https://arxiv.org/abs/2409.07115
作者: Mohammed Alsaafin,Musab Alsheikh,Saeed Anwar,Muhammad Usman
关键词-EN: image quality assessment, addresses estimating image, no-reference image quality, image quality, quality assessment
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: Accepted in International Conference on Image Processing (ICIP)

点击查看摘要

Abstract:The no-reference image quality assessment is a challenging domain that addresses estimating image quality without the original reference. We introduce an improved mechanism to extract local and non-local information from images via different transformer encoders and CNNs. The utilization of Transformer encoders aims to mitigate locality bias and generate a non-local representation by sequentially processing CNN features, which inherently capture local visual structures. Establishing a stronger connection between subjective and objective assessments is achieved through sorting within batches of images based on relative distance information. A self-consistency approach to self-supervision is presented, explicitly addressing the degradation of no-reference image quality assessment (NR-IQA) models under equivariant transformations. Our approach ensures model robustness by maintaining consistency between an image and its horizontally flipped equivalent. Through empirical evaluation of five popular image quality assessment datasets, the proposed model outperforms alternative algorithms in the context of no-reference image quality assessment datasets, especially on smaller datasets. Codes are available at \hrefthis https URLthis https URL

[CV-89] Fast Medical Shape Reconstruction via Meta-learned Implicit Neural Representations

链接: https://arxiv.org/abs/2409.07100
作者: Gaia Romana De Paolis,Dimitrios Lenis,Johannes Novotny,Maria Wimmer,Astrid Berg,Theresa Neubauer,Philip Matthias Winter,David Major,Ariharasudhan Muthusami,Gerald Schröcker,Martin Mienkina,Katja Bühler
关键词-EN: Efficient and fast, anatomical structures plays, clinical practice, structures plays, plays a crucial
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Efficient and fast reconstruction of anatomical structures plays a crucial role in clinical practice. Minimizing retrieval and processing times not only potentially enhances swift response and decision-making in critical scenarios but also supports interactive surgical planning and navigation. Recent methods attempt to solve the medical shape reconstruction problem by utilizing implicit neural functions. However, their performance suffers in terms of generalization and computation time, a critical metric for real-time applications. To address these challenges, we propose to leverage meta-learning to improve the network parameters initialization, reducing inference time by an order of magnitude while maintaining high accuracy. We evaluate our approach on three public datasets covering different anatomical shapes and modalities, namely CT and MRI. Our experimental results show that our model can handle various input configurations, such as sparse slices with different orientations and spacings. Additionally, we demonstrate that our method exhibits strong transferable capabilities in generalizing to shape domains unobserved at training time.

[CV-90] Deep intra-operative illumination calibration of hyperspectral cameras MICCAI2024

链接: https://arxiv.org/abs/2409.07094
作者: Alexander Baumann,Leonardo Ayala,Alexander Studier-Fischer,Jan Sellner,Berkin Özdemir,Karl-Friedrich Kowalewski,Slobodan Ilic,Silvia Seidlitz,Lena Maier-Hein
关键词-EN: potential surgical applications, imaging modality, lighting conditions, Hyperspectral imaging, promising novel imaging
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: Oral at MICCAI 2024

点击查看摘要

Abstract:Hyperspectral imaging (HSI) is emerging as a promising novel imaging modality with various potential surgical applications. Currently available cameras, however, suffer from poor integration into the clinical workflow because they require the lights to be switched off, or the camera to be manually recalibrated as soon as lighting conditions change. Given this critical bottleneck, the contribution of this paper is threefold: (1) We demonstrate that dynamically changing lighting conditions in the operating room dramatically affect the performance of HSI applications, namely physiological parameter estimation, and surgical scene segmentation. (2) We propose a novel learning-based approach to automatically recalibrating hyperspectral images during surgery and show that it is sufficiently accurate to replace the tedious process of white reference-based recalibration. (3) Based on a total of 742 HSI cubes from a phantom, porcine models, and rats we show that our recalibration method not only outperforms previously proposed methods, but also generalizes across species, lighting conditions, and image processing tasks. Due to its simple workflow integration as well as high accuracy, speed, and generalization capabilities, our method could evolve as a central component in clinical surgical HSI.

[CV-91] CWT-Net: Super-resolution of Histopathology Images Using a Cross-scale Wavelet-based Transformer

链接: https://arxiv.org/abs/2409.07092
作者: Feiyang Jia,Zhineng Chen,Ziying Song,Lin Liu,Caiyan Jia
关键词-EN: medical imaging, quality of low-resolution, widely applied, applied in medical, low-resolution images
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Super-resolution (SR) aims to enhance the quality of low-resolution images and has been widely applied in medical imaging. We found that the design principles of most existing methods are influenced by SR tasks based on real-world images and do not take into account the significance of the multi-level structure in pathological images, even if they can achieve respectable objective metric evaluations. In this work, we delve into two super-resolution working paradigms and propose a novel network called CWT-Net, which leverages cross-scale image wavelet transform and Transformer architecture. Our network consists of two branches: one dedicated to learning super-resolution and the other to high-frequency wavelet features. To generate high-resolution histopathology images, the Transformer module shares and fuses features from both branches at various stages. Notably, we have designed a specialized wavelet reconstruction module to effectively enhance the wavelet domain features and enable the network to operate in different modes, allowing for the introduction of additional relevant information from cross-scale images. Our experimental results demonstrate that our model significantly outperforms state-of-the-art methods in both performance and visualization evaluations and can substantially boost the accuracy of image diagnostic networks.

[CV-92] EVENet: Evidence-based Ensemble Learning for Uncertainty-aware Brain Parcellation Using Diffusion MRI

链接: https://arxiv.org/abs/2409.07020
作者: Chenjun Li,Dian Yang,Shun Yao,Shuyue Wang,Ye Wu,Le Zhang,Qiannuo Li,Kang Ik Kevin Cho,Johanna Seitz-Holland,Lipeng Ning,Jon Haitz Legarreta,Yogesh Rathi,Carl-Fredrik Westin,Lauren J. O’Donnell,Nir A. Sochen,Ofer Pasternak,Fan Zhang
关键词-EN: Evidence-based Ensemble Neural, Ensemble Neural Network, diffusion MRI, Neural Network, Ensemble Neural
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注: 15 pages, 5 figures

点击查看摘要

Abstract:In this study, we developed an Evidence-based Ensemble Neural Network, namely EVENet, for anatomical brain parcellation using diffusion MRI. The key innovation of EVENet is the design of an evidential deep learning framework to quantify predictive uncertainty at each voxel during a single inference. Using EVENet, we obtained accurate parcellation and uncertainty estimates across different datasets from healthy and clinical populations and with different imaging acquisitions. The overall network includes five parallel subnetworks, where each is dedicated to learning the FreeSurfer parcellation for a certain diffusion MRI parameter. An evidence-based ensemble methodology is then proposed to fuse the individual outputs. We perform experimental evaluations on large-scale datasets from multiple imaging sources, including high-quality diffusion MRI data from healthy adults and clinically diffusion MRI data from participants with various brain diseases (schizophrenia, bipolar disorder, attention-deficit/hyperactivity disorder, Parkinson’s disease, cerebral small vessel disease, and neurosurgical patients with brain tumors). Compared to several state-of-the-art methods, our experimental results demonstrate highly improved parcellation accuracy across the multiple testing datasets despite the differences in dMRI acquisition protocols and health conditions. Furthermore, thanks to the uncertainty estimation, our EVENet approach demonstrates a good ability to detect abnormal brain regions in patients with lesions, enhancing the interpretability and reliability of the segmentation results.

[CV-93] owards Predicting Temporal Changes in a Patients Chest X-ray Images based on Electronic Health Records

链接: https://arxiv.org/abs/2409.07012
作者: Daeun Kyung,Junu Kim,Tackeun Kim,Edward Choi
关键词-EN: Chest X-ray imaging, Chest X-ray, X-ray imaging, important diagnostic tool, assess patient conditions
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Chest X-ray imaging (CXR) is an important diagnostic tool used in hospitals to assess patient conditions and monitor changes over time. Generative models, specifically diffusion-based models, have shown promise in generating realistic synthetic X-rays. However, these models mainly focus on conditional generation using single-time-point data, i.e., typically CXRs taken at a specific time with their corresponding reports, limiting their clinical utility, particularly for capturing temporal changes. To address this limitation, we propose a novel framework, EHRXDiff, which predicts future CXR images by integrating previous CXRs with subsequent medical events, e.g., prescriptions, lab measures, etc. Our framework dynamically tracks and predicts disease progression based on a latent diffusion model, conditioned on the previous CXR image and a history of medical events. We comprehensively evaluate the performance of our framework across three key aspects, including clinical consistency, demographic consistency, and visual realism. We demonstrate that our framework generates high-quality, realistic future images that capture potential temporal changes, suggesting its potential for further development as a clinical simulation tool. This could offer valuable insights for patient monitoring and treatment planning in the medical field.

[CV-94] Performance Assessment of Feature Detection Methods for 2-D FS Sonar Imagery

链接: https://arxiv.org/abs/2409.07004
作者: Hitesh Kyatham,Shahriar Negahdaripour,Michael Xu,Xiaomin Lin,Miao Yu,Yiannis Aloimonos
关键词-EN: Underwater robot perception, scientific subsea exploration, Underwater robot, commercial operations, robot perception
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Underwater robot perception is crucial in scientific subsea exploration and commercial operations. The key challenges include non-uniform lighting and poor visibility in turbid environments. High-frequency forward-look sonar cameras address these issues, by providing high-resolution imagery at maximum range of tens of meters, despite complexities posed by high degree of speckle noise, and lack of color and texture. In particular, robust feature detection is an essential initial step for automated object recognition, localization, navigation, and 3-D mapping. Various local feature detectors developed for RGB images are not well-suited for sonar data. To assess their performances, we evaluate a number of feature detectors using real sonar images from five different sonar devices. Performance metrics such as detection accuracy, false positives, and robustness to variations in target characteristics and sonar devices are applied to analyze the experimental results. The study would provide a deeper insight into the bottlenecks of feature detection for sonar data, and developing more effective methods

[CV-95] RICAU-Net: Residual-block Inspired Coordinate Attention U-Net for Segmentation of Small and Sparse Calcium Lesions in Cardiac CT

链接: https://arxiv.org/abs/2409.06993
作者: Doyoung Park,Jinsoo Kim,Qi Chang,Shuang Leng,Liang Zhong,Lohendran Baskaran
关键词-EN: main coronary arteries, coronary artery disease, vessel-specific Agatston score, Agatston score, coronary heart disease
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注: 18 pages, 4 figures, 3 tables

点击查看摘要

Abstract:The Agatston score, which is the sum of the calcification in the four main coronary arteries, has been widely used in the diagnosis of coronary artery disease (CAD). However, many studies have emphasized the importance of the vessel-specific Agatston score, as calcification in a specific vessel is significantly correlated with the occurrence of coronary heart disease (CHD). In this paper, we propose the Residual-block Inspired Coordinate Attention U-Net (RICAU-Net), which incorporates coordinate attention in two distinct manners and a customized combo loss function for lesion-specific coronary artery calcium (CAC) segmentation. This approach aims to tackle the high class-imbalance issue associated with small and sparse lesions, particularly for CAC in the left main coronary artery (LM) which is generally small and the scarcest in the dataset due to its anatomical structure. The proposed method was compared with six different methods using Dice score, precision, and recall. Our approach achieved the highest per-lesion Dice scores for all four lesions, especially for CAC in LM compared to other methods. The ablation studies demonstrated the significance of positional information from the coordinate attention and the customized loss function in segmenting small and sparse lesions with a high class-imbalance problem.

[CV-96] Ordinal Learning: Longitudinal Attention Alignment Model for Predicting Time to Future Breast Cancer Events from Mammograms

链接: https://arxiv.org/abs/2409.06887
作者: Xin Wang,Tao Tan,Yuan Gao,Eric Marcus,Luyi Han,Antonio Portaluri,Tianyu Zhang,Chunyao Lu,Xinglong Liang,Regina Beets-Tan,Jonas Teuwen,Ritse Mann
关键词-EN: Precision breast cancer, Precision breast, developing individualized screening, crucial for developing, developing individualized
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Precision breast cancer (BC) risk assessment is crucial for developing individualized screening and prevention. Despite the promising potential of recent mammogram (MG) based deep learning models in predicting BC risk, they mostly overlook the ‘time-to-future-event’ ordering among patients and exhibit limited explorations into how they track history changes in breast tissue, thereby limiting their clinical application. In this work, we propose a novel method, named OA-BreaCR, to precisely model the ordinal relationship of the time to and between BC events while incorporating longitudinal breast tissue changes in a more explainable manner. We validate our method on public EMBED and inhouse datasets, comparing with existing BC risk prediction and time prediction methods. Our ordinal learning method OA-BreaCR outperforms existing methods in both BC risk and time-to-future-event prediction tasks. Additionally, ordinal heatmap visualizations show the model’s attention over time. Our findings underscore the importance of interpretable and precise risk assessment for enhancing BC screening and prevention efforts. The code will be accessible to the public.

[CV-97] Automated Quantification of White Blood Cells in Light Microscopic Images of Injured Skeletal Muscle

链接: https://arxiv.org/abs/2409.06722
作者: Yang Jiao,Hananeh Derakhshan,Barbara St. Pierre Schneider,Emma Regentova,Mei Yang
关键词-EN: White blood cells, White blood, diverse cell types, cell types observed, types observed
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 2 tables, 7 figures, 8 pages

点击查看摘要

Abstract:White blood cells (WBCs) are the most diverse cell types observed in the healing process of injured skeletal muscles. In the course of healing, WBCs exhibit dynamic cellular response and undergo multiple protein expression changes. The progress of healing can be analyzed by quantifying the number of WBCs or the amount of specific proteins in light microscopic images obtained at different time points after injury. In this paper, we propose an automated quantifying and analysis framework to analyze WBCs using light microscopic images of uninjured and injured muscles. The proposed framework is based on the Localized Iterative Otsu’s threshold method with muscle edge detection and region of interest extraction. Compared with the threshold methods used in ImageJ, the LI Otsu’s threshold method has high resistance to background area and achieves better accuracy. The CD68-positive cell results are presented for demonstrating the effectiveness of the proposed work.

[CV-98] Detailed delineation of the fetal brain in diffusion MRI via multi-task learning

链接: https://arxiv.org/abs/2409.06716
作者: Davood Karimi,Camilo Calixto,Haykel Snoussi,Maria Camila Cortes-Albornoz,Clemente Velasco-Annis,Caitlin Rollins,Camilo Jaimes,Ali Gholipour,Simon K. Warfield
关键词-EN: Diffusion-weighted MRI, MRI is increasingly, fetal brain in-utero, fetal brain, study the normal
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Neurons and Cognition (q-bio.NC)
*备注:

点击查看摘要

Abstract:Diffusion-weighted MRI is increasingly used to study the normal and abnormal development of fetal brain in-utero. Recent studies have shown that dMRI can offer invaluable insights into the neurodevelopmental processes in the fetal stage. However, because of the low data quality and rapid brain development, reliable analysis of fetal dMRI data requires dedicated computational methods that are currently unavailable. The lack of automated methods for fast, accurate, and reproducible data analysis has seriously limited our ability to tap the potential of fetal brain dMRI for medical and scientific applications. In this work, we developed and validated a unified computational framework to (1) segment the brain tissue into white matter, cortical/subcortical gray matter, and cerebrospinal fluid, (2) segment 31 distinct white matter tracts, and (3) parcellate the brain’s cortex and delineate the deep gray nuclei and white matter structures into 96 anatomically meaningful regions. We utilized a set of manual, semi-automatic, and automatic approaches to annotate 97 fetal brains. Using these labels, we developed and validated a multi-task deep learning method to perform the three computations. Our evaluations show that the new method can accurately carry out all three tasks, achieving a mean Dice similarity coefficient of 0.865 on tissue segmentation, 0.825 on white matter tract segmentation, and 0.819 on parcellation. The proposed method can greatly advance the field of fetal neuroimaging as it can lead to substantial improvements in fetal brain tractography, tract-specific analysis, and structural connectivity assessment.

[CV-99] FCDM: Sparse-view Sinogram Inpainting with Frequency Domain Convolution Enhanced Diffusion Models

链接: https://arxiv.org/abs/2409.06714
作者: Jiaze E,Srutarshi Banerjee,Tekin Bicer,Guannan Wang,Bin Ren
关键词-EN: Reducing the radiation, computed tomography, radiation dose, dose in computed, Reducing
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Reducing the radiation dose in computed tomography (CT) is crucial, but it often results in sparse-view CT, where the number of available projections is significantly reduced. This reduction in projection data makes it challenging to accurately reconstruct high-quality CT images. In this condition, a sinogram, which is a collection of these projections, becomes incomplete. Sinogram inpainting then becomes essential because it enables accurate image reconstruction with limited projections. Existing models performing well on conventional RGB images for inpainting mostly fail in the case of sinograms. Further, these models usually do not make full use of unique properties, e.g., frequency features and absorption characteristics in the sinogram, and cannot handle large-area masks and complex real-world projections well. To address these limitations, we propose a novel model called the Frequency Convolution Diffusion Model (FCDM). It employs frequency domain convolutions to extract frequency information from various angles and capture the intricate relationships between these angles, which is essential for high-quality CT reconstruction. We also design a specific loss function based on the unique properties of a sinogram to maintain the consistency in physical properties, which allows the model to learn more effectively even in larger mask areas. We compare FCDM using both simulations and real data with nine inpainting models examples, among which two are designed for sinogram and seven for RGB. The results indicate that our model significantly improves the quality of the inpainted sinograms in terms of both visually and quantitatively, with an SSIM of more than 0.95 and PSNR of more than 30, achieving up to a 33% improvement in SSIM and a 29% improvement in PSNR compared to the baseline. Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2409.06714 [eess.IV] (or arXiv:2409.06714v1 [eess.IV] for this version) https://doi.org/10.48550/arXiv.2409.06714 Focus to learn more arXiv-issued DOI via DataCite

机器学习

[LG-0] Introducing Perturb-ability Score (PS) to Enhance Robustness Against Evasion Adversarial Attacks on ML-NIDS

链接: https://arxiv.org/abs/2409.07448
作者: Mohamed elShehaby,Ashraf Matrawy
关键词-EN: identify Network Intrusion, Intrusion Detection Systems, Network Intrusion Detection, Perturb-ability Score, Network Intrusion
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This paper proposes a novel Perturb-ability Score (PS) that can be used to identify Network Intrusion Detection Systems (NIDS) features that can be easily manipulated by attackers in the problem-space. We demonstrate that using PS to select only non-perturb-able features for ML-based NIDS maintains detection performance while enhancing robustness against adversarial attacks.

[LG-1] Adaptive Adapter Routing for Long-Tailed Class-Incremental Learning

链接: https://arxiv.org/abs/2409.07446
作者: Zhi-Hong Qi,Da-Wei Zhou,Yiran Yao,Han-Jia Ye,De-Chuan Zhan
关键词-EN: e-commerce platform reviews, ever-evolving world, platform reviews, e-commerce platform, long-tailed distribution
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted to Machine Learning Journal. Code is available at: this https URL

点击查看摘要

Abstract:In our ever-evolving world, new data exhibits a long-tailed distribution, such as e-commerce platform reviews. This necessitates continuous model learning imbalanced data without forgetting, addressing the challenge of long-tailed class-incremental learning (LTCIL). Existing methods often rely on retraining linear classifiers with former data, which is impractical in real-world settings. In this paper, we harness the potent representation capabilities of pre-trained models and introduce AdaPtive Adapter RouTing (APART) as an exemplar-free solution for LTCIL. To counteract forgetting, we train inserted adapters with frozen pre-trained weights for deeper adaptation and maintain a pool of adapters for selection during sequential model updates. Additionally, we present an auxiliary adapter pool designed for effective generalization, especially on minority classes. Adaptive instance routing across these pools captures crucial correlations, facilitating a comprehensive representation of all classes. Consequently, APART tackles the imbalance problem as well as catastrophic forgetting in a unified framework. Extensive benchmark experiments validate the effectiveness of APART. Code is available at: this https URL

[LG-2] Synthetic continued pretraining

链接: https://arxiv.org/abs/2409.07431
作者: Zitong Yang,Neil Band,Shuangping Li,Emmanuel Candès,Tatsunori Hashimoto
关键词-EN: unstructured internet text, enabled language models, unstructured internet, synthetic continued pretraining, acquire a significant
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Pretraining on large-scale, unstructured internet text has enabled language models to acquire a significant amount of world knowledge. However, this knowledge acquisition is data-inefficient – to learn a given fact, models must be trained on hundreds to thousands of diverse representations of it. This poses a challenge when adapting a pretrained model to a small corpus of domain-specific documents, where each fact may appear rarely or only once. We propose to bridge this gap with synthetic continued pretraining: using the small domain-specific corpus to synthesize a large corpus more amenable to learning, and then performing continued pretraining on the synthesized corpus. We instantiate this proposal with EntiGraph, a synthetic data augmentation algorithm that extracts salient entities from the source documents and then generates diverse text by drawing connections between the sampled entities. Synthetic continued pretraining using EntiGraph enables a language model to answer questions and follow generic instructions related to the source documents without access to them. If instead, the source documents are available at inference time, we show that the knowledge acquired through our approach compounds with retrieval-augmented generation. To better understand these results, we build a simple mathematical model of EntiGraph, and show how synthetic data augmentation can “rearrange” knowledge to enable more data-efficient learning.

[LG-3] owards Fairer Health Recommendations: finding informative unbiased samples via Word Sense Disambiguation RECSYS2024

链接: https://arxiv.org/abs/2409.07424
作者: Gavin Butts,Pegah Emdad,Jethro Lee,Shannon Song,Chiman Salavati,Willmar Sosa Diaz,Shiri Dori-Hacohen,Fabricio Murai
关键词-EN: produce biased predictions, growing concerns, concerns around high-stake, biased predictions, produce biased
类目: Computation and Language (cs.CL); Computers and Society (cs.CY); Machine Learning (cs.LG)
*备注: Accepted for long presentation at the FAcctRec @ Recsys 2024

点击查看摘要

Abstract:There have been growing concerns around high-stake applications that rely on models trained with biased data, which consequently produce biased predictions, often harming the most vulnerable. In particular, biased medical data could cause health-related applications and recommender systems to create outputs that jeopardize patient care and widen disparities in health outcomes. A recent framework titled Fairness via AI posits that, instead of attempting to correct model biases, researchers must focus on their root causes by using AI to debias data. Inspired by this framework, we tackle bias detection in medical curricula using NLP models, including LLMs, and evaluate them on a gold standard dataset containing 4,105 excerpts annotated by medical experts for bias from a large corpus. We build on previous work by coauthors which augments the set of negative samples with non-annotated text containing social identifier terms. However, some of these terms, especially those related to race and ethnicity, can carry different meanings (e.g., “white matter of spinal cord”). To address this issue, we propose the use of Word Sense Disambiguation models to refine dataset quality by removing irrelevant sentences. We then evaluate fine-tuned variations of BERT models as well as GPT models with zero- and few-shot prompting. We found LLMs, considered SOTA on many NLP tasks, unsuitable for bias detection, while fine-tuned BERT models generally perform well across all evaluated metrics.

[LG-4] Hierarchical Reinforcement Learning for Temporal Abstraction of Listwise Recommendation

链接: https://arxiv.org/abs/2409.07416
作者: Luo Ji,Gao Liu,Mingyang Yin,Hongxia Yang,Jingren Zhou
关键词-EN: short-term interest shifts, Modern listwise recommendation, listwise recommendation systems, Modern listwise, long-term user perceptions
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 18 pages, 4 figures

点击查看摘要

Abstract:Modern listwise recommendation systems need to consider both long-term user perceptions and short-term interest shifts. Reinforcement learning can be applied on recommendation to study such a problem but is also subject to large search space, sparse user feedback and long interactive latency. Motivated by recent progress in hierarchical reinforcement learning, we propose a novel framework called mccHRL to provide different levels of temporal abstraction on listwise recommendation. Within the hierarchical framework, the high-level agent studies the evolution of user perception, while the low-level agent produces the item selection policy by modeling the process as a sequential decision-making problem. We argue that such framework has a well-defined decomposition of the outra-session context and the intra-session context, which are encoded by the high-level and low-level agents, respectively. To verify this argument, we implement both a simulator-based environment and an industrial dataset-based experiment. Results observe significant performance improvement by our method, compared with several well-known baselines. Data and codes have been made public.

[LG-5] SoK: Security and Privacy Risks of Medical AI

链接: https://arxiv.org/abs/2409.07415
作者: Yuanhaur Chang,Han Liu,Evin Jaff,Chenyang Lu,Ning Zhang
关键词-EN: powered by artificial, machine learning, products and services, era where software, artificial intelligence
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The integration of technology and healthcare has ushered in a new era where software systems, powered by artificial intelligence and machine learning, have become essential components of medical products and services. While these advancements hold great promise for enhancing patient care and healthcare delivery efficiency, they also expose sensitive medical data and system integrity to potential cyberattacks. This paper explores the security and privacy threats posed by AI/ML applications in healthcare. Through a thorough examination of existing research across a range of medical domains, we have identified significant gaps in understanding the adversarial attacks targeting medical AI systems. By outlining specific adversarial threat models for medical settings and identifying vulnerable application domains, we lay the groundwork for future research that investigates the security and resilience of AI-driven medical systems. Through our analysis of different threat models and feasibility studies on adversarial attacks in different medical domains, we provide compelling insights into the pressing need for cybersecurity research in the rapidly evolving field of AI healthcare technology.

[LG-6] Manifold Learning via Foliations and Knowledge Transfer

链接: https://arxiv.org/abs/2409.07412
作者: E. Tron,E. Fioresi
关键词-EN: high dimensional spaces, Understanding how real, machine learning, distributed in high, high dimensional
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Understanding how real data is distributed in high dimensional spaces is the key to many tasks in machine learning. We want to provide a natural geometric structure on the space of data employing a deep ReLU neural network trained as a classifier. Through the data information matrix (DIM), a variation of the Fisher information matrix, the model will discern a singular foliation structure on the space of data. We show that the singular points of such foliation are contained in a measure zero set, and that a local regular foliation exists almost everywhere. Experiments show that the data is correlated with leaves of such foliation. Moreover we show the potential of our approach for knowledge transfer by analyzing the spectrum of the DIM to measure distances between datasets.

[LG-7] What to align in multimodal contrastive learning?

链接: https://arxiv.org/abs/2409.07402
作者: Benoit Dufumier,Javiera Castillo-Navarro,Devis Tuia,Jean-Philippe Thiran
关键词-EN: Humans perceive, multisensory integration, adapt their behavior, perceive the world, world through multisensory
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
*备注: 22 pages

点击查看摘要

Abstract:Humans perceive the world through multisensory integration, blending the information of different modalities to adapt their behavior. Contrastive learning offers an appealing solution for multimodal self-supervised learning. Indeed, by considering each modality as a different view of the same entity, it learns to align features of different modalities in a shared representation space. However, this approach is intrinsically limited as it only learns shared or redundant information between modalities, while multimodal interactions can arise in other ways. In this work, we introduce CoMM, a Contrastive MultiModal learning strategy that enables the communication between modalities in a single multimodal space. Instead of imposing cross- or intra- modality constraints, we propose to align multimodal representations by maximizing the mutual information between augmented versions of these multimodal features. Our theoretical analysis shows that shared, synergistic and unique terms of information naturally emerge from this formulation, allowing us to estimate multimodal interactions beyond redundancy. We test CoMM both in a controlled and in a series of real-world settings: in the former, we demonstrate that CoMM effectively captures redundant, unique and synergistic information between modalities. In the latter, CoMM learns complex multimodal interactions and achieves state-of-the-art results on the six multimodal benchmarks.

[LG-8] Convergence of continuous-time stochastic gradient descent with applications to linear deep neural networks

链接: https://arxiv.org/abs/2409.07401
作者: Gabor Lugosi,Eulalia Nualart
关键词-EN: stochastic gradient descent, gradient descent process, learning problems, gradient descent, study a continuous-time
类目: Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:We study a continuous-time approximation of the stochastic gradient descent process for minimizing the expected loss in learning problems. The main results establish general sufficient conditions for the convergence, extending the results of Chatterjee (2022) established for (nonstochastic) gradient descent. We show how the main result can be applied to the case of overparametrized linear neural network training.

[LG-9] Revisiting Static Feature-Based Android Malware Detection

链接: https://arxiv.org/abs/2409.07397
作者: Md Tanvirul Alam,Dipkamal Bhusal,Nidhi Rastogi
关键词-EN: driven significant advancements, Android malware detection, Android malware, significant advancements, increasing reliance
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The increasing reliance on machine learning (ML) in computer security, particularly for malware classification, has driven significant advancements. However, the replicability and reproducibility of these results are often overlooked, leading to challenges in verifying research findings. This paper highlights critical pitfalls that undermine the validity of ML research in Android malware detection, focusing on dataset and methodological issues. We comprehensively analyze Android malware detection using two datasets and assess offline and continual learning settings with six widely used ML models. Our study reveals that when properly tuned, simpler baseline methods can often outperform more complex models. To address reproducibility challenges, we propose solutions for improving datasets and methodological practices, enabling fairer model comparisons. Additionally, we open-source our code to facilitate malware analysis, making it extensible for new models and datasets. Our paper aims to support future research in Android malware detection and other security domains, enhancing the reliability and reproducibility of published results.

[LG-10] A Scalable Algorithm for Active Learning

链接: https://arxiv.org/abs/2409.07392
作者: Youguang Chen,Zheyu Wen,George Biros
关键词-EN: recently proposed deterministic, proposed deterministic active, deterministic active learning, logistic regression, recently proposed
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: To be appeared at SC’24. Link: this https URL

点击查看摘要

Abstract:FIRAL is a recently proposed deterministic active learning algorithm for multiclass classification using logistic regression. It was shown to outperform the state-of-the-art in terms of accuracy and robustness and comes with theoretical performance guarantees. However, its scalability suffers when dealing with datasets featuring a large number of points n , dimensions d , and classes c , due to its \mathcalO(c^2d^2+nc^2d) storage and \mathcalO(c^3(nd^2 + bd^3 + bn)) computational complexity where b is the number of points to select in active learning. To address these challenges, we propose an approximate algorithm with storage requirements reduced to \mathcalO(n(d+c) + cd^2) and a computational complexity of \mathcalO(bncd^2) . Additionally, we present a parallel implementation on GPUs. We demonstrate the accuracy and scalability of our approach using MNIST, CIFAR-10, Caltech101, and ImageNet. The accuracy tests reveal no deterioration in accuracy compared to FIRAL. We report strong and weak scaling tests on up to 12 GPUs, for three million point synthetic dataset.

[LG-11] D-CAPTCHA: A Study of Resilience of Deepfake CAPTCHA under Transferable Imperceptible Adversarial Attack

链接: https://arxiv.org/abs/2409.07390
作者: Hong-Hanh Nguyen-Le,Van-Tuan Tran,Dinh-Thuc Nguyen,Nhien-An Le-Khac
关键词-EN: audio synthesis models, synthesis models, voice conversion, advancements in generative, enabled the improvement
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG); Sound (cs.SD); Audio and Speech Processing (eess.AS)
*备注: 14 pages

点击查看摘要

Abstract:The advancements in generative AI have enabled the improvement of audio synthesis models, including text-to-speech and voice conversion. This raises concerns about its potential misuse in social manipulation and political interference, as synthetic speech has become indistinguishable from natural human speech. Several speech-generation programs are utilized for malicious purposes, especially impersonating individuals through phone calls. Therefore, detecting fake audio is crucial to maintain social security and safeguard the integrity of information. Recent research has proposed a D-CAPTCHA system based on the challenge-response protocol to differentiate fake phone calls from real ones. In this work, we study the resilience of this system and introduce a more robust version, D-CAPTCHA++, to defend against fake calls. Specifically, we first expose the vulnerability of the D-CAPTCHA system under transferable imperceptible adversarial attack. Secondly, we mitigate such vulnerability by improving the robustness of the system by using adversarial training in D-CAPTCHA deepfake detectors and task classifiers.

[LG-12] A Contrastive Symmetric Forward-Forward Algorithm (SFFA) for Continual Learning Tasks

链接: https://arxiv.org/abs/2409.07387
作者: Erik B. Terres-Escudero,Javier Del Ser,Pablo Garcia Bringas
关键词-EN: yielding competitive performance, recently gained momentum, conventional back-propagation algorithm, so-called Forward-Forward Algorithm, yielding competitive
类目: Machine Learning (cs.LG)
*备注: Accepted at 3rd Conference on Lifelong Learning Agents (CoLLAs), 2024

点击查看摘要

Abstract:The so-called Forward-Forward Algorithm (FFA) has recently gained momentum as an alternative to the conventional back-propagation algorithm for neural network learning, yielding competitive performance across various modeling tasks. By replacing the backward pass of gradient back-propagation with two contrastive forward passes, the FFA avoids several shortcomings undergone by its predecessor (e.g., vanishing/exploding gradient) by enabling layer-wise training heuristics. In classification tasks, this contrastive method has been proven to effectively create a latent sparse representation of the input data, ultimately favoring discriminability. However, FFA exhibits an inherent asymmetric gradient behavior due to an imbalanced loss function between positive and negative data, adversely impacting on the model’s generalization capabilities and leading to an accuracy degradation. To address this issue, this work proposes the Symmetric Forward-Forward Algorithm (SFFA), a novel modification of the original FFA which partitions each layer into positive and negative neurons. This allows the local fitness function to be defined as the ratio between the activation of positive neurons and the overall layer activity, resulting in a symmetric loss landscape during the training phase. To evaluate the enhanced convergence of our method, we conduct several experiments using multiple image classification benchmarks, comparing the accuracy of models trained with SFFA to those trained with its FFA counterpart. As a byproduct of this reformulation, we explore the advantages of using a layer-wise training algorithm for Continual Learning (CL) tasks. The specialization of neurons and the sparsity of their activations induced by layer-wise training algorithms enable efficient CL strategies that incorporate new knowledge (classes) into the neural network, while preventing catastrophic forgetting of previously…

[LG-13] FIRAL: An Active Learning Algorithm for Multinomial Logistic Regression NEURIPS2023

链接: https://arxiv.org/abs/2409.07379
作者: Youguang Chen,George Biros
关键词-EN: Fisher Information Ratio, pool-based active learning, investigate theory, Information Ratio, multinomial logistic regression
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted at the 37th Conference on Neural Information Processing Systems (NeurIPS 2023)

点击查看摘要

Abstract:We investigate theory and algorithms for pool-based active learning for multiclass classification using multinomial logistic regression. Using finite sample analysis, we prove that the Fisher Information Ratio (FIR) lower and upper bounds the excess risk. Based on our theoretical analysis, we propose an active learning algorithm that employs regret minimization to minimize the FIR. To verify our derived excess risk bounds, we conduct experiments on synthetic datasets. Furthermore, we compare FIRAL with five other methods and found that our scheme outperforms them: it consistently produces the smallest classification error in the multiclass logistic regression setting, as demonstrated through experiments on MNIST, CIFAR-10, and 50-class ImageNet.

[LG-14] Federated Impression for Learning with Distributed Heterogeneous Data

链接: https://arxiv.org/abs/2409.07351
作者: Sana Ayromlou,Atrin Arya,Armin Saadat,Purang Abolmaesumi,Xiaoxiao Li
关键词-EN: Standard deep learning-based, real-world clinical applications, Standard deep, deep learning-based classification, learning-based classification approaches
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注:

点击查看摘要

Abstract:Standard deep learning-based classification approaches may not always be practical in real-world clinical applications, as they require a centralized collection of all samples. Federated learning (FL) provides a paradigm that can learn from distributed datasets across clients without requiring them to share data, which can help mitigate privacy and data ownership issues. In FL, sub-optimal convergence caused by data heterogeneity is common among data from different health centers due to the variety in data collection protocols and patient demographics across centers. Through experimentation in this study, we show that data heterogeneity leads to the phenomenon of catastrophic forgetting during local training. We propose FedImpres which alleviates catastrophic forgetting by restoring synthetic data that represents the global information as federated impression. To achieve this, we distill the global model resulting from each communication round. Subsequently, we use the synthetic data alongside the local data to enhance the generalization of local training. Extensive experiments show that the proposed method achieves state-of-the-art performance on both the BloodMNIST and Retina datasets, which contain label imbalance and domain shift, with an improvement in classification accuracy of up to 20%.

[LG-15] Online Decision MetaMorphFormer: A Casual Transformer-Based Reinforcement Learning Framework of Universal Embodied Intelligence

链接: https://arxiv.org/abs/2409.07341
作者: Luo Ji,Runji Lin
关键词-EN: motion control field, Interactive artificial intelligence, interesting topic, Interactive artificial, motion control
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Robotics (cs.RO)
*备注: 12 pages, 6 figures

点击查看摘要

Abstract:Interactive artificial intelligence in the motion control field is an interesting topic, especially when universal knowledge is adaptive to multiple tasks and universal environments. Despite there being increasing efforts in the field of Reinforcement Learning (RL) with the aid of transformers, most of them might be limited by the offline training pipeline, which prohibits exploration and generalization abilities. To address this limitation, we propose the framework of Online Decision MetaMorphFormer (ODM) which aims to achieve self-awareness, environment recognition, and action planning through a unified model architecture. Motivated by cognitive and behavioral psychology, an ODM agent is able to learn from others, recognize the world, and practice itself based on its own experience. ODM can also be applied to any arbitrary agent with a multi-joint body, located in different environments, and trained with different types of tasks using large-scale pre-trained datasets. Through the use of pre-trained datasets, ODM can quickly warm up and learn the necessary knowledge to perform the desired task, while the target environment continues to reinforce the universal policy. Extensive online experiments as well as few-shot and zero-shot environmental tests are used to verify ODM’s performance and generalization ability. The results of our study contribute to the study of general artificial intelligence in embodied and cognitive fields. Code, results, and video examples can be found on the website \urlthis https URL.

[LG-16] A Framework for Predicting the Impact of Game Balance Changes through Meta Discovery

链接: https://arxiv.org/abs/2409.07340
作者: Akash Saravanan,Matthew Guzdial
关键词-EN: League of Legends, collection of knowledge, balance, Meta Discovery framework, Pokémon
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 11 pages, 1 figure, IEEE Transactions on Games

点击查看摘要

Abstract:A metagame is a collection of knowledge that goes beyond the rules of a game. In competitive, team-based games like Pokémon or League of Legends, it refers to the set of current dominant characters and/or strategies within the player base. Developer changes to the balance of the game can have drastic and unforeseen consequences on these sets of meta characters. A framework for predicting the impact of balance changes could aid developers in making more informed balance decisions. In this paper we present such a Meta Discovery framework, leveraging Reinforcement Learning for automated testing of balance changes. Our results demonstrate the ability to predict the outcome of balance changes in Pokémon Showdown, a collection of competitive Pokémon tiers, with high accuracy.

[LG-17] Learning to Compress Contexts for Efficient Knowledge-based Visual Question Answering

链接: https://arxiv.org/abs/2409.07331
作者: Weixi Weng,Jieming Zhu,Hao Zhang,Xiaojun Meng,Rui Zhang,Chun Yuan
关键词-EN: Large Language Models, Multimodal Large Language, Language Models, Large Language, demonstrated great zero-shot
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Multimodal Large Language Models (MLLMs) have demonstrated great zero-shot performance on visual question answering (VQA). However, when it comes to knowledge-based VQA (KB-VQA), MLLMs may lack human commonsense or specialized domain knowledge to answer such questions and require obtaining necessary information from external knowledge sources. Previous works like Retrival-Augmented VQA-v2 (RAVQA-v2) focus on utilizing as much input information, such as image-based textual descriptions and retrieved knowledge, as possible to improve performance, but they all overlook the issue that with the number of input tokens increasing, inference efficiency significantly decreases, which contradicts the demands of practical applications. To address this issue, we propose Retrieval-Augmented MLLM with Compressed Contexts (RACC). RACC learns to compress and aggregate retrieved contexts, from which it generates a compact modulation in the form of Key-Value (KV) cache. This modulation is then used to adapt the downstream frozen MLLM, thereby achieving effective and efficient inference. RACC achieves a state-of-the-art (SOTA) performance of 62.9% on OK-VQA. Moreover, it significantly reduces inference latency by 22.0%-59.7% compared to the prominent RAVQA-v2. Abundant experiments show RACC’s broad applicability. It is compatible with various off-the-shelf MLLMs and can also handle different knowledge sources including textual and multimodal documents.

[LG-18] Current Symmetry Group Equivariant Convolution Frameworks for Representation Learning

链接: https://arxiv.org/abs/2409.07327
作者: Ramzan Basheer,Deepak Mishra
关键词-EN: addressing real-world signals, Euclidean deep learning, Euclidean deep, complex topologies, inadequate for addressing
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 31 pages, 4 figures

点击查看摘要

Abstract:Euclidean deep learning is often inadequate for addressing real-world signals where the representation space is irregular and curved with complex topologies. Interpreting the geometric properties of such feature spaces has become paramount in obtaining robust and compact feature representations that remain unaffected by nontrivial geometric transformations, which vanilla CNNs cannot effectively handle. Recognizing rotation, translation, permutation, or scale symmetries can lead to equivariance properties in the learned representations. This has led to notable advancements in computer vision and machine learning tasks under the framework of geometric deep learning, as compared to their invariant counterparts. In this report, we emphasize the importance of symmetry group equivariant deep learning models and their realization of convolution-like operations on graphs, 3D shapes, and non-Euclidean spaces by leveraging group theory and symmetry. We categorize them as regular, steerable, and PDE-based convolutions and thoroughly examine the inherent symmetries of their input spaces and ensuing representations. We also outline the mathematical link between group convolutions or message aggregation operations and the concept of equivariance. The report also highlights various datasets, their application scopes, limitations, and insightful observations on future directions to serve as a valuable reference and stimulate further research in this emerging discipline.

[LG-19] Statistically Valid Information Bottleneck via Multiple Hypothesis Testing

链接: https://arxiv.org/abs/2409.07325
作者: Amirmohammad Farzaneh,Osvaldo Simeone
关键词-EN: widely studied framework, extracting compressed features, information bottleneck, downstream tasks, widely studied
类目: Information Theory (cs.IT); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The information bottleneck (IB) problem is a widely studied framework in machine learning for extracting compressed features that are informative for downstream tasks. However, current approaches to solving the IB problem rely on a heuristic tuning of hyperparameters, offering no guarantees that the learned features satisfy information-theoretic constraints. In this work, we introduce a statistically valid solution to this problem, referred to as IB via multiple hypothesis testing (IB-MHT), which ensures that the learned features meet the IB constraints with high probability, regardless of the size of the available dataset. The proposed methodology builds on Pareto testing and learn-then-test (LTT), and it wraps around existing IB solvers to provide statistical guarantees on the IB constraints. We demonstrate the performance of IB-MHT on classical and deterministic IB formulations, validating the effectiveness of IB-MHT in outperforming conventional methods in terms of statistical robustness and reliability.

[LG-20] Efficient and Unbiased Sampling of Boltzmann Distributions via Consistency Models

链接: https://arxiv.org/abs/2409.07323
作者: Fengzhe Zhang,Jiajun He,Laurence I. Midgley,Javier Antorán,José Miguel Hernández-Lobato
关键词-EN: advancing Boltzmann Generators, Boltzmann Generators, shown promising potential, advancing Boltzmann, shown promising
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 10 pages, 4 figures

点击查看摘要

Abstract:Diffusion models have shown promising potential for advancing Boltzmann Generators. However, two critical challenges persist: (1) inherent errors in samples due to model imperfections, and (2) the requirement of hundreds of functional evaluations (NFEs) to achieve high-quality samples. While existing solutions like importance sampling and distillation address these issues separately, they are often incompatible, as most distillation models lack the necessary density information for importance sampling. This paper introduces a novel sampling method that effectively combines Consistency Models (CMs) with importance sampling. We evaluate our approach on both synthetic energy functions and equivariant n-body particle systems. Our method produces unbiased samples using only 6-25 NFEs while achieving a comparable Effective Sample Size (ESS) to Denoising Diffusion Probabilistic Models (DDPMs) that require approximately 100 NFEs.

[LG-21] hree-Dimensional Multimodal Synchrotron Data for Machine Learning Applications

链接: https://arxiv.org/abs/2409.07322
作者: Calum Green,Sharif Ahmed,Shashidhara Marathe,Liam Perera,Alberto Leonardi,Killian Gmyrek,Daniele Dini,James Le Houx
关键词-EN: good quality training, quality training data, imaging modalities, zinc-doped Zeolite, Machine learning techniques
类目: Machine Learning (cs.LG); Image and Video Processing (eess.IV)
*备注: 9 pages, 4 figures. Image Processing and Artificial Intelligence Conference, 2024

点击查看摘要

Abstract:Machine learning techniques are being increasingly applied in medical and physical sciences across a variety of imaging modalities; however, an important issue when developing these tools is the availability of good quality training data. Here we present a unique, multimodal synchrotron dataset of a bespoke zinc-doped Zeolite 13X sample that can be used to develop advanced deep learning and data fusion pipelines. Multi-resolution micro X-ray computed tomography was performed on a zinc-doped Zeolite 13X fragment to characterise its pores and features, before spatially resolved X-ray diffraction computed tomography was carried out to characterise the homogeneous distribution of sodium and zinc phases. Zinc absorption was controlled to create a simple, spatially isolated, two-phase material. Both raw and processed data is available as a series of Zenodo entries. Altogether we present a spatially resolved, three-dimensional, multimodal, multi-resolution dataset that can be used for the development of machine learning techniques. Such techniques include development of super-resolution, multimodal data fusion, and 3D reconstruction algorithm development.

[LG-22] Optimizing Neural Network Performance and Interpretability with Diophantine Equation Encoding

链接: https://arxiv.org/abs/2409.07310
作者: Ronald Katende
关键词-EN: improve model interpretability, Diophantine equations, neural network parameters, decoding neural network, architectures to improve
类目: Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
*备注:

点击查看摘要

Abstract:This paper explores the integration of Diophantine equations into neural network (NN) architectures to improve model interpretability, stability, and efficiency. By encoding and decoding neural network parameters as integer solutions to Diophantine equations, we introduce a novel approach that enhances both the precision and robustness of deep learning models. Our method integrates a custom loss function that enforces Diophantine constraints during training, leading to better generalization, reduced error bounds, and enhanced resilience against adversarial attacks. We demonstrate the efficacy of this approach through several tasks, including image classification and natural language processing, where improvements in accuracy, convergence, and robustness are observed. This study offers a new perspective on combining mathematical theory and machine learning to create more interpretable and efficient models.

[LG-23] Non-Invasive Glucose Prediction System Enhanced by Mixed Linear Models and Meta-Forests for Domain Generalization

链接: https://arxiv.org/abs/2409.07308
作者: Yuyang Sun,Panagiotis Kosmas
关键词-EN: Mixed Linear Model, spectroscopy and millimeter-wave, Mixed Linear, integrates Near-Infrared, NIR
类目: Machine Learning (cs.LG); Computational Engineering, Finance, and Science (cs.CE)
*备注:

点击查看摘要

Abstract:In this study, we present a non-invasive glucose prediction system that integrates Near-Infrared (NIR) spectroscopy and millimeter-wave (mm-wave) sensing. We employ a Mixed Linear Model (MixedLM) to analyze the association between mm-wave frequency S_21 parameters and blood glucose levels within a heterogeneous dataset. The MixedLM method considers inter-subject variability and integrates multiple predictors, offering a more comprehensive analysis than traditional correlation analysis. Additionally, we incorporate a Domain Generalization (DG) model, Meta-forests, to effectively handle domain variance in the dataset, enhancing the model’s adaptability to individual differences. Our results demonstrate promising accuracy in glucose prediction for unseen subjects, with a mean absolute error (MAE) of 17.47 mg/dL, a root mean square error (RMSE) of 31.83 mg/dL, and a mean absolute percentage error (MAPE) of 10.88%, highlighting its potential for clinical application. This study marks a significant step towards developing accurate, personalized, and non-invasive glucose monitoring systems, contributing to improved diabetes management.

[LG-24] A Unified Contrastive Loss for Self-Training

链接: https://arxiv.org/abs/2409.07292
作者: Aurelien Gauffre,Julien Horvat,Massih-Reza Amini
关键词-EN: exploiting abundant unlabeled, abundant unlabeled data, exploiting abundant, abundant unlabeled, loss function
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Self-training methods have proven to be effective in exploiting abundant unlabeled data in semi-supervised learning, particularly when labeled data is scarce. While many of these approaches rely on a cross-entropy loss function (CE), recent advances have shown that the supervised contrastive loss function (SupCon) can be more effective. Additionally, unsupervised contrastive learning approaches have also been shown to capture high quality data representations in the unsupervised setting. To benefit from these advantages in a semi-supervised setting, we propose a general framework to enhance self-training methods, which replaces all instances of CE losses with a unique contrastive loss. By using class prototypes, which are a set of class-wise trainable parameters, we recover the probability distributions of the CE setting and show a theoretical equivalence with it. Our framework, when applied to popular self-training methods, results in significant performance improvements across three different datasets with a limited number of labeled data. Additionally, we demonstrate further improvements in convergence speed, transfer ability, and hyperparameter stability. The code is available at \urlthis https URL.

[LG-25] Exploring User-level Gradient Inversion with a Diffusion Prior NEURIPS2023

链接: https://arxiv.org/abs/2409.07291
作者: Zhuohang Li,Andrew Lowy,Jing Liu,Toshiaki Koike-Akino,Bradley Malin,Kieran Parsons,Ye Wang
关键词-EN: explore user-level gradient, user-level gradient inversion, distributed learning, explore user-level, surface in distributed
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)
*备注: Presented at the International Workshop on Federated Learning in the Age of Foundation Models in conjunction with NeurIPS 2023

点击查看摘要

Abstract:We explore user-level gradient inversion as a new attack surface in distributed learning. We first investigate existing attacks on their ability to make inferences about private information beyond training data reconstruction. Motivated by the low reconstruction quality of existing methods, we propose a novel gradient inversion attack that applies a denoising diffusion model as a strong image prior in order to enhance recovery in the large batch setting. Unlike traditional attacks, which aim to reconstruct individual samples and suffer at large batch and image sizes, our approach instead aims to recover a representative image that captures the sensitive shared semantic information corresponding to the underlying user. Our experiments with face images demonstrate the ability of our methods to recover realistic facial images along with private user attributes.

[LG-26] Using Generative Agents to Create Tip Sheets for Investigative Data Reporting

链接: https://arxiv.org/abs/2409.07286
作者: Joris Veerbeek,Nicholas Diakopoulos
关键词-EN: create tip sheets, investigative data reporting, paper introduces, data reporting, create tip
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注: Short paper to be presented at Computation + Journalism 2024

点击查看摘要

Abstract:This paper introduces a system using generative AI agents to create tip sheets for investigative data reporting. Our system employs three specialized agents–an analyst, a reporter, and an editor–to collaboratively generate and refine tips from datasets. We validate this approach using real-world investigative stories, demonstrating that our agent-based system generally generates more newsworthy and accurate insights compared to a baseline model without agents, although some variability was noted between different stories. Our findings highlight the potential of generative AI to provide leads for investigative data reporting.

[LG-27] LD-READY: Traffic Light Detection – Relevance Estimation and Deployment Analysis

链接: https://arxiv.org/abs/2409.07284
作者: Nikolai Polley,Svetlana Pavlitska,Yacin Boualili,Patrick Rohrbeck,Paul Stiller,Ashok Kumar Bangaru,J. Marius Zöllner
关键词-EN: Effective traffic light, traffic light detection, Small Traffic Lights, Traffic Lights Dataset, Effective traffic
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Effective traffic light detection is a critical component of the perception stack in autonomous vehicles. This work introduces a novel deep-learning detection system while addressing the challenges of previous work. Utilizing a comprehensive dataset amalgamation, including the Bosch Small Traffic Lights Dataset, LISA, the DriveU Traffic Light Dataset, and a proprietary dataset from Karlsruhe, we ensure a robust evaluation across varied scenarios. Furthermore, we propose a relevance estimation system that innovatively uses directional arrow markings on the road, eliminating the need for prior map creation. On the DriveU dataset, this approach results in 96% accuracy in relevance estimation. Finally, a real-world evaluation is performed to evaluate the deployment and generalizing abilities of these models. For reproducibility and to facilitate further research, we provide the model weights and code: this https URL.

[LG-28] uning-Free Online Robust Principal Component Analysis through Implicit Regularization

链接: https://arxiv.org/abs/2409.07275
作者: Lakshmi Jayalal,Gokularam Muthukrishnan,Sheetal Kalyani
关键词-EN: Principal Component Analysis, Online Robust Principal, Robust Principal Component, standard Online Robust, Component Analysis
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:The performance of the standard Online Robust Principal Component Analysis (OR-PCA) technique depends on the optimum tuning of the explicit regularizers and this tuning is dataset sensitive. We aim to remove the dependency on these tuning parameters by using implicit regularization. We propose to use the implicit regularization effect of various modified gradient descents to make OR-PCA tuning free. Our method incorporates three different versions of modified gradient descent that separately but naturally encourage sparsity and low-rank structures in the data. The proposed method performs comparable or better than the tuned OR-PCA for both simulated and real-world datasets. Tuning-free ORPCA makes it more scalable for large datasets since we do not require dataset-dependent parameter tuning.

[LG-29] RePlay: a Recommendation Framework for Experimentation and Production Use

链接: https://arxiv.org/abs/2409.07272
作者: Alexey Vasilev,Anna Volodkevich,Denis Kulandin,Tatiana Bysheva,Anton Klenitskiy
关键词-EN: systems significantly reduces, build and compare, significantly reduces, reduces the time, time to market
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG); Software Engineering (cs.SE)
*备注:

点击查看摘要

Abstract:Using a single tool to build and compare recommender systems significantly reduces the time to market for new models. In addition, the comparison results when using such tools look more consistent. This is why many different tools and libraries for researchers in the field of recommendations have recently appeared. Unfortunately, most of these frameworks are aimed primarily at researchers and require modification for use in production due to the inability to work on large datasets or an inappropriate architecture. In this demo, we present our open-source toolkit RePlay - a framework containing an end-to-end pipeline for building recommender systems, which is ready for production use. RePlay also allows you to use a suitable stack for the pipeline on each stage: Pandas, Polars, or Spark. This allows the library to scale computations and deploy to a cluster. Thus, RePlay allows data scientists to easily move from research mode to production mode using the same interfaces.

[LG-30] Multi-Type Preference Learning: Empowering Preference-Based Reinforcement Learning with Equal Preferences

链接: https://arxiv.org/abs/2409.07268
作者: Ziang Liu,Junjie Xu,Xingjiao Wu,Jing Yang,Liang He
关键词-EN: needing meticulously designed, Preference-Based reinforcement learning, designed reward functions, meticulously designed reward, Equal Preference Learning
类目: Machine Learning (cs.LG)
*备注: 7 pages, 6 figures

点击查看摘要

Abstract:Preference-Based reinforcement learning (PBRL) learns directly from the preferences of human teachers regarding agent behaviors without needing meticulously designed reward functions. However, existing PBRL methods often learn primarily from explicit preferences, neglecting the possibility that teachers may choose equal preferences. This neglect may hinder the understanding of the agent regarding the task perspective of the teacher, leading to the loss of important information. To address this issue, we introduce the Equal Preference Learning Task, which optimizes the neural network by promoting similar reward predictions when the behaviors of two agents are labeled as equal preferences. Building on this task, we propose a novel PBRL method, Multi-Type Preference Learning (MTPL), which allows simultaneous learning from equal preferences while leveraging existing methods for learning from explicit preferences. To validate our approach, we design experiments applying MTPL to four existing state-of-the-art baselines across ten locomotion and robotic manipulation tasks in the DeepMind Control Suite. The experimental results indicate that simultaneous learning from both equal and explicit preferences enables the PBRL method to more comprehensively understand the feedback from teachers, thereby enhancing feedback efficiency.

[LG-31] opoMap: A faster and more space efficient technique to compute projections with topological guarantees

链接: https://arxiv.org/abs/2409.07257
作者: Vitoria Guardieiro,Felipe Inagaki de Oliveira,Harish Doraiswamy,Luis Gustavo Nonato,Claudio Silva
关键词-EN: High-dimensional data, visualize effectively, difficult to visualize, data, visualizing high-dimensional data
类目: Graphics (cs.GR); Computational Geometry (cs.CG); Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
*备注: This is the author’s version of the article that has been accepted for publication in IEEE Transactions on Visualization and Computer Graphics (TVCG)

点击查看摘要

Abstract:High-dimensional data, characterized by many features, can be difficult to visualize effectively. Dimensionality reduction techniques, such as PCA, UMAP, and t-SNE, address this challenge by projecting the data into a lower-dimensional space while preserving important relationships. TopoMap is another technique that excels at preserving the underlying structure of the data, leading to interpretable visualizations. In particular, TopoMap maps the high-dimensional data into a visual space, guaranteeing that the 0-dimensional persistence diagram of the Rips filtration of the visual space matches the one from the high-dimensional data. However, the original TopoMap algorithm can be slow and its layout can be too sparse for large and complex datasets. In this paper, we propose three improvements to TopoMap: 1) a more space-efficient layout, 2) a significantly faster implementation, and 3) a novel TreeMap-based representation that makes use of the topological hierarchy to aid the exploration of the projections. These advancements make TopoMap, now referred to as TopoMap++, a more powerful tool for visualizing high-dimensional data which we demonstrate through different use case scenarios.

[LG-32] Alignment of Diffusion Models: Fundamentals Challenges and Future

链接: https://arxiv.org/abs/2409.07253
作者: Buhua Liu,Shitong Shao,Bao Li,Lichen Bai,Haoyi Xiong,James Kwok,Sumi Helal,Zeke Xie
关键词-EN: Diffusion models, Diffusion, generative modeling, models, leading paradigm
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注: 35 pages, 5 figures, 3 tables

点击查看摘要

Abstract:Diffusion models have emerged as the leading paradigm in generative modeling, excelling in various applications. Despite their success, these models often misalign with human intentions, generating outputs that may not match text prompts or possess desired properties. Inspired by the success of alignment in tuning large language models, recent studies have investigated aligning diffusion models with human expectations and preferences. This work mainly reviews alignment of diffusion models, covering advancements in fundamentals of alignment, alignment techniques of diffusion models, preference benchmarks, and evaluation for diffusion models. Moreover, we discuss key perspectives on current challenges and promising future directions on solving the remaining challenges in alignment of diffusion models. To the best of our knowledge, our work is the first comprehensive review paper for researchers and engineers to comprehend, practice, and research alignment of diffusion models.

[LG-33] Riemannian Federated Learning via Averaging Gradient Stream

链接: https://arxiv.org/abs/2409.07223
作者: Zhenwei Huang,Wen Huang,Pratik Jawanpuria,Bamdev Mishra
关键词-EN: garnered significant attention, Riemannian Federated Averaging, Federated Averaging Gradient, distributed learning paradigm, privacy-preserving distributed learning
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:In recent years, federated learning has garnered significant attention as an efficient and privacy-preserving distributed learning paradigm. In the Euclidean setting, Federated Averaging (FedAvg) and its variants are a class of efficient algorithms for expected (empirical) risk minimization. This paper develops and analyzes a Riemannian Federated Averaging Gradient Stream (RFedAGS) algorithm, which is a generalization of FedAvg, to problems defined on a Riemannian manifold. Under standard assumptions, the convergence rate of RFedAGS with fixed step sizes is proven to be sublinear for an approximate stationary solution. If decaying step sizes are used, the global convergence is established. Furthermore, assuming that the objective obeys the Riemannian Polyak-Łojasiewicz property, the optimal gaps generated by RFedAGS with fixed step size are linearly decreasing up to a tiny upper bound, meanwhile, if decaying step sizes are used, then the gaps sublinearly vanish. Numerical simulations conducted on synthetic and real-world data demonstrate the performance of the proposed RFedAGS. Subjects: Machine Learning (cs.LG); Optimization and Control (math.OC) Cite as: arXiv:2409.07223 [cs.LG] (or arXiv:2409.07223v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2409.07223 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-34] Online Graph Filtering Over Expanding Graphs

链接: https://arxiv.org/abs/2409.07204
作者: Bishwadeep Das,Elvin Isufi
关键词-EN: staple tool, tool for processing, multitude of downstream, Graph, Graph filters
类目: Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注:

点击查看摘要

Abstract:Graph filters are a staple tool for processing signals over graphs in a multitude of downstream tasks. However, they are commonly designed for graphs with a fixed number of nodes, despite real-world networks typically grow over time. This topological evolution is often known up to a stochastic model, thus, making conventional graph filters ill-equipped to withstand such topological changes, their uncertainty, as well as the dynamic nature of the incoming data. To tackle these issues, we propose an online graph filtering framework by relying on online learning principles. We design filters for scenarios where the topology is both known and unknown, including a learner adaptive to such evolution. We conduct a regret analysis to highlight the role played by the different components such as the online algorithm, the filter order, and the growing graph model. Numerical experiments with synthetic and real data corroborate the proposed approach for graph signal inference tasks and show a competitive performance w.r.t. baselines and state-of-the-art alternatives.

[LG-35] Heterogeneity-Aware Coordination for Federated Learning via Stitching Pre-trained blocks

链接: https://arxiv.org/abs/2409.07202
作者: Shichen Zhan,Yebo Wu,Chunlin Tian,Yan Zhao,Li Li
关键词-EN: coordinates multiple devices, preserving data privacy, global model, Federated learning, coordinates multiple
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Federated learning (FL) coordinates multiple devices to collaboratively train a shared model while preserving data privacy. However, large memory footprint and high energy consumption during the training process excludes the low-end devices from contributing to the global model with their own data, which severely deteriorates the model performance in real-world scenarios. In this paper, we propose FedStitch, a hierarchical coordination framework for heterogeneous federated learning with pre-trained blocks. Unlike the traditional approaches that train the global model from scratch, for a new task, FedStitch composes the global model via stitching pre-trained blocks. Specifically, each participating client selects the most suitable block based on their local data from the candidate pool composed of blocks from pre-trained models. The server then aggregates the optimal block for stitching. This process iterates until a new stitched network is generated. Except for the new training paradigm, FedStitch consists of the following three core components: 1) an RL-weighted aggregator, 2) a search space optimizer deployed on the server side, and 3) a local energy optimizer deployed on each participating client. The RL-weighted aggregator helps to select the right block in the non-IID scenario, while the search space optimizer continuously reduces the size of the candidate block pool during stitching. Meanwhile, the local energy optimizer is designed to minimize energy consumption of each client while guaranteeing the overall training progress. The results demonstrate that compared to existing approaches, FedStitch improves the model accuracy up to 20.93%. At the same time, it achieves up to 8.12% speedup, reduces the memory footprint up to 79.5%, and achieves 89.41% energy saving at most during the learning procedure.

[LG-36] Applying Multi-Fidelity Bayesian Optimization in Chemistry: Open Challenges and Major Considerations

链接: https://arxiv.org/abs/2409.07190
作者: Edmund Judge,Mohammed Azzouzi,Austin M. Mroz,Antonio del Rio Chanona,Kim E. Jelfs
关键词-EN: Multi fidelity Bayesian, fidelity Bayesian optimization, Bayesian optimization, maxima cost effectively, desired maxima cost
类目: Machine Learning (cs.LG); Materials Science (cond-mat.mtrl-sci); Chemical Physics (physics.chem-ph)
*备注:

点击查看摘要

Abstract:Multi fidelity Bayesian optimization (MFBO) leverages experimental and or computational data of varying quality and resource cost to optimize towards desired maxima cost effectively. This approach is particularly attractive for chemical discovery due to MFBO’s ability to integrate diverse data sources. Here, we investigate the application of MFBO to accelerate the identification of promising molecules or materials. We specifically analyze the conditions under which lower fidelity data can enhance performance compared to single-fidelity problem formulations. We address two key challenges, selecting the optimal acquisition function, understanding the impact of cost, and data fidelity correlation. We then discuss how to assess the effectiveness of MFBO for chemical discovery.

[LG-37] A Perspective on AI-Guided Molecular Simulations in VR: Exploring Strategies for Imitation Learning in Hyperdimensional Molecular Systems ECAI24

链接: https://arxiv.org/abs/2409.07189
作者: Mohamed Dhouioui,Jonathan Barnoud,Rhoslyn Roebuck Williams,Harry J. Stroud,Phil Bates,David R. Glowacki
关键词-EN: crucial computational tool, engineer molecular structure, Molecular dynamics simulations, crucial computational, computational tool
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Biomolecules (q-bio.BM)
*备注: (Accepted for presentation at the First Workshop on “eXtended Reality \ Intelligent Agents” (XRIA24) @ ECAI24, Santiago De Compostela (Spain), 20 October 2024)

点击查看摘要

Abstract:Molecular dynamics simulations are a crucial computational tool for researchers to understand and engineer molecular structure and function in areas such as drug discovery, protein engineering, and material design. Despite their utility, MD simulations are expensive, owing to the high dimensionality of molecular systems. Interactive molecular dynamics in virtual reality (iMD-VR) has recently been developed as a ‘human-in-the-loop’ strategy, which leverages high-performance computing to accelerate the researcher’s ability to solve the hyperdimensional sampling problem. By providing an immersive 3D environment that enables visualization and manipulation of real-time molecular motion, iMD-VR enables researchers and students to efficiently and intuitively explore and navigate these complex, high-dimensional systems. iMD-VR platforms offer a unique opportunity to quickly generate rich datasets that capture human experts’ spatial insight regarding molecular structure and function. This paper explores the possibility of employing user-generated iMD-VR datasets to train AI agents via imitation learning (IL). IL is an important technique in robotics that enables agents to mimic complex behaviors from expert demonstrations, thus circumventing the need for explicit programming or intricate reward design. We review the utilization of IL for manipulation tasks in robotics and discuss how iMD-VR recordings could be used to train IL models for solving specific molecular ‘tasks’. We then investigate how such approaches could be applied to the data captured from iMD-VR recordings. Finally, we outline the future research directions and potential challenges of using AI agents to augment human expertise to efficiently navigate conformational spaces, highlighting how this approach could provide valuable insight across domains such as materials science, protein engineering, and computer-aided drug design.

[LG-38] Recurrent Aggregators in Neural Algorithmic Reasoning

链接: https://arxiv.org/abs/2409.07154
作者: Kaijia Xu,Petar Veličković
关键词-EN: classical algorithmic computations, mimic classical algorithmic, Neural algorithmic, neural algorithmic reasoners, emerging field
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Neural algorithmic reasoning (NAR) is an emerging field that seeks to design neural networks that mimic classical algorithmic computations. Today, graph neural networks (GNNs) are widely used in neural algorithmic reasoners due to their message passing framework and permutation equivariance. In this extended abstract, we challenge this design choice, and replace the equivariant aggregation function with a recurrent neural network. While seemingly counter-intuitive, this approach has appropriate grounding when nodes have a natural ordering – and this is the case frequently in established reasoning benchmarks like CLRS-30. Indeed, our recurrent NAR (RNAR) model performs very strongly on such tasks, while handling many others gracefully. A notable achievement of RNAR is its decisive state-of-the-art result on the Heapsort and Quickselect tasks, both deemed as a significant challenge for contemporary neural algorithmic reasoners – especially the latter, where RNAR achieves a mean micro-F1 score of 87%.

[LG-39] Combined Optimization of Dynamics and Assimilation with End-to-End Learning on Sparse Observations

链接: https://arxiv.org/abs/2409.07137
作者: Vadim Zinchenko,David S. Greenberg
关键词-EN: Fitting nonlinear dynamical, Fitting nonlinear, nonlinear dynamical models, fundamentally challenging, sparse and noisy
类目: Machine Learning (cs.LG); Atmospheric and Oceanic Physics (physics.ao-ph)
*备注: Submitted to Journal of Advances in Modeling Earth Systems (JAMES)

点击查看摘要

Abstract:Fitting nonlinear dynamical models to sparse and noisy observations is fundamentally challenging. Identifying dynamics requires data assimilation (DA) to estimate system states, but DA requires an accurate dynamical model. To break this deadlock we present CODA, an end-to-end optimization scheme for jointly learning dynamics and DA directly from sparse and noisy observations. A neural network is trained to carry out data accurate, efficient and parallel-in-time DA, while free parameters of the dynamical system are simultaneously optimized. We carry out end-to-end learning directly on observation data, introducing a novel learning objective that combines unrolled auto-regressive dynamics with the data- and self-consistency terms of weak-constraint 4Dvar DA. By taking into account interactions between new and existing simulation components over multiple time steps, CODA can recover initial conditions, fit unknown dynamical parameters and learn neural network-based PDE terms to match both available observations and self-consistency constraints. In addition to facilitating end-to-end learning of dynamics and providing fast, amortized, non-sequential DA, CODA provides greater robustness to model misspecification than classical DA approaches.

[LG-40] Unsupervised Novelty Detection Methods Benchmarking with Wavelet Decomposition

链接: https://arxiv.org/abs/2409.07135
作者: Ariel Priarone,Umberto Albertin,Carlo Cena,Mauro Martini,Marcello Chiaberge
关键词-EN: Novelty detection, engineering fields, critical task, Novelty, detection
类目: Machine Learning (cs.LG)
*备注: To be published in the 8th International Conference on System Reliability and Safety. Sicily, Italy - November 20-22, 2024. 15 pages, 7 figures, 4 tables

点击查看摘要

Abstract:Novelty detection is a critical task in various engineering fields. Numerous approaches to novelty detection rely on supervised or semi-supervised learning, which requires labelled datasets for training. However, acquiring labelled data, when feasible, can be expensive and time-consuming. For these reasons, unsupervised learning is a powerful alternative that allows performing novelty detection without needing labelled samples. In this study, numerous unsupervised machine learning algorithms for novelty detection are compared, highlighting their strengths and weaknesses in the context of vibration sensing. The proposed framework uses a continuous metric, unlike most traditional methods that merely flag anomalous samples without quantifying the degree of anomaly. Moreover, a new dataset is gathered from an actuator vibrating at specific frequencies to benchmark the algorithms and evaluate the framework. Novel conditions are introduced by altering the input wave signal. Our findings offer valuable insights into the adaptability and robustness of unsupervised learning techniques for real-world novelty detection applications.

[LG-41] LLM-based feature generation from text for interpretable machine learning

链接: https://arxiv.org/abs/2409.07132
作者: Vojtěch Balek,Lukáš Sýkora,Vilém Sklenák,Tomáš Kliegr
关键词-EN: Existing text representations, questionable feature-level interpretability, rule learning due, Existing text, feature-level interpretability
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:Existing text representations such as embeddings and bag-of-words are not suitable for rule learning due to their high dimensionality and absent or questionable feature-level interpretability. This article explores whether large language models (LLMs) could address this by extracting a small number of interpretable features from text. We demonstrate this process on two datasets (CORD-19 and M17+) containing several thousand scientific articles from multiple disciplines and a target being a proxy for research impact. An evaluation based on testing for the statistically significant correlation with research impact has shown that LLama 2-generated features are semantically meaningful. We consequently used these generated features in text classification to predict the binary target variable representing the citation rate for the CORD-19 dataset and the ordinal 5-class target representing an expert-awarded grade in the M17+ dataset. Machine-learning models trained on the LLM-generated features provided similar predictive performance to the state-of-the-art embedding model SciBERT for scientific text. The LLM used only 62 features compared to 768 features in SciBERT embeddings, and these features were directly interpretable, corresponding to notions such as article methodological rigor, novelty, or grammatical correctness. As the final step, we extract a small number of well-interpretable action rules. Consistently competitive results obtained with the same LLM feature set across both thematically diverse datasets show that this approach generalizes across domains.

[LG-42] Reranking Laws for Language Generation: A Communication-Theoretic Perspective

链接: https://arxiv.org/abs/2409.07131
作者: António Farinhas,Haau-Sing Li,André F. T. Martins
关键词-EN: ensure large language, large language models, generate unacceptable answers, LLM generate multiple, ensure large
类目: Computation and Language (cs.CL); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: Preprint

点击查看摘要

Abstract:To ensure large language models (LLMs) are used safely, one must reduce their propensity to hallucinate or to generate unacceptable answers. A simple and often used strategy is to first let the LLM generate multiple hypotheses and then employ a reranker to choose the best one. In this paper, we draw a parallel between this strategy and the use of redundancy to decrease the error rate in noisy communication channels. We conceptualize the generator as a sender transmitting multiple descriptions of a message through parallel noisy channels. The receiver decodes the message by ranking the (potentially corrupted) descriptions and selecting the one found to be most reliable. We provide conditions under which this protocol is asymptotically error-free (i.e., yields an acceptable answer almost surely) even in scenarios where the reranker is imperfect (governed by Mallows or Zipf-Mandelbrot models) and the channel distributions are statistically dependent. We use our framework to obtain reranking laws which we validate empirically on two real-world tasks using LLMs: text-to-code generation with DeepSeek-Coder 7B and machine translation of medical data with TowerInstruct 13B.

[LG-43] Cross-Refine: Improving Natural Language Explanation Generation by Learning in Tandem

链接: https://arxiv.org/abs/2409.07123
作者: Qianli Wang,Tatiana Anikina,Nils Feldhus,Simon Ostermann,Sebastian Möller,Vera Schmitt
关键词-EN: large language model, Natural language explanations, Natural language, language model, large language
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注: 17 pages; under review

点击查看摘要

Abstract:Natural language explanations (NLEs) are vital for elucidating the reasoning behind large language model (LLM) decisions. Many techniques have been developed to generate NLEs using LLMs. However, like humans, LLMs might not always produce optimal NLEs on first attempt. Inspired by human learning processes, we introduce Cross-Refine, which employs role modeling by deploying two LLMs as generator and critic, respectively. The generator outputs a first NLE and then refines this initial explanation using feedback and suggestions provided by the critic. Cross-Refine does not require any supervised training data or additional training. We validate Cross-Refine across three NLP tasks using three state-of-the-art open-source LLMs through automatic and human evaluation. We select Self-Refine (Madaan et al., 2023) as the baseline, which only utilizes self-feedback to refine the explanations. Our findings from automatic evaluation and a user study indicate that Cross-Refine outperforms Self-Refine. Meanwhile, Cross-Refine can perform effectively with less powerful LLMs, whereas Self-Refine only yields strong results with ChatGPT. Additionally, we conduct an ablation study to assess the importance of feedback and suggestions. Both of them play an important role in refining explanations. We further evaluate Cross-Refine on a bilingual dataset in English and German.

[LG-44] A Continual and Incremental Learning Approach for TinyML On-device Training Using Dataset Distillation and Model Size Adaption

链接: https://arxiv.org/abs/2409.07114
作者: Marcus Rüb,Philipp Tuchel,Axel Sikora,Daniel Mueller-Gritschneder
关键词-EN: Tiny Machine learning, machine learning models, Tiny Machine, context of Tiny, Machine learning
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:A new algorithm for incremental learning in the context of Tiny Machine learning (TinyML) is presented, which is optimized for low-performance and energy efficient embedded devices. TinyML is an emerging field that deploys machine learning models on resource-constrained devices such as microcontrollers, enabling intelligent applications like voice recognition, anomaly detection, predictive maintenance, and sensor data processing in environments where traditional machine learning models are not feasible. The algorithm solve the challenge of catastrophic forgetting through the use of knowledge distillation to create a small, distilled dataset. The novelty of the method is that the size of the model can be adjusted dynamically, so that the complexity of the model can be adapted to the requirements of the task. This offers a solution for incremental learning in resource-constrained environments, where both model size and computational efficiency are critical factors. Results show that the proposed algorithm offers a promising approach for TinyML incremental learning on embedded devices. The algorithm was tested on five datasets including: CIFAR10, MNIST, CORE50, HAR, Speech Commands. The findings indicated that, despite using only 43% of Floating Point Operations (FLOPs) compared to a larger fixed model, the algorithm experienced a negligible accuracy loss of just 1%. In addition, the presented method is memory efficient. While state-of-the-art incremental learning is usually very memory intensive, the method requires only 1% of the original data set.

[LG-45] Advancing On-Device Neural Network Training with TinyPropv2: Dynamic Sparse and Efficient Backpropagation IJCNN

链接: https://arxiv.org/abs/2409.07109
作者: Marcus Rüb,Axel Sikora,Daniel Mueller-Gritschneder
关键词-EN: deep neural networks, low-power microcontroller units, innovative algorithm optimized, study introduces, neural networks
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 2024 International Joint Conference on Neural Networks (IJCNN)

点击查看摘要

Abstract:This study introduces TinyPropv2, an innovative algorithm optimized for on-device learning in deep neural networks, specifically designed for low-power microcontroller units. TinyPropv2 refines sparse backpropagation by dynamically adjusting the level of sparsity, including the ability to selectively skip training steps. This feature significantly lowers computational effort without substantially compromising accuracy. Our comprehensive evaluation across diverse datasets CIFAR 10, CIFAR100, Flower, Food, Speech Command, MNIST, HAR, and DCASE2020 reveals that TinyPropv2 achieves near-parity with full training methods, with an average accuracy drop of only around 1 percent in most cases. For instance, against full training, TinyPropv2’s accuracy drop is minimal, for example, only 0.82 percent on CIFAR 10 and 1.07 percent on CIFAR100. In terms of computational effort, TinyPropv2 shows a marked reduction, requiring as little as 10 percent of the computational effort needed for full training in some scenarios, and consistently outperforms other sparse training methodologies. These findings underscore TinyPropv2’s capacity to efficiently manage computational resources while maintaining high accuracy, positioning it as an advantageous solution for advanced embedded device applications in the IoT ecosystem.

[LG-46] rialSynth: Generation of Synthetic Sequential Clinical Trial Data

链接: https://arxiv.org/abs/2409.07089
作者: Chufan Gao,Mandis Beigi,Afrah Shafquat,Jacob Aptekar,Jimeng Sun
关键词-EN: clinical trial data, efficiently bring life-saving, bring life-saving interventions, clinical trial, trial data
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Analyzing data from past clinical trials is part of the ongoing effort to optimize the design, implementation, and execution of new clinical trials and more efficiently bring life-saving interventions to market. While there have been recent advances in the generation of static context synthetic clinical trial data, due to both limited patient availability and constraints imposed by patient privacy needs, the generation of fine-grained synthetic time-sequential clinical trial data has been challenging. Given that patient trajectories over an entire clinical trial are of high importance for optimizing trial design and efforts to prevent harmful adverse events, there is a significant need for the generation of high-fidelity time-sequence clinical trial data. Here we introduce TrialSynth, a Variational Autoencoder (VAE) designed to address the specific challenges of generating synthetic time-sequence clinical trial data. Distinct from related clinical data VAE methods, the core of our method leverages Hawkes Processes (HP), which are particularly well-suited for modeling event-type and time gap prediction needed to capture the structure of sequential clinical trial data. Our experiments demonstrate that TrialSynth surpasses the performance of other comparable methods that can generate sequential clinical trial data, in terms of both fidelity and in enabling the generation of highly accurate event sequences across multiple real-world sequential event datasets with small patient source populations when using minimal external information. Notably, our empirical findings highlight that TrialSynth not only outperforms existing clinical sequence-generating methods but also produces data with superior utility while empirically preserving patient privacy.

[LG-47] Understanding Knowledge Drift in LLMs through Misinformation KDD2024

链接: https://arxiv.org/abs/2409.07085
作者: Alina Fastowski,Gjergji Kasneci
关键词-EN: Large Language Models, Large Language, revolutionized numerous applications, Language Models, digital ecosystem
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注: 13 pages, 3 figures. Accepted at DELTA workshop at KDD 2024

点击查看摘要

Abstract:Large Language Models (LLMs) have revolutionized numerous applications, making them an integral part of our digital ecosystem. However, their reliability becomes critical, especially when these models are exposed to misinformation. We primarily analyze the susceptibility of state-of-the-art LLMs to factual inaccuracies when they encounter false information in a QnA scenario, an issue that can lead to a phenomenon we refer to as knowledge drift, which significantly undermines the trustworthiness of these models. We evaluate the factuality and the uncertainty of the models’ responses relying on Entropy, Perplexity, and Token Probability metrics. Our experiments reveal that an LLM’s uncertainty can increase up to 56.6% when the question is answered incorrectly due to the exposure to false information. At the same time, repeated exposure to the same false information can decrease the models uncertainty again (-52.8% w.r.t. the answers on the untainted prompts), potentially manipulating the underlying model’s beliefs and introducing a drift from its original knowledge. These findings provide insights into LLMs’ robustness and vulnerability to adversarial inputs, paving the way for developing more reliable LLM applications across various domains. The code is available at this https URL.

[LG-48] Dynamic Error-Bounded Hierarchical Matrices in Neural Network Compression

链接: https://arxiv.org/abs/2409.07028
作者: John Mango,Ronald Katende
关键词-EN: Physics-Informed Neural Networks, integrates hierarchical matrix, Neural Tangent Kernel, Neural Networks, error-bounded H-matrix compression
类目: Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注:

点击查看摘要

Abstract:This paper presents an innovative framework that integrates hierarchical matrix (H-matrix) compression techniques into the structure and training of Physics-Informed Neural Networks (PINNs). By leveraging the low-rank properties of matrix sub-blocks, the proposed dynamic, error-bounded H-matrix compression method significantly reduces computational complexity and storage requirements without compromising accuracy. This approach is rigorously compared to traditional compression techniques, such as Singular Value Decomposition (SVD), pruning, and quantization, demonstrating superior performance, particularly in maintaining the Neural Tangent Kernel (NTK) properties critical for the stability and convergence of neural networks. The findings reveal that H-matrix compression not only enhances training efficiency but also ensures the scalability and robustness of PINNs for complex, large-scale applications in physics-based modeling. This work offers a substantial contribution to the optimization of deep learning models, paving the way for more efficient and practical implementations of PINNs in real-world scenarios.

[LG-49] CPSample: Classifier Protected Sampling for Guarding Training Data During Diffusion

链接: https://arxiv.org/abs/2409.07025
作者: Joshua Kazdan,Hao Sun,Jiaqi Han,Felix Petersen,Stefano Ermon
关键词-EN: training data, training, data, CPSample, Diffusion
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Diffusion models have a tendency to exactly replicate their training data, especially when trained on small datasets. Most prior work has sought to mitigate this problem by imposing differential privacy constraints or masking parts of the training data, resulting in a notable substantial decrease in image quality. We present CPSample, a method that modifies the sampling process to prevent training data replication while preserving image quality. CPSample utilizes a classifier that is trained to overfit on random binary labels attached to the training data. CPSample then uses classifier guidance to steer the generation process away from the set of points that can be classified with high certainty, a set that includes the training data. CPSample achieves FID scores of 4.97 and 2.97 on CIFAR-10 and CelebA-64, respectively, without producing exact replicates of the training data. Unlike prior methods intended to guard the training images, CPSample only requires training a classifier rather than retraining a diffusion model, which is computationally cheaper. Moreover, our technique provides diffusion models with greater robustness against membership inference attacks, wherein an adversary attempts to discern which images were in the model’s training dataset. We show that CPSample behaves like a built-in rejection sampler, and we demonstrate its capabilities to prevent mode collapse in Stable Diffusion.

[LG-50] AdvLogo: Adversarial Patch Attack against Object Detectors based on Diffusion Models

链接: https://arxiv.org/abs/2409.07002
作者: Boming Miao,Chunxiao Li,Yao Zhu,Weixiang Sun,Zizhe Wang,Xiaoyi Wang,Chuanlong Xie
关键词-EN: demonstrated impressive performance, deep learning, rapid development, development of deep, demonstrated impressive
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:With the rapid development of deep learning, object detectors have demonstrated impressive performance; however, vulnerabilities still exist in certain scenarios. Current research exploring the vulnerabilities using adversarial patches often struggles to balance the trade-off between attack effectiveness and visual quality. To address this problem, we propose a novel framework of patch attack from semantic perspective, which we refer to as AdvLogo. Based on the hypothesis that every semantic space contains an adversarial subspace where images can cause detectors to fail in recognizing objects, we leverage the semantic understanding of the diffusion denoising process and drive the process to adversarial subareas by perturbing the latent and unconditional embeddings at the last timestep. To mitigate the distribution shift that exposes a negative impact on image quality, we apply perturbation to the latent in frequency domain with the Fourier Transform. Experimental results demonstrate that AdvLogo achieves strong attack performance while maintaining high visual quality.

[LG-51] Learning Personalized Scoping for Graph Neural Networks under Heterophily

链接: https://arxiv.org/abs/2409.06998
作者: Gangda Deng,Hongkuan Zhou,Rajgopal Kannan,Viktor Prasanna
关键词-EN: aggregating homophilous information, graph neural networks, Heterophilous graphs, dissimilar nodes tend, tend to connect
类目: Machine Learning (cs.LG); Social and Information Networks (cs.SI)
*备注:

点击查看摘要

Abstract:Heterophilous graphs, where dissimilar nodes tend to connect, pose a challenge for graph neural networks (GNNs) as their superior performance typically comes from aggregating homophilous information. Increasing the GNN depth can expand the scope (i.e., receptive field), potentially finding homophily from the higher-order neighborhoods. However, uniformly expanding the scope results in subpar performance since real-world graphs often exhibit homophily disparity between nodes. An ideal way is personalized scopes, allowing nodes to have varying scope sizes. Existing methods typically add node-adaptive weights for each hop. Although expressive, they inevitably suffer from severe overfitting. To address this issue, we formalize personalized scoping as a separate scope classification problem that overcomes GNN overfitting in node classification. Specifically, we predict the optimal GNN depth for each node. Our theoretical and empirical analysis suggests that accurately predicting the depth can significantly enhance generalization. We further propose Adaptive Scope (AS), a lightweight MLP-based approach that only participates in GNN inference. AS encodes structural patterns and predicts the depth to select the best model for each node’s prediction. Experimental results show that AS is highly flexible with various GNN architectures across a wide range of datasets while significantly improving accuracy.

[LG-52] What is the Right Notion of Distance between Predict-then-Optimize Tasks?

链接: https://arxiv.org/abs/2409.06997
作者: Paula Rodriguez-Diaz,Lingkai Kong,Kai Wang,David Alvarez-Melis,Milind Tambe
关键词-EN: detecting data drift, Comparing datasets, learning paradigms, data drift, machine learning
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Comparing datasets is a fundamental task in machine learning, essential for various learning paradigms; from evaluating train and test datasets for model generalization to using dataset similarity for detecting data drift. While traditional notions of dataset distances offer principled measures of similarity, their utility has largely been assessed through prediction error minimization. However, in Predict-then-Optimize (PtO) frameworks, where predictions serve as inputs for downstream optimization tasks, model performance is measured through decision regret minimization rather than prediction error minimization. In this work, we (i) show that traditional dataset distances, which rely solely on feature and label dimensions, lack informativeness in the PtO context, and (ii) propose a new dataset distance that incorporates the impacts of downstream decisions. Our results show that this decision-aware dataset distance effectively captures adaptation success in PtO contexts, providing a PtO adaptation bound in terms of dataset distance. Empirically, we show that our proposed distance measure accurately predicts transferability across three different PtO tasks from the literature.

[LG-53] Enhancing Cross-domain Pre-Trained Decision Transformers with Adaptive Attention

链接: https://arxiv.org/abs/2409.06985
作者: Wenhao Zhao,Qiushui Xu,Linjie Xu,Lei Song,Jinyu Wang,Chunlai Zhou,Jiang Bian
关键词-EN: natural language text, offline reinforcement learning, offline reinforcement, cross-domain pre-training approach, decision transformers
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Recently, the pre-training of decision transformers (DT) using a different domain, such as natural language text, has generated significant attention in offline reinforcement learning (Offline RL). Although this cross-domain pre-training approach achieves superior performance compared to training from scratch in environments required short-term planning ability, the mechanisms by which pre-training benefits the fine-tuning phase remain unclear. Furthermore, we point out that the cross-domain pre-training approach hinders the extraction of distant information in environments like PointMaze that require long-term planning ability, leading to performance that is much worse than training DT from scratch. This work first analyzes these issues and found that Markov Matrix, a component that exists in pre-trained attention heads, is the key to explain the significant performance disparity of pre-trained models in different planning abilities. Inspired by our analysis, we propose a general method GPT-DTMA, which equips a pre-trained DT with Mixture of Attention (MoA), to enable adaptive learning and accommodating diverse attention requirements during fine-tuning. Extensive experiments demonstrate that the effectiveness of GPT-DTMA: it achieves superior performance in short-term environments compared to baselines, and in long-term environments, it mitigates the negative impact caused by Markov Matrix, achieving results comparable to those of DT trained from scratch.

[LG-54] Policy Filtration in RLHF to Fine-Tune LLM for Code Generation

链接: https://arxiv.org/abs/2409.06957
作者: Wei Shen,Chuheng Zhang
关键词-EN: large language models, Reinforcement learning, human feedback, key techniques, large language
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Reinforcement learning from human feedback (RLHF) is one of the key techniques that helps large language models (LLMs) to follow instructions and provide helpful and harmless responses. While direct policy optimization methods exist, state-of-the-art LLMs adopt RL-based methods (usually PPO) in RLHF to train the policy to generate good responses guided by a reward model learned from preference data. The main challenge of these methods is the inaccuracy of the intermediate reward model, especially in code generation tasks that require long and complex reasoning to score a response. We find that the reliability of the reward model varies across responses assigned with different rewards. This motivates us to filter the samples whose rewards may be unreliable to improve signal-to-noise ratio during policy learning, resulting in Policy Filtration for Proximal Policy Optimization (PF-PPO). To choose a proper policy filtration strategy for a given reward model, the coefficient of determination ( R^2 ) between rewards and actual scores on filtered samples serves as a good metrics and helps us find several promising strategies. We provide extensive experiments to validate the effectiveness of PF-PPO in code generation tasks, and find that some variants of PF-PPO are highly effective and achieve new state-of-the-art performance across 7-billion-parameter models on HumanEval, MBPP, and a new and more challenging LeetCode Contest benchmark.

[LG-55] Privacy-Preserving Federated Learning with Consistency via Knowledge Distillation Using Conditional Generator

链接: https://arxiv.org/abs/2409.06955
作者: Kangyang Luo,Shuai Wang,Xiang Li,Yunshi Lan,Ming Gao,Jinlong Shu
关键词-EN: distributed learning framework, Federated Learning, shares model parameters, private data locally, distributed learning
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注:

点击查看摘要

Abstract:Federated Learning (FL) is gaining popularity as a distributed learning framework that only shares model parameters or gradient updates and keeps private data locally. However, FL is at risk of privacy leakage caused by privacy inference attacks. And most existing privacy-preserving mechanisms in FL conflict with achieving high performance and efficiency. Therefore, we propose FedMD-CG, a novel FL method with highly competitive performance and high-level privacy preservation, which decouples each client’s local model into a feature extractor and a classifier, and utilizes a conditional generator instead of the feature extractor to perform server-side model aggregation. To ensure the consistency of local generators and classifiers, FedMD-CG leverages knowledge distillation to train local models and generators at both the latent feature level and the logit level. Also, we construct additional classification losses and design new diversity losses to enhance client-side training. FedMD-CG is robust to data heterogeneity and does not require training extra discriminators (like cGAN). We conduct extensive experiments on various image classification tasks to validate the superiority of FedMD-CG.

[LG-56] Neural Algorithmic Reasoning with Multiple Correct Solutions

链接: https://arxiv.org/abs/2409.06953
作者: Zeno Kujawa,John Poole,Dobrik Georgiev,Danilo Numeroso,Pietro Liò
关键词-EN: Neural Algorithmic Reasoning, aims to optimize, Neural Algorithmic, NAR train neural, NAR
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Neural Algorithmic Reasoning (NAR) aims to optimize classical algorithms. However, canonical implementations of NAR train neural networks to return only a single solution, even when there are multiple correct solutions to a problem, such as single-source shortest paths. For some applications, it is desirable to recover more than one correct solution. To that end, we give the first method for NAR with multiple solutions. We demonstrate our method on two classical algorithms: Bellman-Ford (BF) and Depth-First Search (DFS), favouring deeper insight into two algorithms over a broader survey of algorithms. This method involves generating appropriate training data as well as sampling and validating solutions from model output. Each step of our method, which can serve as a framework for neural algorithmic reasoning beyond the tasks presented in this paper, might be of independent interest to the field and our results represent the first attempt at this task in the NAR literature.

[LG-57] Representation Tuning

链接: https://arxiv.org/abs/2409.06927
作者: Christopher M. Ackerman
关键词-EN: large language models, increasingly popular, large language, vectors, online control
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
*备注: 9 pages, 6 figures, 6 tables

点击查看摘要

Abstract:Activation engineering is becoming increasingly popular as a means of online control of large language models (LLMs). In this work, I extend the idea of active steering with vectors that represent a behavioral direction of interest to tuning those vectors directly into the model, obviating the need for online control. First, I identify activation vectors related to honesty in an open-source LLM (Llama- 2-13b-chat). Next, I demonstrate that model output can be made more or less honest by adding positive or negative multiples of these vectors to residual stream activations during generation. Then, I show that a similar effect can be achieved by fine-tuning the vectors directly into the model, by use of a dual loss function based on the cosine similarity of residual stream activations to the vectors combined with a standard token-based loss (“representation tuning”). Finally, I compare the generations in response to honesty-probing prompts from the resulting models to those from models fine-tuned with a token-based loss alone, and to those from the untuned model subjected to online steering. Overall, fine-tuning the vectors into the models using the cosine similarity plus token loss showed a stronger effect than online steering, and generalized better than using the standard loss, suggesting the potential utility of this approach as a safety measure. Code and data are available at this https URL tuned models are available at this https URL representation-tuning-66da1e5ab41cd1b824687d9f.

[LG-58] Applied Federated Model Personalisation in the Industrial Domain: A Comparative Study

链接: https://arxiv.org/abs/2409.06904
作者: Ilias Siniosoglou,Vasileios Argyriou,George Fragulis,Panagiotis Fouliras,Georgios Th. Papadopoulos,Anastasios Lytos,Panagiotis Sarigiannidis
关键词-EN: deploying complicated Machine, Machine and Deep, complicated Machine, Machine Learning, Deep Learning
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Performance (cs.PF)
*备注:

点击查看摘要

Abstract:The time-consuming nature of training and deploying complicated Machine and Deep Learning (DL) models for a variety of applications continues to pose significant challenges in the field of Machine Learning (ML). These challenges are particularly pronounced in the federated domain, where optimizing models for individual nodes poses significant difficulty. Many methods have been developed to tackle this problem, aiming to reduce training expenses and time while maintaining efficient optimisation. Three suggested strategies to tackle this challenge include Active Learning, Knowledge Distillation, and Local Memorization. These methods enable the adoption of smaller models that require fewer computational resources and allow for model personalization with local insights, thereby improving the effectiveness of current models. The present study delves into the fundamental principles of these three approaches and proposes an advanced Federated Learning System that utilises different Personalisation methods towards improving the accuracy of AI models and enhancing user experience in real-time NG-IoT applications, investigating the efficacy of these techniques in the local and federated domain. The results of the original and optimised models are then compared in both local and federated contexts using a comparison analysis. The post-analysis shows encouraging outcomes when it comes to optimising and personalising the models with the suggested techniques.

[LG-59] Semi-Supervised Reward Modeling via Iterative Self-Training

链接: https://arxiv.org/abs/2409.06903
作者: Yifei He,Haoxiang Wang,Ziyan Jiang,Alexandros Papangelis,Han Zhao
关键词-EN: Reinforcement Learning, Human Feedback, role in Reinforcement, Learning with Human, align pretrained large
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Reward models (RM) capture the values and preferences of humans and play a central role in Reinforcement Learning with Human Feedback (RLHF) to align pretrained large language models (LLMs). Traditionally, training these models relies on extensive human-annotated preference data, which poses significant challenges in terms of scalability and cost. To overcome these limitations, we propose Semi-Supervised Reward Modeling (SSRM), an approach that enhances RM training using unlabeled data. Given an unlabeled dataset, SSRM involves three key iterative steps: pseudo-labeling unlabeled examples, selecting high-confidence examples through a confidence threshold, and supervised finetuning on the refined dataset. Across extensive experiments on various model configurations, we demonstrate that SSRM significantly improves reward models without incurring additional labeling costs. Notably, SSRM can achieve performance comparable to models trained entirely on labeled data of equivalent volumes. Overall, SSRM substantially reduces the dependency on large volumes of human-annotated data, thereby decreasing the overall cost and time involved in training effective reward models.

[LG-60] Mazed and Confused: A Dataset of Cybersickness Working Memory Mental Load Physical Load and Attention During a Real Walking Task in VR

链接: https://arxiv.org/abs/2409.06898
作者: Jyotirmay Nag Setu,Joshua M Le,Ripan Kumar Kundu,Barry Giesbrecht,Tobias Höllerer,Khaza Anuarul Hoque,Kevin Desai,John Quarles
关键词-EN: Virtual Reality, multiple complex cognitive, physical activities, quickly establishing, frequently required
类目: Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Virtual Reality (VR) is quickly establishing itself in various industries, including training, education, medicine, and entertainment, in which users are frequently required to carry out multiple complex cognitive and physical activities. However, the relationship between cognitive activities, physical activities, and familiar feelings of cybersickness is not well understood and thus can be unpredictable for developers. Researchers have previously provided labeled datasets for predicting cybersickness while users are stationary, but there have been few labeled datasets on cybersickness while users are physically walking. Thus, from 39 participants, we collected head orientation, head position, eye tracking, images, physiological readings from external sensors, and the self-reported cybersickness severity, physical load, and mental load in VR. Throughout the data collection, participants navigated mazes via real walking and performed tasks challenging their attention and working memory. To demonstrate the dataset’s utility, we conducted a case study of training classifiers in which we achieved 95% accuracy for cybersickness severity classification. The noteworthy performance of the straightforward classifiers makes this dataset ideal for future researchers to develop cybersickness detection and reduction models. To better understand the features that helped with classification, we performed SHAP(SHapley Additive exPlanations) analysis, highlighting the importance of eye tracking and physiological measures for cybersickness prediction while walking. This open dataset can allow future researchers to study the connection between cybersickness and cognitive loads and develop prediction models. This dataset will empower future VR developers to design efficient and effective Virtual Environments by improving cognitive load management and minimizing cybersickness.

[LG-61] Enhanced Pix2Pix GAN for Visual Defect Removal in UAV-Captured Images

链接: https://arxiv.org/abs/2409.06889
作者: Volodymyr Rizun
关键词-EN: removes visual defects, paper presents, presents a neural, neural network, effectively removes visual
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: Prepared for IEEE APUAVD 2024 conference

点击查看摘要

Abstract:This paper presents a neural network that effectively removes visual defects from UAV-captured images. It features an enhanced Pix2Pix GAN, specifically engineered to address visual defects in UAV imagery. The method incorporates advanced modifications to the Pix2Pix architecture, targeting prevalent issues such as mode collapse. The suggested method facilitates significant improvements in the quality of defected UAV images, yielding cleaner and more precise visual results. The effectiveness of the proposed approach is demonstrated through evaluation on a custom dataset of aerial photographs, highlighting its capability to refine and restore UAV imagery effectively.

[LG-62] A Dataset for Evaluating LLM-based Evaluation Functions for Research Question Extraction Task

链接: https://arxiv.org/abs/2409.06883
作者: Yuya Fujisaki,Shiro Takagi,Hideki Asoh,Wataru Kumagai
关键词-EN: text summarization techniques, progress in text, task, text summarization, summarization techniques
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The progress in text summarization techniques has been remarkable. However the task of accurately extracting and summarizing necessary information from highly specialized documents such as research papers has not been sufficiently investigated. We are focusing on the task of extracting research questions (RQ) from research papers and construct a new dataset consisting of machine learning papers, RQ extracted from these papers by GPT-4, and human evaluations of the extracted RQ from multiple perspectives. Using this dataset, we systematically compared recently proposed LLM-based evaluation functions for summarizations, and found that none of the functions showed sufficiently high correlations with human evaluations. We expect our dataset provides a foundation for further research on developing better evaluation functions tailored to the RQ extraction task, and contribute to enhance the performance of the task. The dataset is available at this https URL.

[LG-63] he Competition Complexity of Prophet Inequalities with Correlations

链接: https://arxiv.org/abs/2409.06868
作者: Tomer Ezra,Tamar Garbuz
关键词-EN: prophet inequality problem, resource augmentation framework, rewards, additional rewards, number of additional
类目: Machine Learning (cs.LG); Computer Science and Game Theory (cs.GT)
*备注:

点击查看摘要

Abstract:We initiate the study of the prophet inequality problem through the resource augmentation framework in scenarios when the values of the rewards are correlated. Our goal is to determine the number of additional rewards an online algorithm requires to approximate the maximum value of the original instance. While the independent reward case is well understood, we extend this research to account for correlations among rewards. Our results demonstrate that, unlike in the independent case, the required number of additional rewards for approximation depends on the number of original rewards, and that block-threshold algorithms, which are optimal in the independent case, may require an infinite number of additional rewards when correlations are present. We develop asymptotically optimal algorithms for the following three scenarios: (1) where rewards arrive in blocks corresponding to the different copies of the original instance; (2) where rewards across all copies are arbitrarily shuffled; and (3) where rewards arrive in blocks corresponding to the different copies of the original instance, and values within each block are pairwise independent rather than fully correlated.

[LG-64] owards Understanding Human Emotional Fluctuations with Sparse Check-In Data

链接: https://arxiv.org/abs/2409.06863
作者: Sagar Paresh Shah,Ga Wu,Sean W. Kortschot,Samuel Daviau
关键词-EN: key challenge limiting, key challenge, challenge limiting, limiting the power, Data
类目: Machine Learning (cs.LG); Human-Computer Interaction (cs.HC)
*备注:

点击查看摘要

Abstract:Data sparsity is a key challenge limiting the power of AI tools across various domains. The problem is especially pronounced in domains that require active user input rather than measurements derived from automated sensors. It is a critical barrier to harnessing the full potential of AI in domains requiring active user engagement, such as self-reported mood check-ins, where capturing a continuous picture of emotional states is essential. In this context, sparse data can hinder efforts to capture the nuances of individual emotional experiences such as causes, triggers, and contributing factors. Existing methods for addressing data scarcity often rely on heuristics or large established datasets, favoring deep learning models that lack adaptability to new domains. This paper proposes a novel probabilistic framework that integrates user-centric feedback-based learning, allowing for personalized predictions despite limited data. Achieving 60% accuracy in predicting user states among 64 options (chance of 1/64), this framework effectively mitigates data sparsity. It is versatile across various applications, bridging the gap between theoretical AI research and practical deployment.

[LG-65] Stratospheric aerosol source inversion: Noise variability and uncertainty quantification

链接: https://arxiv.org/abs/2409.06846
作者: J. Hart,I. Manickam,M. Gulian,L. Swiler,D. Bull,T. Ehrmann,H. Brown,B. Wagman,J. Watkins
关键词-EN: earth system model, earth system, months to years, Stratospheric aerosols play, Exascale Earth System
类目: Machine Learning (cs.LG); Atmospheric and Oceanic Physics (physics.ao-ph); Applications (stat.AP)
*备注:

点击查看摘要

Abstract:Stratospheric aerosols play an important role in the earth system and can affect the climate on timescales of months to years. However, estimating the characteristics of partially observed aerosol injections, such as those from volcanic eruptions, is fraught with uncertainties. This article presents a framework for stratospheric aerosol source inversion which accounts for background aerosol noise and earth system internal variability via a Bayesian approximation error approach. We leverage specially designed earth system model simulations using the Energy Exascale Earth System Model (E3SM). A comprehensive framework for data generation, data processing, dimension reduction, operator learning, and Bayesian inversion is presented where each component of the framework is designed to address particular challenges in stratospheric modeling on the global scale. We present numerical results using synthesized observational data to rigorously assess the ability of our approach to estimate aerosol sources and associate uncertainty with those estimates.

[LG-66] Atom dimension adaptation for infinite set dictionary learning

链接: https://arxiv.org/abs/2409.06831
作者: Andra Băltoiu,Denis C. Ilie-Ablachim,Bogdan Dumitrescu
关键词-EN: Recent work, shown benefits, dictionary learning, cone dictionary learning, Recent
类目: Machine Learning (cs.LG)
*备注: 16 pages, 3 figures

点击查看摘要

Abstract:Recent work on dictionary learning with set-atoms has shown benefits in anomaly detection. Instead of viewing an atom as a single vector, these methods allow building sparse representations with atoms taken from a set around a central vector; the set can be a cone or may have a probability distribution associated to it. We propose a method for adaptively adjusting the size of set-atoms in Gaussian and cone dictionary learning. The purpose of the algorithm is to match the atom sizes with their contribution in representing the signals. The proposed algorithm not only decreases the representation error, but also improves anomaly detection, for a class of anomalies called `dependency’. We obtain better detection performance than state-of-the-art methods.

[LG-67] Noisy Early Stopping for Noisy Labels

链接: https://arxiv.org/abs/2409.06830
作者: William Toner,Amos Storkey
关键词-EN: Early Stopping, neural network classifiers, implementing Early Stopping, labels significantly increases, Training neural network
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Training neural network classifiers on datasets contaminated with noisy labels significantly increases the risk of overfitting. Thus, effectively implementing Early Stopping in noisy label environments is crucial. Under ideal circumstances, Early Stopping utilises a validation set uncorrupted by label noise to effectively monitor generalisation during training. However, obtaining a noise-free validation dataset can be costly and challenging to obtain. This study establishes that, in many typical learning environments, a noise-free validation set is not necessary for effective Early Stopping. Instead, near-optimal results can be achieved by monitoring accuracy on a noisy dataset - drawn from the same distribution as the noisy training set. Referred to as `Noisy Early Stopping’ (NES), this method simplifies and reduces the cost of implementing Early Stopping. We provide theoretical insights into the conditions under which this method is effective and empirically demonstrate its robust performance across standard benchmarks using common loss functions.

[LG-68] Bifurcation Identification for Ultrasound-driven Robotic Cannulation

链接: https://arxiv.org/abs/2409.06817
作者: Cecilia G. Morales,Dhruv Srikanth,Jack H. Good,Keith A. Dufendach,Artur Dubrawski
关键词-EN: critical care settings, precise intravascular access, care settings, rapid and precise, patients’ survival
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In trauma and critical care settings, rapid and precise intravascular access is key to patients’ survival. Our research aims at ensuring this access, even when skilled medical personnel are not readily available. Vessel bifurcations are anatomical landmarks that can guide the safe placement of catheters or needles during medical procedures. Although ultrasound is advantageous in navigating anatomical landmarks in emergency scenarios due to its portability and safety, to our knowledge no existing algorithm can autonomously extract vessel bifurcations using ultrasound images. This is primarily due to the limited availability of ground truth data, in particular, data from live subjects, needed for training and validating reliable models. Researchers often resort to using data from anatomical phantoms or simulations. We introduce BIFURC, Bifurcation Identification for Ultrasound-driven Robot Cannulation, a novel algorithm that identifies vessel bifurcations and provides optimal needle insertion sites for an autonomous robotic cannulation system. BIFURC integrates expert knowledge with deep learning techniques to efficiently detect vessel bifurcations within the femoral region and can be trained on a limited amount of in-vivo data. We evaluated our algorithm using a medical phantom as well as real-world experiments involving live pigs. In all cases, BIFURC consistently identified bifurcation points and needle insertion locations in alignment with those identified by expert clinicians.

[LG-69] Personalized Federated Learning Techniques: Empirical Analysis

链接: https://arxiv.org/abs/2409.06805
作者: Azal Ahmad Khan,Ahmad Faraz Khan,Haider Ali,Ali Anwar
关键词-EN: holds immense promise, Personalized Federated Learning, tailoring machine learning, preserving data privacy, Personalized Federated
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
*备注:

点击查看摘要

Abstract:Personalized Federated Learning (pFL) holds immense promise for tailoring machine learning models to individual users while preserving data privacy. However, achieving optimal performance in pFL often requires a careful balancing act between memory overhead costs and model accuracy. This paper delves into the trade-offs inherent in pFL, offering valuable insights for selecting the right algorithms for diverse real-world scenarios. We empirically evaluate ten prominent pFL techniques across various datasets and data splits, uncovering significant differences in their performance. Our study reveals interesting insights into how pFL methods that utilize personalized (local) aggregation exhibit the fastest convergence due to their efficiency in communication and computation. Conversely, fine-tuning methods face limitations in handling data heterogeneity and potential adversarial attacks while multi-objective learning methods achieve higher accuracy at the cost of additional training and resource consumption. Our study emphasizes the critical role of communication efficiency in scaling pFL, demonstrating how it can significantly affect resource usage in real-world deployments.

[LG-70] Adaptive Meta-Domain Transfer Learning (AMDTL): A Novel Approach for Knowledge Transfer in AI

链接: https://arxiv.org/abs/2409.06800
作者: Michele Laurelli
关键词-EN: paper presents Adaptive, presents Adaptive Meta-Domain, Adaptive Meta-Domain Transfer, artificial intelligence models, Meta-Domain Transfer Learning
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:This paper presents Adaptive Meta-Domain Transfer Learning (AMDTL), a novel methodology that combines principles of meta-learning with domain-specific adaptations to enhance the transferability of artificial intelligence models across diverse and unknown domains. AMDTL aims to address the main challenges of transfer learning, such as domain misalignment, negative transfer, and catastrophic forgetting, through a hybrid framework that emphasizes both generalization and contextual specialization. The framework integrates a meta-learner trained on a diverse distribution of tasks, adversarial training techniques for aligning domain feature distributions, and dynamic feature regulation mechanisms based on contextual domain embeddings. Experimental results on benchmark datasets demonstrate that AMDTL outperforms existing transfer learning methodologies in terms of accuracy, adaptation efficiency, and robustness. This research provides a solid theoretical and practical foundation for the application of AMDTL in various fields, opening new perspectives for the development of more adaptable and inclusive AI systems.

[LG-71] Adversarial Attacks to Multi-Modal Models

链接: https://arxiv.org/abs/2409.06793
作者: Zhihao Dou,Xin Hu,Haibo Yang,Zhuqing Liu,Minghong Fang
关键词-EN: gained significant attention, significant attention due, powerful capabilities, gained significant, significant attention
类目: Cryptography and Security (cs.CR); Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注: To appear in the ACM Workshop on Large AI Systems and Models with Privacy and Safety Analysis 2024 (LAMPS '24)

点击查看摘要

Abstract:Multi-modal models have gained significant attention due to their powerful capabilities. These models effectively align embeddings across diverse data modalities, showcasing superior performance in downstream tasks compared to their unimodal counterparts. Recent study showed that the attacker can manipulate an image or audio file by altering it in such a way that its embedding matches that of an attacker-chosen targeted input, thereby deceiving downstream models. However, this method often underperforms due to inherent disparities in data from different modalities. In this paper, we introduce CrossFire, an innovative approach to attack multi-modal models. CrossFire begins by transforming the targeted input chosen by the attacker into a format that matches the modality of the original image or audio file. We then formulate our attack as an optimization problem, aiming to minimize the angular deviation between the embeddings of the transformed input and the modified image or audio file. Solving this problem determines the perturbations to be added to the original media. Our extensive experiments on six real-world benchmark datasets reveal that CrossFire can significantly manipulate downstream tasks, surpassing existing attacks. Additionally, we evaluate six defensive strategies against CrossFire, finding that current defenses are insufficient to counteract our CrossFire.

[LG-72] Human Motion Synthesis_ A Diffusion Approach for Motion Stitching and In-Betweening

链接: https://arxiv.org/abs/2409.06791
作者: Michael Adewole,Oluwaseyi Giwa,Favour Nerrise,Martins Osifeko,Ajibola Oyedeji
关键词-EN: Human motion generation, important area, area of research, Human motion, Frechet Inception Distance
类目: Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
*备注: 12 pages, 5 figures, and 11 equations

点击查看摘要

Abstract:Human motion generation is an important area of research in many fields. In this work, we tackle the problem of motion stitching and in-betweening. Current methods either require manual efforts, or are incapable of handling longer sequences. To address these challenges, we propose a diffusion model with a transformer-based denoiser to generate realistic human motion. Our method demonstrated strong performance in generating in-betweening sequences, transforming a variable number of input poses into smooth and realistic motion sequences consisting of 75 frames at 15 fps, resulting in a total duration of 5 seconds. We present the performance evaluation of our method using quantitative metrics such as Frechet Inception Distance (FID), Diversity, and Multimodality, along with visual assessments of the generated outputs.

[LG-73] Beyond designers knowledge: Generating materials design hypotheses via large language models

链接: https://arxiv.org/abs/2409.06756
作者: Quanliang Liu,Maciej P. Polak,So Yeon Kim,MD Al Amin Shuvo,Hrishikesh Shridhar Deodhar,Jeongsoo Han,Dane Morgan,Hyunseok Oh
关键词-EN: process inherently limited, extract knowledge implications, expertise is required, inherently limited, relies on human-generated
类目: Machine Learning (cs.LG); Materials Science (cond-mat.mtrl-sci); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Materials design often relies on human-generated hypotheses, a process inherently limited by cognitive constraints such as knowledge gaps and limited ability to integrate and extract knowledge implications, particularly when multidisciplinary expertise is required. This work demonstrates that large language models (LLMs), coupled with prompt engineering, can effectively generate non-trivial materials hypotheses by integrating scientific principles from diverse sources without explicit design guidance by human experts. These include design ideas for high-entropy alloys with superior cryogenic properties and halide solid electrolytes with enhanced ionic conductivity and formability. These design ideas have been experimentally validated in high-impact publications in 2023 not available in the LLM training data, demonstrating the LLM’s ability to generate highly valuable and realizable innovative ideas not established in the literature. Our approach primarily leverages materials system charts encoding processing-structure-property relationships, enabling more effective data integration by condensing key information from numerous papers, and evaluation and categorization of numerous hypotheses for human cognition, both through the LLM. This LLM-driven approach opens the door to new avenues of artificial intelligence-driven materials discovery by accelerating design, democratizing innovation, and expanding capabilities beyond the designer’s direct knowledge.

[LG-74] Scaling Law Hypothesis for Multimodal Model

链接: https://arxiv.org/abs/2409.06754
作者: Qingyun Sun,Zhen Guo
关键词-EN: models processing text, scaling law hypothesis, processing text, embedding space, multimodal models processing
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:We propose a scaling law hypothesis for multimodal models processing text, audio, images, and video within a shared token and embedding space. Our framework predicts model performance based on modality-specific compression and tokenization efficiency, extending established scaling laws from text-based decoder models to mixed-modality systems. We explore whether leveraging more training data in multiple modalities can reduce the size of the multimodal model, enabling efficient deployment on resource-constrained devices.

[LG-75] A tutorial on automatic differentiation with complex numbers

链接: https://arxiv.org/abs/2409.06752
作者: Nicholas Krämer
关键词-EN: mathbb, arithmetic beyond stating, shallow references, Wirtinger calculus, exists only minimal
类目: Mathematical Software (cs.MS); Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注:

点击查看摘要

Abstract:Automatic differentiation is everywhere, but there exists only minimal documentation of how it works in complex arithmetic beyond stating "derivatives in \mathbbC^d " \cong "derivatives in \mathbbR^2d " and, at best, shallow references to Wirtinger calculus. Unfortunately, the equivalence \mathbbC^d \cong \mathbbR^2d becomes insufficient as soon as we need to derive custom gradient rules, e.g., to avoid differentiating “through” expensive linear algebra functions or differential equation simulators. To combat such a lack of documentation, this article surveys forward- and reverse-mode automatic differentiation with complex numbers, covering topics such as Wirtinger derivatives, a modified chain rule, and different gradient conventions while explicitly avoiding holomorphicity and the Cauchy–Riemann equations (which would be far too restrictive). To be precise, we will derive, explain, and implement a complex version of Jacobian-vector and vector-Jacobian products almost entirely with linear algebra without relying on complex analysis or differential geometry. This tutorial is a call to action, for users and developers alike, to take complex values seriously when implementing custom gradient propagation rules – the manuscript explains how.

[LG-76] he Weak Form Is Stronger Than You Think

链接: https://arxiv.org/abs/2409.06751
作者: Daniel A. Messenger,April Tran,Vanja Dukic,David M. Bortz
关键词-EN: widely-utilized mathematical tool, weak form, applied mathematics, tool in modern, Machine Learning
类目: Machine Learning (cs.LG); Computational Engineering, Finance, and Science (cs.CE); Numerical Analysis (math.NA); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:The weak form is a ubiquitous, well-studied, and widely-utilized mathematical tool in modern computational and applied mathematics. In this work we provide a survey of both the history and recent developments for several fields in which the weak form can play a critical role. In particular, we highlight several recent advances in weak form versions of equation learning, parameter estimation, and coarse graining, which offer surprising noise robustness, accuracy, and computational efficiency. We note that this manuscript is a companion piece to our October 2024 SIAM News article of the same name. Here we provide more detailed explanations of mathematical developments as well as a more complete list of references. Lastly, we note that the software with which to reproduce the results in this manuscript is also available on our group’s GitHub website this https URL . Subjects: Machine Learning (cs.LG); Computational Engineering, Finance, and Science (cs.CE); Numerical Analysis (math.NA); Machine Learning (stat.ML) MSC classes: 26A33, 35D30, 62FXX, 62JXX, 65L09, 65M32, 68Q32, Cite as: arXiv:2409.06751 [cs.LG] (or arXiv:2409.06751v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2409.06751 Focus to learn more arXiv-issued DOI via DataCite

[LG-77] Can Agents Spontaneously Form a Society? Introducing a Novel Architecture for Generative Multi-Agents to Elicit Social Emergence

链接: https://arxiv.org/abs/2409.06750
作者: H. Zhang,J. Yin,M. Jiang,C. Su
关键词-EN: demonstrated impressive capabilities, social interactions, specific tasks, independent tasks, framework called LTRHA
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
*备注: 13 pages, 8 figures

点击查看摘要

Abstract:Generative agents have demonstrated impressive capabilities in specific tasks, but most of these frameworks focus on independent tasks and lack attention to social interactions. We introduce a generative agent architecture called ITCMA-S, which includes a basic framework for individual agents and a framework called LTRHA that supports social interactions among multi-agents. This architecture enables agents to identify and filter out behaviors that are detrimental to social interactions, guiding them to choose more favorable actions. We designed a sandbox environment to simulate the natural evolution of social relationships among multiple identity-less agents for experimental evaluation. The results showed that ITCMA-S performed well on multiple evaluation indicators, demonstrating its ability to actively explore the environment, recognize new agents, and acquire new information through continuous actions and dialogue. Observations show that as agents establish connections with each other, they spontaneously form cliques with internal hierarchies around a selected leader and organize collective activities.

[LG-78] EasyST: A Simple Framework for Spatio-Temporal Prediction CIKM’2024

链接: https://arxiv.org/abs/2409.06748
作者: Jiabin Tang,Wei Wei,Lianghao Xia,Chao Huang
关键词-EN: crucial research area, public safety, Graph Neural Networks, implications for transportation, environmental monitoring
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Accepted by CIKM’2024, full paper

点击查看摘要

Abstract:Spatio-temporal prediction is a crucial research area in data-driven urban computing, with implications for transportation, public safety, and environmental monitoring. However, scalability and generalization challenges remain significant obstacles. Advanced models often rely on Graph Neural Networks to encode spatial and temporal correlations, but struggle with the increased complexity of large-scale datasets. The recursive GNN-based message passing schemes used in these models hinder their training and deployment in real-life urban sensing scenarios. Moreover, long-spanning large-scale spatio-temporal data introduce distribution shifts, necessitating improved generalization performance. To address these challenges, we propose a simple framework for spatio-temporal prediction - EasyST paradigm. It learns lightweight and robust Multi-Layer Perceptrons (MLPs) by effectively distilling knowledge from complex spatio-temporal GNNs. We ensure robust knowledge distillation by integrating the spatio-temporal information bottleneck with teacher-bounded regression loss, filtering out task-irrelevant noise and avoiding erroneous guidance. We further enhance the generalization ability of the student model by incorporating spatial and temporal prompts to provide downstream task contexts. Evaluation on three spatio-temporal datasets for urban computing tasks demonstrates that EasyST surpasses state-of-the-art approaches in terms of efficiency and accuracy. The implementation code is available at: this https URL.

[LG-79] Distributed Cooperative AI for Large-Scale Eigenvalue Computations Using Neural Networks

链接: https://arxiv.org/abs/2409.06746
作者: Ronald Katende
关键词-EN: distributed cooperative neural, neural network framework, paper presents, distributed cooperative, cooperative neural network
类目: Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注:

点击查看摘要

Abstract:This paper presents a novel method for eigenvalue computation using a distributed cooperative neural network framework. Unlike traditional techniques that struggle with scalability in large systems, our decentralized algorithm enables multiple autonomous agents to collaboratively estimate the smallest eigenvalue of large matrices. Each agent uses a localized neural network model, refining its estimates through inter-agent communication. Our approach guarantees convergence to the true eigenvalue, even with communication failures or network disruptions. Theoretical analysis confirms the robustness and accuracy of the method, while empirical results demonstrate its better performance compared to some traditional centralized algorithms

[LG-80] Personalized Knowledge Tracing through Student Representation Reconstruction and Class Imbalance Mitigation

链接: https://arxiv.org/abs/2409.06745
作者: Zhiyu Chen,Wei Ji,Jing Xiao,Zitao Liu
关键词-EN: predicts students’ future, students’ future performance, enabling a precise, students’ future, analyzing their learning
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
*备注:

点击查看摘要

Abstract:Knowledge tracing is a technique that predicts students’ future performance by analyzing their learning process through historical interactions with intelligent educational platforms, enabling a precise evaluation of their knowledge mastery. Recent studies have achieved significant progress by leveraging powerful deep neural networks. These models construct complex input representations using questions, skills, and other auxiliary information but overlook individual student characteristics, which limits the capability for personalized assessment. Additionally, the available datasets in the field exhibit class imbalance issues. The models that simply predict all responses as correct without substantial effort can yield impressive accuracy. In this paper, we propose PKT, a novel approach for personalized knowledge tracing. PKT reconstructs representations from sequences of interactions with a tutoring platform to capture latent information about the students. Moreover, PKT incorporates focal loss to improve prioritize minority classes, thereby achieving more balanced predictions. Extensive experimental results on four publicly available educational datasets demonstrate the advanced predictive performance of PKT in comparison with 16 state-of-the-art models. To ensure the reproducibility of our research, the code is publicly available at https://anonymous.4open.science/r/PKT.

[LG-81] Data-efficient and Interpretable Inverse Materials Design using a Disentangled Variational Autoencoder

链接: https://arxiv.org/abs/2409.06740
作者: Cheng Zeng,Zulqarnain Khan,Nathan L. Post
关键词-EN: Inverse materials design, proven successful, successful in accelerating, materials, Inverse materials
类目: Machine Learning (cs.LG); Materials Science (cond-mat.mtrl-sci)
*备注:

点击查看摘要

Abstract:Inverse materials design has proven successful in accelerating novel material discovery. Many inverse materials design methods use unsupervised learning where a latent space is learned to offer a compact description of materials representations. A latent space learned this way is likely to be entangled, in terms of the target property and other properties of the materials. This makes the inverse design process ambiguous. Here, we present a semi-supervised learning approach based on a disentangled variational autoencoder to learn a probabilistic relationship between features, latent variables and target properties. This approach is data efficient because it combines all labelled and unlabelled data in a coherent manner, and it uses expert-informed prior distributions to improve model robustness even with limited labelled data. It is in essence interpretable, as the learnable target property is disentangled out of the other properties of the materials, and an extra layer of interpretability can be provided by a post-hoc analysis of the classification head of the model. We demonstrate this new approach on an experimental high-entropy alloy dataset with chemical compositions as input and single-phase formation as the single target property. While single property is used in this work, the disentangled model can be extended to customize for inverse design of materials with multiple target properties.

[LG-82] Urban context and delivery performance: Modelling service time for cargo bikes and vans across diverse urban environments

链接: https://arxiv.org/abs/2409.06730
作者: Maxwell Schrader,Navish Kumar,Esben Sørig,Soonmyeong Yoon,Akash Srivastava,Kai Xu,Maria Astefanoaei,Nicolas Collignon
关键词-EN: Light goods vehicles, Light Electric Vehicles, Light goods, Light Electric, service times
类目: Computers and Society (cs.CY); Machine Learning (cs.LG)
*备注: 37 pages in submission to the Springer Journal of Urban Informatics. arXiv admin note: text overlap with arXiv:2007.06277 by other authors

点击查看摘要

Abstract:Light goods vehicles (LGV) used extensively in the last mile of delivery are one of the leading polluters in cities. Cargo-bike logistics and Light Electric Vehicles (LEVs) have been put forward as a high impact candidate for replacing LGVs. Studies have estimated over half of urban van deliveries being replaceable by cargo-bikes, due to their faster speeds, shorter parking times and more efficient routes across cities. However, the logistics sector suffers from a lack of publicly available data, particularly pertaining to cargo-bike deliveries, thus limiting the understanding of their potential benefits. Specifically, service time (which includes cruising for parking, and walking to destination) is a major, but often overlooked component of delivery time modelling. The aim of this study is to establish a framework for measuring the performance of delivery vehicles, with an initial focus on modelling service times of vans and cargo-bikes across diverse urban environments. We introduce two datasets that allow for in-depth analysis and modelling of service times of cargo bikes and use existing datasets to reason about differences in delivery performance across vehicle types. We introduce a modelling framework to predict the service times of deliveries based on urban context. We employ Uber’s H3 index to divide cities into hexagonal cells and aggregate OpenStreetMap tags for each cell, providing a detailed assessment of urban context. Leveraging this spatial grid, we use GeoVex to represent micro-regions as points in a continuous vector space, which then serve as input for predicting vehicle service times. We show that geospatial embeddings can effectively capture urban contexts and facilitate generalizations to new contexts and cities. Our methodology addresses the challenge of limited comparative data available for different vehicle types within the same urban settings.

[LG-83] Feedback-based Modal Mutual Search for Attacking Vision-Language Pre-training Models

链接: https://arxiv.org/abs/2409.06726
作者: Renhua Ding,Xinze Zhang,Xiao Yang,Kun He
关键词-EN: achieved remarkable progress, attacking VLP models, vision-language pre-training, achieved remarkable, remarkable progress
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 9 pages

点击查看摘要

Abstract:Although vision-language pre-training (VLP) models have achieved remarkable progress on cross-modal tasks, they remain vulnerable to adversarial attacks. Using data augmentation and cross-modal interactions to generate transferable adversarial examples on surrogate models, transfer-based black-box attacks have become the mainstream methods in attacking VLP models, as they are more practical in real-world scenarios. However, their transferability may be limited due to the differences on feature representation across different models. To this end, we propose a new attack paradigm called Feedback-based Modal Mutual Search (FMMS). FMMS introduces a novel modal mutual loss (MML), aiming to push away the matched image-text pairs while randomly drawing mismatched pairs closer in feature space, guiding the update directions of the adversarial examples. Additionally, FMMS leverages the target model feedback to iteratively refine adversarial examples, driving them into the adversarial region. To our knowledge, this is the first work to exploit target model feedback to explore multi-modality adversarial boundaries. Extensive empirical evaluations on Flickr30K and MSCOCO datasets for image-text matching tasks show that FMMS significantly outperforms the state-of-the-art baselines.

[LG-84] DefectTwin: When LLM Meets Digital Twin for Railway Defect Inspection

链接: https://arxiv.org/abs/2409.06725
作者: Rahatara Ferdousi,M. Anwar Hossain,Chunsheng Yang,Abdulmotaleb El Saddik
关键词-EN: Digital Twin, Large Language Models, replicates objects, real-time monitoring, predictive maintenance
类目: Computational Engineering, Finance, and Science (cs.CE); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
*备注: 12 pages, 10 figures, IEEE transaction on consumer electronics

点击查看摘要

Abstract:A Digital Twin (DT) replicates objects, processes, or systems for real-time monitoring, simulation, and predictive maintenance. Recent advancements like Large Language Models (LLMs) have revolutionized traditional AI systems and offer immense potential when combined with DT in industrial applications such as railway defect inspection. Traditionally, this inspection requires extensive defect samples to identify patterns, but limited samples can lead to overfitting and poor performance on unseen defects. Integrating pre-trained LLMs into DT addresses this challenge by reducing the need for vast sample data. We introduce DefectTwin, which employs a multimodal and multi-model (M^2) LLM-based AI pipeline to analyze both seen and unseen visual defects in railways. This application enables a railway agent to perform expert-level defect analysis using consumer electronics (e.g., tablets). A multimodal processor ensures responses are in a consumable format, while an instant user feedback mechanism (instaUF) enhances Quality-of-Experience (QoE). The proposed M^2 LLM outperforms existing models, achieving high precision (0.76-0.93) across multimodal inputs including text, images, and videos of pre-trained defects, and demonstrates superior zero-shot generalizability for unseen defects. We also evaluate the latency, token count, and usefulness of responses generated by DefectTwin on consumer devices. To our knowledge, DefectTwin is the first LLM-integrated DT designed for railway defect inspection.

[LG-85] Dual Adversarial Perturbators Generate rich Views for Recommendation

链接: https://arxiv.org/abs/2409.06719
作者: Lijun Zhang,Yuan Yao,Haibo Ye
关键词-EN: contrastive views, extensively studied, studied and leveraged, potent tool, GCL
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注: 16 pages,6 figures and 5 tables

点击查看摘要

Abstract:Graph contrastive learning (GCL) has been extensively studied and leveraged as a potent tool in recommender systems. Most existing GCL-based recommenders generate contrastive views by altering the graph structure or introducing perturbations to embedding. While these methods effectively enhance learning from sparse data, they risk performance degradation or even training collapse when the differences between contrastive views become too pronounced. To mitigate this issue, we employ curriculum learning to incrementally increase the disparity between contrastive views, enabling the model to gain from more challenging scenarios. In this paper, we propose a dual-adversarial graph learning approach, AvoGCL, which emulates curriculum learning by progressively applying adversarial training to graph structures and embedding perturbations. Specifically, AvoGCL construct contrastive views by reducing graph redundancy and generating adversarial perturbations in the embedding space, and achieve better results by gradually increasing the difficulty of contrastive views. Extensive experiments on three real-world datasets demonstrate that AvoGCL significantly outperforms the state-of-the-art competitors.

[LG-86] Unsupervised Representation Learning of Complex Time Series for Maneuverability State Identification in Smart Mobility

链接: https://arxiv.org/abs/2409.06718
作者: Thabang Lebese
关键词-EN: Multivariate Time Series, Multivariate Time, Time Series, physical dynamic phenomena, provide invaluable insights
类目: Machine Learning (cs.LG)
*备注: 12 pages, 5 figures, International Conference on Agents and Artificial Intelligence Doctoral Consortium (ICAART_DC)

点击查看摘要

Abstract:Multivariate Time Series (MTS) data capture temporal behaviors to provide invaluable insights into various physical dynamic phenomena. In smart mobility, MTS plays a crucial role in providing temporal dynamics of behaviors such as maneuver patterns, enabling early detection of anomalous behaviors while facilitating pro-activity in Prognostics and Health Management (PHM). In this work, we aim to address challenges associated with modeling MTS data collected from a vehicle using sensors. Our goal is to investigate the effectiveness of two distinct unsupervised representation learning approaches in identifying maneuvering states in smart mobility. Specifically, we focus on some bivariate accelerations extracted from 2.5 years of driving, where the dataset is non-stationary, long, noisy, and completely unlabeled, making manual labeling impractical. The approaches of interest are Temporal Neighborhood Coding for Maneuvering (TNC4Maneuvering) and Decoupled Local and Global Representation learner for Maneuvering (DLG4Maneuvering). The main advantage of these frameworks is that they capture transferable insights in a form of representations from the data that can be effectively applied in multiple subsequent tasks, such as time-series classification, clustering, and multi-linear regression, which are the quantitative measures and qualitative measures, including visualization of representations themselves and resulting reconstructed MTS, respectively. We compare their effectiveness, where possible, in order to gain insights into which approach is more effective in identifying maneuvering states in smart mobility. Comments: 12 pages, 5 figures, International Conference on Agents and Artificial Intelligence Doctoral Consortium (ICAART_DC) Subjects: Machine Learning (cs.LG) Cite as: arXiv:2409.06718 [cs.LG] (or arXiv:2409.06718v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2409.06718 Focus to learn more arXiv-issued DOI via DataCite

[LG-87] Scalable Multivariate Fronthaul Quantization for Cell-Free Massive MIMO

链接: https://arxiv.org/abs/2409.06715
作者: Sangwoo Park,Ahmet Hasim Gokceoglu,Li Wang,Osvaldo Simeone
关键词-EN: cell-free massive MIMO, conventional approach, fronthaul links, massive MIMO system, massive MIMO
类目: Information Theory (cs.IT); Machine Learning (cs.LG); Networking and Internet Architecture (cs.NI); Signal Processing (eess.SP)
*备注: submitted for a journal publication

点击查看摘要

Abstract:The conventional approach to the fronthaul design for cell-free massive MIMO system follows the compress-and-precode (CP) paradigm. Accordingly, encoded bits and precoding coefficients are shared by the distributed unit (DU) on the fronthaul links, and precoding takes place at the radio units (RUs). Previous theoretical work has shown that CP can be potentially improved by a significant margin by precode-and-compress (PC) methods, in which all baseband processing is carried out at the DU, which compresses the precoded signals for transmission on the fronthaul links. The theoretical performance gain of PC methods are particularly pronounced when the DU implements multivariate quantization (MQ), applying joint quantization across the signals for all the RUs. However, existing solutions for MQ are characterized by a computational complexity that grows exponentially with the sum-fronthaul capacity from the DU to all RUs. This work sets out to design scalable MQ strategies for PC-based cell-free massive MIMO systems. For the low-fronthaul capacity regime, we present alpha-parallel MQ (alpha-PMQ), whose complexity is exponential only in the fronthaul capacity towards an individual RU, while performing close to full MQ. alpha-PMQ tailors MQ to the topology of the network by allowing for parallel local quantization steps for RUs that do not interfere too much with each other. For the high-fronthaul capacity regime, we then introduce neural MQ, which replaces the exhaustive search in MQ with gradient-based updates for a neural-network-based decoder, attaining a complexity that grows linearly with the sum-fronthaul capacity. Numerical results demonstrate that the proposed scalable MQ strategies outperform CP for both the low and high-fronthaul capacity regimes at the cost of increased computational complexity at the DU (but not at the RUs).

[LG-88] Gating Syn-to-Real Knowledge for Pedestrian Crossing Prediction in Safe Driving

链接: https://arxiv.org/abs/2409.06707
作者: Jie Bai,Jianwu Fang,Yisheng Lv,Chen Lv,Jianru Xue,Zhengguo Li
关键词-EN: driving scenes plays, Pedestrian Crossing Prediction, Pedestrian Crossing, Crossing Prediction, intelligent vehicles
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: under review by TITS

点击查看摘要

Abstract:Pedestrian Crossing Prediction (PCP) in driving scenes plays a critical role in ensuring the safe operation of intelligent vehicles. Due to the limited observations of pedestrian crossing behaviors in typical situations, recent studies have begun to leverage synthetic data with flexible variation to boost prediction performance, employing domain adaptation frameworks. However, different domain knowledge has distinct cross-domain distribution gaps, which necessitates suitable domain knowledge adaption ways for PCP tasks. In this work, we propose a Gated Syn-to-Real Knowledge transfer approach for PCP (Gated-S2R-PCP), which has two aims: 1) designing the suitable domain adaptation ways for different kinds of crossing-domain knowledge, and 2) transferring suitable knowledge for specific situations with gated knowledge fusion. Specifically, we design a framework that contains three domain adaption methods including style transfer, distribution approximation, and knowledge distillation for various information, such as visual, semantic, depth, location, etc. A Learnable Gated Unit (LGU) is employed to fuse suitable cross-domain knowledge to boost pedestrian crossing prediction. We construct a new synthetic benchmark S2R-PCP-3181 with 3181 sequences (489,740 frames) which contains the pedestrian locations, RGB frames, semantic images, and depth images. With the synthetic S2R-PCP-3181, we transfer the knowledge to two real challenging datasets of PIE and JAAD, and superior PCP performance is obtained to the state-of-the-art methods.

[LG-89] Discovering Long-Term Effects on Parameter Efficient Fine-tuning

链接: https://arxiv.org/abs/2409.06706
作者: Gaole Dai,Yiming Tang,Chunkai Fan,Qizhe Zhang,Zhi Zhang,Yulu Gan,Chengqing Zeng,Shanghang Zhang,Tiejun Huang
关键词-EN: Pre-trained Artificial Neural, specifically Biological Neural, Artificial Neural Networks, Biological Neural Networks, Artificial Neural
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Pre-trained Artificial Neural Networks (ANNs) exhibit robust pattern recognition capabilities and share extensive similarities with the human brain, specifically Biological Neural Networks (BNNs). We are particularly intrigued by these models’ ability to acquire new knowledge through fine-tuning. In this regard, Parameter-efficient Fine-tuning (PEFT) has gained widespread adoption as a substitute for full fine-tuning due to its cost reduction in training and mitigation of over-fitting risks by limiting the number of trainable parameters during adaptation. Since both ANNs and BNNs propagate information layer-by-layer, a common analogy can be drawn: weights in ANNs represent synapses in BNNs, while features (also known as latent variables or logits) in ANNs represent neurotransmitters released by neurons in BNNs. Mainstream PEFT methods aim to adjust feature or parameter values using only a limited number of trainable parameters (usually less than 1% of the total parameters), yet achieve surprisingly good results. Building upon this clue, we delve deeper into exploring the connections between feature adjustment and parameter adjustment, resulting in our proposed method Synapses Neurons (SAN) that learns scaling matrices for features and propagates their effects towards posterior weight matrices. Our approach draws strong inspiration from well-known neuroscience phenomena - Long-term Potentiation (LTP) and Long-term Depression (LTD), which also reveal the relationship between synapse development and neurotransmitter release levels. We conducted extensive comparisons of PEFT on 26 datasets using attention-based networks as well as convolution-based networks, leading to significant improvements compared to other tuning methods (+8.5% over fully-finetune, +7% over Visual Prompt Tuning, and +3.2% over LoRA). The codes would be released.

[LG-90] Bridging Autoencoders and Dynamic Mode Decomposition for Reduced-order Modeling and Control of PDEs

链接: https://arxiv.org/abs/2409.06101
作者: Priyabrata Saha,Saibal Mukhopadhyay
关键词-EN: partial differential equations, necessitate dimensionality reduction, dimensionality reduction techniques, controlling complex spatiotemporal, complex spatiotemporal dynamical
类目: ystems and Control (eess.SY); Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注: 8 pages, 5 figures. Accepted to IEEE Conference on Decision and Control (CDC 2024)

点击查看摘要

Abstract:Modeling and controlling complex spatiotemporal dynamical systems driven by partial differential equations (PDEs) often necessitate dimensionality reduction techniques to construct lower-order models for computational efficiency. This paper explores a deep autoencoding learning method for reduced-order modeling and control of dynamical systems governed by spatiotemporal PDEs. We first analytically show that an optimization objective for learning a linear autoencoding reduced-order model can be formulated to yield a solution closely resembling the result obtained through the dynamic mode decomposition with control algorithm. We then extend this linear autoencoding architecture to a deep autoencoding framework, enabling the development of a nonlinear reduced-order model. Furthermore, we leverage the learned reduced-order model to design controllers using stability-constrained deep neural networks. Numerical experiments are presented to validate the efficacy of our approach in both modeling and control using the example of a reaction-diffusion system.

[LG-91] Asymptotics of Stochastic Gradient Descent with Dropout Regularization in Linear Models

链接: https://arxiv.org/abs/2409.07434
作者: Jiaqi Li,Johannes Schmidt-Hieber,Wei Biao Wu
关键词-EN: stochastic gradient descent, SGD dropout iterates, step-size SGD dropout, gradient descent, linear regression
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST)
*备注: 77 pages, 5 figures, 4 tables

点击查看摘要

Abstract:This paper proposes an asymptotic theory for online inference of the stochastic gradient descent (SGD) iterates with dropout regularization in linear regression. Specifically, we establish the geometric-moment contraction (GMC) for constant step-size SGD dropout iterates to show the existence of a unique stationary distribution of the dropout recursive function. By the GMC property, we provide quenched central limit theorems (CLT) for the difference between dropout and \ell^2 -regularized iterates, regardless of initialization. The CLT for the difference between the Ruppert-Polyak averaged SGD (ASGD) with dropout and \ell^2 -regularized iterates is also presented. Based on these asymptotic normality results, we further introduce an online estimator for the long-run covariance matrix of ASGD dropout to facilitate inference in a recursive manner with efficiency in computational time and memory. The numerical experiments demonstrate that for sufficiently large samples, the proposed confidence intervals for ASGD with dropout nearly achieve the nominal coverage probability.

[LG-92] raining-Free Guidance for Discrete Diffusion Models for Molecular Generation

链接: https://arxiv.org/abs/2409.07359
作者: Thomas J. Kerby,Kevin R. Moon
关键词-EN: enable foundation diffusion, foundation diffusion models, explosion of interest, interest due, enable foundation
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Chemical Physics (physics.chem-ph); Biomolecules (q-bio.BM)
*备注: 5 pages, 2 figures, and 2 tables

点击查看摘要

Abstract:Training-free guidance methods for continuous data have seen an explosion of interest due to the fact that they enable foundation diffusion models to be paired with interchangable guidance models. Currently, equivalent guidance methods for discrete diffusion models are unknown. We present a framework for applying training-free guidance to discrete data and demonstrate its utility on molecular graph generation tasks using the discrete diffusion model architecture of DiGress. We pair this model with guidance functions that return the proportion of heavy atoms that are a specific atom type and the molecular weight of the heavy atoms and demonstrate our method’s ability to guide the data generation.

[LG-93] he Role of Explainable AI in Revolutionizing Human Health Monitoring

链接: https://arxiv.org/abs/2409.07347
作者: Abdullah Alharthi,Ahmed Alqurashi,Turki Alharbi,Mohammed Alammar,Nasser Aldosari,Houssem Bouchekara,Yusuf Shaaban,Mohammad Shoaib Shahriar,Abdulrahman Al Ayidh
关键词-EN: effective diagnostic tools, symptoms present significant, present significant obstacles, developing effective diagnostic, patient symptoms present
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The complex nature of disease mechanisms and the variability of patient symptoms present significant obstacles in developing effective diagnostic tools. Although machine learning has made considerable advances in medical diagnosis, its decision-making processes frequently lack transparency, which can jeopardize patient outcomes. This underscores the critical need for Explainable AI (XAI), which not only offers greater clarity but also has the potential to significantly improve patient care. In this literature review, we conduct a detailed analysis of analyzing XAI methods identified through searches across various databases, focusing on chronic conditions such as Parkinson’s, stroke, depression, cancer, heart disease, and Alzheimer’s disease. The literature search revealed the application of 9 trending XAI algorithms in the field of healthcare and highlighted the pros and cons of each of them. Thus, the article is concluded with a critical appraisal of the challenges and future research opportunities for XAI in human health monitoring.

[LG-94] ART: Artifact Removal Transformer for Reconstructing Noise-Free Multichannel Electroencephalographic Signals

链接: https://arxiv.org/abs/2409.07326
作者: Chun-Hsiang Chuang,Kong-Yi Chang,Chih-Sheng Huang,Anne-Mei Bessas
关键词-EN: significantly impacts neuroscientific, impacts neuroscientific analysis, brain-computer interface, Artifact removal, EEG
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Artifact removal in electroencephalography (EEG) is a longstanding challenge that significantly impacts neuroscientific analysis and brain-computer interface (BCI) performance. Tackling this problem demands advanced algorithms, extensive noisy-clean training data, and thorough evaluation strategies. This study presents the Artifact Removal Transformer (ART), an innovative EEG denoising model employing transformer architecture to adeptly capture the transient millisecond-scale dynamics characteristic of EEG signals. Our approach offers a holistic, end-to-end denoising solution for diverse artifact types in multichannel EEG data. We enhanced the generation of noisy-clean EEG data pairs using an independent component analysis, thus fortifying the training scenarios critical for effective supervised learning. We performed comprehensive validations using a wide range of open datasets from various BCI applications, employing metrics like mean squared error and signal-to-noise ratio, as well as sophisticated techniques such as source localization and EEG component classification. Our evaluations confirm that ART surpasses other deep-learning-based artifact removal methods, setting a new benchmark in EEG signal processing. This advancement not only boosts the accuracy and reliability of artifact removal but also promises to catalyze further innovations in the field, facilitating the study of brain dynamics in naturalistic environments.

[LG-95] BLS-GAN: A Deep Layer Separation Framework for Eliminating Bone Overlap in Conventional Radiographs

链接: https://arxiv.org/abs/2409.07304
作者: Haolin Wang,Yafei Ou,Prasoon Ambalathankandy,Gen Ota,Pengyu Dai,Masayuki Ikebe,Kenji Suzuki,Tamotsu Kamishima
关键词-EN: bone layer separation, bone layer, bone layer images, conventional radiographs, layer separation
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Conventional radiography is the widely used imaging technology in diagnosing, monitoring, and prognosticating musculoskeletal (MSK) diseases because of its easy availability, versatility, and cost-effectiveness. In conventional radiographs, bone overlaps are prevalent, and can impede the accurate assessment of bone characteristics by radiologists or algorithms, posing significant challenges to conventional and computer-aided diagnoses. This work initiated the study of a challenging scenario - bone layer separation in conventional radiographs, in which separate overlapped bone regions enable the independent assessment of the bone characteristics of each bone layer and lay the groundwork for MSK disease diagnosis and its automation. This work proposed a Bone Layer Separation GAN (BLS-GAN) framework that can produce high-quality bone layer images with reasonable bone characteristics and texture. This framework introduced a reconstructor based on conventional radiography imaging principles, which achieved efficient reconstruction and mitigates the recurrent calculations and training instability issues caused by soft tissue in the overlapped regions. Additionally, pre-training with synthetic images was implemented to enhance the stability of both the training process and the results. The generated images passed the visual Turing test, and improved performance in downstream tasks. This work affirms the feasibility of extracting bone layer images from conventional radiographs, which holds promise for leveraging bone layer separation technology to facilitate more comprehensive analytical research in MSK diagnosis, monitoring, and prognosis. Code and dataset will be made available.

[LG-96] Federated mathcalX-armed Bandit with Flexible Personalisation

链接: https://arxiv.org/abs/2409.07251
作者: Ali Arabzadeh,James A. Grant,David S. Leslie
关键词-EN: armed bandit framework, highly heterogeneous environment, personalised federated learning, armed bandit, bandit framework
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This paper introduces a novel approach to personalised federated learning within the \mathcalX -armed bandit framework, addressing the challenge of optimising both local and global objectives in a highly heterogeneous environment. Our method employs a surrogate objective function that combines individual client preferences with aggregated global knowledge, allowing for a flexible trade-off between personalisation and collective learning. We propose a phase-based elimination algorithm that achieves sublinear regret with logarithmic communication overhead, making it well-suited for federated settings. Theoretical analysis and empirical evaluations demonstrate the effectiveness of our approach compared to existing methods. Potential applications of this work span various domains, including healthcare, smart home devices, and e-commerce, where balancing personalisation with global insights is crucial.

[LG-97] Is merging worth it? Securely evaluating the information gain for causal dataset acquisition

链接: https://arxiv.org/abs/2409.07215
作者: Jake Fawkes,Lucile Ter-Minassian,Desi Ivanova,Uri Shalit,Chris Holmes
关键词-EN: involves private information, Merging datasets, costly procedure, lengthy and costly, involves private
类目: Machine Learning (stat.ML); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Merging datasets across institutions is a lengthy and costly procedure, especially when it involves private information. Data hosts may therefore want to prospectively gauge which datasets are most beneficial to merge with, without revealing sensitive information. For causal estimation this is particularly challenging as the value of a merge will depend not only on the reduction in epistemic uncertainty but also the improvement in overlap. To address this challenge, we introduce the first cryptographically secure information-theoretic approach for quantifying the value of a merge in the context of heterogeneous treatment effect estimation. We do this by evaluating the Expected Information Gain (EIG) and utilising multi-party computation to ensure it can be securely computed without revealing any raw data. As we demonstrate, this can be used with differential privacy (DP) to ensure privacy requirements whilst preserving more accurate computation than naive DP alone. To the best of our knowledge, this work presents the first privacy-preserving method for dataset acquisition tailored to causal estimation. We demonstrate the effectiveness and reliability of our method on a range of simulated and realistic benchmarks. The code is available anonymously.

[LG-98] FuXi-2.0: Advancing machine learning weather forecasting model for practical applications

链接: https://arxiv.org/abs/2409.07188
作者: Xiaohui Zhong,Lei Chen,Xu Fan,Wenxu Qian,Jun Liu,Hao Li
关键词-EN: Machine learning, lower computational costs, numerical weather prediction, traditional numerical weather, increasingly valuable
类目: Atmospheric and Oceanic Physics (physics.ao-ph); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Machine learning (ML) models have become increasingly valuable in weather forecasting, providing forecasts that not only lower computational costs but often match or exceed the accuracy of traditional numerical weather prediction (NWP) models. Despite their potential, ML models typically suffer from limitations such as coarse temporal resolution, typically 6 hours, and a limited set of meteorological variables, limiting their practical applicability. To overcome these challenges, we introduce FuXi-2.0, an advanced ML model that delivers 1-hourly global weather forecasts and includes a comprehensive set of essential meteorological variables, thereby expanding its utility across various sectors like wind and solar energy, aviation, and marine shipping. Our study conducts comparative analyses between ML-based 1-hourly forecasts and those from the high-resolution forecast (HRES) of the European Centre for Medium-Range Weather Forecasts (ECMWF) for various practical scenarios. The results demonstrate that FuXi-2.0 consistently outperforms ECMWF HRES in forecasting key meteorological variables relevant to these sectors. In particular, FuXi-2.0 shows superior performance in wind power forecasting compared to ECMWF HRES, further validating its efficacy as a reliable tool for scenarios demanding precise weather forecasts. Additionally, FuXi-2.0 also integrates both atmospheric and oceanic components, representing a significant step forward in the development of coupled atmospheric-ocean models. Further comparative analyses reveal that FuXi-2.0 provides more accurate forecasts of tropical cyclone intensity than its predecessor, FuXi-1.0, suggesting that there are benefits of an atmosphere-ocean coupled model over atmosphere-only models.

[LG-99] Coupling Machine Learning Local Predictions with a Computational Fluid Dynamics Solver to Accelerate Transient Buoyant Plume Simulations

链接: https://arxiv.org/abs/2409.07175
作者: Clément Caron,Philippe Lauret,Alain Bastide
关键词-EN: Data-driven methods demonstrate, methods demonstrate considerable, demonstrate considerable potential, inherently expensive computational, computational fluid dynamics
类目: Fluid Dynamics (physics.flu-dyn); Machine Learning (cs.LG)
*备注: Twelfth International Conference on Computational Fluid Dynamics (ICCFD12), Kobe, Japan, July 14-19, 2024. 18 pages, 8 figures

点击查看摘要

Abstract:Data-driven methods demonstrate considerable potential for accelerating the inherently expensive computational fluid dynamics (CFD) solvers. Nevertheless, pure machine-learning surrogate models face challenges in ensuring physical consistency and scaling up to address real-world problems. This study presents a versatile and scalable hybrid methodology, combining CFD and machine learning, to accelerate long-term incompressible fluid flow simulations without compromising accuracy. A neural network was trained offline using simulated data of various two-dimensional transient buoyant plume flows. The objective was to leverage local features to predict the temporal changes in the pressure field in comparable scenarios. Due to cell-level predictions, the methodology was successfully applied to diverse geometries without additional training. Pressure estimates were employed as initial values to accelerate the pressure-velocity coupling procedure. The results demonstrated an average improvement of 94% in the initial guess for solving the Poisson equation. The first pressure corrector acceleration reached a mean factor of 3, depending on the iterative solver employed. Our work reveals that machine learning estimates at the cell level can enhance the efficiency of CFD iterative linear solvers while maintaining accuracy. Although the scalability of the methodology to more complex cases has yet to be demonstrated, this study underscores the prospective value of domain-specific hybrid solvers for CFD.

[LG-100] Attention Down-Sampling Transformer Relative Ranking and Self-Consistency for Blind Image Quality Assessment ICIP

链接: https://arxiv.org/abs/2409.07115
作者: Mohammed Alsaafin,Musab Alsheikh,Saeed Anwar,Muhammad Usman
关键词-EN: image quality assessment, addresses estimating image, no-reference image quality, image quality, quality assessment
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: Accepted in International Conference on Image Processing (ICIP)

点击查看摘要

Abstract:The no-reference image quality assessment is a challenging domain that addresses estimating image quality without the original reference. We introduce an improved mechanism to extract local and non-local information from images via different transformer encoders and CNNs. The utilization of Transformer encoders aims to mitigate locality bias and generate a non-local representation by sequentially processing CNN features, which inherently capture local visual structures. Establishing a stronger connection between subjective and objective assessments is achieved through sorting within batches of images based on relative distance information. A self-consistency approach to self-supervision is presented, explicitly addressing the degradation of no-reference image quality assessment (NR-IQA) models under equivariant transformations. Our approach ensures model robustness by maintaining consistency between an image and its horizontally flipped equivalent. Through empirical evaluation of five popular image quality assessment datasets, the proposed model outperforms alternative algorithms in the context of no-reference image quality assessment datasets, especially on smaller datasets. Codes are available at \hrefthis https URLthis https URL

[LG-101] Deep intra-operative illumination calibration of hyperspectral cameras MICCAI2024

链接: https://arxiv.org/abs/2409.07094
作者: Alexander Baumann,Leonardo Ayala,Alexander Studier-Fischer,Jan Sellner,Berkin Özdemir,Karl-Friedrich Kowalewski,Slobodan Ilic,Silvia Seidlitz,Lena Maier-Hein
关键词-EN: potential surgical applications, imaging modality, lighting conditions, Hyperspectral imaging, promising novel imaging
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: Oral at MICCAI 2024

点击查看摘要

Abstract:Hyperspectral imaging (HSI) is emerging as a promising novel imaging modality with various potential surgical applications. Currently available cameras, however, suffer from poor integration into the clinical workflow because they require the lights to be switched off, or the camera to be manually recalibrated as soon as lighting conditions change. Given this critical bottleneck, the contribution of this paper is threefold: (1) We demonstrate that dynamically changing lighting conditions in the operating room dramatically affect the performance of HSI applications, namely physiological parameter estimation, and surgical scene segmentation. (2) We propose a novel learning-based approach to automatically recalibrating hyperspectral images during surgery and show that it is sufficiently accurate to replace the tedious process of white reference-based recalibration. (3) Based on a total of 742 HSI cubes from a phantom, porcine models, and rats we show that our recalibration method not only outperforms previously proposed methods, but also generalizes across species, lighting conditions, and image processing tasks. Due to its simple workflow integration as well as high accuracy, speed, and generalization capabilities, our method could evolve as a central component in clinical surgical HSI.

[LG-102] From optimal score matching to optimal sampling

链接: https://arxiv.org/abs/2409.07032
作者: Zehao Dou,Subhodh Kotekal,Zhehao Xu,Harrison H. Zhou
关键词-EN: score-based diffusion models, score matching, impressive advances, high-fidelity image, advances in algorithmic
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 71 pages

点击查看摘要

Abstract:The recent, impressive advances in algorithmic generation of high-fidelity image, audio, and video are largely due to great successes in score-based diffusion models. A key implementing step is score matching, that is, the estimation of the score function of the forward diffusion process from training data. As shown in earlier literature, the total variation distance between the law of a sample generated from the trained diffusion model and the ground truth distribution can be controlled by the score matching risk. Despite the widespread use of score-based diffusion models, basic theoretical questions concerning exact optimal statistical rates for score estimation and its application to density estimation remain open. We establish the sharp minimax rate of score estimation for smooth, compactly supported densities. Formally, given (n) i.i.d. samples from an unknown (\alpha)-Hölder density (f) supported on ([-1, 1]), we prove the minimax rate of estimating the score function of the diffused distribution (f * \mathcalN(0, t)) with respect to the score matching loss is (\frac1nt^2 \wedge \frac1nt^3/2 \wedge (t^\alpha-1 + n^-2(\alpha-1)/(2\alpha+1))) for all (\alpha 0) and (t \ge 0). As a consequence, it is shown the law (\hatf) of a sample generated from the diffusion model achieves the sharp minimax rate (\bE(\dTV(\hatf, f)^2) \lesssim n^-2\alpha/(2\alpha+1)) for all (\alpha 0) without any extraneous logarithmic terms which are prevalent in the literature, and without the need for early stopping which has been required for all existing procedures to the best of our knowledge. Comments: 71 pages Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG) Cite as: arXiv:2409.07032 [stat.ML] (or arXiv:2409.07032v1 [stat.ML] for this version) https://doi.org/10.48550/arXiv.2409.07032 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-103] A Practical Theory of Generalization in Selectivity Learning

链接: https://arxiv.org/abs/2409.07014
作者: Peizhi Wu,Haoshu Xu,Ryan Marcus,Zachary G. Ives
关键词-EN: promising estimation technique, machine learning, promising estimation, Approximately Correct, PAC learning framework
类目: Machine Learning (stat.ML); Databases (cs.DB); Machine Learning (cs.LG)
*备注: 14 pages

点击查看摘要

Abstract:Query-driven machine learning models have emerged as a promising estimation technique for query selectivities. Yet, surprisingly little is known about the efficacy of these techniques from a theoretical perspective, as there exist substantial gaps between practical solutions and state-of-the-art (SOTA) theory based on the Probably Approximately Correct (PAC) learning framework. In this paper, we aim to bridge the gaps between theory and practice. First, we demonstrate that selectivity predictors induced by signed measures are learnable, which relaxes the reliance on probability measures in SOTA theory. More importantly, beyond the PAC learning framework (which only allows us to characterize how the model behaves when both training and test workloads are drawn from the same distribution), we establish, under mild assumptions, that selectivity predictors from this class exhibit favorable out-of-distribution (OOD) generalization error bounds. These theoretical advances provide us with a better understanding of both the in-distribution and OOD generalization capabilities of query-driven selectivity learning, and facilitate the design of two general strategies to improve OOD generalization for existing query-driven selectivity models. We empirically verify that our techniques help query-driven selectivity models generalize significantly better to OOD queries both in terms of prediction accuracy and query latency performance, while maintaining their superior in-distribution generalization performance. Comments: 14 pages Subjects: Machine Learning (stat.ML); Databases (cs.DB); Machine Learning (cs.LG) Cite as: arXiv:2409.07014 [stat.ML] (or arXiv:2409.07014v1 [stat.ML] for this version) https://doi.org/10.48550/arXiv.2409.07014 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-104] oward Model-Agnostic Detection of New Physics Using Data-Driven Signal Regions

链接: https://arxiv.org/abs/2409.06960
作者: Soheun Yi,John Alison,Mikael Kuusela
关键词-EN: high-energy physics, signal events, crucial to select, events, Signal
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Data Analysis, Statistics and Probability (physics.data-an); Applications (stat.AP)
*备注: 5 pages, 2 figures

点击查看摘要

Abstract:In the search for new particles in high-energy physics, it is crucial to select the Signal Region (SR) in such a way that it is enriched with signal events if they are present. While most existing search methods set the region relying on prior domain knowledge, it may be unavailable for a completely novel particle that falls outside the current scope of understanding. We address this issue by proposing a method built upon a model-agnostic but often realistic assumption about the localized topology of the signal events, in which they are concentrated in a certain area of the feature space. Considering the signal component as a localized high-frequency feature, our approach employs the notion of a low-pass filter. We define the SR as an area which is most affected when the observed events are smeared with additive random noise. We overcome challenges in density estimation in the high-dimensional feature space by learning the density ratio of events that potentially include a signal to the complementary observation of events that closely resemble the target events but are free of any signals. By applying our method to simulated \mathrmHH \rightarrow 4b events, we demonstrate that the method can efficiently identify a data-driven SR in a high-dimensional feature space in which a high portion of signal events concentrate.

[LG-105] k-MLE k-Bregman k-VARs: Theory Convergence Computation

链接: https://arxiv.org/abs/2409.06938
作者: Zuogong Yue,Victor Solo
关键词-EN: develop hard clustering, hard clustering based, prove convergence, develop hard, hard clustering
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We develop hard clustering based on likelihood rather than distance and prove convergence. We also provide simulations and real data examples.

[LG-106] Learning Deep Kernels for Non-Parametric Independence Testing

链接: https://arxiv.org/abs/2409.06890
作者: Nathaniel Xu,Feng Liu,Danica J. Sutherland
关键词-EN: Hilbert-Schmidt Independence Criterion, Independence Criterion, powerful tool, tool for nonparametric, nonparametric detection
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The Hilbert-Schmidt Independence Criterion (HSIC) is a powerful tool for nonparametric detection of dependence between random variables. It crucially depends, however, on the selection of reasonable kernels; commonly-used choices like the Gaussian kernel, or the kernel that yields the distance covariance, are sufficient only for amply sized samples from data distributions with relatively simple forms of dependence. We propose a scheme for selecting the kernels used in an HSIC-based independence test, based on maximizing an estimate of the asymptotic test power. We prove that maximizing this estimate indeed approximately maximizes the true power of the test, and demonstrate that our learned kernels can identify forms of structured dependence between random variables in various experiments.

[LG-107] Joint trajectory and network inference via reference fitting

链接: https://arxiv.org/abs/2409.06879
作者: Stephen Y Zhang
关键词-EN: extremely challenging problem, experimental observables, task of reconstructing, reconstructing interactions, central yet extremely
类目: Quantitative Methods (q-bio.QM); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 14 pages, 6 figures

点击查看摘要

Abstract:Network inference, the task of reconstructing interactions in a complex system from experimental observables, is a central yet extremely challenging problem in systems biology. While much progress has been made in the last two decades, network inference remains an open problem. For systems observed at steady state, limited insights are available since temporal information is unavailable and thus causal information is lost. Two common avenues for gaining causal insights into system behaviour are to leverage temporal dynamics in the form of trajectories, and to apply interventions such as knock-out perturbations. We propose an approach for leveraging both dynamical and perturbational single cell data to jointly learn cellular trajectories and power network inference. Our approach is motivated by min-entropy estimation for stochastic dynamics and can infer directed and signed networks from time-stamped single cell snapshots.

[LG-108] ProteinBench: A Holistic Evaluation of Protein Foundation Models

链接: https://arxiv.org/abs/2409.06744
作者: Fei Ye,Zaixiang Zheng,Dongyu Xue,Yuning Shen,Lihao Wang,Yiming Ma,Yan Wang,Xinyou Wang,Xiangxin Zhou,Quanquan Gu
关键词-EN: Recent years, protein foundation models, significantly improving performance, generative tasks ranging, structure prediction
类目: Quantitative Methods (q-bio.QM); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Biomolecules (q-bio.BM)
*备注: 29 pages, 1 figure and 11 tables

点击查看摘要

Abstract:Recent years have witnessed a surge in the development of protein foundation models, significantly improving performance in protein prediction and generative tasks ranging from 3D structure prediction and protein design to conformational dynamics. However, the capabilities and limitations associated with these models remain poorly understood due to the absence of a unified evaluation framework. To fill this gap, we introduce ProteinBench, a holistic evaluation framework designed to enhance the transparency of protein foundation models. Our approach consists of three key components: (i) A taxonomic classification of tasks that broadly encompass the main challenges in the protein domain, based on the relationships between different protein modalities; (ii) A multi-metric evaluation approach that assesses performance across four key dimensions: quality, novelty, diversity, and robustness; and (iii) In-depth analyses from various user objectives, providing a holistic view of model performance. Our comprehensive evaluation of protein foundation models reveals several key findings that shed light on their current capabilities and limitations. To promote transparency and facilitate further research, we release the evaluation dataset, code, and a public leaderboard publicly for further analysis and a general modular toolkit. We intend for ProteinBench to be a living benchmark for establishing a standardized, in-depth evaluation framework for protein foundation models, driving their development and application while fostering collaboration within the field.

[LG-109] Evaluation of Tropical Cyclone Track and Intensity Forecasts from Artificial Intelligence Weather Prediction (AIWP) Models

链接: https://arxiv.org/abs/2409.06735
作者: Mark DeMaria,James L. Franklin,Galina Chirokova,Jacob Radford,Robert DeMaria,Kate D. Musgrave,Imme Ebert-Uphoff
关键词-EN: Artificial Intelligence Weather, data-driven Artificial Intelligence, Intelligence Weather Prediction, multiple data-driven Artificial, Artificial Intelligence
类目: Atmospheric and Oceanic Physics (physics.ao-ph); Machine Learning (cs.LG)
*备注: 33 pages. 9 figures, 3 tables

点击查看摘要

Abstract:In just the past few years multiple data-driven Artificial Intelligence Weather Prediction (AIWP) models have been developed, with new versions appearing almost monthly. Given this rapid development, the applicability of these models to operational forecasting has yet to be adequately explored and documented. To assess their utility for operational tropical cyclone (TC) forecasting, the NHC verification procedure is used to evaluate seven-day track and intensity predictions for northern hemisphere TCs from May-November 2023. Four open-source AIWP models are considered (FourCastNetv1, FourCastNetv2-small, GraphCast-operational and Pangu-Weather). The AIWP track forecast errors and detection rates are comparable to those from the best-performing operational forecast models. However, the AIWP intensity forecast errors are larger than those of even the simplest intensity forecasts based on climatology and persistence. The AIWP models almost always reduce the TC intensity, especially within the first 24 h of the forecast, resulting in a substantial low bias. The contribution of the AIWP models to the NHC model consensus was also evaluated. The consensus track errors are reduced by up to 11% at the longer time periods. The five-day NHC official track forecasts have improved by about 2% per year since 2001, so this represents more than a five-year gain in accuracy. Despite substantial negative intensity biases, the AIWP models have a neutral impact on the intensity consensus. These results show that the current formulation of the AIWP models have promise for operational TC track forecasts, but improved bias corrections or model reformulations will be needed for accurate intensity forecasts. Comments: 33 pages. 9 figures, 3 tables Subjects: Atmospheric and Oceanic Physics (physics.ao-ph); Machine Learning (cs.LG) Cite as: arXiv:2409.06735 [physics.ao-ph] (or arXiv:2409.06735v1 [physics.ao-ph] for this version) https://doi.org/10.48550/arXiv.2409.06735 Focus to learn more arXiv-issued DOI via DataCite

[LG-110] STAA: Spatio-Temporal Alignment Attention for Short-Term Precipitation Forecasting

链接: https://arxiv.org/abs/2409.06732
作者: Min Chen,Hao Yang,Shaohan Li,Xiaolin Qin
关键词-EN: accurately predict short-term, predict short-term precipitation, disaster prevention, accurately predict, socioeconomic effects
类目: Atmospheric and Oceanic Physics (physics.ao-ph); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:There is a great need to accurately predict short-term precipitation, which has socioeconomic effects such as agriculture and disaster prevention. Recently, the forecasting models have employed multi-source data as the multi-modality input, thus improving the prediction accuracy. However, the prevailing methods usually suffer from the desynchronization of multi-source variables, the insufficient capability of capturing spatio-temporal dependency, and unsatisfactory performance in predicting extreme precipitation events. To fix these problems, we propose a short-term precipitation forecasting model based on spatio-temporal alignment attention, with SATA as the temporal alignment module and STAU as the spatio-temporal feature extractor to filter high-pass features from precipitation signals and capture multi-term temporal dependencies. Based on satellite and ERA5 data from the southwestern region of China, our model achieves improvements of 12.61% in terms of RMSE, in comparison with the state-of-the-art methods.

[LG-111] Leveraging RNNs and LSTMs for Synchronization Analysis in the Indian Stock Market: A Threshold-Based Classification Approach

链接: https://arxiv.org/abs/2409.06728
作者: Sanjay Sathish,Charu C Sharma
关键词-EN: Cross Recurrence Plot, stock price synchronization, non-linear time-series analysis, utilize recurrence plots, research presents
类目: atistical Finance (q-fin.ST); Machine Learning (cs.LG)
*备注: 24 Pages, 7 Figures

点击查看摘要

Abstract:Our research presents a new approach for forecasting the synchronization of stock prices using machine learning and non-linear time-series analysis. To capture the complex non-linear relationships between stock prices, we utilize recurrence plots (RP) and cross-recurrence quantification analysis (CRQA). By transforming Cross Recurrence Plot (CRP) data into a time-series format, we enable the use of Recurrent Neural Networks (RNN) and Long Short-Term Memory (LSTM) networks for predicting stock price synchronization through both regression and classification. We apply this methodology to a dataset of 20 highly capitalized stocks from the Indian market over a 21-year period. The findings reveal that our approach can predict stock price synchronization, with an accuracy of 0.98 and F1 score of 0.83 offering valuable insights for developing effective trading strategies and risk management tools.

[LG-112] MLP XGBoost KAN TDNN and LSTM-GRU Hybrid RNN with Attention for SPX and NDX European Call Option Pricing

链接: https://arxiv.org/abs/2409.06724
作者: Boris Ter-Avanesov,Homayoon Beigi
关键词-EN: neural network architectures, artificial neural network, recursive neural network, time-delay neural network, neural network
类目: Computational Finance (q-fin.CP); Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG)
*备注: 78 pages, 39 figures

点击查看摘要

Abstract:We explore the performance of various artificial neural network architectures, including a multilayer perceptron (MLP), Kolmogorov-Arnold network (KAN), LSTM-GRU hybrid recursive neural network (RNN) models, and a time-delay neural network (TDNN) for pricing European call options. In this study, we attempt to leverage the ability of supervised learning methods, such as ANNs, KANs, and gradient-boosted decision trees, to approximate complex multivariate functions in order to calibrate option prices based on past market data. The motivation for using ANNs and KANs is the Universal Approximation Theorem and Kolmogorov-Arnold Representation Theorem, respectively. Specifically, we use S\P 500 (SPX) and NASDAQ 100 (NDX) index options traded during 2015-2023 with times to maturity ranging from 15 days to over 4 years (OptionMetrics IvyDB US dataset). Black \ Scholes’s (BS) PDE \citeBlack1973 model’s performance in pricing the same options compared to real data is used as a benchmark. This model relies on strong assumptions, and it has been observed and discussed in the literature that real data does not match its predictions. Supervised learning methods are widely used as an alternative for calibrating option prices due to some of the limitations of this model. In our experiments, the BS model underperforms compared to all of the others. Also, the best TDNN model outperforms the best MLP model on all error metrics. We implement a simple self-attention mechanism to enhance the RNN models, significantly improving their performance. The best-performing model overall is the LSTM-GRU hybrid RNN model with attention. Also, the KAN model outperforms the TDNN and MLP models. We analyze the performance of all models by ticker, moneyness category, and over/under/correctly-priced percentage.

[LG-113] Automated Quantification of White Blood Cells in Light Microscopic Images of Injured Skeletal Muscle

链接: https://arxiv.org/abs/2409.06722
作者: Yang Jiao,Hananeh Derakhshan,Barbara St. Pierre Schneider,Emma Regentova,Mei Yang
关键词-EN: White blood cells, White blood, diverse cell types, cell types observed, types observed
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 2 tables, 7 figures, 8 pages

点击查看摘要

Abstract:White blood cells (WBCs) are the most diverse cell types observed in the healing process of injured skeletal muscles. In the course of healing, WBCs exhibit dynamic cellular response and undergo multiple protein expression changes. The progress of healing can be analyzed by quantifying the number of WBCs or the amount of specific proteins in light microscopic images obtained at different time points after injury. In this paper, we propose an automated quantifying and analysis framework to analyze WBCs using light microscopic images of uninjured and injured muscles. The proposed framework is based on the Localized Iterative Otsu’s threshold method with muscle edge detection and region of interest extraction. Compared with the threshold methods used in ImageJ, the LI Otsu’s threshold method has high resistance to background area and achieves better accuracy. The CD68-positive cell results are presented for demonstrating the effectiveness of the proposed work.

[LG-114] Detailed delineation of the fetal brain in diffusion MRI via multi-task learning

链接: https://arxiv.org/abs/2409.06716
作者: Davood Karimi,Camilo Calixto,Haykel Snoussi,Maria Camila Cortes-Albornoz,Clemente Velasco-Annis,Caitlin Rollins,Camilo Jaimes,Ali Gholipour,Simon K. Warfield
关键词-EN: Diffusion-weighted MRI, MRI is increasingly, fetal brain in-utero, fetal brain, study the normal
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Neurons and Cognition (q-bio.NC)
*备注:

点击查看摘要

Abstract:Diffusion-weighted MRI is increasingly used to study the normal and abnormal development of fetal brain in-utero. Recent studies have shown that dMRI can offer invaluable insights into the neurodevelopmental processes in the fetal stage. However, because of the low data quality and rapid brain development, reliable analysis of fetal dMRI data requires dedicated computational methods that are currently unavailable. The lack of automated methods for fast, accurate, and reproducible data analysis has seriously limited our ability to tap the potential of fetal brain dMRI for medical and scientific applications. In this work, we developed and validated a unified computational framework to (1) segment the brain tissue into white matter, cortical/subcortical gray matter, and cerebrospinal fluid, (2) segment 31 distinct white matter tracts, and (3) parcellate the brain’s cortex and delineate the deep gray nuclei and white matter structures into 96 anatomically meaningful regions. We utilized a set of manual, semi-automatic, and automatic approaches to annotate 97 fetal brains. Using these labels, we developed and validated a multi-task deep learning method to perform the three computations. Our evaluations show that the new method can accurately carry out all three tasks, achieving a mean Dice similarity coefficient of 0.865 on tissue segmentation, 0.825 on white matter tract segmentation, and 0.819 on parcellation. The proposed method can greatly advance the field of fetal neuroimaging as it can lead to substantial improvements in fetal brain tractography, tract-specific analysis, and structural connectivity assessment.

信息检索

[IR-0] Dot Product is All You Need: Bridging the Gap Between Item Recommendation and Link Prediction

链接: https://arxiv.org/abs/2409.07433
作者: Daniele Malitesta,Alberto Carlo Maria Mancino,Pasquale Minervini,Tommaso Di Noia
关键词-EN: item recommendation problem, item recommendation task, identifying missing links, link prediction models, link prediction problem
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Item recommendation (the task of predicting if a user may interact with new items from the catalogue in a recommendation system) and link prediction (the task of identifying missing links in a knowledge graph) have long been regarded as distinct problems. In this work, we show that the item recommendation problem can be seen as an instance of the link prediction problem, where entities in the graph represent users and items, and the task consists of predicting missing instances of the relation type interactsWith. In a preliminary attempt to demonstrate the assumption, we decide to test three popular factorisation-based link prediction models on the item recommendation task, showing that their predictive accuracy is competitive with ten state-of-the-art recommendation models. The purpose is to show how the former may be seamlessly and effectively applied to the recommendation task without any specific modification to their architectures. Finally, while beginning to unveil the key reasons behind the recommendation performance of the selected link prediction models, we explore different settings for their hyper-parameter values, paving the way for future directions.

[IR-1] Hierarchical Reinforcement Learning for Temporal Abstraction of Listwise Recommendation

链接: https://arxiv.org/abs/2409.07416
作者: Luo Ji,Gao Liu,Mingyang Yin,Hongxia Yang,Jingren Zhou
关键词-EN: short-term interest shifts, Modern listwise recommendation, listwise recommendation systems, Modern listwise, long-term user perceptions
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 18 pages, 4 figures

点击查看摘要

Abstract:Modern listwise recommendation systems need to consider both long-term user perceptions and short-term interest shifts. Reinforcement learning can be applied on recommendation to study such a problem but is also subject to large search space, sparse user feedback and long interactive latency. Motivated by recent progress in hierarchical reinforcement learning, we propose a novel framework called mccHRL to provide different levels of temporal abstraction on listwise recommendation. Within the hierarchical framework, the high-level agent studies the evolution of user perception, while the low-level agent produces the item selection policy by modeling the process as a sequential decision-making problem. We argue that such framework has a well-defined decomposition of the outra-session context and the intra-session context, which are encoded by the high-level and low-level agents, respectively. To verify this argument, we implement both a simulator-based environment and an industrial dataset-based experiment. Results observe significant performance improvement by our method, compared with several well-known baselines. Data and codes have been made public.

[IR-2] Enhancing Sequential Music Recommendation with Negative Feedback-informed Contrastive Learning

链接: https://arxiv.org/abs/2409.07367
作者: Pavan Seshadri,Shahrzad Shashaani,Peter Knees
关键词-EN: Modern music streaming, music streaming services, Modern music, streaming services, services are heavily
类目: Information Retrieval (cs.IR)
*备注: To-appear at 18th ACM Conference on Recommendation Systems

点击查看摘要

Abstract:Modern music streaming services are heavily based on recommendation engines to serve content to users. Sequential recommendation – continuously providing new items within a single session in a contextually coherent manner – has been an emerging topic in current literature. User feedback – a positive or negative response to the item presented – is used to drive content recommendations by learning user preferences. We extend this idea to session-based recommendation to provide context-coherent music recommendations by modelling negative user feedback, i.e., skips, in the loss function. We propose a sequence-aware contrastive sub-task to structure item embeddings in session-based music recommendation, such that true next-positive items (ignoring skipped items) are structured closer in the session embedding space, while skipped tracks are structured farther away from all items in the session. This directly affects item rankings using a K-nearest-neighbors search for next-item recommendations, while also promoting the rank of the true next item. Experiments incorporating this task into SoTA methods for sequential item recommendation show consistent performance gains in terms of next-item hit rate, item ranking, and skip down-ranking on three music recommendation datasets, strongly benefiting from the increasing presence of user feedback.

[IR-3] STORE: Streamlining Semantic Tokenization and Generative Recommendation with A Single LLM

链接: https://arxiv.org/abs/2409.07276
作者: Qijiong Liu,Jieming Zhu,Lu Fan,Zhou Zhao,Xiao-Ming Wu
关键词-EN: unique item identifiers, Traditional recommendation models, item content information, Traditional recommendation, rely on unique
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Traditional recommendation models often rely on unique item identifiers (IDs) to distinguish between items, which can hinder their ability to effectively leverage item content information and generalize to long-tail or cold-start items. Recently, semantic tokenization has been proposed as a promising solution that aims to tokenize each item’s semantic representation into a sequence of discrete tokens. In this way, it preserves the item’s semantics within these tokens and ensures that semantically similar items are represented by similar tokens. These semantic tokens have become fundamental in training generative recommendation models. However, existing generative recommendation methods typically involve multiple sub-models for embedding, quantization, and recommendation, leading to an overly complex system. In this paper, we propose to streamline the semantic tokenization and generative recommendation process with a unified framework, dubbed STORE, which leverages a single large language model (LLM) for both tasks. Specifically, we formulate semantic tokenization as a text-to-token task and generative recommendation as a token-to-token task, supplemented by a token-to-text reconstruction task and a text-to-token auxiliary task. All these tasks are framed in a generative manner and trained using a single LLM backbone. Extensive experiments have been conducted to validate the effectiveness of our STORE framework across various recommendation tasks and datasets. We will release the source code and configurations for reproducible research.

[IR-4] RePlay: a Recommendation Framework for Experimentation and Production Use

链接: https://arxiv.org/abs/2409.07272
作者: Alexey Vasilev,Anna Volodkevich,Denis Kulandin,Tatiana Bysheva,Anton Klenitskiy
关键词-EN: systems significantly reduces, build and compare, significantly reduces, reduces the time, time to market
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG); Software Engineering (cs.SE)
*备注:

点击查看摘要

Abstract:Using a single tool to build and compare recommender systems significantly reduces the time to market for new models. In addition, the comparison results when using such tools look more consistent. This is why many different tools and libraries for researchers in the field of recommendations have recently appeared. Unfortunately, most of these frameworks are aimed primarily at researchers and require modification for use in production due to the inability to work on large datasets or an inappropriate architecture. In this demo, we present our open-source toolkit RePlay - a framework containing an end-to-end pipeline for building recommender systems, which is ready for production use. RePlay also allows you to use a suitable stack for the pipeline on each stage: Pandas, Polars, or Spark. This allows the library to scale computations and deploy to a cluster. Thus, RePlay allows data scientists to easily move from research mode to production mode using the same interfaces.

[IR-5] Diff-VPS: Video Polyp Segmentation via a Multi-task Diffusion Network with Adversarial Temporal Reasoning

链接: https://arxiv.org/abs/2409.07238
作者: Yingling Lu,Yijun Yang,Zhaohu Xing,Qiong Wang,Lei Zhu
关键词-EN: Diffusion Probabilistic Models, recently attracted significant, attracted significant attention, computer vision due, Probabilistic Models
类目: Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Diffusion Probabilistic Models have recently attracted significant attention in the community of computer vision due to their outstanding performance. However, while a substantial amount of diffusion-based research has focused on generative tasks, no work introduces diffusion models to advance the results of polyp segmentation in videos, which is frequently challenged by polyps’ high camouflage and redundant temporal this http URL this paper, we present a novel diffusion-based network for video polyp segmentation task, dubbed as Diff-VPS. We incorporate multi-task supervision into diffusion models to promote the discrimination of diffusion models on pixel-by-pixel segmentation. This integrates the contextual high-level information achieved by the joint classification and detection tasks. To explore the temporal dependency, Temporal Reasoning Module (TRM) is devised via reasoning and reconstructing the target frame from the previous frames. We further equip TRM with a generative adversarial self-supervised strategy to produce more realistic frames and thus capture better dynamic cues. Extensive experiments are conducted on SUN-SEG, and the results indicate that our proposed Diff-VPS significantly achieves state-of-the-art performance. Code is available at this https URL.

[IR-6] Negative Sampling in Recommendation: A Survey and Future Directions

链接: https://arxiv.org/abs/2409.07237
作者: Haokai Ma,Ruobing Xie,Lei Meng,Fuli Feng,Xiaoyu Du,Xingwu Sun,Zhanhui Kang,Xiangxu Meng
关键词-EN: Recommender systems aim, capture users’ personalized, users’ personalized preferences, Recommender systems, making them pivotal
类目: Information Retrieval (cs.IR)
*备注: 38 pages, 9 figures; Under review

点击查看摘要

Abstract:Recommender systems aim to capture users’ personalized preferences from the cast amount of user behaviors, making them pivotal in the era of information explosion. However, the presence of the dynamic preference, the “information cocoons”, and the inherent feedback loops in recommendation make users interact with a limited number of items. Conventional recommendation algorithms typically focus on the positive historical behaviors, while neglecting the essential role of negative feedback in user interest understanding. As a promising but easy-to-ignored area, negative sampling is proficients in revealing the genuine negative aspect inherent in user behaviors, emerging as an inescapable procedure in recommendation. In this survey, we first discuss the role of negative sampling in recommendation and thoroughly analyze challenges that consistently impede its progress. Then, we conduct an extensive literature review on the existing negative sampling strategies in recommendation and classify them into five categories with their discrepant techniques. Finally, we detail the insights of the tailored negative sampling strategies in diverse recommendation scenarios and outline an overview of the prospective research directions toward which the community may engage and benefit.

[IR-7] E-commerce Webpage Recommendation Scheme Base on Semantic Mining and Neural Networks

链接: https://arxiv.org/abs/2409.07033
作者: Wenchao Zhao,Xiaoyi Liu,Ruilin Xu,Lingxi Xiao,Muqing Li
关键词-EN: page recommendation technology, web page recommendation, web, web page, page recommendation
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
*备注: arXiv admin note: text overlap with arXiv:2409.01137

点击查看摘要

Abstract:In e-commerce websites, web mining web page recommendation technology has been widely used. However, recommendation solutions often cannot meet the actual application needs of online shopping users. To address this problem, this paper proposes an e-commerce web page recommendation solution that combines semantic web mining and BP neural networks. First, the web logs of user searches are processed, and 5 features are extracted: content priority, time consumption priority, online shopping users’ explicit/implicit feedback on the website, recommendation semantics and input deviation amount. Then, these features are used as input features of the BP neural network to classify and identify the priority of the final output web page. Finally, the web pages are sorted according to priority and recommended to users. This project uses book sales webpages as samples for experiments. The results show that this solution can quickly and accurately identify the webpages required by users.

[IR-8] Interactive Counterfactual Exploration of Algorithmic Harms in Recommender Systems

链接: https://arxiv.org/abs/2409.06916
作者: Yongsu Ahn,Quinn K Wolter,Jonilyn Dick,Janet Dick,Yu-Ru Lin
关键词-EN: shaping user interactions, integral to digital, interactions and preferences, Recommender systems, digital experiences
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
*备注:

点击查看摘要

Abstract:Recommender systems have become integral to digital experiences, shaping user interactions and preferences across various platforms. Despite their widespread use, these systems often suffer from algorithmic biases that can lead to unfair and unsatisfactory user experiences. This study introduces an interactive tool designed to help users comprehend and explore the impacts of algorithmic harms in recommender systems. By leveraging visualizations, counterfactual explanations, and interactive modules, the tool allows users to investigate how biases such as miscalibration, stereotypes, and filter bubbles affect their recommendations. Informed by in-depth user interviews, this tool benefits both general users and researchers by increasing transparency and offering personalized impact assessments, ultimately fostering a better understanding of algorithmic biases and contributing to more equitable recommendation outcomes. This work provides valuable insights for future research and practical applications in mitigating bias and enhancing fairness in machine learning algorithms.

[IR-9] Adversarial Attacks to Multi-Modal Models

链接: https://arxiv.org/abs/2409.06793
作者: Zhihao Dou,Xin Hu,Haibo Yang,Zhuqing Liu,Minghong Fang
关键词-EN: gained significant attention, significant attention due, powerful capabilities, gained significant, significant attention
类目: Cryptography and Security (cs.CR); Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注: To appear in the ACM Workshop on Large AI Systems and Models with Privacy and Safety Analysis 2024 (LAMPS '24)

点击查看摘要

Abstract:Multi-modal models have gained significant attention due to their powerful capabilities. These models effectively align embeddings across diverse data modalities, showcasing superior performance in downstream tasks compared to their unimodal counterparts. Recent study showed that the attacker can manipulate an image or audio file by altering it in such a way that its embedding matches that of an attacker-chosen targeted input, thereby deceiving downstream models. However, this method often underperforms due to inherent disparities in data from different modalities. In this paper, we introduce CrossFire, an innovative approach to attack multi-modal models. CrossFire begins by transforming the targeted input chosen by the attacker into a format that matches the modality of the original image or audio file. We then formulate our attack as an optimization problem, aiming to minimize the angular deviation between the embeddings of the transformed input and the modified image or audio file. Solving this problem determines the perturbations to be added to the original media. Our extensive experiments on six real-world benchmark datasets reveal that CrossFire can significantly manipulate downstream tasks, surpassing existing attacks. Additionally, we evaluate six defensive strategies against CrossFire, finding that current defenses are insufficient to counteract our CrossFire.

[IR-10] Dual Adversarial Perturbators Generate rich Views for Recommendation

链接: https://arxiv.org/abs/2409.06719
作者: Lijun Zhang,Yuan Yao,Haibo Ye
关键词-EN: contrastive views, extensively studied, studied and leveraged, potent tool, GCL
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注: 16 pages,6 figures and 5 tables

点击查看摘要

Abstract:Graph contrastive learning (GCL) has been extensively studied and leveraged as a potent tool in recommender systems. Most existing GCL-based recommenders generate contrastive views by altering the graph structure or introducing perturbations to embedding. While these methods effectively enhance learning from sparse data, they risk performance degradation or even training collapse when the differences between contrastive views become too pronounced. To mitigate this issue, we employ curriculum learning to incrementally increase the disparity between contrastive views, enabling the model to gain from more challenging scenarios. In this paper, we propose a dual-adversarial graph learning approach, AvoGCL, which emulates curriculum learning by progressively applying adversarial training to graph structures and embedding perturbations. Specifically, AvoGCL construct contrastive views by reducing graph redundancy and generating adversarial perturbations in the embedding space, and achieve better results by gradually increasing the difficulty of contrastive views. Extensive experiments on three real-world datasets demonstrate that AvoGCL significantly outperforms the state-of-the-art competitors.

附件下载

点击下载今日全部论文列表