本篇博文主要展示 2024-09-02 从Arxiv.org论文网站获取的最新论文列表,自动更新,按照NLP、CV、ML、AI、IR五个大方向区分,若需要邮件定时接收,请在评论区留下你的邮箱号。

说明:每日论文数据从Arxiv.org获取,每天早上10:30左右定时自动更新。

友情提示: 如何您需要邮箱接收每日论文数据,请在评论处留下你的邮箱,同样每天10:30左右邮件定时自动发送。

目录

概览 (2024-09-02)

今日共更新328篇论文,其中:

  • 自然语言处理30篇(Computation and Language (cs.CL))
  • 人工智能72篇(Artificial Intelligence (cs.AI))
  • 计算机视觉75篇(Computer Vision and Pattern Recognition (cs.CV))
  • 机器学习98篇(Machine Learning (cs.LG))

自然语言处理

[NLP-0] Bridging Episodes and Semantics: A Novel Framework for Long-Form Video Understanding ECCV’24
[NLP-0] 连接情节和语义:长篇视频理解的新颖框架

链接: https://arxiv.org/abs/2408.17443
作者: Gueter Josmy Faure,Jia-Fong Yeh,Min-Hung Chen,Hung-Ting Su,Winston H. Hsu,Shang-Hong Lai
关键词-EN: reflects human cognition, accurately reflects human, extended short videos, treats long-form videos, human cognition
关键词-ZH: 反映人类认知,准确反映人类、扩展短视频,对待长篇视频,人类认知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Accepted to the EVAL-FoMo Workshop at ECCV’24. Project page: this https URL

点击查看摘要

Abstract:While existing research often treats long-form videos as extended short videos, we propose a novel approach that more accurately reflects human cognition. This paper introduces BREASE: BRidging Episodes And SEmantics for Long-Form Video Understanding, a model that simulates episodic memory accumulation to capture action sequences and reinforces them with semantic knowledge dispersed throughout the video. Our work makes two key contributions: First, we develop an Episodic COmpressor (ECO) that efficiently aggregates crucial representations from micro to semi-macro levels. Second, we propose a Semantics reTRiever (SeTR) that enhances these aggregated representations with semantic information by focusing on the broader context, dramatically reducing feature dimensionality while preserving relevant macro-level information. Extensive experiments demonstrate that BREASE achieves state-of-the-art performance across multiple long video understanding benchmarks in both zero-shot and fully-supervised settings. The project page and code are at: this https URL.
摘要:虽然现有的研究往往将长形式的视频视为扩展的短视频,但我们提出了一种新的方法,更准确地反映了人类的认知。本文介绍了BREASE:为长形式视频理解搭桥情节和语义的模型,该模型模拟情节记忆积累来捕获动作序列,并用分散在视频中的语义知识来加强它们。我们的工作有两个关键贡献:首先,我们开发了一个情节压缩器(ECO),它有效地聚集了从微观到半宏观层面的关键表示。其次,我们提出了一种语义检索器(SET),它通过关注更广泛的上下文来增强这些聚集表示的语义信息,在保留相关宏观信息的同时显著降低特征维度。广泛的实验表明,BREASE在零镜头和全监督设置下,在多个长视频理解基准测试中实现了最先进的性能。项目页面和代码位于:This HTTPS URL。

[NLP-1] SYNTHEVAL: Hybrid Behavioral Testing of NLP Models with Synthetic CheckLists
[NLP-1] SYNTHEEVAL:具有合成检查表的NLP模型混合行为测试

链接: https://arxiv.org/abs/2408.17437
作者: Raoyuan Zhao,Abdullatif Köksal,Yihong Liu,Leonie Weissweiler,Anna Korhonen,Hinrich Schütze
关键词-EN: NLP typically involves, Traditional benchmarking, held-out test sets, static held-out test, NLP models
关键词-ZH: NLP通常涉及传统基准测试、固定测试集、静态固定测试、NLP模型
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Traditional benchmarking in NLP typically involves using static held-out test sets. However, this approach often results in an overestimation of performance and lacks the ability to offer comprehensive, interpretable, and dynamic assessments of NLP models. Recently, works like DynaBench (Kiela et al., 2021) and CheckList (Ribeiro et al., 2020) have addressed these limitations through behavioral testing of NLP models with test types generated by a multistep human-annotated pipeline. Unfortunately, manually creating a variety of test types requires much human labor, often at prohibitive cost. In this work, we propose SYNTHEVAL, a hybrid behavioral testing framework that leverages large language models (LLMs) to generate a wide range of test types for a comprehensive evaluation of NLP models. SYNTHEVAL first generates sentences via LLMs using controlled generation, and then identifies challenging examples by comparing the predictions made by LLMs with task-specific NLP models. In the last stage, human experts investigate the challenging examples, manually design templates, and identify the types of failures the taskspecific models consistently exhibit. We apply SYNTHEVAL to two classification tasks, sentiment analysis and toxic language detection, and show that our framework is effective in identifying weaknesses of strong models on these tasks. We share our code in this https URL.
摘要:NLP中的传统基准测试通常涉及使用静态坚持测试集。然而,这种方法经常导致对性能的高估,并且缺乏对NLP模型提供全面、可解释和动态评估的能力。最近,DyaBch(Kiela等人,2021)和Checklist(Ribeiro等人,2020)等工作已经通过使用由多步骤人工注释管道生成的测试类型对NLP模型进行行为测试来解决这些限制。不幸的是,手动创建各种测试类型需要大量人力,而且往往成本高昂。在这项工作中,我们提出了SYNTHEVAL,这是一个混合行为测试框架,它利用大型语言模型(LLM)来生成广泛的测试类型,以便对NLP模型进行全面评估。SYNTHEVAL首先使用受控生成通过LLMS生成句子,然后通过将LLMS做出的预测与特定于任务的NLP模型进行比较来识别具有挑战性的例子。在最后阶段,人类专家调查具有挑战性的示例,手动设计模板,并确定特定于任务的模型一贯表现出的故障类型。我们将SYNTHEVAL应用于情感分析和有毒语言检测这两个分类任务,结果表明我们的框架能够有效地识别强模型在这些任务上的弱点。我们在此HTTPS URL中共享我们的代码。

[NLP-2] CLOCR-C: Context Leveraging OCR Correction with Pre-trained Language Models
[NLP-2] CLCR-C:通过预训练的语言模型利用上下文进行OCR纠正

链接: https://arxiv.org/abs/2408.17428
作者: Jonathan Bourne
关键词-EN: historical print media, print media archives, Optical Character Recognition, Context Leveraging OCR, Leveraging OCR Correction
关键词-ZH: 历史印刷媒体、印刷媒体档案、光学字符识别、上下文利用OCR、利用OCR纠正
类目: Computation and Language (cs.CL); Digital Libraries (cs.DL)
备注: 13 pages, 3 figures, currently under peer review

点击查看摘要

Abstract:The digitisation of historical print media archives is crucial for increasing accessibility to contemporary records. However, the process of Optical Character Recognition (OCR) used to convert physical records to digital text is prone to errors, particularly in the case of newspapers and periodicals due to their complex layouts. This paper introduces Context Leveraging OCR Correction (CLOCR-C), which utilises the infilling and context-adaptive abilities of transformer-based language models (LMs) to improve OCR quality. The study aims to determine if LMs can perform post-OCR correction, improve downstream NLP tasks, and the value of providing the socio-cultural context as part of the correction process. Experiments were conducted using seven LMs on three datasets: the 19th Century Serials Edition (NCSE) and two datasets from the Overproof collection. The results demonstrate that some LMs can significantly reduce error rates, with the top-performing model achieving over a 60% reduction in character error rate on the NCSE dataset. The OCR improvements extend to downstream tasks, such as Named Entity Recognition, with increased Cosine Named Entity Similarity. Furthermore, the study shows that providing socio-cultural context in the prompts improves performance, while misleading prompts lower performance. In addition to the findings, this study releases a dataset of 91 transcribed articles from the NCSE, containing a total of 40 thousand words, to support further research in this area. The findings suggest that CLOCR-C is a promising approach for enhancing the quality of existing digital archives by leveraging the socio-cultural information embedded in the LMs and the text requiring correction.
摘要:历史印刷媒体档案的数字化对于增加当代记录的可及性至关重要。然而,用于将物理记录转换为数字文本的光学字符识别(OCR)过程容易出错,特别是在报纸和期刊的情况下,由于其复杂的布局。本文介绍了上下文利用OCR校正(CLOCR-C),它利用基于变换的语言模型(LMS)的填充和上下文自适应能力来提高OCR质量。这项研究旨在确定LMS是否可以进行OCR后矫正,改善下游的NLP任务,以及作为矫正过程的一部分提供社会文化背景的价值。实验是在三个数据集上使用七个LMS进行的:19世纪连载版(NCSE)和OverProof集合中的两个数据集。结果表明,一些LMS可以显著降低错误率,其中性能最好的模型在NCSE数据集上的字符错误率降低了60%以上。OCR的改进扩展到下游任务,例如命名实体识别,增加了余弦命名实体相似度。此外,研究表明,在提示中提供社会文化背景可以提高成绩,而误导性的提示会降低成绩。除了这些发现之外,这项研究还发布了一个包含91篇转录自NCSE的文章的数据集,总共包含4万字,以支持这一领域的进一步研究。研究结果表明,CLOCR-C是一种很有前途的方法,通过利用LMS中嵌入的社会文化信息和需要更正的文本来提高现有数字档案的质量。

[NLP-3] NDP: Next Distribution Prediction as a More Broad Target
[NLP-3] NDP:下一次分配预测是更广泛的目标

链接: https://arxiv.org/abs/2408.17377
作者: Junhao Ruan,Abudukeyumu Abudula,Xinyu Liu,Bei Li,Yinqiao Li,Chenglong Wang,Yuchun Fan,Yuan Ge,Tong Xiao,Jingbo Zhu
关键词-EN: Large language models, demonstrated powerful capabilities, Large language, existing NTP paradigm, trained on next-token
关键词-ZH: 大型语言模型,展示了强大的功能,大型语言,现有的NP范式,在下一个令牌上训练
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 8 pages,5 figures

点击查看摘要

Abstract:Large language models (LLMs) trained on next-token prediction (NTP) paradigm have demonstrated powerful capabilities. However, the existing NTP paradigm contains several limitations, particularly related to planned task complications and error propagation during inference. In our work, we extend the critique of NTP, highlighting its limitation also due to training with a narrow objective: the prediction of a sub-optimal one-hot distribution. To support this critique, we conducted a pre-experiment treating the output distribution from powerful LLMs as efficient world data compression. By evaluating the similarity between the n -gram distribution and the one-hot distribution with LLMs, we observed that the n -gram distributions align more closely with the output distribution of LLMs. Based on this insight, we introduce Next Distribution Prediction (NDP), which uses n -gram distributions to replace the one-hot targets, enhancing learning without extra online training time. We conducted experiments across translation, general task, language transfer, and medical domain adaptation. Compared to NTP, NDP can achieve up to +2.97 COMET improvement in translation tasks, +0.61 average improvement in general tasks, and incredible +10.75 average improvement in the medical domain. This demonstrates the concrete benefits of addressing the target narrowing problem, pointing to a new direction for future work on improving NTP.
摘要:基于下一代币预测(NTP)范式的大型语言模型(LLM)已经显示出强大的能力。然而,现有的NTP范例包含一些限制,特别是与计划任务复杂性和推理过程中的错误传播相关的限制。在我们的工作中,我们扩展了对NTP的批评,强调了它的局限性,也是因为训练的目标很狭窄:预测次优的一热分布。为了支持这一批评,我们进行了一项预实验,将强大的LLMS的输出分布视为有效的世界数据压缩。通过评估n元语法分布和一次热分布与LLMS的相似性,我们观察到n元语法分布与LLMS的输出分布更接近。基于这一认识,我们引入了下一个分布预测(NDP),它使用n元分布来代替一个热点目标,在不需要额外的在线训练时间的情况下增强学习。我们进行了翻译实验、一般任务实验、语言迁移实验和医学领域适应实验。与NTP相比,NDP在翻译任务上可以获得高达2.97 Comet的改进,在一般任务上可以实现+0.61的平均改进,在医疗领域可以实现令人难以置信的+10.75的平均改进。这表明了解决目标缩小问题的具体好处,为今后改进NTP的工作指明了新的方向。

[NLP-4] Assessing Generative Language Models in Classification Tasks: Performance and Self-Evaluation Capabilities in the Environmental and Climate Change Domain
[NLP-4] 评估分类任务中的生成语言模型:环境和气候变化领域的性能和自我评估能力

链接: https://arxiv.org/abs/2408.17362
作者: Francesca Grasso,Stefano Locci
关键词-EN: Small Language Model, Large Language Models, Small Language, Language Models, Large Language
关键词-ZH: 小语言模型,大语言模型,小语言,语言模型,大语言
类目: Computation and Language (cs.CL)
备注: 11 pages, to be published in NLDB 2024

点击查看摘要

Abstract:This paper examines the performance of two Large Language Models (LLMs), GPT3.5 and Llama2 and one Small Language Model (SLM) Gemma, across three different classification tasks within the climate change (CC) and environmental domain. Employing BERT-based models as a baseline, we compare their efficacy against these transformer-based models. Additionally, we assess the models’ self-evaluation capabilities by analyzing the calibration of verbalized confidence scores in these text classification tasks. Our findings reveal that while BERT-based models generally outperform both the LLMs and SLM, the performance of the large generative models is still noteworthy. Furthermore, our calibration analysis reveals that although Gemma is well-calibrated in initial tasks, it thereafter produces inconsistent results; Llama is reasonably calibrated, and GPT consistently exhibits strong calibration. Through this research, we aim to contribute to the ongoing discussion on the utility and effectiveness of generative LMs in addressing some of the planet’s most urgent issues, highlighting their strengths and limitations in the context of ecology and CC.
摘要:本文考察了两个大语言模型(GPT3.5和Llama2)和一个小语言模型(SLM)GEMA在气候变化(CC)和环境领域的三个不同分类任务中的性能。以基于BERT的模型为基线,我们将它们的有效性与基于变压器的模型进行了比较。此外,我们通过分析这些文本分类任务中言语置信度分数的校准来评估模型的自我评估能力。我们的发现表明,虽然基于BERT的模型总体上优于LLM和SLM,但大型生成性模型的性能仍然值得注意。此外,我们的校准分析表明,尽管Gema在最初的任务中得到了很好的校准,但此后产生了不一致的结果;Llama得到了合理的校准,而GPT始终表现出很强的校准。通过这项研究,我们的目标是为正在进行的关于生成性LMS在解决地球上一些最紧迫的问题方面的效用和有效性的讨论做出贡献,突出它们在生态学和CC背景下的优势和局限性。

[NLP-5] Impact of ChatGPT on the writing style of condensed matter physicists
[NLP-5] ChatGPT对凝聚态物理学家写作风格的影响

链接: https://arxiv.org/abs/2408.17325
作者: Shaojun Xu,Xiaohui Ye,Mengqi Zhang,Pei Wang
关键词-EN: condensed matter papers, approach to estimate, papers on arXiv, estimate the impact, condensed matter
关键词-ZH: 浓缩物质论文,估计方法,arXiv上的论文,估计影响,浓缩物质
类目: Computation and Language (cs.CL); Disordered Systems and Neural Networks (cond-mat.dis-nn); Statistical Mechanics (cond-mat.stat-mech)
备注: 9 pages, 1 figure, 7 tables

点击查看摘要

Abstract:We apply a state-of-the-art difference-in-differences approach to estimate the impact of ChatGPT’s release on the writing style of condensed matter papers on arXiv. Our analysis reveals a statistically significant improvement in the English quality of abstracts written by non-native English speakers. Importantly, this improvement remains robust even after accounting for other potential factors, confirming that it can be attributed to the release of ChatGPT. This indicates widespread adoption of the tool. Following the release of ChatGPT, there is a significant increase in the use of unique words, while the frequency of rare words decreases. Across language families, the changes in writing style are significant for authors from the Latin and Ural-Altaic groups, but not for those from the Germanic or other Indo-European groups.
摘要:我们应用最先进的差异中差异方法来估计ChatGPT的发布对arXiv上浓缩物质论文写作风格的影响。我们的分析显示,非英语母语人士撰写的摘要的英语质量在统计上显着提高。重要的是,即使考虑到其他潜在因素,这种改善仍然强劲,证实它可以归因于ChatGPT的发布。这表明该工具得到了广泛采用。ChatGPT发布后,独特词的使用显着增加,而稀有词的频率则下降。在各个语系中,写作风格的变化对于拉丁语和乌拉尔-阿尔泰语系的作家来说意义重大,但对于日耳曼语或其他印欧语系的作家来说则不然。

[NLP-6] Modularity in Transformers: Investigating Neuron Separability Specialization
[NLP-6] 变形金刚中的模块化:研究神经元可分离性专业化

链接: https://arxiv.org/abs/2408.17324
作者: Nicholas Pochinkov,Thomas Jones,Mohammed Rashidur Rahman
关键词-EN: workings remains limited, internal workings remains, remains limited, increasingly prevalent, workings remains
关键词-ZH: 工作方式仍然有限,内部工作方式仍然有限,越来越普遍,工作方式仍然
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 11 pages, 6 figures

点击查看摘要

Abstract:Transformer models are increasingly prevalent in various applications, yet our understanding of their internal workings remains limited. This paper investigates the modularity and task specialization of neurons within transformer architectures, focusing on both vision (ViT) and language (Mistral 7B) models. Using a combination of selective pruning and MoEfication clustering techniques, we analyze the overlap and specialization of neurons across different tasks and data subsets. Our findings reveal evidence of task-specific neuron clusters, with varying degrees of overlap between related tasks. We observe that neuron importance patterns persist to some extent even in randomly initialized models, suggesting an inherent structure that training refines. Additionally, we find that neuron clusters identified through MoEfication correspond more strongly to task-specific neurons in earlier and later layers of the models. This work contributes to a more nuanced understanding of transformer internals and offers insights into potential avenues for improving model interpretability and efficiency.
摘要:变压器模型在各种应用中越来越普遍,但我们对其内部工作原理的了解仍然有限。本文研究了变压器体系结构中神经元的模块化和任务专门化,重点研究了视觉(VIT)和语言(Mistral 7B)模型。使用选择性剪枝和MoEation聚类技术的组合,我们分析了不同任务和数据子集之间神经元的重叠和专门化。我们的发现揭示了任务特定神经元集群的证据,相关任务之间存在不同程度的重叠。我们观察到,即使在随机初始化的模型中,神经元重要性模式也在一定程度上保持不变,这表明训练改进了一种内在结构。此外,我们发现,通过MoEation识别的神经元簇在模型的早期和后期与任务特定的神经元更强地对应。这项工作有助于更细致入微地理解变压器的内部结构,并为提高模型的可解释性和效率提供了潜在的见解。

[NLP-7] Investigating Neuron Ablation in Attention Heads: The Case for Peak Activation Centering
[NLP-7] 研究注意力头部的神经元消融:峰值激活集中的案例

链接: https://arxiv.org/abs/2408.17322
作者: Nicholas Pochinkov,Ben Pasero,Skylar Shibayama
关键词-EN: rapidly throughout society, growing rapidly, ablation, Abstract, models
关键词-ZH: 迅速遍布社会,迅速发展,消融,抽象,模型
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注: 9 pages, 2 figures, XAI World Conference 2024 Late-Breaking Work

点击查看摘要

Abstract:The use of transformer-based models is growing rapidly throughout society. With this growth, it is important to understand how they work, and in particular, how the attention mechanisms represent concepts. Though there are many interpretability methods, many look at models through their neuronal activations, which are poorly understood. We describe different lenses through which to view neuron activations, and investigate the effectiveness in language models and vision transformers through various methods of neural ablation: zero ablation, mean ablation, activation resampling, and a novel approach we term ‘peak ablation’. Through experimental analysis, we find that in different regimes and models, each method can offer the lowest degradation of model performance compared to other methods, with resampling usually causing the most significant performance deterioration. We make our code available at this https URL.
摘要:基于变压器的模型的使用正在整个社会迅速增长。随着这种增长,了解它们的工作原理非常重要,特别是注意力机制如何代表概念。尽管有很多可解释性方法,但许多方法通过神经元激活来看待模型,而这一点人们对此知之甚少。我们描述了观察神经元激活的不同镜头,并通过各种神经消融方法研究语言模型和视觉转换器的有效性:零消融、平均消融、激活再采样以及我们称之为“峰值消融”的一种新型方法。通过实验分析,我们发现,在不同的制度和模型中,与其他方法相比,每种方法都可以提供最低的模型性能降级,其中重新采样通常会导致最显着的性能降级。我们在此https URL上提供我们的代码。

[NLP-8] Bridging Domain Knowledge and Process Discovery Using Large Language Models
[NLP-8] 使用大型语言模型连接领域知识和流程发现

链接: https://arxiv.org/abs/2408.17316
作者: Ali Norouzifar,Humam Kourani,Marcus Dees,Wil van der Aalst
关键词-EN: Discovering good process, Discovering good, process analysis tasks, process, analysis tasks
关键词-ZH: 发现良好的过程,发现良好的过程分析任务,过程,分析任务
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: This paper is accepted at the AI4BPM 2024 workshop and to be published in their proceedings

点击查看摘要

Abstract:Discovering good process models is essential for different process analysis tasks such as conformance checking and process improvements. Automated process discovery methods often overlook valuable domain knowledge. This knowledge, including insights from domain experts and detailed process documentation, remains largely untapped during process discovery. This paper leverages Large Language Models (LLMs) to integrate such knowledge directly into process discovery. We use rules derived from LLMs to guide model construction, ensuring alignment with both domain knowledge and actual process executions. By integrating LLMs, we create a bridge between process knowledge expressed in natural language and the discovery of robust process models, advancing process discovery methodologies significantly. To showcase the usability of our framework, we conducted a case study with the UWV employee insurance agency, demonstrating its practical benefits and effectiveness.
摘要:发现良好的流程模型对于不同的流程分析任务(例如一致性检查和流程改进)至关重要。自动化流程发现方法经常忽视有价值的领域知识。这些知识,包括领域专家的见解和详细的流程文档,在流程发现过程中基本上尚未开发。本文利用大型语言模型(LLM)将此类知识直接集成到流程发现中。我们使用源自LLM的规则来指导模型构建,确保与领域知识和实际流程执行保持一致。通过集成LLM,我们在以自然语言表达的流程知识和稳健流程模型的发现之间建立了桥梁,显着推进流程发现方法论。为了展示我们框架的可用性,我们对UWV员工保险机构进行了案例研究,展示了其实际好处和有效性。

[NLP-9] owards Tailored Recovery of Lexical Diversity in Literary Machine Translation
[NLP-9] owards文学机器翻译中词汇多样性的量身定制恢复

链接: https://arxiv.org/abs/2408.17308
作者: Esther Ploeger,Huiyuan Lai,Rik van Noord,Antonio Toral
关键词-EN: lexical diversity, lexically poorer, lexical, diversity, translation
关键词-ZH: 词汇多样性,词汇较差,词汇,多样性,翻译
类目: Computation and Language (cs.CL)
备注: Accepted to EAMT 2024

点击查看摘要

Abstract:Machine translations are found to be lexically poorer than human translations. The loss of lexical diversity through MT poses an issue in the automatic translation of literature, where it matters not only what is written, but also how it is written. Current methods for increasing lexical diversity in MT are rigid. Yet, as we demonstrate, the degree of lexical diversity can vary considerably across different novels. Thus, rather than aiming for the rigid increase of lexical diversity, we reframe the task as recovering what is lost in the machine translation process. We propose a novel approach that consists of reranking translation candidates with a classifier that distinguishes between original and translated text. We evaluate our approach on 31 English-to-Dutch book translations, and find that, for certain books, our approach retrieves lexical diversity scores that are close to human translation.
摘要:发现机器翻译在词汇上比人类翻译差。通过MT失去词汇多样性给文学的自动翻译带来了一个问题,文学的自动翻译不仅重要写什么,还重要如何写。目前增加MT词汇多样性的方法很严格。然而,正如我们所证明的那样,不同小说的词汇多样性程度可能存在很大差异。因此,我们不是以严格增加词汇多样性为目标,而是将任务重新定义为恢复机器翻译过程中丢失的内容。我们提出了一种新颖的方法,包括使用区分原始文本和翻译文本的分类器重新排列翻译候选项。我们对31本英语到荷兰语的书籍翻译进行了评估,发现对于某些书籍,我们的方法检索的词汇多样性分数接近人类翻译。

[NLP-10] Flexible and Effective Mixing of Large Language Models into a Mixture of Domain Experts
[NLP-10] 灵活有效地将大型语言模型混合到领域专家混合中

链接: https://arxiv.org/abs/2408.17280
作者: Rhui Dih Lee,Laura Wynter,Raghu Kiran Ganti
关键词-EN: creating low-cost, trained models, Abstract, MOE, toolkit
关键词-ZH: 创建低成本、经过训练的模型、抽象、MoE、工具包
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We present a toolkit for creating low-cost Mixture-of-Domain-Experts (MOE) from trained models. The toolkit can be used for creating a mixture from models or from adapters. We perform extensive tests and offer guidance on defining the architecture of the resulting MOE using the toolkit. A public repository is available.
摘要:我们提供了一个工具包,用于从训练模型创建低成本领域专家混合(MoE)。该工具包可用于从模型或适配器创建混合物。我们执行广泛的测试,并提供有关使用该工具包定义最终MoE架构的指导。公共存储库可用。

[NLP-11] Improving Extraction of Clinical Event Contextual Properties from Electronic Health Records: A Comparative Study
[NLP-11] 改进从电子健康记录中提取临床事件上下文属性:比较研究

链接: https://arxiv.org/abs/2408.17181
作者: Shubham Agarwal,Thomas Searle,Mart Ratas,Anthony Shek,James Teo,Richard Dobson
关键词-EN: Electronic Health Records, Electronic Health, Health Records, significant portion stored, unstructured text format
关键词-ZH: 电子健康记录、电子健康、健康记录、大部分存储、非结构化文本格式
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Electronic Health Records are large repositories of valuable clinical data, with a significant portion stored in unstructured text format. This textual data includes clinical events (e.g., disorders, symptoms, findings, medications and procedures) in context that if extracted accurately at scale can unlock valuable downstream applications such as disease prediction. Using an existing Named Entity Recognition and Linking methodology, MedCAT, these identified concepts need to be further classified (contextualised) for their relevance to the patient, and their temporal and negated status for example, to be useful downstream. This study performs a comparative analysis of various natural language models for medical text classification. Extensive experimentation reveals the effectiveness of transformer-based language models, particularly BERT. When combined with class imbalance mitigation techniques, BERT outperforms Bi-LSTM models by up to 28% and the baseline BERT model by up to 16% for recall of the minority classes. The method has been implemented as part of CogStack/MedCAT framework and made available to the community for further research.
摘要:电子健康记录是有价值的临床数据的大型存储库,其中很大一部分以非结构化文本格式存储。这些文本数据包括上下文中的临床事件(例如,疾病、症状、发现、药物和程序),如果在规模上准确提取,可以解锁有价值的下游应用,如疾病预测。使用现有的命名实体识别和链接方法MedCAT,这些已识别的概念需要进一步分类(结合背景),以确定它们与患者的相关性,以及它们的临时和否定状态,以便在下游有用。本研究对用于医学文本分类的各种自然语言模型进行了比较分析。广泛的实验表明了基于转换器的语言模型的有效性,尤其是BERT。当与类别失衡缓解技术相结合时,在召回少数类别方面,BERT比BI-LSTM模型高出28%,基准BERT模型高出16%。该方法已作为CogStack/MedCAT框架的一部分实现,并提供给社区进行进一步研究。

[NLP-12] MaFeRw: Query Rewriting with Multi-Aspect Feedbacks for Retrieval-Augmented Large Language Models
[NLP-12] MaFeRw:使用多方面反馈的检索增强大型语言模型的查询重写

链接: https://arxiv.org/abs/2408.17072
作者: Yujing Wang,Hainan Zhang,Liang Pang,Liang Pang,Hongwei Zheng,Zhiming Zheng
关键词-EN: involves spoken ellipses, real-world RAG system, describe user information, query rewriting, necessitating query rewriting
关键词-ZH: 涉及口语省略号、现实世界的RAG系统、描述用户信息、查询重写、需要查询重写
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:In a real-world RAG system, the current query often involves spoken ellipses and ambiguous references from dialogue contexts, necessitating query rewriting to better describe user’s information needs. However, traditional context-based rewriting has minimal enhancement on downstream generation tasks due to the lengthy process from query rewriting to response generation. Some researchers try to utilize reinforcement learning with generation feedback to assist the rewriter, but these sparse rewards provide little guidance in most cases, leading to unstable training and generation results. We find that user’s needs are also reflected in the gold document, retrieved documents and ground truth. Therefore, by feeding back these multi-aspect dense rewards to query rewriting, more stable and satisfactory responses can be achieved. In this paper, we propose a novel query rewriting method MaFeRw, which improves RAG performance by integrating multi-aspect feedback from both the retrieval process and generated results. Specifically, we first use manual data to train a T5 model for the rewriter initialization. Next, we design three metrics as reinforcement learning feedback: the similarity between the rewritten query and the gold document, the ranking metrics, and ROUGE between the generation and the ground truth. Inspired by RLAIF, we train three kinds of reward models for the above metrics to achieve more efficient training. Finally, we combine the scores of these reward models as feedback, and use PPO algorithm to explore the optimal query rewriting strategy. Experimental results on two conversational RAG datasets demonstrate that MaFeRw achieves superior generation metrics and more stable training compared to baselines.
摘要:在现实的RAG系统中,当前的查询经常涉及口头省略和对话上下文中的模糊引用,需要重写查询以更好地描述用户的信息需求。然而,传统的基于上下文的重写对下游生成任务的增强很小,因为从查询重写到响应生成的过程很长。一些研究人员试图利用带有生成反馈的强化学习来帮助重写器,但这些稀疏的奖励在大多数情况下提供的指导很少,导致训练和生成结果不稳定。我们发现,用户的需求也反映在黄金文档、检索到的文档和基础事实中。因此,通过将这些多方面的密集回报反馈给查询重写,可以获得更稳定和满意的响应。本文提出了一种新的查询重写方法MaFeRw,该方法通过综合检索过程和生成结果的多方面反馈来提高RAG的性能。具体地说,我们首先使用手动数据来训练用于重写器初始化的T5模型。接下来,我们设计了三个指标作为强化学习反馈:重写后的查询与GOLD文档之间的相似度,排名指标,以及生成与基本事实之间的Rouge。受RLAIF的启发,我们针对上述指标训练了三种奖励模型,以实现更有效的训练。最后,我们结合这些奖励模型的得分作为反馈,使用PPO算法来探索最优的查询重写策略。在两个会话RAG数据集上的实验结果表明,与基准相比,MaFeRw获得了更好的生成度量和更稳定的训练。

[NLP-13] Novel-WD: Exploring acquisition of Novel World Knowledge in LLMs Using Prefix-Tuning
[NLP-13] Novel-WD:探索使用Prefix-Tuning在LLM中获取新世界知识

链接: https://arxiv.org/abs/2408.17070
作者: Maxime Méloux,Christophe Cerisara
关键词-EN: pre-trained large language, large language models, pre-trained large, crucial but challenging, Teaching
关键词-ZH: 预训练的大型语言,大型语言模型,预训练的大型,关键但具有挑战性,教学
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Teaching new information to pre-trained large language models (PLM) is a crucial but challenging task. Model adaptation techniques, such as fine-tuning and parameter-efficient training have been shown to store new facts at a slow rate; continual learning is an option but is costly and prone to catastrophic forgetting. This work studies and quantifies how PLM may learn and remember new world knowledge facts that do not occur in their pre-training corpus, which only contains world knowledge up to a certain date. To that purpose, we first propose Novel-WD, a new dataset consisting of sentences containing novel facts extracted from recent Wikidata updates, along with two evaluation tasks in the form of causal language modeling and multiple choice questions (MCQ). We make this dataset freely available to the community, and release a procedure to later build new versions of similar datasets with up-to-date information. We also explore the use of prefix-tuning for novel information learning, and analyze how much information can be stored within a given prefix. We show that a single fact can reliably be encoded within a single prefix, and that the prefix capacity increases with its length and with the base model size.
摘要:向预先训练的大型语言模型(PLM)教授新信息是一项关键但具有挑战性的任务。模型适应技术,如微调和参数高效训练,已被证明以缓慢的速度存储新的事实;持续学习是一种选择,但代价高昂,而且容易发生灾难性的遗忘。这项工作研究并量化了PLM如何学习和记忆他们的训练前语料库中没有出现的新的世界知识事实,该语料库只包含到特定日期的世界知识。为此,我们首先提出了NOVICE-WD,这是一个新的数据集,由包含从最近的维基数据更新中提取的新事实的句子组成,以及两个以因果语言建模和多项选择题(McQ)的形式的评估任务。我们将此数据集免费提供给社区,并发布一个过程,以便稍后使用最新信息构建类似数据集的新版本。我们还探讨了前缀调整在新信息学习中的应用,并分析了在给定的前缀中可以存储多少信息。我们证明了单个事实可以可靠地编码在单个前缀中,并且前缀容量随着其长度和基本模型的大小而增加。

[NLP-14] From Text to Emotion: Unveiling the Emotion Annotation Capabilities of LLMs INTERSPEECH2024
[NLP-14] 从文本到情感:揭开LLM的情感注释能力

链接: https://arxiv.org/abs/2408.17026
作者: Minxue Niu(1),Mimansa Jaiswal(2),Emily Mower Provost(1) ((1) University of Michigan, (2) Independent Researcher)
关键词-EN: Large Language Models, human annotated data, Training emotion recognition, emotion recognition models, annotated data
关键词-ZH: 大型语言模型、人类注释数据、训练情感识别、情感识别模型、注释数据
类目: Computation and Language (cs.CL)
备注: to be published in Interspeech 2024

点击查看摘要

Abstract:Training emotion recognition models has relied heavily on human annotated data, which present diversity, quality, and cost challenges. In this paper, we explore the potential of Large Language Models (LLMs), specifically GPT4, in automating or assisting emotion annotation. We compare GPT4 with supervised models and or humans in three aspects: agreement with human annotations, alignment with human perception, and impact on model training. We find that common metrics that use aggregated human annotations as ground truth can underestimate the performance, of GPT-4 and our human evaluation experiment reveals a consistent preference for GPT-4 annotations over humans across multiple datasets and evaluators. Further, we investigate the impact of using GPT-4 as an annotation filtering process to improve model training. Together, our findings highlight the great potential of LLMs in emotion annotation tasks and underscore the need for refined evaluation methodologies.
摘要:训练情感识别模型严重依赖人类注释数据,这带来了多样性、质量和成本的挑战。在本文中,我们探讨了大型语言模型(LLM)(特别是GPT 4)在自动化或辅助情感注释方面的潜力。我们在三个方面将GPT 4与监督模型和/或人类进行了比较:与人类注释的一致性、与人类感知的一致性以及对模型训练的影响。我们发现,使用聚合人类注释作为基本事实的常见指标可能会低估GPT-4的性能,我们的人类评估实验揭示了在多个数据集和评估者中,GPT-4注释相对于人类的一致偏好。此外,我们还研究了使用GPT-4作为注释过滤过程对改进模型训练的影响。总的来说,我们的研究结果凸显了LLM在情感注释任务中的巨大潜力,并强调了对改进评估方法的必要性。

[NLP-15] InkubaLM: A small language model for low-resource African languages
[NLP-15] InkubaLM:低资源非洲语言的小型语言模型

链接: https://arxiv.org/abs/2408.17024
作者: Atnafu Lambebo Tonja,Bonaventure F. P. Dossou,Jessica Ojo,Jenalea Rajab,Fadel Thior,Eric Peter Wairagala,Aremu Anuoluwapo,Pelonomi Moiloa,Jade Abbott,Vukosi Marivate,Benjamin Rosman
关键词-EN: amidst significant computing, African context, High-resource language models, High-resource language, locally relevant
关键词-ZH: 在重要的计算中、非洲背景、高资源语言模型、高资源语言、本地相关
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:High-resource language models often fall short in the African context, where there is a critical need for models that are efficient, accessible, and locally relevant, even amidst significant computing and data constraints. This paper introduces InkubaLM, a small language model with 0.4 billion parameters, which achieves performance comparable to models with significantly larger parameter counts and more extensive training data on tasks such as machine translation, question-answering, AfriMMLU, and the AfriXnli task. Notably, InkubaLM outperforms many larger models in sentiment analysis and demonstrates remarkable consistency across multiple languages. This work represents a pivotal advancement in challenging the conventional paradigm that effective language models must rely on substantial resources. Our model and datasets are publicly available \footnote\urlthis https URL to encourage research and development on low-resource languages.
摘要:高资源语言模型在非洲环境中往往存在缺陷,非洲环境迫切需要高效、可访问且本地相关的模型,即使在严重的计算和数据限制的情况下也是如此。本文介绍了InkubaLM,这是一个具有4亿个参数的小型语言模型,其性能与参数数明显更大且机器翻译、问答、AfriMMLU和AfriXnli任务等任务训练数据更广泛的模型相当。值得注意的是,InkubaLM在情感分析方面优于许多大型模型,并且在多种语言中表现出显着的一致性。这项工作代表了挑战传统范式的关键进步,即有效的语言模型必须依赖大量资源。我们的模型和数据集是公开的\脚注\url这个https URL,以鼓励对低资源语言的研究和开发。

[NLP-16] Dynamic Self-Consistency: Leveraging Reasoning Paths for Efficient LLM Sampling
[NLP-16] 动态自一致性:利用推理路径实现高效LLM采样

链接: https://arxiv.org/abs/2408.17017
作者: Guangya Wan,Yuqi Wu,Jie Chen,Sheng Li
关键词-EN: Large Language Models, Language Models, Large Language, hallucinations in Large, frequent solution
关键词-ZH: 大型语言模型,语言模型,大型语言,幻觉,频繁解决方案
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Self-Consistency (SC) is a widely used method to mitigate hallucinations in Large Language Models (LLMs) by sampling the LLM multiple times and outputting the most frequent solution. Despite its benefits, SC results in significant computational costs proportional to the number of samples generated. Previous early-stopping approaches, such as Early Stopping Self Consistency and Adaptive Consistency, have aimed to reduce these costs by considering output consistency, but they do not analyze the quality of the reasoning paths (RPs) themselves. To address this issue, we propose Reasoning-Aware Self-Consistency (RASC), an innovative early-stopping framework that dynamically adjusts the number of sample generations by considering both the output answer and the RPs from Chain of Thought (CoT) prompting. RASC assigns confidence scores sequentially to the generated samples, stops when certain criteria are met, and then employs weighted majority voting to optimize sample usage and enhance answer reliability. We comprehensively test RASC with multiple LLMs across varied QA datasets. RASC outperformed existing methods and significantly reduces sample usage by an average of 80% while maintaining or improving accuracy up to 5% compared to the original SC
摘要:自洽(SC)是在大型语言模型(LLM)中广泛使用的一种减少幻觉的方法,它通过多次采样LLM并输出最频繁的解来减少幻觉。尽管SC有其好处,但它会导致与生成的样本数量成正比的显著计算成本。以前的早停止方法,如早停止自一致性和自适应一致性,旨在通过考虑输出一致性来降低这些代价,但它们不分析推理路径(RP)本身的质量。为了解决这个问题,我们提出了一种创新的提前停止框架RASC,它通过同时考虑输出答案和思维链(COT)提示的RPS来动态调整样本生成的数量。RASC按顺序为生成的样本分配置信度分数,当满足一定条件时停止,然后采用加权多数投票来优化样本使用和提高答案可靠性。我们在不同的QA数据集上使用多个LLM对RASC进行了全面测试。RASC的性能优于现有方法,与原始SC相比,在保持或提高高达5%的准确率的同时,显著减少了平均80%的样本使用量

[NLP-17] ool-Assisted Agent on SQL Inspection and Refinement in Real-World Scenarios
[NLP-17] 在现实场景中进行SQL检查和细化的ool辅助代理

链接: https://arxiv.org/abs/2408.16991
作者: Zhongyuan Wang,Richong Zhang,Zhijie Nie,Jaein Kim
关键词-EN: large language models, leverage large language, database management system, methods leverage large, language models
关键词-ZH: 大型语言模型、利用大型语言、数据库管理系统、利用大型语言模型的方法
类目: Computation and Language (cs.CL)
备注: work in progress

点击查看摘要

Abstract:Recent Text-to-SQL methods leverage large language models (LLMs) by incorporating feedback from the database management system. While these methods effectively address execution errors in SQL queries, they struggle with database mismatches – errors that do not trigger execution exceptions. Database mismatches include issues such as condition mismatches and stricter constraint mismatches, both of which are more prevalent in real-world scenarios. To address these challenges, we propose a tool-assisted agent framework for SQL inspection and refinement, equipping the LLM-based agent with two specialized tools: a retriever and a detector, designed to diagnose and correct SQL queries with database mismatches. These tools enhance the capability of LLMs to handle real-world queries more effectively. We also introduce Spider-Mismatch, a new dataset specifically constructed to reflect the condition mismatch problems encountered in real-world scenarios. Experimental results demonstrate that our method achieves the highest performance on the averaged results of the Spider and Spider-Realistic datasets in few-shot settings, and it significantly outperforms baseline methods on the more realistic dataset, Spider-Mismatch.
摘要:最近的Text-to-SQL方法通过结合来自数据库管理系统的反馈来利用大型语言模型(LLM)。虽然这些方法有效地解决了SQL查询中的执行错误,但它们与数据库不匹配–不会触发执行异常的错误–作斗争。数据库不匹配包括条件不匹配和更严格的约束不匹配等问题,这两种问题在实际场景中都更为常见。为了应对这些挑战,我们提出了一个用于SQL检查和精化的工具辅助代理框架,该框架为基于LLM的代理配备了两个专门的工具:检索器和检测器,旨在诊断和纠正数据库不匹配的SQL查询。这些工具增强了LLMS更有效地处理真实查询的能力。我们还引入了蜘蛛不匹配,这是一个专门为反映现实世界场景中遇到的条件不匹配问题而构建的新数据集。实验结果表明,在少镜头环境下,我们的方法在蜘蛛和蜘蛛真实感数据集的平均结果上取得了最高的性能,并且在更真实的数据集-蜘蛛不匹配上,它的性能明显优于基线方法。

[NLP-18] MemLong: Memory-Augmented Retrieval for Long Text Modeling
[NLP-18] MemLong:用于长文本建模的内存增强检索

链接: https://arxiv.org/abs/2408.16967
作者: Weijie Liu,Zecheng Tang,Juntao Li,Kehai Chen,Min Zhang
关键词-EN: yielded remarkable success, Large Language Models, Recent advancements, advancements in Large, Large Language
关键词-ZH: 取得了显着的成功,大型语言模型,最近的进步,大型语言的进步
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent advancements in Large Language Models (LLMs) have yielded remarkable success across diverse fields. However, handling long contexts remains a significant challenge for LLMs due to the quadratic time and space complexity of attention mechanisms and the growing memory consumption of the key-value cache during generation. This work introduces MemLong: Memory-Augmented Retrieval for Long Text Generation, a method designed to enhance the capabilities of long-context language modeling by utilizing an external retriever for historical information retrieval. MemLong combines a non-differentiable ``ret-mem’’ module with a partially trainable decoder-only language model and introduces a fine-grained, controllable retrieval attention mechanism that leverages semantic-level relevant chunks. Comprehensive evaluations on multiple long-context language modeling benchmarks demonstrate that MemLong consistently outperforms other state-of-the-art LLMs. More importantly, MemLong can extend the context length on a single 3090 GPU from 4k up to 80k. Our code is available at this https URL
摘要:大型语言模型(LLM)的最新进展在各个领域都取得了显著的成功。然而,由于注意机制在时间和空间上的二次复杂性,以及键值缓存在生成过程中不断增长的内存消耗,处理长上下文仍然是LLMS面临的重大挑战。MemLong:记忆力增强的检索长文本生成方法,旨在通过利用外部检索器来检索历史信息来增强长上下文语言建模的能力。MemLong将不可区分的“ret-mem”模块与部分可训练的仅解码器语言模型相结合,并引入了一种利用语义级相关块的细粒度、可控的检索注意机制。对多个长上下文语言建模基准的综合评估表明,MemLong的性能始终优于其他最先进的LLM。更重要的是,MemLong可以将单个3090 GPU上的上下文长度从4k扩展到80k。我们的代码可从以下HTTPS URL获得

[NLP-19] UserSumBench: A Benchmark Framework for Evaluating User Summarization Approaches
[NLP-19] userSumBench:评估用户摘要方法的基准框架

链接: https://arxiv.org/abs/2408.16966
作者: Chao Wang,Neo Wu,Lin Ning,Luyang Liu,Jun Xie,Shawn O’Banion,Bradley Green
关键词-EN: Large language models, shown remarkable capabilities, user activity data, Large language, raw user activity
关键词-ZH: 大型语言模型,表现出非凡的功能,用户活动数据,大型语言,原始用户活动
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have shown remarkable capabilities in generating user summaries from a long list of raw user activity data. These summaries capture essential user information such as preferences and interests, and therefore are invaluable for LLM-based personalization applications, such as explainable recommender systems. However, the development of new summarization techniques is hindered by the lack of ground-truth labels, the inherent subjectivity of user summaries, and human evaluation which is often costly and time-consuming. To address these challenges, we introduce \UserSumBench, a benchmark framework designed to facilitate iterative development of LLM-based summarization approaches. This framework offers two key components: (1) A reference-free summary quality metric. We show that this metric is effective and aligned with human preferences across three diverse datasets (MovieLens, Yelp and Amazon Review). (2) A novel robust summarization method that leverages time-hierarchical summarizer and self-critique verifier to produce high-quality summaries while eliminating hallucination. This method serves as a strong baseline for further innovation in summarization techniques.
摘要:大型语言模型(LLM)在从一长串原始用户活动数据中生成用户摘要方面表现出了非凡的能力。这些摘要捕获了基本的用户信息,如偏好和兴趣,因此对于基于LLM的个性化应用(如可解释推荐系统)来说是非常有价值的。然而,新的摘要技术的发展受到以下因素的阻碍:缺乏基本事实标签、用户摘要固有的主观性以及通常代价高昂且耗时的人工评估。为了应对这些挑战,我们引入了\UserSumBch,这是一个基准测试框架,旨在促进基于LLM的摘要方法的迭代开发。该框架提供了两个关键组件:(1)无引用的摘要质量度量。我们表明,这一指标是有效的,并与人类对三个不同数据集(MovieLens、Yelp和Amazon Review)的偏好保持一致。(2)一种新的健壮的摘要方法,该方法利用时间分层摘要生成器和自我批评验证器在消除幻觉的同时生成高质量的摘要。这一方法为摘要技术的进一步创新奠定了坚实的基础。

[NLP-20] A longitudinal sentiment analysis of Sinophobia during COVID-19 using large language models
[NLP-20] 使用大型语言模型对COVID-19期间恐华症的纵向情绪分析

链接: https://arxiv.org/abs/2408.16942
作者: Chen Wang,Rohitash Chandra
关键词-EN: exacerbated xenophobia, leading to widespread, widespread discrimination, discrimination against individuals, Sinophobic sentiments
关键词-ZH: 仇外心理加剧,导致广泛、广泛的歧视、对个人的歧视、仇华情绪
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The COVID-19 pandemic has exacerbated xenophobia, particularly Sinophobia, leading to widespread discrimination against individuals of Chinese descent. Large language models (LLMs) are pre-trained deep learning models used for natural language processing (NLP) tasks. The ability of LLMs to understand and generate human-like text makes them particularly useful for analysing social media data to detect and evaluate sentiments. We present a sentiment analysis framework utilising LLMs for longitudinal sentiment analysis of the Sinophobic sentiments expressed in X (Twitter) during the COVID-19 pandemic. The results show a significant correlation between the spikes in Sinophobic tweets, Sinophobic sentiments and surges in COVID-19 cases, revealing that the evolution of the pandemic influenced public sentiment and the prevalence of Sinophobic discourse. Furthermore, the sentiment analysis revealed a predominant presence of negative sentiments, such as annoyance and denial, which underscores the impact of political narratives and misinformation shaping public opinion. The lack of empathetic sentiment which was present in previous studies related to COVID-19 highlights the way the political narratives in media viewed the pandemic and how it blamed the Chinese community. Our study highlights the importance of transparent communication in mitigating xenophobic sentiments during global crises.
摘要:新冠肺炎疫情加剧了仇外心理,尤其是恐华症,导致对华裔个人的普遍歧视。大语言模型(LLM)是用于自然语言处理(NLP)任务的预训练深度学习模型。LLMS理解和生成类似人类的文本的能力使其在分析社交媒体数据以检测和评估情绪方面特别有用。我们提出了一个情绪分析框架,利用最小二乘法对新冠肺炎大流行期间用X(推特)表达的恐华情绪进行纵向情绪分析。结果显示,恐华推文的激增、恐华情绪和新冠肺炎案例激增之间存在显著的相关性,揭示了疫情的演变影响了公众情绪和恐华话语的流行。此外,情绪分析显示,负面情绪占主导地位,如恼怒和否认,这突显了政治叙事和错误信息塑造舆论的影响。之前有关新冠肺炎的研究缺乏同理心,突显出媒体上的政治叙事如何看待疫情,以及它是如何指责华人社区的。我们的研究强调了透明沟通在缓解全球危机期间排外情绪方面的重要性。

[NLP-21] Plausible-Parrots @ MSP2023: Enhancing Semantic Plausibility Modeling using Entity and Event Knowledge
[NLP-21] Plausible-Parrots @ MSP 2023:使用实体和事件知识增强语义合理性建模

链接: https://arxiv.org/abs/2408.16937
作者: Chong Shen,Chenyue Zhou
关键词-EN: large language model, injecting external knowledge, identify semantic plausibility, external knowledge base, external knowledge
关键词-ZH: 大语言模型,注入外部知识,识别语义可信度,外部知识库,外部知识
类目: Computation and Language (cs.CL)
备注: 10 pages, 5 figures, 5 tables

点击查看摘要

Abstract:In this work, we investigate the effectiveness of injecting external knowledge to a large language model (LLM) to identify semantic plausibility of simple events. Specifically, we enhance the LLM with fine-grained entity types, event types and their definitions extracted from an external knowledge base. These knowledge are injected into our system via designed templates. We also augment the data to balance the label distribution and adapt the task setting to real world scenarios in which event mentions are expressed as natural language sentences. The experimental results show the effectiveness of the injected knowledge on modeling semantic plausibility of events. An error analysis further emphasizes the importance of identifying non-trivial entity and event types.
摘要:在这项工作中,我们研究了将外部知识注入大型语言模型(LLM)以识别简单事件的语义可信度的有效性。具体来说,我们通过从外部知识库中提取的细粒度实体类型、事件类型及其定义来增强LLM。这些知识通过设计的模板注入到我们的系统中。我们还增强数据以平衡标签分布并使任务设置适应现实世界场景,其中事件提及被表示为自然语言句子。实验结果表明了注入的知识对事件语义可信度建模的有效性。错误分析进一步强调了识别非平凡实体和事件类型的重要性。

[NLP-22] Event Extraction for Portuguese: A QA-driven Approach using ACE-2005
[NLP-22] 葡萄牙语事件提取:使用ACE-2005的QA驱动方法

链接: https://arxiv.org/abs/2408.16932
作者: Luís Filipe Cunha,Ricardo Campos,Alípio Jorge
关键词-EN: Information Retrieval task, Information Retrieval, Retrieval task, Portuguese, commonly consists
关键词-ZH: 信息检索任务,信息检索,检索任务,葡萄牙语,通常包括
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Event extraction is an Information Retrieval task that commonly consists of identifying the central word for the event (trigger) and the event’s arguments. This task has been extensively studied for English but lags behind for Portuguese, partly due to the lack of task-specific annotated corpora. This paper proposes a framework in which two separated BERT-based models were fine-tuned to identify and classify events in Portuguese documents. We decompose this task into two sub-tasks. Firstly, we use a token classification model to detect event triggers. To extract event arguments, we train a Question Answering model that queries the triggers about their corresponding event argument roles. Given the lack of event annotated corpora in Portuguese, we translated the original version of the ACE-2005 dataset (a reference in the field) into Portuguese, producing a new corpus for Portuguese event extraction. To accomplish this, we developed an automatic translation pipeline. Our framework obtains F1 marks of 64.4 for trigger classification and 46.7 for argument classification setting, thus a new state-of-the-art reference for these tasks in Portuguese.
摘要:事件提取是一项信息检索任务,通常包括识别事件的中心词(触发器)和事件的参数。这项任务在英语中得到了广泛的研究,但在葡萄牙语中却落后了,部分原因是缺乏针对任务的注释语料库。本文提出了一个框架,其中两个独立的基于ERT的模型被微调来识别和分类葡萄牙语文档中的事件。我们将该任务分解为两个子任务。首先,我们使用令牌分类模型来检测事件触发器。为了提取事件参数,我们训练了一个问答模型,该模型查询触发器对应的事件参数角色。由于缺乏葡萄牙语的事件注释语料库,我们将ACE-2005数据集的原始版本(该领域的参考资料)翻译成葡萄牙语,生成了一个新的葡萄牙语事件提取语料库。为了实现这一点,我们开发了一条自动翻译管道。我们的框架在触发器分类方面获得了64.4分的F1分数,在参数分类设置方面获得了46.7分,从而为这些任务提供了一个新的葡萄牙语最先进的参考。

[NLP-23] ACE-2005-PT: Corpus for Event Extraction in Portuguese
[NLP-23] ACE-2005-PT:葡萄牙语事件提取数据库

链接: https://arxiv.org/abs/2408.16928
作者: Luís Filipe Cunha,Purificação Silvano,Ricardo Campos,Alípio Jorge
关键词-EN: NLP task, commonly involves identifying, task that commonly, commonly involves, Event extraction
关键词-ZH: NLP任务,通常涉及识别,通常涉及的任务,事件提取
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Event extraction is an NLP task that commonly involves identifying the central word (trigger) for an event and its associated arguments in text. ACE-2005 is widely recognised as the standard corpus in this field. While other corpora, like PropBank, primarily focus on annotating predicate-argument structure, ACE-2005 provides comprehensive information about the overall event structure and semantics. However, its limited language coverage restricts its usability. This paper introduces ACE-2005-PT, a corpus created by translating ACE-2005 into Portuguese, with European and Brazilian variants. To speed up the process of obtaining ACE-2005-PT, we rely on automatic translators. This, however, poses some challenges related to automatically identifying the correct alignments between multi-word annotations in the original text and in the corresponding translated sentence. To achieve this, we developed an alignment pipeline that incorporates several alignment techniques: lemmatization, fuzzy matching, synonym matching, multiple translations and a BERT-based word aligner. To measure the alignment effectiveness, a subset of annotations from the ACE-2005-PT corpus was manually aligned by a linguist expert. This subset was then compared against our pipeline results which achieved exact and relaxed match scores of 70.55% and 87.55% respectively. As a result, we successfully generated a Portuguese version of the ACE-2005 corpus, which has been accepted for publication by LDC.
摘要:事件抽取是一项自然语言处理任务,通常涉及识别文本中事件的中心词(触发器)及其相关参数。ACE-2005被广泛认为是该领域的标准语料库。虽然其他语料库,如PropBank,主要专注于注释谓词参数结构,但ACE-2005提供了有关整体事件结构和语义的全面信息。然而,其有限的语言覆盖范围限制了它的可用性。本文介绍了ACE-2005-PT语料库,该语料库是将ACE-2005翻译成葡萄牙语建立的,带有欧洲和巴西的变体。为了加快获得ACE-2005-PT的进程,我们依赖自动翻译器。然而,这带来了一些与自动识别原文和相应翻译句子中的多个单词注释之间的正确对齐相关的挑战。为了实现这一点,我们开发了一种结合了几种对齐技术的对齐管道:词汇化、模糊匹配、同义词匹配、多翻译和基于BERT的单词对齐器。为了衡量对齐的有效性,一位语言学家专家手动对ACE-2005-PT语料库中的注释子集进行了对齐。然后将该子集与我们的流水线结果进行比较,获得了70.55和87.55的精确匹配分数和轻松匹配分数。因此,我们成功地生成了葡萄牙语版本的ACE-2005语料库,该语料库已被最不发达国家接受出版。

[NLP-24] Exploring Multiple Strategies to Improve Multilingual Coreference Resolution in CorefUD
[NLP-24] 探索多种策略以提高CorefUD中的多语言共指解析度

链接: https://arxiv.org/abs/2408.16893
作者: Ondřej Pražák,Miloslav Konopík
关键词-EN: natural language processing, identifying expressions, expressions in text, text that refer, critical component
关键词-ZH: 自然语言处理、识别表达、文本中的表达、引用的文本、关键组件
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Coreference resolution, the task of identifying expressions in text that refer to the same entity, is a critical component in various natural language processing (NLP) applications. This paper presents our end-to-end neural coreference resolution system, utilizing the CorefUD 1.1 dataset, which spans 17 datasets across 12 languages. We first establish strong baseline models, including monolingual and cross-lingual variations, and then propose several extensions to enhance performance across diverse linguistic contexts. These extensions include cross-lingual training, incorporation of syntactic information, a Span2Head model for optimized headword prediction, and advanced singleton modeling. We also experiment with headword span representation and long-documents modeling through overlapping segments. The proposed extensions, particularly the heads-only approach, singleton modeling, and long document prediction significantly improve performance across most datasets. We also perform zero-shot cross-lingual experiments, highlighting the potential and limitations of cross-lingual transfer in coreference resolution. Our findings contribute to the development of robust and scalable coreference systems for multilingual coreference resolution. Finally, we evaluate our model on CorefUD 1.1 test set and surpass the best model from CRAC 2023 shared task of a comparable size by a large margin. Our nodel is available on GitHub: \urlthis https URL
摘要:共指消解是自然语言处理(NLP)应用中的一个重要组成部分,它是识别文本中引用同一实体的表达式的任务。本文介绍了我们的端到端神经共指消解系统,利用CorefUD 1.1数据集,该数据集跨越12种语言的17个数据集。我们首先建立了强大的基线模型,包括单语和跨语言变体,然后提出了几个扩展以提高在不同语言背景下的性能。这些扩展包括跨语言训练、合并句法信息、用于优化词头预测的span 2Head模型以及高级单例建模。我们还实验了标题范围表示和通过重叠片段的长文档建模。建议的扩展,特别是仅头部方法、单例建模和长文档预测,显著提高了大多数数据集的性能。我们还进行了零命中率的跨语言实验,突出了跨语言迁移在共指消解中的潜力和局限性。我们的发现有助于开发健壮和可扩展的多语言共指消解系统。最后,我们在CorefUD 1.1测试集上对我们的模型进行了评估,并以较大的优势超过了CRAC 2023共享任务的最优模型。我们的节点可在GitHub:\urlThis HTTPS URL上找到

[NLP-25] LLaVA-Chef: A Multi-modal Generative Model for Food Recipes
[NLP-25] LLaVA-Chef:食品食谱的多模式生成模型

链接: https://arxiv.org/abs/2408.16889
作者: Fnu Mohbat,Mohammed J. Zaki
关键词-EN: rapidly evolving landscape, Natural Language Processing, online recipe sharing, globalized context, rapidly evolving
关键词-ZH: 快速发展的景观、自然语言处理、在线食谱共享、全球化背景、快速发展
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:In the rapidly evolving landscape of online recipe sharing within a globalized context, there has been a notable surge in research towards comprehending and generating food recipes. Recent advancements in large language models (LLMs) like GPT-2 and LLaVA have paved the way for Natural Language Processing (NLP) approaches to delve deeper into various facets of food-related tasks, encompassing ingredient recognition and comprehensive recipe generation. Despite impressive performance and multi-modal adaptability of LLMs, domain-specific training remains paramount for their effective application. This work evaluates existing LLMs for recipe generation and proposes LLaVA-Chef, a novel model trained on a curated dataset of diverse recipe prompts in a multi-stage approach. First, we refine the mapping of visual food image embeddings to the language space. Second, we adapt LLaVA to the food domain by fine-tuning it on relevant recipe data. Third, we utilize diverse prompts to enhance the model’s recipe comprehension. Finally, we improve the linguistic quality of generated recipes by penalizing the model with a custom loss function. LLaVA-Chef demonstrates impressive improvements over pretrained LLMs and prior works. A detailed qualitative analysis reveals that LLaVA-Chef generates more detailed recipes with precise ingredient mentions, compared to existing approaches.
摘要:在全球化背景下在线食谱共享的快速发展格局中,对理解和生成食物食谱的研究出现了显著的激增。GPT-2和LLaVA等大型语言模型(LLM)的最新进展为自然语言处理(NLP)方法更深入地研究与食品相关的任务的各个方面铺平了道路,包括成分识别和综合食谱生成。尽管LLMS的性能和多模式适应性令人印象深刻,但特定领域的培训对于它们的有效应用仍然是至关重要的。这项工作评估了现有的用于食谱生成的LLMS,并提出了LLaVA-Chef,一个新的模型,在多阶段方法的不同食谱提示的精选数据集上进行训练。首先,我们细化了视觉食物图像嵌入到语言空间的映射。其次,我们通过对相关食谱数据进行微调,使LLaVA适应食品领域。第三,我们利用不同的提示来增强模型的食谱理解。最后,通过引入自定义损失函数对模型进行惩罚,提高了生成食谱的语言质量。LLaVA-Chef展示了比预先培训的LLM和之前的工作有了令人印象深刻的改进。详细的定性分析表明,与现有的方法相比,LLaVA-Chef生成了更详细的食谱,并准确地提到了配料。

[NLP-26] Modeling offensive content detection for TikTok
[NLP-26] 为TikTok建模攻击性内容检测

链接: https://arxiv.org/abs/2408.16857
作者: Kasper Cools,Gideon Mailette de Buy Wenniger,Clara Maathuis
关键词-EN: transformed interpersonal communication, media transformed interpersonal, information consumption processes, social media platforms, social media transformed
关键词-ZH: 转变的人际沟通、媒体转变的人际沟通、信息消费过程、社交媒体平台、社交媒体转变
类目: Computation and Language (cs.CL)
备注: Accepted as a conference paper at DPSH 2024, 8 pages

点击查看摘要

Abstract:The advent of social media transformed interpersonal communication and information consumption processes. This digital landscape accommodates user intentions, also resulting in an increase of offensive language and harmful behavior. Concurrently, social media platforms collect vast datasets comprising user-generated content and behavioral information. These datasets are instrumental for platforms deploying machine learning and data-driven strategies, facilitating customer insights and countermeasures against social manipulation mechanisms like disinformation and offensive content. Nevertheless, the availability of such datasets, along with the application of various machine learning techniques, to researchers and practitioners, for specific social media platforms regarding particular events, is limited. In particular for TikTok, which offers unique tools for personalized content creation and sharing, the existing body of knowledge would benefit from having diverse comprehensive datasets and associated data analytics solutions on offensive content. While efforts from social media platforms, research, and practitioner communities are seen on this behalf, such content continues to proliferate. This translates to an essential need to make datasets publicly available and build corresponding intelligent solutions. On this behalf, this research undertakes the collection and analysis of TikTok data containing offensive content, building a series of machine learning and deep learning models for offensive content detection. This is done aiming at answering the following research question: “How to develop a series of computational models to detect offensive content on TikTok?”. To this end, a Data Science methodological approach is considered, 120.423 TikTok comments are collected, and on a balanced, binary classification approach, F1 score performance results of 0.863 is obtained.
摘要:社交媒体的出现改变了人们的人际交流和信息消费过程。这种数字景观迎合了用户的意图,也导致了攻击性语言和有害行为的增加。与此同时,社交媒体平台收集了大量的数据集,其中包括用户生成的内容和行为信息。这些数据集对于部署机器学习和数据驱动策略的平台非常有用,有助于客户洞察和应对虚假信息和攻击性内容等社交操纵机制。然而,这类数据集以及各种机器学习技术的应用,对于研究人员和从业者来说,对于关于特定事件的特定社交媒体平台来说,是有限的。尤其是TikTok,它为个性化内容创建和共享提供了独特的工具,现有的知识体系将受益于拥有针对攻击性内容的多样化综合数据集和相关数据分析解决方案。虽然社交媒体平台、研究和从业者社区在这方面做出了努力,但此类内容仍在继续激增。这就需要公开数据集并构建相应的智能解决方案。对此,本研究承担了对含有攻击性内容的TikTok数据的收集和分析,构建了一系列针对攻击性内容检测的机器学习和深度学习模型。这是针对以下研究问题而做的:如何开发一系列计算模型来检测TikTok上的攻击性内容?为此,考虑了一种数据科学的方法论方法,收集了120.423条TikTok评论,并在平衡的二进制分类方法上,获得了0.863的F1成绩表现结果。

[NLP-27] See or Guess: Counterfactually Regularized Image Captioning ACM-MM2024
[NLP-27] 看还是猜:反事实规则化的图像字幕

链接: https://arxiv.org/abs/2408.16809
作者: Qian Cao,Xu Chen,Ruihua Song,Xiting Wang,Xinting Huang,Yuchen Ren
关键词-EN: generates natural language, natural language descriptions, vision-language research, language descriptions, visual information
关键词-ZH: 生成自然语言、自然语言描述、视觉语言研究、语言描述、视觉信息
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Multimedia (cs.MM)
备注: Accepted by ACM MM 2024

点击查看摘要

Abstract:Image captioning, which generates natural language descriptions of the visual information in an image, is a crucial task in vision-language research. Previous models have typically addressed this task by aligning the generative capabilities of machines with human intelligence through statistical fitting of existing datasets. While effective for normal images, they may struggle to accurately describe those where certain parts of the image are obscured or edited, unlike humans who excel in such cases. These weaknesses they exhibit, including hallucinations and limited interpretability, often hinder performance in scenarios with shifted association patterns. In this paper, we present a generic image captioning framework that employs causal inference to make existing models more capable of interventional tasks, and counterfactually explainable. Our approach includes two variants leveraging either total effect or natural direct effect. Integrating them into the training process enables models to handle counterfactual scenarios, increasing their generalizability. Extensive experiments on various datasets show that our method effectively reduces hallucinations and improves the model’s faithfulness to images, demonstrating high portability across both small-scale and large-scale image-to-text models. The code is available at this https URL.
摘要:图像字幕是视觉语言研究中的一项重要任务,它对图像中的视觉信息进行自然语言描述。以前的模型通常通过对现有数据集进行统计拟合,使机器的生成能力与人类的智能保持一致,从而解决这一任务。虽然对正常图像有效,但它们可能很难准确描述图像的某些部分被遮挡或编辑的情况,而不像人类在这种情况下表现出色。他们表现出的这些弱点,包括幻觉和有限的可解释性,往往会阻碍在联想模式发生变化的情况下的表现。在这篇文章中,我们提出了一个通用的图像字幕框架,它使用因果推理来使现有的模型更有能力执行介入性任务,并具有反事实解释能力。我们的方法包括两个变种,利用总效应或自然直接效应。将它们整合到培训过程中,使模型能够处理反事实场景,增加了它们的概括性。在不同数据集上的大量实验表明,我们的方法有效地减少了幻觉,提高了模型对图像的忠实性,在小规模和大规模图像到文本模型中都表现出了高度的可移植性。代码可在此HTTPS URL上找到。

[NLP-28] Inductive Learning of Logical Theories with LLMs: A Complexity-graded Analysis
[NLP-28] 利用LLM进行逻辑理论的归纳学习:复杂性分级分析

链接: https://arxiv.org/abs/2408.16779
作者: João Pedro Gandarela,Danilo S. Carvalho,André Freitas
关键词-EN: Large Language Models, limitations of Large, Large Language, Natural Language Processing, work presents
关键词-ZH: 大型语言模型、大型语言的局限性、大型语言、自然语言处理、工作介绍
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO)
备注:

点击查看摘要

Abstract:This work presents a novel systematic methodology to analyse the capabilities and limitations of Large Language Models (LLMs) with feedback from a formal inference engine, on logic theory induction. The analysis is complexity-graded w.r.t. rule dependency structure, allowing quantification of specific inference challenges on LLM performance. Integrating LLMs with formal methods is a promising frontier in the Natural Language Processing field, as an important avenue for improving model inference control and explainability. In particular, inductive learning over complex sets of facts and rules, poses unique challenges for current autoregressive models, as they lack explicit symbolic grounding. While they can be complemented by formal systems, the properties delivered by LLMs regarding inductive learning, are not well understood and quantified. Empirical results indicate that the largest LLMs can achieve competitive results against a SOTA Inductive Logic Programming (ILP) system baseline, but also that tracking long predicate relationship chains is a more difficult obstacle than theory complexity for the LLMs.
摘要:这项工作提出了一种新的系统方法来分析大型语言模型(LLM)的能力和局限性,这些模型来自形式推理引擎的反馈,基于逻辑理论归纳。分析是复杂程度分级的w.r.t.规则依赖结构,允许量化LLM性能上的特定推理挑战。将LLMS与形式化方法相结合是自然语言处理领域的一个很有前途的前沿,是提高模型推理控制和可解释性的重要途径。特别是,对复杂的事实和规则集进行归纳学习,对当前的自回归模型提出了独特的挑战,因为它们缺乏明确的符号基础。虽然它们可以由形式系统来补充,但LLMS提供的关于归纳学习的属性并没有得到很好的理解和量化。实验结果表明,与SOTA归纳逻辑编程(ILP)系统基线相比,最大的LLMS能够获得具有竞争力的结果,但对于LLMS来说,跟踪较长的谓词关系链是一个比理论复杂性更难的障碍。

[NLP-29] Codec Does Matter: Exploring the Semantic Shortcoming of Codec for Audio Language Model
[NLP-29] 编解码器确实很重要:探索音频语言模型编解码器的语义缺陷

链接: https://arxiv.org/abs/2408.17175
作者: Zhen Ye,Peiwen Sun,Jiahe Lei,Hongzhan Lin,Xu Tan,Zheqi Dai,Qiuqiang Kong,Jianyi Chen,Jiahao Pan,Qifeng Liu,Yike Guo,Wei Xue
关键词-EN: Large Language Models, Recent advancements, capabilities of Large, Large Language, Language Models
关键词-ZH: 大型语言模型、最新进展、大型、大型语言、语言模型的能力
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Sound (cs.SD)
备注:

点击查看摘要

Abstract:Recent advancements in audio generation have been significantly propelled by the capabilities of Large Language Models (LLMs). The existing research on audio LLM has primarily focused on enhancing the architecture and scale of audio language models, as well as leveraging larger datasets, and generally, acoustic codecs, such as EnCodec, are used for audio tokenization. However, these codecs were originally designed for audio compression, which may lead to suboptimal performance in the context of audio LLM. Our research aims to address the shortcomings of current audio LLM codecs, particularly their challenges in maintaining semantic integrity in generated audio. For instance, existing methods like VALL-E, which condition acoustic token generation on text transcriptions, often suffer from content inaccuracies and elevated word error rates (WER) due to semantic misinterpretations of acoustic tokens, resulting in word skipping and errors. To overcome these issues, we propose a straightforward yet effective approach called X-Codec. X-Codec incorporates semantic features from a pre-trained semantic encoder before the Residual Vector Quantization (RVQ) stage and introduces a semantic reconstruction loss after RVQ. By enhancing the semantic ability of the codec, X-Codec significantly reduces WER in speech synthesis tasks and extends these benefits to non-speech applications, including music and sound generation. Our experiments in text-to-speech, music continuation, and text-to-sound tasks demonstrate that integrating semantic information substantially improves the overall performance of language models in audio generation. Our code and demo are available (Demo: this https URL Code: this https URL)
摘要:大型语言模型(LLM)的能力极大地推动了音频生成的最新进展。现有的音频LLM研究主要集中在增强音频语言模型的体系结构和规模,以及利用更大的数据集,通常使用声学编解码器,如EnCodec,用于音频标记化。但是,这些编解码器最初是为音频压缩而设计的,这可能会导致音频LLM环境中的性能不佳。我们的研究旨在解决当前音频LLM编解码器的不足,特别是它们在保持生成音频的语义完整性方面所面临的挑战。例如,像VALL-E这样的以文本转录为条件的声学标记生成的现有方法,由于对声学标记的语义误解而经常受到内容不准确和单词错误率(WER)的影响,从而导致单词跳过和错误。为了克服这些问题,我们提出了一种简单而有效的方法,称为X-Codec。X-Codec融合了残差矢量量化(RVQ)前预先训练的语义编码器的语义特征,并在RVQ后引入了语义重构损失。通过增强编解码器的语义能力,X-Codec显著降低了语音合成任务中的WER,并将这些优势扩展到非语音应用,包括音乐和声音生成。我们在文本到语音、音乐延续和文本到声音任务中的实验表明,整合语义信息显著提高了语言模型在音频生成中的整体性能。我们的代码和演示可用(演示:此HTTPS URL代码:此HTTPS URL)

人工智能

[AI-0] Bridging Episodes and Semantics: A Novel Framework for Long-Form Video Understanding ECCV’24

链接: https://arxiv.org/abs/2408.17443
作者: Gueter Josmy Faure,Jia-Fong Yeh,Min-Hung Chen,Hung-Ting Su,Winston H. Hsu,Shang-Hong Lai
关键词-EN: reflects human cognition, accurately reflects human, extended short videos, treats long-form videos, human cognition
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注: Accepted to the EVAL-FoMo Workshop at ECCV’24. Project page: this https URL

点击查看摘要

Abstract:While existing research often treats long-form videos as extended short videos, we propose a novel approach that more accurately reflects human cognition. This paper introduces BREASE: BRidging Episodes And SEmantics for Long-Form Video Understanding, a model that simulates episodic memory accumulation to capture action sequences and reinforces them with semantic knowledge dispersed throughout the video. Our work makes two key contributions: First, we develop an Episodic COmpressor (ECO) that efficiently aggregates crucial representations from micro to semi-macro levels. Second, we propose a Semantics reTRiever (SeTR) that enhances these aggregated representations with semantic information by focusing on the broader context, dramatically reducing feature dimensionality while preserving relevant macro-level information. Extensive experiments demonstrate that BREASE achieves state-of-the-art performance across multiple long video understanding benchmarks in both zero-shot and fully-supervised settings. The project page and code are at: this https URL.

[AI-1] Open-vocabulary Temporal Action Localization using VLMs DATE

链接: https://arxiv.org/abs/2408.17422
作者: Naoki Wake,Atsushi Kanehira,Kazuhiro Sasabuchi,Jun Takamatsu,Katsushi Ikeuchi
关键词-EN: action localization aims, localization aims, aims to find, find timings, Video action localization
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)
*备注: 7 pages, 5 figures, 4 tables. Last updated on August 30th, 2024

点击查看摘要

Abstract:Video action localization aims to find timings of a specific action from a long video. Although existing learning-based approaches have been successful, those require annotating videos that come with a considerable labor cost. This paper proposes a learning-free, open-vocabulary approach based on emerging vision-language models (VLM). The challenge stems from the fact that VLMs are neither designed to process long videos nor tailored for finding actions. We overcome these problems by extending an iterative visual prompting technique. Specifically, we sample video frames into a concatenated image with frame index labels, making a VLM guess a frame that is considered to be closest to the start/end of the action. Iterating this process by narrowing a sampling time window results in finding a specific frame of start and end of an action. We demonstrate that this sampling technique yields reasonable results, illustrating a practical extension of VLMs for understanding videos.

[AI-2] Getting Inspiration for Feature Elicitation: App Store- vs. LLM-based Approach

链接: https://arxiv.org/abs/2408.17404
作者: Jialiang Wei,Anne-Lise Courbis,Thomas Lambolais,Binbin Xu,Pierre Louis Bernard,Gérard Dray,Walid Maalej
关键词-EN: inspired requirements elicitation, past decade, inspired requirements, app store, requirements elicitation
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
*备注: To Appear In Proceedings of 39th IEEE/ACM International Conference on Automated Software Engineering (ASE 2024)

点击查看摘要

Abstract:Over the past decade, app store (AppStore)-inspired requirements elicitation has proven to be highly beneficial. Developers often explore competitors’ apps to gather inspiration for new features. With the advance of Generative AI, recent studies have demonstrated the potential of large language model (LLM)-inspired requirements elicitation. LLMs can assist in this process by providing inspiration for new feature ideas. While both approaches are gaining popularity in practice, there is a lack of insight into their differences. We report on a comparative study between AppStore- and LLM-based approaches for refining features into sub-features. By manually analyzing 1,200 sub-features recommended from both approaches, we identified their benefits, challenges, and key differences. While both approaches recommend highly relevant sub-features with clear descriptions, LLMs seem more powerful particularly concerning novel unseen app scopes. Moreover, some recommended features are imaginary with unclear feasibility, which suggests the importance of a human-analyst in the elicitation loop.

[AI-3] Exploring the Effect of Explanation Content and Format on User Comprehension and Trust

链接: https://arxiv.org/abs/2408.17401
作者: Antonio Rago,Bence Palfi,Purin Sukpanichnant,Hannibal Nabli,Kavyesh Vivek,Olga Kostopoulou,James Kinross,Francesca Toni
关键词-EN: explanations, SHAP explanations, recent years, introduced for explaining, SHAP
类目: Artificial Intelligence (cs.AI)
*备注: 18 pages

点击查看摘要

Abstract:In recent years, various methods have been introduced for explaining the outputs of “black-box” AI models. However, it is not well understood whether users actually comprehend and trust these explanations. In this paper, we focus on explanations for a regression tool for assessing cancer risk and examine the effect of the explanations’ content and format on the user-centric metrics of comprehension and trust. Regarding content, we experiment with two explanation methods: the popular SHAP, based on game-theoretic notions and thus potentially complex for everyday users to comprehend, and occlusion-1, based on feature occlusion which may be more comprehensible. Regarding format, we present SHAP explanations as charts (SC), as is conventional, and occlusion-1 explanations as charts (OC) as well as text (OT), to which their simpler nature also lends itself. The experiments amount to user studies questioning participants, with two different levels of expertise (the general population and those with some medical training), on their subjective and objective comprehension of and trust in explanations for the outputs of the regression tool. In both studies we found a clear preference in terms of subjective comprehension and trust for occlusion-1 over SHAP explanations in general, when comparing based on content. However, direct comparisons of explanations when controlling for format only revealed evidence for OT over SC explanations in most cases, suggesting that the dominance of occlusion-1 over SHAP explanations may be driven by a preference for text over charts as explanations. Finally, we found no evidence of a difference between the explanation types in terms of objective comprehension. Thus overall, the choice of the content and format of explanations needs careful attention, since in some contexts format, rather than content, may play the critical role in improving user experience.

[AI-4] MoRe Fine-Tuning with 10x Fewer Parameters

链接: https://arxiv.org/abs/2408.17383
作者: Wenxuan Tan,Nicholas Roberts,Tzu-Heng Huang,Jitian Zhao,John Cooper,Samuel Guo,Chengyu Duan,Frederic Sala
关键词-EN: easily specialize large, specialize large pretrained, large pretrained models, Monarch Rectangular Fine-tuning, unlocked the potential
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Parameter-efficient fine-tuning (PEFT) techniques have unlocked the potential to cheaply and easily specialize large pretrained models. However, the most prominent approaches, like low-rank adapters (LoRA), depend on heuristics or rules-of-thumb for their architectural choices – potentially limiting their performance for new models and architectures. This limitation suggests that techniques from neural architecture search could be used to obtain optimal adapter architectures, but these are often expensive and difficult to implement. We address this challenge with Monarch Rectangular Fine-tuning (MoRe), a simple framework to search over adapter architectures that relies on the Monarch matrix class. Theoretically, we show that MoRe is more expressive than LoRA. Empirically, our approach is more parameter-efficient and performant than state-of-the-art PEFTs on a range of tasks and models, with as few as 5% of LoRA’s parameters.

[AI-5] raffic expertise meets residual RL: Knowledge-informed model-based residual reinforcement learning for CAV trajectory control

链接: https://arxiv.org/abs/2408.17380
作者: Zihao Sheng,Zilin Huang,Sikai Chen
关键词-EN: exhibit higher sample, virtual environment model, environment model, higher sample efficiency, sample efficiency
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Model-based reinforcement learning (RL) is anticipated to exhibit higher sample efficiency compared to model-free RL by utilizing a virtual environment model. However, it is challenging to obtain sufficiently accurate representations of the environmental dynamics due to uncertainties in complex systems and environments. An inaccurate environment model may degrade the sample efficiency and performance of model-based RL. Furthermore, while model-based RL can improve sample efficiency, it often still requires substantial training time to learn from scratch, potentially limiting its advantages over model-free approaches. To address these challenges, this paper introduces a knowledge-informed model-based residual reinforcement learning framework aimed at enhancing learning efficiency by infusing established expert knowledge into the learning process and avoiding the issue of beginning from zero. Our approach integrates traffic expert knowledge into a virtual environment model, employing the Intelligent Driver Model (IDM) for basic dynamics and neural networks for residual dynamics, thus ensuring adaptability to complex scenarios. We propose a novel strategy that combines traditional control methods with residual RL, facilitating efficient learning and policy optimization without the need to learn from scratch. The proposed approach is applied to CAV trajectory control tasks for the dissipation of stop-and-go waves in mixed traffic flow. Experimental results demonstrate that our proposed approach enables the CAV agent to achieve superior performance in trajectory control compared to the baseline agents in terms of sample efficiency, traffic flow smoothness and traffic mobility. The source code and supplementary materials are available at this https URL.

[AI-6] EMPOWER: Embodied Multi-role Open-vocabulary Planning with Online Grounding and Execution IROS2024

链接: https://arxiv.org/abs/2408.17379
作者: Francesco Argenziano,Michele Brienza,Vincenzo Suriani,Daniele Nardi,Domenico D. Bloisi
关键词-EN: settings presents significant, presents significant challenges, real-life settings presents, Task planning, settings presents
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注: Accepted at IROS 2024

点击查看摘要

Abstract:Task planning for robots in real-life settings presents significant challenges. These challenges stem from three primary issues: the difficulty in identifying grounded sequences of steps to achieve a goal; the lack of a standardized mapping between high-level actions and low-level commands; and the challenge of maintaining low computational overhead given the limited resources of robotic hardware. We introduce EMPOWER, a framework designed for open-vocabulary online grounding and planning for embodied agents aimed at addressing these issues. By leveraging efficient pre-trained foundation models and a multi-role mechanism, EMPOWER demonstrates notable improvements in grounded planning and execution. Quantitative results highlight the effectiveness of our approach, achieving an average success rate of 0.73 across six different real-life scenarios using a TIAGo robot.

[AI-7] NDP: Next Distribution Prediction as a More Broad Target

链接: https://arxiv.org/abs/2408.17377
作者: Junhao Ruan,Abudukeyumu Abudula,Xinyu Liu,Bei Li,Yinqiao Li,Chenglong Wang,Yuchun Fan,Yuan Ge,Tong Xiao,Jingbo Zhu
关键词-EN: Large language models, demonstrated powerful capabilities, Large language, existing NTP paradigm, trained on next-token
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: 8 pages,5 figures

点击查看摘要

Abstract:Large language models (LLMs) trained on next-token prediction (NTP) paradigm have demonstrated powerful capabilities. However, the existing NTP paradigm contains several limitations, particularly related to planned task complications and error propagation during inference. In our work, we extend the critique of NTP, highlighting its limitation also due to training with a narrow objective: the prediction of a sub-optimal one-hot distribution. To support this critique, we conducted a pre-experiment treating the output distribution from powerful LLMs as efficient world data compression. By evaluating the similarity between the n -gram distribution and the one-hot distribution with LLMs, we observed that the n -gram distributions align more closely with the output distribution of LLMs. Based on this insight, we introduce Next Distribution Prediction (NDP), which uses n -gram distributions to replace the one-hot targets, enhancing learning without extra online training time. We conducted experiments across translation, general task, language transfer, and medical domain adaptation. Compared to NTP, NDP can achieve up to +2.97 COMET improvement in translation tasks, +0.61 average improvement in general tasks, and incredible +10.75 average improvement in the medical domain. This demonstrates the concrete benefits of addressing the target narrowing problem, pointing to a new direction for future work on improving NTP.

[AI-8] Leveraging Graph Neural Networks to Forecast Electricity Consumption KDD2024 ECML

链接: https://arxiv.org/abs/2408.17366
作者: Eloi Campagne,Yvenn Amara-Ouali,Yannig Goude,Argyris Kalogeratos
关键词-EN: renewable energy sources, Accurate electricity demand, decentralized network paradigm, paradigm introduce greater, introduce greater complexity
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 17 pages, ECML PKDD 2024 Workshop paper

点击查看摘要

Abstract:Accurate electricity demand forecasting is essential for several reasons, especially as the integration of renewable energy sources and the transition to a decentralized network paradigm introduce greater complexity and uncertainty. The proposed methodology leverages graph-based representations to effectively capture the spatial distribution and relational intricacies inherent in this decentralized network structure. This research work offers a novel approach that extends beyond the conventional Generalized Additive Model framework by considering models like Graph Convolutional Networks or Graph SAGE. These graph-based models enable the incorporation of various levels of interconnectedness and information sharing among nodes, where each node corresponds to the combined load (i.e. consumption) of a subset of consumers (e.g. the regions of a country). More specifically, we introduce a range of methods for inferring graphs tailored to consumption forecasting, along with a framework for evaluating the developed models in terms of both performance and explainability. We conduct experiments on electricity forecasting, in both a synthetic and a real framework considering the French mainland regions, and the performance and merits of our approach are discussed.

[AI-9] C-RADAR: A Centralized Deep Learning System for Intrusion Detection in Software Defined Networks

链接: https://arxiv.org/abs/2408.17356
作者: Osama Mustafa,Khizer Ali,Talha Naqash
关键词-EN: Software Defined Networks, Software Defined, popularity of Software, simplify network management, Defined Networks
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
*备注:

点击查看摘要

Abstract:The popularity of Software Defined Networks (SDNs) has grown in recent years, mainly because of their ability to simplify network management and improve network flexibility. However, this also makes them vulnerable to various types of cyber attacks. SDNs work on a centralized control plane which makes them more prone to network attacks. Research has demonstrated that deep learning (DL) methods can be successful in identifying intrusions in conventional networks, but their application in SDNs is still an open research area. In this research, we propose the use of DL techniques for intrusion detection in SDNs. We measure the effectiveness of our method by experimentation on a dataset of network traffic and comparing it to existing techniques. Our results show that the DL-based approach outperforms traditional methods in terms of detection accuracy and computational efficiency. The deep learning architecture that has been used in this research is a Long Short Term Memory Network and Self-Attention based architecture i.e. LSTM-Attn which achieves an Fl-score of 0.9721. Furthermore, this technique can be trained to detect new attack patterns and improve the overall security of SDNs.

[AI-10] Bidirectional Decoding: Improving Action Chunking via Closed-Loop Resampling

链接: https://arxiv.org/abs/2408.17355
作者: Yuejiang Liu,Jubayer Ibn Hamid,Annie Xie,Yoonho Lee,Maximilian Du,Chelsea Finn
关键词-EN: Predicting and executing, human demonstrations, robot learning, learning from human, Predicting
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Project website: this https URL

点击查看摘要

Abstract:Predicting and executing a sequence of actions without intermediate replanning, known as action chunking, is increasingly used in robot learning from human demonstrations. However, its effects on learned policies remain puzzling: some studies highlight its importance for achieving strong performance, while others observe detrimental effects. In this paper, we first dissect the role of action chunking by analyzing the divergence between the learner and the demonstrator. We find that longer action chunks enable a policy to better capture temporal dependencies by taking into account more past states and actions within the chunk. However, this advantage comes at the cost of exacerbating errors in stochastic environments due to fewer observations of recent states. To address this, we propose Bidirectional Decoding (BID), a test-time inference algorithm that bridges action chunking with closed-loop operations. BID samples multiple predictions at each time step and searches for the optimal one based on two criteria: (i) backward coherence, which favors samples aligned with previous decisions, (ii) forward contrast, which favors samples close to outputs of a stronger policy and distant from those of a weaker policy. By coupling decisions within and across action chunks, BID enhances temporal consistency over extended sequences while enabling adaptive replanning in stochastic environments. Experimental results show that BID substantially outperforms conventional closed-loop operations of two state-of-the-art generative policies across seven simulation benchmarks and two real-world tasks.

[AI-11] Forget to Flourish: Leveraging Machine-Unlearning on Pretrained Language Models for Privacy Leakage

链接: https://arxiv.org/abs/2408.17354
作者: Md Rafi Ur Rashid,Jing Liu,Toshiaki Koike-Akino,Shagufta Mehnaz,Ye Wang
关键词-EN: exposing sensitive information, downstream applications poses, applications poses significant, potentially exposing sensitive, poses significant privacy
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
*备注:

点击查看摘要

Abstract:Fine-tuning large language models on private data for downstream applications poses significant privacy risks in potentially exposing sensitive information. Several popular community platforms now offer convenient distribution of a large variety of pre-trained models, allowing anyone to publish without rigorous verification. This scenario creates a privacy threat, as pre-trained models can be intentionally crafted to compromise the privacy of fine-tuning datasets. In this study, we introduce a novel poisoning technique that uses model-unlearning as an attack tool. This approach manipulates a pre-trained language model to increase the leakage of private data during the fine-tuning process. Our method enhances both membership inference and data extraction attacks while preserving model utility. Experimental results across different models, datasets, and fine-tuning setups demonstrate that our attacks significantly surpass baseline performance. This work serves as a cautionary note for users who download pre-trained models from unverified sources, highlighting the potential risks involved.

[AI-12] AASIST3: KAN-Enhanced AASIST Speech Deepfake Detection using SSL Features and Additional Regularization for the ASVspoof 2024 Challenge INTERSPEECH

链接: https://arxiv.org/abs/2408.17352
作者: Kirill Borodin,Vasiliy Kudryavtsev,Dmitrii Korzh,Alexey Efimenko,Grach Mkrtchian,Mikhail Gorodnichev,Oleg Y. Rogov
关键词-EN: Automatic Speaker Verification, identify speakers based, exclusive access control, Automatic Speaker, Speaker Verification
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
*备注: 8 pages, 2 figures, 2 tables. Accepted paper at the ASVspoof 2024 (the 25th Interspeech Conference)

点击查看摘要

Abstract:Automatic Speaker Verification (ASV) systems, which identify speakers based on their voice characteristics, have numerous applications, such as user authentication in financial transactions, exclusive access control in smart devices, and forensic fraud detection. However, the advancement of deep learning algorithms has enabled the generation of synthetic audio through Text-to-Speech (TTS) and Voice Conversion (VC) systems, exposing ASV systems to potential vulnerabilities. To counteract this, we propose a novel architecture named AASIST3. By enhancing the existing AASIST framework with Kolmogorov-Arnold networks, additional layers, encoders, and pre-emphasis techniques, AASIST3 achieves a more than twofold improvement in performance. It demonstrates minDCF results of 0.5357 in the closed condition and 0.1414 in the open condition, significantly enhancing the detection of synthetic voices and improving ASV security.

[AI-13] rerankers: A Lightweight Python Library to Unify Ranking Methods

链接: https://arxiv.org/abs/2408.17344
作者: Benjamin Clavié
关键词-EN: paper presents rerankers, paper presents, Python library, Abstract, Python
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:This paper presents rerankers, a Python library which provides an easy-to-use interface to the most commonly used re-ranking approaches. Re-ranking is an integral component of many retrieval pipelines; however, there exist numerous approaches to it, relying on different implementation methods. \textttrerankers unifies these methods into a single user-friendly interface, allowing practitioners and researchers alike to explore different methods while only changing a single line of Python code. Moreover ,rerankers ensures that its implementations are done with the fewest dependencies possible, and re-uses the original implementation whenever possible, guaranteeing that our simplified interface results in no performance degradation compared to more complex ones. The full source code and list of supported models are updated regularly and available at this https URL.

[AI-14] Modularity in Transformers: Investigating Neuron Separability Specialization

链接: https://arxiv.org/abs/2408.17324
作者: Nicholas Pochinkov,Thomas Jones,Mohammed Rashidur Rahman
关键词-EN: workings remains limited, internal workings remains, remains limited, increasingly prevalent, workings remains
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注: 11 pages, 6 figures

点击查看摘要

Abstract:Transformer models are increasingly prevalent in various applications, yet our understanding of their internal workings remains limited. This paper investigates the modularity and task specialization of neurons within transformer architectures, focusing on both vision (ViT) and language (Mistral 7B) models. Using a combination of selective pruning and MoEfication clustering techniques, we analyze the overlap and specialization of neurons across different tasks and data subsets. Our findings reveal evidence of task-specific neuron clusters, with varying degrees of overlap between related tasks. We observe that neuron importance patterns persist to some extent even in randomly initialized models, suggesting an inherent structure that training refines. Additionally, we find that neuron clusters identified through MoEfication correspond more strongly to task-specific neurons in earlier and later layers of the models. This work contributes to a more nuanced understanding of transformer internals and offers insights into potential avenues for improving model interpretability and efficiency.

[AI-15] Investigating Neuron Ablation in Attention Heads: The Case for Peak Activation Centering

链接: https://arxiv.org/abs/2408.17322
作者: Nicholas Pochinkov,Ben Pasero,Skylar Shibayama
关键词-EN: rapidly throughout society, growing rapidly, ablation, Abstract, models
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
*备注: 9 pages, 2 figures, XAI World Conference 2024 Late-Breaking Work

点击查看摘要

Abstract:The use of transformer-based models is growing rapidly throughout society. With this growth, it is important to understand how they work, and in particular, how the attention mechanisms represent concepts. Though there are many interpretability methods, many look at models through their neuronal activations, which are poorly understood. We describe different lenses through which to view neuron activations, and investigate the effectiveness in language models and vision transformers through various methods of neural ablation: zero ablation, mean ablation, activation resampling, and a novel approach we term ‘peak ablation’. Through experimental analysis, we find that in different regimes and models, each method can offer the lowest degradation of model performance compared to other methods, with resampling usually causing the most significant performance deterioration. We make our code available at this https URL.

[AI-16] Bridging Domain Knowledge and Process Discovery Using Large Language Models

链接: https://arxiv.org/abs/2408.17316
作者: Ali Norouzifar,Humam Kourani,Marcus Dees,Wil van der Aalst
关键词-EN: Discovering good process, Discovering good, process analysis tasks, process, analysis tasks
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注: This paper is accepted at the AI4BPM 2024 workshop and to be published in their proceedings

点击查看摘要

Abstract:Discovering good process models is essential for different process analysis tasks such as conformance checking and process improvements. Automated process discovery methods often overlook valuable domain knowledge. This knowledge, including insights from domain experts and detailed process documentation, remains largely untapped during process discovery. This paper leverages Large Language Models (LLMs) to integrate such knowledge directly into process discovery. We use rules derived from LLMs to guide model construction, ensuring alignment with both domain knowledge and actual process executions. By integrating LLMs, we create a bridge between process knowledge expressed in natural language and the discovery of robust process models, advancing process discovery methodologies significantly. To showcase the usability of our framework, we conducted a case study with the UWV employee insurance agency, demonstrating its practical benefits and effectiveness.

[AI-17] Fair Best Arm Identification with Fixed Confidence

链接: https://arxiv.org/abs/2408.17313
作者: Alessio Russo,Filippo Vannella
关键词-EN: Arm Identification, fair BAI, sample complexity, sample complexity lower, Unlike traditional BAI
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In this work, we present a novel framework for Best Arm Identification (BAI) under fairness constraints, a setting that we refer to as \textitF-BAI (fair BAI). Unlike traditional BAI, which solely focuses on identifying the optimal arm with minimal sample complexity, F-BAI also includes a set of fairness constraints. These constraints impose a lower limit on the selection rate of each arm and can be either model-agnostic or model-dependent. For this setting, we establish an instance-specific sample complexity lower bound and analyze the \textitprice of fairness, quantifying how fairness impacts sample complexity. Based on the sample complexity lower bound, we propose F-TaS, an algorithm provably matching the sample complexity lower bound, while ensuring that the fairness constraints are satisfied. Numerical results, conducted using both a synthetic model and a practical wireless scheduling application, show the efficiency of F-TaS in minimizing the sample complexity while achieving low fairness violations.

[AI-18] Hybridizing Base-Line 2D-CNN Model with Cat Swarm Optimization for Enhanced Advanced Persistent Threat Detection

链接: https://arxiv.org/abs/2408.17307
作者: Ali M. Bakhiet,Salah A. Aly
关键词-EN: detecting Advanced Persistent, Advanced Persistent Threats, Advanced Persistent, formidable challenge due, Convolutional Neural Networks
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Networking and Internet Architecture (cs.NI)
*备注: 6 pages, 5 figures

点击查看摘要

Abstract:In the realm of cyber-security, detecting Advanced Persistent Threats (APTs) remains a formidable challenge due to their stealthy and sophisticated nature. This research paper presents an innovative approach that leverages Convolutional Neural Networks (CNNs) with a 2D baseline model, enhanced by the cutting-edge Cat Swarm Optimization (CSO) algorithm, to significantly improve APT detection accuracy. By seamlessly integrating the 2D-CNN baseline model with CSO, we unlock the potential for unprecedented accuracy and efficiency in APT detection. The results unveil an impressive accuracy score of 98.4% , marking a significant enhancement in APT detection across various attack stages, illuminating a path forward in combating these relentless and sophisticated threats.

[AI-19] Stationary Policies are Optimal in Risk-averse Total-reward MDPs with EVaR

链接: https://arxiv.org/abs/2408.17286
作者: Xihong Su,Marek Petrik,Julien Grand-Clément
关键词-EN: Optimizing risk-averse objectives, admit direct dynamic, Entropic Risk Measure, complex history-dependent policies, direct dynamic programming
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Optimizing risk-averse objectives in discounted MDPs is challenging because most models do not admit direct dynamic programming equations and require complex history-dependent policies. In this paper, we show that the risk-averse \em total reward criterion, under the Entropic Risk Measure (ERM) and Entropic Value at Risk (EVaR) risk measures, can be optimized by a stationary policy, making it simple to analyze, interpret, and deploy. We propose exponential value iteration, policy iteration, and linear programming to compute optimal policies. In comparison with prior work, our results only require the relatively mild condition of transient MDPs and allow for \em both positive and negative rewards. Our results indicate that the total reward criterion may be preferable to the discounted criterion in a broad range of risk-averse reinforcement learning domains.

[AI-20] Flexible and Effective Mixing of Large Language Models into a Mixture of Domain Experts

链接: https://arxiv.org/abs/2408.17280
作者: Rhui Dih Lee,Laura Wynter,Raghu Kiran Ganti
关键词-EN: creating low-cost, trained models, Abstract, MOE, toolkit
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:We present a toolkit for creating low-cost Mixture-of-Domain-Experts (MOE) from trained models. The toolkit can be used for creating a mixture from models or from adapters. We perform extensive tests and offer guidance on defining the architecture of the resulting MOE using the toolkit. A public repository is available.

[AI-21] UrBench: A Comprehensive Benchmark for Evaluating Large Multimodal Models in Multi-View Urban Scenarios

链接: https://arxiv.org/abs/2408.17267
作者: Baichuan Zhou,Haote Yang,Dairong Chen,Junyan Ye,Tianyi Bai,Jinhua Yu,Songyang Zhang,Dahua Lin,Conghui He,Weijia Li
关键词-EN: Large Multimodal Models, Multimodal Models, Large Multimodal, Recent evaluations, benchmarks specifically focusing
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 9 pages, 6 figures

点击查看摘要

Abstract:Recent evaluations of Large Multimodal Models (LMMs) have explored their capabilities in various domains, with only few benchmarks specifically focusing on urban environments. Moreover, existing urban benchmarks have been limited to evaluating LMMs with basic region-level urban tasks under singular views, leading to incomplete evaluations of LMMs’ abilities in urban environments. To address these issues, we present UrBench, a comprehensive benchmark designed for evaluating LMMs in complex multi-view urban scenarios. UrBench contains 11.6K meticulously curated questions at both region-level and role-level that cover 4 task dimensions: Geo-Localization, Scene Reasoning, Scene Understanding, and Object Understanding, totaling 14 task types. In constructing UrBench, we utilize data from existing datasets and additionally collect data from 11 cities, creating new annotations using a cross-view detection-matching method. With these images and annotations, we then integrate LMM-based, rule-based, and human-based methods to construct large-scale high-quality questions. Our evaluations on 21 LMMs show that current LMMs struggle in the urban environments in several aspects. Even the best performing GPT-4o lags behind humans in most tasks, ranging from simple tasks such as counting to complex tasks such as orientation, localization and object attribute recognition, with an average performance gap of 17.4%. Our benchmark also reveals that LMMs exhibit inconsistent behaviors with different urban views, especially with respect to understanding cross-view relations. UrBench datasets and benchmark results will be publicly available at this https URL.

[AI-22] VisionTS: Visual Masked Autoencoders Are Free-Lunch Zero-Shot Time Series Forecasters

链接: https://arxiv.org/abs/2408.17253
作者: Mouxiang Chen,Lefei Shen,Zhuo Li,Xiaoyun Joy Wang,Jianling Sun,Chenghao Liu
关键词-EN: TSF foundation models, TSF foundation, develop TSF foundation, Foundation models, TSF
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 26 pages, 11 figures

点击查看摘要

Abstract:Foundation models have emerged as a promising approach in time series forecasting (TSF). Existing approaches either fine-tune large language models (LLMs) or build large-scale time-series datasets to develop TSF foundation models. However, these methods face challenges due to the severe cross-domain gap or in-domain heterogeneity. In this paper, we explore a new road to building a TSF foundation model from rich and high-quality natural images, based on the intrinsic similarities between images and time series. To bridge the gap between the two domains, we reformulate the TSF task as an image reconstruction task, which is further processed by a visual masked autoencoder (MAE) self-supervised pre-trained on the ImageNet dataset. Surprisingly, without further adaptation in the time-series domain, the proposed VisionTS could achieve superior zero-shot forecasting performance compared to existing TSF foundation models. With minimal fine-tuning, VisionTS could further improve the forecasting and achieve state-of-the-art performance in most cases. These findings suggest that visual models could be a free lunch for TSF and highlight the potential for future cross-domain research between computer vision and TSF. Our code is publicly available at this https URL.

[AI-23] Abstracted Gaussian Prototypes for One-Shot Concept Learning

链接: https://arxiv.org/abs/2408.17251
作者: Chelsea Zou,Kenneth J. Kurtz
关键词-EN: encode higher-level representations, Gaussian Mixture Model, cluster-based generative image, generative image segmentation, image segmentation framework
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:We introduce a cluster-based generative image segmentation framework to encode higher-level representations of visual concepts based on one-shot learning inspired by the Omniglot Challenge. The inferred parameters of each component of a Gaussian Mixture Model (GMM) represent a distinct topological subpart of a visual concept. Sampling new data from these parameters generates augmented subparts to build a more robust prototype for each concept, i.e., the Abstracted Gaussian Prototype (AGP). This framework addresses one-shot classification tasks using a cognitively-inspired similarity metric and addresses one-shot generative tasks through a novel AGP-VAE pipeline employing variational autoencoders (VAEs) to generate new class variants. Results from human judges reveal that the generative pipeline produces novel examples and classes of visual concepts that are broadly indistinguishable from those made by humans. The proposed framework leads to impressive but not state-of-the-art classification accuracy; thus, the contribution is two-fold: 1) the system is uniquely low in theoretical and computational complexity and operates in a completely standalone manner compared while existing approaches draw heavily on pre-training or knowledge engineering; and 2) in contrast with competing neural network models, the AGP approach addresses the importance of breadth of task capability emphasized in the Omniglot challenge (i.e., successful performance on generative tasks). These two points are critical as we advance toward an understanding of how learning/reasoning systems can produce viable, robust, and flexible concepts based on literally nothing more than a single example.

[AI-24] AI-Driven Intrusion Detection Systems (IDS) on the ROAD dataset: A Comparative Analysis for automotive Controller Area Network (CAN)

链接: https://arxiv.org/abs/2408.17235
作者: Lorenzo Guerra,Linhan Xu,Pavlo Mozharovskyi,Paolo Bellavista,Thomas Chapuis,Guillaume Duc,Van-Tam Nguyen
关键词-EN: revolutionized automotive technology, Controller Area Network, automotive technology, enhancing safety, driving experience
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The integration of digital devices in modern vehicles has revolutionized automotive technology, enhancing safety and the overall driving experience. The Controller Area Network (CAN) bus is a central system for managing in-vehicle communication between the electronic control units (ECUs). However, the CAN protocol poses security challenges due to inherent vulnerabilities, lacking encryption and authentication, which, combined with an expanding attack surface, necessitates robust security measures. In response to this challenge, numerous Intrusion Detection Systems (IDS) have been developed and deployed. Nonetheless, an open, comprehensive, and realistic dataset to test the effectiveness of such IDSs remains absent in the existing literature. This paper addresses this gap by considering the latest ROAD dataset, containing stealthy and sophisticated injections. The methodology involves dataset labelling and the implementation of both state-of-the-art deep learning models and traditional machine learning models to show the discrepancy in performance between the datasets most commonly used in the literature and the ROAD dataset, a more realistic alternative.

[AI-25] A methodological framework for Resilience as a Service (RaaS) in multimodal urban transportation networks

链接: https://arxiv.org/abs/2408.17233
作者: Sara Jaber(Univ. Gustave Eiffel, COSYS, GRETTIA, Paris, France and VEDECOM, mobiLAB, Department of new solutions of mobility services and shared energy, Versailles, France),Mostafa Ameli(Univ. Gustave Eiffel, COSYS, GRETTIA, Paris, France),S. M. Hassan Mahdavi(VEDECOM, mobiLAB, Department of new solutions of mobility services and shared energy, Versailles, France),Neila Bhouri(Univ. Gustave Eiffel, COSYS, GRETTIA, Paris, France)
关键词-EN: Public transportation systems, commuter traffic, public transport disruptions, unexpected service disruptions, minimize adverse effects
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Public transportation systems are experiencing an increase in commuter traffic. This increase underscores the need for resilience strategies to manage unexpected service disruptions, ensuring rapid and effective responses that minimize adverse effects on stakeholders and enhance the system’s ability to maintain essential functions and recover quickly. This study aims to explore the management of public transport disruptions through resilience as a service (RaaS) strategies, developing an optimization model to effectively allocate resources and minimize the cost for operators and passengers. The proposed model includes multiple transportation options, such as buses, taxis, and automated vans, and evaluates them as bridging alternatives to rail-disrupted services based on factors such as their availability, capacity, speed, and proximity to the disrupted station. This ensures that the most suitable vehicles are deployed to maintain service continuity. Applied to a case study in the Ile de France region, Paris and suburbs, complemented by a microscopic simulation, the model is compared to existing solutions such as bus bridging and reserve fleets. The results highlight the model’s performance in minimizing costs and enhancing stakeholder satisfaction, optimizing transport management during disruptions.

[AI-26] owards Symbolic XAI – Explanation Through Human Understandable Logical Relationships Between Features

链接: https://arxiv.org/abs/2408.17198
作者: Thomas Schnake,Farnoush Rezaei Jafaria,Jonas Lederer,Ping Xiong,Shinichi Nakajima,Stefan Gugler,Grégoire Montavon,Klaus-Robert Müller
关键词-EN: Explainable Artificial Intelligence, Explainable Artificial, Artificial Intelligence, approaches typically offer, heatmaps highlighting single
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Explainable Artificial Intelligence (XAI) plays a crucial role in fostering transparency and trust in AI systems, where traditional XAI approaches typically offer one level of abstraction for explanations, often in the form of heatmaps highlighting single or multiple input features. However, we ask whether abstract reasoning or problem-solving strategies of a model may also be relevant, as these align more closely with how humans approach solutions to problems. We propose a framework, called Symbolic XAI, that attributes relevance to symbolic queries expressing logical relationships between input features, thereby capturing the abstract reasoning behind a model’s predictions. The methodology is built upon a simple yet general multi-order decomposition of model predictions. This decomposition can be specified using higher-order propagation-based relevance methods, such as GNN-LRP, or perturbation-based explanation methods commonly used in XAI. The effectiveness of our framework is demonstrated in the domains of natural language processing (NLP), vision, and quantum chemistry (QC), where abstract symbolic domain knowledge is abundant and of significant interest to users. The Symbolic XAI framework provides an understanding of the model’s decision-making process that is both flexible for customization by the user and human-readable through logical formulas.

[AI-27] Reasoning with maximal consistent signatures

链接: https://arxiv.org/abs/2408.17190
作者: Matthias Thimm,Jandson Santos Ribeiro Santos
关键词-EN: Lang and Marquis, maximal consistent, maximal consistent subsignatures, consistent subsignatures, specific instance
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:We analyse a specific instance of the general approach of reasoning based on forgetting by Lang and Marquis. More precisely, we discuss an approach for reasoning with inconsistent information using maximal consistent subsignatures, where a maximal consistent subsignature is a maximal set of propositions such that forgetting the remaining propositions restores consistency. We analyse maximal consistent subsignatures and the corresponding minimal inconsistent subsignatures in-depth and show, among others, that the hitting set duality applies for them as well. We further analyse inference relations based on maximal consistent subsignatures wrt. rationality postulates from non-monotonic reasoning and computational complexity. We also consider the relationship of our approach with inconsistency measurement and paraconsistent reasoning.

[AI-28] “Benefit Game: Alien Seaweed Swarms” – Real-time Gamification of Digital Seaweed Ecology

链接: https://arxiv.org/abs/2408.17186
作者: Dan-Lu Fei,Zi-Wei Wu,Kang Zhang
关键词-EN: Alien Seaweed Swarms, combines artificial life, artificial life art, Alien Seaweed, Seaweed Swarms
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
*备注: Paper accepted at ISEA 24, The 29th International Symposium on Electronic Art, Brisbane, Australia, 21-29 June 2024

点击查看摘要

Abstract:“Benefit Game: Alien Seaweed Swarms” combines artificial life art and interactive game with installation to explore the impact of human activity on fragile seaweed ecosystems. The project aims to promote ecological consciousness by creating a balance in digital seaweed ecologies. Inspired by the real species “Laminaria saccharina”, the author employs Procedural Content Generation via Machine Learning technology to generate variations of virtual seaweeds and symbiotic fungi. The audience can explore the consequences of human activities through gameplay and observe the ecosystem’s feedback on the benefits and risks of seaweed aquaculture. This Benefit Game offers dynamic and real-time responsive artificial seaweed ecosystems for an interactive experience that enhances ecological consciousness.

[AI-29] Causal Reasoning in Software Quality Assurance: A Systematic Review

链接: https://arxiv.org/abs/2408.17183
作者: Luca Giamattei,Antonio Guerriero,Roberto Pietrantuono,Stefano Russo
关键词-EN: Software Quality Assurance, software products work, Quality Assurance, Causal Reasoning, quality software systems
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
*备注: Preprint Journal Information and Software Technology

点击查看摘要

Abstract:Context: Software Quality Assurance (SQA) is a fundamental part of software engineering to ensure stakeholders that software products work as expected after release in operation. Machine Learning (ML) has proven to be able to boost SQA activities and contribute to the development of quality software systems. In this context, Causal Reasoning is gaining increasing interest as a methodology to solve some of the current ML limitations. It aims to go beyond a purely data-driven approach by exploiting the use of causality for more effective SQA strategies. Objective: Provide a broad and detailed overview of the use of causal reasoning for SQA activities, in order to support researchers to access this research field, identifying room for application, main challenges and research opportunities. Methods: A systematic literature review of causal reasoning in the SQA research area. Scientific papers have been searched, classified, and analyzed according to established guidelines for software engineering secondary studies. Results: Results highlight the primary areas within SQA where causal reasoning has been applied, the predominant methodologies used, and the level of maturity of the proposed solutions. Fault localization is the activity where causal reasoning is more exploited, especially in the web services/microservices domain, but other tasks like testing are rapidly gaining popularity. Both causal inference and causal discovery are exploited, with the Pearl’s graphical formulation of causality being preferred, likely due to its intuitiveness. Tools to favour their application are appearing at a fast pace - most of them after 2021. Conclusions: The findings show that causal reasoning is a valuable means for SQA tasks with respect to multiple quality attributes, especially during VV, evolution and maintenance to ensure reliability, while it is not yet fully exploited for phases like …

[AI-30] Identifying and Clustering Counter Relationships of Team Compositions in PvP Games for Efficient Balance Analysis

链接: https://arxiv.org/abs/2408.17180
作者: Chiu-Chou Lin,Yu-Wei Shih,Kuei-Ting Kuo,Yu-Cheng Chen,Chien-Hua Chen,Wei-Chen Chiu,I-Chen Wu
关键词-EN: balance, win, game settings, game, Abstract
类目: Artificial Intelligence (cs.AI); Computer Science and Game Theory (cs.GT); Information Retrieval (cs.IR); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
*备注: TMLR 09/2024 this https URL

点击查看摘要

Abstract:How can balance be quantified in game settings? This question is crucial for game designers, especially in player-versus-player (PvP) games, where analyzing the strength relations among predefined team compositions-such as hero combinations in multiplayer online battle arena (MOBA) games or decks in card games-is essential for enhancing gameplay and achieving balance. We have developed two advanced measures that extend beyond the simplistic win rate to quantify balance in zero-sum competitive scenarios. These measures are derived from win value estimations, which employ strength rating approximations via the Bradley-Terry model and counter relationship approximations via vector quantization, significantly reducing the computational complexity associated with traditional win value estimations. Throughout the learning process of these models, we identify useful categories of compositions and pinpoint their counter relationships, aligning with the experiences of human players without requiring specific game knowledge. Our methodology hinges on a simple technique to enhance codebook utilization in discrete representation with a deterministic vector quantization process for an extremely small state space. Our framework has been validated in popular online games, including Age of Empires II, Hearthstone, Brawl Stars, and League of Legends. The accuracy of the observed strength relations in these games is comparable to traditional pairwise win value predictions, while also offering a more manageable complexity for analysis. Ultimately, our findings contribute to a deeper understanding of PvP game dynamics and present a methodology that significantly improves game balance evaluation and design.

[AI-31] Deep Feature Embedding for Tabular Data ICONIP2024

链接: https://arxiv.org/abs/2408.17162
作者: Yuqian Wu,Hengyi Luo,Raymond S. T. Lee
关键词-EN: capture complex relationships, Tabular data learning, Tabular data, relationships and engineering, extensive applications
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 15 pages, 2figures, accepted to ICONIP 2024, Paper ID: 1399

点击查看摘要

Abstract:Tabular data learning has extensive applications in deep learning but its existing embedding techniques are limited in numerical and categorical features such as the inability to capture complex relationships and engineering. This paper proposes a novel deep embedding framework with leverages lightweight deep neural networks to generate effective feature embeddings for tabular data in machine learning research. For numerical features, a two-step feature expansion and deep transformation technique is used to capture copious semantic information. For categorical features, a unique identification vector for each entity is referred by a compact lookup table with a parameterized deep embedding function to uniform the embedding size dimensions, and transformed into a embedding vector using deep neural network. Experiments are conducted on real-world datasets for performance evaluation.

[AI-32] Look Compare Decide: Alleviating Hallucination in Large Vision-Language Models via Multi-View Multi-Path Reasoning

链接: https://arxiv.org/abs/2408.17150
作者: Xiaoye Qu,Jiashuo Sun,Wei Wei,Yu Cheng
关键词-EN: Large Vision-Language Models, multi-modal context comprehension, Large Vision-Language, Vision-Language Models, demonstrated impressive capabilities
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 13 pages, 7 tables, 7 figures

点击查看摘要

Abstract:Recently, Large Vision-Language Models (LVLMs) have demonstrated impressive capabilities in multi-modal context comprehension. However, they still suffer from hallucination problems referring to generating inconsistent outputs with the image content. To mitigate hallucinations, previous studies mainly focus on retraining LVLMs with custom datasets. Although effective, they inherently come with additional computational costs. In this paper, we propose a training-free framework, \textbfMVP, that aims to reduce hallucinations by making the most of the innate capabilities of the LVLMs via \textbfMulti-\textbfView Multi-\textbfPath Reasoning. Specifically, we first devise a multi-view information-seeking strategy to thoroughly perceive the comprehensive information in the image, which enriches the general global information captured by the original vision encoder in LVLMs. Furthermore, during the answer decoding, we observe that the occurrence of hallucinations has a strong correlation with the certainty of the answer tokens. Thus, we propose multi-path reasoning for each information view to quantify and aggregate the certainty scores for each potential answer among multiple decoding paths and finally decide the output answer. By fully grasping the information in the image and carefully considering the certainty of the potential answers when decoding, our MVP can effectively reduce hallucinations in LVLMs.The extensive experiments verify that our proposed MVP significantly mitigates the hallucination problem across four well-known LVLMs. The source code is available at: \urlthis https URL.

[AI-33] owards Hyper-parameter-free Federated Learning

链接: https://arxiv.org/abs/2408.17145
作者: Geetika,Drishya Uniyal,Bapi Chatterjee
关键词-EN: adaptive synchronization techniques, vanilla federated averaging, scaled global model, global model updates, adaptive synchronization
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Optimization and Control (math.OC)
*备注: 28 pages, 3 figures

点击查看摘要

Abstract:The adaptive synchronization techniques in federated learning (FL) for scaled global model updates show superior performance over the vanilla federated averaging (FedAvg) scheme. However, existing methods employ additional tunable hyperparameters on the server to determine the scaling factor. A contrasting approach is automated scaling analogous to tuning-free step-size schemes in stochastic gradient descent (SGD) methods, which offer competitive convergence rates and exhibit good empirical performance. In this work, we introduce two algorithms for automated scaling of global model updates. In our first algorithm, we establish that a descent-ensuring step-size regime at the clients ensures descent for the server objective. We show that such a scheme enables linear convergence for strongly convex federated objectives. Our second algorithm shows that the average of objective values of sampled clients is a practical and effective substitute for the objective function value at the server required for computing the scaling factor, whose computation is otherwise not permitted. Our extensive empirical results show that the proposed methods perform at par or better than the popular federated learning algorithms for both convex and non-convex problems. Our work takes a step towards designing hyper-parameter-free federated learning.

[AI-34] Leveraging Digital Twin Technologies for Public Space Protection and Vulnerability Assessment

链接: https://arxiv.org/abs/2408.17136
作者: Artemis Stefanidou,Jorgen Cani,Thomas Papadopoulos,Panagiotis Radoglou-Grammatikis,Panagiotis Sarigiannidis,Iraklis Varlamis,Georgios Th. Papadopoulos
关键词-EN: locations easily accessible, increasingly important issue, public spaces, recent years, locations easily
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Over the recent years, the protection of the so-called `soft-targets’, i.e. locations easily accessible by the general public with relatively low, though, security measures, has emerged as a rather challenging and increasingly important issue. The complexity and seriousness of this security threat growths nowadays exponentially, due to the emergence of new advanced technologies (e.g. Artificial Intelligence (AI), Autonomous Vehicles (AVs), 3D printing, etc.); especially when it comes to large-scale, popular and diverse public spaces. In this paper, a novel Digital Twin-as-a-Security-Service (DTaaSS) architecture is introduced for holistically and significantly enhancing the protection of public spaces (e.g. metro stations, leisure sites, urban squares, etc.). The proposed framework combines a Digital Twin (DT) conceptualization with additional cutting-edge technologies, including Internet of Things (IoT), cloud computing, Big Data analytics and AI. In particular, DTaaSS comprises a holistic, real-time, large-scale, comprehensive and data-driven security solution for the efficient/robust protection of public spaces, supporting: a) data collection and analytics, b) area monitoring/control and proactive threat detection, c) incident/attack prediction, and d) quantitative and data-driven vulnerability assessment. Overall, the designed architecture exhibits increased potential in handling complex, hybrid and combined threats over large, critical and popular soft-targets. The applicability and robustness of DTaaSS is discussed in detail against representative and diverse real-world application scenarios, including complex attacks to: a) a metro station, b) a leisure site, and c) a cathedral square.

[AI-35] VQ4DiT: Efficient Post-Training Vector Quantization for Diffusion Transformers

链接: https://arxiv.org/abs/2408.17131
作者: Juncan Deng,Shuaiting Li,Zeyu Wang,Hong Gu,Kedong Xu,Kejie Huang
关键词-EN: Diffusion Transformers Models, Diffusion Transformers, demonstrating exceptional capabilities, demonstrating exceptional, Transformers Models
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 11 pages, 6 figures

点击查看摘要

Abstract:The Diffusion Transformers Models (DiTs) have transitioned the network architecture from traditional UNets to transformers, demonstrating exceptional capabilities in image generation. Although DiTs have been widely applied to high-definition video generation tasks, their large parameter size hinders inference on edge devices. Vector quantization (VQ) can decompose model weight into a codebook and assignments, allowing extreme weight quantization and significantly reducing memory usage. In this paper, we propose VQ4DiT, a fast post-training vector quantization method for DiTs. We found that traditional VQ methods calibrate only the codebook without calibrating the assignments. This leads to weight sub-vectors being incorrectly assigned to the same assignment, providing inconsistent gradients to the codebook and resulting in a suboptimal result. To address this challenge, VQ4DiT calculates the candidate assignment set for each weight sub-vector based on Euclidean distance and reconstructs the sub-vector based on the weighted average. Then, using the zero-data and block-wise calibration method, the optimal assignment from the set is efficiently selected while calibrating the codebook. VQ4DiT quantizes a DiT XL/2 model on a single NVIDIA A100 GPU within 20 minutes to 5 hours depending on the different quantization settings. Experiments show that VQ4DiT establishes a new state-of-the-art in model size and performance trade-offs, quantizing weights to 2-bit precision while retaining acceptable image generation quality.

[AI-36] Controllable Edge-Type-Specific Interpretation in Multi-Relational Graph Neural Networks for Drug Response Prediction

链接: https://arxiv.org/abs/2408.17129
作者: Xiaodi Li,Jianfeng Gui,Qian Gao,Haoyuan Shi,Zhenyu Yue
关键词-EN: Graph Neural Networks, Neural Networks, critical decision-making areas, Graph Neural, demand interpretable predictions
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Graph Neural Networks have been widely applied in critical decision-making areas that demand interpretable predictions, leading to the flourishing development of interpretability algorithms. However, current graph interpretability algorithms tend to emphasize generality and often overlook biological significance, thereby limiting their applicability in predicting cancer drug responses. In this paper, we propose a novel post-hoc interpretability algorithm for cancer drug response prediction, CETExplainer, which incorporates a controllable edge-type-specific weighting mechanism. It considers the mutual information between subgraphs and predictions, proposing a structural scoring approach to provide fine-grained, biologically meaningful explanations for predictive models. We also introduce a method for constructing ground truth based on real-world datasets to quantitatively evaluate the proposed interpretability algorithm. Empirical analysis on the real-world dataset demonstrates that CETExplainer achieves superior stability and improves explanation quality compared to leading algorithms, thereby offering a robust and insightful tool for cancer drug prediction.

[AI-37] Exploring User Acceptance Of Portable Intelligent Personal Assistants: A Hybrid Approach Using PLS-SEM And fsQCA

链接: https://arxiv.org/abs/2408.17119
作者: Gustave Florentin Nkoulou Mvondo,Ben Niu
关键词-EN: intelligent personal assistant, newly developed portable, developed portable intelligent, portable intelligent personal, redefine user interaction
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
*备注: 36,

点击查看摘要

Abstract:This research explores the factors driving user acceptance of Rabbit R1, a newly developed portable intelligent personal assistant (PIPA) that aims to redefine user interaction and control. The study extends the technology acceptance model (TAM) by incorporating artificial intelligence-specific factors (conversational intelligence, task intelligence, and perceived naturalness), user interface design factors (simplicity in information design and visual aesthetics), and user acceptance and loyalty. Using a purposive sampling method, we gathered data from 824 users in the US and analyzed the sample through partial least squares structural equation modeling (PLS-SEM) and fuzzy set qualitative comparative analysis (fsQCA). The findings reveal that all hypothesized relationships, including both direct and indirect effects, are supported. Additionally, fsQCA supports the PLS-SEM findings and identifies three configurations leading to high and low user acceptance. This research enriches the literature and provides valuable insights for system designers and marketers of PIPAs, guiding strategic decisions to foster widespread adoption and long-term engagement.

[AI-38] Understanding the User: An Intent-Based Ranking Dataset

链接: https://arxiv.org/abs/2408.17103
作者: Abhijit Anand,Jurek Leonhardt,V Venktesh,Avishek Anand
关键词-EN: retrieval systems continue, information retrieval systems, continue to evolve, retrieval systems, systems continue
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:As information retrieval systems continue to evolve, accurate evaluation and benchmarking of these systems become pivotal. Web search datasets, such as MS MARCO, primarily provide short keyword queries without accompanying intent or descriptions, posing a challenge in comprehending the underlying information need. This paper proposes an approach to augmenting such datasets to annotate informative query descriptions, with a focus on two prominent benchmark datasets: TREC-DL-21 and TREC-DL-22. Our methodology involves utilizing state-of-the-art LLMs to analyze and comprehend the implicit intent within individual queries from benchmark datasets. By extracting key semantic elements, we construct detailed and contextually rich descriptions for these queries. To validate the generated query descriptions, we employ crowdsourcing as a reliable means of obtaining diverse human perspectives on the accuracy and informativeness of the descriptions. This information can be used as an evaluation set for tasks such as ranking, query rewriting, or others.

[AI-39] Strategic Arms with Side Communication Prevail Over Low-Regret MAB Algorithms

链接: https://arxiv.org/abs/2408.17101
作者: Ahmed Ben Yahmed(CREST, ENSAE Paris),Clément Calauzènes,Vianney Perchet(CREST, ENSAE Paris)
关键词-EN: multi-armed bandit setting, strategic multi-armed bandit, possess perfect information, arms possess perfect, bandit setting
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In the strategic multi-armed bandit setting, when arms possess perfect information about the player’s behavior, they can establish an equilibrium where: 1. they retain almost all of their value, 2. they leave the player with a substantial (linear) regret. This study illustrates that, even if complete information is not publicly available to all arms but is shared among them, it is possible to achieve a similar equilibrium. The primary challenge lies in designing a communication protocol that incentivizes the arms to communicate truthfully.

[AI-40] FissionVAE: Federated Non-IID Image Generation with Latent Space and Decoder Decomposition

链接: https://arxiv.org/abs/2408.17090
作者: Chen Hu,Jingjing Deng,Xianghua Xie,Xiaoke Ma
关键词-EN: Generative Adversarial Networks, enables decentralized clients, machine learning paradigm, paradigm that enables, enables decentralized
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Federated learning is a machine learning paradigm that enables decentralized clients to collaboratively learn a shared model while keeping all the training data local. While considerable research has focused on federated image generation, particularly Generative Adversarial Networks, Variational Autoencoders have received less attention. In this paper, we address the challenges of non-IID (independently and identically distributed) data environments featuring multiple groups of images of different types. Specifically, heterogeneous data distributions can lead to difficulties in maintaining a consistent latent space and can also result in local generators with disparate texture features being blended during aggregation. We introduce a novel approach, FissionVAE, which decomposes the latent space and constructs decoder branches tailored to individual client groups. This method allows for customized learning that aligns with the unique data distributions of each group. Additionally, we investigate the incorporation of hierarchical VAE architectures and demonstrate the use of heterogeneous decoder architectures within our model. We also explore strategies for setting the latent prior distributions to enhance the decomposition process. To evaluate our approach, we assemble two composite datasets: the first combines MNIST and FashionMNIST; the second comprises RGB datasets of cartoon and human faces, wild animals, marine vessels, and remote sensing images of Earth. Our experiments demonstrate that FissionVAE greatly improves generation quality on these datasets compared to baseline federated VAE models.

[AI-41] Instant Adversarial Purification with Adversarial Consistency Distillation

链接: https://arxiv.org/abs/2408.17064
作者: Chun Tong Lei,Hon Ming Yam,Zhongliang Guo,Chun Pong Lau
关键词-EN: including image classification, Neural Function Evaluation, widespread applications, Neural networks, remarkable performance
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Neural networks, despite their remarkable performance in widespread applications, including image classification, are also known to be vulnerable to subtle adversarial noise. Although some diffusion-based purification methods have been proposed, for example, DiffPure, those methods are time-consuming. In this paper, we propose One Step Control Purification (OSCP), a diffusion-based purification model that can purify the adversarial image in one Neural Function Evaluation (NFE) in diffusion models. We use Latent Consistency Model (LCM) and ControlNet for our one-step purification. OSCP is computationally friendly and time efficient compared to other diffusion-based purification methods; we achieve defense success rate of 74.19% on ImageNet, only requiring 0.1s for each purification. Moreover, there is a fundamental incongruence between consistency distillation and adversarial perturbation. To address this ontological dissonance, we propose Gaussian Adversarial Noise Distillation (GAND), a novel consistency distillation framework that facilitates a more nuanced reconciliation of the latent space dynamics, effectively bridging the natural and adversarial manifolds. Our experiments show that the GAND does not need a Full Fine Tune (FFT); PEFT, e.g., LoRA is sufficient.

[AI-42] A Survey of the Self Supervised Learning Mechanisms for Vision Transformers

链接: https://arxiv.org/abs/2408.17059
作者: Asifullah Khan,Anabia Sohail,Mustansar Fiaz,Mehdi Hassan,Tariq Habib Afridi,Sibghat Ullah Marwat,Farzeen Munir,Safdar Ali,Hannan Naseem,Muhammad Zaigham Zaheer,Kamran Ali,Tangina Sultana,Ziaurrehman Tanoli,Naeem Akhter
关键词-EN: require high volume, attain sufficiently good, models require high, sufficiently good results, require high
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 34 Pages, 5 Figures, 7 Tables

点击查看摘要

Abstract:Deep supervised learning models require high volume of labeled data to attain sufficiently good results. Although, the practice of gathering and annotating such big data is costly and laborious. Recently, the application of self supervised learning (SSL) in vision tasks has gained significant attention. The intuition behind SSL is to exploit the synchronous relationships within the data as a form of self-supervision, which can be versatile. In the current big data era, most of the data is unlabeled, and the success of SSL thus relies in finding ways to improve this vast amount of unlabeled data available. Thus its better for deep learning algorithms to reduce reliance on human supervision and instead focus on self-supervision based on the inherent relationships within the data. With the advent of ViTs, which have achieved remarkable results in computer vision, it is crucial to explore and understand the various SSL mechanisms employed for training these models specifically in scenarios where there is less label data available. In this survey we thus develop a comprehensive taxonomy of systematically classifying the SSL techniques based upon their representations and pre-training tasks being applied. Additionally, we discuss the motivations behind SSL, review popular pre-training tasks, and highlight the challenges and advancements in this field. Furthermore, we present a comparative analysis of different SSL methods, evaluate their strengths and limitations, and identify potential avenues for future research.

[AI-43] Dynamic Self-Consistency: Leveraging Reasoning Paths for Efficient LLM Sampling

链接: https://arxiv.org/abs/2408.17017
作者: Guangya Wan,Yuqi Wu,Jie Chen,Sheng Li
关键词-EN: Large Language Models, Language Models, Large Language, hallucinations in Large, frequent solution
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Self-Consistency (SC) is a widely used method to mitigate hallucinations in Large Language Models (LLMs) by sampling the LLM multiple times and outputting the most frequent solution. Despite its benefits, SC results in significant computational costs proportional to the number of samples generated. Previous early-stopping approaches, such as Early Stopping Self Consistency and Adaptive Consistency, have aimed to reduce these costs by considering output consistency, but they do not analyze the quality of the reasoning paths (RPs) themselves. To address this issue, we propose Reasoning-Aware Self-Consistency (RASC), an innovative early-stopping framework that dynamically adjusts the number of sample generations by considering both the output answer and the RPs from Chain of Thought (CoT) prompting. RASC assigns confidence scores sequentially to the generated samples, stops when certain criteria are met, and then employs weighted majority voting to optimize sample usage and enhance answer reliability. We comprehensively test RASC with multiple LLMs across varied QA datasets. RASC outperformed existing methods and significantly reduces sample usage by an average of 80% while maintaining or improving accuracy up to 5% compared to the original SC

[AI-44] Improving Time Series Classification with Representation Soft Label Smoothing

链接: https://arxiv.org/abs/2408.17010
作者: Hengyi Ma,Weitong Chen
关键词-EN: time series classification, deep neural network, neural network based, Previous research, network based models
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 14 pages,6 figures

点击查看摘要

Abstract:Previous research has indicated that deep neural network based models for time series classification (TSC) tasks are prone to overfitting. This issue can be mitigated by employing strategies that prevent the model from becoming overly confident in its predictions, such as label smoothing and confidence penalty. Building upon the concept of label smoothing, we propose a novel approach to generate more reliable soft labels, which we refer to as representation soft label smoothing. We apply label smoothing, confidence penalty, and our method representation soft label smoothing to several TSC models and compare their performance with baseline method which only uses hard labels for training. Our results demonstrate that the use of these enhancement techniques yields competitive results compared to the baseline method. Importantly, our method demonstrates strong performance across models with varying structures and complexities.

[AI-45] Safety Layers of Aligned Large Language Models : The Key to LLM Security

链接: https://arxiv.org/abs/2408.17003
作者: Shen Li,Liuyi Yao,Lan Zhang,Yaliang Li
关键词-EN: answer malicious questions, highly secure, capable of recognizing, safety layers, recognizing and refusing
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Aligned LLMs are highly secure, capable of recognizing and refusing to answer malicious questions. However, the role of internal parameters in maintaining this security is not well understood, further these models are vulnerable to security degradation when fine-tuned with non-malicious backdoor data or normal data. To address these challenges, our work uncovers the mechanism behind security in aligned LLMs at the parameter level, identifying a small set of contiguous layers in the middle of the model that are crucial for distinguishing malicious queries from normal ones, referred to as “safety layers.” We first confirm the existence of these safety layers by analyzing variations in input vectors within the model’s internal layers. Additionally, we leverage the over-rejection phenomenon and parameters scaling analysis to precisely locate the safety layers. Building on this understanding, we propose a novel fine-tuning approach, Safely Partial-Parameter Fine-Tuning (SPPFT), that fixes the gradient of the safety layers during fine-tuning to address the security degradation. Our experiments demonstrate that this approach significantly preserves model security while maintaining performance and reducing computational resources compared to full fine-tuning.

[AI-46] Beyond Preferences in AI Alignment

链接: https://arxiv.org/abs/2408.16984
作者: Tan Zhi-Xuan,Micah Carroll,Matija Franklin,Hal Ashton
关键词-EN: dominant practice, maximizing the satisfaction, behave safely, alignment assumes, preferences
类目: Artificial Intelligence (cs.AI)
*备注: 26 pages (excl. references), 5 figures

点击查看摘要

Abstract:The dominant practice of AI alignment assumes (1) that preferences are an adequate representation of human values, (2) that human rationality can be understood in terms of maximizing the satisfaction of preferences, and (3) that AI systems should be aligned with the preferences of one or more humans to ensure that they behave safely and in accordance with our values. Whether implicitly followed or explicitly endorsed, these commitments constitute what we term a preferentist approach to AI alignment. In this paper, we characterize and challenge the preferentist approach, describing conceptual and technical alternatives that are ripe for further research. We first survey the limits of rational choice theory as a descriptive model, explaining how preferences fail to capture the thick semantic content of human values, and how utility representations neglect the possible incommensurability of those values. We then critique the normativity of expected utility theory (EUT) for humans and AI, drawing upon arguments showing how rational agents need not comply with EUT, while highlighting how EUT is silent on which preferences are normatively acceptable. Finally, we argue that these limitations motivate a reframing of the targets of AI alignment: Instead of alignment with the preferences of a human user, developer, or humanity-writ-large, AI systems should be aligned with normative standards appropriate to their social roles, such as the role of a general-purpose assistant. Furthermore, these standards should be negotiated and agreed upon by all relevant stakeholders. On this alternative conception of alignment, a multiplicity of AI systems will be able to serve diverse ends, aligned with normative standards that promote mutual benefit and limit harm despite our plural and divergent values.

[AI-47] raining Ultra Long Context Language Model with Fully Pipelined Distributed Transformer

链接: https://arxiv.org/abs/2408.16978
作者: Jinghan Yao,Sam Ade Jacobs,Masahiro Tanaka,Olatunji Ruwase,Aamir Shafi,Hari Subramoni,Dhabaleswar K. Panda
关键词-EN: Large Language Models, natural language processing, Large Language, long context capabilities, natural language
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Large Language Models (LLMs) with long context capabilities are integral to complex tasks in natural language processing and computational biology, such as text generation and protein sequence analysis. However, training LLMs directly on extremely long contexts demands considerable GPU resources and increased memory, leading to higher costs and greater complexity. Alternative approaches that introduce long context capabilities via downstream finetuning or adaptations impose significant design limitations. In this paper, we propose Fully Pipelined Distributed Transformer (FPDT) for efficiently training long-context LLMs with extreme hardware efficiency. For GPT and Llama models, we achieve a 16x increase in sequence length that can be trained on the same hardware compared to current state-of-the-art solutions. With our dedicated sequence chunk pipeline design, we can now train 8B LLM with 2 million sequence length on only 4 GPUs, while also maintaining over 55% of MFU. Our proposed FPDT is agnostic to existing training techniques and is proven to work efficiently across different LLM models.

[AI-48] MemLong: Memory-Augmented Retrieval for Long Text Modeling

链接: https://arxiv.org/abs/2408.16967
作者: Weijie Liu,Zecheng Tang,Juntao Li,Kehai Chen,Min Zhang
关键词-EN: yielded remarkable success, Large Language Models, Recent advancements, advancements in Large, Large Language
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Recent advancements in Large Language Models (LLMs) have yielded remarkable success across diverse fields. However, handling long contexts remains a significant challenge for LLMs due to the quadratic time and space complexity of attention mechanisms and the growing memory consumption of the key-value cache during generation. This work introduces MemLong: Memory-Augmented Retrieval for Long Text Generation, a method designed to enhance the capabilities of long-context language modeling by utilizing an external retriever for historical information retrieval. MemLong combines a non-differentiable ``ret-mem’’ module with a partially trainable decoder-only language model and introduces a fine-grained, controllable retrieval attention mechanism that leverages semantic-level relevant chunks. Comprehensive evaluations on multiple long-context language modeling benchmarks demonstrate that MemLong consistently outperforms other state-of-the-art LLMs. More importantly, MemLong can extend the context length on a single 3090 GPU from 4k up to 80k. Our code is available at this https URL

[AI-49] UserSumBench: A Benchmark Framework for Evaluating User Summarization Approaches

链接: https://arxiv.org/abs/2408.16966
作者: Chao Wang,Neo Wu,Lin Ning,Luyang Liu,Jun Xie,Shawn O’Banion,Bradley Green
关键词-EN: Large language models, shown remarkable capabilities, user activity data, Large language, raw user activity
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:Large language models (LLMs) have shown remarkable capabilities in generating user summaries from a long list of raw user activity data. These summaries capture essential user information such as preferences and interests, and therefore are invaluable for LLM-based personalization applications, such as explainable recommender systems. However, the development of new summarization techniques is hindered by the lack of ground-truth labels, the inherent subjectivity of user summaries, and human evaluation which is often costly and time-consuming. To address these challenges, we introduce \UserSumBench, a benchmark framework designed to facilitate iterative development of LLM-based summarization approaches. This framework offers two key components: (1) A reference-free summary quality metric. We show that this metric is effective and aligned with human preferences across three diverse datasets (MovieLens, Yelp and Amazon Review). (2) A novel robust summarization method that leverages time-hierarchical summarizer and self-critique verifier to produce high-quality summaries while eliminating hallucination. This method serves as a strong baseline for further innovation in summarization techniques.

[AI-50] he Future of Open Human Feedback

链接: https://arxiv.org/abs/2408.16961
作者: Shachar Don-Yehiya,Ben Burtenshaw,Ramon Fernandez Astudillo,Cailean Osborne,Mimansa Jaiswal,Tzu-Sheng Kuo,Wenting Zhao,Idan Shenfeld,Andi Peng,Mikhail Yurochkin,Atoosa Kasirzadeh,Yangsibo Huang,Tatsunori Hashimoto,Yacine Jernite,Daniel Vila-Suero,Omri Abend,Jennifer Ding,Sara Hooker,Hannah Rose Kirk,Leshem Choshen
关键词-EN: language language models, language language, improve their capabilities, safe behaviors, Human feedback
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Human feedback on conversations with language language models (LLMs) is central to how these systems learn about the world, improve their capabilities, and are steered toward desirable and safe behaviors. However, this feedback is mostly collected by frontier AI labs and kept behind closed doors. In this work, we bring together interdisciplinary experts to assess the opportunities and challenges to realizing an open ecosystem of human feedback for AI. We first look for successful practices in peer production, open source, and citizen science communities. We then characterize the main challenges for open human feedback. For each, we survey current approaches and offer recommendations. We end by envisioning the components needed to underpin a sustainable and open human feedback ecosystem. In the center of this ecosystem are mutually beneficial feedback loops, between users and specialized models, incentivizing a diverse stakeholders community of model trainers and feedback providers to support a general open feedback pool.

[AI-51] Discovery of False Data Injection Schemes on Frequency Controllers with Reinforcement Learning

链接: https://arxiv.org/abs/2408.16958
作者: Romesh Prasad,Malik Hassanaly,Xiangyu Zhang,Abhijeet Sahu
关键词-EN: distributed energy resources, inverter-based distributed energy, integrating renewable energy, energy resources, play a crucial
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:While inverter-based distributed energy resources (DERs) play a crucial role in integrating renewable energy into the power system, they concurrently diminish the grid’s system inertia, elevating the risk of frequency instabilities. Furthermore, smart inverters, interfaced via communication networks, pose a potential vulnerability to cyber threats if not diligently managed. To proactively fortify the power grid against sophisticated cyber attacks, we propose to employ reinforcement learning (RL) to identify potential threats and system vulnerabilities. This study concentrates on analyzing adversarial strategies for false data injection, specifically targeting smart inverters involved in primary frequency control. Our findings demonstrate that an RL agent can adeptly discern optimal false data injection methods to manipulate inverter settings, potentially causing catastrophic consequences.

[AI-52] ransient Fault Tolerant Semantic Segmentation for Autonomous Driving ECCV2024

链接: https://arxiv.org/abs/2408.16952
作者: Leonardo Iurada,Niccolò Cavagnero,Fernando Fernandes Dos Santos,Giuseppe Averta,Paolo Rech,Tatiana Tommasi
关键词-EN: Deep learning models, Deep learning, autonomous vehicle perception, vehicle perception, reliability is challenged
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: Accepted ECCV 2024 UnCV Workshop - this https URL

点击查看摘要

Abstract:Deep learning models are crucial for autonomous vehicle perception, but their reliability is challenged by algorithmic limitations and hardware faults. We address the latter by examining fault-tolerance in semantic segmentation models. Using established hardware fault models, we evaluate existing hardening techniques both in terms of accuracy and uncertainty and introduce ReLUMax, a novel simple activation function designed to enhance resilience against transient faults. ReLUMax integrates seamlessly into existing architectures without time overhead. Our experiments demonstrate that ReLUMax effectively improves robustness, preserving performance and boosting prediction confidence, thus contributing to the development of reliable autonomous driving systems.

[AI-53] Different Victims Same Layout: Email Visual Similarity Detection for Enhanced Email Protection CCS2024

链接: https://arxiv.org/abs/2408.16945
作者: Sachin Shukla,Omid Mirzaei
关键词-EN: machine learning, rule-based detection systems, effective spam detection, detection, email
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: To be published in the proceedings of the ACM Conference on Computer and Communications Security (ACM CCS 2024)

点击查看摘要

Abstract:In the pursuit of an effective spam detection system, the focus has often been on identifying known spam patterns either through rule-based detection systems or machine learning (ML) solutions. However, both systems are susceptible to evasion techniques and zero-day attacks that can be achieved at low cost. Therefore, an email that bypassed the defense system once can do it again in the following days, even though rules are updated or the ML models are retrained. The recurrence of failures to detect emails that exhibit layout similarities to previously undetected spam is concerning for customers and can erode their trust in a company. Our observations show that threat actors reuse email kits extensively and can bypass detection with little effort, for example, by making changes to the content of emails. In this work, we propose an email visual similarity detection approach, named Pisco, to improve the detection capabilities of an email threat defense system. We apply our proof of concept to some real-world samples received from different sources. Our results show that email kits are being reused extensively and visually similar emails are sent to our customers at various time intervals. Therefore, this method could be very helpful in situations where detection features that rely on contextual information and keywords are bypassed, an occurrence our observations show happens frequently.

[AI-54] A longitudinal sentiment analysis of Sinophobia during COVID-19 using large language models

链接: https://arxiv.org/abs/2408.16942
作者: Chen Wang,Rohitash Chandra
关键词-EN: exacerbated xenophobia, leading to widespread, widespread discrimination, discrimination against individuals, Sinophobic sentiments
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The COVID-19 pandemic has exacerbated xenophobia, particularly Sinophobia, leading to widespread discrimination against individuals of Chinese descent. Large language models (LLMs) are pre-trained deep learning models used for natural language processing (NLP) tasks. The ability of LLMs to understand and generate human-like text makes them particularly useful for analysing social media data to detect and evaluate sentiments. We present a sentiment analysis framework utilising LLMs for longitudinal sentiment analysis of the Sinophobic sentiments expressed in X (Twitter) during the COVID-19 pandemic. The results show a significant correlation between the spikes in Sinophobic tweets, Sinophobic sentiments and surges in COVID-19 cases, revealing that the evolution of the pandemic influenced public sentiment and the prevalence of Sinophobic discourse. Furthermore, the sentiment analysis revealed a predominant presence of negative sentiments, such as annoyance and denial, which underscores the impact of political narratives and misinformation shaping public opinion. The lack of empathetic sentiment which was present in previous studies related to COVID-19 highlights the way the political narratives in media viewed the pandemic and how it blamed the Chinese community. Our study highlights the importance of transparent communication in mitigating xenophobic sentiments during global crises.

[AI-55] Event Extraction for Portuguese: A QA-driven Approach using ACE-2005

链接: https://arxiv.org/abs/2408.16932
作者: Luís Filipe Cunha,Ricardo Campos,Alípio Jorge
关键词-EN: Information Retrieval task, Information Retrieval, Retrieval task, Portuguese, commonly consists
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Event extraction is an Information Retrieval task that commonly consists of identifying the central word for the event (trigger) and the event’s arguments. This task has been extensively studied for English but lags behind for Portuguese, partly due to the lack of task-specific annotated corpora. This paper proposes a framework in which two separated BERT-based models were fine-tuned to identify and classify events in Portuguese documents. We decompose this task into two sub-tasks. Firstly, we use a token classification model to detect event triggers. To extract event arguments, we train a Question Answering model that queries the triggers about their corresponding event argument roles. Given the lack of event annotated corpora in Portuguese, we translated the original version of the ACE-2005 dataset (a reference in the field) into Portuguese, producing a new corpus for Portuguese event extraction. To accomplish this, we developed an automatic translation pipeline. Our framework obtains F1 marks of 64.4 for trigger classification and 46.7 for argument classification setting, thus a new state-of-the-art reference for these tasks in Portuguese.

[AI-56] ACE-2005-PT: Corpus for Event Extraction in Portuguese

链接: https://arxiv.org/abs/2408.16928
作者: Luís Filipe Cunha,Purificação Silvano,Ricardo Campos,Alípio Jorge
关键词-EN: NLP task, commonly involves identifying, task that commonly, commonly involves, Event extraction
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Event extraction is an NLP task that commonly involves identifying the central word (trigger) for an event and its associated arguments in text. ACE-2005 is widely recognised as the standard corpus in this field. While other corpora, like PropBank, primarily focus on annotating predicate-argument structure, ACE-2005 provides comprehensive information about the overall event structure and semantics. However, its limited language coverage restricts its usability. This paper introduces ACE-2005-PT, a corpus created by translating ACE-2005 into Portuguese, with European and Brazilian variants. To speed up the process of obtaining ACE-2005-PT, we rely on automatic translators. This, however, poses some challenges related to automatically identifying the correct alignments between multi-word annotations in the original text and in the corresponding translated sentence. To achieve this, we developed an alignment pipeline that incorporates several alignment techniques: lemmatization, fuzzy matching, synonym matching, multiple translations and a BERT-based word aligner. To measure the alignment effectiveness, a subset of annotations from the ACE-2005-PT corpus was manually aligned by a linguist expert. This subset was then compared against our pipeline results which achieved exact and relaxed match scores of 70.55% and 87.55% respectively. As a result, we successfully generated a Portuguese version of the ACE-2005 corpus, which has been accepted for publication by LDC.

[AI-57] Analyzing Inference Privacy Risks Through Gradients in Machine Learning

链接: https://arxiv.org/abs/2408.16913
作者: Zhuohang Li,Andrew Lowy,Jing Liu,Toshiaki Koike-Akino,Kieran Parsons,Bradley Malin,Ye Wang
关键词-EN: potentially sensitive user, shared gradients computed, models are iteratively, iteratively updated, updated with shared
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:In distributed learning settings, models are iteratively updated with shared gradients computed from potentially sensitive user data. While previous work has studied various privacy risks of sharing gradients, our paper aims to provide a systematic approach to analyze private information leakage from gradients. We present a unified game-based framework that encompasses a broad range of attacks including attribute, property, distributional, and user disclosures. We investigate how different uncertainties of the adversary affect their inferential power via extensive experiments on five datasets across various data modalities. Our results demonstrate the inefficacy of solely relying on data aggregation to achieve privacy against inference attacks in distributed learning. We further evaluate five types of defenses, namely, gradient pruning, signed gradient descent, adversarial perturbations, variational information bottleneck, and differential privacy, under both static and adaptive adversary settings. We provide an information-theoretic view for analyzing the effectiveness of these defenses against inference from gradients. Finally, we introduce a method for auditing attribute inference privacy, improving the empirical estimation of worst-case privacy through crafting adversarial canary records.

[AI-58] GSTAM: Efficient Graph Distillation with Structural Attention-Matching ECCV

链接: https://arxiv.org/abs/2408.16871
作者: Arash Rasti-Meymandi,Ahmad Sajedi,Zhaopan Xu,Konstantinos N. Plataniotis
关键词-EN: reducing large graph, large graph datasets, solution for reducing, reducing large, Graph
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Accepted at ECCV-DD 2024

点击查看摘要

Abstract:Graph distillation has emerged as a solution for reducing large graph datasets to smaller, more manageable, and informative ones. Existing methods primarily target node classification, involve computationally intensive processes, and fail to capture the true distribution of the full graph dataset. To address these issues, we introduce Graph Distillation with Structural Attention Matching (GSTAM), a novel method for condensing graph classification datasets. GSTAM leverages the attention maps of GNNs to distill structural information from the original dataset into synthetic graphs. The structural attention-matching mechanism exploits the areas of the input graph that GNNs prioritize for classification, effectively distilling such information into the synthetic graphs and improving overall distillation performance. Comprehensive experiments demonstrate GSTAM’s superiority over existing methods, achieving 0.45% to 6.5% better performance in extreme condensation ratios, highlighting its potential use in advancing distillation for graph classification tasks (Code available at this https URL).

[AI-59] Physics-Informed Neural Networks and Extensions

链接: https://arxiv.org/abs/2408.16806
作者: Maziar Raissi,Paris Perdikaris,Nazanin Ahmadi,George Em Karniadakis
关键词-EN: Physics-Informed Neural Networks, method Physics-Informed Neural, scientific machine learning, recent practical extensions, governing differential equations
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Frontiers of Science Awards 2024

点击查看摘要

Abstract:In this paper, we review the new method Physics-Informed Neural Networks (PINNs) that has become the main pillar in scientific machine learning, we present recent practical extensions, and provide a specific example in data-driven discovery of governing differential equations.

[AI-60] HLogformer: A Hierarchical Transformer for Representing Log Data

链接: https://arxiv.org/abs/2408.16803
作者: Zhichao Hou,Mina Ghashami,Mikhail Kuznetsov,MohamadAli Torkamani
关键词-EN: gained widespread acclaim, data remains underexplored, handling diverse data, log, log data
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Transformers have gained widespread acclaim for their versatility in handling diverse data structures, yet their application to log data remains underexplored. Log data, characterized by its hierarchical, dictionary-like structure, poses unique challenges when processed using conventional transformer models. Traditional methods often rely on manually crafted templates for parsing logs, a process that is labor-intensive and lacks generalizability. Additionally, the linear treatment of log sequences by standard transformers neglects the rich, nested relationships within log entries, leading to suboptimal representations and excessive memory usage. To address these issues, we introduce HLogformer, a novel hierarchical transformer framework specifically designed for log data. HLogformer leverages the hierarchical structure of log entries to significantly reduce memory costs and enhance representation learning. Unlike traditional models that treat log data as flat sequences, our framework processes log entries in a manner that respects their inherent hierarchical organization. This approach ensures comprehensive encoding of both fine-grained details and broader contextual relationships. Our contributions are threefold: First, HLogformer is the first framework to design a dynamic hierarchical transformer tailored for dictionary-like log data. Second, it dramatically reduces memory costs associated with processing extensive log sequences. Third, comprehensive experiments demonstrate that HLogformer more effectively encodes hierarchical contextual information, proving to be highly effective for downstream tasks such as synthetic anomaly detection and product recommendation. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2408.16803 [cs.LG] (or arXiv:2408.16803v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2408.16803 Focus to learn more arXiv-issued DOI via DataCite

[AI-61] Generative AI in Ship Design

链接: https://arxiv.org/abs/2408.16798
作者: Sahil Thakur,Navneet V Saxena,Prof Sitikantha Roy
关键词-EN: heavily influenced, accounts for approximately, total cost, Gaussian Mixture Model, model architecture
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The process of ship design is intricate, heavily influenced by the hull form which accounts for approximately 70% of the total cost. Traditional methods rely on human-driven iterative processes based on naval architecture principles and engineering analysis. In contrast, generative AI presents a novel approach, utilizing computational algorithms rooted in machine learning and artificial intelligence to optimize ship hull design. This report outlines the systematic creation of a generative AI for this purpose, involving steps such as dataset collection, model architecture selection, training, and validation. Utilizing the “SHIP-D” dataset, consisting of 30,000 hull forms, the report adopts the Gaussian Mixture Model (GMM) as the generative model architecture. GMMs offer a statistical framework to analyze data distribution, crucial for generating innovative ship designs efficiently. Overall, this approach holds promise in revolutionizing ship design by exploring a broader design space and integrating multidisciplinary optimization objectives effectively.

[AI-62] EvoAl2048 GECCO’24

链接: https://arxiv.org/abs/2408.16780
作者: Bernhard J. Berger(1 and 2),Christina Plump(3),Rolf Drechsler(4 and 3) ((1) University of Rostock, Software Engineering Chair Rostock, Germany, (2) Hamburg University of Technology, Institute of Embedded Systems, Germany, (3) DFKI - Cyber-Physical Systems Bremen, Germany, (4) University of Bremen, Departments of Mathematics and Computer Science)
关键词-EN: enter safety-critical products, solutions enter safety-critical, increasingly important, safety-critical products, enter safety-critical
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI)
*备注: 2 pages, GECCO’24 competition entry

点击查看摘要

Abstract:As AI solutions enter safety-critical products, the explainability and interpretability of solutions generated by AI products become increasingly important. In the long term, such explanations are the key to gaining users’ acceptance of AI-based systems’ decisions. We report on applying a model-driven-based optimisation to search for an interpretable and explainable policy that solves the game 2048. This paper describes a solution to the GECCO’24 Interpretable Control Competition using the open-source software EvoAl. We aimed to develop an approach for creating interpretable policies that are easy to adapt to new ideas.

[AI-63] Inductive Learning of Logical Theories with LLMs: A Complexity-graded Analysis

链接: https://arxiv.org/abs/2408.16779
作者: João Pedro Gandarela,Danilo S. Carvalho,André Freitas
关键词-EN: Large Language Models, limitations of Large, Large Language, Natural Language Processing, work presents
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO)
*备注:

点击查看摘要

Abstract:This work presents a novel systematic methodology to analyse the capabilities and limitations of Large Language Models (LLMs) with feedback from a formal inference engine, on logic theory induction. The analysis is complexity-graded w.r.t. rule dependency structure, allowing quantification of specific inference challenges on LLM performance. Integrating LLMs with formal methods is a promising frontier in the Natural Language Processing field, as an important avenue for improving model inference control and explainability. In particular, inductive learning over complex sets of facts and rules, poses unique challenges for current autoregressive models, as they lack explicit symbolic grounding. While they can be complemented by formal systems, the properties delivered by LLMs regarding inductive learning, are not well understood and quantified. Empirical results indicate that the largest LLMs can achieve competitive results against a SOTA Inductive Logic Programming (ILP) system baseline, but also that tracking long predicate relationship chains is a more difficult obstacle than theory complexity for the LLMs.

[AI-64] Online Behavior Modification for Expressive User Control of RL-Trained Robots

链接: https://arxiv.org/abs/2408.16776
作者: Isaac Sheidlower,Mavis Murdock,Emma Bethel,Reuben M. Aronson,Elaine Schaertl Short
关键词-EN: Reinforcement Learning, effective method, Reinforcement, Learning, learn tasks
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注: This work was published and presented at HRI 2024

点击查看摘要

Abstract:Reinforcement Learning (RL) is an effective method for robots to learn tasks. However, in typical RL, end-users have little to no control over how the robot does the task after the robot has been deployed. To address this, we introduce the idea of online behavior modification, a paradigm in which users have control over behavior features of a robot in real time as it autonomously completes a task using an RL-trained policy. To show the value of this user-centered formulation for human-robot interaction, we present a behavior diversity based algorithm, Adjustable Control Of RL Dynamics (ACORD), and demonstrate its applicability to online behavior modification in simulation and a user study. In the study (n=23) users adjust the style of paintings as a robot traces a shape autonomously. We compare ACORD to RL and Shared Autonomy (SA), and show ACORD affords user-preferred levels of control and expression, comparable to SA, but with the potential for autonomous execution and robustness of RL.

[AI-65] An Effective Information Theoretic Framework for Channel Pruning

链接: https://arxiv.org/abs/2408.16772
作者: Yihao Chen,Zefang Wang
关键词-EN: neural networks, Channel pruning, accelerating and compressing, compressing convolutional neural, information
类目: Information Theory (cs.IT); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Channel pruning is a promising method for accelerating and compressing convolutional neural networks. However, current pruning algorithms still remain unsolved problems that how to assign layer-wise pruning ratios properly and discard the least important channels with a convincing criterion. In this paper, we present a novel channel pruning approach via information theory and interpretability of neural networks. Specifically, we regard information entropy as the expected amount of information for convolutional layers. In addition, if we suppose a matrix as a system of linear equations, a higher-rank matrix represents there exist more solutions to it, which indicates more uncertainty. From the point of view of information theory, the rank can also describe the amount of information. In a neural network, considering the rank and entropy as two information indicators of convolutional layers, we propose a fusion function to reach a compromise of them, where the fusion results are defined as ``information concentration’'. When pre-defining layer-wise pruning ratios, we employ the information concentration as a reference instead of heuristic and engineering tuning to provide a more interpretable solution. Moreover, we leverage Shapley values, which are a potent tool in the interpretability of neural networks, to evaluate the channel contributions and discard the least important channels for model compression while maintaining its performance. Extensive experiments demonstrate the effectiveness and promising performance of our method. For example, our method improves the accuracy by 0.21% when reducing 45.5% FLOPs and removing 40.3% parameters for ResNet-56 on CIFAR-10. Moreover, our method obtains loss in Top-1/Top-5 accuracies of 0.43%/0.11% by reducing 41.6% FLOPs and removing 35.0% parameters for ResNet-50 on ImageNet.

[AI-66] Advancing Multi-talker ASR Performance with Large Language Models

链接: https://arxiv.org/abs/2408.17431
作者: Mohan Shi,Zengrui Jin,Yaoxun Xu,Yong Xu,Shi-Xiong Zhang,Kun Wei,Yiwen Shao,Chunlei Zhang,Dong Yu
关键词-EN: Recognizing overlapping speech, automatic speech recognition, Recognizing overlapping, multi-talker ASR, address multi-talker ASR
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI)
*备注: 8 pages, accepted by IEEE SLT 2024

点击查看摘要

Abstract:Recognizing overlapping speech from multiple speakers in conversational scenarios is one of the most challenging problem for automatic speech recognition (ASR). Serialized output training (SOT) is a classic method to address multi-talker ASR, with the idea of concatenating transcriptions from multiple speakers according to the emission times of their speech for training. However, SOT-style transcriptions, derived from concatenating multiple related utterances in a conversation, depend significantly on modeling long contexts. Therefore, compared to traditional methods that primarily emphasize encoder performance in attention-based encoder-decoder (AED) architectures, a novel approach utilizing large language models (LLMs) that leverages the capabilities of pre-trained decoders may be better suited for such complex and challenging scenarios. In this paper, we propose an LLM-based SOT approach for multi-talker ASR, leveraging pre-trained speech encoder and LLM, fine-tuning them on multi-talker dataset using appropriate strategies. Experimental results demonstrate that our approach surpasses traditional AED-based methods on the simulated dataset LibriMix and achieves state-of-the-art performance on the evaluation set of the real-world dataset AMI, outperforming the AED model trained with 1000 times more supervised data in previous works.

[AI-67] Accelerating the discovery of steady-states of planetary interior dynamics with machine learning

链接: https://arxiv.org/abs/2408.17298
作者: Siddhant Agarwal,Nicola Tosi,Christian Hüttig,David S. Greenberg,Ali Can Bekar
关键词-EN: deriving scaling laws, dynamical flow properties, Simulating mantle convection, computationally expensive steady-state, Simulating mantle
类目: Fluid Dynamics (physics.flu-dyn); Earth and Planetary Astrophysics (astro-ph.EP); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Simulating mantle convection often requires reaching a computationally expensive steady-state, crucial for deriving scaling laws for thermal and dynamical flow properties and benchmarking numerical solutions. The strong temperature dependence of the rheology of mantle rocks causes viscosity variations of several orders of magnitude, leading to a slow-evolving stagnant lid where heat conduction dominates, overlying a rapidly-evolving and strongly convecting region. Time-stepping methods, while effective for fluids with constant viscosity, are hindered by the Courant criterion, which restricts the time step based on the system’s maximum velocity and grid size. Consequently, achieving steady-state requires a large number of time steps due to the disparate time scales governing the stagnant and convecting regions. We present a concept for accelerating mantle convection simulations using machine learning. We generate a dataset of 128 two-dimensional simulations with mixed basal and internal heating, and pressure- and temperature-dependent viscosity. We train a feedforward neural network on 97 simulations to predict steady-state temperature profiles. These can then be used to initialize numerical time stepping methods for different simulation parameters. Compared to typical initializations, the number of time steps required to reach steady-state is reduced by a median factor of 3.75. The benefit of this method lies in requiring very few simulations to train on, providing a solution with no prediction error as we initialize a numerical method, and posing minimal computational overhead at inference time. We demonstrate the effectiveness of our approach and discuss the potential implications for accelerated simulations for advancing mantle convection research. Subjects: Fluid Dynamics (physics.flu-dyn); Earth and Planetary Astrophysics (astro-ph.EP); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2408.17298 [physics.flu-dyn] (or arXiv:2408.17298v1 [physics.flu-dyn] for this version) https://doi.org/10.48550/arXiv.2408.17298 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-68] Codec Does Matter: Exploring the Semantic Shortcoming of Codec for Audio Language Model

链接: https://arxiv.org/abs/2408.17175
作者: Zhen Ye,Peiwen Sun,Jiahe Lei,Hongzhan Lin,Xu Tan,Zheqi Dai,Qiuqiang Kong,Jianyi Chen,Jiahao Pan,Qifeng Liu,Yike Guo,Wei Xue
关键词-EN: Large Language Models, Recent advancements, capabilities of Large, Large Language, Language Models
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Sound (cs.SD)
*备注:

点击查看摘要

Abstract:Recent advancements in audio generation have been significantly propelled by the capabilities of Large Language Models (LLMs). The existing research on audio LLM has primarily focused on enhancing the architecture and scale of audio language models, as well as leveraging larger datasets, and generally, acoustic codecs, such as EnCodec, are used for audio tokenization. However, these codecs were originally designed for audio compression, which may lead to suboptimal performance in the context of audio LLM. Our research aims to address the shortcomings of current audio LLM codecs, particularly their challenges in maintaining semantic integrity in generated audio. For instance, existing methods like VALL-E, which condition acoustic token generation on text transcriptions, often suffer from content inaccuracies and elevated word error rates (WER) due to semantic misinterpretations of acoustic tokens, resulting in word skipping and errors. To overcome these issues, we propose a straightforward yet effective approach called X-Codec. X-Codec incorporates semantic features from a pre-trained semantic encoder before the Residual Vector Quantization (RVQ) stage and introduces a semantic reconstruction loss after RVQ. By enhancing the semantic ability of the codec, X-Codec significantly reduces WER in speech synthesis tasks and extends these benefits to non-speech applications, including music and sound generation. Our experiments in text-to-speech, music continuation, and text-to-sound tasks demonstrate that integrating semantic information substantially improves the overall performance of language models in audio generation. Our code and demo are available (Demo: this https URL Code: this https URL)

[AI-69] Disease Classification and Impact of Pretrained Deep Convolution Neural Networks on Diverse Medical Imaging Datasets across Imaging Modalities

链接: https://arxiv.org/abs/2408.17011
作者: Jutika Borah,Kumaresh Sarmah,Hidam Kumarjit Singh
关键词-EN: Chest X-rays, optical coherence tomography, coherence tomography serve, medical imaging, optical coherence
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 15 pages, 3 figures, 4 tables

点击查看摘要

Abstract:Imaging techniques such as Chest X-rays, whole slide images, and optical coherence tomography serve as the initial screening and detection for a wide variety of medical pulmonary and ophthalmic conditions respectively. This paper investigates the intricacies of using pretrained deep convolutional neural networks with transfer learning across diverse medical imaging datasets with varying modalities for binary and multiclass classification. We conducted a comprehensive performance analysis with ten network architectures and model families each with pretraining and random initialization. Our finding showed that the use of pretrained models as fixed feature extractors yields poor performance irrespective of the datasets. Contrary, histopathology microscopy whole slide images have better performance. It is also found that deeper and more complex architectures did not necessarily result in the best performance. This observation implies that the improvements in ImageNet are not parallel to the medical imaging tasks. Within a medical domain, the performance of the network architectures varies within model families with shifts in datasets. This indicates that the performance of models within a specific modality may not be conclusive for another modality within the same domain. This study provides a deeper understanding of the applications of deep learning techniques in medical imaging and highlights the impact of pretrained networks across different medical imaging datasets under five different experimental settings.

[AI-70] chnical Report of HelixFold3 for Biomolecular Structure Prediction

链接: https://arxiv.org/abs/2408.16975
作者: Lihang Liu,Shanzhuo Zhang,Yang Xue,Xianbin Ye,Kunrui Zhu,Yuxin Li,Yang Liu,Xiaonan Zhang,Xiaomin Fang
关键词-EN: matching experimental methods, transformed protein structure, experimental methods, AlphaFold series, series has transformed
类目: Biomolecules (q-bio.BM); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The AlphaFold series has transformed protein structure prediction with remarkable accuracy, often matching experimental methods. AlphaFold2, AlphaFold-Multimer, and the latest AlphaFold3 represent significant strides in predicting single protein chains, protein complexes, and biomolecular structures. While AlphaFold2 and AlphaFold-Multimer are open-sourced, facilitating rapid and reliable predictions, AlphaFold3 remains partially accessible through a limited online server and has not been open-sourced, restricting further development. To address these challenges, the PaddleHelix team is developing HelixFold3, aiming to replicate AlphaFold3’s capabilities. Using insights from previous models and extensive datasets, HelixFold3 achieves an accuracy comparable to AlphaFold3 in predicting the structures of conventional ligands, nucleic acids, and proteins. The initial release of HelixFold3 is available as open source on GitHub for academic research, promising to advance biomolecular research and accelerate discoveries. We also provide online service at PaddleHelix website at this https URL.

[AI-71] Uncertainty-aware segmentation for rainfall prediction post processing KDD’24

链接: https://arxiv.org/abs/2408.16792
作者: Simone Monaco,Luca Monaco,Daniele Apiletti
关键词-EN: water resource allocation, Accurate precipitation forecasts, Accurate precipitation, agricultural planning, flood management
类目: Atmospheric and Oceanic Physics (physics.ao-ph); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Paper accepted at the 3rd Workshop on Uncertainty Reasoning and Quantification in Decision Making at ACM SIGKDD’24 (August 26, 2024, Barcelona)

点击查看摘要

Abstract:Accurate precipitation forecasts are crucial for applications such as flood management, agricultural planning, water resource allocation, and weather warnings. Despite advances in numerical weather prediction (NWP) models, they still exhibit significant biases and uncertainties, especially at high spatial and temporal resolutions. To address these limitations, we explore uncertainty-aware deep learning models for post-processing daily cumulative quantitative precipitation forecasts to obtain forecast uncertainties that lead to a better trade-off between accuracy and reliability. Our study compares different state-of-the-art models, and we propose a variant of the well-known SDE-Net, called SDE U-Net, tailored to segmentation problems like ours. We evaluate its performance for both typical and intense precipitation events. Our results show that all deep learning models significantly outperform the average baseline NWP solution, with our implementation of the SDE U-Net showing the best trade-off between accuracy and reliability. Integrating these models, which account for uncertainty, into operational forecasting systems can improve decision-making and preparedness for weather-related events. Comments: Paper accepted at the 3rd Workshop on Uncertainty Reasoning and Quantification in Decision Making at ACM SIGKDD’24 (August 26, 2024, Barcelona) Subjects: Atmospheric and Oceanic Physics (physics.ao-ph); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2408.16792 [physics.ao-ph] (or arXiv:2408.16792v1 [physics.ao-ph] for this version) https://doi.org/10.48550/arXiv.2408.16792 Focus to learn more arXiv-issued DOI via DataCite

计算机视觉

[CV-0] Bridging Episodes and Semantics: A Novel Framework for Long-Form Video Understanding ECCV’24

链接: https://arxiv.org/abs/2408.17443
作者: Gueter Josmy Faure,Jia-Fong Yeh,Min-Hung Chen,Hung-Ting Su,Winston H. Hsu,Shang-Hong Lai
关键词-EN: reflects human cognition, accurately reflects human, extended short videos, treats long-form videos, human cognition
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注: Accepted to the EVAL-FoMo Workshop at ECCV’24. Project page: this https URL

点击查看摘要

Abstract:While existing research often treats long-form videos as extended short videos, we propose a novel approach that more accurately reflects human cognition. This paper introduces BREASE: BRidging Episodes And SEmantics for Long-Form Video Understanding, a model that simulates episodic memory accumulation to capture action sequences and reinforces them with semantic knowledge dispersed throughout the video. Our work makes two key contributions: First, we develop an Episodic COmpressor (ECO) that efficiently aggregates crucial representations from micro to semi-macro levels. Second, we propose a Semantics reTRiever (SeTR) that enhances these aggregated representations with semantic information by focusing on the broader context, dramatically reducing feature dimensionality while preserving relevant macro-level information. Extensive experiments demonstrate that BREASE achieves state-of-the-art performance across multiple long video understanding benchmarks in both zero-shot and fully-supervised settings. The project page and code are at: this https URL.

[CV-1] DARES: Depth Anything in Robotic Endoscopic Surgery with Self-supervised Vector-LoRA of the Foundation Model

链接: https://arxiv.org/abs/2408.17433
作者: Mona Sheikh Zeinoddin,Chiara Lena,Jiongqi Qu,Luca Carlini,Mattia Magro,Seunghoi Kim,Elena De Momi,Sophia Bano,Matthew Grech-Sollars,Evangelos Mazomenos,Daniel C. Alexander,Danail Stoyanov,Matthew J. Clarkson,Mobarakol Islam
关键词-EN: Robotic-assisted surgery, reconstruction and visualization, relies on accurate, accurate depth estimation, Robotic Endoscopic Surgery
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 11 pages

点击查看摘要

Abstract:Robotic-assisted surgery (RAS) relies on accurate depth estimation for 3D reconstruction and visualization. While foundation models like Depth Anything Models (DAM) show promise, directly applying them to surgery often yields suboptimal results. Fully fine-tuning on limited surgical data can cause overfitting and catastrophic forgetting, compromising model robustness and generalization. Although Low-Rank Adaptation (LoRA) addresses some adaptation issues, its uniform parameter distribution neglects the inherent feature hierarchy, where earlier layers, learning more general features, require more parameters than later ones. To tackle this issue, we introduce Depth Anything in Robotic Endoscopic Surgery (DARES), a novel approach that employs a new adaptation technique, Vector Low-Rank Adaptation (Vector-LoRA) on the DAM V2 to perform self-supervised monocular depth estimation in RAS scenes. To enhance learning efficiency, we introduce Vector-LoRA by integrating more parameters in earlier layers and gradually decreasing parameters in later layers. We also design a reprojection loss based on the multi-scale SSIM error to enhance depth perception by better tailoring the foundation model to the specific requirements of the surgical environment. The proposed method is validated on the SCARED dataset and demonstrates superior performance over recent state-of-the-art self-supervised monocular depth estimation techniques, achieving an improvement of 13.3% in the absolute relative error metric. The code and pre-trained weights are available at this https URL.

[CV-2] CinePreGen: Camera Controllable Video Previsualization via Engine-powered Diffusion

链接: https://arxiv.org/abs/2408.17424
作者: Yiran Chen,Anyi Rao,Xuekun Jiang,Shishi Xiao,Ruiqing Ma,Zeyu Wang,Hui Xiong,Bo Dai
关键词-EN: generative AI models, techniques to enhance, enhance video previsualization, SORA, camera
类目: Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)
*备注:

点击查看摘要

Abstract:With advancements in video generative AI models (e.g., SORA), creators are increasingly using these techniques to enhance video previsualization. However, they face challenges with incomplete and mismatched AI workflows. Existing methods mainly rely on text descriptions and struggle with camera placement, a key component of previsualization. To address these issues, we introduce CinePreGen, a visual previsualization system enhanced with engine-powered diffusion. It features a novel camera and storyboard interface that offers dynamic control, from global to local camera adjustments. This is combined with a user-friendly AI rendering workflow, which aims to achieve consistent results through multi-masked IP-Adapter and engine simulation guidelines. In our comprehensive evaluation study, we demonstrate that our system reduces development viscosity (i.e., the complexity and challenges in the development process), meets users’ needs for extensive control and iteration in the design process, and outperforms other AI video production workflows in cinematic camera movement, as shown by our experiments and a within-subjects user study. With its intuitive camera controls and realistic rendering of camera motion, CinePreGen shows great potential for improving video production for both individual creators and industry professionals.

[CV-3] Open-vocabulary Temporal Action Localization using VLMs DATE

链接: https://arxiv.org/abs/2408.17422
作者: Naoki Wake,Atsushi Kanehira,Kazuhiro Sasabuchi,Jun Takamatsu,Katsushi Ikeuchi
关键词-EN: action localization aims, localization aims, aims to find, find timings, Video action localization
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)
*备注: 7 pages, 5 figures, 4 tables. Last updated on August 30th, 2024

点击查看摘要

Abstract:Video action localization aims to find timings of a specific action from a long video. Although existing learning-based approaches have been successful, those require annotating videos that come with a considerable labor cost. This paper proposes a learning-free, open-vocabulary approach based on emerging vision-language models (VLM). The challenge stems from the fact that VLMs are neither designed to process long videos nor tailored for finding actions. We overcome these problems by extending an iterative visual prompting technique. Specifically, we sample video frames into a concatenated image with frame index labels, making a VLM guess a frame that is considered to be closest to the start/end of the action. Iterating this process by narrowing a sampling time window results in finding a specific frame of start and end of an action. We demonstrate that this sampling technique yields reasonable results, illustrating a practical extension of VLMs for understanding videos.

[CV-4] How Knowledge Distillation Mitigates the Synthetic Gap in Fair Face Recognition ECCV2024

链接: https://arxiv.org/abs/2408.17399
作者: Pedro C. Neto,Ivona Colakovic,Sašo Karakatič,Ana F. Sequeira
关键词-EN: Leveraging the capabilities, Knowledge Distillation, devise a strategy, strategy to fight, fight the recent
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted at ECCV 2024 Workshops

点击查看摘要

Abstract:Leveraging the capabilities of Knowledge Distillation (KD) strategies, we devise a strategy to fight the recent retraction of face recognition datasets. Given a pretrained Teacher model trained on a real dataset, we show that carefully utilising synthetic datasets, or a mix between real and synthetic datasets to distil knowledge from this teacher to smaller students can yield surprising results. In this sense, we trained 33 different models with and without KD, on different datasets, with different architectures and losses. And our findings are consistent, using KD leads to performance gains across all ethnicities and decreased bias. In addition, it helps to mitigate the performance gap between real and synthetic datasets. This approach addresses the limitations of synthetic data training, improving both the accuracy and fairness of face recognition models.

[CV-5] Look Learn and Leverage (L3): Mitigating Visual-Domain Shift and Discovering Intrinsic Relations via Symbolic Alignment

链接: https://arxiv.org/abs/2408.17363
作者: Hanchen Xie,Jiageng Zhu,Mahyar Khayatkhoei,Jiazhi Li,Wael AbdAlmageed
关键词-EN: Disentangled Representation Learning, Causal Representation Learning, Visual Question Answering, Modern deep learning, Disentangled Representation
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 17 pages, 9 figures, 6 tables

点击查看摘要

Abstract:Modern deep learning models have demonstrated outstanding performance on discovering the underlying mechanisms when both visual appearance and intrinsic relations (e.g., causal structure) data are sufficient, such as Disentangled Representation Learning (DRL), Causal Representation Learning (CRL) and Visual Question Answering (VQA) methods. However, generalization ability of these models is challenged when the visual domain shifts and the relations data is absent during finetuning. To address this challenge, we propose a novel learning framework, Look, Learn and Leverage (L ^3 ), which decomposes the learning process into three distinct phases and systematically utilize the class-agnostic segmentation masks as the common symbolic space to align visual domains. Thus, a relations discovery model can be trained on the source domain, and when the visual domain shifts and the intrinsic relations are absent, the pretrained relations discovery model can be directly reused and maintain a satisfactory performance. Extensive performance evaluations are conducted on three different tasks: DRL, CRL and VQA, and show outstanding results on all three tasks, which reveals the advantages of L ^3 .

[CV-6] LSMS: Language-guided Scale-aware MedSegmentor for Medical Image Referring Segmentation

链接: https://arxiv.org/abs/2408.17347
作者: Shuyi Ouyang,Jinyang Zhang,Xiangye Lin,Xilai Wang,Qingqing Chen,Yen-Wei Chen,Lanfen Lin
关键词-EN: medical image segmentation, Conventional medical image, medical image, Medical Image Referring, Image Referring Segmentation
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 14 pages, 5 figures

点击查看摘要

Abstract:Conventional medical image segmentation methods have been found inadequate in facilitating physicians with the identification of specific lesions for diagnosis and treatment. Given the utility of text as an instructional format, we introduce a novel task termed Medical Image Referring Segmentation (MIRS), which requires segmenting specified lesions in images based on the given language expressions. Due to the varying object scales in medical images, MIRS demands robust vision-language modeling and comprehensive multi-scale interaction for precise localization and segmentation under linguistic guidance. However, existing medical image segmentation methods fall short in meeting these demands, resulting in insufficient segmentation accuracy. In response, we propose an approach named Language-guided Scale-aware MedSegmentor (LSMS), incorporating two appealing designs: (1)~a Scale-aware Vision-Language Attention module that leverages diverse convolutional kernels to acquire rich visual knowledge and interact closely with linguistic features, thereby enhancing lesion localization capability; (2)~a Full-Scale Decoder that globally models multi-modal features across various scales, capturing complementary information between scales to accurately outline lesion boundaries. Addressing the lack of suitable datasets for MIRS, we constructed a vision-language medical dataset called Reference Hepatic Lesion Segmentation (RefHL-Seg). This dataset comprises 2,283 abdominal CT slices from 231 cases, with corresponding textual annotations and segmentation masks for various liver lesions in images. We validated the performance of LSMS for MIRS and conventional medical image segmentation tasks across various datasets. Our LSMS consistently outperforms on all datasets with lower computational costs. The code and datasets will be released.

[CV-7] Enhancing Underwater Imaging with 4-D Light Fields: Dataset and Method

链接: https://arxiv.org/abs/2408.17339
作者: Yuji Lin,Xianqiang Lyu,Junhui Hou,Qian Zhao,Deyu Meng
关键词-EN: light fields, light absorption, underwater imaging plagued, enhance underwater imaging, underwater imaging
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
*备注: 14 pages, 14 figures

点击查看摘要

Abstract:In this paper, we delve into the realm of 4-D light fields (LFs) to enhance underwater imaging plagued by light absorption, scattering, and other challenges. Contrasting with conventional 2-D RGB imaging, 4-D LF imaging excels in capturing scenes from multiple perspectives, thereby indirectly embedding geometric information. This intrinsic property is anticipated to effectively address the challenges associated with underwater imaging. By leveraging both explicit and implicit depth cues present in 4-D LF images, we propose a progressive, mutually reinforcing framework for underwater 4-D LF image enhancement and depth estimation. Specifically, our framework explicitly utilizes estimated depth information alongside implicit depth-related dynamic convolutional kernels to modulate output features. The entire framework decomposes this complex task, iteratively optimizing the enhanced image and depth information to progressively achieve optimal enhancement results. More importantly, we construct the first 4-D LF-based underwater image dataset for quantitative evaluation and supervised training of learning-based methods, comprising 75 underwater scenes and 3675 high-resolution 2K pairs. To craft vibrant and varied underwater scenes, we build underwater environments with various objects and adopt several types of degradation. Through extensive experimentation, we showcase the potential and superiority of 4-D LF-based underwater imaging vis-a-vis traditional 2-D RGB-based approaches. Moreover, our method effectively corrects color bias and achieves state-of-the-art performance. The dataset and code will be publicly available at this https URL.

[CV-8] Evaluating Reliability in Medical DNNs: A Critical Analysis of Feature and Confidence-Based OOD Detection MICCAI2023

链接: https://arxiv.org/abs/2408.17337
作者: Harry Anthony,Konstantinos Kamnitsas
关键词-EN: deep neural networks, OOD, prevent erroneous predictions, medical image analysis, image analysis requires
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: Accepted for the Uncertainty for Safe Utilization of Machine Learning in Medical Imaging (UNSURE 2024) workshop at the MICCAI 2023

点击查看摘要

Abstract:Reliable use of deep neural networks (DNNs) for medical image analysis requires methods to identify inputs that differ significantly from the training data, called out-of-distribution (OOD), to prevent erroneous predictions. OOD detection methods can be categorised as either confidence-based (using the model’s output layer for OOD detection) or feature-based (not using the output layer). We created two new OOD benchmarks by dividing the D7P (dermatology) and BreastMNIST (ultrasound) datasets into subsets which either contain or don’t contain an artefact (rulers or annotations respectively). Models were trained with artefact-free images, and images with the artefacts were used as OOD test sets. For each OOD image, we created a counterfactual by manually removing the artefact via image processing, to assess the artefact’s impact on the model’s predictions. We show that OOD artefacts can boost a model’s softmax confidence in its predictions, due to correlations in training data among other factors. This contradicts the common assumption that OOD artefacts should lead to more uncertain outputs, an assumption on which most confidence-based methods rely. We use this to explain why feature-based methods (e.g. Mahalanobis score) typically have greater OOD detection performance than confidence-based methods (e.g. MCP). However, we also show that feature-based methods typically perform worse at distinguishing between inputs that lead to correct and incorrect predictions (for both OOD and ID data). Following from these insights, we argue that a combination of feature-based and confidence-based methods should be used within DNN pipelines to mitigate their respective weaknesses. These project’s code and OOD benchmarks are available at: this https URL.

[CV-9] Investigating Neuron Ablation in Attention Heads: The Case for Peak Activation Centering

链接: https://arxiv.org/abs/2408.17322
作者: Nicholas Pochinkov,Ben Pasero,Skylar Shibayama
关键词-EN: rapidly throughout society, growing rapidly, ablation, Abstract, models
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
*备注: 9 pages, 2 figures, XAI World Conference 2024 Late-Breaking Work

点击查看摘要

Abstract:The use of transformer-based models is growing rapidly throughout society. With this growth, it is important to understand how they work, and in particular, how the attention mechanisms represent concepts. Though there are many interpretability methods, many look at models through their neuronal activations, which are poorly understood. We describe different lenses through which to view neuron activations, and investigate the effectiveness in language models and vision transformers through various methods of neural ablation: zero ablation, mean ablation, activation resampling, and a novel approach we term ‘peak ablation’. Through experimental analysis, we find that in different regimes and models, each method can offer the lowest degradation of model performance compared to other methods, with resampling usually causing the most significant performance deterioration. We make our code available at this https URL.

[CV-10] Structuring a Training Strategy to Robustify Perception Models with Realistic Image Augmentations

链接: https://arxiv.org/abs/2408.17311
作者: Ahmed Hammam,Bharathwaj Krishnaswami Sreedhar,Nura Kawa,Tim Patzelt,Oliver De Candido
关键词-EN: Operational Design Domains, challenging Operational Design, Design Domains, Operational Design, Advancing Machine Learning
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Advancing Machine Learning (ML)-based perception models for autonomous systems necessitates addressing weak spots within the models, particularly in challenging Operational Design Domains (ODDs). These are environmental operating conditions of an autonomous vehicle which can contain difficult conditions, e.g., lens flare at night or objects reflected in a wet street. This report introduces a novel methodology for training with augmentations to enhance model robustness and performance in such conditions. The proposed approach leverages customized physics-based augmentation functions, to generate realistic training data that simulates diverse ODD scenarios. We present a comprehensive framework that includes identifying weak spots in ML models, selecting suitable augmentations, and devising effective training strategies. The methodology integrates hyperparameter optimization and latent space optimization to fine-tune augmentation parameters, ensuring they maximally improve the ML models’ performance. Experimental results demonstrate improvements in model performance, as measured by commonly used metrics such as mean Average Precision (mAP) and mean Intersection over Union (mIoU) on open-source object detection and semantic segmentation models and datasets. Our findings emphasize that optimal training strategies are model- and data-specific and highlight the benefits of integrating augmentations into the training pipeline. By incorporating augmentations, we observe enhanced robustness of ML-based perception models, making them more resilient to edge cases encountered in real-world ODDs. This work underlines the importance of customized augmentations and offers an effective solution for improving the safety and reliability of autonomous driving functions. Subjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2408.17311 [cs.LG] (or arXiv:2408.17311v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2408.17311 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[CV-11] BOP-D: Revisiting 6D Pose Estimation Benchmark for Better Evaluation under Visual Ambiguities

链接: https://arxiv.org/abs/2408.17297
作者: Boris Meden,Asma Brazi,Steve Bourgeois,Fabrice Mayran de Chamisso,Vincent Lepetit
关键词-EN: global object symmetries, visual ambiguities, related to global, pose estimation methods, object symmetries
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Currently, 6D pose estimation methods are benchmarked on datasets that consider, for their ground truth annotations, visual ambiguities as only related to global object symmetries. However, as previously observed [26], visual ambiguities can also happen depending on the viewpoint or the presence of occluding objects, when disambiguating parts become hidden. The visual ambiguities are therefore actually different across images. We thus first propose an automatic method to re-annotate those datasets with a 6D pose distribution specific to each image, taking into account the visibility of the object surface in the image to correctly determine the visual ambiguities. Given this improved ground truth, we re-evaluate the state-of-the-art methods and show this greatly modify the ranking of these methods. Our annotations also allow us to benchmark recent methods able to estimate a pose distribution on real images for the first time. We will make our annotations for the T-LESS dataset and our code publicly available.

[CV-12] DCUDF2: Improving Efficiency and Accuracy in Extracting Zero Level Sets from Unsigned Distance Fields

链接: https://arxiv.org/abs/2408.17284
作者: Xuhui Chen,Fugang Yu,Fei Hou,Wencheng Wang,Zhebin Zhang,Ying He
关键词-EN: Unsigned distance fields, poses significant challenges, fields poses significant, Unsigned distance, fine geometric details
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Unsigned distance fields (UDFs) allow for the representation of models with complex topologies, but extracting accurate zero level sets from these fields poses significant challenges, particularly in preserving topological accuracy and capturing fine geometric details. To overcome these issues, we introduce DCUDF2, an enhancement over DCUDF–the current state-of-the-art method–for extracting zero level sets from UDFs. Our approach utilizes an accuracy-aware loss function, enhanced with self-adaptive weights, to improve geometric quality significantly. We also propose a topology correction strategy that reduces the dependence on hyper-parameter, increasing the robustness of our method. Furthermore, we develop new operations leveraging self-adaptive weights to boost runtime efficiency. Extensive experiments on surface extraction across diverse datasets demonstrate that DCUDF2 outperforms DCUDF and existing methods in both geometric fidelity and topological accuracy. We will make the source code publicly available.

[CV-13] UrBench: A Comprehensive Benchmark for Evaluating Large Multimodal Models in Multi-View Urban Scenarios

链接: https://arxiv.org/abs/2408.17267
作者: Baichuan Zhou,Haote Yang,Dairong Chen,Junyan Ye,Tianyi Bai,Jinhua Yu,Songyang Zhang,Dahua Lin,Conghui He,Weijia Li
关键词-EN: Large Multimodal Models, Multimodal Models, Large Multimodal, Recent evaluations, benchmarks specifically focusing
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 9 pages, 6 figures

点击查看摘要

Abstract:Recent evaluations of Large Multimodal Models (LMMs) have explored their capabilities in various domains, with only few benchmarks specifically focusing on urban environments. Moreover, existing urban benchmarks have been limited to evaluating LMMs with basic region-level urban tasks under singular views, leading to incomplete evaluations of LMMs’ abilities in urban environments. To address these issues, we present UrBench, a comprehensive benchmark designed for evaluating LMMs in complex multi-view urban scenarios. UrBench contains 11.6K meticulously curated questions at both region-level and role-level that cover 4 task dimensions: Geo-Localization, Scene Reasoning, Scene Understanding, and Object Understanding, totaling 14 task types. In constructing UrBench, we utilize data from existing datasets and additionally collect data from 11 cities, creating new annotations using a cross-view detection-matching method. With these images and annotations, we then integrate LMM-based, rule-based, and human-based methods to construct large-scale high-quality questions. Our evaluations on 21 LMMs show that current LMMs struggle in the urban environments in several aspects. Even the best performing GPT-4o lags behind humans in most tasks, ranging from simple tasks such as counting to complex tasks such as orientation, localization and object attribute recognition, with an average performance gap of 17.4%. Our benchmark also reveals that LMMs exhibit inconsistent behaviors with different urban views, especially with respect to understanding cross-view relations. UrBench datasets and benchmark results will be publicly available at this https URL.

[CV-14] VisionTS: Visual Masked Autoencoders Are Free-Lunch Zero-Shot Time Series Forecasters

链接: https://arxiv.org/abs/2408.17253
作者: Mouxiang Chen,Lefei Shen,Zhuo Li,Xiaoyun Joy Wang,Jianling Sun,Chenghao Liu
关键词-EN: TSF foundation models, TSF foundation, develop TSF foundation, Foundation models, TSF
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 26 pages, 11 figures

点击查看摘要

Abstract:Foundation models have emerged as a promising approach in time series forecasting (TSF). Existing approaches either fine-tune large language models (LLMs) or build large-scale time-series datasets to develop TSF foundation models. However, these methods face challenges due to the severe cross-domain gap or in-domain heterogeneity. In this paper, we explore a new road to building a TSF foundation model from rich and high-quality natural images, based on the intrinsic similarities between images and time series. To bridge the gap between the two domains, we reformulate the TSF task as an image reconstruction task, which is further processed by a visual masked autoencoder (MAE) self-supervised pre-trained on the ImageNet dataset. Surprisingly, without further adaptation in the time-series domain, the proposed VisionTS could achieve superior zero-shot forecasting performance compared to existing TSF foundation models. With minimal fine-tuning, VisionTS could further improve the forecasting and achieve state-of-the-art performance in most cases. These findings suggest that visual models could be a free lunch for TSF and highlight the potential for future cross-domain research between computer vision and TSF. Our code is publicly available at this https URL.

[CV-15] Abstracted Gaussian Prototypes for One-Shot Concept Learning

链接: https://arxiv.org/abs/2408.17251
作者: Chelsea Zou,Kenneth J. Kurtz
关键词-EN: encode higher-level representations, Gaussian Mixture Model, cluster-based generative image, generative image segmentation, image segmentation framework
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:We introduce a cluster-based generative image segmentation framework to encode higher-level representations of visual concepts based on one-shot learning inspired by the Omniglot Challenge. The inferred parameters of each component of a Gaussian Mixture Model (GMM) represent a distinct topological subpart of a visual concept. Sampling new data from these parameters generates augmented subparts to build a more robust prototype for each concept, i.e., the Abstracted Gaussian Prototype (AGP). This framework addresses one-shot classification tasks using a cognitively-inspired similarity metric and addresses one-shot generative tasks through a novel AGP-VAE pipeline employing variational autoencoders (VAEs) to generate new class variants. Results from human judges reveal that the generative pipeline produces novel examples and classes of visual concepts that are broadly indistinguishable from those made by humans. The proposed framework leads to impressive but not state-of-the-art classification accuracy; thus, the contribution is two-fold: 1) the system is uniquely low in theoretical and computational complexity and operates in a completely standalone manner compared while existing approaches draw heavily on pre-training or knowledge engineering; and 2) in contrast with competing neural network models, the AGP approach addresses the importance of breadth of task capability emphasized in the Omniglot challenge (i.e., successful performance on generative tasks). These two points are critical as we advance toward an understanding of how learning/reasoning systems can produce viable, robust, and flexible concepts based on literally nothing more than a single example.

[CV-16] CondSeg: Ellipse Estimation of Pupil and Iris via Conditioned Segmentation

链接: https://arxiv.org/abs/2408.17231
作者: Zhuang Jia,Jiangfan Deng,Liying Chi,Xiang Long,Daniel K. Du
关键词-EN: pupil, iris, full pupil, Parsing of eye, gaze estimation
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Parsing of eye components (i.e. pupil, iris and sclera) is fundamental for eye tracking and gaze estimation for AR/VR products. Mainstream approaches tackle this problem as a multi-class segmentation task, providing only visible part of pupil/iris, other methods regress elliptical parameters using human-annotated full pupil/iris parameters. In this paper, we consider two priors: projected full pupil/iris circle can be modelled with ellipses (ellipse prior), and the visibility of pupil/iris is controlled by openness of eye-region (condition prior), and design a novel method CondSeg to estimate elliptical parameters of pupil/iris directly from segmentation labels, without explicitly annotating full ellipses, and use eye-region mask to control the visibility of estimated pupil/iris ellipses. Conditioned segmentation loss is used to optimize the parameters by transforming parameterized ellipses into pixel-wise soft masks in a differentiable way. Our method is tested on public datasets (OpenEDS-2019/-2020) and shows competitive results on segmentation metrics, and provides accurate elliptical parameters for further applications of eye tracking simultaneously.

[CV-17] OG-Mapping: Octree-based Structured 3D Gaussians for Online Dense Mapping

链接: https://arxiv.org/abs/2408.17223
作者: Meng Wang,Junyi Wang,Changqun Xia,Chen Wang,Yue Qi
关键词-EN: recently demonstrated promising, demonstrated promising advancements, Gaussian splatting, recently demonstrated, demonstrated promising
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:3D Gaussian splatting (3DGS) has recently demonstrated promising advancements in RGB-D online dense mapping. Nevertheless, existing methods excessively rely on per-pixel depth cues to perform map densification, which leads to significant redundancy and increased sensitivity to depth noise. Additionally, explicitly storing 3D Gaussian parameters of room-scale scene poses a significant storage challenge. In this paper, we introduce OG-Mapping, which leverages the robust scene structural representation capability of sparse octrees, combined with structured 3D Gaussian representations, to achieve efficient and robust online dense mapping. Moreover, OG-Mapping employs an anchor-based progressive map refinement strategy to recover the scene structures at multiple levels of detail. Instead of maintaining a small number of active keyframes with a fixed keyframe window as previous approaches do, a dynamic keyframe window is employed to allow OG-Mapping to better tackle false local minima and forgetting issues. Experimental results demonstrate that OG-Mapping delivers more robust and superior realism mapping results than existing Gaussian-based RGB-D online mapping methods with a compact model, and no additional post-processing is required.

[CV-18] How Could Generative AI Support Compliance with the EU AI Act? A Review for Safe Automated Driving Perception

链接: https://arxiv.org/abs/2408.17222
作者: Mert Keser,Youssef Shoeb,Alois Knoll
关键词-EN: Deep Neural Networks, Deep Neural, Neural Networks, substantially enhancing, interpret the environment
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Deep Neural Networks (DNNs) have become central for the perception functions of autonomous vehicles, substantially enhancing their ability to understand and interpret the environment. However, these systems exhibit inherent limitations such as brittleness, opacity, and unpredictable behavior in out-of-distribution scenarios. The European Union (EU) Artificial Intelligence (AI) Act, as a pioneering legislative framework, aims to address these challenges by establishing stringent norms and standards for AI systems, including those used in autonomous driving (AD), which are categorized as high-risk AI. In this work, we explore how the newly available generative AI models can potentially support addressing upcoming regulatory requirements in AD perception, particularly with respect to safety. This short review paper summarizes the requirements arising from the EU AI Act regarding DNN-based perception systems and systematically categorizes existing generative AI applications in AD. While generative AI models show promise in addressing some of the EU AI Acts requirements, such as transparency and robustness, this review examines their potential benefits and discusses how developers could leverage these methods to enhance compliance with the Act. The paper also highlights areas where further research is needed to ensure reliable and safe integration of these technologies.

[CV-19] NanoMVG: USV-Centric Low-Power Multi-Task Visual Grounding based on Prompt-Guided Camera and 4D mmWave Radar

链接: https://arxiv.org/abs/2408.17207
作者: Runwei Guan,Jianan Liu,Liye Jia,Haocheng Zhao,Shanliang Yao,Xiaohui Zhu,Ka Lok Man,Eng Gee Lim,Jeremy Smith,Yutao Yue
关键词-EN: Unmanned Surface Vehicles, Surface Vehicles, Unmanned Surface, autonomous driving systems, terrestrial autonomous driving
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
*备注: 8 pages, 6 figures

点击查看摘要

Abstract:Recently, visual grounding and multi-sensors setting have been incorporated into perception system for terrestrial autonomous driving systems and Unmanned Surface Vehicles (USVs), yet the high complexity of modern learning-based visual grounding model using multi-sensors prevents such model to be deployed on USVs in the real-life. To this end, we design a low-power multi-task model named NanoMVG for waterway embodied perception, guiding both camera and 4D millimeter-wave radar to locate specific object(s) through natural language. NanoMVG can perform both box-level and mask-level visual grounding tasks simultaneously. Compared to other visual grounding models, NanoMVG achieves highly competitive performance on the WaterVG dataset, particularly in harsh environments and boasts ultra-low power consumption for long endurance.

[CV-20] Covariance-corrected Whitening Alleviates Network Degeneration on Imbalanced Classification

链接: https://arxiv.org/abs/2408.17197
作者: Zhiwei Zhang
关键词-EN: deep recognition models, critical issue, issue in image, image classification, classification that significantly
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 20 pages, 10 figures, 10 tables. arXiv admin note: text overlap with arXiv:2112.05958

点击查看摘要

Abstract:Class imbalance is a critical issue in image classification that significantly affects the performance of deep recognition models. In this work, we first identify a network degeneration dilemma that hinders the model learning by introducing a high linear dependence among the features inputted into the classifier. To overcome this challenge, we propose a novel framework called Whitening-Net to mitigate the degenerate solutions, in which ZCA whitening is integrated before the linear classifier to normalize and decorrelate the batch samples. However, in scenarios with extreme class imbalance, the batch covariance statistic exhibits significant fluctuations, impeding the convergence of the whitening operation. Therefore, we propose two covariance-corrected modules, the Group-based Relatively Balanced Batch Sampler (GRBS) and the Batch Embedded Training (BET), to get more accurate and stable batch covariance, thereby reinforcing the capability of whitening. Our modules can be trained end-to-end without incurring substantial computational costs. Comprehensive empirical evaluations conducted on benchmark datasets, including CIFAR-LT-10/100, ImageNet-LT, and iNaturalist-LT, validate the effectiveness of our proposed approaches.

[CV-21] Hybrid Classification-Regression Adaptive Loss for Dense Object Detection

链接: https://arxiv.org/abs/2408.17182
作者: Yanquan Huang,Liu Wei Zhen,Yun Hao,Mengyuan Zhang,Qingyao Wu,Zikun Deng,Xueming Liu,Hong Deng
关键词-EN: object detection detectors, detection detectors, enhancing model performance, object detection, ability to simultaneously
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:For object detection detectors, enhancing model performance hinges on the ability to simultaneously consider inconsistencies across tasks and focus on difficult-to-train samples. Achieving this necessitates incorporating information from both the classification and regression tasks. However, prior work tends to either emphasize difficult-to-train samples within their respective tasks or simply compute classification scores with IoU, often leading to suboptimal model performance. In this paper, we propose a Hybrid Classification-Regression Adaptive Loss, termed as HCRAL. Specifically, we introduce the Residual of Classification and IoU (RCI) module for cross-task supervision, addressing task inconsistencies, and the Conditioning Factor (CF) to focus on difficult-to-train samples within each task. Furthermore, we introduce a new strategy named Expanded Adaptive Training Sample Selection (EATSS) to provide additional samples that exhibit classification and regression inconsistencies. To validate the effectiveness of the proposed method, we conduct extensive experiments on COCO test-dev. Experimental evaluations demonstrate the superiority of our approachs. Additionally, we designed experiments by separately combining the classification and regression loss with regular loss functions in popular one-stage models, demonstrating improved performance.

[CV-22] EMHI: A Multimodal Egocentric Human Motion Dataset with HMD and Body-Worn IMUs

链接: https://arxiv.org/abs/2408.17168
作者: Zhen Fan,Peng Dai,Zhuo Su,Xu Gao,Zheng Lv,Jiarui Zhang,Tianyuan Du,Guidong Wang,Yang Zhang
关键词-EN: Inertial Measurement Unit, sparse Inertial Measurement, human pose estimation, egocentric HPE, Measurement Unit
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Egocentric human pose estimation (HPE) using wearable sensors is essential for VR/AR applications. Most methods rely solely on either egocentric-view images or sparse Inertial Measurement Unit (IMU) signals, leading to inaccuracies due to self-occlusion in images or the sparseness and drift of inertial sensors. Most importantly, the lack of real-world datasets containing both modalities is a major obstacle to progress in this field. To overcome the barrier, we propose EMHI, a multimodal \textbfEgocentric human \textbfMotion dataset with \textbfHead-Mounted Display (HMD) and body-worn \textbfIMUs, with all data collected under the real VR product suite. Specifically, EMHI provides synchronized stereo images from downward-sloping cameras on the headset and IMU data from body-worn sensors, along with pose annotations in SMPL format. This dataset consists of 885 sequences captured by 58 subjects performing 39 actions, totaling about 28.5 hours of recording. We evaluate the annotations by comparing them with optical marker-based SMPL fitting results. To substantiate the reliability of our dataset, we introduce MEPoser, a new baseline method for multimodal egocentric HPE, which employs a multimodal fusion encoder, temporal feature encoder, and MLP-based regression heads. The experiments on EMHI show that MEPoser outperforms existing single-modal methods and demonstrates the value of our dataset in solving the problem of egocentric HPE. We believe the release of EMHI and the method could advance the research of egocentric HPE and expedite the practical implementation of this technology in VR/AR products.

[CV-23] Self-supervised Anomaly Detection Pretraining Enhances Long-tail ECG Diagnosis

链接: https://arxiv.org/abs/2408.17154
作者: Aofan Jiang,Chaoqin Huang,Qing Cao,Yuchen Xu,Zi Zeng,Kang Chen,Ya Zhang,Yanfeng Wang
关键词-EN: critical cardiac anomalies, cardiac anomalies due, diagnostic systems struggle, Current computer-aided ECG, computer-aided ECG diagnostic
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: arXiv admin note: text overlap with arXiv:2404.04935

点击查看摘要

Abstract:Current computer-aided ECG diagnostic systems struggle with the underdetection of rare but critical cardiac anomalies due to the imbalanced nature of ECG datasets. This study introduces a novel approach using self-supervised anomaly detection pretraining to address this limitation. The anomaly detection model is specifically designed to detect and localize subtle deviations from normal cardiac patterns, capturing the nuanced details essential for accurate ECG interpretation. Validated on an extensive dataset of over one million ECG records from clinical practice, characterized by a long-tail distribution across 116 distinct categories, the anomaly detection-pretrained ECG diagnostic model has demonstrated a significant improvement in overall accuracy. Notably, our approach yielded a 94.7% AUROC, 92.2% sensitivity, and 92.5% specificity for rare ECG types, significantly outperforming traditional methods and narrowing the performance gap with common ECG types. The integration of anomaly detection pretraining into ECG analysis represents a substantial contribution to the field, addressing the long-standing challenge of long-tail data distributions in clinical diagnostics. Furthermore, prospective validation in real-world clinical settings revealed that our AI-driven approach enhances diagnostic efficiency, precision, and completeness by 32%, 6.7%, and 11.8% respectively, when compared to standard practices. This advancement marks a pivotal step forward in the integration of AI within clinical cardiology, with particularly profound implications for emergency care, where rapid and accurate ECG interpretation is crucial. The contributions of this study not only push the boundaries of current ECG diagnostic capabilities but also lay the groundwork for more reliable and accessible cardiovascular care.

[CV-24] Look Compare Decide: Alleviating Hallucination in Large Vision-Language Models via Multi-View Multi-Path Reasoning

链接: https://arxiv.org/abs/2408.17150
作者: Xiaoye Qu,Jiashuo Sun,Wei Wei,Yu Cheng
关键词-EN: Large Vision-Language Models, multi-modal context comprehension, Large Vision-Language, Vision-Language Models, demonstrated impressive capabilities
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 13 pages, 7 tables, 7 figures

点击查看摘要

Abstract:Recently, Large Vision-Language Models (LVLMs) have demonstrated impressive capabilities in multi-modal context comprehension. However, they still suffer from hallucination problems referring to generating inconsistent outputs with the image content. To mitigate hallucinations, previous studies mainly focus on retraining LVLMs with custom datasets. Although effective, they inherently come with additional computational costs. In this paper, we propose a training-free framework, \textbfMVP, that aims to reduce hallucinations by making the most of the innate capabilities of the LVLMs via \textbfMulti-\textbfView Multi-\textbfPath Reasoning. Specifically, we first devise a multi-view information-seeking strategy to thoroughly perceive the comprehensive information in the image, which enriches the general global information captured by the original vision encoder in LVLMs. Furthermore, during the answer decoding, we observe that the occurrence of hallucinations has a strong correlation with the certainty of the answer tokens. Thus, we propose multi-path reasoning for each information view to quantify and aggregate the certainty scores for each potential answer among multiple decoding paths and finally decide the output answer. By fully grasping the information in the image and carefully considering the certainty of the potential answers when decoding, our MVP can effectively reduce hallucinations in LVLMs.The extensive experiments verify that our proposed MVP significantly mitigates the hallucination problem across four well-known LVLMs. The source code is available at: \urlthis https URL.

[CV-25] GMM-IKRS: Gaussian Mixture Models for Interpretable Keypoint Refinement and Scoring ECCV2024

链接: https://arxiv.org/abs/2408.17149
作者: Emanuele Santellani,Martin Zach,Christian Sormann,Mattia Rossi,Andreas Kuhn,Friedrich Fraundorfer
关键词-EN: computer vision applications, vision applications, computer vision, keypoints, Gaussian Mixture Model
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted at ECCV 2024

点击查看摘要

Abstract:The extraction of keypoints in images is at the basis of many computer vision applications, from localization to 3D reconstruction. Keypoints come with a score permitting to rank them according to their quality. While learned keypoints often exhibit better properties than handcrafted ones, their scores are not easily interpretable, making it virtually impossible to compare the quality of individual keypoints across methods. We propose a framework that can refine, and at the same time characterize with an interpretable score, the keypoints extracted by any method. Our approach leverages a modified robust Gaussian Mixture Model fit designed to both reject non-robust keypoints and refine the remaining ones. Our score comprises two components: one relates to the probability of extracting the same keypoint in an image captured from another viewpoint, the other relates to the localization accuracy of the keypoint. These two interpretable components permit a comparison of individual keypoints extracted across different methods. Through extensive experiments we demonstrate that, when applied to popular keypoint detectors, our framework consistently improves the repeatability of keypoints as well as their performance in homography and two/multiple-view pose recovery tasks.

[CV-26] RenDetNet: Weakly-supervised Shadow Detection with Shadow Caster Verification ECCV2024

链接: https://arxiv.org/abs/2408.17143
作者: Nikolina Kubiak,Elliot Wortman,Armin Mustafa,Graeme Phillipson,Stephen Jolly,Simon Hadfield
关键词-EN: differentiate dark image, dark image areas, Existing shadow detection, Existing shadow, struggle to differentiate
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
*备注: AIM @ ECCV 2024 / code available at this https URL

点击查看摘要

Abstract:Existing shadow detection models struggle to differentiate dark image areas from shadows. In this paper, we tackle this issue by verifying that all detected shadows are real, i.e. they have paired shadow casters. We perform this step in a physically-accurate manner by differentiably re-rendering the scene and observing the changes stemming from carving out estimated shadow casters. Thanks to this approach, the RenDetNet proposed in this paper is the first learning-based shadow detection model whose supervisory signals can be computed in a self-supervised manner. The developed system compares favourably against recent models trained on our data. As part of this publication, we release our code on github.

[CV-27] mporal and Interactive Modeling for Efficient Human-Human Motion Generation

链接: https://arxiv.org/abs/2408.17135
作者: Yabiao Wang,Shuo Wang,Jiangning Zhang,Ke Fan,Jiafu Wu,Zhengkai Jiang,Yong Liu
关键词-EN: essential for understanding, understanding humans, humans as social, Human-human motion generation, Causal Interactive Injection
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Homepage: this https URL

点击查看摘要

Abstract:Human-human motion generation is essential for understanding humans as social beings. Although several transformer-based methods have been proposed, they typically model each individual separately and overlook the causal relationships in temporal motion sequences. Furthermore, the attention mechanism in transformers exhibits quadratic computational complexity, significantly reducing their efficiency when processing long sequences. In this paper, we introduce TIM (Temporal and Interactive Modeling), an efficient and effective approach that presents the pioneering human-human motion generation model utilizing RWKV. Specifically, we first propose Causal Interactive Injection to leverage the temporal properties of motion sequences and avoid non-causal and cumbersome modeling. Then we present Role-Evolving Mixing to adjust to the ever-evolving roles throughout the interaction. Finally, to generate smoother and more rational motion, we design Localized Pattern Amplification to capture short-term motion patterns. Extensive experiments on InterHuman demonstrate that our method achieves superior performance. Notably, TIM has achieved state-of-the-art results using only 32% of InterGen’s trainable parameters. Code will be available soon. Homepage: this https URL

[CV-28] VQ4DiT: Efficient Post-Training Vector Quantization for Diffusion Transformers

链接: https://arxiv.org/abs/2408.17131
作者: Juncan Deng,Shuaiting Li,Zeyu Wang,Hong Gu,Kedong Xu,Kejie Huang
关键词-EN: Diffusion Transformers Models, Diffusion Transformers, demonstrating exceptional capabilities, demonstrating exceptional, Transformers Models
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 11 pages, 6 figures

点击查看摘要

Abstract:The Diffusion Transformers Models (DiTs) have transitioned the network architecture from traditional UNets to transformers, demonstrating exceptional capabilities in image generation. Although DiTs have been widely applied to high-definition video generation tasks, their large parameter size hinders inference on edge devices. Vector quantization (VQ) can decompose model weight into a codebook and assignments, allowing extreme weight quantization and significantly reducing memory usage. In this paper, we propose VQ4DiT, a fast post-training vector quantization method for DiTs. We found that traditional VQ methods calibrate only the codebook without calibrating the assignments. This leads to weight sub-vectors being incorrectly assigned to the same assignment, providing inconsistent gradients to the codebook and resulting in a suboptimal result. To address this challenge, VQ4DiT calculates the candidate assignment set for each weight sub-vector based on Euclidean distance and reconstructs the sub-vector based on the weighted average. Then, using the zero-data and block-wise calibration method, the optimal assignment from the set is efficiently selected while calibrating the codebook. VQ4DiT quantizes a DiT XL/2 model on a single NVIDIA A100 GPU within 20 minutes to 5 hours depending on the different quantization settings. Experiments show that VQ4DiT establishes a new state-of-the-art in model size and performance trade-offs, quantizing weights to 2-bit precision while retaining acceptable image generation quality.

[CV-29] Multi-centric AI Model for Unruptured Intracranial Aneurysm Detection and Volumetric Segmentation in 3D TOF-MRI

链接: https://arxiv.org/abs/2408.17115
作者: Ashraya K. Indrakanti,Jakob Wasserthal,Martin Segeroth,Shan Yang,Victor Schulze-Zachau,Joshy Cyriac,Michael Bach,Marios Psychogios,Matthias A. Mutke
关键词-EN: unruptured intracranial aneurysms, aneurysm-like differential diagnoses, intracranial aneurysms, unruptured intracranial, compare models trained
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 14 pages, 5 figures, 3 tables, 2 supplementary tables

点击查看摘要

Abstract:Purpose: To develop an open-source nnU-Net-based AI model for combined detection and segmentation of unruptured intracranial aneurysms (UICA) in 3D TOF-MRI, and compare models trained on datasets with aneurysm-like differential diagnoses. Methods: This retrospective study (2020-2023) included 385 anonymized 3D TOF-MRI images from 364 patients (mean age 59 years, 60% female) at multiple centers plus 113 subjects from the ADAM challenge. Images featured untreated or possible UICAs and differential diagnoses. Four distinct training datasets were created, and the nnU-Net framework was used for model development. Performance was assessed on a separate test set using sensitivity and False Positive (FP)/case rate for detection, and DICE score and NSD (Normalized Surface Distance) with a 0.5mm threshold for segmentation. Statistical analysis included chi-square, Mann-Whitney-U, and Kruskal-Wallis tests, with significance set at p 0.05. Results: Models achieved overall sensitivity between 82% and 85% and a FP/case rate of 0.20 to 0.31, with no significant differences (p = 0.90 and p = 0.16). The primary model showed 85% sensitivity and 0.23 FP/case rate, outperforming the ADAM-challenge winner (61%) and a nnU-Net trained on ADAM data (51%) in sensitivity (p 0.05). It achieved a mean DICE score of 0.73 and an NSD of 0.84 for correctly detected UICA. Conclusions: Our open-source, nnU-Net-based AI model (available at https://doi.org/10.5281/zenodo.13386859) demonstrates high sensitivity, low false positive rates, and consistent segmentation accuracy for UICA detection and segmentation in 3D TOF-MRI, suggesting its potential to improve clinical diagnosis and for monitoring of UICA.

[CV-30] Sparse Uncertainty-Informed Sampling from Federated Streaming Data

链接: https://arxiv.org/abs/2408.17108
作者: Manuel Röder,Frank-Michael Schleif
关键词-EN: federated client systems, computationally efficient approach, local model adaptation, numerically robust, computationally efficient
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注: Preprint, 6 pages, 3 figures, Accepted for ESANN 2024

点击查看摘要

Abstract:We present a numerically robust, computationally efficient approach for non-I.I.D. data stream sampling in federated client systems, where resources are limited and labeled data for local model adaptation is sparse and expensive. The proposed method identifies relevant stream observations to optimize the underlying client model, given a local labeling budget, and performs instantaneous labeling decisions without relying on any memory buffering strategies. Our experiments show enhanced training batch diversity and an improved numerical robustness of the proposal compared to existing strategies over large-scale data streams, making our approach an effective and convenient solution in FL environments.

[CV-31] UTrack: Multi-Object Tracking with Uncertain Detections ECCV2024

链接: https://arxiv.org/abs/2408.17098
作者: Edgardo Solano-Carrillo,Felix Sattler,Antje Alex,Alexander Klein,Bruno Pereira Costa,Angel Bueno Rodriguez,Jannis Stoppe
关键词-EN: associating tracks, multi-object tracking, mainstream in multi-object, object detector, predictions
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted for the ECCV 2024 Workshop on Uncertainty Quantification for Computer Vision

点击查看摘要

Abstract:The tracking-by-detection paradigm is the mainstream in multi-object tracking, associating tracks to the predictions of an object detector. Although exhibiting uncertainty through a confidence score, these predictions do not capture the entire variability of the inference process. For safety and security critical applications like autonomous driving, surveillance, etc., knowing this predictive uncertainty is essential though. Therefore, we introduce, for the first time, a fast way to obtain the empirical predictive distribution during object detection and incorporate that knowledge in multi-object tracking. Our mechanism can easily be integrated into state-of-the-art trackers, enabling them to fully exploit the uncertainty in the detections. Additionally, novel association methods are introduced that leverage the proposed mechanism. We demonstrate the effectiveness of our contribution on a variety of benchmarks, such as MOT17, MOT20, DanceTrack, and KITTI.

[CV-32] RISSOLE: Parameter-efficient Diffusion Models via Block-wise Generation and Retrieval-Guidance

链接: https://arxiv.org/abs/2408.17095
作者: Avideep Mukherjee,Soumya Banerjee,Vinay P. Namboodiri,Piyush Rai
关键词-EN: Diffusion-based models demonstrate, impressive generation capabilities, Diffusion-based models, generation, Diffusion-based
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Diffusion-based models demonstrate impressive generation capabilities. However, they also have a massive number of parameters, resulting in enormous model sizes, thus making them unsuitable for deployment on resource-constraint devices. Block-wise generation can be a promising alternative for designing compact-sized (parameter-efficient) deep generative models since the model can generate one block at a time instead of generating the whole image at once. However, block-wise generation is also considerably challenging because ensuring coherence across generated blocks can be non-trivial. To this end, we design a retrieval-augmented generation (RAG) approach and leverage the corresponding blocks of the images retrieved by the RAG module to condition the training and generation stages of a block-wise denoising diffusion model. Our conditioning schemes ensure coherence across the different blocks during training and, consequently, during generation. While we showcase our approach using the latent diffusion model (LDM) as the base model, it can be used with other variants of denoising diffusion models. We validate the solution of the coherence problem through the proposed approach by reporting substantive experiments to demonstrate our approach’s effectiveness in compact model size and excellent generation quality.

[CV-33] FissionVAE: Federated Non-IID Image Generation with Latent Space and Decoder Decomposition

链接: https://arxiv.org/abs/2408.17090
作者: Chen Hu,Jingjing Deng,Xianghua Xie,Xiaoke Ma
关键词-EN: Generative Adversarial Networks, enables decentralized clients, machine learning paradigm, paradigm that enables, enables decentralized
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Federated learning is a machine learning paradigm that enables decentralized clients to collaboratively learn a shared model while keeping all the training data local. While considerable research has focused on federated image generation, particularly Generative Adversarial Networks, Variational Autoencoders have received less attention. In this paper, we address the challenges of non-IID (independently and identically distributed) data environments featuring multiple groups of images of different types. Specifically, heterogeneous data distributions can lead to difficulties in maintaining a consistent latent space and can also result in local generators with disparate texture features being blended during aggregation. We introduce a novel approach, FissionVAE, which decomposes the latent space and constructs decoder branches tailored to individual client groups. This method allows for customized learning that aligns with the unique data distributions of each group. Additionally, we investigate the incorporation of hierarchical VAE architectures and demonstrate the use of heterogeneous decoder architectures within our model. We also explore strategies for setting the latent prior distributions to enhance the decomposition process. To evaluate our approach, we assemble two composite datasets: the first combines MNIST and FashionMNIST; the second comprises RGB datasets of cartoon and human faces, wild animals, marine vessels, and remote sensing images of Earth. Our experiments demonstrate that FissionVAE greatly improves generation quality on these datasets compared to baseline federated VAE models.

[CV-34] Focus-Consistent Multi-Level Aggregation for Compositional Zero-Shot Learning

链接: https://arxiv.org/abs/2408.17083
作者: Fengyuan Dai,Siteng Huang,Min Zhang,Biao Gong,Donglin Wang
关键词-EN: compositional zero-shot learning, recent compositional zero-shot, optimal classification branches, recent compositional, zero-shot learning
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Compositional Zero-Shot Learning

点击查看摘要

Abstract:To transfer knowledge from seen attribute-object compositions to recognize unseen ones, recent compositional zero-shot learning (CZSL) methods mainly discuss the optimal classification branches to identify the elements, leading to the popularity of employing a three-branch architecture. However, these methods mix up the underlying relationship among the branches, in the aspect of consistency and diversity. Specifically, consistently providing the highest-level features for all three branches increases the difficulty in distinguishing classes that are superficially similar. Furthermore, a single branch may focus on suboptimal regions when spatial messages are not shared between the personalized branches. Recognizing these issues and endeavoring to address them, we propose a novel method called Focus-Consistent Multi-Level Aggregation (FOMA). Our method incorporates a Multi-Level Feature Aggregation (MFA) module to generate personalized features for each branch based on the image content. Additionally, a Focus-Consistent Constraint encourages a consistent focus on the informative regions, thereby implicitly exchanging spatial information between all branches. Extensive experiments on three benchmark datasets (UT-Zappos, C-GQA, and Clothing16K) demonstrate that our FOMA outperforms SOTA.

[CV-35] Stochastic Layer-Wise Shuffle: A Good Practice to Improve Vision Mamba Training

链接: https://arxiv.org/abs/2408.17081
作者: Zizheng Huang,Haoxing Chen,Jiaqi Li,Jun Lan,Huijia Zhu,Weiqiang Wang,Limin Wang
关键词-EN: Recent Vision Mamba, processing higher resolution, higher resolution images, Recent Vision, Vision Transformers
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Recent Vision Mamba models not only have much lower complexity for processing higher resolution images and longer videos but also the competitive performance with Vision Transformers (ViTs). However, they are stuck into overfitting and thus only present up to base size (about 80M). It is still unclear how vanilla Vision Mamba (Vim) can be efficiently scaled up to larger sizes, which is essentially for further exploitation. In this paper, we propose a stochastic layer-wise shuffle regularization, which empowers successfully scaling non-hierarchical Vision Mamba to a large size (about 300M) in a supervised setting. Specifically, our base and large-scale ShuffleMamba models can outperform the supervised ViTs of similar size by 0.8% and 1.0% classification accuracy on ImageNet1k, respectively, without auxiliary data. When evaluated on the ADE20K semantic segmentation and COCO detection tasks, our ShuffleMamba models also show significant improvements. Without bells and whistles, the stochastic layer-wise shuffle has the following highlights: (1) \textitPlug and play: it does not change model architectures and will be omitted in inference. (2) \textitSimple but effective: it can improve the overfitting in Vim training and only introduce random token permutation operations. (3) \textitIntuitive: the token sequences in deeper layers are more likely to be shuffled as they are expected to be more semantic and less sensitive to patch positions. Code and models will be available at this https URL.

[CV-36] Generalizing Deepfake Video Detection with Plug-and-Play: Video-Level Blending and Spatiotemporal Adapter Tuning

链接: https://arxiv.org/abs/2408.17065
作者: Zhiyuan Yan,Yandan Zhao,Shen Chen,Xinghe Fu,Taiping Yao,Shouhong Ding,Li Yuan
关键词-EN: enhance model generalization, current deepfake video, key challenges hinder, deepfake video detection, complex and diverse
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Three key challenges hinder the development of current deepfake video detection: (1) Temporal features can be complex and diverse: how can we identify general temporal artifacts to enhance model generalization? (2) Spatiotemporal models often lean heavily on one type of artifact and ignore the other: how can we ensure balanced learning from both? (3) Videos are naturally resource-intensive: how can we tackle efficiency without compromising accuracy? This paper attempts to tackle the three challenges jointly. First, inspired by the notable generality of using image-level blending data for image forgery detection, we investigate whether and how video-level blending can be effective in video. We then perform a thorough analysis and identify a previously underexplored temporal forgery artifact: Facial Feature Drift (FFD), which commonly exists across different forgeries. To reproduce FFD, we then propose a novel Video-level Blending data (VB), where VB is implemented by blending the original image and its warped version frame-by-frame, serving as a hard negative sample to mine more general artifacts. Second, we carefully design a lightweight Spatiotemporal Adapter (StA) to equip a pretrained image model (both ViTs and CNNs) with the ability to capture both spatial and temporal features jointly and efficiently. StA is designed with two-stream 3D-Conv with varying kernel sizes, allowing it to process spatial and temporal features separately. Extensive experiments validate the effectiveness of the proposed methods; and show our approach can generalize well to previously unseen forgery videos, even the just-released (in 2024) SoTAs. We release our code and pretrained weights at \urlthis https URL. Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2408.17065 [cs.CV] (or arXiv:2408.17065v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2408.17065 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[CV-37] Instant Adversarial Purification with Adversarial Consistency Distillation

链接: https://arxiv.org/abs/2408.17064
作者: Chun Tong Lei,Hon Ming Yam,Zhongliang Guo,Chun Pong Lau
关键词-EN: including image classification, Neural Function Evaluation, widespread applications, Neural networks, remarkable performance
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Neural networks, despite their remarkable performance in widespread applications, including image classification, are also known to be vulnerable to subtle adversarial noise. Although some diffusion-based purification methods have been proposed, for example, DiffPure, those methods are time-consuming. In this paper, we propose One Step Control Purification (OSCP), a diffusion-based purification model that can purify the adversarial image in one Neural Function Evaluation (NFE) in diffusion models. We use Latent Consistency Model (LCM) and ControlNet for our one-step purification. OSCP is computationally friendly and time efficient compared to other diffusion-based purification methods; we achieve defense success rate of 74.19% on ImageNet, only requiring 0.1s for each purification. Moreover, there is a fundamental incongruence between consistency distillation and adversarial perturbation. To address this ontological dissonance, we propose Gaussian Adversarial Noise Distillation (GAND), a novel consistency distillation framework that facilitates a more nuanced reconciliation of the latent space dynamics, effectively bridging the natural and adversarial manifolds. Our experiments show that the GAND does not need a Full Fine Tune (FFT); PEFT, e.g., LoRA is sufficient.

[CV-38] VoteMix: Plug-and-Play Token Reduction for Efficient Vision Transformer

链接: https://arxiv.org/abs/2408.17062
作者: Shuai Peng,Di Fu,Baole Wei,Yong Cao,Liangcai Gao,Zhi Tang
关键词-EN: Vision Transformers, success of Vision, substantial computational cost, remarkable success, hindered by substantial
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Despite the remarkable success of Vision Transformers (ViTs) in various visual tasks, they are often hindered by substantial computational cost. In this work, we introduce Vote\Mix (\textbfVoMix), a plug-and-play and parameter-free token reduction method, which can be readily applied to off-the-shelf ViT models \textitwithout any training. VoMix tackles the computational redundancy of ViTs by identifying tokens with high homogeneity through a layer-wise token similarity voting mechanism. Subsequently, the selected tokens are mixed into the retained set, thereby preserving visual information. Experiments demonstrate VoMix significantly improves the speed-accuracy tradeoff of ViTs on both images and videos. Without any training, VoMix achieves a 2 \times increase in throughput of existing ViT-H on ImageNet-1K and a 2.4 \times increase in throughput of existing ViT-L on Kinetics-400 video dataset, with a mere 0.3% drop in top-1 accuracy.

[CV-39] Efficient Image Restoration through Low-Rank Adaptation and Stable Diffusion XL

链接: https://arxiv.org/abs/2408.17060
作者: Haiyang Zhao
关键词-EN: Stable Diffusion, enhanced image restoration, fine-tune SDXL models, low-rank adaptive, image restoration model
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 10 pages

点击查看摘要

Abstract:In this study, we propose an enhanced image restoration model, SUPIR, based on the integration of two low-rank adaptive (LoRA) modules with the Stable Diffusion XL (SDXL) framework. Our method leverages the advantages of LoRA to fine-tune SDXL models, thereby significantly improving image restoration quality and efficiency. We collect 2600 high-quality real-world images, each with detailed descriptive text, for training the model. The proposed method is evaluated on standard benchmarks and achieves excellent performance, demonstrated by higher peak signal-to-noise ratio (PSNR), lower learned perceptual image patch similarity (LPIPS), and higher structural similarity index measurement (SSIM) scores. These results underscore the effectiveness of combining LoRA with SDXL for advanced image restoration tasks, highlighting the potential of our approach in generating high-fidelity restored images.

[CV-40] A Survey of the Self Supervised Learning Mechanisms for Vision Transformers

链接: https://arxiv.org/abs/2408.17059
作者: Asifullah Khan,Anabia Sohail,Mustansar Fiaz,Mehdi Hassan,Tariq Habib Afridi,Sibghat Ullah Marwat,Farzeen Munir,Safdar Ali,Hannan Naseem,Muhammad Zaigham Zaheer,Kamran Ali,Tangina Sultana,Ziaurrehman Tanoli,Naeem Akhter
关键词-EN: require high volume, attain sufficiently good, models require high, sufficiently good results, require high
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 34 Pages, 5 Figures, 7 Tables

点击查看摘要

Abstract:Deep supervised learning models require high volume of labeled data to attain sufficiently good results. Although, the practice of gathering and annotating such big data is costly and laborious. Recently, the application of self supervised learning (SSL) in vision tasks has gained significant attention. The intuition behind SSL is to exploit the synchronous relationships within the data as a form of self-supervision, which can be versatile. In the current big data era, most of the data is unlabeled, and the success of SSL thus relies in finding ways to improve this vast amount of unlabeled data available. Thus its better for deep learning algorithms to reduce reliance on human supervision and instead focus on self-supervision based on the inherent relationships within the data. With the advent of ViTs, which have achieved remarkable results in computer vision, it is crucial to explore and understand the various SSL mechanisms employed for training these models specifically in scenarios where there is less label data available. In this survey we thus develop a comprehensive taxonomy of systematically classifying the SSL techniques based upon their representations and pre-training tasks being applied. Additionally, we discuss the motivations behind SSL, review popular pre-training tasks, and highlight the challenges and advancements in this field. Furthermore, we present a comparative analysis of different SSL methods, evaluate their strengths and limitations, and identify potential avenues for future research.

[CV-41] LAR-IQA: A Lightweight Accurate and Robust No-Reference Image Quality Assessment Model

链接: https://arxiv.org/abs/2408.17057
作者: Nasim Jamshidi Avanaki,Abhijay Ghildiyal,Nabajeet Barman,Saman Zadtootaghaj
关键词-EN: deep learning techniques, Image Quality Assessment, Recent advancements, learning techniques demonstrate, techniques demonstrate high
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM); Image and Video Processing (eess.IV)
*备注:

点击查看摘要

Abstract:Recent advancements in the field of No-Reference Image Quality Assessment (NR-IQA) using deep learning techniques demonstrate high performance across multiple open-source datasets. However, such models are typically very large and complex making them not so suitable for real-world deployment, especially on resource- and battery-constrained mobile devices. To address this limitation, we propose a compact, lightweight NR-IQA model that achieves state-of-the-art (SOTA) performance on ECCV AIM UHD-IQA challenge validation and test datasets while being also nearly 5.7 times faster than the fastest SOTA model. Our model features a dual-branch architecture, with each branch separately trained on synthetically and authentically distorted images which enhances the model’s generalizability across different distortion types. To improve robustness under diverse real-world visual conditions, we additionally incorporate multiple color spaces during the training process. We also demonstrate the higher accuracy of recently proposed Kolmogorov-Arnold Networks (KANs) for final quality regression as compared to the conventional Multi-Layer Perceptrons (MLPs). Our evaluation considering various open-source datasets highlights the practical, high-accuracy, and robust performance of our proposed lightweight model. Code: this https URL.

[CV-42] BTMuda: A Bi-level Multi-source unsupervised domain adaptation framework for breast cancer diagnosis

链接: https://arxiv.org/abs/2408.17054
作者: Yuxiang Yang,Xinyi Zeng,Pinxian Zeng,Binyu Yan,Xi Wu,Jiliu Zhou,Yan Wang
关键词-EN: Deep learning, mortality rates, learning has revolutionized, revolutionized the early, early detection
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Deep learning has revolutionized the early detection of breast cancer, resulting in a significant decrease in mortality rates. However, difficulties in obtaining annotations and huge variations in distribution between training sets and real scenes have limited their clinical applications. To address these limitations, unsupervised domain adaptation (UDA) methods have been used to transfer knowledge from one labeled source domain to the unlabeled target domain, yet these approaches suffer from severe domain shift issues and often ignore the potential benefits of leveraging multiple relevant sources in practical applications. To address these limitations, in this work, we construct a Three-Branch Mixed extractor and propose a Bi-level Multi-source unsupervised domain adaptation method called BTMuda for breast cancer diagnosis. Our method addresses the problems of domain shift by dividing domain shift issues into two levels: intra-domain and inter-domain. To reduce the intra-domain shift, we jointly train a CNN and a Transformer as two paths of a domain mixed feature extractor to obtain robust representations rich in both low-level local and high-level global information. As for the inter-domain shift, we redesign the Transformer delicately to a three-branch architecture with cross-attention and distillation, which learns domain-invariant representations from multiple domains. Besides, we introduce two alignment modules - one for feature alignment and one for classifier alignment - to improve the alignment process. Extensive experiments conducted on three public mammographic datasets demonstrate that our BTMuda outperforms state-of-the-art methods.

[CV-43] Can We Leave Deepfake Data Behind in Training Deepfake Detector?

链接: https://arxiv.org/abs/2408.17052
作者: Jikang Cheng,Zhiyuan Yan,Ying Zhang,Yuhao Luo,Zhongyuan Wang,Chen Li
关键词-EN: deepfake, real-world scenarios, blendfake, applications in real-world, data
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:The generalization ability of deepfake detectors is vital for their applications in real-world scenarios. One effective solution to enhance this ability is to train the models with manually-blended data, which we termed “blendfake”, encouraging models to learn generic forgery artifacts like blending boundary. Interestingly, current SoTA methods utilize blendfake without incorporating any deepfake data in their training process. This is likely because previous empirical observations suggest that vanilla hybrid training (VHT), which combines deepfake and blendfake data, results in inferior performance to methods using only blendfake data (so-called “1+12”). Therefore, a critical question arises: Can we leave deepfake behind and rely solely on blendfake data to train an effective deepfake detector? Intuitively, as deepfakes also contain additional informative forgery clues (e.g., deep generative artifacts), excluding all deepfake data in training deepfake detectors seems counter-intuitive. In this paper, we rethink the role of blendfake in detecting deepfakes and formulate the process from “real to blendfake to deepfake” to be a progressive transition. Specifically, blendfake and deepfake can be explicitly delineated as the oriented pivot anchors between “real-to-fake” transitions. The accumulation of forgery information should be oriented and progressively increasing during this transition process. To this end, we propose an Oriented Progressive Regularizor (OPR) to establish the constraints that compel the distribution of anchors to be discretely arranged. Furthermore, we introduce feature bridging to facilitate the smooth transition between adjacent anchors. Extensive experiments confirm that our design allows leveraging forgery information from both blendfake and deepfake effectively and comprehensively.

[CV-44] xt-to-Image Generation Via Energy-Based CLIP

链接: https://arxiv.org/abs/2408.17046
作者: Roy Ganz,Michael Elad
关键词-EN: significant research attention, Joint Energy Models, drawing significant research, Joint Energy, high-resolution datasets
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Joint Energy Models (JEMs), while drawing significant research attention, have not been successfully scaled to real-world, high-resolution datasets. We present EB-CLIP, a novel approach extending JEMs to the multimodal vision-language domain using CLIP, integrating both generative and discriminative objectives. For the generative objective, we introduce an image-text joint-energy function based on Cosine similarity in the CLIP space, training CLIP to assign low energy to real image-caption pairs and high energy otherwise. For the discriminative objective, we employ contrastive adversarial loss, extending the adversarial training objective to the multimodal domain. EB-CLIP not only generates realistic images from text but also achieves competitive results on the compositionality benchmark, outperforming leading methods with fewer parameters. Additionally, we demonstrate the superior guidance capability of EB-CLIP by enhancing CLIP-based generative frameworks and converting unconditional diffusion models to text-based ones. Lastly, we show that EB-CLIP can serve as a more robust evaluation metric for text-to-image generative tasks than CLIP.

[CV-45] CP-VoteNet: Contrastive Prototypical VoteNet for Few-Shot Point Cloud Object Detection

链接: https://arxiv.org/abs/2408.17036
作者: Xuejing Li,Weijia Zhang,Chao Ma
关键词-EN: Few-shot point cloud, Few-shot point, object detection, localise objects, aims to identify
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted by PRCV 2024

点击查看摘要

Abstract:Few-shot point cloud 3D object detection (FS3D) aims to identify and localise objects of novel classes from point clouds, using knowledge learnt from annotated base classes and novel classes with very few annotations. Thus far, this challenging task has been approached using prototype learning, but the performance remains far from satisfactory. We find that in existing methods, the prototypes are only loosely constrained and lack of fine-grained awareness of the semantic and geometrical correlation embedded within the point cloud space. To mitigate these issues, we propose to leverage the inherent contrastive relationship within the semantic and geometrical subspaces to learn more refined and generalisable prototypical representations. To this end, we first introduce contrastive semantics mining, which enables the network to extract discriminative categorical features by constructing positive and negative pairs within training batches. Meanwhile, since point features representing local patterns can be clustered into geometric components, we further propose to impose contrastive relationship at the primitive level. Through refined primitive geometric structures, the transferability of feature encoding from base to novel classes is significantly enhanced. The above designs and insights lead to our novel Contrastive Prototypical VoteNet (CP-VoteNet). Extensive experiments on two FS3D benchmarks FS-ScanNet and FS-SUNRGBD demonstrate that CP-VoteNet surpasses current state-of-the-art methods by considerable margins across different FS3D settings. Further ablation studies conducted corroborate the rationale and effectiveness of our designs.

[CV-46] ConDense: Consistent 2D/3D Pre-training for Dense and Sparse Features from Multi-View Images ECCV2024

链接: https://arxiv.org/abs/2408.17027
作者: Xiaoshuai Zhang,Zhicheng Wang,Howard Zhou,Soham Ghosh,Danushen Gnanapragasam,Varun Jampani,Hao Su,Leonidas Guibas
关键词-EN: large-scale multi-view datasets, pre-training utilizing existing, utilizing existing pre-trained, networks and large-scale, multi-view datasets
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: ECCV 2024

点击查看摘要

Abstract:To advance the state of the art in the creation of 3D foundation models, this paper introduces the ConDense framework for 3D pre-training utilizing existing pre-trained 2D networks and large-scale multi-view datasets. We propose a novel 2D-3D joint training scheme to extract co-embedded 2D and 3D features in an end-to-end pipeline, where 2D-3D feature consistency is enforced through a volume rendering NeRF-like ray marching process. Using dense per pixel features we are able to 1) directly distill the learned priors from 2D models to 3D models and create useful 3D backbones, 2) extract more consistent and less noisy 2D features, 3) formulate a consistent embedding space where 2D, 3D, and other modalities of data (e.g., natural language prompts) can be jointly queried. Furthermore, besides dense features, ConDense can be trained to extract sparse features (e.g., key points), also with 2D-3D consistency – condensing 3D NeRF representations into compact sets of decorated key points. We demonstrate that our pre-trained model provides good initialization for various 3D tasks including 3D classification and segmentation, outperforming other 3D pre-training methods by a significant margin. It also enables, by exploiting our sparse features, additional useful downstream tasks, such as matching 2D images to 3D scenes, detecting duplicate 3D scenes, and querying a repository of 3D scenes through natural language – all quite efficiently and without any per-scene fine-tuning.

[CV-47] Retrieval-Augmented Natural Language Reasoning for Explainable Visual Question Answering ICIP

链接: https://arxiv.org/abs/2408.17006
作者: Su Hyeon Lim,Minkuk Kim,Hyeon Bae Kim,Seong Tae Kim
关键词-EN: Visual Question Answering, Visual Question, Question Answering, Retrieval-augmented natural language, task is challenging
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: ICIP Workshop 2024

点击查看摘要

Abstract:Visual Question Answering with Natural Language Explanation (VQA-NLE) task is challenging due to its high demand for reasoning-based inference. Recent VQA-NLE studies focus on enhancing model networks to amplify the model’s reasoning capability but this approach is resource-consuming and unstable. In this work, we introduce a new VQA-NLE model, ReRe (Retrieval-augmented natural language Reasoning), using leverage retrieval information from the memory to aid in generating accurate answers and persuasive explanations without relying on complex networks and extra datasets. ReRe is an encoder-decoder architecture model using a pre-trained clip vision encoder and a pre-trained GPT-2 language model as a decoder. Cross-attention layers are added in the GPT-2 for processing retrieval features. ReRe outperforms previous methods in VQA accuracy and explanation score and shows improvement in NLE with more persuasive, reliability.

[CV-48] Efficient Camera Exposure Control for Visual Odometry via Deep Reinforcement Learning

链接: https://arxiv.org/abs/2408.17005
作者: Shuyang Zhang,Jinhao He,Yilong Zhu,Jin Wu,Jie Yuan
关键词-EN: degraded image quality, stability of visual, undermined by degraded, image quality, degraded image
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
*备注: 8 pages, 7 figures

点击查看摘要

Abstract:The stability of visual odometry (VO) systems is undermined by degraded image quality, especially in environments with significant illumination changes. This study employs a deep reinforcement learning (DRL) framework to train agents for exposure control, aiming to enhance imaging performance in challenging conditions. A lightweight image simulator is developed to facilitate the training process, enabling the diversification of image exposure and sequence trajectory. This setup enables completely offline training, eliminating the need for direct interaction with camera hardware and the real environments. Different levels of reward functions are crafted to enhance the VO systems, equipping the DRL agents with varying intelligence. Extensive experiments have shown that our exposure control agents achieve superior efficiency-with an average inference duration of 1.58 ms per frame on a CPU-and respond more quickly than traditional feedback control schemes. By choosing an appropriate reward function, agents acquire an intelligent understanding of motion trends and anticipate future illumination changes. This predictive capability allows VO systems to deliver more stable and precise odometry results. The codes and datasets are available at this https URL.

[CV-49] AdaptVision: Dynamic Input Scaling in MLLMs for Versatile Scene Understanding

链接: https://arxiv.org/abs/2408.16986
作者: Yonghui Wang,Wengang Zhou,Hao Feng,Houqiang Li
关键词-EN: Multimodal Large Language, Large Language Models, enhance MLLMs’ comprehension, Large Language, Multimodal Large
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Over the past few years, the advancement of Multimodal Large Language Models (MLLMs) has captured the wide interest of researchers, leading to numerous innovations to enhance MLLMs’ comprehension. In this paper, we present AdaptVision, a multimodal large language model specifically designed to dynamically process input images at varying resolutions. We hypothesize that the requisite number of visual tokens for the model is contingent upon both the resolution and content of the input image. Generally, natural images with a lower information density can be effectively interpreted by the model using fewer visual tokens at reduced resolutions. In contrast, images containing textual content, such as documents with rich text, necessitate a higher number of visual tokens for accurate text interpretation due to their higher information density. Building on this insight, we devise a dynamic image partitioning module that adjusts the number of visual tokens according to the size and aspect ratio of images. This method mitigates distortion effects that arise from resizing images to a uniform resolution and dynamically optimizing the visual tokens input to the LLMs. Our model is capable of processing images with resolutions up to 1008\times 1008 . Extensive experiments across various datasets demonstrate that our method achieves impressive performance in handling vision-language tasks in both natural and text-related scenes. The source code and dataset are now publicly available at \urlthis https URL.

[CV-50] 2DGH: 2D Gaussian-Hermite Splatting for High-quality Rendering and Better Geometry Reconstruction

链接: https://arxiv.org/abs/2408.16982
作者: Ruihan Yu,Tianyu Huang,Jingwang Ling,Feng Xu
关键词-EN: Gaussian Splatting, Gaussian splatting methods, Gaussian Splatting kernels, current Gaussian splatting, geometry reconstruction simultaneously
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
*备注:

点击查看摘要

Abstract:2D Gaussian Splatting has recently emerged as a significant method in 3D reconstruction, enabling novel view synthesis and geometry reconstruction simultaneously. While the well-known Gaussian kernel is broadly used, its lack of anisotropy and deformation ability leads to dim and vague edges at object silhouettes, limiting the reconstruction quality of current Gaussian splatting methods. To enhance the representation power, we draw inspiration from quantum physics and propose to use the Gaussian-Hermite kernel as the new primitive in Gaussian splatting. The new kernel takes a unified mathematical form and extends the Gaussian function, which serves as the zero-rank term in the updated formulation. Our experiments demonstrate the extraordinary performance of Gaussian-Hermite kernel in both geometry reconstruction and novel-view synthesis tasks. The proposed kernel outperforms traditional Gaussian Splatting kernels, showcasing its potential for high-quality 3D reconstruction and rendering.

[CV-51] Cross Fusion RGB-T Tracking with Bi-directional Adapter

链接: https://arxiv.org/abs/2408.16979
作者: Zhirong Zeng,Xiaotao Liu,Meng Sun,Hongyu Wang,Jing Liu
关键词-EN: achieved remarkable results, achieved remarkable, remarkable results, temporal information, cross spatio-temporal
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Many state-of-the-art RGB-T trackers have achieved remarkable results through modality fusion. However, these trackers often either overlook temporal information or fail to fully utilize it, resulting in an ineffective balance between multi-modal and temporal information. To address this issue, we propose a novel Cross Fusion RGB-T Tracking architecture (CFBT) that ensures the full participation of multiple modalities in tracking while dynamically fusing temporal information. The effectiveness of CFBT relies on three newly designed cross spatio-temporal information fusion modules: Cross Spatio-Temporal Augmentation Fusion (CSTAF), Cross Spatio-Temporal Complementarity Fusion (CSTCF), and Dual-Stream Spatio-Temporal Adapter (DSTA). CSTAF employs a cross-attention mechanism to enhance the feature representation of the template comprehensively. CSTCF utilizes complementary information between different branches to enhance target features and suppress background features. DSTA adopts the adapter concept to adaptively fuse complementary information from multiple branches within the transformer layer, using the RGB modality as a medium. These ingenious fusions of multiple perspectives introduce only less than 0.3% of the total modal parameters, but they indeed enable an efficient balance between multi-modal and temporal information. Extensive experiments on three popular RGB-T tracking benchmarks demonstrate that our method achieves new state-of-the-art performance.

[CV-52] Synthetic Lunar Terrain: A Multimodal Open Dataset for Training and Evaluating Neuromorphic Vision Algorithms

链接: https://arxiv.org/abs/2408.16971
作者: Marcus Märtens,Kevin Farries,John Culton,Tat-Jun Chin
关键词-EN: Synthetic Lunar Terrain, featuring synthetic craters, high-contrast lighting setup, open dataset collected, analogue test site
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 7 pages, 5 figures, to be published at "International Symposium on Artificial Intelligence, Robotics and Automation in Space, i-SAIRAS, 2024

点击查看摘要

Abstract:Synthetic Lunar Terrain (SLT) is an open dataset collected from an analogue test site for lunar missions, featuring synthetic craters in a high-contrast lighting setup. It includes several side-by-side captures from event-based and conventional RGB cameras, supplemented with a high-resolution 3D laser scan for depth estimation. The event-stream recorded from the neuromorphic vision sensor of the event-based camera is of particular interest as this emerging technology provides several unique advantages, such as high data rates, low energy consumption and resilience towards scenes of high dynamic range. SLT provides a solid foundation to analyse the limits of RGB-cameras and potential advantages or synergies in utilizing neuromorphic visions with the goal of enabling and improving lunar specific applications like rover navigation, landing in cratered environments or similar.

[CV-53] Contrastive Learning with Synthetic Positives

链接: https://arxiv.org/abs/2408.16965
作者: Dewen Zeng,Yawen Wu,Xinrong Hu,Xiaowei Xu,Yiyu Shi
关键词-EN: efficient self-supervised learning, nearest neighbor algorithm, nearest neighbor, techniques by utilizing, Contrastive learning
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 8 pages, conference

点击查看摘要

Abstract:Contrastive learning with the nearest neighbor has proved to be one of the most efficient self-supervised learning (SSL) techniques by utilizing the similarity of multiple instances within the same class. However, its efficacy is constrained as the nearest neighbor algorithm primarily identifies easy'' positive pairs, where the representations are already closely located in the embedding space. In this paper, we introduce a novel approach called Contrastive Learning with Synthetic Positives (CLSP) that utilizes synthetic images, generated by an unconditional diffusion model, as the additional positives to help the model learn from diverse positives. Through feature interpolation in the diffusion model sampling process, we generate images with distinct backgrounds yet similar semantic content to the anchor image. These images are considered hard’’ positives for the anchor image, and when included as supplementary positives in the contrastive loss, they contribute to a performance improvement of over 2% and 1% in linear evaluation compared to the previous NNCLR and All4One methods across multiple benchmark datasets such as CIFAR10, achieving state-of-the-art methods. On transfer learning benchmarks, CLSP outperforms existing SSL frameworks on 6 out of 8 downstream datasets. We believe CLSP establishes a valuable baseline for future SSL studies incorporating synthetic data in the training process.

[CV-54] Causal Representation-Based Domain Generalization on Gaze Estimation

链接: https://arxiv.org/abs/2408.16964
作者: Younghan Kim,Kangryun Moon,Yongjun Park,Yonggyu Kim
关键词-EN: significantly enhanced gaze, gaze estimation accuracy, enhanced gaze estimation, gaze estimation, availability of extensive
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:The availability of extensive datasets containing gaze information for each subject has significantly enhanced gaze estimation accuracy. However, the discrepancy between domains severely affects a model’s performance explicitly trained for a particular domain. In this paper, we propose the Causal Representation-Based Domain Generalization on Gaze Estimation (CauGE) framework designed based on the general principle of causal mechanisms, which is consistent with the domain difference. We employ an adversarial training manner and an additional penalizing term to extract domain-invariant features. After extracting features, we position the attention layer to make features sufficient for inferring the actual gaze. By leveraging these modules, CauGE ensures that the neural networks learn from representations that meet the causal mechanisms’ general principles. By this, CauGE generalizes across domains by extracting domain-invariant features, and spurious correlations cannot influence the model. Our method achieves state-of-the-art performance in the domain generalization on gaze estimation benchmark.

[CV-55] HiTSR: A Hierarchical Transformer for Reference-based Super-Resolution

链接: https://arxiv.org/abs/2408.16959
作者: Masoomeh Aslahishahri,Jordan Ubbens,Ian Stavness
关键词-EN: learning matching correspondences, high-resolution reference images, hierarchical transformer model, propose HiTSR, low-resolution input images
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: arXiv admin note: text overlap with arXiv:2307.08837

点击查看摘要

Abstract:In this paper, we propose HiTSR, a hierarchical transformer model for reference-based image super-resolution, which enhances low-resolution input images by learning matching correspondences from high-resolution reference images. Diverging from existing multi-network, multi-stage approaches, we streamline the architecture and training pipeline by incorporating the double attention block from GAN literature. Processing two visual streams independently, we fuse self-attention and cross-attention blocks through a gating attention strategy. The model integrates a squeeze-and-excitation module to capture global context from the input images, facilitating long-range spatial interactions within window-based attention blocks. Long skip connections between shallow and deep layers further enhance information flow. Our model demonstrates superior performance across three datasets including SUN80, Urban100, and Manga109. Specifically, on the SUN80 dataset, our model achieves PSNR/SSIM values of 30.24/0.821. These results underscore the effectiveness of attention mechanisms in reference-based image super-resolution. The transformer-based model attains state-of-the-art results without the need for purpose-built subnetworks, knowledge distillation, or multi-stage training, emphasizing the potency of attention in meeting reference-based image super-resolution requirements.

[CV-56] ransient Fault Tolerant Semantic Segmentation for Autonomous Driving ECCV2024

链接: https://arxiv.org/abs/2408.16952
作者: Leonardo Iurada,Niccolò Cavagnero,Fernando Fernandes Dos Santos,Giuseppe Averta,Paolo Rech,Tatiana Tommasi
关键词-EN: Deep learning models, Deep learning, autonomous vehicle perception, vehicle perception, reliability is challenged
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: Accepted ECCV 2024 UnCV Workshop - this https URL

点击查看摘要

Abstract:Deep learning models are crucial for autonomous vehicle perception, but their reliability is challenged by algorithmic limitations and hardware faults. We address the latter by examining fault-tolerance in semantic segmentation models. Using established hardware fault models, we evaluate existing hardening techniques both in terms of accuracy and uncertainty and introduce ReLUMax, a novel simple activation function designed to enhance resilience against transient faults. ReLUMax integrates seamlessly into existing architectures without time overhead. Our experiments demonstrate that ReLUMax effectively improves robustness, preserving performance and boosting prediction confidence, thus contributing to the development of reliable autonomous driving systems.

[CV-57] VLM-KD: Knowledge Distillation from VLM for Long-Tail Visual Recognition

链接: https://arxiv.org/abs/2408.16930
作者: Zaiwei Zhang,Gregory P. Meyer,Zhichao Lu,Ashish Shrivastava,Avinash Ravichandran,Eric M. Wolff
关键词-EN: typically involves transferring, smaller student model, well-trained teacher model, involves transferring knowledge, distillation typically involves
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:For visual recognition, knowledge distillation typically involves transferring knowledge from a large, well-trained teacher model to a smaller student model. In this paper, we introduce an effective method to distill knowledge from an off-the-shelf vision-language model (VLM), demonstrating that it provides novel supervision in addition to those from a conventional vision-only teacher model. Our key technical contribution is the development of a framework that generates novel text supervision and distills free-form text into a vision encoder. We showcase the effectiveness of our approach, termed VLM-KD, across various benchmark datasets, showing that it surpasses several state-of-the-art long-tail visual classifiers. To our knowledge, this work is the first to utilize knowledge distillation with text supervision generated by an off-the-shelf VLM and apply it to vanilla randomly initialized vision encoders.

[CV-58] Enhancing Autism Spectrum Disorder Early Detection with the Parent-Child Dyads Block-Play Protocol and an Attention-enhanced GCN-xLSTM Hybrid Deep Learning Framework

链接: https://arxiv.org/abs/2408.16924
作者: Xiang Li,Lizhou Fan,Hanbo Wu,Kunping Chen,Xiaoxiao Yu,Chao Che,Zhifeng Cai,Xiuhong Niu,Aihua Cao,Xin Ma
关键词-EN: Autism Spectrum Disorder, growing neurodevelopmental disorder, Autism Spectrum, Spectrum Disorder, rapidly growing neurodevelopmental
类目: Computer Vision and Pattern Recognition (cs.CV); Emerging Technologies (cs.ET)
*备注: 18 pages, 8 figures, and 4 tables

点击查看摘要

Abstract:Autism Spectrum Disorder (ASD) is a rapidly growing neurodevelopmental disorder. Performing a timely intervention is crucial for the growth of young children with ASD, but traditional clinical screening methods lack objectivity. This study introduces an innovative approach to early detection of ASD. The contributions are threefold. First, this work proposes a novel Parent-Child Dyads Block-Play (PCB) protocol, grounded in kinesiological and neuroscientific research, to identify behavioral patterns distinguishing ASD from typically developing (TD) toddlers. Second, we have compiled a substantial video dataset, featuring 40 ASD and 89 TD toddlers engaged in block play with parents. This dataset exceeds previous efforts on both the scale of participants and the length of individual sessions. Third, our approach to action analysis in videos employs a hybrid deep learning framework, integrating a two-stream graph convolution network with attention-enhanced xLSTM (2sGCN-AxLSTM). This framework is adept at capturing dynamic interactions between toddlers and parents by extracting spatial features correlated with upper body and head movements and focusing on global contextual information of action sequences over time. By learning these global features with spatio-temporal correlations, our 2sGCN-AxLSTM effectively analyzes dynamic human behavior patterns and demonstrates an unprecedented accuracy of 89.6% in early detection of ASD. Our approach shows strong potential for enhancing early ASD diagnosis by accurately analyzing parent-child interactions, providing a critical tool to support timely and informed clinical decision-making.

[CV-59] Ig3D: Integrating 3D Face Representations in Facial Expression Inference ECCV

链接: https://arxiv.org/abs/2408.16907
作者: Lu Dong,Xiao Wang,Srirangaraj Setlur,Venu Govindaraju,Ifeoma Nwogu
关键词-EN: advances in animation, virtual reality, geometry from single, single images, images has allowed
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted by ECCVW 2024

点击查看摘要

Abstract:Reconstructing 3D faces with facial geometry from single images has allowed for major advances in animation, generative models, and virtual reality. However, this ability to represent faces with their 3D features is not as fully explored by the facial expression inference (FEI) community. This study therefore aims to investigate the impacts of integrating such 3D representations into the FEI task, specifically for facial expression classification and face-based valence-arousal (VA) estimation. To accomplish this, we first assess the performance of two 3D face representations (both based on the 3D morphable model, FLAME) for the FEI tasks. We further explore two fusion architectures, intermediate fusion and late fusion, for integrating the 3D face representations with existing 2D inference frameworks. To evaluate our proposed architecture, we extract the corresponding 3D representations and perform extensive tests on the AffectNet and RAF-DB datasets. Our experimental results demonstrate that our proposed method outperforms the state-of-the-art AffectNet VA estimation and RAF-DB classification tasks. Moreover, our method can act as a complement to other existing methods to boost performance in many emotion inference tasks.

[CV-60] x-ViT: A Generalizable Robust Texture-based dual-branch cross-attention deepfake detector

链接: https://arxiv.org/abs/2408.16892
作者: Deepak Dagar,Dinesh Kumar Vishwakarma
关键词-EN: realistic facial modification, produce highly realistic, highly realistic facial, facial modification, prevailing method
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Deepfakes, which employ GAN to produce highly realistic facial modification, are widely regarded as the prevailing method. Traditional CNN have been able to identify bogus media, but they struggle to perform well on different datasets and are vulnerable to adversarial attacks due to their lack of robustness. Vision transformers have demonstrated potential in the realm of image classification problems, but they require enough training data. Motivated by these limitations, this publication introduces Tex-ViT (Texture-Vision Transformer), which enhances CNN features by combining ResNet with a vision transformer. The model combines traditional ResNet features with a texture module that operates in parallel on sections of ResNet before each down-sampling operation. The texture module then serves as an input to the dual branch of the cross-attention vision transformer. It specifically focuses on improving the global texture module, which extracts feature map correlation. Empirical analysis reveals that fake images exhibit smooth textures that do not remain consistent over long distances in manipulations. Experiments were performed on different categories of FF++, such as DF, f2f, FS, and NT, together with other types of GAN datasets in cross-domain scenarios. Furthermore, experiments also conducted on FF++, DFDCPreview, and Celeb-DF dataset underwent several post-processing situations, such as blurring, compression, and noise. The model surpassed the most advanced models in terms of generalization, achieving a 98% accuracy in cross-domain scenarios. This demonstrates its ability to learn the shared distinguishing textural characteristics in the manipulated samples. These experiments provide evidence that the proposed model is capable of being applied to various situations and is resistant to many post-processing procedures.

[CV-61] Revising Multimodal VAEs with Diffusion Decoders

链接: https://arxiv.org/abs/2408.16883
作者: Daniel Wesego,Amirmohammad Rooshenas
关键词-EN: generating high-quality outputs, high-quality outputs, struggle with generating, generating high-quality, challenge that extends
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Multimodal VAEs often struggle with generating high-quality outputs, a challenge that extends beyond the inherent limitations of the VAE framework. The core issue lies in the restricted joint representation of the latent space, particularly when complex modalities like images are involved. Feedforward decoders, commonly used for these intricate modalities, inadvertently constrain the joint latent space, leading to a degradation in the quality of the other modalities as well. Although recent studies have shown improvement by introducing modality-specific representations, the issue remains significant. In this work, we demonstrate that incorporating a flexible diffusion decoder specifically for the image modality not only enhances the generation quality of the images but also positively impacts the performance of the other modalities that rely on feedforward decoders. This approach addresses the limitations imposed by conventional joint representations and opens up new possibilities for improving multimodal generation tasks using the multimodal VAE framework. Our model provides state-of-the-art results compared to other multimodal VAEs in different datasets with higher coherence and superior quality in the generated modalities

[CV-62] FineFACE: Fair Facial Attribute Classification Leveraging Fine-grained Features

链接: https://arxiv.org/abs/2408.16881
作者: Ayesha Manzoor,Ajita Rattani
关键词-EN: Published research highlights, darker skin tones, Published research, automated facial attribute, attribute classification algorithms
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Published research highlights the presence of demographic bias in automated facial attribute classification algorithms, particularly impacting women and individuals with darker skin tones. Existing bias mitigation techniques typically require demographic annotations and often obtain a trade-off between fairness and accuracy, i.e., Pareto inefficiency. Facial attributes, whether common ones like gender or others such as “chubby” or “high cheekbones”, exhibit high interclass similarity and intraclass variation across demographics leading to unequal accuracy. This requires the use of local and subtle cues using fine-grained analysis for differentiation. This paper proposes a novel approach to fair facial attribute classification by framing it as a fine-grained classification problem. Our approach effectively integrates both low-level local features (like edges and color) and high-level semantic features (like shapes and structures) through cross-layer mutual attention learning. Here, shallow to deep CNN layers function as experts, offering category predictions and attention regions. An exhaustive evaluation on facial attribute annotated datasets demonstrates that our FineFACE model improves accuracy by 1.32% to 1.74% and fairness by 67% to 83.6%, over the SOTA bias mitigation techniques. Importantly, our approach obtains a Pareto-efficient balance between accuracy and fairness between demographic groups. In addition, our approach does not require demographic annotations and is applicable to diverse downstream classification tasks. To facilitate reproducibility, the code and dataset information is available at this https URL.

[CV-63] MSLIQA: Enhancing Learning Representations for Image Quality Assessment through Multi-Scale Learning

链接: https://arxiv.org/abs/2408.16879
作者: Nasim Jamshidi Avanaki,Abhijay Ghildiyal,Nabajeet Barman,Saman Zadtootaghaj
关键词-EN: Image Quality Assessment, No-Reference Image Quality, Quality Assessment, challenging task due, large annotated datasets
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
*备注:

点击查看摘要

Abstract:No-Reference Image Quality Assessment (NR-IQA) remains a challenging task due to the diversity of distortions and the lack of large annotated datasets. Many studies have attempted to tackle these challenges by developing more accurate NR-IQA models, often employing complex and computationally expensive networks, or by bridging the domain gap between various distortions to enhance performance on test datasets. In our work, we improve the performance of a generic lightweight NR-IQA model by introducing a novel augmentation strategy that boosts its performance by almost 28%. This augmentation strategy enables the network to better discriminate between different distortions in various parts of the image by zooming in and out. Additionally, the inclusion of test-time augmentation further enhances performance, making our lightweight network’s results comparable to the current state-of-the-art models, simply through the use of augmentations.

[CV-64] GameIR: A Large-Scale Synthesized Ground-Truth Dataset for Image Restoration over Gaming Content

链接: https://arxiv.org/abs/2408.16866
作者: Lebin Zhou,Kun Han,Nam Ling,Wei Wang,Wei Jiang
关键词-EN: NVIDIA DLSS, products like NVIDIA, cloud gaming products, commercial cloud gaming, gaming content
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Image restoration methods like super-resolution and image synthesis have been successfully used in commercial cloud gaming products like NVIDIA’s DLSS. However, restoration over gaming content is not well studied by the general public. The discrepancy is mainly caused by the lack of ground-truth gaming training data that match the test cases. Due to the unique characteristics of gaming content, the common approach of generating pseudo training data by degrading the original HR images results in inferior restoration performance. In this work, we develop GameIR, a large-scale high-quality computer-synthesized ground-truth dataset to fill in the blanks, targeting at two different applications. The first is super-resolution with deferred rendering, to support the gaming solution of rendering and transferring LR images only and restoring HR images on the client side. We provide 19200 LR-HR paired ground-truth frames coming from 640 videos rendered at 720p and 1440p for this task. The second is novel view synthesis (NVS), to support the multiview gaming solution of rendering and transferring part of the multiview frames and generating the remaining frames on the client side. This task has 57,600 HR frames from 960 videos of 160 scenes with 6 camera views. In addition to the RGB frames, the GBuffers during the deferred rendering stage are also provided, which can be used to help restoration. Furthermore, we evaluate several SOTA super-resolution algorithms and NeRF-based NVS algorithms over our dataset, which demonstrates the effectiveness of our ground-truth GameIR data in improving restoration performance for gaming content. Also, we test the method of incorporating the GBuffers as additional input information for helping super-resolution and NVS. We release our dataset and models to the general public to facilitate research on restoration methods over gaming content.

[CV-65] Enabling Local Editing in Diffusion Models by Joint and Individual Component Analysis

链接: https://arxiv.org/abs/2408.16845
作者: Theodoros Kouzelis,Manos Plitsis,Mihalis A. Nikolaou,Yannis Panagakis
关键词-EN: Generative Adversarial Networks, Diffusion Models, Generative Adversarial, advances in Diffusion, competitor to Generative
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: Code available here: this https URL

点击查看摘要

Abstract:Recent advances in Diffusion Models (DMs) have led to significant progress in visual synthesis and editing tasks, establishing them as a strong competitor to Generative Adversarial Networks (GANs). However, the latent space of DMs is not as well understood as that of GANs. Recent research has focused on unsupervised semantic discovery in the latent space of DMs by leveraging the bottleneck layer of the denoising network, which has been shown to exhibit properties of a semantic latent space. However, these approaches are limited to discovering global attributes. In this paper we address, the challenge of local image manipulation in DMs and introduce an unsupervised method to factorize the latent semantics learned by the denoising network of pre-trained DMs. Given an arbitrary image and defined regions of interest, we utilize the Jacobian of the denoising network to establish a relation between the regions of interest and their corresponding subspaces in the latent space. Furthermore, we disentangle the joint and individual components of these subspaces to identify latent directions that enable local image manipulation. Once discovered, these directions can be applied to different images to produce semantically consistent edits, making our method suitable for practical applications. Experimental results on various datasets demonstrate that our method can produce semantic edits that are more localized and have better fidelity compared to the state-of-the-art.

[CV-66] Fluent and Accurate Image Captioning with a Self-Trained Reward Model ICPR2024

链接: https://arxiv.org/abs/2408.16827
作者: Nicholas Moratelli,Marcella Cornia,Lorenzo Baraldi,Rita Cucchiara
关键词-EN: sequence level, promoting caption quality, image captioning models, Fine-tuning image captioning, hand-crafted rewards
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: ICPR 2024

点击查看摘要

Abstract:Fine-tuning image captioning models with hand-crafted rewards like the CIDEr metric has been a classical strategy for promoting caption quality at the sequence level. This approach, however, is known to limit descriptiveness and semantic richness and tends to drive the model towards the style of ground-truth sentences, thus losing detail and specificity. On the contrary, recent attempts to employ image-text models like CLIP as reward have led to grammatically incorrect and repetitive captions. In this paper, we propose Self-Cap, a captioning approach that relies on a learnable reward model based on self-generated negatives that can discriminate captions based on their consistency with the image. Specifically, our discriminator is a fine-tuned contrastive image-text model trained to promote caption correctness while avoiding the aberrations that typically happen when training with a CLIP-based reward. To this end, our discriminator directly incorporates negative samples from a frozen captioner, which significantly improves the quality and richness of the generated captions but also reduces the fine-tuning time in comparison to using the CIDEr score as the sole metric for optimization. Experimental results demonstrate the effectiveness of our training strategy on both standard and zero-shot image captioning datasets.

[CV-67] See or Guess: Counterfactually Regularized Image Captioning ACM-MM2024

链接: https://arxiv.org/abs/2408.16809
作者: Qian Cao,Xu Chen,Ruihua Song,Xiting Wang,Xinting Huang,Yuchen Ren
关键词-EN: generates natural language, natural language descriptions, vision-language research, language descriptions, visual information
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Multimedia (cs.MM)
*备注: Accepted by ACM MM 2024

点击查看摘要

Abstract:Image captioning, which generates natural language descriptions of the visual information in an image, is a crucial task in vision-language research. Previous models have typically addressed this task by aligning the generative capabilities of machines with human intelligence through statistical fitting of existing datasets. While effective for normal images, they may struggle to accurately describe those where certain parts of the image are obscured or edited, unlike humans who excel in such cases. These weaknesses they exhibit, including hallucinations and limited interpretability, often hinder performance in scenarios with shifted association patterns. In this paper, we present a generic image captioning framework that employs causal inference to make existing models more capable of interventional tasks, and counterfactually explainable. Our approach includes two variants leveraging either total effect or natural direct effect. Integrating them into the training process enables models to handle counterfactual scenarios, increasing their generalizability. Extensive experiments on various datasets show that our method effectively reduces hallucinations and improves the model’s faithfulness to images, demonstrating high portability across both small-scale and large-scale image-to-text models. The code is available at this https URL.

[CV-68] STEREO: Towards Adversarially Robust Concept Erasing from Text-to-Image Generation Models

链接: https://arxiv.org/abs/2408.16807
作者: Koushik Srivatsan,Fahad Shamshad,Muzammal Naseer,Karthik Nandakumar
关键词-EN: generating harmful content, proliferation of large-scale, harmful content, rapid proliferation, led to concerns
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Project Page: this https URL

点击查看摘要

Abstract:The rapid proliferation of large-scale text-to-image generation (T2IG) models has led to concerns about their potential misuse in generating harmful content. Though many methods have been proposed for erasing undesired concepts from T2IG models, they only provide a false sense of security, as recent works demonstrate that concept-erased models (CEMs) can be easily deceived to generate the erased concept through adversarial attacks. The problem of adversarially robust concept erasing without significant degradation to model utility (ability to generate benign concepts) remains an unresolved challenge, especially in the white-box setting where the adversary has access to the CEM. To address this gap, we propose an approach called STEREO that involves two distinct stages. The first stage searches thoroughly enough for strong and diverse adversarial prompts that can regenerate an erased concept from a CEM, by leveraging robust optimization principles from adversarial training. In the second robustly erase once stage, we introduce an anchor-concept-based compositional objective to robustly erase the target concept at one go, while attempting to minimize the degradation on model utility. By benchmarking the proposed STEREO approach against four state-of-the-art concept erasure methods under three adversarial attacks, we demonstrate its ability to achieve a better robustness vs. utility trade-off. Our code and models are available at this https URL.

[CV-69] Generative AI Enables Medical Image Segmentation in Ultra Low-Data Regimes

链接: https://arxiv.org/abs/2408.17421
作者: Li Zhang,Basu Jindal,Ahmed Alaa,Robert Weinreb,David Wilson,Eran Segal,James Zou,Pengtao Xie
关键词-EN: Semantic segmentation, treatment planning, deep learning, pivotal in applications, diagnosis and treatment
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Semantic segmentation of medical images is pivotal in applications like disease diagnosis and treatment planning. While deep learning has excelled in automating this task, a major hurdle is the need for numerous annotated segmentation masks, which are resource-intensive to produce due to the required expertise and time. This scenario often leads to ultra low-data regimes, where annotated images are extremely limited, posing significant challenges for the generalization of conventional deep learning methods on test images. To address this, we introduce a generative deep learning framework, which uniquely generates high-quality paired segmentation masks and medical images, serving as auxiliary data for training robust models in data-scarce environments. Unlike traditional generative models that treat data generation and segmentation model training as separate processes, our method employs multi-level optimization for end-to-end data generation. This approach allows segmentation performance to directly influence the data generation process, ensuring that the generated data is specifically tailored to enhance the performance of the segmentation model. Our method demonstrated strong generalization performance across 9 diverse medical image segmentation tasks and on 16 datasets, in ultra-low data regimes, spanning various diseases, organs, and imaging modalities. When applied to various segmentation models, it achieved performance improvements of 10-20% (absolute), in both same-domain and out-of-domain scenarios. Notably, it requires 8 to 20 times less training data than existing methods to achieve comparable results. This advancement significantly improves the feasibility and cost-effectiveness of applying deep learning in medical imaging, particularly in scenarios with limited data availability.

[CV-70] A nonlinear elasticity model in computer vision

链接: https://arxiv.org/abs/2408.17237
作者: John M. Ball,Christopher L. Horner
关键词-EN: vector-valued intensity maps, bounded open subsets, nonlinear elasticity model, elasticity model previously, model previously introduced
类目: Analysis of PDEs (math.AP); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:The purpose of this paper is to analyze a nonlinear elasticity model previously introduced by the authors for comparing two images, regarded as bounded open subsets of \R^n together with associated vector-valued intensity maps. Optimal transformations between the images are sought as minimisers of an integral functional among orientation-preserving homeomorphisms. The existence of minimisers is proved under natural coercivity and polyconvexity conditions, assuming only that the intensity functions are bounded measurable. Variants of the existence theorem are also proved, first under the constraint that finite sets of landmark points in the two images are mapped one to the other, and second when one image is to be compared to an unknown part of another. The question is studied as to whether for images related by a linear mapping the unique minimizer is given by that linear mapping. For a natural class of functional integrands an example is given guaranteeing that this property holds for pairs of images in which the second is a scaling of the first by a constant factor. However for the property to hold for arbitrary pairs of linearly related images it is shown that the integrand has to depend on the gradient of the transformation as a convex function of its determinant alone. This suggests a new model in which the integrand depends also on second derivatives of the transformation, and an example is given for which both existence of minimizers is assured and the above property holds for all pairs of linearly related images. Subjects: Analysis of PDEs (math.AP); Computer Vision and Pattern Recognition (cs.CV) MSC classes: 94A08, 74B20 Cite as: arXiv:2408.17237 [math.AP] (or arXiv:2408.17237v1 [math.AP] for this version) https://doi.org/10.48550/arXiv.2408.17237 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[CV-71] Approximately Invertible Neural Network for Learned Image Compression

链接: https://arxiv.org/abs/2408.17073
作者: Yanbo Gao,Meng Fu,Shuai Li,Chong Lv,Xun Cai,Hui Yuan,Mao Ye
关键词-EN: attracted considerable interests, Learned image compression, image compression, synthesis transform, image
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Learned image compression have attracted considerable interests in recent years. It typically comprises an analysis transform, a synthesis transform, quantization and an entropy coding model. The analysis transform and synthesis transform are used to encode an image to latent feature and decode the quantized feature to reconstruct the image, and can be regarded as coupled transforms. However, the analysis transform and synthesis transform are designed independently in the existing methods, making them unreliable in high-quality image compression. Inspired by the invertible neural networks in generative modeling, invertible modules are used to construct the coupled analysis and synthesis transforms. Considering the noise introduced in the feature quantization invalidates the invertible process, this paper proposes an Approximately Invertible Neural Network (A-INN) framework for learned image compression. It formulates the rate-distortion optimization in lossy image compression when using INN with quantization, which differentiates from using INN for generative modelling. Generally speaking, A-INN can be used as the theoretical foundation for any INN based lossy compression method. Based on this formulation, A-INN with a progressive denoising module (PDM) is developed to effectively reduce the quantization noise in the decoding. Moreover, a Cascaded Feature Recovery Module (CFRM) is designed to learn high-dimensional feature recovery from low-dimensional ones to further reduce the noise in feature channel compression. In addition, a Frequency-enhanced Decomposition and Synthesis Module (FDSM) is developed by explicitly enhancing the high-frequency components in an image to address the loss of high-frequency information inherent in neural network based image compression. Extensive experiments demonstrate that the proposed A-INN outperforms the existing learned image compression methods.

[CV-72] Disease Classification and Impact of Pretrained Deep Convolution Neural Networks on Diverse Medical Imaging Datasets across Imaging Modalities

链接: https://arxiv.org/abs/2408.17011
作者: Jutika Borah,Kumaresh Sarmah,Hidam Kumarjit Singh
关键词-EN: Chest X-rays, optical coherence tomography, coherence tomography serve, medical imaging, optical coherence
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 15 pages, 3 figures, 4 tables

点击查看摘要

Abstract:Imaging techniques such as Chest X-rays, whole slide images, and optical coherence tomography serve as the initial screening and detection for a wide variety of medical pulmonary and ophthalmic conditions respectively. This paper investigates the intricacies of using pretrained deep convolutional neural networks with transfer learning across diverse medical imaging datasets with varying modalities for binary and multiclass classification. We conducted a comprehensive performance analysis with ten network architectures and model families each with pretraining and random initialization. Our finding showed that the use of pretrained models as fixed feature extractors yields poor performance irrespective of the datasets. Contrary, histopathology microscopy whole slide images have better performance. It is also found that deeper and more complex architectures did not necessarily result in the best performance. This observation implies that the improvements in ImageNet are not parallel to the medical imaging tasks. Within a medical domain, the performance of the network architectures varies within model families with shifts in datasets. This indicates that the performance of models within a specific modality may not be conclusive for another modality within the same domain. This study provides a deeper understanding of the applications of deep learning techniques in medical imaging and highlights the impact of pretrained networks across different medical imaging datasets under five different experimental settings.

[CV-73] LV-UNet: A Lightweight and Vanilla Model for Medical Image Segmentation

链接: https://arxiv.org/abs/2408.16886
作者: Juntao Jiang,Mengmeng Wang,Huizhong Tian,Lingbo Cheng,Yong Liu
关键词-EN: medical image segmentation, mobile medical devices, practical applications call, optimization challenges, computer vision
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Although the progress made by large models in computer vision, optimization challenges, the complexity of transformer models, computational limitations, and the requirements of practical applications call for simpler designs in model architecture for medical image segmentation, especially in mobile medical devices that require lightweight and deployable models with real-time performance. However, some of the current lightweight models exhibit poor robustness across different datasets, which hinders their broader adoption. This paper proposes a lightweight and vanilla model called LV-UNet, which effectively utilizes pre-trained MobileNetv3-Large models and introduces fusible modules. It can be trained using an improved deep training strategy and switched to deployment mode during inference, reducing both parameter count and computational load. Experiments are conducted on ISIC 2016, BUSI, CVC- ClinicDB, CVC-ColonDB, and Kvair-SEG datasets, achieving better performance compared to the state-of-the-art and classic models.

[CV-74] Comparative Analysis of Transfer Learning Models for Breast Cancer Classification

链接: https://arxiv.org/abs/2408.16859
作者: Sania Eskandari,Ali Eslamian,Qiang Cheng
关键词-EN: Invasive Ductal Carcinoma, classification of histopathological, early and precise, precise detection, Ductal Carcinoma
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:The classification of histopathological images is crucial for the early and precise detection of breast cancer. This study investigates the efficiency of deep learning models in distinguishing between Invasive Ductal Carcinoma (IDC) and non-IDC in histopathology slides. We conducted a thorough comparison examination of eight sophisticated models: ResNet-50, DenseNet-121, ResNeXt-50, Vision Transformer (ViT), GoogLeNet (Inception v3), EfficientNet, MobileNet, and SqueezeNet. This analysis was carried out using a large dataset of 277,524 image patches. Our research makes a substantial contribution to the field by offering a comprehensive assessment of the performance of each model. We particularly highlight the exceptional efficacy of attention-based mechanisms in the ViT model, which achieved a remarkable validation accuracy of 93%, surpassing conventional convolutional networks. This study highlights the promise of advanced machine learning approaches in clinical settings, offering improved precision as well as efficiency in breast cancer diagnosis.

机器学习

[LG-0] Fairness-Aware Estimation of Graphical Models

链接: https://arxiv.org/abs/2408.17396
作者: Zhuoping Zhou,Davoud Ataee Tarzanagh,Bojian Hou,Qi Long,Li Shen
关键词-EN: Ising models, paper examines, examines the issue, graphical models, Covariance
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 32 Pages, 9 Figures

点击查看摘要

Abstract:This paper examines the issue of fairness in the estimation of graphical models (GMs), particularly Gaussian, Covariance, and Ising models. These models play a vital role in understanding complex relationships in high-dimensional data. However, standard GMs can result in biased outcomes, especially when the underlying data involves sensitive characteristics or protected groups. To address this, we introduce a comprehensive framework designed to reduce bias in the estimation of GMs related to protected attributes. Our approach involves the integration of the pairwise graph disparity error and a tailored loss function into a nonsmooth multi-objective optimization problem, striving to achieve fairness across different sensitive groups while maintaining the effectiveness of the GMs. Experimental evaluations on synthetic and real-world datasets demonstrate that our framework effectively mitigates bias without undermining GMs’ performance.

[LG-1] Continual learning with the neural tangent ensemble

链接: https://arxiv.org/abs/2408.17394
作者: Ari S. Benjamin,Christian Pehle,Kyle Daruwalla
关键词-EN: natural strategy, fixed functions, neural, ensemble, Bayesian ensemble
类目: Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
*备注:

点击查看摘要

Abstract:A natural strategy for continual learning is to weigh a Bayesian ensemble of fixed functions. This suggests that if a (single) neural network could be interpreted as an ensemble, one could design effective algorithms that learn without forgetting. To realize this possibility, we observe that a neural network classifier with N parameters can be interpreted as a weighted ensemble of N classifiers, and that in the lazy regime limit these classifiers are fixed throughout learning. We term these classifiers the neural tangent experts and show they output valid probability distributions over the labels. We then derive the likelihood and posterior probability of each expert given past data. Surprisingly, we learn that the posterior updates for these experts are equivalent to a scaled and projected form of stochastic gradient descent (SGD) over the network weights. Away from the lazy regime, networks can be seen as ensembles of adaptive experts which improve over time. These results offer a new interpretation of neural networks as Bayesian ensembles of experts, providing a principled framework for understanding and mitigating catastrophic forgetting in continual learning settings.

[LG-2] LASSO-MOGAT: A Multi-Omics Graph Attention Framework for Cancer Classification

链接: https://arxiv.org/abs/2408.17384
作者: Fadi Alharbi,Aleksandar Vakanski,Murtada K. Elbashir,Mohanad Mohammed
关键词-EN: gene expression patterns, underpinning cancer development, enhancing our understanding, development and progression, application of machine
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The application of machine learning methods to analyze changes in gene expression patterns has recently emerged as a powerful approach in cancer research, enhancing our understanding of the molecular mechanisms underpinning cancer development and progression. Combining gene expression data with other types of omics data has been reported by numerous works to improve cancer classification outcomes. Despite these advances, effectively integrating high-dimensional multi-omics data and capturing the complex relationships across different biological layers remains challenging. This paper introduces LASSO-MOGAT (LASSO-Multi-Omics Gated ATtention), a novel graph-based deep learning framework that integrates messenger RNA, microRNA, and DNA methylation data to classify 31 cancer types. Utilizing differential expression analysis with LIMMA and LASSO regression for feature selection, and leveraging Graph Attention Networks (GATs) to incorporate protein-protein interaction (PPI) networks, LASSO-MOGAT effectively captures intricate relationships within multi-omics data. Experimental validation using five-fold cross-validation demonstrates the method’s precision, reliability, and capacity for providing comprehensive insights into cancer molecular mechanisms. The computation of attention coefficients for the edges in the graph by the proposed graph-attention architecture based on protein-protein interactions proved beneficial for identifying synergies in multi-omics data for cancer classification.

[LG-3] MoRe Fine-Tuning with 10x Fewer Parameters

链接: https://arxiv.org/abs/2408.17383
作者: Wenxuan Tan,Nicholas Roberts,Tzu-Heng Huang,Jitian Zhao,John Cooper,Samuel Guo,Chengyu Duan,Frederic Sala
关键词-EN: easily specialize large, specialize large pretrained, large pretrained models, Monarch Rectangular Fine-tuning, unlocked the potential
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Parameter-efficient fine-tuning (PEFT) techniques have unlocked the potential to cheaply and easily specialize large pretrained models. However, the most prominent approaches, like low-rank adapters (LoRA), depend on heuristics or rules-of-thumb for their architectural choices – potentially limiting their performance for new models and architectures. This limitation suggests that techniques from neural architecture search could be used to obtain optimal adapter architectures, but these are often expensive and difficult to implement. We address this challenge with Monarch Rectangular Fine-tuning (MoRe), a simple framework to search over adapter architectures that relies on the Monarch matrix class. Theoretically, we show that MoRe is more expressive than LoRA. Empirically, our approach is more parameter-efficient and performant than state-of-the-art PEFTs on a range of tasks and models, with as few as 5% of LoRA’s parameters.

[LG-4] raffic expertise meets residual RL: Knowledge-informed model-based residual reinforcement learning for CAV trajectory control

链接: https://arxiv.org/abs/2408.17380
作者: Zihao Sheng,Zilin Huang,Sikai Chen
关键词-EN: exhibit higher sample, virtual environment model, environment model, higher sample efficiency, sample efficiency
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Model-based reinforcement learning (RL) is anticipated to exhibit higher sample efficiency compared to model-free RL by utilizing a virtual environment model. However, it is challenging to obtain sufficiently accurate representations of the environmental dynamics due to uncertainties in complex systems and environments. An inaccurate environment model may degrade the sample efficiency and performance of model-based RL. Furthermore, while model-based RL can improve sample efficiency, it often still requires substantial training time to learn from scratch, potentially limiting its advantages over model-free approaches. To address these challenges, this paper introduces a knowledge-informed model-based residual reinforcement learning framework aimed at enhancing learning efficiency by infusing established expert knowledge into the learning process and avoiding the issue of beginning from zero. Our approach integrates traffic expert knowledge into a virtual environment model, employing the Intelligent Driver Model (IDM) for basic dynamics and neural networks for residual dynamics, thus ensuring adaptability to complex scenarios. We propose a novel strategy that combines traditional control methods with residual RL, facilitating efficient learning and policy optimization without the need to learn from scratch. The proposed approach is applied to CAV trajectory control tasks for the dissipation of stop-and-go waves in mixed traffic flow. Experimental results demonstrate that our proposed approach enables the CAV agent to achieve superior performance in trajectory control compared to the baseline agents in terms of sample efficiency, traffic flow smoothness and traffic mobility. The source code and supplementary materials are available at this https URL.

[LG-5] Exploring the Impact of Environmental Pollutants on Multiple Sclerosis Progression

链接: https://arxiv.org/abs/2408.17376
作者: Elena Marinello,Erica Tavazzi,Enrico Longato,Pietro Bosoni,Arianna Dagliati,Mahin Vazifehdan,Riccardo Bellazzi,Isotta Trescato,Alessandro Guazzo,Martina Vettoretti,Eleonora Tavazzi,Lara Ahmad,Roberto Bergamaschi,Paola Cavalla,Umberto Manera,Adriano Chio,Barbara Di Camillo
关键词-EN: Multiple Sclerosis, inflammatory neurological disorder, neurological disorder characterised, symptom exacerbation, chronic autoimmune
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Multiple Sclerosis (MS) is a chronic autoimmune and inflammatory neurological disorder characterised by episodes of symptom exacerbation, known as relapses. In this study, we investigate the role of environmental factors in relapse occurrence among MS patients, using data from the H2020 BRAINTEASER project. We employed predictive models, including Random Forest (RF) and Logistic Regression (LR), with varying sets of input features to predict the occurrence of relapses based on clinical and pollutant data collected over a week. The RF yielded the best result, with an AUC-ROC score of 0.713. Environmental variables, such as precipitation, NO2, PM2.5, humidity, and temperature, were found to be relevant to the prediction.

[LG-6] Leveraging Graph Neural Networks to Forecast Electricity Consumption KDD2024 ECML

链接: https://arxiv.org/abs/2408.17366
作者: Eloi Campagne,Yvenn Amara-Ouali,Yannig Goude,Argyris Kalogeratos
关键词-EN: renewable energy sources, Accurate electricity demand, decentralized network paradigm, paradigm introduce greater, introduce greater complexity
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 17 pages, ECML PKDD 2024 Workshop paper

点击查看摘要

Abstract:Accurate electricity demand forecasting is essential for several reasons, especially as the integration of renewable energy sources and the transition to a decentralized network paradigm introduce greater complexity and uncertainty. The proposed methodology leverages graph-based representations to effectively capture the spatial distribution and relational intricacies inherent in this decentralized network structure. This research work offers a novel approach that extends beyond the conventional Generalized Additive Model framework by considering models like Graph Convolutional Networks or Graph SAGE. These graph-based models enable the incorporation of various levels of interconnectedness and information sharing among nodes, where each node corresponds to the combined load (i.e. consumption) of a subset of consumers (e.g. the regions of a country). More specifically, we introduce a range of methods for inferring graphs tailored to consumption forecasting, along with a framework for evaluating the developed models in terms of both performance and explainability. We conduct experiments on electricity forecasting, in both a synthetic and a real framework considering the French mainland regions, and the performance and merits of our approach are discussed.

[LG-7] Hold Me Tight: Stable Encoder-Decoder Design for Speech Enhancement INTERSPEECH2024

链接: https://arxiv.org/abs/2408.17358
作者: Daniel Haider,Felix Perfler,Vincent Lostanlen,Martin Ehler,Peter Balazs
关键词-EN: Convolutional layers, frontend to encode, encode audio signals, Convolutional, encode audio
类目: ound (cs.SD); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注: Accepted at INTERSPEECH 2024

点击查看摘要

Abstract:Convolutional layers with 1-D filters are often used as frontend to encode audio signals. Unlike fixed time-frequency representations, they can adapt to the local characteristics of input data. However, 1-D filters on raw audio are hard to train and often suffer from instabilities. In this paper, we address these problems with hybrid solutions, i.e., combining theory-driven and data-driven approaches. First, we preprocess the audio signals via a auditory filterbank, guaranteeing good frequency localization for the learned encoder. Second, we use results from frame theory to define an unsupervised learning objective that encourages energy conservation and perfect reconstruction. Third, we adapt mixed compressed spectral norms as learning objectives to the encoder coefficients. Using these solutions in a low-complexity encoder-mask-decoder model significantly improves the perceptual evaluation of speech quality (PESQ) in speech enhancement.

[LG-8] C-RADAR: A Centralized Deep Learning System for Intrusion Detection in Software Defined Networks

链接: https://arxiv.org/abs/2408.17356
作者: Osama Mustafa,Khizer Ali,Talha Naqash
关键词-EN: Software Defined Networks, Software Defined, popularity of Software, simplify network management, Defined Networks
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
*备注:

点击查看摘要

Abstract:The popularity of Software Defined Networks (SDNs) has grown in recent years, mainly because of their ability to simplify network management and improve network flexibility. However, this also makes them vulnerable to various types of cyber attacks. SDNs work on a centralized control plane which makes them more prone to network attacks. Research has demonstrated that deep learning (DL) methods can be successful in identifying intrusions in conventional networks, but their application in SDNs is still an open research area. In this research, we propose the use of DL techniques for intrusion detection in SDNs. We measure the effectiveness of our method by experimentation on a dataset of network traffic and comparing it to existing techniques. Our results show that the DL-based approach outperforms traditional methods in terms of detection accuracy and computational efficiency. The deep learning architecture that has been used in this research is a Long Short Term Memory Network and Self-Attention based architecture i.e. LSTM-Attn which achieves an Fl-score of 0.9721. Furthermore, this technique can be trained to detect new attack patterns and improve the overall security of SDNs.

[LG-9] Bidirectional Decoding: Improving Action Chunking via Closed-Loop Resampling

链接: https://arxiv.org/abs/2408.17355
作者: Yuejiang Liu,Jubayer Ibn Hamid,Annie Xie,Yoonho Lee,Maximilian Du,Chelsea Finn
关键词-EN: Predicting and executing, human demonstrations, robot learning, learning from human, Predicting
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Project website: this https URL

点击查看摘要

Abstract:Predicting and executing a sequence of actions without intermediate replanning, known as action chunking, is increasingly used in robot learning from human demonstrations. However, its effects on learned policies remain puzzling: some studies highlight its importance for achieving strong performance, while others observe detrimental effects. In this paper, we first dissect the role of action chunking by analyzing the divergence between the learner and the demonstrator. We find that longer action chunks enable a policy to better capture temporal dependencies by taking into account more past states and actions within the chunk. However, this advantage comes at the cost of exacerbating errors in stochastic environments due to fewer observations of recent states. To address this, we propose Bidirectional Decoding (BID), a test-time inference algorithm that bridges action chunking with closed-loop operations. BID samples multiple predictions at each time step and searches for the optimal one based on two criteria: (i) backward coherence, which favors samples aligned with previous decisions, (ii) forward contrast, which favors samples close to outputs of a stronger policy and distant from those of a weaker policy. By coupling decisions within and across action chunks, BID enhances temporal consistency over extended sequences while enabling adaptive replanning in stochastic environments. Experimental results show that BID substantially outperforms conventional closed-loop operations of two state-of-the-art generative policies across seven simulation benchmarks and two real-world tasks.

[LG-10] Forget to Flourish: Leveraging Machine-Unlearning on Pretrained Language Models for Privacy Leakage

链接: https://arxiv.org/abs/2408.17354
作者: Md Rafi Ur Rashid,Jing Liu,Toshiaki Koike-Akino,Shagufta Mehnaz,Ye Wang
关键词-EN: exposing sensitive information, downstream applications poses, applications poses significant, potentially exposing sensitive, poses significant privacy
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
*备注:

点击查看摘要

Abstract:Fine-tuning large language models on private data for downstream applications poses significant privacy risks in potentially exposing sensitive information. Several popular community platforms now offer convenient distribution of a large variety of pre-trained models, allowing anyone to publish without rigorous verification. This scenario creates a privacy threat, as pre-trained models can be intentionally crafted to compromise the privacy of fine-tuning datasets. In this study, we introduce a novel poisoning technique that uses model-unlearning as an attack tool. This approach manipulates a pre-trained language model to increase the leakage of private data during the fine-tuning process. Our method enhances both membership inference and data extraction attacks while preserving model utility. Experimental results across different models, datasets, and fine-tuning setups demonstrate that our attacks significantly surpass baseline performance. This work serves as a cautionary note for users who download pre-trained models from unverified sources, highlighting the potential risks involved.

[LG-11] Evaluating Reliability in Medical DNNs: A Critical Analysis of Feature and Confidence-Based OOD Detection MICCAI2023

链接: https://arxiv.org/abs/2408.17337
作者: Harry Anthony,Konstantinos Kamnitsas
关键词-EN: deep neural networks, OOD, prevent erroneous predictions, medical image analysis, image analysis requires
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: Accepted for the Uncertainty for Safe Utilization of Machine Learning in Medical Imaging (UNSURE 2024) workshop at the MICCAI 2023

点击查看摘要

Abstract:Reliable use of deep neural networks (DNNs) for medical image analysis requires methods to identify inputs that differ significantly from the training data, called out-of-distribution (OOD), to prevent erroneous predictions. OOD detection methods can be categorised as either confidence-based (using the model’s output layer for OOD detection) or feature-based (not using the output layer). We created two new OOD benchmarks by dividing the D7P (dermatology) and BreastMNIST (ultrasound) datasets into subsets which either contain or don’t contain an artefact (rulers or annotations respectively). Models were trained with artefact-free images, and images with the artefacts were used as OOD test sets. For each OOD image, we created a counterfactual by manually removing the artefact via image processing, to assess the artefact’s impact on the model’s predictions. We show that OOD artefacts can boost a model’s softmax confidence in its predictions, due to correlations in training data among other factors. This contradicts the common assumption that OOD artefacts should lead to more uncertain outputs, an assumption on which most confidence-based methods rely. We use this to explain why feature-based methods (e.g. Mahalanobis score) typically have greater OOD detection performance than confidence-based methods (e.g. MCP). However, we also show that feature-based methods typically perform worse at distinguishing between inputs that lead to correct and incorrect predictions (for both OOD and ID data). Following from these insights, we argue that a combination of feature-based and confidence-based methods should be used within DNN pipelines to mitigate their respective weaknesses. These project’s code and OOD benchmarks are available at: this https URL.

[LG-12] Modularity in Transformers: Investigating Neuron Separability Specialization

链接: https://arxiv.org/abs/2408.17324
作者: Nicholas Pochinkov,Thomas Jones,Mohammed Rashidur Rahman
关键词-EN: workings remains limited, internal workings remains, remains limited, increasingly prevalent, workings remains
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注: 11 pages, 6 figures

点击查看摘要

Abstract:Transformer models are increasingly prevalent in various applications, yet our understanding of their internal workings remains limited. This paper investigates the modularity and task specialization of neurons within transformer architectures, focusing on both vision (ViT) and language (Mistral 7B) models. Using a combination of selective pruning and MoEfication clustering techniques, we analyze the overlap and specialization of neurons across different tasks and data subsets. Our findings reveal evidence of task-specific neuron clusters, with varying degrees of overlap between related tasks. We observe that neuron importance patterns persist to some extent even in randomly initialized models, suggesting an inherent structure that training refines. Additionally, we find that neuron clusters identified through MoEfication correspond more strongly to task-specific neurons in earlier and later layers of the models. This work contributes to a more nuanced understanding of transformer internals and offers insights into potential avenues for improving model interpretability and efficiency.

[LG-13] Investigating Neuron Ablation in Attention Heads: The Case for Peak Activation Centering

链接: https://arxiv.org/abs/2408.17322
作者: Nicholas Pochinkov,Ben Pasero,Skylar Shibayama
关键词-EN: rapidly throughout society, growing rapidly, ablation, Abstract, models
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
*备注: 9 pages, 2 figures, XAI World Conference 2024 Late-Breaking Work

点击查看摘要

Abstract:The use of transformer-based models is growing rapidly throughout society. With this growth, it is important to understand how they work, and in particular, how the attention mechanisms represent concepts. Though there are many interpretability methods, many look at models through their neuronal activations, which are poorly understood. We describe different lenses through which to view neuron activations, and investigate the effectiveness in language models and vision transformers through various methods of neural ablation: zero ablation, mean ablation, activation resampling, and a novel approach we term ‘peak ablation’. Through experimental analysis, we find that in different regimes and models, each method can offer the lowest degradation of model performance compared to other methods, with resampling usually causing the most significant performance deterioration. We make our code available at this https URL.

[LG-14] Fair Best Arm Identification with Fixed Confidence

链接: https://arxiv.org/abs/2408.17313
作者: Alessio Russo,Filippo Vannella
关键词-EN: Arm Identification, fair BAI, sample complexity, sample complexity lower, Unlike traditional BAI
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In this work, we present a novel framework for Best Arm Identification (BAI) under fairness constraints, a setting that we refer to as \textitF-BAI (fair BAI). Unlike traditional BAI, which solely focuses on identifying the optimal arm with minimal sample complexity, F-BAI also includes a set of fairness constraints. These constraints impose a lower limit on the selection rate of each arm and can be either model-agnostic or model-dependent. For this setting, we establish an instance-specific sample complexity lower bound and analyze the \textitprice of fairness, quantifying how fairness impacts sample complexity. Based on the sample complexity lower bound, we propose F-TaS, an algorithm provably matching the sample complexity lower bound, while ensuring that the fairness constraints are satisfied. Numerical results, conducted using both a synthetic model and a practical wireless scheduling application, show the efficiency of F-TaS in minimizing the sample complexity while achieving low fairness violations.

[LG-15] Structuring a Training Strategy to Robustify Perception Models with Realistic Image Augmentations

链接: https://arxiv.org/abs/2408.17311
作者: Ahmed Hammam,Bharathwaj Krishnaswami Sreedhar,Nura Kawa,Tim Patzelt,Oliver De Candido
关键词-EN: Operational Design Domains, challenging Operational Design, Design Domains, Operational Design, Advancing Machine Learning
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Advancing Machine Learning (ML)-based perception models for autonomous systems necessitates addressing weak spots within the models, particularly in challenging Operational Design Domains (ODDs). These are environmental operating conditions of an autonomous vehicle which can contain difficult conditions, e.g., lens flare at night or objects reflected in a wet street. This report introduces a novel methodology for training with augmentations to enhance model robustness and performance in such conditions. The proposed approach leverages customized physics-based augmentation functions, to generate realistic training data that simulates diverse ODD scenarios. We present a comprehensive framework that includes identifying weak spots in ML models, selecting suitable augmentations, and devising effective training strategies. The methodology integrates hyperparameter optimization and latent space optimization to fine-tune augmentation parameters, ensuring they maximally improve the ML models’ performance. Experimental results demonstrate improvements in model performance, as measured by commonly used metrics such as mean Average Precision (mAP) and mean Intersection over Union (mIoU) on open-source object detection and semantic segmentation models and datasets. Our findings emphasize that optimal training strategies are model- and data-specific and highlight the benefits of integrating augmentations into the training pipeline. By incorporating augmentations, we observe enhanced robustness of ML-based perception models, making them more resilient to edge cases encountered in real-world ODDs. This work underlines the importance of customized augmentations and offers an effective solution for improving the safety and reliability of autonomous driving functions. Subjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2408.17311 [cs.LG] (or arXiv:2408.17311v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2408.17311 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-16] Hybridizing Base-Line 2D-CNN Model with Cat Swarm Optimization for Enhanced Advanced Persistent Threat Detection

链接: https://arxiv.org/abs/2408.17307
作者: Ali M. Bakhiet,Salah A. Aly
关键词-EN: detecting Advanced Persistent, Advanced Persistent Threats, Advanced Persistent, formidable challenge due, Convolutional Neural Networks
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Networking and Internet Architecture (cs.NI)
*备注: 6 pages, 5 figures

点击查看摘要

Abstract:In the realm of cyber-security, detecting Advanced Persistent Threats (APTs) remains a formidable challenge due to their stealthy and sophisticated nature. This research paper presents an innovative approach that leverages Convolutional Neural Networks (CNNs) with a 2D baseline model, enhanced by the cutting-edge Cat Swarm Optimization (CSO) algorithm, to significantly improve APT detection accuracy. By seamlessly integrating the 2D-CNN baseline model with CSO, we unlock the potential for unprecedented accuracy and efficiency in APT detection. The results unveil an impressive accuracy score of 98.4% , marking a significant enhancement in APT detection across various attack stages, illuminating a path forward in combating these relentless and sophisticated threats.

[LG-17] Stationary Policies are Optimal in Risk-averse Total-reward MDPs with EVaR

链接: https://arxiv.org/abs/2408.17286
作者: Xihong Su,Marek Petrik,Julien Grand-Clément
关键词-EN: Optimizing risk-averse objectives, admit direct dynamic, Entropic Risk Measure, complex history-dependent policies, direct dynamic programming
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Optimizing risk-averse objectives in discounted MDPs is challenging because most models do not admit direct dynamic programming equations and require complex history-dependent policies. In this paper, we show that the risk-averse \em total reward criterion, under the Entropic Risk Measure (ERM) and Entropic Value at Risk (EVaR) risk measures, can be optimized by a stationary policy, making it simple to analyze, interpret, and deploy. We propose exponential value iteration, policy iteration, and linear programming to compute optimal policies. In comparison with prior work, our results only require the relatively mild condition of transient MDPs and allow for \em both positive and negative rewards. Our results indicate that the total reward criterion may be preferable to the discounted criterion in a broad range of risk-averse reinforcement learning domains.

[LG-18] Image-Perfect Imperfections: Safety Bias and Authenticity in the Shadow of Text-To-Image Model Evolution

链接: https://arxiv.org/abs/2408.17285
作者: Yixin Wu,Yun Shen,Michael Backes,Yang Zhang
关键词-EN: Stable Diffusion, undergo iterative updates, undergo iterative, improve image quality, updates
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注: To Appear in the ACM Conference on Computer and Communications Security, October 14-18, 2024

点击查看摘要

Abstract:Text-to-image models, such as Stable Diffusion (SD), undergo iterative updates to improve image quality and address concerns such as safety. Improvements in image quality are straightforward to assess. However, how model updates resolve existing concerns and whether they raise new questions remain unexplored. This study takes an initial step in investigating the evolution of text-to-image models from the perspectives of safety, bias, and authenticity. Our findings, centered on Stable Diffusion, indicate that model updates paint a mixed picture. While updates progressively reduce the generation of unsafe images, the bias issue, particularly in gender, intensifies. We also find that negative stereotypes either persist within the same Non-White race group or shift towards other Non-White race groups through SD updates, yet with minimal association of these traits with the White race group. Additionally, our evaluation reveals a new concern stemming from SD updates: State-of-the-art fake image detectors, initially trained for earlier SD versions, struggle to identify fake images generated by updated versions. We show that fine-tuning these detectors on fake images generated by updated versions achieves at least 96.6% accuracy across various SD versions, addressing this issue. Our insights highlight the importance of continued efforts to mitigate biases and vulnerabilities in evolving text-to-image models.

[LG-19] he Transferability of Downsampling Sparse Graph Convolutional Networks

链接: https://arxiv.org/abs/2408.17274
作者: Qinji Shu,Hang Sheng,Hui Feng,Bo Hu
关键词-EN: random graph model, large-scale sparse graph, sparse random graph, propose a large-scale, downsampling method
类目: Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注:

点击查看摘要

Abstract:In this paper, we propose a large-scale sparse graph downsampling method based on a sparse random graph model, which allows for the adjustment of different sparsity levels. We combine sparsity and topological similarity: the sparse graph model reduces the node connection probability as the graph size increases, while the downsampling method preserves a specific topological connection pattern during this change. Based on the downsampling method, we derive a theoretical transferability bound about downsampling sparse graph convolutional networks (GCNs), that higher sampling rates, greater average degree expectations, and smaller initial graph sizes lead to better downsampling transferability performance.

[LG-20] Joint Estimation and Prediction of City-wide Delivery Demand: A Large Language Model Empowered Graph-based Learning Approach

链接: https://arxiv.org/abs/2408.17258
作者: Tong Nie,Junlin He,Yuewen Mei,Guoyang Qin,Guilong Li,Jian Sun,Wei Ma
关键词-EN: intensified delivery operations, boosting the volume, proliferation of e-commerce, e-commerce and urbanization, volume and complexity
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The proliferation of e-commerce and urbanization has significantly intensified delivery operations in urban areas, boosting the volume and complexity of delivery demand. Data-driven predictive methods, especially those utilizing machine learning techniques, have emerged to handle these complexities in urban delivery demand management problems. One particularly pressing problem that has not yet been sufficiently studied is the joint estimation and prediction of city-wide delivery demand. To this end, we formulate this problem as a graph-based spatiotemporal learning task. First, a message-passing neural network model is formalized to capture the interaction between demand patterns of associated regions. Second, by exploiting recent advances in large language models, we extract general geospatial knowledge encodings from the unstructured locational data and integrate them into the demand predictor. Last, to encourage the cross-city transferability of the model, an inductive training scheme is developed in an end-to-end routine. Extensive empirical results on two real-world delivery datasets, including eight cities in China and the US, demonstrate that our model significantly outperforms state-of-the-art baselines in these challenging tasks.

[LG-21] Self-supervised learning for crystal property prediction via denoising ICML2024

链接: https://arxiv.org/abs/2408.17255
作者: Alexander New,Nam Q. Le,Michael J. Pekala,Christopher D. Stiles
关键词-EN: Accurate prediction, targeted discovery, crucial for targeted, Accurate, self-supervised learning
类目: Machine Learning (cs.LG); Materials Science (cond-mat.mtrl-sci)
*备注: Published at ICML 2024 AI4Science: this https URL

点击查看摘要

Abstract:Accurate prediction of the properties of crystalline materials is crucial for targeted discovery, and this prediction is increasingly done with data-driven models. However, for many properties of interest, the number of materials for which a specific property has been determined is much smaller than the number of known materials. To overcome this disparity, we propose a novel self-supervised learning (SSL) strategy for material property prediction. Our approach, crystal denoising self-supervised learning (CDSSL), pretrains predictive models (e.g., graph networks) with a pretext task based on recovering valid material structures when given perturbed versions of these structures. We demonstrate that CDSSL models out-perform models trained without SSL, across material types, properties, and dataset sizes.

[LG-22] Categorical data clustering: 25 years beyond K-modes

链接: https://arxiv.org/abs/2408.17244
作者: Tai Dinh,Wong Hauchi,Philippe Fournier-Viger,Daniil Lisik,Minh-Quyet Ha,Hieu-Chi Dam,Van-Nam Huynh
关键词-EN: offering profound implications, categorical data, categorical data clustering, offering profound, spectrum of applications
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The clustering of categorical data is a common and important task in computer science, offering profound implications across a spectrum of applications. Unlike purely numerical datasets, categorical data often lack inherent ordering as in nominal data, or have varying levels of order as in ordinal data, thus requiring specialized methodologies for efficient organization and analysis. This review provides a comprehensive synthesis of categorical data clustering in the past twenty-five years, starting from the introduction of K-modes. It elucidates the pivotal role of categorical data clustering in diverse fields such as health sciences, natural sciences, social sciences, education, engineering and economics. Practical comparisons are conducted for algorithms having public implementations, highlighting distinguishing clustering methodologies and revealing the performance of recent algorithms on several benchmark categorical datasets. Finally, challenges and opportunities in the field are discussed.

[LG-23] AI-Driven Intrusion Detection Systems (IDS) on the ROAD dataset: A Comparative Analysis for automotive Controller Area Network (CAN)

链接: https://arxiv.org/abs/2408.17235
作者: Lorenzo Guerra,Linhan Xu,Pavlo Mozharovskyi,Paolo Bellavista,Thomas Chapuis,Guillaume Duc,Van-Tam Nguyen
关键词-EN: revolutionized automotive technology, Controller Area Network, automotive technology, enhancing safety, driving experience
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The integration of digital devices in modern vehicles has revolutionized automotive technology, enhancing safety and the overall driving experience. The Controller Area Network (CAN) bus is a central system for managing in-vehicle communication between the electronic control units (ECUs). However, the CAN protocol poses security challenges due to inherent vulnerabilities, lacking encryption and authentication, which, combined with an expanding attack surface, necessitates robust security measures. In response to this challenge, numerous Intrusion Detection Systems (IDS) have been developed and deployed. Nonetheless, an open, comprehensive, and realistic dataset to test the effectiveness of such IDSs remains absent in the existing literature. This paper addresses this gap by considering the latest ROAD dataset, containing stealthy and sophisticated injections. The methodology involves dataset labelling and the implementation of both state-of-the-art deep learning models and traditional machine learning models to show the discrepancy in performance between the datasets most commonly used in the literature and the ROAD dataset, a more realistic alternative.

[LG-24] Geometry of Lightning Self-Attention: Identifiability and Dimension

链接: https://arxiv.org/abs/2408.17221
作者: Nathan W. Henry,Giovanni Luca Marchetti,Kathlén Kohn
关键词-EN: function spaces defined, theoretically analyze, spaces defined, self-attention networks, analyze their geometry
类目: Machine Learning (cs.LG); Algebraic Geometry (math.AG)
*备注:

点击查看摘要

Abstract:We consider function spaces defined by self-attention networks without normalization, and theoretically analyze their geometry. Since these networks are polynomial, we rely on tools from algebraic geometry. In particular, we study the identifiability of deep attention by providing a description of the generic fibers of the parametrization for an arbitrary number of layers and, as a consequence, compute the dimension of the function space. Additionally, for a single-layer model, we characterize the singular and boundary points. Finally, we formulate a conjectural extension of our results to normalized self-attention networks, prove it for a single layer, and numerically verify it in the deep case.

[LG-25] Democratizing AI in Africa: FL for Low-Resource Edge Devices

链接: https://arxiv.org/abs/2408.17216
作者: Jorge Fabila,Víctor M. Campello,Carlos Martín-Isla,Johnes Obungoloch,Kinyera Leo,Amodoi Ronald,Karim Lekadir
关键词-EN: advanced medical technologies, Africa faces significant, healthcare delivery due, Africa faces, faces significant challenges
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Africa faces significant challenges in healthcare delivery due to limited infrastructure and access to advanced medical technologies. This study explores the use of federated learning to overcome these barriers, focusing on perinatal health. We trained a fetal plane classifier using perinatal data from five African countries: Algeria, Ghana, Egypt, Malawi, and Uganda, along with data from Spanish hospitals. To incorporate the lack of computational resources in the analysis, we considered a heterogeneous set of devices, including a Raspberry Pi and several laptops, for model training. We demonstrate comparative performance between a centralized and a federated model, despite the compute limitations, and a significant improvement in model generalizability when compared to models trained only locally. These results show the potential for a future implementation at a large scale of a federated learning platform to bridge the accessibility gap and improve model generalizability with very little requirements.

[LG-26] owards Symbolic XAI – Explanation Through Human Understandable Logical Relationships Between Features

链接: https://arxiv.org/abs/2408.17198
作者: Thomas Schnake,Farnoush Rezaei Jafaria,Jonas Lederer,Ping Xiong,Shinichi Nakajima,Stefan Gugler,Grégoire Montavon,Klaus-Robert Müller
关键词-EN: Explainable Artificial Intelligence, Explainable Artificial, Artificial Intelligence, approaches typically offer, heatmaps highlighting single
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Explainable Artificial Intelligence (XAI) plays a crucial role in fostering transparency and trust in AI systems, where traditional XAI approaches typically offer one level of abstraction for explanations, often in the form of heatmaps highlighting single or multiple input features. However, we ask whether abstract reasoning or problem-solving strategies of a model may also be relevant, as these align more closely with how humans approach solutions to problems. We propose a framework, called Symbolic XAI, that attributes relevance to symbolic queries expressing logical relationships between input features, thereby capturing the abstract reasoning behind a model’s predictions. The methodology is built upon a simple yet general multi-order decomposition of model predictions. This decomposition can be specified using higher-order propagation-based relevance methods, such as GNN-LRP, or perturbation-based explanation methods commonly used in XAI. The effectiveness of our framework is demonstrated in the domains of natural language processing (NLP), vision, and quantum chemistry (QC), where abstract symbolic domain knowledge is abundant and of significant interest to users. The Symbolic XAI framework provides an understanding of the model’s decision-making process that is both flexible for customization by the user and human-readable through logical formulas.

[LG-27] Short-term Wind Speed Forecasting for Power Integration in Smart Grids based on Hybrid LSSVM-SVMD Method

链接: https://arxiv.org/abs/2408.17185
作者: Ephrem Admasu Yekun,Alem H. Fitwib,Selvi Karpaga Subramaniand,Anubhav Kumard,Teshome Goa Tella
关键词-EN: renewable energy resources, exploited renewable energy, widely exploited renewable, wind speed forecasting, wind speed
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Owing to its minimal pollution and efficient energy use, wind energy has become one of the most widely exploited renewable energy resources. The successful integration of wind power into the grid system is contingent upon accurate wind speed forecasting models. However, the task of wind speed forecasting is challenging due to the inherent intermittent characteristics of wind speed. In this paper, a hybrid machine learning approach is developed for predicting short-term wind speed. First, the wind data was decomposed into modal components using Successive Variational Mode Decomposition (SVMD). Then, each sub-signal was fitted into a Least Squares Support Vector Machines (LSSVM) model, with its hyperparameter optimized by a novel variant of Quantum-behaved Particle Swarm Optimization (QPSO), QPSO with elitist breeding (EBQPSO). Second, the residuals making up for the differences between the original wind series and the aggregate of the SVMD modes were modeled using long short-term model (LSTM). Then, the overall predicted values were computed using the aggregate of the LSSVM and the LSTM models. Finally, the performance of the proposed model was compared against state-of-the-art benchmark models for forecasting wind speed using two separate data sets collected from a local wind farm. Empirical results show significant improvement in performance by the proposed method, achieving a 1.21% to 32.76% reduction in root mean square error (RMSE) and a 2.05% to 40.75% reduction in mean average error (MAE) compared to the benchmark methods. The entire code implementation of this work is freely available in Github.

[LG-28] Identifying and Clustering Counter Relationships of Team Compositions in PvP Games for Efficient Balance Analysis

链接: https://arxiv.org/abs/2408.17180
作者: Chiu-Chou Lin,Yu-Wei Shih,Kuei-Ting Kuo,Yu-Cheng Chen,Chien-Hua Chen,Wei-Chen Chiu,I-Chen Wu
关键词-EN: balance, win, game settings, game, Abstract
类目: Artificial Intelligence (cs.AI); Computer Science and Game Theory (cs.GT); Information Retrieval (cs.IR); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
*备注: TMLR 09/2024 this https URL

点击查看摘要

Abstract:How can balance be quantified in game settings? This question is crucial for game designers, especially in player-versus-player (PvP) games, where analyzing the strength relations among predefined team compositions-such as hero combinations in multiplayer online battle arena (MOBA) games or decks in card games-is essential for enhancing gameplay and achieving balance. We have developed two advanced measures that extend beyond the simplistic win rate to quantify balance in zero-sum competitive scenarios. These measures are derived from win value estimations, which employ strength rating approximations via the Bradley-Terry model and counter relationship approximations via vector quantization, significantly reducing the computational complexity associated with traditional win value estimations. Throughout the learning process of these models, we identify useful categories of compositions and pinpoint their counter relationships, aligning with the experiences of human players without requiring specific game knowledge. Our methodology hinges on a simple technique to enhance codebook utilization in discrete representation with a deterministic vector quantization process for an extremely small state space. Our framework has been validated in popular online games, including Age of Empires II, Hearthstone, Brawl Stars, and League of Legends. The accuracy of the observed strength relations in these games is comparable to traditional pairwise win value predictions, while also offering a more manageable complexity for analysis. Ultimately, our findings contribute to a deeper understanding of PvP game dynamics and present a methodology that significantly improves game balance evaluation and design.

[LG-29] SafeTail: Efficient Tail Latency Optimization in Edge Service Scheduling via Computational Redundancy Management

链接: https://arxiv.org/abs/2408.17171
作者: Jyoti Shokhanda,Utkarsh Pal,Aman Kumar,Soumi Chattopadhyay,Arani Bhattacharya
关键词-EN: efficiently managing computational, delivering high-performance, efficiently managing, crucial for delivering, latency
类目: Machine Learning (cs.LG)
*备注: This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible

点击查看摘要

Abstract:Optimizing tail latency while efficiently managing computational resources is crucial for delivering high-performance, latency-sensitive services in edge computing. Emerging applications, such as augmented reality, require low-latency computing services with high reliability on user devices, which often have limited computational capabilities. Consequently, these devices depend on nearby edge servers for processing. However, inherent uncertainties in network and computation latencies stemming from variability in wireless networks and fluctuating server loads make service delivery on time challenging. Existing approaches often focus on optimizing median latency but fall short of addressing the specific challenges of tail latency in edge environments, particularly under uncertain network and computational conditions. Although some methods do address tail latency, they typically rely on fixed or excessive redundancy and lack adaptability to dynamic network conditions, often being designed for cloud environments rather than the unique demands of edge computing. In this paper, we introduce SafeTail, a framework that meets both median and tail response time targets, with tail latency defined as latency beyond the 90^th percentile threshold. SafeTail addresses this challenge by selectively replicating services across multiple edge servers to meet target latencies. SafeTail employs a reward-based deep learning framework to learn optimal placement strategies, balancing the need to achieve target latencies with minimizing additional resource usage. Through trace-driven simulations, SafeTail demonstrated near-optimal performance and outperformed most baseline strategies across three diverse services.

[LG-30] Efficient Testable Learning of General Halfspaces with Adversarial Label Noise COLT’24

链接: https://arxiv.org/abs/2408.17165
作者: Ilias Diakonikolas,Daniel M. Kane,Sihan Liu,Nikos Zarifis
关键词-EN: adversarial label noise, Gaussian distribution, testable learning, testable learning framework, reduce testable learning
类目: Machine Learning (cs.LG); Data Structures and Algorithms (cs.DS); Machine Learning (stat.ML)
*备注: Presented to COLT’24

点击查看摘要

Abstract:We study the task of testable learning of general – not necessarily homogeneous – halfspaces with adversarial label noise with respect to the Gaussian distribution. In the testable learning framework, the goal is to develop a tester-learner such that if the data passes the tester, then one can trust the output of the robust learner on the data.Our main result is the first polynomial time tester-learner for general halfspaces that achieves dimension-independent misclassification error. At the heart of our approach is a new methodology to reduce testable learning of general halfspaces to testable learning of nearly homogeneous halfspaces that may be of broader interest.

[LG-31] he Iterative Optimal Brain Surgeon: Faster Sparse Recovery by Leveraging Second-Order Information

链接: https://arxiv.org/abs/2408.17163
作者: Diyuan Wu,Ionut-Vlad Modoranu,Mher Safaryan,Denis Kuznedelev,Dan Alistarh
关键词-EN: Optimal Brain Surgeon, sparse recovery algorithms, classical Optimal Brain, focus on imposing, memory costs
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The rising footprint of machine learning has led to a focus on imposing \emphmodel sparsity as a means of reducing computational and memory costs. For deep neural networks (DNNs), the state-of-the-art accuracy-vs-sparsity is achieved by heuristics inspired by the classical Optimal Brain Surgeon (OBS) framework~\citeplecun90brain, hassibi1992second, hassibi1993optimal, which leverages loss curvature information to make better pruning decisions. Yet, these results still lack a solid theoretical understanding, and it is unclear whether they can be improved by leveraging connections to the wealth of work on sparse recovery algorithms. In this paper, we draw new connections between these two areas and present new sparse recovery algorithms inspired by the OBS framework that comes with theoretical guarantees under reasonable assumptions and have strong practical performance. Specifically, our work starts from the observation that we can leverage curvature information in OBS-like fashion upon the projection step of classic iterative sparse recovery algorithms such as IHT. We show for the first time that this leads both to improved convergence bounds under standard assumptions. Furthermore, we present extensions of this approach to the practical task of obtaining accurate sparse DNNs, and validate it experimentally at scale for Transformer-based models on vision and language tasks.

[LG-32] Deep Feature Embedding for Tabular Data ICONIP2024

链接: https://arxiv.org/abs/2408.17162
作者: Yuqian Wu,Hengyi Luo,Raymond S. T. Lee
关键词-EN: capture complex relationships, Tabular data learning, Tabular data, relationships and engineering, extensive applications
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 15 pages, 2figures, accepted to ICONIP 2024, Paper ID: 1399

点击查看摘要

Abstract:Tabular data learning has extensive applications in deep learning but its existing embedding techniques are limited in numerical and categorical features such as the inability to capture complex relationships and engineering. This paper proposes a novel deep embedding framework with leverages lightweight deep neural networks to generate effective feature embeddings for tabular data in machine learning research. For numerical features, a two-step feature expansion and deep transformation technique is used to capture copious semantic information. For categorical features, a unique identification vector for each entity is referred by a compact lookup table with a parameterized deep embedding function to uniform the embedding size dimensions, and transformed into a embedding vector using deep neural network. Experiments are conducted on real-world datasets for performance evaluation.

[LG-33] Investigating Privacy Leakage in Dimensionality Reduction Methods via Reconstruction Attack

链接: https://arxiv.org/abs/2408.17151
作者: Chayadon Lumbut,Donlapark Ponnoprat
关键词-EN: study investigates privacy, investigates privacy leakage, dimensionality reduction methods, study investigates, investigates privacy
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This study investigates privacy leakage in dimensionality reduction methods through a novel machine learning-based reconstruction attack. Employing an \emphinformed adversary threat model, we develop a neural network capable of reconstructing high-dimensional data from low-dimensional embeddings. We evaluate six popular dimensionality reduction techniques: PCA, sparse random projection (SRP), multidimensional scaling (MDS), Isomap, t -SNE, and UMAP. Using both MNIST and NIH Chest X-ray datasets, we perform a qualitative analysis to identify key factors affecting reconstruction quality. Furthermore, we assess the effectiveness of an additive noise mechanism in mitigating these reconstruction attacks. Subjects: Cryptography and Security (cs.CR); Machine Learning (cs.LG) Cite as: arXiv:2408.17151 [cs.CR] (or arXiv:2408.17151v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2408.17151 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-34] he Many Faces of Optimal Weak-to-Strong Learning

链接: https://arxiv.org/abs/2408.17148
作者: Mikael Møller Høgsgaard,Kasper Green Larsen,Markus Engelund Mathiasen
关键词-EN: extremely successful idea, multiple low accuracy, combine multiple low, low accuracy classifiers, accurate voting classifier
类目: Machine Learning (cs.LG); Data Structures and Algorithms (cs.DS)
*备注:

点击查看摘要

Abstract:Boosting is an extremely successful idea, allowing one to combine multiple low accuracy classifiers into a much more accurate voting classifier. In this work, we present a new and surprisingly simple Boosting algorithm that obtains a provably optimal sample complexity. Sample optimal Boosting algorithms have only recently been developed, and our new algorithm has the fastest runtime among all such algorithms and is the simplest to describe: Partition your training data into 5 disjoint pieces of equal size, run AdaBoost on each, and combine the resulting classifiers via a majority vote. In addition to this theoretical contribution, we also perform the first empirical comparison of the proposed sample optimal Boosting algorithms. Our pilot empirical study suggests that our new algorithm might outperform previous algorithms on large data sets.

[LG-35] owards Hyper-parameter-free Federated Learning

链接: https://arxiv.org/abs/2408.17145
作者: Geetika,Drishya Uniyal,Bapi Chatterjee
关键词-EN: adaptive synchronization techniques, vanilla federated averaging, scaled global model, global model updates, adaptive synchronization
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Optimization and Control (math.OC)
*备注: 28 pages, 3 figures

点击查看摘要

Abstract:The adaptive synchronization techniques in federated learning (FL) for scaled global model updates show superior performance over the vanilla federated averaging (FedAvg) scheme. However, existing methods employ additional tunable hyperparameters on the server to determine the scaling factor. A contrasting approach is automated scaling analogous to tuning-free step-size schemes in stochastic gradient descent (SGD) methods, which offer competitive convergence rates and exhibit good empirical performance. In this work, we introduce two algorithms for automated scaling of global model updates. In our first algorithm, we establish that a descent-ensuring step-size regime at the clients ensures descent for the server objective. We show that such a scheme enables linear convergence for strongly convex federated objectives. Our second algorithm shows that the average of objective values of sampled clients is a practical and effective substitute for the objective function value at the server required for computing the scaling factor, whose computation is otherwise not permitted. Our extensive empirical results show that the proposed methods perform at par or better than the popular federated learning algorithms for both convex and non-convex problems. Our work takes a step towards designing hyper-parameter-free federated learning.

[LG-36] Flow Matching for Optimal Reaction Coordinates of Biomolecular System

链接: https://arxiv.org/abs/2408.17139
作者: Mingyuan Zhang,Zhicheng Zhang,Yong Wang,Hao Wu
关键词-EN: optimal reaction coordinates, present Flow Matching, Reaction Coordinates, identify optimal reaction, Flow Matching
类目: Machine Learning (cs.LG); Biological Physics (physics.bio-ph)
*备注:

点击查看摘要

Abstract:We present Flow Matching for Reaction Coordinates (FMRC), a novel deep learning algorithm designed to identify optimal reaction coordinates (RC) in biomolecular reversible dynamics. FMRC is based on the mathematical principles of lumpability and decomposability, which we reformulate into a conditional probability framework for efficient data-driven optimization using deep generative models. While FMRC does not explicitly learn the well-established transfer operator or its eigenfunctions, it can effectively encode the dynamics of leading eigenfunctions of the system transfer operator into its low-dimensional RC space. We further quantitatively compare its performance with several state-of-the-art algorithms by evaluating the quality of Markov State Models (MSM) constructed in their respective RC spaces, demonstrating the superiority of FMRC in three increasingly complex biomolecular systems. Finally, we discuss its potential applications in downstream applications such as enhanced sampling methods and MSM construction.

[LG-37] Controllable Edge-Type-Specific Interpretation in Multi-Relational Graph Neural Networks for Drug Response Prediction

链接: https://arxiv.org/abs/2408.17129
作者: Xiaodi Li,Jianfeng Gui,Qian Gao,Haoyuan Shi,Zhenyu Yue
关键词-EN: Graph Neural Networks, Neural Networks, critical decision-making areas, Graph Neural, demand interpretable predictions
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Graph Neural Networks have been widely applied in critical decision-making areas that demand interpretable predictions, leading to the flourishing development of interpretability algorithms. However, current graph interpretability algorithms tend to emphasize generality and often overlook biological significance, thereby limiting their applicability in predicting cancer drug responses. In this paper, we propose a novel post-hoc interpretability algorithm for cancer drug response prediction, CETExplainer, which incorporates a controllable edge-type-specific weighting mechanism. It considers the mutual information between subgraphs and predictions, proposing a structural scoring approach to provide fine-grained, biologically meaningful explanations for predictive models. We also introduce a method for constructing ground truth based on real-world datasets to quantitatively evaluate the proposed interpretability algorithm. Empirical analysis on the real-world dataset demonstrates that CETExplainer achieves superior stability and improves explanation quality compared to leading algorithms, thereby offering a robust and insightful tool for cancer drug prediction.

[LG-38] Efficient Estimation of Unique Components in Independent Component Analysis by Matrix Representation

链接: https://arxiv.org/abs/2408.17118
作者: Yoshitatsu Matsuda,Kazunori Yamaguch
关键词-EN: Independent component analysis, Independent component, principal component analysis, component analysis, feature extraction
类目: Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
*备注:

点击查看摘要

Abstract:Independent component analysis (ICA) is a widely used method in various applications of signal processing and feature extraction. It extends principal component analysis (PCA) and can extract important and complicated components with small variances. One of the major problems of ICA is that the uniqueness of the solution is not guaranteed, unlike PCA. That is because there are many local optima in optimizing the objective function of ICA. It has been shown previously that the unique global optimum of ICA can be estimated from many random initializations by handcrafted thread computation. In this paper, the unique estimation of ICA is highly accelerated by reformulating the algorithm in matrix representation and reducing redundant calculations. Experimental results on artificial datasets and EEG data verified the efficiency of the proposed method.

[LG-39] Sparse Uncertainty-Informed Sampling from Federated Streaming Data

链接: https://arxiv.org/abs/2408.17108
作者: Manuel Röder,Frank-Michael Schleif
关键词-EN: federated client systems, computationally efficient approach, local model adaptation, numerically robust, computationally efficient
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注: Preprint, 6 pages, 3 figures, Accepted for ESANN 2024

点击查看摘要

Abstract:We present a numerically robust, computationally efficient approach for non-I.I.D. data stream sampling in federated client systems, where resources are limited and labeled data for local model adaptation is sparse and expensive. The proposed method identifies relevant stream observations to optimize the underlying client model, given a local labeling budget, and performs instantaneous labeling decisions without relying on any memory buffering strategies. Our experiments show enhanced training batch diversity and an improved numerical robustness of the proposal compared to existing strategies over large-scale data streams, making our approach an effective and convenient solution in FL environments.

[LG-40] RISSOLE: Parameter-efficient Diffusion Models via Block-wise Generation and Retrieval-Guidance

链接: https://arxiv.org/abs/2408.17095
作者: Avideep Mukherjee,Soumya Banerjee,Vinay P. Namboodiri,Piyush Rai
关键词-EN: Diffusion-based models demonstrate, impressive generation capabilities, Diffusion-based models, generation, Diffusion-based
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Diffusion-based models demonstrate impressive generation capabilities. However, they also have a massive number of parameters, resulting in enormous model sizes, thus making them unsuitable for deployment on resource-constraint devices. Block-wise generation can be a promising alternative for designing compact-sized (parameter-efficient) deep generative models since the model can generate one block at a time instead of generating the whole image at once. However, block-wise generation is also considerably challenging because ensuring coherence across generated blocks can be non-trivial. To this end, we design a retrieval-augmented generation (RAG) approach and leverage the corresponding blocks of the images retrieved by the RAG module to condition the training and generation stages of a block-wise denoising diffusion model. Our conditioning schemes ensure coherence across the different blocks during training and, consequently, during generation. While we showcase our approach using the latent diffusion model (LDM) as the base model, it can be used with other variants of denoising diffusion models. We validate the solution of the coherence problem through the proposed approach by reporting substantive experiments to demonstrate our approach’s effectiveness in compact model size and excellent generation quality.

[LG-41] FissionVAE: Federated Non-IID Image Generation with Latent Space and Decoder Decomposition

链接: https://arxiv.org/abs/2408.17090
作者: Chen Hu,Jingjing Deng,Xianghua Xie,Xiaoke Ma
关键词-EN: Generative Adversarial Networks, enables decentralized clients, machine learning paradigm, paradigm that enables, enables decentralized
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Federated learning is a machine learning paradigm that enables decentralized clients to collaboratively learn a shared model while keeping all the training data local. While considerable research has focused on federated image generation, particularly Generative Adversarial Networks, Variational Autoencoders have received less attention. In this paper, we address the challenges of non-IID (independently and identically distributed) data environments featuring multiple groups of images of different types. Specifically, heterogeneous data distributions can lead to difficulties in maintaining a consistent latent space and can also result in local generators with disparate texture features being blended during aggregation. We introduce a novel approach, FissionVAE, which decomposes the latent space and constructs decoder branches tailored to individual client groups. This method allows for customized learning that aligns with the unique data distributions of each group. Additionally, we investigate the incorporation of hierarchical VAE architectures and demonstrate the use of heterogeneous decoder architectures within our model. We also explore strategies for setting the latent prior distributions to enhance the decomposition process. To evaluate our approach, we assemble two composite datasets: the first combines MNIST and FashionMNIST; the second comprises RGB datasets of cartoon and human faces, wild animals, marine vessels, and remote sensing images of Earth. Our experiments demonstrate that FissionVAE greatly improves generation quality on these datasets compared to baseline federated VAE models.

[LG-42] Instant Adversarial Purification with Adversarial Consistency Distillation

链接: https://arxiv.org/abs/2408.17064
作者: Chun Tong Lei,Hon Ming Yam,Zhongliang Guo,Chun Pong Lau
关键词-EN: including image classification, Neural Function Evaluation, widespread applications, Neural networks, remarkable performance
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Neural networks, despite their remarkable performance in widespread applications, including image classification, are also known to be vulnerable to subtle adversarial noise. Although some diffusion-based purification methods have been proposed, for example, DiffPure, those methods are time-consuming. In this paper, we propose One Step Control Purification (OSCP), a diffusion-based purification model that can purify the adversarial image in one Neural Function Evaluation (NFE) in diffusion models. We use Latent Consistency Model (LCM) and ControlNet for our one-step purification. OSCP is computationally friendly and time efficient compared to other diffusion-based purification methods; we achieve defense success rate of 74.19% on ImageNet, only requiring 0.1s for each purification. Moreover, there is a fundamental incongruence between consistency distillation and adversarial perturbation. To address this ontological dissonance, we propose Gaussian Adversarial Noise Distillation (GAND), a novel consistency distillation framework that facilitates a more nuanced reconciliation of the latent space dynamics, effectively bridging the natural and adversarial manifolds. Our experiments show that the GAND does not need a Full Fine Tune (FFT); PEFT, e.g., LoRA is sufficient.

[LG-43] A Survey of the Self Supervised Learning Mechanisms for Vision Transformers

链接: https://arxiv.org/abs/2408.17059
作者: Asifullah Khan,Anabia Sohail,Mustansar Fiaz,Mehdi Hassan,Tariq Habib Afridi,Sibghat Ullah Marwat,Farzeen Munir,Safdar Ali,Hannan Naseem,Muhammad Zaigham Zaheer,Kamran Ali,Tangina Sultana,Ziaurrehman Tanoli,Naeem Akhter
关键词-EN: require high volume, attain sufficiently good, models require high, sufficiently good results, require high
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 34 Pages, 5 Figures, 7 Tables

点击查看摘要

Abstract:Deep supervised learning models require high volume of labeled data to attain sufficiently good results. Although, the practice of gathering and annotating such big data is costly and laborious. Recently, the application of self supervised learning (SSL) in vision tasks has gained significant attention. The intuition behind SSL is to exploit the synchronous relationships within the data as a form of self-supervision, which can be versatile. In the current big data era, most of the data is unlabeled, and the success of SSL thus relies in finding ways to improve this vast amount of unlabeled data available. Thus its better for deep learning algorithms to reduce reliance on human supervision and instead focus on self-supervision based on the inherent relationships within the data. With the advent of ViTs, which have achieved remarkable results in computer vision, it is crucial to explore and understand the various SSL mechanisms employed for training these models specifically in scenarios where there is less label data available. In this survey we thus develop a comprehensive taxonomy of systematically classifying the SSL techniques based upon their representations and pre-training tasks being applied. Additionally, we discuss the motivations behind SSL, review popular pre-training tasks, and highlight the challenges and advancements in this field. Furthermore, we present a comparative analysis of different SSL methods, evaluate their strengths and limitations, and identify potential avenues for future research.

[LG-44] Estimating Conditional Average Treatment Effects via Sufficient Representation Learning

链接: https://arxiv.org/abs/2408.17053
作者: Pengfei Shi,Wei Zhong,Xinyu Zhang,Ningtao Wang,Xing Fu,Weiqiang Wang,Yin Jin
关键词-EN: conditional average treatment, average treatment effects, conditional average, important in causal, causal inference
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Estimating the conditional average treatment effects (CATE) is very important in causal inference and has a wide range of applications across many fields. In the estimation process of CATE, the unconfoundedness assumption is typically required to ensure the identifiability of the regression problems. When estimating CATE using high-dimensional data, there have been many variable selection methods and neural network approaches based on representation learning, while these methods do not provide a way to verify whether the subset of variables after dimensionality reduction or the learned representations still satisfy the unconfoundedness assumption during the estimation process, which can lead to ineffective estimates of the treatment effects. Additionally, these methods typically use data from only the treatment or control group when estimating the regression functions for each group. This paper proposes a novel neural network approach named \textbfCrossNet to learn a sufficient representation for the features, based on which we then estimate the CATE, where cross indicates that in estimating the regression functions, we used data from their own group as well as cross-utilized data from another group. Numerical simulations and empirical results demonstrate that our method outperforms the competitive approaches.

[LG-45] Error-controlled non-additive interaction discovery in machine learning models

链接: https://arxiv.org/abs/2408.17016
作者: Winston Chen,Yifan Jiang,William Stafford Noble,Yang Young Lu
关键词-EN: Machine learning, detecting complex patterns, black box, nature limits, limits their interpretability
类目: Machine Learning (cs.LG); Applications (stat.AP); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Machine learning (ML) models are powerful tools for detecting complex patterns within data, yet their “black box” nature limits their interpretability, hindering their use in critical domains like healthcare and finance. To address this challenge, interpretable ML methods have been developed to explain how features influence model predictions. However, these methods often focus on univariate feature importance, overlooking the complex interactions between features that ML models are capable of capturing. Recognizing this limitation, recent efforts have aimed to extend these methods to discover feature interactions, but existing approaches struggle with robustness and error control, especially under data perturbations. In this study, we introduce Diamond, a novel method for trustworthy feature interaction discovery. Diamond uniquely integrates the model-X knockoffs framework to control the false discovery rate (FDR), ensuring that the proportion of falsely discovered interactions remains low. We further address the challenges of using off-the-shelf interaction importance measures by proposing a calibration procedure that refines these measures to maintain the desired FDR. Diamond’s applicability spans a wide range of ML models, including deep neural networks, tree-based models, and factorization-based models. Our empirical evaluations on both simulated and real datasets across various biomedical studies demonstrate Diamond’s utility in enabling more reliable data-driven scientific discoveries. This method represents a significant step forward in the deployment of ML models for scientific innovation and hypothesis generation.

[LG-46] Improving Time Series Classification with Representation Soft Label Smoothing

链接: https://arxiv.org/abs/2408.17010
作者: Hengyi Ma,Weitong Chen
关键词-EN: time series classification, deep neural network, neural network based, Previous research, network based models
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 14 pages,6 figures

点击查看摘要

Abstract:Previous research has indicated that deep neural network based models for time series classification (TSC) tasks are prone to overfitting. This issue can be mitigated by employing strategies that prevent the model from becoming overly confident in its predictions, such as label smoothing and confidence penalty. Building upon the concept of label smoothing, we propose a novel approach to generate more reliable soft labels, which we refer to as representation soft label smoothing. We apply label smoothing, confidence penalty, and our method representation soft label smoothing to several TSC models and compare their performance with baseline method which only uses hard labels for training. Our results demonstrate that the use of these enhancement techniques yields competitive results compared to the baseline method. Importantly, our method demonstrates strong performance across models with varying structures and complexities.

[LG-47] Evaluation of Table Representations to Answer Questions from Tables in Documents : A Case Study using 3GPP Specifications

链接: https://arxiv.org/abs/2408.17008
作者: Sujoy Roychowdhury,Sumit Soman,HG Ranjani,Avantika Sharma,Neeraj Gunda,Sai Krishna Bala
关键词-EN: important aspect, ability to extract, Generation Partnership Project, question answering, document corpora
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注: 10 pages, 4 figures, 2 tables

点击查看摘要

Abstract:With the ubiquitous use of document corpora for question answering, one important aspect which is especially relevant for technical documents is the ability to extract information from tables which are interspersed with text. The major challenge in this is that unlike free-flow text or isolated set of tables, the representation of a table in terms of what is a relevant chunk is not obvious. We conduct a series of experiments examining various representations of tabular data interspersed with text to understand the relative benefits of different representations. We choose a corpus of 3^rd Generation Partnership Project (3GPP) documents since they are heavily interspersed with tables. We create expert curated dataset of question answers to evaluate our approach. We conclude that row level representations with corresponding table header information being included in every cell improves the performance of the retrieval, thus leveraging the structural information present in the tabular data.

[LG-48] A Tighter Convergence Proof of Reverse Experience Replay

链接: https://arxiv.org/abs/2408.16999
作者: Nan Jiang,Jinzhao Li,Yexiang Xue
关键词-EN: experience replay method, classic experience replay, Reverse Experience Replay, Experience Replay, replay method
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: This paper is accepted at RLC 2024

点击查看摘要

Abstract:In reinforcement learning, Reverse Experience Replay (RER) is a recently proposed algorithm that attains better sample complexity than the classic experience replay method. RER requires the learning algorithm to update the parameters through consecutive state-action-reward tuples in reverse order. However, the most recent theoretical analysis only holds for a minimal learning rate and short consecutive steps, which converge slower than those large learning rate algorithms without RER. In view of this theoretical and empirical gap, we provide a tighter analysis that mitigates the limitation on the learning rate and the length of consecutive steps. Furthermore, we show theoretically that RER converges with a larger learning rate and a longer sequence.

[LG-49] A Scalable k-Medoids Clustering via Whale Optimization Algorithm

链接: https://arxiv.org/abs/2408.16993
作者: Huang Chenan,Narumasa Tsutsumida
关键词-EN: uncovering hidden patterns, insights from vast, Partitioning Around Medoids, critical tool, tool for uncovering
类目: Machine Learning (cs.LG); Performance (cs.PF)
*备注: 11 pages, 2 figures

点击查看摘要

Abstract:Unsupervised clustering has emerged as a critical tool for uncovering hidden patterns and insights from vast, unlabeled datasets. However, traditional methods like Partitioning Around Medoids (PAM) struggle with scalability due to their quadratic computational complexity. To address this limitation, we introduce WOA-kMedoids, a novel unsupervised clustering method that incorporates the Whale Optimization Algorithm (WOA), a nature-inspired metaheuristic inspired by the hunting strategies of humpback whales. By optimizing centroid selection, WOA-kMedoids reduces computational complexity of the k-medoids algorithm from quadratic to near-linear with respect to the number of observations. This improvement in efficiency enables WOA-kMedoids to be scalable to large datasets while maintaining high clustering accuracy. We evaluated the performance of WOA-kMedoids on 25 diverse time series datasets from the UCR archive. Our empirical results demonstrate that WOA-kMedoids maintains clustering accuracy similar to PAM. While WOA-kMedoids exhibited slightly higher runtime than PAM on small datasets (less than 300 observations), it outperformed PAM in computational efficiency on larger datasets. The scalability of WOA-kMedoids, combined with its consistently high accuracy, positions it as a promising and practical choice for unsupervised clustering in big data applications. WOA-kMedoids has implications for efficient knowledge discovery in massive, unlabeled datasets across various domains.

[LG-50] From Model Explanation to Data Misinterpretation: Uncovering the Pitfalls of Post Hoc Explainers in Business Research

链接: https://arxiv.org/abs/2408.16987
作者: Ronilo Ragodos,Tong Wang,Lu Feng, Yu (Jeffrey)Hu
关键词-EN: Machine learning models, Machine learning, learning models, post hoc, post hoc explainers
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Machine learning models have been increasingly used in business research. However, most state-of-the-art machine learning models, such as deep neural networks and XGBoost, are black boxes in nature. Therefore, post hoc explainers that provide explanations for machine learning models by, for example, estimating numerical importance of the input features, have been gaining wide usage. Despite the intended use of post hoc explainers being explaining machine learning models, we found a growing trend in business research where post hoc explanations are used to draw inferences about the data. In this work, we investigate the validity of such use. Specifically, we investigate with extensive experiments whether the explanations obtained by the two most popular post hoc explainers, SHAP and LIME, provide correct information about the true marginal effects of X on Y in the data, which we call data-alignment. We then identify what factors influence the alignment of explanations. Finally, we propose a set of mitigation strategies to improve the data-alignment of explanations and demonstrate their effectiveness with real-world data in an econometric context. In spite of this effort, we nevertheless conclude that it is often not appropriate to infer data insights from post hoc explanations. We articulate appropriate alternative uses, the most important of which is to facilitate the proposition and subsequent empirical investigation of hypotheses. The ultimate goal of this paper is to caution business researchers against translating post hoc explanations of machine learning models into potentially false insights and understanding of data.

[LG-51] he Sample-Communication Complexity Trade-off in Federated Q-Learning

链接: https://arxiv.org/abs/2408.16981
作者: Sudeep Salgia,Yuejie Chi
关键词-EN: unknown infinite-horizon Markov, infinite-horizon Markov decision, Markov decision process, federated Q-learning algorithm, optimal Q-function
类目: Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:We consider the problem of federated Q-learning, where M agents aim to collaboratively learn the optimal Q-function of an unknown infinite-horizon Markov decision process with finite state and action spaces. We investigate the trade-off between sample and communication complexities for the widely used class of intermittent communication algorithms. We first establish the converse result, where it is shown that a federated Q-learning algorithm that offers any speedup with respect to the number of agents in the per-agent sample complexity needs to incur a communication cost of at least an order of \frac11-\gamma up to logarithmic factors, where \gamma is the discount factor. We also propose a new algorithm, called Fed-DVR-Q, which is the first federated Q-learning algorithm to simultaneously achieve order-optimal sample and communication complexities. Thus, together these results provide a complete characterization of the sample-communication complexity trade-off in federated Q-learning.

[LG-52] raining Ultra Long Context Language Model with Fully Pipelined Distributed Transformer

链接: https://arxiv.org/abs/2408.16978
作者: Jinghan Yao,Sam Ade Jacobs,Masahiro Tanaka,Olatunji Ruwase,Aamir Shafi,Hari Subramoni,Dhabaleswar K. Panda
关键词-EN: Large Language Models, natural language processing, Large Language, long context capabilities, natural language
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Large Language Models (LLMs) with long context capabilities are integral to complex tasks in natural language processing and computational biology, such as text generation and protein sequence analysis. However, training LLMs directly on extremely long contexts demands considerable GPU resources and increased memory, leading to higher costs and greater complexity. Alternative approaches that introduce long context capabilities via downstream finetuning or adaptations impose significant design limitations. In this paper, we propose Fully Pipelined Distributed Transformer (FPDT) for efficiently training long-context LLMs with extreme hardware efficiency. For GPT and Llama models, we achieve a 16x increase in sequence length that can be trained on the same hardware compared to current state-of-the-art solutions. With our dedicated sequence chunk pipeline design, we can now train 8B LLM with 2 million sequence length on only 4 GPUs, while also maintaining over 55% of MFU. Our proposed FPDT is agnostic to existing training techniques and is proven to work efficiently across different LLM models.

[LG-53] Point Neuron Learning: A New Physics-Informed Neural Network Architecture

链接: https://arxiv.org/abs/2408.16969
作者: Hanwen Bi,Thushara D. Abhayapala
关键词-EN: numerous research domains, advanced numerous research, large training data, training data requirements, inconsistent model performance
类目: Machine Learning (cs.LG); Sound (cs.SD); Audio and Speech Processing (eess.AS); Signal Processing (eess.SP)
*备注: under the review process of EURASIP Journal on Audio, Speech, and Music Processing

点击查看摘要

Abstract:Machine learning and neural networks have advanced numerous research domains, but challenges such as large training data requirements and inconsistent model performance hinder their application in certain scientific problems. To overcome these challenges, researchers have investigated integrating physics principles into machine learning models, mainly through: (i) physics-guided loss functions, generally termed as physics-informed neural networks, and (ii) physics-guided architectural design. While both approaches have demonstrated success across multiple scientific disciplines, they have limitations including being trapped to a local minimum, poor interpretability, and restricted generalizability. This paper proposes a new physics-informed neural network (PINN) architecture that combines the strengths of both approaches by embedding the fundamental solution of the wave equation into the network architecture, enabling the learned model to strictly satisfy the wave equation. The proposed point neuron learning method can model an arbitrary sound field based on microphone observations without any dataset. Compared to other PINN methods, our approach directly processes complex numbers and offers better interpretability and generalizability. We evaluate the versatility of the proposed architecture by a sound field reconstruction problem in a reverberant environment. Results indicate that the point neuron method outperforms two competing methods and can efficiently handle noisy environments with sparse microphone observations.

[LG-54] UserSumBench: A Benchmark Framework for Evaluating User Summarization Approaches

链接: https://arxiv.org/abs/2408.16966
作者: Chao Wang,Neo Wu,Lin Ning,Luyang Liu,Jun Xie,Shawn O’Banion,Bradley Green
关键词-EN: Large language models, shown remarkable capabilities, user activity data, Large language, raw user activity
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:Large language models (LLMs) have shown remarkable capabilities in generating user summaries from a long list of raw user activity data. These summaries capture essential user information such as preferences and interests, and therefore are invaluable for LLM-based personalization applications, such as explainable recommender systems. However, the development of new summarization techniques is hindered by the lack of ground-truth labels, the inherent subjectivity of user summaries, and human evaluation which is often costly and time-consuming. To address these challenges, we introduce \UserSumBench, a benchmark framework designed to facilitate iterative development of LLM-based summarization approaches. This framework offers two key components: (1) A reference-free summary quality metric. We show that this metric is effective and aligned with human preferences across three diverse datasets (MovieLens, Yelp and Amazon Review). (2) A novel robust summarization method that leverages time-hierarchical summarizer and self-critique verifier to produce high-quality summaries while eliminating hallucination. This method serves as a strong baseline for further innovation in summarization techniques.

[LG-55] Discovery of False Data Injection Schemes on Frequency Controllers with Reinforcement Learning

链接: https://arxiv.org/abs/2408.16958
作者: Romesh Prasad,Malik Hassanaly,Xiangyu Zhang,Abhijeet Sahu
关键词-EN: distributed energy resources, inverter-based distributed energy, integrating renewable energy, energy resources, play a crucial
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:While inverter-based distributed energy resources (DERs) play a crucial role in integrating renewable energy into the power system, they concurrently diminish the grid’s system inertia, elevating the risk of frequency instabilities. Furthermore, smart inverters, interfaced via communication networks, pose a potential vulnerability to cyber threats if not diligently managed. To proactively fortify the power grid against sophisticated cyber attacks, we propose to employ reinforcement learning (RL) to identify potential threats and system vulnerabilities. This study concentrates on analyzing adversarial strategies for false data injection, specifically targeting smart inverters involved in primary frequency control. Our findings demonstrate that an RL agent can adeptly discern optimal false data injection methods to manipulate inverter settings, potentially causing catastrophic consequences.

[LG-56] An Empirical Study of Scaling Laws for Transfer

链接: https://arxiv.org/abs/2408.16947
作者: Matthew Barnett
关键词-EN: limited empirical study, transformer models, present a limited, limited empirical, empirical study
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We present a limited empirical study of scaling laws for transfer learning in transformer models. More specifically, we examine a scaling law that incorporates a “transfer gap” term, indicating the effectiveness of pre-training on one distribution when optimizing for downstream performance on another distribution. When the transfer gap is low, pre-training is a cost-effective strategy for improving downstream performance. Conversely, when the gap is high, collecting high-quality fine-tuning data becomes relatively more cost effective. Fitting the scaling law to experiments from diverse datasets reveals significant variations in the transfer gap across distributions. In theory, the scaling law can inform optimal data allocation strategies and highlights how the scarcity of downstream data can bottleneck performance. Our findings contribute to a principled way to measure transfer learning efficiency and understand how data availability affects capabilities.

[LG-57] Different Victims Same Layout: Email Visual Similarity Detection for Enhanced Email Protection CCS2024

链接: https://arxiv.org/abs/2408.16945
作者: Sachin Shukla,Omid Mirzaei
关键词-EN: machine learning, rule-based detection systems, effective spam detection, detection, email
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: To be published in the proceedings of the ACM Conference on Computer and Communications Security (ACM CCS 2024)

点击查看摘要

Abstract:In the pursuit of an effective spam detection system, the focus has often been on identifying known spam patterns either through rule-based detection systems or machine learning (ML) solutions. However, both systems are susceptible to evasion techniques and zero-day attacks that can be achieved at low cost. Therefore, an email that bypassed the defense system once can do it again in the following days, even though rules are updated or the ML models are retrained. The recurrence of failures to detect emails that exhibit layout similarities to previously undetected spam is concerning for customers and can erode their trust in a company. Our observations show that threat actors reuse email kits extensively and can bypass detection with little effort, for example, by making changes to the content of emails. In this work, we propose an email visual similarity detection approach, named Pisco, to improve the detection capabilities of an email threat defense system. We apply our proof of concept to some real-world samples received from different sources. Our results show that email kits are being reused extensively and visually similar emails are sent to our customers at various time intervals. Therefore, this method could be very helpful in situations where detection features that rely on contextual information and keywords are bypassed, an occurrence our observations show happens frequently.

[LG-58] FlowRetrieval: Flow-Guided Data Retrieval for Few-Shot Imitation Learning

链接: https://arxiv.org/abs/2408.16944
作者: Li-Heng Lin,Yuchen Cui,Amber Xie,Tianyu Hua,Dorsa Sadigh
关键词-EN: Few-shot imitation learning, imitation learning relies, task-specific demonstrations, demonstrations to efficiently, efficiently adapt
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Few-shot imitation learning relies on only a small amount of task-specific demonstrations to efficiently adapt a policy for a given downstream tasks. Retrieval-based methods come with a promise of retrieving relevant past experiences to augment this target data when learning policies. However, existing data retrieval methods fall under two extremes: they either rely on the existence of exact behaviors with visually similar scenes in the prior data, which is impractical to assume; or they retrieve based on semantic similarity of high-level language descriptions of the task, which might not be that informative about the shared low-level behaviors or motions across tasks that is often a more important factor for retrieving relevant data for policy learning. In this work, we investigate how we can leverage motion similarity in the vast amount of cross-task data to improve few-shot imitation learning of the target task. Our key insight is that motion-similar data carries rich information about the effects of actions and object interactions that can be leveraged during few-shot adaptation. We propose FlowRetrieval, an approach that leverages optical flow representations for both extracting similar motions to target tasks from prior data, and for guiding learning of a policy that can maximally benefit from such data. Our results show FlowRetrieval significantly outperforms prior methods across simulated and real-world domains, achieving on average 27% higher success rate than the best retrieval-based prior method. In the Pen-in-Cup task with a real Franka Emika robot, FlowRetrieval achieves 3.7x the performance of the baseline imitation learning technique that learns from all prior and target data. Website: this https URL

[LG-59] heoretical Insights into Overparameterized Models in Multi-Task and Replay-Based Continual Learning

链接: https://arxiv.org/abs/2408.16939
作者: Mohammadamin Banayeeanzade,Mahdi Soltanolkotabi,Mohammad Rostami
关键词-EN: machine learning paradigm, multiple related tasks, Multi-task learning, MTL, paradigm that aims
类目: Machine Learning (cs.LG)
*备注: 41 pages, 21 figures

点击查看摘要

Abstract:Multi-task learning (MTL) is a machine learning paradigm that aims to improve the generalization performance of a model on multiple related tasks by training it simultaneously on those tasks. Unlike MTL, where the model has instant access to the training data of all tasks, continual learning (CL) involves adapting to new sequentially arriving tasks over time without forgetting the previously acquired knowledge. Despite the wide practical adoption of CL and MTL and extensive literature on both areas, there remains a gap in the theoretical understanding of these methods when used with overparameterized models such as deep neural networks. This paper studies the overparameterized linear models as a proxy for more complex models. We develop theoretical results describing the effect of various system parameters on the model’s performance in an MTL setup. Specifically, we study the impact of model size, dataset size, and task similarity on the generalization error and knowledge transfer. Additionally, we present theoretical results to characterize the performance of replay-based CL models. Our results reveal the impact of buffer size and model capacity on the forgetting rate in a CL setup and help shed light on some of the state-of-the-art CL methods. Finally, through extensive empirical evaluations, we demonstrate that our theoretical findings are also applicable to deep neural networks, offering valuable guidance for designing MTL and CL models in practice.

[LG-60] Analyzing Inference Privacy Risks Through Gradients in Machine Learning

链接: https://arxiv.org/abs/2408.16913
作者: Zhuohang Li,Andrew Lowy,Jing Liu,Toshiaki Koike-Akino,Kieran Parsons,Bradley Malin,Ye Wang
关键词-EN: potentially sensitive user, shared gradients computed, models are iteratively, iteratively updated, updated with shared
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:In distributed learning settings, models are iteratively updated with shared gradients computed from potentially sensitive user data. While previous work has studied various privacy risks of sharing gradients, our paper aims to provide a systematic approach to analyze private information leakage from gradients. We present a unified game-based framework that encompasses a broad range of attacks including attribute, property, distributional, and user disclosures. We investigate how different uncertainties of the adversary affect their inferential power via extensive experiments on five datasets across various data modalities. Our results demonstrate the inefficacy of solely relying on data aggregation to achieve privacy against inference attacks in distributed learning. We further evaluate five types of defenses, namely, gradient pruning, signed gradient descent, adversarial perturbations, variational information bottleneck, and differential privacy, under both static and adaptive adversary settings. We provide an information-theoretic view for analyzing the effectiveness of these defenses against inference from gradients. Finally, we introduce a method for auditing attribute inference privacy, improving the empirical estimation of worst-case privacy through crafting adversarial canary records.

[LG-61] DLFormer: Enhancing Explainability in Multivariate Time Series Forecasting using Distributed Lag Embedding

链接: https://arxiv.org/abs/2408.16896
作者: Younghwi Kim,Dohee Kim,Sunghyun Sim
关键词-EN: time series, Abstract, time, series, variables
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:. Most real-world variables are multivariate time series influenced by past values and explanatory factors. Consequently, predicting these time series data using artificial intelligence is ongoing. In particular, in fields such as healthcare and finance, where reliability is crucial, having understandable explanations for predictions is essential. However, achieving a balance between high prediction accuracy and intuitive explainability has proven challenging. Although attention-based models have limitations in representing the individual influences of each variable, these models can influence the temporal dependencies in time series prediction and the magnitude of the influence of individual variables. To address this issue, this study introduced DLFormer, an attention-based architecture integrated with distributed lag embedding, to temporally embed individual variables and capture their temporal influence. Through validation against various real-world datasets, DLFormer showcased superior performance improvements compared to existing attention-based high-performance models. Furthermore, comparing the relationships between variables enhanced the reliability of explainability.

[LG-62] Exploring Multiple Strategies to Improve Multilingual Coreference Resolution in CorefUD

链接: https://arxiv.org/abs/2408.16893
作者: Ondřej Pražák,Miloslav Konopík
关键词-EN: natural language processing, identifying expressions, expressions in text, text that refer, critical component
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Coreference resolution, the task of identifying expressions in text that refer to the same entity, is a critical component in various natural language processing (NLP) applications. This paper presents our end-to-end neural coreference resolution system, utilizing the CorefUD 1.1 dataset, which spans 17 datasets across 12 languages. We first establish strong baseline models, including monolingual and cross-lingual variations, and then propose several extensions to enhance performance across diverse linguistic contexts. These extensions include cross-lingual training, incorporation of syntactic information, a Span2Head model for optimized headword prediction, and advanced singleton modeling. We also experiment with headword span representation and long-documents modeling through overlapping segments. The proposed extensions, particularly the heads-only approach, singleton modeling, and long document prediction significantly improve performance across most datasets. We also perform zero-shot cross-lingual experiments, highlighting the potential and limitations of cross-lingual transfer in coreference resolution. Our findings contribute to the development of robust and scalable coreference systems for multilingual coreference resolution. Finally, we evaluate our model on CorefUD 1.1 test set and surpass the best model from CRAC 2023 shared task of a comparable size by a large margin. Our nodel is available on GitHub: \urlthis https URL

[LG-63] x-ViT: A Generalizable Robust Texture-based dual-branch cross-attention deepfake detector

链接: https://arxiv.org/abs/2408.16892
作者: Deepak Dagar,Dinesh Kumar Vishwakarma
关键词-EN: realistic facial modification, produce highly realistic, highly realistic facial, facial modification, prevailing method
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Deepfakes, which employ GAN to produce highly realistic facial modification, are widely regarded as the prevailing method. Traditional CNN have been able to identify bogus media, but they struggle to perform well on different datasets and are vulnerable to adversarial attacks due to their lack of robustness. Vision transformers have demonstrated potential in the realm of image classification problems, but they require enough training data. Motivated by these limitations, this publication introduces Tex-ViT (Texture-Vision Transformer), which enhances CNN features by combining ResNet with a vision transformer. The model combines traditional ResNet features with a texture module that operates in parallel on sections of ResNet before each down-sampling operation. The texture module then serves as an input to the dual branch of the cross-attention vision transformer. It specifically focuses on improving the global texture module, which extracts feature map correlation. Empirical analysis reveals that fake images exhibit smooth textures that do not remain consistent over long distances in manipulations. Experiments were performed on different categories of FF++, such as DF, f2f, FS, and NT, together with other types of GAN datasets in cross-domain scenarios. Furthermore, experiments also conducted on FF++, DFDCPreview, and Celeb-DF dataset underwent several post-processing situations, such as blurring, compression, and noise. The model surpassed the most advanced models in terms of generalization, achieving a 98% accuracy in cross-domain scenarios. This demonstrates its ability to learn the shared distinguishing textural characteristics in the manipulated samples. These experiments provide evidence that the proposed model is capable of being applied to various situations and is resistant to many post-processing procedures.

[LG-64] Robotic warehousing operations: a learn-then-optimize approach to large-scale neighborhood search

链接: https://arxiv.org/abs/2408.16890
作者: Cynthia Barnhart,Alexandre Jacquillat,Alexandria Schmid
关键词-EN: technologies requires dedicated, robotics technologies requires, requires dedicated optimization, manage large fleets, autonomous agents
类目: Robotics (cs.RO); Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:The rapid deployment of robotics technologies requires dedicated optimization algorithms to manage large fleets of autonomous agents. This paper supports robotic parts-to-picker operations in warehousing by optimizing order-workstation assignments, item-pod assignments and the schedule of order fulfillment at workstations. The model maximizes throughput, while managing human workload at the workstations and congestion in the facility. We solve it via large-scale neighborhood search, with a novel learn-then-optimize approach to subproblem generation. The algorithm relies on an offline machine learning procedure to predict objective improvements based on subproblem features, and an online optimization model to generate a new subproblem at each iteration. In collaboration with Amazon Robotics, we show that our model and algorithm generate much stronger solutions for practical problems than state-of-the-art approaches. In particular, our solution enhances the utilization of robotic fleets by coordinating robotic tasks for human operators to pick multiple items at once, and by coordinating robotic routes to avoid congestion in the facility.

[LG-65] LLaVA-Chef: A Multi-modal Generative Model for Food Recipes

链接: https://arxiv.org/abs/2408.16889
作者: Fnu Mohbat,Mohammed J. Zaki
关键词-EN: rapidly evolving landscape, Natural Language Processing, online recipe sharing, globalized context, rapidly evolving
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In the rapidly evolving landscape of online recipe sharing within a globalized context, there has been a notable surge in research towards comprehending and generating food recipes. Recent advancements in large language models (LLMs) like GPT-2 and LLaVA have paved the way for Natural Language Processing (NLP) approaches to delve deeper into various facets of food-related tasks, encompassing ingredient recognition and comprehensive recipe generation. Despite impressive performance and multi-modal adaptability of LLMs, domain-specific training remains paramount for their effective application. This work evaluates existing LLMs for recipe generation and proposes LLaVA-Chef, a novel model trained on a curated dataset of diverse recipe prompts in a multi-stage approach. First, we refine the mapping of visual food image embeddings to the language space. Second, we adapt LLaVA to the food domain by fine-tuning it on relevant recipe data. Third, we utilize diverse prompts to enhance the model’s recipe comprehension. Finally, we improve the linguistic quality of generated recipes by penalizing the model with a custom loss function. LLaVA-Chef demonstrates impressive improvements over pretrained LLMs and prior works. A detailed qualitative analysis reveals that LLaVA-Chef generates more detailed recipes with precise ingredient mentions, compared to existing approaches.

[LG-66] Revising Multimodal VAEs with Diffusion Decoders

链接: https://arxiv.org/abs/2408.16883
作者: Daniel Wesego,Amirmohammad Rooshenas
关键词-EN: generating high-quality outputs, high-quality outputs, struggle with generating, generating high-quality, challenge that extends
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Multimodal VAEs often struggle with generating high-quality outputs, a challenge that extends beyond the inherent limitations of the VAE framework. The core issue lies in the restricted joint representation of the latent space, particularly when complex modalities like images are involved. Feedforward decoders, commonly used for these intricate modalities, inadvertently constrain the joint latent space, leading to a degradation in the quality of the other modalities as well. Although recent studies have shown improvement by introducing modality-specific representations, the issue remains significant. In this work, we demonstrate that incorporating a flexible diffusion decoder specifically for the image modality not only enhances the generation quality of the images but also positively impacts the performance of the other modalities that rely on feedforward decoders. This approach addresses the limitations imposed by conventional joint representations and opens up new possibilities for improving multimodal generation tasks using the multimodal VAE framework. Our model provides state-of-the-art results compared to other multimodal VAEs in different datasets with higher coherence and superior quality in the generated modalities

[LG-67] Longitudinal Modularity a Modularity for Link Streams

链接: https://arxiv.org/abs/2408.16877
作者: Victor Brabant,Yasaman Asgari,Pierre Borgnat,Angela Bonifati,Remy Cazabet
关键词-EN: model real-life phenomena, model real-life, link streams, real-life phenomena, streams
类目: ocial and Information Networks (cs.SI); Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Temporal networks are commonly used to model real-life phenomena. When these phenomena represent interactions and are captured at a fine-grained temporal resolution, they are modeled as link streams. Community detection is an essential network analysis task. Although many methods exist for static networks, and some methods have been developed for temporal networks represented as sequences of snapshots, few works can handle link streams. This article introduces the first adaptation of the well-known Modularity quality function to link streams. Unlike existing methods, it is independent of the time scale of analysis. After introducing the quality function, and its relation to existing static and dynamic definitions of Modularity, we show experimentally its relevance for dynamic community evaluation.

[LG-68] Learning Multi-agent Multi-machine Tending by Mobile Robots

链接: https://arxiv.org/abs/2408.16875
作者: Abdalwhab Abdalwhab,Giovanni Beltrame,Samira Ebrahimi Kahou,David St-Onge
关键词-EN: growing worker shortage, worker shortage challenge, manufacturing industry, address the growing, growing worker
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注: 7 pages, 4 figures

点击查看摘要

Abstract:Robotics can help address the growing worker shortage challenge of the manufacturing industry. As such, machine tending is a task collaborative robots can tackle that can also highly boost productivity. Nevertheless, existing robotics systems deployed in that sector rely on a fixed single-arm setup, whereas mobile robots can provide more flexibility and scalability. In this work, we introduce a multi-agent multi-machine tending learning framework by mobile robots based on Multi-agent Reinforcement Learning (MARL) techniques with the design of a suitable observation and reward. Moreover, an attention-based encoding mechanism is developed and integrated into Multi-agent Proximal Policy Optimization (MAPPO) algorithm to boost its performance for machine tending scenarios. Our model (AB-MAPPO) outperformed MAPPO in this new challenging scenario in terms of task success, safety, and resources utilization. Furthermore, we provided an extensive ablation study to support our various design decisions.

[LG-69] GSTAM: Efficient Graph Distillation with Structural Attention-Matching ECCV

链接: https://arxiv.org/abs/2408.16871
作者: Arash Rasti-Meymandi,Ahmad Sajedi,Zhaopan Xu,Konstantinos N. Plataniotis
关键词-EN: reducing large graph, large graph datasets, solution for reducing, reducing large, Graph
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Accepted at ECCV-DD 2024

点击查看摘要

Abstract:Graph distillation has emerged as a solution for reducing large graph datasets to smaller, more manageable, and informative ones. Existing methods primarily target node classification, involve computationally intensive processes, and fail to capture the true distribution of the full graph dataset. To address these issues, we introduce Graph Distillation with Structural Attention Matching (GSTAM), a novel method for condensing graph classification datasets. GSTAM leverages the attention maps of GNNs to distill structural information from the original dataset into synthetic graphs. The structural attention-matching mechanism exploits the areas of the input graph that GNNs prioritize for classification, effectively distilling such information into the synthetic graphs and improving overall distillation performance. Comprehensive experiments demonstrate GSTAM’s superiority over existing methods, achieving 0.45% to 6.5% better performance in extreme condensation ratios, highlighting its potential use in advancing distillation for graph classification tasks (Code available at this https URL).

[LG-70] he Star Geometry of Critic-Based Regularizer Learning

链接: https://arxiv.org/abs/2408.16852
作者: Oscar Leong,Eliza O’Reilly,Yong Sheng Soh
关键词-EN: impressive empirical performance, modern data-driven approaches, data-driven approaches parameterizing, showcasing impressive empirical, networks showcasing impressive
类目: Machine Learning (cs.LG); Metric Geometry (math.MG); Optimization and Control (math.OC); Statistics Theory (math.ST); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Variational regularization is a classical technique to solve statistical inference tasks and inverse problems, with modern data-driven approaches parameterizing regularizers via deep neural networks showcasing impressive empirical performance. Recent works along these lines learn task-dependent regularizers. This is done by integrating information about the measurements and ground-truth data in an unsupervised, critic-based loss function, where the regularizer attributes low values to likely data and high values to unlikely data. However, there is little theory about the structure of regularizers learned via this process and how it relates to the two data distributions. To make progress on this challenge, we initiate a study of optimizing critic-based loss functions to learn regularizers over a particular family of regularizers: gauges (or Minkowski functionals) of star-shaped bodies. This family contains regularizers that are commonly employed in practice and shares properties with regularizers parameterized by deep neural networks. We specifically investigate critic-based losses derived from variational representations of statistical distances between probability measures. By leveraging tools from star geometry and dual Brunn-Minkowski theory, we illustrate how these losses can be interpreted as dual mixed volumes that depend on the data distribution. This allows us to derive exact expressions for the optimal regularizer in certain cases. Finally, we identify which neural network architectures give rise to such star body gauges and when do such regularizers have favorable properties for optimization. More broadly, this work highlights how the tools of star geometry can aid in understanding the geometry of unsupervised regularizer learning.

[LG-71] Machine Learning-Based Research on the Adaptability of Adolescents to Online Education

链接: https://arxiv.org/abs/2408.16849
作者: Mingwei Wang,Sitong Liu
关键词-EN: adolescent online learning, online learning adaptability, Chinese Adolescent Online, Adolescent Online Education, online learning
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注:

点击查看摘要

Abstract:With the rapid advancement of internet technology, the adaptability of adolescents to online learning has emerged as a focal point of interest within the educational sphere. However, the academic community’s efforts to develop predictive models for adolescent online learning adaptability require further refinement and expansion. Utilizing data from the “Chinese Adolescent Online Education Survey” spanning the years 2014 to 2016, this study implements five machine learning algorithms - logistic regression, K-nearest neighbors, random forest, XGBoost, and CatBoost - to analyze the factors influencing adolescent online learning adaptability and to determine the model best suited for prediction. The research reveals that the duration of courses, the financial status of the family, and age are the primary factors affecting students’ adaptability in online learning environments. Additionally, age significantly impacts students’ adaptive capacities. Among the predictive models, the random forest, XGBoost, and CatBoost algorithms demonstrate superior forecasting capabilities, with the random forest model being particularly adept at capturing the characteristics of students’ adaptability.

[LG-72] Enabling Local Editing in Diffusion Models by Joint and Individual Component Analysis

链接: https://arxiv.org/abs/2408.16845
作者: Theodoros Kouzelis,Manos Plitsis,Mihalis A. Nikolaou,Yannis Panagakis
关键词-EN: Generative Adversarial Networks, Diffusion Models, Generative Adversarial, advances in Diffusion, competitor to Generative
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: Code available here: this https URL

点击查看摘要

Abstract:Recent advances in Diffusion Models (DMs) have led to significant progress in visual synthesis and editing tasks, establishing them as a strong competitor to Generative Adversarial Networks (GANs). However, the latent space of DMs is not as well understood as that of GANs. Recent research has focused on unsupervised semantic discovery in the latent space of DMs by leveraging the bottleneck layer of the denoising network, which has been shown to exhibit properties of a semantic latent space. However, these approaches are limited to discovering global attributes. In this paper we address, the challenge of local image manipulation in DMs and introduce an unsupervised method to factorize the latent semantics learned by the denoising network of pre-trained DMs. Given an arbitrary image and defined regions of interest, we utilize the Jacobian of the denoising network to establish a relation between the regions of interest and their corresponding subspaces in the latent space. Furthermore, we disentangle the joint and individual components of these subspaces to identify latent directions that enable local image manipulation. Once discovered, these directions can be applied to different images to produce semantically consistent edits, making our method suitable for practical applications. Experimental results on various datasets demonstrate that our method can produce semantic edits that are more localized and have better fidelity compared to the state-of-the-art.

[LG-73] AdapShare: An RL-Based Dynamic Spectrum Sharing Solution for O-RAN

链接: https://arxiv.org/abs/2408.16842
作者: Sneihil Gopal,David Griffith,Richard A. Rouil,Chunmei Liu
关键词-EN: Open Radio Access, RAN Intelligent Controller, Radio Access Network, ML-capable RAN Intelligent, Intelligent Controller
类目: Networking and Internet Architecture (cs.NI); Machine Learning (cs.LG)
*备注: arXiv admin note: text overlap with arXiv:2404.09110

点击查看摘要

Abstract:The Open Radio Access Network (O-RAN) initiative, characterized by open interfaces and AI/ML-capable RAN Intelligent Controller (RIC), facilitates effective spectrum sharing among RANs. In this context, we introduce AdapShare, an ORAN-compatible solution leveraging Reinforcement Learning (RL) for intent-based spectrum management, with the primary goal of minimizing resource surpluses or deficits in RANs. By employing RL agents, AdapShare intelligently learns network demand patterns and uses them to allocate resources. We demonstrate the efficacy of AdapShare in the spectrum sharing scenario between LTE and NR networks, incorporating real-world LTE resource usage data and synthetic NR usage data to demonstrate its practical use. We use the average surplus or deficit and fairness index to measure the system’s performance in various scenarios. AdapShare outperforms a quasi-static resource allocation scheme based on long-term network demand statistics, particularly when available resources are scarce or exceed the aggregate demand from the networks. Lastly, we present a high-level O-RAN compatible architecture using RL agents, which demonstrates the seamless integration of AdapShare into real-world deployment scenarios.

[LG-74] Physics-Informed Neural Networks and Extensions

链接: https://arxiv.org/abs/2408.16806
作者: Maziar Raissi,Paris Perdikaris,Nazanin Ahmadi,George Em Karniadakis
关键词-EN: Physics-Informed Neural Networks, method Physics-Informed Neural, scientific machine learning, recent practical extensions, governing differential equations
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Frontiers of Science Awards 2024

点击查看摘要

Abstract:In this paper, we review the new method Physics-Informed Neural Networks (PINNs) that has become the main pillar in scientific machine learning, we present recent practical extensions, and provide a specific example in data-driven discovery of governing differential equations.

[LG-75] HLogformer: A Hierarchical Transformer for Representing Log Data

链接: https://arxiv.org/abs/2408.16803
作者: Zhichao Hou,Mina Ghashami,Mikhail Kuznetsov,MohamadAli Torkamani
关键词-EN: gained widespread acclaim, data remains underexplored, handling diverse data, log, log data
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Transformers have gained widespread acclaim for their versatility in handling diverse data structures, yet their application to log data remains underexplored. Log data, characterized by its hierarchical, dictionary-like structure, poses unique challenges when processed using conventional transformer models. Traditional methods often rely on manually crafted templates for parsing logs, a process that is labor-intensive and lacks generalizability. Additionally, the linear treatment of log sequences by standard transformers neglects the rich, nested relationships within log entries, leading to suboptimal representations and excessive memory usage. To address these issues, we introduce HLogformer, a novel hierarchical transformer framework specifically designed for log data. HLogformer leverages the hierarchical structure of log entries to significantly reduce memory costs and enhance representation learning. Unlike traditional models that treat log data as flat sequences, our framework processes log entries in a manner that respects their inherent hierarchical organization. This approach ensures comprehensive encoding of both fine-grained details and broader contextual relationships. Our contributions are threefold: First, HLogformer is the first framework to design a dynamic hierarchical transformer tailored for dictionary-like log data. Second, it dramatically reduces memory costs associated with processing extensive log sequences. Third, comprehensive experiments demonstrate that HLogformer more effectively encodes hierarchical contextual information, proving to be highly effective for downstream tasks such as synthetic anomaly detection and product recommendation. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2408.16803 [cs.LG] (or arXiv:2408.16803v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2408.16803 Focus to learn more arXiv-issued DOI via DataCite

[LG-76] Generative AI in Ship Design

链接: https://arxiv.org/abs/2408.16798
作者: Sahil Thakur,Navneet V Saxena,Prof Sitikantha Roy
关键词-EN: heavily influenced, accounts for approximately, total cost, Gaussian Mixture Model, model architecture
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The process of ship design is intricate, heavily influenced by the hull form which accounts for approximately 70% of the total cost. Traditional methods rely on human-driven iterative processes based on naval architecture principles and engineering analysis. In contrast, generative AI presents a novel approach, utilizing computational algorithms rooted in machine learning and artificial intelligence to optimize ship hull design. This report outlines the systematic creation of a generative AI for this purpose, involving steps such as dataset collection, model architecture selection, training, and validation. Utilizing the “SHIP-D” dataset, consisting of 30,000 hull forms, the report adopts the Gaussian Mixture Model (GMM) as the generative model architecture. GMMs offer a statistical framework to analyze data distribution, crucial for generating innovative ship designs efficiently. Overall, this approach holds promise in revolutionizing ship design by exploring a broader design space and integrating multidisciplinary optimization objectives effectively.

[LG-77] Advance Real-time Detection of Traffic Incidents in Highways using Vehicle Trajectory Data

链接: https://arxiv.org/abs/2408.16773
作者: Sudipta Roy,Samiul Hasan
关键词-EN: secondary crashes, significant number, traffic incidents, traffic, Random Forest
类目: Computers and Society (cs.CY); Machine Learning (cs.LG); Applications (stat.AP); Machine Learning (stat.ML)
*备注: 19 Pages, 4 Tables, 10 Figures

点击查看摘要

Abstract:A significant number of traffic crashes are secondary crashes that occur because of an earlier incident on the road. Thus, early detection of traffic incidents is crucial for road users from safety perspectives with a potential to reduce the risk of secondary crashes. The wide availability of GPS devices now-a-days gives an opportunity of tracking and recording vehicle trajectories. The objective of this study is to use vehicle trajectory data for advance real-time detection of traffic incidents on highways using machine learning-based algorithms. The study uses three days of unevenly sequenced vehicle trajectory data and traffic incident data on I-10, one of the most crash-prone highways in Louisiana. Vehicle trajectories are converted to trajectories based on virtual detector locations to maintain spatial uniformity as well as to generate historical traffic data for machine learning algorithms. Trips matched with traffic incidents on the way are separated and along with other trips with similar spatial attributes are used to build a database for modeling. Multiple machine learning algorithms such as Logistic Regression, Random Forest, Extreme Gradient Boost, and Artificial Neural Network models are used to detect a trajectory that is likely to face an incident in the downstream road section. Results suggest that the Random Forest model achieves the best performance for predicting an incident with reasonable recall value and discrimination capability.

[LG-78] An Effective Information Theoretic Framework for Channel Pruning

链接: https://arxiv.org/abs/2408.16772
作者: Yihao Chen,Zefang Wang
关键词-EN: neural networks, Channel pruning, accelerating and compressing, compressing convolutional neural, information
类目: Information Theory (cs.IT); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Channel pruning is a promising method for accelerating and compressing convolutional neural networks. However, current pruning algorithms still remain unsolved problems that how to assign layer-wise pruning ratios properly and discard the least important channels with a convincing criterion. In this paper, we present a novel channel pruning approach via information theory and interpretability of neural networks. Specifically, we regard information entropy as the expected amount of information for convolutional layers. In addition, if we suppose a matrix as a system of linear equations, a higher-rank matrix represents there exist more solutions to it, which indicates more uncertainty. From the point of view of information theory, the rank can also describe the amount of information. In a neural network, considering the rank and entropy as two information indicators of convolutional layers, we propose a fusion function to reach a compromise of them, where the fusion results are defined as ``information concentration’'. When pre-defining layer-wise pruning ratios, we employ the information concentration as a reference instead of heuristic and engineering tuning to provide a more interpretable solution. Moreover, we leverage Shapley values, which are a potent tool in the interpretability of neural networks, to evaluate the channel contributions and discard the least important channels for model compression while maintaining its performance. Extensive experiments demonstrate the effectiveness and promising performance of our method. For example, our method improves the accuracy by 0.21% when reducing 45.5% FLOPs and removing 40.3% parameters for ResNet-56 on CIFAR-10. Moreover, our method obtains loss in Top-1/Top-5 accuracies of 0.43%/0.11% by reducing 41.6% FLOPs and removing 35.0% parameters for ResNet-50 on ImageNet.

[LG-79] SelectTTS: Synthesizing Anyones Voice via Discrete Unit-Based Frame Selection

链接: https://arxiv.org/abs/2408.17432
作者: Ismail Rasim Ulgen,Shreeram Suresh Chandra,Junchen Lu,Berrak Sisman
关键词-EN: Synthesizing the voices, multi-speaker TTS, multi-speaker TTS models, persisting challenge, multi-speaker TTS frameworks
类目: Audio and Speech Processing (eess.AS); Machine Learning (cs.LG)
*备注: Submitted to IEEE Signal Processing Letters

点击查看摘要

Abstract:Synthesizing the voices of unseen speakers is a persisting challenge in multi-speaker text-to-speech (TTS). Most multi-speaker TTS models rely on modeling speaker characteristics through speaker conditioning during training. Modeling unseen speaker attributes through this approach has necessitated an increase in model complexity, which makes it challenging to reproduce results and improve upon them. We design a simple alternative to this. We propose SelectTTS, a novel method to select the appropriate frames from the target speaker and decode using frame-level self-supervised learning (SSL) features. We show that this approach can effectively capture speaker characteristics for unseen speakers, and achieves comparable results to other multi-speaker TTS frameworks in both objective and subjective metrics. With SelectTTS, we show that frame selection from the target speaker’s speech is a direct way to achieve generalization in unseen speakers with low model complexity. We achieve better speaker similarity performance than SOTA baselines XTTS-v2 and VALL-E with over an 8x reduction in model parameters and a 270x reduction in training data

[LG-80] Bayesian Optimization for Non-Convex Two-Stage Stochastic Optimization Problems

链接: https://arxiv.org/abs/2408.17387
作者: Jack M. Buckingham,Ivo Couckuyt,Juergen Branke
关键词-EN: black-box optimization problems, Bayesian optimization, black-box optimization, apply Bayesian optimization, programming concerns optimization
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:Bayesian optimization is a sample-efficient method for solving expensive, black-box optimization problems. Stochastic programming concerns optimization under uncertainty where, typically, average performance is the quantity of interest. In the first stage of a two-stage problem, here-and-now decisions must be made in the face of this uncertainty, while in the second stage, wait-and-see decisions are made after the uncertainty has been resolved. Many methods in stochastic programming assume that the objective is cheap to evaluate and linear or convex. In this work, we apply Bayesian optimization to solve non-convex, two-stage stochastic programs which are expensive to evaluate. We formulate a knowledge-gradient-based acquisition function to jointly optimize the first- and second-stage variables, establish a guarantee of asymptotic consistency and provide a computationally efficient approximation. We demonstrate comparable empirical results to an alternative we formulate which alternates its focus between the two variable types, and superior empirical results over the standard, naive, two-step benchmark. We show that differences in the dimension and length scales between the variable types can lead to inefficiencies of the two-step algorithm, while the joint and alternating acquisition functions perform well in all problems tested. Experiments are conducted on both synthetic and real-world examples.

[LG-81] Estimation of Cardiac and Non-cardiac Diagnosis from Electrocardiogram Features ALT

链接: https://arxiv.org/abs/2408.17329
作者: Juan Miguel Lopez Alcaraz,Nils Strodthoff
关键词-EN: effective patient care, Ensuring timely, timely and accurate, paramount for effective, ECG
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注: 4 pages, source code under this https URL

点击查看摘要

Abstract:Introduction: Ensuring timely and accurate diagnosis of medical conditions is paramount for effective patient care. Electrocardiogram (ECG) signals are fundamental for evaluating a patient’s cardiac health and are readily available. Despite this, little attention has been given to the remarkable potential of ECG data in detecting non-cardiac conditions. Methods: In our study, we used publicly available datasets (MIMIC-IV-ECG-ICD and ECG-VIEW II) to investigate the feasibility of inferring general diagnostic conditions from ECG features. To this end, we trained a tree-based model (XGBoost) based on ECG features and basic demographic features to estimate a wide range of diagnoses, encompassing both cardiac and non-cardiac conditions. Results: Our results demonstrate the reliability of estimating 23 cardiac as well as 21 non-cardiac conditions above 0.7 AUROC in a statistically significant manner across a wide range of physiological categories. Our findings underscore the predictive potential of ECG data in identifying well-known cardiac conditions. However, even more striking, this research represents a pioneering effort in systematically expanding the scope of ECG-based diagnosis to conditions not traditionally associated with the cardiac system. Comments: 4 pages, source code under this https URL Subjects: Signal Processing (eess.SP); Machine Learning (cs.LG) Cite as: arXiv:2408.17329 [eess.SP] (or arXiv:2408.17329v1 [eess.SP] for this version) https://doi.org/10.48550/arXiv.2408.17329 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-82] Accelerating the discovery of steady-states of planetary interior dynamics with machine learning

链接: https://arxiv.org/abs/2408.17298
作者: Siddhant Agarwal,Nicola Tosi,Christian Hüttig,David S. Greenberg,Ali Can Bekar
关键词-EN: deriving scaling laws, dynamical flow properties, Simulating mantle convection, computationally expensive steady-state, Simulating mantle
类目: Fluid Dynamics (physics.flu-dyn); Earth and Planetary Astrophysics (astro-ph.EP); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Simulating mantle convection often requires reaching a computationally expensive steady-state, crucial for deriving scaling laws for thermal and dynamical flow properties and benchmarking numerical solutions. The strong temperature dependence of the rheology of mantle rocks causes viscosity variations of several orders of magnitude, leading to a slow-evolving stagnant lid where heat conduction dominates, overlying a rapidly-evolving and strongly convecting region. Time-stepping methods, while effective for fluids with constant viscosity, are hindered by the Courant criterion, which restricts the time step based on the system’s maximum velocity and grid size. Consequently, achieving steady-state requires a large number of time steps due to the disparate time scales governing the stagnant and convecting regions. We present a concept for accelerating mantle convection simulations using machine learning. We generate a dataset of 128 two-dimensional simulations with mixed basal and internal heating, and pressure- and temperature-dependent viscosity. We train a feedforward neural network on 97 simulations to predict steady-state temperature profiles. These can then be used to initialize numerical time stepping methods for different simulation parameters. Compared to typical initializations, the number of time steps required to reach steady-state is reduced by a median factor of 3.75. The benefit of this method lies in requiring very few simulations to train on, providing a solution with no prediction error as we initialize a numerical method, and posing minimal computational overhead at inference time. We demonstrate the effectiveness of our approach and discuss the potential implications for accelerated simulations for advancing mantle convection research. Subjects: Fluid Dynamics (physics.flu-dyn); Earth and Planetary Astrophysics (astro-ph.EP); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2408.17298 [physics.flu-dyn] (or arXiv:2408.17298v1 [physics.flu-dyn] for this version) https://doi.org/10.48550/arXiv.2408.17298 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-83] Minimax and Communication-Efficient Distributed Best Subset Selection with Oracle Property

链接: https://arxiv.org/abs/2408.17276
作者: Jingguo Lan,Hongmei Lin,Xueqin Wang
关键词-EN: statistical inference methods, distributed statistical inference, single-machine systems, explosion of large-scale, large-scale data
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The explosion of large-scale data in fields such as finance, e-commerce, and social media has outstripped the processing capabilities of single-machine systems, driving the need for distributed statistical inference methods. Traditional approaches to distributed inference often struggle with achieving true sparsity in high-dimensional datasets and involve high computational costs. We propose a novel, two-stage, distributed best subset selection algorithm to address these issues. Our approach starts by efficiently estimating the active set while adhering to the \ell_0 norm-constrained surrogate likelihood function, effectively reducing dimensionality and isolating key variables. A refined estimation within the active set follows, ensuring sparse estimates and matching the minimax \ell_2 error bound. We introduce a new splicing technique for adaptive parameter selection to tackle subproblems under \ell_0 constraints and a Generalized Information Criterion (GIC). Our theoretical and numerical studies show that the proposed algorithm correctly finds the true sparsity pattern, has the oracle property, and greatly lowers communication costs. This is a big step forward in distributed sparse estimation.

[LG-84] Equation identification for fluid flows via physics-informed neural networks ICML2024

链接: https://arxiv.org/abs/2408.17271
作者: Alexander New,Marisel Villafañe-Delgado,Charles Shugert
关键词-EN: Scientific machine learning, Scientific machine, physics-informed neural networks, machine learning, neural networks
类目: Fluid Dynamics (physics.flu-dyn); Machine Learning (cs.LG)
*备注: Published at ICML 2024 AI4Science: this https URL

点击查看摘要

Abstract:Scientific machine learning (SciML) methods such as physics-informed neural networks (PINNs) are used to estimate parameters of interest from governing equations and small quantities of data. However, there has been little work in assessing how well PINNs perform for inverse problems across wide ranges of governing equations across the mathematical sciences. We present a new and challenging benchmark problem for inverse PINNs based on a parametric sweep of the 2D Burgers’ equation with rotational flow. We show that a novel strategy that alternates between first- and second-order optimization proves superior to typical first-order strategies for estimating parameters. In addition, we propose a novel data-driven method to characterize PINN effectiveness in the inverse setting. PINNs’ physics-informed regularization enables them to leverage small quantities of data more efficiently than the data-driven baseline. However, both PINNs and the baseline can fail to recover parameters for highly inviscid flows, motivating the need for further development of PINN methods.

[LG-85] Learning and Verifying Maximal Taylor-Neural Lyapunov functions

链接: https://arxiv.org/abs/2408.17246
作者: Matthieu Barreau,Nicola Bastianello
关键词-EN: termed Taylor-neural Lyapunov, Taylor-neural Lyapunov functions, approximate Lyapunov functions, termed Taylor-neural, Taylor-neural Lyapunov
类目: Optimization and Control (math.OC); Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注:

点击查看摘要

Abstract:We introduce a novel neural network architecture, termed Taylor-neural Lyapunov functions, designed to approximate Lyapunov functions with formal certification. This architecture innovatively encodes local approximations and extends them globally by leveraging neural networks to approximate the residuals. Our method recasts the problem of estimating the largest region of attraction - specifically for maximal Lyapunov functions - into a learning problem, ensuring convergence around the origin through robust control theory. Physics-informed machine learning techniques further refine the estimation of the largest region of attraction. Remarkably, this method is versatile, operating effectively even without simulated data points. We validate the efficacy of our approach by providing numerical certificates of convergence across multiple examples. Our proposed methodology not only competes closely with state-of-the-art approaches, such as sum-of-squares and LyZNet, but also achieves comparable results even in the absence of simulated data. This work represents a significant advancement in control theory, with broad potential applications in the design of stable control systems and beyond.

[LG-86] Using Quantum Solved Deep Boltzmann Machines to Increase the Data Efficiency of RL Agents

链接: https://arxiv.org/abs/2408.17240
作者: Daniel Kent,Clement O’Rourke,Jake Southall,Kirsty Duncan,Adrian Bedford
关键词-EN: require large quantities, Reinforcement Learning, Deep Boltzmann Machines, train effectively, Learning
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Deep Learning algorithms, such as those used in Reinforcement Learning, often require large quantities of data to train effectively. In most cases, the availability of data is not a significant issue. However, for some contexts, such as in autonomous cyber defence, we require data efficient methods. Recently, Quantum Machine Learning and Boltzmann Machines have been proposed as solutions to this challenge. In this work we build upon the pre-existing work to extend the use of Deep Boltzmann Machines to the cutting edge algorithm Proximal Policy Optimisation in a Reinforcement Learning cyber defence environment. We show that this approach, when solved using a D-WAVE quantum annealer, can lead to a two-fold increase in data efficiency. We therefore expect it to be used by the machine learning and quantum communities who are hoping to capitalise on data-efficient Reinforcement Learning methods.

[LG-87] Learning Multi-Target TDOA Features for Sound Event Localization and Detection

链接: https://arxiv.org/abs/2408.17166
作者: Axel Berg,Johanna Engman,Jens Gulin,Karl Åström,Magnus Oskarsson
关键词-EN: microphone array rely, microphone array, array rely, rely on spatial, spatial cues
类目: Audio and Speech Processing (eess.AS); Machine Learning (cs.LG)
*备注: DCASE 2024

点击查看摘要

Abstract:Sound event localization and detection (SELD) systems using audio recordings from a microphone array rely on spatial cues for determining the location of sound events. As a consequence, the localization performance of such systems is to a large extent determined by the quality of the audio features that are used as inputs to the system. We propose a new feature, based on neural generalized cross-correlations with phase-transform (NGCC-PHAT), that learns audio representations suitable for localization. Using permutation invariant training for the time-difference of arrival (TDOA) estimation problem enables NGCC-PHAT to learn TDOA features for multiple overlapping sound events. These features can be used as a drop-in replacement for GCC-PHAT inputs to a SELD-network. We test our method on the STARSS23 dataset and demonstrate improved localization performance compared to using standard GCC-PHAT or SALSA-Lite input features.

[LG-88] Disease Classification and Impact of Pretrained Deep Convolution Neural Networks on Diverse Medical Imaging Datasets across Imaging Modalities

链接: https://arxiv.org/abs/2408.17011
作者: Jutika Borah,Kumaresh Sarmah,Hidam Kumarjit Singh
关键词-EN: Chest X-rays, optical coherence tomography, coherence tomography serve, medical imaging, optical coherence
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 15 pages, 3 figures, 4 tables

点击查看摘要

Abstract:Imaging techniques such as Chest X-rays, whole slide images, and optical coherence tomography serve as the initial screening and detection for a wide variety of medical pulmonary and ophthalmic conditions respectively. This paper investigates the intricacies of using pretrained deep convolutional neural networks with transfer learning across diverse medical imaging datasets with varying modalities for binary and multiclass classification. We conducted a comprehensive performance analysis with ten network architectures and model families each with pretraining and random initialization. Our finding showed that the use of pretrained models as fixed feature extractors yields poor performance irrespective of the datasets. Contrary, histopathology microscopy whole slide images have better performance. It is also found that deeper and more complex architectures did not necessarily result in the best performance. This observation implies that the improvements in ImageNet are not parallel to the medical imaging tasks. Within a medical domain, the performance of the network architectures varies within model families with shifts in datasets. This indicates that the performance of models within a specific modality may not be conclusive for another modality within the same domain. This study provides a deeper understanding of the applications of deep learning techniques in medical imaging and highlights the impact of pretrained networks across different medical imaging datasets under five different experimental settings.

[LG-89] chnical Report of HelixFold3 for Biomolecular Structure Prediction

链接: https://arxiv.org/abs/2408.16975
作者: Lihang Liu,Shanzhuo Zhang,Yang Xue,Xianbin Ye,Kunrui Zhu,Yuxin Li,Yang Liu,Xiaonan Zhang,Xiaomin Fang
关键词-EN: matching experimental methods, transformed protein structure, experimental methods, AlphaFold series, series has transformed
类目: Biomolecules (q-bio.BM); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The AlphaFold series has transformed protein structure prediction with remarkable accuracy, often matching experimental methods. AlphaFold2, AlphaFold-Multimer, and the latest AlphaFold3 represent significant strides in predicting single protein chains, protein complexes, and biomolecular structures. While AlphaFold2 and AlphaFold-Multimer are open-sourced, facilitating rapid and reliable predictions, AlphaFold3 remains partially accessible through a limited online server and has not been open-sourced, restricting further development. To address these challenges, the PaddleHelix team is developing HelixFold3, aiming to replicate AlphaFold3’s capabilities. Using insights from previous models and extensive datasets, HelixFold3 achieves an accuracy comparable to AlphaFold3 in predicting the structures of conventional ligands, nucleic acids, and proteins. The initial release of HelixFold3 is available as open source on GitHub for academic research, promising to advance biomolecular research and accelerate discoveries. We also provide online service at PaddleHelix website at this https URL.

[LG-90] Efficient Transonic Aeroelastic Model Reduction Using Optimized Sparse Multi-Input Polynomial Functionals

链接: https://arxiv.org/abs/2408.16941
作者: Michael Candon,Maciej Balajewicz,Arturo Delgado-Gutierrez,Pier Marzocca,Earl H. Dowell
关键词-EN: artificial intelligence algorithms, practical aeroelastic applications, Nonlinear aeroelastic reduced-order, aeroelastic reduced-order models, nonlinear aeroelastic model
类目: Fluid Dynamics (physics.flu-dyn); Machine Learning (cs.LG)
*备注: 24 pages, preprint, under review

点击查看摘要

Abstract:Nonlinear aeroelastic reduced-order models (ROMs) based on machine learning or artificial intelligence algorithms can be complex and computationally demanding to train, meaning that for practical aeroelastic applications, the conservative nature of linearization is often favored. Therefore, there is a requirement for novel nonlinear aeroelastic model reduction approaches that are accurate, simple and, most importantly, efficient to generate. This paper proposes a novel formulation for the identification of a compact multi-input Volterra series, where Orthogonal Matching Pursuit is used to obtain a set of optimally sparse nonlinear multi-input ROM coefficients from unsteady aerodynamic training data. The framework is exemplified using the Benchmark Supercritical Wing, considering; forced response, flutter and limit cycle oscillation. The simple and efficient Optimal Sparsity Multi-Input ROM (OSM-ROM) framework performs with high accuracy compared to the full-order aeroelastic model, requiring only a fraction of the tens-of-thousands of possible multi-input terms to be identified and allowing a 96% reduction in the number of training samples.

[LG-91] AI-driven Reverse Engineering of QML Models

链接: https://arxiv.org/abs/2408.16929
作者: Archisman Ghosh,Swaroop Ghosh
关键词-EN: Noisy Intermediate-Scale Quantum, Quantum machine learning, rapidly emerging area, capabilities of Noisy, Noisy Intermediate-Scale
类目: Quantum Physics (quant-ph); Emerging Technologies (cs.ET); Machine Learning (cs.LG)
*备注: 7 pages, 4 figures

点击查看摘要

Abstract:Quantum machine learning (QML) is a rapidly emerging area of research, driven by the capabilities of Noisy Intermediate-Scale Quantum (NISQ) devices. With the progress in the research of QML models, there is a rise in third-party quantum cloud services to cater to the increasing demand for resources. New security concerns surface, specifically regarding the protection of intellectual property (IP) from untrustworthy service providers. One of the most pressing risks is the potential for reverse engineering (RE) by malicious actors who may steal proprietary quantum IPs such as trained parameters and QML architecture, modify them to remove additional watermarks or signatures and re-transpile them for other quantum hardware. Prior work presents a brute force approach to RE the QML parameters which takes exponential time overhead. In this paper, we introduce an autoencoder-based approach to extract the parameters from transpiled QML models deployed on untrusted third-party vendors. We experiment on multi-qubit classifiers and note that they can be reverse-engineered under restricted conditions with a mean error of order 10^-1. The amount of time taken to prepare the dataset and train the model to reverse engineer the QML circuit being of the order 10^3 seconds (which is 10^2x better than the previously reported value for 4-layered 4-qubit classifiers) makes the threat of RE highly potent, underscoring the need for continued development of effective defenses.

[LG-92] Coverage Analysis of Multi-Environment Q-Learning Algorithms for Wireless Network Optimization

链接: https://arxiv.org/abs/2408.16882
作者: Talha Bozkus,Urbashi Mitra
关键词-EN: unknown system dynamics, Q-learning algorithms, Q-learning, optimize wireless networks, multi-environment hybrid Q-learning
类目: ignal Processing (eess.SP); Machine Learning (cs.LG); Networking and Internet Architecture (cs.NI)
*备注:

点击查看摘要

Abstract:Q-learning is widely used to optimize wireless networks with unknown system dynamics. Recent advancements include ensemble multi-environment hybrid Q-learning algorithms, which utilize multiple Q-learning algorithms across structurally related but distinct Markovian environments and outperform existing Q-learning algorithms in terms of accuracy and complexity in large-scale wireless networks. We herein conduct a comprehensive coverage analysis to ensure optimal data coverage conditions for these algorithms. Initially, we establish upper bounds on the expectation and variance of different coverage coefficients. Leveraging these bounds, we present an algorithm for efficient initialization of these algorithms. We test our algorithm on two distinct real-world wireless networks. Numerical simulations show that our algorithm can achieve %50 less policy error and %40 less runtime complexity than state-of-the-art reinforcement learning algorithms. Furthermore, our algorithm exhibits robustness to changes in network settings and parameters. We also numerically validate our theoretical results.

[LG-93] Characterization of point-source transient events with a rolling-shutter compressed sensing system

链接: https://arxiv.org/abs/2408.16868
作者: Frank Qiu,Joshua Michalenko,Lilian K. Casias,Cameron J. Radosevich,Jon Slater,Eric A. Shields
关键词-EN: Point-source transient events, Point-source transient, extremely small, extremely fast, pose several challenges
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Signal Processing (eess.SP); Optics (physics.optics); Applications (stat.AP)
*备注: 20 pages, 11 figures

点击查看摘要

Abstract:Point-source transient events (PSTEs) - optical events that are both extremely fast and extremely small - pose several challenges to an imaging system. Due to their speed, accurately characterizing such events often requires detectors with very high frame rates. Due to their size, accurately detecting such events requires maintaining coverage over an extended field-of-view, often through the use of imaging focal plane arrays (FPA) with a global shutter readout. Traditional imaging systems that meet these requirements are costly in terms of price, size, weight, power consumption, and data bandwidth, and there is a need for cheaper solutions with adequate temporal and spatial coverage. To address these issues, we develop a novel compressed sensing algorithm adapted to the rolling shutter readout of an imaging system. This approach enables reconstruction of a PSTE signature at the sampling rate of the rolling shutter, offering a 1-2 order of magnitude temporal speedup and a proportional reduction in data bandwidth. We present empirical results demonstrating accurate recovery of PSTEs using measurements that are spatially undersampled by a factor of 25, and our simulations show that, relative to other compressed sensing algorithms, our algorithm is both faster and yields higher quality reconstructions. We also present theoretical results characterizing our algorithm and corroborating simulations. The potential impact of our work includes the development of much faster, cheaper sensor solutions for PSTE detection and characterization.

[LG-94] Probabilistic Decomposed Linear Dynamical Systems for Robust Discovery of Latent Neural Dynamics

链接: https://arxiv.org/abs/2408.16862
作者: Yenho Chen,Noga Mudrik,Kyle A. Johnsen,Sankaraleengam Alagapan,Adam S. Charles,Christopher J. Rozell
关键词-EN: Time-varying linear state-space, Time-varying linear, mathematically interpretable representations, obtaining mathematically interpretable, linear state-space models
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Time-varying linear state-space models are powerful tools for obtaining mathematically interpretable representations of neural signals. For example, switching and decomposed models describe complex systems using latent variables that evolve according to simple locally linear dynamics. However, existing methods for latent variable estimation are not robust to dynamical noise and system nonlinearity due to noise-sensitive inference procedures and limited model formulations. This can lead to inconsistent results on signals with similar dynamics, limiting the model’s ability to provide scientific insight. In this work, we address these limitations and propose a probabilistic approach to latent variable estimation in decomposed models that improves robustness against dynamical noise. Additionally, we introduce an extended latent dynamics model to improve robustness against system nonlinearities. We evaluate our approach on several synthetic dynamical systems, including an empirically-derived brain-computer interface experiment, and demonstrate more accurate latent variable inference in nonlinear systems with diverse noise conditions. Furthermore, we apply our method to a real-world clinical neurophysiology dataset, illustrating the ability to identify interpretable and coherent structure where previous models cannot.

[LG-95] Maven: A Multimodal Foundation Model for Supernova Science

链接: https://arxiv.org/abs/2408.16829
作者: Gemma Zhang,Thomas Helfer,Alexander T. Gagliano,Siddharth Mishra-Sharma,V. Ashley Villar
关键词-EN: high-quality observations, lower-quality observations, setting in astronomy, larger amounts, small number
类目: High Energy Astrophysical Phenomena (astro-ph.HE); Instrumentation and Methods for Astrophysics (astro-ph.IM); Machine Learning (cs.LG)
*备注: code: this https URL data: this https URL

点击查看摘要

Abstract:A common setting in astronomy is the availability of a small number of high-quality observations, and larger amounts of either lower-quality observations or synthetic data from simplified models. Time-domain astrophysics is a canonical example of this imbalance, with the number of supernovae observed photometrically outpacing the number observed spectroscopically by multiple orders of magnitude. At the same time, no data-driven models exist to understand these photometric and spectroscopic observables in a common context. Contrastive learning objectives, which have grown in popularity for aligning distinct data modalities in a shared embedding space, provide a potential solution to extract information from these modalities. We present Maven, the first foundation model for supernova science. To construct Maven, we first pre-train our model to align photometry and spectroscopy from 0.5M synthetic supernovae using a constrastive objective. We then fine-tune the model on 4,702 observed supernovae from the Zwicky Transient Facility. Maven reaches state-of-the-art performance on both classification and redshift estimation, despite the embeddings not being explicitly optimized for these tasks. Through ablation studies, we show that pre-training with synthetic data improves overall performance. In the upcoming era of the Vera C. Rubin Observatory, Maven serves as a Rosetta Stone for leveraging large, unlabeled and multimodal time-domain datasets.

[LG-96] CNN Based Detection of Cardiovascular Diseases from ECG Images

链接: https://arxiv.org/abs/2408.16800
作者: Irem Sayin,Rana Gursoy,Buse Cicek,Yunus Emre Mert,Fatih Ozturk,Taha Emre Pamukcu,Ceylin Deniz Sevimli,Huseyin Uvet
关键词-EN: Convolutional Neural Network, Neural Network, Convolutional Neural, develops a Convolutional, detecting myocardial infarction
类目: Medical Physics (physics.med-ph); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
*备注: 4 pages

点击查看摘要

Abstract:This study develops a Convolutional Neural Network (CNN) model for detecting myocardial infarction (MI) from Electrocardiogram (ECG) images. The model, built using the InceptionV3 architecture and optimized through transfer learning, was trained using ECG data obtained from the Ch. Pervaiz Elahi Institute of Cardiology in Pakistan. The dataset includes ECG images representing four different cardiac conditions: myocardial infarction, abnormal heartbeat, history of myocardial infarction, and normal heart activity. The developed model successfully detects MI and other cardiovascular conditions with an accuracy of 93.27%. This study demonstrates that deep learning-based models can provide significant support to clinicians in the early detection and prevention of heart attacks.

[LG-97] Uncertainty-aware segmentation for rainfall prediction post processing KDD’24

链接: https://arxiv.org/abs/2408.16792
作者: Simone Monaco,Luca Monaco,Daniele Apiletti
关键词-EN: water resource allocation, Accurate precipitation forecasts, Accurate precipitation, agricultural planning, flood management
类目: Atmospheric and Oceanic Physics (physics.ao-ph); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Paper accepted at the 3rd Workshop on Uncertainty Reasoning and Quantification in Decision Making at ACM SIGKDD’24 (August 26, 2024, Barcelona)

点击查看摘要

Abstract:Accurate precipitation forecasts are crucial for applications such as flood management, agricultural planning, water resource allocation, and weather warnings. Despite advances in numerical weather prediction (NWP) models, they still exhibit significant biases and uncertainties, especially at high spatial and temporal resolutions. To address these limitations, we explore uncertainty-aware deep learning models for post-processing daily cumulative quantitative precipitation forecasts to obtain forecast uncertainties that lead to a better trade-off between accuracy and reliability. Our study compares different state-of-the-art models, and we propose a variant of the well-known SDE-Net, called SDE U-Net, tailored to segmentation problems like ours. We evaluate its performance for both typical and intense precipitation events. Our results show that all deep learning models significantly outperform the average baseline NWP solution, with our implementation of the SDE U-Net showing the best trade-off between accuracy and reliability. Integrating these models, which account for uncertainty, into operational forecasting systems can improve decision-making and preparedness for weather-related events. Comments: Paper accepted at the 3rd Workshop on Uncertainty Reasoning and Quantification in Decision Making at ACM SIGKDD’24 (August 26, 2024, Barcelona) Subjects: Atmospheric and Oceanic Physics (physics.ao-ph); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2408.16792 [physics.ao-ph] (or arXiv:2408.16792v1 [physics.ao-ph] for this version) https://doi.org/10.48550/arXiv.2408.16792 Focus to learn more arXiv-issued DOI via DataCite

信息检索

[IR-0] rerankers: A Lightweight Python Library to Unify Ranking Methods

链接: https://arxiv.org/abs/2408.17344
作者: Benjamin Clavié
关键词-EN: paper presents rerankers, paper presents, Python library, Abstract, Python
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:This paper presents rerankers, a Python library which provides an easy-to-use interface to the most commonly used re-ranking approaches. Re-ranking is an integral component of many retrieval pipelines; however, there exist numerous approaches to it, relying on different implementation methods. \textttrerankers unifies these methods into a single user-friendly interface, allowing practitioners and researchers alike to explore different methods while only changing a single line of Python code. Moreover ,rerankers ensures that its implementations are done with the fewest dependencies possible, and re-uses the original implementation whenever possible, guaranteeing that our simplified interface results in no performance degradation compared to more complex ones. The full source code and list of supported models are updated regularly and available at this https URL.

[IR-1] Not All Videos Become Outdated: Short-Video Recommendation by Learning to Deconfound Release Interval Bias

链接: https://arxiv.org/abs/2408.17332
作者: Lulu Dong,Guoxiu He,Aixin Sun
关键词-EN: Short-video recommender systems, recently released videos, recommender systems, biased preference, preference to recently
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Short-video recommender systems often exhibit a biased preference to recently released videos. However, not all videos become outdated; certain classic videos can still attract user’s attention. Such bias along temporal dimension can be further aggravated by the matching model between users and videos, because the model learns from preexisting interactions. From real data, we observe that different videos have varying sensitivities to recency in attracting users’ attention. Our analysis, based on a causal graph modeling short-video recommendation, suggests that the release interval serves as a confounder, establishing a backdoor path between users and videos. To address this confounding effect, we propose a model-agnostic causal architecture called Learning to Deconfound the Release Interval Bias (LDRI). LDRI enables jointly learning of the matching model and the video recency sensitivity perceptron. In the inference stage, we apply a backdoor adjustment, effectively blocking the backdoor path by intervening on each video. Extensive experiments on two benchmarks demonstrate that LDRI consistently outperforms backbone models and exhibits superior performance against state-of-the-art models. Additional comprehensive analyses confirm the deconfounding capability of LDRI.

[IR-2] Metadata practices for simulation workflows

链接: https://arxiv.org/abs/2408.17309
作者: Jose Villamar,Matthias Kelbling,Heather L. More,Michael Denker,Tom Tetzlaff,Johanna Senk,Stephan Thober
关键词-EN: Computer simulations, generation in science, essential pillar, pillar of knowledge, knowledge generation
类目: Information Retrieval (cs.IR)
*备注: 19 pages, 5 figures

点击查看摘要

Abstract:Computer simulations are an essential pillar of knowledge generation in science. Understanding, reproducing, and exploring the results of simulations relies on tracking and organizing metadata describing numerical experiments. However, the models used to understand real-world systems, and the computational machinery required to simulate them, are typically complex, and produce large amounts of heterogeneous metadata. Here, we present general practices for acquiring and handling metadata that are agnostic to software and hardware, and highly flexible for the user. These consist of two steps: 1) recording and storing raw metadata, and 2) selecting and structuring metadata. As a proof of concept, we develop the Archivist, a Python tool to help with the second step, and use it to apply our practices to distinct high-performance computing use cases from neuroscience and hydrology. Our practices and the Archivist can readily be applied to existing workflows without the need for substantial restructuring. They support sustainable numerical workflows, facilitating reproducibility and data reuse in generic simulation-based research.

[IR-3] Efficient Multi-task Prompt Tuning for Recommendation

链接: https://arxiv.org/abs/2408.17214
作者: Ting Bai,Le Huang,Yue Yu,Cheng Yang,Cheng Hou,Zhe Zhao,Chuan Shi
关键词-EN: multi-task learning, multi-task, tasks, business scenarios, expansion of business
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:With the expansion of business scenarios, real recommender systems are facing challenges in dealing with the constantly emerging new tasks in multi-task learning frameworks. In this paper, we attempt to improve the generalization ability of multi-task recommendations when dealing with new tasks. We find that joint training will enhance the performance of the new task but always negatively impact existing tasks in most multi-task learning methods. Besides, such a re-training mechanism with new tasks increases the training costs, limiting the generalization ability of multi-task recommendation models. Based on this consideration, we aim to design a suitable sharing mechanism among different tasks while maintaining joint optimization efficiency in new task learning. A novel two-stage prompt-tuning MTL framework (MPT-Rec) is proposed to address task irrelevance and training efficiency problems in multi-task recommender systems. Specifically, we disentangle the task-specific and task-sharing information in the multi-task pre-training stage, then use task-aware prompts to transfer knowledge from other tasks to the new task effectively. By freezing parameters in the pre-training tasks, MPT-Rec solves the negative impacts that may be brought by the new task and greatly reduces the training costs. Extensive experiments on three real-world datasets show the effectiveness of our proposed multi-task learning framework. MPT-Rec achieves the best performance compared to the SOTA multi-task learning method. Besides, it maintains comparable model performance but vastly improves the training efficiency (i.e., with up to 10% parameters in the full training way) in the new task learning.

[IR-4] Identifying and Clustering Counter Relationships of Team Compositions in PvP Games for Efficient Balance Analysis

链接: https://arxiv.org/abs/2408.17180
作者: Chiu-Chou Lin,Yu-Wei Shih,Kuei-Ting Kuo,Yu-Cheng Chen,Chien-Hua Chen,Wei-Chen Chiu,I-Chen Wu
关键词-EN: balance, win, game settings, game, Abstract
类目: Artificial Intelligence (cs.AI); Computer Science and Game Theory (cs.GT); Information Retrieval (cs.IR); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
*备注: TMLR 09/2024 this https URL

点击查看摘要

Abstract:How can balance be quantified in game settings? This question is crucial for game designers, especially in player-versus-player (PvP) games, where analyzing the strength relations among predefined team compositions-such as hero combinations in multiplayer online battle arena (MOBA) games or decks in card games-is essential for enhancing gameplay and achieving balance. We have developed two advanced measures that extend beyond the simplistic win rate to quantify balance in zero-sum competitive scenarios. These measures are derived from win value estimations, which employ strength rating approximations via the Bradley-Terry model and counter relationship approximations via vector quantization, significantly reducing the computational complexity associated with traditional win value estimations. Throughout the learning process of these models, we identify useful categories of compositions and pinpoint their counter relationships, aligning with the experiences of human players without requiring specific game knowledge. Our methodology hinges on a simple technique to enhance codebook utilization in discrete representation with a deterministic vector quantization process for an extremely small state space. Our framework has been validated in popular online games, including Age of Empires II, Hearthstone, Brawl Stars, and League of Legends. The accuracy of the observed strength relations in these games is comparable to traditional pairwise win value predictions, while also offering a more manageable complexity for analysis. Ultimately, our findings contribute to a deeper understanding of PvP game dynamics and present a methodology that significantly improves game balance evaluation and design.

[IR-5] Understanding the User: An Intent-Based Ranking Dataset

链接: https://arxiv.org/abs/2408.17103
作者: Abhijit Anand,Jurek Leonhardt,V Venktesh,Avishek Anand
关键词-EN: retrieval systems continue, information retrieval systems, continue to evolve, retrieval systems, systems continue
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:As information retrieval systems continue to evolve, accurate evaluation and benchmarking of these systems become pivotal. Web search datasets, such as MS MARCO, primarily provide short keyword queries without accompanying intent or descriptions, posing a challenge in comprehending the underlying information need. This paper proposes an approach to augmenting such datasets to annotate informative query descriptions, with a focus on two prominent benchmark datasets: TREC-DL-21 and TREC-DL-22. Our methodology involves utilizing state-of-the-art LLMs to analyze and comprehend the implicit intent within individual queries from benchmark datasets. By extracting key semantic elements, we construct detailed and contextually rich descriptions for these queries. To validate the generated query descriptions, we employ crowdsourcing as a reliable means of obtaining diverse human perspectives on the accuracy and informativeness of the descriptions. This information can be used as an evaluation set for tasks such as ranking, query rewriting, or others.

[IR-6] Evaluation of Table Representations to Answer Questions from Tables in Documents : A Case Study using 3GPP Specifications

链接: https://arxiv.org/abs/2408.17008
作者: Sujoy Roychowdhury,Sumit Soman,HG Ranjani,Avantika Sharma,Neeraj Gunda,Sai Krishna Bala
关键词-EN: important aspect, ability to extract, Generation Partnership Project, question answering, document corpora
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注: 10 pages, 4 figures, 2 tables

点击查看摘要

Abstract:With the ubiquitous use of document corpora for question answering, one important aspect which is especially relevant for technical documents is the ability to extract information from tables which are interspersed with text. The major challenge in this is that unlike free-flow text or isolated set of tables, the representation of a table in terms of what is a relevant chunk is not obvious. We conduct a series of experiments examining various representations of tabular data interspersed with text to understand the relative benefits of different representations. We choose a corpus of 3^rd Generation Partnership Project (3GPP) documents since they are heavily interspersed with tables. We create expert curated dataset of question answers to evaluate our approach. We conclude that row level representations with corresponding table header information being included in every cell improves the performance of the retrieval, thus leveraging the structural information present in the tabular data.

[IR-7] A Prototype Model of Zero-Trust Architecture Blockchain with EigenTrust-Based Practical Byzantine Fault Tolerance Protocol to Manage Decentralized Clinical Trials

链接: https://arxiv.org/abs/2408.16885
作者: Ashok Kumar Peepliwall,Hari Mohan Pandey,Surya Prakash,Anand A Mahajan,Sudhinder Singh Chowhan,Vinesh Kumar,Rahul Sharma
关键词-EN: enable virtual care, facilitate seamless communication, improve data accessibility, clinical trial data, pandemic necessitated
类目: Cryptography and Security (cs.CR); Emerging Technologies (cs.ET); Information Retrieval (cs.IR)
*备注: NA

点击查看摘要

Abstract:The COVID-19 pandemic necessitated the emergence of decentralized Clinical Trials (DCTs) due to patient retention, accelerate trials, improve data accessibility, enable virtual care, and facilitate seamless communication through integrated systems. However, integrating systems in DCTs exposes clinical data to potential security threats, making them susceptible to theft at any stage, a high risk of protocol deviations, and monitoring issues. To mitigate these challenges, blockchain technology serves as a secure framework, acting as a decentralized ledger, creating an immutable environment by establishing a zero-trust architecture, where data are deemed untrusted until verified. In combination with Internet of Things (IoT)-enabled wearable devices, blockchain secures the transfer of clinical trial data on private blockchains during DCT automation and operations. This paper proposes a prototype model of the Zero-Trust Architecture Blockchain (z-TAB) to integrate patient-generated clinical trial data during DCT operation management. The EigenTrust-based Practical Byzantine Fault Tolerance (T-PBFT) algorithm has been incorporated as a consensus protocol, leveraging Hyperledger Fabric. Furthermore, the Internet of Things (IoT) has been integrated to streamline data processing among stakeholders within the blockchain platforms. Rigorous evaluation has been done to evaluate the quality of the system.

[IR-8] Longitudinal Modularity a Modularity for Link Streams

链接: https://arxiv.org/abs/2408.16877
作者: Victor Brabant,Yasaman Asgari,Pierre Borgnat,Angela Bonifati,Remy Cazabet
关键词-EN: model real-life phenomena, model real-life, link streams, real-life phenomena, streams
类目: ocial and Information Networks (cs.SI); Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Temporal networks are commonly used to model real-life phenomena. When these phenomena represent interactions and are captured at a fine-grained temporal resolution, they are modeled as link streams. Community detection is an essential network analysis task. Although many methods exist for static networks, and some methods have been developed for temporal networks represented as sequences of snapshots, few works can handle link streams. This article introduces the first adaptation of the well-known Modularity quality function to link streams. Unlike existing methods, it is independent of the time scale of analysis. After introducing the quality function, and its relation to existing static and dynamic definitions of Modularity, we show experimentally its relevance for dynamic community evaluation.

附件下载

点击下载今日全部论文列表