本篇博文主要展示 2024-08-14 从Arxiv.org论文网站获取的最新论文列表,自动更新,按照NLP、CV、ML、AI、IR五个大方向区分,若需要邮件定时接收,请在评论区留下你的邮箱号。

说明:每日论文数据从Arxiv.org获取,每天早上10:30左右定时自动更新。

友情提示: 如何您需要邮箱接收每日论文数据,请在评论处留下你的邮箱,同样每天10:30左右邮件定时自动发送。

目录

概览 (2024-08-14)

今日共更新343篇论文,其中:

  • 自然语言处理54篇(Computation and Language (cs.CL))
  • 人工智能85篇(Artificial Intelligence (cs.AI))
  • 计算机视觉89篇(Computer Vision and Pattern Recognition (cs.CV))
  • 机器学习82篇(Machine Learning (cs.LG))

自然语言处理

[NLP-0] Fingerspelling within Sign Language Translation
[NLP-0] 手语翻译中的手指拼写

链接: https://arxiv.org/abs/2408.07065
作者: Garrett Tanzer
关键词-EN: language processing due, sign language processing, Fingerspelling poses challenges, American Sign Language, sign language
关键词-ZH: 语言处理,手语处理,指纹拼写构成挑战,美国手语,手语
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Fingerspelling poses challenges for sign language processing due to its high-frequency motion and use for open-vocabulary terms. While prior work has studied fingerspelling recognition, there has been little attention to evaluating how well sign language translation models understand fingerspelling in the context of entire sentences – and improving this capability. We manually annotate instances of fingerspelling within FLEURS-ASL and use them to evaluate the effect of two simple measures to improve fingerspelling recognition within American Sign Language to English translation: 1) use a model family (ByT5) with character- rather than subword-level tokenization, and 2) mix fingerspelling recognition data into the translation training mixture. We find that 1) substantially improves understanding of fingerspelling (and therefore translation quality overall), but the effect of 2) is mixed.
摘要:由于指纹拼写的高频运动和开放词汇术语的使用,给手语处理带来了挑战。虽然之前的工作研究了手指拼写识别,但很少有人关注评估手语翻译模型在整个句子背景下理解手指拼写的能力以及提高这种能力。我们手动注释FLEURS-ATL中的手指拼写实例,并使用它们来评估两种简单措施的效果,以改善美国手语到英语翻译中的手指拼写识别:1)使用具有字符而不是子词级标记化的模型族(ByT 5),2)将手指拼写识别数据混合到翻译训练混合物中。我们发现1)大大提高了对手指拼写的理解(从而提高了整体翻译质量),但2)的效果好坏参半。

[NLP-1] Diversity Empowers Intelligence: Integrating Expertise of Software Engineering Agents
[NLP-1] 多样性增强智能:集成软件工程代理的专业知识

链接: https://arxiv.org/abs/2408.07060
作者: Kexun Zhang,Weiran Yao,Zuxin Liu,Yihao Feng,Zhiwei Liu,Rithesh Murthy,Tian Lan,Lei Li,Renze Lou,Jiacheng Xu,Bo Pang,Yingbo Zhou,Shelby Heinecke,Silvio Savarese,Huan Wang,Caiming Xiong
关键词-EN: Large language model, solving real-world software, shown great potential, language model, SWE-Bench Lite
关键词-ZH: 大型语言模型,解决现实世界的软件,显示出巨大的潜力,语言模型,SWE-Bench Lite
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Large language model (LLM) agents have shown great potential in solving real-world software engineering (SWE) problems. The most advanced open-source SWE agent can resolve over 27% of real GitHub issues in SWE-Bench Lite. However, these sophisticated agent frameworks exhibit varying strengths, excelling in certain tasks while underperforming in others. To fully harness the diversity of these agents, we propose DEI (Diversity Empowered Intelligence), a framework that leverages their unique expertise. DEI functions as a meta-module atop existing SWE agent frameworks, managing agent collectives for enhanced problem-solving. Experimental results show that a DEI-guided committee of agents is able to surpass the best individual agent’s performance by a large margin. For instance, a group of open-source SWE agents, with a maximum individual resolve rate of 27.3% on SWE-Bench Lite, can achieve a 34.3% resolve rate with DEI, making a 25% improvement and beating most closed-source solutions. Our best-performing group excels with a 55% resolve rate, securing the highest ranking on SWE-Bench Lite. Our findings contribute to the growing body of research on collaborative AI systems and their potential to solve complex software engineering challenges.
摘要:大型语言模型(LLM)代理在解决实际软件工程(SWE)问题方面显示出巨大的潜力。最先进的开源SWE代理可以在SWE-BENCH Lite中解决超过27%的实际GitHub问题。然而,这些复杂的代理框架表现出不同的优势,在某些任务中表现出色,而在其他任务中表现不佳。为了充分利用这些代理的多样性,我们提出了DEI(多样性授权智能),这是一个利用他们独特专业知识的框架。DEI作为现有SWE代理框架上的元模块,管理代理集体以增强问题解决能力。实验结果表明,一个由Dei指导的代理委员会能够以较大幅度超过最佳个体代理的性能。例如,一组开源SWE代理,在SWE-BENCH Lite上的最高个体解决率为27.3%,与DEI可以达到34.3%的解决率,提升了25%,击败了大多数闭源解决方案。我们表现最好的团队以55%的解决率出类拔萃,确保了SWE-BENCH Lite上的最高排名。我们的发现有助于对协作人工智能系统及其解决复杂软件工程挑战的潜力进行越来越多的研究。

[NLP-2] A Survey on Model MoErging: Recycling and Routing Among Specialized Experts for Collaborative Learning
[NLP-2] 关于模型MoErging的调查:协作学习专业专家之间的回收和路由

链接: https://arxiv.org/abs/2408.07057
作者: Prateek Yadav,Colin Raffel,Mohammed Muqeeth,Lucas Caccia,Haokun Liu,Tianlong Chen,Mohit Bansal,Leshem Choshen,Alessandro Sordoni
关键词-EN: performant pre-trained models, fine-tuned expert models, MoErging methods, domain or task, MoErging
关键词-ZH: 高性能的预训练模型、微调专家模型、MoErging方法、领域或任务、MoErging
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 26 pages

点击查看摘要

Abstract:The availability of performant pre-trained models has led to a proliferation of fine-tuned expert models that are specialized to a particular domain or task. Model MoErging methods aim to recycle expert models to create an aggregate system with improved performance or generalization. A key component of MoErging methods is the creation of a router that decides which expert model(s) to use for a particular input or application. The promise, effectiveness, and large design space of MoErging has spurred the development of many new methods over the past few years. This rapid pace of development has made it challenging to compare different MoErging methods, which are rarely compared to one another and are often validated in different experimental setups. To remedy such gaps, we present a comprehensive survey of MoErging methods that includes a novel taxonomy for cataloging key design choices and clarifying suitable applications for each method. Apart from surveying MoErging research, we inventory software tools and applications that make use of MoErging. We additionally discuss related fields of study such as model merging, multitask learning, and mixture-of-experts models. Taken as a whole, our survey provides a unified overview of existing MoErging methods and creates a solid foundation for future work in this burgeoning field.
摘要:性能良好的预训练模型的出现导致了专门针对特定领域或任务的微调专家模型的激增。Model MoErging方法旨在回收专家模型,以创建具有更高性能或通用性的聚合系统。MoErging方法的一个关键组件是创建一个路由器,该路由器决定将哪个专家模型(S)用于特定的输入或应用。MoErging的承诺、有效性和巨大的设计空间在过去几年里刺激了许多新方法的发展。这种快速的发展速度使得比较不同的MoErging方法变得具有挑战性,这些方法很少相互比较,而且经常在不同的实验设置中得到验证。为了弥补这种差距,我们对MoErging方法进行了全面的调查,其中包括一种新的分类方法,用于对关键设计选择进行编目,并阐明每种方法的适当应用。除了调查MoErging研究,我们还清点了使用MoErging的软件工具和应用程序。此外,我们还讨论了相关的研究领域,如模型合并、多任务学习和专家混合模型。作为一个整体,我们的调查提供了对现有MoErging方法的统一概述,并为这一新兴领域的未来工作奠定了坚实的基础。

[NLP-3] LongWriter: Unleashing 10000 Word Generation from Long Context LLMs
[NLP-3] LongWriter:从长上下文LLM释放10000个单词生成

链接: https://arxiv.org/abs/2408.07055
作者: Yushi Bai,Jiajie Zhang,Xin Lv,Linzhi Zheng,Siqi Zhu,Lei Hou,Yuxiao Dong,Jie Tang,Juanzi Li
关键词-EN: Current long context, Current long, context large language, large language models, large language
关键词-ZH: 当前长上下文,当前长、上下文大语言,大语言模型,大语言
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Current long context large language models (LLMs) can process inputs up to 100,000 tokens, yet struggle to generate outputs exceeding even a modest length of 2,000 words. Through controlled experiments, we find that the model’s effective generation length is inherently bounded by the sample it has seen during supervised fine-tuning (SFT). In other words, their output limitation is due to the scarcity of long-output examples in existing SFT datasets. To address this, we introduce AgentWrite, an agent-based pipeline that decomposes ultra-long generation tasks into subtasks, enabling off-the-shelf LLMs to generate coherent outputs exceeding 20,000 words. Leveraging AgentWrite, we construct LongWriter-6k, a dataset containing 6,000 SFT data with output lengths ranging from 2k to 32k words. By incorporating this dataset into model training, we successfully scale the output length of existing models to over 10,000 words while maintaining output quality. We also develop LongBench-Write, a comprehensive benchmark for evaluating ultra-long generation capabilities. Our 9B parameter model, further improved through DPO, achieves state-of-the-art performance on this benchmark, surpassing even much larger proprietary models. In general, our work demonstrates that existing long context LLM already possesses the potential for a larger output window–all you need is data with extended output during model alignment to unlock this capability. Our code models are at: this https URL.
摘要:当前的长上下文大型语言模型(LLMS)可以处理多达100,000个令牌的输入,但却很难生成超过2,000个单词的中等长度的输出。通过对比实验,我们发现模型的有效生成长度受其在有监督微调(SFT)过程中所看到的样本的内在限制。换言之,它们的输出限制是由于现有SFT数据集中缺乏长输出样本。为了解决这个问题,我们引入了AgentWite,这是一个基于代理的流水线,将超长生成任务分解为子任务,使现成的LLM能够生成超过20,000字的连贯输出。利用AgentWite,我们构建了一个包含6000个SFT数据的数据集LongWriter-6k,输出长度从2k到32k字不等。通过将该数据集纳入模型训练,我们成功地将现有模型的输出长度扩展到10,000字以上,同时保持了输出质量。我们还开发了LongBch-WRITE,这是一个评估超长发电能力的综合基准。我们的9B参数模型通过DPO进一步改进,在此基准上实现了最先进的性能,甚至超过了更大的专有模型。总体而言,我们的工作表明,现有的长上下文LLM已经具有更大输出窗口的潜力–您所需要的就是在模型对齐期间具有扩展输出的数据来释放这一功能。我们的代码模型位于:This HTTPS URL。

[NLP-4] he News Comment Gap and Algorithmic Agenda Setting in Online Forums
[NLP-4] 在线论坛中的新闻评论差距和议程设置

链接: https://arxiv.org/abs/2408.07052
作者: Flora Böwing,Patrick Gildersleve
关键词-EN: stories valued, ranking algorithms, Ranking Utility Metric, ranking, comment ranking algorithms
关键词-ZH: 故事价值、排名算法、排名效用指标、排名、评论排名算法
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Social and Information Networks (cs.SI); Physics and Society (physics.soc-ph)
备注:

点击查看摘要

Abstract:The disparity between news stories valued by journalists and those preferred by readers, known as the “News Gap”, is well-documented. However, the difference in expectations regarding news related user-generated content is less studied. Comment sections, hosted by news websites, are popular venues for reader engagement, yet still subject to editorial decisions. It is thus important to understand journalist vs reader comment preferences and how these are served by various comment ranking algorithms that represent discussions differently. We analyse 1.2 million comments from Austrian newspaper Der Standard to understand the “News Comment Gap” and the effects of different ranking algorithms. We find that journalists prefer positive, timely, complex, direct responses, while readers favour comments similar to article content from elite authors. We introduce the versatile Feature-Oriented Ranking Utility Metric (FORUM) to assess the impact of different ranking algorithms and find dramatic differences in how they prioritise the display of comments by sentiment, topical relevance, lexical diversity, and readability. Journalists can exert substantial influence over the discourse through both curatorial and algorithmic means. Understanding these choices’ implications is vital in fostering engaging and civil discussions while aligning with journalistic objectives, especially given the increasing legal scrutiny and societal importance of online discourse.
摘要:记者看重的新闻与读者喜欢的新闻之间的差距,即所谓的“新闻鸿沟”,是有据可查的。然而,对与新闻相关的用户生成内容的预期差异研究较少。由新闻网站主持的评论区是读者参与的热门场所,但仍受编辑决定的影响。因此,重要的是要了解记者与读者的评论偏好,以及不同的评论排名算法如何提供这些偏好,这些算法以不同的方式代表讨论。我们分析了奥地利报纸Der Standard的120万条评论,以了解“新闻评论差距”和不同排名算法的影响。我们发现,记者更喜欢积极、及时、复杂、直接的回应,而读者更喜欢与精英作者的文章内容类似的评论。我们引入了多功能的面向功能的排名实用指标(论坛)来评估不同排名算法的影响,并发现它们在如何根据情绪、主题相关性、词汇多样性和可读性对评论的显示进行优先排序方面存在显著差异。记者可以通过策划性手段和算法手段对话语施加实质性影响。了解这些选择的影响对于促进参与和公民讨论,同时与新闻目标保持一致至关重要,特别是考虑到越来越多的法律审查和网络话语的社会重要性。

[NLP-5] ableGuard – Securing Structured Unstructured Data
[NLP-5] ableGuard --保护结构化非结构化数据

链接: https://arxiv.org/abs/2408.07045
作者: Anantha Sharma,Ajinkya Deshmukh
关键词-EN: data, critical challenge, increasing demand, TableGuard, obfuscation
关键词-ZH: 数据、关键挑战、需求不断增加、Table Guard、模糊
类目: Cryptography and Security (cs.CR); Computation and Language (cs.CL); Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注: 7 pages, 3 tables, 1 figure

点击查看摘要

Abstract:With the increasing demand for data sharing across platforms and organizations, ensuring the privacy and security of sensitive information has become a critical challenge. This paper introduces “TableGuard”. An innovative approach to data obfuscation tailored for relational databases. Building on the principles and techniques developed in prior work on context-sensitive obfuscation, TableGuard applies these methods to ensure that API calls return only obfuscated data, thereby safeguarding privacy when sharing data with third parties. TableGuard leverages advanced context-sensitive obfuscation techniques to replace sensitive data elements with contextually appropriate alternatives. By maintaining the relational integrity and coherence of the data, our approach mitigates the risks of cognitive dissonance and data leakage. We demonstrate the implementation of TableGuard using a BERT based transformer model, which identifies and obfuscates sensitive entities within relational tables. Our evaluation shows that TableGuard effectively balances privacy protection with data utility, minimizing information loss while ensuring that the obfuscated data remains functionally useful for downstream applications. The results highlight the importance of domain-specific obfuscation strategies and the role of context length in preserving data integrity. The implications of this research are significant for organizations that need to share data securely with external parties. TableGuard offers a robust framework for implementing privacy-preserving data sharing mechanisms, thereby contributing to the broader field of data privacy and security.
摘要:随着跨平台和跨组织的数据共享需求日益增长,确保敏感信息的隐私和安全已成为一个严峻的挑战。本文介绍了TableGuard。为关系数据库量身定做的数据混淆的创新方法。TableGuard建立在以前的上下文相关模糊处理工作中开发的原则和技术的基础上,应用这些方法来确保API调用只返回模糊数据,从而在与第三方共享数据时保护隐私。TableGuard利用高级上下文相关模糊处理技术,将敏感数据元素替换为上下文合适的替代方案。通过维护数据的关系完整性和一致性,我们的方法减少了认知不协调和数据泄漏的风险。我们使用基于BERT的转换器模型演示了TableGuard的实现,该模型识别和混淆关系表中的敏感实体。我们的评估表明,TableGuard有效地平衡了隐私保护和数据效用,最大限度地减少了信息损失,同时确保混淆后的数据在功能上仍然对下游应用程序有用。研究结果强调了特定领域混淆策略的重要性,以及上下文长度在保持数据完整性方面的作用。这项研究对需要与外部各方安全共享数据的组织具有重要意义。TableGuard为实施隐私保护数据共享机制提供了一个强大的框架,从而为更广泛的数据隐私和安全领域做出了贡献。

[NLP-6] Generative AI for automatic topic labelling
[NLP-6] 生成人工智能用于自动主题标签

链接: https://arxiv.org/abs/2408.07003
作者: Diego Kozlowski,Carolina Pradier,Pierre Benz
关键词-EN: large scale interpretation, Topic Modeling, prominent tool, large scale, scale interpretation
关键词-ZH: 大规模解释,主题建模,突出工具,大规模,规模解释
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 10 pages, 1 figure

点击查看摘要

Abstract:Topic Modeling has become a prominent tool for the study of scientific fields, as they allow for a large scale interpretation of research trends. Nevertheless, the output of these models is structured as a list of keywords which requires a manual interpretation for the labelling. This paper proposes to assess the reliability of three LLMs, namely flan, GPT-4o, and GPT-4 mini for topic labelling. Drawing on previous research leveraging BERTopic, we generate topics from a dataset of all the scientific articles (n=34,797) authored by all biology professors in Switzerland (n=465) between 2008 and 2020, as recorded in the Web of Science database. We assess the output of the three models both quantitatively and qualitatively and find that, first, both GPT models are capable of accurately and precisely label topics from the models’ output keywords. Second, 3-word labels are preferable to grasp the complexity of research topics.
摘要:主题建模已成为科学领域研究的重要工具,因为它们可以大规模解释研究趋势。然而,这些模型的输出结构为关键词列表,需要手动解释标签。本文建议评估三种LLM(即flan、GPT-4 o和GPT-4 mini)用于主题标签的可靠性。利用之前的研究利用BEERTopic,我们从2008年至2020年间瑞士所有生物学教授(n=465)撰写的所有科学文章(n= 34,797)的数据集中生成主题,如Web of Science数据库中所记录的那样。我们从定量和定性角度评估了这三个模型的输出,发现首先,两个GPT模型都能够准确且准确地从模型的输出关键词中标记主题。其次,3字标签更好地理解研究主题的复杂性。

[NLP-7] he advantages of context specific language models: the case of the Erasmian Language Model
[NLP-7] 上下文特定语言模型的优势:伊拉斯谟语言模型的案例

链接: https://arxiv.org/abs/2408.06931
作者: João Gonçalves,Nick Jelicic,Michele Murgia,Evert Stamhuis
关键词-EN: training data fed, improve language model, Erasmus University Rotterdam, current trend, trend to improve
关键词-ZH: 训练数据反馈,改进语言模型,鹿特丹伊拉斯谟大学,当前趋势,改进趋势
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 12 pages, 3 figures, 1 table

点击查看摘要

Abstract:The current trend to improve language model performance seems to be based on scaling up with the number of parameters (e.g. the state of the art GPT4 model has approximately 1.7 trillion parameters) or the amount of training data fed into the model. However this comes at significant costs in terms of computational resources and energy costs that compromise the sustainability of AI solutions, as well as risk relating to privacy and misuse. In this paper we present the Erasmian Language Model (ELM) a small context specific, 900 million parameter model, pre-trained and fine-tuned by and for Erasmus University Rotterdam. We show how the model performs adequately in a classroom context for essay writing, and how it achieves superior performance in subjects that are part of its context. This has implications for a wide range of institutions and organizations, showing that context specific language models may be a viable alternative for resource constrained, privacy sensitive use cases.
摘要:当前提高语言模型性能的趋势似乎是基于随着参数的数量(例如,最先进的GPT4模型大约有1.7万亿个参数)或输入到模型的训练数据量的增加而扩大的。然而,这会带来巨大的计算资源和能源成本,危及人工智能解决方案的可持续性,以及与隐私和误用相关的风险。在这篇文章中,我们提出了一个小的、特定于上下文的、9亿个参数的模型–Erasmian Language Model(ELM),它是由鹿特丹伊拉斯谟大学预先训练和微调的。我们展示了该模式如何在课堂写作中表现良好,以及它如何在作为其语境一部分的科目上取得优异的表现。这对许多机构和组织都有影响,表明特定于上下文的语言模型对于资源受限、隐私敏感的用例可能是一种可行的替代方案。

[NLP-8] Diagnosis extraction from unstructured Dutch echocardiogram reports using span- and document-level characteristic classification
[NLP-8] 使用跨度和文档级特征分类从非结构化荷兰超声心动图报告中提取诊断

链接: https://arxiv.org/abs/2408.06930
作者: Bauke Arends,Melle Vessies,Dirk van Osch,Arco Teske,Pim van der Harst,René van Es,Bram van Es
关键词-EN: Clinical machine learning, driven clinical decision, clinical decision support, machine learning research, clinically accurate labels
关键词-ZH: 临床机器学习、驱动临床决策、临床决策支持、机器学习研究、临床准确标签
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 28 pages, 5 figures

点击查看摘要

Abstract:Clinical machine learning research and AI driven clinical decision support models rely on clinically accurate labels. Manually extracting these labels with the help of clinical specialists is often time-consuming and expensive. This study tests the feasibility of automatic span- and document-level diagnosis extraction from unstructured Dutch echocardiogram reports. We included 115,692 unstructured echocardiogram reports from the UMCU a large university hospital in the Netherlands. A randomly selected subset was manually annotated for the occurrence and severity of eleven commonly described cardiac characteristics. We developed and tested several automatic labelling techniques at both span and document levels, using weighted and macro F1-score, precision, and recall for performance evaluation. We compared the performance of span labelling against document labelling methods, which included both direct document classifiers and indirect document classifiers that rely on span classification results. The SpanCategorizer and this http URL models outperformed all other span and document classifiers, respectively. The weighted F1-score varied between characteristics, ranging from 0.60 to 0.93 in SpanCategorizer and 0.96 to 0.98 in this http URL. Direct document classification was superior to indirect document classification using span classifiers. SetFit achieved competitive document classification performance using only 10% of the training data. Utilizing a reduced label set yielded near-perfect document classification results. We recommend using our published SpanCategorizer and this http URL models for span- and document-level diagnosis extraction from Dutch echocardiography reports. For settings with limited training data, SetFit may be a promising alternative for document classification. Comments: 28 pages, 5 figures Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) MSC classes: 68T50, 68P20 ACMclasses: I.2.7; J.3; H.3.3 Cite as: arXiv:2408.06930 [cs.CL] (or arXiv:2408.06930v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2408.06930 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Bram van Es [view email] [v1] Tue, 13 Aug 2024 14:33:32 UTC (1,041 KB)
摘要:临床机器学习研究和人工智能驱动的临床决策支持模型依赖于临床准确的标签。在临床专家的帮助下手动提取这些标签通常既耗时又昂贵。这项研究测试了从非结构化荷兰超声心动图报告中自动提取跨度和文档级诊断的可行性。我们纳入了115,692份来自荷兰大型大学医院UMCU的非结构化超声心动图报告。随机选择的子集被手动注释为11个常见描述的心脏特征的发生和严重程度。我们开发并测试了几种跨度和文档级别的自动标签技术,使用加权和宏观F1-分数、精确度和召回率进行性能评估。我们比较了SPAN标注和文档标注方法的性能,文档标注方法包括直接文档分类器和依赖于SPAN分类结果的间接文档分类器。Span Categorizer和这个http URL模型的性能分别优于所有其他SPAN和文档分类器。加权的F1分数在不同的特征之间变化,范围从0.60到0.93,在这个http URL中从0.96到0.98。直接文档分类优于使用SPAN分类器的间接文档分类。SetFit仅使用10%的训练数据就获得了具有竞争力的文档分类性能。利用简化的标签集可以得到近乎完美的文档分类结果。我们建议使用我们发布的span Categorizer和此http URL模型从荷兰超声心动图报告中提取跨度和文档级诊断。对于训练数据有限的设置,SetFit可能是一种很有前途的文档分类替代方案。评论:28页,5位数字科目:计算与语言(cs.CL);人工智能(cs.AI)MSC课程:68T50,68P20 AC课程:I.2.7;J.3;H.3.3引用为:arxiv:2408.06930cs.CLhttps://doi.org/10.48550/arXiv.2408.06930 Focus通过DataCite(待注册)了解更多arxiv发布的文件提交历史记录来自:Bram van es[查看电子邮件][v1]Tue,13 Aug 202414:33:32 UTC(1,041 KB)

[NLP-9] Evaluating Cultural Adaptability of a Large Language Model via Simulation of Synthetic Personas
[NLP-9] 通过模拟合成人物角色评估大型语言模型的文化适应性

链接: https://arxiv.org/abs/2408.06929
作者: Louis Kwok,Michal Bravansky,Lewis D. Griffin
关键词-EN: multicultural environments hinges, understand users’ diverse, success of Large, diverse cultural backgrounds, Large Language Models
关键词-ZH: 多元文化环境枢纽,了解用户的多样性,成功大型、多元化的文化背景,大型语言模型
类目: Computation and Language (cs.CL)
备注: 18 pages, 8 figures, Published as a conference paper at COLM 2024

点击查看摘要

Abstract:The success of Large Language Models (LLMs) in multicultural environments hinges on their ability to understand users’ diverse cultural backgrounds. We measure this capability by having an LLM simulate human profiles representing various nationalities within the scope of a questionnaire-style psychological experiment. Specifically, we employ GPT-3.5 to reproduce reactions to persuasive news articles of 7,286 participants from 15 countries; comparing the results with a dataset of real participants sharing the same demographic traits. Our analysis shows that specifying a person’s country of residence improves GPT-3.5’s alignment with their responses. In contrast, using native language prompting introduces shifts that significantly reduce overall alignment, with some languages particularly impairing performance. These findings suggest that while direct nationality information enhances the model’s cultural adaptability, native language cues do not reliably improve simulation fidelity and can detract from the model’s effectiveness.
摘要:大型语言模型在跨文化环境中的成功与否取决于其对用户不同文化背景的理解能力。我们通过让LLM在问卷式心理实验范围内模拟代表不同国籍的人的个人资料来衡量这种能力。具体地说,我们使用GPT-3.5来复制来自15个国家的7286名参与者对有说服力的新闻文章的反应;将结果与具有相同人口统计特征的真实参与者的数据集进行比较。我们的分析表明,指定一个人的居住国可以提高GPT-3.5中S与他们的回答的一致性。相比之下,使用本地语言提示会带来显著降低整体对齐的变化,某些语言尤其会影响性能。这些发现表明,虽然直接的国籍信息增强了模型的文化适应性,但母语线索并不能可靠地提高模拟的逼真度,并可能削弱模型的有效性。

[NLP-10] Re-TASK: Revisiting LLM Tasks from Capability Skill and Knowledge Perspectives
[NLP-10] 重新任务:从能力技能和知识的角度重新审视LLM任务

链接: https://arxiv.org/abs/2408.06904
作者: Zhihu Wang,Shiwan Zhao,Yu Wang,Heyuan Huang,Jiaxin Shi,Sitao Xie,Zhixing Wang,Yubo Zhang,Hongyan Li,Junchi Yan
关键词-EN: large language models, continue to scale, Knowledge Space Theory, solving domain-specific tasks, large language
关键词-ZH: 大型语言模型、持续扩展、知识空间理论、解决特定领域任务、大型语言
类目: Computation and Language (cs.CL)
备注: Work in Progress

点击查看摘要

Abstract:As large language models (LLMs) continue to scale, their enhanced performance often proves insufficient for solving domain-specific tasks. Systematically analyzing their failures and effectively enhancing their performance remain significant challenges. This paper introduces the Re-TASK framework, a novel theoretical model that Revisits LLM Tasks from cApability, Skill, Knowledge perspectives, guided by the principles of Bloom’s Taxonomy and Knowledge Space Theory. The Re-TASK framework provides a systematic methodology to deepen our understanding, evaluation, and enhancement of LLMs for domain-specific tasks. It explores the interplay among an LLM’s capabilities, the knowledge it processes, and the skills it applies, elucidating how these elements are interconnected and impact task performance. Our application of the Re-TASK framework reveals that many failures in domain-specific tasks can be attributed to insufficient knowledge or inadequate skill adaptation. With this insight, we propose structured strategies for enhancing LLMs through targeted knowledge injection and skill adaptation. Specifically, we identify key capability items associated with tasks and employ a deliberately designed prompting strategy to enhance task performance, thereby reducing the need for extensive fine-tuning. Alternatively, we fine-tune the LLM using capability-specific instructions, further validating the efficacy of our framework. Experimental results confirm the framework’s effectiveness, demonstrating substantial improvements in both the performance and applicability of LLMs.
摘要:随着大型语言模型(LLM)规模的不断扩大,其增强的性能通常被证明不足以解决特定领域的任务。系统地分析它们的失败并有效地提高它们的业绩仍然是重大的挑战。本文在布卢姆分类学和知识空间理论的指导下,提出了一种新的理论模型–重任务框架,该模型从能力、技能、知识的角度对LLM任务进行了修正。重新任务框架提供了一种系统的方法,以加深我们对领域特定任务的低成本管理的理解、评估和增强。它探索了LLM的能力、所处理的知识和所应用的技能之间的相互作用,阐明了这些要素是如何相互联系并影响任务绩效的。我们对重任务框架的应用表明,领域特定任务中的许多失败可以归因于知识不足或技能适应不足。基于这一认识,我们提出了通过有针对性的知识注入和技能适应来增强LLM的结构化策略。具体地说,我们确定了与任务相关的关键能力项目,并采用了精心设计的提示策略来提高任务绩效,从而减少了广泛微调的需要。或者,我们使用特定于功能的指令对LLM进行微调,进一步验证我们框架的有效性。实验结果证实了该框架的有效性,表明LLMS在性能和适用性方面都有很大的提高。

[NLP-11] Leveraging Language Models for Emotion and Behavior Analysis in Education
[NLP-11] 利用语言模型进行教育中的情感和行为分析

链接: https://arxiv.org/abs/2408.06874
作者: Kaito Tanaka,Benjamin Tan,Brian Wong
关键词-EN: enhancing learning outcomes, personalizing educational experiences, crucial for enhancing, enhancing learning, learning outcomes
关键词-ZH: 增强学习成果,个性化教育体验,对于增强、增强学习和学习成果至关重要
类目: Computation and Language (cs.CL)
备注: 8 pages

点击查看摘要

Abstract:The analysis of students’ emotions and behaviors is crucial for enhancing learning outcomes and personalizing educational experiences. Traditional methods often rely on intrusive visual and physiological data collection, posing privacy concerns and scalability issues. This paper proposes a novel method leveraging large language models (LLMs) and prompt engineering to analyze textual data from students. Our approach utilizes tailored prompts to guide LLMs in detecting emotional and engagement states, providing a non-intrusive and scalable solution. We conducted experiments using Qwen, ChatGPT, Claude2, and GPT-4, comparing our method against baseline models and chain-of-thought (CoT) prompting. Results demonstrate that our method significantly outperforms the baselines in both accuracy and contextual understanding. This study highlights the potential of LLMs combined with prompt engineering to offer practical and effective tools for educational emotion and behavior analysis.
摘要:对学生的情绪和行为的分析对于提高学习成果和个性化教育体验至关重要。传统方法通常依赖于侵入式视觉和生理数据收集,从而带来隐私问题和可扩展性问题。本文提出了一种利用大型语言模型(LLM)和提示工程来分析学生的文本数据的新颖方法。我们的方法利用量身定制的提示来指导LLM检测情绪和参与状态,提供非侵入性和可扩展的解决方案。我们使用Qwen、ChatGPT、Clade 2和GPT-4进行了实验,将我们的方法与基线模型和思想链(CoT)提示进行了比较。结果表明,我们的方法在准确性和上下文理解方面都显着优于基线。这项研究强调了法学硕士与即时工程相结合的潜力,为教育情感和行为分析提供实用有效的工具。

[NLP-12] LoRA2 : Multi-Scale Low-Rank Approximations for Fine-Tuning Large Language Models
[NLP-12] LoRA 2:用于微调大型语言模型的多规模低等级逼近

链接: https://arxiv.org/abs/2408.06854
作者: Jia-Chen Zhang,Yu-Jie Xiong,He-Xi Qiu,Dong-Hai Zhu,Chun-Ming Xia
关键词-EN: Fine-tuning large language, high parameter efficiency, large language models, large language, Fine-tuning large
关键词-ZH: 微调大型语言、高参数效率、大型语言模型、大型语言、微调大型
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Fine-tuning large language models (LLMs) with high parameter efficiency for downstream tasks has become a new paradigm. Low-Rank Adaptation (LoRA) significantly reduces the number of trainable parameters for fine-tuning. Although it has demonstrated commendable performance, updating parameters within a single scale may not be the optimal choice for complex downstream this http URL this paper, we extend the LoRA to multiple scales, dubbed as LoRA ^2 . We first combine orthogonal projection theory to train a set of LoRAs in two mutually orthogonal planes. Then, we improve the importance score algorithm, which reduce parameter sensitivity score calculations by approximately 98.5%. By pruning singular values with lower importance scores, thereby enhancing adaptability to various downstream tasks. Extensive experiments are conducted on two widely used pre-trained models to validate the effectiveness of LoRA ^2 . Results show that it significantly reduces the number of trainable parameters to just 0.72% compared to full fine-tuning, while still delivering highly impressive performance. Even when the parameters are further reduced to 0.17M, it still achieves comparable results to the baseline with 8 times more parameters. Our code is available here: https://anonymous.4open.science/r/LoRA-2-5B4C
摘要:针对下游任务的参数效率高的大型语言模型的微调已成为一种新的范式。低阶适应(LORA)显著减少了用于微调的可训练参数的数量。尽管它表现出了值得称赞的性能,但在单个尺度内更新参数可能不是复杂的下游http URL的最佳选择。本文将LORA扩展到多个尺度,称为LORA^2。我们首先结合正交投影理论在两个相互垂直的平面上训练一组LORA。然后,对重要性评分算法进行了改进,使参数敏感度评分的计算量减少了约98.5%。通过修剪重要性分数较低的单值,从而增强对各种下游任务的适应性。在两个广泛使用的预训练模型上进行了大量的实验,以验证LORA^2的有效性。结果表明,与完全微调相比,它显著减少了可训练参数的数量,仅为0.72%,同时仍提供了令人印象深刻的性能。即使当参数进一步减少到0.17m时,它仍然可以获得与参数多8倍的基线相当的结果。我们的代码可在此处获得:https://anonymous.4open.science/r/LoRA-2-5B4C

[NLP-13] Causal Agent based on Large Language Model
[NLP-13] 基于大语言模型的因果代理

链接: https://arxiv.org/abs/2408.06849
作者: Kairong Han,Kun Kuang,Ziyu Zhao,Junjian Ye,Fei Wu
关键词-EN: Large language models, causal, Causal Agent, achieved significant success, Large language
关键词-ZH: 大型语言模型,因果,因果代理,取得了重大成功,大型语言
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have achieved significant success across various domains. However, the inherent complexity of causal problems and causal theory poses challenges in accurately describing them in natural language, making it difficult for LLMs to comprehend and use them effectively. Causal methods are not easily conveyed through natural language, which hinders LLMs’ ability to apply them accurately. Additionally, causal datasets are typically tabular, while LLMs excel in handling natural language data, creating a structural mismatch that impedes effective reasoning with tabular data. This lack of causal reasoning capability limits the development of LLMs. To address these challenges, we have equipped the LLM with causal tools within an agent framework, named the Causal Agent, enabling it to tackle causal problems. The causal agent comprises tools, memory, and reasoning modules. In the tools module, the causal agent applies causal methods to align tabular data with natural language. In the reasoning module, the causal agent employs the ReAct framework to perform reasoning through multiple iterations with the tools. In the memory module, the causal agent maintains a dictionary instance where the keys are unique names and the values are causal graphs. To verify the causal ability of the causal agent, we established a benchmark consisting of four levels of causal problems: variable level, edge level, causal graph level, and causal effect level. We generated a test dataset of 1.3K using ChatGPT-3.5 for these four levels of issues and tested the causal agent on the datasets. Our methodology demonstrates remarkable efficacy on the four-level causal problems, with accuracy rates all above 80%. For further insights and implementation details, our code is accessible via the GitHub repository this https URL.
摘要:大型语言模型在各个领域都取得了巨大的成功。然而,因果问题和因果理论固有的复杂性给用自然语言准确描述它们带来了挑战,这使得LLMS很难有效地理解和使用它们。因果方法不容易通过自然语言传达,这阻碍了LLMS准确应用它们的能力。此外,因果数据集通常是表格式的,而LLMS擅长处理自然语言数据,这造成了结构不匹配,阻碍了对表格式数据的有效推理。这种缺乏因果推理的能力限制了LLMS的发展。为了应对这些挑战,我们在一个名为因果代理的代理框架内为LLM配备了因果工具,使其能够处理因果问题。因果代理包括工具、存储器和推理模块。在工具模块中,因果代理应用因果方法来使表格数据与自然语言保持一致。在推理模块中,因果代理使用Reaction框架通过工具进行多次迭代来执行推理。在内存模块中,因果代理维护一个字典实例,其中键是唯一的名称,值是因果图。为了验证因果因素的因果能力,我们建立了一个由四个水平的因果问题组成的基准:变量水平、边缘水平、因果图形水平和因果影响水平。我们使用ChatGPT-3.5为这四个级别的问题生成了1.3K的测试数据集,并在数据集上测试了因果因素。我们的方法在四个层次的因果问题上表现出显著的效果,准确率都在80%以上。对于进一步的见解和实现细节,我们的代码可以通过GitHub资源库这个HTTPS URL访问。

[NLP-14] MAQA: Evaluating Uncertainty Quantification in LLMs Regarding Data Uncertainty
[NLP-14] MAQA:评估LLM中关于数据不确定性的不确定性量化

链接: https://arxiv.org/abs/2408.06816
作者: Yongjin Yang,Haneul Yoo,Hwaran Lee
关键词-EN: uncertainty quantification, uncertainty, uncertainty quantification methods, large language models, data uncertainty
关键词-ZH: 不确定性量化,不确定性,不确定性量化方法,大型语言模型,数据不确定性
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Although large language models (LLMs) are capable of performing various tasks, they still suffer from producing plausible but incorrect responses. To improve the reliability of LLMs, recent research has focused on uncertainty quantification to predict whether a response is correct or not. However, most uncertainty quantification methods have been evaluated on questions requiring a single clear answer, ignoring the existence of data uncertainty that arises from irreducible randomness. Instead, these methods only consider model uncertainty, which arises from a lack of knowledge. In this paper, we investigate previous uncertainty quantification methods under the presence of data uncertainty. Our contributions are two-fold: 1) proposing a new Multi-Answer Question Answering dataset, MAQA, consisting of world knowledge, mathematical reasoning, and commonsense reasoning tasks to evaluate uncertainty quantification regarding data uncertainty, and 2) assessing 5 uncertainty quantification methods of diverse white- and black-box LLMs. Our findings show that entropy and consistency-based methods estimate the model uncertainty well even under data uncertainty, while other methods for white- and black-box LLMs struggle depending on the tasks. Additionally, methods designed for white-box LLMs suffer from overconfidence in reasoning tasks compared to simple knowledge queries. We believe our observations will pave the way for future work on uncertainty quantification in realistic setting.
摘要:尽管大型语言模型能够执行各种任务,但它们仍然难以产生貌似合理但不正确的回答。为了提高LLMS的可靠性,最近的研究集中在不确定性量化来预测响应是否正确。然而,大多数不确定性量化方法都是在需要一个明确答案的问题上进行评估的,忽略了由于不可约随机性而产生的数据不确定性的存在。取而代之的是,这些方法只考虑模型的不确定性,这是由于缺乏知识而引起的。在本文中,我们研究了在存在数据不确定性的情况下,以往的不确定性量化方法。我们的贡献有两个方面:1)提出了一个新的多答案问答数据集MAQA,它由世界知识、数学推理和常识推理任务组成,用于评估关于数据不确定性的不确定性量化;2)评估不同白盒和黑盒LLM的5种不确定性量化方法。我们的发现表明,即使在数据不确定的情况下,基于熵和一致性的方法也能很好地估计模型的不确定性,而其他方法对白盒和黑盒LLM的估计取决于任务的不同。此外,与简单的知识查询相比,为白盒LLMS设计的方法在推理任务中存在过度自信的问题。我们相信,我们的观察结果将为未来在现实环境中进行不确定性量化的工作铺平道路。

[NLP-15] Layerwise Recurrent Router for Mixture-of-Experts
[NLP-15] 用于混合专家的分层循环路由器

链接: https://arxiv.org/abs/2408.06793
作者: Zihan Qiu,Zeyu Huang,Shuang Cheng,Yizhi Zhou,Zili Wang,Ivan Titov,Jie Fu
关键词-EN: efficient computational strategies, computational strategies, scaling of large, revolutionized their capabilities, matched with efficient
关键词-ZH: 高效的计算策略、计算策略、大规模扩展、彻底改变了他们的能力,与高效的相匹配
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The scaling of large language models (LLMs) has revolutionized their capabilities in various tasks, yet this growth must be matched with efficient computational strategies. The Mixture-of-Experts (MoE) architecture stands out for its ability to scale model size without significantly increasing training costs. Despite their advantages, current MoE models often display parameter inefficiency. For instance, a pre-trained MoE-based LLM with 52 billion parameters might perform comparably to a standard model with 6.7 billion parameters. Being a crucial part of MoE, current routers in different layers independently assign tokens without leveraging historical routing information, potentially leading to suboptimal token-expert combinations and the parameter inefficiency problem. To alleviate this issue, we introduce the Layerwise Recurrent Router for Mixture-of-Experts (RMoE). RMoE leverages a Gated Recurrent Unit (GRU) to establish dependencies between routing decisions across consecutive layers. Such layerwise recurrence can be efficiently parallelly computed for input tokens and introduces negotiable costs. Our extensive empirical evaluations demonstrate that RMoE-based language models consistently outperform a spectrum of baseline models. Furthermore, RMoE integrates a novel computation stage orthogonal to existing methods, allowing seamless compatibility with other MoE architectures. Our analyses attribute RMoE’s gains to its effective cross-layer information sharing, which also improves expert selection and diversity. Our code is at this https URL
摘要:大型语言模型(LLM)的扩展使其在各种任务中的能力发生了革命性的变化,但这种增长必须与高效的计算策略相匹配。专家混合(MOE)架构因其在不显著增加培训成本的情况下扩展模型大小的能力而脱颖而出。尽管现有的MOE模型具有一定的优势,但它们往往表现出参数的低效性。例如,具有520亿个参数的预先训练的基于MOE的LLM的性能可能与具有67亿个参数的标准模型相当。作为MOE的重要组成部分,当前不同层次的路由器在不利用历史路由信息的情况下独立分配令牌,这可能导致令牌专家组合的次优和参数效率低下的问题。为了缓解这一问题,我们引入了专家混合分层循环路由器(RMoE)。RMoE利用门控循环单元(GRU)来建立跨连续层的路由决策之间的相关性。这种LayerWise递归可以针对输入令牌高效地并行计算,并引入可协商成本。我们广泛的经验评估表明,基于RMoE的语言模型始终优于一系列基线模型。此外,RMoE集成了一个与现有方法垂直的新型计算阶段,允许与其他MOE体系结构无缝兼容。我们的分析将RMoE的收益归因于其有效的跨层信息共享,这也提高了专家选择和多样性。我们的代码位于以下的HTTPS URL

[NLP-16] Unlock the Power of Frozen LLMs in Knowledge Graph Completion
[NLP-16] 验证冻结LLM在知识图完成中的力量

链接: https://arxiv.org/abs/2408.06787
作者: Bo Xue,Yi Xu,Yunchong Song,Yiming Pang,Yuyang Ren,Jiaxin Ding,Luoyi Fu,Xinbing Wang
关键词-EN: knowledge graph completion, methods rely solely, Large Language Models, graph completion, Classical knowledge graph
关键词-ZH: 知识图完成,方法仅依赖,大型语言模型,图完成,经典知识图
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Classical knowledge graph completion (KGC) methods rely solely on structural information, struggling with the inherent sparsity of knowledge graphs (KGs). Large Language Models (LLMs) learn extensive knowledge from large corpora with powerful context modeling, which is ideal for mitigating the limitations of previous methods. Directly fine-tuning LLMs offers great capability but comes at the cost of huge time and memory consumption, while utilizing frozen LLMs yields suboptimal results. In this work, we aim to leverage LLMs for KGC effectively and efficiently. We capture the context-aware hidden states of knowledge triples by employing prompts to stimulate the intermediate layers of LLMs. We then train a data-efficient classifier on these hidden states to harness the inherent capabilities of frozen LLMs in KGC. We also generate entity descriptions with subgraph sampling on KGs, reducing the ambiguity of triplets and enriching the knowledge representation. Extensive experiments on standard benchmarks showcase the efficiency and effectiveness of our approach. We outperform classical KGC methods on most datasets and match the performance of fine-tuned LLMs. Additionally, compared to fine-tuned LLMs, we boost GPU memory efficiency by \textbf 188\times and speed up training+inference by \textbf 13.48\times .
摘要:经典的知识图补全(KGC)方法仅依赖结构信息,难以解决知识图的稀疏性问题。大型语言模型(LLM)通过强大的上下文建模从大型语料库中学习广泛的知识,这是缓解以前方法的限制的理想选择。直接微调LLM提供了很大的能力,但以巨大的时间和内存消耗为代价,而利用冻结的LLM会产生次优的结果。在这项工作中,我们的目标是有效和高效地利用KGC的LLMS。我们通过使用提示来刺激LLM的中间层来捕捉知识三元组的上下文感知隐藏状态。然后,我们在这些隐藏状态上训练一个数据高效的分类器,以利用KGC中冻结的LLM的固有能力。我们还在KGS上用子图采样生成实体描述,减少了三元组的歧义,丰富了知识表示。在标准基准上的大量实验表明了我们方法的效率和有效性。我们在大多数数据集上的性能优于经典的KGC方法,并且与微调的LLMS的性能相当。此外,与微调的LLMS相比,我们将GPU的内存效率提高了188倍,训练和推理的速度提高了13.48倍。

[NLP-17] Fast-and-Frugal Text-Graph Transformers are Effective Link Predictors
[NLP-17] 快速而节俭的文本图转换器是有效的链接预测器

链接: https://arxiv.org/abs/2408.06778
作者: Andrei C. Coman,Christos Theodoropoulos,Marie-Francine Moens,James Henderson
关键词-EN: Link prediction models, enabling fully inductive, fully inductive learning, Link prediction, incorporating textual descriptions
关键词-ZH: 链接预测模型,支持完全归纳、完全归纳学习、链接预测,结合文本描述
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Link prediction models can benefit from incorporating textual descriptions of entities and relations, enabling fully inductive learning and flexibility in dynamic graphs. We address the challenge of also capturing rich structured information about the local neighbourhood of entities and their relations, by introducing a Transformer-based approach that effectively integrates textual descriptions with graph structure, reducing the reliance on resource-intensive text encoders. Our experiments on three challenging datasets show that our Fast-and-Frugal Text-Graph (FnF-TG) Transformers achieve superior performance compared to the previous state-of-the-art methods, while maintaining efficiency and scalability.
摘要:链接预测模型可以受益于合并实体和关系的文本描述,从而实现动态图形中的完全归纳学习和灵活性。我们通过引入基于Transformer的方法来解决捕获有关实体及其关系的本地邻居的丰富结构化信息的挑战,该方法有效地将文本描述与图形结构集成,从而减少对资源密集型文本编码器的依赖。我们在三个具有挑战性的数据集上的实验表明,与之前的最先进方法相比,我们的快速节俭文本图(FnF-TG)变形器实现了卓越的性能,同时保持了效率和可扩展性。

[NLP-18] Sumotosima: A Framework and Dataset for Classifying and Summarizing Otoscopic Images
[NLP-18] Sumotosima:用于分类和总结耳镜图像的框架和数据集

链接: https://arxiv.org/abs/2408.06755
作者: Eram Anwarul Khan,Anas Anwarul Haq Khan
关键词-EN: diagnostic procedure, procedure to examine, canal and eardrum, ear canal, ear drum perforations
关键词-ZH: 诊断程序、检查程序、耳道和耳膜、耳道、耳膜穿孔
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注: Work in Progress

点击查看摘要

Abstract:Otoscopy is a diagnostic procedure to examine the ear canal and eardrum using an otoscope. It identifies conditions like infections, foreign bodies, ear drum perforations and ear abnormalities. We propose a novel resource efficient deep learning and transformer based framework, Sumotosima (Summarizer for otoscopic images), an end-to-end pipeline for classification followed by summarization. Our framework works on combination of triplet and cross-entropy losses. Additionally, we use Knowledge Enhanced Multimodal BART whose input is fused textual and image embedding. The objective is to provide summaries that are well-suited for patients, ensuring clarity and efficiency in understanding otoscopic images. Given the lack of existing datasets, we have curated our own OCASD (Otoscopic Classification And Summary Dataset), which includes 500 images with 5 unique categories annotated with their class and summaries by Otolaryngologists. Sumotosima achieved a result of 98.03%, which is 7.00%, 3.10%, 3.01% higher than K-Nearest Neighbors, Random Forest and Support Vector Machines, respectively, in classification tasks. For summarization, Sumotosima outperformed GPT-4o and LLaVA by 88.53% and 107.57% in ROUGE scores, respectively. We have made our code and dataset publicly available at this https URL
摘要:耳镜检查是一种使用耳镜检查耳道和鼓膜的诊断程序。它可以识别感染、异物、鼓膜穿孔和耳朵异常等情况。我们提出了一种新的基于资源高效的深度学习和转换框架Sumotosima(Sumotosima),这是一种端到端的分类和摘要流水线。我们的框架工作在三重态和交叉熵损失的组合上。此外,我们使用了知识增强的多模式BART,其输入是融合了文本和图像嵌入的。其目的是提供非常适合患者的摘要,确保在理解耳镜图像方面的清晰度和效率。鉴于现有数据集的缺乏,我们已经策划了我们自己的OCASD(耳镜分类和摘要数据集),其中包括500张图像,其中包括5个独特的类别,并由耳鼻咽喉科医生用其类别和摘要进行注释。Sumotosima的分类结果为98.03%,分别比K-近邻、随机森林和支持向量机高7.00%、3.10%和3.01%。总而言之,Sumotosima的口红得分比GPT-40和LLaVA分别高出88.53%和107.57%。我们已经在此HTTPS URL上公开了我们的代码和数据集

[NLP-19] Multilingual Models for Check-Worthy Social Media Posts Detection
[NLP-19] 用于检查有价值的社交媒体帖子检测的多语言模型

链接: https://arxiv.org/abs/2408.06737
作者: Sebastian Kula,Michal Gregor
关键词-EN: transformer-based NLP models, transformer-based NLP, social media posts, NLP models, presents an extensive
关键词-ZH: 基于转换器的NLP模型、基于转换器的NLP、社交媒体帖子、NLP模型,呈现了广泛的
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:This work presents an extensive study of transformer-based NLP models for detection of social media posts that contain verifiable factual claims and harmful claims. The study covers various activities, including dataset collection, dataset pre-processing, architecture selection, setup of settings, model training (fine-tuning), model testing, and implementation. The study includes a comprehensive analysis of different models, with a special focus on multilingual models where the same model is capable of processing social media posts in both English and in low-resource languages such as Arabic, Bulgarian, Dutch, Polish, Czech, Slovak. The results obtained from the study were validated against state-of-the-art models, and the comparison demonstrated the robustness of the proposed models. The novelty of this work lies in the development of multi-label multilingual classification models that can simultaneously detect harmful posts and posts that contain verifiable factual claims in an efficient way.
摘要:这项工作对基于变压器的NLP模型进行了广泛的研究,该模型用于检测包含可验证的事实声明和有害声明的社交媒体帖子。该研究涵盖了各种活动,包括数据集收集、数据集预处理、体系结构选择、设置设置、模型训练(微调)、模型测试和实施。这项研究包括对不同模式的全面分析,特别侧重于多语种模式,在这些模式中,同一种模式能够处理以英语和阿拉伯语、保加利亚语、荷兰语、波兰语、捷克语、斯洛伐克语等资源不足的语言发布的社交媒体帖子。研究结果与最先进的模型进行了验证,比较结果证明了所提出模型的稳健性。这项工作的新颖性在于开发了多标签多语言分类模型,可以有效地同时检测有害帖子和包含可核实的事实声明的帖子。

[NLP-20] Exploring the anatomy of articulation rate in spontaneous English speech: relationships between utterance length effects and social factors INTERSPEECH2024
[NLP-20] 探索自发英语言语清晰率的解剖:话语长度效应与社会因素之间的关系

链接: https://arxiv.org/abs/2408.06732
作者: James Tanner,Morgan Sonderegger,Jane Stuart-Smith,Tyler Kendall,Jeff Mielke,Robin Dodsworth,Erik Thomas
关键词-EN: Speech rate, Speech, rate, shown to vary, utterance length
关键词-ZH: 语音速率、语音、速率、显示变化、发声长度
类目: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: Proceedings of Interspeech 2024. 5 pages, 4 figures

点击查看摘要

Abstract:Speech rate has been shown to vary across social categories such as gender, age, and dialect, while also being conditioned by properties of speech planning. The effect of utterance length, where speech rate is faster and less variable for longer utterances, has also been shown to reduce the role of social factors once it has been accounted for, leaving unclear the relationship between social factors and speech production in conditioning speech rate. Through modelling of speech rate across 13 English speech corpora, it is found that utterance length has the largest effect on speech rate, though this effect itself varies little across corpora and speakers. While age and gender also modulate speech rate, their effects are much smaller in magnitude. These findings suggest utterance length effects may be conditioned by articulatory and perceptual constraints, and that social influences on speech rate should be interpreted in the broader context of how speech rate variation is structured.
摘要:语速因性别、年龄和方言等社会类别的不同而不同,同时也受到言语规划特性的制约。语速越长,语速越快,变化性越小的话语长度的影响也被证明,一旦考虑到社会因素的作用,就会减少社会因素的作用,使得社会因素和言语产生在限制语速中的关系变得不清楚。通过对13个英语语音语料库的语速建模,发现语长对语速的影响最大,尽管这种影响本身在语料库和说话人之间差异不大。虽然年龄和性别也会调节语速,但它们的影响要小得多。这些发现表明,话语长度效应可能受到发音和知觉限制的制约,而社会对语速的影响应该在更广泛的语境中解释,即语速变异是如何构成的。

[NLP-21] Large language models can consistently generate high-quality content for election disinformation operations
[NLP-21] 大型语言模型可以为选举虚假信息操作一致生成高质量的内容

链接: https://arxiv.org/abs/2408.06731
作者: Angus R. Williams,Liam Burke-Moore,Ryan Sze-Yin Chan,Florence E. Enock,Federico Nanni,Tvesha Sippy,Yi-Ling Chung,Evelina Gabasova,Kobi Hackenburg,Jonathan Bright
关键词-EN: Advances in large, election disinformation operation, generating compelling election, election disinformation, compelling election disinformation
关键词-ZH: 大规模选举虚假信息行动的进展,产生引人注目的选举、选举虚假信息、引人注目的选举虚假信息
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Advances in large language models have raised concerns about their potential use in generating compelling election disinformation at scale. This study presents a two-part investigation into the capabilities of LLMs to automate stages of an election disinformation operation. First, we introduce DisElect, a novel evaluation dataset designed to measure LLM compliance with instructions to generate content for an election disinformation operation in localised UK context, containing 2,200 malicious prompts and 50 benign prompts. Using DisElect, we test 13 LLMs and find that most models broadly comply with these requests; we also find that the few models which refuse malicious prompts also refuse benign election-related prompts, and are more likely to refuse to generate content from a right-wing perspective. Secondly, we conduct a series of experiments (N=2,340) to assess the “humanness” of LLMs: the extent to which disinformation operation content generated by an LLM is able to pass as human-written. Our experiments suggest that almost all LLMs tested released since 2022 produce election disinformation operation content indiscernible by human evaluators over 50% of the time. Notably, we observe that multiple models achieve above-human levels of humanness. Taken together, these findings suggest that current LLMs can be used to generate high-quality content for election disinformation operations, even in hyperlocalised scenarios, at far lower costs than traditional methods, and offer researchers and policymakers an empirical benchmark for the measurement and evaluation of these capabilities in current and future models.
摘要:大型语言模型的进步引起了人们对它们可能被用来大规模产生令人信服的选举虚假信息的担忧。这项研究提出了一个两部分的调查,对低收入管理系统的能力,自动化阶段的选举造假行动。首先,我们介绍了DisElect,这是一个新的评估数据集,旨在衡量LLM是否遵守指令,以便在本地化的英国上下文中为选举造假操作生成内容,包含2200个恶意提示和50个良性提示。使用DisElect,我们测试了13个LLM,发现大多数模型基本上符合这些请求;我们还发现,少数拒绝恶意提示的模型也拒绝与良性选举相关的提示,并且更有可能拒绝从右翼角度生成内容。其次,我们进行了一系列实验(N=2,340)来评估LLMS的“人性化”:LLM产生的虚假信息操作内容能够在多大程度上被认为是人类编写的。我们的实验表明,自2022年以来发布的几乎所有测试的LLM都会产生人类评估者在50%以上的时间无法识别的选举虚假操作内容。值得注意的是,我们观察到,多个模特达到了高于人类的人性水平。综上所述,这些发现表明,当前的LLM可以用来为选举造假操作生成高质量的内容,即使在超本地化的场景中,成本也远远低于传统方法,并为研究人员和政策制定者提供了一个经验基准,用于在当前和未来的模型中衡量和评估这些能力。

[NLP-22] Enhancing Visual Dialog State Tracking through Iterative Object-Entity Alignment in Multi-Round Conversations
[NLP-22] 通过多轮对话中的迭代对象-实体对齐增强视觉对话状态跟踪

链接: https://arxiv.org/abs/2408.06725
作者: Wei Pang,Ruixue Duan,Jinfu Yang,Ning Li
关键词-EN: Visual Dialog, dialog history, image-related questions based, multi-round dialog history, Multi-round Dialogue State
关键词-ZH: 视觉对话、对话历史、基于图像的问题、多轮对话历史、多轮对话状态
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注: This article has been accepted in CAAI Transactions on Intelligence Technology! Article ID: CIT2_12370, Article DOI: https://doi.org/10.1049/cit2.12370

点击查看摘要

Abstract:Visual Dialog (VD) is a task where an agent answers a series of image-related questions based on a multi-round dialog history. However, previous VD methods often treat the entire dialog history as a simple text input, disregarding the inherent conversational information flows at the round level. In this paper, we introduce Multi-round Dialogue State Tracking model (MDST), a framework that addresses this limitation by leveraging the dialogue state learned from dialog history to answer questions. MDST captures each round of dialog history, constructing internal dialogue state representations defined as 2-tuples of vision-language representations. These representations effectively ground the current question, enabling the generation of accurate answers. Experimental results on the VisDial v1.0 dataset demonstrate that MDST achieves a new state-of-the-art performance in generative setting. Furthermore, through a series of human studies, we validate the effectiveness of MDST in generating long, consistent, and human-like answers while consistently answering a series of questions correctly.
摘要:可视对话(VD)是一种由代理根据多轮对话历史回答一系列与图像相关的问题的任务。然而,以前的VD方法通常将整个对话历史视为简单的文本输入,而忽略了回合级别上固有的对话信息流。在本文中,我们介绍了多轮对话状态跟踪模型(MDST),该模型通过利用从对话历史中学习的对话状态来回答问题来解决这一局限性。MDST捕获每一轮对话历史,构建定义为视觉语言表示的2元组的内部对话状态表示。这些表示法有效地解决了当前问题,从而能够生成准确的答案。在VisDial v1.0数据集上的实验结果表明,MDST在生成环境下达到了最新的性能。此外,通过一系列人类研究,我们验证了MDST在生成长的、一致的、与人类相似的答案方面的有效性,同时一致地正确回答了一系列问题。

[NLP-23] Latin Treebanks in Review: An Evaluation of Morphological Tagging Across Time
[NLP-23] 拉丁树库评论:跨时间形态标记的评估

链接: https://arxiv.org/abs/2408.06675
作者: Marisa Hudspeth,Brendan O’Connor,Laure Thompson
关键词-EN: long written tradition, Latin long written, written tradition, variety of cultures, long written
关键词-ZH: 长期书面传统,拉丁语长期书面,书面传统,多种文化,长期书面
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Existing Latin treebanks draw from Latin’s long written tradition, spanning 17 centuries and a variety of cultures. Recent efforts have begun to harmonize these treebanks’ annotations to better train and evaluate morphological taggers. However, the heterogeneity of these treebanks must be carefully considered to build effective and reliable data. In this work, we review existing Latin treebanks to identify the texts they draw from, identify their overlap, and document their coverage across time and genre. We additionally design automated conversions of their morphological feature annotations into the conventions of standard Latin grammar. From this, we build new time-period data splits that draw from the existing treebanks which we use to perform a broad cross-time analysis for POS and morphological feature tagging. We find that BERT-based taggers outperform existing taggers while also being more robust to cross-domain shifts.
摘要:现有的拉丁树库借鉴了拉丁语悠久的书面传统,跨越17个世纪和多种文化。最近已经开始努力协调这些树库的注释,以更好地训练和评估形态标记器。然而,必须仔细考虑这些树库的多样性,以构建有效且可靠的数据。在这项工作中,我们回顾了现有的拉丁树库,以识别它们所借鉴的文本,识别它们的重叠之处,并记录它们跨时间和流派的覆盖范围。我们还设计了将其形态特征注释自动转换为标准拉丁语法的惯例。由此,我们从现有的树库中构建新的时间段数据拆分,我们使用这些数据拆分来对POS和形态特征标记进行广泛的跨时间分析。我们发现基于BERT的标记器优于现有的标记器,同时对跨域转移也更稳健。

[NLP-24] Pragmatic inference of scalar implicature by LLMs
[NLP-24] LLM对量含义的修辞推断

链接: https://arxiv.org/abs/2408.06673
作者: Ye-eun Cho,Seong mook Kim
关键词-EN: Large Language Models, investigates how Large, Large Language, study investigates, pragmatic implicature
关键词-ZH: 大型语言模型,调查大型、大型语言研究如何调查,务实含义
类目: Computation and Language (cs.CL)
备注: This research was presented at the Association for Computational Linguistics conference, held on August 11-16

点击查看摘要

Abstract:This study investigates how Large Language Models (LLMs), particularly BERT (Devlin et al., 2019) and GPT-2 (Radford et al., 2019), engage in pragmatic inference of scalar implicature, such as some. Two sets of experiments were conducted using cosine similarity and next sentence/token prediction as experimental methods. The results in experiment 1 showed that, both models interpret some as pragmatic implicature not all in the absence of context, aligning with human language processing. In experiment 2, in which Question Under Discussion (QUD) was presented as a contextual cue, BERT showed consistent performance regardless of types of QUDs, while GPT-2 encountered processing difficulties since a certain type of QUD required pragmatic inference for implicature. The findings revealed that, in terms of theoretical approaches, BERT inherently incorporates pragmatic implicature not all within the term some, adhering to Default model (Levinson, 2000). In contrast, GPT-2 seems to encounter processing difficulties in inferring pragmatic implicature within context, consistent with Context-driven model (Sperber and Wilson, 2002).
摘要:本研究考察了大型语言模型,特别是Bert(Devlin et al.,2019)和GPT-2(Radford et al.,2019)是如何进行语用推理的。使用余弦相似度和下一句/标记预测作为实验方法,进行了两组实验。实验一的结果表明,两个模型都将一些语用含义解释为语用含义,而不是在没有语境的情况下都解释为语用含义,这与人类的语言加工是一致的。在实验2中,被讨论的问题(Qud)被呈现为语境线索,无论是哪种类型的Qud,BERT都表现出一致的表现,而GPT-2则遇到了加工困难,因为某种类型的Qud需要对含义进行语用推理。研究结果表明,就理论方法而言,Bert遵循缺省模式(Levinson,2000),内在地包含了语用含义,但并不完全包含在术语Some内。相反,GPT-2在推理语境中的语用含义时似乎遇到了困难,这与语境驱动的模式是一致的(Sperber和Wilson,2002)。

[NLP-25] Amuro Char: Analyzing the Relationship between Pre-Training and Fine-Tuning of Large Language Models
[NLP-25] Amuro Char:分析大型语言模型的预训练和微调之间的关系

链接: https://arxiv.org/abs/2408.06663
作者: Kaiser Sun,Mark Dredze
关键词-EN: large text corpus, large language models, language models leads, large language, large text
关键词-ZH: 大型文本库、大型语言模型、语言模型引导、大型语言、大型文本
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The development of large language models leads to the formation of a pre-train-then-align paradigm, in which the model is typically pre-trained on a large text corpus and undergoes a tuning stage to align the model with human preference or downstream tasks. In this work, we investigate the relationship between pre-training and fine-tuning by fine-tuning multiple intermediate pre-trained model checkpoints. Our results on 18 datasets suggest that i) continual pre-training improves the model in a latent way that unveils after fine-tuning; ii) with extra fine-tuning, the datasets that the model does not demonstrate capability gain much more than those that the model performs well during the pre-training stage; iii) although model benefits significantly through supervised fine-tuning, it may forget previously known domain knowledge and the tasks that are not seen during fine-tuning; iv) the model resembles high sensitivity to evaluation prompts after supervised fine-tuning, but this sensitivity can be alleviated by more pre-training.
摘要:大型语言模型的发展导致了一种先训练后对齐的范式,在这种范式中,模型通常在大型文本语料库上进行预训练,并经历一个调整阶段,以使模型与人类的偏好或下游任务相匹配。在这项工作中,我们通过微调多个中间预训练模型检查点来研究预训练和微调之间的关系。我们在18个数据集上的结果表明,i)持续的预训练以一种潜在的方式改进了模型,并在微调后显现出来;ii)通过额外的微调,模型没有表现出能力的数据集获得了比模型在预训练阶段表现良好的数据集更多的收益;iii)尽管模型通过监督微调获得了显著的好处,但它可能会忘记先前已知的领域知识和微调过程中看不到的任务;iv)模型在监督微调后对评估提示类似于高度敏感,但这种敏感性可以通过更多的预训练来缓解。

[NLP-26] EditScribe: Non-Visual Image Editing with Natural Language Verification Loops
[NLP-26] Editor Scribe:使用自然语言验证循环进行非视觉图像编辑

链接: https://arxiv.org/abs/2408.06632
作者: Ruei-Che Chang,Yuxuan Liu,Lotus Zhang,Anhong Guo
关键词-EN: requires precise visual, precise visual evaluation, Image editing, iterative process, process that requires
关键词-ZH: 需要精确的视觉、精确的视觉评估、图像编辑、迭代过程、需要的过程
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: ASSETS 2024

点击查看摘要

Abstract:Image editing is an iterative process that requires precise visual evaluation and manipulation for the output to match the editing intent. However, current image editing tools do not provide accessible interaction nor sufficient feedback for blind and low vision individuals to achieve this level of control. To address this, we developed EditScribe, a prototype system that makes image editing accessible using natural language verification loops powered by large multimodal models. Using EditScribe, the user first comprehends the image content through initial general and object descriptions, then specifies edit actions using open-ended natural language prompts. EditScribe performs the image edit, and provides four types of verification feedback for the user to verify the performed edit, including a summary of visual changes, AI judgement, and updated general and object descriptions. The user can ask follow-up questions to clarify and probe into the edits or verification feedback, before performing another edit. In a study with ten blind or low-vision users, we found that EditScribe supported participants to perform and verify image edit actions non-visually. We observed different prompting strategies from participants, and their perceptions on the various types of verification feedback. Finally, we discuss the implications of leveraging natural language verification loops to make visual authoring non-visually accessible.
摘要:图像编辑是一个迭代的过程,需要对输出进行精确的视觉评估和处理,以符合编辑意图。然而,目前的图像编辑工具没有为盲人和低视力者提供可访问的交互或足够的反馈来实现这种级别的控制。为了解决这个问题,我们开发了EditScribe,这是一个原型系统,可以使用大型多模式模型支持的自然语言验证循环来访问图像编辑。使用EditScribe,用户首先通过初始的一般描述和对象描述来理解图像内容,然后使用开放式自然语言提示来指定编辑操作。EditScribe执行图像编辑,并为用户提供四种类型的验证反馈以验证所执行的编辑,包括视觉更改摘要、AI判断以及更新的常规和对象描述。在执行另一次编辑之前,用户可以询问后续问题以澄清和探讨编辑或验证反馈。在一项对10名盲人或低视力用户的研究中,我们发现EditScribe支持参与者执行和验证非视觉的图像编辑操作。我们观察了参与者不同的提示策略,以及他们对不同类型的验证反馈的看法。最后,我们讨论了利用自然语言验证循环使可视化创作以非可视方式访问的含义。

[NLP-27] IFShip: A Large Vision-Language Model for Interpretable Fine-grained Ship Classification via Domain Knowledge-Enhanced Instruction Tuning
[NLP-27] IFShip:一个大型视觉语言模型,用于通过领域知识增强型指令调优进行可解释细粒度船舶分类

链接: https://arxiv.org/abs/2408.06631
作者: Mingning Guo,Mengwei Wu,Yuxiang Shen,Haifeng Li,Chao Tao
关键词-EN: remote sensing fine-grained, sensing fine-grained ship, fine-grained ship classification, prevailing paradigm, paradigm for remote
关键词-ZH: 遥感细颗粒,传感细颗粒船舶,细颗粒船舶分类,流行范式,远程范式
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:End-to-end interpretation is currently the prevailing paradigm for remote sensing fine-grained ship classification (RS-FGSC) task. However, its inference process is uninterpretable, leading to criticism as a black box model. To address this issue, we propose a large vision-language model (LVLM) named IFShip for interpretable fine-grained ship classification. Unlike traditional methods, IFShip excels in interpretability by accurately conveying the reasoning process of FGSC in natural language. Specifically, we first design a domain knowledge-enhanced Chain-of-Thought (COT) prompt generation mechanism. This mechanism is used to semi-automatically construct a task-specific instruction-following dataset named TITANIC-FGS, which emulates human-like logical decision-making. We then train the IFShip model using task instructions tuned with the TITANIC-FGS dataset. Building on IFShip, we develop an FGSC visual chatbot that redefines the FGSC problem as a step-by-step reasoning task and conveys the reasoning process in natural language. Experimental results reveal that the proposed method surpasses state-of-the-art FGSC algorithms in both classification interpretability and accuracy. Moreover, compared to LVLMs like LLaVA and MiniGPT-4, our approach demonstrates superior expertise in the FGSC task. It provides an accurate chain of reasoning when fine-grained ship types are recognizable to the human eye and offers interpretable explanations when they are not.
摘要:端到端解译是当前遥感细粒度船舶分类任务的主流模式。然而,它的推理过程是无法解释的,导致批评为黑箱模型。针对这一问题,我们提出了一种用于可解释细粒度船舶分类的大型视觉语言模型IFShip。与传统方法不同,IFShip通过用自然语言准确地传达FGSC的推理过程,在可解释性方面表现出色。具体地说,我们首先设计了一种领域知识增强的思想链(COT)提示生成机制。这种机制被用来半自动地构造一个特定于任务的指令跟随数据集,称为泰坦尼克号-FGS,它模仿人类的逻辑决策。然后,我们使用与泰坦尼克号-FGS数据集进行调整的任务指令来训练IFShip模型。在IFShip的基础上,我们开发了一个FGSC可视化聊天机器人,它将FGSC问题重新定义为一个逐步推理的任务,并用自然语言表达推理过程。实验结果表明,该方法在分类可解释性和分类准确率方面均优于现有的FGSC算法。此外,与LLaVA和MiniGPT-4等LVLMS相比,我们的方法在FGSC任务中展示了卓越的专业知识。当细粒度的船舶类型可以用肉眼识别时,它提供了准确的推理链,当它们不能识别时,它提供了可解释的解释。

[NLP-28] WorldScribe: Towards Context-Aware Live Visual Descriptions
[NLP-28] WorldScribe:迈向上下文感知实时视觉描述

链接: https://arxiv.org/abs/2408.06627
作者: Ruei-Che Chang,Yuxuan Liu,Anhong Guo
关键词-EN: aid blind people, visual descriptions, live visual descriptions, Automated live visual, autonomy and independence
关键词-ZH: 帮助盲人、视觉描述、实时视觉描述、自动化实时视觉、自主性和独立性
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: UIST 2024

点击查看摘要

Abstract:Automated live visual descriptions can aid blind people in understanding their surroundings with autonomy and independence. However, providing descriptions that are rich, contextual, and just-in-time has been a long-standing challenge in accessibility. In this work, we develop WorldScribe, a system that generates automated live real-world visual descriptions that are customizable and adaptive to users’ contexts: (i) WorldScribe’s descriptions are tailored to users’ intents and prioritized based on semantic relevance. (ii) WorldScribe is adaptive to visual contexts, e.g., providing consecutively succinct descriptions for dynamic scenes, while presenting longer and detailed ones for stable settings. (iii) WorldScribe is adaptive to sound contexts, e.g., increasing volume in noisy environments, or pausing when conversations start. Powered by a suite of vision, language, and sound recognition models, WorldScribe introduces a description generation pipeline that balances the tradeoffs between their richness and latency to support real-time use. The design of WorldScribe is informed by prior work on providing visual descriptions and a formative study with blind participants. Our user study and subsequent pipeline evaluation show that WorldScribe can provide real-time and fairly accurate visual descriptions to facilitate environment understanding that is adaptive and customized to users’ contexts. Finally, we discuss the implications and further steps toward making live visual descriptions more context-aware and humanized.
摘要:自动化的实时视觉描述可以帮助盲人自主、独立地理解周围的环境。然而,提供丰富、上下文和及时的描述一直是可访问性方面的一个长期挑战。在这项工作中,我们开发了WorldScribe,这是一个自动生成实时真实世界视觉描述的系统,这些描述可以根据用户的上下文进行定制和自适应:(I)WorldScribe的描述是根据用户的意图定制的,并根据语义相关性进行优先排序。(Ii)WorldScribe适应视觉环境,例如,为动态场景提供连续简洁的描述,而为稳定的设置提供更长和详细的描述。(Iii)WorldScribe自适应声音环境,例如在嘈杂环境中增加音量,或在对话开始时暂停。在一套视觉、语言和声音识别模型的支持下,WorldScribe引入了一条描述生成管道,该管道平衡了描述的丰富性和延迟之间的权衡,以支持实时使用。WorldScribe的设计灵感来自于提供视觉描述的先前工作和对盲人参与者的形成性研究。我们的用户研究和后续的管道评估表明,WorldScribe可以提供实时和相当准确的视觉描述,以促进对环境的理解,并根据用户的上下文进行自适应和定制。最后,我们讨论了它的含义和进一步的步骤,以使现场视觉描述更具上下文感知和人性化。

[NLP-29] owards Robust and Cost-Efficient Knowledge Unlearning for Large Language Models
[NLP-29] 面向大型语言模型的稳健且经济高效的知识去学习

链接: https://arxiv.org/abs/2408.06621
作者: Sungmin Cha,Sungjun Cho,Dasol Hwang,Moontae Lee
关键词-EN: demonstrated strong reasoning, Large Language Models, massive textual corpora, Large Language, demonstrated strong
关键词-ZH: 表现出强大的推理,大型语言模型,大量文本库,大型语言,表现出强大的
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: Preprint

点击查看摘要

Abstract:Large Language Models (LLMs) have demonstrated strong reasoning and memorization capabilities via pretraining on massive textual corpora. However, training LLMs on human-written text entails significant risk of privacy and copyright violations, which demands an efficient machine unlearning framework to remove knowledge of sensitive data without retraining the model from scratch. While Gradient Ascent (GA) is widely used for unlearning by reducing the likelihood of generating unwanted information, the unboundedness of increasing the cross-entropy loss causes not only unstable optimization, but also catastrophic forgetting of knowledge that needs to be retained. We also discover its joint application under low-rank adaptation results in significantly suboptimal computational cost vs. generative performance trade-offs. In light of this limitation, we propose two novel techniques for robust and cost-efficient unlearning on LLMs. We first design an Inverted Hinge loss that suppresses unwanted tokens by increasing the probability of the next most likely token, thereby retaining fluency and structure in language generation. We also propose to initialize low-rank adapter weights based on Fisher-weighted low-rank approximation, which induces faster unlearning and better knowledge retention by allowing model updates to be focused on parameters that are important in generating textual data we wish to remove.
摘要:通过对海量文本语料库的预训练,大型语言模型(LLM)表现出了很强的推理和记忆能力。然而,在人类书写的文本上训练LLM会带来很大的隐私和侵犯版权的风险,这需要一个有效的机器遗忘框架来移除敏感数据的知识,而不需要从头开始重新训练模型。虽然梯度递增(GA)通过减少产生无用信息的可能性而被广泛用于遗忘,但增加交叉熵损失的无界性不仅会导致不稳定的优化,而且会导致需要保留的灾难性知识遗忘。我们还发现,在低阶适应下联合应用它会导致显著次优的计算成本与生成性能之间的权衡。针对这一局限性,我们提出了两种新的基于LLMS的健壮性和低成本去学习技术。我们首先设计了一种反向铰链损失,通过增加下一个最有可能的令牌的概率来抑制不想要的令牌,从而保持语言生成的流畅性和结构。我们还提出了基于Fisher加权低阶近似的低阶适配器权重初始化,通过允许模型更新专注于在生成我们希望移除的文本数据中重要的参数,从而导致更快的遗忘和更好的知识保持。

[NLP-30] Generalized knowledge-enhanced framework for biomedical entity and relation extraction
[NLP-30] 生物医学实体和关系提取的广义知识增强框架

链接: https://arxiv.org/abs/2408.06618
作者: Minh Nguyen,Phuong Le
关键词-EN: recent years, increasing number, relation extraction, entity and relation, biomedical entity
关键词-ZH: 近年来,数量不断增加,关系提取,实体与关系,生物医学实体
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:In recent years, there has been an increasing number of frameworks developed for biomedical entity and relation extraction. This research effort aims to address the accelerating growth in biomedical publications and the intricate nature of biomedical texts, which are written for mainly domain experts. To handle these challenges, we develop a novel framework that utilizes external knowledge to construct a task-independent and reusable background knowledge graph for biomedical entity and relation extraction. The design of our model is inspired by how humans learn domain-specific topics. In particular, humans often first acquire the most basic and common knowledge regarding a field to build the foundational knowledge and then use that as a basis for extending to various specialized topics. Our framework employs such common-knowledge-sharing mechanism to build a general neural-network knowledge graph that is learning transferable to different domain-specific biomedical texts effectively. Experimental evaluations demonstrate that our model, equipped with this generalized and cross-transferable knowledge base, achieves competitive performance benchmarks, including BioRelEx for binding interaction detection and ADE for Adverse Drug Effect identification.
摘要:近年来,用于生物医学实体和关系抽取的框架越来越多。这项研究旨在解决生物医学出版物的加速增长和生物医学文本的错综复杂的性质,这些文本主要是为领域专家撰写的。为了应对这些挑战,我们开发了一个新的框架,利用外部知识来构建一个与任务无关的、可重用的背景知识图,用于生物医学实体和关系的提取。我们模型的设计灵感来自于人类学习特定领域主题的方式。特别是,人类往往首先获得关于一个领域的最基本和最常见的知识,以建立基础知识,然后将其作为扩展到各种专业主题的基础。我们的框架利用这种共同的知识共享机制来构建一个通用的神经网络知识图,该知识图可以有效地学习转移到不同领域特定的生物医学文本。实验评估表明,我们的模型配备了这个通用的、可交叉转移的知识库,达到了具有竞争力的性能基准,包括用于结合相互作用检测的BioRelEx和用于不良药物效应识别的ADE。

[NLP-31] CROME: Cross-Modal Adapters for Efficient Multimodal LLM
[NLP-31] CROME:用于高效多模式LLM的跨模式适配器

链接: https://arxiv.org/abs/2408.06610
作者: Sayna Ebrahimi,Sercan O. Arik,Tejas Nama,Tomas Pfister
关键词-EN: Large Language Models, remarkable image-language capabilities, Multimodal Large Language, Large Language, demonstrate remarkable image-language
关键词-ZH: 大型语言模型,非凡的图像语言能力,多模式大型语言,大型语言,展示了非凡的图像语言
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Multimodal Large Language Models (MLLMs) demonstrate remarkable image-language capabilities, but their widespread use faces challenges in cost-effective training and adaptation. Existing approaches often necessitate expensive language model retraining and limited adaptability. Additionally, the current focus on zero-shot performance improvements offers insufficient guidance for task-specific tuning. We propose CROME, an efficient vision-language instruction tuning framework. It features a novel gated cross-modal adapter that effectively combines visual and textual representations prior to input into a frozen LLM. This lightweight adapter, trained with minimal parameters, enables efficient cross-modal understanding. Notably, CROME demonstrates superior zero-shot performance on standard visual question answering and instruction-following benchmarks. Moreover, it yields fine-tuning with exceptional parameter efficiency, competing with task-specific specialist state-of-the-art methods. CROME demonstrates the potential of pre-LM alignment for building scalable, adaptable, and parameter-efficient multimodal models.
摘要:多通道大语言模型(MLLMS)显示出卓越的图像语言能力,但它们的广泛使用在低成本的培训和适应方面面临着挑战。现有的方法往往需要昂贵的语言模型再培训和有限的适应性。此外,当前对零点性能改进的关注不足以指导特定任务的调优。我们提出了一个高效的视觉语言教学调优框架CROME。它的特点是一个新的门控跨模式适配器,在输入到冻结的LLM之前,有效地结合了视觉和文本表示。这款经过最小参数训练的轻量级适配器可实现高效的跨模式理解。值得注意的是,Crome在标准的视觉问题回答和指令遵循基准上展示了卓越的零命中率性能。此外,它产生微调和非凡的参数效率,与特定任务的专家最先进的方法竞争。CROME展示了在构建可伸缩的、可适应的和参数高效的多模式模型方面的预LM对齐的潜力。

[NLP-32] A Perspective on Large Language Models Intelligent Machines and Knowledge Acquisition
[NLP-32] 大型语言模型的透视智能机器和知识获取

链接: https://arxiv.org/abs/2408.06598
作者: Vladimir Cherkassky,Eng Hock Lee
关键词-EN: Large Language Models, Large Language, Language Models, text documents, remarkable ability
关键词-ZH: 大型语言模型,大型语言,语言模型,文本文档,出色的能力
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) are known for their remarkable ability to generate synthesized ‘knowledge’, such as text documents, music, images, etc. However, there is a huge gap between LLM’s and human capabilities for understanding abstract concepts and reasoning. We discuss these issues in a larger philosophical context of human knowledge acquisition and the Turing test. In addition, we illustrate the limitations of LLMs by analyzing GPT-4 responses to questions ranging from science and math to common sense reasoning. These examples show that GPT-4 can often imitate human reasoning, even though it lacks understanding. However, LLM responses are synthesized from a large LLM model trained on all available data. In contrast, human understanding is based on a small number of abstract concepts. Based on this distinction, we discuss the impact of LLMs on acquisition of human knowledge and education.
摘要:大型语言模型(LLM)以其生成合成“知识”(例如文本文档、音乐、图像等)的出色能力而闻名。然而,LLM与人类理解抽象概念和推理的能力之间存在巨大差距。我们在人类知识获取和图灵测试的更大哲学背景下讨论这些问题。此外,我们还通过分析GPT-4对从科学、数学到常识推理等问题的回答来说明LLM的局限性。这些例子表明GPT-4经常可以模仿人类推理,尽管它缺乏理解。然而,LLM响应是根据所有可用数据训练的大型LLM模型合成的。相比之下,人类的理解是基于少数抽象概念的。基于这一区别,我们讨论了法学硕士对人类知识和教育获取的影响。

[NLP-33] Biomedical Event Extraction via Structure-aware Generation
[NLP-33] 通过结构感知生成的生物医学事件提取

链接: https://arxiv.org/abs/2408.06583
作者: Haohan Yuan,Siu Cheung Hui,Haopeng Zhang
关键词-EN: involves modeling complex, Biomedical Event Extraction, biomedical text data, modeling complex relationships, Event Extraction
关键词-ZH: 涉及复杂建模、生物医学事件提取、生物医学文本数据、复杂关系建模、事件提取
类目: Computation and Language (cs.CL)
备注: 8 pages, 4 figures, 6 tables

点击查看摘要

Abstract:Biomedical Event Extraction (BEE) is a critical task that involves modeling complex relationships between fine-grained entities in biomedical text data. However, most existing BEE models rely on classification methods that neglect the label semantics and argument dependency structure within the data. To address these limitations, we propose GenBEE, a generative model enhanced with a structure-aware prefix for biomedical event extraction. GenBEE constructs event prompts that leverage knowledge distilled from large language models (LLMs), thereby incorporating both label semantics and argument dependency relationships. Additionally, GenBEE introduces a structural prefix learning module that generates structure-aware prefixes with structural prompts, enriching the generation process with structural features. Extensive experiments on three benchmark datasets demonstrate the effectiveness of GenBEE and it achieves state-of-the-art performance on the MLEE and GE11 datasets. Furthermore, our analysis shows that the structural prefixes effectively bridge the gap between structural prompts and the representation space of generative models, enabling better integration of event structural information.
摘要:生物医学事件抽取(BEE)是一项关键任务,涉及对生物医学文本数据中细粒度实体之间的复杂关系建模。然而,现有的大多数BEE模型依赖于分类方法,忽略了数据中的标签语义和参数依赖结构。为了解决这些局限性,我们提出了GenBEE,一个增强了结构感知前缀的生成性模型,用于生物医学事件提取。GenBEE构建事件提示,利用从大型语言模型(LLM)提取的知识,从而结合标签语义和参数依赖关系。此外,GenBEE引入了结构前缀学习模块,该模块通过结构提示生成结构感知前缀,使用结构特征丰富了生成过程。在三个基准数据集上的大量实验证明了GenBEE的有效性,并在mlee和Ge11数据集上取得了最先进的性能。此外,我们的分析表明,结构前缀有效地弥合了结构提示和生成模型的表示空间之间的差距,使得事件结构信息能够更好地整合。

[NLP-34] OpenEP: Open-Ended Future Event Prediction
[NLP-34] OpenEP:开放式未来事件预测

链接: https://arxiv.org/abs/2408.06578
作者: Yong Guan,Hao Peng,Xiaozhi Wang,Lei Hou,Juanzi Li
关键词-EN: Future event prediction, early risk identification, enables early risk, event prediction, Future event
关键词-ZH: 未来事件预测、早期风险识别、实现早期风险、事件预测、未来事件
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Future event prediction (FEP) is a long-standing and crucial task in the world, as understanding the evolution of events enables early risk identification, informed decision-making, and strategic planning. Existing work typically treats event prediction as classification tasks and confines the outcomes of future events to a fixed scope, such as yes/no questions, candidate set, and taxonomy, which is difficult to include all possible outcomes of future events. In this paper, we introduce OpenEP (an Open-Ended Future Event Prediction task), which generates flexible and diverse predictions aligned with real-world scenarios. This is mainly reflected in two aspects: firstly, the predictive questions are diverse, covering different stages of event development and perspectives; secondly, the outcomes are flexible, without constraints on scope or format. To facilitate the study of this task, we construct OpenEPBench, an open-ended future event prediction dataset. For question construction, we pose questions from seven perspectives, including location, time, event development, event outcome, event impact, event response, and other, to facilitate an in-depth analysis and understanding of the comprehensive evolution of events. For outcome construction, we collect free-form text containing the outcomes as ground truth to provide semantically complete and detail-enriched outcomes. Furthermore, we propose StkFEP, a stakeholder-enhanced future event prediction framework, that incorporates event characteristics for open-ended settings. Our method extracts stakeholders involved in events to extend questions to gather diverse information. We also collect historically events that are relevant and similar to the question to reveal potential evolutionary patterns. Experiment results indicate that accurately predicting future events in open-ended settings is challenging for existing LLMs.
摘要:未来事件预测(FEP)是世界上一项长期而关键的任务,因为了解事件的演变有助于及早识别风险、做出明智的决策和战略规划。现有的工作通常将事件预测视为分类任务,并将未来事件的结果限制在固定的范围内,例如是/否问题、候选集和分类,这很难包括未来事件的所有可能结果。在本文中,我们介绍了OpenEP(一个开放式的未来事件预测任务),它可以生成与真实世界场景相一致的灵活和多样化的预测。这主要体现在两个方面:第一,预测性问题是多样的,涵盖了不同的事件发展阶段和视角;第二,结果是灵活的,不受范围和形式的限制。为了方便这项任务的研究,我们构建了一个开放的未来事件预测数据集OpenEPBch。在问题构建方面,我们从地点、时间、事件发展、事件结果、事件影响、事件响应等七个角度提出问题,以便于深入分析和了解事件的全面演变。对于结果构建,我们收集包含作为基本事实的结果的自由格式文本,以提供语义完整和细节丰富的结果。此外,我们提出了StkFEP,一个利益相关者增强的未来事件预测框架,它结合了开放式环境下的事件特征。我们的方法提取参与事件的利益相关者,以扩展问题以收集不同的信息。我们还收集与问题相关和相似的历史事件,以揭示潜在的进化模式。实验结果表明,在开放式环境下准确预测未来事件对于现有的LLMS是具有挑战性的。

[NLP-35] CTISum: A New Benchmark Dataset For Cyber Threat Intelligence Summarization
[NLP-35] CTISum:网络威胁情报总结的新基准数据集

链接: https://arxiv.org/abs/2408.06576
作者: Wei Peng,Junmei Ding,Wei Wang,Lei Cui,Wei Cai,Zhiyu Hao,Xiaochun Yun
关键词-EN: Cyber Threat Intelligence, raw intelligence data, Cyber Threat, Threat Intelligence, raw intelligence
关键词-ZH: 网络威胁情报,原始情报数据,网络威胁,威胁情报,原始情报
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Cyber Threat Intelligence (CTI) summarization task requires the system to generate concise and accurate highlights from raw intelligence data, which plays an important role in providing decision-makers with crucial information to quickly detect and respond to cyber threats in the cybersecurity domain. However, efficient techniques for summarizing CTI reports, including facts, analytical insights, attack processes, etc., have largely been unexplored, primarily due to the lack of available dataset. To this end, we present CTISum, a new benchmark for CTI summarization task. Considering the importance of attack process, a novel fine-grained subtask of attack process summarization is proposed to enable defenders to assess risk, identify security gaps, vulnerabilities, and so on. Specifically, we first design a multi-stage annotation pipeline to gather and annotate the CTI data, and then benchmark the CTISum with a collection of extractive and abstractive summarization methods. Experimental results show that current state-of-the-art models exhibit limitations when applied to CTISum, underscoring the fact that automatically producing concise summaries of CTI reports remains an open research challenge.
摘要:网络威胁情报(CTI)摘要任务要求系统从原始情报数据中生成简洁准确的亮点,这在为决策者提供关键信息以快速检测和应对网络安全领域的网络威胁方面发挥着重要作用。然而,总结CTI报告的有效技术,包括事实、分析见解、攻击过程等,在很大程度上还没有被探索,主要是因为缺乏可用的数据集。为此,我们提出了一种新的CTI摘要基准–CTISum。考虑到攻击过程的重要性,提出了一种新的细粒度攻击过程摘要子任务,使防御者能够评估风险,识别安全漏洞和漏洞等。具体地说,我们首先设计了一个多阶段的标注管道来收集和标注CTI数据,然后用一系列提取和抽象的摘要方法对CTISum进行基准测试。实验结果表明,当前最先进的模型在应用于CTISum时显示出局限性,这突显了这样一个事实,即自动生成CTI报告的简明摘要仍然是一个开放的研究挑战。

[NLP-36] SparkRA: A Retrieval-Augmented Knowledge Service System Based on Spark Large Language Model
[NLP-36] SparkRA:基于Spark大型语言模型的检索增强知识服务系统

链接: https://arxiv.org/abs/2408.06574
作者: Dayong Wu,Jiaqi Li,Baoxin Wang,Honghong Zhao,Siyuan Xue,Yanjie Yang,Zhijun Chang,Rui Zhang,Li Qian,Bo Wang,Shijin Wang,Zhixiong Zhang,Guoping Hu
关键词-EN: Large language models, http URL enhance, iFLYTEK Spark LLM, Spark Research Assistant, shown remarkable achievements
关键词-ZH: 大型语言模型、http URL增强、iFlyTEK Spark LLM、Spark Research Assistant取得了显着的成就
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have shown remarkable achievements across various language this http URL enhance the performance of LLMs in scientific literature services, we developed the scientific literature LLM (SciLit-LLM) through pre-training and supervised fine-tuning on scientific literature, building upon the iFLYTEK Spark LLM. Furthermore, we present a knowledge service system Spark Research Assistant (SparkRA) based on our SciLit-LLM. SparkRA is accessible online and provides three primary functions: literature investigation, paper reading, and academic writing. As of July 30, 2024, SparkRA has garnered over 50,000 registered users, with a total usage count exceeding 1.3 million.
摘要:大型语言模型(LLM)在各种语言中取得了显着的成就这个http URL增强了LLM在科学文献服务中的性能,我们通过对科学文献进行预训练和监督微调,开发了科学文献LLM(SciLit-LLM),以iFlyTEK Spark LLM为基础。此外,我们还推出了一个基于SciLit-LLM的知识服务系统Spark Research Assistant(SparkRA)。SparkRA可在线访问,提供三个主要功能:文献调查、论文阅读和学术写作。截至2024年7月30日,SparkRA已拥有超过50,000名注册用户,总使用次数超过130万。

[NLP-37] Social Debiasing for Fair Multi-modal LLMs
[NLP-37] 公平多模式法学硕士的社会去偏见

链接: https://arxiv.org/abs/2408.06569
作者: Harry Cheng,Yangyang Guo,Qingpei Guo,Ming Yang,Tian Gan,Liqiang Nie
关键词-EN: Multi-modal Large Language, Large Language Models, Multi-modal Large, Large Language, offering powerful vision-language
关键词-ZH: 多模式大型语言,大型语言模型,多模式大型,大型语言,提供强大的视觉语言
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Multi-modal Large Language Models (MLLMs) have advanced significantly, offering powerful vision-language understanding capabilities. However, these models often inherit severe social biases from their training datasets, leading to unfair predictions based on attributes like race and gender. This paper addresses the issue of social biases in MLLMs by i) Introducing a comprehensive Counterfactual dataset with Multiple Social Concepts (CMSC), which provides a more diverse and extensive training set compared to existing datasets. ii) Proposing an Anti-Stereotype Debiasing strategy (ASD). Our method works by revisiting the MLLM training process, rescaling the autoregressive loss function, and improving data sampling methods to counteract biases. Through extensive experiments on various MLLMs, our CMSC dataset and ASD method demonstrate a significant reduction in social biases while maintaining the models’ original performance.
摘要:多模式大型语言模型(MLLM)取得了显着进步,提供了强大的视觉语言理解能力。然而,这些模型经常从其训练数据集中继承严重的社会偏见,导致基于种族和性别等属性的不公平预测。本文通过i)引入具有多个社会概念(CMSC)的全面反事实数据集来解决MLLM中的社会偏见问题,该数据集提供了与现有数据集相比更多样化、更广泛的训练集。ii)提出反刻板印象去偏见策略(ASD)。我们的方法通过重新审视MLLM训练过程、重新调整自回归损失函数以及改进数据采样方法以抵消偏差来工作。通过对各种MLLM的广泛实验,我们的CMSC数据集和ASD方法证明了社会偏见的显着减少,同时保持了模型的原始性能。

[NLP-38] AquilaMoE: Efficient Training for MoE Models with Scale-Up and Scale-Out Strategies
[NLP-38] AquilaMoE:通过横向扩展和横向扩展策略对MoE模型进行有效培训

链接: https://arxiv.org/abs/2408.06567
作者: Bo-Wen Zhang,Liangdong Wang,Ye Yuan,Jijie Li,Shuhao Gu,Mengdi Zhao,Xinya Wu,Guang Liu,Chengwei Wu,Hanyu Zhao,Li Du,Yiming Ju,Quanyue Ma,Yulong Ao,Yingli Zhao,Songhe Zhu,Zhou Cao,Dong Liang,Yonghua Lin,Ming Zhang,Shunfei Wang,Yanxin Zhou,Min Ye,Xuekai Chen,Xinyang Yu,Xiangjun Huang,Jian Yang
关键词-EN: recent years, gradually increased, grown exponentially, rapid application, application of large
关键词-ZH: 近年来,逐渐增加,呈指数级增长,应用迅速,应用量大
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In recent years, with the rapid application of large language models across various fields, the scale of these models has gradually increased, and the resources required for their pre-training have grown exponentially. Training an LLM from scratch will cost a lot of computation resources while scaling up from a smaller model is a more efficient approach and has thus attracted significant attention. In this paper, we present AquilaMoE, a cutting-edge bilingual 816B Mixture of Experts (MoE) language model that has 8 experts with 16 billion parameters each and is developed using an innovative training methodology called EfficientScale. This approach optimizes performance while minimizing data requirements through a two-stage process. The first stage, termed Scale-Up, initializes the larger model with weights from a pre-trained smaller model, enabling substantial knowledge transfer and continuous pretraining with significantly less data. The second stage, Scale-Out, uses a pre-trained dense model to initialize the MoE experts, further enhancing knowledge transfer and performance. Extensive validation experiments on 1.8B and 7B models compared various initialization schemes, achieving models that maintain and reduce loss during continuous pretraining. Utilizing the optimal scheme, we successfully trained a 16B model and subsequently the 816B AquilaMoE model, demonstrating significant improvements in performance and training efficiency.
摘要:近年来,随着大型语言模型在各个领域的快速应用,这些模型的规模逐渐扩大,其前期培训所需的资源也呈指数级增长。从头开始训练LLM将花费大量的计算资源,而从较小的模型向上扩展是一种更有效的方法,因此引起了极大的关注。在本文中,我们提出了AquilaMoE,这是一种尖端的双语816B混合专家(MOE)语言模型,它有8名专家,每个专家有160亿个参数,并使用一种名为EfficientScale的创新培训方法开发。此方法通过两个阶段的流程优化性能,同时最大限度地减少数据需求。第一阶段称为放大,用来自预先训练的较小模型的权重来初始化较大的模型,从而能够用显著较少的数据进行大量的知识传递和连续的预训练。第二阶段,横向扩展,使用预先训练的密集模型来初始化教育部专家,进一步增强知识传授和绩效。在1.8B和7B模型上进行了广泛的验证实验,比较了各种初始化方案,实现了在连续预训练期间保持和减少损失的模型。利用优化方案,我们成功地训练了16B模型,随后又训练了816B AquilaMoE模型,证明了性能和训练效率的显著改善。

[NLP-39] Introducing the NewsPaLM MBR and QE Dataset: LLM-Generated High-Quality Parallel Data Outperforms Traditional Web-Crawled Data
[NLP-39] 介绍NewsPaLM BR和QE数据集:LLM生成的高质量并行数据优于传统的Web Crawed数据

链接: https://arxiv.org/abs/2408.06537
作者: Mara Finkelstein,David Vilar,Markus Freitag
关键词-EN: neural machine translation, Recent research, machine translation, training, research in neural
关键词-ZH: 神经机器翻译,最近的研究,机器翻译,训练,神经研究
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Recent research in neural machine translation (NMT) has shown that training on high-quality machine-generated data can outperform training on human-generated data. This work accompanies the first-ever release of a LLM-generated, MBR-decoded and QE-reranked dataset with both sentence-level and multi-sentence examples. We perform extensive experiments to demonstrate the quality of our dataset in terms of its downstream impact on NMT model performance. We find that training from scratch on our (machine-generated) dataset outperforms training on the (web-crawled) WMT’23 training dataset (which is 300 times larger), and also outperforms training on the top-quality subset of the WMT’23 training dataset. We also find that performing self-distillation by finetuning the LLM which generated this dataset outperforms the LLM’s strong few-shot baseline. These findings corroborate the quality of our dataset, and demonstrate the value of high-quality machine-generated data in improving performance of NMT models.
摘要:最近神经机器翻译(NMT)的研究表明,对高质量机器生成数据的训练可以优于对人类生成数据的训练。这项工作伴随着LLM生成的、MBR解码的和QE重新排序的数据集的首次发布,其中既有句子级别的例子,也有多句子的例子。我们进行了大量的实验,以证明我们的数据集的质量,就其对NMT模型性能的下游影响而言。我们发现,在我们(机器生成的)数据集上从头开始的训练优于(网络爬行的)WMT‘23训练数据集(大300倍)上的训练,也优于在WMT’23训练数据集的最高质量子集上的训练。我们还发现,通过对生成该数据集的LLM进行精细调整来执行自我蒸馏,性能优于LLM的强大的少数几次基线。这些发现证实了我们数据集的质量,并证明了高质量的机器生成数据在提高NMT模型性能方面的价值。

[NLP-40] Chain-of-Strategy Planning with LLMs: Aligning the Generation of Psychotherapy Dialogue with Strategy in Motivational Interviewing
[NLP-40] LLM的战略链规划:将心理治疗对话的一代与动机面试中的策略保持一致

链接: https://arxiv.org/abs/2408.06527
作者: Xin Sun,Xiao Tang,Abdallah El Ali,Zhuying Li,Xiaoyu Shen,Pengjie Ren,Jan de Wit,Jiahuan Pei,Jos A.Bosch
关键词-EN: large language models, Motivational Interviewing, Recent advancements, language models, advancements in large
关键词-ZH: 大型语言模型、动机面试、最新进展、语言模型、大型进步
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent advancements in large language models (LLMs) have shown promise in generating psychotherapeutic dialogues, especially in Motivational Interviewing (MI). However, how to employ strategies, a set of motivational interviewing (MI) skills, to generate therapeutic-adherent conversations with explainability is underexplored. We propose an approach called strategy-aware dialogue generation with Chain-of-Strategy (CoS) planning, which first predicts MI strategies as reasoning and utilizes these strategies to guide the subsequent dialogue generation. It brings the potential for controllable and explainable generation in psychotherapy by aligning the generated MI dialogues with therapeutic strategies. Extensive experiments including automatic and human evaluations are conducted to validate the effectiveness of the MI strategy. Our findings demonstrate the potential of LLMs in producing strategically aligned dialogues and suggest directions for practical applications in psychotherapeutic settings.
摘要:大型语言模型(LLM)的最新进展在产生心理治疗对话方面显示出了希望,特别是在动机访谈(MI)方面。然而,如何运用策略,即一套动机访谈(MI)技能,来产生具有可解释性的治疗依从性对话,还没有得到充分的探索。我们提出了一种基于策略链(CoS)规划的策略感知对话生成方法,该方法首先将MI策略预测为推理,然后利用这些策略来指导后续的对话生成。它通过将产生的MI对话与治疗策略相结合,为心理治疗中可控和可解释的产生带来了潜力。进行了大量的实验,包括自动和人工评估,以验证MI策略的有效性。我们的发现证明了LLMS在产生战略一致的对话方面的潜力,并为心理治疗环境中的实际应用提供了方向。

[NLP-41] Hierarchical in-Context Reinforcement Learning with Hindsight Modular Reflections for Planning
[NLP-41] 具有事后诸葛亮模块反思的分层上下文强化学习用于规划

链接: https://arxiv.org/abs/2408.06520
作者: Chuanneng Sun,Songjun Huang,Dario Pompili
关键词-EN: Large Language Models, Large Language, Language Models, demonstrated remarkable abilities, Hierarchical Reinforcement Learning
关键词-ZH: 大型语言模型,大型语言,语言模型,表现出非凡的能力,分层强化学习
类目: Robotics (cs.RO); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have demonstrated remarkable abilities in various language tasks, making them promising candidates for decision-making in robotics. Inspired by Hierarchical Reinforcement Learning (HRL), we propose Hierarchical in-Context Reinforcement Learning (HCRL), a novel framework that decomposes complex tasks into sub-tasks using an LLM-based high-level policy, in which a complex task is decomposed into sub-tasks by a high-level policy on-the-fly. The sub-tasks, defined by goals, are assigned to the low-level policy to complete. Once the LLM agent determines that the goal is finished, a new goal will be proposed. To improve the agent’s performance in multi-episode execution, we propose Hindsight Modular Reflection (HMR), where, instead of reflecting on the full trajectory, we replace the task objective with intermediate goals and let the agent reflect on shorter trajectories to improve reflection efficiency. We evaluate the decision-making ability of the proposed HCRL in three benchmark environments–ALFWorld, Webshop, and HotpotQA. Results show that HCRL can achieve 9%, 42%, and 10% performance improvement in 5 episodes of execution over strong in-context learning baselines.
摘要:大型语言模型(LLM)在各种语言任务中表现出了卓越的能力,使其成为机器人决策的潜在候选者。受分层强化学习(HRL)的启发,我们提出了一种新的框架–分层上下文强化学习(HCRL),它使用基于LLM的高层策略将复杂任务分解为子任务,其中复杂任务被高级策略动态分解为子任务。由目标定义的子任务被分配给低级别策略来完成。一旦LLM代理确定目标已完成,就会提出新的目标。为了提高智能体在多情节执行中的性能,我们提出了后见式模块化反射(HMR),其中我们用中间目标代替任务目标,让智能体在较短的轨迹上进行反射,以提高反射效率。我们在ALFWorld、Webshop和HotpotQA三个基准环境中对所提出的HCRL的决策能力进行了评估。结果表明,在较强的情景学习基线上,HCRL在5次执行中可以获得9%、42%和10%的性能改进。

[NLP-42] Does Liking Yellow Imply Driving a School Bus? Semantic Leakage in Language Models
[NLP-42] 喜欢黄色意味着驾驶校车吗?语言模型中的语义泄漏

链接: https://arxiv.org/abs/2408.06518
作者: Hila Gonen,Terra Blevins,Alisa Liu,Luke Zettlemoyer,Noah A. Smith
关键词-EN: remain poorly understood, models remain poorly, wide adoption, poorly understood, semantic leakage
关键词-ZH: 理解仍然不足,模型仍然不足,广泛采用,理解不足,语义泄露
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Despite their wide adoption, the biases and unintended behaviors of language models remain poorly understood. In this paper, we identify and characterize a phenomenon never discussed before, which we call semantic leakage, where models leak irrelevant information from the prompt into the generation in unexpected ways. We propose an evaluation setting to detect semantic leakage both by humans and automatically, curate a diverse test suite for diagnosing this behavior, and measure significant semantic leakage in 13 flagship models. We also show that models exhibit semantic leakage in languages besides English and across different settings and generation scenarios. This discovery highlights yet another type of bias in language models that affects their generation patterns and behavior.
摘要:尽管语言模型被广泛采用,但它们的偏见和无意行为仍然知之甚少。在本文中,我们识别并描述了一种以前从未讨论过的现象,我们称之为语义泄漏,即模型以意想不到的方式将不相关的信息从提示泄露到生成中。我们提出了一个评估设置,可以通过人类和自动检测语义泄漏,策划用于诊断这种行为的多样化测试套件,并测量13个旗舰模型中的显着语义泄漏。我们还表明,模型在英语以外的语言中以及在不同的设置和生成场景中表现出语义泄漏。这一发现凸显了语言模型中影响其生成模式和行为的另一种偏见。

[NLP-43] Cross-Lingual Conversational Speech Summarization with Large Language Models
[NLP-43] 使用大型语言模型的跨语言对话语音总结

链接: https://arxiv.org/abs/2408.06484
作者: Max Nelson,Shannon Wotherspoon,Francis Keith,William Hartmann,Matthew Snover
关键词-EN: Cross-lingual conversational speech, Cross-lingual conversational, important problem, dearth of resources, conversational speech
关键词-ZH: 跨语言对话演讲,跨语言对话,重要问题,资源匮乏,对话演讲
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Cross-lingual conversational speech summarization is an important problem, but suffers from a dearth of resources. While transcriptions exist for a number of languages, translated conversational speech is rare and datasets containing summaries are non-existent. We build upon the existing Fisher and Callhome Spanish-English Speech Translation corpus by supplementing the translations with summaries. The summaries are generated using GPT-4 from the reference translations and are treated as ground truth. The task is to generate similar summaries in the presence of transcription and translation errors. We build a baseline cascade-based system using open-source speech recognition and machine translation models. We test a range of LLMs for summarization and analyze the impact of transcription and translation errors. Adapting the Mistral-7B model for this task performs significantly better than off-the-shelf models and matches the performance of GPT-4.
摘要:跨语言对话语音摘要是一个重要问题,但资源匮乏。虽然多种语言都存在转录,但经过翻译的对话语音很少见,并且不存在包含摘要的数据集。我们通过用摘要补充翻译来构建现有的Fisher和CallhomeSpanish-English Speech Translation Corpus。摘要是使用GPT-4从参考翻译中生成的,并被视为基本真相。任务是在存在转录和翻译错误的情况下生成类似的摘要。我们使用开源语音识别和机器翻译模型构建了一个基于级联的基线系统。我们测试一系列LLM进行总结并分析转录和翻译错误的影响。将Mistral-7 B型号调整用于此任务的性能明显优于现成型号,并且与GPT-4的性能相匹配。

[NLP-44] OGGL: Transcribing Overlapping Speech with Staggered Labeling
[NLP-44] OGGL:用错开标签转录重叠语音

链接: https://arxiv.org/abs/2408.06474
作者: Chak-Fai Li,William Hartmann,Matthew Snover
关键词-EN: typically requires separating, overlapping speakers typically, Transcribing the speech, streams and recognizing, speakers typically requires
关键词-ZH: 通常需要分离、重叠说话者,转录语音、流媒体和识别,说话者通常需要
类目: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: 5 pages

点击查看摘要

Abstract:Transcribing the speech of multiple overlapping speakers typically requires separating the audio into multiple streams and recognizing each one independently. More recent work jointly separates and transcribes, but requires a separate decoding component for each speaker. We propose the TOGGL model to simultaneously transcribe the speech of multiple speakers. The TOGGL model uses special output tokens to attribute the speech to each speaker with only a single decoder. Our approach generalizes beyond two speakers, even when trained only on two-speaker data. We demonstrate superior performance compared to competing approaches on a conversational speech dataset. Our approach also improves performance on single-speaker audio.
摘要:转录多个重叠说话者的语音通常需要将音频分成多个流并独立识别每个流。最近的工作联合分离和转录,但需要每个说话者单独的解码组件。我们提出TOGGL模型来同时转录多个说话者的语音。TOGGL模型使用特殊的输出令牌将语音归因于仅使用单个解码器的每个说话者。我们的方法可以推广到两个说话者之外,即使仅在两个说话者数据上训练也是如此。与对话语音数据集的竞争方法相比,我们表现出了卓越的性能。我们的方法还提高了单扬声器音频的性能。

[NLP-45] owards Autonomous Agents : Adaptive-planning Reasoning and Acting in Language Models
[NLP-45] owards自治代理:语言模型中的适应性规划推理和行为

链接: https://arxiv.org/abs/2408.06458
作者: Yen-Che Hsiao,Abhishek Dutta
关键词-EN: in-context learning algorithm, building autonomous decision-making, decision-making language agents, autonomous decision-making language, in-context learning
关键词-ZH: 上下文学习算法,构建自主决策,决策语言代理,自主决策语言,上下文学习
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We propose a novel in-context learning algorithm for building autonomous decision-making language agents. The language agent continuously attempts to solve the same task by self-correcting each time the task fails. Our selected language agent demonstrates the ability to solve tasks in a text-based game environment. Our results show that the gemma-2-9b-it language model, using our proposed method, can successfully complete two of six tasks that failed in the first attempt. This highlights the effectiveness of our approach in enhancing the problem-solving capabilities of a single language model through self-correction, paving the way for more advanced autonomous agents. The code is publicly available at this https URL.
摘要:我们提出了一种新颖的上下文学习算法,用于构建自主决策语言代理。每次任务失败时,语言代理都会通过自我纠正来不断尝试解决同一任务。我们选择的语言代理展示了在基于文本的游戏环境中解决任务的能力。我们的结果表明,使用我们提出的方法的gemma-2- 9 b-it语言模型可以成功完成第一次尝试失败的六个任务中的两个。这凸显了我们的方法在通过自我纠正增强单一语言模型解决问题的能力方面的有效性,为更先进的自治代理铺平了道路。该代码可在此https URL上公开获取。

[NLP-46] Evaluating Language Models for Efficient Code Generation
[NLP-46] 评估语言模型以实现高效代码生成

链接: https://arxiv.org/abs/2408.06450
作者: Jiawei Liu,Songrun Xie,Junhao Wang,Yuxiang Wei,Yifeng Ding,Lingming Zhang
关键词-EN: Large Language Models, evaluate Large Language, introduce Differential Performance, Large Language, Differential Performance Evaluation
关键词-ZH: 大型语言模型,评估大型语言,引入差异性能,大型语言,差异性能评估
类目: oftware Engineering (cs.SE); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:We introduce Differential Performance Evaluation (DPE), a framework designed to reliably evaluate Large Language Models (LLMs) for efficient code generation. Traditional coding benchmarks often fail to provide reliable insights into code efficiency, due to their reliance on simplistic test inputs and the absence of effective compound metrics. DPE addresses these issues by focusing on efficiency-demanding programming tasks and establishing an insightful compound metric for performance evaluation. DPE operates in two phases: To curate efficiency datasets, it selects efficiency-demanding tasks from existing coding benchmarks and generates computationally expensive inputs to stress the efficiency of LLM solutions. To assess the code efficiency, DPE profiles the new solution and compares it globally against a set of reference solutions that exhibit distinct efficiency levels, where the matched level defines its efficiency score. As a proof of concept, we use DPE to create EvalPerf, a benchmark with 121 performance-challenging coding tasks. Our comprehensive evaluation draws interesting findings on the efficiency impact of model sizes, instruction tuning, and prompting. For example, while the scaling law fails to account for code efficiency, general instruction tuning benefits both code correctness and efficiency. We also evaluate the evaluation by examining the effectiveness of DPE, showing that EvalPerf is reliable and convenient to use even across platforms.
摘要:我们介绍了差异性能评估(DPE),这是一个旨在可靠地评估大型语言模型(LLM)以实现高效代码生成的框架。传统的编码基准通常无法提供对代码效率的可靠见解,因为它们依赖于简单的测试输入,并且缺乏有效的复合度量。DPE通过专注于效率要求高的编程任务并为性能评估建立有洞察力的复合指标来解决这些问题。DPE分两个阶段运行:为了精选效率数据集,它从现有的编码基准中选择效率要求高的任务,并生成计算成本高昂的输入以强调LLM解决方案的效率。为了评估代码效率,DPE分析新的解决方案,并将其与一组表现出不同效率级别的参考解决方案进行全局比较,其中匹配的级别定义了其效率分数。作为概念验证,我们使用DPE创建了EvalPerf,这是一个包含121个性能挑战编码任务的基准测试。我们的综合评估得出了关于模型大小、教学调整和提示的效率影响的有趣发现。例如,虽然缩放律不能考虑代码效率,但通用指令调优有利于代码的正确性和效率。我们还通过检查DPE的有效性来评估评估,表明即使跨平台使用EvalPerf也是可靠和方便的。

[NLP-47] Evaluating Language Models on Entity Disambiguation in Tables
[NLP-47] 表中实体歧义消除的语言模型评估

链接: https://arxiv.org/abs/2408.06423
作者: Federico Belotti,Fabio Dadda,Marco Cremaschi,Roberto Avogadro,Riccardo Pozzi,Matteo Palmonari
关键词-EN: Semantic Table Interpretation, containers of information, crucial containers, Table Interpretation, Large Language Models
关键词-ZH: 语义表解释、信息容器、关键容器、表解释、大型语言模型
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Tables are crucial containers of information, but understanding their meaning may be challenging. Indeed, recently, there has been a focus on Semantic Table Interpretation (STI), i.e., the task that involves the semantic annotation of tabular data to disambiguate their meaning. Over the years, there has been a surge in interest in data-driven approaches based on deep learning that have increasingly been combined with heuristic-based approaches. In the last period, the advent of Large Language Models (LLMs) has led to a new category of approaches for table annotation. The interest in this research field, characterised by multiple challenges, has led to a proliferation of approaches employing different techniques. However, these approaches have not been consistently evaluated on a common ground, making evaluation and comparison difficult. This work proposes an extensive evaluation of four state-of-the-art (SOTA) approaches - Alligator (formerly s-elBat), Dagobah, TURL, and TableLlama; the first two belong to the family of heuristic-based algorithms, while the others are respectively encoder-only and decoder-only LLMs. The primary objective is to measure the ability of these approaches to solve the entity disambiguation task, with the ultimate aim of charting new research paths in the field.
摘要:表格是信息的重要容器,但理解它们的含义可能具有挑战性。事实上,最近人们关注语义表解释(STI),即涉及对表格数据进行语义注释以消除其含义的任务。多年来,人们对基于深度学习的数据驱动方法的兴趣激增,这种方法越来越多地与基于启发式的方法相结合。在上一阶段,大型语言模型(LLM)的出现导致了一类新的表格注释方法。对这一以多重挑战为特征的研究领域的兴趣,导致了采用不同技术的方法的激增。然而,这些方法并没有在共同的基础上得到一致的评估,这使得评估和比较变得困难。这项工作对四种最新的SOTA方法进行了广泛的评估-鳄鱼(以前的S-elBat)、Dagobah、Turl和TableLlama;前两种属于基于启发式的算法家族,而其他的分别是仅适用于编码器和仅适用于解码器的LLM。主要目标是衡量这些方法解决实体消歧任务的能力,最终目的是在该领域绘制新的研究路径。

[NLP-48] ViC: Virtual Compiler Is All You Need For Assembly Code Search
[NLP-48] ViC:虚拟编译器就是您搜索汇编代码所需的全部内容

链接: https://arxiv.org/abs/2408.06385
作者: Zeyu Gao,Hao Wang,Yuanda Wang,Chao Zhang
关键词-EN: vast binary programs, quickly identify specific, identify specific functions, Assembly code, Assembly code search
关键词-ZH: 庞大的二进制程序,快速识别特定,识别特定功能,汇编代码,汇编代码搜索
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Assembly code search is vital for reducing the burden on reverse engineers, allowing them to quickly identify specific functions using natural language within vast binary programs. Despite its significance, this critical task is impeded by the complexities involved in building high-quality datasets. This paper explores training a Large Language Model (LLM) to emulate a general compiler. By leveraging Ubuntu packages to compile a dataset of 20 billion tokens, we further continue pre-train CodeLlama as a Virtual Compiler (ViC), capable of compiling any source code of any language to assembly code. This approach allows for virtual compilation across a wide range of programming languages without the need for a real compiler, preserving semantic equivalency and expanding the possibilities for assembly code dataset construction. Furthermore, we use ViC to construct a sufficiently large dataset for assembly code search. Employing this extensive dataset, we achieve a substantial improvement in assembly code search performance, with our model surpassing the leading baseline by 26%.
摘要:汇编代码搜索对于减轻逆向工程师的负担至关重要,它使逆向工程师能够在庞大的二进制程序中使用自然语言快速识别特定的功能。尽管意义重大,但构建高质量数据集所涉及的复杂性阻碍了这项关键任务的完成。本文探讨了训练大型语言模型(LLM)来模拟通用编译器的问题。通过利用Ubuntu包编译200亿令牌的数据集,我们进一步将CodeLlama预训练为虚拟编译器(VIC),能够将任何语言的任何源代码编译为汇编代码。这种方法允许跨多种编程语言进行虚拟编译,而不需要真正的编译器,从而保持了语义等价性,并扩展了构建汇编代码数据集的可能性。此外,我们使用VIC来构建一个足够大的数据集来进行汇编代码搜索。使用这个广泛的数据集,我们在汇编代码搜索性能上实现了实质性的改进,我们的模型超过了领先的基线26%。

[NLP-49] Deep Learning based Key Information Extraction from Business Documents: Systematic Literature Review
[NLP-49] 基于深度学习的业务文档关键信息提取:系统文献综述

链接: https://arxiv.org/abs/2408.06345
作者: Alexander Rombach,Peter Fettke
关键词-EN: Extracting key information, Key Information Extraction, Extracting key, key information, process automation
关键词-ZH: 提取关键信息,关键信息提取,提取关键,关键信息,过程自动化
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 52 pages, 7 figures, 9 tables; Submitted to ACM Computing Surveys

点击查看摘要

Abstract:Extracting key information from documents represents a large portion of business workloads and therefore offers a high potential for efficiency improvements and process automation. With recent advances in deep learning, a plethora of deep learning-based approaches for Key Information Extraction have been proposed under the umbrella term Document Understanding that enable the processing of complex business documents. The goal of this systematic literature review is an in-depth analysis of existing approaches in this domain and the identification of opportunities for further research. To this end, 96 approaches published between 2017 and 2023 are analyzed in this study.
摘要:从文档中提取关键信息代表了业务工作负载的很大一部分,因此为效率改进和流程自动化提供了很大的潜力。随着深度学习的最新进展,人们在“文档理解”这个统称下提出了大量基于深度学习的关键信息提取方法,这些方法能够处理复杂的业务文档。这项系统性文献综述的目标是深入分析该领域的现有方法并确定进一步研究的机会。为此,本研究分析了2017年至2023年间发表的96种方法。

[NLP-50] Accuracy and Political Bias of News Source Credibility Ratings by Large Language Models
[NLP-50] 大型语言模型对新闻来源可信度评级的准确性和政治偏见

链接: https://arxiv.org/abs/2304.00228
作者: Kai-Cheng Yang,Filippo Menczer
关键词-EN: Search engines increasingly, generate direct answers, engines increasingly leverage, increasingly leverage large, leverage large language
关键词-ZH: 搜索引擎越来越多地产生直接的答案,引擎越来越多地利用,越来越多地利用大型语言
类目: Computation and Language (cs.CL); Computers and Society (cs.CY); Information Retrieval (cs.IR)
备注: 11 pages, 8 figures

点击查看摘要

Abstract:Search engines increasingly leverage large language models (LLMs) to generate direct answers, and AI chatbots now access the Internet for fresh data. As information curators for billions of users, LLMs must assess the accuracy and reliability of different sources. This paper audits eight widely used LLMs from three major providers – OpenAI, Google, and Meta – to evaluate their ability to discern credible and high-quality information sources from low-credibility ones. We find that while LLMs can rate most tested news outlets, larger models more frequently refuse to provide ratings due to insufficient information, whereas smaller models are more prone to hallucination in their ratings. For sources where ratings are provided, LLMs exhibit a high level of agreement among themselves (average Spearman’s \rho = 0.81 ), but their ratings align only moderately with human expert evaluations (average \rho = 0.59 ). Analyzing news sources with different political leanings in the US, we observe a liberal bias in credibility ratings yielded by all LLMs in default configurations. Additionally, assigning partisan identities to LLMs consistently results in strong politically congruent bias in the ratings. These findings have important implications for the use of LLMs in curating news and political information.
摘要:搜索引擎越来越多地利用大型语言模型(LLM)来生成直接答案,人工智能聊天机器人现在可以访问互联网获取最新的数据。作为数十亿用户的信息管理者,LLMS必须评估不同来源的准确性和可靠性。本文审计了来自三大提供商–OpenAI、Google和Meta–的八个广泛使用的LLM,以评估它们区分可信和高质量信息源与低可信度信息源的能力。我们发现,尽管LLM可以对大多数经过测试的新闻机构进行评级,但较大的模型更经常由于信息不足而拒绝提供评级,而较小的模型更容易在评级中产生幻觉。对于提供评级的来源,LLM之间表现出很高的一致性(平均Spearman‘s=0.81),但它们的评级仅与人类专家的评估适度一致(平均\rho=0.59)。通过分析美国不同政治倾向的新闻来源,我们观察到,所有处于默认配置的LLM在可信度评级方面都存在自由主义倾向。此外,将党派身份分配给LLM始终会导致评级中强烈的政治一致性偏见。这些发现对于使用LLMS来策划新闻和政治信息具有重要的意义。

[NLP-51] Harnessing Earnings Reports for Stock Predictions: A QLoRA-Enhanced LLM Approach
[NLP-51] 利用收益报告进行股票预测:QLoRA增强的LLM方法

链接: https://arxiv.org/abs/2408.06634
作者: Haowei Ni,Shuchen Meng,Xupeng Chen,Ziqing Zhao,Andi Chen,Panfeng Li,Shiyao Zhang,Qifu Yin,Yuanqing Wang,Yuxi Chan
关键词-EN: Accurate stock market, Accurate stock, stock market predictions, crucial for investors, earnings reports
关键词-ZH: 准确的股市,准确的股票,股市预测,对投资者至关重要,收益报告
类目: Computational Finance (q-fin.CP); Computation and Language (cs.CL); Machine Learning (cs.LG); Statistical Finance (q-fin.ST)
备注: Accepted by 2024 6th International Conference on Data-driven Optimization of Complex Systems

点击查看摘要

Abstract:Accurate stock market predictions following earnings reports are crucial for investors. Traditional methods, particularly classical machine learning models, struggle with these predictions because they cannot effectively process and interpret extensive textual data contained in earnings reports and often overlook nuances that influence market movements. This paper introduces an advanced approach by employing Large Language Models (LLMs) instruction fine-tuned with a novel combination of instruction-based techniques and quantized low-rank adaptation (QLoRA) compression. Our methodology integrates ‘base factors’, such as financial metric growth and earnings transcripts, with ‘external factors’, including recent market indices performances and analyst grades, to create a rich, supervised dataset. This comprehensive dataset enables our models to achieve superior predictive performance in terms of accuracy, weighted F1, and Matthews correlation coefficient (MCC), especially evident in the comparison with benchmarks such as GPT-4. We specifically highlight the efficacy of the llama-3-8b-Instruct-4bit model, which showcases significant improvements over baseline models. The paper also discusses the potential of expanding the output capabilities to include a ‘Hold’ option and extending the prediction horizon, aiming to accommodate various investment styles and time frames. This study not only demonstrates the power of integrating cutting-edge AI with fine-tuned financial data but also paves the way for future research in enhancing AI-driven financial analysis tools.
摘要:收益报告发布后,准确的股市预测对投资者来说至关重要。传统的方法,尤其是经典的机器学习模型,难以处理这些预测,因为它们无法有效地处理和解释收益报告中包含的大量文本数据,而且往往忽视了影响市场走势的细微差别。本文介绍了一种先进的方法,它采用大语言模型(LLMS)指令,并结合基于指令的技术和量化低阶自适应(QLoRA)压缩的新组合进行了微调。我们的方法将“基本因素”(如财务指标增长和盈利成绩单)与“外部因素”(包括最近的市场指数表现和分析师评级)整合在一起,以创建一个丰富的、受监督的数据集。这一全面的数据集使我们的模型在准确性、加权F1和马修斯相关系数(MCC)方面实现了卓越的预测性能,特别是在与GPT-4等基准测试的比较中表现得尤为明显。我们特别强调了骆驼-3-8b-指令-4bit模型的有效性,它展示了与基准模型相比的显著改进。本文还讨论了扩大产出能力以包括“持有”选项和延长预测期限的潜力,旨在适应不同的投资风格和时间框架。这项研究不仅展示了将尖端人工智能与微调金融数据相结合的力量,还为未来在增强人工智能驱动的金融分析工具方面的研究铺平了道路。

[NLP-52] Lyrics Transcription for Humans: A Readability-Aware Benchmark ALT
[NLP-52] 歌词人类转录:可读性基准

链接: https://arxiv.org/abs/2408.06370
作者: Ondřej Cífka,Hendrik Schreiber,Luke Miner,Fabian-Robert Stöter
关键词-EN: convey contextual information, human consumption involves, capturing word sequences, accurately capturing word, contextual information
关键词-ZH: 传达上下文信息,人类消费涉及,捕获单词序列,准确捕获单词,上下文信息
类目: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Machine Learning (cs.LG); Sound (cs.SD)
备注: ISMIR 2024 camera-ready. 6 pages + references + supplementary material. Website this https URL Data this https URL Code this https URL . arXiv admin note: text overlap with arXiv:2311.13987

点击查看摘要

Abstract:Writing down lyrics for human consumption involves not only accurately capturing word sequences, but also incorporating punctuation and formatting for clarity and to convey contextual information. This includes song structure, emotional emphasis, and contrast between lead and background vocals. While automatic lyrics transcription (ALT) systems have advanced beyond producing unstructured strings of words and are able to draw on wider context, ALT benchmarks have not kept pace and continue to focus exclusively on words. To address this gap, we introduce Jam-ALT, a comprehensive lyrics transcription benchmark. The benchmark features a complete revision of the JamendoLyrics dataset, in adherence to industry standards for lyrics transcription and formatting, along with evaluation metrics designed to capture and assess the lyric-specific nuances, laying the foundation for improving the readability of lyrics. We apply the benchmark to recent transcription systems and present additional error analysis, as well as an experimental comparison with a classical music dataset.
摘要:写下歌词以供人类阅读不仅需要准确地捕捉单词序列,还需要结合标点符号和格式以确保清晰,并传达上下文信息。这包括歌曲结构,情感强调,以及主唱和背景人声之间的对比。虽然自动歌词转录(ALT)系统已经超越了产生无结构的单词字符串,并能够利用更广泛的上下文,但ALT基准没有跟上步伐,继续专注于单词。为了弥补这一差距,我们引入了Jam-alt,一个全面的歌词转录基准。该基准标准遵循歌词转录和格式的行业标准,对JamendoLyrics数据集进行了全面修订,并采用了旨在捕获和评估歌词特定细微差别的评估指标,为提高歌词的可读性奠定了基础。我们将该基准应用于最近的转录系统,并给出了额外的误差分析,以及与古典音乐数据集的实验比较。

[NLP-53] Large Language Model Agent in Financial Trading: A Survey
[NLP-53] 金融交易中的大型语言模型代理:调查

链接: https://arxiv.org/abs/2408.06361
作者: Han Ding,Yinheng Li,Junhao Wang,Hang Chen
关键词-EN: highly competitive task, combination of strategy, psychological fortitude, task that requires, requires a combination
关键词-ZH: 竞争激烈的任务,策略组合,心理坚韧,需要的任务,需要的组合
类目: Trading and Market Microstructure (q-fin.TR); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Trading is a highly competitive task that requires a combination of strategy, knowledge, and psychological fortitude. With the recent success of large language models(LLMs), it is appealing to apply the emerging intelligence of LLM agents in this competitive arena and understanding if they can outperform professional traders. In this survey, we provide a comprehensive review of the current research on using LLMs as agents in financial trading. We summarize the common architecture used in the agent, the data inputs, and the performance of LLM trading agents in backtesting as well as the challenges presented in these research. This survey aims to provide insights into the current state of LLM-based financial trading agents and outline future research directions in this field.
摘要:交易是一项竞争激烈的任务,需要策略、知识和心理毅力的结合。随着大型语言模型(LLM)最近的成功,在这个竞争激烈的领域应用LLM代理的新兴智能并了解他们是否能超越专业交易员,这很有吸引力。在本调查中,我们对当前关于使用LLM作为金融交易代理的研究进行了全面回顾。我们总结了代理中使用的常见架构、数据输入和LLM交易代理在回测中的性能以及这些研究中提出的挑战。这项调查旨在深入了解法学硕士金融交易代理的现状,并概述该领域未来的研究方向。

人工智能

[AI-0] Diversity Empowers Intelligence: Integrating Expertise of Software Engineering Agents

链接: https://arxiv.org/abs/2408.07060
作者: Kexun Zhang,Weiran Yao,Zuxin Liu,Yihao Feng,Zhiwei Liu,Rithesh Murthy,Tian Lan,Lei Li,Renze Lou,Jiacheng Xu,Bo Pang,Yingbo Zhou,Shelby Heinecke,Silvio Savarese,Huan Wang,Caiming Xiong
关键词-EN: Large language model, solving real-world software, shown great potential, language model, SWE-Bench Lite
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Large language model (LLM) agents have shown great potential in solving real-world software engineering (SWE) problems. The most advanced open-source SWE agent can resolve over 27% of real GitHub issues in SWE-Bench Lite. However, these sophisticated agent frameworks exhibit varying strengths, excelling in certain tasks while underperforming in others. To fully harness the diversity of these agents, we propose DEI (Diversity Empowered Intelligence), a framework that leverages their unique expertise. DEI functions as a meta-module atop existing SWE agent frameworks, managing agent collectives for enhanced problem-solving. Experimental results show that a DEI-guided committee of agents is able to surpass the best individual agent’s performance by a large margin. For instance, a group of open-source SWE agents, with a maximum individual resolve rate of 27.3% on SWE-Bench Lite, can achieve a 34.3% resolve rate with DEI, making a 25% improvement and beating most closed-source solutions. Our best-performing group excels with a 55% resolve rate, securing the highest ranking on SWE-Bench Lite. Our findings contribute to the growing body of research on collaborative AI systems and their potential to solve complex software engineering challenges.

[AI-1] Model Counting in the Wild KR2024

链接: https://arxiv.org/abs/2408.07059
作者: Arijit Shaw,Kuldeep S. Meel
关键词-EN: neural network verification, network reliability, neural network, network verification, Model counting
类目: Logic in Computer Science (cs.LO); Artificial Intelligence (cs.AI)
*备注: Full version of conference paper accepted at KR 2024

点击查看摘要

Abstract:Model counting is a fundamental problem in automated reasoning with applications in probabilistic inference, network reliability, neural network verification, and more. Although model counting is computationally intractable from a theoretical perspective due to its #P-completeness, the past decade has seen significant progress in developing state-of-the-art model counters to address scalability challenges. In this work, we conduct a rigorous assessment of the scalability of model counters in the wild. To this end, we surveyed 11 application domains and collected an aggregate of 2262 benchmarks from these domains. We then evaluated six state-of-the-art model counters on these instances to assess scalability and runtime performance. Our empirical evaluation demonstrates that the performance of model counters varies significantly across different application domains, underscoring the need for careful selection by the end user. Additionally, we investigated the behavior of different counters with respect to two parameters suggested by the model counting community, finding only a weak correlation. Our analysis highlights the challenges and opportunities for portfolio-based approaches in model counting. Comments: Full version of conference paper accepted at KR 2024 Subjects: Logic in Computer Science (cs.LO); Artificial Intelligence (cs.AI) Cite as: arXiv:2408.07059 [cs.LO] (or arXiv:2408.07059v1 [cs.LO] for this version) https://doi.org/10.48550/arXiv.2408.07059 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-2] A Survey on Model MoErging: Recycling and Routing Among Specialized Experts for Collaborative Learning

链接: https://arxiv.org/abs/2408.07057
作者: Prateek Yadav,Colin Raffel,Mohammed Muqeeth,Lucas Caccia,Haokun Liu,Tianlong Chen,Mohit Bansal,Leshem Choshen,Alessandro Sordoni
关键词-EN: performant pre-trained models, fine-tuned expert models, MoErging methods, domain or task, MoErging
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注: 26 pages

点击查看摘要

Abstract:The availability of performant pre-trained models has led to a proliferation of fine-tuned expert models that are specialized to a particular domain or task. Model MoErging methods aim to recycle expert models to create an aggregate system with improved performance or generalization. A key component of MoErging methods is the creation of a router that decides which expert model(s) to use for a particular input or application. The promise, effectiveness, and large design space of MoErging has spurred the development of many new methods over the past few years. This rapid pace of development has made it challenging to compare different MoErging methods, which are rarely compared to one another and are often validated in different experimental setups. To remedy such gaps, we present a comprehensive survey of MoErging methods that includes a novel taxonomy for cataloging key design choices and clarifying suitable applications for each method. Apart from surveying MoErging research, we inventory software tools and applications that make use of MoErging. We additionally discuss related fields of study such as model merging, multitask learning, and mixture-of-experts models. Taken as a whole, our survey provides a unified overview of existing MoErging methods and creates a solid foundation for future work in this burgeoning field.

[AI-3] he News Comment Gap and Algorithmic Agenda Setting in Online Forums

链接: https://arxiv.org/abs/2408.07052
作者: Flora Böwing,Patrick Gildersleve
关键词-EN: stories valued, ranking algorithms, Ranking Utility Metric, ranking, comment ranking algorithms
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Social and Information Networks (cs.SI); Physics and Society (physics.soc-ph)
*备注:

点击查看摘要

Abstract:The disparity between news stories valued by journalists and those preferred by readers, known as the “News Gap”, is well-documented. However, the difference in expectations regarding news related user-generated content is less studied. Comment sections, hosted by news websites, are popular venues for reader engagement, yet still subject to editorial decisions. It is thus important to understand journalist vs reader comment preferences and how these are served by various comment ranking algorithms that represent discussions differently. We analyse 1.2 million comments from Austrian newspaper Der Standard to understand the “News Comment Gap” and the effects of different ranking algorithms. We find that journalists prefer positive, timely, complex, direct responses, while readers favour comments similar to article content from elite authors. We introduce the versatile Feature-Oriented Ranking Utility Metric (FORUM) to assess the impact of different ranking algorithms and find dramatic differences in how they prioritise the display of comments by sentiment, topical relevance, lexical diversity, and readability. Journalists can exert substantial influence over the discourse through both curatorial and algorithmic means. Understanding these choices’ implications is vital in fostering engaging and civil discussions while aligning with journalistic objectives, especially given the increasing legal scrutiny and societal importance of online discourse.

[AI-4] KAN You See It? KANs and Sentinel for Effective and Explainable Crop Field Segmentation ECCV2024

链接: https://arxiv.org/abs/2408.07040
作者: Daniele Rege Cambrin,Eleonora Poeta,Eliana Pastor,Tania Cerquitelli,Elena Baralis,Paolo Garza
关键词-EN: enhancing agricultural productivity, promoting sustainable practices, monitoring crop health, agricultural productivity, sustainable practices
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: Accepted at ECCV 2024 CVPPA Workshop

点击查看摘要

Abstract:Segmentation of crop fields is essential for enhancing agricultural productivity, monitoring crop health, and promoting sustainable practices. Deep learning models adopted for this task must ensure accurate and reliable predictions to avoid economic losses and environmental impact. The newly proposed Kolmogorov-Arnold networks (KANs) offer promising advancements in the performance of neural networks. This paper analyzes the integration of KAN layers into the U-Net architecture (U-KAN) to segment crop fields using Sentinel-2 and Sentinel-1 satellite images and provides an analysis of the performance and explainability of these networks. Our findings indicate a 2% improvement in IoU compared to the traditional full-convolutional U-Net model in fewer GFLOPs. Furthermore, gradient-based explanation techniques show that U-KAN predictions are highly plausible and that the network has a very high ability to focus on the boundaries of cultivated areas rather than on the areas themselves. The per-channel relevance analysis also reveals that some channels are irrelevant to this task.

[AI-5] PathInsight: Instruction Tuning of Multimodal Datasets and Models for Intelligence Assisted Diagnosis in Histopathology

链接: https://arxiv.org/abs/2408.07037
作者: Xiaomin Wu,Rui Xu,Pengchen Wei,Wenkang Qin,Peixiang Huang,Ziheng Li,Lin Luo
关键词-EN: Pathological diagnosis remains, identifying tumors, diagnosis remains, remains the definitive, definitive standard
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 10 pages, 2 figures

点击查看摘要

Abstract:Pathological diagnosis remains the definitive standard for identifying tumors. The rise of multimodal large models has simplified the process of integrating image analysis with textual descriptions. Despite this advancement, the substantial costs associated with training and deploying these complex multimodal models, together with a scarcity of high-quality training datasets, create a significant divide between cutting-edge technology and its application in the clinical setting. We had meticulously compiled a dataset of approximately 45,000 cases, covering over 6 different tasks, including the classification of organ tissues, generating pathology report descriptions, and addressing pathology-related questions and answers. We have fine-tuned multimodal large models, specifically LLaVA, Qwen-VL, InternLM, with this dataset to enhance instruction-based performance. We conducted a qualitative assessment of the capabilities of the base model and the fine-tuned model in performing image captioning and classification tasks on the specific dataset. The evaluation results demonstrate that the fine-tuned model exhibits proficiency in addressing typical pathological questions. We hope that by making both our models and datasets publicly available, they can be valuable to the medical and research communities.

[AI-6] Defining and Measuring Disentanglement for non-Independent Factors of Variation

链接: https://arxiv.org/abs/2408.07016
作者: Antonio Almudévar,Alfonso Ortega,Luis Vicente,Antonio Miguel,Eduardo Lleida
关键词-EN: factors of variation, Representation learning, factors, variation, discover and extract
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Representation learning is an approach that allows to discover and extract the factors of variation from the data. Intuitively, a representation is said to be disentangled if it separates the different factors of variation in a way that is understandable to humans. Definitions of disentanglement and metrics to measure it usually assume that the factors of variation are independent of each other. However, this is generally false in the real world, which limits the use of these definitions and metrics to very specific and unrealistic scenarios. In this paper we give a definition of disentanglement based on information theory that is also valid when the factors of variation are not independent. Furthermore, we relate this definition to the Information Bottleneck Method. Finally, we propose a method to measure the degree of disentanglement from the given definition that works when the factors of variation are not independent. We show through different experiments that the method proposed in this paper correctly measures disentanglement with non-independent factors of variation, while other methods fail in this scenario.

[AI-7] Casper: Prompt Sanitization for Protecting User Privacy in Web-Based Large Language Models

链接: https://arxiv.org/abs/2408.07004
作者: Chun Jie Chong,Chenxi Hou,Zhihao Yao,Seyed Mohammadjavad Seyed Talebi
关键词-EN: Web-based Large Language, Large Language Model, Web-based Large, Language Model, Large Language
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Web-based Large Language Model (LLM) services have been widely adopted and have become an integral part of our Internet experience. Third-party plugins enhance the functionalities of LLM by enabling access to real-world data and services. However, the privacy consequences associated with these services and their third-party plugins are not well understood. Sensitive prompt data are stored, processed, and shared by cloud-based LLM providers and third-party plugins. In this paper, we propose Casper, a prompt sanitization technique that aims to protect user privacy by detecting and removing sensitive information from user inputs before sending them to LLM services. Casper runs entirely on the user’s device as a browser extension and does not require any changes to the online LLM services. At the core of Casper is a three-layered sanitization mechanism consisting of a rule-based filter, a Machine Learning (ML)-based named entity recognizer, and a browser-based local LLM topic identifier. We evaluate Casper on a dataset of 4000 synthesized prompts and show that it can effectively filter out Personal Identifiable Information (PII) and privacy-sensitive topics with high accuracy, at 98.5% and 89.9%, respectively.

[AI-8] Generative AI for automatic topic labelling

链接: https://arxiv.org/abs/2408.07003
作者: Diego Kozlowski,Carolina Pradier,Pierre Benz
关键词-EN: large scale interpretation, Topic Modeling, prominent tool, large scale, scale interpretation
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: 10 pages, 1 figure

点击查看摘要

Abstract:Topic Modeling has become a prominent tool for the study of scientific fields, as they allow for a large scale interpretation of research trends. Nevertheless, the output of these models is structured as a list of keywords which requires a manual interpretation for the labelling. This paper proposes to assess the reliability of three LLMs, namely flan, GPT-4o, and GPT-4 mini for topic labelling. Drawing on previous research leveraging BERTopic, we generate topics from a dataset of all the scientific articles (n=34,797) authored by all biology professors in Switzerland (n=465) between 2008 and 2020, as recorded in the Web of Science database. We assess the output of the three models both quantitatively and qualitatively and find that, first, both GPT models are capable of accurately and precisely label topics from the models’ output keywords. Second, 3-word labels are preferable to grasp the complexity of research topics.

[AI-9] LLMs can Schedule

链接: https://arxiv.org/abs/2408.06993
作者: Henrik Abgaryan,Ararat Harutyunyan,Tristan Cazenave
关键词-EN: optimizing production processes, shop scheduling problem, Large Language Models, remains a significant, production processes
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The job shop scheduling problem (JSSP) remains a significant hurdle in optimizing production processes. This challenge involves efficiently allocating jobs to a limited number of machines while minimizing factors like total processing time or job delays. While recent advancements in artificial intelligence have yielded promising solutions, such as reinforcement learning and graph neural networks, this paper explores the potential of Large Language Models (LLMs) for JSSP. We introduce the very first supervised 120k dataset specifically designed to train LLMs for JSSP. Surprisingly, our findings demonstrate that LLM-based scheduling can achieve performance comparable to other neural approaches. Furthermore, we propose a sampling method that enhances the effectiveness of LLMs in tackling JSSP.

[AI-10] SpectralGaussians: Semantic spectral 3D Gaussian splatting for multi-spectral scene representation visualization and analysis

链接: https://arxiv.org/abs/2408.06975
作者: Saptarshi Neil Sinha,Holger Graf,Michael Weinmann
关键词-EN: registered multi-view spectrum, semantically meaningful splats, Gaussian Splatting, cross-spectral rendering framework, rendering framework based
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Graphics (cs.GR)
*备注:

点击查看摘要

Abstract:We propose a novel cross-spectral rendering framework based on 3D Gaussian Splatting (3DGS) that generates realistic and semantically meaningful splats from registered multi-view spectrum and segmentation maps. This extension enhances the representation of scenes with multiple spectra, providing insights into the underlying materials and segmentation. We introduce an improved physically-based rendering approach for Gaussian splats, estimating reflectance and lights per spectra, thereby enhancing accuracy and realism. In a comprehensive quantitative and qualitative evaluation, we demonstrate the superior performance of our approach with respect to other recent learning-based spectral scene representation approaches (i.e., XNeRF and SpectralNeRF) as well as other non-spectral state-of-the-art learning-based approaches. Our work also demonstrates the potential of spectral scene understanding for precise scene editing techniques like style transfer, inpainting, and removal. Thereby, our contributions address challenges in multi-spectral scene representation, rendering, and editing, offering new possibilities for diverse applications.

[AI-11] Neural Speech and Audio Coding

链接: https://arxiv.org/abs/2408.06954
作者: Minje Kim,Jan Skoglund
关键词-EN: speech and audio, explores the integration, purely data-driven approaches, data-driven approaches, audio coding systems
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS); Signal Processing (eess.SP)
*备注: Accepted for publication in IEEE Signal Processing Magazine

点击查看摘要

Abstract:This paper explores the integration of model-based and data-driven approaches within the realm of neural speech and audio coding systems. It highlights the challenges posed by the subjective evaluation processes of speech and audio codecs and discusses the limitations of purely data-driven approaches, which often require inefficiently large architectures to match the performance of model-based methods. The study presents hybrid systems as a viable solution, offering significant improvements to the performance of conventional codecs through meticulously chosen design enhancements. Specifically, it introduces a neural network-based signal enhancer designed to post-process existing codecs’ output, along with the autoencoder-based end-to-end models and LPCNet–hybrid systems that combine linear predictive coding (LPC) with neural networks. Furthermore, the paper delves into predictive models operating within custom feature spaces (TF-Codec) or predefined transform domains (MDCTNet) and examines the use of psychoacoustically calibrated loss functions to train end-to-end neural audio codecs. Through these investigations, the paper demonstrates the potential of hybrid systems to advance the field of speech and audio coding by bridging the gap between traditional model-based approaches and modern data-driven techniques.

[AI-12] Heavy-Ball Momentum Accelerated Actor-Critic With Function Approximation

链接: https://arxiv.org/abs/2408.06945
作者: Yanjie Dong,Haijun Zhang,Gang Wang,Shisheng Cui,Xiping Hu
关键词-EN: stochastic policy gradient, analyzing convergence rate, convergence rate, replace the Monte-Carlo, Monte-Carlo rollouts
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:By using an parametric value function to replace the Monte-Carlo rollouts for value estimation, the actor-critic (AC) algorithms can reduce the variance of stochastic policy gradient so that to improve the convergence rate. While existing works mainly focus on analyzing convergence rate of AC algorithms under Markovian noise, the impacts of momentum on AC algorithms remain largely unexplored. In this work, we first propose a heavy-ball momentum based advantage actor-critic (\mboxHB-A2C) algorithm by integrating the heavy-ball momentum into the critic recursion that is parameterized by a linear function. When the sample trajectory follows a Markov decision process, we quantitatively certify the acceleration capability of the proposed HB-A2C algorithm. Our theoretical results demonstrate that the proposed HB-A2C finds an \epsilon -approximate stationary point with \oo\epsilon^-2 iterations for reinforcement learning tasks with Markovian noise. Moreover, we also reveal the dependence of learning rates on the length of the sample trajectory. By carefully selecting the momentum factor of the critic recursion, the proposed HB-A2C can balance the errors introduced by the initialization and the stoschastic approximation.

[AI-13] he advantages of context specific language models: the case of the Erasmian Language Model

链接: https://arxiv.org/abs/2408.06931
作者: João Gonçalves,Nick Jelicic,Michele Murgia,Evert Stamhuis
关键词-EN: training data fed, improve language model, Erasmus University Rotterdam, current trend, trend to improve
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: 12 pages, 3 figures, 1 table

点击查看摘要

Abstract:The current trend to improve language model performance seems to be based on scaling up with the number of parameters (e.g. the state of the art GPT4 model has approximately 1.7 trillion parameters) or the amount of training data fed into the model. However this comes at significant costs in terms of computational resources and energy costs that compromise the sustainability of AI solutions, as well as risk relating to privacy and misuse. In this paper we present the Erasmian Language Model (ELM) a small context specific, 900 million parameter model, pre-trained and fine-tuned by and for Erasmus University Rotterdam. We show how the model performs adequately in a classroom context for essay writing, and how it achieves superior performance in subjects that are part of its context. This has implications for a wide range of institutions and organizations, showing that context specific language models may be a viable alternative for resource constrained, privacy sensitive use cases.

[AI-14] Diagnosis extraction from unstructured Dutch echocardiogram reports using span- and document-level characteristic classification

链接: https://arxiv.org/abs/2408.06930
作者: Bauke Arends,Melle Vessies,Dirk van Osch,Arco Teske,Pim van der Harst,René van Es,Bram van Es
关键词-EN: Clinical machine learning, driven clinical decision, clinical decision support, machine learning research, clinically accurate labels
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: 28 pages, 5 figures

点击查看摘要

Abstract:Clinical machine learning research and AI driven clinical decision support models rely on clinically accurate labels. Manually extracting these labels with the help of clinical specialists is often time-consuming and expensive. This study tests the feasibility of automatic span- and document-level diagnosis extraction from unstructured Dutch echocardiogram reports. We included 115,692 unstructured echocardiogram reports from the UMCU a large university hospital in the Netherlands. A randomly selected subset was manually annotated for the occurrence and severity of eleven commonly described cardiac characteristics. We developed and tested several automatic labelling techniques at both span and document levels, using weighted and macro F1-score, precision, and recall for performance evaluation. We compared the performance of span labelling against document labelling methods, which included both direct document classifiers and indirect document classifiers that rely on span classification results. The SpanCategorizer and this http URL models outperformed all other span and document classifiers, respectively. The weighted F1-score varied between characteristics, ranging from 0.60 to 0.93 in SpanCategorizer and 0.96 to 0.98 in this http URL. Direct document classification was superior to indirect document classification using span classifiers. SetFit achieved competitive document classification performance using only 10% of the training data. Utilizing a reduced label set yielded near-perfect document classification results. We recommend using our published SpanCategorizer and this http URL models for span- and document-level diagnosis extraction from Dutch echocardiography reports. For settings with limited training data, SetFit may be a promising alternative for document classification. Comments: 28 pages, 5 figures Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) MSC classes: 68T50, 68P20 ACMclasses: I.2.7; J.3; H.3.3 Cite as: arXiv:2408.06930 [cs.CL] (or arXiv:2408.06930v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2408.06930 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Bram van Es [view email] [v1] Tue, 13 Aug 2024 14:33:32 UTC (1,041 KB)

[AI-15] mporal Variability and Multi-Viewed Self-Supervised Representations to Tackle the ASVspoof5 Deepfake Challenge

链接: https://arxiv.org/abs/2408.06922
作者: Yuankun Xie,Xiaopeng Wang,Zhiyong Wang,Ruibo Fu,Zhengqi Wen,Haonan Cheng,Long Ye
关键词-EN: audio security challenges, largest global audio, global audio security, security challenges, largest global
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
*备注:

点击查看摘要

Abstract:ASVspoof5, the fifth edition of the ASVspoof series, is one of the largest global audio security challenges. It aims to advance the development of countermeasure (CM) to discriminate bonafide and spoofed speech utterances. In this paper, we focus on addressing the problem of open-domain audio deepfake detection, which corresponds directly to the ASVspoof5 Track1 open condition. At first, we comprehensively investigate various CM on ASVspoof5, including data expansion, data augmentation, and self-supervised learning (SSL) features. Due to the high-frequency gaps characteristic of the ASVspoof5 dataset, we introduce Frequency Mask, a data augmentation method that masks specific frequency bands to improve CM robustness. Combining various scale of temporal information with multiple SSL features, our experiments achieved a minDCF of 0.0158 and an EER of 0.55% on the ASVspoof 5 Track 1 evaluation progress set.

[AI-16] Multi-Agent Continuous Control with Generative Flow Networks

链接: https://arxiv.org/abs/2408.06920
作者: Shuang Luo,Yinchuan Li,Shunyu Liu,Xu Zhang,Yunfeng Shao,Chao Wu
关键词-EN: generate diverse trajectories, exploratory control tasks, Generative Flow Networks, diverse trajectories, generative Continuous Flow
类目: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
*备注:

点击查看摘要

Abstract:Generative Flow Networks (GFlowNets) aim to generate diverse trajectories from a distribution in which the final states of the trajectories are proportional to the reward, serving as a powerful alternative to reinforcement learning for exploratory control tasks. However, the individual-flow matching constraint in GFlowNets limits their applications for multi-agent systems, especially continuous joint-control problems. In this paper, we propose a novel Multi-Agent generative Continuous Flow Networks (MACFN) method to enable multiple agents to perform cooperative exploration for various compositional continuous objects. Technically, MACFN trains decentralized individual-flow-based policies in a centralized global-flow-based matching fashion. During centralized training, MACFN introduces a continuous flow decomposition network to deduce the flow contributions of each agent in the presence of only global rewards. Then agents can deliver actions solely based on their assigned local flow in a decentralized way, forming a joint policy distribution proportional to the rewards. To guarantee the expressiveness of continuous flow decomposition, we theoretically derive a consistency condition on the decomposition network. Experimental results demonstrate that the proposed method yields results superior to the state-of-the-art counterparts and better exploration capability. Our code is available at this https URL.

[AI-17] Entendre a Social Bot Detection Tool for Niche Fringe and Extreme Social Media

链接: https://arxiv.org/abs/2408.06900
作者: Pranav Venkatesh,Kami Vinton,Dhiraj Murthy,Kellen Sharp,Akaash Kolluri
关键词-EN: media-are exploiting vulnerabilities, manipulate public perception, social media-are exploiting, disseminate disinformation, media-are exploiting
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Social and Information Networks (cs.SI)
*备注: 6 pages

点击查看摘要

Abstract:Social bots-automated accounts that generate and spread content on social media-are exploiting vulnerabilities in these platforms to manipulate public perception and disseminate disinformation. This has prompted the development of public bot detection services; however, most of these services focus primarily on Twitter, leaving niche platforms vulnerable. Fringe social media platforms such as Parler, Gab, and Gettr often have minimal moderation, which facilitates the spread of hate speech and misinformation. To address this gap, we introduce Entendre, an open-access, scalable, and platform-agnostic bot detection framework. Entendre can process a labeled dataset from any social platform to produce a tailored bot detection model using a random forest classification approach, ensuring robust social bot detection. We exploit the idea that most social platforms share a generic template, where users can post content, approve content, and provide a bio (common data features). By emphasizing general data features over platform-specific ones, Entendre offers rapid extensibility at the expense of some accuracy. To demonstrate Entendre’s effectiveness, we used it to explore the presence of bots among accounts posting racist content on the now-defunct right-wing platform Parler. We examined 233,000 posts from 38,379 unique users and found that 1,916 unique users (4.99%) exhibited bot-like behavior. Visualization techniques further revealed that these bots significantly impacted the network, amplifying influential rhetoric and hashtags (e.g., #qanon, #trump, #antilgbt). These preliminary findings underscore the need for tools like Entendre to monitor and assess bot activity across diverse platforms.

[AI-18] Automatic Feature Recognition and Dimensional Attributes Extraction From CAD Models for Hybrid Additive-Subtractive Manufacturing

链接: https://arxiv.org/abs/2408.06891
作者: Muhammad Tayyab Khan,Wenhe Feng,Lequn Chen,Ye Han Ng,Nicholas Yew Jin Tan,Seung Ki Moon
关键词-EN: facilitating seamless transitions, Computer-Aided Design, Computer-Aided Process Planning, manufacturing process planning, digital designs
类目: Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 10 pages, 12 figures. This paper has been accepted for presentation at the ASME IDETC-CIE 2024 conference

点击查看摘要

Abstract:The integration of Computer-Aided Design (CAD), Computer-Aided Process Planning (CAPP), and Computer-Aided Manufacturing (CAM) plays a crucial role in modern manufacturing, facilitating seamless transitions from digital designs to physical products. However, a significant challenge within this integration is the Automatic Feature Recognition (AFR) of CAD models, especially in the context of hybrid manufacturing that combines subtractive and additive manufacturing processes. Traditional AFR methods, focused mainly on the identification of subtractive (machined) features including holes, fillets, chamfers, pockets, and slots, fail to recognize features pertinent to additive manufacturing. Furthermore, the traditional methods fall short in accurately extracting geometric dimensions and orientations, which are also key factors for effective manufacturing process planning. This paper presents a novel approach for creating a synthetic CAD dataset that encompasses features relevant to both additive and subtractive machining through Python Open Cascade. The Hierarchical Graph Convolutional Neural Network (HGCNN) model is implemented to accurately identify the composite additive-subtractive features within the synthetic CAD dataset. The key novelty and contribution of the proposed methodology lie in its ability to recognize a wide range of manufacturing features, and precisely extracting their dimensions, orientations, and stock sizes. The proposed model demonstrates remarkable feature recognition accuracy exceeding 97% and a dimension extraction accuracy of 100% for identified features. Therefore, the proposed methodology enhances the integration of CAD, CAPP, and CAM within hybrid manufacturing by providing precise feature recognition and dimension extraction. It facilitates improved manufacturing process planning, by enabling more informed decision-making.

[AI-19] BMFT: Achieving Fairness via Bias-based Weight Masking Fine-tuning MICCAI2024

链接: https://arxiv.org/abs/2408.06890
作者: Yuyang Xue,Junyu Yan,Raman Dutt,Fasih Haider,Jingshuai Liu,Steven McDonagh,Sotirios A. Tsaftaris
关键词-EN: robust group fairness, group fairness properties, ethically sensitive domains, Developing models, properties is paramount
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Accepted by MICCAI 2024 FAIMI Workshop Oral

点击查看摘要

Abstract:Developing models with robust group fairness properties is paramount, particularly in ethically sensitive domains such as medical diagnosis. Recent approaches to achieving fairness in machine learning require a substantial amount of training data and depend on model retraining, which may not be practical in real-world scenarios. To mitigate these challenges, we propose Bias-based Weight Masking Fine-Tuning (BMFT), a novel post-processing method that enhances the fairness of a trained model in significantly fewer epochs without requiring access to the original training data. BMFT produces a mask over model parameters, which efficiently identifies the weights contributing the most towards biased predictions. Furthermore, we propose a two-step debiasing strategy, wherein the feature extractor undergoes initial fine-tuning on the identified bias-influenced weights, succeeded by a fine-tuning phase on a reinitialised classification layer to uphold discriminative performance. Extensive experiments across four dermatological datasets and two sensitive attributes demonstrate that BMFT outperforms existing state-of-the-art (SOTA) techniques in both diagnostic accuracy and fairness metrics. Our findings underscore the efficacy and robustness of BMFT in advancing fairness across various out-of-distribution (OOD) settings. Our code is available at: this https URL

[AI-20] Decision-Focused Learning to Predict Action Costs for Planning

链接: https://arxiv.org/abs/2408.06876
作者: Jayanta Mandi,Marco Foschini,Daniel Holler,Sylvie Thiebaux,Jorg Hoffmann,Tias Guns
关键词-EN: automated planning applications, action costs, automated planning, planning, DFL
类目: Artificial Intelligence (cs.AI); Robotics (cs.RO)
*备注:

点击查看摘要

Abstract:In many automated planning applications, action costs can be hard to specify. An example is the time needed to travel through a certain road segment, which depends on many factors, such as the current weather conditions. A natural way to address this issue is to learn to predict these parameters based on input features (e.g., weather forecasts) and use the predicted action costs in automated planning afterward. Decision-Focused Learning (DFL) has been successful in learning to predict the parameters of combinatorial optimization problems in a way that optimizes solution quality rather than prediction quality. This approach yields better results than treating prediction and optimization as separate tasks. In this paper, we investigate for the first time the challenges of implementing DFL for automated planning in order to learn to predict the action costs. There are two main challenges to overcome: (1) planning systems are called during gradient descent learning, to solve planning problems with negative action costs, which are not supported in planning. We propose novel methods for gradient computation to avoid this issue. (2) DFL requires repeated planner calls during training, which can limit the scalability of the method. We experiment with different methods approximating the optimal plan as well as an easy-to-implement caching mechanism to speed up the learning process. As the first work that addresses DFL for automated planning, we demonstrate that the proposed gradient computation consistently yields significantly better plans than predictions aimed at minimizing prediction error; and that caching can temper the computation requirements.

[AI-21] Advancing Interactive Explainable AI via Belief Change Theory KR2024

链接: https://arxiv.org/abs/2408.06875
作者: Antonio Rago,Maria Vanina Martinez
关键词-EN: humans’ daily lives, interactive XAI, methods are needed, daily lives, greater levels
类目: Artificial Intelligence (cs.AI)
*备注: 9 pages. To be published at KR 2024

点击查看摘要

Abstract:As AI models become ever more complex and intertwined in humans’ daily lives, greater levels of interactivity of explainable AI (XAI) methods are needed. In this paper, we propose the use of belief change theory as a formal foundation for operators that model the incorporation of new information, i.e. user feedback in interactive XAI, to logical representations of data-driven classifiers. We argue that this type of formalisation provides a framework and a methodology to develop interactive explanations in a principled manner, providing warranted behaviour and favouring transparency and accountability of such interactions. Concretely, we first define a novel, logic-based formalism to represent explanatory information shared between humans and machines. We then consider real world scenarios for interactive XAI, with different prioritisations of new and existing knowledge, where our formalism may be instantiated. Finally, we analyse a core set of belief change postulates, discussing their suitability for our real world settings and pointing to particular challenges that may require the relaxation or reinterpretation of some of the theoretical assumptions underlying existing operators.

[AI-22] Generative AI Tools in Academic Research: Applications and Implications for Qualitative and Quantitative Research Methodologies

链接: https://arxiv.org/abs/2408.06872
作者: Mike Perkins(1),Jasper Roe(2) ((1) British University Vietnam, (2) James Cook University Singapore)
关键词-EN: Generative Artificial Intelligence, impact of Generative, Generative Artificial, Artificial Intelligence, examines the impact
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:This study examines the impact of Generative Artificial Intelligence (GenAI) on academic research, focusing on its application to qualitative and quantitative data analysis. As GenAI tools evolve rapidly, they offer new possibilities for enhancing research productivity and democratising complex analytical processes. However, their integration into academic practice raises significant questions regarding research integrity and security, authorship, and the changing nature of scholarly work. Through an examination of current capabilities and potential future applications, this study provides insights into how researchers may utilise GenAI tools responsibly and ethically. We present case studies that demonstrate the application of GenAI in various research methodologies, discuss the challenges of replicability and consistency in AI-assisted research, and consider the ethical implications of increased AI integration in academia. This study explores both qualitative and quantitative applications of GenAI, highlighting tools for transcription, coding, thematic analysis, visual analytics, and statistical analysis. By addressing these issues, we aim to contribute to the ongoing discourse on the role of AI in shaping the future of academic research and provide guidance for researchers exploring the rapidly evolving landscape of AI-assisted research tools and research. Subjects: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI) Cite as: arXiv:2408.06872 [cs.HC] (or arXiv:2408.06872v1 [cs.HC] for this version) https://doi.org/10.48550/arXiv.2408.06872 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-23] Causal Agent based on Large Language Model

链接: https://arxiv.org/abs/2408.06849
作者: Kairong Han,Kun Kuang,Ziyu Zhao,Junjian Ye,Fei Wu
关键词-EN: Large language models, causal, Causal Agent, achieved significant success, Large language
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:Large language models (LLMs) have achieved significant success across various domains. However, the inherent complexity of causal problems and causal theory poses challenges in accurately describing them in natural language, making it difficult for LLMs to comprehend and use them effectively. Causal methods are not easily conveyed through natural language, which hinders LLMs’ ability to apply them accurately. Additionally, causal datasets are typically tabular, while LLMs excel in handling natural language data, creating a structural mismatch that impedes effective reasoning with tabular data. This lack of causal reasoning capability limits the development of LLMs. To address these challenges, we have equipped the LLM with causal tools within an agent framework, named the Causal Agent, enabling it to tackle causal problems. The causal agent comprises tools, memory, and reasoning modules. In the tools module, the causal agent applies causal methods to align tabular data with natural language. In the reasoning module, the causal agent employs the ReAct framework to perform reasoning through multiple iterations with the tools. In the memory module, the causal agent maintains a dictionary instance where the keys are unique names and the values are causal graphs. To verify the causal ability of the causal agent, we established a benchmark consisting of four levels of causal problems: variable level, edge level, causal graph level, and causal effect level. We generated a test dataset of 1.3K using ChatGPT-3.5 for these four levels of issues and tested the causal agent on the datasets. Our methodology demonstrates remarkable efficacy on the four-level causal problems, with accuracy rates all above 80%. For further insights and implementation details, our code is accessible via the GitHub repository this https URL.

[AI-24] AI Research is not Magic it has to be Reproducible and Responsible: Challenges in the AI field from the Perspective of its PhD Students

链接: https://arxiv.org/abs/2408.06847
作者: Andrea Hrckova,Jennifer Renoux,Rafael Tolosana Calasanz,Daniela Chuda,Martin Tamajka,Jakub Simko
关键词-EN: European countries, faced by European, European AI students, goal of uncovering, doctoral candidates
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
*备注: 8 pages, 4 figures, 1 appendix (interview questions)

点击查看摘要

Abstract:With the goal of uncovering the challenges faced by European AI students during their research endeavors, we surveyed 28 AI doctoral candidates from 13 European countries. The outcomes underscore challenges in three key areas: (1) the findability and quality of AI resources such as datasets, models, and experiments; (2) the difficulties in replicating the experiments in AI papers; (3) and the lack of trustworthiness and interdisciplinarity. From our findings, it appears that although early stage AI researchers generally tend to share their AI resources, they lack motivation or knowledge to engage more in dataset and code preparation and curation, and ethical assessments, and are not used to cooperate with well-versed experts in application domains. Furthermore, we examine existing practices in data governance and reproducibility both in computer science and in artificial intelligence. For instance, only a minority of venues actively promote reproducibility initiatives such as reproducibility evaluations. Critically, there is need for immediate adoption of responsible and reproducible AI research practices, crucial for society at large, and essential for the AI research community in particular. This paper proposes a combination of social and technical recommendations to overcome the identified challenges. Socially, we propose the general adoption of reproducibility initiatives in AI conferences and journals, as well as improved interdisciplinary collaboration, especially in data governance practices. On the technical front, we call for enhanced tools to better support versioning control of datasets and code, and a computing infrastructure that facilitates the sharing and discovery of AI resources, as well as the sharing, execution, and verification of experiments. Comments: 8 pages, 4 figures, 1 appendix (interview questions) Subjects: Computers and Society (cs.CY); Artificial Intelligence (cs.AI) Cite as: arXiv:2408.06847 [cs.CY] (or arXiv:2408.06847v1 [cs.CY] for this version) https://doi.org/10.48550/arXiv.2408.06847 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-25] Efficient Search for Customized Activation Functions with Gradient Descent

链接: https://arxiv.org/abs/2408.06820
作者: Lukas Strack,Mahmoud Safari,Frank Hutter
关键词-EN: activation functions work, activation functions, functions work, functions, deep learning
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 10 pages, 1 figure, excluding references and appendix

点击查看摘要

Abstract:Different activation functions work best for different deep learning models. To exploit this, we leverage recent advancements in gradient-based search techniques for neural architectures to efficiently identify high-performing activation functions for a given application. We propose a fine-grained search cell that combines basic mathematical operations to model activation functions, allowing for the exploration of novel activations. Our approach enables the identification of specialized activations, leading to improved performance in every model we tried, from image classification to language models. Moreover, the identified activations exhibit strong transferability to larger models of the same type, as well as new datasets. Importantly, our automated process for creating customized activation functions is orders of magnitude more efficient than previous approaches. It can easily be applied on top of arbitrary deep learning pipelines and thus offers a promising practical avenue for enhancing deep learning architectures.

[AI-26] Personalized Dynamic Difficulty Adjustment – Imitation Learning Meets Reinforcement Learning

链接: https://arxiv.org/abs/2408.06818
作者: Ronja Fuchs,Robin Gieseke,Alexander Dockhorn
关键词-EN: create interesting gaming, interesting gaming experiences, Balancing game difficulty, game difficulty, Balancing game
类目: Artificial Intelligence (cs.AI)
*备注: 2 pages, the code to our demo can be found here: this https URL

点击查看摘要

Abstract:Balancing game difficulty in video games is a key task to create interesting gaming experiences for players. Mismatching the game difficulty and a player’s skill or commitment results in frustration or boredom on the player’s side, and hence reduces time spent playing the game. In this work, we explore balancing game difficulty using machine learning-based agents to challenge players based on their current behavior. This is achieved by a combination of two agents, in which one learns to imitate the player, while the second is trained to beat the first. In our demo, we investigate the proposed framework for personalized dynamic difficulty adjustment of AI agents in the context of the fighting game AI competition.

[AI-27] MAQA: Evaluating Uncertainty Quantification in LLMs Regarding Data Uncertainty

链接: https://arxiv.org/abs/2408.06816
作者: Yongjin Yang,Haneul Yoo,Hwaran Lee
关键词-EN: uncertainty quantification, uncertainty, uncertainty quantification methods, large language models, data uncertainty
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:Although large language models (LLMs) are capable of performing various tasks, they still suffer from producing plausible but incorrect responses. To improve the reliability of LLMs, recent research has focused on uncertainty quantification to predict whether a response is correct or not. However, most uncertainty quantification methods have been evaluated on questions requiring a single clear answer, ignoring the existence of data uncertainty that arises from irreducible randomness. Instead, these methods only consider model uncertainty, which arises from a lack of knowledge. In this paper, we investigate previous uncertainty quantification methods under the presence of data uncertainty. Our contributions are two-fold: 1) proposing a new Multi-Answer Question Answering dataset, MAQA, consisting of world knowledge, mathematical reasoning, and commonsense reasoning tasks to evaluate uncertainty quantification regarding data uncertainty, and 2) assessing 5 uncertainty quantification methods of diverse white- and black-box LLMs. Our findings show that entropy and consistency-based methods estimate the model uncertainty well even under data uncertainty, while other methods for white- and black-box LLMs struggle depending on the tasks. Additionally, methods designed for white-box LLMs suffer from overconfidence in reasoning tasks compared to simple knowledge queries. We believe our observations will pave the way for future work on uncertainty quantification in realistic setting.

[AI-28] Unmasking the Uniqueness: A Glimpse into Age-Invariant Face Recognition of Indigenous African Faces

链接: https://arxiv.org/abs/2408.06806
作者: Fakunle Ajewole,Joseph Damilola Akinyemi,Khadijat Tope Ladoja,Olufade Falade Williams Onifade
关键词-EN: compared to Africa, received considerable research, considerable research efforts, AIFR research efforts, indigenous African faces
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: Keywords: Age-Invariant Face Recognition, CACD, FAGE_v2, VGGFace

点击查看摘要

Abstract:The task of recognizing the age-separated faces of an individual, Age-Invariant Face Recognition (AIFR), has received considerable research efforts in Europe, America, and Asia, compared to Africa. Thus, AIFR research efforts have often under-represented/misrepresented the African ethnicity with non-indigenous Africans. This work developed an AIFR system for indigenous African faces to reduce the misrepresentation of African ethnicity in facial image analysis research. We adopted a pre-trained deep learning model (VGGFace) for AIFR on a dataset of 5,000 indigenous African faces (FAGE_v2) collected for this study. FAGE_v2 was curated via Internet image searches of 500 individuals evenly distributed across 10 African countries. VGGFace was trained on FAGE_v2 to obtain the best accuracy of 81.80%. We also performed experiments on an African-American subset of the CACD dataset and obtained the best accuracy of 91.5%. The results show a significant difference in the recognition accuracies of indigenous versus non-indigenous Africans.

[AI-29] Deep Learning for Speaker Identification: Architectural Insights from AB-1 Corpus Analysis and Performance Evaluation ALT

链接: https://arxiv.org/abs/2408.06804
作者: Matthias Bartolo
关键词-EN: outweighs text-based interactions, fundamental human input, human input outweighs, input outweighs text-based, Frequency Cepstral Coefficients
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
*备注: Resultant work from Assignment, Department of AI, University of Malta. Code available at: this https URL

点击查看摘要

Abstract:In the fields of security systems, forensic investigations, and personalized services, the importance of speech as a fundamental human input outweighs text-based interactions. This research delves deeply into the complex field of Speaker Identification (SID), examining its essential components and emphasising Mel Spectrogram and Mel Frequency Cepstral Coefficients (MFCC) for feature extraction. Moreover, this study evaluates six slightly distinct model architectures using extensive analysis to evaluate their performance, with hyperparameter tuning applied to the best-performing model. This work performs a linguistic analysis to verify accent and gender accuracy, in addition to bias evaluation within the AB-1 Corpus dataset.

[AI-30] Integrating Saliency Ranking and Reinforcement Learning for Enhanced Object Detection ALT

链接: https://arxiv.org/abs/2408.06803
作者: Matthias Bartolo,Dylan Seychell,Josef Bajada
关键词-EN: based visual attention, combine reinforcement learning, visual attention methods, saliency ranking techniques, based visual
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: Resultant work from Dissertation, Department of AI, University of Malta. Code available at: this https URL

点击查看摘要

Abstract:With the ever-growing variety of object detection approaches, this study explores a series of experiments that combine reinforcement learning (RL)-based visual attention methods with saliency ranking techniques to investigate transparent and sustainable solutions. By integrating saliency ranking for initial bounding box prediction and subsequently applying RL techniques to refine these predictions through a finite set of actions over multiple time steps, this study aims to enhance RL object detection accuracy. Presented as a series of experiments, this research investigates the use of various image feature extraction methods and explores diverse Deep Q-Network (DQN) architectural variations for deep reinforcement learning-based localisation agent training. Additionally, we focus on optimising the detection pipeline at every step by prioritising lightweight and faster models, while also incorporating the capability to classify detected objects, a feature absent in previous RL approaches. We show that by evaluating the performance of these trained agents using the Pascal VOC 2007 dataset, faster and more optimised models were developed. Notably, the best mean Average Precision (mAP) achieved in this study was 51.4, surpassing benchmarks set by RL-based single object detectors in the literature.

[AI-31] Robust Deep Reinforcement Learning for Inverter-based Volt-Var Control in Partially Observable Distribution Networks

链接: https://arxiv.org/abs/2408.06776
作者: Qiong Liu,Ye Guo,Tong Xu
关键词-EN: Inverter-based volt-var control, Inverter-based volt-var, robust DRL approach, volt-var control, control is studied
类目: ystems and Control (eess.SY); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Inverter-based volt-var control is studied in this paper. One key issue in DRL-based approaches is the limited measurement deployment in active distribution networks, which leads to problems of a partially observable state and unknown reward. To address those problems, this paper proposes a robust DRL approach with a conservative critic and a surrogate reward. The conservative critic utilizes the quantile regression technology to estimate conservative state-action value function based on the partially observable state, which helps to train a robust policy; the surrogate rewards of power loss and voltage violation are designed that can be calculated from the limited measurements. The proposed approach optimizes the power loss of the whole network and the voltage profile of buses with measurable voltages while indirectly improving the voltage profile of other buses. Extensive simulations verify the effectiveness of the robust DRL approach in different limited measurement conditions, even when only the active power injection of the root bus and less than 10% of bus voltages are measurable.

[AI-32] Exploring Domain Shift on Radar-Based 3D Object Detection Amidst Diverse Environmental Conditions ITSC

链接: https://arxiv.org/abs/2408.06772
作者: Miao Zhang,Sherif Abdulatif,Benedikt Loesch,Marco Altmann,Marius Schwarz,Bin Yang
关键词-EN: autonomous driving systems, perception using multimodal, rapid evolution, evolution of deep, deep learning
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO)
*备注: 6 pages, 5 figures, 3 tables, accepted in IEEE International Conference on Intelligent Transportation Systems (ITSC) 2024

点击查看摘要

Abstract:The rapid evolution of deep learning and its integration with autonomous driving systems have led to substantial advancements in 3D perception using multimodal sensors. Notably, radar sensors show greater robustness compared to cameras and lidar under adverse weather and varying illumination conditions. This study delves into the often-overlooked yet crucial issue of domain shift in 4D radar-based object detection, examining how varying environmental conditions, such as different weather patterns and road types, impact 3D object detection performance. Our findings highlight distinct domain shifts across various weather scenarios, revealing unique dataset sensitivities that underscore the critical role of radar point cloud generation. Additionally, we demonstrate that transitioning between different road types, especially from highways to urban settings, introduces notable domain shifts, emphasizing the necessity for diverse data collection across varied road environments. To the best of our knowledge, this is the first comprehensive analysis of domain shift effects on 4D radar-based object detection. We believe this empirical study contributes to understanding the complex nature of domain shifts in radar data and suggests paths forward for data collection strategy in the face of environmental variability.

[AI-33] Cross-View Geolocalization and Disaster Mapping with Street-View and VHR Satellite Imagery: A Case Study of Hurricane IAN

链接: https://arxiv.org/abs/2408.06761
作者: Hao Li,Fabian Deuser,Wenping Yina,Xuanshu Luo,Paul Walther,Gengchen Mai,Wei Huang,Martin Werner
关键词-EN: Nature disasters play, human-urban infrastructure interactions, shaping human-urban infrastructure, Nature disasters, play a key
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Nature disasters play a key role in shaping human-urban infrastructure interactions. Effective and efficient response to natural disasters is essential for building resilience and a sustainable urban environment. Two types of information are usually the most necessary and difficult to gather in disaster response. The first information is about disaster damage perception, which shows how badly people think that urban infrastructure has been damaged. The second information is geolocation awareness, which means how people whereabouts are made available. In this paper, we proposed a novel disaster mapping framework, namely CVDisaster, aiming at simultaneously addressing geolocalization and damage perception estimation using cross-view Street-View Imagery (SVI) and Very High-Resolution satellite imagery. CVDisaster consists of two cross-view models, where CVDisaster-Geoloc refers to a cross-view geolocalization model based on a contrastive learning objective with a Siamese ConvNeXt image encoder, and CVDisaster-Est is a cross-view classification model based on a Couple Global Context Vision Transformer (CGCViT). Taking Hurricane IAN as a case study, we evaluate the CVDisaster framework by creating a novel cross-view dataset (CVIAN) and conducting extensive experiments. As a result, we show that CVDisaster can achieve highly competitive performance (over 80% for geolocalization and 75% for damage perception estimation) with even limited fine-tuning efforts, which largely motivates future cross-view models and applications within a broader GeoAI research community. The data and code are publicly available at: this https URL.

[AI-34] Evaluating Research Quality with Large Language Models : An Analysis of ChatGPTs Effectiveness with Different Settings and Inputs

链接: https://arxiv.org/abs/2408.06752
作者: Mike Thelwall
关键词-EN: Large Language Models, academic journal articles, appointments and promotion, academic journal, time consuming
类目: Digital Libraries (cs.DL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Evaluating the quality of academic journal articles is a time consuming but critical task for national research evaluation exercises, appointments and promotion. It is therefore important to investigate whether Large Language Models (LLMs) can play a role in this process. This article assesses which ChatGPT inputs (full text without tables, figures and references; title and abstract; title only) produce better quality score estimates, and the extent to which scores are affected by ChatGPT models and system prompts. The results show that the optimal input is the article title and abstract, with average ChatGPT scores based on these (30 iterations on a dataset of 51 papers) correlating at 0.67 with human scores, the highest ever reported. ChatGPT 4o is slightly better than 3.5-turbo (0.66), and 4o-mini (0.66). The results suggest that article full texts might confuse LLM research quality evaluations, even though complex system instructions for the task are more effective than simple ones. Thus, whilst abstracts contain insufficient information for a thorough assessment of rigour, they may contain strong pointers about originality and significance. Finally, linear regression can be used to convert the model scores into the human scale scores, which is 31% more accurate than guessing.

[AI-35] DiffLoRA: Generating Personalized Low-Rank Adaptation Weights with Diffusion

链接: https://arxiv.org/abs/2408.06740
作者: Yujia Wu,Yiming Shi,Jiwei Wei,Chengwei Sun,Yuyang Zhou,Yang Yang,Heng Tao Shen
关键词-EN: gained significant attention, specific identities conditioned, generate high-fidelity portraits, generation has gained, user-defined prompts
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 9 pages,8 figures

点击查看摘要

Abstract:Personalized text-to-image generation has gained significant attention for its capability to generate high-fidelity portraits of specific identities conditioned on user-defined prompts. Existing methods typically involve test-time fine-tuning or instead incorporating an additional pre-trained branch. However, these approaches struggle to simultaneously address the demands of efficiency, identity fidelity, and preserving the model’s original generative capabilities. In this paper, we propose DiffLoRA, a novel approach that leverages diffusion models as a hypernetwork to predict personalized low-rank adaptation (LoRA) weights based on the reference images. By integrating these LoRA weights into the text-to-image model, DiffLoRA achieves personalization during inference without further training. Additionally, we propose an identity-oriented LoRA weight construction pipeline to facilitate the training of DiffLoRA. By utilizing the dataset produced by this pipeline, our DiffLoRA consistently generates high-performance and accurate LoRA weights. Extensive evaluations demonstrate the effectiveness of our method, achieving both time efficiency and maintaining identity fidelity throughout the personalization process.

[AI-36] Speculations on Uncertainty and Humane Algorithms

链接: https://arxiv.org/abs/2408.06736
作者: Nicholas Gray
关键词-EN: appreciation and utilisation, play a key, key role, role in helping, helping to solve
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The appreciation and utilisation of risk and uncertainty can play a key role in helping to solve some of the many ethical issues that are posed by AI. Understanding the uncertainties can allow algorithms to make better decisions by providing interrogatable avenues to check the correctness of outputs. Allowing algorithms to deal with variability and ambiguity with their inputs means they do not need to force people into uncomfortable classifications. Provenance enables algorithms to know what they know preventing possible harms. Additionally, uncertainty about provenance highlights the trustworthiness of algorithms. It is essential to compute with what we know rather than make assumptions that may be unjustified or untenable. This paper provides a perspective on the need for the importance of risk and uncertainty in the development of ethical AI, especially in high-risk scenarios. It argues that the handling of uncertainty, especially epistemic uncertainty, is critical to ensuring that algorithms do not cause harm and are trustworthy and ensure that the decisions that they make are humane.

[AI-37] Large language models can consistently generate high-quality content for election disinformation operations

链接: https://arxiv.org/abs/2408.06731
作者: Angus R. Williams,Liam Burke-Moore,Ryan Sze-Yin Chan,Florence E. Enock,Federico Nanni,Tvesha Sippy,Yi-Ling Chung,Evelina Gabasova,Kobi Hackenburg,Jonathan Bright
关键词-EN: Advances in large, election disinformation operation, generating compelling election, election disinformation, compelling election disinformation
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:Advances in large language models have raised concerns about their potential use in generating compelling election disinformation at scale. This study presents a two-part investigation into the capabilities of LLMs to automate stages of an election disinformation operation. First, we introduce DisElect, a novel evaluation dataset designed to measure LLM compliance with instructions to generate content for an election disinformation operation in localised UK context, containing 2,200 malicious prompts and 50 benign prompts. Using DisElect, we test 13 LLMs and find that most models broadly comply with these requests; we also find that the few models which refuse malicious prompts also refuse benign election-related prompts, and are more likely to refuse to generate content from a right-wing perspective. Secondly, we conduct a series of experiments (N=2,340) to assess the “humanness” of LLMs: the extent to which disinformation operation content generated by an LLM is able to pass as human-written. Our experiments suggest that almost all LLMs tested released since 2022 produce election disinformation operation content indiscernible by human evaluators over 50% of the time. Notably, we observe that multiple models achieve above-human levels of humanness. Taken together, these findings suggest that current LLMs can be used to generate high-quality content for election disinformation operations, even in hyperlocalised scenarios, at far lower costs than traditional methods, and offer researchers and policymakers an empirical benchmark for the measurement and evaluation of these capabilities in current and future models.

[AI-38] Enhancing Visual Dialog State Tracking through Iterative Object-Entity Alignment in Multi-Round Conversations

链接: https://arxiv.org/abs/2408.06725
作者: Wei Pang,Ruixue Duan,Jinfu Yang,Ning Li
关键词-EN: Visual Dialog, dialog history, image-related questions based, multi-round dialog history, Multi-round Dialogue State
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
*备注: This article has been accepted in CAAI Transactions on Intelligence Technology! Article ID: CIT2_12370, Article DOI: https://doi.org/10.1049/cit2.12370

点击查看摘要

Abstract:Visual Dialog (VD) is a task where an agent answers a series of image-related questions based on a multi-round dialog history. However, previous VD methods often treat the entire dialog history as a simple text input, disregarding the inherent conversational information flows at the round level. In this paper, we introduce Multi-round Dialogue State Tracking model (MDST), a framework that addresses this limitation by leveraging the dialogue state learned from dialog history to answer questions. MDST captures each round of dialog history, constructing internal dialogue state representations defined as 2-tuples of vision-language representations. These representations effectively ground the current question, enabling the generation of accurate answers. Experimental results on the VisDial v1.0 dataset demonstrate that MDST achieves a new state-of-the-art performance in generative setting. Furthermore, through a series of human studies, we validate the effectiveness of MDST in generating long, consistent, and human-like answers while consistently answering a series of questions correctly.

[AI-39] Adaptive Data Quality Scoring Operations Framework using Drift-Aware Mechanism for Industrial Applications

链接: https://arxiv.org/abs/2408.06724
作者: Firas Bayram,Bestoun S. Ahmed,Erik Hallin
关键词-EN: data-driven artificial intelligence, incoming data streams, data quality scoring, artificial intelligence, trustworthy decision-making
类目: Databases (cs.DB); Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
*备注: 17 pages

点击查看摘要

Abstract:Within data-driven artificial intelligence (AI) systems for industrial applications, ensuring the reliability of the incoming data streams is an integral part of trustworthy decision-making. An approach to assess data validity is data quality scoring, which assigns a score to each data point or stream based on various quality dimensions. However, certain dimensions exhibit dynamic qualities, which require adaptation on the basis of the system’s current conditions. Existing methods often overlook this aspect, making them inefficient in dynamic production environments. In this paper, we introduce the Adaptive Data Quality Scoring Operations Framework, a novel framework developed to address the challenges posed by dynamic quality dimensions in industrial data streams. The framework introduces an innovative approach by integrating a dynamic change detector mechanism that actively monitors and adapts to changes in data quality, ensuring the relevance of quality scores. We evaluate the proposed framework performance in a real-world industrial use case. The experimental results reveal high predictive performance and efficient processing time, highlighting its effectiveness in practical quality-driven AI applications.

[AI-40] Computation-friendly Graph Neural Network Design by Accumulating Knowledge on Large Language Models

链接: https://arxiv.org/abs/2408.06717
作者: Jialiang Wang,Shimin Di,Hanmo Liu,Zhili Wang,Jiachuan Wang,Lei Chen,Xiaofang Zhou
关键词-EN: Graph Neural Networks, Neural Networks, shown remarkable success, Graph Neural, Networks
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Graph Neural Networks (GNNs), like other neural networks, have shown remarkable success but are hampered by the complexity of their architecture designs, which heavily depend on specific data and tasks. Traditionally, designing proper architectures involves trial and error, which requires intensive manual effort to optimize various components. To reduce human workload, researchers try to develop automated algorithms to design GNNs. However, both experts and automated algorithms suffer from two major issues in designing GNNs: 1) the substantial computational resources expended in repeatedly trying candidate GNN architectures until a feasible design is achieved, and 2) the intricate and prolonged processes required for humans or algorithms to accumulate knowledge of the interrelationship between graphs, GNNs, and performance. To further enhance the automation of GNN architecture design, we propose a computation-friendly way to empower Large Language Models (LLMs) with specialized knowledge in designing GNNs, thereby drastically shortening the computational overhead and development cycle of designing GNN architectures. Our framework begins by establishing a knowledge retrieval pipeline that comprehends the intercorrelations between graphs, GNNs, and performance. This pipeline converts past model design experiences into structured knowledge for LLM reference, allowing it to quickly suggest initial model proposals. Subsequently, we introduce a knowledge-driven search strategy that emulates the exploration-exploitation process of human experts, enabling quick refinement of initial proposals within a promising scope. Extensive experiments demonstrate that our framework can efficiently deliver promising (e.g., Top-5.77%) initial model proposals for unseen datasets within seconds and without any prior training and achieve outstanding search performance in a few iterations. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2408.06717 [cs.LG] (or arXiv:2408.06717v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2408.06717 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-41] Variational Learning of Gaussian Process Latent Variable Models through Stochastic Gradient Annealed Importance Sampling

链接: https://arxiv.org/abs/2408.06710
作者: Jian Xu,Shian Du,Junmei Yang,Qianli Ma,Delu Zeng
关键词-EN: Gaussian Process Latent, Latent Variable Models, Process Latent Variable, Gaussian Process, Process Latent
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Gaussian Process Latent Variable Models (GPLVMs) have become increasingly popular for unsupervised tasks such as dimensionality reduction and missing data recovery due to their flexibility and non-linear nature. An importance-weighted version of the Bayesian GPLVMs has been proposed to obtain a tighter variational bound. However, this version of the approach is primarily limited to analyzing simple data structures, as the generation of an effective proposal distribution can become quite challenging in high-dimensional spaces or with complex data sets. In this work, we propose an Annealed Importance Sampling (AIS) approach to address these issues. By transforming the posterior into a sequence of intermediate distributions using annealing, we combine the strengths of Sequential Monte Carlo samplers and VI to explore a wider range of posterior distributions and gradually approach the target distribution. We further propose an efficient algorithm by reparameterizing all variables in the evidence lower bound (ELBO). Experimental results on both toy and image datasets demonstrate that our method outperforms state-of-the-art methods in terms of tighter variational bounds, higher log-likelihoods, and more robust convergence.

[AI-42] Information Geometry and Beta Link for Optimizing Sparse Variational Student-t Processes

链接: https://arxiv.org/abs/2408.06699
作者: Jian Xu,Delu Zeng,John Paisley
关键词-EN: Student-t Processes, variational Student-t Processes, enhance computational efficiency, sparse variational Student-t, termed sparse variational
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Recently, a sparse version of Student-t Processes, termed sparse variational Student-t Processes, has been proposed to enhance computational efficiency and flexibility for real-world datasets using stochastic gradient descent. However, traditional gradient descent methods like Adam may not fully exploit the parameter space geometry, potentially leading to slower convergence and suboptimal performance. To mitigate these issues, we adopt natural gradient methods from information geometry for variational parameter optimization of Student-t Processes. This approach leverages the curvature and structure of the parameter space, utilizing tools such as the Fisher information matrix which is linked to the Beta function in our model. This method provides robust mathematical support for the natural gradient algorithm when using Student’s t-distribution as the variational distribution. Additionally, we present a mini-batch algorithm for efficiently computing natural gradients. Experimental results across four benchmark datasets demonstrate that our method consistently accelerates convergence speed.

[AI-43] SlotLifter: Slot-guided Feature Lifting for Learning Object-centric Radiance Fields ECCV2024

链接: https://arxiv.org/abs/2408.06697
作者: Yu Liu,Baoxiong Jia,Yixin Chen,Siyuan Huang
关键词-EN: underpins human-level generalization, distill object-centric abstractions, intricate visual scenes, visual scenes underpins, scenes underpins human-level
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO)
*备注: Accepted by ECCV 2024. Project website: this https URL

点击查看摘要

Abstract:The ability to distill object-centric abstractions from intricate visual scenes underpins human-level generalization. Despite the significant progress in object-centric learning methods, learning object-centric representations in the 3D physical world remains a crucial challenge. In this work, we propose SlotLifter, a novel object-centric radiance model addressing scene reconstruction and decomposition jointly via slot-guided feature lifting. Such a design unites object-centric learning representations and image-based rendering methods, offering state-of-the-art performance in scene decomposition and novel-view synthesis on four challenging synthetic and four complex real-world datasets, outperforming existing 3D object-centric learning methods by a large margin. Through extensive ablative studies, we showcase the efficacy of designs in SlotLifter, revealing key insights for potential future directions.

[AI-44] DC3DO: Diffusion Classifier for 3D Objects

链接: https://arxiv.org/abs/2408.06693
作者: Nursena Koprucu,Meher Shashwat Nigam,Shicheng Xu(Luke),Biruk Abere,Gabriele Dominici,Andrew Rodriguez,Sharvaree Vadgam,Berfin Inal,Alberto Tono
关键词-EN: Geoffrey Hinton emphasis, Inspired by Geoffrey, Geoffrey Hinton, Hinton emphasis, learn to generate
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computational Geometry (cs.CG)
*备注:

点击查看摘要

Abstract:Inspired by Geoffrey Hinton emphasis on generative modeling, To recognize shapes, first learn to generate them, we explore the use of 3D diffusion models for object classification. Leveraging the density estimates from these models, our approach, the Diffusion Classifier for 3D Objects (DC3DO), enables zero-shot classification of 3D shapes without additional training. On average, our method achieves a 12.5 percent improvement compared to its multiview counterparts, demonstrating superior multimodal reasoning over discriminative approaches. DC3DO employs a class-conditional diffusion model trained on ShapeNet, and we run inferences on point clouds of chairs and cars. This work highlights the potential of generative models in 3D object classification.

[AI-45] Masked Image Modeling: A Survey

链接: https://arxiv.org/abs/2408.06687
作者: Vlad Hondru,Florinel Alin Croitoru,Shervin Minaee,Radu Tudor Ionescu,Nicu Sebe
关键词-EN: powerful self-supervised learning, self-supervised learning technique, survey recent studies, computer vision, approach that emerged
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In this work, we survey recent studies on masked image modeling (MIM), an approach that emerged as a powerful self-supervised learning technique in computer vision. The MIM task involves masking some information, e.g. pixels, patches, or even latent representations, and training a model, usually an autoencoder, to predicting the missing information by using the context available in the visible part of the input. We identify and formalize two categories of approaches on how to implement MIM as a pretext task, one based on reconstruction and one based on contrastive learning. Then, we construct a taxonomy and review the most prominent papers in recent years. We complement the manually constructed taxonomy with a dendrogram obtained by applying a hierarchical clustering algorithm. We further identify relevant clusters via manually inspecting the resulting dendrogram. Our review also includes datasets that are commonly used in MIM research. We aggregate the performance results of various masked image modeling methods on the most popular datasets, to facilitate the comparison of competing methods. Finally, we identify research gaps and propose several interesting directions of future work.

[AI-46] Leveraging Priors via Diffusion Bridge for Time Series Generation

链接: https://arxiv.org/abs/2408.06672
作者: Jinseong Park,Seungyun Lee,Woojin Jeong,Yujin Choi,Jaewook Lee
关键词-EN: hypothesis test techniques, Time series generation, Time series, series generation, series
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Time series generation is widely used in real-world applications such as simulation, data augmentation, and hypothesis test techniques. Recently, diffusion models have emerged as the de facto approach for time series generation, emphasizing diverse synthesis scenarios based on historical or correlated time series data streams. Since time series have unique characteristics, such as fixed time order and data scaling, standard Gaussian prior might be ill-suited for general time series generation. In this paper, we exploit the usage of diverse prior distributions for synthesis. Then, we propose TimeBridge, a framework that enables flexible synthesis by leveraging diffusion bridges to learn the transport between chosen prior and data distributions. Our model covers a wide range of scenarios in time series diffusion models, which leverages (i) data- and time-dependent priors for unconditional synthesis, and (ii) data-scale preserving synthesis with a constraint as a prior for conditional generation. Experimentally, our model achieves state-of-the-art performance in both unconditional and conditional time series generation tasks.

[AI-47] RW-NSGCN: A Robust Approach to Structural Attacks via Negative Sampling

链接: https://arxiv.org/abs/2408.06665
作者: Shuqi He,Jun Zhuang,Ding Wang,Jun Song
关键词-EN: Graph Neural Networks, predicting user interests, Graph Neural, Graph Convolutional Network, Sampling Graph Convolutional
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Node classification using Graph Neural Networks (GNNs) has been widely applied in various practical scenarios, such as predicting user interests and detecting communities in social networks. However, recent studies have shown that graph-structured networks often contain potential noise and attacks, in the form of topological perturbations and weight disturbances, which can lead to decreased classification performance in GNNs. To improve the robustness of the model, we propose a novel method: Random Walk Negative Sampling Graph Convolutional Network (RW-NSGCN). Specifically, RW-NSGCN integrates the Random Walk with Restart (RWR) and PageRank (PGR) algorithms for negative sampling and employs a Determinantal Point Process (DPP)-based GCN for convolution operations. RWR leverages both global and local information to manage noise and local variations, while PGR assesses node importance to stabilize the topological structure. The DPP-based GCN ensures diversity among negative samples and aggregates their features to produce robust node embeddings, thereby improving classification performance. Experimental results demonstrate that the RW-NSGCN model effectively addresses network topology attacks and weight instability, increasing the accuracy of anomaly detection and overall stability. In terms of classification accuracy, RW-NSGCN significantly outperforms existing methods, showing greater resilience across various scenarios and effectively mitigating the impact of such vulnerabilities.

[AI-48] Amuro Char: Analyzing the Relationship between Pre-Training and Fine-Tuning of Large Language Models

链接: https://arxiv.org/abs/2408.06663
作者: Kaiser Sun,Mark Dredze
关键词-EN: large text corpus, large language models, language models leads, large language, large text
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The development of large language models leads to the formation of a pre-train-then-align paradigm, in which the model is typically pre-trained on a large text corpus and undergoes a tuning stage to align the model with human preference or downstream tasks. In this work, we investigate the relationship between pre-training and fine-tuning by fine-tuning multiple intermediate pre-trained model checkpoints. Our results on 18 datasets suggest that i) continual pre-training improves the model in a latent way that unveils after fine-tuning; ii) with extra fine-tuning, the datasets that the model does not demonstrate capability gain much more than those that the model performs well during the pre-training stage; iii) although model benefits significantly through supervised fine-tuning, it may forget previously known domain knowledge and the tasks that are not seen during fine-tuning; iv) the model resembles high sensitivity to evaluation prompts after supervised fine-tuning, but this sensitivity can be alleviated by more pre-training.

[AI-49] Hierarchical Structured Neural Network for Retrieval

链接: https://arxiv.org/abs/2408.06653
作者: Kaushik Rangadurai,Siyang Yuan,Minhui Huang,Yiqun Liu,Golnaz Ghasemiesfeh,Yunchen Pu,Xinfeng Xie,Xingfeng He,Fangzhou Xu,Andrew Cui,Vidhoon Viswanathan,Yan Dong,Liang Xiong,Lin Yang,Liang Wang,Jiyan Yang,Chonglin Sun
关键词-EN: Embedding Based Retrieval, Embedding Based, Nearest Neighbor Search, Approximate Nearest Neighbor, learn embeddings
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
*备注: 9 pages

点击查看摘要

Abstract:Embedding Based Retrieval (EBR) is a crucial component of the retrieval stage in (Ads) Recommendation System that utilizes Two Tower or Siamese Networks to learn embeddings for both users and items (ads). It then employs an Approximate Nearest Neighbor Search (ANN) to efficiently retrieve the most relevant ads for a specific user. Despite the recent rise to popularity in the industry, they have a couple of limitations. Firstly, Two Tower model architecture uses a single dot product interaction which despite their efficiency fail to capture the data distribution in practice. Secondly, the centroid representation and cluster assignment, which are components of ANN, occur after the training process has been completed. As a result, they do not take into account the optimization criteria used for retrieval model. In this paper, we present Hierarchical Structured Neural Network (HSNN), a deployed jointly optimized hierarchical clustering and neural network model that can take advantage of sophisticated interactions and model architectures that are more common in the ranking stages while maintaining a sub-linear inference cost. We achieve 6.5% improvement in offline evaluation and also demonstrate 1.22% online gains through A/B experiments. HSNN has been successfully deployed into the Ads Recommendation system and is currently handling major portion of the traffic. The paper shares our experience in developing this system, dealing with challenges like freshness, volatility, cold start recommendations, cluster collapse and lessons deploying the model in a large scale retrieval production system.

[AI-50] EditScribe: Non-Visual Image Editing with Natural Language Verification Loops

链接: https://arxiv.org/abs/2408.06632
作者: Ruei-Che Chang,Yuxuan Liu,Lotus Zhang,Anhong Guo
关键词-EN: requires precise visual, precise visual evaluation, Image editing, iterative process, process that requires
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注: ASSETS 2024

点击查看摘要

Abstract:Image editing is an iterative process that requires precise visual evaluation and manipulation for the output to match the editing intent. However, current image editing tools do not provide accessible interaction nor sufficient feedback for blind and low vision individuals to achieve this level of control. To address this, we developed EditScribe, a prototype system that makes image editing accessible using natural language verification loops powered by large multimodal models. Using EditScribe, the user first comprehends the image content through initial general and object descriptions, then specifies edit actions using open-ended natural language prompts. EditScribe performs the image edit, and provides four types of verification feedback for the user to verify the performed edit, including a summary of visual changes, AI judgement, and updated general and object descriptions. The user can ask follow-up questions to clarify and probe into the edits or verification feedback, before performing another edit. In a study with ten blind or low-vision users, we found that EditScribe supported participants to perform and verify image edit actions non-visually. We observed different prompting strategies from participants, and their perceptions on the various types of verification feedback. Finally, we discuss the implications of leveraging natural language verification loops to make visual authoring non-visually accessible.

[AI-51] WorldScribe: Towards Context-Aware Live Visual Descriptions

链接: https://arxiv.org/abs/2408.06627
作者: Ruei-Che Chang,Yuxuan Liu,Anhong Guo
关键词-EN: aid blind people, visual descriptions, live visual descriptions, Automated live visual, autonomy and independence
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注: UIST 2024

点击查看摘要

Abstract:Automated live visual descriptions can aid blind people in understanding their surroundings with autonomy and independence. However, providing descriptions that are rich, contextual, and just-in-time has been a long-standing challenge in accessibility. In this work, we develop WorldScribe, a system that generates automated live real-world visual descriptions that are customizable and adaptive to users’ contexts: (i) WorldScribe’s descriptions are tailored to users’ intents and prioritized based on semantic relevance. (ii) WorldScribe is adaptive to visual contexts, e.g., providing consecutively succinct descriptions for dynamic scenes, while presenting longer and detailed ones for stable settings. (iii) WorldScribe is adaptive to sound contexts, e.g., increasing volume in noisy environments, or pausing when conversations start. Powered by a suite of vision, language, and sound recognition models, WorldScribe introduces a description generation pipeline that balances the tradeoffs between their richness and latency to support real-time use. The design of WorldScribe is informed by prior work on providing visual descriptions and a formative study with blind participants. Our user study and subsequent pipeline evaluation show that WorldScribe can provide real-time and fairly accurate visual descriptions to facilitate environment understanding that is adaptive and customized to users’ contexts. Finally, we discuss the implications and further steps toward making live visual descriptions more context-aware and humanized.

[AI-52] Generalized knowledge-enhanced framework for biomedical entity and relation extraction

链接: https://arxiv.org/abs/2408.06618
作者: Minh Nguyen,Phuong Le
关键词-EN: recent years, increasing number, relation extraction, entity and relation, biomedical entity
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In recent years, there has been an increasing number of frameworks developed for biomedical entity and relation extraction. This research effort aims to address the accelerating growth in biomedical publications and the intricate nature of biomedical texts, which are written for mainly domain experts. To handle these challenges, we develop a novel framework that utilizes external knowledge to construct a task-independent and reusable background knowledge graph for biomedical entity and relation extraction. The design of our model is inspired by how humans learn domain-specific topics. In particular, humans often first acquire the most basic and common knowledge regarding a field to build the foundational knowledge and then use that as a basis for extending to various specialized topics. Our framework employs such common-knowledge-sharing mechanism to build a general neural-network knowledge graph that is learning transferable to different domain-specific biomedical texts effectively. Experimental evaluations demonstrate that our model, equipped with this generalized and cross-transferable knowledge base, achieves competitive performance benchmarks, including BioRelEx for binding interaction detection and ADE for Adverse Drug Effect identification.

[AI-53] Simple but Effective Compound Geometric Operations for Temporal Knowledge Graph Completion

链接: https://arxiv.org/abs/2408.06603
作者: Rui Ying,Mengting Hu,Jianfeng Wu,Yalan Xie,Xiaoyi Liu,Zhunheng Wang,Ming Jiang,Hang Gao,Linlin Zhang,Renhong Cheng
关键词-EN: temporal knowledge graphs, Temporal knowledge, graph completion aims, knowledge graph completion, knowledge graphs
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Temporal knowledge graph completion aims to infer the missing facts in temporal knowledge graphs. Current approaches usually embed factual knowledge into continuous vector space and apply geometric operations to learn potential patterns in temporal knowledge graphs. However, these methods only adopt a single operation, which may have limitations in capturing the complex temporal dynamics present in temporal knowledge graphs. Therefore, we propose a simple but effective method, i.e. TCompoundE, which is specially designed with two geometric operations, including time-specific and relation-specific operations. We provide mathematical proofs to demonstrate the ability of TCompoundE to encode various relation patterns. Experimental results show that our proposed model significantly outperforms existing temporal knowledge graph embedding models. Our code is available at this https URL.

[AI-54] Super-intelligence or Superstition? Exploring Psychological Factors Underlying Unwarranted Belief in AI Predictions

链接: https://arxiv.org/abs/2408.06602
作者: Eunhae Lee,Pat Pataranutaporn,Judith Amores,Pattie Maes
关键词-EN: study investigates psychological, investigates psychological factors, psychological factors influencing, factors influencing belief, personal behavior
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:This study investigates psychological factors influencing belief in AI predictions about personal behavior, comparing it to belief in astrology and personality-based predictions. Through an experiment with 238 participants, we examined how cognitive style, paranormal beliefs, AI attitudes, personality traits, and other factors affect perceived validity, reliability, usefulness, and personalization of predictions from different sources. Our findings reveal that belief in AI predictions is positively correlated with belief in predictions based on astrology and personality psychology. Notably, paranormal beliefs and positive AI attitudes significantly increased perceived validity, reliability, usefulness, and personalization of AI predictions. Conscientiousness was negatively correlated with belief in predictions across all sources, and interest in the prediction topic increased believability across predictions. Surprisingly, cognitive style did not significantly influence belief in predictions. These results highlight the “rational superstition” phenomenon in AI, where belief is driven more by mental heuristics and intuition than critical evaluation. We discuss implications for designing AI systems and communication strategies that foster appropriate trust and skepticism. This research contributes to our understanding of the psychology of human-AI interaction and offers insights for the design and deployment of AI systems.

[AI-55] A Perspective on Large Language Models Intelligent Machines and Knowledge Acquisition

链接: https://arxiv.org/abs/2408.06598
作者: Vladimir Cherkassky,Eng Hock Lee
关键词-EN: Large Language Models, Large Language, Language Models, text documents, remarkable ability
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Large Language Models (LLMs) are known for their remarkable ability to generate synthesized ‘knowledge’, such as text documents, music, images, etc. However, there is a huge gap between LLM’s and human capabilities for understanding abstract concepts and reasoning. We discuss these issues in a larger philosophical context of human knowledge acquisition and the Turing test. In addition, we illustrate the limitations of LLMs by analyzing GPT-4 responses to questions ranging from science and math to common sense reasoning. These examples show that GPT-4 can often imitate human reasoning, even though it lacks understanding. However, LLM responses are synthesized from a large LLM model trained on all available data. In contrast, human understanding is based on a small number of abstract concepts. Based on this distinction, we discuss the impact of LLMs on acquisition of human knowledge and education.

[AI-56] Social Debiasing for Fair Multi-modal LLMs

链接: https://arxiv.org/abs/2408.06569
作者: Harry Cheng,Yangyang Guo,Qingpei Guo,Ming Yang,Tian Gan,Liqiang Nie
关键词-EN: Multi-modal Large Language, Large Language Models, Multi-modal Large, Large Language, offering powerful vision-language
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Multi-modal Large Language Models (MLLMs) have advanced significantly, offering powerful vision-language understanding capabilities. However, these models often inherit severe social biases from their training datasets, leading to unfair predictions based on attributes like race and gender. This paper addresses the issue of social biases in MLLMs by i) Introducing a comprehensive Counterfactual dataset with Multiple Social Concepts (CMSC), which provides a more diverse and extensive training set compared to existing datasets. ii) Proposing an Anti-Stereotype Debiasing strategy (ASD). Our method works by revisiting the MLLM training process, rescaling the autoregressive loss function, and improving data sampling methods to counteract biases. Through extensive experiments on various MLLMs, our CMSC dataset and ASD method demonstrate a significant reduction in social biases while maintaining the models’ original performance.

[AI-57] AquilaMoE: Efficient Training for MoE Models with Scale-Up and Scale-Out Strategies

链接: https://arxiv.org/abs/2408.06567
作者: Bo-Wen Zhang,Liangdong Wang,Ye Yuan,Jijie Li,Shuhao Gu,Mengdi Zhao,Xinya Wu,Guang Liu,Chengwei Wu,Hanyu Zhao,Li Du,Yiming Ju,Quanyue Ma,Yulong Ao,Yingli Zhao,Songhe Zhu,Zhou Cao,Dong Liang,Yonghua Lin,Ming Zhang,Shunfei Wang,Yanxin Zhou,Min Ye,Xuekai Chen,Xinyang Yu,Xiangjun Huang,Jian Yang
关键词-EN: recent years, gradually increased, grown exponentially, rapid application, application of large
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In recent years, with the rapid application of large language models across various fields, the scale of these models has gradually increased, and the resources required for their pre-training have grown exponentially. Training an LLM from scratch will cost a lot of computation resources while scaling up from a smaller model is a more efficient approach and has thus attracted significant attention. In this paper, we present AquilaMoE, a cutting-edge bilingual 816B Mixture of Experts (MoE) language model that has 8 experts with 16 billion parameters each and is developed using an innovative training methodology called EfficientScale. This approach optimizes performance while minimizing data requirements through a two-stage process. The first stage, termed Scale-Up, initializes the larger model with weights from a pre-trained smaller model, enabling substantial knowledge transfer and continuous pretraining with significantly less data. The second stage, Scale-Out, uses a pre-trained dense model to initialize the MoE experts, further enhancing knowledge transfer and performance. Extensive validation experiments on 1.8B and 7B models compared various initialization schemes, achieving models that maintain and reduce loss during continuous pretraining. Utilizing the optimal scheme, we successfully trained a 16B model and subsequently the 816B AquilaMoE model, demonstrating significant improvements in performance and training efficiency.

[AI-58] HDRGS: High Dynamic Range Gaussian Splatting

链接: https://arxiv.org/abs/2408.06543
作者: Jiahao Wu,Lu Xiao,Chao Wang,Rui Peng,Kaiqiang Xiong,Ronggang Wang
关键词-EN: witnessed substantial advancements, neural radiance field, high dynamic range, radiance field, years have witnessed
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Recent years have witnessed substantial advancements in the field of 3D reconstruction from 2D images, particularly following the introduction of the neural radiance field (NeRF) technique. However, reconstructing a 3D high dynamic range (HDR) radiance field, which aligns more closely with real-world conditions, from 2D multi-exposure low dynamic range (LDR) images continues to pose significant challenges. Approaches to this issue fall into two categories: grid-based and implicit-based. Implicit methods, using multi-layer perceptrons (MLP), face inefficiencies, limited solvability, and overfitting risks. Conversely, grid-based methods require significant memory and struggle with image quality and long training times. In this paper, we introduce Gaussian Splatting-a recent, high-quality, real-time 3D reconstruction technique-into this domain. We further develop the High Dynamic Range Gaussian Splatting (HDR-GS) method, designed to address the aforementioned challenges. This method enhances color dimensionality by including luminance and uses an asymmetric grid for tone-mapping, swiftly and precisely converting pixel irradiance to color. Our approach improves HDR scene recovery accuracy and integrates a novel coarse-to-fine strategy to speed up model convergence, enhancing robustness against sparse viewpoints and exposure extremes, and preventing local optima. Extensive testing confirms that our method surpasses current state-of-the-art techniques in both synthetic and real-world scenarios. Code will be released at \urlthis https URL

[AI-59] Value of Information and Reward Specification in Active Inference and POMDPs

链接: https://arxiv.org/abs/2408.06542
作者: Ran Wei
关键词-EN: recently gained popularity, gained popularity due, Expected free energy, Expected free, epistemic component
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Expected free energy (EFE) is a central quantity in active inference which has recently gained popularity due to its intuitive decomposition of the expected value of control into a pragmatic and an epistemic component. While numerous conjectures have been made to justify EFE as a decision making objective function, the most widely accepted is still its intuitiveness and resemblance to variational free energy in approximate Bayesian inference. In this work, we take a bottom up approach and ask: taking EFE as given, what’s the resulting agent’s optimality gap compared with a reward-driven reinforcement learning (RL) agent, which is well understood? By casting EFE under a particular class of belief MDP and using analysis tools from RL theory, we show that EFE approximates the Bayes optimal RL policy via information value. We discuss the implications for objective specification of active inference agents.

[AI-60] Chain-of-Strategy Planning with LLMs: Aligning the Generation of Psychotherapy Dialogue with Strategy in Motivational Interviewing

链接: https://arxiv.org/abs/2408.06527
作者: Xin Sun,Xiao Tang,Abdallah El Ali,Zhuying Li,Xiaoyu Shen,Pengjie Ren,Jan de Wit,Jiahuan Pei,Jos A.Bosch
关键词-EN: large language models, Motivational Interviewing, Recent advancements, language models, advancements in large
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Recent advancements in large language models (LLMs) have shown promise in generating psychotherapeutic dialogues, especially in Motivational Interviewing (MI). However, how to employ strategies, a set of motivational interviewing (MI) skills, to generate therapeutic-adherent conversations with explainability is underexplored. We propose an approach called strategy-aware dialogue generation with Chain-of-Strategy (CoS) planning, which first predicts MI strategies as reasoning and utilizes these strategies to guide the subsequent dialogue generation. It brings the potential for controllable and explainable generation in psychotherapy by aligning the generated MI dialogues with therapeutic strategies. Extensive experiments including automatic and human evaluations are conducted to validate the effectiveness of the MI strategy. Our findings demonstrate the potential of LLMs in producing strategically aligned dialogues and suggest directions for practical applications in psychotherapeutic settings.

[AI-61] Learned Ranking Function: From Short-term Behavior Predictions to Long-term User Satisfaction RECSYS24

链接: https://arxiv.org/abs/2408.06512
作者: Yi Wu,Daryl Chang,Jennifer She,Zhe Zhao,Li Wei,Lukasz Heldt
关键词-EN: Learned Ranking Function, Learned Ranking, short-term user-item behavior, user-item behavior predictions, Ranking Function
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
*备注: RecSys 24

点击查看摘要

Abstract:We present the Learned Ranking Function (LRF), a system that takes short-term user-item behavior predictions as input and outputs a slate of recommendations that directly optimizes for long-term user satisfaction. Most previous work is based on optimizing the hyperparameters of a heuristic function. We propose to model the problem directly as a slate optimization problem with the objective of maximizing long-term user satisfaction. We also develop a novel constraint optimization algorithm that stabilizes objective trade-offs for multi-objective optimization. We evaluate our approach with live experiments and describe its deployment on YouTube.

[AI-62] Fooling SHAP with Output Shuffling Attacks

链接: https://arxiv.org/abs/2408.06509
作者: Jun Yuan,Aritra Dasgupta
关键词-EN: discover feature attributions, SHAP, Explainable, discover feature, XAI methods
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
*备注:

点击查看摘要

Abstract:Explainable AI~(XAI) methods such as SHAP can help discover feature attributions in black-box models. If the method reveals a significant attribution from a ``protected feature’’ (e.g., gender, race) on the model output, the model is considered unfair. However, adversarial attacks can subvert the detection of XAI methods. Previous approaches to constructing such an adversarial model require access to underlying data distribution, which may not be possible in many practical scenarios. We relax this constraint and propose a novel family of attacks, called shuffling attacks, that are data-agnostic. The proposed attack strategies can adapt any trained machine learning model to fool Shapley value-based explanations. We prove that Shapley values cannot detect shuffling attacks. However, algorithms that estimate Shapley values, such as linear SHAP and SHAP, can detect these attacks with varying degrees of effectiveness. We demonstrate the efficacy of the attack strategies by comparing the performance of linear SHAP and SHAP using real-world datasets.

[AI-63] Benchmarking tree species classification from proximally-sensed laser scanning data: introducing the FOR-species20K dataset

链接: https://arxiv.org/abs/2408.06507
作者: Stefano Puliti,Emily R. Lines,Jana Müllerová,Julian Frey,Zoe Schindler,Adrian Straker,Matthew J. Allen,Lukas Winiwarter,Nataliia Rehush,Hristina Hristova,Brent Murray,Kim Calders,Louise Terryn,Nicholas Coops,Bernhard Höfle,Samuli Junttila,Martin Krůček,Grzegorz Krok,Kamil Král,Shaun R. Levick,Linda Luck,Azim Missarov,Martin Mokroš,Harry J. F. Owen,Krzysztof Stereńczak,Timo P. Pitkänen,Nicola Puletti,Ninni Saarinen,Chris Hopkinson,Chiara Torresan,Enrico Tomelleri,Hannah Weiser,Rasmus Astrup
关键词-EN: offers significant potential, Proximally-sensed laser scanning, automatically identifying tree, scanning offers significant, Proximally-sensed laser
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Proximally-sensed laser scanning offers significant potential for automated forest data capture, but challenges remain in automatically identifying tree species without additional ground data. Deep learning (DL) shows promise for automation, yet progress is slowed by the lack of large, diverse, openly available labeled datasets of single tree point clouds. This has impacted the robustness of DL models and the ability to establish best practices for species classification. To overcome these challenges, the FOR-species20K benchmark dataset was created, comprising over 20,000 tree point clouds from 33 species, captured using terrestrial (TLS), mobile (MLS), and drone laser scanning (ULS) across various European forests, with some data from other regions. This dataset enables the benchmarking of DL models for tree species classification, including both point cloud-based (PointNet++, MinkNet, MLP-Mixer, DGCNNs) and multi-view image-based methods (SimpleView, DetailView, YOLOv5). 2D image-based models generally performed better (average OA = 0.77) than 3D point cloud-based models (average OA = 0.72), with consistent results across different scanning platforms and sensors. The top model, DetailView, was particularly robust, handling data imbalances well and generalizing effectively across tree sizes. The FOR-species20K dataset, available at this https URL, is a key resource for developing and benchmarking DL models for tree species classification using laser scanning data, providing a foundation for future advancements in the field. Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI) Cite as: arXiv:2408.06507 [cs.CV] (or arXiv:2408.06507v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2408.06507 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Stefano Puliti [view email] [v1] Mon, 12 Aug 2024 21:47:15 UTC (2,472 KB)

[AI-64] Decentralized Cooperation in Heterogeneous Multi-Agent Reinforcement Learning via Graph Neural Network-Based Intrinsic Motivation

链接: https://arxiv.org/abs/2408.06503
作者: Jahir Sadik Monon,Deeparghya Dutta Barua,Md. Mosaddek Khan
关键词-EN: Multi-agent Reinforcement Learning, Multi-agent Reinforcement, Reinforcement Learning, control tasks, key framework
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Robotics (cs.RO)
*备注: 8 pages, 4 figures

点击查看摘要

Abstract:Multi-agent Reinforcement Learning (MARL) is emerging as a key framework for various sequential decision-making and control tasks. Unlike their single-agent counterparts, multi-agent systems necessitate successful cooperation among the agents. The deployment of these systems in real-world scenarios often requires decentralized training, a diverse set of agents, and learning from infrequent environmental reward signals. These challenges become more pronounced under partial observability and the lack of prior knowledge about agent heterogeneity. While notable studies use intrinsic motivation (IM) to address reward sparsity or cooperation in decentralized settings, those dealing with heterogeneity typically assume centralized training, parameter sharing, and agent indexing. To overcome these limitations, we propose the CoHet algorithm, which utilizes a novel Graph Neural Network (GNN) based intrinsic motivation to facilitate the learning of heterogeneous agent policies in decentralized settings, under the challenges of partial observability and reward sparsity. Evaluation of CoHet in the Multi-agent Particle Environment (MPE) and Vectorized Multi-Agent Simulator (VMAS) benchmarks demonstrates superior performance compared to the state-of-the-art in a range of cooperative multi-agent scenarios. Our research is supplemented by an analysis of the impact of the agent dynamics model on the intrinsic motivation module, insights into the performance of different CoHet variants, and its robustness to an increasing number of heterogeneous agents.

[AI-65] Cross-Lingual Conversational Speech Summarization with Large Language Models

链接: https://arxiv.org/abs/2408.06484
作者: Max Nelson,Shannon Wotherspoon,Francis Keith,William Hartmann,Matthew Snover
关键词-EN: Cross-lingual conversational speech, Cross-lingual conversational, important problem, dearth of resources, conversational speech
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Cross-lingual conversational speech summarization is an important problem, but suffers from a dearth of resources. While transcriptions exist for a number of languages, translated conversational speech is rare and datasets containing summaries are non-existent. We build upon the existing Fisher and Callhome Spanish-English Speech Translation corpus by supplementing the translations with summaries. The summaries are generated using GPT-4 from the reference translations and are treated as ground truth. The task is to generate similar summaries in the presence of transcription and translation errors. We build a baseline cascade-based system using open-source speech recognition and machine translation models. We test a range of LLMs for summarization and analyze the impact of transcription and translation errors. Adapting the Mistral-7B model for this task performs significantly better than off-the-shelf models and matches the performance of GPT-4.

[AI-66] owards Autonomous Agents : Adaptive-planning Reasoning and Acting in Language Models

链接: https://arxiv.org/abs/2408.06458
作者: Yen-Che Hsiao,Abhishek Dutta
关键词-EN: in-context learning algorithm, building autonomous decision-making, decision-making language agents, autonomous decision-making language, in-context learning
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:We propose a novel in-context learning algorithm for building autonomous decision-making language agents. The language agent continuously attempts to solve the same task by self-correcting each time the task fails. Our selected language agent demonstrates the ability to solve tasks in a text-based game environment. Our results show that the gemma-2-9b-it language model, using our proposed method, can successfully complete two of six tasks that failed in the first attempt. This highlights the effectiveness of our approach in enhancing the problem-solving capabilities of a single language model through self-correction, paving the way for more advanced autonomous agents. The code is publicly available at this https URL.

[AI-67] Multi-View Neural Differential Equations for Continuous-Time Stream Data in Long-Term Traffic Forecasting

链接: https://arxiv.org/abs/2408.06445
作者: Zibo Liu,Zhe Jiang,Shigang Chen
关键词-EN: Neural Differential Equations, Differential Equations, Neural Differential, flow forecasting plays, decisions in advance
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Long-term traffic flow forecasting plays a crucial role in intelligent transportation as it allows traffic managers to adjust their decisions in advance. However, the problem is challenging due to spatio-temporal correlations and complex dynamic patterns in continuous-time stream data. Neural Differential Equations (NDEs) are among the state-of-the-art methods for learning continuous-time traffic dynamics. However, the traditional NDE models face issues in long-term traffic forecasting due to failures in capturing delayed traffic patterns, dynamic edge (location-to-location correlation) patterns, and abrupt trend patterns. To fill this gap, we propose a new NDE architecture called Multi-View Neural Differential Equations. Our model captures current states, delayed states, and trends in different state variables (views) by learning latent multiple representations within Neural Differential Equations. Extensive experiments conducted on several real-world traffic datasets demonstrate that our proposed method outperforms the state-of-the-art and achieves superior prediction accuracy for long-term forecasting and robustness with noisy or missing inputs.

[AI-68] Evaluating Language Models on Entity Disambiguation in Tables

链接: https://arxiv.org/abs/2408.06423
作者: Federico Belotti,Fabio Dadda,Marco Cremaschi,Roberto Avogadro,Riccardo Pozzi,Matteo Palmonari
关键词-EN: Semantic Table Interpretation, containers of information, crucial containers, Table Interpretation, Large Language Models
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Tables are crucial containers of information, but understanding their meaning may be challenging. Indeed, recently, there has been a focus on Semantic Table Interpretation (STI), i.e., the task that involves the semantic annotation of tabular data to disambiguate their meaning. Over the years, there has been a surge in interest in data-driven approaches based on deep learning that have increasingly been combined with heuristic-based approaches. In the last period, the advent of Large Language Models (LLMs) has led to a new category of approaches for table annotation. The interest in this research field, characterised by multiple challenges, has led to a proliferation of approaches employing different techniques. However, these approaches have not been consistently evaluated on a common ground, making evaluation and comparison difficult. This work proposes an extensive evaluation of four state-of-the-art (SOTA) approaches - Alligator (formerly s-elBat), Dagobah, TURL, and TableLlama; the first two belong to the family of heuristic-based algorithms, while the others are respectively encoder-only and decoder-only LLMs. The primary objective is to measure the ability of these approaches to solve the entity disambiguation task, with the ultimate aim of charting new research paths in the field.

[AI-69] Distributed Stackelberg Strategies in State-based Potential Games for Autonomous Decentralized Learning Manufacturing Systems

链接: https://arxiv.org/abs/2408.06397
作者: Steve Yuwono,Dorothea Schwung,Andreas Schwung
关键词-EN: Distributed Stackelberg Strategies, Stackelberg Strategies, autonomously optimizing decentralized, optimizing decentralized manufacturing, decentralized manufacturing systems
类目: Computer Science and Game Theory (cs.GT); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
*备注: This pre-print was submitted to IEEE Transactions on Systems, Man, and Cybernetics: Systems on July 31, 2024

点击查看摘要

Abstract:This article describes a novel game structure for autonomously optimizing decentralized manufacturing systems with multi-objective optimization challenges, namely Distributed Stackelberg Strategies in State-Based Potential Games (DS2-SbPG). DS2-SbPG integrates potential games and Stackelberg games, which improves the cooperative trade-off capabilities of potential games and the multi-objective optimization handling by Stackelberg games. Notably, all training procedures remain conducted in a fully distributed manner. DS2-SbPG offers a promising solution to finding optimal trade-offs between objectives by eliminating the complexities of setting up combined objective optimization functions for individual players in self-learning domains, particularly in real-world industrial settings with diverse and numerous objectives between the sub-systems. We further prove that DS2-SbPG constitutes a dynamic potential game that results in corresponding converge guarantees. Experimental validation conducted on a laboratory-scale testbed highlights the efficacy of DS2-SbPG and its two variants, such as DS2-SbPG for single-leader-follower and Stack DS2-SbPG for multi-leader-follower. The results show significant reductions in power consumption and improvements in overall performance, which signals the potential of DS2-SbPG in real-world applications.

[AI-70] ViC: Virtual Compiler Is All You Need For Assembly Code Search

链接: https://arxiv.org/abs/2408.06385
作者: Zeyu Gao,Hao Wang,Yuanda Wang,Chao Zhang
关键词-EN: vast binary programs, quickly identify specific, identify specific functions, Assembly code, Assembly code search
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:Assembly code search is vital for reducing the burden on reverse engineers, allowing them to quickly identify specific functions using natural language within vast binary programs. Despite its significance, this critical task is impeded by the complexities involved in building high-quality datasets. This paper explores training a Large Language Model (LLM) to emulate a general compiler. By leveraging Ubuntu packages to compile a dataset of 20 billion tokens, we further continue pre-train CodeLlama as a Virtual Compiler (ViC), capable of compiling any source code of any language to assembly code. This approach allows for virtual compilation across a wide range of programming languages without the need for a real compiler, preserving semantic equivalency and expanding the possibilities for assembly code dataset construction. Furthermore, we use ViC to construct a sufficiently large dataset for assembly code search. Employing this extensive dataset, we achieve a substantial improvement in assembly code search performance, with our model surpassing the leading baseline by 26%.

[AI-71] Algorithm Research of ELMo Word Embedding and Deep Learning Multimodal Transformer in Image Description

链接: https://arxiv.org/abs/2408.06357
作者: Xiaohan Cheng,Taiyuan Mei,Yun Zi,Qi Wang,Zijun Gao,Haowei Yang
关键词-EN: sample learning methods, sample learning, data deficiency, sample learning algorithms, effective method
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Zero sample learning is an effective method for data deficiency. The existing embedded zero sample learning methods only use the known classes to construct the embedded space, so there is an overfitting of the known classes in the testing process. This project uses category semantic similarity measures to classify multiple tags. This enables it to incorporate unknown classes that have the same meaning as currently known classes into the vector space when it is built. At the same time, most of the existing zero sample learning algorithms directly use the depth features of medical images as input, and the feature extraction process does not consider semantic information. This project intends to take ELMo-MCT as the main task and obtain multiple visual features related to the original image through self-attention mechanism. In this paper, a large number of experiments are carried out on three zero-shot learning reference datasets, and the best harmonic average accuracy is obtained compared with the most advanced algorithms.

[AI-72] Using Large Language Models to Compare Explainable Models for Smart Home Human Activity Recognition ISWC2024

链接: https://arxiv.org/abs/2408.06352
作者: Michele Fiori,Gabriele Civitarese,Claudio Bettini
关键词-EN: Recognizing daily activities, smart environments enables, Recognizing daily, healthcare applications, smart environments
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
*备注: Accepted for publication at UbiComp / ISWC 2024’s XAIforU workshop

点击查看摘要

Abstract:Recognizing daily activities with unobtrusive sensors in smart environments enables various healthcare applications. Monitoring how subjects perform activities at home and their changes over time can reveal early symptoms of health issues, such as cognitive decline. Most approaches in this field use deep learning models, which are often seen as black boxes mapping sensor data to activities. However, non-expert users like clinicians need to trust and understand these models’ outputs. Thus, eXplainable AI (XAI) methods for Human Activity Recognition have emerged to provide intuitive natural language explanations from these models. Different XAI methods generate different explanations, and their effectiveness is typically evaluated through user surveys, that are often challenging in terms of costs and fairness. This paper proposes an automatic evaluation method using Large Language Models (LLMs) to identify, in a pool of candidates, the best XAI approach for non-expert users. Our preliminary results suggest that LLM evaluation aligns with user surveys.

[AI-73] Closing the Affective Loop via Experience-Driven Reinforcement Learning Designers

链接: https://arxiv.org/abs/2408.06346
作者: Matthew Barthet,Diogo Branco,Roberto Gallotta,Ahmed Khalifa,Georgios N. Yannakakis
关键词-EN: Autonomously tailoring content, affect-aware human-computer interaction, Autonomously tailoring, predetermined affective patterns, interaction at large
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
*备注: 9 pages, 4 figures, 1 table

点击查看摘要

Abstract:Autonomously tailoring content to a set of predetermined affective patterns has long been considered the holy grail of affect-aware human-computer interaction at large. The experience-driven procedural content generation framework realises this vision by searching for content that elicits a certain experience pattern to a user. In this paper, we propose a novel reinforcement learning (RL) framework for generating affect-tailored content, and we test it in the domain of racing games. Specifically, the experience-driven RL (EDRL) framework is given a target arousal trace, and it then generates a racetrack that elicits the desired affective responses for a particular type of player. EDRL leverages a reward function that assesses the affective pattern of any generated racetrack from a corpus of arousal traces. Our findings suggest that EDRL can accurately generate affect-driven racing game levels according to a designer’s style and outperforms search-based methods for personalised content generation. The method is not only directly applicable to game content generation tasks but also employable broadly to any domain that uses content for affective adaptation.

[AI-74] Heterogeneous Space Fusion and Dual-Dimension Attention: A New Paradigm for Speech Enhancement

链接: https://arxiv.org/abs/2408.06911
作者: Tao Zheng,Liejun Wang,Yinfeng Yu
关键词-EN: demonstrated impressive performance, remains ample opportunity, speech enhancement research, addressing speech tasks, speech tasks
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI)
*备注: Accepted for publication by IEEE International Conference on Systems, Man, and Cybernetics 2024

点击查看摘要

Abstract:Self-supervised learning has demonstrated impressive performance in speech tasks, yet there remains ample opportunity for advancement in the realm of speech enhancement research. In addressing speech tasks, confining the attention mechanism solely to the temporal dimension poses limitations in effectively focusing on critical speech features. Considering the aforementioned issues, our study introduces a novel speech enhancement framework, HFSDA, which skillfully integrates heterogeneous spatial features and incorporates a dual-dimension attention mechanism to significantly enhance speech clarity and quality in noisy environments. By leveraging self-supervised learning embeddings in tandem with Short-Time Fourier Transform (STFT) spectrogram features, our model excels at capturing both high-level semantic information and detailed spectral data, enabling a more thorough analysis and refinement of speech signals. Furthermore, we employ the innovative Omni-dimensional Dynamic Convolution (ODConv) technology within the spectrogram input branch, enabling enhanced extraction and integration of crucial information across multiple dimensions. Additionally, we refine the Conformer model by enhancing its feature extraction capabilities not only in the temporal dimension but also across the spectral domain. Extensive experiments on the VCTK-DEMAND dataset show that HFSDA is comparable to existing state-of-the-art models, confirming the validity of our approach.

[AI-75] VNet: A GAN-based Multi-Tier Discriminator Network for Speech Synthesis Vocoders

链接: https://arxiv.org/abs/2408.06906
作者: Yubing Cao,Yongming Li,Liejun Wang,Yinfeng Yu
关键词-EN: Generative Adversarial Networks, introduction of Generative, Generative Adversarial, full-band spectral information, remarkable achievements
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI)
*备注: Accepted for publication by IEEE International Conference on Systems, Man, and Cybernetics 2024

点击查看摘要

Abstract:Since the introduction of Generative Adversarial Networks (GANs) in speech synthesis, remarkable achievements have been attained. In a thorough exploration of vocoders, it has been discovered that audio waveforms can be generated at speeds exceeding real-time while maintaining high fidelity, achieved through the utilization of GAN-based models. Typically, the inputs to the vocoder consist of band-limited spectral information, which inevitably sacrifices high-frequency details. To address this, we adopt the full-band Mel spectrogram information as input, aiming to provide the vocoder with the most comprehensive information possible. However, previous studies have revealed that the use of full-band spectral information as input can result in the issue of over-smoothing, compromising the naturalness of the synthesized speech. To tackle this challenge, we propose VNet, a GAN-based neural vocoder network that incorporates full-band spectral information and introduces a Multi-Tier Discriminator (MTD) comprising multiple sub-discriminators to generate high-resolution signals. Additionally, we introduce an asymptotically constrained method that modifies the adversarial loss of the generator and discriminator, enhancing the stability of the training process. Through rigorous experiments, we demonstrate that the VNet model is capable of generating high-fidelity speech and significantly improving the performance of the vocoder.

[AI-76] BSS-CFFMA: Cross-Domain Feature Fusion and Multi-Attention Speech Enhancement Network based on Self-Supervised Embedding

链接: https://arxiv.org/abs/2408.06851
作者: Alimjan Mattursun,Liejun Wang,Yinfeng Yu
关键词-EN: multiple downstream tasks, represents has achieved, Speech self-supervised learning, multiple downstream, speech enhancement
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI)
*备注: Accepted for publication by IEEE International Conference on Systems, Man, and Cybernetics 2024

点击查看摘要

Abstract:Speech self-supervised learning (SSL) represents has achieved state-of-the-art (SOTA) performance in multiple downstream tasks. However, its application in speech enhancement (SE) tasks remains immature, offering opportunities for improvement. In this study, we introduce a novel cross-domain feature fusion and multi-attention speech enhancement network, termed BSS-CFFMA, which leverages self-supervised embeddings. BSS-CFFMA comprises a multi-scale cross-domain feature fusion (MSCFF) block and a residual hybrid multi-attention (RHMA) block. The MSCFF block effectively integrates cross-domain features, facilitating the extraction of rich acoustic information. The RHMA block, serving as the primary enhancement module, utilizes three distinct attention modules to capture diverse attention representations and estimate high-quality speech signals. We evaluate the performance of the BSS-CFFMA model through comparative and ablation studies on the VoiceBank-DEMAND dataset, achieving SOTA results. Furthermore, we select three types of data from the WHAMR! dataset, a collection specifically designed for speech enhancement tasks, to assess the capabilities of BSS-CFFMA in tasks such as denoising only, dereverberation only, and simultaneous denoising and dereverberation. This study marks the first attempt to explore the effectiveness of self-supervised embedding-based speech enhancement methods in complex tasks encompassing dereverberation and simultaneous denoising and dereverberation. The demo implementation of BSS-CFFMA is available online\footnote[2]this https URL. \labels1. Comments: Accepted for publication by IEEE International Conference on Systems, Man, and Cybernetics 2024 Subjects: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI) Cite as: arXiv:2408.06851 [eess.AS] (or arXiv:2408.06851v1 [eess.AS] for this version) https://doi.org/10.48550/arXiv.2408.06851 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-77] Stunned by Sleeping Beauty: How Prince Probability updates his forecast upon their fateful encounter

链接: https://arxiv.org/abs/2408.06797
作者: Laurens Walleghem
关键词-EN: Sleeping Beauty, Sleeping Beauty problem, Beauty, Elga discussion, Sleeping
类目: Probability (math.PR); Artificial Intelligence (cs.AI); Physics and Society (physics.soc-ph)
*备注: 12 pages, 1 figure, all comments welcome!

点击查看摘要

Abstract:The Sleeping Beauty problem is a puzzle in probability theory that has gained much attention since Elga’s discussion of it [Elga, Adam, Analysis 60 (2), p.143-147 (2000)]. Sleeping Beauty is put asleep, and a coin is tossed. If the outcome of the coin toss is Tails, Sleeping Beauty is woken up on Monday, put asleep again and woken up again on Tuesday (with no recollection of having woken up on Monday). If the outcome is Heads, Sleeping Beauty is woken up on Monday only. Each time Sleeping Beauty is woken up, she is asked what her belief is that the outcome was Heads. What should Sleeping Beauty reply? In literature arguments have been given for both 1/3 and 1/2 as the correct answer. In this short note we argue using simple Bayesian probability theory why 1/3 is the right answer, and not 1/2. Briefly, when Sleeping Beauty awakens, her being awake is nontrivial extra information that leads her to update her beliefs about Heads to 1/3. We strengthen our claim by considering an additional observer, Prince Probability, who may or may not meet Sleeping Beauty. If he meets Sleeping Beauty while she is awake, he lowers his credence in Heads to 1/3. We also briefly consider the credence in Heads of a Sleeping Beauty who knows that she is dreaming (and thus asleep).

[AI-78] Dynamic Exclusion of Low-Fidelity Data in Bayesian Optimization for Autonomous Beamline Alignment

链接: https://arxiv.org/abs/2408.06540
作者: Megha R. Narayanan,Thomas W. Morris
关键词-EN: synchrotron light sources, dynamic optical components, National Synchrotron Light, Brookhaven National Laboratory, synchrotron light
类目: Accelerator Physics (physics.acc-ph); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 12 pages, 6 figure sets

点击查看摘要

Abstract:Aligning beamlines at synchrotron light sources is a high-dimensional, expensive-to-sample optimization problem, as beams are focused using a series of dynamic optical components. Bayesian Optimization is an efficient machine learning approach to finding global optima of beam quality, but the model can easily be impaired by faulty data points caused by the beam going off the edge of the sensor or by background noise. This study, conducted at the National Synchrotron Light Source II (NSLS-II) facility at Brookhaven National Laboratory (BNL), is an investigation of methods to identify untrustworthy readings of beam quality and discourage the optimization model from seeking out points likely to yield low-fidelity beams. The approaches explored include dynamic pruning using loss analysis of size and position models and a lengthscale-based genetic algorithm to determine which points to include in the model for optimal fit. Each method successfully classified high and low fidelity points. This research advances BNL’s mission to tackle our nation’s energy challenges by providing scientists at all beamlines with access to higher quality beams, and faster convergence to these optima for their experiments.

[AI-79] PhaGO: Protein function annotation for bacteriophages by integrating the genomic context

链接: https://arxiv.org/abs/2408.06402
作者: Jiaojiao Guan,Yongxin Ji,Cheng Peng,Wei Zou,Xubo Tang,Jiayu Shang,Yanni Sun
关键词-EN: Bacteriophages are viruses, target bacteria, playing a crucial, microbial ecology, viruses that target
类目: Quantitative Methods (q-bio.QM); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 17 pages,6 figures

点击查看摘要

Abstract:Bacteriophages are viruses that target bacteria, playing a crucial role in microbial ecology. Phage proteins are important in understanding phage biology, such as virus infection, replication, and evolution. Although a large number of new phages have been identified via metagenomic sequencing, many of them have limited protein function annotation. Accurate function annotation of phage proteins presents several challenges, including their inherent diversity and the scarcity of annotated ones. Existing tools have yet to fully leverage the unique properties of phages in annotating protein functions. In this work, we propose a new protein function annotation tool for phages by leveraging the modular genomic structure of phage genomes. By employing embeddings from the latest protein foundation models and Transformer to capture contextual information between proteins in phage genomes, PhaGO surpasses state-of-the-art methods in annotating diverged proteins and proteins with uncommon functions by 6.78% and 13.05% improvement, respectively. PhaGO can annotate proteins lacking homology search results, which is critical for characterizing the rapidly accumulating phage genomes. We demonstrate the utility of PhaGO by identifying 688 potential holins in phages, which exhibit high structural conservation with known holins. The results show the potential of PhaGO to extend our understanding of newly discovered phages.

[AI-80] Design Proteins Using Large Language Models : Enhancements and Comparative Analyses ACL2024

链接: https://arxiv.org/abs/2408.06396
作者: Kamyar Zeinalipour,Neda Jamshidi,Monica Bianchini,Marco Maggini,Marco Gori
关键词-EN: natural language processing, demonstrated substantial capabilities, conventional natural language, protein sequences, language processing
类目: Quantitative Methods (q-bio.QM); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: This paper has been accepted for presentation at Language and Molecules ACL 2024

点击查看摘要

Abstract:Pre-trained LLMs have demonstrated substantial capabilities across a range of conventional natural language processing (NLP) tasks, such as summarization and entity recognition. In this paper, we explore the application of LLMs in the generation of high-quality protein sequences. Specifically, we adopt a suite of pre-trained LLMs, including Mistral-7B1, Llama-2-7B2, Llama-3-8B3, and gemma-7B4, to produce valid protein sequences. All of these models are publicly available.5 Unlike previous work in this field, our approach utilizes a relatively small dataset comprising 42,000 distinct human protein sequences. We retrain these models to process protein-related data, ensuring the generation of biologically feasible protein structures. Our findings demonstrate that even with limited data, the adapted models exhibit efficiency comparable to established protein-focused models such as ProGen varieties, ProtGPT2, and ProLLaMA, which were trained on millions of protein sequences. To validate and quantify the performance of our models, we conduct comparative analyses employing standard metrics such as pLDDT, RMSD, TM-score, and REU. Furthermore, we commit to making the trained versions of all four models publicly available, fostering greater transparency and collaboration in the field of computational biology.

[AI-81] Autoregressive Enzyme Function Prediction with Multi-scale Multi-modality Fusion

链接: https://arxiv.org/abs/2408.06391
作者: Dingyi Rong,Wenzhuo Zheng,Bozitao Zhong,Zhouhan Lin,Liang Hong,Ning Liu
关键词-EN: elucidating biological mechanisms, crucial for elucidating, elucidating biological, biological mechanisms, mechanisms and driving
类目: Quantitative Methods (q-bio.QM); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Accurate prediction of enzyme function is crucial for elucidating biological mechanisms and driving innovation across various sectors. Existing deep learning methods tend to rely solely on either sequence data or structural data and predict the EC number as a whole, neglecting the intrinsic hierarchical structure of EC numbers. To address these limitations, we introduce MAPred, a novel multi-modality and multi-scale model designed to autoregressively predict the EC number of proteins. MAPred integrates both the primary amino acid sequence and the 3D tokens of proteins, employing a dual-pathway approach to capture comprehensive protein characteristics and essential local functional sites. Additionally, MAPred utilizes an autoregressive prediction network to sequentially predict the digits of the EC number, leveraging the hierarchical organization of EC classifications. Evaluations on benchmark datasets, including New-392, Price, and New-815, demonstrate that our method outperforms existing models, marking a significant advance in the reliability and granularity of protein function prediction within bioinformatics.

[AI-82] Assessment of Cell Nuclei AI Foundation Models in Kidney Pathology

链接: https://arxiv.org/abs/2408.06381
作者: Junlin Guo,Siqi Lu,Can Cui,Ruining Deng,Tianyuan Yao,Zhewen Tao,Yizhe Lin,Marilyn Lionts,Quan Liu,Juming Xiong,Catie Chang,Mitchell Wilkes,Mengmeng Yin,Haichun Yang,Yuankai Huo
关键词-EN: kidney pathology, digital kidney pathology, crucial task, task in digital, Cell nuclei instance
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Cell nuclei instance segmentation is a crucial task in digital kidney pathology. Traditional automatic segmentation methods often lack generalizability when applied to unseen datasets. Recently, the success of foundation models (FMs) has provided a more generalizable solution, potentially enabling the segmentation of any cell type. In this study, we perform a large-scale evaluation of three widely used state-of-the-art (SOTA) cell nuclei foundation models (Cellpose, StarDist, and CellViT). Specifically, we created a highly diverse evaluation dataset consisting of 2,542 kidney whole slide images (WSIs) collected from both human and rodent sources, encompassing various tissue types, sizes, and staining methods. To our knowledge, this is the largest-scale evaluation of its kind to date. Our quantitative analysis of the prediction distribution reveals a persistent performance gap in kidney pathology. Among the evaluated models, CellViT demonstrated superior performance in segmenting nuclei in kidney pathology. However, none of the foundation models are perfect; a performance gap remains in general nuclei segmentation for kidney pathology.

[AI-83] Masked Graph Autoencoders with Contrastive Augmentation for Spatially Resolved Transcriptomics Data

链接: https://arxiv.org/abs/2408.06377
作者: Donghai Fang,Fangfang Zhu,Dongting Xie,Wenwen Min
关键词-EN: Spatial Resolved Transcriptomics, Resolved Transcriptomics, comprehensively measure gene, measure gene transcription, Spatial Resolved
类目: Genomics (q-bio.GN); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:With the rapid advancement of Spatial Resolved Transcriptomics (SRT) technology, it is now possible to comprehensively measure gene transcription while preserving the spatial context of tissues. Spatial domain identification and gene denoising are key objectives in SRT data analysis. We propose a Contrastively Augmented Masked Graph Autoencoder (STMGAC) to learn low-dimensional latent representations for domain identification. In the latent space, persistent signals for representations are obtained through self-distillation to guide self-supervised matching. At the same time, positive and negative anchor pairs are constructed using triplet learning to augment the discriminative ability. We evaluated the performance of STMGAC on five datasets, achieving results superior to those of existing baseline methods. All code and public datasets used in this paper are available at this https URL and this https URL.

[AI-84] An Adaptive CSI Feedback Model Based on BiLSTM for Massive MIMO-OFDM Systems

链接: https://arxiv.org/abs/2408.06359
作者: Hongrui Shen,Long Zhao,Kan Zheng,Yuhua Cao,Pingzhi Fan
关键词-EN: input CSI lengths, CSI feedback, channel state information, frequency division multiplexing, massive multiple-input multiple-output
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 13 pages, 14 figures, 3 tables

点击查看摘要

Abstract:Deep learning (DL)-based channel state information (CSI) feedback has the potential to improve the recovery accuracy and reduce the feedback overhead in massive multiple-input multiple-output orthogonal frequency division multiplexing (MIMO-OFDM) systems. However, the length of input CSI and the number of feedback bits should be adjustable in different scenarios, which can not be efficiently achieved by the existing CSI feedback models. Therefore, an adaptive bidirectional long short-term memory network (ABLNet) for CSI feedback is first designed to process various input CSI lengths, where the number of feedback bits is in proportion to the CSI length. Then, to realize a more flexible feedback bit number, a feedback bit control unit (FBCU) module is proposed to control the output length of feedback bits. Based on which, a target feedback performance can be adaptively achieved by a designed bit number adjusting (BNA) algorithm. Furthermore, a novel separate training approach is devised to solve the model protection problem that the UE and gNB are from different manufacturers. Experiments demonstrate that the proposed ABLNet with FBCU can fit for different input CSI lengths and feedback bit numbers; the CSI feedback performance can be stabilized by the BNA algorithm; and the proposed separate training approach can maintain the feedback performance and reduce the complexity of feedback model.

计算机视觉

[CV-0] Fingerspelling within Sign Language Translation

链接: https://arxiv.org/abs/2408.07065
作者: Garrett Tanzer
关键词-EN: language processing due, sign language processing, Fingerspelling poses challenges, American Sign Language, sign language
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Fingerspelling poses challenges for sign language processing due to its high-frequency motion and use for open-vocabulary terms. While prior work has studied fingerspelling recognition, there has been little attention to evaluating how well sign language translation models understand fingerspelling in the context of entire sentences – and improving this capability. We manually annotate instances of fingerspelling within FLEURS-ASL and use them to evaluate the effect of two simple measures to improve fingerspelling recognition within American Sign Language to English translation: 1) use a model family (ByT5) with character- rather than subword-level tokenization, and 2) mix fingerspelling recognition data into the translation training mixture. We find that 1) substantially improves understanding of fingerspelling (and therefore translation quality overall), but the effect of 2) is mixed.

[CV-1] PSM: Learning Probabilistic Embeddings for Multi-scale Zero-Shot Soundscape Mapping ACM-MM2024

链接: https://arxiv.org/abs/2408.07050
作者: Subash Khanal,Eric Xing,Srikumar Sastry,Aayush Dhakal,Zhexiao Xiong,Adeel Ahmad,Nathan Jacobs
关键词-EN: acoustic environment, environment a person, person perceives, Abstract, soundscape
类目: ound (cs.SD); Computer Vision and Pattern Recognition (cs.CV); Audio and Speech Processing (eess.AS)
*备注: Accepted at ACM MM 2024

点击查看摘要

Abstract:A soundscape is defined by the acoustic environment a person perceives at a location. In this work, we propose a framework for mapping soundscapes across the Earth. Since soundscapes involve sound distributions that span varying spatial scales, we represent locations with multi-scale satellite imagery and learn a joint representation among this imagery, audio, and text. To capture the inherent uncertainty in the soundscape of a location, we design the representation space to be probabilistic. We also fuse ubiquitous metadata (including geolocation, time, and data source) to enable learning of spatially and temporally dynamic representations of soundscapes. We demonstrate the utility of our framework by creating large-scale soundscape maps integrating both audio and text with temporal control. To facilitate future research on this task, we also introduce a large-scale dataset, GeoSound, containing over 300k geotagged audio samples paired with both low- and high-resolution satellite imagery. We demonstrate that our method outperforms the existing state-of-the-art on both GeoSound and the existing SoundingEarth dataset. Our dataset and code is available at this https URL.

[CV-2] KAN You See It? KANs and Sentinel for Effective and Explainable Crop Field Segmentation ECCV2024

链接: https://arxiv.org/abs/2408.07040
作者: Daniele Rege Cambrin,Eleonora Poeta,Eliana Pastor,Tania Cerquitelli,Elena Baralis,Paolo Garza
关键词-EN: enhancing agricultural productivity, promoting sustainable practices, monitoring crop health, agricultural productivity, sustainable practices
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: Accepted at ECCV 2024 CVPPA Workshop

点击查看摘要

Abstract:Segmentation of crop fields is essential for enhancing agricultural productivity, monitoring crop health, and promoting sustainable practices. Deep learning models adopted for this task must ensure accurate and reliable predictions to avoid economic losses and environmental impact. The newly proposed Kolmogorov-Arnold networks (KANs) offer promising advancements in the performance of neural networks. This paper analyzes the integration of KAN layers into the U-Net architecture (U-KAN) to segment crop fields using Sentinel-2 and Sentinel-1 satellite images and provides an analysis of the performance and explainability of these networks. Our findings indicate a 2% improvement in IoU compared to the traditional full-convolutional U-Net model in fewer GFLOPs. Furthermore, gradient-based explanation techniques show that U-KAN predictions are highly plausible and that the network has a very high ability to focus on the boundaries of cultivated areas rather than on the areas themselves. The per-channel relevance analysis also reveals that some channels are irrelevant to this task.

[CV-3] PathInsight: Instruction Tuning of Multimodal Datasets and Models for Intelligence Assisted Diagnosis in Histopathology

链接: https://arxiv.org/abs/2408.07037
作者: Xiaomin Wu,Rui Xu,Pengchen Wei,Wenkang Qin,Peixiang Huang,Ziheng Li,Lin Luo
关键词-EN: Pathological diagnosis remains, identifying tumors, diagnosis remains, remains the definitive, definitive standard
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 10 pages, 2 figures

点击查看摘要

Abstract:Pathological diagnosis remains the definitive standard for identifying tumors. The rise of multimodal large models has simplified the process of integrating image analysis with textual descriptions. Despite this advancement, the substantial costs associated with training and deploying these complex multimodal models, together with a scarcity of high-quality training datasets, create a significant divide between cutting-edge technology and its application in the clinical setting. We had meticulously compiled a dataset of approximately 45,000 cases, covering over 6 different tasks, including the classification of organ tissues, generating pathology report descriptions, and addressing pathology-related questions and answers. We have fine-tuned multimodal large models, specifically LLaVA, Qwen-VL, InternLM, with this dataset to enhance instruction-based performance. We conducted a qualitative assessment of the capabilities of the base model and the fine-tuned model in performing image captioning and classification tasks on the specific dataset. The evaluation results demonstrate that the fine-tuned model exhibits proficiency in addressing typical pathological questions. We hope that by making both our models and datasets publicly available, they can be valuable to the medical and research communities.

[CV-4] Efficient Human-Object-Interaction (EHOI) Detection via Interaction Label Coding and Conditional Decision

链接: https://arxiv.org/abs/2408.07018
作者: Tsung-Shan Yang,Yun-Cheng Wang,Chengwei Wei,Suya You,C.-C. Jay Kuo
关键词-EN: image understanding, fundamental task, task in image, HOI methods provide, Average Precision
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Human-Object Interaction (HOI) detection is a fundamental task in image understanding. While deep-learning-based HOI methods provide high performance in terms of mean Average Precision (mAP), they are computationally expensive and opaque in training and inference processes. An Efficient HOI (EHOI) detector is proposed in this work to strike a good balance between detection performance, inference complexity, and mathematical transparency. EHOI is a two-stage method. In the first stage, it leverages a frozen object detector to localize the objects and extract various features as intermediate outputs. In the second stage, the first-stage outputs predict the interaction type using the XGBoost classifier. Our contributions include the application of error correction codes (ECCs) to encode rare interaction cases, which reduces the model size and the complexity of the XGBoost classifier in the second stage. Additionally, we provide a mathematical formulation of the relabeling and decision-making process. Apart from the architecture, we present qualitative results to explain the functionalities of the feedforward modules. Experimental results demonstrate the advantages of ECC-coded interaction labels and the excellent balance of detection performance and complexity of the proposed EHOI method.

[CV-5] Imagen 3

链接: https://arxiv.org/abs/2408.07009
作者: Imagen-Team-Google:Jason Baldridge,Jakob Bauer,Mukul Bhutani,Nicole Brichtova,Andrew Bunner,Kelvin Chan,Yichang Chen,Sander Dieleman,Yuqing Du,Zach Eaton-Rosen,Hongliang Fei,Nando de Freitas,Yilin Gao,Evgeny Gladchenko,Sergio Gómez Colmenarejo,Mandy Guo,Alex Haig,Will Hawkins,Hexiang Hu,Huilian Huang,Tobenna Peter Igwe,Christos Kaplanis,Siavash Khodadadeh,Yelin Kim,Ksenia Konyushkova,Karol Langner,Eric Lau,Shixin Luo,Soňa Mokrá,Henna Nandwani,Yasumasa Onoe,Aäron van den Oord,Zarana Parekh,Jordi Pont-Tuset,Hang Qi,Rui Qian,Deepak Ramachandran,Poorva Rane,Abdullah Rashwan,Ali Razavi,Robert Riachi,Hansa Srinivasan,Srivatsan Srinivasan,Robin Strudel,Benigno Uria,Oliver Wang,Su Wang,Austin Waters,Chris Wolff,Auriel Wright,Zhisheng Xiao,Hao Xiong,Keyang Xu,Marc van Zee,Junlin Zhang,Katie Zhang,Wenlei Zhou,Konrad Zolna,Ola Aboubakar,Canfer Akbulut,Oscar Akerlund,Isabela Albuquerque,Nina Anderson,Marco Andreetto,Lora Aroyo,Ben Bariach,David Barker,Sherry Ben,Dana Berman,Courtney Biles,Irina Blok,Pankil Botadra,Jenny Brennan,Karla Brown,John Buckley,Rudy Bunel,Elie Bursztein,Christina Butterfield,Ben Caine,Viral Carpenter,Norman Casagrande,Ming-Wei Chang,Solomon Chang,Shamik Chaudhuri,Tony Chen,John Choi,Dmitry Churbanau,Nathan Clement,Matan Cohen,Forrester Cole,Mikhail Dektiarev,Vincent Du,Praneet Dutta,Tom Eccles,Ndidi Elue,Ashley Feden,Shlomi Fruchter,Frankie Garcia,Roopal Garg
关键词-EN: generates high quality, high quality images, latent diffusion model, text prompts, latent diffusion
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:We introduce Imagen 3, a latent diffusion model that generates high quality images from text prompts. We describe our quality and responsibility evaluations. Imagen 3 is preferred over other state-of-the-art (SOTA) models at the time of evaluation. In addition, we discuss issues around safety and representation, as well as methods we used to minimize the potential harm of our models.

[CV-6] Low-Bitwidth Floating Point Quantization for Efficient High-Quality Diffusion Models

链接: https://arxiv.org/abs/2408.06995
作者: Cheng Chen,Christina Giannoula,Andreas Moshovos
关键词-EN: denoising random Gaussian, random Gaussian noise, deep neural networks, iteratively denoising random, random Gaussian
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Diffusion models are emerging models that generate images by iteratively denoising random Gaussian noise using deep neural networks. These models typically exhibit high computational and memory demands, necessitating effective post-training quantization for high-performance inference. Recent works propose low-bitwidth (e.g., 8-bit or 4-bit) quantization for diffusion models, however 4-bit integer quantization typically results in low-quality images. We observe that on several widely used hardware platforms, there is little or no difference in compute capability between floating-point and integer arithmetic operations of the same bitwidth (e.g., 8-bit or 4-bit). Therefore, we propose an effective floating-point quantization method for diffusion models that provides better image quality compared to integer quantization methods. We employ a floating-point quantization method that was effective for other processing tasks, specifically computer vision and natural language tasks, and tailor it for diffusion models by integrating weight rounding learning during the mapping of the full-precision values to the quantized values in the quantization process. We comprehensively study integer and floating-point quantization methods in state-of-the-art diffusion models. Our floating-point quantization method not only generates higher-quality images than that of integer quantization methods, but also shows no noticeable degradation compared to full-precision models (32-bit floating-point), when both weights and activations are quantized to 8-bit floating-point values, while has minimal degradation with 4-bit weights and 8-bit activations.

[CV-7] SpectralGaussians: Semantic spectral 3D Gaussian splatting for multi-spectral scene representation visualization and analysis

链接: https://arxiv.org/abs/2408.06975
作者: Saptarshi Neil Sinha,Holger Graf,Michael Weinmann
关键词-EN: registered multi-view spectrum, semantically meaningful splats, Gaussian Splatting, cross-spectral rendering framework, rendering framework based
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Graphics (cs.GR)
*备注:

点击查看摘要

Abstract:We propose a novel cross-spectral rendering framework based on 3D Gaussian Splatting (3DGS) that generates realistic and semantically meaningful splats from registered multi-view spectrum and segmentation maps. This extension enhances the representation of scenes with multiple spectra, providing insights into the underlying materials and segmentation. We introduce an improved physically-based rendering approach for Gaussian splats, estimating reflectance and lights per spectra, thereby enhancing accuracy and realism. In a comprehensive quantitative and qualitative evaluation, we demonstrate the superior performance of our approach with respect to other recent learning-based spectral scene representation approaches (i.e., XNeRF and SpectralNeRF) as well as other non-spectral state-of-the-art learning-based approaches. Our work also demonstrates the potential of spectral scene understanding for precise scene editing techniques like style transfer, inpainting, and removal. Thereby, our contributions address challenges in multi-spectral scene representation, rendering, and editing, offering new possibilities for diverse applications.

[CV-8] Prompt-Based Segmentation at Multiple Resolutions and Lighting Conditions using Segment Anything Model 2

链接: https://arxiv.org/abs/2408.06970
作者: Osher Rafaeli,Tal Svoray,Ariel Nahlieli
关键词-EN: RGB aerial imagery, conventional convolutional network, segmenting solar panels, RGB aerial, lighting conditions
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:This paper provides insight into the effectiveness of zero-shot, prompt-based, Segment Anything Model (SAM), and its updated version, SAM 2, and the non-promptable, conventional convolutional network (CNN), in segmenting solar panels, in RGB aerial imagery, across lighting conditions, spatial resolutions, and prompt strategies. SAM 2 demonstrates improvements over SAM, particularly in sub-optimal lighting conditions when prompted by points. Both SAMs, prompted by user-box, outperformed CNN, in all scenarios. Additionally, YOLOv9 prompting outperformed user points prompting. In high-resolution imagery, both in optimal and sub-optimal lighting conditions, Eff-UNet outperformed both SAM models prompted by YOLOv9 boxes, positioning Eff-UNet as the appropriate model for automatic segmentation in high-resolution data. In low-resolution data, user box prompts were found crucial to achieve a reasonable performance. This paper provides details on strengths and limitations of each model and outlines robustness of user prompted image segmentation models in inconsistent resolution and lighting conditions of remotely sensed data.

[CV-9] Breaking Class Barriers: Efficient Dataset Distillation via Inter-Class Feature Compensator

链接: https://arxiv.org/abs/2408.06927
作者: Xin Zhang,Jiawei Du,Ping Liu,Joey Tianyi Zhou
关键词-EN: condense informative features, Inter-class Feature Compensator, Universal Feature Compensator, aiming to condense, condense informative
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Dataset distillation has emerged as a technique aiming to condense informative features from large, natural datasets into a compact and synthetic form. While recent advancements have refined this technique, its performance is bottlenecked by the prevailing class-specific synthesis paradigm. Under this paradigm, synthetic data is optimized exclusively for a pre-assigned one-hot label, creating an implicit class barrier in feature condensation. This leads to inefficient utilization of the distillation budget and oversight of inter-class feature distributions, which ultimately limits the effectiveness and efficiency, as demonstrated in our analysis. To overcome these constraints, this paper presents the Inter-class Feature Compensator (INFER), an innovative distillation approach that transcends the class-specific data-label framework widely utilized in current dataset distillation methods. Specifically, INFER leverages a Universal Feature Compensator (UFC) to enhance feature integration across classes, enabling the generation of multiple additional synthetic instances from a single UFC input. This significantly improves the efficiency of the distillation budget. Moreover, INFER enriches inter-class interactions during the distillation, thereby enhancing the effectiveness and generalizability of the distilled data. By allowing for the linear interpolation of labels similar to those in the original dataset, INFER meticulously optimizes the synthetic data and dramatically reduces the size of soft labels in the synthetic dataset to almost zero, establishing a new benchmark for efficiency and effectiveness in dataset distillation. Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG) Cite as: arXiv:2408.06927 [cs.CV] (or arXiv:2408.06927v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2408.06927 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[CV-10] SceneGPT: A Language Model for 3D Scene Understanding

链接: https://arxiv.org/abs/2408.06926
作者: Shivam Chandhok
关键词-EN: large-scale training regimes, Building models, understand and reason, difficult owing, lack of data
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: UBC Report

点击查看摘要

Abstract:Building models that can understand and reason about 3D scenes is difficult owing to the lack of data sources for 3D supervised training and large-scale training regimes. In this work we ask - How can the knowledge in a pre-trained language model be leveraged for 3D scene understanding without any 3D pre-training. The aim of this work is to establish whether pre-trained LLMs possess priors/knowledge required for reasoning in 3D space and how can we prompt them such that they can be used for general purpose spatial reasoning and object understanding in 3D. To this end, we present SceneGPT, an LLM based scene understanding system which can perform 3D spatial reasoning without training or explicit 3D supervision. The key components of our framework are - 1) a 3D scene graph, that serves as scene representation, encoding the objects in the scene and their spatial relationships 2) a pre-trained LLM that can be adapted with in context learning for 3D spatial reasoning. We evaluate our framework qualitatively on object and scene understanding tasks including object semantics, physical properties and affordances (object-level) and spatial understanding (scene-level).

[CV-11] Divide and Conquer: Improving Multi-Camera 3D Perception with 2D Semantic-Depth Priors and Input-Dependent Queries

链接: https://arxiv.org/abs/2408.06901
作者: Qi Song,Qingyong Hu,Chi Zhang,Yongquan Chen,Rui Huang
关键词-EN: significant attention recently, drawn significant attention, multi-camera images, attention recently, drawn significant
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted by TIP 2024

点击查看摘要

Abstract:3D perception tasks, such as 3D object detection and Bird’s-Eye-View (BEV) segmentation using multi-camera images, have drawn significant attention recently. Despite the fact that accurately estimating both semantic and 3D scene layouts are crucial for this task, existing techniques often neglect the synergistic effects of semantic and depth cues, leading to the occurrence of classification and position estimation errors. Additionally, the input-independent nature of initial queries also limits the learning capacity of Transformer-based models. To tackle these challenges, we propose an input-aware Transformer framework that leverages Semantics and Depth as priors (named SDTR). Our approach involves the use of an S-D Encoder that explicitly models semantic and depth priors, thereby disentangling the learning process of object categorization and position estimation. Moreover, we introduce a Prior-guided Query Builder that incorporates the semantic prior into the initial queries of the Transformer, resulting in more effective input-aware queries. Extensive experiments on the nuScenes and Lyft benchmarks demonstrate the state-of-the-art performance of our method in both 3D object detection and BEV segmentation tasks.

[CV-12] EE3P3D: Event-based Estimation of Periodic Phenomena Frequency using 3D Correlation

链接: https://arxiv.org/abs/2408.06899
作者: Jakub Kolář,Radim Špetlík,Jiří Matas
关键词-EN: high temporal resolution, device asynchronously reporting, asynchronously reporting brightness, independently operating pixels, temporal resolution
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 15 paper pages + 11 suppl pages, 15 figues, 4 tables

点击查看摘要

Abstract:We present a novel method for measuring the frequency of periodic phenomena, e.g., rotation, flicker and vibration, by an event camera, a device asynchronously reporting brightness changes at independently operating pixels with high temporal resolution. The approach assumes that for a periodic phenomenon, a highly similar set of events is generated within a specific spatio-temporal window at a time difference corresponding to the phenomenon’s period. The sets of similar events are detected by 3D spatio-temporal correlation in the event stream space. The proposed method, EE3P3D, is evaluated on a dataset of 12 sequences of periodic phenomena, i.e. flashing light and vibration, and periodic motion, e.g., rotation, ranging from 3.2 Hz to 2 kHz (equivalent to 192 - 120 000 RPM). EE3P3D significantly outperforms published methods on this dataset, achieving a mean relative error of 0.1%.

[CV-13] Automatic Feature Recognition and Dimensional Attributes Extraction From CAD Models for Hybrid Additive-Subtractive Manufacturing

链接: https://arxiv.org/abs/2408.06891
作者: Muhammad Tayyab Khan,Wenhe Feng,Lequn Chen,Ye Han Ng,Nicholas Yew Jin Tan,Seung Ki Moon
关键词-EN: facilitating seamless transitions, Computer-Aided Design, Computer-Aided Process Planning, manufacturing process planning, digital designs
类目: Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 10 pages, 12 figures. This paper has been accepted for presentation at the ASME IDETC-CIE 2024 conference

点击查看摘要

Abstract:The integration of Computer-Aided Design (CAD), Computer-Aided Process Planning (CAPP), and Computer-Aided Manufacturing (CAM) plays a crucial role in modern manufacturing, facilitating seamless transitions from digital designs to physical products. However, a significant challenge within this integration is the Automatic Feature Recognition (AFR) of CAD models, especially in the context of hybrid manufacturing that combines subtractive and additive manufacturing processes. Traditional AFR methods, focused mainly on the identification of subtractive (machined) features including holes, fillets, chamfers, pockets, and slots, fail to recognize features pertinent to additive manufacturing. Furthermore, the traditional methods fall short in accurately extracting geometric dimensions and orientations, which are also key factors for effective manufacturing process planning. This paper presents a novel approach for creating a synthetic CAD dataset that encompasses features relevant to both additive and subtractive machining through Python Open Cascade. The Hierarchical Graph Convolutional Neural Network (HGCNN) model is implemented to accurately identify the composite additive-subtractive features within the synthetic CAD dataset. The key novelty and contribution of the proposed methodology lie in its ability to recognize a wide range of manufacturing features, and precisely extracting their dimensions, orientations, and stock sizes. The proposed model demonstrates remarkable feature recognition accuracy exceeding 97% and a dimension extraction accuracy of 100% for identified features. Therefore, the proposed methodology enhances the integration of CAD, CAPP, and CAM within hybrid manufacturing by providing precise feature recognition and dimension extraction. It facilitates improved manufacturing process planning, by enabling more informed decision-making.

[CV-14] PBIR-NIE: Glossy Object Capture under Non-Distant Lighting

链接: https://arxiv.org/abs/2408.06878
作者: Guangyan Cai,Fujun Luan,Miloš Hašan,Kai Zhang,Sai Bi,Zexiang Xu,Iliyan Georgiev,Shuang Zhao
关键词-EN: multi-view input images, Glossy objects present, present a significant, significant challenge, multi-view input
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
*备注:

点击查看摘要

Abstract:Glossy objects present a significant challenge for 3D reconstruction from multi-view input images under natural lighting. In this paper, we introduce PBIR-NIE, an inverse rendering framework designed to holistically capture the geometry, material attributes, and surrounding illumination of such objects. We propose a novel parallax-aware non-distant environment map as a lightweight and efficient lighting representation, accurately modeling the near-field background of the scene, which is commonly encountered in real-world capture setups. This feature allows our framework to accommodate complex parallax effects beyond the capabilities of standard infinite-distance environment maps. Our method optimizes an underlying signed distance field (SDF) through physics-based differentiable rendering, seamlessly connecting surface gradients between a triangle mesh and the SDF via neural implicit evolution (NIE). To address the intricacies of highly glossy BRDFs in differentiable rendering, we integrate the antithetic sampling algorithm to mitigate variance in the Monte Carlo gradient estimator. Consequently, our framework exhibits robust capabilities in handling glossy object reconstruction, showcasing superior quality in geometry, relighting, and material estimation.

[CV-15] A Comprehensive Survey on Synthetic Infrared Image synthesis

链接: https://arxiv.org/abs/2408.06868
作者: Avinash Upadhyay,Manoj sharma,Prerna Mukherjee,Amit Singhal,Brejesh Lall
关键词-EN: important computer vision, computer vision problem, target generation, scene and target, remote sensing
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
*备注: Submitted in Journal of Infrared Physics Technology

点击查看摘要

Abstract:Synthetic infrared (IR) scene and target generation is an important computer vision problem as it allows the generation of realistic IR images and targets for training and testing of various applications, such as remote sensing, surveillance, and target recognition. It also helps reduce the cost and risk associated with collecting real-world IR data. This survey paper aims to provide a comprehensive overview of the conventional mathematical modelling-based methods and deep learning-based methods used for generating synthetic IR scenes and targets. The paper discusses the importance of synthetic IR scene and target generation and briefly covers the mathematics of blackbody and grey body radiations, as well as IR image-capturing methods. The potential use cases of synthetic IR scenes and target generation are also described, highlighting the significance of these techniques in various fields. Additionally, the paper explores possible new ways of developing new techniques to enhance the efficiency and effectiveness of synthetic IR scenes and target generation while highlighting the need for further research to advance this field.

[CV-16] Dynamic and Compressive Adaptation of Transformers From Images to Videos

链接: https://arxiv.org/abs/2408.06840
作者: Guozhen Zhang,Jingyu Liu,Shengming Cao,Xiaotong Zhao,Kevin Zhao,Kai Ma,Limin Wang
关键词-EN: pre-trained Vision Transformers, Vision Transformers, pre-trained Vision, success of pre-trained, image-text matching
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Recently, the remarkable success of pre-trained Vision Transformers (ViTs) from image-text matching has sparked an interest in image-to-video adaptation. However, most current approaches retain the full forward pass for each frame, leading to a high computation overhead for processing entire videos. In this paper, we present InTI, a novel approach for compressive image-to-video adaptation using dynamic Inter-frame Token Interpolation. InTI aims to softly preserve the informative tokens without disrupting their coherent spatiotemporal structure. Specifically, each token pair at identical positions within neighbor frames is linearly aggregated into a new token, where the aggregation weights are generated by a multi-scale context-aware network. In this way, the information of neighbor frames can be adaptively compressed in a point-by-point manner, thereby effectively reducing the number of processed frames by half each time. Importantly, InTI can be seamlessly integrated with existing adaptation methods, achieving strong performance without extra-complex design. On Kinetics-400, InTI reaches a top-1 accuracy of 87.1 with a remarkable 37.5% reduction in GFLOPs compared to naive adaptation. When combined with additional temporal modules, InTI achieves a top-1 accuracy of 87.6 with a 37% reduction in GFLOPs. Similar conclusions have been verified in other common datasets.

[CV-17] GLGait: A Global-Local Temporal Receptive Field Network for Gait Recognition in the Wild ACM-MM2024

链接: https://arxiv.org/abs/2408.06834
作者: Guozhen Peng,Yunhong Wang,Yuwei Zhao,Shaoxiong Zhang,Annan Li
关键词-EN: temporal receptive field, temporal receptive, receptive field, attracted increasing attention, global temporal receptive
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted by ACM MM2024

点击查看摘要

Abstract:Gait recognition has attracted increasing attention from academia and industry as a human recognition technology from a distance in non-intrusive ways without requiring cooperation. Although advanced methods have achieved impressive success in lab scenarios, most of them perform poorly in the wild. Recently, some Convolution Neural Networks (ConvNets) based methods have been proposed to address the issue of gait recognition in the wild. However, the temporal receptive field obtained by convolution operations is limited for long gait sequences. If directly replacing convolution blocks with visual transformer blocks, the model may not enhance a local temporal receptive field, which is important for covering a complete gait cycle. To address this issue, we design a Global-Local Temporal Receptive Field Network (GLGait). GLGait employs a Global-Local Temporal Module (GLTM) to establish a global-local temporal receptive field, which mainly consists of a Pseudo Global Temporal Self-Attention (PGTA) and a temporal convolution operation. Specifically, PGTA is used to obtain a pseudo global temporal receptive field with less memory and computation complexity compared with a multi-head self-attention (MHSA). The temporal convolution operation is used to enhance the local temporal receptive field. Besides, it can also aggregate pseudo global temporal receptive field to a true holistic temporal receptive field. Furthermore, we also propose a Center-Augmented Triplet Loss (CTL) in GLGait to reduce the intra-class distance and expand the positive samples in the training stage. Extensive experiments show that our method obtains state-of-the-art results on in-the-wild datasets, i.e. , Gait3D and GREW. The code is available at this https URL.

[CV-18] FlatFusion: Delving into Details of Sparse Transformer-based Camera-LiDAR Fusion for Autonomous Driving

链接: https://arxiv.org/abs/2408.06832
作者: Yutao Zhu,Xiaosong Jia,Xinyu Yang,Junchi Yan
关键词-EN: diverse sensor modalities, autonomous driving scenarios, camera and LiDAR, sensor modalities, constitutes a prevalent
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:The integration of data from diverse sensor modalities (e.g., camera and LiDAR) constitutes a prevalent methodology within the ambit of autonomous driving scenarios. Recent advancements in efficient point cloud transformers have underscored the efficacy of integrating information in sparse formats. When it comes to fusion, since image patches are dense in pixel space with ambiguous depth, it necessitates additional design considerations for effective fusion. In this paper, we conduct a comprehensive exploration of design choices for Transformer-based sparse cameraLiDAR fusion. This investigation encompasses strategies for image-to-3D and LiDAR-to-2D mapping, attention neighbor grouping, single modal tokenizer, and micro-structure of Transformer. By amalgamating the most effective principles uncovered through our investigation, we introduce FlatFusion, a carefully designed framework for sparse camera-LiDAR fusion. Notably, FlatFusion significantly outperforms state-of-the-art sparse Transformer-based methods, including UniTR, CMT, and SparseFusion, achieving 73.7 NDS on the nuScenes validation set with 10.1 FPS with PyTorch.

[CV-19] Photometric Inverse Rendering: Shading Cues Modeling and Surface Reflectance Regularization

链接: https://arxiv.org/abs/2408.06828
作者: Jingzhi Bao,Guanying Chen,Shuguang Cui
关键词-EN: inverse rendering, paper addresses, photometric images, neural inverse rendering, inverse
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Project page: https://jzbao03.site/projects/PIR/

点击查看摘要

Abstract:This paper addresses the problem of inverse rendering from photometric images. Existing approaches for this problem suffer from the effects of self-shadows, inter-reflections, and lack of constraints on the surface reflectance, leading to inaccurate decomposition of reflectance and illumination due to the ill-posed nature of inverse rendering. In this work, we propose a new method for neural inverse rendering. Our method jointly optimizes the light source position to account for the self-shadows in images, and computes indirect illumination using a differentiable rendering layer and an importance sampling strategy. To enhance surface reflectance decomposition, we introduce a new regularization by distilling DINO features to foster accurate and consistent material decomposition. Extensive experiments on synthetic and real datasets demonstrate that our method outperforms the state-of-the-art methods in reflectance decomposition.

[CV-20] Membership Inference Attack Against Masked Image Modeling

链接: https://arxiv.org/abs/2408.06825
作者: Zheng Li,Xinlei He,Ning Yu,Yang Zhang
关键词-EN: Masked Image Modeling, achieved significant success, Masked Image, Image Modeling, MIM
类目: Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Masked Image Modeling (MIM) has achieved significant success in the realm of self-supervised learning (SSL) for visual recognition. The image encoder pre-trained through MIM, involving the masking and subsequent reconstruction of input images, attains state-of-the-art performance in various downstream vision tasks. However, most existing works focus on improving the performance of this http URL this work, we take a different angle by studying the pre-training data privacy of MIM. Specifically, we propose the first membership inference attack against image encoders pre-trained by MIM, which aims to determine whether an image is part of the MIM pre-training dataset. The key design is to simulate the pre-training paradigm of MIM, i.e., image masking and subsequent reconstruction, and then obtain reconstruction errors. These reconstruction errors can serve as membership signals for achieving attack goals, as the encoder is more capable of reconstructing the input image in its training set with lower errors. Extensive evaluations are conducted on three model architectures and three benchmark datasets. Empirical results show that our attack outperforms baseline methods. Additionally, we undertake intricate ablation studies to analyze multiple factors that could influence the performance of the attack.

[CV-21] Structure-preserving Planar Simplification for Indoor Environments

链接: https://arxiv.org/abs/2408.06814
作者: Bishwash Khanal,Sanjay Rijal,Manish Awale,Vaghawan Ojha
关键词-EN: structure-preserving planar simplification, scene point clouds, indoor scene point, point cloud, point cloud undergoes
类目: Computer Vision and Pattern Recognition (cs.CV); Computational Geometry (cs.CG)
*备注:

点击查看摘要

Abstract:This paper presents a novel approach for structure-preserving planar simplification of indoor scene point clouds for both simulated and real-world environments. Initially, the scene point cloud undergoes preprocessing steps, including noise reduction and Manhattan world alignment, to ensure robustness and coherence in subsequent analyses. We segment each captured scene into structured (walls-ceiling-floor) and non-structured (indoor objects) scenes. Leveraging a RANSAC algorithm, we extract primitive planes from the input point cloud, facilitating the segmentation and simplification of the structured scene. The best-fitting wall meshes are then generated from the primitives, followed by adjacent mesh merging with the vertex-translation algorithm which preserves the mesh layout. To accurately represent ceilings and floors, we employ the mesh clipping algorithm which clips the ceiling and floor meshes with respect to wall normals. In the case of indoor scenes, we apply a surface reconstruction technique to enhance the fidelity. This paper focuses on the intricate steps of the proposed scene simplification methodology, addressing complex scenarios such as multi-story and slanted walls and ceilings. We also conduct qualitative and quantitative performance comparisons against popular surface reconstruction, shape approximation, and floorplan generation approaches.

[CV-22] Oracle Bone Script Similiar Character Screening Approach Based on Simsiam Contrastive Learning and Supervised Learning

链接: https://arxiv.org/abs/2408.06811
作者: Xinying Weng,Yifan Li,Shuaidong Hao,Jialiang Hou
关键词-EN: comprehensive evaluation method, self-supervised and RepVGG, RepVGG supervised learning, project proposes, fuzzy comprehensive evaluation
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:This project proposes a new method that uses fuzzy comprehensive evaluation method to integrate ResNet-50 self-supervised and RepVGG supervised learning. The source image dataset HWOBC oracle is taken as input, the target image is selected, and finally the most similar image is output in turn without any manual intervention. The same feature encoding method is not used for images of different modalities. Before the model training, the image data is preprocessed, and the image is enhanced by random rotation processing, self-square graph equalization theory algorithm, and gamma transform, which effectively enhances the key feature learning. Finally, the fuzzy comprehensive evaluation method is used to combine the results of supervised training and unsupervised training, which can better solve the “most similar” problem that is difficult to quantify. At present, there are many unknown oracle-bone inscriptions waiting for us to crack. Contacting with the glyphs can provide new ideas for cracking.

[CV-23] Unmasking the Uniqueness: A Glimpse into Age-Invariant Face Recognition of Indigenous African Faces

链接: https://arxiv.org/abs/2408.06806
作者: Fakunle Ajewole,Joseph Damilola Akinyemi,Khadijat Tope Ladoja,Olufade Falade Williams Onifade
关键词-EN: compared to Africa, received considerable research, considerable research efforts, AIFR research efforts, indigenous African faces
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: Keywords: Age-Invariant Face Recognition, CACD, FAGE_v2, VGGFace

点击查看摘要

Abstract:The task of recognizing the age-separated faces of an individual, Age-Invariant Face Recognition (AIFR), has received considerable research efforts in Europe, America, and Asia, compared to Africa. Thus, AIFR research efforts have often under-represented/misrepresented the African ethnicity with non-indigenous Africans. This work developed an AIFR system for indigenous African faces to reduce the misrepresentation of African ethnicity in facial image analysis research. We adopted a pre-trained deep learning model (VGGFace) for AIFR on a dataset of 5,000 indigenous African faces (FAGE_v2) collected for this study. FAGE_v2 was curated via Internet image searches of 500 individuals evenly distributed across 10 African countries. VGGFace was trained on FAGE_v2 to obtain the best accuracy of 81.80%. We also performed experiments on an African-American subset of the CACD dataset and obtained the best accuracy of 91.5%. The results show a significant difference in the recognition accuracies of indigenous versus non-indigenous Africans.

[CV-24] Integrating Saliency Ranking and Reinforcement Learning for Enhanced Object Detection ALT

链接: https://arxiv.org/abs/2408.06803
作者: Matthias Bartolo,Dylan Seychell,Josef Bajada
关键词-EN: based visual attention, combine reinforcement learning, visual attention methods, saliency ranking techniques, based visual
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: Resultant work from Dissertation, Department of AI, University of Malta. Code available at: this https URL

点击查看摘要

Abstract:With the ever-growing variety of object detection approaches, this study explores a series of experiments that combine reinforcement learning (RL)-based visual attention methods with saliency ranking techniques to investigate transparent and sustainable solutions. By integrating saliency ranking for initial bounding box prediction and subsequently applying RL techniques to refine these predictions through a finite set of actions over multiple time steps, this study aims to enhance RL object detection accuracy. Presented as a series of experiments, this research investigates the use of various image feature extraction methods and explores diverse Deep Q-Network (DQN) architectural variations for deep reinforcement learning-based localisation agent training. Additionally, we focus on optimising the detection pipeline at every step by prioritising lightweight and faster models, while also incorporating the capability to classify detected objects, a feature absent in previous RL approaches. We show that by evaluating the performance of these trained agents using the Pascal VOC 2007 dataset, faster and more optimised models were developed. Notably, the best mean Average Precision (mAP) achieved in this study was 51.4, surpassing benchmarks set by RL-based single object detectors in the literature.

[CV-25] oken Compensator: Altering Inference Cost of Vision Transformer without Re-Tuning ECCV2024

链接: https://arxiv.org/abs/2408.06798
作者: Shibo Jie,Yehui Tang,Jianyuan Guo,Zhi-Hong Deng,Kai Han,Yunhe Wang
关键词-EN: Vision Transformers, pruning inattentive tokens, merging similar tokens, Token compression expedites, compression degrees
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted to ECCV2024

点击查看摘要

Abstract:Token compression expedites the training and inference of Vision Transformers (ViTs) by reducing the number of the redundant tokens, e.g., pruning inattentive tokens or merging similar tokens. However, when applied to downstream tasks, these approaches suffer from significant performance drop when the compression degrees are mismatched between training and inference stages, which limits the application of token compression on off-the-shelf trained models. In this paper, we propose a model arithmetic framework to decouple the compression degrees between the two stages. In advance, we additionally perform a fast parameter-efficient self-distillation stage on the pre-trained models to obtain a small plugin, called Token Compensator (ToCom), which describes the gap between models across different compression degrees. During inference, ToCom can be directly inserted into any downstream off-the-shelf models with any mismatched training and inference compression degrees to acquire universal performance improvements without further training. Experiments on over 20 downstream tasks demonstrate the effectiveness of our framework. On CIFAR100, fine-grained visual classification, and VTAB-1k, ToCom can yield up to a maximum improvement of 2.3%, 1.5%, and 2.0% in the average performance of DeiT-B, respectively. Code: this https URL

[CV-26] Visual Neural Decoding via Improved Visual-EEG Semantic Consistency

链接: https://arxiv.org/abs/2408.06788
作者: Hongzhou Chen,Lianghua He,Yihang Liu,Longzhen Yang
关键词-EN: human brain activity, interpreting original visual, original visual experiences, EEG visual decoding, brain activity
类目: Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)
*备注:

点击查看摘要

Abstract:Visual neural decoding refers to the process of extracting and interpreting original visual experiences from human brain activity. Recent advances in metric learning-based EEG visual decoding methods have delivered promising results and demonstrated the feasibility of decoding novel visual categories from brain activity. However, methods that directly map EEG features to the CLIP embedding space may introduce mapping bias and cause semantic inconsistency among features, thereby degrading alignment and impairing decoding performance. To further explore the semantic consistency between visual and neural signals. In this work, we construct a joint semantic space and propose a Visual-EEG Semantic Decouple Framework that explicitly extracts the semantic-related features of these two modalities to facilitate optimal alignment. Specifically, a cross-modal information decoupling module is introduced to guide the extraction of semantic-related information from modalities. Then, by quantifying the mutual information between visual image and EEG features, we observe a strong positive correlation between the decoding performance and the magnitude of mutual information. Furthermore, inspired by the mechanisms of visual object understanding from neuroscience, we propose an intra-class geometric consistency approach during the alignment process. This strategy maps visual samples within the same class to consistent neural patterns, which further enhances the robustness and the performance of EEG visual decoding. Experiments on a large Image-EEG dataset show that our method achieves state-of-the-art results in zero-shot neural decoding tasks.

[CV-27] Do Vision-Language Foundational models show Robust Visual Perception?

链接: https://arxiv.org/abs/2408.06781
作者: Shivam Chandhok,Pranav Tandon
关键词-EN: perform visual understanding, Recent advances, enabled development, perform visual, visual understanding
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: UBC Report

点击查看摘要

Abstract:Recent advances in vision-language foundational models have enabled development of systems that can perform visual understanding and reasoning tasks. However, it is unclear if these models are robust to distribution shifts, and how their performance and generalization capabilities vary under changes in data distribution. In this project we strive to answer the question “Are vision-language foundational models robust to distribution shifts like human perception?” Specifically, we consider a diverse range of vision-language models and compare how the performance of these systems is affected by corruption based distribution shifts (such as \textitmotion blur, fog, snow, gaussian noise) commonly found in practical real-world scenarios. We analyse the generalization capabilities qualitatively and quantitatively on zero-shot image classification task under aforementioned distribution shifts. Our code will be avaible at \urlthis https URL

[CV-28] ED4: Explicit Data-level Debiasing for Deepfake Detection

链接: https://arxiv.org/abs/2408.06779
作者: Jikang Cheng,Ying Zhang,Qin Zou,Zhiyuan Yan,Chao Liang,Zhongyuan Wang,Chen Li
关键词-EN: Learning intrinsic bias, Learning intrinsic, considered the main, main reason, Spatial Consistency Module
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Learning intrinsic bias from limited data has been considered the main reason for the failure of deepfake detection with generalizability. Apart from the discovered content and specific-forgery bias, we reveal a novel spatial bias, where detectors inertly anticipate observing structural forgery clues appearing at the image center, also can lead to the poor generalization of existing methods. We present ED ^4 , a simple and effective strategy, to address aforementioned biases explicitly at the data level in a unified framework rather than implicit disentanglement via network design. In particular, we develop ClockMix to produce facial structure preserved mixtures with arbitrary samples, which allows the detector to learn from an exponentially extended data distribution with much more diverse identities, backgrounds, local manipulation traces, and the co-occurrence of multiple forgery artifacts. We further propose the Adversarial Spatial Consistency Module (AdvSCM) to prevent extracting features with spatial bias, which adversarially generates spatial-inconsistent images and constrains their extracted feature to be consistent. As a model-agnostic debiasing strategy, ED ^4 is plug-and-play: it can be integrated with various deepfake detectors to obtain significant benefits. We conduct extensive experiments to demonstrate its effectiveness and superiority over existing deepfake detection approaches.

[CV-29] Exploring Domain Shift on Radar-Based 3D Object Detection Amidst Diverse Environmental Conditions ITSC

链接: https://arxiv.org/abs/2408.06772
作者: Miao Zhang,Sherif Abdulatif,Benedikt Loesch,Marco Altmann,Marius Schwarz,Bin Yang
关键词-EN: autonomous driving systems, perception using multimodal, rapid evolution, evolution of deep, deep learning
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO)
*备注: 6 pages, 5 figures, 3 tables, accepted in IEEE International Conference on Intelligent Transportation Systems (ITSC) 2024

点击查看摘要

Abstract:The rapid evolution of deep learning and its integration with autonomous driving systems have led to substantial advancements in 3D perception using multimodal sensors. Notably, radar sensors show greater robustness compared to cameras and lidar under adverse weather and varying illumination conditions. This study delves into the often-overlooked yet crucial issue of domain shift in 4D radar-based object detection, examining how varying environmental conditions, such as different weather patterns and road types, impact 3D object detection performance. Our findings highlight distinct domain shifts across various weather scenarios, revealing unique dataset sensitivities that underscore the critical role of radar point cloud generation. Additionally, we demonstrate that transitioning between different road types, especially from highways to urban settings, introduces notable domain shifts, emphasizing the necessity for diverse data collection across varied road environments. To the best of our knowledge, this is the first comprehensive analysis of domain shift effects on 4D radar-based object detection. We believe this empirical study contributes to understanding the complex nature of domain shifts in radar data and suggests paths forward for data collection strategy in the face of environmental variability.

[CV-30] Cross-View Geolocalization and Disaster Mapping with Street-View and VHR Satellite Imagery: A Case Study of Hurricane IAN

链接: https://arxiv.org/abs/2408.06761
作者: Hao Li,Fabian Deuser,Wenping Yina,Xuanshu Luo,Paul Walther,Gengchen Mai,Wei Huang,Martin Werner
关键词-EN: Nature disasters play, human-urban infrastructure interactions, shaping human-urban infrastructure, Nature disasters, play a key
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Nature disasters play a key role in shaping human-urban infrastructure interactions. Effective and efficient response to natural disasters is essential for building resilience and a sustainable urban environment. Two types of information are usually the most necessary and difficult to gather in disaster response. The first information is about disaster damage perception, which shows how badly people think that urban infrastructure has been damaged. The second information is geolocation awareness, which means how people whereabouts are made available. In this paper, we proposed a novel disaster mapping framework, namely CVDisaster, aiming at simultaneously addressing geolocalization and damage perception estimation using cross-view Street-View Imagery (SVI) and Very High-Resolution satellite imagery. CVDisaster consists of two cross-view models, where CVDisaster-Geoloc refers to a cross-view geolocalization model based on a contrastive learning objective with a Siamese ConvNeXt image encoder, and CVDisaster-Est is a cross-view classification model based on a Couple Global Context Vision Transformer (CGCViT). Taking Hurricane IAN as a case study, we evaluate the CVDisaster framework by creating a novel cross-view dataset (CVIAN) and conducting extensive experiments. As a result, we show that CVDisaster can achieve highly competitive performance (over 80% for geolocalization and 75% for damage perception estimation) with even limited fine-tuning efforts, which largely motivates future cross-view models and applications within a broader GeoAI research community. The data and code are publicly available at: this https URL.

[CV-31] Sumotosima: A Framework and Dataset for Classifying and Summarizing Otoscopic Images

链接: https://arxiv.org/abs/2408.06755
作者: Eram Anwarul Khan,Anas Anwarul Haq Khan
关键词-EN: diagnostic procedure, procedure to examine, canal and eardrum, ear canal, ear drum perforations
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
*备注: Work in Progress

点击查看摘要

Abstract:Otoscopy is a diagnostic procedure to examine the ear canal and eardrum using an otoscope. It identifies conditions like infections, foreign bodies, ear drum perforations and ear abnormalities. We propose a novel resource efficient deep learning and transformer based framework, Sumotosima (Summarizer for otoscopic images), an end-to-end pipeline for classification followed by summarization. Our framework works on combination of triplet and cross-entropy losses. Additionally, we use Knowledge Enhanced Multimodal BART whose input is fused textual and image embedding. The objective is to provide summaries that are well-suited for patients, ensuring clarity and efficiency in understanding otoscopic images. Given the lack of existing datasets, we have curated our own OCASD (Otoscopic Classification And Summary Dataset), which includes 500 images with 5 unique categories annotated with their class and summaries by Otolaryngologists. Sumotosima achieved a result of 98.03%, which is 7.00%, 3.10%, 3.01% higher than K-Nearest Neighbors, Random Forest and Support Vector Machines, respectively, in classification tasks. For summarization, Sumotosima outperformed GPT-4o and LLaVA by 88.53% and 107.57% in ROUGE scores, respectively. We have made our code and dataset publicly available at this https URL

[CV-32] Detecting Audio-Visual Deepfakes with Fine-Grained Inconsistencies BMVC2024

链接: https://arxiv.org/abs/2408.06753
作者: Marcella Astrid,Enjie Ghorbel,Djamila Aouada
关键词-EN: audio-visual deepfake detection, visual data, Existing methods, detection mainly focus, focus on high-level
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM); Sound (cs.SD); Audio and Speech Processing (eess.AS)
*备注: Accepted in BMVC 2024

点击查看摘要

Abstract:Existing methods on audio-visual deepfake detection mainly focus on high-level features for modeling inconsistencies between audio and visual data. As a result, these approaches usually overlook finer audio-visual artifacts, which are inherent to deepfakes. Herein, we propose the introduction of fine-grained mechanisms for detecting subtle artifacts in both spatial and temporal domains. First, we introduce a local audio-visual model capable of capturing small spatial regions that are prone to inconsistencies with audio. For that purpose, a fine-grained mechanism based on a spatially-local distance coupled with an attention module is adopted. Second, we introduce a temporally-local pseudo-fake augmentation to include samples incorporating subtle temporal inconsistencies in our training set. Experiments on the DFDC and the FakeAVCeleb datasets demonstrate the superiority of the proposed method in terms of generalization as compared to the state-of-the-art under both in-dataset and cross-dataset settings.

[CV-33] ReCLIP: Learn to Rectify the Bias of CLIP for Unsupervised Semantic Segmentation CVPR24

链接: https://arxiv.org/abs/2408.06747
作者: Jingyun Wang,Guoliang Kang
关键词-EN: Recent works utilize, works utilize CLIP, semantic segmentation task, Recent works, unsupervised semantic segmentation
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Extended version of our CVPR 24 paper

点击查看摘要

Abstract:Recent works utilize CLIP to perform the challenging unsupervised semantic segmentation task where only images without annotations are available. However, we observe that when adopting CLIP to such a pixel-level understanding task, unexpected bias (including class-preference bias and space-preference bias) occurs. Previous works don’t explicitly model the bias, which largely constrains the segmentation performance. In this paper, we propose to explicitly model and rectify the bias existing in CLIP to facilitate the unsupervised semantic segmentation task. Specifically, we design a learnable ‘‘Reference’’ prompt to encode class-preference bias and a projection of the positional embedding in vision transformer to encode space-preference bias respectively. To avoid interference, two kinds of biases are firstly independently encoded into the Reference feature and the positional feature. Via a matrix multiplication between two features, a bias logit map is generated to explicitly represent two kinds of biases. Then we rectify the logits of CLIP via a simple element-wise subtraction. To make the rectified results smoother and more contextual, we design a mask decoder which takes the feature of CLIP and rectified logits as input and outputs a rectified segmentation mask with the help of Gumbel-Softmax operation. To make the bias modeling and rectification process meaningful and effective, a contrastive loss based on masked visual features and the text features of different classes is imposed. To further improve the segmentation, we distill the knowledge from the rectified CLIP to the advanced segmentation architecture via minimizing our designed mask-guided, feature-guided and text-guided loss terms. Extensive experiments on various benchmarks demonstrate that ReCLIP++ performs favorably against previous SOTAs. The implementation is available at: this https URL.

[CV-34] Long-Tailed Out-of-Distribution Detection: Prioritizing Attention to Tail

链接: https://arxiv.org/abs/2408.06742
作者: Yina He,Lei Peng,Yongcun Zhang,Juanjuan Weng,Zhiming Luo,Shaozi Li
关键词-EN: assume balanced in-distribution, typically assume balanced, real-world data follow, OOD data, methods typically assume
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Current out-of-distribution (OOD) detection methods typically assume balanced in-distribution (ID) data, while most real-world data follow a long-tailed distribution. Previous approaches to long-tailed OOD detection often involve balancing the ID data by reducing the semantics of head classes. However, this reduction can severely affect the classification accuracy of ID data. The main challenge of this task lies in the severe lack of features for tail classes, leading to confusion with OOD data. To tackle this issue, we introduce a novel Prioritizing Attention to Tail (PATT) method using augmentation instead of reduction. Our main intuition involves using a mixture of von Mises-Fisher (vMF) distributions to model the ID data and a temperature scaling module to boost the confidence of ID data. This enables us to generate infinite contrastive pairs, implicitly enhancing the semantics of ID classes while promoting differentiation between ID and OOD data. To further strengthen the detection of OOD data without compromising the classification performance of ID data, we propose feature calibration during the inference phase. By extracting an attention weight from the training set that prioritizes the tail classes and reduces the confidence in OOD data, we improve the OOD detection capability. Extensive experiments verified that our method outperforms the current state-of-the-art methods on various benchmarks.

[CV-35] Improving Synthetic Image Detection Towards Generalization: An Image Transformation Perspective

链接: https://arxiv.org/abs/2408.06741
作者: Ouxiang Li,Jiayin Cai,Yanbin Hao,Xiaolong Jiang,Yao Hu,Fuli Feng
关键词-EN: photo-realistic image synthesis, facilitating photo-realistic image, models facilitating photo-realistic, social platforms, artifact features
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:With recent generative models facilitating photo-realistic image synthesis, the proliferation of synthetic images has also engendered certain negative impacts on social platforms, thereby raising an urgent imperative to develop effective detectors. Current synthetic image detection (SID) pipelines are primarily dedicated to crafting universal artifact features, accompanied by an oversight about SID training paradigm. In this paper, we re-examine the SID problem and identify two prevalent biases in current training paradigms, i.e., weakened artifact features and overfitted artifact features. Meanwhile, we discover that the imaging mechanism of synthetic images contributes to heightened local correlations among pixels, suggesting that detectors should be equipped with local awareness. In this light, we propose SAFE, a lightweight and effective detector with three simple image transformations. Firstly, for weakened artifact features, we substitute the down-sampling operator with the crop operator in image pre-processing to help circumvent artifact distortion. Secondly, for overfitted artifact features, we include ColorJitter and RandomRotation as additional data augmentations, to help alleviate irrelevant biases from color discrepancies and semantic differences in limited training samples. Thirdly, for local awareness, we propose a patch-based random masking strategy tailored for SID, forcing the detector to focus on local regions at training. Comparative experiments are conducted on an open-world dataset, comprising synthetic images generated by 26 distinct generative models. Our pipeline achieves a new state-of-the-art performance, with remarkable improvements of 4.5% in accuracy and 2.9% in average precision against existing methods.

[CV-36] DiffLoRA: Generating Personalized Low-Rank Adaptation Weights with Diffusion

链接: https://arxiv.org/abs/2408.06740
作者: Yujia Wu,Yiming Shi,Jiwei Wei,Chengwei Sun,Yuyang Zhou,Yang Yang,Heng Tao Shen
关键词-EN: gained significant attention, specific identities conditioned, generate high-fidelity portraits, generation has gained, user-defined prompts
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 9 pages,8 figures

点击查看摘要

Abstract:Personalized text-to-image generation has gained significant attention for its capability to generate high-fidelity portraits of specific identities conditioned on user-defined prompts. Existing methods typically involve test-time fine-tuning or instead incorporating an additional pre-trained branch. However, these approaches struggle to simultaneously address the demands of efficiency, identity fidelity, and preserving the model’s original generative capabilities. In this paper, we propose DiffLoRA, a novel approach that leverages diffusion models as a hypernetwork to predict personalized low-rank adaptation (LoRA) weights based on the reference images. By integrating these LoRA weights into the text-to-image model, DiffLoRA achieves personalization during inference without further training. Additionally, we propose an identity-oriented LoRA weight construction pipeline to facilitate the training of DiffLoRA. By utilizing the dataset produced by this pipeline, our DiffLoRA consistently generates high-performance and accurate LoRA weights. Extensive evaluations demonstrate the effectiveness of our method, achieving both time efficiency and maintaining identity fidelity throughout the personalization process.

[CV-37] Enhancing Visual Dialog State Tracking through Iterative Object-Entity Alignment in Multi-Round Conversations

链接: https://arxiv.org/abs/2408.06725
作者: Wei Pang,Ruixue Duan,Jinfu Yang,Ning Li
关键词-EN: Visual Dialog, dialog history, image-related questions based, multi-round dialog history, Multi-round Dialogue State
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
*备注: This article has been accepted in CAAI Transactions on Intelligence Technology! Article ID: CIT2_12370, Article DOI: https://doi.org/10.1049/cit2.12370

点击查看摘要

Abstract:Visual Dialog (VD) is a task where an agent answers a series of image-related questions based on a multi-round dialog history. However, previous VD methods often treat the entire dialog history as a simple text input, disregarding the inherent conversational information flows at the round level. In this paper, we introduce Multi-round Dialogue State Tracking model (MDST), a framework that addresses this limitation by leveraging the dialogue state learned from dialog history to answer questions. MDST captures each round of dialog history, constructing internal dialogue state representations defined as 2-tuples of vision-language representations. These representations effectively ground the current question, enabling the generation of accurate answers. Experimental results on the VisDial v1.0 dataset demonstrate that MDST achieves a new state-of-the-art performance in generative setting. Furthermore, through a series of human studies, we validate the effectiveness of MDST in generating long, consistent, and human-like answers while consistently answering a series of questions correctly.

[CV-38] Response Wide Shut: Surprising Observations in Basic Vision Language Model Capabilities

链接: https://arxiv.org/abs/2408.06721
作者: Shivam Chandhok,Wan-Cyuan Fan,Leonid Sigal
关键词-EN: computer vision problems, general purpose tools, complex computer vision, vision problems, emerged as general
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Under Submission

点击查看摘要

Abstract:Vision-Language Models (VLMs) have emerged as general purpose tools for addressing a variety of complex computer vision problems. Such models have been shown to be highly capable, but, at the same time, also lacking some basic visual understanding skills. In this paper, we set out to understand the limitations of SoTA VLMs on fundamental visual tasks: object classification, understanding spatial arrangement, and ability to delineate individual object instances (through counting), by constructing a series of tests that probe which components of design, specifically, maybe lacking. Importantly, we go significantly beyond the current benchmarks, that simply measure final performance of VLM, by also comparing and contrasting it to performance of probes trained directly on features obtained from visual encoder (image embeddings), as well as intermediate vision-language projection used to bridge image-encoder and LLM-decoder ouput in many SoTA models (e.g., LLaVA, BLIP, InstructBLIP). In doing so, we uncover nascent shortcomings in VLMs response and make a number of important observations which could help train and develop more effective VLM models in future.

[CV-39] Multimodal Analysis of White Blood Cell Differentiation in Acute Myeloid Leukemia Patients using a beta-Variational Autoencoder MICCAI2024

链接: https://arxiv.org/abs/2408.06720
作者: Gizem Mert,Ario Sadafi,Raheleh Salehi,Nassir Navab,Carsten Marr
关键词-EN: Biomedical imaging, imaging and RNA, RNA sequencing, white blood cell, blood cell diseases
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Quantitative Methods (q-bio.QM)
*备注: Accepted for publication at MICCAI 2024 workshop on AI for Imaging Genomics Learning (AIIG)

点击查看摘要

Abstract:Biomedical imaging and RNA sequencing with single-cell resolution improves our understanding of white blood cell diseases like leukemia. By combining morphological and transcriptomic data, we can gain insights into cellular functions and trajectoriess involved in blood cell differentiation. However, existing methodologies struggle with integrating morphological and transcriptomic data, leaving a significant research gap in comprehensively understanding the dynamics of cell differentiation. Here, we introduce an unsupervised method that explores and reconstructs these two modalities and uncovers the relationship between different subtypes of white blood cells from human peripheral blood smears in terms of morphology and their corresponding transcriptome. Our method is based on a beta-variational autoencoder (\beta-VAE) with a customized loss function, incorporating a R-CNN architecture to distinguish single-cell from background and to minimize any interference from artifacts. This implementation of \beta-VAE shows good reconstruction capability along with continuous latent embeddings, while maintaining clear differentiation between single-cell classes. Our novel approach is especially helpful to uncover the correlation of two latent features in complex biological processes such as formation of granules in the cell (granulopoiesis) with gene expression patterns. It thus provides a unique tool to improve the understanding of white blood cell maturation for biomedicine and diagnostics.

[CV-40] owards Cross-Domain Single Blood Cell Image Classification via Large-Scale LoRA-based Segment Anything Model

链接: https://arxiv.org/abs/2408.06716
作者: Yongcheng Li,Lingcong Cai,Ying Lu,Yupeng Zhang,Jingyan Jiang,Genan Dai,Bowen Zhang,Jingzhou Cao,Xiangzhong Zhang,Xiaomao Fan
关键词-EN: blood cell, Accurate classification, medical conditions, blood cells plays, plays a vital
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Accurate classification of blood cells plays a vital role in hematological analysis as it aids physicians in diagnosing various medical conditions. In this study, we present a novel approach for classifying blood cell images known as BC-SAM. BC-SAM leverages the large-scale foundation model of Segment Anything Model (SAM) and incorporates a fine-tuning technique using LoRA, allowing it to extract general image embeddings from blood cell images. To enhance the applicability of BC-SAM across different blood cell image datasets, we introduce an unsupervised cross-domain autoencoder that focuses on learning intrinsic features while suppressing artifacts in the images. To assess the performance of BC-SAM, we employ four widely used machine learning classifiers (Random Forest, Support Vector Machine, Artificial Neural Network, and XGBoost) to construct blood cell classification models and compare them against existing state-of-the-art methods. Experimental results conducted on two publicly available blood cell datasets (Matek-19 and Acevedo-20) demonstrate that our proposed BC-SAM achieves a new state-of-the-art result, surpassing the baseline methods with a significant improvement. The source code of this paper is available at this https URL.

[CV-41] Review Learning: Advancing All-in-One Ultra-High-Definition Image Restoration Training Method

链接: https://arxiv.org/abs/2408.06709
作者: Xin Su,Zhuoran Zheng,Chen Wu
关键词-EN: UHD image restoration, image restoration, image restoration tasks, image restoration models, increasingly important
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:All-in-one image restoration tasks are becoming increasingly important, especially for ultra-high-definition (UHD) images. Existing all-in-one UHD image restoration methods usually boost the model’s performance by introducing prompt or customized dynamized networks for different degradation types. For the inference stage, it might be friendly, but in the training stage, since the model encounters multiple degraded images of different quality in an epoch, these cluttered learning objectives might be information pollution for the model. To address this problem, we propose a new training paradigm for general image restoration models, which we name \textbfReview Learning, which enables image restoration models to be capable enough to handle multiple types of degradation without prior knowledge and prompts. This approach begins with sequential training of an image restoration model on several degraded datasets, combined with a review mechanism that enhances the image restoration model’s memory for several previous classes of degraded datasets. In addition, we design a lightweight all-purpose image restoration network that can efficiently reason about degraded images with 4K ( 3840 \times 2160 ) resolution on a single consumer-grade GPU.

[CV-42] MAIR: Improving Multi-view Attention Inverse Rendering with Implicit Lighting Representation

链接: https://arxiv.org/abs/2408.06707
作者: JunYong Choi,SeokYeong Lee,Haesol Park,Seung-Won Jung,Ig-Jae Kim,Junghyun Cho
关键词-EN: inverse rendering, scene-level inverse rendering, Attention Inverse Rendering, scene-level inverse, inverse
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:In this paper, we propose a scene-level inverse rendering framework that uses multi-view images to decompose the scene into geometry, SVBRDF, and 3D spatially-varying lighting. While multi-view images have been widely used for object-level inverse rendering, scene-level inverse rendering has primarily been studied using single-view images due to the lack of a dataset containing high dynamic range multi-view images with ground-truth geometry, material, and spatially-varying lighting. To improve the quality of scene-level inverse rendering, a novel framework called Multi-view Attention Inverse Rendering (MAIR) was recently introduced. MAIR performs scene-level multi-view inverse rendering by expanding the OpenRooms dataset, designing efficient pipelines to handle multi-view images, and splitting spatially-varying lighting. Although MAIR showed impressive results, its lighting representation is fixed to spherical Gaussians, which limits its ability to render images realistically. Consequently, MAIR cannot be directly used in applications such as material editing. Moreover, its multi-view aggregation networks have difficulties extracting rich features because they only focus on the mean and variance between multi-view features. In this paper, we propose its extended version, called MAIR++. MAIR++ addresses the aforementioned limitations by introducing an implicit lighting representation that accurately captures the lighting conditions of an image while facilitating realistic rendering. Furthermore, we design a directional attention-based multi-view aggregation network to infer more intricate relationships between views. Experimental results show that MAIR++ not only achieves better performance than MAIR and single-view-based methods, but also displays robust performance on unseen real-world scenes.

[CV-43] SlotLifter: Slot-guided Feature Lifting for Learning Object-centric Radiance Fields ECCV2024

链接: https://arxiv.org/abs/2408.06697
作者: Yu Liu,Baoxiong Jia,Yixin Chen,Siyuan Huang
关键词-EN: underpins human-level generalization, distill object-centric abstractions, intricate visual scenes, visual scenes underpins, scenes underpins human-level
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO)
*备注: Accepted by ECCV 2024. Project website: this https URL

点击查看摘要

Abstract:The ability to distill object-centric abstractions from intricate visual scenes underpins human-level generalization. Despite the significant progress in object-centric learning methods, learning object-centric representations in the 3D physical world remains a crucial challenge. In this work, we propose SlotLifter, a novel object-centric radiance model addressing scene reconstruction and decomposition jointly via slot-guided feature lifting. Such a design unites object-centric learning representations and image-based rendering methods, offering state-of-the-art performance in scene decomposition and novel-view synthesis on four challenging synthetic and four complex real-world datasets, outperforming existing 3D object-centric learning methods by a large margin. Through extensive ablative studies, we showcase the efficacy of designs in SlotLifter, revealing key insights for potential future directions.

[CV-44] DC3DO: Diffusion Classifier for 3D Objects

链接: https://arxiv.org/abs/2408.06693
作者: Nursena Koprucu,Meher Shashwat Nigam,Shicheng Xu(Luke),Biruk Abere,Gabriele Dominici,Andrew Rodriguez,Sharvaree Vadgam,Berfin Inal,Alberto Tono
关键词-EN: Geoffrey Hinton emphasis, Inspired by Geoffrey, Geoffrey Hinton, Hinton emphasis, learn to generate
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computational Geometry (cs.CG)
*备注:

点击查看摘要

Abstract:Inspired by Geoffrey Hinton emphasis on generative modeling, To recognize shapes, first learn to generate them, we explore the use of 3D diffusion models for object classification. Leveraging the density estimates from these models, our approach, the Diffusion Classifier for 3D Objects (DC3DO), enables zero-shot classification of 3D shapes without additional training. On average, our method achieves a 12.5 percent improvement compared to its multiview counterparts, demonstrating superior multimodal reasoning over discriminative approaches. DC3DO employs a class-conditional diffusion model trained on ShapeNet, and we run inferences on point clouds of chairs and cars. This work highlights the potential of generative models in 3D object classification.

[CV-45] Masked Image Modeling: A Survey

链接: https://arxiv.org/abs/2408.06687
作者: Vlad Hondru,Florinel Alin Croitoru,Shervin Minaee,Radu Tudor Ionescu,Nicu Sebe
关键词-EN: powerful self-supervised learning, self-supervised learning technique, survey recent studies, computer vision, approach that emerged
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In this work, we survey recent studies on masked image modeling (MIM), an approach that emerged as a powerful self-supervised learning technique in computer vision. The MIM task involves masking some information, e.g. pixels, patches, or even latent representations, and training a model, usually an autoencoder, to predicting the missing information by using the context available in the visible part of the input. We identify and formalize two categories of approaches on how to implement MIM as a pretext task, one based on reconstruction and one based on contrastive learning. Then, we construct a taxonomy and review the most prominent papers in recent years. We complement the manually constructed taxonomy with a dendrogram obtained by applying a hierarchical clustering algorithm. We further identify relevant clusters via manually inspecting the resulting dendrogram. Our review also includes datasets that are commonly used in MIM research. We aggregate the performance results of various masked image modeling methods on the most popular datasets, to facilitate the comparison of competing methods. Finally, we identify research gaps and propose several interesting directions of future work.

[CV-46] Bi-directional Contextual Attention for 3D Dense Captioning ECCV2024

链接: https://arxiv.org/abs/2408.06662
作者: Minjung Kim,Hyung Suk Lim,Soonyoung Lee,Bumsoo Kim,Gunhee Kim
关键词-EN: global scene, Bi-directional Contextual Attention, object, dense captioning, scene
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted to ECCV 2024 (Oral)

点击查看摘要

Abstract:3D dense captioning is a task involving the localization of objects and the generation of descriptions for each object in a 3D scene. Recent approaches have attempted to incorporate contextual information by modeling relationships with object pairs or aggregating the nearest neighbor features of an object. However, the contextual information constructed in these scenarios is limited in two aspects: first, objects have multiple positional relationships that exist across the entire global scene, not only near the object itself. Second, it faces with contradicting objectives–where localization and attribute descriptions are generated better with tight localization, while descriptions involving global positional relations are generated better with contextualized features of the global scene. To overcome this challenge, we introduce BiCA, a transformer encoder-decoder pipeline that engages in 3D dense captioning for each object with Bi-directional Contextual Attention. Leveraging parallelly decoded instance queries for objects and context queries for non-object contexts, BiCA generates object-aware contexts, where the contexts relevant to each object is summarized, and context-aware objects, where the objects relevant to the summarized object-aware contexts are aggregated. This extension relieves previous methods from the contradicting objectives, enhancing both localization performance and enabling the aggregation of contextual features throughout the global scene; thus improving caption generation performance simultaneously. Extensive experiments on two of the most widely-used 3D dense captioning datasets demonstrate that our proposed method achieves a significant improvement over prior methods.

[CV-47] Hybrid SD: Edge-Cloud Collaborative Inference for Stable Diffusion Models

链接: https://arxiv.org/abs/2408.06646
作者: Chenqian Yan,Songwei Liu,Hongjian Liu,Xurui Peng,Xiaojian Wang,Fangming Chen,Lean Fu,Xing Mei
关键词-EN: shown remarkable proficiency, Stable Diffusion Models, Stable Diffusion, shown remarkable, remarkable proficiency
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Stable Diffusion Models (SDMs) have shown remarkable proficiency in image synthesis. However, their broad application is impeded by their large model sizes and intensive computational requirements, which typically require expensive cloud servers for deployment. On the flip side, while there are many compact models tailored for edge devices that can reduce these demands, they often compromise on semantic integrity and visual quality when compared to full-sized SDMs. To bridge this gap, we introduce Hybrid SD, an innovative, training-free SDMs inference framework designed for edge-cloud collaborative inference. Hybrid SD distributes the early steps of the diffusion process to the large models deployed on cloud servers, enhancing semantic planning. Furthermore, small efficient models deployed on edge devices can be integrated for refining visual details in the later stages. Acknowledging the diversity of edge devices with differing computational and storage capacities, we employ structural pruning to the SDMs U-Net and train a lightweight VAE. Empirical evaluations demonstrate that our compressed models achieve state-of-the-art parameter efficiency (225.8M) on edge devices with competitive image quality. Additionally, Hybrid SD reduces the cloud cost by 66% with edge-cloud collaborative inference.

[CV-48] COD: Learning Conditional Invariant Representation for Domain Adaptation Regression ECCV2024

链接: https://arxiv.org/abs/2408.06638
作者: Hao-Ran Yang,Chuan-Xian Ren,You-Wei Luo
关键词-EN: unlabeled target domain, Domain Adaptation Regression, Aiming to generalize, complex practical learning, source domain
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted to ECCV 2024 (oral)

点击查看摘要

Abstract:Aiming to generalize the label knowledge from a source domain with continuous outputs to an unlabeled target domain, Domain Adaptation Regression (DAR) is developed for complex practical learning problems. However, due to the continuity problem in regression, existing conditional distribution alignment theory and methods with discrete prior, which are proven to be effective in classification settings, are no longer applicable. In this work, focusing on the feasibility problems in DAR, we establish the sufficiency theory for the regression model, which shows the generalization error can be sufficiently dominated by the cross-domain conditional discrepancy. Further, to characterize conditional discrepancy with continuous conditioning variable, a novel Conditional Operator Discrepancy (COD) is proposed, which admits the metric property on conditional distributions via the kernel embedding theory. Finally, to minimize the discrepancy, a COD-based conditional invariant representation learning model is proposed, and the reformulation is derived to show that reasonable modifications on moment statistics can further improve the discriminability of the adaptation model. Extensive experiments on standard DAR datasets verify the validity of theoretical results and the superiority over SOTA DAR methods.

[CV-49] Unified-IoU: For High-Quality Object Detection

链接: https://arxiv.org/abs/2408.06636
作者: Xiangjie Luo,Zhihao Cai,Bo Shao,Yingxun Wang
关键词-EN: Ground Truth box, Object detection, current prediction box, computer vision, prediction box
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Object detection is an important part in the field of computer vision, and the effect of object detection is directly determined by the regression accuracy of the prediction box. As the key to model training, IoU (Intersection over Union) greatly shows the difference between the current prediction box and the Ground Truth box. Subsequent researchers have continuously added more considerations to IoU, such as center distance, aspect ratio, and so on. However, there is an upper limit to just refining the geometric differences; And there is a potential connection between the new consideration index and the IoU itself, and the direct addition or subtraction between the two may lead to the problem of “over-consideration”. Based on this, we propose a new IoU loss function, called Unified-IoU (UIoU), which is more concerned with the weight assignment between different quality prediction boxes. Specifically, the loss function dynamically shifts the model’s attention from low-quality prediction boxes to high-quality prediction boxes in a novel way to enhance the model’s detection performance on high-precision or intensive datasets and achieve a balance in training speed. Our proposed method achieves better performance on multiple datasets, especially at a high IoU threshold, UIoU has a more significant improvement effect compared with other improved IoU losses. Our code is publicly available at: this https URL.

[CV-50] IDRetracor: Towards Visual Forensics Against Malicious Face Swapping

链接: https://arxiv.org/abs/2408.06635
作者: Jikang Cheng,Jiaxin Ai,Zhen Han,Chao Liang,Qin Zou,Zhongyuan Wang,Qian Wang
关键词-EN: personal identity security, poses significant social, significant social risks, swapping technique based, methods poses significant
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:The face swapping technique based on deepfake methods poses significant social risks to personal identity security. While numerous deepfake detection methods have been proposed as countermeasures against malicious face swapping, they can only output binary labels (Fake/Real) for distinguishing fake content without reliable and traceable evidence. To achieve visual forensics and target face attribution, we propose a novel task named face retracing, which considers retracing the original target face from the given fake one via inverse mapping. Toward this goal, we propose an IDRetracor that can retrace arbitrary original target identities from fake faces generated by multiple face swapping methods. Specifically, we first adopt a mapping resolver to perceive the possible solution space of the original target face for the inverse mappings. Then, we propose mapping-aware convolutions to retrace the original target face from the fake one. Such convolutions contain multiple kernels that can be combined under the control of the mapping resolver to tackle different face swapping mappings dynamically. Extensive experiments demonstrate that the IDRetracor exhibits promising retracing performance from both quantitative and qualitative perspectives.

[CV-51] A lightweight YOLOv5-FFM model for occlusion pedestrian detection

链接: https://arxiv.org/abs/2408.06633
作者: Xiangjie Luo,Bo Shao,Zhihao Cai,Yingxun Wang
关键词-EN: autonomous driving technology, pedestrian detection, development of autonomous, autonomous driving, driving technology
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:The development of autonomous driving technology must be inseparable from pedestrian detection. Because of the fast speed of the vehicle, the accuracy and real-time performance of the pedestrian detection algorithm are very important. YOLO, as an efficient and simple one-stage target detection method, is often used for pedestrian detection in various environments. However, this series of detectors face some challenges, such as excessive computation and undesirable detection rate when facing occluded pedestrians. In this paper, we propose an improved lightweight YOLOv5 model to deal with these problems. This model can achieve better pedestrian detection accuracy with fewer floating-point operations (FLOPs), especially for occluded targets. In order to achieve the above goals, we made improvements based on the YOLOv5 model framework and introduced Ghost module and SE block. Furthermore, we designed a local feature fusion module (FFM) to deal with occlusion in pedestrian detection. To verify the validity of our method, two datasets, Citypersons and CUHK Occlusion, were selected for the experiment. The experimental results show that, compared with the original yolov5s model, the average precision (AP) of our method is significantly improved, while the number of parameters is reduced by 27.9% and FLOPs are reduced by 19.0%.

[CV-52] Fast Information Streaming Handler (FisH): A Unified Seismic Neural Network for Single Station Real-Time Earthquake Early Warning

链接: https://arxiv.org/abs/2408.06629
作者: Tianning Zhang,Feng Liu,Yuming Yuan,Rui Su,Wanli Ouyang,Lei Bai
关键词-EN: Existing EEW approaches, Existing EEW, Information Streaming Handler, Fast Information Streaming, treat phase picking
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Existing EEW approaches often treat phase picking, location estimation, and magnitude estimation as separate tasks, lacking a unified framework. Additionally, most deep learning models in seismology rely on full three-component waveforms and are not suitable for real-time streaming data. To address these limitations, we propose a novel unified seismic neural network called Fast Information Streaming Handler (FisH). FisH is designed to process real-time streaming seismic data and generate simultaneous results for phase picking, location estimation, and magnitude estimation in an end-to-end fashion. By integrating these tasks within a single model, FisH simplifies the overall process and leverages the nonlinear relationships between tasks for improved performance. The FisH model utilizes RetNet as its backbone, enabling parallel processing during training and recurrent handling during inference. This capability makes FisH suitable for real-time applications, reducing latency in EEW systems. Extensive experiments conducted on the STEAD benchmark dataset provide strong validation for the effectiveness of our proposed FisH model. The results demonstrate that FisH achieves impressive performance across multiple seismic event detection and characterization tasks. Specifically, it achieves an F1 score of 0.99/0.96. Also, FisH demonstrates precise earthquake location estimation, with location error of only 6.0km, a distance error of 2.6km, and a back-azimuth error of 19°. The model also exhibits accurate earthquake magnitude estimation, with a magnitude error of just 0.14. Additionally, FisH is capable of generating real-time estimations, providing location and magnitude estimations with a location error of 8.06km and a magnitude error of 0.18 within a mere 3 seconds after the P-wave arrives.

[CV-53] DePatch: Towards Robust Adversarial Patch for Evading Person Detectors in the Real World

链接: https://arxiv.org/abs/2408.06625
作者: Jikang Cheng,Ying Zhang,Zhongyuan Wang,Zou Qin,Chen Li
关键词-EN: deep neural networks, deceiving deep neural, craft deployable patterns, Recent years, physical adversarial attacks
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Recent years have seen an increasing interest in physical adversarial attacks, which aim to craft deployable patterns for deceiving deep neural networks, especially for person detectors. However, the adversarial patterns of existing patch-based attacks heavily suffer from the self-coupling issue, where a degradation, caused by physical transformations, in any small patch segment can result in a complete adversarial dysfunction, leading to poor robustness in the complex real world. Upon this observation, we introduce the Decoupled adversarial Patch (DePatch) attack to address the self-coupling issue of adversarial patches. Specifically, we divide the adversarial patch into block-wise segments, and reduce the inter-dependency among these segments through randomly erasing out some segments during the optimization. We further introduce a border shifting operation and a progressive decoupling strategy to improve the overall attack capabilities. Extensive experiments demonstrate the superior performance of our method over other physical adversarial attacks, especially in the real world.

[CV-54] ActPrompt: In-Domain Feature Adaptation via Action Cues for Video Temporal Grounding

链接: https://arxiv.org/abs/2408.06622
作者: Yubin Wang,Xinyang Jiang,De Cheng,Dongsheng Li,Cairong Zhao
关键词-EN: emerging topic aiming, identify specific clips, Video temporal grounding, pre-trained video models, emerging topic
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 9 pages, 5 figures

点击查看摘要

Abstract:Video temporal grounding is an emerging topic aiming to identify specific clips within videos. In addition to pre-trained video models, contemporary methods utilize pre-trained vision-language models (VLM) to capture detailed characteristics of diverse scenes and objects from video frames. However, as pre-trained on images, VLM may struggle to distinguish action-sensitive patterns from static objects, making it necessary to adapt them to specific data domains for effective feature representation over temporal grounding. We address two primary challenges to achieve this goal. Specifically, to mitigate high adaptation costs, we propose an efficient preliminary in-domain fine-tuning paradigm for feature adaptation, where downstream-adaptive features are learned through several pretext tasks. Furthermore, to integrate action-sensitive information into VLM, we introduce Action-Cue-Injected Temporal Prompt Learning (ActPrompt), which injects action cues into the image encoder of VLM for better discovering action-sensitive patterns. Extensive experiments demonstrate that ActPrompt is an off-the-shelf training framework that can be effectively applied to various SOTA methods, resulting in notable improvements. The complete code used in this study is provided in the supplementary materials.

[CV-55] ViMo: Generating Motions from Casual Videos

链接: https://arxiv.org/abs/2408.06614
作者: Liangdong Qiu,Chengxing Yu,Yanran Li,Zhao Wang,Haibin Huang,Chongyang Ma,Di Zhang,Pengfei Wan,Xiaoguang Han
关键词-EN: intricate camera movements, innate ability, ability to imagine, imagine multiple, multiple possible actions
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
*备注:

点击查看摘要

Abstract:Although humans have the innate ability to imagine multiple possible actions from videos, it remains an extraordinary challenge for computers due to the intricate camera movements and montages. Most existing motion generation methods predominantly rely on manually collected motion datasets, usually tediously sourced from motion capture (Mocap) systems or Multi-View cameras, unavoidably resulting in a limited size that severely undermines their generalizability. Inspired by recent advance of diffusion models, we probe a simple and effective way to capture motions from videos and propose a novel Video-to-Motion-Generation framework (ViMo) which could leverage the immense trove of untapped video content to produce abundant and diverse 3D human motions. Distinct from prior work, our videos could be more causal, including complicated camera movements and occlusions. Striking experimental results demonstrate the proposed model could generate natural motions even for videos where rapid movements, varying perspectives, or frequent occlusions might exist. We also show this work could enable three important downstream applications, such as generating dancing motions according to arbitrary music and source video style. Extensive experimental results prove that our model offers an effective and scalable way to generate diversity and realistic motions. Code and demos will be public soon.

[CV-56] CROME: Cross-Modal Adapters for Efficient Multimodal LLM

链接: https://arxiv.org/abs/2408.06610
作者: Sayna Ebrahimi,Sercan O. Arik,Tejas Nama,Tomas Pfister
关键词-EN: Large Language Models, remarkable image-language capabilities, Multimodal Large Language, Large Language, demonstrate remarkable image-language
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Multimodal Large Language Models (MLLMs) demonstrate remarkable image-language capabilities, but their widespread use faces challenges in cost-effective training and adaptation. Existing approaches often necessitate expensive language model retraining and limited adaptability. Additionally, the current focus on zero-shot performance improvements offers insufficient guidance for task-specific tuning. We propose CROME, an efficient vision-language instruction tuning framework. It features a novel gated cross-modal adapter that effectively combines visual and textual representations prior to input into a frozen LLM. This lightweight adapter, trained with minimal parameters, enables efficient cross-modal understanding. Notably, CROME demonstrates superior zero-shot performance on standard visual question answering and instruction-following benchmarks. Moreover, it yields fine-tuning with exceptional parameter efficiency, competing with task-specific specialist state-of-the-art methods. CROME demonstrates the potential of pre-LM alignment for building scalable, adaptable, and parameter-efficient multimodal models.

[CV-57] MV-DETR: Multi-modality indoor object detection by Multi-View DEtecton TRansformers

链接: https://arxiv.org/abs/2408.06604
作者: Zichao Dong,Yilin Zhang,Xufeng Huang,Hang Ji,Zhan Shi,Xin Zhan,Junbo Chen
关键词-EN: efficient transformer based, transformer based detection, based detection method, MV-DETR pipeline, efficient transformer
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:We introduce a novel MV-DETR pipeline which is effective while efficient transformer based detection method. Given input RGBD data, we notice that there are super strong pretraining weights for RGB data while less effective works for depth related data. First and foremost , we argue that geometry and texture cues are both of vital importance while could be encoded separately. Secondly, we find that visual texture feature is relatively hard to extract compared with geometry feature in 3d space. Unfortunately, single RGBD dataset with thousands of data is not enough for training an discriminating filter for visual texture feature extraction. Last but certainly not the least, we designed a lightweight VG module consists of a visual textual encoder, a geometry encoder and a VG connector. Compared with previous state of the art works like V-DETR, gains from pretrained visual encoder could be seen. Extensive experiments on ScanNetV2 dataset shows the effectiveness of our method. It is worth mentioned that our method achieve 78% AP which create new state of the art on ScanNetv2 benchmark.

[CV-58] GeoFormer: Learning Point Cloud Completion with Tri-Plane Integrated Transformer

链接: https://arxiv.org/abs/2408.06596
作者: Jinpeng Yu,Binbin Huang,Yuxuan Zhang,Huaxia Li,Xu Tang,Shenghua Gao
关键词-EN: cloud completion aims, Point cloud completion, recover accurate global, preserve fine-grained local, completion aims
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: accepted by the 32nd ACM International Conference on Multimedia (MM’24)

点击查看摘要

Abstract:Point cloud completion aims to recover accurate global geometry and preserve fine-grained local details from partial point clouds. Conventional methods typically predict unseen points directly from 3D point cloud coordinates or use self-projected multi-view depth maps to ease this task. However, these gray-scale depth maps cannot reach multi-view consistency, consequently restricting the performance. In this paper, we introduce a GeoFormer that simultaneously enhances the global geometric structure of the points and improves the local details. Specifically, we design a CCM Feature Enhanced Point Generator to integrate image features from multi-view consistent canonical coordinate maps (CCMs) and align them with pure point features, thereby enhancing the global geometry feature. Additionally, we employ the Multi-scale Geometry-aware Upsampler module to progressively enhance local details. This is achieved through cross attention between the multi-scale features extracted from the partial input and the features derived from previously estimated points. Extensive experiments on the PCN, ShapeNet-55/34, and KITTI benchmarks demonstrate that our GeoFormer outperforms recent methods, achieving the state-of-the-art performance. Our code is available at \hrefthis https URLthis https URL.

[CV-59] ActiveNeRF: Learning Accurate 3D Geometry by Active Pattern Projection

链接: https://arxiv.org/abs/2408.06592
作者: Jianyu Tao,Changping Hu,Edward Yang,Jing Xu,Rui Chen
关键词-EN: achieved incredible success, achieved incredible, incredible success, geometry reconstruction, active pattern
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 18 pages, 10 figures

点击查看摘要

Abstract:NeRFs have achieved incredible success in novel view synthesis. However, the accuracy of the implicit geometry is unsatisfactory because the passive static environmental illumination has low spatial frequency and cannot provide enough information for accurate geometry reconstruction. In this work, we propose ActiveNeRF, a 3D geometry reconstruction framework, which improves the geometry quality of NeRF by actively projecting patterns of high spatial frequency onto the scene using a projector which has a constant relative pose to the camera. We design a learnable active pattern rendering pipeline which jointly learns the scene geometry and the active pattern. We find that, by adding the active pattern and imposing its consistency across different views, our proposed method outperforms state of the art geometry reconstruction methods qualitatively and quantitatively in both simulation and real experiments. Code is avaliable at this https URL

[CV-60] HDRGS: High Dynamic Range Gaussian Splatting

链接: https://arxiv.org/abs/2408.06543
作者: Jiahao Wu,Lu Xiao,Chao Wang,Rui Peng,Kaiqiang Xiong,Ronggang Wang
关键词-EN: witnessed substantial advancements, neural radiance field, high dynamic range, radiance field, years have witnessed
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Recent years have witnessed substantial advancements in the field of 3D reconstruction from 2D images, particularly following the introduction of the neural radiance field (NeRF) technique. However, reconstructing a 3D high dynamic range (HDR) radiance field, which aligns more closely with real-world conditions, from 2D multi-exposure low dynamic range (LDR) images continues to pose significant challenges. Approaches to this issue fall into two categories: grid-based and implicit-based. Implicit methods, using multi-layer perceptrons (MLP), face inefficiencies, limited solvability, and overfitting risks. Conversely, grid-based methods require significant memory and struggle with image quality and long training times. In this paper, we introduce Gaussian Splatting-a recent, high-quality, real-time 3D reconstruction technique-into this domain. We further develop the High Dynamic Range Gaussian Splatting (HDR-GS) method, designed to address the aforementioned challenges. This method enhances color dimensionality by including luminance and uses an asymmetric grid for tone-mapping, swiftly and precisely converting pixel irradiance to color. Our approach improves HDR scene recovery accuracy and integrates a novel coarse-to-fine strategy to speed up model convergence, enhancing robustness against sparse viewpoints and exposure extremes, and preventing local optima. Extensive testing confirms that our method surpasses current state-of-the-art techniques in both synthetic and real-world scenarios. Code will be released at \urlthis https URL

[CV-61] Benchmarking tree species classification from proximally-sensed laser scanning data: introducing the FOR-species20K dataset

链接: https://arxiv.org/abs/2408.06507
作者: Stefano Puliti,Emily R. Lines,Jana Müllerová,Julian Frey,Zoe Schindler,Adrian Straker,Matthew J. Allen,Lukas Winiwarter,Nataliia Rehush,Hristina Hristova,Brent Murray,Kim Calders,Louise Terryn,Nicholas Coops,Bernhard Höfle,Samuli Junttila,Martin Krůček,Grzegorz Krok,Kamil Král,Shaun R. Levick,Linda Luck,Azim Missarov,Martin Mokroš,Harry J. F. Owen,Krzysztof Stereńczak,Timo P. Pitkänen,Nicola Puletti,Ninni Saarinen,Chris Hopkinson,Chiara Torresan,Enrico Tomelleri,Hannah Weiser,Rasmus Astrup
关键词-EN: offers significant potential, Proximally-sensed laser scanning, automatically identifying tree, scanning offers significant, Proximally-sensed laser
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Proximally-sensed laser scanning offers significant potential for automated forest data capture, but challenges remain in automatically identifying tree species without additional ground data. Deep learning (DL) shows promise for automation, yet progress is slowed by the lack of large, diverse, openly available labeled datasets of single tree point clouds. This has impacted the robustness of DL models and the ability to establish best practices for species classification. To overcome these challenges, the FOR-species20K benchmark dataset was created, comprising over 20,000 tree point clouds from 33 species, captured using terrestrial (TLS), mobile (MLS), and drone laser scanning (ULS) across various European forests, with some data from other regions. This dataset enables the benchmarking of DL models for tree species classification, including both point cloud-based (PointNet++, MinkNet, MLP-Mixer, DGCNNs) and multi-view image-based methods (SimpleView, DetailView, YOLOv5). 2D image-based models generally performed better (average OA = 0.77) than 3D point cloud-based models (average OA = 0.72), with consistent results across different scanning platforms and sensors. The top model, DetailView, was particularly robust, handling data imbalances well and generalizing effectively across tree sizes. The FOR-species20K dataset, available at this https URL, is a key resource for developing and benchmarking DL models for tree species classification using laser scanning data, providing a foundation for future advancements in the field. Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI) Cite as: arXiv:2408.06507 [cs.CV] (or arXiv:2408.06507v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2408.06507 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Stefano Puliti [view email] [v1] Mon, 12 Aug 2024 21:47:15 UTC (2,472 KB)

[CV-62] Prompt Recovery for Image Generation Models: A Comparative Study of Discrete Optimizers

链接: https://arxiv.org/abs/2408.06502
作者: Joshua Nathaniel Williams,Avi Schwarzschild,J. Zico Kolter
关键词-EN: Recovering natural language, natural language prompts, image generation models, Recovering natural, difficult discrete optimization
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 9 Pages, 4 Figures

点击查看摘要

Abstract:Recovering natural language prompts for image generation models, solely based on the generated images is a difficult discrete optimization problem. In this work, we present the first head-to-head comparison of recent discrete optimization techniques for the problem of prompt inversion. We evaluate Greedy Coordinate Gradients (GCG), PEZ , Random Search, AutoDAN and BLIP2’s image captioner across various evaluation metrics related to the quality of inverted prompts and the quality of the images generated by the inverted prompts. We find that focusing on the CLIP similarity between the inverted prompts and the ground truth image acts as a poor proxy for the similarity between ground truth image and the image generated by the inverted prompts. While the discrete optimizers effectively minimize their objectives, simply using responses from a well-trained captioner often leads to generated images that more closely resemble those produced by the original prompts.

[CV-63] What Color Scheme is More Effective in Assisting Readers to Locate Information in a Color-Coded Article?

链接: https://arxiv.org/abs/2408.06494
作者: Ho Yin Ng,Zeyu He,Ting-Hao ‘Kenneth’ Huang
关键词-EN: human cognitive activities, aiding human cognitive, Large Language Models, cluster information types, assigning specific colors
类目: Human-Computer Interaction (cs.HC); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Color coding, a technique assigning specific colors to cluster information types, has proven advantages in aiding human cognitive activities, especially reading and comprehension. The rise of Large Language Models (LLMs) has streamlined document coding, enabling simple automatic text labeling with various schemes. This has the potential to make color-coding more accessible and benefit more users. However, the impact of color choice on information seeking is understudied. We conducted a user study assessing various color schemes’ effectiveness in LLM-coded text documents, standardizing contrast ratios to approximately 5.55:1 across schemes. Participants performed timed information-seeking tasks in color-coded scholarly abstracts. Results showed non-analogous and yellow-inclusive color schemes improved performance, with the latter also being more preferred by participants. These findings can inform better color scheme choices for text annotation. As LLMs advance document coding, we advocate for more research focusing on the “color” aspect of color-coding techniques.

[CV-64] Generalization Enhancement Strategies to Enable Cross-year Cropland Mapping with Convolutional Neural Networks Trained Using Historical Samples

链接: https://arxiv.org/abs/2408.06467
作者: Sam Khallaghi,Rahebe Abedi,Hanan Abou Ali,Mary Dziedzorm Asipunu,Ismail Alatise,Nguyen Ha,Boka Luo,Cat Mai,Lei Song,Amos Wussah,Sitian Xiong,Qi Zhang,Lyndon D. Estes
关键词-EN: high-resolution satellite imagery, mapping agricultural fields, deep learning, geometrically irregular, accuracy of mapping
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:The accuracy of mapping agricultural fields across large areas is steadily improving with high-resolution satellite imagery and deep learning (DL) models, even in regions where fields are small and geometrically irregular. However, developing effective DL models often requires large, expensive label datasets, typically available only for specific years or locations. This limits the ability to create annual maps essential for agricultural monitoring, as domain shifts occur between years and regions due to changes in farming practices and environmental conditions. The challenge is to design a model flexible enough to account for these shifts without needing yearly labels. While domain adaptation techniques or semi-supervised training are common solutions, we explored enhancing the model’s generalization power. Our results indicate that a holistic approach is essential, combining methods to improve generalization. Specifically, using an area-based loss function, such as Tversky-focal loss (TFL), significantly improved predictions across multiple years. The use of different augmentation techniques helped to encode different types of invariance, particularly photometric augmentations encoded invariance to brightness changes, though they increased false positives. The combination of photometric augmentation, TFL loss, and MC-dropout produced the best results, although dropout alone led to more false negatives in subsequent year predictions. Additionally, the choice of input normalization had a significant impact, with the best results obtained when statistics were calculated either locally or across the entire dataset over all bands (lab and gab). We developed a workflow that enabled a U-Net model to generate effective multi-year crop maps over large areas. Our code, available at: this https URL, will be regularly updated with improvements.

[CV-65] Advanced Vision Transformers and Open-Set Learning for Robust Mosquito Classification: A Novel Approach to Entomological Studies

链接: https://arxiv.org/abs/2408.06457
作者: Ahmed Akib Jawad Karim,Muhammad Zawad Mahmud,Riasat Khan
关键词-EN: Mosquito-related diseases pose, global public health, Mosquito-related diseases, accurate mosquito classification, mosquito classification
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 23 pages, 15 figures

点击查看摘要

Abstract:Mosquito-related diseases pose a significant threat to global public health, necessitating efficient and accurate mosquito classification for effective surveillance and control. This work presents an innovative approach to mosquito classification by leveraging state-of-the-art vision transformers and open-set learning techniques. A novel framework has been introduced that integrates Transformer-based deep learning models with comprehensive data augmentation and preprocessing methods, enabling robust and precise identification of ten mosquito species. The Swin Transformer model achieves the best performance for traditional closed-set learning with 99.80% accuracy and 0.998 F1 score. The lightweight MobileViT technique attains an almost similar accuracy of 98.90% with significantly reduced parameters and model complexities. Next, the applied deep learning models’ adaptability and generalizability in a static environment have been enhanced by using new classes of data samples during the inference stage that have not been included in the training set. The proposed framework’s ability to handle unseen classes like insects similar to mosquitoes, even humans, through open-set learning further enhances its practical applicability employing the OpenMax technique and Weibull distribution. The traditional CNN model, Xception, outperforms the latest transformer with higher accuracy and F1 score for open-set learning. The study’s findings highlight the transformative potential of advanced deep-learning architectures in entomology, providing a strong groundwork for future research and development in mosquito surveillance and vector control. The implications of this work extend beyond mosquito classification, offering valuable insights for broader ecological and environmental monitoring applications.

[CV-66] S-SAM: SVD-based Fine-Tuning of Segment Anything Model for Medical Image Segmentation MICCAI2024

链接: https://arxiv.org/abs/2408.06447
作者: Jay N. Paranjape,Shameema Sikder,S. Swaroop Vedula,Vishal M. Patel
关键词-EN: modality or dataset, traditionally approached, fine-tuning the entire, entire model, Medical image segmentation
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted in MICCAI 2024

点击查看摘要

Abstract:Medical image segmentation has been traditionally approached by training or fine-tuning the entire model to cater to any new modality or dataset. However, this approach often requires tuning a large number of parameters during training. With the introduction of the Segment Anything Model (SAM) for prompted segmentation of natural images, many efforts have been made towards adapting it efficiently for medical imaging, thus reducing the training time and resources. However, these methods still require expert annotations for every image in the form of point prompts or bounding box prompts during training and inference, making it tedious to employ them in practice. In this paper, we propose an adaptation technique, called S-SAM, that only trains parameters equal to 0.4% of SAM’s parameters and at the same time uses simply the label names as prompts for producing precise masks. This not only makes tuning SAM more efficient than the existing adaptation methods but also removes the burden of providing expert prompts. We call this modified version S-SAM and evaluate it on five different modalities including endoscopic images, x-ray, ultrasound, CT, and histology images. Our experiments show that S-SAM outperforms state-of-the-art methods as well as existing SAM adaptation methods while tuning a significantly less number of parameters. We release the code for S-SAM at this https URL.

[CV-67] HAT: History-Augmented Anchor Transformer for Online Temporal Action Localization ECCV2024

链接: https://arxiv.org/abs/2408.06437
作者: Sakib Reza,Yuexi Zhang,Mohsen Moghaddam,Octavia Camps
关键词-EN: Online video understanding, Online Temporal Action, Online video, individual frames, Online Temporal
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted to ECCV 2024

点击查看摘要

Abstract:Online video understanding often relies on individual frames, leading to frame-by-frame predictions. Recent advancements such as Online Temporal Action Localization (OnTAL), extend this approach to instance-level predictions. However, existing methods mainly focus on short-term context, neglecting historical information. To address this, we introduce the History-Augmented Anchor Transformer (HAT) Framework for OnTAL. By integrating historical context, our framework enhances the synergy between long-term and short-term information, improving the quality of anchor features crucial for classification and localization. We evaluate our model on both procedural egocentric (PREGO) datasets (EGTEA and EPIC) and standard non-PREGO OnTAL datasets (THUMOS and MUSES). Results show that our model outperforms state-of-the-art approaches significantly on PREGO datasets and achieves comparable or slightly superior performance on non-PREGO datasets, underscoring the importance of leveraging long-term history, especially in procedural and egocentric action scenarios. Code is available at: this https URL

[CV-68] Wavelet based inpainting detection

链接: https://arxiv.org/abs/2408.06429
作者: Barglazan Adrian-Alin,Brad Remus Ovidiu
关键词-EN: manipulating digital images, manipulating digital, alarmingly easy, image editing tools, Hierarchical Feature segmentation
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:With the advancement in image editing tools, manipulating digital images has become alarmingly easy. Inpainting, which is used to remove objects or fill in parts of an image, serves as a powerful tool for both image restoration and forgery. This paper introduces a novel approach for detecting image inpainting forgeries by combining DT-CWT with Hierarchical Feature segmentation and with noise inconsistency analysis. The DT-CWT offers several advantages for this task, including inherent shift-invariance, which makes it robust to minor manipulations during the inpainting process, and directional selectivity, which helps capture subtle artifacts introduced by inpainting in specific frequency bands and orientations. By first applying color image segmentation and then analyzing for each segment, noise inconsistency obtained via DT-CW we can identify patterns indicative of inpainting forgeries. The proposed method is evaluated on a benchmark dataset created for this purpose and is compared with existing forgery detection techniques. Our approach demonstrates superior results compared with SOTA in detecting inpainted images.

[CV-69] Synthetic Photography Detection: A Visual Guidance for Identifying Synthetic Images Created by AI

链接: https://arxiv.org/abs/2408.06398
作者: Melanie Mathys,Marco Willi,Raphael Meier
关键词-EN: Artificial Intelligence, incredibly powerful, powerful in generating, generating synthetic images, Artificial
类目: Computers and Society (cs.CY); Computer Vision and Pattern Recognition (cs.CV)
*备注: 27 pages, 25 figures

点击查看摘要

Abstract:Artificial Intelligence (AI) tools have become incredibly powerful in generating synthetic images. Of particular concern are generated images that resemble photographs as they aspire to represent real world events. Synthetic photographs may be used maliciously by a broad range of threat actors, from scammers to nation-state actors, to deceive, defraud, and mislead people. Mitigating this threat usually involves answering a basic analytic question: Is the photograph real or synthetic? To address this, we have examined the capabilities of recent generative diffusion models and have focused on their flaws: visible artifacts in generated images which reveal their synthetic origin to the trained eye. We categorize these artifacts, provide examples, discuss the challenges in detecting them, suggest practical applications of our work, and outline future research directions.

[CV-70] Dilated Convolution with Learnable Spacings

链接: https://arxiv.org/abs/2408.06383
作者: Ismail Khalfaoui-Hassani
关键词-EN: Learnable Spacings, DCLS method, evaluates the Dilated, Dilated Convolution, DCLS
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Neural and Evolutionary Computing (cs.NE); Sound (cs.SD); Audio and Speech Processing (eess.AS)
*备注: PhD Thesis

点击查看摘要

Abstract:This thesis presents and evaluates the Dilated Convolution with Learnable Spacings (DCLS) method. Through various supervised learning experiments in the fields of computer vision, audio, and speech processing, the DCLS method proves to outperform both standard and advanced convolution techniques. The research is organized into several steps, starting with an analysis of the literature and existing convolution techniques that preceded the development of the DCLS method. We were particularly interested in the methods that are closely related to our own and that remain essential to capture the nuances and uniqueness of our approach. The cornerstone of our study is the introduction and application of the DCLS method to convolutional neural networks (CNNs), as well as to hybrid architectures that rely on both convolutional and visual attention approaches. DCLS is shown to be particularly effective in tasks such as classification, semantic segmentation, and object detection. Initially using bilinear interpolation, the study also explores other interpolation methods, finding that Gaussian interpolation slightly improves performance. The DCLS method is further applied to spiking neural networks (SNNs) to enable synaptic delay learning within a neural network that could eventually be transferred to so-called neuromorphic chips. The results show that the DCLS method stands out as a new state-of-the-art technique in SNN audio classification for certain benchmark tasks in this field. These tasks involve datasets with a high temporal component. In addition, we show that DCLS can significantly improve the accuracy of artificial neural networks for the multi-label audio classification task. We conclude with a discussion of the chosen experimental setup, its limitations, the limitations of our method, and our results.

[CV-71] FedRobo: Federated Learning Driven Autonomous Inter Robots Communication For Optimal Chemical Sprays

链接: https://arxiv.org/abs/2408.06382
作者: Jannatul Ferdaus,Sameera Pisupati,Mahedi Hasan,Sathwick Paladugu
关键词-EN: centralized data collection, Learning enables robots, Federated Learning enables, Federated Learning, federated learning algorithm
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Distributed, Parallel, and Cluster Computing (cs.DC); Robotics (cs.RO)
*备注: This research article is going to be submitted to a best-fit conference. We are looking for a conference

点击查看摘要

Abstract:Federated Learning enables robots to learn from each other’s experiences without relying on centralized data collection. Each robot independently maintains a model of crop conditions and chemical spray effectiveness, which is periodically shared with other robots in the fleet. A communication protocol is designed to optimize chemical spray applications by facilitating the exchange of information about crop conditions, weather, and other critical factors. The federated learning algorithm leverages this shared data to continuously refine the chemical spray strategy, reducing waste and improving crop yields. This approach has the potential to revolutionize the agriculture industry by offering a scalable and efficient solution for crop protection. However, significant challenges remain, including the development of a secure and robust communication protocol, the design of a federated learning algorithm that effectively integrates data from multiple sources, and ensuring the safety and reliability of autonomous robots. The proposed cluster-based federated learning approach also effectively reduces the computational load on the global server and minimizes communication overhead among clients.

[CV-72] Using deep learning to enhance electronic service quality: Application to real estate websites

链接: https://arxiv.org/abs/2408.06364
作者: Samaa Elnagar
关键词-EN: service quality dimensions, Electronic service quality, quality dimensions, service quality, Damage Level
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Electronic service quality (E-SQ) is a strategic metric for successful e-services.Among the service quality dimensions, tangibility is overlooked. However, by incorporating visuals or tangible tools, the intangible nature of e-services can be balanced. Thanks to advancements in Deep Learning for computer vision, tangible visual features can now be leveraged to enhance the browsing and searching experience of electronic services. Users usually have specific search criteria to meet, but most services will not offer flexible search filters. This research emphasizes the importance of integrating visual and descriptive features to improve the tangibility and efficiency of e-services. A prime example of an electronic service that can benefit from this is real-estate websites. Searching for real estate properties that match user preferences is usually demanding and lacks visual filters, such as the Damage Level to the property. The research introduces a novel visual descriptive feature, the Damage Level, which utilizes a deep learning network known as Mask-RCNN to estimate damage in real estate images. Additionally, a model is developed to incorporate the Damage Level as a tangible feature in electronic real estate services, with the aim of enhancing the tangible customer experience.

[CV-73] Modality-Balanced Learning for Multimedia Recommendation

链接: https://arxiv.org/abs/2408.06360
作者: Jinghao Zhang,Guofan Liu,Qiang Liu,Shu Wu,Liang Wang
关键词-EN: filtering framework effectively, traditional collaborative filtering, collaborative filtering framework, incorporate multimodal content, multimodal content information
类目: Information Retrieval (cs.IR); Computer Vision and Pattern Recognition (cs.CV)
*备注: ACM Multimedia 2024 (Oral)

点击查看摘要

Abstract:Many recommender models have been proposed to investigate how to incorporate multimodal content information into traditional collaborative filtering framework effectively. The use of multimodal information is expected to provide more comprehensive information and lead to superior performance. However, the integration of multiple modalities often encounters the modal imbalance problem: since the information in different modalities is unbalanced, optimizing the same objective across all modalities leads to the under-optimization problem of the weak modalities with a slower convergence rate or lower performance. Even worse, we find that in multimodal recommendation models, all modalities suffer from the problem of insufficient optimization. To address these issues, we propose a Counterfactual Knowledge Distillation method that could solve the imbalance problem and make the best use of all modalities. Through modality-specific knowledge distillation, it could guide the multimodal model to learn modality-specific knowledge from uni-modal teachers. We also design a novel generic-and-specific distillation loss to guide the multimodal student to learn wider-and-deeper knowledge from teachers. Additionally, to adaptively recalibrate the focus of the multimodal model towards weaker modalities during training, we estimate the causal effect of each modality on the training objective using counterfactual inference techniques, through which we could determine the weak modalities, quantify the imbalance degree and re-weight the distillation loss accordingly. Our method could serve as a plug-and-play module for both late-fusion and early-fusion backbones. Extensive experiments on six backbones show that our proposed method can improve the performance by a large margin. The source code will be released at \urlthis https URL

[CV-74] Algorithm Research of ELMo Word Embedding and Deep Learning Multimodal Transformer in Image Description

链接: https://arxiv.org/abs/2408.06357
作者: Xiaohan Cheng,Taiyuan Mei,Yun Zi,Qi Wang,Zijun Gao,Haowei Yang
关键词-EN: sample learning methods, sample learning, data deficiency, sample learning algorithms, effective method
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Zero sample learning is an effective method for data deficiency. The existing embedded zero sample learning methods only use the known classes to construct the embedded space, so there is an overfitting of the known classes in the testing process. This project uses category semantic similarity measures to classify multiple tags. This enables it to incorporate unknown classes that have the same meaning as currently known classes into the vector space when it is built. At the same time, most of the existing zero sample learning algorithms directly use the depth features of medical images as input, and the feature extraction process does not consider semantic information. This project intends to take ELMo-MCT as the main task and obtain multiple visual features related to the original image through self-attention mechanism. In this paper, a large number of experiments are carried out on three zero-shot learning reference datasets, and the best harmonic average accuracy is obtained compared with the most advanced algorithms.

[CV-75] Enhancing Ecological Monitoring with Multi-Objective Optimization: A Novel Dataset and Methodology for Segmentation Algorithms

链接: https://arxiv.org/abs/2408.06356
作者: Sophia J. Abraham,Jin Huang,Brandon RichardWebster,Michael Milford,Jonathan D. Hauenstein,Walter Scheirer
关键词-EN: high-resolution aerial images, Bega Valley, South Wales, aerial images capturing, images capturing indigenous
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:We introduce a unique semantic segmentation dataset of 6,096 high-resolution aerial images capturing indigenous and invasive grass species in Bega Valley, New South Wales, Australia, designed to address the underrepresented domain of ecological data in the computer vision community. This dataset presents a challenging task due to the overlap and distribution of grass species, which is critical for advancing models in ecological and agronomical applications. Our study features a homotopy-based multi-objective fine-tuning approach that balances segmentation accuracy and contextual consistency, applicable to various models. By integrating DiceCELoss for pixel-wise classification and a smoothness loss for spatial coherence, this method evolves during training to enhance robustness against noisy data. Performance baselines are established through a case study on the Segment Anything Model (SAM), demonstrating its effectiveness. Our annotation methodology, emphasizing pen size, zoom control, and memory management, ensures high-quality dataset creation. The dataset and code will be made publicly available, aiming to drive research in computer vision, machine learning, and ecological studies, advancing environmental monitoring and sustainable development.

[CV-76] Automated Romberg Test: Leveraging a CNN and Centre of Mass Analysis for Sensory Ataxia Diagnosis

链接: https://arxiv.org/abs/2408.06354
作者: Reilly Haskins,Richard Green
关键词-EN: automated Romberg Test, Romberg Test, facto medical procedure, automated Romberg, diagnose sensory ataxia
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:This paper proposes a novel method to diagnose sensory ataxia via an automated Romberg Test - the current de facto medical procedure used to diagnose this condition. It utilizes a convolutional neural network to predict joint locations, used for the calculation of various bio-mechanical markers such as the center of mass of the subject and various joint angles. This information is used in combination with data filtering techniques such as Kalman Filters, and center of mass analysis which helped make accurate inferences about the relative weight distribution in the lateral and anterior-posterior axes, and provide an objective, mathematically based diagnosis of this condition. In order to evaluate the performance of this method, testing was performed using dual weight scales and pre-annotated diagnosis videos taken from medical settings. These two methods both quantified the veritable weight distribution upon the ground surface with a ground truth and provided a real-world estimate of accuracy for the proposed method. A mean absolute error of 0.2912 percent was found for the calculated relative weight distribution difference, and an accuracy of 83.33 percent was achieved on diagnoses.

[CV-77] Automated Schizophrenia Detection from Handwriting Samples via Transfer Learning Convolutional Neural Networks

链接: https://arxiv.org/abs/2408.06347
作者: Rafael Castro,Ishaan Patel,Tarun Patanjali,Priya Iyer
关键词-EN: impairs daily life, globally prevalent psychiatric, prevalent psychiatric disorder, severely impairs daily, daily life
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 5 pages, 8 figures

点击查看摘要

Abstract:Schizophrenia is a globally prevalent psychiatric disorder that severely impairs daily life. Schizophrenia is caused by dopamine imbalances in the fronto-striatal pathways of the brain, which influences fine motor control in the cerebellum. This leads to abnormalities in handwriting. The goal of this study was to develop an accurate, objective, and accessible computational method to be able to distinguish schizophrenic handwriting samples from non-schizophrenic handwriting samples. To achieve this, data from Crespo et al. (2019) was used, which contains images of handwriting samples from schizophrenic and non-schizophrenic patients. The data was preprocessed and augmented to produce a more robust model that can recognize different types of handwriting. The data was used to train several different convolutional neural networks, and the model with the base architecture of InceptionV3 performed the best, differentiating between the two types of image with a 92% accuracy rate. To make this model accessible, a secure website was developed for medical professionals to use for their patients. Such a result suggests that handwriting analysis through computational models holds promise as a non-invasive and objective method for clinicians to diagnose and monitor schizophrenia.

[CV-78] Event-Stream Super Resolution using Sigma-Delta Neural Network ECCV ECCV2024

链接: https://arxiv.org/abs/2408.06968
作者: Waseem Shariff,Joe Lemley,Peter Corcoran
关键词-EN: time-event pixels based, study introduces, approach to enhance, enhance the spatial-temporal, time-event pixels
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: ECCV: The 18th European Conference on Computer Vision ECCV 2024 NeVi Workshop

点击查看摘要

Abstract:This study introduces a novel approach to enhance the spatial-temporal resolution of time-event pixels based on luminance changes captured by event cameras. These cameras present unique challenges due to their low resolution and the sparse, asynchronous nature of the data they collect. Current event super-resolution algorithms are not fully optimized for the distinct data structure produced by event cameras, resulting in inefficiencies in capturing the full dynamism and detail of visual scenes with improved computational complexity. To bridge this gap, our research proposes a method that integrates binary spikes with Sigma Delta Neural Networks (SDNNs), leveraging spatiotemporal constraint learning mechanism designed to simultaneously learn the spatial and temporal distributions of the event stream. The proposed network is evaluated using widely recognized benchmark datasets, including N-MNIST, CIFAR10-DVS, ASL-DVS, and Event-NFS. A comprehensive evaluation framework is employed, assessing both the accuracy, through root mean square error (RMSE), and the computational efficiency of our model. The findings demonstrate significant improvements over existing state-of-the-art methods, specifically, the proposed method outperforms state-of-the-art performance in computational efficiency, achieving a 17.04-fold improvement in event sparsity and a 32.28-fold increase in synaptic operation efficiency over traditional artificial neural networks, alongside a two-fold better performance over spiking neural networks.

[CV-79] Enhancing Diabetic Retinopathy Diagnosis: A Lightweight CNN Architecture for Efficient Exudate Detection in Retinal Fundus Images

链接: https://arxiv.org/abs/2408.06784
作者: Mujadded Al Rabbani Alif
关键词-EN: Retinal fundus imaging, early disease onset, Retinal fundus, plays an essential, essential role
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Retinal fundus imaging plays an essential role in diagnosing various stages of diabetic retinopathy, where exudates are critical markers of early disease onset. Prompt detection of these exudates is pivotal for enabling optometrists to arrest or significantly decelerate the disease progression. This paper introduces a novel, lightweight convolutional neural network architecture tailored for automated exudate detection, designed to identify these markers efficiently and accurately. To address the challenge of limited training data, we have incorporated domain-specific data augmentations to enhance the model’s generalizability. Furthermore, we applied a suite of regularization techniques within our custom architecture to boost diagnostic accuracy while optimizing computational efficiency. Remarkably, this streamlined model contains only 4.73 million parameters a reduction of nearly 60% compared to the standard ResNet-18 model, which has 11.69 million parameters. Despite its reduced complexity, our model achieves an impressive F1 score of 90%, demonstrating its efficacy in the early detection of diabetic retinopathy through fundus imaging.

[CV-80] How to Best Combine Demosaicing and Denoising?

链接: https://arxiv.org/abs/2408.06684
作者: Yu Guo,Qiyu Jin,Jean-Michel Morel,Gabriele Facciolo
关键词-EN: play a critical, critical role, images, denoising, demosaicing
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注: This paper was accepted by Inverse Problems and Imaging on October, 2023

点击查看摘要

Abstract:Image demosaicing and denoising play a critical role in the raw imaging pipeline. These processes have often been treated as independent, without considering their interactions. Indeed, most classic denoising methods handle noisy RGB images, not raw images. Conversely, most demosaicing methods address the demosaicing of noise free images. The real problem is to jointly denoise and demosaic noisy raw images. But the question of how to proceed is still not yet clarified. In this paper, we carry-out extensive experiments and a mathematical analysis to tackle this problem by low complexity algorithms. Indeed, both problems have been only addressed jointly by end-to-end heavy weight convolutional neural networks (CNNs), which are currently incompatible with low power portable imaging devices and remain by nature domain (or device) dependent. Our study leads us to conclude that, with moderate noise, demosaicing should be applied first, followed by denoising. This requires a simple adaptation of classic denoising algorithms to demosaiced noise, which we justify and specify. Although our main conclusion is ``demosaic first, then denoise’', we also discover that for high noise, there is a moderate PSNR gain by a more complex strategy: partial CFA denoising followed by demosaicing, and by a second denoising on the RGB image. These surprising results are obtained by a black-box optimization of the pipeline, which could be applied to any other pipeline. We validate our results on simulated and real noisy CFA images obtained from several benchmarks.

[CV-81] Coherence Awareness in Diffractive Neural Networks

链接: https://arxiv.org/abs/2408.06681
作者: Matan Kleiner,Lior Michaeli,Tomer Michaeli
关键词-EN: intensive computational processing, hold great promise, requiring intensive computational, networks hold great, applications requiring intensive
类目: Optics (physics.optics); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Diffractive neural networks hold great promise for applications requiring intensive computational processing. Considerable attention has focused on diffractive networks for either spatially coherent or spatially incoherent illumination. Here we illustrate that, as opposed to imaging systems, in diffractive networks the degree of spatial coherence has a dramatic effect. In particular, we show that when the spatial coherence length on the object is comparable to the minimal feature size preserved by the optical system, neither the incoherent nor the coherent extremes serve as acceptable approximations. Importantly, this situation is inherent to many settings involving active illumination, including reflected light microscopy, autonomous vehicles and smartphones. Following this observation, we propose a general framework for training diffractive networks for any specified degree of spatial and temporal coherence, supporting all types of linear and nonlinear layers. Using our method, we numerically optimize networks for image classification, and thoroughly investigate their performance dependence on the illumination coherence properties. We further introduce the concept of coherence-blind networks, which have enhanced resilience to changes in illumination conditions. Our findings serve as a steppingstone toward adopting all-optical neural networks in real-world applications, leveraging nothing but natural light.

[CV-82] Specialized Change Detection using Segment Anything

链接: https://arxiv.org/abs/2408.06644
作者: Tahir Ahmad,Sudipan Saha
关键词-EN: Earth observation, task in Earth, Change detection, fundamental task, change detection methods
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Change detection (CD) is a fundamental task in Earth observation. While most change detection methods detect all changes, there is a growing need for specialized methods targeting specific changes relevant to particular applications while discarding the other changes. For instance, urban management might prioritize detecting the disappearance of buildings due to natural disasters or other reasons. Furthermore, while most supervised change detection methods require large-scale training datasets, in many applications only one or two training examples might be available instead of large datasets. Addressing such needs, we propose a focused CD approach using the Segment Anything Model (SAM), a versatile vision foundation model. Our method leverages a binary mask of the object of interest in pre-change images to detect their disappearance in post-change images. By using SAM’s robust segmentation capabilities, we create prompts from the pre-change mask, use those prompts to segment the post-change image, and identify missing objects. This unsupervised approach demonstrated for building disappearance detection, is adaptable to various domains requiring specialized CD. Our contributions include defining a novel CD problem, proposing a method using SAM, and demonstrating its effectiveness. The proposed method also has benefits related to privacy preservation.

[CV-83] Attention Based Feature Fusion Network for Monkeypox Skin Lesion Detection

链接: https://arxiv.org/abs/2408.06640
作者: Niloy Kumar Kundu,Mainul Karim,Sarah Kobir,Dewan Md. Farid
关键词-EN: public health concerns, health concerns due, recent monkeypox outbreak, raised significant public, significant public health
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注: 6 pages with 6 figures

点击查看摘要

Abstract:The recent monkeypox outbreak has raised significant public health concerns due to its rapid spread across multiple countries. Monkeypox can be difficult to distinguish from chickenpox and measles in the early stages because the symptoms of all three diseases are similar. Modern deep learning algorithms can be used to identify diseases, including COVID-19, by analyzing images of the affected areas. In this study, we introduce a lightweight model that merges two pre-trained architectures, EfficientNetV2B3 and ResNet151V2, to classify human monkeypox disease. We have also incorporated the squeeze-and-excitation attention network module to focus on the important parts of the feature maps for classifying the monkeypox images. This attention module provides channels and spatial attention to highlight significant areas within feature maps. We evaluated the effectiveness of our model by extensively testing it on a publicly available Monkeypox Skin Lesions Dataset using a four-fold cross-validation approach. The evaluation metrics of our model were compared with the existing others. Our model achieves a mean validation accuracy of 96.52%, with precision, recall, and F1-score values of 96.58%, 96.52%, and 96.51%, respectively.

[CV-84] Deep Inertia L_p Half-Quadratic Splitting Unrolling Network for Sparse View CT Reconstruction

链接: https://arxiv.org/abs/2408.06600
作者: Yu Guo,Caiying Wu,Yaxin Li,Qiyu Jin,Tieyong Zeng
关键词-EN: ill-posed inverse problem, Sparse view computed, challenging ill-posed inverse, necessitating effective regularization, effective regularization techniques
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注: This paper was accepted by IEEE Signal Processing Letters on July 28, 2024

点击查看摘要

Abstract:Sparse view computed tomography (CT) reconstruction poses a challenging ill-posed inverse problem, necessitating effective regularization techniques. In this letter, we employ L_p -norm ( 0p1 ) regularization to induce sparsity and introduce inertial steps, leading to the development of the inertial L_p -norm half-quadratic splitting algorithm. We rigorously prove the convergence of this algorithm. Furthermore, we leverage deep learning to initialize the conjugate gradient method, resulting in a deep unrolling network with theoretical guarantees. Our extensive numerical experiments demonstrate that our proposed algorithm surpasses existing methods, particularly excelling in fewer scanned views and complex noise conditions.

[CV-85] InfLocNet: Enhanced Lung Infection Localization and Disease Detection from Chest X-Ray Images Using Lightweight Deep Learning

链接: https://arxiv.org/abs/2408.06459
作者: Md. Asiful Islam Miah,Shourin Paul,Sunanda Das,M. M. A. Hashem
关键词-EN: deep learning techniques, deep learning, recent years, deep learning based, revolutionized the diagnosis
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:In recent years, the integration of deep learning techniques into medical imaging has revolutionized the diagnosis and treatment of lung diseases, particularly in the context of COVID-19 and pneumonia. This paper presents a novel, lightweight deep learning based segmentation-classification network designed to enhance the detection and localization of lung infections using chest X-ray images. By leveraging the power of transfer learning with pre-trained VGG-16 weights, our model achieves robust performance even with limited training data. The architecture incorporates refined skip connections within the UNet++ framework, reducing semantic gaps and improving precision in segmentation tasks. Additionally, a classification module is integrated at the end of the encoder block, enabling simultaneous classification and segmentation. This dual functionality enhances the model’s versatility, providing comprehensive diagnostic insights while optimizing computational efficiency. Experimental results demonstrate that our proposed lightweight network outperforms existing methods in terms of accuracy and computational requirements, making it a viable solution for real-time and resource constrained medical imaging applications. Furthermore, the streamlined design facilitates easier hyperparameter tuning and deployment on edge devices. This work underscores the potential of advanced deep learning architectures in improving clinical outcomes through precise and efficient medical image analysis. Our model achieved remarkable results with an Intersection over Union (IoU) of 93.59% and a Dice Similarity Coefficient (DSC) of 97.61% in lung area segmentation, and an IoU of 97.67% and a DSC of 87.61% for infection region localization. Additionally, it demonstrated high accuracy of 93.86% and sensitivity of 89.55% in detecting chest diseases, highlighting its efficacy and reliability.

[CV-86] From Diagnostic CT to DTI Tractography labels: Using Deep Learning for Corticospinal Tract Injury Assessment and Outcome Prediction in Intracerebral Haemorrhage MICCAI

链接: https://arxiv.org/abs/2408.06403
作者: Olivia N Murray,Hamied Haroon,Paul Ryu,Hiren Patel,George Harston,Marieke Wermer,Wilmar Jolink,Daniel Hanley,Catharina Klijn,Ulrike Hammerbeck,Adrian Parry-Jones,Timothy Cootes
关键词-EN: good motor recovery, corticospinal tract, CST, key to good, good motor
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Medical Physics (physics.med-ph)
*备注: Accepted to Miccai Switch Workshop

点击查看摘要

Abstract:The preservation of the corticospinal tract (CST) is key to good motor recovery after stroke. The gold standard method of assessing the CST with imaging is diffusion tensor tractography. However, this is not available for most intracerebral haemorrhage (ICH) patients. Non-contrast CT scans are routinely available in most ICH diagnostic pipelines, but delineating white matter from a CT scan is challenging. We utilise nnU-Net, trained on paired diagnostic CT scans and high-directional diffusion tractography maps, to segment the CST from diagnostic CT scans alone, and we show our model reproduces diffusion based tractography maps of the CST with a Dice similarity coefficient of 57%. Surgical haematoma evacuation is sometimes performed after ICH, but published clinical trials to date show that whilst surgery reduces mortality, there is no evidence of improved functional recovery. Restricting surgery to patients with an intact CST may reveal a subset of patients for whom haematoma evacuation improves functional outcome. We investigated the clinical utility of our model in the MISTIE III clinical trial dataset. We found that our model’s CST integrity measure significantly predicted outcome after ICH in the acute and chronic time frames, therefore providing a prognostic marker for patients to whom advanced diffusion tensor imaging is unavailable. This will allow for future probing of subgroups who may benefit from surgery. Comments: Accepted to Miccai Switch Workshop Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Medical Physics (physics.med-ph) Cite as: arXiv:2408.06403 [eess.IV] (or arXiv:2408.06403v1 [eess.IV] for this version) https://doi.org/10.48550/arXiv.2408.06403 Focus to learn more arXiv-issued DOI via DataCite

[CV-87] Assessment of Cell Nuclei AI Foundation Models in Kidney Pathology

链接: https://arxiv.org/abs/2408.06381
作者: Junlin Guo,Siqi Lu,Can Cui,Ruining Deng,Tianyuan Yao,Zhewen Tao,Yizhe Lin,Marilyn Lionts,Quan Liu,Juming Xiong,Catie Chang,Mitchell Wilkes,Mengmeng Yin,Haichun Yang,Yuankai Huo
关键词-EN: kidney pathology, digital kidney pathology, crucial task, task in digital, Cell nuclei instance
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Cell nuclei instance segmentation is a crucial task in digital kidney pathology. Traditional automatic segmentation methods often lack generalizability when applied to unseen datasets. Recently, the success of foundation models (FMs) has provided a more generalizable solution, potentially enabling the segmentation of any cell type. In this study, we perform a large-scale evaluation of three widely used state-of-the-art (SOTA) cell nuclei foundation models (Cellpose, StarDist, and CellViT). Specifically, we created a highly diverse evaluation dataset consisting of 2,542 kidney whole slide images (WSIs) collected from both human and rodent sources, encompassing various tissue types, sizes, and staining methods. To our knowledge, this is the largest-scale evaluation of its kind to date. Our quantitative analysis of the prediction distribution reveals a persistent performance gap in kidney pathology. Among the evaluated models, CellViT demonstrated superior performance in segmenting nuclei in kidney pathology. However, none of the foundation models are perfect; a performance gap remains in general nuclei segmentation for kidney pathology.

[CV-88] How good nnU-Net for Segmenting Cardiac MRI: A Comprehensive Evaluation

链接: https://arxiv.org/abs/2408.06358
作者: Malitha Gunawardhana,Fangqiang Xu,Jichao Zhao
关键词-EN: essential for detailed, heart structures, cardiovascular diseases, detailed analysis, analysis of heart
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Cardiac segmentation is a critical task in medical imaging, essential for detailed analysis of heart structures, which is crucial for diagnosing and treating various cardiovascular diseases. With the advent of deep learning, automated segmentation techniques have demonstrated remarkable progress, achieving high accuracy and efficiency compared to traditional manual methods. Among these techniques, the nnU-Net framework stands out as a robust and versatile tool for medical image segmentation. In this study, we evaluate the performance of nnU-Net in segmenting cardiac magnetic resonance images (MRIs). Utilizing five cardiac segmentation datasets, we employ various nnU-Net configurations, including 2D, 3D full resolution, 3D low resolution, 3D cascade, and ensemble models. Our study benchmarks the capabilities of these configurations and examines the necessity of developing new models for specific cardiac segmentation tasks.

机器学习

[LG-0] Diversity Empowers Intelligence: Integrating Expertise of Software Engineering Agents

链接: https://arxiv.org/abs/2408.07060
作者: Kexun Zhang,Weiran Yao,Zuxin Liu,Yihao Feng,Zhiwei Liu,Rithesh Murthy,Tian Lan,Lei Li,Renze Lou,Jiacheng Xu,Bo Pang,Yingbo Zhou,Shelby Heinecke,Silvio Savarese,Huan Wang,Caiming Xiong
关键词-EN: Large language model, solving real-world software, shown great potential, language model, SWE-Bench Lite
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Large language model (LLM) agents have shown great potential in solving real-world software engineering (SWE) problems. The most advanced open-source SWE agent can resolve over 27% of real GitHub issues in SWE-Bench Lite. However, these sophisticated agent frameworks exhibit varying strengths, excelling in certain tasks while underperforming in others. To fully harness the diversity of these agents, we propose DEI (Diversity Empowered Intelligence), a framework that leverages their unique expertise. DEI functions as a meta-module atop existing SWE agent frameworks, managing agent collectives for enhanced problem-solving. Experimental results show that a DEI-guided committee of agents is able to surpass the best individual agent’s performance by a large margin. For instance, a group of open-source SWE agents, with a maximum individual resolve rate of 27.3% on SWE-Bench Lite, can achieve a 34.3% resolve rate with DEI, making a 25% improvement and beating most closed-source solutions. Our best-performing group excels with a 55% resolve rate, securing the highest ranking on SWE-Bench Lite. Our findings contribute to the growing body of research on collaborative AI systems and their potential to solve complex software engineering challenges.

[LG-1] A Survey on Model MoErging: Recycling and Routing Among Specialized Experts for Collaborative Learning

链接: https://arxiv.org/abs/2408.07057
作者: Prateek Yadav,Colin Raffel,Mohammed Muqeeth,Lucas Caccia,Haokun Liu,Tianlong Chen,Mohit Bansal,Leshem Choshen,Alessandro Sordoni
关键词-EN: performant pre-trained models, fine-tuned expert models, MoErging methods, domain or task, MoErging
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注: 26 pages

点击查看摘要

Abstract:The availability of performant pre-trained models has led to a proliferation of fine-tuned expert models that are specialized to a particular domain or task. Model MoErging methods aim to recycle expert models to create an aggregate system with improved performance or generalization. A key component of MoErging methods is the creation of a router that decides which expert model(s) to use for a particular input or application. The promise, effectiveness, and large design space of MoErging has spurred the development of many new methods over the past few years. This rapid pace of development has made it challenging to compare different MoErging methods, which are rarely compared to one another and are often validated in different experimental setups. To remedy such gaps, we present a comprehensive survey of MoErging methods that includes a novel taxonomy for cataloging key design choices and clarifying suitable applications for each method. Apart from surveying MoErging research, we inventory software tools and applications that make use of MoErging. We additionally discuss related fields of study such as model merging, multitask learning, and mixture-of-experts models. Taken as a whole, our survey provides a unified overview of existing MoErging methods and creates a solid foundation for future work in this burgeoning field.

[LG-2] LongWriter: Unleashing 10000 Word Generation from Long Context LLMs

链接: https://arxiv.org/abs/2408.07055
作者: Yushi Bai,Jiajie Zhang,Xin Lv,Linzhi Zheng,Siqi Zhu,Lei Hou,Yuxiao Dong,Jie Tang,Juanzi Li
关键词-EN: Current long context, Current long, context large language, large language models, large language
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Current long context large language models (LLMs) can process inputs up to 100,000 tokens, yet struggle to generate outputs exceeding even a modest length of 2,000 words. Through controlled experiments, we find that the model’s effective generation length is inherently bounded by the sample it has seen during supervised fine-tuning (SFT). In other words, their output limitation is due to the scarcity of long-output examples in existing SFT datasets. To address this, we introduce AgentWrite, an agent-based pipeline that decomposes ultra-long generation tasks into subtasks, enabling off-the-shelf LLMs to generate coherent outputs exceeding 20,000 words. Leveraging AgentWrite, we construct LongWriter-6k, a dataset containing 6,000 SFT data with output lengths ranging from 2k to 32k words. By incorporating this dataset into model training, we successfully scale the output length of existing models to over 10,000 words while maintaining output quality. We also develop LongBench-Write, a comprehensive benchmark for evaluating ultra-long generation capabilities. Our 9B parameter model, further improved through DPO, achieves state-of-the-art performance on this benchmark, surpassing even much larger proprietary models. In general, our work demonstrates that existing long context LLM already possesses the potential for a larger output window–all you need is data with extended output during model alignment to unlock this capability. Our code models are at: this https URL.

[LG-3] ableGuard – Securing Structured Unstructured Data

链接: https://arxiv.org/abs/2408.07045
作者: Anantha Sharma,Ajinkya Deshmukh
关键词-EN: data, critical challenge, increasing demand, TableGuard, obfuscation
类目: Cryptography and Security (cs.CR); Computation and Language (cs.CL); Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注: 7 pages, 3 tables, 1 figure

点击查看摘要

Abstract:With the increasing demand for data sharing across platforms and organizations, ensuring the privacy and security of sensitive information has become a critical challenge. This paper introduces “TableGuard”. An innovative approach to data obfuscation tailored for relational databases. Building on the principles and techniques developed in prior work on context-sensitive obfuscation, TableGuard applies these methods to ensure that API calls return only obfuscated data, thereby safeguarding privacy when sharing data with third parties. TableGuard leverages advanced context-sensitive obfuscation techniques to replace sensitive data elements with contextually appropriate alternatives. By maintaining the relational integrity and coherence of the data, our approach mitigates the risks of cognitive dissonance and data leakage. We demonstrate the implementation of TableGuard using a BERT based transformer model, which identifies and obfuscates sensitive entities within relational tables. Our evaluation shows that TableGuard effectively balances privacy protection with data utility, minimizing information loss while ensuring that the obfuscated data remains functionally useful for downstream applications. The results highlight the importance of domain-specific obfuscation strategies and the role of context length in preserving data integrity. The implications of this research are significant for organizations that need to share data securely with external parties. TableGuard offers a robust framework for implementing privacy-preserving data sharing mechanisms, thereby contributing to the broader field of data privacy and security.

[LG-4] Source Separation of Multi-source Raw Music using a Residual Quantized Variational Autoencoder

链接: https://arxiv.org/abs/2408.07020
作者: Leonardo Berti
关键词-EN: variational autoencoder architecture, residual quantized variational, quantized variational autoencoder, codec model based, neural audio codec
类目: ound (cs.SD); Machine Learning (cs.LG); Multimedia (cs.MM); Audio and Speech Processing (eess.AS)
*备注: 9 pages

点击查看摘要

Abstract:I developed a neural audio codec model based on the residual quantized variational autoencoder architecture. I train the model on the Slakh2100 dataset, a standard dataset for musical source separation, composed of multi-track audio. The model can separate audio sources, achieving almost SoTA results with much less computing power. The code is publicly available at this http URL

[LG-5] Defining and Measuring Disentanglement for non-Independent Factors of Variation

链接: https://arxiv.org/abs/2408.07016
作者: Antonio Almudévar,Alfonso Ortega,Luis Vicente,Antonio Miguel,Eduardo Lleida
关键词-EN: factors of variation, Representation learning, factors, variation, discover and extract
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Representation learning is an approach that allows to discover and extract the factors of variation from the data. Intuitively, a representation is said to be disentangled if it separates the different factors of variation in a way that is understandable to humans. Definitions of disentanglement and metrics to measure it usually assume that the factors of variation are independent of each other. However, this is generally false in the real world, which limits the use of these definitions and metrics to very specific and unrealistic scenarios. In this paper we give a definition of disentanglement based on information theory that is also valid when the factors of variation are not independent. Furthermore, we relate this definition to the Information Bottleneck Method. Finally, we propose a method to measure the degree of disentanglement from the given definition that works when the factors of variation are not independent. We show through different experiments that the method proposed in this paper correctly measures disentanglement with non-independent factors of variation, while other methods fail in this scenario.

[LG-6] Faster Private Minimum Spanning Trees

链接: https://arxiv.org/abs/2408.06997
作者: Rasmus Pagh,Lukas Retschmeier
关键词-EN: Motivated by applications, zero-concentrated differential privacy, edge-weight differential privacy, differential privacy constraints, MST
类目: Data Structures and Algorithms (cs.DS); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Motivated by applications in clustering and synthetic data generation, we consider the problem of releasing a minimum spanning tree (MST) under edge-weight differential privacy constraints where a graph topology G=(V,E) with n vertices and m edges is public, the weight matrix \vecW\in \mathbbR^n \times n is private, and we wish to release an approximate MST under \rho -zero-concentrated differential privacy. Weight matrices are considered neighboring if they differ by at most \Delta_\infty in each entry, i.e., we consider an \ell_\infty neighboring relationship. Existing private MST algorithms either add noise to each entry in \vecW and estimate the MST by post-processing or add noise to weights in-place during the execution of a specific MST algorithm. Using the post-processing approach with an efficient MST algorithm takes O(n^2) time on dense graphs but results in an additive error on the weight of the MST of magnitude O(n^2\log n) . In-place algorithms give asymptotically better utility, but the running time of existing in-place algorithms is O(n^3) for dense graphs. Our main result is a new differentially private MST algorithm that matches the utility of existing in-place methods while running in time O(m + n^3/2\log n) for fixed privacy parameter \rho . The technical core of our algorithm is an efficient sublinear time simulation of Report-Noisy-Max that works by discretizing all edge weights to a multiple of \Delta_\infty and forming groups of edges with identical weights. Specifically, we present a data structure that allows us to sample a noisy minimum weight edge among at most O(n^2) cut edges in O(\sqrtn \log n) time. Experimental evaluations support our claims that our algorithm significantly improves previous algorithms either in utility or running time.

[LG-7] Blessing of Dimensionality for Approximating Sobolev Classes on Manifolds

链接: https://arxiv.org/abs/2408.06996
作者: Hong Ye Tan,Subhadip Mukherjee,Junqi Tang,Carola-Bibiane Schönlieb
关键词-EN: natural high-dimensional data, high-dimensional data, manifold, natural high-dimensional, manifold hypothesis
类目: Machine Learning (cs.LG); Statistics Theory (math.ST)
*备注:

点击查看摘要

Abstract:The manifold hypothesis says that natural high-dimensional data is actually supported on or around a low-dimensional manifold. Recent success of statistical and learning-based methods empirically supports this hypothesis, due to outperforming classical statistical intuition in very high dimensions. A natural step for analysis is thus to assume the manifold hypothesis and derive bounds that are independent of any embedding space. Theoretical implications in this direction have recently been explored in terms of generalization of ReLU networks and convergence of Langevin methods. We complement existing results by providing theoretical statistical complexity results, which directly relates to generalization properties. In particular, we demonstrate that the statistical complexity required to approximate a class of bounded Sobolev functions on a compact manifold is bounded from below, and moreover that this bound is dependent only on the intrinsic properties of the manifold. These provide complementary bounds for existing approximation results for ReLU networks on manifolds, which give upper bounds on generalization capacity.

[LG-8] IRS-Assisted Lossy Communications Under Correlated Rayleigh Fading: Outage Probability Analysis and Optimization

链接: https://arxiv.org/abs/2408.06969
作者: Guanchang Li,Wensheng Lin,Lixin Li,Yixuan He,Fucheng Yang,Zhu Han
关键词-EN: correlated Rayleigh fading, intelligent reflecting surface, assisted lossy communication, lossy communication system, Rayleigh fading
类目: Networking and Internet Architecture (cs.NI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This paper focuses on an intelligent reflecting surface (IRS)-assisted lossy communication system with correlated Rayleigh fading. We analyze the correlated channel model and derive the outage probability of the system. Then, we design a deep reinforce learning (DRL) method to optimize the phase shift of IRS, in order to maximize the received signal power. Moreover, this paper presents results of the simulations conducted to evaluate the performance of the DRL-based method. The simulation results indicate that the outage probability of the considered system increases significantly with more correlated channel coefficients. Moreover, the performance gap between DRL and theoretical limit increases with higher transmit power and/or larger distortion requirement.

[LG-9] DyG-Mamba: Continuous State Space Modeling on Dynamic Graphs

链接: https://arxiv.org/abs/2408.06966
作者: Dongyuan Li,Shiyin Tan,Ying Zhang,Ming Jin,Shirui Pan,Manabu Okumura,Renhe Jiang
关键词-EN: enabling accurate social, accurate social recommendation, uncover evolutionary laws, Dynamic graph learning, graph learning aims
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Dynamic graph learning aims to uncover evolutionary laws in real-world systems, enabling accurate social recommendation (link prediction) or early detection of cancer cells (classification). Inspired by the success of state space models, e.g., Mamba, for efficiently capturing long-term dependencies in language modeling, we propose DyG-Mamba, a new continuous state space model (SSM) for dynamic graph learning. Specifically, we first found that using inputs as control signals for SSM is not suitable for continuous-time dynamic network data with irregular sampling intervals, resulting in models being insensitive to time information and lacking generalization properties. Drawing inspiration from the Ebbinghaus forgetting curve, which suggests that memory of past events is strongly correlated with time intervals rather than specific details of the events themselves, we directly utilize irregular time spans as control signals for SSM to achieve significant robustness and generalization. Through exhaustive experiments on 12 datasets for dynamic link prediction and dynamic node classification tasks, we found that DyG-Mamba achieves state-of-the-art performance on most of the datasets, while also demonstrating significantly improved computation and memory efficiency.

[LG-10] Measuring User Understanding in Dialogue-based XAI Systems ECAI2024

链接: https://arxiv.org/abs/2408.06960
作者: Dimitry Mindlin,Amelie Sophie Robrecht,Michael Morasch,Philipp Cimiano
关键词-EN: eXplainable Artificial Intelligence, Artificial Intelligence, reflect users’ explanation, eXplainable Artificial, XAI
类目: Machine Learning (cs.LG)
*备注: Accepted at the ECAI 2024 main conference - final version and code coming soon. 8 pages, 5 figures

点击查看摘要

Abstract:The field of eXplainable Artificial Intelligence (XAI) is increasingly recognizing the need to personalize and/or interactively adapt the explanation to better reflect users’ explanation needs. While dialogue-based approaches to XAI have been proposed recently, the state-of-the-art in XAI is still characterized by what we call one-shot, non-personalized and one-way explanations. In contrast, dialogue-based systems that can adapt explanations through interaction with a user promise to be superior to GUI-based or dashboard explanations as they offer a more intuitive way of requesting information. In general, while interactive XAI systems are often evaluated in terms of user satisfaction, there are limited studies that access user’s objective model understanding. This is in particular the case for dialogue-based XAI approaches. In this paper, we close this gap by carrying out controlled experiments within a dialogue framework in which we measure understanding of users in three phases by asking them to simulate the predictions of the model they are learning about. By this, we can quantify the level of (improved) understanding w.r.t. how the model works, comparing the state prior, and after the interaction. We further analyze the data to reveal patterns of how the interaction between groups with high vs. low understanding gain differ. Overall, our work thus contributes to our understanding about the effectiveness of XAI approaches.

[LG-11] AuToMATo: A Parameter-Free Persistence-Based Clustering Algorithm

链接: https://arxiv.org/abs/2408.06958
作者: Marius Huber,Sara Kalisnik,Patrick Schnider
关键词-EN: clustering algorithm based, persistent homology, parameter-free clustering algorithm, clustering algorithm, parameter-free clustering
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:We present AuToMATo, a novel parameter-free clustering algorithm based on persistent homology. AuToMATo combines the existing ToMATo clustering algorithm with a bootstrapping procedure in order to separate significant peaks of an estimated density function from non-significant ones. We perform a thorough comparison of AuToMATo against many other state-of-the-art clustering algorithms. We find that not only that AuToMATo compares favorably against other parameter-free clustering algorithms, but in many instances also significantly outperforms even the best selection of parameters for other algorithms. AuToMATo is motivated by applications in topological data analysis, in particular the Mapper algorithm, where it is desirable to work with a parameter-free clustering algorithm. Indeed, we provide evidence that AuToMATo performs well when used with Mapper. Finally, we provide an open-source implementation of AuToMATo in Python that is fully compatible with the standardscikit-learn architecture.

[LG-12] Heavy-Ball Momentum Accelerated Actor-Critic With Function Approximation

链接: https://arxiv.org/abs/2408.06945
作者: Yanjie Dong,Haijun Zhang,Gang Wang,Shisheng Cui,Xiping Hu
关键词-EN: stochastic policy gradient, analyzing convergence rate, convergence rate, replace the Monte-Carlo, Monte-Carlo rollouts
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:By using an parametric value function to replace the Monte-Carlo rollouts for value estimation, the actor-critic (AC) algorithms can reduce the variance of stochastic policy gradient so that to improve the convergence rate. While existing works mainly focus on analyzing convergence rate of AC algorithms under Markovian noise, the impacts of momentum on AC algorithms remain largely unexplored. In this work, we first propose a heavy-ball momentum based advantage actor-critic (\mboxHB-A2C) algorithm by integrating the heavy-ball momentum into the critic recursion that is parameterized by a linear function. When the sample trajectory follows a Markov decision process, we quantitatively certify the acceleration capability of the proposed HB-A2C algorithm. Our theoretical results demonstrate that the proposed HB-A2C finds an \epsilon -approximate stationary point with \oo\epsilon^-2 iterations for reinforcement learning tasks with Markovian noise. Moreover, we also reveal the dependence of learning rates on the length of the sample trajectory. By carefully selecting the momentum factor of the critic recursion, the proposed HB-A2C can balance the errors introduced by the initialization and the stoschastic approximation.

[LG-13] owards Holistic Disease Risk Prediction using Small Language Models ICML

链接: https://arxiv.org/abs/2408.06943
作者: Liv Björkdahl,Oskar Pauli,Johan Östman,Chiara Ceccobello,Sara Lundell,Magnus Kjellberg
关键词-EN: healthcare domain arise, continuous measurements, x-ray images, clinical notes, domain arise
类目: Machine Learning (cs.LG)
*备注: 6 pages, submitted to ICMLA

点击查看摘要

Abstract:Data in the healthcare domain arise from a variety of sources and modalities, such as x-ray images, continuous measurements, and clinical notes. Medical practitioners integrate these diverse data types daily to make informed and accurate decisions. With recent advancements in language models capable of handling multimodal data, it is a logical progression to apply these models to the healthcare sector. In this work, we introduce a framework that connects small language models to multiple data sources, aiming to predict the risk of various diseases simultaneously. Our experiments encompass 12 different tasks within a multitask learning setup. Although our approach does not surpass state-of-the-art methods specialized for single tasks, it demonstrates competitive performance and underscores the potential of small language models for multimodal reasoning in healthcare.

[LG-14] Breaking Class Barriers: Efficient Dataset Distillation via Inter-Class Feature Compensator

链接: https://arxiv.org/abs/2408.06927
作者: Xin Zhang,Jiawei Du,Ping Liu,Joey Tianyi Zhou
关键词-EN: condense informative features, Inter-class Feature Compensator, Universal Feature Compensator, aiming to condense, condense informative
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Dataset distillation has emerged as a technique aiming to condense informative features from large, natural datasets into a compact and synthetic form. While recent advancements have refined this technique, its performance is bottlenecked by the prevailing class-specific synthesis paradigm. Under this paradigm, synthetic data is optimized exclusively for a pre-assigned one-hot label, creating an implicit class barrier in feature condensation. This leads to inefficient utilization of the distillation budget and oversight of inter-class feature distributions, which ultimately limits the effectiveness and efficiency, as demonstrated in our analysis. To overcome these constraints, this paper presents the Inter-class Feature Compensator (INFER), an innovative distillation approach that transcends the class-specific data-label framework widely utilized in current dataset distillation methods. Specifically, INFER leverages a Universal Feature Compensator (UFC) to enhance feature integration across classes, enabling the generation of multiple additional synthetic instances from a single UFC input. This significantly improves the efficiency of the distillation budget. Moreover, INFER enriches inter-class interactions during the distillation, thereby enhancing the effectiveness and generalizability of the distilled data. By allowing for the linear interpolation of labels similar to those in the original dataset, INFER meticulously optimizes the synthetic data and dramatically reduces the size of soft labels in the synthetic dataset to almost zero, establishing a new benchmark for efficiency and effectiveness in dataset distillation. Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG) Cite as: arXiv:2408.06927 [cs.CV] (or arXiv:2408.06927v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2408.06927 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-15] Heterogeneity: An Open Challenge for Federated On-board Machine Learning

链接: https://arxiv.org/abs/2408.06903
作者: Maria Hartmann,Grégoire Danoy,Pascal Bouvry
关键词-EN: distributed mission configurations, individualised monolithic satellites, Federated Learning, mission configurations, multiple small satellites
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
*备注: Accepted to the ESA SPAICE conference 2024

点击查看摘要

Abstract:The design of satellite missions is currently undergoing a paradigm shift from the historical approach of individualised monolithic satellites towards distributed mission configurations, consisting of multiple small satellites. With a rapidly growing number of such satellites now deployed in orbit, each collecting large amounts of data, interest in on-board orbital edge computing is rising. Federated Learning is a promising distributed computing approach in this context, allowing multiple satellites to collaborate efficiently in training on-board machine learning models. Though recent works on the use of Federated Learning in orbital edge computing have focused largely on homogeneous satellite constellations, Federated Learning could also be employed to allow heterogeneous satellites to form ad-hoc collaborations, e.g. in the case of communications satellites operated by different providers. Such an application presents additional challenges to the Federated Learning paradigm, arising largely from the heterogeneity of such a system. In this position paper, we offer a systematic review of these challenges in the context of the cross-provider use case, giving a brief overview of the state-of-the-art for each, and providing an entry point for deeper exploration of each issue.

[LG-16] Automatic Feature Recognition and Dimensional Attributes Extraction From CAD Models for Hybrid Additive-Subtractive Manufacturing

链接: https://arxiv.org/abs/2408.06891
作者: Muhammad Tayyab Khan,Wenhe Feng,Lequn Chen,Ye Han Ng,Nicholas Yew Jin Tan,Seung Ki Moon
关键词-EN: facilitating seamless transitions, Computer-Aided Design, Computer-Aided Process Planning, manufacturing process planning, digital designs
类目: Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 10 pages, 12 figures. This paper has been accepted for presentation at the ASME IDETC-CIE 2024 conference

点击查看摘要

Abstract:The integration of Computer-Aided Design (CAD), Computer-Aided Process Planning (CAPP), and Computer-Aided Manufacturing (CAM) plays a crucial role in modern manufacturing, facilitating seamless transitions from digital designs to physical products. However, a significant challenge within this integration is the Automatic Feature Recognition (AFR) of CAD models, especially in the context of hybrid manufacturing that combines subtractive and additive manufacturing processes. Traditional AFR methods, focused mainly on the identification of subtractive (machined) features including holes, fillets, chamfers, pockets, and slots, fail to recognize features pertinent to additive manufacturing. Furthermore, the traditional methods fall short in accurately extracting geometric dimensions and orientations, which are also key factors for effective manufacturing process planning. This paper presents a novel approach for creating a synthetic CAD dataset that encompasses features relevant to both additive and subtractive machining through Python Open Cascade. The Hierarchical Graph Convolutional Neural Network (HGCNN) model is implemented to accurately identify the composite additive-subtractive features within the synthetic CAD dataset. The key novelty and contribution of the proposed methodology lie in its ability to recognize a wide range of manufacturing features, and precisely extracting their dimensions, orientations, and stock sizes. The proposed model demonstrates remarkable feature recognition accuracy exceeding 97% and a dimension extraction accuracy of 100% for identified features. Therefore, the proposed methodology enhances the integration of CAD, CAPP, and CAM within hybrid manufacturing by providing precise feature recognition and dimension extraction. It facilitates improved manufacturing process planning, by enabling more informed decision-making.

[LG-17] BMFT: Achieving Fairness via Bias-based Weight Masking Fine-tuning MICCAI2024

链接: https://arxiv.org/abs/2408.06890
作者: Yuyang Xue,Junyu Yan,Raman Dutt,Fasih Haider,Jingshuai Liu,Steven McDonagh,Sotirios A. Tsaftaris
关键词-EN: robust group fairness, group fairness properties, ethically sensitive domains, Developing models, properties is paramount
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Accepted by MICCAI 2024 FAIMI Workshop Oral

点击查看摘要

Abstract:Developing models with robust group fairness properties is paramount, particularly in ethically sensitive domains such as medical diagnosis. Recent approaches to achieving fairness in machine learning require a substantial amount of training data and depend on model retraining, which may not be practical in real-world scenarios. To mitigate these challenges, we propose Bias-based Weight Masking Fine-Tuning (BMFT), a novel post-processing method that enhances the fairness of a trained model in significantly fewer epochs without requiring access to the original training data. BMFT produces a mask over model parameters, which efficiently identifies the weights contributing the most towards biased predictions. Furthermore, we propose a two-step debiasing strategy, wherein the feature extractor undergoes initial fine-tuning on the identified bias-influenced weights, succeeded by a fine-tuning phase on a reinitialised classification layer to uphold discriminative performance. Extensive experiments across four dermatological datasets and two sensitive attributes demonstrate that BMFT outperforms existing state-of-the-art (SOTA) techniques in both diagnostic accuracy and fairness metrics. Our findings underscore the efficacy and robustness of BMFT in advancing fairness across various out-of-distribution (OOD) settings. Our code is available at: this https URL

[LG-18] Optimal Bound for PCA with Outliers using Higher-Degree Voronoi Diagrams

链接: https://arxiv.org/abs/2408.06867
作者: Sajjad Hashemian,Mohammad Saeed Arvenaghi,Ebrahim Ardeshir-Larijani
关键词-EN: Principal Component Analysis, Component Analysis, Principal Component, Analysis, higher-degree Voronoi diagrams
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In this paper, we introduce new algorithms for Principal Component Analysis (PCA) with outliers. Utilizing techniques from computational geometry, specifically higher-degree Voronoi diagrams, we navigate to the optimal subspace for PCA even in the presence of outliers. This approach achieves an optimal solution with a time complexity of n^d+\mathcalO(1)\textpoly(n,d) . Additionally, we present a randomized algorithm with a complexity of 2^\mathcalO(r(d-r)) \times \textpoly(n, d) . This algorithm samples subspaces characterized in terms of a Grassmannian manifold. By employing such sampling method, we ensure a high likelihood of capturing the optimal subspace, with the success probability (1 - \delta)^T . Where \delta represents the probability that a sampled subspace does not contain the optimal solution, and T is the number of subspaces sampled, proportional to 2^r(d-r) . Our use of higher-degree Voronoi diagrams and Grassmannian based sampling offers a clearer conceptual pathway and practical advantages, particularly in handling large datasets or higher-dimensional settings.

[LG-19] Efficient Search for Customized Activation Functions with Gradient Descent

链接: https://arxiv.org/abs/2408.06820
作者: Lukas Strack,Mahmoud Safari,Frank Hutter
关键词-EN: activation functions work, activation functions, functions work, functions, deep learning
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 10 pages, 1 figure, excluding references and appendix

点击查看摘要

Abstract:Different activation functions work best for different deep learning models. To exploit this, we leverage recent advancements in gradient-based search techniques for neural architectures to efficiently identify high-performing activation functions for a given application. We propose a fine-grained search cell that combines basic mathematical operations to model activation functions, allowing for the exploration of novel activations. Our approach enables the identification of specialized activations, leading to improved performance in every model we tried, from image classification to language models. Moreover, the identified activations exhibit strong transferability to larger models of the same type, as well as new datasets. Importantly, our automated process for creating customized activation functions is orders of magnitude more efficient than previous approaches. It can easily be applied on top of arbitrary deep learning pipelines and thus offers a promising practical avenue for enhancing deep learning architectures.

[LG-20] Enhancing Multiview Synergy: Robust Learning by Exploiting the Wave Loss Function with Consensus and Complementarity Principles

链接: https://arxiv.org/abs/2408.06819
作者: A. Quadir,Mushir Akhtar,M. Tanveer
关键词-EN: multiple data perspectives, view-consistency and view-discrepancy, advancing domain, perspectives to enhance, leveraging multiple data
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Multiview learning (MvL) is an advancing domain in machine learning, leveraging multiple data perspectives to enhance model performance through view-consistency and view-discrepancy. Despite numerous successful multiview-based SVM models, existing frameworks predominantly focus on the consensus principle, often overlooking the complementarity principle. Furthermore, they exhibit limited robustness against noisy, error-prone, and view-inconsistent samples, prevalent in multiview datasets. To tackle the aforementioned limitations, this paper introduces Wave-MvSVM, a novel multiview support vector machine framework leveraging the wave loss (W-loss) function, specifically designed to harness both consensus and complementarity principles. Unlike traditional approaches that often overlook the complementary information among different views, the proposed Wave-MvSVM ensures a more comprehensive and resilient learning process by integrating both principles effectively. The W-loss function, characterized by its smoothness, asymmetry, and bounded nature, is particularly effective in mitigating the adverse effects of noisy and outlier data, thereby enhancing model stability. Theoretically, the W-loss function also exhibits a crucial classification-calibrated property, further boosting its effectiveness. Wave-MvSVM employs a between-view co-regularization term to enforce view consistency and utilizes an adaptive combination weight strategy to maximize the discriminative power of each view. The optimization problem is efficiently solved using a combination of GD and the ADMM, ensuring reliable convergence to optimal solutions. Theoretical analyses, grounded in Rademacher complexity, validate the generalization capabilities of the Wave-MvSVM model. Extensive empirical evaluations across diverse datasets demonstrate the superior performance of Wave-MvSVM in comparison to existing benchmark models.

[LG-21] On a Scale-Invariant Approach to Bundle Recommendations in Candy Crush Saga

链接: https://arxiv.org/abs/2408.06799
作者: Styliani Katsarou,Francesca Carminati,Martin Dlask,Marta Braojos,Lavena Patra,Richard Perkins,Carlos Garcia Ling,Maria Paskevich
关键词-EN: increasing content relevancy, mobile game scenario, content relevancy, player preferences, preferences is crucial
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:A good understanding of player preferences is crucial for increasing content relevancy, especially in mobile games. This paper illustrates the use of attentive models for producing item recommendations in a mobile game scenario. The methodology comprises a combination of supervised and unsupervised approaches to create user-level recommendations while introducing a novel scale-invariant approach to the prediction. The methodology is subsequently applied to a bundle recommendation in Candy Crush Saga. The strategy of deployment, maintenance, and monitoring of ML models that are scaled up to serve millions of users is presented, along with the best practices and design patterns adopted to minimize technical debt typical of ML systems. The recommendation approach is evaluated both offline and online, with a focus on understanding the increase in engagement, click- and take rates, novelty effects, recommendation diversity, and the impact of degenerate feedback loops. We have demonstrated that the recommendation enhances user engagement by 30% concerning click rate and by more than 40% concerning take rate. In addition, we empirically quantify the diminishing effects of recommendation accuracy on user engagement.

[LG-22] Exploring Domain Shift on Radar-Based 3D Object Detection Amidst Diverse Environmental Conditions ITSC

链接: https://arxiv.org/abs/2408.06772
作者: Miao Zhang,Sherif Abdulatif,Benedikt Loesch,Marco Altmann,Marius Schwarz,Bin Yang
关键词-EN: autonomous driving systems, perception using multimodal, rapid evolution, evolution of deep, deep learning
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO)
*备注: 6 pages, 5 figures, 3 tables, accepted in IEEE International Conference on Intelligent Transportation Systems (ITSC) 2024

点击查看摘要

Abstract:The rapid evolution of deep learning and its integration with autonomous driving systems have led to substantial advancements in 3D perception using multimodal sensors. Notably, radar sensors show greater robustness compared to cameras and lidar under adverse weather and varying illumination conditions. This study delves into the often-overlooked yet crucial issue of domain shift in 4D radar-based object detection, examining how varying environmental conditions, such as different weather patterns and road types, impact 3D object detection performance. Our findings highlight distinct domain shifts across various weather scenarios, revealing unique dataset sensitivities that underscore the critical role of radar point cloud generation. Additionally, we demonstrate that transitioning between different road types, especially from highways to urban settings, introduces notable domain shifts, emphasizing the necessity for diverse data collection across varied road environments. To the best of our knowledge, this is the first comprehensive analysis of domain shift effects on 4D radar-based object detection. We believe this empirical study contributes to understanding the complex nature of domain shifts in radar data and suggests paths forward for data collection strategy in the face of environmental variability.

[LG-23] Robust Black-box Testing of Deep Neural Networks using Co-Domain Coverage

链接: https://arxiv.org/abs/2408.06766
作者: Aishwarya Gupta,Indranil Saha,Piyush Rai
关键词-EN: machine learning models, Rigorous testing, trustworthy deployments, machine learning, DNN
类目: Machine Learning (cs.LG)
*备注: 20 pages (including references), 4 figures, 7 tables

点击查看摘要

Abstract:Rigorous testing of machine learning models is necessary for trustworthy deployments. We present a novel black-box approach for generating test-suites for robust testing of deep neural networks (DNNs). Most existing methods create test inputs based on maximizing some “coverage” criterion/metric such as a fraction of neurons activated by the test inputs. Such approaches, however, can only analyze each neuron’s behavior or each layer’s output in isolation and are unable to capture their collective effect on the DNN’s output, resulting in test suites that often do not capture the various failure modes of the DNN adequately. These approaches also require white-box access, i.e., access to the DNN’s internals (node activations). We present a novel black-box coverage criterion called Co-Domain Coverage (CDC), which is defined as a function of the model’s output and thus takes into account its end-to-end behavior. Subsequently, we develop a new fuzz testing procedure named CoDoFuzz, which uses CDC to guide the fuzzing process to generate a test suite for a DNN. We extensively compare the test suite generated by CoDoFuzz with those generated using several state-of-the-art coverage-based fuzz testing methods for the DNNs trained on six publicly available datasets. Experimental results establish the efficiency and efficacy of CoDoFuzz in generating the largest number of misclassified inputs and the inputs for which the model lacks confidence in its decision.

[LG-24] Class-aware and Augmentation-free Contrastive Learning from Label Proportion

链接: https://arxiv.org/abs/2408.06743
作者: Jialiang Wang,Ning Zhang,Shimin Di,Ruidong Wang,Lei Chen
关键词-EN: weakly supervised learning, supervised learning scenario, label proportion matching, Label Proportion, weakly supervised
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Learning from Label Proportion (LLP) is a weakly supervised learning scenario in which training data is organized into predefined bags of instances, disclosing only the class label proportions per bag. This paradigm is essential for user modeling and personalization, where user privacy is paramount, offering insights into user preferences without revealing individual data. LLP faces a unique difficulty: the misalignment between bag-level supervision and the objective of instance-level prediction, primarily due to the inherent ambiguity in label proportion matching. Previous studies have demonstrated deep representation learning can generate auxiliary signals to promote the supervision level in the image domain. However, applying these techniques to tabular data presents significant challenges: 1) they rely heavily on label-invariant augmentation to establish multi-view, which is not feasible with the heterogeneous nature of tabular datasets, and 2) tabular datasets often lack sufficient semantics for perfect class distinction, making them prone to suboptimality caused by the inherent ambiguity of label proportion matching. To address these challenges, we propose an augmentation-free contrastive framework TabLLP-BDC that introduces class-aware supervision (explicitly aware of class differences) at the instance level. Our solution features a two-stage Bag Difference Contrastive (BDC) learning mechanism that establishes robust class-aware instance-level supervision by disassembling the nuance between bag label proportions, without relying on augmentations. Concurrently, our model presents a pioneering multi-task pretraining pipeline tailored for tabular-based LLP, capturing intrinsic tabular feature correlations in alignment with label proportion distribution. Extensive experiments demonstrate that TabLLP-BDC achieves state-of-the-art performance for LLP in the tabular domain. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2408.06743 [cs.LG] (or arXiv:2408.06743v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2408.06743 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-25] Multimodal Analysis of White Blood Cell Differentiation in Acute Myeloid Leukemia Patients using a beta-Variational Autoencoder MICCAI2024

链接: https://arxiv.org/abs/2408.06720
作者: Gizem Mert,Ario Sadafi,Raheleh Salehi,Nassir Navab,Carsten Marr
关键词-EN: Biomedical imaging, imaging and RNA, RNA sequencing, white blood cell, blood cell diseases
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Quantitative Methods (q-bio.QM)
*备注: Accepted for publication at MICCAI 2024 workshop on AI for Imaging Genomics Learning (AIIG)

点击查看摘要

Abstract:Biomedical imaging and RNA sequencing with single-cell resolution improves our understanding of white blood cell diseases like leukemia. By combining morphological and transcriptomic data, we can gain insights into cellular functions and trajectoriess involved in blood cell differentiation. However, existing methodologies struggle with integrating morphological and transcriptomic data, leaving a significant research gap in comprehensively understanding the dynamics of cell differentiation. Here, we introduce an unsupervised method that explores and reconstructs these two modalities and uncovers the relationship between different subtypes of white blood cells from human peripheral blood smears in terms of morphology and their corresponding transcriptome. Our method is based on a beta-variational autoencoder (\beta-VAE) with a customized loss function, incorporating a R-CNN architecture to distinguish single-cell from background and to minimize any interference from artifacts. This implementation of \beta-VAE shows good reconstruction capability along with continuous latent embeddings, while maintaining clear differentiation between single-cell classes. Our novel approach is especially helpful to uncover the correlation of two latent features in complex biological processes such as formation of granules in the cell (granulopoiesis) with gene expression patterns. It thus provides a unique tool to improve the understanding of white blood cell maturation for biomedicine and diagnostics.

[LG-26] Computation-friendly Graph Neural Network Design by Accumulating Knowledge on Large Language Models

链接: https://arxiv.org/abs/2408.06717
作者: Jialiang Wang,Shimin Di,Hanmo Liu,Zhili Wang,Jiachuan Wang,Lei Chen,Xiaofang Zhou
关键词-EN: Graph Neural Networks, Neural Networks, shown remarkable success, Graph Neural, Networks
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Graph Neural Networks (GNNs), like other neural networks, have shown remarkable success but are hampered by the complexity of their architecture designs, which heavily depend on specific data and tasks. Traditionally, designing proper architectures involves trial and error, which requires intensive manual effort to optimize various components. To reduce human workload, researchers try to develop automated algorithms to design GNNs. However, both experts and automated algorithms suffer from two major issues in designing GNNs: 1) the substantial computational resources expended in repeatedly trying candidate GNN architectures until a feasible design is achieved, and 2) the intricate and prolonged processes required for humans or algorithms to accumulate knowledge of the interrelationship between graphs, GNNs, and performance. To further enhance the automation of GNN architecture design, we propose a computation-friendly way to empower Large Language Models (LLMs) with specialized knowledge in designing GNNs, thereby drastically shortening the computational overhead and development cycle of designing GNN architectures. Our framework begins by establishing a knowledge retrieval pipeline that comprehends the intercorrelations between graphs, GNNs, and performance. This pipeline converts past model design experiences into structured knowledge for LLM reference, allowing it to quickly suggest initial model proposals. Subsequently, we introduce a knowledge-driven search strategy that emulates the exploration-exploitation process of human experts, enabling quick refinement of initial proposals within a promising scope. Extensive experiments demonstrate that our framework can efficiently deliver promising (e.g., Top-5.77%) initial model proposals for unseen datasets within seconds and without any prior training and achieve outstanding search performance in a few iterations. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2408.06717 [cs.LG] (or arXiv:2408.06717v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2408.06717 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-27] Variational Learning of Gaussian Process Latent Variable Models through Stochastic Gradient Annealed Importance Sampling

链接: https://arxiv.org/abs/2408.06710
作者: Jian Xu,Shian Du,Junmei Yang,Qianli Ma,Delu Zeng
关键词-EN: Gaussian Process Latent, Latent Variable Models, Process Latent Variable, Gaussian Process, Process Latent
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Gaussian Process Latent Variable Models (GPLVMs) have become increasingly popular for unsupervised tasks such as dimensionality reduction and missing data recovery due to their flexibility and non-linear nature. An importance-weighted version of the Bayesian GPLVMs has been proposed to obtain a tighter variational bound. However, this version of the approach is primarily limited to analyzing simple data structures, as the generation of an effective proposal distribution can become quite challenging in high-dimensional spaces or with complex data sets. In this work, we propose an Annealed Importance Sampling (AIS) approach to address these issues. By transforming the posterior into a sequence of intermediate distributions using annealing, we combine the strengths of Sequential Monte Carlo samplers and VI to explore a wider range of posterior distributions and gradually approach the target distribution. We further propose an efficient algorithm by reparameterizing all variables in the evidence lower bound (ELBO). Experimental results on both toy and image datasets demonstrate that our method outperforms state-of-the-art methods in terms of tighter variational bounds, higher log-likelihoods, and more robust convergence.

[LG-28] DiffSG: A Generative Solver for Network Optimization with Diffusion Model

链接: https://arxiv.org/abs/2408.06701
作者: Ruihuai Liang,Bo Yang,Zhiwen Yu,Bin Guo,Xuelin Cao,Mérouane Debbah,H. Vincent Poor,Chau Yuen
关键词-EN: Diffusion generative models, Diffusion generative, generative models, Diffusion, network optimization
类目: Networking and Internet Architecture (cs.NI); Machine Learning (cs.LG)
*备注: 8 pages, 5 figures

点击查看摘要

Abstract:Diffusion generative models, famous for their performance in image generation, are popular in various cross-domain applications. However, their use in the communication community has been mostly limited to auxiliary tasks like data modeling and feature extraction. These models hold greater promise for fundamental problems in network optimization compared to traditional machine learning methods. Discriminative deep learning often falls short due to its single-step input-output mapping and lack of global awareness of the solution space, especially given the complexity of network optimization’s objective functions. In contrast, diffusion generative models can consider a broader range of solutions and exhibit stronger generalization by learning parameters that describe the distribution of the underlying solution space, with higher probabilities assigned to better solutions. We propose a new framework Diffusion Model-based Solution Generation (DiffSG), which leverages the intrinsic distribution learning capabilities of diffusion generative models to learn high-quality solution distributions based on given inputs. The optimal solution within this distribution is highly probable, allowing it to be effectively reached through repeated sampling. We validate the performance of DiffSG on several typical network optimization problems, including mixed-integer non-linear programming, convex optimization, and hierarchical non-convex optimization. Our results show that DiffSG outperforms existing baselines. In summary, we demonstrate the potential of diffusion generative models in tackling complex network optimization problems and outline a promising path for their broader application in the communication community.

[LG-29] Information Geometry and Beta Link for Optimizing Sparse Variational Student-t Processes

链接: https://arxiv.org/abs/2408.06699
作者: Jian Xu,Delu Zeng,John Paisley
关键词-EN: Student-t Processes, variational Student-t Processes, enhance computational efficiency, sparse variational Student-t, termed sparse variational
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Recently, a sparse version of Student-t Processes, termed sparse variational Student-t Processes, has been proposed to enhance computational efficiency and flexibility for real-world datasets using stochastic gradient descent. However, traditional gradient descent methods like Adam may not fully exploit the parameter space geometry, potentially leading to slower convergence and suboptimal performance. To mitigate these issues, we adopt natural gradient methods from information geometry for variational parameter optimization of Student-t Processes. This approach leverages the curvature and structure of the parameter space, utilizing tools such as the Fisher information matrix which is linked to the Beta function in our model. This method provides robust mathematical support for the natural gradient algorithm when using Student’s t-distribution as the variational distribution. Additionally, we present a mini-batch algorithm for efficiently computing natural gradients. Experimental results across four benchmark datasets demonstrate that our method consistently accelerates convergence speed.

[LG-30] SlotLifter: Slot-guided Feature Lifting for Learning Object-centric Radiance Fields ECCV2024

链接: https://arxiv.org/abs/2408.06697
作者: Yu Liu,Baoxiong Jia,Yixin Chen,Siyuan Huang
关键词-EN: underpins human-level generalization, distill object-centric abstractions, intricate visual scenes, visual scenes underpins, scenes underpins human-level
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO)
*备注: Accepted by ECCV 2024. Project website: this https URL

点击查看摘要

Abstract:The ability to distill object-centric abstractions from intricate visual scenes underpins human-level generalization. Despite the significant progress in object-centric learning methods, learning object-centric representations in the 3D physical world remains a crucial challenge. In this work, we propose SlotLifter, a novel object-centric radiance model addressing scene reconstruction and decomposition jointly via slot-guided feature lifting. Such a design unites object-centric learning representations and image-based rendering methods, offering state-of-the-art performance in scene decomposition and novel-view synthesis on four challenging synthetic and four complex real-world datasets, outperforming existing 3D object-centric learning methods by a large margin. Through extensive ablative studies, we showcase the efficacy of designs in SlotLifter, revealing key insights for potential future directions.

[LG-31] Masked Image Modeling: A Survey

链接: https://arxiv.org/abs/2408.06687
作者: Vlad Hondru,Florinel Alin Croitoru,Shervin Minaee,Radu Tudor Ionescu,Nicu Sebe
关键词-EN: powerful self-supervised learning, self-supervised learning technique, survey recent studies, computer vision, approach that emerged
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In this work, we survey recent studies on masked image modeling (MIM), an approach that emerged as a powerful self-supervised learning technique in computer vision. The MIM task involves masking some information, e.g. pixels, patches, or even latent representations, and training a model, usually an autoencoder, to predicting the missing information by using the context available in the visible part of the input. We identify and formalize two categories of approaches on how to implement MIM as a pretext task, one based on reconstruction and one based on contrastive learning. Then, we construct a taxonomy and review the most prominent papers in recent years. We complement the manually constructed taxonomy with a dendrogram obtained by applying a hierarchical clustering algorithm. We further identify relevant clusters via manually inspecting the resulting dendrogram. Our review also includes datasets that are commonly used in MIM research. We aggregate the performance results of various masked image modeling methods on the most popular datasets, to facilitate the comparison of competing methods. Finally, we identify research gaps and propose several interesting directions of future work.

[LG-32] Case-based Explainability for Random Forest: Prototypes Critics Counter-factuals and Semi-factuals

链接: https://arxiv.org/abs/2408.06679
作者: Gregory Yampolsky,Dhruv Desai,Mingshu Li,Stefano Pasquali,Dhagash Mehta
关键词-EN: Explainable Artificial Intelligence, Artificial Intelligence, black-box machine learning, regulated industrial applications, industrial applications due
类目: Machine Learning (cs.LG); Statistical Finance (q-fin.ST); Machine Learning (stat.ML)
*备注: 8 pages, 2 figures, 5 tables

点击查看摘要

Abstract:The explainability of black-box machine learning algorithms, commonly known as Explainable Artificial Intelligence (XAI), has become crucial for financial and other regulated industrial applications due to regulatory requirements and the need for transparency in business practices. Among the various paradigms of XAI, Explainable Case-Based Reasoning (XCBR) stands out as a pragmatic approach that elucidates the output of a model by referencing actual examples from the data used to train or test the model. Despite its potential, XCBR has been relatively underexplored for many algorithms such as tree-based models until recently. We start by observing that most XCBR methods are defined based on the distance metric learned by the algorithm. By utilizing a recently proposed technique to extract the distance metric learned by Random Forests (RFs), which is both geometry- and accuracy-preserving, we investigate various XCBR methods. These methods amount to identify special points from the training datasets, such as prototypes, critics, counter-factuals, and semi-factuals, to explain the predictions for a given query of the RF. We evaluate these special points using various evaluation metrics to assess their explanatory power and effectiveness.

[LG-33] Leveraging Priors via Diffusion Bridge for Time Series Generation

链接: https://arxiv.org/abs/2408.06672
作者: Jinseong Park,Seungyun Lee,Woojin Jeong,Yujin Choi,Jaewook Lee
关键词-EN: hypothesis test techniques, Time series generation, Time series, series generation, series
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Time series generation is widely used in real-world applications such as simulation, data augmentation, and hypothesis test techniques. Recently, diffusion models have emerged as the de facto approach for time series generation, emphasizing diverse synthesis scenarios based on historical or correlated time series data streams. Since time series have unique characteristics, such as fixed time order and data scaling, standard Gaussian prior might be ill-suited for general time series generation. In this paper, we exploit the usage of diverse prior distributions for synthesis. Then, we propose TimeBridge, a framework that enables flexible synthesis by leveraging diffusion bridges to learn the transport between chosen prior and data distributions. Our model covers a wide range of scenarios in time series diffusion models, which leverages (i) data- and time-dependent priors for unconditional synthesis, and (ii) data-scale preserving synthesis with a constraint as a prior for conditional generation. Experimentally, our model achieves state-of-the-art performance in both unconditional and conditional time series generation tasks.

[LG-34] RW-NSGCN: A Robust Approach to Structural Attacks via Negative Sampling

链接: https://arxiv.org/abs/2408.06665
作者: Shuqi He,Jun Zhuang,Ding Wang,Jun Song
关键词-EN: Graph Neural Networks, predicting user interests, Graph Neural, Graph Convolutional Network, Sampling Graph Convolutional
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Node classification using Graph Neural Networks (GNNs) has been widely applied in various practical scenarios, such as predicting user interests and detecting communities in social networks. However, recent studies have shown that graph-structured networks often contain potential noise and attacks, in the form of topological perturbations and weight disturbances, which can lead to decreased classification performance in GNNs. To improve the robustness of the model, we propose a novel method: Random Walk Negative Sampling Graph Convolutional Network (RW-NSGCN). Specifically, RW-NSGCN integrates the Random Walk with Restart (RWR) and PageRank (PGR) algorithms for negative sampling and employs a Determinantal Point Process (DPP)-based GCN for convolution operations. RWR leverages both global and local information to manage noise and local variations, while PGR assesses node importance to stabilize the topological structure. The DPP-based GCN ensures diversity among negative samples and aggregates their features to produce robust node embeddings, thereby improving classification performance. Experimental results demonstrate that the RW-NSGCN model effectively addresses network topology attacks and weight instability, increasing the accuracy of anomaly detection and overall stability. In terms of classification accuracy, RW-NSGCN significantly outperforms existing methods, showing greater resilience across various scenarios and effectively mitigating the impact of such vulnerabilities.

[LG-35] COD: Learning Conditional Invariant Representation for Domain Adaptation Regression ECCV2024

链接: https://arxiv.org/abs/2408.06638
作者: Hao-Ran Yang,Chuan-Xian Ren,You-Wei Luo
关键词-EN: unlabeled target domain, Domain Adaptation Regression, Aiming to generalize, complex practical learning, source domain
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted to ECCV 2024 (oral)

点击查看摘要

Abstract:Aiming to generalize the label knowledge from a source domain with continuous outputs to an unlabeled target domain, Domain Adaptation Regression (DAR) is developed for complex practical learning problems. However, due to the continuity problem in regression, existing conditional distribution alignment theory and methods with discrete prior, which are proven to be effective in classification settings, are no longer applicable. In this work, focusing on the feasibility problems in DAR, we establish the sufficiency theory for the regression model, which shows the generalization error can be sufficiently dominated by the cross-domain conditional discrepancy. Further, to characterize conditional discrepancy with continuous conditioning variable, a novel Conditional Operator Discrepancy (COD) is proposed, which admits the metric property on conditional distributions via the kernel embedding theory. Finally, to minimize the discrepancy, a COD-based conditional invariant representation learning model is proposed, and the reformulation is derived to show that reasonable modifications on moment statistics can further improve the discriminability of the adaptation model. Extensive experiments on standard DAR datasets verify the validity of theoretical results and the superiority over SOTA DAR methods.

[LG-36] owards Robust and Cost-Efficient Knowledge Unlearning for Large Language Models

链接: https://arxiv.org/abs/2408.06621
作者: Sungmin Cha,Sungjun Cho,Dasol Hwang,Moontae Lee
关键词-EN: demonstrated strong reasoning, Large Language Models, massive textual corpora, Large Language, demonstrated strong
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
*备注: Preprint

点击查看摘要

Abstract:Large Language Models (LLMs) have demonstrated strong reasoning and memorization capabilities via pretraining on massive textual corpora. However, training LLMs on human-written text entails significant risk of privacy and copyright violations, which demands an efficient machine unlearning framework to remove knowledge of sensitive data without retraining the model from scratch. While Gradient Ascent (GA) is widely used for unlearning by reducing the likelihood of generating unwanted information, the unboundedness of increasing the cross-entropy loss causes not only unstable optimization, but also catastrophic forgetting of knowledge that needs to be retained. We also discover its joint application under low-rank adaptation results in significantly suboptimal computational cost vs. generative performance trade-offs. In light of this limitation, we propose two novel techniques for robust and cost-efficient unlearning on LLMs. We first design an Inverted Hinge loss that suppresses unwanted tokens by increasing the probability of the next most likely token, thereby retaining fluency and structure in language generation. We also propose to initialize low-rank adapter weights based on Fisher-weighted low-rank approximation, which induces faster unlearning and better knowledge retention by allowing model updates to be focused on parameters that are important in generating textual data we wish to remove.

[LG-37] Unveiling the Flaws: A Critical Analysis of Initialization Effect on Time Series Anomaly Detection

链接: https://arxiv.org/abs/2408.06620
作者: Alex Koran,Hadi Hojjati,Narges Armanfard
关键词-EN: Deep learning, gained significant attention, past decade, learning for time-series, Deep
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Deep learning for time-series anomaly detection (TSAD) has gained significant attention over the past decade. Despite the reported improvements in several papers, the practical application of these models remains limited. Recent studies have cast doubt on these models, attributing their results to flawed evaluation techniques. However, the impact of initialization has largely been overlooked. This paper provides a critical analysis of the initialization effects on TSAD model performance. Our extensive experiments reveal that TSAD models are highly sensitive to hyperparameters such as window size, seed number, and normalization. This sensitivity often leads to significant variability in performance, which can be exploited to artificially inflate the reported efficacy of these models. We demonstrate that even minor changes in initialization parameters can result in performance variations that overshadow the claimed improvements from novel model architectures. Our findings highlight the need for rigorous evaluation protocols and transparent reporting of preprocessing steps to ensure the reliability and fairness of anomaly detection methods. This paper calls for a more cautious interpretation of TSAD advancements and encourages the development of more robust and transparent evaluation practices to advance the field and its practical applications.

[LG-38] Generalized knowledge-enhanced framework for biomedical entity and relation extraction

链接: https://arxiv.org/abs/2408.06618
作者: Minh Nguyen,Phuong Le
关键词-EN: recent years, increasing number, relation extraction, entity and relation, biomedical entity
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In recent years, there has been an increasing number of frameworks developed for biomedical entity and relation extraction. This research effort aims to address the accelerating growth in biomedical publications and the intricate nature of biomedical texts, which are written for mainly domain experts. To handle these challenges, we develop a novel framework that utilizes external knowledge to construct a task-independent and reusable background knowledge graph for biomedical entity and relation extraction. The design of our model is inspired by how humans learn domain-specific topics. In particular, humans often first acquire the most basic and common knowledge regarding a field to build the foundational knowledge and then use that as a basis for extending to various specialized topics. Our framework employs such common-knowledge-sharing mechanism to build a general neural-network knowledge graph that is learning transferable to different domain-specific biomedical texts effectively. Experimental evaluations demonstrate that our model, equipped with this generalized and cross-transferable knowledge base, achieves competitive performance benchmarks, including BioRelEx for binding interaction detection and ADE for Adverse Drug Effect identification.

[LG-39] CROME: Cross-Modal Adapters for Efficient Multimodal LLM

链接: https://arxiv.org/abs/2408.06610
作者: Sayna Ebrahimi,Sercan O. Arik,Tejas Nama,Tomas Pfister
关键词-EN: Large Language Models, remarkable image-language capabilities, Multimodal Large Language, Large Language, demonstrate remarkable image-language
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Multimodal Large Language Models (MLLMs) demonstrate remarkable image-language capabilities, but their widespread use faces challenges in cost-effective training and adaptation. Existing approaches often necessitate expensive language model retraining and limited adaptability. Additionally, the current focus on zero-shot performance improvements offers insufficient guidance for task-specific tuning. We propose CROME, an efficient vision-language instruction tuning framework. It features a novel gated cross-modal adapter that effectively combines visual and textual representations prior to input into a frozen LLM. This lightweight adapter, trained with minimal parameters, enables efficient cross-modal understanding. Notably, CROME demonstrates superior zero-shot performance on standard visual question answering and instruction-following benchmarks. Moreover, it yields fine-tuning with exceptional parameter efficiency, competing with task-specific specialist state-of-the-art methods. CROME demonstrates the potential of pre-LM alignment for building scalable, adaptable, and parameter-efficient multimodal models.

[LG-40] Prioritizing Modalities: Flexible Importance Scheduling in Federated Multimodal Learning

链接: https://arxiv.org/abs/2408.06549
作者: Jieming Bian,Lei Wang,Jie Xu
关键词-EN: ensuring user privacy, distributed machine learning, Federated Learning, collaboratively train models, ensuring user
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注: Submitted to IEEE TMC, under review

点击查看摘要

Abstract:Federated Learning (FL) is a distributed machine learning approach that enables devices to collaboratively train models without sharing their local data, ensuring user privacy and scalability. However, applying FL to real-world data presents challenges, particularly as most existing FL research focuses on unimodal data. Multimodal Federated Learning (MFL) has emerged to address these challenges, leveraging modality-specific encoder models to process diverse datasets. Current MFL methods often uniformly allocate computational frequencies across all modalities, which is inefficient for IoT devices with limited resources. In this paper, we propose FlexMod, a novel approach to enhance computational efficiency in MFL by adaptively allocating training resources for each modality encoder based on their importance and training requirements. We employ prototype learning to assess the quality of modality encoders, use Shapley values to quantify the importance of each modality, and adopt the Deep Deterministic Policy Gradient (DDPG) method from deep reinforcement learning to optimize the allocation of training resources. Our method prioritizes critical modalities, optimizing model performance and resource utilization. Experimental results on three real-world datasets demonstrate that our proposed method significantly improves the performance of MFL models.

[LG-41] Value of Information and Reward Specification in Active Inference and POMDPs

链接: https://arxiv.org/abs/2408.06542
作者: Ran Wei
关键词-EN: recently gained popularity, gained popularity due, Expected free energy, Expected free, epistemic component
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Expected free energy (EFE) is a central quantity in active inference which has recently gained popularity due to its intuitive decomposition of the expected value of control into a pragmatic and an epistemic component. While numerous conjectures have been made to justify EFE as a decision making objective function, the most widely accepted is still its intuitiveness and resemblance to variational free energy in approximate Bayesian inference. In this work, we take a bottom up approach and ask: taking EFE as given, what’s the resulting agent’s optimality gap compared with a reward-driven reinforcement learning (RL) agent, which is well understood? By casting EFE under a particular class of belief MDP and using analysis tools from RL theory, we show that EFE approximates the Bayes optimal RL policy via information value. We discuss the implications for objective specification of active inference agents.

[LG-42] A Comparison of Imitation Learning Algorithms for Bimanual Manipulation

链接: https://arxiv.org/abs/2408.06536
作者: Michael Drolet,Simon Stepputtis,Siva Kailas,Ajinkya Jain,Jan Peters,Stefan Schaal,Heni Ben Amor
关键词-EN: Amidst the wide, high-precision industry-inspired environments, data efficiency, wide popularity, well-studied in high-precision
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Amidst the wide popularity of imitation learning algorithms in robotics, their properties regarding hyperparameter sensitivity, ease of training, data efficiency, and performance have not been well-studied in high-precision industry-inspired environments. In this work, we demonstrate the limitations and benefits of prominent imitation learning approaches and analyze their capabilities regarding these properties. We evaluate each algorithm on a complex bimanual manipulation task involving an over-constrained dynamics system in a setting involving multiple contacts between the manipulated object and the environment. While we find that imitation learning is well suited to solve such complex tasks, not all algorithms are equal in terms of handling environmental and hyperparameter perturbations, training requirements, performance, and ease of use. We investigate the empirical influence of these key characteristics by employing a carefully designed experimental procedure and learning environment. Paper website: this https URL

[LG-43] Operator Learning Using Random Features: A Tool for Scientific Computing

链接: https://arxiv.org/abs/2408.06526
作者: Nicholas H. Nelsen,Andrew M. Stuart
关键词-EN: Supervised operator learning, operator learning centers, operator learning, input-output pairs, function-valued random features
类目: Machine Learning (cs.LG); Numerical Analysis (math.NA); Machine Learning (stat.ML)
*备注: 36 pages, 1 table, 9 figures. SIGEST version of SIAM J. Sci. Comput. Vol. 43 No. 5 (2021) pp. A3212-A3243, hence text overlap with arXiv:2005.10224

点击查看摘要

Abstract:Supervised operator learning centers on the use of training data, in the form of input-output pairs, to estimate maps between infinite-dimensional spaces. It is emerging as a powerful tool to complement traditional scientific computing, which may often be framed in terms of operators mapping between spaces of functions. Building on the classical random features methodology for scalar regression, this paper introduces the function-valued random features method. This leads to a supervised operator learning architecture that is practical for nonlinear problems yet is structured enough to facilitate efficient training through the optimization of a convex, quadratic cost. Due to the quadratic structure, the trained model is equipped with convergence guarantees and error and complexity bounds, properties that are not readily available for most other operator learning architectures. At its core, the proposed approach builds a linear combination of random operators. This turns out to be a low-rank approximation of an operator-valued kernel ridge regression algorithm, and hence the method also has strong connections to Gaussian process regression. The paper designs function-valued random features that are tailored to the structure of two nonlinear operator learning benchmark problems arising from parametric partial differential equations. Numerical results demonstrate the scalability, discretization invariance, and transferability of the function-valued random features method.

[LG-44] Learned Ranking Function: From Short-term Behavior Predictions to Long-term User Satisfaction RECSYS24

链接: https://arxiv.org/abs/2408.06512
作者: Yi Wu,Daryl Chang,Jennifer She,Zhe Zhao,Li Wei,Lukasz Heldt
关键词-EN: Learned Ranking Function, Learned Ranking, short-term user-item behavior, user-item behavior predictions, Ranking Function
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
*备注: RecSys 24

点击查看摘要

Abstract:We present the Learned Ranking Function (LRF), a system that takes short-term user-item behavior predictions as input and outputs a slate of recommendations that directly optimizes for long-term user satisfaction. Most previous work is based on optimizing the hyperparameters of a heuristic function. We propose to model the problem directly as a slate optimization problem with the objective of maximizing long-term user satisfaction. We also develop a novel constraint optimization algorithm that stabilizes objective trade-offs for multi-objective optimization. We evaluate our approach with live experiments and describe its deployment on YouTube.

[LG-45] Fooling SHAP with Output Shuffling Attacks

链接: https://arxiv.org/abs/2408.06509
作者: Jun Yuan,Aritra Dasgupta
关键词-EN: discover feature attributions, SHAP, Explainable, discover feature, XAI methods
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
*备注:

点击查看摘要

Abstract:Explainable AI~(XAI) methods such as SHAP can help discover feature attributions in black-box models. If the method reveals a significant attribution from a ``protected feature’’ (e.g., gender, race) on the model output, the model is considered unfair. However, adversarial attacks can subvert the detection of XAI methods. Previous approaches to constructing such an adversarial model require access to underlying data distribution, which may not be possible in many practical scenarios. We relax this constraint and propose a novel family of attacks, called shuffling attacks, that are data-agnostic. The proposed attack strategies can adapt any trained machine learning model to fool Shapley value-based explanations. We prove that Shapley values cannot detect shuffling attacks. However, algorithms that estimate Shapley values, such as linear SHAP and SHAP, can detect these attacks with varying degrees of effectiveness. We demonstrate the efficacy of the attack strategies by comparing the performance of linear SHAP and SHAP using real-world datasets.

[LG-46] Prompt Recovery for Image Generation Models: A Comparative Study of Discrete Optimizers

链接: https://arxiv.org/abs/2408.06502
作者: Joshua Nathaniel Williams,Avi Schwarzschild,J. Zico Kolter
关键词-EN: Recovering natural language, natural language prompts, image generation models, Recovering natural, difficult discrete optimization
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 9 Pages, 4 Figures

点击查看摘要

Abstract:Recovering natural language prompts for image generation models, solely based on the generated images is a difficult discrete optimization problem. In this work, we present the first head-to-head comparison of recent discrete optimization techniques for the problem of prompt inversion. We evaluate Greedy Coordinate Gradients (GCG), PEZ , Random Search, AutoDAN and BLIP2’s image captioner across various evaluation metrics related to the quality of inverted prompts and the quality of the images generated by the inverted prompts. We find that focusing on the CLIP similarity between the inverted prompts and the ground truth image acts as a poor proxy for the similarity between ground truth image and the image generated by the inverted prompts. While the discrete optimizers effectively minimize their objectives, simply using responses from a well-trained captioner often leads to generated images that more closely resemble those produced by the original prompts.

[LG-47] Music2Latent: Consistency Autoencoders for Latent Audio Compression

链接: https://arxiv.org/abs/2408.06500
作者: Marco Pasini,Stefan Lattner,George Fazekas
关键词-EN: Music Information Retrieval, Information Retrieval, Music Information, modeling and Music, Efficient audio representations
类目: ound (cs.SD); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注: Accepted to ISMIR 2024

点击查看摘要

Abstract:Efficient audio representations in a compressed continuous latent space are critical for generative audio modeling and Music Information Retrieval (MIR) tasks. However, some existing audio autoencoders have limitations, such as multi-stage training procedures, slow iterative sampling, or low reconstruction quality. We introduce Music2Latent, an audio autoencoder that overcomes these limitations by leveraging consistency models. Music2Latent encodes samples into a compressed continuous latent space in a single end-to-end training process while enabling high-fidelity single-step reconstruction. Key innovations include conditioning the consistency model on upsampled encoder outputs at all levels through cross connections, using frequency-wise self-attention to capture long-range frequency dependencies, and employing frequency-wise learned scaling to handle varying value distributions across frequencies at different noise levels. We demonstrate that Music2Latent outperforms existing continuous audio autoencoders in sound quality and reconstruction accuracy while achieving competitive performance on downstream MIR tasks using its latent representations. To our knowledge, this represents the first successful attempt at training an end-to-end consistency autoencoder model.

[LG-48] Implicit Neural Representation For Accurate CFD Flow Field Prediction

链接: https://arxiv.org/abs/2408.06486
作者: Laurent de Vito,Nils Pinnau,Simone Dey
关键词-EN: real industrial applications, industrial applications remain, deep learning frameworks, flow field prediction, deep learning
类目: Machine Learning (cs.LG); Computational Engineering, Finance, and Science (cs.CE)
*备注: ECCOMAS CONGRESS 2024, 9th European Congress on Computational Methods in Applied Sciences and Engineering

点击查看摘要

Abstract:Despite the plethora of deep learning frameworks for flow field prediction, most of them deal with flow fields on regular domains, and although the best ones can cope with irregular domains, they mostly rely on graph networks, so that real industrial applications remain currently elusive. We present a deep learning framework for 3D flow field prediction applied to blades of aircraft engine turbines and compressors. Crucially, we view any 3D field as a function from coordinates that is modeled by a neural network we call the backbone-net. It inherits the property of coordinate-based MLPs, namely the discretization-agnostic representation of flow fields in domains of arbitrary topology at infinite resolution. First, we demonstrate the performance of the backbone-net solo in regressing 3D steady simulations of single blade rows in various flow regimes: it can accurately render important flow characteristics such as boundary layers, wakes and shock waves. Second, we introduce a hyper-net that maps the surface mesh of a blade to the parameters of the backbone-net. By doing so, the flow solution can be directly predicted from the blade geometry, irrespective of its parameterization. Together, backbone-net and hyper-net form a highly-accurate memory-efficient data-driven proxy to CFD solvers with good generalization on unseen geometries.

[LG-49] Kernel Sum of Squares for Data Adapted Kernel Learning of Dynamical Systems from Data: A global optimization approach

链接: https://arxiv.org/abs/2408.06465
作者: Daniel Lengyel,Panos Parpas,Boumediene Hamzi,Houman Owhadi
关键词-EN: Sum of Squares, dynamical systems, Kernel Sum, paper examines, examines the application
类目: Machine Learning (cs.LG); Dynamical Systems (math.DS); Optimization and Control (math.OC); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:This paper examines the application of the Kernel Sum of Squares (KSOS) method for enhancing kernel learning from data, particularly in the context of dynamical systems. Traditional kernel-based methods, despite their theoretical soundness and numerical efficiency, frequently struggle with selecting optimal base kernels and parameter tuning, especially with gradient-based methods prone to local optima. KSOS mitigates these issues by leveraging a global optimization framework with kernel-based surrogate functions, thereby achieving more reliable and precise learning of dynamical systems. Through comprehensive numerical experiments on the Logistic Map, Henon Map, and Lorentz System, KSOS is shown to consistently outperform gradient descent in minimizing the relative- \rho metric and improving kernel accuracy. These results highlight KSOS’s effectiveness in predicting the behavior of chaotic dynamical systems, demonstrating its capability to adapt kernels to underlying dynamics and enhance the robustness and predictive power of kernel-based approaches, making it a valuable asset for time series analysis in various scientific fields.

[LG-50] Wireless Channel Aware Data Augmentation Methods for Deep Leaning-Based Indoor Localization

链接: https://arxiv.org/abs/2408.06452
作者: Omer Gokalp Serbetci,Daoud Burghal,Andreas F. Molisch
关键词-EN: unlike outdoor localization, Indoor localization, unlike outdoor, lacks a universal, robust solution
类目: ystems and Control (eess.SY); Machine Learning (cs.LG)
*备注: 13 pages, 14 figures

点击查看摘要

Abstract:Indoor localization is a challenging problem that - unlike outdoor localization - lacks a universal and robust solution. Machine Learning (ML), particularly Deep Learning (DL), methods have been investigated as a promising approach. Although such methods bring remarkable localization accuracy, they heavily depend on the training data collected from the environment. The data collection is usually a laborious and time-consuming task, but Data Augmentation (DA) can be used to alleviate this issue. In this paper, different from previously used DA, we propose methods that utilize the domain knowledge about wireless propagation channels and devices. The methods exploit the typical hardware component drift in the transceivers and/or the statistical behavior of the channel, in combination with the measured Power Delay Profile (PDP). We comprehensively evaluate the proposed methods to demonstrate their effectiveness. This investigation mainly focuses on the impact of factors such as the number of measurements, augmentation proportion, and the environment of interest impact the effectiveness of the different DA methods. We show that in the low-data regime (few actual measurements available), localization accuracy increases up to 50%, matching non-augmented results in the high-data regime. In addition, the proposed methods may outperform the measurement-only high-data performance by up to 33% using only 1/4 of the amount of measured data. We also exhibit the effect of different training data distribution and quality on the effectiveness of DA. Finally, we demonstrate the power of the proposed methods when employed along with Transfer Learning (TL) to address the data scarcity in target and/or source environments.

[LG-51] Evaluating Language Models for Efficient Code Generation

链接: https://arxiv.org/abs/2408.06450
作者: Jiawei Liu,Songrun Xie,Junhao Wang,Yuxiang Wei,Yifeng Ding,Lingming Zhang
关键词-EN: Large Language Models, evaluate Large Language, introduce Differential Performance, Large Language, Differential Performance Evaluation
类目: oftware Engineering (cs.SE); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We introduce Differential Performance Evaluation (DPE), a framework designed to reliably evaluate Large Language Models (LLMs) for efficient code generation. Traditional coding benchmarks often fail to provide reliable insights into code efficiency, due to their reliance on simplistic test inputs and the absence of effective compound metrics. DPE addresses these issues by focusing on efficiency-demanding programming tasks and establishing an insightful compound metric for performance evaluation. DPE operates in two phases: To curate efficiency datasets, it selects efficiency-demanding tasks from existing coding benchmarks and generates computationally expensive inputs to stress the efficiency of LLM solutions. To assess the code efficiency, DPE profiles the new solution and compares it globally against a set of reference solutions that exhibit distinct efficiency levels, where the matched level defines its efficiency score. As a proof of concept, we use DPE to create EvalPerf, a benchmark with 121 performance-challenging coding tasks. Our comprehensive evaluation draws interesting findings on the efficiency impact of model sizes, instruction tuning, and prompting. For example, while the scaling law fails to account for code efficiency, general instruction tuning benefits both code correctness and efficiency. We also evaluate the evaluation by examining the effectiveness of DPE, showing that EvalPerf is reliable and convenient to use even across platforms.

[LG-52] Multi-View Neural Differential Equations for Continuous-Time Stream Data in Long-Term Traffic Forecasting

链接: https://arxiv.org/abs/2408.06445
作者: Zibo Liu,Zhe Jiang,Shigang Chen
关键词-EN: Neural Differential Equations, Differential Equations, Neural Differential, flow forecasting plays, decisions in advance
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Long-term traffic flow forecasting plays a crucial role in intelligent transportation as it allows traffic managers to adjust their decisions in advance. However, the problem is challenging due to spatio-temporal correlations and complex dynamic patterns in continuous-time stream data. Neural Differential Equations (NDEs) are among the state-of-the-art methods for learning continuous-time traffic dynamics. However, the traditional NDE models face issues in long-term traffic forecasting due to failures in capturing delayed traffic patterns, dynamic edge (location-to-location correlation) patterns, and abrupt trend patterns. To fill this gap, we propose a new NDE architecture called Multi-View Neural Differential Equations. Our model captures current states, delayed states, and trends in different state variables (views) by learning latent multiple representations within Neural Differential Equations. Extensive experiments conducted on several real-world traffic datasets demonstrate that our proposed method outperforms the state-of-the-art and achieves superior prediction accuracy for long-term forecasting and robustness with noisy or missing inputs.

[LG-53] Distributed Stackelberg Strategies in State-based Potential Games for Autonomous Decentralized Learning Manufacturing Systems

链接: https://arxiv.org/abs/2408.06397
作者: Steve Yuwono,Dorothea Schwung,Andreas Schwung
关键词-EN: Distributed Stackelberg Strategies, Stackelberg Strategies, autonomously optimizing decentralized, optimizing decentralized manufacturing, decentralized manufacturing systems
类目: Computer Science and Game Theory (cs.GT); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
*备注: This pre-print was submitted to IEEE Transactions on Systems, Man, and Cybernetics: Systems on July 31, 2024

点击查看摘要

Abstract:This article describes a novel game structure for autonomously optimizing decentralized manufacturing systems with multi-objective optimization challenges, namely Distributed Stackelberg Strategies in State-Based Potential Games (DS2-SbPG). DS2-SbPG integrates potential games and Stackelberg games, which improves the cooperative trade-off capabilities of potential games and the multi-objective optimization handling by Stackelberg games. Notably, all training procedures remain conducted in a fully distributed manner. DS2-SbPG offers a promising solution to finding optimal trade-offs between objectives by eliminating the complexities of setting up combined objective optimization functions for individual players in self-learning domains, particularly in real-world industrial settings with diverse and numerous objectives between the sub-systems. We further prove that DS2-SbPG constitutes a dynamic potential game that results in corresponding converge guarantees. Experimental validation conducted on a laboratory-scale testbed highlights the efficacy of DS2-SbPG and its two variants, such as DS2-SbPG for single-leader-follower and Stack DS2-SbPG for multi-leader-follower. The results show significant reductions in power consumption and improvements in overall performance, which signals the potential of DS2-SbPG in real-world applications.

[LG-54] Fast John Ellipsoid Computation with Differential Privacy Optimization

链接: https://arxiv.org/abs/2408.06395
作者: Jiuxiang Gu,Xiaoyu Li,Yingyu Liang,Zhenmei Shi,Zhao Song,Junwei Yu
关键词-EN: John ellipsoid computation, John ellipsoid, fast John ellipsoid, largest volume ellipsoid, volume ellipsoid contained
类目: Data Structures and Algorithms (cs.DS); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Determining the John ellipsoid - the largest volume ellipsoid contained within a convex polytope - is a fundamental problem with applications in machine learning, optimization, and data analytics. Recent work has developed fast algorithms for approximating the John ellipsoid using sketching and leverage score sampling techniques. However, these algorithms do not provide privacy guarantees for sensitive input data. In this paper, we present the first differentially private algorithm for fast John ellipsoid computation. Our method integrates noise perturbation with sketching and leverage score sampling to achieve both efficiency and privacy. We prove that (1) our algorithm provides (\epsilon,\delta) -differential privacy, and the privacy guarantee holds for neighboring datasets that are \epsilon_0 -close, allowing flexibility in the privacy definition; (2) our algorithm still converges to a (1+\xi) -approximation of the optimal John ellipsoid in O(\xi^-2(\log(n/\delta_0) + (L\epsilon_0)^-2)) iterations where n is the number of data point, L is the Lipschitz constant, \delta_0 is the failure probability, and \epsilon_0 is the closeness of neighboring input datasets. Our theoretical analysis demonstrates the algorithm’s convergence and privacy properties, providing a robust approach for balancing utility and privacy in John ellipsoid computation. This is the first differentially private algorithm for fast John ellipsoid computation, opening avenues for future research in privacy-preserving optimization techniques.

[LG-55] Approximate ADCs for In-Memory Computing

链接: https://arxiv.org/abs/2408.06390
作者: Arkapravo Ghosh,Hemkar Reddy Sadana,Mukut Debnath,Panthadip Maji,Shubham Negi,Sumeet Gupta,Mrigank Sharad,Kaushik Roy
关键词-EN: accelerators leverage energy-efficient, matrix vector multiplication, highly parallel matrix, parallel matrix vector, deep learning
类目: Emerging Technologies (cs.ET); Hardware Architecture (cs.AR); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
*备注:

点击查看摘要

Abstract:In memory computing (IMC) architectures for deep learning (DL) accelerators leverage energy-efficient and highly parallel matrix vector multiplication (MVM) operations, implemented directly in memory arrays. Such IMC designs have been explored based on CMOS as well as emerging non-volatile memory (NVM) technologies like RRAM. IMC architectures generally involve a large number of cores consisting of memory arrays, storing the trained weights of the DL model. Peripheral units like DACs and ADCs are also used for applying inputs and reading out the output values. Recently reported designs reveal that the ADCs required for reading out the MVM results, consume more than 85% of the total compute power and also dominate the area, thereby eschewing the benefits of the IMC scheme. Mitigation of imperfections in the ADCs, namely, non-linearity and variations, incur significant design overheads, due to dedicated calibration units. In this work we present peripheral aware design of IMC cores, to mitigate such overheads. It involves incorporating the non-idealities of ADCs in the training of the DL models, along with that of the memory units. The proposed approach applies equally well to both current mode as well as charge mode MVM operations demonstrated in recent years., and can significantly simplify the design of mixed-signal IMC units.

[LG-56] Dilated Convolution with Learnable Spacings

链接: https://arxiv.org/abs/2408.06383
作者: Ismail Khalfaoui-Hassani
关键词-EN: Learnable Spacings, DCLS method, evaluates the Dilated, Dilated Convolution, DCLS
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Neural and Evolutionary Computing (cs.NE); Sound (cs.SD); Audio and Speech Processing (eess.AS)
*备注: PhD Thesis

点击查看摘要

Abstract:This thesis presents and evaluates the Dilated Convolution with Learnable Spacings (DCLS) method. Through various supervised learning experiments in the fields of computer vision, audio, and speech processing, the DCLS method proves to outperform both standard and advanced convolution techniques. The research is organized into several steps, starting with an analysis of the literature and existing convolution techniques that preceded the development of the DCLS method. We were particularly interested in the methods that are closely related to our own and that remain essential to capture the nuances and uniqueness of our approach. The cornerstone of our study is the introduction and application of the DCLS method to convolutional neural networks (CNNs), as well as to hybrid architectures that rely on both convolutional and visual attention approaches. DCLS is shown to be particularly effective in tasks such as classification, semantic segmentation, and object detection. Initially using bilinear interpolation, the study also explores other interpolation methods, finding that Gaussian interpolation slightly improves performance. The DCLS method is further applied to spiking neural networks (SNNs) to enable synaptic delay learning within a neural network that could eventually be transferred to so-called neuromorphic chips. The results show that the DCLS method stands out as a new state-of-the-art technique in SNN audio classification for certain benchmark tasks in this field. These tasks involve datasets with a high temporal component. In addition, we show that DCLS can significantly improve the accuracy of artificial neural networks for the multi-label audio classification task. We conclude with a discussion of the chosen experimental setup, its limitations, the limitations of our method, and our results.

[LG-57] FedRobo: Federated Learning Driven Autonomous Inter Robots Communication For Optimal Chemical Sprays

链接: https://arxiv.org/abs/2408.06382
作者: Jannatul Ferdaus,Sameera Pisupati,Mahedi Hasan,Sathwick Paladugu
关键词-EN: centralized data collection, Learning enables robots, Federated Learning enables, Federated Learning, federated learning algorithm
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Distributed, Parallel, and Cluster Computing (cs.DC); Robotics (cs.RO)
*备注: This research article is going to be submitted to a best-fit conference. We are looking for a conference

点击查看摘要

Abstract:Federated Learning enables robots to learn from each other’s experiences without relying on centralized data collection. Each robot independently maintains a model of crop conditions and chemical spray effectiveness, which is periodically shared with other robots in the fleet. A communication protocol is designed to optimize chemical spray applications by facilitating the exchange of information about crop conditions, weather, and other critical factors. The federated learning algorithm leverages this shared data to continuously refine the chemical spray strategy, reducing waste and improving crop yields. This approach has the potential to revolutionize the agriculture industry by offering a scalable and efficient solution for crop protection. However, significant challenges remain, including the development of a secure and robust communication protocol, the design of a federated learning algorithm that effectively integrates data from multiple sources, and ensuring the safety and reliability of autonomous robots. The proposed cluster-based federated learning approach also effectively reduces the computational load on the global server and minimizes communication overhead among clients.

[LG-58] Using deep learning to enhance electronic service quality: Application to real estate websites

链接: https://arxiv.org/abs/2408.06364
作者: Samaa Elnagar
关键词-EN: service quality dimensions, Electronic service quality, quality dimensions, service quality, Damage Level
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Electronic service quality (E-SQ) is a strategic metric for successful e-services.Among the service quality dimensions, tangibility is overlooked. However, by incorporating visuals or tangible tools, the intangible nature of e-services can be balanced. Thanks to advancements in Deep Learning for computer vision, tangible visual features can now be leveraged to enhance the browsing and searching experience of electronic services. Users usually have specific search criteria to meet, but most services will not offer flexible search filters. This research emphasizes the importance of integrating visual and descriptive features to improve the tangibility and efficiency of e-services. A prime example of an electronic service that can benefit from this is real-estate websites. Searching for real estate properties that match user preferences is usually demanding and lacks visual filters, such as the Damage Level to the property. The research introduces a novel visual descriptive feature, the Damage Level, which utilizes a deep learning network known as Mask-RCNN to estimate damage in real estate images. Additionally, a model is developed to incorporate the Damage Level as a tangible feature in electronic real estate services, with the aim of enhancing the tangible customer experience.

[LG-59] Predicting cognitive load in immersive driving scenarios with a hybrid CNN-RNN model

链接: https://arxiv.org/abs/2408.06350
作者: Mehshan Ahmed Khan,Houshyar Asadi,Mohammad Reza Chalak Qazani,Adetokunbo Arogbonlo,Saeid Nahavandi,Chee Peng Lim
关键词-EN: cognitive load, primary task performance, cognitive, sec-ondary tasks reduces, tasks reduces primary
类目: Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
*备注: 17 pages

点击查看摘要

Abstract:One debatable issue in traffic safety research is that cognitive load from sec-ondary tasks reduces primary task performance, such as driving. Although physiological signals have been extensively used in driving-related research to assess cognitive load, only a few studies have specifically focused on high cognitive load scenarios. Most existing studies tend to examine moderate or low levels of cognitive load In this study, we adopted an auditory version of the n-back task of three levels as a cognitively loading secondary task while driving in a driving simulator. During the simultaneous execution of driving and the n-back task, we recorded fNIRS, eye-tracking, and driving behavior data to predict cognitive load at three different levels. To the best of our knowledge, this combination of data sources has never been used before. Un-like most previous studies that utilize binary classification of cognitive load and driving in conditions without traffic, our study involved three levels of cognitive load, with drivers operating in normal traffic conditions under low visibility, specifically during nighttime and rainy weather. We proposed a hybrid neural network combining a 1D Convolutional Neural Network and a Recurrent Neural Network to predict cognitive load. Our experimental re-sults demonstrate that the proposed model, with fewer parameters, increases accuracy from 99.82% to 99.99% using physiological data, and from 87.26% to 92.02% using driving behavior data alone. This significant improvement highlights the effectiveness of our hybrid neural network in accurately pre-dicting cognitive load during driving under challenging conditions.

[LG-60] Deep Learning based Key Information Extraction from Business Documents: Systematic Literature Review

链接: https://arxiv.org/abs/2408.06345
作者: Alexander Rombach,Peter Fettke
关键词-EN: Extracting key information, Key Information Extraction, Extracting key, key information, process automation
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注: 52 pages, 7 figures, 9 tables; Submitted to ACM Computing Surveys

点击查看摘要

Abstract:Extracting key information from documents represents a large portion of business workloads and therefore offers a high potential for efficiency improvements and process automation. With recent advances in deep learning, a plethora of deep learning-based approaches for Key Information Extraction have been proposed under the umbrella term Document Understanding that enable the processing of complex business documents. The goal of this systematic literature review is an in-depth analysis of existing approaches in this domain and the identification of opportunities for further research. To this end, 96 approaches published between 2017 and 2023 are analyzed in this study.

[LG-61] Approaches for enhancing extrapolability in process-based and data-driven models in hydrology

链接: https://arxiv.org/abs/2408.07071
作者: Haiyang Shi
关键词-EN: water cycle variables, modern hydrological research, predicting key water, key water cycle, soil moisture
类目: Geophysics (physics.geo-ph); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The application of process-based and data-driven hydrological models is crucial in modern hydrological research, especially for predicting key water cycle variables such as runoff, evapotranspiration (ET), and soil moisture. These models provide a scientific basis for water resource management, flood forecasting, and ecological protection. Process-based models simulate the physical mechanisms of watershed hydrological processes, while data-driven models leverage large datasets and advanced machine learning algorithms. This paper reviewed and compared methods for assessing and enhancing the extrapolability of both model types, discussing their prospects and limitations. Key strategies include the use of leave-one-out cross-validation and similarity-based methods to evaluate model performance in ungauged regions. Deep learning, transfer learning, and domain adaptation techniques are also promising in their potential to improve model predictions in data-sparse and extreme conditions. Interdisciplinary collaboration and continuous algorithmic advancements are also important to strengthen the global applicability and reliability of hydrological models.

[LG-62] Event-Stream Super Resolution using Sigma-Delta Neural Network ECCV ECCV2024

链接: https://arxiv.org/abs/2408.06968
作者: Waseem Shariff,Joe Lemley,Peter Corcoran
关键词-EN: time-event pixels based, study introduces, approach to enhance, enhance the spatial-temporal, time-event pixels
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: ECCV: The 18th European Conference on Computer Vision ECCV 2024 NeVi Workshop

点击查看摘要

Abstract:This study introduces a novel approach to enhance the spatial-temporal resolution of time-event pixels based on luminance changes captured by event cameras. These cameras present unique challenges due to their low resolution and the sparse, asynchronous nature of the data they collect. Current event super-resolution algorithms are not fully optimized for the distinct data structure produced by event cameras, resulting in inefficiencies in capturing the full dynamism and detail of visual scenes with improved computational complexity. To bridge this gap, our research proposes a method that integrates binary spikes with Sigma Delta Neural Networks (SDNNs), leveraging spatiotemporal constraint learning mechanism designed to simultaneously learn the spatial and temporal distributions of the event stream. The proposed network is evaluated using widely recognized benchmark datasets, including N-MNIST, CIFAR10-DVS, ASL-DVS, and Event-NFS. A comprehensive evaluation framework is employed, assessing both the accuracy, through root mean square error (RMSE), and the computational efficiency of our model. The findings demonstrate significant improvements over existing state-of-the-art methods, specifically, the proposed method outperforms state-of-the-art performance in computational efficiency, achieving a 17.04-fold improvement in event sparsity and a 32.28-fold increase in synaptic operation efficiency over traditional artificial neural networks, alongside a two-fold better performance over spiking neural networks.

[LG-63] Stabilizer bootstrapping: A recipe for efficient agnostic tomography and magic estimation

链接: https://arxiv.org/abs/2408.06967
作者: Sitan Chen,Weiyuan Gong,Qi Ye,Zhihan Zhang
关键词-EN: tau, Toggle, epsilon, runs in time, Code
类目: Quantum Physics (quant-ph); Computational Complexity (cs.CC); Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG)
*备注: 68 pages

点击查看摘要

Abstract:We study the task of agnostic tomography: given copies of an unknown n -qubit state \rho which has fidelity \tau with some state in a given class C , find a state which has fidelity \ge \tau - \epsilon with \rho . We give a new framework, stabilizer bootstrapping, for designing computationally efficient protocols for this task, and use this to get new agnostic tomography protocols for the following classes: Stabilizer states: We give a protocol that runs in time \mathrmpoly(n,1/\epsilon)\cdot (1/\tau)^O(\log(1/\tau)) , answering an open question posed by Grewal, Iyer, Kretschmer, Liang [40] and Anshu and Arunachalam [6]. Previous protocols ran in time \mathrmexp(\Theta(n)) or required \tau\cos^2(\pi/8) . States with stabilizer dimension n - t : We give a protocol that runs in time n^3\cdot(2^t/\tau)^O(\log(1/\epsilon)) , extending recent work on learning quantum states prepared by circuits with few non-Clifford gates, which only applied in the realizable setting where \tau = 1 [30, 37, 46, 61]. Discrete product states: If C = K^\otimes n for some \mu -separated discrete set K of single-qubit states, we give a protocol that runs in time (n/\mu)^O((1 + \log (1/\tau))/\mu)/\epsilon^2 . This strictly generalizes a prior guarantee which applied to stabilizer product states [39]. For stabilizer product states, we give a further improved protocol that runs in time (n^2/\epsilon^2)\cdot (1/\tau)^O(\log(1/\tau)) . As a corollary, we give the first protocol for estimating stabilizer fidelity, a standard measure of magic for quantum states, to error \epsilon in n^3 \mathrmquasipoly(1/\epsilon) time. Comments: 68 pages Subjects: Quantum Physics (quant-ph); Computational Complexity (cs.CC); Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG) Cite as: arXiv:2408.06967 [quant-ph] (or arXiv:2408.06967v1 [quant-ph] for this version) https://doi.org/10.48550/arXiv.2408.06967 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Weiyuan Gong [view email] [v1] Tue, 13 Aug 2024 15:23:17 UTC (73 KB) Full-text links: Access Paper: View a PDF of the paper titled Stabilizer bootstrapping: A recipe for efficient agnostic tomography and magic estimation, by Sitan Chen and 3 other authorsView PDFHTML (experimental)TeX SourceOther Formats view license Current browse context: quant-ph prev | next new | recent | 2024-08 Change to browse by: cs cs.CC cs.DS cs.LG References Citations INSPIRE HEP NASA ADSGoogle Scholar Semantic Scholar a export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Links to Code Toggle Papers with Code (What is Papers with Code?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status Get status notifications via email or slack

[LG-64] PRESENT: Zero-Shot Text-to-Prosody Control

链接: https://arxiv.org/abs/2408.06827
作者: Perry Lam,Huayun Zhang,Nancy F. Chen,Berrak Sisman,Dorien Herremans
关键词-EN: additional style embeddings, extracting additional style, speech synthesis entail, synthesis entail extracting, entail extracting additional
类目: Audio and Speech Processing (eess.AS); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Current strategies for achieving fine-grained prosody control in speech synthesis entail extracting additional style embeddings or adopting more complex architectures. To enable zero-shot application of pretrained text-to-speech (TTS) models, we present PRESENT (PRosody Editing without Style Embeddings or New Training), which exploits explicit prosody prediction in FastSpeech2-based models by modifying the inference process directly. We apply our text-to-prosody framework to zero-shot language transfer using a JETS model exclusively trained on English LJSpeech data. We obtain character error rates (CER) of 12.8%, 18.7% and 5.9% for German, Hungarian and Spanish respectively, beating the previous state-of-the-art CER by over 2x for all three languages. Furthermore, we allow subphoneme-level control, a first in this field. To evaluate its effectiveness, we show that PRESENT can improve the prosody of questions, and use it to generate Mandarin, a tonal language where vowel pitch varies at subphoneme level. We attain 25.3% hanzi CER and 13.0% pinyin CER with the JETS model. All our code and audio samples are available online.

[LG-65] Enhancing Diabetic Retinopathy Diagnosis: A Lightweight CNN Architecture for Efficient Exudate Detection in Retinal Fundus Images

链接: https://arxiv.org/abs/2408.06784
作者: Mujadded Al Rabbani Alif
关键词-EN: Retinal fundus imaging, early disease onset, Retinal fundus, plays an essential, essential role
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Retinal fundus imaging plays an essential role in diagnosing various stages of diabetic retinopathy, where exudates are critical markers of early disease onset. Prompt detection of these exudates is pivotal for enabling optometrists to arrest or significantly decelerate the disease progression. This paper introduces a novel, lightweight convolutional neural network architecture tailored for automated exudate detection, designed to identify these markers efficiently and accurately. To address the challenge of limited training data, we have incorporated domain-specific data augmentations to enhance the model’s generalizability. Furthermore, we applied a suite of regularization techniques within our custom architecture to boost diagnostic accuracy while optimizing computational efficiency. Remarkably, this streamlined model contains only 4.73 million parameters a reduction of nearly 60% compared to the standard ResNet-18 model, which has 11.69 million parameters. Despite its reduced complexity, our model achieves an impressive F1 score of 90%, demonstrating its efficacy in the early detection of diabetic retinopathy through fundus imaging.

[LG-66] Coherence Awareness in Diffractive Neural Networks

链接: https://arxiv.org/abs/2408.06681
作者: Matan Kleiner,Lior Michaeli,Tomer Michaeli
关键词-EN: intensive computational processing, hold great promise, requiring intensive computational, networks hold great, applications requiring intensive
类目: Optics (physics.optics); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Diffractive neural networks hold great promise for applications requiring intensive computational processing. Considerable attention has focused on diffractive networks for either spatially coherent or spatially incoherent illumination. Here we illustrate that, as opposed to imaging systems, in diffractive networks the degree of spatial coherence has a dramatic effect. In particular, we show that when the spatial coherence length on the object is comparable to the minimal feature size preserved by the optical system, neither the incoherent nor the coherent extremes serve as acceptable approximations. Importantly, this situation is inherent to many settings involving active illumination, including reflected light microscopy, autonomous vehicles and smartphones. Following this observation, we propose a general framework for training diffractive networks for any specified degree of spatial and temporal coherence, supporting all types of linear and nonlinear layers. Using our method, we numerically optimize networks for image classification, and thoroughly investigate their performance dependence on the illumination coherence properties. We further introduce the concept of coherence-blind networks, which have enhanced resilience to changes in illumination conditions. Our findings serve as a steppingstone toward adopting all-optical neural networks in real-world applications, leveraging nothing but natural light.

[LG-67] Harnessing Earnings Reports for Stock Predictions: A QLoRA-Enhanced LLM Approach

链接: https://arxiv.org/abs/2408.06634
作者: Haowei Ni,Shuchen Meng,Xupeng Chen,Ziqing Zhao,Andi Chen,Panfeng Li,Shiyao Zhang,Qifu Yin,Yuanqing Wang,Yuxi Chan
关键词-EN: Accurate stock market, Accurate stock, stock market predictions, crucial for investors, earnings reports
类目: Computational Finance (q-fin.CP); Computation and Language (cs.CL); Machine Learning (cs.LG); Statistical Finance (q-fin.ST)
*备注: Accepted by 2024 6th International Conference on Data-driven Optimization of Complex Systems

点击查看摘要

Abstract:Accurate stock market predictions following earnings reports are crucial for investors. Traditional methods, particularly classical machine learning models, struggle with these predictions because they cannot effectively process and interpret extensive textual data contained in earnings reports and often overlook nuances that influence market movements. This paper introduces an advanced approach by employing Large Language Models (LLMs) instruction fine-tuned with a novel combination of instruction-based techniques and quantized low-rank adaptation (QLoRA) compression. Our methodology integrates ‘base factors’, such as financial metric growth and earnings transcripts, with ‘external factors’, including recent market indices performances and analyst grades, to create a rich, supervised dataset. This comprehensive dataset enables our models to achieve superior predictive performance in terms of accuracy, weighted F1, and Matthews correlation coefficient (MCC), especially evident in the comparison with benchmarks such as GPT-4. We specifically highlight the efficacy of the llama-3-8b-Instruct-4bit model, which showcases significant improvements over baseline models. The paper also discusses the potential of expanding the output capabilities to include a ‘Hold’ option and extending the prediction horizon, aiming to accommodate various investment styles and time frames. This study not only demonstrates the power of integrating cutting-edge AI with fine-tuned financial data but also paves the way for future research in enhancing AI-driven financial analysis tools.

[LG-68] Variance-Reduced Cascade Q-learning: Algorithms and Sample Complexity

链接: https://arxiv.org/abs/2408.06544
作者: Mohammad Boveiri,Peyman Mohajerin Esfahani
关键词-EN: discounted Markov decision, Markov decision processes, discounted Markov, Markov decision, Variance-Reduced Cascade Q-learning
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Systems and Control (eess.SY); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:We study the problem of estimating the optimal Q-function of \gamma -discounted Markov decision processes (MDPs) under the synchronous setting, where independent samples for all state-action pairs are drawn from a generative model at each iteration. We introduce and analyze a novel model-free algorithm called Variance-Reduced Cascade Q-learning (VRCQ). VRCQ comprises two key building blocks: (i) the established direct variance reduction technique and (ii) our proposed variance reduction scheme, Cascade Q-learning. By leveraging these techniques, VRCQ provides superior guarantees in the \ell_\infty -norm compared with the existing model-free stochastic approximation-type algorithms. Specifically, we demonstrate that VRCQ is minimax optimal. Additionally, when the action set is a singleton (so that the Q-learning problem reduces to policy evaluation), it achieves non-asymptotic instance optimality while requiring the minimum number of samples theoretically possible. Our theoretical results and their practical implications are supported by numerical experiments.

[LG-69] Dynamic Exclusion of Low-Fidelity Data in Bayesian Optimization for Autonomous Beamline Alignment

链接: https://arxiv.org/abs/2408.06540
作者: Megha R. Narayanan,Thomas W. Morris
关键词-EN: synchrotron light sources, dynamic optical components, National Synchrotron Light, Brookhaven National Laboratory, synchrotron light
类目: Accelerator Physics (physics.acc-ph); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 12 pages, 6 figure sets

点击查看摘要

Abstract:Aligning beamlines at synchrotron light sources is a high-dimensional, expensive-to-sample optimization problem, as beams are focused using a series of dynamic optical components. Bayesian Optimization is an efficient machine learning approach to finding global optima of beam quality, but the model can easily be impaired by faulty data points caused by the beam going off the edge of the sensor or by background noise. This study, conducted at the National Synchrotron Light Source II (NSLS-II) facility at Brookhaven National Laboratory (BNL), is an investigation of methods to identify untrustworthy readings of beam quality and discourage the optimization model from seeking out points likely to yield low-fidelity beams. The approaches explored include dynamic pruning using loss analysis of size and position models and a lengthscale-based genetic algorithm to determine which points to include in the model for optimal fit. Each method successfully classified high and low fidelity points. This research advances BNL’s mission to tackle our nation’s energy challenges by providing scientists at all beamlines with access to higher quality beams, and faster convergence to these optima for their experiments.

[LG-70] he NP-hardness of the Gromov-Wasserstein distance

链接: https://arxiv.org/abs/2408.06525
作者: Natalia Kravtsova
关键词-EN: property frequently mentioned, note addresses, addresses the property, property frequently, frequently mentioned
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This note addresses the property frequently mentioned in the literature that the Gromov-Wasserstein (GW) distance is NP-hard. We provide the details on the non-convex nature of the GW optimization problem that imply NP-hardness of the GW distance between finite spaces for any instance of an input data. We further illustrate the non-convexity of the problem with several explicit examples.

[LG-71] From Graphs to Qubits: A Critical Review of Quantum Graph Neural Networks

链接: https://arxiv.org/abs/2408.06524
作者: Andrea Ceschini,Francesco Mauro,Francesca De Falco,Alessandro Sebastianelli,Alessio Verdone,Antonello Rosato,Bertrand Le Saux,Massimo Panella,Paolo Gamba,Silvia L. Ullo
关键词-EN: Graph Neural Networks, Quantum Graph Neural, Neural Networks, Graph Neural, complex relational structures
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG)
*备注: 21 pages, 9 figures, 2 tables. arXiv admin note: text overlap with arXiv:1909.12264 by other authors

点击查看摘要

Abstract:Quantum Graph Neural Networks (QGNNs) represent a novel fusion of quantum computing and Graph Neural Networks (GNNs), aimed at overcoming the computational and scalability challenges inherent in classical GNNs that are powerful tools for analyzing data with complex relational structures but suffer from limitations such as high computational complexity and over-smoothing in large-scale applications. Quantum computing, leveraging principles like superposition and entanglement, offers a pathway to enhanced computational capabilities. This paper critically reviews the state-of-the-art in QGNNs, exploring various architectures. We discuss their applications across diverse fields such as high-energy physics, molecular chemistry, finance and earth sciences, highlighting the potential for quantum advantage. Additionally, we address the significant challenges faced by QGNNs, including noise, decoherence, and scalability issues, proposing potential strategies to mitigate these problems. This comprehensive review aims to provide a foundational understanding of QGNNs, fostering further research and development in this promising interdisciplinary field.

[LG-72] Bayesian Learning in a Nonlinear Multiscale State-Space Model

链接: https://arxiv.org/abs/2408.06425
作者: Nayely Vélez-Cruz,Manfred D. Laubichler
关键词-EN: temporal scales influence, interactions in complex, development and heredity, heredity serving, complex systems
类目: ignal Processing (eess.SP); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:The ubiquity of multiscale interactions in complex systems is well-recognized, with development and heredity serving as a prime example of how processes at different temporal scales influence one another. This work introduces a novel multiscale state-space model to explore the dynamic interplay between systems interacting across different time scales, with feedback between each scale. We propose a Bayesian learning framework to estimate unknown states by learning the unknown process noise covariances within this multiscale model. We develop a Particle Gibbs with Ancestor Sampling (PGAS) algorithm for inference and demonstrate through simulations the efficacy of our approach.

[LG-73] Neural Networks as Spin Models: From Glass to Hidden Order Through Training

链接: https://arxiv.org/abs/2408.06421
作者: Richard Barney,Michael Winer,Victor Galitksi
关键词-EN: mapped to Ising, Ising spins, mechanical spin model, neural network, spin-spin couplings
类目: Disordered Systems and Neural Networks (cond-mat.dis-nn); Machine Learning (cs.LG); Adaptation and Self-Organizing Systems (nlin.AO)
*备注: 18 pages, 9 figures

点击查看摘要

Abstract:We explore a one-to-one correspondence between a neural network (NN) and a statistical mechanical spin model where neurons are mapped to Ising spins and weights to spin-spin couplings. The process of training an NN produces a family of spin Hamiltonians parameterized by training time. We study the magnetic phases and the melting transition temperature as training progresses. First, we prove analytically that the common initial state before training–an NN with independent random weights–maps to a layered version of the classical Sherrington-Kirkpatrick spin glass exhibiting a replica symmetry breaking. The spin-glass-to-paramagnet transition temperature is calculated. Further, we use the Thouless-Anderson-Palmer (TAP) equations–a theoretical technique to analyze the landscape of energy minima of random systems–to determine the evolution of the magnetic phases on two types of NNs (one with continuous and one with binarized activations) trained on the MNIST dataset. The two NN types give rise to similar results, showing a quick destruction of the spin glass and the appearance of a phase with a hidden order, whose melting transition temperature T_c grows as a power law in training time. We also discuss the properties of the spectrum of the spin system’s bond matrix in the context of rich vs. lazy learning. We suggest that this statistical mechanical view of NNs provides a useful unifying perspective on the training process, which can be viewed as selecting and strengthening a symmetry-broken state associated with the training task.

[LG-74] PhaGO: Protein function annotation for bacteriophages by integrating the genomic context

链接: https://arxiv.org/abs/2408.06402
作者: Jiaojiao Guan,Yongxin Ji,Cheng Peng,Wei Zou,Xubo Tang,Jiayu Shang,Yanni Sun
关键词-EN: Bacteriophages are viruses, target bacteria, playing a crucial, microbial ecology, viruses that target
类目: Quantitative Methods (q-bio.QM); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 17 pages,6 figures

点击查看摘要

Abstract:Bacteriophages are viruses that target bacteria, playing a crucial role in microbial ecology. Phage proteins are important in understanding phage biology, such as virus infection, replication, and evolution. Although a large number of new phages have been identified via metagenomic sequencing, many of them have limited protein function annotation. Accurate function annotation of phage proteins presents several challenges, including their inherent diversity and the scarcity of annotated ones. Existing tools have yet to fully leverage the unique properties of phages in annotating protein functions. In this work, we propose a new protein function annotation tool for phages by leveraging the modular genomic structure of phage genomes. By employing embeddings from the latest protein foundation models and Transformer to capture contextual information between proteins in phage genomes, PhaGO surpasses state-of-the-art methods in annotating diverged proteins and proteins with uncommon functions by 6.78% and 13.05% improvement, respectively. PhaGO can annotate proteins lacking homology search results, which is critical for characterizing the rapidly accumulating phage genomes. We demonstrate the utility of PhaGO by identifying 688 potential holins in phages, which exhibit high structural conservation with known holins. The results show the potential of PhaGO to extend our understanding of newly discovered phages.

[LG-75] High-dimensional optimization for multi-spiked tensor PCA

链接: https://arxiv.org/abs/2408.06401
作者: Gérard Ben Arous,Cédric Gerbelot,Vanessa Piccolo
关键词-EN: local optimization algorithms, multi-spiked tensor model, stochastic gradient descent, tensor PCA problem, online stochastic gradient
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Probability (math.PR); Statistics Theory (math.ST)
*备注:

点击查看摘要

Abstract:We study the dynamics of two local optimization algorithms, online stochastic gradient descent (SGD) and gradient flow, within the framework of the multi-spiked tensor model in the high-dimensional regime. This multi-index model arises from the tensor principal component analysis (PCA) problem, which aims to infer r unknown, orthogonal signal vectors within the N -dimensional unit sphere through maximum likelihood estimation from noisy observations of an order- p tensor. We determine the number of samples and the conditions on the signal-to-noise ratios (SNRs) required to efficiently recover the unknown spikes from natural initializations. Specifically, we distinguish between three types of recovery: exact recovery of each spike, recovery of a permutation of all spikes, and recovery of the correct subspace spanned by the signal vectors. We show that with online SGD, it is possible to recover all spikes provided a number of sample scaling as N^p-2 , aligning with the computational threshold identified in the rank-one tensor PCA problem [Ben Arous, Gheissari, Jagannath 2020, 2021]. For gradient flow, we show that the algorithmic threshold to efficiently recover the first spike is also of order N^p-2 . However, recovering the subsequent directions requires the number of samples to scale as N^p-1 . Our results are obtained through a detailed analysis of a low-dimensional system that describes the evolution of the correlations between the estimators and the spikes. In particular, the hidden vectors are recovered one by one according to a sequential elimination phenomenon: as one correlation exceeds a critical threshold, all correlations sharing a row or column index decrease and become negligible, allowing the subsequent correlation to grow and become macroscopic. The sequence in which correlations become macroscopic depends on their initial values and on the associated SNRs.

[LG-76] MetMamba: Regional Weather Forecasting with Spatial-Temporal Mamba Model

链接: https://arxiv.org/abs/2408.06400
作者: Haoyu Qin,Yungang Chen,Qianchuan Jiang,Pengchao Sun,Xiancai Ye,Chao Lin
关键词-EN: based Weather Prediction, art numerical weather, Weather Prediction, numerical weather forecasts, Deep Learning based
类目: Atmospheric and Oceanic Physics (physics.ao-ph); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Deep Learning based Weather Prediction (DLWP) models have been improving rapidly over the last few years, surpassing state of the art numerical weather forecasts by significant margins. While much of the optimization effort is focused on training curriculum to extend forecast range in the global context, two aspects remains less explored: limited area modeling and better backbones for weather forecasting. We show in this paper that MetMamba, a DLWP model built on a state-of-the-art state-space model, Mamba, offers notable performance gains and unique advantages over other popular backbones using traditional attention mechanisms and neural operators. We also demonstrate the feasibility of deep learning based limited area modeling via coupled training with a global host model.

[LG-77] Design Proteins Using Large Language Models : Enhancements and Comparative Analyses ACL2024

链接: https://arxiv.org/abs/2408.06396
作者: Kamyar Zeinalipour,Neda Jamshidi,Monica Bianchini,Marco Maggini,Marco Gori
关键词-EN: natural language processing, demonstrated substantial capabilities, conventional natural language, protein sequences, language processing
类目: Quantitative Methods (q-bio.QM); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: This paper has been accepted for presentation at Language and Molecules ACL 2024

点击查看摘要

Abstract:Pre-trained LLMs have demonstrated substantial capabilities across a range of conventional natural language processing (NLP) tasks, such as summarization and entity recognition. In this paper, we explore the application of LLMs in the generation of high-quality protein sequences. Specifically, we adopt a suite of pre-trained LLMs, including Mistral-7B1, Llama-2-7B2, Llama-3-8B3, and gemma-7B4, to produce valid protein sequences. All of these models are publicly available.5 Unlike previous work in this field, our approach utilizes a relatively small dataset comprising 42,000 distinct human protein sequences. We retrain these models to process protein-related data, ensuring the generation of biologically feasible protein structures. Our findings demonstrate that even with limited data, the adapted models exhibit efficiency comparable to established protein-focused models such as ProGen varieties, ProtGPT2, and ProLLaMA, which were trained on millions of protein sequences. To validate and quantify the performance of our models, we conduct comparative analyses employing standard metrics such as pLDDT, RMSD, TM-score, and REU. Furthermore, we commit to making the trained versions of all four models publicly available, fostering greater transparency and collaboration in the field of computational biology.

[LG-78] Autoregressive Enzyme Function Prediction with Multi-scale Multi-modality Fusion

链接: https://arxiv.org/abs/2408.06391
作者: Dingyi Rong,Wenzhuo Zheng,Bozitao Zhong,Zhouhan Lin,Liang Hong,Ning Liu
关键词-EN: elucidating biological mechanisms, crucial for elucidating, elucidating biological, biological mechanisms, mechanisms and driving
类目: Quantitative Methods (q-bio.QM); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Accurate prediction of enzyme function is crucial for elucidating biological mechanisms and driving innovation across various sectors. Existing deep learning methods tend to rely solely on either sequence data or structural data and predict the EC number as a whole, neglecting the intrinsic hierarchical structure of EC numbers. To address these limitations, we introduce MAPred, a novel multi-modality and multi-scale model designed to autoregressively predict the EC number of proteins. MAPred integrates both the primary amino acid sequence and the 3D tokens of proteins, employing a dual-pathway approach to capture comprehensive protein characteristics and essential local functional sites. Additionally, MAPred utilizes an autoregressive prediction network to sequentially predict the digits of the EC number, leveraging the hierarchical organization of EC classifications. Evaluations on benchmark datasets, including New-392, Price, and New-815, demonstrate that our method outperforms existing models, marking a significant advance in the reliability and granularity of protein function prediction within bioinformatics.

[LG-79] Masked Graph Autoencoders with Contrastive Augmentation for Spatially Resolved Transcriptomics Data

链接: https://arxiv.org/abs/2408.06377
作者: Donghai Fang,Fangfang Zhu,Dongting Xie,Wenwen Min
关键词-EN: Spatial Resolved Transcriptomics, Resolved Transcriptomics, comprehensively measure gene, measure gene transcription, Spatial Resolved
类目: Genomics (q-bio.GN); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:With the rapid advancement of Spatial Resolved Transcriptomics (SRT) technology, it is now possible to comprehensively measure gene transcription while preserving the spatial context of tissues. Spatial domain identification and gene denoising are key objectives in SRT data analysis. We propose a Contrastively Augmented Masked Graph Autoencoder (STMGAC) to learn low-dimensional latent representations for domain identification. In the latent space, persistent signals for representations are obtained through self-distillation to guide self-supervised matching. At the same time, positive and negative anchor pairs are constructed using triplet learning to augment the discriminative ability. We evaluated the performance of STMGAC on five datasets, achieving results superior to those of existing baseline methods. All code and public datasets used in this paper are available at this https URL and this https URL.

[LG-80] Lyrics Transcription for Humans: A Readability-Aware Benchmark ALT

链接: https://arxiv.org/abs/2408.06370
作者: Ondřej Cífka,Hendrik Schreiber,Luke Miner,Fabian-Robert Stöter
关键词-EN: convey contextual information, human consumption involves, capturing word sequences, accurately capturing word, contextual information
类目: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Machine Learning (cs.LG); Sound (cs.SD)
*备注: ISMIR 2024 camera-ready. 6 pages + references + supplementary material. Website this https URL Data this https URL Code this https URL . arXiv admin note: text overlap with arXiv:2311.13987

点击查看摘要

Abstract:Writing down lyrics for human consumption involves not only accurately capturing word sequences, but also incorporating punctuation and formatting for clarity and to convey contextual information. This includes song structure, emotional emphasis, and contrast between lead and background vocals. While automatic lyrics transcription (ALT) systems have advanced beyond producing unstructured strings of words and are able to draw on wider context, ALT benchmarks have not kept pace and continue to focus exclusively on words. To address this gap, we introduce Jam-ALT, a comprehensive lyrics transcription benchmark. The benchmark features a complete revision of the JamendoLyrics dataset, in adherence to industry standards for lyrics transcription and formatting, along with evaluation metrics designed to capture and assess the lyric-specific nuances, laying the foundation for improving the readability of lyrics. We apply the benchmark to recent transcription systems and present additional error analysis, as well as an experimental comparison with a classical music dataset.

[LG-81] An Adaptive CSI Feedback Model Based on BiLSTM for Massive MIMO-OFDM Systems

链接: https://arxiv.org/abs/2408.06359
作者: Hongrui Shen,Long Zhao,Kan Zheng,Yuhua Cao,Pingzhi Fan
关键词-EN: input CSI lengths, CSI feedback, channel state information, frequency division multiplexing, massive multiple-input multiple-output
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 13 pages, 14 figures, 3 tables

点击查看摘要

Abstract:Deep learning (DL)-based channel state information (CSI) feedback has the potential to improve the recovery accuracy and reduce the feedback overhead in massive multiple-input multiple-output orthogonal frequency division multiplexing (MIMO-OFDM) systems. However, the length of input CSI and the number of feedback bits should be adjustable in different scenarios, which can not be efficiently achieved by the existing CSI feedback models. Therefore, an adaptive bidirectional long short-term memory network (ABLNet) for CSI feedback is first designed to process various input CSI lengths, where the number of feedback bits is in proportion to the CSI length. Then, to realize a more flexible feedback bit number, a feedback bit control unit (FBCU) module is proposed to control the output length of feedback bits. Based on which, a target feedback performance can be adaptively achieved by a designed bit number adjusting (BNA) algorithm. Furthermore, a novel separate training approach is devised to solve the model protection problem that the UE and gNB are from different manufacturers. Experiments demonstrate that the proposed ABLNet with FBCU can fit for different input CSI lengths and feedback bit numbers; the CSI feedback performance can be stabilized by the BNA algorithm; and the proposed separate training approach can maintain the feedback performance and reduce the complexity of feedback model.

信息检索

[IR-0] ableGuard – Securing Structured Unstructured Data

链接: https://arxiv.org/abs/2408.07045
作者: Anantha Sharma,Ajinkya Deshmukh
关键词-EN: data, critical challenge, increasing demand, TableGuard, obfuscation
类目: Cryptography and Security (cs.CR); Computation and Language (cs.CL); Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注: 7 pages, 3 tables, 1 figure

点击查看摘要

Abstract:With the increasing demand for data sharing across platforms and organizations, ensuring the privacy and security of sensitive information has become a critical challenge. This paper introduces “TableGuard”. An innovative approach to data obfuscation tailored for relational databases. Building on the principles and techniques developed in prior work on context-sensitive obfuscation, TableGuard applies these methods to ensure that API calls return only obfuscated data, thereby safeguarding privacy when sharing data with third parties. TableGuard leverages advanced context-sensitive obfuscation techniques to replace sensitive data elements with contextually appropriate alternatives. By maintaining the relational integrity and coherence of the data, our approach mitigates the risks of cognitive dissonance and data leakage. We demonstrate the implementation of TableGuard using a BERT based transformer model, which identifies and obfuscates sensitive entities within relational tables. Our evaluation shows that TableGuard effectively balances privacy protection with data utility, minimizing information loss while ensuring that the obfuscated data remains functionally useful for downstream applications. The results highlight the importance of domain-specific obfuscation strategies and the role of context length in preserving data integrity. The implications of this research are significant for organizations that need to share data securely with external parties. TableGuard offers a robust framework for implementing privacy-preserving data sharing mechanisms, thereby contributing to the broader field of data privacy and security.

[IR-1] OpenResearcher: Unleashing AI for Accelerated Scientific Research

链接: https://arxiv.org/abs/2408.06941
作者: Yuxiang Zheng,Shichao Sun,Lin Qiu,Dongyu Ru,Cheng Jiayang,Xuefeng Li,Jifan Lin,Binjie Wang,Yun Luo,Renjie Pan,Yang Xu,Qingkai Min,Zizhao Zhang,Yiwen Wang,Wenjie Li,Pengfei Liu
关键词-EN: imposes significant challenges, literature imposes significant, leverages Artificial Intelligence, Large Language Models, rapid growth
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:The rapid growth of scientific literature imposes significant challenges for researchers endeavoring to stay updated with the latest advancements in their fields and delve into new areas. We introduce OpenResearcher, an innovative platform that leverages Artificial Intelligence (AI) techniques to accelerate the research process by answering diverse questions from researchers. OpenResearcher is built based on Retrieval-Augmented Generation (RAG) to integrate Large Language Models (LLMs) with up-to-date, domain-specific knowledge. Moreover, we develop various tools for OpenResearcher to understand researchers’ queries, search from the scientific literature, filter retrieved information, provide accurate and comprehensive answers, and self-refine these answers. OpenResearcher can flexibly use these tools to balance efficiency and effectiveness. As a result, OpenResearcher enables researchers to save time and increase their potential to discover new insights and drive scientific breakthroughs. Demo, video, and code are available at: this https URL.

[IR-2] Diffusion Model for Slate Recommendation

链接: https://arxiv.org/abs/2408.06883
作者: Federico Tomasi,Francesco Fabbri,Mounia Lalmas,Zhenwen Dai
关键词-EN: technique commonly, streaming platforms, sites to present, present multiple items, Slate recommendation
类目: Information Retrieval (cs.IR); Machine Learning (stat.ML)
*备注: 9 pages, 5 figures, 3 tables

点击查看摘要

Abstract:Slate recommendation is a technique commonly used on streaming platforms and e-commerce sites to present multiple items together. A significant challenge with slate recommendation is managing the complex combinatorial choice space. Traditional methods often simplify this problem by assuming users engage with only one item at a time. However, this simplification does not reflect the reality, as users often interact with multiple items simultaneously. In this paper, we address the general slate recommendation problem, which accounts for simultaneous engagement with multiple items. We propose a generative approach using Diffusion Models, leveraging their ability to learn structures in high-dimensional data. Our model generates high-quality slates that maximize user satisfaction by overcoming the challenges of the combinatorial choice space. Furthermore, our approach enhances the diversity of recommendations. Extensive offline evaluations on applications such as music playlist generation and e-commerce bundle recommendations show that our model outperforms state-of-the-art baselines in both relevance and diversity.

[IR-3] Reformulating Conversational Recommender Systems as Tri-Phase Offline Policy Learning CIKM2024

链接: https://arxiv.org/abs/2408.06809
作者: Gangyi Zhang,Chongming Gao,Hang Pan,Runzhe Teng,Ruizhe Li
关键词-EN: Existing Conversational Recommender, Conversational Recommender Systems, Existing Conversational, Learning-based Conversational Recommender, predominantly utilize user
类目: Information Retrieval (cs.IR)
*备注: Accepted at CIKM 2024

点击查看摘要

Abstract:Existing Conversational Recommender Systems (CRS) predominantly utilize user simulators for training and evaluating recommendation policies. These simulators often oversimplify the complexity of user interactions by focusing solely on static item attributes, neglecting the rich, evolving preferences that characterize real-world user behavior. This limitation frequently leads to models that perform well in simulated environments but falter in actual deployment. Addressing these challenges, this paper introduces the Tri-Phase Offline Policy Learning-based Conversational Recommender System (TPCRS), which significantly reduces dependency on real-time interactions and mitigates overfitting issues prevalent in traditional approaches. TPCRS integrates a model-based offline learning strategy with a controllable user simulation that dynamically aligns with both personalized and evolving user preferences. Through comprehensive experiments, TPCRS demonstrates enhanced robustness, adaptability, and accuracy in recommendations, outperforming traditional CRS models in diverse user scenarios. This approach not only provides a more realistic evaluation environment but also facilitates a deeper understanding of user behavior dynamics, thereby refining the recommendation process.

[IR-4] Hierarchical Structured Neural Network for Retrieval

链接: https://arxiv.org/abs/2408.06653
作者: Kaushik Rangadurai,Siyang Yuan,Minhui Huang,Yiqun Liu,Golnaz Ghasemiesfeh,Yunchen Pu,Xinfeng Xie,Xingfeng He,Fangzhou Xu,Andrew Cui,Vidhoon Viswanathan,Yan Dong,Liang Xiong,Lin Yang,Liang Wang,Jiyan Yang,Chonglin Sun
关键词-EN: Embedding Based Retrieval, Embedding Based, Nearest Neighbor Search, Approximate Nearest Neighbor, learn embeddings
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
*备注: 9 pages

点击查看摘要

Abstract:Embedding Based Retrieval (EBR) is a crucial component of the retrieval stage in (Ads) Recommendation System that utilizes Two Tower or Siamese Networks to learn embeddings for both users and items (ads). It then employs an Approximate Nearest Neighbor Search (ANN) to efficiently retrieve the most relevant ads for a specific user. Despite the recent rise to popularity in the industry, they have a couple of limitations. Firstly, Two Tower model architecture uses a single dot product interaction which despite their efficiency fail to capture the data distribution in practice. Secondly, the centroid representation and cluster assignment, which are components of ANN, occur after the training process has been completed. As a result, they do not take into account the optimization criteria used for retrieval model. In this paper, we present Hierarchical Structured Neural Network (HSNN), a deployed jointly optimized hierarchical clustering and neural network model that can take advantage of sophisticated interactions and model architectures that are more common in the ranking stages while maintaining a sub-linear inference cost. We achieve 6.5% improvement in offline evaluation and also demonstrate 1.22% online gains through A/B experiments. HSNN has been successfully deployed into the Ads Recommendation system and is currently handling major portion of the traffic. The paper shares our experience in developing this system, dealing with challenges like freshness, volatility, cold start recommendations, cluster collapse and lessons deploying the model in a large scale retrieval production system.

[IR-5] BMX: Entropy-weighted Similarity and Semantic-enhanced Lexical Search

链接: https://arxiv.org/abs/2408.06643
作者: Xianming Li,Julius Lipp,Aamir Shakir,Rui Huang,Jing Li
关键词-EN: large language models, remains crucial, language models, lexical search algorithm, rise of pre-trained
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:BM25, a widely-used lexical search algorithm, remains crucial in information retrieval despite the rise of pre-trained and large language models (PLMs/LLMs). However, it neglects query-document similarity and lacks semantic understanding, limiting its performance. We revisit BM25 and introduce BMX, a novel extension of BM25 incorporating entropy-weighted similarity and semantic enhancement techniques. Extensive experiments demonstrate that BMX consistently outperforms traditional BM25 and surpasses PLM/LLM-based dense retrieval in long-context and real-world retrieval benchmarks. This study bridges the gap between classical lexical search and modern semantic approaches, offering a promising direction for future information retrieval research. The reference implementation of BMX can be found in Baguetter, which was created in the context of this work. The code can be found here: this https URL.

[IR-6] Generalized knowledge-enhanced framework for biomedical entity and relation extraction

链接: https://arxiv.org/abs/2408.06618
作者: Minh Nguyen,Phuong Le
关键词-EN: recent years, increasing number, relation extraction, entity and relation, biomedical entity
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In recent years, there has been an increasing number of frameworks developed for biomedical entity and relation extraction. This research effort aims to address the accelerating growth in biomedical publications and the intricate nature of biomedical texts, which are written for mainly domain experts. To handle these challenges, we develop a novel framework that utilizes external knowledge to construct a task-independent and reusable background knowledge graph for biomedical entity and relation extraction. The design of our model is inspired by how humans learn domain-specific topics. In particular, humans often first acquire the most basic and common knowledge regarding a field to build the foundational knowledge and then use that as a basis for extending to various specialized topics. Our framework employs such common-knowledge-sharing mechanism to build a general neural-network knowledge graph that is learning transferable to different domain-specific biomedical texts effectively. Experimental evaluations demonstrate that our model, equipped with this generalized and cross-transferable knowledge base, achieves competitive performance benchmarks, including BioRelEx for binding interaction detection and ADE for Adverse Drug Effect identification.

[IR-7] Prompt Tuning as User Inherent Profile Inference Machine

链接: https://arxiv.org/abs/2408.06577
作者: Yusheng Lu,Zhaocheng Du,Xiangyang Li,Xiangyu Zhao,Weiwen Liu,Yichao Wang,Huifeng Guo,Ruiming Tang,Zhenhua Dong,Yongrui Duan
关键词-EN: Large Language Models, Large Language, Language Models, superior reasoning capabilities, exhibited significant promise
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have exhibited significant promise in recommender systems by empowering user profiles with their extensive world knowledge and superior reasoning capabilities. However, LLMs face challenges like unstable instruction compliance, modality gaps, and high inference latency, leading to textual noise and limiting their effectiveness in recommender systems. To address these challenges, we propose UserIP-Tuning, which uses prompt-tuning to infer user profiles. It integrates the causal relationship between user profiles and behavior sequences into LLMs’ prompts. And employs expectation maximization to infer the embedded latent profile, minimizing textual noise by fixing the prompt template. Furthermore, A profile quantization codebook bridges the modality gap by categorizing profile embeddings into collaborative IDs, which are pre-stored for online deployment. This improves time efficiency and reduces memory usage. Experiments on four public datasets show that UserIP-Tuning outperforms state-of-the-art recommendation algorithms. Additional tests and case studies confirm its effectiveness, robustness, and transferability.

[IR-8] Learned Ranking Function: From Short-term Behavior Predictions to Long-term User Satisfaction RECSYS24

链接: https://arxiv.org/abs/2408.06512
作者: Yi Wu,Daryl Chang,Jennifer She,Zhe Zhao,Li Wei,Lukasz Heldt
关键词-EN: Learned Ranking Function, Learned Ranking, short-term user-item behavior, user-item behavior predictions, Ranking Function
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
*备注: RecSys 24

点击查看摘要

Abstract:We present the Learned Ranking Function (LRF), a system that takes short-term user-item behavior predictions as input and outputs a slate of recommendations that directly optimizes for long-term user satisfaction. Most previous work is based on optimizing the hyperparameters of a heuristic function. We propose to model the problem directly as a slate optimization problem with the objective of maximizing long-term user satisfaction. We also develop a novel constraint optimization algorithm that stabilizes objective trade-offs for multi-objective optimization. We evaluate our approach with live experiments and describe its deployment on YouTube.

[IR-9] Modality-Balanced Learning for Multimedia Recommendation

链接: https://arxiv.org/abs/2408.06360
作者: Jinghao Zhang,Guofan Liu,Qiang Liu,Shu Wu,Liang Wang
关键词-EN: filtering framework effectively, traditional collaborative filtering, collaborative filtering framework, incorporate multimodal content, multimodal content information
类目: Information Retrieval (cs.IR); Computer Vision and Pattern Recognition (cs.CV)
*备注: ACM Multimedia 2024 (Oral)

点击查看摘要

Abstract:Many recommender models have been proposed to investigate how to incorporate multimodal content information into traditional collaborative filtering framework effectively. The use of multimodal information is expected to provide more comprehensive information and lead to superior performance. However, the integration of multiple modalities often encounters the modal imbalance problem: since the information in different modalities is unbalanced, optimizing the same objective across all modalities leads to the under-optimization problem of the weak modalities with a slower convergence rate or lower performance. Even worse, we find that in multimodal recommendation models, all modalities suffer from the problem of insufficient optimization. To address these issues, we propose a Counterfactual Knowledge Distillation method that could solve the imbalance problem and make the best use of all modalities. Through modality-specific knowledge distillation, it could guide the multimodal model to learn modality-specific knowledge from uni-modal teachers. We also design a novel generic-and-specific distillation loss to guide the multimodal student to learn wider-and-deeper knowledge from teachers. Additionally, to adaptively recalibrate the focus of the multimodal model towards weaker modalities during training, we estimate the causal effect of each modality on the training objective using counterfactual inference techniques, through which we could determine the weak modalities, quantify the imbalance degree and re-weight the distillation loss accordingly. Our method could serve as a plug-and-play module for both late-fusion and early-fusion backbones. Extensive experiments on six backbones show that our proposed method can improve the performance by a large margin. The source code will be released at \urlthis https URL

[IR-10] Deep Learning based Key Information Extraction from Business Documents: Systematic Literature Review

链接: https://arxiv.org/abs/2408.06345
作者: Alexander Rombach,Peter Fettke
关键词-EN: Extracting key information, Key Information Extraction, Extracting key, key information, process automation
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注: 52 pages, 7 figures, 9 tables; Submitted to ACM Computing Surveys

点击查看摘要

Abstract:Extracting key information from documents represents a large portion of business workloads and therefore offers a high potential for efficiency improvements and process automation. With recent advances in deep learning, a plethora of deep learning-based approaches for Key Information Extraction have been proposed under the umbrella term Document Understanding that enable the processing of complex business documents. The goal of this systematic literature review is an in-depth analysis of existing approaches in this domain and the identification of opportunities for further research. To this end, 96 approaches published between 2017 and 2023 are analyzed in this study.

[IR-11] Accuracy and Political Bias of News Source Credibility Ratings by Large Language Models

链接: https://arxiv.org/abs/2304.00228
作者: Kai-Cheng Yang,Filippo Menczer
关键词-EN: Search engines increasingly, generate direct answers, engines increasingly leverage, increasingly leverage large, leverage large language
类目: Computation and Language (cs.CL); Computers and Society (cs.CY); Information Retrieval (cs.IR)
*备注: 11 pages, 8 figures

点击查看摘要

Abstract:Search engines increasingly leverage large language models (LLMs) to generate direct answers, and AI chatbots now access the Internet for fresh data. As information curators for billions of users, LLMs must assess the accuracy and reliability of different sources. This paper audits eight widely used LLMs from three major providers – OpenAI, Google, and Meta – to evaluate their ability to discern credible and high-quality information sources from low-credibility ones. We find that while LLMs can rate most tested news outlets, larger models more frequently refuse to provide ratings due to insufficient information, whereas smaller models are more prone to hallucination in their ratings. For sources where ratings are provided, LLMs exhibit a high level of agreement among themselves (average Spearman’s \rho = 0.81 ), but their ratings align only moderately with human expert evaluations (average \rho = 0.59 ). Analyzing news sources with different political leanings in the US, we observe a liberal bias in credibility ratings yielded by all LLMs in default configurations. Additionally, assigning partisan identities to LLMs consistently results in strong politically congruent bias in the ratings. These findings have important implications for the use of LLMs in curating news and political information.

附件下载

点击下载今日全部论文列表