本篇博文主要展示 2024-08-09 从Arxiv.org论文网站获取的最新论文列表,自动更新,按照NLP、CV、ML、AI、IR五个大方向区分,若需要邮件定时接收,请在评论区留下你的邮箱号。

说明:每日论文数据从Arxiv.org获取,每天早上10:30左右定时自动更新。

友情提示: 如何您需要邮箱接收每日论文数据,请在评论处留下你的邮箱,同样每天10:30左右邮件定时自动发送。

目录

概览 (2024-08-09)

今日共更新340篇论文,其中:

  • 自然语言处理55篇(Computation and Language (cs.CL))
  • 人工智能76篇(Artificial Intelligence (cs.AI))
  • 计算机视觉71篇(Computer Vision and Pattern Recognition (cs.CV))
  • 机器学习92篇(Machine Learning (cs.LG))

自然语言处理

[NLP-0] Arctic-TILT. Business Document Understanding at Sub-Billion Scale
[NLP-0] 北极倾斜。数十亿级规模的商业文档理解

链接: https://arxiv.org/abs/2408.04632
作者: Łukasz Borchmann,Michał Pietruszka,Wojciech Jaśkowski,Dawid Jurkiewicz,Piotr Halama,Paweł Józiak,Łukasz Garncarek,Paweł Liskowski,Karolina Szyndler,Andrzej Gretkowski,Julita Ołtusek,Gabriela Nowakowska,Artur Zawłocki,Łukasz Duhr,Paweł Dyda,Michał Turski
关键词-EN: workloads employing LLMs, employing LLMs involves, LLMs involves answering, involves answering questions, answering questions grounded
关键词-ZN: 雇用LLM的工作量,雇用LLM涉及,LLM涉及回答,涉及回答问题,回答接地气的问题
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The vast portion of workloads employing LLMs involves answering questions grounded on PDF or scan content. We introduce the Arctic-TILT achieving accuracy on par with models 1000 \times its size on these use cases. It can be fine-tuned and deployed on a single 24GB GPU, lowering operational costs while processing Visually Rich Documents with up to 400k tokens. The model establishes state-of-the-art results on seven diverse Document Understanding benchmarks, as well as provides reliable confidence scores and quick inference, which are essential for processing files in large-scale or time-sensitive enterprise environments.
摘要:使用LLM的大部分工作量涉及回答基于PDF或扫描内容的问题。我们引入了Arctic-TILT,在这些用例上实现了与其1000倍大小的模型相当的准确性。它可以进行微调并部署在单个24 GB图形处理器上,从而降低运营成本,同时处理具有高达40万个令牌的Visual Rich Document。该模型在七个不同的文档理解基准上建立了最先进的结果,并提供可靠的置信度分数和快速推理,这对于在大规模或时间敏感的企业环境中处理文件至关重要。

[NLP-1] LogogramNLP: Comparing Visual and Textual Representations of Ancient Logographic Writing Systems for NLP
[NLP-1] Logogram NLP:比较NLP古代Logogram书写系统的视觉和文本表示

链接: https://arxiv.org/abs/2408.04628
作者: Danlu Chen,Freda Shi,Aditi Agarwal,Jacobo Myerston,Taylor Berg-Kirkpatrick
关键词-EN: Standard natural language, Standard natural, ancient logographic languages, discrete tokens, operate on symbolic
关键词-ZN: 标准自然语言、标准自然、古老的徽标语言、离散符号、以符号为基础运作
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Standard natural language processing (NLP) pipelines operate on symbolic representations of language, which typically consist of sequences of discrete tokens. However, creating an analogous representation for ancient logographic writing systems is an extremely labor intensive process that requires expert knowledge. At present, a large portion of logographic data persists in a purely visual form due to the absence of transcription – this issue poses a bottleneck for researchers seeking to apply NLP toolkits to study ancient logographic languages: most of the relevant data are images of writing. This paper investigates whether direct processing of visual representations of language offers a potential solution. We introduce LogogramNLP, the first benchmark enabling NLP analysis of ancient logographic languages, featuring both transcribed and visual datasets for four writing systems along with annotations for tasks like classification, translation, and parsing. Our experiments compare systems that employ recent visual and text encoding strategies as backbones. The results demonstrate that visual representations outperform textual representations for some investigated tasks, suggesting that visual processing pipelines may unlock a large amount of cultural heritage data of logographic languages for NLP-based analyses. Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2408.04628 [cs.CL] (or arXiv:2408.04628v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2408.04628 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Journalreference: ACL 2024, long paper
摘要:标准自然语言处理(NLP)管道对语言的符号表示进行操作,语言通常由离散的符号序列组成。然而,为古代标志文字系统创建类似的表示法是一个需要专业知识的劳动密集型过程。目前,由于缺乏转录,很大一部分标志数据仍然以纯视觉形式存在–这一问题给寻求应用NLP工具包研究古代标志语言的研究人员带来了一个瓶颈:大多数相关数据都是文字的图像。本文探讨了语言视觉表征的直接加工是否提供了一种潜在的解决方案。我们介绍了Logogram NLP,这是第一个能够对古代标志语言进行NLP分析的基准,具有四种书写系统的转录和可视数据集,以及分类、翻译和语法分析等任务的注释。我们的实验比较了采用最新视觉和文本编码策略作为主干的系统。结果表明,视觉表征在一些被调查任务中的表现优于文本表征,这表明视觉处理管道可以为基于自然语言处理的分析解锁大量的标志语言的文化遗产数据。主题:计算和语言(cs.CL);人工智能(cs.AI);计算机视觉和模式识别(cs.CV)引用AS:ARxiv:2408.04628cs.CLhttps://doi.org/10.48550/arXiv.2408.04628 Focus通过DataCite(待注册)了解更多Arxiv发布的DOI期刊参考:ACL2024,长论文

[NLP-2] ransformer Explainer: Interactive Learning of Text-Generative Models IEEE-VIS2024
[NLP-2] ransformer解释者:文本生成模型的交互学习

链接: https://arxiv.org/abs/2408.04619
作者: Aeree Cho,Grace C. Kim,Alexander Karpekov,Alec Helbling,Zijie J. Wang,Seongmin Lee,Benjamin Hoover,Duen Horng Chau
关键词-EN: revolutionized machine learning, workings remain opaque, present Transformer Explainer, machine learning, revolutionized machine
关键词-ZN: 革命性的机器学习,工作仍然不透明,当前Transformer解释器,机器学习,革命性的机器
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
备注: To be presented at IEEE VIS 2024

点击查看摘要

Abstract:Transformers have revolutionized machine learning, yet their inner workings remain opaque to many. We present Transformer Explainer, an interactive visualization tool designed for non-experts to learn about Transformers through the GPT-2 model. Our tool helps users understand complex Transformer concepts by integrating a model overview and enabling smooth transitions across abstraction levels of mathematical operations and model structures. It runs a live GPT-2 instance locally in the user’s browser, empowering users to experiment with their own input and observe in real-time how the internal components and parameters of the Transformer work together to predict the next tokens. Our tool requires no installation or special hardware, broadening the public’s education access to modern generative AI techniques. Our open-sourced tool is available at this https URL. A video demo is available at this https URL.
摘要:变形金刚彻底改变了机器学习,但它们的内部运作对许多人来说仍然不透明。我们介绍Transformer Explainer,这是一个交互式可视化工具,专为非专家设计,通过GPT-2模型了解Transformers。我们的工具通过集成模型概述并实现数学运算和模型结构的抽象级别的平稳过渡,帮助用户理解复杂的Transformer概念。它在用户浏览器中本地运行实时GPT-2实例,使用户能够尝试自己的输入并实时观察Transformer的内部组件和参数如何协同工作以预测下一个令牌。我们的工具不需要安装或特殊硬件,扩大了公众对现代生成性人工智能技术的教育范围。我们的开源工具可在httpsURL上获取。此https URL提供了视频演示。

[NLP-3] Better Alignment with Instruction Back-and-Forth Translation
[NLP-3] 更好地与指令保持一致来回翻译

链接: https://arxiv.org/abs/2408.04614
作者: Thao Nguyen,Jeffrey Li,Sewoong Oh,Ludwig Schmidt,Jason Weston,Luke Zettlemoyer,Xian Li
关键词-EN: large language models, aligning large language, construct high-quality synthetic, high-quality synthetic data, synthetic data grounded
关键词-ZN: 大型语言模型,对齐大型语言,构建高质量合成,高质量合成数据,合成数据接地
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:We propose a new method, instruction back-and-forth translation, to construct high-quality synthetic data grounded in world knowledge for aligning large language models (LLMs). Given documents from a web corpus, we generate and curate synthetic instructions using the backtranslation approach proposed by Li et al.(2023a), and rewrite the responses to improve their quality further based on the initial documents. Fine-tuning with the resulting (backtranslated instruction, rewritten response) pairs yields higher win rates on AlpacaEval than using other common instruction datasets such as Humpback, ShareGPT, Open Orca, Alpaca-GPT4 and Self-instruct. We also demonstrate that rewriting the responses with an LLM outperforms direct distillation, and the two generated text distributions exhibit significant distinction in embedding space. Further analysis shows that our backtranslated instructions are of higher quality than other sources of synthetic instructions, while our responses are more diverse and complex than those obtained from distillation. Overall we find that instruction back-and-forth translation combines the best of both worlds – making use of the information diversity and quantity found on the web, while ensuring the quality of the responses which is necessary for effective alignment.
摘要:我们提出了一种新的方法,指令来回翻译,以构建基于世界知识的高质量合成数据,用于对齐大型语言模型(LLM)。给定网络语料库中的文档,我们使用Li等人(2023a)提出的反向翻译方法生成和整理合成指令,并在原始文档的基础上重写响应以进一步提高其质量。与使用其他常见指令数据集(如Humpback、ShareGPT、Open Orca、Alpaca-GPT4和自指令)相比,使用生成的(反向翻译的指令、重写的响应)对在AlpacaEval上进行微调可以产生更高的胜率。我们还证明了使用LLM重写响应的性能优于直接蒸馏,并且两种生成的文本分布在嵌入空间上显示出显著的差异。进一步的分析表明,我们的回译指令比其他合成指令的质量更高,而我们的反应比蒸馏得到的指令更多样化和复杂。总体而言,我们发现指令来回翻译结合了两个世界的最好–利用了网络上的信息多样性和数量,同时确保了有效匹配所必需的答复质量。

[NLP-4] Code-switching in text and speech reveals information-theoretic audience design
[NLP-4] 文本和言语中的代码转换揭示了信息论受众设计

链接: https://arxiv.org/abs/2408.04596
作者: Debasmita Bhattacharya,Marten van Schijndel
关键词-EN: primary language, high primary language, secondary language, language, modeling to investigate
关键词-ZN: 初级语言、高级初级语言、中学语言、语言、建模要调查
类目: Computation and Language (cs.CL)
备注: Submitted to Journal of Memory and Language on 7 June 2024

点击查看摘要

Abstract:In this work, we use language modeling to investigate the factors that influence code-switching. Code-switching occurs when a speaker alternates between one language variety (the primary language) and another (the secondary language), and is widely observed in multilingual contexts. Recent work has shown that code-switching is often correlated with areas of high information load in the primary language, but it is unclear whether high primary language load only makes the secondary language relatively easier to produce at code-switching points (speaker-driven code-switching), or whether code-switching is additionally used by speakers to signal the need for greater attention on the part of listeners (audience-driven code-switching). In this paper, we use bilingual Chinese-English online forum posts and transcripts of spontaneous Chinese-English speech to replicate prior findings that high primary language (Chinese) information load is correlated with switches to the secondary language (English). We then demonstrate that the information load of the English productions is even higher than that of meaning equivalent Chinese alternatives, and these are therefore not easier to produce, providing evidence of audience-driven influences in code-switching at the level of the communication channel, not just at the sociolinguistic level, in both writing and speech.
摘要:在这项工作中,我们使用语言模型来研究影响语码转换的因素。当说话人在一种语言变体(主要语言)和另一种语言变体(第二语言)之间转换时,就会发生语码转换,这在多语言环境中广泛存在。最近的研究表明,语码转换往往与母语中信息负荷高的区域相关,但尚不清楚的是,高的母语负荷是否只是使第二语言在语码切换点更容易产生(说话者驱动的语码转换),还是说话者额外使用语码转换来表明听话者需要更多的关注(听众驱动的语码转换)。在本文中,我们使用汉英双语在线论坛帖子和自发性汉英语音的文本来复制先前的发现,即高的母语(汉语)信息负荷与向第二语言(英语)的转换相关。然后,我们证明了英语作品的信息量甚至比意义对等的汉语作品更高,因此这些都不容易产生,这为受众驱动的语码转换在交际渠道层面上的影响提供了证据,而不仅仅是在社会语言层面上,无论是在写作还是在言语方面。

[NLP-5] owards Resilient and Efficient LLMs: A Comparative Study of Efficiency Performance and Adversarial Robustness
[NLP-5] owards韧性和高效的LLM:效率绩效和对抗稳健性的比较研究

链接: https://arxiv.org/abs/2408.04585
作者: Xiaojing Fan,Chunliang Tao
关键词-EN: Large Language Models, Large Language, Gated Linear Attention, Language Models, computational cost
关键词-ZN: 大型语言模型、大型语言、门控线性注意力、语言模型、计算成本
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:With the increasing demand for practical applications of Large Language Models (LLMs), many attention-efficient models have been developed to balance performance and computational cost. However, the adversarial robustness of these models remains under-explored. In this work, we design a framework to investigate the trade-off between efficiency, performance, and adversarial robustness of LLMs by comparing three prominent models with varying levels of complexity and efficiency – Transformer++, Gated Linear Attention (GLA) Transformer, and MatMul-Free LM – utilizing the GLUE and AdvGLUE datasets. The AdvGLUE dataset extends the GLUE dataset with adversarial samples designed to challenge model robustness. Our results show that while the GLA Transformer and MatMul-Free LM achieve slightly lower accuracy on GLUE tasks, they demonstrate higher efficiency and either superior or comparative robustness on AdvGLUE tasks compared to Transformer++ across different attack levels. These findings highlight the potential of simplified architectures to achieve a compelling balance between efficiency, performance, and adversarial robustness, offering valuable insights for applications where resource constraints and resilience to adversarial attacks are critical.
摘要:随着大型语言模型在实际应用中的需求日益增长,人们开发了许多注意力高效的模型来平衡性能和计算代价。然而,这些模型的对抗性稳健性仍然没有得到充分的研究。在这项工作中,我们设计了一个框架来研究LLMS的效率、性能和对抗健壮性之间的权衡,方法是利用GLUE和AdvGLUE数据集比较三种不同复杂度和效率的重要模型–Transformer++、GLA Transformer和Matmul-Free LM。AdvGLUE数据集使用旨在挑战模型稳健性的对抗性样本扩展了GLUE数据集。我们的结果表明,虽然GLA Transformer和MatMul-Free LM在粘合任务上的准确率略低,但在不同攻击级别上,它们在AdvGLUE任务上表现出比Transformer++更高的效率和更好的健壮性或相对较高的稳健性。这些发现突出了简化体系结构在效率、性能和对手攻击健壮性之间实现引人注目的平衡的潜力,为资源约束和对抗攻击的弹性至关重要的应用程序提供了宝贵的见解。

[NLP-6] SCENE: Evaluating Explainable AI Techniques Using Soft Counterfactuals
[NLP-6] 场景:使用软反事实评估可解释的人工智能技术

链接: https://arxiv.org/abs/2408.04575
作者: Haoran Zheng,Utku Pamuksuz
关键词-EN: Explainable Artificial Intelligence, Explainable Artificial, Artificial Intelligence, natural language processing, Natural language Explainability
关键词-ZN: 可解释人工智能,可解释人工,人工智能,自然语言处理,自然语言可解释性
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 10 pages, 5 tables

点击查看摘要

Abstract:Explainable Artificial Intelligence (XAI) is essential for enhancing the transparency and accountability of AI models, especially in natural language processing (NLP) tasks. This paper introduces SCENE (Soft Counterfactual Evaluation for Natural language Explainability), a novel evaluation method that leverages large language models (LLMs) to generate Soft Counterfactual explanations in a zero-shot manner. By focusing on token-based substitutions, SCENE creates contextually appropriate and seman-tically meaningful Soft Counterfactuals without extensive fine-tuning. SCENE adopts Validitysoft and Csoft metrics to evaluate the effectiveness of model-agnostic XAI methods in text classification tasks. Applied to CNN, RNN, and BERT architectures, SCENE provides valuable insights into the strengths and limitations of various XAI techniques.
摘要:可解释人工智能(XAI)对于增强人工智能模型的透明度和问责制至关重要,尤其是在自然语言处理(NLP)任务中。本文介绍了SCENE(自然语言解释性软反事实评估),这是一种新颖的评估方法,利用大型语言模型(LLM)以零触发方式生成软反事实解释。通过专注于基于代币的替代,SCENE创建了上下文适当且语义有意义的软反事实,而无需进行广泛的微调。SCENE采用Validitysoft和Csoft指标来评估文本分类任务中模型不可知的XAI方法的有效性。SCENE应用于CNN、RNN和BERT架构,为各种XAI技术的优势和局限性提供了宝贵的见解。

[NLP-7] Learning Fine-Grained Grounded Citations for Attributed Large Language Models ACL2024
[NLP-7] 学习归因大型语言模型的细粒度接地引用

链接: https://arxiv.org/abs/2408.04568
作者: Lei Huang,Xiaocheng Feng,Weitao Ma,Yuxuan Gu,Weihong Zhong,Xiachong Feng,Weijiang Yu,Weihua Peng,Duyu Tang,Dandan Tu,Bing Qin
关键词-EN: large language models, information-seeking tasks, large language, impressive performance, performance on information-seeking
关键词-ZN: 大型语言模型、信息查找任务、大型语言、令人印象深刻的性能、信息查找性能
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted by ACL 2024 Findings

点击查看摘要

Abstract:Despite the impressive performance on information-seeking tasks, large language models (LLMs) still struggle with hallucinations. Attributed LLMs, which augment generated text with in-line citations, have shown potential in mitigating hallucinations and improving verifiability. However, current approaches suffer from suboptimal citation quality due to their reliance on in-context learning. Furthermore, the practice of citing only coarse document identifiers makes it challenging for users to perform fine-grained verification. In this work, we introduce FRONT, a training framework designed to teach LLMs to generate Fine-Grained Grounded Citations. By grounding model outputs in fine-grained supporting quotes, these quotes guide the generation of grounded and consistent responses, not only improving citation quality but also facilitating fine-grained verification. Experiments on the ALCE benchmark demonstrate the efficacy of FRONT in generating superior grounded responses and highly supportive citations. With LLaMA-2-7B, the framework significantly outperforms all the baselines, achieving an average of 14.21% improvement in citation quality across all datasets, even surpassing ChatGPT.
摘要:尽管大型语言模型在信息搜索任务中的表现令人印象深刻,但他们仍然在幻觉中苦苦挣扎。归属LLMS通过内联引用来增强生成的文本,在缓解幻觉和提高可验证性方面显示出潜力。然而,由于依赖于情景学习,目前的方法存在引文质量不佳的问题。此外,只引用粗略文档标识符的做法给用户执行细粒度验证带来了挑战。在这项工作中,我们介绍了一个培训框架,旨在教LLMS生成细粒度的接地引文。通过将模型输出接地在细粒度的支持引文中,这些引语指导生成接地气和一致的回应,不仅提高了引文质量,还促进了细粒度验证。在ALCE基准上的实验证明了POWER在生成卓越的接地反应和高度支持的引用方面的有效性。使用Llama-2-7B,该框架的表现明显优于所有基线,所有数据集的引文质量平均提高了14.21%,甚至超过了ChatGPT。

[NLP-8] Conversational Prompt Engineering
[NLP-8] 对话提示工程

链接: https://arxiv.org/abs/2408.04560
作者: Liat Ein-Dor,Orith Toledo-Ronen,Artem Spector,Shai Gretz,Lena Dankin,Alon Halfon,Yoav Katz,Noam Slonim
关键词-EN: Conversational Prompt Engineering, humans communicate, prompt engineering, prompt, Prompts
关键词-ZN: 对话提示工程,人类沟通,提示工程,提示,提示
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Prompts are how humans communicate with LLMs. Informative prompts are essential for guiding LLMs to produce the desired output. However, prompt engineering is often tedious and time-consuming, requiring significant expertise, limiting its widespread use. We propose Conversational Prompt Engineering (CPE), a user-friendly tool that helps users create personalized prompts for their specific tasks. CPE uses a chat model to briefly interact with users, helping them articulate their output preferences and integrating these into the prompt. The process includes two main stages: first, the model uses user-provided unlabeled data to generate data-driven questions and utilize user responses to shape the initial instruction. Then, the model shares the outputs generated by the instruction and uses user feedback to further refine the instruction and the outputs. The final result is a few-shot prompt, where the outputs approved by the user serve as few-shot examples. A user study on summarization tasks demonstrates the value of CPE in creating personalized, high-performing prompts. The results suggest that the zero-shot prompt obtained is comparable to its - much longer - few-shot counterpart, indicating significant savings in scenarios involving repetitive tasks with large text volumes.
摘要:提示是人类与LLM进行交流的方式。信息性提示对于引导LLM产生所需的输出是必不可少的。然而,快速工程通常是乏味和耗时的,需要大量的专业知识,限制了它的广泛使用。我们提出了会话提示工程(CPE),这是一个用户友好的工具,帮助用户为他们的特定任务创建个性化的提示。CPE使用聊天模式与用户进行简短的交互,帮助他们清楚地表达他们的输出偏好,并将这些整合到提示中。该过程包括两个主要阶段:首先,该模型使用用户提供的未标记数据来生成数据驱动的问题,并利用用户响应来塑造初始指令。然后,该模型共享指令生成的输出,并使用用户反馈进一步细化指令和输出。最终结果是几个镜头的提示,其中用户认可的输出作为几个镜头的例子。一项关于摘要任务的用户研究证明了CPE在创建个性化、高性能提示方面的价值。结果表明,获得的零射击提示与长得多的少射击提示相当,这表明在涉及大量语料量的重复任务的情景中,显著节省了时间。

[NLP-9] Bias-Aware Low-Rank Adaptation: Mitigating Catastrophic Inheritance of Large Language Models
[NLP-9] 具有偏见的低等级适应:减轻大型语言模型的灾难性继承

链接: https://arxiv.org/abs/2408.04556
作者: Yupeng Chang,Yi Chang,Yuan Wu
关键词-EN: exhibited remarkable proficiency, Large language models, Large language, exhibited remarkable, remarkable proficiency
关键词-ZN: 表现出非凡的熟练程度,大型语言模型,大型语言,表现出非凡的熟练程度
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Work in progress

点击查看摘要

Abstract:Large language models (LLMs) have exhibited remarkable proficiency across a diverse array of natural language processing (NLP) tasks. However, adapting LLMs to downstream applications typically necessitates computationally intensive and memory-demanding fine-tuning procedures. To mitigate these burdens, parameter-efficient fine-tuning (PEFT) techniques have emerged as a promising approach to tailor LLMs with minimal computational overhead. While PEFT methods offer substantial advantages, they do not fully address the pervasive issue of bias propagation from pre-training data. In this work, we introduce Bias-Aware Low-Rank Adaptation (BA-LoRA), a novel PEFT method designed to counteract bias inheritance. BA-LoRA incorporates three distinct regularization terms: (1) consistency regularizer, (2) diversity regularizer, and (3) singular vector decomposition regularizer. These regularizers collectively aim to improve the generative models’ consistency, diversity, and generalization capabilities during the fine-tuning process. Through extensive experiments on a variety of natural language understanding (NLU) and natural language generation (NLG) tasks, employing prominent LLMs such as LLaMA, Mistral, and Gemma, we demonstrate that BA-LoRA surpasses the performance of LoRA and its state-of-the-art variants. Moreover, our method effectively mitigates the deleterious effects of pre-training bias, leading to more reliable and robust model outputs. The code is available at this https URL.
摘要:大型语言模型在一系列自然语言处理(NLP)任务中表现出非凡的精通能力。然而,使LLM适应下游应用通常需要计算密集型和内存要求高的微调过程。为了减轻这些负担,参数高效微调(PEFT)技术已经成为一种很有前途的方法,可以用最小的计算开销来定制LLM。虽然PEFT方法提供了实质性的优势,但它们不能完全解决从训练前数据中普遍存在的偏差传播问题。在这项工作中,我们介绍了一种新的PEFT方法–偏向感知低阶自适应(BA-LORA),用于抵消偏向遗传。BA-LORA包含三个不同的正则化项:(1)一致性正则化,(2)多样性正则化,(3)奇异向量分解正则化。这些正则化子共同致力于提高生成模型在微调过程中的一致性、多样性和泛化能力。通过在各种自然语言理解(NLU)和自然语言生成(NLG)任务上的广泛实验,我们使用了著名的LLMS,如Llama,Mistral和Gema,我们证明BA-Lora的性能超过了Lora及其最新的变体。此外,我们的方法有效地缓解了训练前偏差的有害影响,导致了更可靠和稳健的模型输出。代码可在此HTTPS URL上找到。

[NLP-10] Molye: A Corpus-based Approach to Language Contact in Colonial France
[NLP-10] Molye:基于数据库的法国殖民地语言接触方法

链接: https://arxiv.org/abs/2408.04554
作者: Rasul Dent,Juliette Janès,Thibault Clérice,Pedro Ortiz Suarez,Benoît Sagot
关键词-EN: considered genetic descendants, descendants of European, French-based Creole languages, early modern period, intense debate
关键词-ZN: 被认为是遗传后裔,欧洲、以法语为基础的克里奥尔语的后裔,现代早期,激烈的争论
类目: Computation and Language (cs.CL)
备注: 8 main pages and 3 pages of references

点击查看摘要

Abstract:Whether or not several Creole languages which developed during the early modern period can be considered genetic descendants of European languages has been the subject of intense debate. This is in large part due to the absence of evidence of intermediate forms. This work introduces a new open corpus, the Molyé corpus, which combines stereotypical representations of three kinds of language variation in Europe with early attestations of French-based Creole languages across a period of 400 years. It is intended to facilitate future research on the continuity between contact situations in Europe and Creolophone (former) colonies.
摘要:现代早期发展起来的几种克里奥尔语言是否可以被视为欧洲语言的遗传后裔一直是激烈争论的话题。这在很大程度上是由于缺乏中间形式的证据。这项工作引入了一个新的开放性数据库–Molyé数据库,它将欧洲三种语言变体的刻板印象与400年来法语克里奥尔语的早期证明结合在一起。其目的是促进未来对欧洲和克里奥尔语(前)殖民地接触情况之间连续性的研究。

[NLP-11] MemeMind at ArAIEval Shared Task: Spotting Persuasive Spans in Arabic Text with Persuasion Techniques Identification
[NLP-11] ArAIEval的MemeMind共享任务:使用说服技术识别在阿拉伯文本中发现说服性Span

链接: https://arxiv.org/abs/2408.04540
作者: Md Rafiul Biswas,Zubair Shah,Wajdi Zaghouani
关键词-EN: detecting propagandistic spans, Arabic text, paper focuses, focuses on detecting, detecting propagandistic
关键词-ZN: 检测宣传范围,阿拉伯语文本,论文重点,重点检测,检测宣传
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:This paper focuses on detecting propagandistic spans and persuasion techniques in Arabic text from tweets and news paragraphs. Each entry in the dataset contains a text sample and corresponding labels that indicate the start and end positions of propaganda techniques within the text. Tokens falling within a labeled span were assigned “B” (Begin) or “I” (Inside), “O”, corresponding to the specific propaganda technique. Using attention masks, we created uniform lengths for each span and assigned BIO tags to each token based on the provided labels. Then, we used AraBERT-base pre-trained model for Arabic text tokenization and embeddings with a token classification layer to identify propaganda techniques. Our training process involves a two-phase fine-tuning approach. First, we train only the classification layer for a few epochs, followed by full model fine-tuning, updating all parameters. This methodology allows the model to adapt to the specific characteristics of the propaganda detection task while leveraging the knowledge captured by the pre-trained AraBERT model. Our approach achieved an F1 score of 0.2774, securing the 3rd position in the leaderboard of Task 1.
摘要:本文的重点是从推文和新闻段落中检测阿拉伯语文本中的宣传跨度和说服技巧。数据集中的每个条目都包含一个文本样本和相应的标签,这些标签指示文本中宣传技巧的开始和结束位置。落在标记跨度内的记号被分配“B”(开始)或“I”(内部)、“O”,对应于特定的宣传技巧。使用注意面具,我们为每个跨度创建了统一的长度,并根据提供的标签为每个令牌分配了生物标签。然后,我们使用基于AraBERT的预训练模型进行阿拉伯文本的标记化,并嵌入一个标记分类层来识别宣传技巧。我们的培训过程涉及两个阶段的微调方法。首先,我们只对分类层进行几个历元的训练,然后对整个模型进行微调,更新所有参数。这一方法使该模型能够适应宣传检测任务的具体特征,同时利用预先训练的AraBERT模型捕获的知识。我们的方法达到了F1的0.2774分,确保了任务1的排行榜第三名的位置。

[NLP-12] Compromesso! Italian Many-Shot Jailbreaks Undermine the Safety of Large Language Models ACL2024
[NLP-12] 康普罗梅索!意大利多人越狱破坏了大型语言模型的安全性

链接: https://arxiv.org/abs/2408.04522
作者: Fabio Pernisi,Dirk Hovy,Paul Röttger
关键词-EN: diverse linguistic communities, users adopt large, adopt large language, diverse linguistic, linguistic communities
关键词-ZN: 多元化的语言社区,用户采用大的、采用大的语言,多元化的语言,语言社区
类目: Computation and Language (cs.CL)
备注: Accepted at ACL 2024 (Student Research Workshop)

点击查看摘要

Abstract:As diverse linguistic communities and users adopt large language models (LLMs), assessing their safety across languages becomes critical. Despite ongoing efforts to make LLMs safe, they can still be made to behave unsafely with jailbreaking, a technique in which models are prompted to act outside their operational guidelines. Research on LLM safety and jailbreaking, however, has so far mostly focused on English, limiting our understanding of LLM safety in other languages. We contribute towards closing this gap by investigating the effectiveness of many-shot jailbreaking, where models are prompted with unsafe demonstrations to induce unsafe behaviour, in Italian. To enable our analysis, we create a new dataset of unsafe Italian question-answer pairs. With this dataset, we identify clear safety vulnerabilities in four families of open-weight LLMs. We find that the models exhibit unsafe behaviors even when prompted with few unsafe demonstrations, and – more alarmingly – that this tendency rapidly escalates with more demonstrations.
摘要:随着不同的语言社区和用户采用大型语言模型,评估其跨语言的安全性变得至关重要。尽管正在努力使LLM安全,但仍然可以让它们在越狱时表现出不安全的行为,这是一种促使模特做出超出其操作指南的行为的技术。然而,到目前为止,关于LLM安全和越狱的研究主要集中在英语上,限制了我们对其他语言的LLM安全的理解。我们通过调查多次越狱的有效性为缩小这一差距做出了贡献,在意大利语中,通过不安全的示威活动促使模特产生不安全的行为。为了支持我们的分析,我们创建了一个新的不安全意大利语问答对数据集。利用这个数据集,我们在四个开放重量LLM家族中识别出了明显的安全漏洞。我们发现,即使在不安全的演示很少的情况下,模型也会表现出不安全的行为,而且–更令人震惊的是–这种趋势随着演示的增加而迅速升级。

[NLP-13] Can LLMs Beat Humans in Debating? A Dynamic Multi-agent Framework for Competitive Debate
[NLP-13] 法学硕士能否在辩论中击败人类?竞争辩论的动态多主体框架

链接: https://arxiv.org/abs/2408.04472
作者: Yiqun Zhang,Xiaocui Yang,Shi Feng,Daling Wang,Yifei Zhang,Kaisong Song
关键词-EN: complex computational argumentation, computational argumentation task, Large Language Models, comprehensive and complex, complex computational
关键词-ZN: 复杂计算论证,计算论证任务,大型语言模型,全面而复杂,复杂计算
类目: Computation and Language (cs.CL)
备注: 9 pages, 3 figures

点击查看摘要

Abstract:Competitive debate is a comprehensive and complex computational argumentation task. Large Language Models (LLMs) encounter hallucinations and lack competitiveness in this task. To address these challenges, we introduce Agent for Debate (Agent4Debate), a dynamic, multi-agent framework based on LLMs designed to enhance their capabilities in competitive debate. Drawing inspiration from human behavior in debate preparation and execution, Agent4Debate employs a collaborative architecture where four specialized agents (Searcher, Analyzer, Writer, and Reviewer) dynamically interact and cooperate. These agents work throughout the debate process, covering multiple stages from initial research and argument formulation to rebuttal and summary. To comprehensively evaluate framework performance, we construct the Chinese Debate Arena, comprising 66 carefully selected Chinese debate motions. We recruite ten experienced human debaters and collect records of 200 debates involving Agent4Debate, baseline models, and humans. The evaluation employs the Debatrix automatic scoring system and professional human reviewers based on the established Debatrix-Elo and Human-Elo ranking. Experimental results indicate that the state-of-the-art Agent4Debate exhibits capabilities comparable to those of humans. Furthermore, ablation studies demonstrate the effectiveness of each component in the agent structure.
摘要:竞争性辩论是一项全面而复杂的计算论证任务。大型语言模型(LLM)在这项任务中会出现幻觉,缺乏竞争力。为了应对这些挑战,我们引入了辩论代理(Agent4Debate),这是一个基于LLM的动态多代理框架,旨在增强他们在竞争性辩论中的能力。Agent4Debate从辩论准备和执行中的人类行为中获得灵感,采用了一种协作架构,其中四个专门的代理(搜索者、分析者、作者和审查者)动态交互和合作。这些代理人在整个辩论过程中工作,涵盖从最初的研究和论点形成到反驳和总结的多个阶段。为了全面评估框架性能,我们构建了中文辩论竞技场,包括精心挑选的66项中文辩论议案。我们招募了10名有经验的人类辩论者,并收集了涉及Agent4Debate、基线模型和人类的200场辩论的记录。该评价采用了Debatrix自动评分系统和专业的人工审查员,并根据Debatrix-Elo和Human-Elo的既定排名进行了评估。实验结果表明,最先进的Agent4Debate具有与人类相当的能力。此外,消融研究证明了药剂结构中每种成分的有效性。

[NLP-14] Crowd Intelligence for Early Misinformation Prediction on Social Media
[NLP-14] 社交媒体上早期错误信息预测的群体情报

链接: https://arxiv.org/abs/2408.04463
作者: Megha Sundriyal,Harshit Choudhary,Tanmoy Chakraborty,Md Shad Akhtar
关键词-EN: promoting dangerous behavior, influencing public opinion, Misinformation spreads rapidly, social media, causing serious damage
关键词-ZN: 宣扬危险行为,影响舆论,错误信息迅速传播,社交媒体,造成严重损害
类目: Computation and Language (cs.CL)
备注: This work has been submitted to the IEEE for possible publication

点击查看摘要

Abstract:Misinformation spreads rapidly on social media, causing serious damage by influencing public opinion, promoting dangerous behavior, or eroding trust in reliable sources. It spreads too fast for traditional fact-checking, stressing the need for predictive methods. We introduce CROWDSHIELD, a crowd intelligence-based method for early misinformation prediction. We hypothesize that the crowd’s reactions to misinformation reveal its accuracy. Furthermore, we hinge upon exaggerated assertions/claims and replies with particular positions/stances on the source post within a conversation thread. We employ Q-learning to capture the two dimensions – stances and claims. We utilize deep Q-learning due to its proficiency in navigating complex decision spaces and effectively learning network properties. Additionally, we use a transformer-based encoder to develop a comprehensive understanding of both content and context. This multifaceted approach helps ensure the model pays attention to user interaction and stays anchored in the communication’s content. We propose MIST, a manually annotated misinformation detection Twitter corpus comprising nearly 200 conversation threads with more than 14K replies. In experiments, CROWDSHIELD outperformed ten baseline systems, achieving an improvement of ~4% macro-F1 score. We conduct an ablation study and error analysis to validate our proposed model’s performance. The source code and dataset are available at this https URL.
摘要:虚假信息在社交媒体上迅速传播,通过影响舆论、促进危险行为或侵蚀对可靠来源的信任,造成严重损害。它传播得太快,无法进行传统的事实核查,强调了预测方法的必要性。我们介绍了CROWDSHIELD,一种基于人群智能的早期错误信息预测方法。我们假设,人群对错误信息的反应揭示了它的准确性。此外,我们依赖于夸张的断言/主张,以及在对话主题中对源帖子的特定立场/立场的回复。我们使用Q学习来捕捉两个维度–立场和主张。我们使用深度Q-学习,因为它在导航复杂的决策空间和有效地学习网络属性方面非常熟练。此外,我们使用基于转换器的编码器来全面理解内容和上下文。这种多方面的方法有助于确保模型关注用户交互,并停留在交流的内容上。我们提出了MIST,一个人工标注的错误信息检测推特语料库,由近200个会话线索组成,超过14K个回复。在实验中,CROWDSHIELD的表现超过了十个基线系统,实现了宏观F1分数的大约4%的改善。我们进行了烧蚀研究和误差分析,以验证我们提出的模型的性能。源代码和数据集可以在此HTTPS URL上找到。

[NLP-15] AcrosticSleuth: Probabilistic Identification and Ranking of Acrostics in Multilingual Corpora
[NLP-15] AcrosticSleuth:多语言库中Acrostics的概率识别和排名

链接: https://arxiv.org/abs/2408.04427
作者: Aleksandr Fedchin,Isabel Cooperman,Pramit Chaudhuri,Joseph P. Dexter
关键词-EN: form meaningful words, paragraphs form meaningful, writers have hidden, words or phrases, hidden messages
关键词-ZN: 形成有意义的词语,形成有意义的段落,作家有隐藏的词语或短语,隐藏的信息
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:For centuries, writers have hidden messages in their texts as acrostics, where initial letters of consecutive lines or paragraphs form meaningful words or phrases. Scholars searching for acrostics manually can only focus on a few authors at a time and often favor qualitative arguments in discussing intentionally. We aim to put the study of acrostics on firmer statistical footing by presenting AcrosticSleuth, a first-of-its-kind tool that automatically identifies acrostics and ranks them by the probability that the sequence of characters does not occur by chance (and therefore may have been inserted intentionally). Acrostics are rare, so we formalize the problem as a binary classification task in the presence of extreme class imbalance. To evaluate AcrosticSleuth, we present the Acrostic Identification Dataset (AcrostID), a collection of acrostics from the WikiSource online database. Despite the class imbalance, AcrosticSleuth achieves F1 scores of 0.39, 0.59, and 0.66 on French, English, and Russian subdomains of WikiSource, respectively. We further demonstrate that AcrosticSleuth can identify previously unknown high-profile instances of wordplay, such as the acrostic spelling ARSPOETICA (``art of poetry") by Italian Humanist Albertino Mussato and English philosopher Thomas Hobbes’ signature in the opening paragraphs of The Elements of Law.
摘要:几个世纪以来,作家在他们的文本中以离合体的形式隐藏信息,连续行或段落的首字母组成有意义的单词或短语。手工搜索离合体的学者们一次只能关注几个作者,而且往往倾向于有意识地讨论定性论点。我们的目标是通过推出AcroticSleuth,这是一种同类工具中的第一个,自动识别离合体并根据字符序列不是偶然出现的概率(因此可能是故意插入的)对其进行排序,从而使对离体体的研究建立在更坚实的统计基础上。Acrostics很少见,因此我们将该问题形式化为存在极端类失衡的二进制分类任务。为了评估AcroticSleuth,我们提供了Acrotics识别数据集(AcrostID),这是来自Wikiource在线数据库的Acrostics集合。尽管等级不平衡,AcrostiSleuth在Wikiource的法语、英语和俄语子域名上的F1得分分别为0.39、0.59和0.66。我们进一步证明,AcroticSleuth可以识别以前不为人知的高调文字游戏实例,例如意大利人文主义者阿尔贝蒂诺·穆萨托和英国哲学家托马斯·霍布斯在《法律元素》开头几段中的离合体拼写ARSPOETICA(诗歌的艺术)。

[NLP-16] Recognizing Emotion Regulation Strategies from Human Behavior with Large Language Models
[NLP-16] 用大型语言模型从人类行为中识别情绪调节策略

链接: https://arxiv.org/abs/2408.04420
作者: Philipp Müller,Alexander Heimerl,Sayed Muddashir Hossain,Lea Siegel,Jan Alexandersson,Patrick Gebhard,Elisabeth André,Tanja Schneeberger
关键词-EN: Human emotions, emotion regulation, expressed directly, social display rules, emotion
关键词-ZN: 人类情感,情感调节,直接表达,社会表现规则,情感
类目: Computation and Language (cs.CL)
备注: Accepted to ACII’24

点击查看摘要

Abstract:Human emotions are often not expressed directly, but regulated according to internal processes and social display rules. For affective computing systems, an understanding of how users regulate their emotions can be highly useful, for example to provide feedback in job interview training, or in psychotherapeutic scenarios. However, at present no method to automatically classify different emotion regulation strategies in a cross-user scenario exists. At the same time, recent studies showed that instruction-tuned Large Language Models (LLMs) can reach impressive performance across a variety of affect recognition tasks such as categorical emotion recognition or sentiment analysis. While these results are promising, it remains unclear to what extent the representational power of LLMs can be utilized in the more subtle task of classifying users’ internal emotion regulation strategy. To close this gap, we make use of the recently introduced \textscDeep corpus for modeling the social display of the emotion shame, where each point in time is annotated with one of seven different emotion regulation classes. We fine-tune Llama2-7B as well as the recently introduced Gemma model using Low-rank Optimization on prompts generated from different sources of information on the \textscDeep corpus. These include verbal and nonverbal behavior, person factors, as well as the results of an in-depth interview after the interaction. Our results show, that a fine-tuned Llama2-7B LLM is able to classify the utilized emotion regulation strategy with high accuracy (0.84) without needing access to data from post-interaction interviews. This represents a significant improvement over previous approaches based on Bayesian Networks and highlights the importance of modeling verbal behavior in emotion regulation.
摘要:人的情感往往不是直接表达出来的,而是根据内在过程和社会表现规律来调节的。对于情感计算系统来说,理解用户如何调节他们的情绪可能非常有用,例如在求职面试培训中提供反馈,或者在心理治疗场景中。然而,目前还不存在跨用户场景下对不同情绪调节策略进行自动分类的方法。与此同时,最近的研究表明,指令调整的大型语言模型(LLMS)在各种情感识别任务中都可以取得令人印象深刻的性能,例如范畴情感识别或情感分析。虽然这些结果是有希望的,但在分类用户的内部情绪调节策略这一更微妙的任务中,LLMS的表征能力可以在多大程度上得到利用仍不清楚。为了缩小这一差距,我们利用最近引入的文本深语料库来模拟情绪羞愧的社交展示,其中每个时间点都用七个不同的情绪调节类中的一个进行标注。我们对从文本scDeep语料库的不同信息源生成的提示使用低级优化对Llama2-7B以及最近引入的Gema模型进行了微调。这些因素包括言语和非言语行为、人的因素,以及互动后的深度访谈结果。结果表明,微调的Llama2-7B LLM能够以较高的准确率(0.84)对所使用的情绪调节策略进行分类,而不需要访问互动后访谈的数据。这比以往基于贝叶斯网络的方法有了显著的改进,并突出了言语行为建模在情绪调节中的重要性。

[NLP-17] Enhancing Robustness of Retrieval-Augmented Language Models with In-Context Learning
[NLP-17] 通过上下文内学习增强检索增强语言模型的鲁棒性

链接: https://arxiv.org/abs/2408.04414
作者: Seong-Il Park,Seung-Woo Choi,Na-Hyun Kim,Jay-Yoon Lee
关键词-EN: Retrieval-Augmented Language Models, leveraging external knowledge, significantly improved performance, Retrieval-Augmented Language, open-domain question answering
关键词-ZN: 检索增强语言模型,利用外部知识,显着提高性能,检索增强语言,开放领域问答
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 10 pages, 2 figures

点击查看摘要

Abstract:Retrieval-Augmented Language Models (RALMs) have significantly improved performance in open-domain question answering (QA) by leveraging external knowledge. However, RALMs still struggle with unanswerable queries, where the retrieved contexts do not contain the correct answer, and with conflicting information, where different sources provide contradictory answers due to imperfect retrieval. This study introduces an in-context learning-based approach to enhance the reasoning capabilities of RALMs, making them more robust in imperfect retrieval scenarios. Our method incorporates Machine Reading Comprehension (MRC) demonstrations, referred to as cases, to boost the model’s capabilities to identify unanswerabilities and conflicts among the retrieved contexts. Experiments on two open-domain QA datasets show that our approach increases accuracy in identifying unanswerable and conflicting scenarios without requiring additional fine-tuning. This work demonstrates that in-context learning can effectively enhance the robustness of RALMs in open-domain QA tasks.
摘要:检索增强语言模型通过利用外部知识显著提高了开放领域问答(QA)的性能。然而,RALM仍然在努力处理无法回答的查询,其中检索的上下文不包含正确的答案,以及冲突的信息,其中不同的来源由于检索不完美而提供相互矛盾的答案。本研究引入了一种基于情境学习的方法来增强RALMS的推理能力,使其在不完美的检索场景中具有更强的鲁棒性。我们的方法结合了机器阅读理解(MRC)演示,称为CASE,以增强模型识别检索到的上下文中的不可回答和冲突的能力。在两个开放领域的QA数据集上的实验表明,我们的方法在不需要额外微调的情况下提高了识别无法回答和冲突的场景的准确性。这项工作表明,情境学习可以有效地增强开放领域问答任务中RALMS的稳健性。

[NLP-18] Exploring Reasoning Biases in Large Language Models Through Syllogism: Insights from the NeuBAROCO Dataset ACL2024
[NLP-18] 通过Sylogism探索大型语言模型中的推理偏差:来自NeuBAROCO数据集的见解

链接: https://arxiv.org/abs/2408.04403
作者: Kentaro Ozeki,Risako Ando,Takanobu Morishita,Hirohiko Abe,Koji Mineshima,Mitsuhiro Okada
关键词-EN: accurately current large, reasoning, paper explores, explores the question, accurately current
关键词-ZN: 准确当前大,推理,论文探索,探索问题,准确当前
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: To appear in Findings of the Association for Computational Linguistics: ACL 2024

点击查看摘要

Abstract:This paper explores the question of how accurately current large language models can perform logical reasoning in natural language, with an emphasis on whether these models exhibit reasoning biases similar to humans. Specifically, our study focuses on syllogistic reasoning, a form of deductive reasoning extensively studied in cognitive science as a natural form of human reasoning. We present a syllogism dataset called NeuBAROCO, which consists of syllogistic reasoning problems in English and Japanese. This dataset was originally designed for psychological experiments to assess human reasoning capabilities using various forms of syllogisms. Our experiments with leading large language models indicate that these models exhibit reasoning biases similar to humans, along with other error tendencies. Notably, there is significant room for improvement in reasoning problems where the relationship between premises and hypotheses is neither entailment nor contradiction. We also present experimental results and in-depth analysis using a new Chain-of-Thought prompting method, which asks LLMs to translate syllogisms into abstract logical expressions and then explain their reasoning process. Our analysis using this method suggests that the primary limitations of LLMs lie in the reasoning process itself rather than the interpretation of syllogisms.
摘要:本文探讨了当前大型语言模型在自然语言中进行逻辑推理的精确度问题,重点是这些模型是否表现出与人类相似的推理偏差。具体地说,我们的研究重点是三段论推理,这是认知科学中广泛研究的一种演绎推理形式,是人类推理的一种自然形式。我们给出了一个名为NeuBAROCO的三段论数据集,它包括英语和日语的三段论推理问题。这个数据集最初是为心理学实验设计的,目的是使用各种形式的三段论来评估人类的推理能力。我们用领先的大型语言模型进行的实验表明,这些模型表现出类似于人类的推理偏见,以及其他错误倾向。值得注意的是,在前提和假设之间既不是蕴涵也不是矛盾的情况下,推理问题有很大的改进空间。我们还给出了实验结果,并使用了一种新的思维链提示方法进行了深入的分析,该方法要求LLMS将三段论转换为抽象的逻辑表达式,然后解释它们的推理过程。我们使用这种方法进行的分析表明,LLMS的主要局限性在于推理过程本身,而不是对三段论的解释。

[NLP-19] Automated Educational Question Generation at Different Blooms Skill Levels using Large Language Models : Strategies and Evaluation
[NLP-19] 使用大型语言模型在不同Bloom技能水平下自动生成教育问题:策略和评估

链接: https://arxiv.org/abs/2408.04394
作者: Nicy Scaria,Suma Dharani Chenna,Deepak Subramani
关键词-EN: Developing questions, pedagogically sound, promote learning, challenging and time-consuming, time-consuming task
关键词-ZN: 开发问题,教学合理,促进学习,具有挑战性且耗时,耗时的任务
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Developing questions that are pedagogically sound, relevant, and promote learning is a challenging and time-consuming task for educators. Modern-day large language models (LLMs) generate high-quality content across multiple domains, potentially helping educators to develop high-quality questions. Automated educational question generation (AEQG) is important in scaling online education catering to a diverse student population. Past attempts at AEQG have shown limited abilities to generate questions at higher cognitive levels. In this study, we examine the ability of five state-of-the-art LLMs of different sizes to generate diverse and high-quality questions of different cognitive levels, as defined by Bloom’s taxonomy. We use advanced prompting techniques with varying complexity for AEQG. We conducted expert and LLM-based evaluations to assess the linguistic and pedagogical relevance and quality of the questions. Our findings suggest that LLms can generate relevant and high-quality educational questions of different cognitive levels when prompted with adequate information, although there is a significant variance in the performance of the five LLms considered. We also show that automated evaluation is not on par with human evaluation.
摘要:对于教育工作者来说,开发具有教育性、相关性和促进学习的问题是一项具有挑战性和耗时的任务。现代大型语言模型(LLM)在多个领域生成高质量的内容,潜在地帮助教育工作者开发高质量的问题。自动教育问题生成(AEQG)对于扩展在线教育以迎合不同的学生群体非常重要。过去AEQG的尝试表明,在更高的认知水平上产生问题的能力有限。在这项研究中,我们考察了五个不同大小的最先进的LLM生成不同认知水平的高质量问题的能力,按照Bloom的分类定义。我们为AEQG使用了复杂程度不同的高级提示技术。我们进行了专家评估和基于LLM的评估,以评估问题的语言和教学相关性和质量。我们的研究结果表明,尽管五种学习策略的成绩存在显著差异,但在充分的信息提示下,学习策略能够生成不同认知水平的相关和高质量的教育问题。我们还表明,自动评估不能与人工评估相提并论。

[NLP-20] Open-domain Implicit Format Control for Large Language Model Generation
[NLP-20] 用于大型语言模型生成的开放域隐式格式控制

链接: https://arxiv.org/abs/2408.04392
作者: Yiqun Yao,Wenjia Ma,Xuezhi Fang,Xin Jiang,Xiang Li,Xuying Meng,Peng Han,Jing Li,Aixin Sun,Yequan Wang
关键词-EN: large language models, language models, generated by large, large language, critical functionality
关键词-ZN: 大型语言模型,语言模型,由大型语言生成,关键功能
类目: Computation and Language (cs.CL)
备注: 6 pages

点击查看摘要

Abstract:Controlling the format of outputs generated by large language models (LLMs) is a critical functionality in various applications. Current methods typically employ constrained decoding with rule-based automata or fine-tuning with manually crafted format instructions, both of which struggle with open-domain format requirements. To address this limitation, we introduce a novel framework for controlled generation in LLMs, leveraging user-provided, one-shot QA pairs. This study investigates LLMs’ capabilities to follow open-domain, one-shot constraints and replicate the format of the example answers. We observe that this is a non-trivial problem for current LLMs. We also develop a dataset collection methodology for supervised fine-tuning that enhances the open-domain format control of LLMs without degrading output quality, as well as a benchmark on which we evaluate both the helpfulness and format correctness of LLM outputs. The resulting datasets, named OIFC-SFT, along with the related code, will be made publicly available at this https URL.
摘要:在各种应用程序中,控制大型语言模型(LLM)生成的输出格式是一项关键功能。目前的方法通常使用基于规则的自动机进行约束解码,或者使用手动制作的格式指令进行微调,这两种方法都难以满足开放域格式要求。为了解决这一限制,我们引入了一种新的框架,用于在LLMS中控制生成,利用用户提供的一次QA对。本研究考察了LLMS在遵循开放领域、一次性约束和复制范例答案格式方面的能力。我们观察到,对于当前的LLM来说,这不是一个微不足道的问题。我们还开发了一种用于监督微调的数据集收集方法,该方法在不降低输出质量的情况下增强了LLMS的开放域格式控制,以及一个基准,我们基于该基准来评估LLM输出的帮助和格式正确性。名为OIFC-SFT的结果数据集以及相关代码将在此HTTPS URL上公开提供。

[NLP-21] Overview of the NLPCC 2024 Shared Task on Chinese Metaphor Generation
[NLP-21] NLPCC 2024中国隐喻生成共享任务概述

链接: https://arxiv.org/abs/2408.04378
作者: Xingwei Qu,Ge Zhang,Siwei Wu,Yizhi Li,Chenghua Lin
关键词-EN: Natural Language Processing, CCF Conference, Conference on Natural, Natural Language, Language Processing
关键词-ZN: 自然语言处理,CTF会议,自然会议,自然语言,语言处理
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:This paper presents the results of the shared task on Chinese metaphor generation, hosted at the 13th CCF Conference on Natural Language Processing and Chinese Computing (NLPCC 2024). The goal of this shared task is to generate Chinese metaphors using machine learning techniques and effectively identifying basic components of metaphorical sentences. It is divided into two subtasks: 1) Metaphor Generation, which involves creating a metaphor from a provided tuple consisting of TENOR, GROUND, and VEHICLE. The goal here is to synthesize a metaphor that connects the subject (i.e. TENOR) with the object (i.e. VEHICLE), guided by the concept of the GROUND. 2) Metaphor Components Identification, which extracts the most fitting TENORs, GROUNDs, and VEHICLEs from a metaphorical sentence. This component requires the identification of the most fitting metaphor elements that correspond to the specified grounds. In addition to overall results, we report on the setup and insights from the metaphor generation shared task, which attracted a total of 4 participating teams across both subtasks.
摘要:本文介绍了在中国自然语言处理与汉语计算第13届会议(NLPCC2024)上共同完成的汉语隐喻生成任务的成果。这项共享任务的目标是使用机器学习技术生成汉语隐喻,并有效识别隐喻句子的基本成分。它分为两个子任务:1)隐喻生成,这涉及到从由主旨、背景和喻体组成的元组中创建隐喻。本文的目的是在语场概念的指导下,合成一种连接主语(即语旨)和宾语(即载体)的隐喻。2)隐喻成分识别,从隐喻句子中提取最合适的基调、背景和喻体。这一组成部分需要确定与特定理由相对应的最合适的隐喻元素。除了总体结果外,我们还报告了隐喻生成共享任务的设置和见解,该任务吸引了两个子任务的总共4个参与团队。

[NLP-22] Analyzing Consumer Reviews for Understanding Drivers of Hotels Ratings: An Indian Perspective
[NLP-22] 分析消费者评论以了解酒店评级的驱动因素:印度的视角

链接: https://arxiv.org/abs/2408.04369
作者: Subhasis Dasgupta,Soumya Roy,Jaydip Sen
关键词-EN: social media platforms, digital media, media platforms, digital footprint, social media
关键词-ZN: 社交媒体平台、数字媒体、媒体平台、数字足迹、社交媒体
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: This is the pre-print of the paper that was accepted for oral presentation and publication in the proceedings of IEEE ICCCNT 2024 which was organized as IIT Mandi, India from June 24 to 28, 2024. The paper is 5 pages long and it contains 4 figures and 6 tables. The is not the final version of the paper

点击查看摘要

Abstract:In the internet era, almost every business entity is trying to have its digital footprint in digital media and other social media platforms. For these entities, word of mouse is also very important. Particularly, this is quite crucial for the hospitality sector dealing with hotels, restaurants etc. Consumers do read other consumers reviews before making final decisions. This is where it becomes very important to understand which aspects are affecting most in the minds of the consumers while giving their ratings. The current study focuses on the consumer reviews of Indian hotels to extract aspects important for final ratings. The study involves gathering data using web scraping methods, analyzing the texts using Latent Dirichlet Allocation for topic extraction and sentiment analysis for aspect-specific sentiment mapping. Finally, it incorporates Random Forest to understand the importance of the aspects in predicting the final rating of a user.
摘要:在互联网时代,几乎每个商业实体都试图在数字媒体和其他社交媒体平台上拥有数字足迹。对于这些实体来说,鼠标这个词也非常重要。特别是,这对于与酒店、餐馆等打交道的酒店业至关重要。消费者在做出最终决定之前确实会阅读其他消费者的评论。这就是在给出评级时了解哪些方面对消费者的影响最大变得非常重要的地方。当前的研究重点是消费者对印度酒店的评论,以提取对最终评级重要的方面。该研究涉及使用网络抓取方法收集数据,使用潜在Dirichlet分配来分析文本以进行主题提取,并使用情感分析来进行特定方面的情感映射。最后,它结合了Random Forest来了解这些方面在预测用户最终评分方面的重要性。

[NLP-23] Enhancing Journalism with AI: A Study of Contextualized Image Captioning for News Articles using LLMs and LMMs
[NLP-23] 用人工智能增强新闻性:使用LLM和LSYS的新闻文章上下文化图像字幕研究

链接: https://arxiv.org/abs/2408.04331
作者: Aliki Anagnostopoulou,Thiago Gouvea,Daniel Sonntag
关键词-EN: Large language models, large multimodal models, Large language, large multimodal, economic sectors
关键词-ZN: 大型语言模型,大型多模式模型,大型语言,大型多模式,经济部门
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Large language models (LLMs) and large multimodal models (LMMs) have significantly impacted the AI community, industry, and various economic sectors. In journalism, integrating AI poses unique challenges and opportunities, particularly in enhancing the quality and efficiency of news reporting. This study explores how LLMs and LMMs can assist journalistic practice by generating contextualised captions for images accompanying news articles. We conducted experiments using the GoodNews dataset to evaluate the ability of LMMs (BLIP-2, GPT-4v, or LLaVA) to incorporate one of two types of context: entire news articles, or extracted named entities. In addition, we compared their performance to a two-stage pipeline composed of a captioning model (BLIP-2, OFA, or ViT-GPT2) with post-hoc contextualisation with LLMs (GPT-4 or LLaMA). We assess a diversity of models, and we find that while the choice of contextualisation model is a significant factor for the two-stage pipelines, this is not the case in the LMMs, where smaller, open-source models perform well compared to proprietary, GPT-powered ones. Additionally, we found that controlling the amount of provided context enhances performance. These results highlight the limitations of a fully automated approach and underscore the necessity for an interactive, human-in-the-loop strategy.
摘要:大型语言模型(LLM)和大型多通道模型(LMM)已经对人工智能社区、行业和各种经济部门产生了重大影响。在新闻业,整合人工智能带来了独特的挑战和机遇,特别是在提高新闻报道的质量和效率方面。这项研究探索了LLM和LMM如何通过为新闻文章附带的图像生成上下文字幕来帮助新闻实践。我们使用GoodNews数据集进行了实验,以评估LMM(BLIP-2、GPT-4v或LLaVA)整合两种类型上下文之一的能力:完整的新闻文章或提取的命名实体。此外,我们将它们的性能与由字幕模型(BLIP-2、OFA或VIT-GPT2)和后自组织上下文与LLMS(GPT-4或骆驼)组成的两阶段管道进行了比较。我们评估了多种模式,我们发现,尽管情景模式的选择是两阶段管道的一个重要因素,但在LMM中情况并非如此,与专有的、GPT驱动的模式相比,较小的开源模式表现良好。此外,我们发现控制提供的上下文的数量可以提高性能。这些结果突显了完全自动化方法的局限性,并强调了交互式、人在环中战略的必要性。

[NLP-24] rans-Tokenization and Cross-lingual Vocabulary Transfers: Language Adaptation of LLMs for Low-Resource NLP
[NLP-24] rans-Tokenation和跨语言词汇转移:LLM针对低资源NLP的语言适应

链接: https://arxiv.org/abs/2408.04303
作者: François Remy,Pieter Delobelle,Hayastan Avetisyan,Alfiya Khabibullina,Miryam de Lhoneux,Thomas Demeester
关键词-EN: mid-resource languages continues, low and mid-resource, difficulty in sourcing, language, languages
关键词-ZN: 中等资源语言继续存在,低资源和中等资源,采购困难,语言,语言
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Accepted at COLM 2024

点击查看摘要

Abstract:The development of monolingual language models for low and mid-resource languages continues to be hindered by the difficulty in sourcing high-quality training data. In this study, we present a novel cross-lingual vocabulary transfer strategy, trans-tokenization, designed to tackle this challenge and enable more efficient language adaptation. Our approach focuses on adapting a high-resource monolingual LLM to an unseen target language by initializing the token embeddings of the target language using a weighted average of semantically similar token embeddings from the source language. For this, we leverage a translation resource covering both the source and target languages. We validate our method with the Tweeties, a series of trans-tokenized LLMs, and demonstrate their competitive performance on various downstream tasks across a small but diverse set of languages. Additionally, we introduce Hydra LLMs, models with multiple swappable language modeling heads and embedding tables, which further extend the capabilities of our trans-tokenization strategy. By designing a Hydra LLM based on the multilingual model TowerInstruct, we developed a state-of-the-art machine translation model for Tatar, in a zero-shot manner, completely bypassing the need for high-quality parallel data. This breakthrough is particularly significant for low-resource languages like Tatar, where high-quality parallel data is hard to come by. By lowering the data and time requirements for training high-quality models, our trans-tokenization strategy allows for the development of LLMs for a wider range of languages, especially those with limited resources. We hope that our work will inspire further research and collaboration in the field of cross-lingual vocabulary transfer and contribute to the empowerment of languages on a global scale.
摘要:由于难以获得高质量的训练数据,中低资源语言的单语模型的发展继续受到阻碍。在这项研究中,我们提出了一种新的跨语言词汇迁移策略–跨标记化,旨在应对这一挑战,并使语言适应更有效。我们的方法集中于通过使用来自源语言的语义相似的标记嵌入的加权平均来初始化目标语言的标记嵌入,从而使高资源的单语LLM适应于不可见的目标语言。为此,我们利用了涵盖源语言和目标语言的翻译资源。我们用tweites验证了我们的方法,这是一系列跨标记化的LLM,并展示了它们在一小部分但不同的语言集上的各种下游任务上的竞争性能。此外,我们引入了Hydra LLM,具有多个可交换语言建模头和嵌入表的模型,进一步扩展了我们的跨标记化策略的能力。通过设计基于多语言模型TowerInstruct的Hydra LLM,我们以零镜头方式开发了最先进的鞑靼机器翻译模型,完全绕过了对高质量并行数据的需求。这一突破对于像鞑靼这样的低资源语言来说尤其重要,在这些语言中,很难获得高质量的并行数据。通过降低训练高质量模型的数据和时间要求,我们的跨标记化策略允许为更广泛的语言开发LLM,特别是那些资源有限的语言。我们希望我们的工作将激励在跨语言词汇迁移领域的进一步研究和合作,并为全球范围内的语言赋权做出贡献。

[NLP-25] Are Social Sentiments Inherent in LLMs? An Empirical Study on Extraction of Inter-demographic Sentiments
[NLP-25] 社会情感是法学硕士固有的吗?人口间情感提取的实证研究

链接: https://arxiv.org/abs/2408.04293
作者: Kunitomo Tanaka,Ryohei Sasano,Koichi Takeda
关键词-EN: acquire unconscious human, unconscious human knowledge, Large language models, social common sense, knowledge and feelings
关键词-ZN: 获取无意识的人类、无意识的人类知识、大型语言模型、社会常识、知识和感情
类目: Computation and Language (cs.CL); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:Large language models (LLMs) are supposed to acquire unconscious human knowledge and feelings, such as social common sense and biases, by training models from large amounts of text. However, it is not clear how much the sentiments of specific social groups can be captured in various LLMs. In this study, we focus on social groups defined in terms of nationality, religion, and race/ethnicity, and validate the extent to which sentiments between social groups can be captured in and extracted from LLMs. Specifically, we input questions regarding sentiments from one group to another into LLMs, apply sentiment analysis to the responses, and compare the results with social surveys. The validation results using five representative LLMs showed higher correlations with relatively small p-values for nationalities and religions, whose number of data points were relatively large. This result indicates that the LLM responses including the inter-group sentiments align well with actual social survey results.
摘要:大型语言模型通过从大量文本中训练模型来获取人类无意识的知识和情感,如社会常识和偏见。然而,目前还不清楚在不同的LLM中能捕捉到多少特定社会群体的情绪。在这项研究中,我们专注于从国籍、宗教和种族/民族方面定义的社会群体,并验证了社会群体之间的情感可以在多大程度上从LLMS中捕捉和提取。具体地说,我们将关于一个群体到另一个群体的情绪的问题输入LLMS,对回复进行情绪分析,并将结果与社会调查进行比较。五个具有代表性的最小二乘模型的验证结果表明,与民族和宗教的p值相对较小的相关性较高,这些民族和宗教的数据点相对较多。这一结果表明,包含群体间情绪的LLM响应与实际社会调查结果吻合较好。

[NLP-26] EMTeC: A Corpus of Eye Movements on Machine-Generated Texts
[NLP-26] EMTeC:机器生成文本上的眼球运动数据库

链接: https://arxiv.org/abs/2408.04289
作者: Lena Sophia Bolliger,Patrick Haller,Isabelle Caroline Rose Cretton,David Robert Reich,Tannon Kew,Lena Ann Jäger
关键词-EN: native English speakers, English speakers reading, native English, English speakers, eye movement data
关键词-ZN: 以英语为母语的人、以英语为母语的人阅读、以英语为母语的人、眼动数据
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The Eye Movements on Machine-Generated Texts Corpus (EMTeC) is a naturalistic eye-movements-while-reading corpus of 107 native English speakers reading machine-generated texts. The texts are generated by three large language models using five different decoding strategies, and they fall into six different text type categories. EMTeC entails the eye movement data at all stages of pre-processing, i.e., the raw coordinate data sampled at 2000 Hz, the fixation sequences, and the reading measures. It further provides both the original and a corrected version of the fixation sequences, accounting for vertical calibration drift. Moreover, the corpus includes the language models’ internals that underlie the generation of the stimulus texts: the transition scores, the attention scores, and the hidden states. The stimuli are annotated for a range of linguistic features both at text and at word level. We anticipate EMTeC to be utilized for a variety of use cases such as, but not restricted to, the investigation of reading behavior on machine-generated text and the impact of different decoding strategies; reading behavior on different text types; the development of new pre-processing, data filtering, and drift correction algorithms; the cognitive interpretability and enhancement of language models; and the assessment of the predictive power of surprisal and entropy for human reading times. The data at all stages of pre-processing, the model internals, and the code to reproduce the stimulus generation, data pre-processing and analyses can be accessed via this https URL.
摘要:机器生成文本中的眼动语料库是一个由107名英语母语者阅读机器生成文本的自然眼动语料库。这些文本是由三个大型语言模型使用五种不同的解码策略生成的,它们属于六种不同的文本类型。EMTEC需要在预处理的所有阶段的眼动数据,即,以2000赫兹采样的原始坐标数据、注视序列和阅读测量。它还提供了固定序列的原始版本和校正版本,考虑了垂直校准漂移。此外,语料库还包括作为刺激文本生成基础的语言模型的内部结构:转换分数、注意力分数和隐藏状态。这些刺激是针对文本和单词层面的一系列语言特征进行注释的。我们预计EMTEC将被用于各种用例,例如但不限于对机器生成文本的阅读行为和不同解码策略的影响的调查;不同文本类型的阅读行为;新的预处理、数据过滤和漂移校正算法的开发;语言模型的认知可解释性和增强;以及对人类阅读时间的惊喜和熵预测能力的评估。可通过此HTTPS URL访问预处理所有阶段的数据、模型内部结构以及用于再现刺激生成、数据预处理和分析的代码。

[NLP-27] LLM-DetectAIve: a Tool for Fine-Grained Machine-Generated Text Detection
[NLP-27] LLM-DetectAIve:细粒度机器生成文本检测工具

链接: https://arxiv.org/abs/2408.04284
作者: Mervat Abassy,Kareem Elozeiri,Alexander Aziz,Minh Ngoc Ta,Raj Vardhan Tomar,Bimarsha Adhikari,Saad El Dine Ahmed,Yuxia Wang,Osama Mohammed Afzal,Zhuohan Xie,Jonibek Mansurov,Ekaterina Artemova,Vladislav Mikhailov,Rui Xing,Jiahui Geng,Hasan Iqbal,Zain Muhammad Mujahid,Tarek Mahmoud,Akim Tsvigun,Alham Fikri Aji,Artem Shelmanov,Nizar Habash,Iryna Gurevych,Preslav Nakov
关键词-EN: large language models, language models, widespread accessibility, accessibility of large, large language
关键词-ZN: 大型语言模型、语言模型、广泛的可访问性、大型语言的可访问性
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The widespread accessibility of large language models (LLMs) to the general public has significantly amplified the dissemination of machine-generated texts (MGTs). Advancements in prompt manipulation have exacerbated the difficulty in discerning the origin of a text (human-authored vs machinegenerated). This raises concerns regarding the potential misuse of MGTs, particularly within educational and academic domains. In this paper, we present \textbfLLM-DetectAIve – a system designed for fine-grained MGT detection. It is able to classify texts into four categories: human-written, machine-generated, machine-written machine-humanized, and human-written machine-polished. Contrary to previous MGT detectors that perform binary classification, introducing two additional categories in LLM-DetectiAIve offers insights into the varying degrees of LLM intervention during the text creation. This might be useful in some domains like education, where any LLM intervention is usually prohibited. Experiments show that LLM-DetectAIve can effectively identify the authorship of textual content, proving its usefulness in enhancing integrity in education, academia, and other domains. LLM-DetectAIve is publicly accessible at this https URL. The video describing our system is available at this https URL.
摘要:大型语言模型(LLM)广泛应用于大众,极大地促进了机器生成文本(MGTS)的传播。即时操作的进步加剧了辨别文本来源的难度(人工创作与机器生成)。这引起了人们对MGTS潜在滥用的关注,特别是在教育和学术领域。本文介绍了一个用于细粒度MGT检测的系统–TextbfLLM-DetectAIve。它能够将文本分为四类:人类书写的、机器生成的、机器书写的机器人性化的和人类书写的机器打磨的。与以前执行二进制分类的MGT检测器不同,在LLM-DetectiAIve中引入了两个额外的类别,从而提供了对文本创建过程中LLM干预的不同程度的洞察。这在教育等一些领域可能很有用,在这些领域,任何LLM干预通常都是被禁止的。实验表明,LLM-DetectAIVE能够有效地识别文本内容的作者,证明了其在提高教育、学术等领域的完整性方面的有效性。Llm-DetectAIve可通过此HTTPS URL公开访问。描述我们系统的视频可以在这个HTTPS URL上找到。

[NLP-28] LaDiMo: Layer-wise Distillation Inspired MoEfier
[NLP-28] LaDiMo:分层蒸馏启发MoEvier

链接: https://arxiv.org/abs/2408.04278
作者: Sungyoon Kim,Youngjun Kim,Kihyo Moon,Minsung Jang
关键词-EN: natural language processing, revolutionized natural language, large language models, language processing, large language
关键词-ZN: 自然语言处理、革命性的自然语言、大型语言模型、语言处理、大型语言
类目: Computation and Language (cs.CL)
备注: 21 pages, 10 figures

点击查看摘要

Abstract:The advent of large language models has revolutionized natural language processing, but their increasing complexity has led to substantial training costs, resource demands, and environmental impacts. In response, sparse Mixture-of-Experts (MoE) models have emerged as a promising alternative to dense models. Since training MoE models from scratch can be prohibitively expensive, recent studies have explored leveraging knowledge from pre-trained non-MoE models. However, existing approaches have limitations, such as requiring significant hardware resources and data. We propose a novel algorithm, LaDiMo, which efficiently converts a Transformer-based non-MoE model into a MoE model with minimal additional training cost. LaDiMo consists of two stages: layer-wise expert construction and routing policy decision. By harnessing the concept of Knowledge Distillation, we compress the model and rapidly recover its performance. Furthermore, we develop an adaptive router that optimizes inference efficiency by profiling the distribution of routing weights and determining a layer-wise policy that balances accuracy and latency. We demonstrate the effectiveness of our method by converting the LLaMA2-7B model to a MoE model using only 100K tokens, reducing activated parameters by over 20% while keeping accuracy. Our approach offers a flexible and efficient solution for building and deploying MoE models.
摘要:大型语言模型的出现使自然语言处理发生了革命性的变化,但其日益复杂的程度也导致了大量的培训成本、资源需求和环境影响。作为回应,稀疏专家混合(MOE)模型已成为密集模型的一种有前途的替代方案。由于从头开始训练MOE模型的成本可能高得令人望而却步,最近的研究探索了利用来自预先训练的非MOE模型的知识。然而,现有方法具有局限性,例如需要大量硬件资源和数据。我们提出了一种新的算法LaDiMo,该算法以最小的额外训练代价有效地将基于变压器的非MOE模型转换为MOE模型。LaDiMo由两个阶段组成:层次化的专家构建和路径决策。利用知识蒸馏的概念,对模型进行压缩,并快速恢复其性能。此外,我们开发了一个自适应路由器,它通过分析路由权重的分布和确定平衡精度和延迟的分层策略来优化推理效率。我们通过将LLaMA2-7B模型转换为MOE模型,仅使用100K令牌,在保持精度的情况下将激活参数减少了20%以上,从而证明了该方法的有效性。我们的方法为构建和部署MOE模型提供了灵活高效的解决方案。

[NLP-29] Analysis of Argument Structure Constructions in the Large Language Model BERT
[NLP-29] 大型语言模型BERT中论元结构结构分析

链接: https://arxiv.org/abs/2408.04270
作者: Pegah Ramezani,Achim Schilling,Patrick Krauss
关键词-EN: represents Argument Structure, Argument Structure Constructions, Argument Structure, represents Argument, extending previous LSTM
关键词-ZN: 代表参数结构,参数结构构造,参数结构,代表参数,扩展之前的LSTM
类目: Computation and Language (cs.CL)
备注: arXiv admin note: text overlap with arXiv:2408.03062

点击查看摘要

Abstract:This study investigates how BERT processes and represents Argument Structure Constructions (ASCs), extending previous LSTM analyses. Using a dataset of 2000 sentences across four ASC types (transitive, ditransitive, caused-motion, resultative), we analyzed BERT’s token embeddings across 12 layers. Visualizations with MDS and t-SNE and clustering quantified by Generalized Discrimination Value (GDV) were used. Feedforward classifiers (probes) predicted construction categories from embeddings. CLS token embeddings clustered best in layers 2-4, decreased in intermediate layers, and slightly increased in final layers. DET and SUBJ embeddings showed consistent clustering in intermediate layers, VERB embeddings increased in clustering from layer 1 to 12, and OBJ embeddings peaked in layer 10. Probe accuracies indicated low construction information in layer 1, with over 90 percent accuracy from layer 2 onward, revealing latent construction information beyond GDV clustering. Fisher Discriminant Ratio (FDR) analysis of attention weights showed OBJ tokens were crucial for differentiating ASCs, followed by VERB and DET tokens. SUBJ, CLS, and SEP tokens had insignificant FDR scores. This study highlights BERT’s layered processing of linguistic constructions and its differences from LSTMs. Future research will compare these findings with neuroimaging data to understand the neural correlates of ASC processing. This research underscores neural language models’ potential to mirror linguistic processing in the human brain, offering insights into the computational and neural mechanisms underlying language understanding.
摘要:本研究扩展了前人的LSTM分析,考察了BERT是如何处理和表征论元结构的。使用2000个句子的语料集,研究了四种体裁类型(及物型、双及物型、致因型、结果型)的标记嵌入,共12个层级。使用MDS和t-SNE进行可视化,并用广义判别值(GDV)量化聚类。前馈分类器(探测器)根据嵌入预测结构类别。CLS标记嵌入在第2-4层聚集最好,在中间层减少,在最后一层略有增加。DET和SUBJ嵌入物在中间层表现出一致的聚集性,动词嵌入物从第1层到第12层增加,OBJ嵌入物在第10层达到峰值。探测准确率表明第1层结构信息较低,从第2层开始准确率超过90%,揭示了GDV聚类区以外的潜在结构信息。注意力权重的Fisher判别比(FDR)分析表明,OBJ标记对区分ASCs至关重要,其次是动词标记和DET标记。SUBJ、CLS和SEP标记的FDR得分不显著。本研究强调了BERT对语言结构的分层处理及其与LSTM的不同。未来的研究将把这些发现与神经成像数据进行比较,以了解ASC处理的神经关联。这项研究强调了神经语言模型在人脑中反映语言处理的潜力,为语言理解的计算和神经机制提供了见解。

[NLP-30] EfficientRAG: Efficient Retriever for Multi-Hop Question Answering
[NLP-30] EfficientRAG:用于多跳问题解答的高效检索器

链接: https://arxiv.org/abs/2408.04259
作者: Ziyuan Zhuang,Zhiyang Zhang,Sitao Cheng,Fangkai Yang,Jia Liu,Shujian Huang,Qingwei Lin,Saravan Rajmohan,Dongmei Zhang,Qi Zhang
关键词-EN: Retrieval-augmented generation, addressing complex questions, methods encounter difficulties, encounter difficulties, difficulties when addressing
关键词-ZN: 检索增强生成,解决复杂问题,方法遇到困难,遇到困难,解决时遇到困难
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 20 pages, 4 figures

点击查看摘要

Abstract:Retrieval-augmented generation (RAG) methods encounter difficulties when addressing complex questions like multi-hop queries. While iterative retrieval methods improve performance by gathering additional information, current approaches often rely on multiple calls of large language models (LLMs). In this paper, we introduce EfficientRAG, an efficient retriever for multi-hop question answering. EfficientRAG iteratively generates new queries without the need for LLM calls at each iteration and filters out irrelevant information. Experimental results demonstrate that EfficientRAG surpasses existing RAG methods on three open-domain multi-hop question-answering datasets.
摘要:检索增强生成(RAG)方法在解决多跳查询等复杂问题时遇到困难。虽然迭代检索方法通过收集额外信息来提高性能,但当前的方法通常依赖于大型语言模型(LLM)的多次调用。在本文中,我们介绍了EfficientRAG,这是一种用于多跳问答的高效检索器。EfficientRAG迭代生成新查询,而无需每次迭代时进行LLM调用,并过滤掉不相关的信息。实验结果表明,EfficientRAG在三个开放域多跳问答数据集中优于现有的RAG方法。

[NLP-31] Explicating the Implicit: Argument Detection Beyond Sentence Boundaries ACL2024
[NLP-31] 阐明隐含:超越句子边界的论点检测

链接: https://arxiv.org/abs/2408.04246
作者: Paul Roit,Aviv Slobodkin,Eran Hirsch,Arie Cattan,Ayal Klein,Valentina Pyatkin,Ido Dagan
关键词-EN: Detecting semantic arguments, Detecting semantic, conventionally modeled, predicate word, Detecting
关键词-ZN: 检测语义参数,检测语义,常规建模,代词,检测
类目: Computation and Language (cs.CL)
备注: 9 pages, ACL 2024

点击查看摘要

Abstract:Detecting semantic arguments of a predicate word has been conventionally modeled as a sentence-level task. The typical reader, however, perfectly interprets predicate-argument relations in a much wider context than just the sentence where the predicate was evoked. In this work, we reformulate the problem of argument detection through textual entailment to capture semantic relations across sentence boundaries. We propose a method that tests whether some semantic relation can be inferred from a full passage by first encoding it into a simple and standalone proposition and then testing for entailment against the passage. Our method does not require direct supervision, which is generally absent due to dataset scarcity, but instead builds on existing NLI and sentence-level SRL resources. Such a method can potentially explicate pragmatically understood relations into a set of explicit sentences. We demonstrate it on a recent document-level benchmark, outperforming some supervised methods and contemporary language models.
摘要:检测谓词的语义论元通常被建模为一项句子级任务。然而,典型的读者在更广泛的背景下完美地解释了谓词-论元关系,而不仅仅是在谓词被唤起的句子中。在这项工作中,我们通过文本蕴涵来重新描述论元检测问题,以捕获句子边界之间的语义关系。我们提出了一种方法来测试是否可以从完整的段落中推断出某种语义关系,方法是首先将其编码成一个简单的独立命题,然后测试是否可以针对该段落进行蕴涵测试。我们的方法不需要直接监督,但由于数据集稀缺,通常没有直接监督,而是建立在现有的NLI和句子级SRL资源的基础上。这样的方法可以潜在地将语用上理解的关系解释为一组明确的句子。我们在最近的文档级基准测试中演示了它,性能优于一些监督方法和当代语言模型。

[NLP-32] Learning to Rewrite: Generalized LLM-Generated Text Detection
[NLP-32] 学习重写:广义LLM生成的文本检测

链接: https://arxiv.org/abs/2408.04237
作者: Wei Hao,Ran Li,Weiliang Zhao,Junfeng Yang,Chengzhi Mao
关键词-EN: Large language models, Large language, create non-factual content, language models, spread disinformation
关键词-ZN: 大型语言模型,大型语言,创建非事实内容,语言模型,传播虚假信息
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) can be abused at scale to create non-factual content and spread disinformation. Detecting LLM-generated content is essential to mitigate these risks, but current classifiers often fail to generalize in open-world contexts. Prior work shows that LLMs tend to rewrite LLM-generated content less frequently, which can be used for detection and naturally generalizes to unforeseen data. However, we find that the rewriting edit distance between human and LLM content can be indistinguishable across domains, leading to detection failures. We propose training an LLM to rewrite input text, producing minimal edits for LLM-generated content and more edits for human-written text, deriving a distinguishable and generalizable edit distance difference across different domains. Experiments on text from 21 independent domains and three popular LLMs (e.g., GPT-4o, Gemini, and Llama-3) show that our classifier outperforms the state-of-the-art zero-shot classifier by up to 20.6% on AUROC score and the rewriting classifier by 9.2% on F1 score. Our work suggests that LLM can effectively detect machine-generated text if they are trained properly.
摘要:大型语言模型(LLM)可能会被大规模滥用,以创建不真实的内容并传播虚假信息。检测LLM生成的内容对于降低这些风险至关重要,但当前的分类器通常无法在开放世界环境中进行概括。先前的工作表明,LLM倾向于不太频繁地重写LLM生成的内容,这些内容可以用于检测,并自然地概括为不可预见的数据。然而,我们发现人类和LLM内容之间的重写编辑距离可能无法跨域区分,从而导致检测失败。我们建议训练LLM重写输入文本,对LLM生成的内容进行最少的编辑,而对人类编写的文本进行更多的编辑,从而在不同的领域得出可区分和可概括的编辑距离差异。在21个独立领域和三个流行的LLMS(如GPT-40、Gemini和Llama-3)上的实验表明,我们的分类器在AUROC得分上比最先进的零命中分类器高出20.6%,在F1得分上比重写分类器高出9.2%。我们的工作表明,如果机器生成的文本经过适当的训练,LLM可以有效地检测到它们。

[NLP-33] Evaluating Language Model Math Reasoning via Grounding in Educational Curricula
[NLP-33] 通过教育课程基础评估语言模型数学推理

链接: https://arxiv.org/abs/2408.04226
作者: Li Lucy,Tal August,Rose E. Wang,Luca Soldaini,Courtney Allison,Kyle Lo
关键词-EN: evaluating language models’, mathematical abilities, language models’, work presents, angle for evaluating
关键词-ZN: 评估语言模型、数学能力、语言模型、作品呈现、评估角度
类目: Computation and Language (cs.CL)
备注: 30 pages, 23 figures

点击查看摘要

Abstract:Our work presents a novel angle for evaluating language models’ (LMs) mathematical abilities, by investigating whether they can discern skills and concepts enabled by math content. We contribute two datasets: one consisting of 385 fine-grained descriptions of K-12 math skills and concepts, or standards, from Achieve the Core (ATC), and another of 9.9K problems labeled with these standards (MathFish). Working with experienced teachers, we find that LMs struggle to tag and verify standards linked to problems, and instead predict labels that are close to ground truth, but differ in subtle ways. We also show that LMs often generate problems that do not fully align with standards described in prompts. Finally, we categorize problems in GSM8k using math standards, allowing us to better understand why some problems are more difficult to solve for models than others.
摘要:我们的工作通过调查语言模型是否能够识别数学内容所支持的技能和概念,为评估语言模型(LM)的数学能力提供了一个新的角度。我们贡献了两个数据集:其中一个由来自Achieve the Core(ATC)的K-12数学技能和概念或标准的385个细粒度描述组成,另一个由这些标准标记的9,900个问题组成(MathFish)。与经验丰富的教师合作,我们发现LM很难标记和验证与问题相关的标准,而是预测接近实际真相但在微妙方面有所不同的标签。我们还表明LM经常产生与提示中描述的标准不完全一致的问题。最后,我们使用数学标准对GSM 8 k中的问题进行分类,使我们能够更好地理解为什么某些问题比其他问题更难解决。

[NLP-34] Diffusion Guided Language Modeling ACL
[NLP-34] 扩散引导语言建模

链接: https://arxiv.org/abs/2408.04220
作者: Justin Lovelace,Varsha Kishore,Yiwei Chen,Kilian Q. Weinberger
关键词-EN: demonstrate remarkable proficiency, Current language models, Current language, models demonstrate remarkable, language models demonstrate
关键词-ZN: 表现出出色的熟练程度,当前语言模型,当前语言,模型表现出出色,语言模型表现出
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: ACL Findings 2024

点击查看摘要

Abstract:Current language models demonstrate remarkable proficiency in text generation. However, for many applications it is desirable to control attributes, such as sentiment, or toxicity, of the generated language – ideally tailored towards each specific use case and target audience. For auto-regressive language models, existing guidance methods are prone to decoding errors that cascade during generation and degrade performance. In contrast, text diffusion models can easily be guided with, for example, a simple linear sentiment classifier – however they do suffer from significantly higher perplexity than auto-regressive alternatives. In this paper we use a guided diffusion model to produce a latent proposal that steers an auto-regressive language model to generate text with desired properties. Our model inherits the unmatched fluency of the auto-regressive approach and the plug-and-play flexibility of diffusion. We show that it outperforms previous plug-and-play guidance methods across a wide range of benchmark data sets. Further, controlling a new attribute in our framework is reduced to training a single logistic regression classifier.
摘要:当前的语言模型在文本生成方面表现出了惊人的熟练程度。然而,对于许多应用程序来说,需要控制所生成语言的属性,例如情绪或毒性–最好是针对每个特定用例和目标受众进行定制。对于自回归语言模型,现有的制导方法容易在生成过程中产生级联的解码错误,从而降低性能。相比之下,文本扩散模型可以很容易地使用简单的线性情感分类器来指导–然而,与自回归方案相比,它们确实面临着明显更高的困惑。在本文中,我们使用一个引导扩散模型来产生一个潜在的建议,它引导一个自回归语言模型来生成具有期望属性的文本。我们的模型继承了自回归方法无与伦比的流畅性和扩散的即插即用灵活性。我们表明,在广泛的基准数据集上,它的性能优于以前的即插即用指导方法。此外,在我们的框架中控制新的属性被简化为训练单个Logistic回归分类器。

[NLP-35] Simplifying Translations for Children: Iterative Simplification Considering Age of Acquisition with LLMs ACL2024
[NLP-35] 简化儿童翻译:考虑LLM习得年龄的迭代简化

链接: https://arxiv.org/abs/2408.04217
作者: Masashi Oshika,Makoto Morishita,Tsutomu Hirao,Ryohei Sasano,Koichi Takeda
关键词-EN: neural machine translation, recent years, neural machine, everyday life, current NMT lacks
关键词-ZN: 神经机器翻译,近年来,神经机器,日常生活,当前NMT缺乏
类目: Computation and Language (cs.CL)
备注: Findings of ACL 2024

点击查看摘要

Abstract:In recent years, neural machine translation (NMT) has been widely used in everyday life. However, the current NMT lacks a mechanism to adjust the difficulty level of translations to match the user’s language level. Additionally, due to the bias in the training data for NMT, translations of simple source sentences are often produced with complex words. In particular, this could pose a problem for children, who may not be able to understand the meaning of the translations correctly. In this study, we propose a method that replaces words with high Age of Acquisitions (AoA) in translations with simpler words to match the translations to the user’s level. We achieve this by using large language models (LLMs), providing a triple of a source sentence, a translation, and a target word to be replaced. We create a benchmark dataset using back-translation on Simple English Wikipedia. The experimental results obtained from the dataset show that our method effectively replaces high-AoA words with lower-AoA words and, moreover, can iteratively replace most of the high-AoA words while still maintaining high BLEU and COMET scores.
摘要:近年来,神经机器翻译在日常生活中得到了广泛的应用。然而,目前的NMT缺乏一种机制来调整翻译的难度以匹配用户的语言水平。此外,由于自然机器翻译训练数据的偏差,简单源句的翻译往往带有复杂的单词。特别是,这可能会给儿童带来问题,他们可能无法正确理解翻译的意思。在这项研究中,我们提出了一种方法,将翻译中具有高习得年龄(AOA)的单词替换为更简单的单词,以使翻译符合用户的水平。我们通过使用大型语言模型(LLM)来实现这一点,提供源句、翻译和要替换的目标词的三元组。我们在简单的英文维基百科上使用反向翻译创建了一个基准数据集。实验结果表明,该方法能够有效地用低影响度词语替换高影响度词语,并且能够在保持较高BLEU和彗星评分的前提下反复替换大部分高影响程度词语。

[NLP-36] Attention Mechanism and Context Modeling System for Text Mining Machine Translation
[NLP-36] 文本挖掘机器翻译的注意力机制和上下文建模系统

链接: https://arxiv.org/abs/2408.04216
作者: Shi Bo,Yuwei Zhang,Junming Huang,Sitong Liu,Zexi Chen,Zizheng Li
关键词-EN: contextual apprehension capabilities, architectural schema anchored, K-means categorization algorithm, Transformer paradigm, paper advances
关键词-ZN: 上下文理解能力、架构模式锚定、K均值分类算法、Transformer范式、论文进展
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This paper advances a novel architectural schema anchored upon the Transformer paradigm and innovatively amalgamates the K-means categorization algorithm to augment the contextual apprehension capabilities of the schema. The transformer model performs well in machine translation tasks due to its parallel computing power and multi-head attention mechanism. However, it may encounter contextual ambiguity or ignore local features when dealing with highly complex language structures. To circumvent this constraint, this exposition incorporates the K-Means algorithm, which is used to stratify the lexis and idioms of the input textual matter, thereby facilitating superior identification and preservation of the local structure and contextual intelligence of the language. The advantage of this combination is that K-Means can automatically discover the topic or concept regions in the text, which may be directly related to translation quality. Consequently, the schema contrived herein enlists K-Means as a preparatory phase antecedent to the Transformer and recalibrates the multi-head attention weights to assist in the discrimination of lexis and idioms bearing analogous semantics or functionalities. This ensures the schema accords heightened regard to the contextual intelligence embodied by these clusters during the training phase, rather than merely focusing on locational intelligence.
摘要:本文提出了一种基于Transformer范式的新型建筑模式,并创新性地融合了K-Means分类算法,以增强该模式的上下文理解能力。由于其并行计算能力和多头注意机制,该模型在机器翻译任务中表现良好。然而,在处理高度复杂的语言结构时,它可能会遇到语境歧义或忽略局部特征。为了规避这一限制,本论述采用了K-Means算法,该算法用于对输入文本的词汇和习语进行分层,从而有助于更好地识别和保存语言的局部结构和上下文智能。这种组合的优势在于,K-Means能够自动发现文本中可能与翻译质量直接相关的主题或概念区域。因此,本文设计的模式将K-Means作为转换器之前的准备阶段,并重新校准多头注意力权重,以帮助区分具有相似语义或功能的词汇和习语。这确保了图式在培训阶段高度重视这些集群所体现的背景智能,而不仅仅是关注位置智能。

[NLP-37] MMREC: LLM Based Multi-Modal Recommender System
[NLP-37] MMREC:基于LLM的多模式推荐系统

链接: https://arxiv.org/abs/2408.04211
作者: Jiahao Tian,Jinman Zhao,Zhenkai Wang,Zhicheng Ding
关键词-EN: growing rapidly due, content generated daily, recommender systems, generated daily, effective recommender systems
关键词-ZN: 由于每天生成的内容、推荐系统、每天生成的、有效的推荐系统,快速增长
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:The importance of recommender systems is growing rapidly due to the exponential increase in the volume of content generated daily. This surge in content presents unique challenges for designing effective recommender systems. Key among these challenges is the need to effectively leverage the vast amounts of natural language data and images that represent user preferences. This paper presents a novel approach to enhancing recommender systems by leveraging Large Language Models (LLMs) and deep learning techniques. The proposed framework aims to improve the accuracy and relevance of recommendations by incorporating multi-modal information processing and by the use of unified latent space representation. The study explores the potential of LLMs to better understand and utilize natural language data in recommendation contexts, addressing the limitations of previous methods. The framework efficiently extracts and integrates text and image information through LLMs, unifying diverse modalities in a latent space to simplify the learning process for the ranking model. Experimental results demonstrate the enhanced discriminative power of the model when utilizing multi-modal information. This research contributes to the evolving field of recommender systems by showcasing the potential of LLMs and multi-modal data integration to create more personalized and contextually relevant recommendations.
摘要:由于每天产生的内容数量呈指数级增长,推荐系统的重要性正在迅速增长。内容的激增给设计有效的推荐系统带来了独特的挑战。这些挑战中的关键是需要有效地利用代表用户偏好的海量自然语言数据和图像。提出了一种利用大语言模型和深度学习技术来增强推荐系统的新方法。拟议的框架旨在通过结合多模式信息处理和使用统一的潜在空间表示来提高建议的准确性和相关性。这项研究探索了LLMS在推荐上下文中更好地理解和利用自然语言数据的潜力,解决了以前方法的局限性。该框架通过LLMS有效地提取和集成文本和图像信息,统一了潜在空间中的各种模式,简化了排序模型的学习过程。实验结果表明,该模型在利用多模式信息时具有较强的识别力。这项研究通过展示LLMS和多模式数据集成的潜力来创建更个性化和上下文相关的推荐,从而为不断发展的推荐系统领域做出贡献。

[NLP-38] wav2graph: A Framework for Supervised Learning Knowledge Graph from Speech
[NLP-38] wav 2graph:语音监督学习知识图谱的框架

链接: https://arxiv.org/abs/2408.04174
作者: Khai Le-Duc,Quy-Anh Dang,Tan-Hanh Pham,Truong-Son Hy
关键词-EN: large language models, enhance the performance, providing structured, reasoning and context-awareness, performance of large
关键词-ZN: 大型语言模型,增强性能,提供结构化、推理和上下文感知、大型性能
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: Preprint, 32 pages

点击查看摘要

Abstract:Knowledge graphs (KGs) enhance the performance of large language models (LLMs) and search engines by providing structured, interconnected data that improves reasoning and context-awareness. However, KGs only focus on text data, thereby neglecting other modalities such as speech. In this work, we introduce wav2graph, the first framework for supervised learning knowledge graph from speech data. Our pipeline are straightforward: (1) constructing a KG based on transcribed spoken utterances and a named entity database, (2) converting KG into embedding vectors, and (3) training graph neural networks (GNNs) for node classification and link prediction tasks. Through extensive experiments conducted in inductive and transductive learning contexts using state-of-the-art GNN models, we provide baseline results and error analysis for node classification and link prediction tasks on human transcripts and automatic speech recognition (ASR) transcripts, including evaluations using both encoder-based and decoder-based node embeddings, as well as monolingual and multilingual acoustic pre-trained models. All related code, data, and models are published online.
摘要:知识图(KG)通过提供结构化、互连的数据来改进推理和上下文感知,从而提高大型语言模型(LLM)和搜索引擎的性能。然而,KGS只关注文本数据,从而忽略了其他形式,如语音。在这项工作中,我们介绍了第一个从语音数据中监督学习知识图的框架Wave2graph。我们的流程很简单:(1)基于转录的语音和命名实体数据库构建KG,(2)将KG转换为嵌入向量,(3)训练图神经网络(GNN)用于节点分类和链接预测任务。通过使用最先进的GNN模型在归纳和转导学习环境中进行的广泛实验,我们为人类记录和自动语音识别(ASR)记录上的节点分类和链接预测任务提供了基线结果和误差分析,包括使用基于编码器和基于解码器的节点嵌入以及单语言和多语言声学预训练模型的评估。所有相关代码、数据和模型都在线发布。

[NLP-39] mbrs: A Library for Minimum Bayes Risk Decoding
[NLP-39] mbrs:最小Bayes风险解码库

链接: https://arxiv.org/abs/2408.04167
作者: Hiroyuki Deguchi,Yusuke Sakai,Hidetaka Kamigaito,Taro Watanabe
关键词-EN: Minimum Bayes risk, Minimum Bayes, text generation tasks, outperforms conventional maximum, selecting high-quality outputs
关键词-ZN: 最小Bayes风险、最小Bayes、文本生成任务、优于传统最大值、选择高质量输出
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Minimum Bayes risk (MBR) decoding is a decision rule of text generation tasks that outperforms conventional maximum a posterior (MAP) decoding using beam search by selecting high-quality outputs based on a utility function rather than those with high-probability. Typically, it finds the most suitable hypothesis from the set of hypotheses under the sampled pseudo-references. mbrs is a library of MBR decoding, which can flexibly combine various metrics, alternative expectation estimations, and algorithmic variants. It is designed with a focus on speed measurement and calling count of code blocks, transparency, reproducibility, and extensibility, which are essential for researchers and developers. We published our mbrs as an MIT-licensed open-source project, and the code is available on GitHub. GitHub: this https URL Subjects: Computation and Language (cs.CL) Cite as: arXiv:2408.04167 [cs.CL] (or arXiv:2408.04167v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2408.04167 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
摘要:最小贝叶斯风险(MBR)译码是文本生成任务的一种决策规则,它通过基于效用函数而不是高概率选择高质量的输出,从而优于传统的基于波束搜索的最大后验(MAP)译码。通常,它从抽样的伪参考文献下的假设集合中找到最合适的假设。MBRS是一个MBR解码库,它可以灵活地组合各种度量、替代预期估计和算法变体。它的设计重点是速度测量和代码块的调用计数、透明度、重复性和可扩展性,这些对研究人员和开发人员来说都是必不可少的。我们将我们的MBR作为麻省理工学院许可的开源项目发布,代码可以在GitHub上获得。GitHub:此HTTPS URL主题:计算和语言(cs.CL)引用为:arxiv:2408.04167cs.CLhttps://doi.org/10.48550/arXiv.2408.04167 Focus通过DataCite了解更多arxiv发布的DOI(等待注册)

[NLP-40] Semantics or spelling? Probing contextual word embeddings with orthographic noise
[NLP-40] 语义还是拼写?用正字噪音探测上下文词嵌入

链接: https://arxiv.org/abs/2408.04162
作者: Jacob A. Matthews,John R. Starr,Marten van Schijndel
关键词-EN: Pretrained language model, Pretrained language, contextual word embeddings, language model, frequently employed
关键词-ZN: 预训练语言模型,预训练语言,上下文词嵌入,语言模型,经常使用
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Pretrained language model (PLM) hidden states are frequently employed as contextual word embeddings (CWE): high-dimensional representations that encode semantic information given linguistic context. Across many areas of computational linguistics research, similarity between CWEs is interpreted as semantic similarity. However, it remains unclear exactly what information is encoded in PLM hidden states. We investigate this practice by probing PLM representations using minimal orthographic noise. We expect that if CWEs primarily encode semantic information, a single character swap in the input word will not drastically affect the resulting representation,given sufficient linguistic context. Surprisingly, we find that CWEs generated by popular PLMs are highly sensitive to noise in input data, and that this sensitivity is related to subword tokenization: the fewer tokens used to represent a word at input, the more sensitive its corresponding CWE. This suggests that CWEs capture information unrelated to word-level meaning and can be manipulated through trivial modifications of input data. We conclude that these PLM-derived CWEs may not be reliable semantic proxies, and that caution is warranted when interpreting representational similarity
摘要:预先训练的语言模型(PLM)隐含状态经常被用作上下文词嵌入(CWE):在给定的语言上下文中编码语义信息的高维表示。在计算语言学研究的许多领域中,CWE之间的相似性被解释为语义相似性。然而,目前还不清楚PLM隐藏状态下到底编码了哪些信息。我们通过使用最小的正交噪声来探测PLM表示来研究这种做法。我们预计,如果CWE主要编码语义信息,在给定足够的语言上下文的情况下,输入单词中的单个字符交换不会对结果表示产生重大影响。令人惊讶的是,我们发现流行的PLM生成的CWE对输入数据中的噪声高度敏感,并且这种敏感性与子词标记化有关:在输入时用来代表一个词的标记词越少,其对应的CWE就越敏感。这表明CWE捕获了与词级含义无关的信息,可以通过对输入数据进行微小的修改来操作。我们的结论是,这些PLM派生的CWE可能不是可靠的语义代理,在解释表征相似性时需要谨慎

[NLP-41] UNLEARN Efficient Removal of Knowledge in Large Language Models
[NLP-41] 学习大型语言模型中知识的有效删除

链接: https://arxiv.org/abs/2408.04140
作者: Tyler Lizzo,Larry Heck
关键词-EN: dynamically forgetting specific, large language models, forgetting specific knowledge, private or proprietary, dynamically forgetting
关键词-ZN: 动态忘记特定的大型语言模型,忘记特定的知识,私人的或专有的,动态忘记
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 11 pages, 2 Figures

点击查看摘要

Abstract:Given the prevalence of large language models (LLMs) and the prohibitive cost of training these models from scratch, dynamically forgetting specific knowledge e.g., private or proprietary, without retraining the model has become an important capability. This paper proposes a novel method to achieve this objective called UNLEARN. The approach builds upon subspace methods to identify and specifically target the removal of knowledge without adversely affecting other knowledge in the LLM. Results demonstrate 96% of targeted knowledge can be forgotten while maintaining performance on other knowledge within 2.5% of the original model, significantly outperforming the discriminatory abilities of the previous state-of-the-art. A dual method called LEARN is also proposed for targeted knowledge addition. Results show LEARN can match the fine-tuning accuracy of Low-Rank Adaptation (LoRA) without adversely affecting similar tasks.
摘要:鉴于大型语言模型(LLM)的流行以及从头开始训练这些模型的高昂成本,动态忘记特定知识,例如,无论是私有的还是专有的,无需重新培训,该模型已成为一种重要的能力。本文提出了一种新的方法来实现这一目标,称为UNLEARN。该方法基于子空间方法来识别并专门针对知识的删除,而不会对LLM中的其他知识产生不利影响。结果表明,96%的目标知识可以被遗忘,同时将其他知识的性能保持在原始模型的2.5%以内,显着优于之前最先进技术的区分能力。还提出了一种名为LEARN的双重方法用于有针对性的知识添加。结果表明,LEARN可以与低等级自适应(LoRA)的微调准确性相匹配,而不会对类似任务产生不利影响。

[NLP-42] Enhancing Healthcare through Large Language Models : A Study on Medical Question Answering
[NLP-42] 通过大型语言模型加强医疗保健:医学问题解答研究

链接: https://arxiv.org/abs/2408.04138
作者: Haoran Yu,Chang Yu,Zihan Wang,Dongxian Zou,Hao Qin
关键词-EN: Large Language Models, Large Language, shown significant promise, application of Large, Language Models
关键词-ZN: 大型语言模型,大型语言,显示出巨大的前景,大型语言模型的应用
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: received by IEEE ICPICS

点击查看摘要

Abstract:In recent years, the application of Large Language Models (LLMs) in healthcare has shown significant promise in improving the accessibility and dissemination of medical knowledge. This paper presents a detailed study of various LLMs trained on the MedQuAD medical question-answering dataset, with a focus on identifying the most effective model for providing accurate medical information. Among the models tested, the Sentence-t5 combined with Mistral 7B demonstrated superior performance, achieving a precision score of 0.762. This model’s enhanced capabilities are attributed to its advanced pretraining techniques, robust architecture, and effective prompt construction methodologies. By leveraging these strengths, the Sentence-t5 + Mistral 7B model excels in understanding and generating precise medical answers. Our findings highlight the potential of integrating sophisticated LLMs in medical contexts to facilitate efficient and accurate medical knowledge retrieval, thus significantly enhancing patient education and support.
摘要:近年来,大语言模型在医疗保健领域的应用在改善医学知识的可获得性和传播方面显示出了巨大的前景。本文详细研究了在MedQuAD医疗问答数据集上训练的各种LLM,重点是识别提供准确医疗信息的最有效模型。在测试的模型中,语句-T5和西风7B的组合表现出了优越的性能,达到了0.762的精度分数。该模型的增强能力归功于其先进的预培训技术、健壮的体系结构和有效的快速构建方法。通过利用这些优势,句子-T5+米斯特拉尔7B模式在理解和生成精确的医学答案方面表现出色。我们的发现突出了在医疗环境中集成复杂的LLM的潜力,以促进高效和准确的医疗知识检索,从而显著增强患者的教育和支持。

[NLP-43] Incorporating Spatial Awareness in Data-Driven Gesture Generation for Virtual Agents
[NLP-43] 虚拟代理数据驱动手势生成中增强空间感知

链接: https://arxiv.org/abs/2408.04127
作者: Anna Deichler,Simon Alexanderson,Jonas Beskow
关键词-EN: agents’ non-verbal behaviors, enhancing human-agent communication, integrating spatial context, virtual agents’ non-verbal, non-verbal behaviors
关键词-ZN: 代理人的非言语行为,增强人与代理人的沟通,整合空间上下文,虚拟代理人的非言语、非言语行为
类目: Human-Computer Interaction (cs.HC); Computation and Language (cs.CL); Graphics (cs.GR)
备注:

点击查看摘要

Abstract:This paper focuses on enhancing human-agent communication by integrating spatial context into virtual agents’ non-verbal behaviors, specifically gestures. Recent advances in co-speech gesture generation have primarily utilized data-driven methods, which create natural motion but limit the scope of gestures to those performed in a void. Our work aims to extend these methods by enabling generative models to incorporate scene information into speech-driven gesture synthesis. We introduce a novel synthetic gesture dataset tailored for this purpose. This development represents a critical step toward creating embodied conversational agents that interact more naturally with their environment and users.
摘要:本文重点关注通过将空间上下文集成到虚拟代理的非言语行为(特别是手势)中来增强人机通信。协同语音手势生成方面的最新进展主要利用数据驱动的方法,这些方法创建自然运动,但将手势的范围限制在虚空中执行的手势。我们的工作旨在通过使生成模型能够将场景信息纳入语音驱动的手势合成来扩展这些方法。我们引入了一种为此目的量身定制的新型合成手势数据集。这一开发代表了创建与环境和用户更自然地交互的嵌入式对话代理的关键一步。

[NLP-44] Can Rule-Based Insights Enhance LLMs for Radiology Report Classification? Introducing the RadPrompt Methodology ACL2024
[NLP-44] 基于规则的洞察能否增强放射学报告分类的LLM?引入RadPrompt方法

链接: https://arxiv.org/abs/2408.04121
作者: Panagiotis Fytas,Anna Breger,Ian Selby,Simon Baker,Shahab Shahipasand,Anna Korhonen
关键词-EN: Developing imaging models, Developing imaging, chest X-rays, imaging models capable, capable of detecting
关键词-ZN: 开发成像模型、开发成像、胸部X光检查、能够检测的成像模型
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at BioNLP, ACL 2024

点击查看摘要

Abstract:Developing imaging models capable of detecting pathologies from chest X-rays can be cost and time-prohibitive for large datasets as it requires supervision to attain state-of-the-art performance. Instead, labels extracted from radiology reports may serve as distant supervision since these are routinely generated as part of clinical practice. Despite their widespread use, current rule-based methods for label extraction rely on extensive rule sets that are limited in their robustness to syntactic variability. To alleviate these limitations, we introduce RadPert, a rule-based system that integrates an uncertainty-aware information schema with a streamlined set of rules, enhancing performance. Additionally, we have developed RadPrompt, a multi-turn prompting strategy that leverages RadPert to bolster the zero-shot predictive capabilities of large language models, achieving a statistically significant improvement in weighted average F1 score over GPT-4 Turbo. Most notably, RadPrompt surpasses both its underlying models, showcasing the synergistic potential of LLMs with rule-based models. We have evaluated our methods on two English Corpora: the MIMIC-CXR gold-standard test set and a gold-standard dataset collected from the Cambridge University Hospitals.
摘要:开发能够从胸部X光中检测病理的成像模型对于大型数据集来说可能是成本和时间上的障碍,因为它需要监督才能获得最先进的性能。取而代之的是,从放射学报告中提取的标签可以作为远程监督,因为这些标签通常是作为临床实践的一部分生成的。尽管它们被广泛使用,但当前基于规则的标签提取方法依赖于广泛的规则集,这些规则集对句法可变性的稳健性有限。为了缓解这些限制,我们引入了RadPert,这是一个基于规则的系统,它将感知不确定性的信息模式与简化的规则集集成在一起,从而提高了性能。此外,我们还开发了RadPrompt,这是一种多转弯提示策略,利用RadPert来增强大型语言模型的零射预测能力,在加权平均F1成绩方面实现了相对于GPT-4 Turbo的显著改善。最值得注意的是,RadPrompt超越了这两个基础模型,展示了LLMS与基于规则的模型的协同潜力。我们已经在两个英语语料库上对我们的方法进行了评估:MIMIC-CXR黄金标准测试集和从剑桥大学医院收集的黄金标准数据集。

[NLP-45] Zero-shot Factual Consistency Evaluation Across Domains
[NLP-45] 跨领域零镜头事实一致性评估

链接: https://arxiv.org/abs/2408.04114
作者: Raunak Agarwal
关键词-EN: text generation systems, Natural Language Inference, Factual Consistency Evaluation, factual consistency, generation systems
关键词-ZN: 文本生成系统、自然语言推理、事实一致性评估、事实一致性、生成系统
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:This work addresses the challenge of factual consistency in text generation systems. We unify the tasks of Natural Language Inference, Summarization Evaluation, Factuality Verification and Factual Consistency Evaluation to train models capable of evaluating the factual consistency of source-target pairs across diverse domains. We rigorously evaluate these against eight baselines on a comprehensive benchmark suite comprising 22 datasets that span various tasks, domains, and document lengths. Results demonstrate that our method achieves state-of-the-art performance on this heterogeneous benchmark while addressing efficiency concerns and attaining cross-domain generalization.
摘要:这项工作解决了文本生成系统中事实一致性的挑战。我们统一了自然语言推理、总结评估、事实验证和事实一致性评估的任务,以训练能够评估不同领域源目标对的事实一致性的模型。我们在一个全面的基准套件上根据八个基线严格评估这些基线,该套件由涵盖各种任务、域和文档长度的22个数据集组成。结果表明,我们的方法在这个异类基准测试上实现了最先进的性能,同时解决了效率问题并实现了跨域通用化。

[NLP-46] Patchview: LLM-Powered Worldbuilding with Generative Dust and Magnet Visualization
[NLP-46] Patchview:LLM支持的Worldbuilding,具有生成灰尘和磁铁可视化

链接: https://arxiv.org/abs/2408.04112
作者: John Joon Young Chung,Max Kreminski
关键词-EN: Large language models, Large language, writers build story, writers build, generating world elements
关键词-ZN: 大型语言模型,大型语言,作家构建故事,作家构建,生成世界元素
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Accepted to UIST2024

点击查看摘要

Abstract:Large language models (LLMs) can help writers build story worlds by generating world elements, such as factions, characters, and locations. However, making sense of many generated elements can be overwhelming. Moreover, if the user wants to precisely control aspects of generated elements that are difficult to specify verbally, prompting alone may be insufficient. We introduce Patchview, a customizable LLM-powered system that visually aids worldbuilding by allowing users to interact with story concepts and elements through the physical metaphor of magnets and dust. Elements in Patchview are visually dragged closer to concepts with high relevance, facilitating sensemaking. The user can also steer the generation with verbally elusive concepts by indicating the desired position of the element between concepts. When the user disagrees with the LLM’s visualization and generation, they can correct those by repositioning the element. These corrections can be used to align the LLM’s future behaviors to the user’s perception. With a user study, we show that Patchview supports the sensemaking of world elements and steering of element generation, facilitating exploration during the worldbuilding process. Patchview provides insights on how customizable visual representation can help sensemake, steer, and align generative AI model behaviors with the user’s intentions.
摘要:大型语言模型(LLM)可以通过生成派系、人物和地点等世界元素来帮助作家构建故事世界。然而,理解许多生成的元素可能是压倒性的。此外,如果用户想要精确控制难以口头指定的所生成元素的各个方面,单靠提示可能是不够的。我们推出了Patchview,这是一个可定制的LLM驱动系统,允许用户通过磁铁和尘埃的物理隐喻与故事概念和元素交互,从而在视觉上帮助构建世界。Patchview中的元素在视觉上被拖得更接近具有高度相关性的概念,有助于制造轰动效应。用户还可以通过指示元素在概念之间的所需位置来引导具有口头难以捉摸的概念的生成。当用户不同意LLM的可视化和生成时,他们可以通过重新定位元素来纠正这些问题。这些修正可以用来使LLM的未来行为与用户的感知保持一致。通过用户研究,我们展示了Patchview支持对世界元素的感知和元素生成的指导,促进了在构建世界的过程中的探索。Patchview提供了关于可定制视觉表示如何帮助感知、引导和使生成性AI模型行为与用户意图保持一致的见解。

[NLP-47] ree Attention: Topology-aware Decoding for Long-Context Attention on GPU clusters
[NLP-47] 注意力ree:在图形处理器集群上进行长上下文注意力的解码

链接: https://arxiv.org/abs/2408.04093
作者: Vasudev Shyam,Jonathan Pilault,Emily Shepperd,Quentin Anthony,Beren Millidge
关键词-EN: modern transformer architectures, significant computational bottleneck, core mathematical operation, computational bottleneck due, core mathematical
关键词-ZN: 现代Transformer架构、重大计算瓶颈、核心数学运算、计算瓶颈、核心数学
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Self-attention is the core mathematical operation of modern transformer architectures and is also a significant computational bottleneck due to its quadratic complexity in the sequence length. In this work, we derive the scalar energy function whose gradient computes the self-attention block, thus elucidating the theoretical underpinnings of self-attention, providing a Bayesian interpretation of the operation and linking it closely with energy-based models such as Hopfield Networks. Moreover, due to this formulation, we discover that we can use efficient and optimized automatic-differentiation techniques to derive a highly efficient Tree Attention algorithm to compute the gradient of the energy and hence self-attention. Our formulation reveals that the reduction across the sequence axis can be efficiently computed in parallel through a tree reduction. Our algorithm, for parallelizing attention computation across multiple GPUs, enables cross-device decoding to be performed asymptotically faster (up to 8x faster) than alternative approaches such as Ring Attention, while also requiring significantly less communication volume and incurring 2x less peak memory. Our code is publicly available here: \urlthis https URL
摘要:自我关注是现代变压器结构的核心数学运算,也是一个重要的计算瓶颈,因为它的序列长度是二次方的。在这项工作中,我们推导了标量能量函数,它的梯度计算自我注意块,从而阐明了自我注意的理论基础,提供了对这一操作的贝叶斯解释,并将其与Hopfield网络等基于能量的模型紧密地联系在一起。此外,由于这个公式,我们发现我们可以使用高效和优化的自动区分技术来推导出一个高效的树注意算法来计算能量的梯度,从而计算自我注意。我们的公式表明,跨序列轴的约简可以通过树约简高效地并行计算。我们的算法,用于在多个GPU上并行注意力计算,使得跨设备解码的执行速度比诸如环形注意力的替代方法更快(高达8倍),同时也需要显著更少的通信量和2倍的峰值内存。我们的代码在以下位置公开提供:\urlThis HTTPS URL

[NLP-48] Human Speech Perception in Noise: Can Large Language Models Paraphrase to Improve It? ACL2024
[NLP-48] 人类在噪音中的言语感知:大型语言模型可以进行解释来改进吗?

链接: https://arxiv.org/abs/2408.04029
作者: Anupama Chingacham,Miaoran Zhang,Vera Demberg,Dietrich Klakow
关键词-EN: Large Language Models, Large Language, Language Models, transferring style attributes, human speech perception
关键词-ZN: 大型语言模型,大型语言,语言模型,传递风格属性,人类言语感知
类目: Computation and Language (cs.CL)
备注: Accepted at HuCLLM @ ACL 2024

点击查看摘要

Abstract:Large Language Models (LLMs) can generate text by transferring style attributes like formality resulting in formal or informal text. However, instructing LLMs to generate text that when spoken, is more intelligible in an acoustically difficult environment, is an under-explored topic. We conduct the first study to evaluate LLMs on a novel task of generating acoustically intelligible paraphrases for better human speech perception in noise. Our experiments in English demonstrated that with standard prompting, LLMs struggle to control the non-textual attribute, i.e., acoustic intelligibility, while efficiently capturing the desired textual attributes like semantic equivalence. To remedy this issue, we propose a simple prompting approach, prompt-and-select, which generates paraphrases by decoupling the desired textual and non-textual attributes in the text generation pipeline. Our approach resulted in a 40% relative improvement in human speech perception, by paraphrasing utterances that are highly distorted in a listening condition with babble noise at a signal-to-noise ratio (SNR) -5 dB. This study reveals the limitation of LLMs in capturing non-textual attributes, and our proposed method showcases the potential of using LLMs for better human speech perception in noise.
摘要:大型语言模型(LLMS)可以通过转换风格属性(如形式)来生成文本,从而生成正式或非正式的文本。然而,指导LLMS生成在说话时更容易理解的文本,在听觉困难的环境中更容易理解,是一个未被探索的话题。我们进行了第一次研究,以评估LLMS在一项新的任务中生成声学上可理解的释义,以改善人类在噪声中的语音感知。我们的英语实验表明,在标准提示的情况下,LLMS难以控制非文本属性,即声音可理解性,同时有效地捕获所需的文本属性,如语义对等。为了解决这个问题,我们提出了一种简单的提示方法,即提示并选择,它通过在文本生成管道中分离所需的文本和非文本属性来生成释义。我们的方法通过改写在信噪比(SNR)-5分贝的杂乱噪声的听力条件下高度失真的话语,导致人类语音感知的相对改善了40%。这项研究揭示了LLMS在捕捉非文本属性方面的局限性,我们提出的方法展示了LLMS在噪声中改善人类语音感知的潜力。

[NLP-49] Improving Large Language Model (LLM) fidelity through context-aware grounding: A systematic approach to reliability and veracity
[NLP-49] 通过上下文感知基础提高大型语言模型(LLM)保真度:可靠性和准确性的系统方法

链接: https://arxiv.org/abs/2408.04023
作者: Wrick Talukdar,Anjanava Biswas
关键词-EN: Large Language Models, natural language processing, ensuring their robustness, Large Language, critical challenge
关键词-ZN: 大型语言模型、自然语言处理、确保其稳健性、大型语言、关键挑战
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 14 pages

点击查看摘要

Abstract:As Large Language Models (LLMs) become increasingly sophisticated and ubiquitous in natural language processing (NLP) applications, ensuring their robustness, trustworthiness, and alignment with human values has become a critical challenge. This paper presents a novel framework for contextual grounding in textual models, with a particular emphasis on the Context Representation stage. Our approach aims to enhance the reliability and ethical alignment of these models through a comprehensive, context-aware methodology. By explicitly capturing and representing relevant situational, cultural, and ethical contexts in a machine-readable format, we lay the foundation for anchoring a model’s behavior within these contexts. Our approach leverages techniques from knowledge representation and reasoning, such as ontologies, semantic web technologies, and logic-based formalisms. We evaluate our framework on real-world textual datasets, demonstrating its effectiveness in improving model performance, fairness, and alignment with human expectations, while maintaining high accuracy. Furthermore, we discuss the other key components of the framework, including context-aware encoding, context-aware learning, interpretability and explainability, and continuous monitoring and adaptation. This research contributes to the growing body of work on responsible AI, offering a practical approach to developing more reliable, trustworthy, and ethically-aligned language models. Our findings have significant implications for the deployment of LLMs in sensitive domains such as healthcare, legal systems, and social services, where contextual understanding is paramount.
摘要:随着大语言模型在自然语言处理中的应用变得越来越复杂和普遍,确保它们的健壮性、可信性和与人类价值观的一致性已经成为一个关键的挑战。本文提出了一种在文本模型中建立语境的新框架,并特别强调了语境表示阶段。我们的方法旨在通过全面的、情景感知的方法来增强这些模型的可靠性和道德一致性。通过以机器可读的格式显式地捕获和表示相关的情景、文化和道德上下文,我们为在这些上下文中锚定模型的行为奠定了基础。我们的方法利用了知识表示和推理的技术,如本体、语义网技术和基于逻辑的形式主义。我们在真实世界的文本数据集上对我们的框架进行了评估,证明了它在提高模型性能、公平性和与人类期望的一致性方面的有效性,同时保持了高精度。此外,我们还讨论了该框架的其他关键组件,包括上下文感知编码、上下文感知学习、可解释性和可解释性,以及持续监控和适应。这项研究有助于越来越多关于负责任的人工智能的工作,为开发更可靠、更值得信赖和伦理一致的语言模型提供了一种实用的方法。我们的发现对在医疗保健、法律系统和社会服务等敏感领域部署LLMS具有重要意义,在这些领域,上下文理解至关重要。

[NLP-50] Image-to-LaTeX Converter for Mathematical Formulas and Text
[NLP-50] 数学公式和文本的图像到LaTeX转换器

链接: https://arxiv.org/abs/2408.04015
作者: Daniil Gurgurov,Aleksey Morshnev
关键词-EN: vision encoder-decoder model, Swin Transformer encoder, train a vision, vision encoder-decoder, generate LaTeX code
关键词-ZN: 视觉编码器-解码器模型,Swin Transformer编码器,训练视觉,视觉编码器-解码器,生成LaTeX代码
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注: 4 pages

点击查看摘要

Abstract:In this project, we train a vision encoder-decoder model to generate LaTeX code from images of mathematical formulas and text. Utilizing a diverse collection of image-to-LaTeX data, we build two models: a base model with a Swin Transformer encoder and a GPT-2 decoder, trained on machine-generated images, and a fine-tuned version enhanced with Low-Rank Adaptation (LoRA) trained on handwritten formulas. We then compare the BLEU performance of our specialized model on a handwritten test set with other similar models, such as Pix2Text, TexTeller, and Sumen. Through this project, we contribute open-source models for converting images to LaTeX and provide from-scratch code for building these models with distributed training and GPU optimizations.
摘要:在这个项目中,我们训练视觉编码器-解码器模型,以从数学公式和文本的图像生成LaTeX代码。利用图像到LaTeX数据的多样化集合,我们构建了两个模型:一个具有Swin Transformer编码器和GPT-2解码器的基本模型,在机器生成的图像上进行训练,另一个是通过在手写公式上训练的低等级自适应(LoRA)增强的微调版本。然后,我们将我们的专用模型在手写测试集中的BLEU性能与其他类似模型(例如Pix 2文本、TexTeller和Sumen)进行比较。通过这个项目,我们贡献了用于将图像转换为LaTeX的开源模型,并提供从头开始的代码,用于通过分布式训练和图形处理优化构建这些模型。

[NLP-51] Impacts of Anthropomorphizing Large Language Models in Learning Environments
[NLP-51] 学习环境中大型语言模型拟人化的影响

链接: https://arxiv.org/abs/2408.03945
作者: Kristina Schaaff,Marc-André Heidelmann
关键词-EN: Large Language Models, Large Language, Language Models, learning environments, support teaching-be
关键词-ZN: 大型语言模型,大型语言,语言模型,学习环境,支持教学
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC)
备注: Presented at Affective Computing Pre-Conference at ISRE 2024

点击查看摘要

Abstract:Large Language Models (LLMs) are increasingly being used in learning environments to support teaching-be it as learning companions or as tutors. With our contribution, we aim to discuss the implications of the anthropomorphization of LLMs in learning environments on educational theory to build a foundation for more effective learning outcomes and understand their emotional impact on learners. According to the media equation, people tend to respond to media in the same way as they would respond to another person. A study conducted by the Georgia Institute of Technology showed that chatbots can be successfully implemented in learning environments. In this study, learners in selected online courses were unable to distinguish the chatbot from a “real” teacher. As LLM-based chatbots such as OpenAI’s GPT series are increasingly used in educational tools, it is important to understand how the attribution processes to LLM-based chatbots in terms of anthropomorphization affect learners’ emotions.
摘要:大型语言模型在学习环境中越来越多地被用来支持教学–无论是作为学习伙伴还是作为导师。通过我们的贡献,我们旨在讨论学习环境中LLMS的拟人化对教育理论的影响,以建立更有效的学习结果的基础,并了解它们对学习者的情感影响。根据媒体等式,人们对媒体的反应方式往往与他们对另一个人的反应方式相同。佐治亚理工学院进行的一项研究表明,聊天机器人可以在学习环境中成功实现。在这项研究中,在精选的在线课程中,学习者无法区分聊天机器人和真正的老师。随着OpenAI的GPT系列等基于LLM的聊天机器人越来越多地被用于教育工具,了解基于LLM的聊天机器人在拟人化方面的归因过程如何影响学习者的情绪是很重要的。

[NLP-52] Articulatory Configurations across Genders and Periods in French Radio and TV archives INTERSPEECH2024
[NLP-52] 法国广播电视档案中跨性别和时期的发音

链接: https://arxiv.org/abs/2408.04519
作者: Benjamin Elie,David Doukhan,Rémi Uro,Lucas Ondel-Yang,Albert Rilliard,Simon Devauchelle
关键词-EN: paper studies, inversion from acoustic, French media archives, articulatory configurations, Maeda articulatory model
关键词-ZN: 论文研究、声学倒置、法国媒体档案、发音配置、前田发音模型
类目: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Computers and Society (cs.CY); Sound (cs.SD)
备注: accepted to InterSpeech 2024, Kos Island, Greece keywords : acoustic to articulatory inversion, diachrony, gender, French, media

点击查看摘要

Abstract:This paper studies changes in articulatory configurations across genders and periods using an inversion from acoustic to articulatory parameters. From a diachronic corpus based on French media archives spanning 60 years from 1955 to 2015, automatic transcription and forced alignment allowed extracting the central frame of each vowel. More than one million frames were obtained from over a thousand speakers across gender and age categories. Their formants were used from these vocalic frames to fit the parameters of Maeda’s articulatory model. Evaluations of the quality of these processes are provided. We focus here on two parameters of Maeda’s model linked to total vocal tract length: the relative position of the larynx (higher for females) and the lips protrusion (more protruded for males). Implications for voice quality across genders are discussed. The effect across periods seems gender independent; thus, the assertion that females lowered their pitch with time is not supported.
摘要:本文采用声学参数到发音参数的倒置方法,研究了不同性别、不同时期发音结构的变化。从法国媒体档案的历时语料库中,从1955年到2015年的60年间,自动转录和强制对齐允许提取每个元音的中心框架。从1000多名发言者那里获得了100多万张不同性别和年龄段的照片。他们的共振峰是从这些发音框架中使用的,以符合前田发音模型的参数。对这些过程的质量进行了评估。在这里,我们关注与声道总长度相关的前田模型的两个参数:喉部的相对位置(女性更高)和嘴唇突出(男性更突出)。讨论了不同性别对语音质量的影响。这种跨时期的影响似乎与性别无关;因此,女性随着时间的推移降低音调的断言不被支持。

[NLP-53] Simulating Articulatory Trajectories with Phonological Feature Interpolation INTERSPEECH2024
[NLP-53] 利用音素特征插值模拟关节语轨迹

链接: https://arxiv.org/abs/2408.04363
作者: Angelo Ortiz Tandazo,Thomas Schatz,Thomas Hueber,Emmanuel Dupoux
关键词-EN: involving perception-production loops, complete computational model, speech learning involving, learning involving perception-production, perception-production loops
关键词-ZN: 涉及感知产生循环、完整计算模型、涉及语音学习、涉及感知产生的学习、感知产生循环
类目: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL)
备注: accepted at Interspeech 2024

点击查看摘要

Abstract:As a first step towards a complete computational model of speech learning involving perception-production loops, we investigate the forward mapping between pseudo-motor commands and articulatory trajectories. Two phonological feature sets, based respectively on generative and articulatory phonology, are used to encode a phonetic target sequence. Different interpolation techniques are compared to generate smooth trajectories in these feature spaces, with a potential optimisation of the target value and timing to capture co-articulation effects. We report the Pearson correlation between a linear projection of the generated trajectories and articulatory data derived from a multi-speaker dataset of electromagnetic articulography (EMA) recordings. A correlation of 0.67 is obtained with an extended feature set based on generative phonology and a linear interpolation technique. We discuss the implications of our results for our understanding of the dynamics of biological motion.
摘要:作为建立涉及感知产生循环的语音学习完整计算模型的第一步,我们研究了伪运动命令和发音轨迹之间的正向映射。使用分别基于生成音系和发音音系的两个音系特征集来编码语音目标序列。比较不同的插值技术,以在这些特征空间中生成平滑的轨迹,并可能优化目标值和捕捉协同发音效果的时机。我们报告了生成的轨迹的线性投影与从电磁关节成像(EMA)记录的多扬声器数据集获得的关节数据之间的Pearson相关性。使用基于生成音系学和线性插值技术的扩展特征集获得了0.67的相关性。我们讨论了我们的结果对我们理解生物运动动力学的影响。

[NLP-54] HydraFormer: One Encoder For All Subsampling Rates ICME2024
[NLP-54] HydraFormer:适用于所有子采样率的一个编码器

链接: https://arxiv.org/abs/2408.04325
作者: Yaoxun Xu,Xingchen Song,Zhiyong Wu,Di Wu,Zhendong Peng,Binbin Zhang
关键词-EN: tackling diverse scenarios, diverse scenarios, essential for tackling, tackling diverse, subsampling
关键词-ZN: 应对不同的场景,不同的场景,对于应对、应对不同的、二次抽样至关重要
类目: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL)
备注: accepted by ICME 2024

点击查看摘要

Abstract:In automatic speech recognition, subsampling is essential for tackling diverse scenarios. However, the inadequacy of a single subsampling rate to address various real-world situations often necessitates training and deploying multiple models, consequently increasing associated costs. To address this issue, we propose HydraFormer, comprising HydraSub, a Conformer-based encoder, and a BiTransformer-based decoder. HydraSub encompasses multiple branches, each representing a distinct subsampling rate, allowing for the flexible selection of any branch during inference based on the specific use case. HydraFormer can efficiently manage different subsampling rates, significantly reducing training and deployment expenses. Experiments on AISHELL-1 and LibriSpeech datasets reveal that HydraFormer effectively adapts to various subsampling rates and languages while maintaining high recognition performance. Additionally, HydraFormer showcases exceptional stability, sustaining consistent performance under various initialization conditions, and exhibits robust transferability by learning from pretrained single subsampling rate automatic speech recognition models\footnoteModel code and scripts: this https URL.
摘要:在自动语音识别中,为了处理不同的场景,亚采样是必不可少的。然而,单个子采样率不足以应对各种现实情况,往往需要培训和部署多个模型,从而增加了相关成本。为了解决这个问题,我们提出了HydraFormer,它包括一个基于Conform的编码器HydraSub和一个基于BiTransformer的解码器。HydraSub包含多个分支,每个分支代表不同的子采样率,允许在基于特定用例的推理过程中灵活选择任何分支。HydraFormer可以高效地管理不同的子采样率,显著降低培训和部署费用。在AISHELL-1和LibriSpeech数据集上的实验表明,HydraFormer在保持较高识别性能的同时,有效地适应了不同的亚采样率和语言。此外,HydraFormer展示了卓越的稳定性,在各种初始化条件下保持了一致的性能,并通过学习预先训练的单次采样率自动语音识别模型\Fonote Model代码和脚本显示了强大的可转移性:此HTTPS URL。

人工智能

[AI-0] Puppet-Master: Scaling Interactive Video Generation as a Motion Prior for Part-Level Dynamics

链接: https://arxiv.org/abs/2408.04631
作者: Ruining Li,Chuanxia Zheng,Christian Rupprecht,Andrea Vedaldi
关键词-EN: interactive video generative, video generative model, part-level dynamics, part-level motion, realistic part-level motion
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: Project page: this https URL

点击查看摘要

Abstract:We present Puppet-Master, an interactive video generative model that can serve as a motion prior for part-level dynamics. At test time, given a single image and a sparse set of motion trajectories (i.e., drags), Puppet-Master can synthesize a video depicting realistic part-level motion faithful to the given drag interactions. This is achieved by fine-tuning a large-scale pre-trained video diffusion model, for which we propose a new conditioning architecture to inject the dragging control effectively. More importantly, we introduce the all-to-first attention mechanism, a drop-in replacement for the widely adopted spatial attention modules, which significantly improves generation quality by addressing the appearance and background issues in existing models. Unlike other motion-conditioned video generators that are trained on in-the-wild videos and mostly move an entire object, Puppet-Master is learned from Objaverse-Animation-HQ, a new dataset of curated part-level motion clips. We propose a strategy to automatically filter out sub-optimal animations and augment the synthetic renderings with meaningful motion trajectories. Puppet-Master generalizes well to real images across various categories and outperforms existing methods in a zero-shot manner on a real-world benchmark. See our project page for more results: this http URL.

[AI-1] LogogramNLP: Comparing Visual and Textual Representations of Ancient Logographic Writing Systems for NLP

链接: https://arxiv.org/abs/2408.04628
作者: Danlu Chen,Freda Shi,Aditi Agarwal,Jacobo Myerston,Taylor Berg-Kirkpatrick
关键词-EN: Standard natural language, Standard natural, ancient logographic languages, discrete tokens, operate on symbolic
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Standard natural language processing (NLP) pipelines operate on symbolic representations of language, which typically consist of sequences of discrete tokens. However, creating an analogous representation for ancient logographic writing systems is an extremely labor intensive process that requires expert knowledge. At present, a large portion of logographic data persists in a purely visual form due to the absence of transcription – this issue poses a bottleneck for researchers seeking to apply NLP toolkits to study ancient logographic languages: most of the relevant data are images of writing. This paper investigates whether direct processing of visual representations of language offers a potential solution. We introduce LogogramNLP, the first benchmark enabling NLP analysis of ancient logographic languages, featuring both transcribed and visual datasets for four writing systems along with annotations for tasks like classification, translation, and parsing. Our experiments compare systems that employ recent visual and text encoding strategies as backbones. The results demonstrate that visual representations outperform textual representations for some investigated tasks, suggesting that visual processing pipelines may unlock a large amount of cultural heritage data of logographic languages for NLP-based analyses. Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2408.04628 [cs.CL] (or arXiv:2408.04628v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2408.04628 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Journalreference: ACL 2024, long paper

[AI-2] ransformer Explainer: Interactive Learning of Text-Generative Models IEEE-VIS2024

链接: https://arxiv.org/abs/2408.04619
作者: Aeree Cho,Grace C. Kim,Alexander Karpekov,Alec Helbling,Zijie J. Wang,Seongmin Lee,Benjamin Hoover,Duen Horng Chau
关键词-EN: revolutionized machine learning, workings remain opaque, present Transformer Explainer, machine learning, revolutionized machine
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
*备注: To be presented at IEEE VIS 2024

点击查看摘要

Abstract:Transformers have revolutionized machine learning, yet their inner workings remain opaque to many. We present Transformer Explainer, an interactive visualization tool designed for non-experts to learn about Transformers through the GPT-2 model. Our tool helps users understand complex Transformer concepts by integrating a model overview and enabling smooth transitions across abstraction levels of mathematical operations and model structures. It runs a live GPT-2 instance locally in the user’s browser, empowering users to experiment with their own input and observe in real-time how the internal components and parameters of the Transformer work together to predict the next tokens. Our tool requires no installation or special hardware, broadening the public’s education access to modern generative AI techniques. Our open-sourced tool is available at this https URL. A video demo is available at this https URL.

[AI-3] Better Alignment with Instruction Back-and-Forth Translation

链接: https://arxiv.org/abs/2408.04614
作者: Thao Nguyen,Jeffrey Li,Sewoong Oh,Ludwig Schmidt,Jason Weston,Luke Zettlemoyer,Xian Li
关键词-EN: large language models, aligning large language, construct high-quality synthetic, high-quality synthetic data, synthetic data grounded
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We propose a new method, instruction back-and-forth translation, to construct high-quality synthetic data grounded in world knowledge for aligning large language models (LLMs). Given documents from a web corpus, we generate and curate synthetic instructions using the backtranslation approach proposed by Li et al.(2023a), and rewrite the responses to improve their quality further based on the initial documents. Fine-tuning with the resulting (backtranslated instruction, rewritten response) pairs yields higher win rates on AlpacaEval than using other common instruction datasets such as Humpback, ShareGPT, Open Orca, Alpaca-GPT4 and Self-instruct. We also demonstrate that rewriting the responses with an LLM outperforms direct distillation, and the two generated text distributions exhibit significant distinction in embedding space. Further analysis shows that our backtranslated instructions are of higher quality than other sources of synthetic instructions, while our responses are more diverse and complex than those obtained from distillation. Overall we find that instruction back-and-forth translation combines the best of both worlds – making use of the information diversity and quantity found on the web, while ensuring the quality of the responses which is necessary for effective alignment.

[AI-4] Img-Diff: Contrastive Data Synthesis for Multimodal Large Language Models

链接: https://arxiv.org/abs/2408.04594
作者: Qirui Jiao,Daoyuan Chen,Yilun Huang,Yaliang Li,Ying Shen
关键词-EN: High-performance Multimodal Large, Large Language Models, Multimodal Large Language, Large Language, High-performance Multimodal
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 14 pages, 9 figures, 7 tables

点击查看摘要

Abstract:High-performance Multimodal Large Language Models (MLLMs) rely heavily on data quality. This study introduces a novel dataset named Img-Diff, designed to enhance fine-grained image recognition in MLLMs by leveraging insights from contrastive learning and image difference captioning. By analyzing object differences between similar images, we challenge models to identify both matching and distinct components. We utilize the Stable-Diffusion-XL model and advanced image editing techniques to create pairs of similar images that highlight object replacements. Our methodology includes a Difference Area Generator for object differences identifying, followed by a Difference Captions Generator for detailed difference descriptions. The result is a relatively small but high-quality dataset of “object replacement” samples. We use the the proposed dataset to fine-tune state-of-the-art (SOTA) MLLMs such as MGM-7B, yielding comprehensive improvements of performance scores over SOTA models that trained with larger-scale datasets, in numerous image difference and Visual Question Answering tasks. For instance, our trained models notably surpass the SOTA models GPT-4V and Gemini on the MMVP benchmark. Besides, we investigate alternative methods for generating image difference data through “object removal” and conduct thorough evaluation to confirm the dataset’s diversity, quality, and robustness, presenting several insights on synthesis of such contrastive dataset. To encourage further research and advance the field of multimodal data synthesis and enhancement of MLLMs’ fundamental capabilities for image understanding, we release our codes and dataset at this https URL.

[AI-5] HiLo: A Learning Framework for Generalized Category Discovery Robust to Domain Shifts

链接: https://arxiv.org/abs/2408.04591
作者: Hongjun Wang,Sagar Vaze,Kai Han
关键词-EN: Generalized Category Discovery, Generalized Category, partially labelled dataset, unlabelled instances, Generalized
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 39 pages, 9 figures, 26 tables

点击查看摘要

Abstract:Generalized Category Discovery (GCD) is a challenging task in which, given a partially labelled dataset, models must categorize all unlabelled instances, regardless of whether they come from labelled categories or from new ones. In this paper, we challenge a remaining assumption in this task: that all images share the same domain. Specifically, we introduce a new task and method to handle GCD when the unlabelled data also contains images from different domains to the labelled set. Our proposed `HiLo’ networks extract High-level semantic and Low-level domain features, before minimizing the mutual information between the representations. Our intuition is that the clusterings based on domain information and semantic information should be independent. We further extend our method with a specialized domain augmentation tailored for the GCD task, as well as a curriculum learning approach. Finally, we construct a benchmark from corrupted fine-grained datasets as well as a large-scale evaluation on DomainNet with real-world domain shifts, reimplementing a number of GCD baselines in this setting. We demonstrate that HiLo outperforms SoTA category discovery models by a large margin on all evaluations.

[AI-6] Sampling for View Synthesis: From Local Light Field Fusion to Neural Radiance Fields and Beyond

链接: https://arxiv.org/abs/2408.04586
作者: Ravi Ramamoorthi
关键词-EN: complex real-world scenes, graphics and vision, virtual reality, immersive experiences, view synthesis
类目: Graphics (cs.GR); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: Article written for Frontiers of Science Award, International Congress on Basic Science, 2024

点击查看摘要

Abstract:Capturing and rendering novel views of complex real-world scenes is a long-standing problem in computer graphics and vision, with applications in augmented and virtual reality, immersive experiences and 3D photography. The advent of deep learning has enabled revolutionary advances in this area, classically known as image-based rendering. However, previous approaches require intractably dense view sampling or provide little or no guidance for how users should sample views of a scene to reliably render high-quality novel views. Local light field fusion proposes an algorithm for practical view synthesis from an irregular grid of sampled views that first expands each sampled view into a local light field via a multiplane image scene representation, then renders novel views by blending adjacent local light fields. Crucially, we extend traditional plenoptic sampling theory to derive a bound that specifies precisely how densely users should sample views of a given scene when using our algorithm. We achieve the perceptual quality of Nyquist rate view sampling while using up to 4000x fewer views. Subsequent developments have led to new scene representations for deep learning with view synthesis, notably neural radiance fields, but the problem of sparse view synthesis from a small number of images has only grown in importance. We reprise some of the recent results on sparse and even single image view synthesis, while posing the question of whether prescriptive sampling guidelines are feasible for the new generation of image-based rendering algorithms.

[AI-7] Unveiling the Power of Sparse Neural Networks for Feature Selection

链接: https://arxiv.org/abs/2408.04583
作者: Zahra Atashgahi,Tennison Liu,Mykola Pechenizkiy,Raymond Veldhuis,Decebal Constantin Mocanu,Mihaela van der Schaar
关键词-EN: feature selection, efficient feature selection, Sparse Neural Networks, Sparse Neural, emerged as powerful
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Sparse Neural Networks (SNNs) have emerged as powerful tools for efficient feature selection. Leveraging the dynamic sparse training (DST) algorithms within SNNs has demonstrated promising feature selection capabilities while drastically reducing computational overheads. Despite these advancements, several critical aspects remain insufficiently explored for feature selection. Questions persist regarding the choice of the DST algorithm for network training, the choice of metric for ranking features/neurons, and the comparative performance of these methods across diverse datasets when compared to dense networks. This paper addresses these gaps by presenting a comprehensive systematic analysis of feature selection with sparse neural networks. Moreover, we introduce a novel metric considering sparse neural network characteristics, which is designed to quantify feature importance within the context of SNNs. Our findings show that feature selection with SNNs trained with DST algorithms can achieve, on average, more than 50% memory and 55% FLOPs reduction compared to the dense networks, while outperforming them in terms of the quality of the selected features. Our code and the supplementary material are available on GitHub (\urlthis https URL).

[AI-8] SCENE: Evaluating Explainable AI Techniques Using Soft Counterfactuals

链接: https://arxiv.org/abs/2408.04575
作者: Haoran Zheng,Utku Pamuksuz
关键词-EN: Explainable Artificial Intelligence, Explainable Artificial, Artificial Intelligence, natural language processing, Natural language Explainability
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注: 10 pages, 5 tables

点击查看摘要

Abstract:Explainable Artificial Intelligence (XAI) is essential for enhancing the transparency and accountability of AI models, especially in natural language processing (NLP) tasks. This paper introduces SCENE (Soft Counterfactual Evaluation for Natural language Explainability), a novel evaluation method that leverages large language models (LLMs) to generate Soft Counterfactual explanations in a zero-shot manner. By focusing on token-based substitutions, SCENE creates contextually appropriate and seman-tically meaningful Soft Counterfactuals without extensive fine-tuning. SCENE adopts Validitysoft and Csoft metrics to evaluate the effectiveness of model-agnostic XAI methods in text classification tasks. Applied to CNN, RNN, and BERT architectures, SCENE provides valuable insights into the strengths and limitations of various XAI techniques.

[AI-9] Learning Fine-Grained Grounded Citations for Attributed Large Language Models ACL2024

链接: https://arxiv.org/abs/2408.04568
作者: Lei Huang,Xiaocheng Feng,Weitao Ma,Yuxuan Gu,Weihong Zhong,Xiachong Feng,Weijiang Yu,Weihua Peng,Duyu Tang,Dandan Tu,Bing Qin
关键词-EN: large language models, information-seeking tasks, large language, impressive performance, performance on information-seeking
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: Accepted by ACL 2024 Findings

点击查看摘要

Abstract:Despite the impressive performance on information-seeking tasks, large language models (LLMs) still struggle with hallucinations. Attributed LLMs, which augment generated text with in-line citations, have shown potential in mitigating hallucinations and improving verifiability. However, current approaches suffer from suboptimal citation quality due to their reliance on in-context learning. Furthermore, the practice of citing only coarse document identifiers makes it challenging for users to perform fine-grained verification. In this work, we introduce FRONT, a training framework designed to teach LLMs to generate Fine-Grained Grounded Citations. By grounding model outputs in fine-grained supporting quotes, these quotes guide the generation of grounded and consistent responses, not only improving citation quality but also facilitating fine-grained verification. Experiments on the ALCE benchmark demonstrate the efficacy of FRONT in generating superior grounded responses and highly supportive citations. With LLaMA-2-7B, the framework significantly outperforms all the baselines, achieving an average of 14.21% improvement in citation quality across all datasets, even surpassing ChatGPT.

[AI-10] Reasoning about Study Regulations in Answer Set Programming

链接: https://arxiv.org/abs/2408.04528
作者: Susana Hahn,Cedric Martens,Amade Nemes,Henry Otunuya,Javier Romero,Torsten Schaub,Sebastian Schellhorn
关键词-EN: ranging from administrators, interested in automating, automating reasoning, University of Potsdam, study plans
类目: Artificial Intelligence (cs.AI)
*备注: To appear in Theory and Practise of Logic Programming

点击查看摘要

Abstract:We are interested in automating reasoning with and about study regulations, catering to various stakeholders, ranging from administrators, over faculty, to students at different stages. Our work builds on an extensive analysis of various study programs at the University of Potsdam. The conceptualization of the underlying principles provides us with a formal account of study regulations. In particular, the formalization reveals the properties of admissible study plans. With these at end, we propose an encoding of study regulations in Answer Set Programming that produces corresponding study plans. Finally, we show how this approach can be extended to a generic user interface for exploring study plans.

[AI-11] owards Synergistic Deep Learning Models for Volumetric Cirrhotic Liver Segmentation in MRIs

链接: https://arxiv.org/abs/2408.04491
作者: Vandan Gorade,Onkar Susladkar,Gorkem Durak,Elif Keles,Ertugrul Aktas,Timurhan Cebeci,Alpay Medetalibeyoglu,Daniela Ladner,Debesh Jha,Ulas Bagci
关键词-EN: requires precise segmentation, effective disease monitoring, global mortality, requires precise, treatment planning
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Liver cirrhosis, a leading cause of global mortality, requires precise segmentation of ROIs for effective disease monitoring and treatment planning. Existing segmentation models often fail to capture complex feature interactions and generalize across diverse datasets. To address these limitations, we propose a novel synergistic theory that leverages complementary latent spaces for enhanced feature interaction modeling. Our proposed architecture, nnSynergyNet3D integrates continuous and discrete latent spaces for 3D volumes and features auto-configured training. This approach captures both fine-grained and coarse features, enabling effective modeling of intricate feature interactions. We empirically validated nnSynergyNet3D on a private dataset of 628 high-resolution T1 abdominal MRI scans from 339 patients. Our model outperformed the baseline nnUNet3D by approximately 2%. Additionally, zero-shot testing on healthy liver CT scans from the public LiTS dataset demonstrated superior cross-modal generalization capabilities. These results highlight the potential of synergistic latent space models to improve segmentation accuracy and robustness, thereby enhancing clinical workflows by ensuring consistency across CT and MRI modalities.

[AI-12] SegXAL: Explainable Active Learning for Semantic Segmentation in Driving Scene Scenarios ICPR

链接: https://arxiv.org/abs/2408.04482
作者: Sriram Mandalika,Athira Nambiar
关键词-EN: achieve high-end performance, utilize huge amounts, models utilize huge, high-end performance, huge amounts
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO)
*备注: 17 pages, 7 figures. To appear in the proceedings of the 27th International Conference on Pattern Recognition (ICPR), 01-05 December, 2024, Kolkata, India

点击查看摘要

Abstract:Most of the sophisticated AI models utilize huge amounts of annotated data and heavy training to achieve high-end performance. However, there are certain challenges that hinder the deployment of AI models “in-the-wild” scenarios, i.e., inefficient use of unlabeled data, lack of incorporation of human expertise, and lack of interpretation of the results. To mitigate these challenges, we propose a novel Explainable Active Learning (XAL) model, XAL-based semantic segmentation model “SegXAL”, that can (i) effectively utilize the unlabeled data, (ii) facilitate the “Human-in-the-loop” paradigm, and (iii) augment the model decisions in an interpretable way. In particular, we investigate the application of the SegXAL model for semantic segmentation in driving scene scenarios. The SegXAL model proposes the image regions that require labeling assistance from Oracle by dint of explainable AI (XAI) and uncertainty measures in a weakly-supervised manner. Specifically, we propose a novel Proximity-aware Explainable-AI (PAE) module and Entropy-based Uncertainty (EBU) module to get an Explainable Error Mask, which enables the machine teachers/human experts to provide intuitive reasoning behind the results and to solicit feedback to the AI system via an active learning strategy. Such a mechanism bridges the semantic gap between man and machine through collaborative intelligence, where humans and AI actively enhance each other’s complementary strengths. A novel high-confidence sample selection technique based on the DICE similarity coefficient is also presented within the SegXAL framework. Extensive quantitative and qualitative analyses are carried out in the benchmarking Cityscape dataset. Results show the outperformance of our proposed SegXAL against other state-of-the-art models.

[AI-13] RiskAwareBench: Towards Evaluating Physical Risk Awareness for High-level Planning of LLM-based Embodied Agents

链接: https://arxiv.org/abs/2408.04449
作者: Zihao Zhu,Bingzhe Wu,Zhengyou Zhang,Baoyuan Wu
关键词-EN: large language models, complex natural language, robotics significantly enhances, executing complex natural, LLM-based embodied agents
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The integration of large language models (LLMs) into robotics significantly enhances the capabilities of embodied agents in understanding and executing complex natural language instructions. However, the unmitigated deployment of LLM-based embodied systems in real-world environments may pose potential physical risks, such as property damage and personal injury. Existing security benchmarks for LLMs overlook risk awareness for LLM-based embodied agents. To address this gap, we propose RiskAwareBench, an automated framework designed to assess physical risks awareness in LLM-based embodied agents. RiskAwareBench consists of four modules: safety tips generation, risky scene generation, plan generation, and evaluation, enabling comprehensive risk assessment with minimal manual intervention. Utilizing this framework, we compile the PhysicalRisk dataset, encompassing diverse scenarios with associated safety tips, observations, and instructions. Extensive experiments reveal that most LLMs exhibit insufficient physical risk awareness, and baseline risk mitigation strategies yield limited enhancement, which emphasizes the urgency and cruciality of improving risk awareness in LLM-based embodied agents in the future.

[AI-14] FedAD-Bench: A Unified Benchmark for Federated Unsupervised Anomaly Detection in Tabular Data

链接: https://arxiv.org/abs/2408.04442
作者: Ahmed Anwar,Brian Moser,Dayananda Herurkar,Federico Raue,Vinit Hegiste,Tatjana Legler,Andreas Dengel
关键词-EN: leverage decentralized data, anomaly detection, preserving privacy, promising approach, approach to leverage
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 8 pages, 1 figure

点击查看摘要

Abstract:The emergence of federated learning (FL) presents a promising approach to leverage decentralized data while preserving privacy. Furthermore, the combination of FL and anomaly detection is particularly compelling because it allows for detecting rare and critical anomalies (usually also rare in locally gathered data) in sensitive data from multiple sources, such as cybersecurity and healthcare. However, benchmarking the performance of anomaly detection methods in FL environments remains an underexplored area. This paper introduces FedAD-Bench, a unified benchmark for evaluating unsupervised anomaly detection algorithms within the context of FL. We systematically analyze and compare the performance of recent deep learning anomaly detection models under federated settings, which were typically assessed solely in centralized settings. FedAD-Bench encompasses diverse datasets and metrics to provide a holistic evaluation. Through extensive experiments, we identify key challenges such as model aggregation inefficiencies and metric unreliability. We present insights into FL’s regularization effects, revealing scenarios in which it outperforms centralized approaches due to its inherent ability to mitigate overfitting. Our work aims to establish a standardized benchmark to guide future research and development in federated anomaly detection, promoting reproducibility and fair comparison across studies.

[AI-15] Enhancing Robustness of Retrieval-Augmented Language Models with In-Context Learning

链接: https://arxiv.org/abs/2408.04414
作者: Seong-Il Park,Seung-Woo Choi,Na-Hyun Kim,Jay-Yoon Lee
关键词-EN: Retrieval-Augmented Language Models, leveraging external knowledge, significantly improved performance, Retrieval-Augmented Language, open-domain question answering
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: 10 pages, 2 figures

点击查看摘要

Abstract:Retrieval-Augmented Language Models (RALMs) have significantly improved performance in open-domain question answering (QA) by leveraging external knowledge. However, RALMs still struggle with unanswerable queries, where the retrieved contexts do not contain the correct answer, and with conflicting information, where different sources provide contradictory answers due to imperfect retrieval. This study introduces an in-context learning-based approach to enhance the reasoning capabilities of RALMs, making them more robust in imperfect retrieval scenarios. Our method incorporates Machine Reading Comprehension (MRC) demonstrations, referred to as cases, to boost the model’s capabilities to identify unanswerabilities and conflicts among the retrieved contexts. Experiments on two open-domain QA datasets show that our approach increases accuracy in identifying unanswerable and conflicting scenarios without requiring additional fine-tuning. This work demonstrates that in-context learning can effectively enhance the robustness of RALMs in open-domain QA tasks.

[AI-16] Probabilistic energy forecasting through quantile regression in reproducing kernel Hilbert spaces

链接: https://arxiv.org/abs/2408.04405
作者: Luca Pernigo,Rohan Sen,Davide Baroli
关键词-EN: Accurate energy demand, Representative Concentration Pathways, resilient energy development, Accurate energy, crucial for sustainable
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
*备注: 12 pages, {Owner/Author | ACM} {2024}. This is the author’s version of the work. It is posted here for your personal use. Not for redistribution. The definitive Version of Record will published in this https URL

点击查看摘要

Abstract:Accurate energy demand forecasting is crucial for sustainable and resilient energy development. To meet the Net Zero Representative Concentration Pathways (RCP) 4.5 scenario in the DACH countries, increased renewable energy production, energy storage, and reduced commercial building consumption are needed. This scenario’s success depends on hydroelectric capacity and climatic factors. Informed decisions require quantifying uncertainty in forecasts. This study explores a non-parametric method based on \emphreproducing kernel Hilbert spaces (RKHS), known as kernel quantile regression, for energy prediction. Our experiments demonstrate its reliability and sharpness, and we benchmark it against state-of-the-art methods in load and price forecasting for the DACH region. We offer our implementation in conjunction with additional scripts to ensure the reproducibility of our research.

[AI-17] Exploring Reasoning Biases in Large Language Models Through Syllogism: Insights from the NeuBAROCO Dataset ACL2024

链接: https://arxiv.org/abs/2408.04403
作者: Kentaro Ozeki,Risako Ando,Takanobu Morishita,Hirohiko Abe,Koji Mineshima,Mitsuhiro Okada
关键词-EN: accurately current large, reasoning, paper explores, explores the question, accurately current
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: To appear in Findings of the Association for Computational Linguistics: ACL 2024

点击查看摘要

Abstract:This paper explores the question of how accurately current large language models can perform logical reasoning in natural language, with an emphasis on whether these models exhibit reasoning biases similar to humans. Specifically, our study focuses on syllogistic reasoning, a form of deductive reasoning extensively studied in cognitive science as a natural form of human reasoning. We present a syllogism dataset called NeuBAROCO, which consists of syllogistic reasoning problems in English and Japanese. This dataset was originally designed for psychological experiments to assess human reasoning capabilities using various forms of syllogisms. Our experiments with leading large language models indicate that these models exhibit reasoning biases similar to humans, along with other error tendencies. Notably, there is significant room for improvement in reasoning problems where the relationship between premises and hypotheses is neither entailment nor contradiction. We also present experimental results and in-depth analysis using a new Chain-of-Thought prompting method, which asks LLMs to translate syllogisms into abstract logical expressions and then explain their reasoning process. Our analysis using this method suggests that the primary limitations of LLMs lie in the reasoning process itself rather than the interpretation of syllogisms.

[AI-18] DIVE: Subgraph Disagreement for Graph Out-of-Distribution Generalization

链接: https://arxiv.org/abs/2408.04400
作者: Xin Sun,Liang Wang,Qiang Liu,Shu Wu,Zilei Wang,Liang Wang
关键词-EN: field rapidly advancing, target data distributions, Stochastic Gradient Descent, graph machine learning, paper addresses
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:This paper addresses the challenge of out-of-distribution (OOD) generalization in graph machine learning, a field rapidly advancing yet grappling with the discrepancy between source and target data distributions. Traditional graph learning algorithms, based on the assumption of uniform distribution between training and test data, falter in real-world scenarios where this assumption fails, resulting in suboptimal performance. A principal factor contributing to this suboptimal performance is the inherent simplicity bias of neural networks trained through Stochastic Gradient Descent (SGD), which prefer simpler features over more complex yet equally or more predictive ones. This bias leads to a reliance on spurious correlations, adversely affecting OOD performance in various tasks such as image recognition, natural language understanding, and graph classification. Current methodologies, including subgraph-mixup and information bottleneck approaches, have achieved partial success but struggle to overcome simplicity bias, often reinforcing spurious correlations. To tackle this, we propose DIVE, training a collection of models to focus on all label-predictive subgraphs by encouraging the models to foster divergence on the subgraph mask, which circumvents the limitation of a model solely focusing on the subgraph corresponding to simple structural patterns. Specifically, we employs a regularizer to punish overlap in extracted subgraphs across models, thereby encouraging different models to concentrate on distinct structural patterns. Model selection for robust OOD performance is achieved through validation accuracy. Tested across four datasets from GOOD benchmark and one dataset from DrugOOD benchmark, our approach demonstrates significant improvement over existing methods, effectively addressing the simplicity bias and enhancing generalization in graph machine learning.

[AI-19] Automated Educational Question Generation at Different Blooms Skill Levels using Large Language Models : Strategies and Evaluation

链接: https://arxiv.org/abs/2408.04394
作者: Nicy Scaria,Suma Dharani Chenna,Deepak Subramani
关键词-EN: Developing questions, pedagogically sound, promote learning, challenging and time-consuming, time-consuming task
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Developing questions that are pedagogically sound, relevant, and promote learning is a challenging and time-consuming task for educators. Modern-day large language models (LLMs) generate high-quality content across multiple domains, potentially helping educators to develop high-quality questions. Automated educational question generation (AEQG) is important in scaling online education catering to a diverse student population. Past attempts at AEQG have shown limited abilities to generate questions at higher cognitive levels. In this study, we examine the ability of five state-of-the-art LLMs of different sizes to generate diverse and high-quality questions of different cognitive levels, as defined by Bloom’s taxonomy. We use advanced prompting techniques with varying complexity for AEQG. We conducted expert and LLM-based evaluations to assess the linguistic and pedagogical relevance and quality of the questions. Our findings suggest that LLms can generate relevant and high-quality educational questions of different cognitive levels when prompted with adequate information, although there is a significant variance in the performance of the five LLms considered. We also show that automated evaluation is not on par with human evaluation.

[AI-20] MM-Forecast: A Multimodal Approach to Temporal Event Forecasting with Large Language Models

链接: https://arxiv.org/abs/2408.04388
作者: Haoxuan Li,Zhengmao Yang,Yunshan Ma,Yi Bin,Yang Yang,Tat-Seng Chua
关键词-EN: large language models, language models, temporal event forecasting, temporal event, large language
类目: Multimedia (cs.MM); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:We study an emerging and intriguing problem of multimodal temporal event forecasting with large language models. Compared to using text or graph modalities, the investigation of utilizing images for temporal event forecasting has not been fully explored, especially in the era of large language models (LLMs). To bridge this gap, we are particularly interested in two key questions of: 1) why images will help in temporal event forecasting, and 2) how to integrate images into the LLM-based forecasting framework. To answer these research questions, we propose to identify two essential functions that images play in the scenario of temporal event forecasting, i.e., highlighting and complementary. Then, we develop a novel framework, named MM-Forecast. It employs an Image Function Identification module to recognize these functions as verbal descriptions using multimodal large language models (MLLMs), and subsequently incorporates these function descriptions into LLM-based forecasting models. To evaluate our approach, we construct a new multimodal dataset, MidEast-TE-mm, by extending an existing event dataset MidEast-TE-mini with images. Empirical studies demonstrate that our MM-Forecast can correctly identify the image functions, and further more, incorporating these verbal function descriptions significantly improves the forecasting performance. The dataset, code, and prompts are available at this https URL.

[AI-21] Non-maximizing policies that fulfill multi-criterion aspirations in expectation

链接: https://arxiv.org/abs/2408.04385
作者: Simon Dima,Simon Fischer,Jobst Heitzig,Joss Oliver
关键词-EN: scalar reward function, sequential decision making, expected total reward, single reward function, reward function
类目: Artificial Intelligence (cs.AI); Theoretical Economics (econ.TH); Optimization and Control (math.OC)
*备注: 16 pages main text + 4 pages supplement. Accepted for Algorithmic Decision Theory 2024

点击查看摘要

Abstract:In dynamic programming and reinforcement learning, the policy for the sequential decision making of an agent in a stochastic environment is usually determined by expressing the goal as a scalar reward function and seeking a policy that maximizes the expected total reward. However, many goals that humans care about naturally concern multiple aspects of the world, and it may not be obvious how to condense those into a single reward function. Furthermore, maximization suffers from specification gaming, where the obtained policy achieves a high expected total reward in an unintended way, often taking extreme or nonsensical actions. Here we consider finite acyclic Markov Decision Processes with multiple distinct evaluation metrics, which do not necessarily represent quantities that the user wants to be maximized. We assume the task of the agent is to ensure that the vector of expected totals of the evaluation metrics falls into some given convex set, called the aspiration set. Our algorithm guarantees that this task is fulfilled by using simplices to approximate feasibility sets and propagate aspirations forward while ensuring they remain feasible. It has complexity linear in the number of possible state-action-successor triples and polynomial in the number of evaluation metrics. Moreover, the explicitly non-maximizing nature of the chosen policy and goals yields additional degrees of freedom, which can be used to apply heuristic safety criteria to the choice of actions. We discuss several such safety criteria that aim to steer the agent towards more conservative behavior. Comments: 16 pages main text + 4 pages supplement. Accepted for Algorithmic Decision Theory 2024 Subjects: Artificial Intelligence (cs.AI); Theoretical Economics (econ.TH); Optimization and Control (math.OC) MSC classes: 68T20, 90C40, 91B06 ACMclasses: I.2.8; F.2.2 Cite as: arXiv:2408.04385 [cs.AI] (or arXiv:2408.04385v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2408.04385 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-22] Judgment2vec: Apply Graph Analytics to Searching and Recommendation of Similar Judgments

链接: https://arxiv.org/abs/2408.04382
作者: Hsuan-Lei Shao
关键词-EN: previous courts efficiently, legal professionals rely, court practice, courts efficiently, previous courts
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
*备注: 5 pages, 7 figures, 2 tables

点击查看摘要

Abstract:In court practice, legal professionals rely on their training to provide opinions that resolve cases, one of the most crucial aspects being the ability to identify similar judgments from previous courts efficiently. However, finding a similar case is challenging and often depends on experience, legal domain knowledge, and extensive labor hours, making veteran lawyers or judges indispensable. This research aims to automate the analysis of judgment text similarity. We utilized a judgment dataset labeled as the “golden standard” by experts, which includes human-verified features that can be converted into an “expert similarity score.” We then constructed a knowledge graph based on “case-article” relationships, ranking each case using natural language processing to derive a “Node2vec similarity score.” By evaluating these two similarity scores, we identified their discrepancies and relationships. The results can significantly reduce the labor hours required for legal searches and recommendations, with potential applications extending to various fields of information retrieval.

[AI-23] Anomaly Prediction: A Novel Approach with Explicit Delay and Horizon

链接: https://arxiv.org/abs/2408.04377
作者: Jiang You,Arben Cela,René Natowicz,Jacob Ouanounou,Patrick Siarry
关键词-EN: Detecting anomalies, critical challenge, time series data, Detecting, series data
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Detecting anomalies in time series data is a critical challenge across various domains. Traditional methods typically focus on identifying anomalies in immediate subsequent steps, often underestimating the significance of temporal dynamics such as delay time and horizons of anomalies, which generally require extensive post-analysis. This paper introduces a novel approach for time series anomaly prediction, incorporating temporal information directly into the prediction results. We propose a new dataset specifically designed to evaluate this approach and conduct comprehensive experiments using several state-of-the-art methods. results demonstrate the efficacy of our approach in providing timely and accurate anomaly predictions, setting a new benchmark for future research in this field.

[AI-24] owards Explainable Network Intrusion Detection using Large Language Models

链接: https://arxiv.org/abs/2408.04342
作者: Paul R. B. Houssel,Priyanka Singh,Siamak Layeghy,Marius Portmann
关键词-EN: Large Language Models, language processing tasks, natural language processing, revolutionised natural language, Large Language
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Networking and Internet Architecture (cs.NI)
*备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have revolutionised natural language processing tasks, particularly as chat agents. However, their applicability to threat detection problems remains unclear. This paper examines the feasibility of employing LLMs as a Network Intrusion Detection System (NIDS), despite their high computational requirements, primarily for the sake of explainability. Furthermore, considerable resources have been invested in developing LLMs, and they may offer utility for NIDS. Current state-of-the-art NIDS rely on artificial benchmarking datasets, resulting in skewed performance when applied to real-world networking environments. Therefore, we compare the GPT-4 and LLama3 models against traditional architectures and transformer-based models to assess their ability to detect malicious NetFlows without depending on artificially skewed datasets, but solely on their vast pre-trained acquired knowledge. Our results reveal that, although LLMs struggle with precise attack detection, they hold significant potential for a path towards explainable NIDS. Our preliminary exploration shows that LLMs are unfit for the detection of Malicious NetFlows. Most promisingly, however, these exhibit significant potential as complementary agents in NIDS, particularly in providing explanations and aiding in threat response when integrated with Retrieval Augmented Generation (RAG) and function calling capabilities.

[AI-25] KnowPC: Knowledge-Driven Programmatic Reinforcement Learning for Zero-shot Coordination

链接: https://arxiv.org/abs/2408.04336
作者: Yin Gu,Qi Liu,Zhi Li,Kai Zhang
关键词-EN: training environments, remains a major, popular ZSC solution, ZSC solution paradigm, handle unseen partners
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Zero-shot coordination (ZSC) remains a major challenge in the cooperative AI field, which aims to learn an agent to cooperate with an unseen partner in training environments or even novel environments. In recent years, a popular ZSC solution paradigm has been deep reinforcement learning (DRL) combined with advanced self-play or population-based methods to enhance the neural policy’s ability to handle unseen partners. Despite some success, these approaches usually rely on black-box neural networks as the policy function. However, neural networks typically lack interpretability and logic, making the learned policies difficult for partners (e.g., humans) to understand and limiting their generalization ability. These shortcomings hinder the application of reinforcement learning methods in diverse cooperative scenarios.We suggest to represent the agent’s policy with an interpretable program. Unlike neural networks, programs contain stable logic, but they are non-differentiable and difficult to this http URL automatically learn such programs, we introduce Knowledge-driven Programmatic reinforcement learning for zero-shot Coordination (KnowPC). We first define a foundational Domain-Specific Language (DSL), including program structures, conditional primitives, and action primitives. A significant challenge is the vast program search space, making it difficult to find high-performing programs efficiently. To address this, KnowPC integrates an extractor and an reasoner. The extractor discovers environmental transition knowledge from multi-agent interaction trajectories, while the reasoner deduces the preconditions of each action primitive based on the transition knowledge.

[AI-26] Learning with Digital Agents : An Analysis based on the Activity Theory

链接: https://arxiv.org/abs/2408.04304
作者: Mateusz Dolata,Dzmitry Katsiuba,Natalie Wellnhammer,Gerhard Schwabe
关键词-EN: general-purpose technology, Digital agents, considered a general-purpose, Digital, agents
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
*备注: Authors manuscript accepted for publication in Journal of Management Information Systems

点击查看摘要

Abstract:Digital agents are considered a general-purpose technology. They spread quickly in private and organizational contexts, including education. Yet, research lacks a conceptual framing to describe interaction with such agents in a holistic manner. While focusing on the interaction with a pedagogical agent, i.e., a digital agent capable of natural-language interaction with a learner, we propose a model of learning activity based on activity theory. We use this model and a review of prior research on digital agents in education to analyze how various characteristics of the activity, including features of a pedagogical agent or learner, influence learning outcomes. The analysis leads to identification of IS research directions and guidance for developers of pedagogical agents and digital agents in general. We conclude by extending the activity theory-based model beyond the context of education and show how it helps designers and researchers ask the right questions when creating a digital agent.

[AI-27] ackling Noisy Clients in Federated Learning with End-to-end Label Correction CIKM’24

链接: https://arxiv.org/abs/2408.04301
作者: Xuefeng Jiang,Sheng Sun,Jia Li,Jingjing Xue,Runhan Li,Zhiyuan Wu,Gang Xu,Yuwei Wang,Min Liu
关键词-EN: diverse privacy-sensitive applications, sensitive private information, achieved wide successes, label noise, wide successes
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: To appear in ACM CIKM’24 full research paper track

点击查看摘要

Abstract:Recently, federated learning (FL) has achieved wide successes for diverse privacy-sensitive applications without sacrificing the sensitive private information of clients. However, the data quality of client datasets can not be guaranteed since corresponding annotations of different clients often contain complex label noise of varying degrees, which inevitably causes the performance degradation. Intuitively, the performance degradation is dominated by clients with higher noise rates since their trained models contain more misinformation from data, thus it is necessary to devise an effective optimization scheme to mitigate the negative impacts of these noisy clients. In this work, we propose a two-stage framework FedELC to tackle this complicated label noise issue. The first stage aims to guide the detection of noisy clients with higher label noise, while the second stage aims to correct the labels of noisy clients’ data via an end-to-end label correction framework which is achieved by learning possible ground-truth labels of noisy clients’ datasets via back propagation. We implement sixteen related methods and evaluate five datasets with three types of complicated label noise scenarios for a comprehensive comparison. Extensive experimental results demonstrate our proposed framework achieves superior performance than its counterparts for different scenarios. Additionally, we effectively improve the data quality of detected noisy clients’ local datasets with our label correction framework. The code is available at this https URL.

[AI-28] Assigning Credit with Partial Reward Decoupling in Multi-Agent Proximal Policy Optimization

链接: https://arxiv.org/abs/2408.04295
作者: Aditya Kapoor,Benjamin Freed,Howie Choset,Jeff Schneider
关键词-EN: proximal policy optimization, Multi-agent proximal policy, challenging multi-agent reinforcement, multi-agent reinforcement learning, policy optimization
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO)
*备注: 20 pages, 5 figures, 12 tables, Reinforcement Learning Journal and Reinforcement Learning Conference 2024

点击查看摘要

Abstract:Multi-agent proximal policy optimization (MAPPO) has recently demonstrated state-of-the-art performance on challenging multi-agent reinforcement learning tasks. However, MAPPO still struggles with the credit assignment problem, wherein the sheer difficulty in ascribing credit to individual agents’ actions scales poorly with team size. In this paper, we propose a multi-agent reinforcement learning algorithm that adapts recent developments in credit assignment to improve upon MAPPO. Our approach leverages partial reward decoupling (PRD), which uses a learned attention mechanism to estimate which of a particular agent’s teammates are relevant to its learning updates. We use this estimate to dynamically decompose large groups of agents into smaller, more manageable subgroups. We empirically demonstrate that our approach, PRD-MAPPO, decouples agents from teammates that do not influence their expected future reward, thereby streamlining credit assignment. We additionally show that PRD-MAPPO yields significantly higher data efficiency and asymptotic performance compared to both MAPPO and other state-of-the-art methods across several multi-agent tasks, including StarCraft II. Finally, we propose a version of PRD-MAPPO that is applicable to \textitshared reward settings, where PRD was previously not applicable, and empirically show that this also leads to performance improvements over MAPPO.

[AI-29] AI-Driven Chatbot for Intrusion Detection in Edge Networks: Enhancing Cybersecurity with Ethical User Consent

链接: https://arxiv.org/abs/2408.04281
作者: Mugheez Asif,Abdul Manan,Abdul Moiz ur Rehman,Mamoona Naveed Asghar,Muhammad Umair
关键词-EN: streamlining customer service, providing personal assistance, automating routine tasks, offering health advice, contemporary digital landscape
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In today’s contemporary digital landscape, chatbots have become indispensable tools across various sectors, streamlining customer service, providing personal assistance, automating routine tasks, and offering health advice. However, their potential remains underexplored in the realm of network security, particularly for intrusion detection. To bridge this gap, we propose an architecture chatbot specifically designed to enhance security within edge networks specifically for intrusion detection. Leveraging advanced machine learning algorithms, this chatbot will monitor network traffic to identify and mitigate potential intrusions. By securing the network environment using an edge network managed by a Raspberry Pi module and ensuring ethical user consent promoting transparency and trust, this innovative solution aims to safeguard sensitive data and maintain a secure workplace, thereby addressing the growing need for robust network security measures in the digital age.

[AI-30] Unveiling Hidden Visual Information: A Reconstruction Attack Against Adversarial Visual Information Hiding

链接: https://arxiv.org/abs/2408.04261
作者: Jonggyu Jang,Hyeonsu Lyu,Seongjin Hwang,Hyun Jong Yang
关键词-EN: executing data reconstruction, AVIH encryption method, AVIH, AVIH method, data reconstruction
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
*备注: 12 pages

点击查看摘要

Abstract:This paper investigates the security vulnerabilities of adversarial-example-based image encryption by executing data reconstruction (DR) attacks on encrypted images. A representative image encryption method is the adversarial visual information hiding (AVIH), which uses type-I adversarial example training to protect gallery datasets used in image recognition tasks. In the AVIH method, the type-I adversarial example approach creates images that appear completely different but are still recognized by machines as the original ones. Additionally, the AVIH method can restore encrypted images to their original forms using a predefined private key generative model. For the best security, assigning a unique key to each image is recommended; however, storage limitations may necessitate some images sharing the same key model. This raises a crucial security question for AVIH: How many images can safely share the same key model without being compromised by a DR attack? To address this question, we introduce a dual-strategy DR attack against the AVIH encryption method by incorporating (1) generative-adversarial loss and (2) augmented identity loss, which prevent DR from overfitting – an issue akin to that in machine learning. Our numerical results validate this approach through image recognition and re-identification benchmarks, demonstrating that our strategy can significantly enhance the quality of reconstructed images, thereby requiring fewer key-sharing encrypted images. Our source code to reproduce our results will be available soon.

[AI-31] EfficientRAG: Efficient Retriever for Multi-Hop Question Answering

链接: https://arxiv.org/abs/2408.04259
作者: Ziyuan Zhuang,Zhiyang Zhang,Sitao Cheng,Fangkai Yang,Jia Liu,Shujian Huang,Qingwei Lin,Saravan Rajmohan,Dongmei Zhang,Qi Zhang
关键词-EN: Retrieval-augmented generation, addressing complex questions, methods encounter difficulties, encounter difficulties, difficulties when addressing
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: 20 pages, 4 figures

点击查看摘要

Abstract:Retrieval-augmented generation (RAG) methods encounter difficulties when addressing complex questions like multi-hop queries. While iterative retrieval methods improve performance by gathering additional information, current approaches often rely on multiple calls of large language models (LLMs). In this paper, we introduce EfficientRAG, an efficient retriever for multi-hop question answering. EfficientRAG iteratively generates new queries without the need for LLM calls at each iteration and filters out irrelevant information. Experimental results demonstrate that EfficientRAG surpasses existing RAG methods on three open-domain multi-hop question-answering datasets.

[AI-32] Scalable Transformer for High Dimensional Multivariate Time Series Forecasting

链接: https://arxiv.org/abs/2408.04245
作者: Xin Zhou,Weiqing Wang,Wray Buntine,Shilin Qu,Abishek Sriramulu,Weicong Tan,Christoph Bergmeir
关键词-EN: demonstrated significant success, recently demonstrated significant, Multivariate Time Series, Deep models, Channel-dependent models
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Deep models for Multivariate Time Series (MTS) forecasting have recently demonstrated significant success. Channel-dependent models capture complex dependencies that channel-independent models cannot capture. However, the number of channels in real-world applications outpaces the capabilities of existing channel-dependent models, and contrary to common expectations, some models underperform the channel-independent models in handling high-dimensional data, which raises questions about the performance of channel-dependent models. To address this, our study first investigates the reasons behind the suboptimal performance of these channel-dependent models on high-dimensional MTS data. Our analysis reveals that two primary issues lie in the introduced noise from unrelated series that increases the difficulty of capturing the crucial inter-channel dependencies, and challenges in training strategies due to high-dimensional data. To address these issues, we propose STHD, the Scalable Transformer for High-Dimensional Multivariate Time Series Forecasting. STHD has three components: a) Relation Matrix Sparsity that limits the noise introduced and alleviates the memory issue; b) ReIndex applied as a training strategy to enable a more flexible batch size setting and increase the diversity of training data; and c) Transformer that handles 2-D inputs and captures channel dependencies. These components jointly enable STHD to manage the high-dimensional MTS while maintaining computational feasibility. Furthermore, experimental results show STHD’s considerable improvement on three high-dimensional datasets: Crime-Chicago, Wiki-People, and Traffic. The source code and dataset are publicly available this https URL.

[AI-33] he Ungrounded Alignment Problem

链接: https://arxiv.org/abs/2408.04242
作者: Marc Pickett,Aakash Kumar Nain,Joseph Modayil,Llion Jones
关键词-EN: Modern machine learning, demonstrated substantial abilities, Modern machine, ignore human-provided knowledge, machine learning systems
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE)
*备注: 7 pages, plus references and appendix

点击查看摘要

Abstract:Modern machine learning systems have demonstrated substantial abilities with methods that either embrace or ignore human-provided knowledge, but combining benefits of both styles remains a challenge. One particular challenge involves designing learning systems that exhibit built-in responses to specific abstract stimulus patterns, yet are still plastic enough to be agnostic about the modality and exact form of their inputs. In this paper, we investigate what we call The Ungrounded Alignment Problem, which asks How can we build in predefined knowledge in a system where we don’t know how a given stimulus will be grounded? This paper examines a simplified version of the general problem, where an unsupervised learner is presented with a sequence of images for the characters in a text corpus, and this learner is later evaluated on its ability to recognize specific (possibly rare) sequential patterns. Importantly, the learner is given no labels during learning or evaluation, but must map images from an unknown font or permutation to its correct class label. That is, at no point is our learner given labeled images, where an image vector is explicitly associated with a class label. Despite ample work in unsupervised and self-supervised loss functions, all current methods require a labeled fine-tuning phase to map the learned representations to correct classes. Finding this mapping in the absence of labels may seem a fool’s errand, but our main result resolves this seeming paradox. We show that leveraging only letter bigram frequencies is sufficient for an unsupervised learner both to reliably associate images to class labels and to reliably identify trigger words in the sequence of inputs. More generally, this method suggests an approach for encoding specific desired innate behaviour in modality-agnostic models.

[AI-34] Cluster-Wide Task Slowdown Detection in Cloud System KDD2024

链接: https://arxiv.org/abs/2408.04236
作者: Feiyi Chen,Yingying Zhang,Lunting Fan,Yuxuan Liang,Guansong Pang,Qingsong Wen,Shuiguang Deng
关键词-EN: substantial liquidated damages, bring substantial liquidated, Slow task detection, Slow task, liquidated damages
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: This paper has been accepted by KDD2024

点击查看摘要

Abstract:Slow task detection is a critical problem in cloud operation and maintenance since it is highly related to user experience and can bring substantial liquidated damages. Most anomaly detection methods detect it from a single-task aspect. However, considering millions of concurrent tasks in large-scale cloud computing clusters, it becomes impractical and inefficient. Moreover, single-task slowdowns are very common and do not necessarily indicate a malfunction of a cluster due to its violent fluctuation nature in a virtual environment. Thus, we shift our attention to cluster-wide task slowdowns by utilizing the duration time distribution of tasks across a cluster, so that the computation complexity is not relevant to the number of tasks. The task duration time distribution often exhibits compound periodicity and local exceptional fluctuations over time. Though transformer-based methods are one of the most powerful methods to capture these time series normal variation patterns, we empirically find and theoretically explain the flaw of the standard attention mechanism in reconstructing subperiods with low amplitude when dealing with compound periodicity. To tackle these challenges, we propose SORN (i.e., Skimming Off subperiods in descending amplitude order and Reconstructing Non-slowing fluctuation), which consists of a Skimming Attention mechanism to reconstruct the compound periodicity and a Neural Optimal Transport module to distinguish cluster-wide slowdowns from other exceptional fluctuations. Furthermore, since anomalies in the training set are inevitable in a practical scenario, we propose a picky loss function, which adaptively assigns higher weights to reliable time slots in the training set. Extensive experiments demonstrate that SORN outperforms state-of-the-art methods on multiple real-world industrial datasets. Comments: This paper has been accepted by KDD2024 Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2408.04236 [cs.LG] (or arXiv:2408.04236v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2408.04236 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-35] Probabilistic Circuits for Cumulative Distribution Functions

链接: https://arxiv.org/abs/2408.04229
作者: Oliver Broadrick,William Cao,Benjie Wang,Martin Trapp,Guy Van den Broeck
关键词-EN: supports efficient probabilistic, efficient probabilistic inference, sufficient structural properties, multivariate probability distribution, probabilistic inference
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:A probabilistic circuit (PC) succinctly expresses a function that represents a multivariate probability distribution and, given sufficient structural properties of the circuit, supports efficient probabilistic inference. Typically a PC computes the probability mass (or density) function (PMF or PDF) of the distribution. We consider PCs instead computing the cumulative distribution function (CDF). We show that for distributions over binary random variables these representations (PMF and CDF) are essentially equivalent, in the sense that one can be transformed to the other in polynomial time. We then show how a similar equivalence holds for distributions over finite discrete variables using a modification of the standard encoding with binary variables that aligns with the CDF semantics. Finally we show that for continuous variables, smooth, decomposable PCs computing PDFs and CDFs can be efficiently transformed to each other by modifying only the leaves of the circuit.

[AI-36] VideoQA in the Era of LLMs: An Empirical Study

链接: https://arxiv.org/abs/2408.04223
作者: Junbin Xiao,Nanxin Huang,Hangyu Qin,Dongyang Li,Yicong Li,Fengbin Zhu,Zhulin Tao,Jianxing Yu,Liang Lin,Tat-Seng Chua,Angela Yao
关键词-EN: Video Large Language, Large Language Models, Large Language, Video Large, Video Question Answering
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: Preprint. Under Review

点击查看摘要

Abstract:Video Large Language Models (Video-LLMs) are flourishing and has advanced many video-language tasks. As a golden testbed, Video Question Answering (VideoQA) plays pivotal role in Video-LLM developing. This work conducts a timely and comprehensive study of Video-LLMs’ behavior in VideoQA, aiming to elucidate their success and failure modes, and provide insights towards more human-like video understanding and question answering. Our analyses demonstrate that Video-LLMs excel in VideoQA; they can correlate contextual cues and generate plausible responses to questions about varied video contents. However, models falter in handling video temporality, both in reasoning about temporal content ordering and grounding QA-relevant temporal moments. Moreover, the models behave unintuitively - they are unresponsive to adversarial video perturbations while being sensitive to simple variations of candidate answers and questions. Also, they do not necessarily generalize better. The findings demonstrate Video-LLMs’ QA capability in standard condition yet highlight their severe deficiency in robustness and interpretability, suggesting the urgent need on rationales in Video-LLM developing.

[AI-37] Connective Viewpoints of Signal-to-Noise Diffusion Models

链接: https://arxiv.org/abs/2408.04221
作者: Khanh Doan,Long Tung Vuong,Tuan Nguyen,Anh Tuan Bui,Quyen Tran,Thanh-Toan Do,Dinh Phung,Trung Le
关键词-EN: complex data interpolation, Diffusion models, audio generation, Diffusion, diffusion models constitute
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
*备注:

点击查看摘要

Abstract:Diffusion models (DM) have become fundamental components of generative models, excelling across various domains such as image creation, audio generation, and complex data interpolation. Signal-to-Noise diffusion models constitute a diverse family covering most state-of-the-art diffusion models. While there have been several attempts to study Signal-to-Noise (S2N) diffusion models from various perspectives, there remains a need for a comprehensive study connecting different viewpoints and exploring new perspectives. In this study, we offer a comprehensive perspective on noise schedulers, examining their role through the lens of the signal-to-noise ratio (SNR) and its connections to information theory. Building upon this framework, we have developed a generalized backward equation to enhance the performance of the inference process.

[AI-38] Attention Mechanism and Context Modeling System for Text Mining Machine Translation

链接: https://arxiv.org/abs/2408.04216
作者: Shi Bo,Yuwei Zhang,Junming Huang,Sitong Liu,Zexi Chen,Zizheng Li
关键词-EN: contextual apprehension capabilities, architectural schema anchored, K-means categorization algorithm, Transformer paradigm, paper advances
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:This paper advances a novel architectural schema anchored upon the Transformer paradigm and innovatively amalgamates the K-means categorization algorithm to augment the contextual apprehension capabilities of the schema. The transformer model performs well in machine translation tasks due to its parallel computing power and multi-head attention mechanism. However, it may encounter contextual ambiguity or ignore local features when dealing with highly complex language structures. To circumvent this constraint, this exposition incorporates the K-Means algorithm, which is used to stratify the lexis and idioms of the input textual matter, thereby facilitating superior identification and preservation of the local structure and contextual intelligence of the language. The advantage of this combination is that K-Means can automatically discover the topic or concept regions in the text, which may be directly related to translation quality. Consequently, the schema contrived herein enlists K-Means as a preparatory phase antecedent to the Transformer and recalibrates the multi-head attention weights to assist in the discrimination of lexis and idioms bearing analogous semantics or functionalities. This ensures the schema accords heightened regard to the contextual intelligence embodied by these clusters during the training phase, rather than merely focusing on locational intelligence.

[AI-39] MMRole: A Comprehensive Framework for Developing and Evaluating Multimodal Role-Playing Agents

链接: https://arxiv.org/abs/2408.04203
作者: Yanqi Dai,Huanran Hu,Lei Wang,Shengjie Jin,Xu Chen,Zhiwu Lu
关键词-EN: facilitate sociological research, garnered increasing attention, Multimodal Role-Playing Agents, Role-Playing Agents, sociological research
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Recently, Role-Playing Agents (RPAs) have garnered increasing attention for their potential to deliver emotional value and facilitate sociological research. However, existing studies are primarily confined to the textual modality, unable to simulate humans’ multimodal perceptual capabilities. To bridge this gap, we introduce the concept of Multimodal Role-Playing Agents (MRPAs), and propose a comprehensive framework, MMRole, for their development and evaluation, which comprises a personalized multimodal dataset and a robust evaluation method. Specifically, we construct a large-scale, high-quality dataset, MMRole-Data, consisting of 85 characters, 11K images, and 14K single or multi-turn dialogues. Additionally, we present a robust evaluation method, MMRole-Eval, encompassing eight metrics across three dimensions, where a reward model is trained to score MRPAs with the constructed ground-truth data for comparison. Moreover, we develop the first specialized MRPA, MMRole-Agent. Extensive evaluation results demonstrate the improved performance of MMRole-Agent and highlight the primary challenges in developing MRPAs, emphasizing the need for enhanced multimodal understanding and role-playing consistency. The data, code, and models will be available at this https URL.

[AI-40] Pairwise Judgment Formulation for Semantic Embedding Model in Web Search

链接: https://arxiv.org/abs/2408.04197
作者: Mengze Hong,Chen Jason Zhang
关键词-EN: network-based Siamese architecture, neural network-based Siamese, Semantic Embedding Model, natural language processing, Siamese architecture
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Databases (cs.DB)
*备注:

点击查看摘要

Abstract:Semantic Embedding Model (SEM), a neural network-based Siamese architecture, is gaining momentum in information retrieval and natural language processing. In order to train SEM in a supervised fashion for Web search, the search engine query log is typically utilized to automatically formulate pairwise judgments as training data. Despite the growing application of semantic embeddings in the search engine industry, little work has been done on formulating effective pairwise judgments for training SEM. In this paper, we make the first in-depth investigation of a wide range of strategies for generating pairwise judgments for SEM. An interesting (perhaps surprising) discovery reveals that the conventional pairwise judgment formulation strategy wildly used in the field of pairwise Learning-to-Rank (LTR) is not necessarily effective for training SEM. Through a large-scale empirical study based on query logs and click-through activities from a major commercial search engine, we demonstrate the effective strategies for SEM and highlight the advantages of a hybrid heuristic (i.e., Clicked Non-Clicked) in comparison to the atomic heuristics (e.g., Clicked Skipped) in LTR. We conclude with best practices for training SEM and offer promising insights for future research.

[AI-41] Uncertainty-Aware Crime Prediction With Spatial Temporal Multivariate Graph Neural Networks

链接: https://arxiv.org/abs/2408.04193
作者: Zepu Wang,Xiaobo Ma,Huajie Yang,Weimin Lvu,Peng Sun,Sharath Chandra Guntuku
关键词-EN: stabilizing society today, society today, Zero-Inflated Negative Binomial, Negative Binomial Graph, critical component
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Crime forecasting is a critical component of urban analysis and essential for stabilizing society today. Unlike other time series forecasting problems, crime incidents are sparse, particularly in small regions and within specific time periods. Traditional spatial-temporal deep learning models often struggle with this sparsity, as they typically cannot effectively handle the non-Gaussian nature of crime data, which is characterized by numerous zeros and over-dispersed patterns. To address these challenges, we introduce a novel approach termed Spatial Temporal Multivariate Zero-Inflated Negative Binomial Graph Neural Networks (STMGNN-ZINB). This framework leverages diffusion and convolution networks to analyze spatial, temporal, and multivariate correlations, enabling the parameterization of probabilistic distributions of crime incidents. By incorporating a Zero-Inflated Negative Binomial model, STMGNN-ZINB effectively manages the sparse nature of crime data, enhancing prediction accuracy and the precision of confidence intervals. Our evaluation on real-world datasets confirms that STMGNN-ZINB outperforms existing models, providing a more reliable tool for predicting and understanding crime dynamics.

[AI-42] Listwise Reward Estimation for Offline Preference-based Reinforcement Learning ICML2024

链接: https://arxiv.org/abs/2408.04190
作者: Heewoong Choi,Sangwon Jung,Hongjoon Ahn,Taesup Moon
关键词-EN: designing precise reward, Reinforcement Learning, precise reward functions, reward functions remains, designing precise
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 21 pages, ICML 2024

点击查看摘要

Abstract:In Reinforcement Learning (RL), designing precise reward functions remains to be a challenge, particularly when aligning with human intent. Preference-based RL (PbRL) was introduced to address this problem by learning reward models from human feedback. However, existing PbRL methods have limitations as they often overlook the second-order preference that indicates the relative strength of preference. In this paper, we propose Listwise Reward Estimation (LiRE), a novel approach for offline PbRL that leverages second-order preference information by constructing a Ranked List of Trajectories (RLT), which can be efficiently built by using the same ternary feedback type as traditional methods. To validate the effectiveness of LiRE, we propose a new offline PbRL dataset that objectively reflects the effect of the estimated rewards. Our extensive experiments on the dataset demonstrate the superiority of LiRE, i.e., outperforming state-of-the-art baselines even with modest feedback budgets and enjoying robustness with respect to the number of feedbacks and feedback noise. Our code is available at this https URL

[AI-43] EdgeShield: A Universal and Efficient Edge Computing Framework for Robust AI

链接: https://arxiv.org/abs/2408.04181
作者: Duo Zhong,Bojing Li,Xiang Chen,Chenchen Liu
关键词-EN: Artificial Intelligence, innovative security measures, systems has created, security measures, increasing prevalence
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The increasing prevalence of adversarial attacks on Artificial Intelligence (AI) systems has created a need for innovative security measures. However, the current methods of defending against these attacks often come with a high computing cost and require back-end processing, making real-time defense challenging. Fortunately, there have been remarkable advancements in edge-computing, which make it easier to deploy neural networks on edge devices. Building upon these advancements, we propose an edge framework design to enable universal and efficient detection of adversarial attacks. This framework incorporates an attention-based adversarial detection methodology and a lightweight detection network formation, making it suitable for a wide range of neural networks and can be deployed on edge devices. To assess the effectiveness of our proposed framework, we conducted evaluations on five neural networks. The results indicate an impressive 97.43% F-score can be achieved, demonstrating the framework’s proficiency in detecting adversarial attacks. Moreover, our proposed framework also exhibits significantly reduced computing complexity and cost in comparison to previous detection methods. This aspect is particularly beneficial as it ensures that the defense mechanism can be efficiently implemented in real-time on-edge devices.

[AI-44] wav2graph: A Framework for Supervised Learning Knowledge Graph from Speech

链接: https://arxiv.org/abs/2408.04174
作者: Khai Le-Duc,Quy-Anh Dang,Tan-Hanh Pham,Truong-Son Hy
关键词-EN: large language models, enhance the performance, providing structured, reasoning and context-awareness, performance of large
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG); Sound (cs.SD); Audio and Speech Processing (eess.AS)
*备注: Preprint, 32 pages

点击查看摘要

Abstract:Knowledge graphs (KGs) enhance the performance of large language models (LLMs) and search engines by providing structured, interconnected data that improves reasoning and context-awareness. However, KGs only focus on text data, thereby neglecting other modalities such as speech. In this work, we introduce wav2graph, the first framework for supervised learning knowledge graph from speech data. Our pipeline are straightforward: (1) constructing a KG based on transcribed spoken utterances and a named entity database, (2) converting KG into embedding vectors, and (3) training graph neural networks (GNNs) for node classification and link prediction tasks. Through extensive experiments conducted in inductive and transductive learning contexts using state-of-the-art GNN models, we provide baseline results and error analysis for node classification and link prediction tasks on human transcripts and automatic speech recognition (ASR) transcripts, including evaluations using both encoder-based and decoder-based node embeddings, as well as monolingual and multilingual acoustic pre-trained models. All related code, data, and models are published online.

[AI-45] Perceive Reflect and Plan: Designing LLM Agent for Goal-Directed City Navigation without Instructions

链接: https://arxiv.org/abs/2408.04168
作者: Qingbin Zeng,Qinglong Yang,Shunan Dong,Heming Du,Liang Zheng,Fengli Xu,Yong Li
关键词-EN: road network connections, including recognizing landmarks, goal location, including recognizing, network connections
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:This paper considers a scenario in city navigation: an AI agent is provided with language descriptions of the goal location with respect to some well-known landmarks; By only observing the scene around, including recognizing landmarks and road network connections, the agent has to make decisions to navigate to the goal location without instructions. This problem is very challenging, because it requires agent to establish self-position and acquire spatial representation of complex urban environment, where landmarks are often invisible. In the absence of navigation instructions, such abilities are vital for the agent to make high-quality decisions in long-range city navigation. With the emergent reasoning ability of large language models (LLMs), a tempting baseline is to prompt LLMs to “react” on each observation and make decisions accordingly. However, this baseline has very poor performance that the agent often repeatedly visits same locations and make short-sighted, inconsistent decisions. To address these issues, this paper introduces a novel agentic workflow featured by its abilities to perceive, reflect and plan. Specifically, we find LLaVA-7B can be fine-tuned to perceive the direction and distance of landmarks with sufficient accuracy for city navigation. Moreover, reflection is achieved through a memory mechanism, where past experiences are stored and can be retrieved with current perception for effective decision argumentation. Planning uses reflection results to produce long-term plans, which can avoid short-sighted decisions in long-range navigation. We show the designed workflow significantly improves navigation ability of the LLM agent compared with the state-of-the-art baselines.

[AI-46] he Data Addition Dilemma ALT

链接: https://arxiv.org/abs/2408.04154
作者: Judy Hanwen Shen,Inioluwa Deborah Raji,Irene Y. Chen
关键词-EN: healthcare tasks, standard datasets, fundamentally dissimilar, machine learning, learning for healthcare
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
*备注: Machine Learning For Health Care 2024 (MLHC)

点击查看摘要

Abstract:In many machine learning for healthcare tasks, standard datasets are constructed by amassing data across many, often fundamentally dissimilar, sources. But when does adding more data help, and when does it hinder progress on desired model outcomes in real-world settings? We identify this situation as the \textitData Addition Dilemma, demonstrating that adding training data in this multi-source scaling context can at times result in reduced overall accuracy, uncertain fairness outcomes, and reduced worst-subgroup performance. We find that this possibly arises from an empirically observed trade-off between model performance improvements due to data scaling and model deterioration from distribution shift. We thus establish baseline strategies for navigating this dilemma, introducing distribution shift heuristics to guide decision-making on which data sources to add in data scaling, in order to yield the expected model performance improvements. We conclude with a discussion of the required considerations for data collection and suggestions for studying data composition and scale in the age of increasingly larger models.

[AI-47] UNLEARN Efficient Removal of Knowledge in Large Language Models

链接: https://arxiv.org/abs/2408.04140
作者: Tyler Lizzo,Larry Heck
关键词-EN: dynamically forgetting specific, large language models, forgetting specific knowledge, private or proprietary, dynamically forgetting
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: 11 pages, 2 Figures

点击查看摘要

Abstract:Given the prevalence of large language models (LLMs) and the prohibitive cost of training these models from scratch, dynamically forgetting specific knowledge e.g., private or proprietary, without retraining the model has become an important capability. This paper proposes a novel method to achieve this objective called UNLEARN. The approach builds upon subspace methods to identify and specifically target the removal of knowledge without adversely affecting other knowledge in the LLM. Results demonstrate 96% of targeted knowledge can be forgotten while maintaining performance on other knowledge within 2.5% of the original model, significantly outperforming the discriminatory abilities of the previous state-of-the-art. A dual method called LEARN is also proposed for targeted knowledge addition. Results show LEARN can match the fine-tuning accuracy of Low-Rank Adaptation (LoRA) without adversely affecting similar tasks.

[AI-48] Enhancing Healthcare through Large Language Models : A Study on Medical Question Answering

链接: https://arxiv.org/abs/2408.04138
作者: Haoran Yu,Chang Yu,Zihan Wang,Dongxian Zou,Hao Qin
关键词-EN: Large Language Models, Large Language, shown significant promise, application of Large, Language Models
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: received by IEEE ICPICS

点击查看摘要

Abstract:In recent years, the application of Large Language Models (LLMs) in healthcare has shown significant promise in improving the accessibility and dissemination of medical knowledge. This paper presents a detailed study of various LLMs trained on the MedQuAD medical question-answering dataset, with a focus on identifying the most effective model for providing accurate medical information. Among the models tested, the Sentence-t5 combined with Mistral 7B demonstrated superior performance, achieving a precision score of 0.762. This model’s enhanced capabilities are attributed to its advanced pretraining techniques, robust architecture, and effective prompt construction methodologies. By leveraging these strengths, the Sentence-t5 + Mistral 7B model excels in understanding and generating precise medical answers. Our findings highlight the potential of integrating sophisticated LLMs in medical contexts to facilitate efficient and accurate medical knowledge retrieval, thus significantly enhancing patient education and support.

[AI-49] Can Rule-Based Insights Enhance LLMs for Radiology Report Classification? Introducing the RadPrompt Methodology ACL2024

链接: https://arxiv.org/abs/2408.04121
作者: Panagiotis Fytas,Anna Breger,Ian Selby,Simon Baker,Shahab Shahipasand,Anna Korhonen
关键词-EN: Developing imaging models, Developing imaging, chest X-rays, imaging models capable, capable of detecting
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted at BioNLP, ACL 2024

点击查看摘要

Abstract:Developing imaging models capable of detecting pathologies from chest X-rays can be cost and time-prohibitive for large datasets as it requires supervision to attain state-of-the-art performance. Instead, labels extracted from radiology reports may serve as distant supervision since these are routinely generated as part of clinical practice. Despite their widespread use, current rule-based methods for label extraction rely on extensive rule sets that are limited in their robustness to syntactic variability. To alleviate these limitations, we introduce RadPert, a rule-based system that integrates an uncertainty-aware information schema with a streamlined set of rules, enhancing performance. Additionally, we have developed RadPrompt, a multi-turn prompting strategy that leverages RadPert to bolster the zero-shot predictive capabilities of large language models, achieving a statistically significant improvement in weighted average F1 score over GPT-4 Turbo. Most notably, RadPrompt surpasses both its underlying models, showcasing the synergistic potential of LLMs with rule-based models. We have evaluated our methods on two English Corpora: the MIMIC-CXR gold-standard test set and a gold-standard dataset collected from the Cambridge University Hospitals.

[AI-50] Patchview: LLM-Powered Worldbuilding with Generative Dust and Magnet Visualization

链接: https://arxiv.org/abs/2408.04112
作者: John Joon Young Chung,Max Kreminski
关键词-EN: Large language models, Large language, writers build story, writers build, generating world elements
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注: Accepted to UIST2024

点击查看摘要

Abstract:Large language models (LLMs) can help writers build story worlds by generating world elements, such as factions, characters, and locations. However, making sense of many generated elements can be overwhelming. Moreover, if the user wants to precisely control aspects of generated elements that are difficult to specify verbally, prompting alone may be insufficient. We introduce Patchview, a customizable LLM-powered system that visually aids worldbuilding by allowing users to interact with story concepts and elements through the physical metaphor of magnets and dust. Elements in Patchview are visually dragged closer to concepts with high relevance, facilitating sensemaking. The user can also steer the generation with verbally elusive concepts by indicating the desired position of the element between concepts. When the user disagrees with the LLM’s visualization and generation, they can correct those by repositioning the element. These corrections can be used to align the LLM’s future behaviors to the user’s perception. With a user study, we show that Patchview supports the sensemaking of world elements and steering of element generation, facilitating exploration during the worldbuilding process. Patchview provides insights on how customizable visual representation can help sensemake, steer, and align generative AI model behaviors with the user’s intentions.

[AI-51] Hardware-Assisted Virtualization of Neural Processing Units for Cloud Platforms MICRO’24

链接: https://arxiv.org/abs/2408.04104
作者: Yuqi Xue,Yiqi Liu,Lifeng Nai,Jian Huang
关键词-EN: Cloud platforms today, deploying hardware accelerators, neural processing units, NPU, powering machine learning
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Operating Systems (cs.OS)
*备注: Accepted to MICRO’24

点击查看摘要

Abstract:Cloud platforms today have been deploying hardware accelerators like neural processing units (NPUs) for powering machine learning (ML) inference services. To maximize the resource utilization while ensuring reasonable quality of service, a natural approach is to virtualize NPUs for efficient resource sharing for multi-tenant ML services. However, virtualizing NPUs for modern cloud platforms is not easy. This is not only due to the lack of system abstraction support for NPU hardware, but also due to the lack of architectural and ISA support for enabling fine-grained dynamic operator scheduling for virtualized NPUs. We present TCloud, a holistic NPU virtualization framework. We investigate virtualization techniques for NPUs across the entire software and hardware stack. TCloud consists of (1) a flexible NPU abstraction called vNPU, which enables fine-grained virtualization of the heterogeneous compute units in a physical NPU (pNPU); (2) a vNPU resource allocator that enables pay-as-you-go computing model and flexible vNPU-to-pNPU mappings for improved resource utilization and cost-effectiveness; (3) an ISA extension of modern NPU architecture for facilitating fine-grained tensor operator scheduling for multiple vNPUs. We implement TCloud based on a production-level NPU simulator. Our experiments show that TCloud improves the throughput of ML inference services by up to 1.4 \times and reduces the tail latency by up to 4.6 \times , while improving the NPU utilization by 1.2 \times on average, compared to state-of-the-art NPU sharing approaches. Comments: Accepted to MICRO’24 Subjects: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Operating Systems (cs.OS) Cite as: arXiv:2408.04104 [cs.AR] (or arXiv:2408.04104v1 [cs.AR] for this version) https://doi.org/10.48550/arXiv.2408.04104 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-52] ArtVLM: Attribute Recognition Through Vision-Based Prefix Language Modeling ECCV2024

链接: https://arxiv.org/abs/2408.04102
作者: William Y. Zhu,Keren Ye,Junjie Ke,Jiahui Yu,Leonidas Guibas,Peyman Milanfar,Feng Yang
关键词-EN: computer vision applications, Recognizing and disentangling, disentangling visual attributes, Visual Genome Attribute, visual attribute recognition
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: Accepted at ECCV 2024

点击查看摘要

Abstract:Recognizing and disentangling visual attributes from objects is a foundation to many computer vision applications. While large vision language representations like CLIP had largely resolved the task of zero-shot object recognition, zero-shot visual attribute recognition remains a challenge because CLIP’s contrastively-learned vision-language representation cannot effectively capture object-attribute dependencies. In this paper, we target this weakness and propose a sentence generation-based retrieval formulation for attribute recognition that is novel in 1) explicitly modeling a to-be-measured and retrieved object-attribute relation as a conditional probability graph, which converts the recognition problem into a dependency-sensitive language-modeling problem, and 2) applying a large pretrained Vision-Language Model (VLM) on this reformulation and naturally distilling its knowledge of image-object-attribute relations to use towards attribute recognition. Specifically, for each attribute to be recognized on an image, we measure the visual-conditioned probability of generating a short sentence encoding the attribute’s relation to objects on the image. Unlike contrastive retrieval, which measures likelihood by globally aligning elements of the sentence to the image, generative retrieval is sensitive to the order and dependency of objects and attributes in the sentence. We demonstrate through experiments that generative retrieval consistently outperforms contrastive retrieval on two visual reasoning datasets, Visual Attribute in the Wild (VAW), and our newly-proposed Visual Genome Attribute Ranking (VGARank).

[AI-53] AEye: A Visualization Tool for Image Datasets IEEE-VIS2024

链接: https://arxiv.org/abs/2408.04072
作者: Florian Grötschla,Luca A. Lanzendörfer,Marco Calzavara,Roger Wattenhofer
关键词-EN: alongside architectural considerations, biases alongside architectural, influencing model capabilities, significantly influencing model, machine learning models
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: Accepted at IEEE VIS 2024

点击查看摘要

Abstract:Image datasets serve as the foundation for machine learning models in computer vision, significantly influencing model capabilities, performance, and biases alongside architectural considerations. Therefore, understanding the composition and distribution of these datasets has become increasingly crucial. To address the need for intuitive exploration of these datasets, we propose AEye, an extensible and scalable visualization tool tailored to image datasets. AEye utilizes a contrastively trained model to embed images into semantically meaningful high-dimensional representations, facilitating data clustering and organization. To visualize the high-dimensional representations, we project them onto a two-dimensional plane and arrange images in layers so users can seamlessly navigate and explore them interactively. AEye facilitates semantic search functionalities for both text and image queries, enabling users to search for content. We open-source the codebase for AEye, and provide a simple configuration to add datasets.

[AI-54] Digital Avatars: Framework Development and Their Evaluation IJCAI2024

链接: https://arxiv.org/abs/2408.04068
作者: Timothy Rupprecht,Sung-En Chang,Yushu Wu,Lei Lu,Enfu Nan,Chih-hsiang Li,Caiyue Lai,Zhimin Li,Zhijun Hu,Yumei He,David Kaeli,Yanzhi Wang
关键词-EN: driven digital avatars, present Crowd Vote, prompting strategy, Crowd Vote, driven digital
类目: Artificial Intelligence (cs.AI)
*备注: This work was presented during the IJCAI 2024 conference proceedings for demonstrations

点击查看摘要

Abstract:We present a novel prompting strategy for artificial intelligence driven digital avatars. To better quantify how our prompting strategy affects anthropomorphic features like humor, authenticity, and favorability we present Crowd Vote - an adaptation of Crowd Score that allows for judges to elect a large language model (LLM) candidate over competitors answering the same or similar prompts. To visualize the responses of our LLM, and the effectiveness of our prompting strategy we propose an end-to-end framework for creating high-fidelity artificial intelligence (AI) driven digital avatars. This pipeline effectively captures an individual’s essence for interaction and our streaming algorithm delivers a high-quality digital avatar with real-time audio-video streaming from server to mobile device. Both our visualization tool, and our Crowd Vote metrics demonstrate our AI driven digital avatars have state-of-the-art humor, authenticity, and favorability outperforming all competitors and baselines. In the case of our Donald Trump and Joe Biden avatars, their authenticity and favorability are rated higher than even their real-world equivalents.

[AI-55] PowerPM: Foundation Model for Power Systems

链接: https://arxiv.org/abs/2408.04057
作者: Shihao Tu,Yupeng Zhang,Jing Zhang,Yang Yang
关键词-EN: including demand-side management, ETS data, ETS, electricity time series, consumer behavior analysis
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 23 pages, 5 figures, 8 tables

点击查看摘要

Abstract:The emergence of abundant electricity time series (ETS) data provides ample opportunities for various applications in the power systems, including demand-side management, grid stability, and consumer behavior analysis. Deep learning models have advanced ETS modeling by effectively capturing sequence dependence. Nevertheless, learning a generic representation of ETS data for various applications remains challenging due to the inherently complex hierarchical structure of ETS data. Moreover, ETS data exhibits intricate temporal dependencies and is suscepti ble to the influence of exogenous variables. Furthermore, different instances exhibit diverse electricity consumption behavior. In this paper, we propose a foundation model PowerPM to model ETS data, providing a large-scale, off-the-shelf model for power systems. PowerPM consists of a temporal encoder and a hierarchical encoder. The temporal encoder captures both temporal dependencies in ETS data, considering exogenous variables. The hierarchical encoder models the correlation between hierarchy. Furthermore, PowerPM leverages a novel self-supervised pretraining framework consisting of masked ETS modeling and dual-view contrastive learning, which enable PowerPM to capture temporal dependency within ETS windows and aware the discrepancy across ETS windows, providing two different perspectives to learn generic representation. Our experiments involve five real world scenario datasets, comprising private and public data. Through pre-training on massive ETS data, PowerPM achieves SOTA performance on diverse downstream tasks within the private dataset. Impressively, when transferred to the public datasets, PowerPM maintains its superiority, showcasing its remarkable generalization ability across various tasks and domains. Moreover, ablation studies, few-shot experiments provide additional evidence of the effectiveness of our model.

[AI-56] NAVINACT: Combining Navigation and Imitation Learning for Bootstrapping Reinforcement Learning

链接: https://arxiv.org/abs/2408.04054
作者: Amisha Bhaskar,Zahiruddin Mahammad,Sachin R Jadhav,Pratap Tokekar
关键词-EN: shown remarkable progress, remains limited due, robotic tasks remains, tasks remains limited, shown remarkable
类目: Artificial Intelligence (cs.AI)
*备注: 16 pages, 10 figures

点击查看摘要

Abstract:Reinforcement Learning (RL) has shown remarkable progress in simulation environments, yet its application to real-world robotic tasks remains limited due to challenges in exploration and generalisation. To address these issues, we introduce NAVINACT, a framework that chooses when the robot should use classical motion planning-based navigation and when it should learn a policy. To further improve the efficiency in exploration, we use imitation data to bootstrap the exploration. NAVINACT dynamically switches between two modes of operation: navigating to a waypoint using classical techniques when away from the objects and reinforcement learning for fine-grained manipulation control when about to interact with objects. NAVINACT consists of a multi-head architecture composed of ModeNet for mode classification, NavNet for waypoint prediction, and InteractNet for precise manipulation. By combining the strengths of RL and Imitation Learning (IL), NAVINACT improves sample efficiency and mitigates distribution shift, ensuring robust task execution. We evaluate our approach across multiple challenging simulation environments and real-world tasks, demonstrating superior performance in terms of adaptability, efficiency, and generalization compared to existing methods. In both simulated and real-world settings, NAVINACT demonstrates robust performance. In simulations, NAVINACT surpasses baseline methods by 10-15% in training success rates at 30k samples and by 30-40% during evaluation phases. In real-world scenarios, it demonstrates a 30-40% higher success rate on simpler tasks compared to baselines and uniquely succeeds in complex, two-stage manipulation tasks. Datasets and supplementary materials can be found on our website: this https URL. Comments: 16 pages, 10 figures Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2408.04054 [cs.AI] (or arXiv:2408.04054v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2408.04054 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-57] Learning Rate-Free Reinforcement Learning: A Case for Model Selection with Non-Stationary Objectives

链接: https://arxiv.org/abs/2408.04046
作者: Aida Afshar,Aldo Pacchiano
关键词-EN: learning rate, learning, reinforcement learning, model selection, Rate-Free Reinforcement Learning
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: RLC 2024 Workshop on Failure Modes of Sequential Decision-Making in Practice

点击查看摘要

Abstract:The performance of reinforcement learning (RL) algorithms is sensitive to the choice of hyperparameters, with the learning rate being particularly influential. RL algorithms fail to reach convergence or demand an extensive number of samples when the learning rate is not optimally set. In this work, we show that model selection can help to improve the failure modes of RL that are due to suboptimal choices of learning rate. We present a model selection framework for Learning Rate-Free Reinforcement Learning that employs model selection methods to select the optimal learning rate on the fly. This approach of adaptive learning rate tuning neither depends on the underlying RL algorithm nor the optimizer and solely uses the reward feedback to select the learning rate; hence, the framework can input any RL algorithm and produce a learning rate-free version of it. We conduct experiments for policy optimization methods and evaluate various model selection strategies within our framework. Our results indicate that data-driven model selection algorithms are better alternatives to standard bandit algorithms when the optimal choice of hyperparameter is time-dependent and non-stationary.

[AI-58] Multimodal Gender Fairness in Depression Prediction: Insights on Data from the USA China

链接: https://arxiv.org/abs/2408.04026
作者: Joseph Cameron,Jiaee Cheong,Micol Spitale,Hatice Gunes
关键词-EN: Social agents, agents and robots, wellbeing settings, robots typically rely, individual mental wellbeing
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Robotics (cs.RO)
*备注: 9 Pages, 7 Tables. To be published and indexed in the IEEE Xplore Digital Library under the ACII 2024 Workshop Proceedings

点击查看摘要

Abstract:Social agents and robots are increasingly being used in wellbeing settings. However, a key challenge is that these agents and robots typically rely on machine learning (ML) algorithms to detect and analyse an individual’s mental wellbeing. The problem of bias and fairness in ML algorithms is becoming an increasingly greater source of concern. In concurrence, existing literature has also indicated that mental health conditions can manifest differently across genders and cultures. We hypothesise that the representation of features (acoustic, textual, and visual) and their inter-modal relations would vary among subjects from different cultures and genders, thus impacting the performance and fairness of various ML models. We present the very first evaluation of multimodal gender fairness in depression manifestation by undertaking a study on two different datasets from the USA and China. We undertake thorough statistical and ML experimentation and repeat the experiments for several different algorithms to ensure that the results are not algorithm-dependent. Our findings indicate that though there are differences between both datasets, it is not conclusive whether this is due to the difference in depression manifestation as hypothesised or other external factors such as differences in data collection methodology. Our findings further motivate a call for a more consistent and culturally aware data collection process in order to address the problem of ML bias in depression detection and to promote the development of fairer agents and robots for wellbeing.

[AI-59] Improving Large Language Model (LLM) fidelity through context-aware grounding: A systematic approach to reliability and veracity

链接: https://arxiv.org/abs/2408.04023
作者: Wrick Talukdar,Anjanava Biswas
关键词-EN: Large Language Models, natural language processing, ensuring their robustness, Large Language, critical challenge
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: 14 pages

点击查看摘要

Abstract:As Large Language Models (LLMs) become increasingly sophisticated and ubiquitous in natural language processing (NLP) applications, ensuring their robustness, trustworthiness, and alignment with human values has become a critical challenge. This paper presents a novel framework for contextual grounding in textual models, with a particular emphasis on the Context Representation stage. Our approach aims to enhance the reliability and ethical alignment of these models through a comprehensive, context-aware methodology. By explicitly capturing and representing relevant situational, cultural, and ethical contexts in a machine-readable format, we lay the foundation for anchoring a model’s behavior within these contexts. Our approach leverages techniques from knowledge representation and reasoning, such as ontologies, semantic web technologies, and logic-based formalisms. We evaluate our framework on real-world textual datasets, demonstrating its effectiveness in improving model performance, fairness, and alignment with human expectations, while maintaining high accuracy. Furthermore, we discuss the other key components of the framework, including context-aware encoding, context-aware learning, interpretability and explainability, and continuous monitoring and adaptation. This research contributes to the growing body of work on responsible AI, offering a practical approach to developing more reliable, trustworthy, and ethically-aligned language models. Our findings have significant implications for the deployment of LLMs in sensitive domains such as healthcare, legal systems, and social services, where contextual understanding is paramount.

[AI-60] Learning from Noisy Labels for Long-tailed Data via Optimal Transport

链接: https://arxiv.org/abs/2408.03977
作者: Mengting Li,Chuang Zhu
关键词-EN: deep learning models, Noisy labels, deep learning, significantly impair, long-tailed
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Noisy labels, which are common in real-world datasets, can significantly impair the training of deep learning models. However, recent adversarial noise-combating methods overlook the long-tailed distribution of real data, which can significantly harm the effect of denoising strategies. Meanwhile, the mismanagement of noisy labels further compromises the model’s ability to handle long-tailed data. To tackle this issue, we propose a novel approach to manage data characterized by both long-tailed distributions and noisy labels. First, we introduce a loss-distance cross-selection module, which integrates class predictions and feature distributions to filter clean samples, effectively addressing uncertainties introduced by noisy labels and long-tailed distributions. Subsequently, we employ optimal transport strategies to generate pseudo-labels for the noise set in a semi-supervised training manner, enhancing pseudo-label quality while mitigating the effects of sample scarcity caused by the long-tailed distribution. We conduct experiments on both synthetic and real-world datasets, and the comprehensive experimental results demonstrate that our method surpasses current state-of-the-art methods. Our code will be available in the future.

[AI-61] Enhancing Output Diversity Improves Conjugate Gradient-based Adversarial Attacks ICPR

链接: https://arxiv.org/abs/2408.03972
作者: Keiichiro Yamamura,Issa Oe,Hiroki Ishikura,Katsuki Fujisawa
关键词-EN: Deep neural networks, Deep neural, generate adversarial, consecutive search points, Auto Conjugate Gradient
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: ICPRAI2024

点击查看摘要

Abstract:Deep neural networks are vulnerable to adversarial examples, and adversarial attacks that generate adversarial examples have been studied in this context. Existing studies imply that increasing the diversity of model outputs contributes to improving the attack performance. This study focuses on the Auto Conjugate Gradient (ACG) attack, which is inspired by the conjugate gradient method and has a high diversification performance. We hypothesized that increasing the distance between two consecutive search points would enhance the output diversity. To test our hypothesis, we propose Rescaling-ACG (ReACG), which automatically modifies the two components that significantly affect the distance between two consecutive search points, including the search direction and step size. ReACG showed higher attack performance than that of ACG, and is particularly effective for ImageNet models with several classification classes. Experimental results show that the distance between two consecutive search points enhances the output diversity and may help develop new potent attacks. The code is available at \urlthis https URL

[AI-62] com Foundation Models: Applications Challenges and Future Trends

链接: https://arxiv.org/abs/2408.03964
作者: Tahar Zanouda,Meysam Masoudi,Fitsum Gaim Gebre,Mischa Dohler
关键词-EN: increasingly complex, Telecom, Telecom networks, diversified deployment scenarios, telecom network ecosystem
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Telecom networks are becoming increasingly complex, with diversified deployment scenarios, multi-standards, and multi-vendor support. The intricate nature of the telecom network ecosystem presents challenges to effectively manage, operate, and optimize networks. To address these hurdles, Artificial Intelligence (AI) has been widely adopted to solve different tasks in telecom networks. However, these conventional AI models are often designed for specific tasks, rely on extensive and costly-to-collect labeled data that require specialized telecom expertise for development and maintenance. The AI models usually fail to generalize and support diverse deployment scenarios and applications. In contrast, Foundation Models (FMs) show effective generalization capabilities in various domains in language, vision, and decision-making tasks. FMs can be trained on multiple data modalities generated from the telecom ecosystem and leverage specialized domain knowledge. Moreover, FMs can be fine-tuned to solve numerous specialized tasks with minimal task-specific labeled data and, in some instances, are able to leverage context to solve previously unseen problems. At the dawn of 6G, this paper investigates the potential opportunities of using FMs to shape the future of telecom technologies and standards. In particular, the paper outlines a conceptual process for developing Telecom FMs (TFMs) and discusses emerging opportunities for orchestrating specialized TFMs for network configuration, operation, and maintenance. Finally, the paper discusses the limitations and challenges of developing and deploying TFMs.

[AI-63] A self-adaptive system of systems architecture to enable its ad-hoc scalability: Unmanned Vehicle Fleet – Mission Control Center Case study

链接: https://arxiv.org/abs/2408.03963
作者: Ahmed R. Sadik(Honda Research Institute Europe, Offenbach am Main, Germany),Bram Bolder(Honda Research Institute Europe, Offenbach am Main, Germany),Pero Subasic(Honda Research Institute USA, CA, United States)
关键词-EN: comprises Constituent Systems, provide unique capabilities, comprises Constituent, Constituent Systems, Unmanned Vehicle Fleet
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Software Engineering (cs.SE); Systems and Control (eess.SY)
*备注: 2023 7th International Conference on Intelligent Systems, Metaheuristics Swarm Intelligence (ISMSI 2023)

点击查看摘要

Abstract:A System of Systems (SoS) comprises Constituent Systems (CSs) that interact to provide unique capabilities beyond any single CS. A key challenge in SoS is ad-hoc scalability, meaning the system size changes during operation by adding or removing CSs. This research focuses on an Unmanned Vehicle Fleet (UVF) as a practical SoS example, addressing uncertainties like mission changes, range extensions, and UV failures. The proposed solution involves a self-adaptive system that dynamically adjusts UVF architecture, allowing the Mission Control Center (MCC) to scale UVF size automatically based on performance criteria or manually by operator decision. A multi-agent environment and rule management engine were implemented to simulate and verify this approach.

[AI-64] EcoFollower: An Environment-Friendly Car Following Model Considering Fuel Consumption

链接: https://arxiv.org/abs/2408.03950
作者: Hui Zhong,Xianda Chen,PakHin Tiu,Hongliang Lu,Meixin Zhu
关键词-EN: alleviate energy shortages, environmental impacts caused, study introduces EcoFollower, Intelligent Driver Model, well-established Intelligent Driver
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:To alleviate energy shortages and environmental impacts caused by transportation, this study introduces EcoFollower, a novel eco-car-following model developed using reinforcement learning (RL) to optimize fuel consumption in car-following scenarios. Employing the NGSIM datasets, the performance of EcoFollower was assessed in comparison with the well-established Intelligent Driver Model (IDM). The findings demonstrate that EcoFollower excels in simulating realistic driving behaviors, maintaining smooth vehicle operations, and closely matching the ground truth metrics of time-to-collision (TTC), headway, and comfort. Notably, the model achieved a significant reduction in fuel consumption, lowering it by 10.42% compared to actual driving scenarios. These results underscore the capability of RL-based models like EcoFollower to enhance autonomous vehicle algorithms, promoting safer and more energy-efficient driving strategies.

[AI-65] A Survey of AI Reliance

链接: https://arxiv.org/abs/2408.03948
作者: Sven Eckhardt,Niklas Kühl,Mateusz Dolata,Gerhard Schwabe
关键词-EN: Artificial intelligence, modern technology, indispensable component, component of modern, reliance
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
*备注:

点击查看摘要

Abstract:Artificial intelligence (AI) systems have become an indispensable component of modern technology. However, research on human behavioral responses is lagging behind, i.e., the research into human reliance on AI advice (AI reliance). Current shortcomings in the literature include the unclear influences on AI reliance, lack of external validity, conflicting approaches to measuring reliance, and disregard for a change in reliance over time. Promising avenues for future research include reliance on generative AI output and reliance in multi-user situations. In conclusion, we present a morphological box that serves as a guide for research on AI reliance.

[AI-66] Prompting for products: Investigating design space exploration strategies for text-to-image generative models

链接: https://arxiv.org/abs/2408.03946
作者: Leah Chong,I-Ping Lo,Jude Rayan,Steven Dow,Faez Ahmed,Ioanna Lykourentzou
关键词-EN: rapidly generating images, enabling efficient design, product design, rapidly generating, design
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
*备注: 12 pages, 7 figures

点击查看摘要

Abstract:Text-to-image models are enabling efficient design space exploration, rapidly generating images from text prompts. However, many generative AI tools are imperfect for product design applications as they are not built for the goals and requirements of product design. The unclear link between text input and image output further complicates their application. This work empirically investigates design space exploration strategies that can successfully yield product images that are feasible, novel, and aesthetic, which are three common goals in product design. Specifically, user actions within the global and local editing modes, including their time spent, prompt length, mono vs. multi-criteria prompts, and goal orientation of prompts, are analyzed. Key findings reveal the pivotal role of mono vs. multi-criteria and goal orientation of prompts in achieving specific design goals over time and prompt length. The study recommends prioritizing the use of multi-criteria prompts for feasibility and novelty during global editing, while favoring mono-criteria prompts for aesthetics during local editing. Overall, this paper underscores the nuanced relationship between the AI-driven text-to-image models and their effectiveness in product design, urging designers to carefully structure prompts during different editing modes to better meet the unique demands of product design.

[AI-67] Impacts of Anthropomorphizing Large Language Models in Learning Environments

链接: https://arxiv.org/abs/2408.03945
作者: Kristina Schaaff,Marc-André Heidelmann
关键词-EN: Large Language Models, Large Language, Language Models, learning environments, support teaching-be
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC)
*备注: Presented at Affective Computing Pre-Conference at ISRE 2024

点击查看摘要

Abstract:Large Language Models (LLMs) are increasingly being used in learning environments to support teaching-be it as learning companions or as tutors. With our contribution, we aim to discuss the implications of the anthropomorphization of LLMs in learning environments on educational theory to build a foundation for more effective learning outcomes and understand their emotional impact on learners. According to the media equation, people tend to respond to media in the same way as they would respond to another person. A study conducted by the Georgia Institute of Technology showed that chatbots can be successfully implemented in learning environments. In this study, learners in selected online courses were unable to distinguish the chatbot from a “real” teacher. As LLM-based chatbots such as OpenAI’s GPT series are increasingly used in educational tools, it is important to understand how the attribution processes to LLM-based chatbots in terms of anthropomorphization affect learners’ emotions.

[AI-68] Building Machines that Learn and Think with People

链接: https://arxiv.org/abs/2408.03943
作者: Katherine M. Collins,Ilia Sucholutsky,Umang Bhatt,Kartik Chandra,Lionel Wong,Mina Lee,Cedegao E. Zhang,Tan Zhi-Xuan,Mark Ho,Vikash Mansinghka,Adrian Weller,Joshua B. Tenenbaum,Thomas L. Griffiths
关键词-EN: thought, partners, machine intelligence, systems, thought partners
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:What do we want from machine intelligence? We envision machines that are not just tools for thought, but partners in thought: reasonable, insightful, knowledgeable, reliable, and trustworthy systems that think with us. Current artificial intelligence (AI) systems satisfy some of these criteria, some of the time. In this Perspective, we show how the science of collaborative cognition can be put to work to engineer systems that really can be called ``thought partners,‘’ systems built to meet our expectations and complement our limitations. We lay out several modes of collaborative thought in which humans and AI thought partners can engage and propose desiderata for human-compatible thought partnerships. Drawing on motifs from computational cognitive science, we motivate an alternative scaling path for the design of thought partners and ecosystems around their use through a Bayesian lens, whereby the partners we construct actively build and reason over models of the human and world.

[AI-69] HOAA: Hybrid Overestimating Approximate Adder for Enhanced Performance Processing Engine

链接: https://arxiv.org/abs/2408.00806
作者: Omkar Kokane,Prabhat Sati,Mukul Lokhande,Santosh Kumar Vishvakarma
关键词-EN: Hybrid Overestimating Approximate, Overestimating Approximate Adder, Hybrid Overestimating, Approximate Adder designed, presents the Hybrid
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:This paper presents the Hybrid Overestimating Approximate Adder designed to enhance the performance in processing engines, specifically focused on edge AI applications. A novel Plus One Adder design is proposed as an incremental adder in the RCA chain, incorporating a Full Adder with an excess 1 alongside inputs A, B, and Cin. The design approximates outputs to 2 bit values to reduce hardware complexity and improve resource efficiency. The Plus One Adder is integrated into a dynamically reconfigurable HOAA, allowing runtime interchangeability between accurate and approximate overestimation modes. The proposed design is demonstrated for multiple applications, such as Twos complement subtraction and Rounding to even, and the Configurable Activation function, which are critical components of the Processing engine. Our approach shows 21 percent improvement in area efficiency and 33 percent reduction in power consumption, compared to state of the art designs with minimal accuracy loss. Thus, the proposed HOAA could be a promising solution for resource-constrained environments, offering ideal trade-offs between hardware efficiency vs computational accuracy.

[AI-70] A Comparative Visual Analytics Framework for Evaluating Evolutionary Processes in Multi-objective Optimization IEEE-VIS2023

链接: https://arxiv.org/abs/2308.05640
作者: Yansong Huang,Zherui Zhang,Ao Jiao,Yuxin Ma,Ran Cheng
关键词-EN: solving multi-criteria decision-making, multi-criteria decision-making problems, EMO algorithms, effective in solving, solving multi-criteria
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
*备注: Accepted by IEEE VIS 2023 (will appear in IEEE TVCG)

点击查看摘要

Abstract:Evolutionary multi-objective optimization (EMO) algorithms have been demonstrated to be effective in solving multi-criteria decision-making problems. In real-world applications, analysts often employ several algorithms concurrently and compare their solution sets to gain insight into the characteristics of different algorithms and explore a broader range of feasible solutions. However, EMO algorithms are typically treated as black boxes, leading to difficulties in performing detailed analysis and comparisons between the internal evolutionary processes. Inspired by the successful application of visual analytics tools in explainable AI, we argue that interactive visualization can significantly enhance the comparative analysis between multiple EMO algorithms. In this paper, we present a visual analytics framework that enables the exploration and comparison of evolutionary processes in EMO algorithms. Guided by a literature review and expert interviews, the proposed framework addresses various analytical tasks and establishes a multi-faceted visualization design to support the comparative analysis of intermediate generations in the evolution as well as solution sets. We demonstrate the effectiveness of our framework through case studies on benchmarking and real-world multi-objective optimization problems to elucidate how analysts can leverage our framework to inspect and compare diverse algorithms.

[AI-71] Inference with the Upper Confidence Bound Algorithm

链接: https://arxiv.org/abs/2408.04595
作者: Koulik Khamaru,Cun-Hui Zhang
关键词-EN: Upper Confidence Bound, Confidence Bound, Upper Confidence, multiarmed bandit problems, downstream inferential tasks
类目: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Systems and Control (eess.SY); Statistics Theory (math.ST)
*备注: 17 pages, 1 figure

点击查看摘要

Abstract:In this paper, we discuss the asymptotic behavior of the Upper Confidence Bound (UCB) algorithm in the context of multiarmed bandit problems and discuss its implication in downstream inferential tasks. While inferential tasks become challenging when data is collected in a sequential manner, we argue that this problem can be alleviated when the sequential algorithm at hand satisfies certain stability property. This notion of stability is motivated from the seminal work of Lai and Wei (1982). Our first main result shows that such a stability property is always satisfied for the UCB algorithm, and as a result the sample means for each arm are asymptotically normal. Next, we examine the stability properties of the UCB algorithm when the number of arms K is allowed to grow with the number of arm pulls T . We show that in such a case the arms are stable when \frac\log K\log T \rightarrow 0 , and the number of near-optimal arms are large.

[AI-72] Synchronous Multi-modal Semantic CommunicationSystem with Packet-level Coding

链接: https://arxiv.org/abs/2408.04535
作者: Yun Tian,Jingkai Ying,Zhijin Qin,Ye Jin,Xiaoming Tao
关键词-EN: joint semantic-channel coding, forward error correction, semantic-channel coding design, physical layer channels, packet-level forward error
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI)
*备注: 12 pages, 9 figures

点击查看摘要

Abstract:Although the semantic communication with joint semantic-channel coding design has shown promising performance in transmitting data of different modalities over physical layer channels, the synchronization and packet-level forward error correction of multimodal semantics have not been well studied. Due to the independent design of semantic encoders, synchronizing multimodal features in both the semantic and time domains is a challenging problem. In this paper, we take the facial video and speech transmission as an example and propose a Synchronous Multimodal Semantic Communication System (SyncSC) with Packet-Level Coding. To achieve semantic and time synchronization, 3D Morphable Mode (3DMM) coefficients and text are transmitted as semantics, and we propose a semantic codec that achieves similar quality of reconstruction and synchronization with lower bandwidth, compared to traditional methods. To protect semantic packets under the erasure channel, we propose a packet-Level Forward Error Correction (FEC) method, called PacSC, that maintains a certain visual quality performance even at high packet loss rates. Particularly, for text packets, a text packet loss concealment module, called TextPC, based on Bidirectional Encoder Representations from Transformers (BERT) is proposed, which significantly improves the performance of traditional FEC methods. The simulation results show that our proposed SyncSC reduce transmission overhead and achieve high-quality synchronous transmission of video and speech over the packet loss network.

[AI-73] Statistical Framework for Clustering MU-MIMO Wireless via Second Order Statistics

链接: https://arxiv.org/abs/2408.04484
作者: Roberto Pereira,Xavier Mestre
关键词-EN: positive definite matrices, channel covariance matrices, sample covariance matrices, covariance matrices, Riemannian manifold
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:This work explores the clustering of wireless users by examining the distances between their channel covariance matrices, which reside on the Riemannian manifold of positive definite matrices. Specifically, we consider an estimator of the Log-Euclidean distance between multiple sample covariance matrices (SCMs) consistent when the number of samples and the observation size grow unbounded at the same rate. Within the context of multi-user MIMO (MU-MIMO) wireless communication systems, we develop a statistical framework that allows to accurate predictions of the clustering algorithm’s performance under realistic conditions. Specifically, we present a central limit theorem that establishes the asymptotic Gaussianity of the consistent estimator of the log-Euclidean distance computed over two sample covariance matrices.

[AI-74] Optimal Layout-Aware CNOT Circuit Synthesis with Qubit Permutation

链接: https://arxiv.org/abs/2408.04349
作者: Irfansha Shaik,Jaco van de Pol
关键词-EN: CNOT optimization plays, Quantum Circuits, CNOT, plays a significant, significant role
类目: Quantum Physics (quant-ph); Artificial Intelligence (cs.AI)
*备注: 9 pages, 12 tables

点击查看摘要

Abstract:CNOT optimization plays a significant role in noise reduction for Quantum Circuits. Several heuristic and exact approaches exist for CNOT optimization. In this paper, we investigate more complicated variations of optimal synthesis by allowing qubit permutations and handling layout restrictions. We encode such problems into Planning, SAT, and QBF. We provide optimization for both CNOT gate count and circuit depth. For experimental evaluation, we consider standard T-gate optimized benchmarks and optimize CNOT sub-circuits. We show that allowing qubit permutations can further reduce up to 56% in CNOT count and 46% in circuit depth. In the case of optimally mapped circuits under layout restrictions, we observe a reduction up to 17% CNOT count and 19% CNOT depth.

[AI-75] Machine Learning-Based Reward-Driven Tuning of Scanning Probe Microscopy: Towards Fully Automated Microscopy

链接: https://arxiv.org/abs/2408.04055
作者: Yu Liu,Roger Proksch,Jason Bemis,Utkarsh Pratiush,Astita Dubey,Mahshid Ahmadi,Reece Emery,Philip D. Rack,Yu-Chen Liu,Jan-Chi Yang,Sergei V. Kalinin
关键词-EN: intermittent contact mode, scanning probe microscopy, tapping mode, intermittent contact, tapping mode imaging
类目: Mesoscale and Nanoscale Physics (cond-mat.mes-hall); Materials Science (cond-mat.mtrl-sci); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 20 pages, 6 figures

点击查看摘要

Abstract:Since the dawn of scanning probe microscopy (SPM), tapping or intermittent contact mode has been one of the most widely used imaging modes. Manual optimization of tapping mode not only takes a lot of instrument and operator time, but also often leads to frequent probe and sample damage, poor image quality and reproducibility issues for new types of samples or inexperienced users. Despite wide use, optimization of tapping mode imaging is an extremely hard problem, ill-suited to either classical control methods or machine learning. Here we introduce a reward-driven workflow to automate the optimization of SPM in the tapping mode. The reward function is defined based on multiple channels with physical and empirical knowledge of good scans encoded, representing a sample-agnostic measure of image quality and imitating the decision-making logic employed by human operators. This automated workflow gives optimal scanning parameters for different probes and samples and gives high-quality SPM images consistently in the attractive mode. This study broadens the application and accessibility of SPM and opens the door for fully automated SPM.

计算机视觉

[CV-0] LiDAR-Event Stereo Fusion with Hallucinations ECCV2024

链接: https://arxiv.org/abs/2408.04633
作者: Luca Bartolomei,Matteo Poggi,Andrea Conti,Stefano Mattoccia
关键词-EN: problem extremely challenging, correspondence problem extremely, untextured regions, Event stereo matching, presence of large
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: ECCV 2024. Code: this https URL - Project Page: this https URL

点击查看摘要

Abstract:Event stereo matching is an emerging technique to estimate depth from neuromorphic cameras; however, events are unlikely to trigger in the absence of motion or the presence of large, untextured regions, making the correspondence problem extremely challenging. Purposely, we propose integrating a stereo event camera with a fixed-frequency active sensor – e.g., a LiDAR – collecting sparse depth measurements, overcoming the aforementioned limitations. Such depth hints are used by hallucinating – i.e., inserting fictitious events – the stacks or raw input streams, compensating for the lack of information in the absence of brightness changes. Our techniques are general, can be adapted to any structured representation to stack events and outperform state-of-the-art fusion methods applied to event-based stereo.

[CV-1] Arctic-TILT. Business Document Understanding at Sub-Billion Scale

链接: https://arxiv.org/abs/2408.04632
作者: Łukasz Borchmann,Michał Pietruszka,Wojciech Jaśkowski,Dawid Jurkiewicz,Piotr Halama,Paweł Józiak,Łukasz Garncarek,Paweł Liskowski,Karolina Szyndler,Andrzej Gretkowski,Julita Ołtusek,Gabriela Nowakowska,Artur Zawłocki,Łukasz Duhr,Paweł Dyda,Michał Turski
关键词-EN: workloads employing LLMs, employing LLMs involves, LLMs involves answering, involves answering questions, answering questions grounded
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:The vast portion of workloads employing LLMs involves answering questions grounded on PDF or scan content. We introduce the Arctic-TILT achieving accuracy on par with models 1000 \times its size on these use cases. It can be fine-tuned and deployed on a single 24GB GPU, lowering operational costs while processing Visually Rich Documents with up to 400k tokens. The model establishes state-of-the-art results on seven diverse Document Understanding benchmarks, as well as provides reliable confidence scores and quick inference, which are essential for processing files in large-scale or time-sensitive enterprise environments.

[CV-2] Puppet-Master: Scaling Interactive Video Generation as a Motion Prior for Part-Level Dynamics

链接: https://arxiv.org/abs/2408.04631
作者: Ruining Li,Chuanxia Zheng,Christian Rupprecht,Andrea Vedaldi
关键词-EN: interactive video generative, video generative model, part-level dynamics, part-level motion, realistic part-level motion
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: Project page: this https URL

点击查看摘要

Abstract:We present Puppet-Master, an interactive video generative model that can serve as a motion prior for part-level dynamics. At test time, given a single image and a sparse set of motion trajectories (i.e., drags), Puppet-Master can synthesize a video depicting realistic part-level motion faithful to the given drag interactions. This is achieved by fine-tuning a large-scale pre-trained video diffusion model, for which we propose a new conditioning architecture to inject the dragging control effectively. More importantly, we introduce the all-to-first attention mechanism, a drop-in replacement for the widely adopted spatial attention modules, which significantly improves generation quality by addressing the appearance and background issues in existing models. Unlike other motion-conditioned video generators that are trained on in-the-wild videos and mostly move an entire object, Puppet-Master is learned from Objaverse-Animation-HQ, a new dataset of curated part-level motion clips. We propose a strategy to automatically filter out sub-optimal animations and augment the synthetic renderings with meaningful motion trajectories. Puppet-Master generalizes well to real images across various categories and outperforms existing methods in a zero-shot manner on a real-world benchmark. See our project page for more results: this http URL.

[CV-3] LogogramNLP: Comparing Visual and Textual Representations of Ancient Logographic Writing Systems for NLP

链接: https://arxiv.org/abs/2408.04628
作者: Danlu Chen,Freda Shi,Aditi Agarwal,Jacobo Myerston,Taylor Berg-Kirkpatrick
关键词-EN: Standard natural language, Standard natural, ancient logographic languages, discrete tokens, operate on symbolic
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Standard natural language processing (NLP) pipelines operate on symbolic representations of language, which typically consist of sequences of discrete tokens. However, creating an analogous representation for ancient logographic writing systems is an extremely labor intensive process that requires expert knowledge. At present, a large portion of logographic data persists in a purely visual form due to the absence of transcription – this issue poses a bottleneck for researchers seeking to apply NLP toolkits to study ancient logographic languages: most of the relevant data are images of writing. This paper investigates whether direct processing of visual representations of language offers a potential solution. We introduce LogogramNLP, the first benchmark enabling NLP analysis of ancient logographic languages, featuring both transcribed and visual datasets for four writing systems along with annotations for tasks like classification, translation, and parsing. Our experiments compare systems that employ recent visual and text encoding strategies as backbones. The results demonstrate that visual representations outperform textual representations for some investigated tasks, suggesting that visual processing pipelines may unlock a large amount of cultural heritage data of logographic languages for NLP-based analyses. Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2408.04628 [cs.CL] (or arXiv:2408.04628v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2408.04628 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Journalreference: ACL 2024, long paper

[CV-4] Enhanced Prototypical Part Network (EPPNet) For Explainable Image Classification Via Prototypes ICIP DATE

链接: https://arxiv.org/abs/2408.04606
作者: Bhushan Atote,Victor Sanchez
关键词-EN: Explainable Artificial Intelligence, Explainable Artificial, Artificial Intelligence, Deep Neural Networks, Prototypical Part Network
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted at the International Conference on Image Processing (ICIP), IEEE (2024), we will update the new version after published through IEEE

点击查看摘要

Abstract:Explainable Artificial Intelligence (xAI) has the potential to enhance the transparency and trust of AI-based systems. Although accurate predictions can be made using Deep Neural Networks (DNNs), the process used to arrive at such predictions is usually hard to explain. In terms of perceptibly human-friendly representations, such as word phrases in text or super-pixels in images, prototype-based explanations can justify a model’s decision. In this work, we introduce a DNN architecture for image classification, the Enhanced Prototypical Part Network (EPPNet), which achieves strong performance while discovering relevant prototypes that can be used to explain the classification results. This is achieved by introducing a novel cluster loss that helps to discover more relevant human-understandable prototypes. We also introduce a faithfulness score to evaluate the explainability of the results based on the discovered prototypes. Our score not only accounts for the relevance of the learned prototypes but also the performance of a model. Our evaluations on the CUB-200-2011 dataset show that the EPPNet outperforms state-of-the-art xAI-based methods, in terms of both classification accuracy and explainability

[CV-5] Fall Detection for Industrial Setups Using YOLOv8 Variants

链接: https://arxiv.org/abs/2408.04605
作者: Gracile Astlin Pereira
关键词-EN: proposed augmentation pipeline, increase dataset variance, detection system utilizing, improve detection accuracy, industrial fall detection
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:This paper presents the development of an industrial fall detection system utilizing YOLOv8 variants, enhanced by our proposed augmentation pipeline to increase dataset variance and improve detection accuracy. Among the models evaluated, the YOLOv8m model, consisting of 25.9 million parameters and 79.1 GFLOPs, demonstrated a respectable balance between computational efficiency and detection performance, achieving a mean Average Precision (mAP) of 0.971 at 50% Intersection over Union (IoU) across both “Fall Detected” and “Human in Motion” categories. Although the YOLOv8l and YOLOv8x models presented higher precision and recall, particularly in fall detection, their higher computational demands and model size make them less suitable for resource-constrained environments.

[CV-6] owards High-resolution 3D Anomaly Detection via Group-Level Feature Contrastive Learning

链接: https://arxiv.org/abs/2408.04604
作者: Hongze Zhu,Guoyang Xie,Chengbin Hou,Tao Dai,Can Gao,Jinbao Wang,Linlin Shen
关键词-EN: high-end equipment manufacturing, High-resolution point clouds, High-resolution point, anomaly detection, plays a critical
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: ACMMM24, 12 pages, 5 figures

点击查看摘要

Abstract:High-resolution point clouds~(HRPCD) anomaly detection~(AD) plays a critical role in precision machining and high-end equipment manufacturing. Despite considerable 3D-AD methods that have been proposed recently, they still cannot meet the requirements of the HRPCD-AD task. There are several challenges: i) It is difficult to directly capture HRPCD information due to large amounts of points at the sample level; ii) The advanced transformer-based methods usually obtain anisotropic features, leading to degradation of the representation; iii) The proportion of abnormal areas is very small, which makes it difficult to characterize. To address these challenges, we propose a novel group-level feature-based network, called Group3AD, which has a significantly efficient representation ability. First, we design an Intercluster Uniformity Network~(IUN) to present the mapping of different groups in the feature space as several clusters, and obtain a more uniform distribution between clusters representing different parts of the point clouds in the feature space. Then, an Intracluster Alignment Network~(IAN) is designed to encourage groups within the cluster to be distributed tightly in the feature space. In addition, we propose an Adaptive Group-Center Selection~(AGCS) based on geometric information to improve the pixel density of potential anomalous regions during inference. The experimental results verify the effectiveness of our proposed Group3AD, which surpasses Reg3D-AD by the margin of 5% in terms of object-level AUROC on Real3D-AD. We provide the code and supplementary information on our website: this https URL.

[CV-7] Improving Network Interpretability via Explanation Consistency Evaluation

链接: https://arxiv.org/abs/2408.04600
作者: Hefeng Wu,Hao Jiang,Keze Wang,Ziyi Tang,Xianghuan He,Liang Lin
关键词-EN: achieved remarkable performance, deep neural networks, transparency in prediction, neural networks, achieved remarkable
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: To appear in IEEE Transactions on Multimedia

点击查看摘要

Abstract:While deep neural networks have achieved remarkable performance, they tend to lack transparency in prediction. The pursuit of greater interpretability in neural networks often results in a degradation of their original performance. Some works strive to improve both interpretability and performance, but they primarily depend on meticulously imposed conditions. In this paper, we propose a simple yet effective framework that acquires more explainable activation heatmaps and simultaneously increase the model performance, without the need for any extra supervision. Specifically, our concise framework introduces a new metric, i.e., explanation consistency, to reweight the training samples adaptively in model learning. The explanation consistency metric is utilized to measure the similarity between the model’s visual explanations of the original samples and those of semantic-preserved adversarial samples, whose background regions are perturbed by using image adversarial attack techniques. Our framework then promotes the model learning by paying closer attention to those training samples with a high difference in explanations (i.e., low explanation consistency), for which the current model cannot provide robust interpretations. Comprehensive experimental results on various benchmarks demonstrate the superiority of our framework in multiple aspects, including higher recognition accuracy, greater data debiasing capability, stronger network robustness, and more precise localization ability on both regular networks and interpretable networks. We also provide extensive ablation studies and qualitative analyses to unveil the detailed contribution of each component.

[CV-8] Img-Diff: Contrastive Data Synthesis for Multimodal Large Language Models

链接: https://arxiv.org/abs/2408.04594
作者: Qirui Jiao,Daoyuan Chen,Yilun Huang,Yaliang Li,Ying Shen
关键词-EN: High-performance Multimodal Large, Large Language Models, Multimodal Large Language, Large Language, High-performance Multimodal
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 14 pages, 9 figures, 7 tables

点击查看摘要

Abstract:High-performance Multimodal Large Language Models (MLLMs) rely heavily on data quality. This study introduces a novel dataset named Img-Diff, designed to enhance fine-grained image recognition in MLLMs by leveraging insights from contrastive learning and image difference captioning. By analyzing object differences between similar images, we challenge models to identify both matching and distinct components. We utilize the Stable-Diffusion-XL model and advanced image editing techniques to create pairs of similar images that highlight object replacements. Our methodology includes a Difference Area Generator for object differences identifying, followed by a Difference Captions Generator for detailed difference descriptions. The result is a relatively small but high-quality dataset of “object replacement” samples. We use the the proposed dataset to fine-tune state-of-the-art (SOTA) MLLMs such as MGM-7B, yielding comprehensive improvements of performance scores over SOTA models that trained with larger-scale datasets, in numerous image difference and Visual Question Answering tasks. For instance, our trained models notably surpass the SOTA models GPT-4V and Gemini on the MMVP benchmark. Besides, we investigate alternative methods for generating image difference data through “object removal” and conduct thorough evaluation to confirm the dataset’s diversity, quality, and robustness, presenting several insights on synthesis of such contrastive dataset. To encourage further research and advance the field of multimodal data synthesis and enhancement of MLLMs’ fundamental capabilities for image understanding, we release our codes and dataset at this https URL.

[CV-9] SAM 2 in Robotic Surgery: An Empirical Evaluation for Robustness and Generalization in Surgical Video Segmentation

链接: https://arxiv.org/abs/2408.04593
作者: Jieming Yu,An Wang,Wenzhen Dong,Mengya Xu,Mobarakol Islam,Jie Wang,Long Bai,Hongliang Ren
关键词-EN: Segment Anything Model, demonstrated remarkable foundational, remarkable foundational competence, recent Segment, achieving superior results
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO); Image and Video Processing (eess.IV)
*备注: Empirical study. Previous work “SAM Meets Robotic Surgery” is accessible at: arXiv:2308.07156

点击查看摘要

Abstract:The recent Segment Anything Model (SAM) 2 has demonstrated remarkable foundational competence in semantic segmentation, with its memory mechanism and mask decoder further addressing challenges in video tracking and object occlusion, thereby achieving superior results in interactive segmentation for both images and videos. Building upon our previous empirical studies, we further explore the zero-shot segmentation performance of SAM 2 in robot-assisted surgery based on prompts, alongside its robustness against real-world corruption. For static images, we employ two forms of prompts: 1-point and bounding box, while for video sequences, the 1-point prompt is applied to the initial frame. Through extensive experimentation on the MICCAI EndoVis 2017 and EndoVis 2018 benchmarks, SAM 2, when utilizing bounding box prompts, outperforms state-of-the-art (SOTA) methods in comparative evaluations. The results with point prompts also exhibit a substantial enhancement over SAM’s capabilities, nearing or even surpassing existing unprompted SOTA methodologies. Besides, SAM 2 demonstrates improved inference speed and less performance degradation against various image corruption. Although slightly unsatisfactory results remain in specific edges or regions, SAM 2’s robust adaptability to 1-point prompts underscores its potential for downstream surgical tasks with limited prompt requirements.

[CV-10] HiLo: A Learning Framework for Generalized Category Discovery Robust to Domain Shifts

链接: https://arxiv.org/abs/2408.04591
作者: Hongjun Wang,Sagar Vaze,Kai Han
关键词-EN: Generalized Category Discovery, Generalized Category, partially labelled dataset, unlabelled instances, Generalized
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 39 pages, 9 figures, 26 tables

点击查看摘要

Abstract:Generalized Category Discovery (GCD) is a challenging task in which, given a partially labelled dataset, models must categorize all unlabelled instances, regardless of whether they come from labelled categories or from new ones. In this paper, we challenge a remaining assumption in this task: that all images share the same domain. Specifically, we introduce a new task and method to handle GCD when the unlabelled data also contains images from different domains to the labelled set. Our proposed `HiLo’ networks extract High-level semantic and Low-level domain features, before minimizing the mutual information between the representations. Our intuition is that the clusterings based on domain information and semantic information should be independent. We further extend our method with a specialized domain augmentation tailored for the GCD task, as well as a curriculum learning approach. Finally, we construct a benchmark from corrupted fine-grained datasets as well as a large-scale evaluation on DomainNet with real-world domain shifts, reimplementing a number of GCD baselines in this setting. We demonstrate that HiLo outperforms SoTA category discovery models by a large margin on all evaluations.

[CV-11] Sampling for View Synthesis: From Local Light Field Fusion to Neural Radiance Fields and Beyond

链接: https://arxiv.org/abs/2408.04586
作者: Ravi Ramamoorthi
关键词-EN: complex real-world scenes, graphics and vision, virtual reality, immersive experiences, view synthesis
类目: Graphics (cs.GR); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: Article written for Frontiers of Science Award, International Congress on Basic Science, 2024

点击查看摘要

Abstract:Capturing and rendering novel views of complex real-world scenes is a long-standing problem in computer graphics and vision, with applications in augmented and virtual reality, immersive experiences and 3D photography. The advent of deep learning has enabled revolutionary advances in this area, classically known as image-based rendering. However, previous approaches require intractably dense view sampling or provide little or no guidance for how users should sample views of a scene to reliably render high-quality novel views. Local light field fusion proposes an algorithm for practical view synthesis from an irregular grid of sampled views that first expands each sampled view into a local light field via a multiplane image scene representation, then renders novel views by blending adjacent local light fields. Crucially, we extend traditional plenoptic sampling theory to derive a bound that specifies precisely how densely users should sample views of a given scene when using our algorithm. We achieve the perceptual quality of Nyquist rate view sampling while using up to 4000x fewer views. Subsequent developments have led to new scene representations for deep learning with view synthesis, notably neural radiance fields, but the problem of sparse view synthesis from a small number of images has only grown in importance. We reprise some of the recent results on sparse and even single image view synthesis, while posing the question of whether prescriptive sampling guidelines are feasible for the new generation of image-based rendering algorithms.

[CV-12] SAM2-Adapter: Evaluating Adapting Segment Anything 2 in Downstream Tasks: Camouflage Shadow Medical Image Segmentation and More

链接: https://arxiv.org/abs/2408.04579
作者: Tianrun Chen,Ankang Lu,Lanyun Zhu,Chaotao Ding,Chunan Yu,Deyi Ji,Zejian Li,Lingyun Sun,Papa Mao,Ying Zang
关键词-EN: achieving notable success, SAM encountered limitations, image segmentation scenarios, advent of large, significantly transformed
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: arXiv admin note: text overlap with arXiv:2304.09148

点击查看摘要

Abstract:The advent of large models, also known as foundation models, has significantly transformed the AI research landscape, with models like Segment Anything (SAM) achieving notable success in diverse image segmentation scenarios. Despite its advancements, SAM encountered limitations in handling some complex low-level segmentation tasks like camouflaged object and medical imaging. In response, in 2023, we introduced SAM-Adapter, which demonstrated improved performance on these challenging tasks. Now, with the release of Segment Anything 2 (SAM2), a successor with enhanced architecture and a larger training corpus, we reassess these challenges. This paper introduces SAM2-Adapter, the first adapter designed to overcome the persistent limitations observed in SAM2 and achieve new state-of-the-art (SOTA) results in specific downstream tasks including medical image segmentation, camouflaged (concealed) object detection, and shadow detection. SAM2-Adapter builds on the SAM-Adapter’s strengths, offering enhanced generalizability and composability for diverse applications. We present extensive experimental results demonstrating SAM2-Adapter’s effectiveness. We show the potential and encourage the research community to leverage the SAM2 model with our SAM2-Adapter for achieving superior segmentation outcomes. Code, pre-trained models, and data processing protocols are available at this http URL

[CV-13] Sketch2Scene: Automatic Generation of Interactive 3D Game Scenes from Users Casual Sketches

链接: https://arxiv.org/abs/2408.04567
作者: Yongzhi Xu,Yonhon Ng,Yifu Wang,Inkyu Sa,Yunfei Duan,Yang Li,Pan Ji,Hongdong Li
关键词-EN: computer graphics applications, including video gaming, graphics applications, virtual and augmented, augmented reality
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
*备注: Project Page: this https URL

点击查看摘要

Abstract:3D Content Generation is at the heart of many computer graphics applications, including video gaming, film-making, virtual and augmented reality, etc. This paper proposes a novel deep-learning based approach for automatically generating interactive and playable 3D game scenes, all from the user’s casual prompts such as a hand-drawn sketch. Sketch-based input offers a natural, and convenient way to convey the user’s design intention in the content creation process. To circumvent the data-deficient challenge in learning (i.e. the lack of large training data of 3D scenes), our method leverages a pre-trained 2D denoising diffusion model to generate a 2D image of the scene as the conceptual guidance. In this process, we adopt the isometric projection mode to factor out unknown camera poses while obtaining the scene layout. From the generated isometric image, we use a pre-trained image understanding method to segment the image into meaningful parts, such as off-ground objects, trees, and buildings, and extract the 2D scene layout. These segments and layouts are subsequently fed into a procedural content generation (PCG) engine, such as a 3D video game engine like Unity or Unreal, to create the 3D scene. The resulting 3D scene can be seamlessly integrated into a game development environment and is readily playable. Extensive tests demonstrate that our method can efficiently generate high-quality and interactive 3D game scenes with layouts that closely follow the user’s intention.

[CV-14] Depth Any Canopy: Leveraging Depth Foundation Models for Canopy Height Estimation ECCV2024

链接: https://arxiv.org/abs/2408.04523
作者: Daniele Rege Cambrin,Isaac Corley,Paolo Garza
关键词-EN: Estimating global tree, climate change applications, Estimating global, global tree canopy, canopy height
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted at ECCV 2024 CV4E Workshop

点击查看摘要

Abstract:Estimating global tree canopy height is crucial for forest conservation and climate change applications. However, capturing high-resolution ground truth canopy height using LiDAR is expensive and not available globally. An efficient alternative is to train a canopy height estimator to operate on single-view remotely sensed imagery. The primary obstacle to this approach is that these methods require significant training data to generalize well globally and across uncommon edge cases. Recent monocular depth estimation foundation models have show strong zero-shot performance even for complex scenes. In this paper we leverage the representations learned by these models to transfer to the remote sensing domain for measuring canopy height. Our findings suggest that our proposed Depth Any Canopy, the result of fine-tuning the Depth Anything v2 model for canopy height estimation, provides a performant and efficient solution, surpassing the current state-of-the-art with superior or comparable performance using only a fraction of the computational resources and parameters. Furthermore, our approach requires less than \ 1.30 in compute and results in an estimated carbon footprint of 0.14 kgCO2. Code, experimental results, and model checkpoints are openly available at this https URL.

[CV-15] Saliency Detection in Educational Videos: Analyzing the Performance of Current Models Identifying Limitations and Advancement Directions

链接: https://arxiv.org/abs/2408.04515
作者: Evelyn Navarrete,Ralph Ewerth,Anett Hoppe
关键词-EN: learner pays attention, related support systems, educational videos, learner pays, pays attention
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Identifying the regions of a learning resource that a learner pays attention to is crucial for assessing the material’s impact and improving its design and related support systems. Saliency detection in videos addresses the automatic recognition of attention-drawing regions in single frames. In educational settings, the recognition of pertinent regions in a video’s visual stream can enhance content accessibility and information retrieval tasks such as video segmentation, navigation, and summarization. Such advancements can pave the way for the development of advanced AI-assisted technologies that support learning with greater efficacy. However, this task becomes particularly challenging for educational videos due to the combination of unique characteristics such as text, voice, illustrations, animations, and more. To the best of our knowledge, there is currently no study that evaluates saliency detection approaches in educational videos. In this paper, we address this gap by evaluating four state-of-the-art saliency detection approaches for educational videos. We reproduce the original studies and explore the replication capabilities for general-purpose (non-educational) datasets. Then, we investigate the generalization capabilities of the models and evaluate their performance on educational videos. We conduct a comprehensive analysis to identify common failure scenarios and possible areas of improvement. Our experimental results show that educational videos remain a challenging context for generic video saliency detection models.

[CV-16] owards Synergistic Deep Learning Models for Volumetric Cirrhotic Liver Segmentation in MRIs

链接: https://arxiv.org/abs/2408.04491
作者: Vandan Gorade,Onkar Susladkar,Gorkem Durak,Elif Keles,Ertugrul Aktas,Timurhan Cebeci,Alpay Medetalibeyoglu,Daniela Ladner,Debesh Jha,Ulas Bagci
关键词-EN: requires precise segmentation, effective disease monitoring, global mortality, requires precise, treatment planning
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Liver cirrhosis, a leading cause of global mortality, requires precise segmentation of ROIs for effective disease monitoring and treatment planning. Existing segmentation models often fail to capture complex feature interactions and generalize across diverse datasets. To address these limitations, we propose a novel synergistic theory that leverages complementary latent spaces for enhanced feature interaction modeling. Our proposed architecture, nnSynergyNet3D integrates continuous and discrete latent spaces for 3D volumes and features auto-configured training. This approach captures both fine-grained and coarse features, enabling effective modeling of intricate feature interactions. We empirically validated nnSynergyNet3D on a private dataset of 628 high-resolution T1 abdominal MRI scans from 339 patients. Our model outperformed the baseline nnUNet3D by approximately 2%. Additionally, zero-shot testing on healthy liver CT scans from the public LiTS dataset demonstrated superior cross-modal generalization capabilities. These results highlight the potential of synergistic latent space models to improve segmentation accuracy and robustness, thereby enhancing clinical workflows by ensuring consistency across CT and MRI modalities.

[CV-17] SegXAL: Explainable Active Learning for Semantic Segmentation in Driving Scene Scenarios ICPR

链接: https://arxiv.org/abs/2408.04482
作者: Sriram Mandalika,Athira Nambiar
关键词-EN: achieve high-end performance, utilize huge amounts, models utilize huge, high-end performance, huge amounts
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO)
*备注: 17 pages, 7 figures. To appear in the proceedings of the 27th International Conference on Pattern Recognition (ICPR), 01-05 December, 2024, Kolkata, India

点击查看摘要

Abstract:Most of the sophisticated AI models utilize huge amounts of annotated data and heavy training to achieve high-end performance. However, there are certain challenges that hinder the deployment of AI models “in-the-wild” scenarios, i.e., inefficient use of unlabeled data, lack of incorporation of human expertise, and lack of interpretation of the results. To mitigate these challenges, we propose a novel Explainable Active Learning (XAL) model, XAL-based semantic segmentation model “SegXAL”, that can (i) effectively utilize the unlabeled data, (ii) facilitate the “Human-in-the-loop” paradigm, and (iii) augment the model decisions in an interpretable way. In particular, we investigate the application of the SegXAL model for semantic segmentation in driving scene scenarios. The SegXAL model proposes the image regions that require labeling assistance from Oracle by dint of explainable AI (XAI) and uncertainty measures in a weakly-supervised manner. Specifically, we propose a novel Proximity-aware Explainable-AI (PAE) module and Entropy-based Uncertainty (EBU) module to get an Explainable Error Mask, which enables the machine teachers/human experts to provide intuitive reasoning behind the results and to solicit feedback to the AI system via an active learning strategy. Such a mechanism bridges the semantic gap between man and machine through collaborative intelligence, where humans and AI actively enhance each other’s complementary strengths. A novel high-confidence sample selection technique based on the DICE similarity coefficient is also presented within the SegXAL framework. Extensive quantitative and qualitative analyses are carried out in the benchmarking Cityscape dataset. Results show the outperformance of our proposed SegXAL against other state-of-the-art models.

[CV-18] LumiGauss: High-Fidelity Outdoor Relighting with 2D Gaussian Splatting

链接: https://arxiv.org/abs/2408.04474
作者: Joanna Kaleta,Kacper Kania,Tomasz Trzcinski,Marek Kowalski
关键词-EN: unconstrained photo collections, notoriously challenging, Decoupling lighting, geometry using unconstrained, unconstrained photo
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Includes video files in src

点击查看摘要

Abstract:Decoupling lighting from geometry using unconstrained photo collections is notoriously challenging. Solving it would benefit many users, as creating complex 3D assets takes days of manual labor. Many previous works have attempted to address this issue, often at the expense of output fidelity, which questions the practicality of such methods. We introduce LumiGauss, a technique that tackles 3D reconstruction of scenes and environmental lighting through 2D Gaussian Splatting. Our approach yields high-quality scene reconstructions and enables realistic lighting synthesis under novel environment maps. We also propose a method for enhancing the quality of shadows, common in outdoor scenes, by exploiting spherical harmonics properties. Our approach facilitates seamless integration with game engines and enables the use of fast precomputed radiance transfer. We validate our method on the NeRF-OSR dataset, demonstrating superior performance over baseline methods. Moreover, LumiGauss can synthesize realistic images when applying novel environment maps. Comments: Includes video files in src Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2408.04474 [cs.CV] (or arXiv:2408.04474v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2408.04474 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[CV-19] What could go wrong? Discovering and describing failure modes in computer vision

链接: https://arxiv.org/abs/2408.04471
作者: Gabriela Csurka,Tyler L. Hayes,Diane Larlus,Riccardo Volpi
关键词-EN: Deep learning models, Deep learning, Deep, learning models, Abstract
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Deep learning models are effective, yet brittle. Even carefully trained, their behavior tends to be hard to predict when confronted with out-of-distribution samples. In this work, our goal is to propose a simple yet effective solution to predict and describe via natural language potential failure modes of computer vision models. Given a pretrained model and a set of samples, our aim is to find sentences that accurately describe the visual conditions in which the model underperforms. In order to study this important topic and foster future research on it, we formalize the problem of Language-Based Error Explainability (LBEE) and propose a set of metrics to evaluate and compare different methods for this task. We propose solutions that operate in a joint vision-and-language embedding space, and can characterize through language descriptions model failures caused, e.g., by objects unseen during training or adverse visual conditions. We experiment with different tasks, such as classification under the presence of dataset bias and semantic segmentation in unseen environments, and show that the proposed methodology isolates nontrivial sentences associated with specific error causes. We hope our work will help practitioners better understand the behavior of models, increasing their overall safety and interpretability.

[CV-20] Deep Learning for identifying systolic complexes in SCG traces: a cross-dataset analysis

链接: https://arxiv.org/abs/2408.04439
作者: Michele Craighero,Sarah Solbiati,Federica Mozzini,Enrico Caiani,Giacomo Boracchi
关键词-EN: traditional ECG, cardiac activity, promising alternative, systolic complex, deep learning solution
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The seismocardiographic signal is a promising alternative to the traditional ECG in the analysis of the cardiac activity. In particular, the systolic complex is known to be the most informative part of the seismocardiogram, thus requiring further analysis. State-of-art solutions to detect the systolic complex are based on Deep Learning models, which have been proven effective in pioneering studies. However, these solutions have only been tested in a controlled scenario considering only clean signals acquired from users maintained still in supine position. On top of that, all these studies consider data coming from a single dataset, ignoring the benefits and challenges related to a cross-dataset scenario. In this work, a cross-dataset experimental analysis was performed considering also data from a real-world scenario. Our findings prove the effectiveness of a deep learning solution, while showing the importance of a personalization step to contrast the domain shift, namely a change in data distribution between training and testing data. Finally, we demonstrate the benefits of a multi-channels approach, leveraging the information extracted from both accelerometers and gyroscopes data.

[CV-21] A Review of 3D Reconstruction Techniques for Deformable Tissues in Robotic Surgery MICCAI2024

链接: https://arxiv.org/abs/2408.04426
作者: Mengya Xu,Ziqi Guo,An Wang,Long Bai,Hongliang Ren
关键词-EN: minimally invasive surgery, robotic minimally invasive, monocular endoscopic video, endoscopic video holds, video holds immense
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
*备注: To appear in MICCAI 2024 EARTH Workshop. Code availability: this https URL

点击查看摘要

Abstract:As a crucial and intricate task in robotic minimally invasive surgery, reconstructing surgical scenes using stereo or monocular endoscopic video holds immense potential for clinical applications. NeRF-based techniques have recently garnered attention for the ability to reconstruct scenes implicitly. On the other hand, Gaussian splatting-based 3D-GS represents scenes explicitly using 3D Gaussians and projects them onto a 2D plane as a replacement for the complex volume rendering in NeRF. However, these methods face challenges regarding surgical scene reconstruction, such as slow inference, dynamic scenes, and surgical tool occlusion. This work explores and reviews state-of-the-art (SOTA) approaches, discussing their innovations and implementation principles. Furthermore, we replicate the models and conduct testing and evaluation on two datasets. The test results demonstrate that with advancements in these techniques, achieving real-time, high-quality reconstructions becomes feasible.

[CV-22] Clutter Classification Using Deep Learning in Multiple Stages

链接: https://arxiv.org/abs/2408.04407
作者: Ryan Dempsey,Jonathan Ethier
关键词-EN: local environment, wireless communications, communications is highly, highly dependent, Path loss
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
*备注: SoutheastCon 2024

点击查看摘要

Abstract:Path loss prediction for wireless communications is highly dependent on the local environment. Propagation models including clutter information have been shown to significantly increase model accuracy. This paper explores the application of deep learning to satellite imagery to identify environmental clutter types automatically. Recognizing these clutter types has numerous uses, but our main application is to use clutter information to enhance propagation prediction models. Knowing the type of obstruction (tree, building, and further classifications) can improve the prediction accuracy of key propagation metrics such as path loss.

[CV-23] MultiViPerFrOG: A Globally Optimized Multi-Viewpoint Perception Framework for Camera Motion and Tissue Deformation

链接: https://arxiv.org/abs/2408.04367
作者: Guido Caccianiga,Julian Nubert,Cesar Cadena,Marco Hutter,Katherine J. Kuchenbecker
关键词-EN: relevant to surgery, moving depth camera, information captured, highly relevant, deformable environment
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Reconstructing the 3D shape of a deformable environment from the information captured by a moving depth camera is highly relevant to surgery. The underlying challenge is the fact that simultaneously estimating camera motion and tissue deformation in a fully deformable scene is an ill-posed problem, especially from a single arbitrarily moving viewpoint. Current solutions are often organ-specific and lack the robustness required to handle large deformations. Here we propose a multi-viewpoint global optimization framework that can flexibly integrate the output of low-level perception modules (data association, depth, and relative scene flow) with kinematic and scene-modeling priors to jointly estimate multiple camera motions and absolute scene flow. We use simulated noisy data to show three practical examples that successfully constrain the convergence to a unique solution. Overall, our method shows robustness to combined noisy input measures and can process hundreds of points in a few milliseconds. MultiViPerFrOG builds a generalized learning-free scaffolding for spatio-temporal encoding that can unlock advanced surgical scene representations and will facilitate the development of the computer-assisted-surgery technologies of the future.

[CV-24] Detecting Car Speed using Object Detection and Depth Estimation: A Deep Learning Framework

链接: https://arxiv.org/abs/2408.04360
作者: Subhasis Dasgupta,Arshi Naaz,Jayeeta Choudhury,Nancy Lahiri
关键词-EN: fatal accidents, accidents are attributed, Road accidents, Radar based guns, accidents
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注: This is the pre-print of the paper which was accepted for oral presentation and publication in the proceedings of IEEE CONIT 2024, organized at Pune from June 21 to 23, 2024. The paper is 6 pages long and it contains 11 figures and 1 table. This is not the final version of the paper

点击查看摘要

Abstract:Road accidents are quite common in almost every part of the world, and, in majority, fatal accidents are attributed to over speeding of vehicles. The tendency to over speeding is usually tried to be controlled using check points at various parts of the road but not all traffic police have the device to check speed with existing speed estimating devices such as LIDAR based, or Radar based guns. The current project tries to address the issue of vehicle speed estimation with handheld devices such as mobile phones or wearable cameras with network connection to estimate the speed using deep learning frameworks.

[CV-25] AggSS: An Aggregated Self-Supervised Approach for Class-Incremental Learning BMVC2024

链接: https://arxiv.org/abs/2408.04347
作者: Jayateja Kalla,Soma Biswas
关键词-EN: specifically image rotations, paper investigates, investigates the impact, impact of self-supervised, learning
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted in BMVC 2024

点击查看摘要

Abstract:This paper investigates the impact of self-supervised learning, specifically image rotations, on various class-incremental learning paradigms. Here, each image with a predefined rotation is considered as a new class for training. At inference, all image rotation predictions are aggregated for the final prediction, a strategy we term Aggregated Self-Supervision (AggSS). We observe a shift in the deep neural network’s attention towards intrinsic object features as it learns through AggSS strategy. This learning approach significantly enhances class-incremental learning by promoting robust feature learning. AggSS serves as a plug-and-play module that can be seamlessly incorporated into any class-incremental learning framework, leveraging its powerful feature learning capabilities to enhance performance across various class-incremental learning approaches. Extensive experiments conducted on standard incremental learning datasets CIFAR-100 and ImageNet-Subset demonstrate the significant role of AggSS in improving performance within these paradigms.

[CV-26] Enhancing Journalism with AI: A Study of Contextualized Image Captioning for News Articles using LLMs and LMMs

链接: https://arxiv.org/abs/2408.04331
作者: Aliki Anagnostopoulou,Thiago Gouvea,Daniel Sonntag
关键词-EN: Large language models, large multimodal models, Large language, large multimodal, economic sectors
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Large language models (LLMs) and large multimodal models (LMMs) have significantly impacted the AI community, industry, and various economic sectors. In journalism, integrating AI poses unique challenges and opportunities, particularly in enhancing the quality and efficiency of news reporting. This study explores how LLMs and LMMs can assist journalistic practice by generating contextualised captions for images accompanying news articles. We conducted experiments using the GoodNews dataset to evaluate the ability of LMMs (BLIP-2, GPT-4v, or LLaVA) to incorporate one of two types of context: entire news articles, or extracted named entities. In addition, we compared their performance to a two-stage pipeline composed of a captioning model (BLIP-2, OFA, or ViT-GPT2) with post-hoc contextualisation with LLMs (GPT-4 or LLaMA). We assess a diversity of models, and we find that while the choice of contextualisation model is a significant factor for the two-stage pipelines, this is not the case in the LMMs, where smaller, open-source models perform well compared to proprietary, GPT-powered ones. Additionally, we found that controlling the amount of provided context enhances performance. These results highlight the limitations of a fully automated approach and underscore the necessity for an interactive, human-in-the-loop strategy.

[CV-27] Multi-Scale and Detail-Enhanced Segment Anything Model for Salient Object Detection ACM-MM2024

链接: https://arxiv.org/abs/2408.04326
作者: Shixuan Gao,Pingping Zhang,Tianyu Yan,Huchuan Lu
关键词-EN: Salient Object Detection, Convolutional Neural Networks, Salient Object, Object Detection, aims to identify
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
*备注: This work is accepted by ACM MM2024

点击查看摘要

Abstract:Salient Object Detection (SOD) aims to identify and segment the most prominent objects in images. Advanced SOD methods often utilize various Convolutional Neural Networks (CNN) or Transformers for deep feature extraction. However, these methods still deliver low performance and poor generalization in complex cases. Recently, Segment Anything Model (SAM) has been proposed as a visual fundamental model, which gives strong segmentation and generalization capabilities. Nonetheless, SAM requires accurate prompts of target objects, which are unavailable in SOD. Additionally, SAM lacks the utilization of multi-scale and multi-level information, as well as the incorporation of fine-grained details. To address these shortcomings, we propose a Multi-scale and Detail-enhanced SAM (MDSAM) for SOD. Specifically, we first introduce a Lightweight Multi-Scale Adapter (LMSA), which allows SAM to learn multi-scale information with very few trainable parameters. Then, we propose a Multi-Level Fusion Module (MLFM) to comprehensively utilize the multi-level information from the SAM’s encoder. Finally, we propose a Detail Enhancement Module (DEM) to incorporate SAM with fine-grained details. Experimental results demonstrate the superior performance of our model on multiple SOD datasets and its strong generalization on other segmentation tasks. The source code is released at this https URL.

[CV-28] Respiratory Subtraction for Pulmonary Microwave Ablation Evaluation

链接: https://arxiv.org/abs/2408.04299
作者: Wan Li,Xinyun Zhong,Wei Li,Song Zhang,Moheng Rong,Yan Xi,Peng Yuan,Zechen Wang,Xiaolei Jiang,Rongxi Yi,Hui Tang,Yang Chen,Chaohui Tong,Zhan Wu,Feng Wang
关键词-EN: global cancer mortality, minimally invasive interventions, necessitating minimally invasive, cancer mortality, global cancer
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Currently, lung cancer is a leading cause of global cancer mortality, often necessitating minimally invasive interventions. Microwave ablation (MWA) is extensively utilized for both primary and secondary lung tumors. Although numerous clinical guidelines and standards for MWA have been established, the clinical evaluation of ablation surgery remains challenging and requires long-term patient follow-up for confirmation. In this paper, we propose a method termed respiratory subtraction to evaluate lung tumor ablation therapy performance based on pre- and post-operative image guidance. Initially, preoperative images undergo coarse rigid registration to their corresponding postoperative positions, followed by further non-rigid registration. Subsequently, subtraction images are generated by subtracting the registered preoperative images from the postoperative ones. Furthermore, to enhance the clinical assessment of MWA treatment performance, we devise a quantitative analysis metric to evaluate ablation efficacy by comparing differences between tumor areas and treatment areas. To the best of our knowledge, this is the pioneering work in the field to facilitate the assessment of MWA surgery performance on pulmonary tumors. Extensive experiments involving 35 clinical cases further validate the efficacy of the respiratory subtraction method. The experimental results confirm the effectiveness of the respiratory subtraction method and the proposed quantitative evaluation metric in assessing lung tumor treatment.

[CV-29] Dual-branch PolSAR Image Classification Based on GraphMAE and Local Feature Extraction

链接: https://arxiv.org/abs/2408.04294
作者: Yuchen Wang,Ziyi Guo,Haixia Bi,Danfeng Hong,Chen Xu
关键词-EN: synthetic aperture radar, polarimetric synthetic aperture, aperture radar, time-consuming process, synthetic aperture
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The annotation of polarimetric synthetic aperture radar (PolSAR) images is a labor-intensive and time-consuming process. Therefore, classifying PolSAR images with limited labels is a challenging task in remote sensing domain. In recent years, self-supervised learning approaches have proven effective in PolSAR image classification with sparse labels. However, we observe a lack of research on generative selfsupervised learning in the studied task. Motivated by this, we propose a dual-branch classification model based on generative self-supervised learning in this paper. The first branch is a superpixel-branch, which learns superpixel-level polarimetric representations using a generative self-supervised graph masked autoencoder. To acquire finer classification results, a convolutional neural networks-based pixel-branch is further incorporated to learn pixel-level features. Classification with fused dual-branch features is finally performed to obtain the predictions. Experimental results on the benchmark Flevoland dataset demonstrate that our approach yields promising classification results.

[CV-30] Evaluating Modern Approaches in 3D Scene Reconstruction: NeRF vs Gaussian-Based Methods

链接: https://arxiv.org/abs/2408.04268
作者: Yiming Zhou,Zixuan Zeng,Andi Chen,Xiaofan Zhou,Haowei Ni,Shiyao Zhang,Panfeng Li,Liangxi Liu,Mengyao Zheng,Xupeng Chen
关键词-EN: Neural Radiance Fields, traditional Simultaneous Localization, Radiance Fields, Neural Radiance, Simultaneous Localization
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted by 2024 6th International Conference on Data-driven Optimization of Complex Systems

点击查看摘要

Abstract:Exploring the capabilities of Neural Radiance Fields (NeRF) and Gaussian-based methods in the context of 3D scene reconstruction, this study contrasts these modern approaches with traditional Simultaneous Localization and Mapping (SLAM) systems. Utilizing datasets such as Replica and ScanNet, we assess performance based on tracking accuracy, mapping fidelity, and view synthesis. Findings reveal that NeRF excels in view synthesis, offering unique capabilities in generating new perspectives from existing data, albeit at slower processing speeds. Conversely, Gaussian-based methods provide rapid processing and significant expressiveness but lack comprehensive scene completion. Enhanced by global optimization and loop closure techniques, newer methods like NICE-SLAM and SplaTAM not only surpass older frameworks such as ORB-SLAM2 in terms of robustness but also demonstrate superior performance in dynamic and complex environments. This comparative analysis bridges theoretical research with practical implications, shedding light on future developments in robust 3D scene reconstruction across various real-world applications.

[CV-31] CoBooM: Codebook Guided Bootstrapping for Medical Image Representation Learning MICCAI2024

链接: https://arxiv.org/abs/2408.04262
作者: Azad Singh,Deepak Mishra
关键词-EN: harnessing unannotated data, medical image analysis, unannotated data, promising paradigm, analysis by harnessing
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted in MICCAI 2024

点击查看摘要

Abstract:Self-supervised learning (SSL) has emerged as a promising paradigm for medical image analysis by harnessing unannotated data. Despite their potential, the existing SSL approaches overlook the high anatomical similarity inherent in medical images. This makes it challenging for SSL methods to capture diverse semantic content in medical images consistently. This work introduces a novel and generalized solution that implicitly exploits anatomical similarities by integrating codebooks in SSL. The codebook serves as a concise and informative dictionary of visual patterns, which not only aids in capturing nuanced anatomical details but also facilitates the creation of robust and generalized feature representations. In this context, we propose CoBooM, a novel framework for self-supervised medical image learning by integrating continuous and discrete representations. The continuous component ensures the preservation of fine-grained details, while the discrete aspect facilitates coarse-grained feature extraction through the structured embedding space. To understand the effectiveness of CoBooM, we conduct a comprehensive evaluation of various medical datasets encompassing chest X-rays and fundus images. The experimental results reveal a significant performance gain in classification and segmentation tasks.

[CV-32] Unveiling Hidden Visual Information: A Reconstruction Attack Against Adversarial Visual Information Hiding

链接: https://arxiv.org/abs/2408.04261
作者: Jonggyu Jang,Hyeonsu Lyu,Seongjin Hwang,Hyun Jong Yang
关键词-EN: executing data reconstruction, AVIH encryption method, AVIH, AVIH method, data reconstruction
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
*备注: 12 pages

点击查看摘要

Abstract:This paper investigates the security vulnerabilities of adversarial-example-based image encryption by executing data reconstruction (DR) attacks on encrypted images. A representative image encryption method is the adversarial visual information hiding (AVIH), which uses type-I adversarial example training to protect gallery datasets used in image recognition tasks. In the AVIH method, the type-I adversarial example approach creates images that appear completely different but are still recognized by machines as the original ones. Additionally, the AVIH method can restore encrypted images to their original forms using a predefined private key generative model. For the best security, assigning a unique key to each image is recommended; however, storage limitations may necessitate some images sharing the same key model. This raises a crucial security question for AVIH: How many images can safely share the same key model without being compromised by a DR attack? To address this question, we introduce a dual-strategy DR attack against the AVIH encryption method by incorporating (1) generative-adversarial loss and (2) augmented identity loss, which prevent DR from overfitting – an issue akin to that in machine learning. Our numerical results validate this approach through image recognition and re-identification benchmarks, demonstrating that our strategy can significantly enhance the quality of reconstructed images, thereby requiring fewer key-sharing encrypted images. Our source code to reproduce our results will be available soon.

[CV-33] UHNet: An Ultra-Lightweight and High-Speed Edge Detection Network

链接: https://arxiv.org/abs/2408.04258
作者: Fuzhang Li,Chuan Lin
关键词-EN: support lesion identification, Convolutional Neural Networks, enabling precise extraction, Vision Transformer architectures, Edge detection
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Edge detection is crucial in medical image processing, enabling precise extraction of structural information to support lesion identification and image analysis. Traditional edge detection models typically rely on complex Convolutional Neural Networks and Vision Transformer architectures. Due to their numerous parameters and high computational demands, these models are limited in their application on resource-constrained devices. This paper presents an ultra-lightweight edge detection model (UHNet), characterized by its minimal parameter count, rapid computation speed, negligible of pre-training costs, and commendable performance. UHNet boasts impressive performance metrics with 42.3k parameters, 166 FPS, and 0.79G FLOPs. By employing an innovative feature extraction module and optimized residual connection method, UHNet significantly reduces model complexity and computational requirements. Additionally, a lightweight feature fusion strategy is explored, enhancing detection accuracy. Experimental results on the BSDS500, NYUD, and BIPED datasets validate that UHNet achieves remarkable edge detection performance while maintaining high efficiency. This work not only provides new insights into the design of lightweight edge detection models but also demonstrates the potential and application prospects of the UHNet model in engineering applications such as medical image processing. The codes are available at this https URL

[CV-34] InstantStyleGaussian: Efficient Art Style Transfer with 3D Gaussian Splatting

链接: https://arxiv.org/abs/2408.04249
作者: Xin-Yi Yu,Jun-Xin Yu,Li-Bo Zhou,Yan Wei,Lin-Lin Ou
关键词-EN: Gaussian Splatting, present InstantStyleGaussian, Gaussian, Splatting, scene representation
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:We present InstantStyleGaussian, an innovative 3D style transfer method based on the 3D Gaussian Splatting (3DGS) scene representation. By inputting a target style image, it quickly generates new 3D GS scenes. Our approach operates on pre-reconstructed GS scenes, combining diffusion models with an improved iterative dataset update strategy. It utilizes diffusion models to generate target style images, adds these new images to the training dataset, and uses this dataset to iteratively update and optimize the GS scenes. Extensive experimental results demonstrate that our method ensures high-quality stylized scenes while offering significant advantages in style transfer speed and consistency.

[CV-35] MU-MAE: Multimodal Masked Autoencoders-Based One-Shot Learning

链接: https://arxiv.org/abs/2408.04243
作者: Rex Liu,Xin Liu
关键词-EN: human activity recognition, leveraging multimodal sensors, activity recognition, exponential growth, growth of multimedia
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
*备注: IEEE MIPR 2024

点击查看摘要

Abstract:With the exponential growth of multimedia data, leveraging multimodal sensors presents a promising approach for improving accuracy in human activity recognition. Nevertheless, accurately identifying these activities using both video data and wearable sensor data presents challenges due to the labor-intensive data annotation, and reliance on external pretrained models or additional data. To address these challenges, we introduce Multimodal Masked Autoencoders-Based One-Shot Learning (Mu-MAE). Mu-MAE integrates a multimodal masked autoencoder with a synchronized masking strategy tailored for wearable sensors. This masking strategy compels the networks to capture more meaningful spatiotemporal features, which enables effective self-supervised pretraining without the need for external data. Furthermore, Mu-MAE leverages the representation extracted from multimodal masked autoencoders as prior information input to a cross-attention multimodal fusion layer. This fusion layer emphasizes spatiotemporal features requiring attention across different modalities while highlighting differences from other classes, aiding in the classification of various classes in metric-based one-shot learning. Comprehensive evaluations on MMAct one-shot classification show that Mu-MAE outperforms all the evaluated approaches, achieving up to an 80.17% accuracy for five-way one-shot multimodal classification, without the use of additional data.

[CV-36] LLDif: Diffusion Models for Low-light Emotion Recognition ICPR2024

链接: https://arxiv.org/abs/2408.04235
作者: Zhifeng Wang,Kaihao Zhang,Ramesh Sankaranarayana
关键词-EN: diffusion-based facial expression, facial expression recognition, paper introduces LLDif, framework tailored, paper introduces
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted by ICPR2024

点击查看摘要

Abstract:This paper introduces LLDif, a novel diffusion-based facial expression recognition (FER) framework tailored for extremely low-light (LL) environments. Images captured under such conditions often suffer from low brightness and significantly reduced contrast, presenting challenges to conventional methods. These challenges include poor image quality that can significantly reduce the accuracy of emotion recognition. LLDif addresses these issues with a novel two-stage training process that combines a Label-aware CLIP (LA-CLIP), an embedding prior network (PNET), and a transformer-based network adept at handling the noise of low-light images. The first stage involves LA-CLIP generating a joint embedding prior distribution (EPD) to guide the LLformer in label recovery. In the second stage, the diffusion model (DM) refines the EPD inference, ultilising the compactness of EPD for precise predictions. Experimental evaluations on various LL-FER datasets have shown that LLDif achieves competitive performance, underscoring its potential to enhance FER applications in challenging lighting conditions.

[CV-37] Cross-View Meets Diffusion: Aerial Image Synthesis with Geometry and Text Guidance

链接: https://arxiv.org/abs/2408.04224
作者: Ahmad Arrabi,Xiaohan Zhang,Waqas Sultan,Chen Chen,Safwan Wshah
关键词-EN: Aerial imagery analysis, aerial images, research fields, Aerial, images
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Aerial imagery analysis is critical for many research fields. However, obtaining frequent high-quality aerial images is not always accessible due to its high effort and cost requirements. One solution is to use the Ground-to-Aerial (G2A) technique to synthesize aerial images from easily collectible ground images. However, G2A is rarely studied, because of its challenges, including but not limited to, the drastic view changes, occlusion, and range of visibility. In this paper, we present a novel Geometric Preserving Ground-to-Aerial (G2A) image synthesis (GPG2A) model that can generate realistic aerial images from ground images. GPG2A consists of two stages. The first stage predicts the Bird’s Eye View (BEV) segmentation (referred to as the BEV layout map) from the ground image. The second stage synthesizes the aerial image from the predicted BEV layout map and text descriptions of the ground image. To train our model, we present a new multi-modal cross-view dataset, namely VIGORv2 which is built upon VIGOR with newly collected aerial images, maps, and text descriptions. Our extensive experiments illustrate that GPG2A synthesizes better geometry-preserved aerial images than existing models. We also present two applications, data augmentation for cross-view geo-localization and sketch-based region search, to further verify the effectiveness of our GPG2A. The code and data will be publicly available.

[CV-38] VideoQA in the Era of LLMs: An Empirical Study

链接: https://arxiv.org/abs/2408.04223
作者: Junbin Xiao,Nanxin Huang,Hangyu Qin,Dongyang Li,Yicong Li,Fengbin Zhu,Zhulin Tao,Jianxing Yu,Liang Lin,Tat-Seng Chua,Angela Yao
关键词-EN: Video Large Language, Large Language Models, Large Language, Video Large, Video Question Answering
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: Preprint. Under Review

点击查看摘要

Abstract:Video Large Language Models (Video-LLMs) are flourishing and has advanced many video-language tasks. As a golden testbed, Video Question Answering (VideoQA) plays pivotal role in Video-LLM developing. This work conducts a timely and comprehensive study of Video-LLMs’ behavior in VideoQA, aiming to elucidate their success and failure modes, and provide insights towards more human-like video understanding and question answering. Our analyses demonstrate that Video-LLMs excel in VideoQA; they can correlate contextual cues and generate plausible responses to questions about varied video contents. However, models falter in handling video temporality, both in reasoning about temporal content ordering and grounding QA-relevant temporal moments. Moreover, the models behave unintuitively - they are unresponsive to adversarial video perturbations while being sensitive to simple variations of candidate answers and questions. Also, they do not necessarily generalize better. The findings demonstrate Video-LLMs’ QA capability in standard condition yet highlight their severe deficiency in robustness and interpretability, suggesting the urgent need on rationales in Video-LLM developing.

[CV-39] Connective Viewpoints of Signal-to-Noise Diffusion Models

链接: https://arxiv.org/abs/2408.04221
作者: Khanh Doan,Long Tung Vuong,Tuan Nguyen,Anh Tuan Bui,Quyen Tran,Thanh-Toan Do,Dinh Phung,Trung Le
关键词-EN: complex data interpolation, Diffusion models, audio generation, Diffusion, diffusion models constitute
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
*备注:

点击查看摘要

Abstract:Diffusion models (DM) have become fundamental components of generative models, excelling across various domains such as image creation, audio generation, and complex data interpolation. Signal-to-Noise diffusion models constitute a diverse family covering most state-of-the-art diffusion models. While there have been several attempts to study Signal-to-Noise (S2N) diffusion models from various perspectives, there remains a need for a comprehensive study connecting different viewpoints and exploring new perspectives. In this study, we offer a comprehensive perspective on noise schedulers, examining their role through the lens of the signal-to-noise ratio (SNR) and its connections to information theory. Building upon this framework, we have developed a generalized backward equation to enhance the performance of the inference process.

[CV-40] Medical Graph RAG: Towards Safe Medical Large Language Model via Graph Retrieval-Augmented Generation

链接: https://arxiv.org/abs/2408.04187
作者: Junde Wu,Jiayuan Zhu,Yunli Qi
关键词-EN: Large Language Model, enhancing Large Language, Large Language, framework specifically designed, graph-based Retrieval-Augmented Generation
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:We introduce a novel graph-based Retrieval-Augmented Generation (RAG) framework specifically designed for the medical domain, called \textbfMedGraphRAG, aimed at enhancing Large Language Model (LLM) capabilities and generating evidence-based results, thereby improving safety and reliability when handling private medical data. Our comprehensive pipeline begins with a hybrid static-semantic approach to document chunking, significantly improving context capture over traditional methods. Extracted entities are used to create a three-tier hierarchical graph structure, linking entities to foundational medical knowledge sourced from medical papers and dictionaries. These entities are then interconnected to form meta-graphs, which are merged based on semantic similarities to develop a comprehensive global graph. This structure supports precise information retrieval and response generation. The retrieval process employs a U-retrieve method to balance global awareness and indexing efficiency of the LLM. Our approach is validated through a comprehensive ablation study comparing various methods for document chunking, graph construction, and information retrieval. The results not only demonstrate that our hierarchical graph construction method consistently outperforms state-of-the-art models on multiple medical Q\A benchmarks, but also confirms that the responses generated include source documentation, significantly enhancing the reliability of medical LLMs in practical applications. Code will be at: this https URL

[CV-41] pyBregMan: A Python library for Bregman Manifolds

链接: https://arxiv.org/abs/2408.04175
作者: Frank Nielsen,Alexander Soen
关键词-EN: dually flat space, Bregman manifolds, Bregman, convex Bregman generators, dually flat
类目: Machine Learning (cs.LG); Computational Geometry (cs.CG); Computer Vision and Pattern Recognition (cs.CV); Information Theory (cs.IT)
*备注: 28 pages

点击查看摘要

Abstract:A Bregman manifold is a synonym for a dually flat space in information geometry which admits as a canonical divergence a Bregman divergence. Bregman manifolds are induced by smooth strictly convex functions like the cumulant or partition functions of regular exponential families, the negative entropy of mixture families, or the characteristic functions of regular cones just to list a few such convex Bregman generators. We describe the design of pyBregMan, a library which implements generic operations on Bregman manifolds and instantiate several common Bregman manifolds used in information sciences. At the core of the library is the notion of Legendre-Fenchel duality inducing a canonical pair of dual potential functions and dual Bregman divergences. The library also implements the Fisher-Rao manifolds of categorical/multinomial distributions and multivariate normal distributions. To demonstrate the use of the pyBregMan kernel manipulating those Bregman and Fisher-Rao manifolds, the library also provides several core algorithms for various applications in statistics, machine learning, information fusion, and so on.

[CV-42] MultiColor: Image Colorization by Learning from Multiple Color Spaces

链接: https://arxiv.org/abs/2408.04172
作者: Xiangcheng Du,Zhao Zhou,Yanlong Wang,Zhuoyao Wang,Yingbin Zheng,Cheng Jin
关键词-EN: shown impressive performance, Deep networks, color, color spaces, image restoration tasks
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
*备注:

点击查看摘要

Abstract:Deep networks have shown impressive performance in the image restoration tasks, such as image colorization. However, we find that previous approaches rely on the digital representation from single color model with a specific mapping function, a.k.a., color space, during the colorization pipeline. In this paper, we first investigate the modeling of different color spaces, and find each of them exhibiting distinctive characteristics with unique distribution of colors. The complementarity among multiple color spaces leads to benefits for the image colorization task. We present MultiColor, a new learning-based approach to automatically colorize grayscale images that combines clues from multiple color spaces. Specifically, we employ a set of dedicated colorization modules for individual color space. Within each module, a transformer decoder is first employed to refine color query embeddings and then a color mapper produces color channel prediction using the embeddings and semantic features. With these predicted color channels representing various color spaces, a complementary network is designed to exploit the complementarity and generate pleasing and reasonable colorized images. We conduct extensive experiments on real-world datasets, and the results demonstrate superior performance over the state-of-the-arts. Subjects: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM) Cite as: arXiv:2408.04172 [cs.CV] (or arXiv:2408.04172v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2408.04172 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[CV-43] Rotation center identification based on geometric relationships for rotary motion deblurring

链接: https://arxiv.org/abs/2408.04171
作者: Jinhui Qin,Yong Ma,Jun Huang,Fan Fan,You Du
关键词-EN: rotary motion deblurring, rotary motion blurred, Non-blind rotary motion, rotary motion, latent clear image
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Non-blind rotary motion deblurring (RMD) aims to recover the latent clear image from a rotary motion blurred (RMB) image. The rotation center is a crucial input parameter in non-blind RMD methods. Existing methods directly estimate the rotation center from the RMB image. However they always suffer significant errors, and the performance of RMD is limited. For the assembled imaging systems, the position of the rotation center remains fixed. Leveraging this prior knowledge, we propose a geometric-based method for rotation center identification and analyze its error range. Furthermore, we construct a RMB imaging system. The experiment demonstrates that our method achieves less than 1-pixel error along a single axis (x-axis or y-axis). We utilize the constructed imaging system to capture real RMB images, and experimental results show that our method can help existing RMD approaches yield better RMD images.

[CV-44] M2EF-NNs: Multimodal Multi-instance Evidence Fusion Neural Networks for Cancer Survival Prediction

链接: https://arxiv.org/abs/2408.04170
作者: Hui Luo,Jiashuang Huang,Hengrong Ju,Tianyi Zhou,Weiping Ding
关键词-EN: formulating treatment plans, assisting clinical doctors, cancer survival prediction, Accurate cancer survival, survival prediction
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Accurate cancer survival prediction is crucial for assisting clinical doctors in formulating treatment plans. Multimodal data, including histopathological images and genomic data, offer complementary and comprehensive information that can greatly enhance the accuracy of this task. However, the current methods, despite yielding promising results, suffer from two notable limitations: they do not effectively utilize global context and disregard modal uncertainty. In this study, we put forward a neural network model called M2EF-NNs, which leverages multimodal and multi-instance evidence fusion techniques for accurate cancer survival prediction. Specifically, to capture global information in the images, we use a pre-trained Vision Transformer (ViT) model to obtain patch feature embeddings of histopathological images. Then, we introduce a multimodal attention module that uses genomic embeddings as queries and learns the co-attention mapping between genomic and histopathological images to achieve an early interaction fusion of multimodal information and better capture their correlations. Subsequently, we are the first to apply the Dempster-Shafer evidence theory (DST) to cancer survival prediction. We parameterize the distribution of class probabilities using the processed multimodal features and introduce subjective logic to estimate the uncertainty associated with different modalities. By combining with the Dempster-Shafer theory, we can dynamically adjust the weights of class probabilities after multimodal fusion to achieve trusted survival prediction. Finally, Experimental validation on the TCGA datasets confirms the significant improvements achieved by our proposed method in cancer survival prediction and enhances the reliability of the model.

[CV-45] Decorrelating Structure via Adapters Makes Ensemble Learning Practical for Semi-supervised Learning

链接: https://arxiv.org/abs/2408.04150
作者: Jiaqi Wu,Junbiao Pang,Qingming Huang
关键词-EN: low training efficiency, traditional ensemble learning, learning methods exhibit, labeled data, ensemble learning methods
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:In computer vision, traditional ensemble learning methods exhibit either a low training efficiency or the limited performance to enhance the reliability of deep neural networks. In this paper, we propose a lightweight, loss-function-free, and architecture-agnostic ensemble learning by the Decorrelating Structure via Adapters (DSA) for various visual tasks. Concretely, the proposed DSA leverages the structure-diverse adapters to decorrelate multiple prediction heads without any tailed regularization or loss. This allows DSA to be easily extensible to architecture-agnostic networks for a range of computer vision tasks. Importantly, the theoretically analysis shows that the proposed DSA has a lower bias and variance than that of the single head based method (which is adopted by most of the state of art approaches). Consequently, the DSA makes deep networks reliable and robust for the various real-world challenges, \textite.g., data corruption, and label noises. Extensive experiments combining the proposed method with FreeMatch achieved the accuracy improvements of 5.35% on CIFAR-10 dataset with 40 labeled data and 0.71% on CIFAR-100 dataset with 400 labeled data. Besides, combining the proposed method with DualPose achieved the improvements in the Percentage of Correct Keypoints (PCK) by 2.08% on the Sniffing dataset with 100 data (30 labeled data), 5.2% on the FLIC dataset with 100 data (including 50 labeled data), and 2.35% on the LSP dataset with 200 data (100 labeled data).

[CV-46] ComKD-CLIP: Comprehensive Knowledge Distillation for Contrastive Language-Image Pre-traning Model

链接: https://arxiv.org/abs/2408.04145
作者: Yifan Chen,Xiaozhen Qiao,Zhe Sun,Xuelong Li
关键词-EN: Contrastive Language-Image Pre-training, contrastive learning techniques, integrating semantic information, Language-Image Pre-training, Model
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: first submit

点击查看摘要

Abstract:Contrastive Language-Image Pre-training (CLIP) excels in integrating semantic information between images and text through contrastive learning techniques. It has achieved remarkable performance in various multimodal tasks. However, the deployment of large CLIP models is hindered in resource-limited environments, while smaller models frequently fall short of meeting performance benchmarks necessary for practical applications. In this paper, we propose a novel approach, coined as ComKD-CLIP: Comprehensive Knowledge Distillation for Contrastive Language-Image Pre-traning Model, which aims to comprehensively distill the knowledge from a large teacher CLIP model into a smaller student model, ensuring comparable performance with significantly reduced parameters. ComKD-CLIP is composed of two key mechanisms: Image Feature Alignment (IFAlign) and Educational Attention (EduAttention). IFAlign makes the image features extracted by the student model closely match those extracted by the teacher model, enabling the student to learn teacher’s knowledge of extracting image features. EduAttention explores the cross-relationships between text features extracted by the teacher model and image features extracted by the student model, enabling the student model to learn how the teacher model integrates text-image features. In addition, ComKD-CLIP can refine the knowledge distilled from IFAlign and EduAttention leveraging the results of text-image feature fusion by the teacher model, ensuring student model accurately absorbs the knowledge of teacher model. Extensive experiments conducted on 11 datasets have demonstrated the superiority of the proposed method.

[CV-47] Integrated Dynamic Phenological Feature for Remote Sensing Image Land Cover Change Detection

链接: https://arxiv.org/abs/2408.04144
作者: Yi Liu,Chenhao Sun,Hao Ye,Xiangying Liu,Weilong Ju
关键词-EN: analyzing land surface, Remote sensing image, land surface changes, image change detection, Remote sensing
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Remote sensing image change detection (CD) is essential for analyzing land surface changes over time, with a significant challenge being the differentiation of actual changes from complex scenes while filtering out pseudo-changes. A primary contributor to this challenge is the intra-class dynamic changes due to phenological characteristics in natural areas. To overcome this, we introduce the InPhea model, which integrates phenological features into a remote sensing image CD framework. The model features a detector with a differential attention module for improved feature representation of change information, coupled with high-resolution feature extraction and spatial pyramid blocks to enhance performance. Additionally, a constrainer with four constraint modules and a multi-stage contrastive learning approach is employed to aid in the model’s understanding of phenological characteristics. Experiments on the HRSCD, SECD, and PSCD-Wuhan datasets reveal that InPhea outperforms other models, confirming its effectiveness in addressing phenological pseudo-changes and its overall model superiority.

[CV-48] Can Rule-Based Insights Enhance LLMs for Radiology Report Classification? Introducing the RadPrompt Methodology ACL2024

链接: https://arxiv.org/abs/2408.04121
作者: Panagiotis Fytas,Anna Breger,Ian Selby,Simon Baker,Shahab Shahipasand,Anna Korhonen
关键词-EN: Developing imaging models, Developing imaging, chest X-rays, imaging models capable, capable of detecting
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted at BioNLP, ACL 2024

点击查看摘要

Abstract:Developing imaging models capable of detecting pathologies from chest X-rays can be cost and time-prohibitive for large datasets as it requires supervision to attain state-of-the-art performance. Instead, labels extracted from radiology reports may serve as distant supervision since these are routinely generated as part of clinical practice. Despite their widespread use, current rule-based methods for label extraction rely on extensive rule sets that are limited in their robustness to syntactic variability. To alleviate these limitations, we introduce RadPert, a rule-based system that integrates an uncertainty-aware information schema with a streamlined set of rules, enhancing performance. Additionally, we have developed RadPrompt, a multi-turn prompting strategy that leverages RadPert to bolster the zero-shot predictive capabilities of large language models, achieving a statistically significant improvement in weighted average F1 score over GPT-4 Turbo. Most notably, RadPrompt surpasses both its underlying models, showcasing the synergistic potential of LLMs with rule-based models. We have evaluated our methods on two English Corpora: the MIMIC-CXR gold-standard test set and a gold-standard dataset collected from the Cambridge University Hospitals.

[CV-49] PaveCap: The First Multimodal Framework for Comprehensive Pavement Condition Assessment with Dense Captioning and PCI Estimation

链接: https://arxiv.org/abs/2408.04110
作者: Blessing Agyei Kyem,Eugene Kofi Okrah Denteh,Joshua Kofi Asamoah,Armstrong Aboah
关键词-EN: PCI Estimation Network, Dense Captioning Network, Pavement Condition Index, PCI Estimation, Dense Captioning
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:This research introduces the first multimodal approach for pavement condition assessment, providing both quantitative Pavement Condition Index (PCI) predictions and qualitative descriptions. We introduce PaveCap, a novel framework for automated pavement condition assessment. The framework consists of two main parts: a Single-Shot PCI Estimation Network and a Dense Captioning Network. The PCI Estimation Network uses YOLOv8 for object detection, the Segment Anything Model (SAM) for zero-shot segmentation, and a four-layer convolutional neural network to predict PCI. The Dense Captioning Network uses a YOLOv8 backbone, a Transformer encoder-decoder architecture, and a convolutional feed-forward module to generate detailed descriptions of pavement conditions. To train and evaluate these networks, we developed a pavement dataset with bounding box annotations, textual annotations, and PCI values. The results of our PCI Estimation Network showed a strong positive correlation (0.70) between predicted and actual PCIs, demonstrating its effectiveness in automating condition assessment. Also, the Dense Captioning Network produced accurate pavement condition descriptions, evidenced by high BLEU (0.7445), GLEU (0.5893), and METEOR (0.7252) scores. Additionally, the dense captioning model handled complex scenarios well, even correcting some errors in the ground truth data. The framework developed here can greatly improve infrastructure management and decision18 making in pavement maintenance.

[CV-50] Decoding Visual Sentiment of Political Imagery

链接: https://arxiv.org/abs/2408.04103
作者: Olga Gasparyan,Elena Sirotkina
关键词-EN: viewers systematically disagree, visual sentiment, viewers systematically, systematically disagree, define visual sentiment
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:How can we define visual sentiment when viewers systematically disagree on their perspectives? This study introduces a novel approach to visual sentiment analysis by integrating attitudinal differences into visual sentiment classification. Recognizing that societal divides, such as partisan differences, heavily influence sentiment labeling, we developed a dataset that reflects these divides. We then trained a deep learning multi-task multi-class model to predict visual sentiment from different ideological viewpoints. Applied to immigration-related images, our approach captures perspectives from both Democrats and Republicans. By incorporating diverse perspectives into the labeling and model training process, our strategy addresses the limitation of label ambiguity and demonstrates improved accuracy in visual sentiment predictions. Overall, our study advocates for a paradigm shift in decoding visual sentiment toward creating classifiers that more accurately reflect the sentiments generated by humans.

[CV-51] ArtVLM: Attribute Recognition Through Vision-Based Prefix Language Modeling ECCV2024

链接: https://arxiv.org/abs/2408.04102
作者: William Y. Zhu,Keren Ye,Junjie Ke,Jiahui Yu,Leonidas Guibas,Peyman Milanfar,Feng Yang
关键词-EN: computer vision applications, Recognizing and disentangling, disentangling visual attributes, Visual Genome Attribute, visual attribute recognition
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: Accepted at ECCV 2024

点击查看摘要

Abstract:Recognizing and disentangling visual attributes from objects is a foundation to many computer vision applications. While large vision language representations like CLIP had largely resolved the task of zero-shot object recognition, zero-shot visual attribute recognition remains a challenge because CLIP’s contrastively-learned vision-language representation cannot effectively capture object-attribute dependencies. In this paper, we target this weakness and propose a sentence generation-based retrieval formulation for attribute recognition that is novel in 1) explicitly modeling a to-be-measured and retrieved object-attribute relation as a conditional probability graph, which converts the recognition problem into a dependency-sensitive language-modeling problem, and 2) applying a large pretrained Vision-Language Model (VLM) on this reformulation and naturally distilling its knowledge of image-object-attribute relations to use towards attribute recognition. Specifically, for each attribute to be recognized on an image, we measure the visual-conditioned probability of generating a short sentence encoding the attribute’s relation to objects on the image. Unlike contrastive retrieval, which measures likelihood by globally aligning elements of the sentence to the image, generative retrieval is sensitive to the order and dependency of objects and attributes in the sentence. We demonstrate through experiments that generative retrieval consistently outperforms contrastive retrieval on two visual reasoning datasets, Visual Attribute in the Wild (VAW), and our newly-proposed Visual Genome Attribute Ranking (VGARank).

[CV-52] PushPull-Net: Inhibition-driven ResNet robust to image corruptions ICPR2024

链接: https://arxiv.org/abs/2408.04077
作者: Guru Swaroop Bennabhaktula,Enrique Alegre,Nicola Strisciuglio,George Azzopardi
关键词-EN: primary visual cortex, anti-phase inhibition phenomenon, inhibition phenomenon observed, visual cortex, phenomenon observed
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted at ICPR 2024, code available at this https URL

点击查看摘要

Abstract:We introduce a novel computational unit, termed PushPull-Conv, in the first layer of a ResNet architecture, inspired by the anti-phase inhibition phenomenon observed in the primary visual cortex. This unit redefines the traditional convolutional layer by implementing a pair of complementary filters: a trainable push kernel and its counterpart, the pull kernel. The push kernel (analogous to traditional convolution) learns to respond to specific stimuli, while the pull kernel reacts to the same stimuli but of opposite contrast. This configuration enhances stimulus selectivity and effectively inhibits response in regions lacking preferred stimuli. This effect is attributed to the push and pull kernels, which produce responses of comparable magnitude in such regions, thereby neutralizing each other. The incorporation of the PushPull-Conv into ResNets significantly increases their robustness to image corruption. Our experiments with benchmark corruption datasets show that the PushPull-Conv can be combined with other data augmentation techniques to further improve model robustness. We set a new robustness benchmark on ResNet50 achieving an mCE of 49.95 % on ImageNet-C when combining PRIME augmentation with PushPull inhibition.

[CV-53] AEye: A Visualization Tool for Image Datasets IEEE-VIS2024

链接: https://arxiv.org/abs/2408.04072
作者: Florian Grötschla,Luca A. Lanzendörfer,Marco Calzavara,Roger Wattenhofer
关键词-EN: alongside architectural considerations, biases alongside architectural, influencing model capabilities, significantly influencing model, machine learning models
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: Accepted at IEEE VIS 2024

点击查看摘要

Abstract:Image datasets serve as the foundation for machine learning models in computer vision, significantly influencing model capabilities, performance, and biases alongside architectural considerations. Therefore, understanding the composition and distribution of these datasets has become increasingly crucial. To address the need for intuitive exploration of these datasets, we propose AEye, an extensible and scalable visualization tool tailored to image datasets. AEye utilizes a contrastively trained model to embed images into semantically meaningful high-dimensional representations, facilitating data clustering and organization. To visualize the high-dimensional representations, we project them onto a two-dimensional plane and arrange images in layers so users can seamlessly navigate and explore them interactively. AEye facilitates semantic search functionalities for both text and image queries, enabling users to search for content. We open-source the codebase for AEye, and provide a simple configuration to add datasets.

[CV-54] ask-oriented Sequential Grounding in 3D Scenes

链接: https://arxiv.org/abs/2408.04034
作者: Zhuofan Zhang,Ziyu Zhu,Pengxiang Li,Tengyu Liu,Xiaojian Ma,Yixin Chen,Baoxiong Jia,Siyuan Huang,Qing Li
关键词-EN: embodied artificial intelligence, Grounding natural language, language in physical, environments is essential, artificial intelligence
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: website: this https URL

点击查看摘要

Abstract:Grounding natural language in physical 3D environments is essential for the advancement of embodied artificial intelligence. Current datasets and models for 3D visual grounding predominantly focus on identifying and localizing objects from static, object-centric descriptions. These approaches do not adequately address the dynamic and sequential nature of task-oriented grounding necessary for practical applications. In this work, we propose a new task: Task-oriented Sequential Grounding in 3D scenes, wherein an agent must follow detailed step-by-step instructions to complete daily activities by locating a sequence of target objects in indoor scenes. To facilitate this task, we introduce SG3D, a large-scale dataset containing 22,346 tasks with 112,236 steps across 4,895 real-world 3D scenes. The dataset is constructed using a combination of RGB-D scans from various 3D scene datasets and an automated task generation pipeline, followed by human verification for quality assurance. We adapted three state-of-the-art 3D visual grounding models to the sequential grounding task and evaluated their performance on SG3D. Our results reveal that while these models perform well on traditional benchmarks, they face significant challenges with task-oriented sequential grounding, underscoring the need for further research in this area.

[CV-55] Image-to-LaTeX Converter for Mathematical Formulas and Text

链接: https://arxiv.org/abs/2408.04015
作者: Daniil Gurgurov,Aleksey Morshnev
关键词-EN: vision encoder-decoder model, Swin Transformer encoder, train a vision, vision encoder-decoder, generate LaTeX code
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
*备注: 4 pages

点击查看摘要

Abstract:In this project, we train a vision encoder-decoder model to generate LaTeX code from images of mathematical formulas and text. Utilizing a diverse collection of image-to-LaTeX data, we build two models: a base model with a Swin Transformer encoder and a GPT-2 decoder, trained on machine-generated images, and a fine-tuned version enhanced with Low-Rank Adaptation (LoRA) trained on handwritten formulas. We then compare the BLEU performance of our specialized model on a handwritten test set with other similar models, such as Pix2Text, TexTeller, and Sumen. Through this project, we contribute open-source models for converting images to LaTeX and provide from-scratch code for building these models with distributed training and GPU optimizations.

[CV-56] HiRISE: High-Resolution Image Scaling for Edge ML via In-Sensor Compression and Selective ROI

链接: https://arxiv.org/abs/2408.03956
作者: Brendan Reidy,Sepehr Tabrizchi,Mohamadreza Mohammadi,Shaahin Angizi,Arman Roohi,Ramtin Zand
关键词-EN: machine learning, powered by machine, researchers have directed, directed their focus, IoT devices powered
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 10 pages, 8 figures

点击查看摘要

Abstract:With the rise of tiny IoT devices powered by machine learning (ML), many researchers have directed their focus toward compressing models to fit on tiny edge devices. Recent works have achieved remarkable success in compressing ML models for object detection and image classification on microcontrollers with small memory, e.g., 512kB SRAM. However, there remain many challenges prohibiting the deployment of ML systems that require high-resolution images. Due to fundamental limits in memory capacity for tiny IoT devices, it may be physically impossible to store large images without external hardware. To this end, we propose a high-resolution image scaling system for edge ML, called HiRISE, which is equipped with selective region-of-interest (ROI) capability leveraging analog in-sensor image scaling. Our methodology not only significantly reduces the peak memory requirements, but also achieves up to 17.7x reduction in data transfer and energy consumption.

[CV-57] Histopathology image embedding based on foundation models features aggregation for patient treatment response prediction MICCAI2024

链接: https://arxiv.org/abs/2408.03954
作者: Bilel Guetarni,Feryal Windal,Halim Benhabiles,Mahfoud Chaibi,Romain Dubois,Emmanuelle Leteurtre,Dominique Collard
关键词-EN: high interest, predicting Diffuse Large, foundation models, Lymphoma patients treatment, patients treatment response
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted at MICCAI 2024 workshop MOVI

点击查看摘要

Abstract:Predicting the response of a patient to a cancer treatment is of high interest. Nonetheless, this task is still challenging from a medical point of view due to the complexity of the interaction between the patient organism and the considered treatment. Recent works on foundation models pre-trained with self-supervised learning on large-scale unlabeled histopathology datasets have opened a new direction towards the development of new methods for cancer diagnosis related tasks. In this article, we propose a novel methodology for predicting Diffuse Large B-Cell Lymphoma patients treatment response from Whole Slide Images. Our method exploits several foundation models as feature extractors to obtain a local representation of the image corresponding to a small region of the tissue, then, a global representation of the image is obtained by aggregating these local representations using attention-based Multiple Instance Learning. Our experimental study conducted on a dataset of 152 patients, shows the promising results of our methodology, notably by highlighting the advantage of using foundation models compared to conventional ImageNet pre-training. Moreover, the obtained results clearly demonstrates the potential of foundation models for characterizing histopathology images and generating more suited semantic representation for this task.

[CV-58] axonomy Driven Fast Adversarial Training AAAI

链接: https://arxiv.org/abs/2408.03944
作者: Kun Tong,Chengze Jiang,Jie Gui,Yuan Cao
关键词-EN: effective defense method, effective defense, Adversarial, Adversarial training, neural networks
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: This paper is accepted by AAAI

点击查看摘要

Abstract:Adversarial training (AT) is an effective defense method against gradient-based attacks to enhance the robustness of neural networks. Among them, single-step AT has emerged as a hotspot topic due to its simplicity and efficiency, requiring only one gradient propagation in generating adversarial examples. Nonetheless, the problem of catastrophic overfitting (CO) that causes training collapse remains poorly understood, and there exists a gap between the robust accuracy achieved through single- and multi-step AT. In this paper, we present a surprising finding that the taxonomy of adversarial examples reveals the truth of CO. Based on this conclusion, we propose taxonomy driven fast adversarial training (TDAT) which jointly optimizes learning objective, loss function, and initialization method, thereby can be regarded as a new paradigm of single-step AT. Compared with other fast AT methods, TDAT can boost the robustness of neural networks, alleviate the influence of misclassified examples, and prevent CO during the training process while requiring almost no additional computational and memory resources. Our method achieves robust accuracy improvement of 1.59% , 1.62% , 0.71% , and 1.26% on CIFAR-10, CIFAR-100, Tiny ImageNet, and ImageNet-100 datasets, when against projected gradient descent PGD10 attack with perturbation budget 8/255. Furthermore, our proposed method also achieves state-of-the-art robust accuracy against other attacks. Code is available at this https URL.

[CV-59] HOAA: Hybrid Overestimating Approximate Adder for Enhanced Performance Processing Engine

链接: https://arxiv.org/abs/2408.00806
作者: Omkar Kokane,Prabhat Sati,Mukul Lokhande,Santosh Kumar Vishvakarma
关键词-EN: Hybrid Overestimating Approximate, Overestimating Approximate Adder, Hybrid Overestimating, Approximate Adder designed, presents the Hybrid
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:This paper presents the Hybrid Overestimating Approximate Adder designed to enhance the performance in processing engines, specifically focused on edge AI applications. A novel Plus One Adder design is proposed as an incremental adder in the RCA chain, incorporating a Full Adder with an excess 1 alongside inputs A, B, and Cin. The design approximates outputs to 2 bit values to reduce hardware complexity and improve resource efficiency. The Plus One Adder is integrated into a dynamically reconfigurable HOAA, allowing runtime interchangeability between accurate and approximate overestimation modes. The proposed design is demonstrated for multiple applications, such as Twos complement subtraction and Rounding to even, and the Configurable Activation function, which are critical components of the Processing engine. Our approach shows 21 percent improvement in area efficiency and 33 percent reduction in power consumption, compared to state of the art designs with minimal accuracy loss. Thus, the proposed HOAA could be a promising solution for resource-constrained environments, offering ideal trade-offs between hardware efficiency vs computational accuracy.

[CV-60] Quantifying the Impact of Population Shift Across Age and Sex for Abdominal Organ Segmentation MICCAI2024

链接: https://arxiv.org/abs/2408.04610
作者: Kate Čevora,Ben Glocker,Wenjia Bai
关键词-EN: Deep learning-based medical, Deep learning-based, learning-based medical image, clinical practice, tremendous progress
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注: This paper has been accepted for publication by the MICCAI 2024 Fairness of AI in Medical Imaging (FAIMI) Workshop

点击查看摘要

Abstract:Deep learning-based medical image segmentation has seen tremendous progress over the last decade, but there is still relatively little transfer into clinical practice. One of the main barriers is the challenge of domain generalisation, which requires segmentation models to maintain high performance across a wide distribution of image data. This challenge is amplified by the many factors that contribute to the diverse appearance of medical images, such as acquisition conditions and patient characteristics. The impact of shifting patient characteristics such as age and sex on segmentation performance remains relatively under-studied, especially for abdominal organs, despite that this is crucial for ensuring the fairness of the segmentation model. We perform the first study to determine the impact of population shift with respect to age and sex on abdominal CT image segmentation, by leveraging two large public datasets, and introduce a novel metric to quantify the impact. We find that population shift is a challenge similar in magnitude to cross-dataset shift for abdominal organ segmentation, and that the effect is asymmetric and dataset-dependent. We conclude that dataset diversity in terms of known patient characteristics is not necessarily equivalent to dataset diversity in terms of image features. This implies that simple population matching to ensure good generalisation and fairness may be insufficient, and we recommend that fairness research should be directed towards better understanding and quantifying medical image dataset diversity in terms of performance-relevant characteristics such as organ morphology.

[CV-61] Deep Transfer Learning for Kidney Cancer Diagnosis

链接: https://arxiv.org/abs/2408.04318
作者: Yassine Habchi,Hamza Kheddar,Yassine Himeur,Abdelkrim Boukabou,Shadi Atalla,Wathiq Mansoor,Hussain Al-Ahmad
关键词-EN: including lifestyle choices, global societies stem, incurable diseases prevalent, social factors, including lifestyle
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 32 pages, 8 figures and 8 tables

点击查看摘要

Abstract:Many incurable diseases prevalent across global societies stem from various influences, including lifestyle choices, economic conditions, social factors, and genetics. Research predominantly focuses on these diseases due to their widespread nature, aiming to decrease mortality, enhance treatment options, and improve healthcare standards. Among these, kidney disease stands out as a particularly severe condition affecting men and women worldwide. Nonetheless, there is a pressing need for continued research into innovative, early diagnostic methods to develop more effective treatments for such diseases. Recently, automatic diagnosis of Kidney Cancer has become an important challenge especially when using deep learning (DL) due to the importance of training medical datasets, which in most cases are difficult and expensive to obtain. Furthermore, in most cases, algorithms require data from the same domain and a powerful computer with efficient storage capacity. To overcome this issue, a new type of learning known as transfer learning (TL) has been proposed that can produce impressive results based on other different pre-trained data. This paper presents, to the best of the authors’ knowledge, the first comprehensive survey of DL-based TL frameworks for kidney cancer diagnosis. This is a strong contribution to help researchers understand the current challenges and perspectives of this topic. Hence, the main limitations and advantages of each framework are identified and detailed critical analyses are provided. Looking ahead, the article identifies promising directions for future research. Moving on, the discussion is concluded by reflecting on the pivotal role of TL in the development of precision medicine and its effects on clinical practice and research in oncology.

[CV-62] An Explainable Non-local Network for COVID-19 Diagnosis

链接: https://arxiv.org/abs/2408.04300
作者: Jingfu Yang,Peng Huang,Jing Hu,Shu Hu,Siwei Lyu,Xin Wang,Jun Guo,Xi Wu
关键词-EN: attention non-local network, attention module, attention, attention non-local, medical images
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:The CNN has achieved excellent results in the automatic classification of medical images. In this study, we propose a novel deep residual 3D attention non-local network (NL-RAN) to classify CT images included COVID-19, common pneumonia, and normal to perform rapid and explainable COVID-19 diagnosis. We built a deep residual 3D attention non-local network that could achieve end-to-end training. The network is embedded with a nonlocal module to capture global information, while a 3D attention module is embedded to focus on the details of the lesion so that it can directly analyze the 3D lung CT and output the classification results. The output of the attention module can be used as a heat map to increase the interpretability of the model. 4079 3D CT scans were included in this study. Each scan had a unique label (novel coronavirus pneumonia, common pneumonia, and normal). The CT scans cohort was randomly split into a training set of 3263 scans, a validation set of 408 scans, and a testing set of 408 scans. And compare with existing mainstream classification methods, such as CovNet, CBAM, ResNet, etc. Simultaneously compare the visualization results with visualization methods such as CAM. Model performance was evaluated using the Area Under the ROC Curve(AUC), precision, and F1-score. The NL-RAN achieved the AUC of 0.9903, the precision of 0.9473, and the F1-score of 0.9462, surpass all the classification methods compared. The heat map output by the attention module is also clearer than the heat map output by CAM. Our experimental results indicate that our proposed method performs significantly better than existing methods. In addition, the first attention module outputs a heat map containing detailed outline information to increase the interpretability of the model. Our experiments indicate that the inference of our model is fast. It can provide real-time assistance with diagnosis.

[CV-63] Efficient and Accurate Pneumonia Detection Using a Novel Multi-Scale Transformer Approach

链接: https://arxiv.org/abs/2408.04290
作者: Alireza Saber,Pouria Parhami,Alimihammad Siahkarzadeh,Amirreza Fateh
关键词-EN: severe respiratory disease, significant diagnostic challenges, poses significant diagnostic, chest X-rays, respiratory disease
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Pneumonia, a severe respiratory disease, poses significant diagnostic challenges, especially in underdeveloped regions. Traditional diagnostic methods, such as chest X-rays, suffer from variability in interpretation among radiologists, necessitating reliable automated tools. In this study, we propose a novel approach combining deep learning and transformer-based attention mechanisms to enhance pneumonia detection from chest X-rays. Our method begins with lung segmentation using a TransUNet model that integrates our specialized transformer module, which has fewer parameters compared to common transformers while maintaining performance. This model is trained on the “Chest Xray Masks and Labels” dataset and then applied to the Kermany and Cohen datasets to isolate lung regions, enhancing subsequent classification tasks. For classification, we employ pre-trained ResNet models (ResNet-50 and ResNet-101) to extract multi-scale feature maps, processed through our modified transformer module. By employing our specialized transformer, we attain superior results with significantly fewer parameters compared to common transformer models. Our approach achieves high accuracy rates of 92.79% on the Kermany dataset and 95.11% on the Cohen dataset, ensuring robust and efficient performance suitable for resource-constrained environments. "this https URL

[CV-64] SG-JND: Semantic-Guided Just Noticeable Distortion Predictor For Image Compression ICIP2024

链接: https://arxiv.org/abs/2408.04273
作者: Linhan Cao,Wei Sun,Xiongkuo Min,Jun Jia,Zicheng Zhang,Zijian Chen,Yucheng Zhu,Lizhou Liu,Qiubo Chen,Jing Chen,Guangtao Zhai
关键词-EN: human visual system, transmission bit rate, image compression algorithms, JND, JND prediction
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted by ICIP 2024

点击查看摘要

Abstract:Just noticeable distortion (JND), representing the threshold of distortion in an image that is minimally perceptible to the human visual system (HVS), is crucial for image compression algorithms to achieve a trade-off between transmission bit rate and image quality. However, traditional JND prediction methods only rely on pixel-level or sub-band level features, lacking the ability to capture the impact of image content on JND. To bridge this gap, we propose a Semantic-Guided JND (SG-JND) network to leverage semantic information for JND prediction. In particular, SG-JND consists of three essential modules: the image preprocessing module extracts semantic-level patches from images, the feature extraction module extracts multi-layer features by utilizing the cross-scale attention layers, and the JND prediction module regresses the extracted features into the final JND value. Experimental results show that SG-JND achieves the state-of-the-art performance on two publicly available JND datasets, which demonstrates the effectiveness of SG-JND and highlight the significance of incorporating semantic information in JND assessment.

[CV-65] Physical prior guided cooperative learning framework for joint turbulence degradation estimation and infrared video restoration

链接: https://arxiv.org/abs/2408.04227
作者: Ziran Zhang,Yuhang Tang,Zhigang Wang,Yueting Chen,Bin Zhao
关键词-EN: Prior Guided Cooperative, Guided Cooperative Learning, Physical Prior Guided, turbulence strength, turbulence strength estimation
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注: 21

点击查看摘要

Abstract:Infrared imaging and turbulence strength measurements are in widespread demand in many fields. This paper introduces a Physical Prior Guided Cooperative Learning (P2GCL) framework to jointly enhance atmospheric turbulence strength estimation and infrared image restoration. P2GCL involves a cyclic collaboration between two models, i.e., a TMNet measures turbulence strength and outputs the refractive index structure constant (Cn2) as a physical prior, a TRNet conducts infrared image sequence restoration based on Cn2 and feeds the restored images back to the TMNet to boost the measurement accuracy. A novel Cn2-guided frequency loss function and a physical constraint loss are introduced to align the training process with physical theories. Experiments demonstrate P2GCL achieves the best performance for both turbulence strength estimation (improving Cn2 MAE by 0.0156, enhancing R2 by 0.1065) and image restoration (enhancing PSNR by 0.2775 dB), validating the significant impact of physical prior guided cooperative learning.

[CV-66] Is SAM 2 Better than SAM in Medical Image Segmentation?

链接: https://arxiv.org/abs/2408.04212
作者: Sourya Sengupta,Satrajit Chakrabarty,Ravi Soni
关键词-EN: demonstrated impressive performance, SAM, Segment Anything Model, Model, recently released Segment
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Segment Anything Model (SAM) demonstrated impressive performance in zero-shot promptable segmentation on natural images. The recently released Segment Anything Model 2 (SAM 2) model claims to have better performance than SAM on images while extending the model’s capabilities to video segmentation. It is important to evaluate the recent model’s ability in medical image segmentation in a zero-shot promptable manner. In this work, we performed extensive studies with multiple datasets from different imaging modalities to compare the performance between SAM and SAM 2. We used two point prompt strategies: (i) single positive prompt near the centroid of the target structure and (ii) additional positive prompts placed randomly within the target structure. The evaluation included 21 unique organ-modality combinations including abdominal structures, cardiac structures, and fetal head images acquired from publicly available MRI, CT, and Ultrasound datasets. The preliminary results, based on 2D images, indicate that while SAM 2 may perform slightly better in a few cases, but it does not in general surpass SAM for medical image segmentation. Especially when the contrast is lower like in CT, Ultrasound images, SAM 2 performs poorly than SAM. For MRI images, SAM 2 performs at par or better than SAM. Similar to SAM, SAM 2 also suffers from over-segmentation issue especially when the boundaries of the to-be-segmented organ is fuzzy in nature.

[CV-67] Efficient Single Image Super-Resolution with Entropy Attention and Receptive Field Augmentation ACM-MM2024

链接: https://arxiv.org/abs/2408.04158
作者: Xiaole Zhao,Linze Li,Chengxing Xie,Xiaoming Zhang,Ting Jiang,Wenjie Lin,Shuaicheng Liu,Tianrui Li
关键词-EN: lightweight SISR tasks, single image super-resolution, lightweight SISR, SISR tasks, Transformer-based deep models
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted to ACM MM 2024

点击查看摘要

Abstract:Transformer-based deep models for single image super-resolution (SISR) have greatly improved the performance of lightweight SISR tasks in recent years. However, they often suffer from heavy computational burden and slow inference due to the complex calculation of multi-head self-attention (MSA), seriously hindering their practical application and deployment. In this work, we present an efficient SR model to mitigate the dilemma between model efficiency and SR performance, which is dubbed Entropy Attention and Receptive Field Augmentation network (EARFA), and composed of a novel entropy attention (EA) and a shifting large kernel attention (SLKA). From the perspective of information theory, EA increases the entropy of intermediate features conditioned on a Gaussian distribution, providing more informative input for subsequent reasoning. On the other hand, SLKA extends the receptive field of SR models with the assistance of channel shifting, which also favors to boost the diversity of hierarchical features. Since the implementation of EA and SLKA does not involve complex computations (such as extensive matrix multiplications), the proposed method can achieve faster nonlinear inference than Transformer-based SR models while maintaining better SR performance. Extensive experiments show that the proposed model can significantly reduce the delay of model inference while achieving the SR performance comparable with other advanced models.

[CV-68] he Quest for Early Detection of Retinal Disease: 3D CycleGAN-based Translation of Optical Coherence Tomography into Confocal Microscopy

链接: https://arxiv.org/abs/2408.04091
作者: Xin Tian,Nantheera Anantrasirichai,Lindsay Nicholson,Alin Achim
关键词-EN: Optical coherence tomography, offering distinct advantages, vivo confocal microscopy, Optical coherence, confocal microscopy
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注: 30 pages, 11 figures, 5 tables

点击查看摘要

Abstract:Optical coherence tomography (OCT) and confocal microscopy are pivotal in retinal imaging, offering distinct advantages and limitations. In vivo OCT offers rapid, non-invasive imaging but can suffer from clarity issues and motion artifacts, while ex vivo confocal microscopy, providing high-resolution, cellular-detailed color images, is invasive and raises ethical concerns. To bridge the benefits of both modalities, we propose a novel framework based on unsupervised 3D CycleGAN for translating unpaired in vivo OCT to ex vivo confocal microscopy images. This marks the first attempt to exploit the inherent 3D information of OCT and translate it into the rich, detailed color domain of confocal microscopy. We also introduce a unique dataset, OCT2Confocal, comprising mouse OCT and confocal retinal images, facilitating the development of and establishing a benchmark for cross-modal image translation research. Our model has been evaluated both quantitatively and qualitatively, achieving Fréchet Inception Distance (FID) scores of 0.766 and Kernel Inception Distance (KID) scores as low as 0.153, and leading subjective Mean Opinion Scores (MOS). Our model demonstrated superior image fidelity and quality with limited data over existing methods. Our approach effectively synthesizes color information from 3D confocal images, closely approximating target outcomes and suggesting enhanced potential for diagnostic and monitoring applications in ophthalmology.

[CV-69] Multi-scale structural complexity as a quantitative measure of visual complexity

链接: https://arxiv.org/abs/2408.04076
作者: Anna Kravchenko,Andrey A. Bagrov,Mikhail I. Katsnelson,Veronica Dudarev
关键词-EN: quantify formally, complexity, MSSC, defines structural complexity, concept of visual
类目: Physics and Society (physics.soc-ph); Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)
*备注: 16 pages, 11 figures, 2 tables

点击查看摘要

Abstract:While intuitive for humans, the concept of visual complexity is hard to define and quantify formally. We suggest adopting the multi-scale structural complexity (MSSC) measure, an approach that defines structural complexity of an object as the amount of dissimilarities between distinct scales in its hierarchical organization. In this work, we apply MSSC to the case of visual stimuli, using an open dataset of images with subjective complexity scores obtained from human participants (SAVOIAS). We demonstrate that MSSC correlates with subjective complexity on par with other computational complexity measures, while being more intuitive by definition, consistent across categories of images, and easier to compute. We discuss objective and subjective elements inherently present in human perception of complexity and the domains where the two are more likely to diverge. We show how the multi-scale nature of MSSC allows further investigation of complexity as it is perceived by humans.

[CV-70] Do Sharpness-based Optimizers Improve Generalization in Medical Image Analysis?

链接: https://arxiv.org/abs/2408.04065
作者: Mohamed Hassan,Aleksander Vakanski,Min Xian
关键词-EN: Effective clinical deployment, healthcare demands high, ensure accurate diagnosis, Effective clinical, deep learning models
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Effective clinical deployment of deep learning models in healthcare demands high generalization performance to ensure accurate diagnosis and treatment planning. In recent years, significant research has focused on improving the generalization of deep learning models by regularizing the sharpness of the loss landscape. Among the optimization approaches that explicitly minimize sharpness, Sharpness-Aware Minimization (SAM) has shown potential in enhancing generalization performance on general domain image datasets. This success has led to the development of several advanced sharpness-based algorithms aimed at addressing the limitations of SAM, such as Adaptive SAM, surrogate-Gap SAM, Weighted SAM, and Curvature Regularized SAM. These sharpness-based optimizers have shown improvements in model generalization compared to conventional stochastic gradient descent optimizers and their variants on general domain image datasets, but they have not been thoroughly evaluated on medical images. This work provides a review of recent sharpness-based methods for improving the generalization of deep learning networks and evaluates the methods performance on medical breast ultrasound images. Our findings indicate that the initial SAM method successfully enhances the generalization of various deep learning models. While Adaptive SAM improves generalization of convolutional neural networks, it fails to do so for vision transformers. Other sharpness-based optimizers, however, do not demonstrate consistent results. The results reveal that, contrary to findings in the non-medical domain, SAM is the only recommended sharpness-based optimizer that consistently improves generalization in medical image analysis, and further research is necessary to refine the variants of SAM to enhance generalization performance in this field

机器学习

[LG-0] ransformer Explainer: Interactive Learning of Text-Generative Models IEEE-VIS2024

链接: https://arxiv.org/abs/2408.04619
作者: Aeree Cho,Grace C. Kim,Alexander Karpekov,Alec Helbling,Zijie J. Wang,Seongmin Lee,Benjamin Hoover,Duen Horng Chau
关键词-EN: revolutionized machine learning, workings remain opaque, present Transformer Explainer, machine learning, revolutionized machine
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
*备注: To be presented at IEEE VIS 2024

点击查看摘要

Abstract:Transformers have revolutionized machine learning, yet their inner workings remain opaque to many. We present Transformer Explainer, an interactive visualization tool designed for non-experts to learn about Transformers through the GPT-2 model. Our tool helps users understand complex Transformer concepts by integrating a model overview and enabling smooth transitions across abstraction levels of mathematical operations and model structures. It runs a live GPT-2 instance locally in the user’s browser, empowering users to experiment with their own input and observe in real-time how the internal components and parameters of the Transformer work together to predict the next tokens. Our tool requires no installation or special hardware, broadening the public’s education access to modern generative AI techniques. Our open-sourced tool is available at this https URL. A video demo is available at this https URL.

[LG-1] Better Alignment with Instruction Back-and-Forth Translation

链接: https://arxiv.org/abs/2408.04614
作者: Thao Nguyen,Jeffrey Li,Sewoong Oh,Ludwig Schmidt,Jason Weston,Luke Zettlemoyer,Xian Li
关键词-EN: large language models, aligning large language, construct high-quality synthetic, high-quality synthetic data, synthetic data grounded
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We propose a new method, instruction back-and-forth translation, to construct high-quality synthetic data grounded in world knowledge for aligning large language models (LLMs). Given documents from a web corpus, we generate and curate synthetic instructions using the backtranslation approach proposed by Li et al.(2023a), and rewrite the responses to improve their quality further based on the initial documents. Fine-tuning with the resulting (backtranslated instruction, rewritten response) pairs yields higher win rates on AlpacaEval than using other common instruction datasets such as Humpback, ShareGPT, Open Orca, Alpaca-GPT4 and Self-instruct. We also demonstrate that rewriting the responses with an LLM outperforms direct distillation, and the two generated text distributions exhibit significant distinction in embedding space. Further analysis shows that our backtranslated instructions are of higher quality than other sources of synthetic instructions, while our responses are more diverse and complex than those obtained from distillation. Overall we find that instruction back-and-forth translation combines the best of both worlds – making use of the information diversity and quantity found on the web, while ensuring the quality of the responses which is necessary for effective alignment.

[LG-2] Learn To Learn More Precisely

链接: https://arxiv.org/abs/2408.04590
作者: Runxi Cheng,Yongxian Wei,Xianglong He,Wanyun Zhu,Songsong Huang,Fei Richard Yu,Fei Ma,Chun Yuan
关键词-EN: learn precise target, fast adaptation, extensively applied, learning and fast, precise target knowledge
类目: Machine Learning (cs.LG)
*备注: 10pages,4 figures, meta learning

点击查看摘要

Abstract:Meta-learning has been extensively applied in the domains of few-shot learning and fast adaptation, achieving remarkable performance. While Meta-learning methods like Model-Agnostic Meta-Learning (MAML) and its variants provide a good set of initial parameters for the model, the model still tends to learn shortcut features, which leads to poor generalization. In this paper, we propose the formal conception of “learn to learn more precisely”, which aims to make the model learn precise target knowledge from data and reduce the effect of noisy knowledge, such as background and noise. To achieve this target, we proposed a simple and effective meta-learning framework named Meta Self-Distillation(MSD) to maximize the consistency of learned knowledge, enhancing the models’ ability to learn precise target knowledge. In the inner loop, MSD uses different augmented views of the same support data to update the model respectively. Then in the outer loop, MSD utilizes the same query data to optimize the consistency of learned knowledge, enhancing the model’s ability to learn more precisely. Our experiment demonstrates that MSD exhibits remarkable performance in few-shot classification tasks in both standard and augmented scenarios, effectively boosting the accuracy and consistency of knowledge learned by the model.

[LG-3] Sampling for View Synthesis: From Local Light Field Fusion to Neural Radiance Fields and Beyond

链接: https://arxiv.org/abs/2408.04586
作者: Ravi Ramamoorthi
关键词-EN: complex real-world scenes, graphics and vision, virtual reality, immersive experiences, view synthesis
类目: Graphics (cs.GR); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: Article written for Frontiers of Science Award, International Congress on Basic Science, 2024

点击查看摘要

Abstract:Capturing and rendering novel views of complex real-world scenes is a long-standing problem in computer graphics and vision, with applications in augmented and virtual reality, immersive experiences and 3D photography. The advent of deep learning has enabled revolutionary advances in this area, classically known as image-based rendering. However, previous approaches require intractably dense view sampling or provide little or no guidance for how users should sample views of a scene to reliably render high-quality novel views. Local light field fusion proposes an algorithm for practical view synthesis from an irregular grid of sampled views that first expands each sampled view into a local light field via a multiplane image scene representation, then renders novel views by blending adjacent local light fields. Crucially, we extend traditional plenoptic sampling theory to derive a bound that specifies precisely how densely users should sample views of a given scene when using our algorithm. We achieve the perceptual quality of Nyquist rate view sampling while using up to 4000x fewer views. Subsequent developments have led to new scene representations for deep learning with view synthesis, notably neural radiance fields, but the problem of sparse view synthesis from a small number of images has only grown in importance. We reprise some of the recent results on sparse and even single image view synthesis, while posing the question of whether prescriptive sampling guidelines are feasible for the new generation of image-based rendering algorithms.

[LG-4] Unveiling the Power of Sparse Neural Networks for Feature Selection

链接: https://arxiv.org/abs/2408.04583
作者: Zahra Atashgahi,Tennison Liu,Mykola Pechenizkiy,Raymond Veldhuis,Decebal Constantin Mocanu,Mihaela van der Schaar
关键词-EN: feature selection, efficient feature selection, Sparse Neural Networks, Sparse Neural, emerged as powerful
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Sparse Neural Networks (SNNs) have emerged as powerful tools for efficient feature selection. Leveraging the dynamic sparse training (DST) algorithms within SNNs has demonstrated promising feature selection capabilities while drastically reducing computational overheads. Despite these advancements, several critical aspects remain insufficiently explored for feature selection. Questions persist regarding the choice of the DST algorithm for network training, the choice of metric for ranking features/neurons, and the comparative performance of these methods across diverse datasets when compared to dense networks. This paper addresses these gaps by presenting a comprehensive systematic analysis of feature selection with sparse neural networks. Moreover, we introduce a novel metric considering sparse neural network characteristics, which is designed to quantify feature importance within the context of SNNs. Our findings show that feature selection with SNNs trained with DST algorithms can achieve, on average, more than 50% memory and 55% FLOPs reduction compared to the dense networks, while outperforming them in terms of the quality of the selected features. Our code and the supplementary material are available on GitHub (\urlthis https URL).

[LG-5] Mathematical Programming For Adaptive Experiments

链接: https://arxiv.org/abs/2408.04570
作者: Ethan Che,Daniel R. Jiang,Hongseok Namkoong,Jimmy Wang
关键词-EN: improve statistical power, standard algorithms overlook, algorithms overlook important, significantly improve statistical, overlook important practical
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Adaptive experimentation can significantly improve statistical power, but standard algorithms overlook important practical issues including batched and delayed feedback, personalization, non-stationarity, multiple objectives, and constraints. To address these issues, the current algorithm design paradigm crafts tailored methods for each problem instance. Since it is infeasible to devise novel algorithms for every real-world instance, practitioners often have to resort to suboptimal approximations that do not address all of their challenges. Moving away from developing bespoke algorithms for each setting, we present a mathematical programming view of adaptive experimentation that can flexibly incorporate a wide range of objectives, constraints, and statistical procedures. By formulating a dynamic program in the batched limit, our modeling framework enables the use of scalable optimization methods (e.g., SGD and auto-differentiation) to solve for treatment allocations. We evaluate our framework on benchmarks modeled after practical challenges such as non-stationarity, personalization, multi-objectives, and constraints. Unlike bespoke algorithms such as modified variants of Thomson sampling, our mathematical programming approach provides remarkably robust performance across instances.

[LG-6] Activation thresholds and expressiveness of polynomial neural networks

链接: https://arxiv.org/abs/2408.04569
作者: Bella Finkel,Jose Israel Rodriguez,Chenxi Wu,Thomas Yahl
关键词-EN: Polynomial neural networks, Polynomial neural, theoretical machine learning, machine learning, range of applications
类目: Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE); Algebraic Geometry (math.AG); Machine Learning (stat.ML)
*备注: 13 pages

点击查看摘要

Abstract:Polynomial neural networks have been implemented in a range of applications and present an advantageous framework for theoretical machine learning. A polynomial neural network of fixed architecture and activation degree gives an algebraic map from the network’s weights to a set of polynomials. The image of this map is the space of functions representable by the network. Its Zariski closure is an affine variety known as a neurovariety. The dimension of a polynomial neural network’s neurovariety provides a measure of its expressivity. In this work, we introduce the notion of the activation threshold of a network architecture which expresses when the dimension of a neurovariety achieves its theoretical maximum. In addition, we prove expressiveness results for polynomial neural networks with equi-width~architectures.

[LG-7] Bias-Aware Low-Rank Adaptation: Mitigating Catastrophic Inheritance of Large Language Models

链接: https://arxiv.org/abs/2408.04556
作者: Yupeng Chang,Yi Chang,Yuan Wu
关键词-EN: exhibited remarkable proficiency, Large language models, Large language, exhibited remarkable, remarkable proficiency
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注: Work in progress

点击查看摘要

Abstract:Large language models (LLMs) have exhibited remarkable proficiency across a diverse array of natural language processing (NLP) tasks. However, adapting LLMs to downstream applications typically necessitates computationally intensive and memory-demanding fine-tuning procedures. To mitigate these burdens, parameter-efficient fine-tuning (PEFT) techniques have emerged as a promising approach to tailor LLMs with minimal computational overhead. While PEFT methods offer substantial advantages, they do not fully address the pervasive issue of bias propagation from pre-training data. In this work, we introduce Bias-Aware Low-Rank Adaptation (BA-LoRA), a novel PEFT method designed to counteract bias inheritance. BA-LoRA incorporates three distinct regularization terms: (1) consistency regularizer, (2) diversity regularizer, and (3) singular vector decomposition regularizer. These regularizers collectively aim to improve the generative models’ consistency, diversity, and generalization capabilities during the fine-tuning process. Through extensive experiments on a variety of natural language understanding (NLU) and natural language generation (NLG) tasks, employing prominent LLMs such as LLaMA, Mistral, and Gemma, we demonstrate that BA-LoRA surpasses the performance of LoRA and its state-of-the-art variants. Moreover, our method effectively mitigates the deleterious effects of pre-training bias, leading to more reliable and robust model outputs. The code is available at this https URL.

[LG-8] How Transformers Utilize Multi-Head Attention in In-Context Learning? A Case Study on Sparse Linear Regression

链接: https://arxiv.org/abs/2408.04532
作者: Xingwu Chen,Lei Zhao,Difan Zou
关键词-EN: remain poorly understood, underlying mechanisms remain, mechanisms remain poorly, real-world tasks, poorly understood
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Despite the remarkable success of transformer-based models in various real-world tasks, their underlying mechanisms remain poorly understood. Recent studies have suggested that transformers can implement gradient descent as an in-context learner for linear regression problems and have developed various theoretical analyses accordingly. However, these works mostly focus on the expressive power of transformers by designing specific parameter constructions, lacking a comprehensive understanding of their inherent working mechanisms post-training. In this study, we consider a sparse linear regression problem and investigate how a trained multi-head transformer performs in-context learning. We experimentally discover that the utilization of multi-heads exhibits different patterns across layers: multiple heads are utilized and essential in the first layer, while usually only a single head is sufficient for subsequent layers. We provide a theoretical explanation for this observation: the first layer preprocesses the context data, and the following layers execute simple optimization steps based on the preprocessed context. Moreover, we demonstrate that such a preprocess-then-optimize algorithm can significantly outperform naive gradient descent and ridge regression algorithms. Further experimental results support our explanations. Our findings offer insights into the benefits of multi-head attention and contribute to understanding the more intricate mechanisms hidden within trained transformers.

[LG-9] AExGym: Benchmarks and Environments for Adaptive Experimentation

链接: https://arxiv.org/abs/2408.04531
作者: Jimmy Wang,Ethan Che,Daniel R. Jiang,Hongseok Namkoong
关键词-EN: Innovations across science, randomized trials, science and industry, industry are evaluated, evaluated using randomized
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Innovations across science and industry are evaluated using randomized trials (a.k.a. A/B tests). While simple and robust, such static designs are inefficient or infeasible for testing many hypotheses. Adaptive designs can greatly improve statistical power in theory, but they have seen limited adoption due to their fragility in practice. We present a benchmark for adaptive experimentation based on real-world datasets, highlighting prominent practical challenges to operationalizing adaptivity: non-stationarity, batched/delayed feedback, multiple outcomes and objectives, and external validity. Our benchmark aims to spur methodological development that puts practical performance (e.g., robustness) as a central concern, rather than mathematical guarantees on contrived instances. We release an open source library, AExGym, which is designed with modularity and extensibility in mind to allow experimentation practitioners to develop custom environments and algorithms.

[LG-10] Advancing Molecular Machine (Learned) Representations with Stereoelectronics-Infused Molecular Graphs

链接: https://arxiv.org/abs/2408.04520
作者: Daniil A. Boiko,Thiago Reschützegger,Benjamin Sanchez-Lengeling,Samuel M. Blau,Gabe Gomes
关键词-EN: physical world, molecular machine learning, foundational element, Molecular, molecular machine
类目: Machine Learning (cs.LG); Chemical Physics (physics.chem-ph)
*备注: 23 pages, 6 figures

点击查看摘要

Abstract:Molecular representation is a foundational element in our understanding of the physical world. Its importance ranges from the fundamentals of chemical reactions to the design of new therapies and materials. Previous molecular machine learning models have employed strings, fingerprints, global features, and simple molecular graphs that are inherently information-sparse representations. However, as the complexity of prediction tasks increases, the molecular representation needs to encode higher fidelity information. This work introduces a novel approach to infusing quantum-chemical-rich information into molecular graphs via stereoelectronic effects. We show that the explicit addition of stereoelectronic interactions significantly improves the performance of molecular machine learning models. Furthermore, stereoelectronics-infused representations can be learned and deployed with a tailored double graph neural network workflow, enabling its application to any downstream molecular machine learning task. Finally, we show that the learned representations allow for facile stereoelectronic evaluation of previously intractable systems, such as entire proteins, opening new avenues of molecular design.

[LG-11] Knowledge-Aided Semantic Communication Leveraging Probabilistic Graphical Modeling

链接: https://arxiv.org/abs/2408.04499
作者: Haowen Wan,Qianqian Yang,Jiancheng Tang,Zhiguo shi
关键词-EN: probabilistic graphical model, communication approach based, semantic communication approach, graphical model, probabilistic graphical
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In this paper, we propose a semantic communication approach based on probabilistic graphical model (PGM). The proposed approach involves constructing a PGM from a training dataset, which is then shared as common knowledge between the transmitter and receiver. We evaluate the importance of various semantic features and present a PGM-based compression algorithm designed to eliminate predictable portions of semantic information. Furthermore, we introduce a technique to reconstruct the discarded semantic information at the receiver end, generating approximate results based on the PGM. Simulation results indicate a significant improvement in transmission efficiency over existing methods, while maintaining the quality of the transmitted images.

[LG-12] Model-Based Transfer Learning for Contextual Reinforcement Learning

链接: https://arxiv.org/abs/2408.04498
作者: Jung-Hoon Cho,Vindula Jayawardana,Sirui Li,Cathy Wu
关键词-EN: Deep reinforcement learning, complex decision making, Deep reinforcement, decision making, complex decision
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Deep reinforcement learning is a powerful approach to complex decision making. However, one issue that limits its practical application is its brittleness, sometimes failing to train in the presence of small changes in the environment. This work is motivated by the empirical observation that directly applying an already trained model to a related task often works remarkably well, also called zero-shot transfer. We take this practical trick one step further to consider how to systematically select good tasks to train, maximizing overall performance across a range of tasks. Given the high cost of training, it is critical to choose a small set of training tasks. The key idea behind our approach is to explicitly model the performance loss (generalization gap) incurred by transferring a trained model. We hence introduce Model-Based Transfer Learning (MBTL) for solving contextual RL problems. In this work, we model the performance loss as a simple linear function of task context similarity. Furthermore, we leverage Bayesian optimization techniques to efficiently model and estimate the unknown training performance of the task space. We theoretically show that the method exhibits regret that is sublinear in the number of training tasks and discuss conditions to further tighten regret bounds. We experimentally validate our methods using urban traffic and standard control benchmarks. Despite the conceptual simplicity, the experimental results suggest that MBTL can achieve greater performance than strong baselines, including exhaustive training on all tasks, multi-task training, and random selection of training tasks. This work lays the foundations for investigating explicit modeling of generalization, thereby enabling principled yet effective methods for contextual RL.

[LG-13] SegXAL: Explainable Active Learning for Semantic Segmentation in Driving Scene Scenarios ICPR

链接: https://arxiv.org/abs/2408.04482
作者: Sriram Mandalika,Athira Nambiar
关键词-EN: achieve high-end performance, utilize huge amounts, models utilize huge, high-end performance, huge amounts
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO)
*备注: 17 pages, 7 figures. To appear in the proceedings of the 27th International Conference on Pattern Recognition (ICPR), 01-05 December, 2024, Kolkata, India

点击查看摘要

Abstract:Most of the sophisticated AI models utilize huge amounts of annotated data and heavy training to achieve high-end performance. However, there are certain challenges that hinder the deployment of AI models “in-the-wild” scenarios, i.e., inefficient use of unlabeled data, lack of incorporation of human expertise, and lack of interpretation of the results. To mitigate these challenges, we propose a novel Explainable Active Learning (XAL) model, XAL-based semantic segmentation model “SegXAL”, that can (i) effectively utilize the unlabeled data, (ii) facilitate the “Human-in-the-loop” paradigm, and (iii) augment the model decisions in an interpretable way. In particular, we investigate the application of the SegXAL model for semantic segmentation in driving scene scenarios. The SegXAL model proposes the image regions that require labeling assistance from Oracle by dint of explainable AI (XAI) and uncertainty measures in a weakly-supervised manner. Specifically, we propose a novel Proximity-aware Explainable-AI (PAE) module and Entropy-based Uncertainty (EBU) module to get an Explainable Error Mask, which enables the machine teachers/human experts to provide intuitive reasoning behind the results and to solicit feedback to the AI system via an active learning strategy. Such a mechanism bridges the semantic gap between man and machine through collaborative intelligence, where humans and AI actively enhance each other’s complementary strengths. A novel high-confidence sample selection technique based on the DICE similarity coefficient is also presented within the SegXAL framework. Extensive quantitative and qualitative analyses are carried out in the benchmarking Cityscape dataset. Results show the outperformance of our proposed SegXAL against other state-of-the-art models.

[LG-14] NFDI4Health workflow and service for synthetic data generation assessment and risk management

链接: https://arxiv.org/abs/2408.04478
作者: Sobhan Moazemi,Tim Adams,Hwei Geok NG,Lisa Kühnel,Julian Schneider,Anatol-Fiete Näher,Juliane Fluck,Holger Fröhlich
关键词-EN: developing Artificial Intelligence, Artificial Intelligence, Individual health data, developing Artificial, sharing real patient
类目: Machine Learning (cs.LG)
*备注: 9 pages, 4 figures, accepted for publication in the proceedings of the 69th Annual Conference of the Society for Medical Informatics, Biometry and Epidemiology (GMDS)

点击查看摘要

Abstract:Individual health data is crucial for scientific advancements, particularly in developing Artificial Intelligence (AI); however, sharing real patient information is often restricted due to privacy concerns. A promising solution to this challenge is synthetic data generation. This technique creates entirely new datasets that mimic the statistical properties of real data, while preserving confidential patient information. In this paper, we present the workflow and different services developed in the context of Germany’s National Data Infrastructure project NFDI4Health. First, two state-of-the-art AI tools (namely, VAMBN and MultiNODEs) for generating synthetic health data are outlined. Further, we introduce SYNDAT (a public web-based tool) which allows users to visualize and assess the quality and risk of synthetic data provided by desired generative models. Additionally, the utility of the proposed methods and the web-based tool is showcased using data from Alzheimer’s Disease Neuroimaging Initiative (ADNI) and the Center for Cancer Registry Data of the Robert Koch Institute (RKI).

[LG-15] Random Walk Diffusion for Efficient Large-Scale Graph Generation

链接: https://arxiv.org/abs/2408.04461
作者: Tobias Bernecker,Ghalia Rehawi,Francesco Paolo Casale,Janine Knauer-Arloth,Annalisa Marsico
关键词-EN: data distribution similar, Graph generation addresses, addresses the problem, problem of generating, data distribution
类目: Machine Learning (cs.LG); Social and Information Networks (cs.SI)
*备注:

点击查看摘要

Abstract:Graph generation addresses the problem of generating new graphs that have a data distribution similar to real-world graphs. While previous diffusion-based graph generation methods have shown promising results, they often struggle to scale to large graphs. In this work, we propose ARROW-Diff (AutoRegressive RandOm Walk Diffusion), a novel random walk-based diffusion approach for efficient large-scale graph generation. Our method encompasses two components in an iterative process of random walk sampling and graph pruning. We demonstrate that ARROW-Diff can scale to large graphs efficiently, surpassing other baseline methods in terms of both generation time and multiple graph statistics, reflecting the high quality of the generated graphs.

[LG-16] An experimental comparative study of backpropagation and alternatives for training binary neural networks for image classification

链接: https://arxiv.org/abs/2408.04460
作者: Ben Crulis,Barthelemy Serres,Cyril de Runz,Gilles Venturini
关键词-EN: Current artificial neural, floating point numbers, artificial neural networks, neural networks, Binary neural networks
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Current artificial neural networks are trained with parameters encoded as floating point numbers that occupy lots of memory space at inference time. Due to the increase in the size of deep learning models, it is becoming very difficult to consider training and using artificial neural networks on edge devices. Binary neural networks promise to reduce the size of deep neural network models, as well as to increase inference speed while decreasing energy consumption. Thus, they may allow the deployment of more powerful models on edge devices. However, binary neural networks are still proven to be difficult to train using the backpropagation-based gradient descent scheme. This paper extends the work of \citecrulis2023alternatives, which proposed adapting to binary neural networks two promising alternatives to backpropagation originally designed for continuous neural networks, and experimented with them on simple image classification datasets. This paper proposes new experiments on the ImageNette dataset, compares three different model architectures for image classification, and adds two additional alternatives to backpropagation.

[LG-17] FedAD-Bench: A Unified Benchmark for Federated Unsupervised Anomaly Detection in Tabular Data

链接: https://arxiv.org/abs/2408.04442
作者: Ahmed Anwar,Brian Moser,Dayananda Herurkar,Federico Raue,Vinit Hegiste,Tatjana Legler,Andreas Dengel
关键词-EN: leverage decentralized data, anomaly detection, preserving privacy, promising approach, approach to leverage
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 8 pages, 1 figure

点击查看摘要

Abstract:The emergence of federated learning (FL) presents a promising approach to leverage decentralized data while preserving privacy. Furthermore, the combination of FL and anomaly detection is particularly compelling because it allows for detecting rare and critical anomalies (usually also rare in locally gathered data) in sensitive data from multiple sources, such as cybersecurity and healthcare. However, benchmarking the performance of anomaly detection methods in FL environments remains an underexplored area. This paper introduces FedAD-Bench, a unified benchmark for evaluating unsupervised anomaly detection algorithms within the context of FL. We systematically analyze and compare the performance of recent deep learning anomaly detection models under federated settings, which were typically assessed solely in centralized settings. FedAD-Bench encompasses diverse datasets and metrics to provide a holistic evaluation. Through extensive experiments, we identify key challenges such as model aggregation inefficiencies and metric unreliability. We present insights into FL’s regularization effects, revealing scenarios in which it outperforms centralized approaches due to its inherent ability to mitigate overfitting. Our work aims to establish a standardized benchmark to guide future research and development in federated anomaly detection, promoting reproducibility and fair comparison across studies.

[LG-18] Deep Learning for identifying systolic complexes in SCG traces: a cross-dataset analysis

链接: https://arxiv.org/abs/2408.04439
作者: Michele Craighero,Sarah Solbiati,Federica Mozzini,Enrico Caiani,Giacomo Boracchi
关键词-EN: traditional ECG, cardiac activity, promising alternative, systolic complex, deep learning solution
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The seismocardiographic signal is a promising alternative to the traditional ECG in the analysis of the cardiac activity. In particular, the systolic complex is known to be the most informative part of the seismocardiogram, thus requiring further analysis. State-of-art solutions to detect the systolic complex are based on Deep Learning models, which have been proven effective in pioneering studies. However, these solutions have only been tested in a controlled scenario considering only clean signals acquired from users maintained still in supine position. On top of that, all these studies consider data coming from a single dataset, ignoring the benefits and challenges related to a cross-dataset scenario. In this work, a cross-dataset experimental analysis was performed considering also data from a real-world scenario. Our findings prove the effectiveness of a deep learning solution, while showing the importance of a personalization step to contrast the domain shift, namely a change in data distribution between training and testing data. Finally, we demonstrate the benefits of a multi-channels approach, leveraging the information extracted from both accelerometers and gyroscopes data.

[LG-19] Detection of Animal Movement from Weather Radar using Self-Supervised Learning

链接: https://arxiv.org/abs/2408.04424
作者: Mubin Ul Haque,Joel Janek Dabrowski,Rebecca M. Rogers,Hazel Parry
关键词-EN: radar involves thresholding, ecosystem.The conventional approach, weather radar involves, migration patterns, aids in management
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Detecting flying animals (e.g., birds, bats, and insects) using weather radar helps gain insights into animal movement and migration patterns, aids in management efforts (such as biosecurity) and enhances our understanding of the ecosystem.The conventional approach to detecting animals in weather radar involves thresholding: defining and applying thresholds for the radar variables, based on expert opinion. More recently, Deep Learning approaches have been shown to provide improved performance in detection. However, obtaining sufficient labelled weather radar data for flying animals to build learning-based models is time-consuming and labor-intensive. To address the challenge of data labelling, we propose a self-supervised learning method for detecting animal movement. In our proposed method, we pre-train our model on a large dataset with noisy labels produced by a threshold approach. The key advantage is that the pre-trained dataset size is limited only by the number of radar images available. We then fine-tune the model on a small human-labelled dataset. Our experiments on Australian weather radar data for waterbird segmentation show that the proposed method outperforms the current state-of-the art approach by 43.53% in the dice co-efficient statistic.

[LG-20] Deeploy: Enabling Energy-Efficient Deployment of Small Language Models On Heterogeneous Microcontrollers

链接: https://arxiv.org/abs/2408.04413
作者: Moritz Scherer,Luka Macan,Victor Jung,Philip Wiese,Luca Bompani,Alessio Burrello,Francesco Conti,Luca Benini
关键词-EN: Embodied Foundation Models, Small Language Models, notably Small Language, Foundation Models, Language Models
类目: Machine Learning (cs.LG); Hardware Architecture (cs.AR)
*备注: Accepted for publication at ESWEEK - CASES 2024

点击查看摘要

Abstract:With the rise of Embodied Foundation Models (EFMs), most notably Small Language Models (SLMs), adapting Transformers for edge applications has become a very active field of research. However, achieving end-to-end deployment of SLMs on microcontroller (MCU)-class chips without high-bandwidth off-chip main memory access is still an open challenge. In this paper, we demonstrate high-efficiency end-to-end SLM deployment on a multicore RISC-V (RV32) MCU augmented with ML instruction extensions and a hardware neural processing unit (NPU). To automate the exploration of the constrained, multi-dimensional memory vs. computation tradeoffs involved in aggressive SLM deployment on heterogeneous (multicore+NPU) resources, we introduce Deeploy, a novel Deep Neural Network (DNN) compiler, which generates highly-optimized C code requiring minimal runtime support. We demonstrate that Deeploy generates end-to-end code for executing SLMs, fully exploiting the RV32 cores’ instruction extensions and the NPU: We achieve leading-edge energy and throughput of \SI490\micro\joule \per Token, at \SI340Token \per \second for an SLM trained on the TinyStories dataset, running for the first time on an MCU-class device without external memory.

[LG-21] Clutter Classification Using Deep Learning in Multiple Stages

链接: https://arxiv.org/abs/2408.04407
作者: Ryan Dempsey,Jonathan Ethier
关键词-EN: local environment, wireless communications, communications is highly, highly dependent, Path loss
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
*备注: SoutheastCon 2024

点击查看摘要

Abstract:Path loss prediction for wireless communications is highly dependent on the local environment. Propagation models including clutter information have been shown to significantly increase model accuracy. This paper explores the application of deep learning to satellite imagery to identify environmental clutter types automatically. Recognizing these clutter types has numerous uses, but our main application is to use clutter information to enhance propagation prediction models. Knowing the type of obstruction (tree, building, and further classifications) can improve the prediction accuracy of key propagation metrics such as path loss.

[LG-22] Probabilistic energy forecasting through quantile regression in reproducing kernel Hilbert spaces

链接: https://arxiv.org/abs/2408.04405
作者: Luca Pernigo,Rohan Sen,Davide Baroli
关键词-EN: Accurate energy demand, Representative Concentration Pathways, resilient energy development, Accurate energy, crucial for sustainable
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
*备注: 12 pages, {Owner/Author | ACM} {2024}. This is the author’s version of the work. It is posted here for your personal use. Not for redistribution. The definitive Version of Record will published in this https URL

点击查看摘要

Abstract:Accurate energy demand forecasting is crucial for sustainable and resilient energy development. To meet the Net Zero Representative Concentration Pathways (RCP) 4.5 scenario in the DACH countries, increased renewable energy production, energy storage, and reduced commercial building consumption are needed. This scenario’s success depends on hydroelectric capacity and climatic factors. Informed decisions require quantifying uncertainty in forecasts. This study explores a non-parametric method based on \emphreproducing kernel Hilbert spaces (RKHS), known as kernel quantile regression, for energy prediction. Our experiments demonstrate its reliability and sharpness, and we benchmark it against state-of-the-art methods in load and price forecasting for the DACH region. We offer our implementation in conjunction with additional scripts to ensure the reproducibility of our research.

[LG-23] DIVE: Subgraph Disagreement for Graph Out-of-Distribution Generalization

链接: https://arxiv.org/abs/2408.04400
作者: Xin Sun,Liang Wang,Qiang Liu,Shu Wu,Zilei Wang,Liang Wang
关键词-EN: field rapidly advancing, target data distributions, Stochastic Gradient Descent, graph machine learning, paper addresses
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:This paper addresses the challenge of out-of-distribution (OOD) generalization in graph machine learning, a field rapidly advancing yet grappling with the discrepancy between source and target data distributions. Traditional graph learning algorithms, based on the assumption of uniform distribution between training and test data, falter in real-world scenarios where this assumption fails, resulting in suboptimal performance. A principal factor contributing to this suboptimal performance is the inherent simplicity bias of neural networks trained through Stochastic Gradient Descent (SGD), which prefer simpler features over more complex yet equally or more predictive ones. This bias leads to a reliance on spurious correlations, adversely affecting OOD performance in various tasks such as image recognition, natural language understanding, and graph classification. Current methodologies, including subgraph-mixup and information bottleneck approaches, have achieved partial success but struggle to overcome simplicity bias, often reinforcing spurious correlations. To tackle this, we propose DIVE, training a collection of models to focus on all label-predictive subgraphs by encouraging the models to foster divergence on the subgraph mask, which circumvents the limitation of a model solely focusing on the subgraph corresponding to simple structural patterns. Specifically, we employs a regularizer to punish overlap in extracted subgraphs across models, thereby encouraging different models to concentrate on distinct structural patterns. Model selection for robust OOD performance is achieved through validation accuracy. Tested across four datasets from GOOD benchmark and one dataset from DrugOOD benchmark, our approach demonstrates significant improvement over existing methods, effectively addressing the simplicity bias and enhancing generalization in graph machine learning.

[LG-24] Evaluating the Impact of Pulse Oximetry Bias in Machine Learning under Counterfactual Thinking MICCAI

链接: https://arxiv.org/abs/2408.04396
作者: Inês Martins,João Matos,Tiago Gonçalves,Leo A. Celi,A. Ian Wong,Jaime S. Cardoso
关键词-EN: Algorithmic bias, healthcare mirrors existing, pulse oximetry, Algorithmic, pulse
类目: Machine Learning (cs.LG)
*备注: 10 pages; accepted at MICCAI’s Third Workshop on Applications of Medical AI (2024)

点击查看摘要

Abstract:Algorithmic bias in healthcare mirrors existing data biases. However, the factors driving unfairness are not always known. Medical devices capture significant amounts of data but are prone to errors; for instance, pulse oximeters overestimate the arterial oxygen saturation of darker-skinned individuals, leading to worse outcomes. The impact of this bias in machine learning (ML) models remains unclear. This study addresses the technical challenges of quantifying the impact of medical device bias in downstream ML. Our experiments compare a “perfect world”, without pulse oximetry bias, using SaO2 (blood-gas), to the “actual world”, with biased measurements, using SpO2 (pulse oximetry). Under this counterfactual design, two models are trained with identical data, features, and settings, except for the method of measuring oxygen saturation: models using SaO2 are a “control” and models using SpO2 a “treatment”. The blood-gas oximetry linked dataset was a suitable test-bed, containing 163,396 nearly-simultaneous SpO2 - SaO2 paired measurements, aligned with a wide array of clinical features and outcomes. We studied three classification tasks: in-hospital mortality, respiratory SOFA score in the next 24 hours, and SOFA score increase by two points. Models using SaO2 instead of SpO2 generally showed better performance. Patients with overestimation of O2 by pulse oximetry of 3% had significant decreases in mortality prediction recall, from 0.63 to 0.59, P 0.001. This mirrors clinical processes where biased pulse oximetry readings provide clinicians with false reassurance of patients’ oxygen levels. A similar degradation happened in ML models, with pulse oximetry biases leading to more false negatives in predicting adverse outcomes.

[LG-25] Deep Generative Models in Robotics: A Survey on Learning from Multimodal Demonstrations

链接: https://arxiv.org/abs/2408.04380
作者: Julen Urain,Ajay Mandlekar,Yilun Du,Mahi Shafiullah,Danfei Xu,Katerina Fragkiadaki,Georgia Chalvatzaki,Jan Peters
关键词-EN: deep generative models, Inverse Reinforcement Learning, generative models, deep generative, learn robot behavior
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注: 20 pages, 11 figures, submitted to TRO

点击查看摘要

Abstract:Learning from Demonstrations, the field that proposes to learn robot behavior models from data, is gaining popularity with the emergence of deep generative models. Although the problem has been studied for years under names such as Imitation Learning, Behavioral Cloning, or Inverse Reinforcement Learning, classical methods have relied on models that don’t capture complex data distributions well or don’t scale well to large numbers of demonstrations. In recent years, the robot learning community has shown increasing interest in using deep generative models to capture the complexity of large datasets. In this survey, we aim to provide a unified and comprehensive review of the last year’s progress in the use of deep generative models in robotics. We present the different types of models that the community has explored, such as energy-based models, diffusion models, action value maps, or generative adversarial networks. We also present the different types of applications in which deep generative models have been used, from grasp generation to trajectory generation or cost learning. One of the most important elements of generative models is the generalization out of distributions. In our survey, we review the different decisions the community has made to improve the generalization of the learned models. Finally, we highlight the research challenges and propose a number of future directions for learning deep generative models in robotics.

[LG-26] Anomaly Prediction: A Novel Approach with Explicit Delay and Horizon

链接: https://arxiv.org/abs/2408.04377
作者: Jiang You,Arben Cela,René Natowicz,Jacob Ouanounou,Patrick Siarry
关键词-EN: Detecting anomalies, critical challenge, time series data, Detecting, series data
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Detecting anomalies in time series data is a critical challenge across various domains. Traditional methods typically focus on identifying anomalies in immediate subsequent steps, often underestimating the significance of temporal dynamics such as delay time and horizons of anomalies, which generally require extensive post-analysis. This paper introduces a novel approach for time series anomaly prediction, incorporating temporal information directly into the prediction results. We propose a new dataset specifically designed to evaluate this approach and conduct comprehensive experiments using several state-of-the-art methods. results demonstrate the efficacy of our approach in providing timely and accurate anomaly predictions, setting a new benchmark for future research in this field.

[LG-27] Deep Reinforcement Learning for the Design of Metamaterial Mechanisms with Functional Compliance Control

链接: https://arxiv.org/abs/2408.04376
作者: Yejun Choi,Yeoneung Kim,Keun Park
关键词-EN: designed flexible members, specially designed flexible, micro-architectured compliant structures, Metamaterial mechanisms, flexible members
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Metamaterial mechanisms are micro-architectured compliant structures that operate through the elastic deformation of specially designed flexible members. This study develops an efficient design methodology for compliant mechanisms using deep reinforcement learning (RL). For this purpose, design domains are digitized into finite cells with various hinge connections, and finite element analyses (FEAs) are conducted to evaluate the deformation behaviors of the compliance mechanism with different cell combinations. The FEA data are learned through the RL method to obtain optimal compliant mechanisms for desired functional requirements. The RL algorithm is applied to the design of a compliant door-latch mechanism, exploring the effect of human guidance and tiling direction. The optimal result is achieved with minimal human guidance and inward tiling, resulting in a threefold increase in the predefined reward compared to human-designed mechanisms. The proposed approach is extended to the design of a soft gripper mechanism, where the effect of hinge connections is additionally considered. The optimal design under hinge penalization reveals remarkably enhanced compliance, and its performance is validated by experimental tests using an additively manufactured gripper. These findings demonstrate that RL-optimized designs outperform those developed with human insight, providing an efficient design methodology for cell-based compliant mechanisms in practical applications.

[LG-28] Analyzing Consumer Reviews for Understanding Drivers of Hotels Ratings: An Indian Perspective

链接: https://arxiv.org/abs/2408.04369
作者: Subhasis Dasgupta,Soumya Roy,Jaydip Sen
关键词-EN: social media platforms, digital media, media platforms, digital footprint, social media
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注: This is the pre-print of the paper that was accepted for oral presentation and publication in the proceedings of IEEE ICCCNT 2024 which was organized as IIT Mandi, India from June 24 to 28, 2024. The paper is 5 pages long and it contains 4 figures and 6 tables. The is not the final version of the paper

点击查看摘要

Abstract:In the internet era, almost every business entity is trying to have its digital footprint in digital media and other social media platforms. For these entities, word of mouse is also very important. Particularly, this is quite crucial for the hospitality sector dealing with hotels, restaurants etc. Consumers do read other consumers reviews before making final decisions. This is where it becomes very important to understand which aspects are affecting most in the minds of the consumers while giving their ratings. The current study focuses on the consumer reviews of Indian hotels to extract aspects important for final ratings. The study involves gathering data using web scraping methods, analyzing the texts using Latent Dirichlet Allocation for topic extraction and sentiment analysis for aspect-specific sentiment mapping. Finally, it incorporates Random Forest to understand the importance of the aspects in predicting the final rating of a user.

[LG-29] Detecting Car Speed using Object Detection and Depth Estimation: A Deep Learning Framework

链接: https://arxiv.org/abs/2408.04360
作者: Subhasis Dasgupta,Arshi Naaz,Jayeeta Choudhury,Nancy Lahiri
关键词-EN: fatal accidents, accidents are attributed, Road accidents, Radar based guns, accidents
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注: This is the pre-print of the paper which was accepted for oral presentation and publication in the proceedings of IEEE CONIT 2024, organized at Pune from June 21 to 23, 2024. The paper is 6 pages long and it contains 11 figures and 1 table. This is not the final version of the paper

点击查看摘要

Abstract:Road accidents are quite common in almost every part of the world, and, in majority, fatal accidents are attributed to over speeding of vehicles. The tendency to over speeding is usually tried to be controlled using check points at various parts of the road but not all traffic police have the device to check speed with existing speed estimating devices such as LIDAR based, or Radar based guns. The current project tries to address the issue of vehicle speed estimation with handheld devices such as mobile phones or wearable cameras with network connection to estimate the speed using deep learning frameworks.

[LG-30] Self-Supervised Contrastive Graph Clustering Network via Structural Information Fusion

链接: https://arxiv.org/abs/2408.04339
作者: Xiaoyang Ji,Yuchen Zhou,Haofu Yang,Shiyue Xu,Jiahao Li
关键词-EN: involves partitioning, distinct clusters, partitioning the nodes, Graph clustering, classical task
类目: Machine Learning (cs.LG); Emerging Technologies (cs.ET)
*备注: 6 pages, 3 figures

点击查看摘要

Abstract:Graph clustering, a classical task in graph learning, involves partitioning the nodes of a graph into distinct clusters. This task has applications in various real-world scenarios, such as anomaly detection, social network analysis, and community discovery. Current graph clustering methods commonly rely on module pre-training to obtain a reliable prior distribution for the model, which is then used as the optimization objective. However, these methods often overlook deeper supervised signals, leading to sub-optimal reliability of the prior distribution. To address this issue, we propose a novel deep graph clustering method called CGCN. Our approach introduces contrastive signals and deep structural information into the pre-training process. Specifically, CGCN utilizes a contrastive learning mechanism to foster information interoperability among multiple modules and allows the model to adaptively adjust the degree of information aggregation for different order structures. Our CGCN method has been experimentally validated on multiple real-world graph datasets, showcasing its ability to boost the dependability of prior clustering distributions acquired through pre-training. As a result, we observed notable enhancements in the performance of the model.

[LG-31] Federated Cubic Regularized Newton Learning with Sparsification-amplified Differential Privacy

链接: https://arxiv.org/abs/2408.04315
作者: Wei Huo,Changxin Liu,Kemi Ding,Karl Henrik Johansson,Ling Shi
关键词-EN: Cubic Regularized Newton, federated learning framework, Differentially Private Federated, Private Federated Cubic, Federated Cubic Regularized
类目: Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注:

点击查看摘要

Abstract:This paper investigates the use of the cubic-regularized Newton method within a federated learning framework while addressing two major concerns that commonly arise in federated learning: privacy leakage and communication bottleneck. We introduce a federated learning algorithm called Differentially Private Federated Cubic Regularized Newton (DP-FCRN). By leveraging second-order techniques, our algorithm achieves lower iteration complexity compared to first-order methods. We also incorporate noise perturbation during local computations to ensure privacy. Furthermore, we employ sparsification in uplink transmission, which not only reduces the communication costs but also amplifies the privacy guarantee. Specifically, this approach reduces the necessary noise intensity without compromising privacy protection. We analyze the convergence properties of our algorithm and establish the privacy guarantee. Finally, we validate the effectiveness of the proposed algorithm through experiments on a benchmark dataset.

[LG-32] Constructing Adversarial Examples for Vertical Federated Learning: Optimal Client Corruption through Multi-Armed Bandit ICLR2024

链接: https://arxiv.org/abs/2408.04310
作者: Duanyi Yao,Songze Li,Ye Xue,Jin Liu
关键词-EN: Vertical federated learning, found numerous applications, participating client holds, Vertical federated, federated learning
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注: Published on ICLR2024

点击查看摘要

Abstract:Vertical federated learning (VFL), where each participating client holds a subset of data features, has found numerous applications in finance, healthcare, and IoT systems. However, adversarial attacks, particularly through the injection of adversarial examples (AEs), pose serious challenges to the security of VFL models. In this paper, we investigate such vulnerabilities through developing a novel attack to disrupt the VFL inference process, under a practical scenario where the adversary is able to adaptively corrupt a subset of clients. We formulate the problem of finding optimal attack strategies as an online optimization problem, which is decomposed into an inner problem of adversarial example generation (AEG) and an outer problem of corruption pattern selection (CPS). Specifically, we establish the equivalence between the formulated CPS problem and a multi-armed bandit (MAB) problem, and propose the Thompson sampling with Empirical maximum reward (E-TS) algorithm for the adversary to efficiently identify the optimal subset of clients for corruption. The key idea of E-TS is to introduce an estimation of the expected maximum reward for each arm, which helps to specify a small set of competitive arms, on which the exploration for the optimal arm is performed. This significantly reduces the exploration space, which otherwise can quickly become prohibitively large as the number of clients increases. We analytically characterize the regret bound of E-TS, and empirically demonstrate its capability of efficiently revealing the optimal corruption pattern with the highest attack success rate, under various datasets of popular VFL tasks.

[LG-33] heGlueNote: Learned Representations for Robust and Flexible Note Alignment

链接: https://arxiv.org/abs/2408.04309
作者: Silvan David Peter,Gerhard Widmer
关键词-EN: symbolically encoded piece, Dynamic Time Warping, Hidden Markov Models, matching individual notes, encoded piece
类目: ound (cs.SD); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注: to be published in Proceedings of the 25th International Society for Music Information Retrieval Conference (ISMIR), 2024

点击查看摘要

Abstract:Note alignment refers to the task of matching individual notes of two versions of the same symbolically encoded piece. Methods addressing this task commonly rely on sequence alignment algorithms such as Hidden Markov Models or Dynamic Time Warping (DTW) applied directly to note or onset sequences. While successful in many cases, such methods struggle with large mismatches between the versions. In this work, we learn note-wise representations from data augmented with various complex mismatch cases, e.g. repeats, skips, block insertions, and long trills. At the heart of our approach lies a transformer encoder network - TheGlueNote - which predicts pairwise note similarities for two 512 note subsequences. We postprocess the predicted similarities using flavors of weightedDTW and pitch-separated onsetDTW to retrieve note matches for two sequences of arbitrary length. Our approach performs on par with the state of the art in terms of note alignment accuracy, is considerably more robust to version mismatches, and works directly on any pair of MIDI files.

[LG-34] Partial Experts Checkpoint: Efficient Fault Tolerance for Sparse Mixture-of-Experts Model Training

链接: https://arxiv.org/abs/2408.04307
作者: Weilin Cai,Le Qin,Jiayi Huang
关键词-EN: learning systems intensifies, large language models, language models continue, deep learning systems, distributed deep learning
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:As large language models continue to scale up, the imperative for fault tolerance in distributed deep learning systems intensifies, becoming a focal area of AI infrastructure research. Checkpoint has emerged as the predominant fault tolerance strategy, with extensive studies dedicated to optimizing its efficiency. However, the advent of the sparse Mixture-of-Experts (MoE) model presents new challenges for traditional checkpoint techniques due to the substantial increase in model size, despite comparable computational demands to dense models. Breaking new ground in the realm of efficient fault tolerance for MoE model training, we introduce a novel Partial Experts Checkpoint (PEC) mechanism alongside a corresponding PEC fault-tolerant system. Our approach strategically checkpoints a selected subset of experts, thereby significantly reducing the checkpoint size for MoE models to a level comparable with that of dense models. The empirical analysis on our 8-expert GPT-MoE model demonstrates that the proposed PEC approach facilitates a substantial 54.2% decrease in the size of non-redundant checkpoint (no data-parallel duplication), without compromising the final model quality. Moreover, our PEC fault-tolerant system achieves a 76.9% reduction in checkpoint workload per data-parallel distributed rank, thereby correspondingly diminishing the checkpointing time and facilitating complete overlap with the training process.

[LG-35] rans-Tokenization and Cross-lingual Vocabulary Transfers: Language Adaptation of LLMs for Low-Resource NLP

链接: https://arxiv.org/abs/2408.04303
作者: François Remy,Pieter Delobelle,Hayastan Avetisyan,Alfiya Khabibullina,Miryam de Lhoneux,Thomas Demeester
关键词-EN: mid-resource languages continues, low and mid-resource, difficulty in sourcing, language, languages
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注: Accepted at COLM 2024

点击查看摘要

Abstract:The development of monolingual language models for low and mid-resource languages continues to be hindered by the difficulty in sourcing high-quality training data. In this study, we present a novel cross-lingual vocabulary transfer strategy, trans-tokenization, designed to tackle this challenge and enable more efficient language adaptation. Our approach focuses on adapting a high-resource monolingual LLM to an unseen target language by initializing the token embeddings of the target language using a weighted average of semantically similar token embeddings from the source language. For this, we leverage a translation resource covering both the source and target languages. We validate our method with the Tweeties, a series of trans-tokenized LLMs, and demonstrate their competitive performance on various downstream tasks across a small but diverse set of languages. Additionally, we introduce Hydra LLMs, models with multiple swappable language modeling heads and embedding tables, which further extend the capabilities of our trans-tokenization strategy. By designing a Hydra LLM based on the multilingual model TowerInstruct, we developed a state-of-the-art machine translation model for Tatar, in a zero-shot manner, completely bypassing the need for high-quality parallel data. This breakthrough is particularly significant for low-resource languages like Tatar, where high-quality parallel data is hard to come by. By lowering the data and time requirements for training high-quality models, our trans-tokenization strategy allows for the development of LLMs for a wider range of languages, especially those with limited resources. We hope that our work will inspire further research and collaboration in the field of cross-lingual vocabulary transfer and contribute to the empowerment of languages on a global scale.

[LG-36] ackling Noisy Clients in Federated Learning with End-to-end Label Correction CIKM’24

链接: https://arxiv.org/abs/2408.04301
作者: Xuefeng Jiang,Sheng Sun,Jia Li,Jingjing Xue,Runhan Li,Zhiyuan Wu,Gang Xu,Yuwei Wang,Min Liu
关键词-EN: diverse privacy-sensitive applications, sensitive private information, achieved wide successes, label noise, wide successes
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: To appear in ACM CIKM’24 full research paper track

点击查看摘要

Abstract:Recently, federated learning (FL) has achieved wide successes for diverse privacy-sensitive applications without sacrificing the sensitive private information of clients. However, the data quality of client datasets can not be guaranteed since corresponding annotations of different clients often contain complex label noise of varying degrees, which inevitably causes the performance degradation. Intuitively, the performance degradation is dominated by clients with higher noise rates since their trained models contain more misinformation from data, thus it is necessary to devise an effective optimization scheme to mitigate the negative impacts of these noisy clients. In this work, we propose a two-stage framework FedELC to tackle this complicated label noise issue. The first stage aims to guide the detection of noisy clients with higher label noise, while the second stage aims to correct the labels of noisy clients’ data via an end-to-end label correction framework which is achieved by learning possible ground-truth labels of noisy clients’ datasets via back propagation. We implement sixteen related methods and evaluate five datasets with three types of complicated label noise scenarios for a comprehensive comparison. Extensive experimental results demonstrate our proposed framework achieves superior performance than its counterparts for different scenarios. Additionally, we effectively improve the data quality of detected noisy clients’ local datasets with our label correction framework. The code is available at this https URL.

[LG-37] Assigning Credit with Partial Reward Decoupling in Multi-Agent Proximal Policy Optimization

链接: https://arxiv.org/abs/2408.04295
作者: Aditya Kapoor,Benjamin Freed,Howie Choset,Jeff Schneider
关键词-EN: proximal policy optimization, Multi-agent proximal policy, challenging multi-agent reinforcement, multi-agent reinforcement learning, policy optimization
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO)
*备注: 20 pages, 5 figures, 12 tables, Reinforcement Learning Journal and Reinforcement Learning Conference 2024

点击查看摘要

Abstract:Multi-agent proximal policy optimization (MAPPO) has recently demonstrated state-of-the-art performance on challenging multi-agent reinforcement learning tasks. However, MAPPO still struggles with the credit assignment problem, wherein the sheer difficulty in ascribing credit to individual agents’ actions scales poorly with team size. In this paper, we propose a multi-agent reinforcement learning algorithm that adapts recent developments in credit assignment to improve upon MAPPO. Our approach leverages partial reward decoupling (PRD), which uses a learned attention mechanism to estimate which of a particular agent’s teammates are relevant to its learning updates. We use this estimate to dynamically decompose large groups of agents into smaller, more manageable subgroups. We empirically demonstrate that our approach, PRD-MAPPO, decouples agents from teammates that do not influence their expected future reward, thereby streamlining credit assignment. We additionally show that PRD-MAPPO yields significantly higher data efficiency and asymptotic performance compared to both MAPPO and other state-of-the-art methods across several multi-agent tasks, including StarCraft II. Finally, we propose a version of PRD-MAPPO that is applicable to \textitshared reward settings, where PRD was previously not applicable, and empirically show that this also leads to performance improvements over MAPPO.

[LG-38] Dual-branch PolSAR Image Classification Based on GraphMAE and Local Feature Extraction

链接: https://arxiv.org/abs/2408.04294
作者: Yuchen Wang,Ziyi Guo,Haixia Bi,Danfeng Hong,Chen Xu
关键词-EN: synthetic aperture radar, polarimetric synthetic aperture, aperture radar, time-consuming process, synthetic aperture
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The annotation of polarimetric synthetic aperture radar (PolSAR) images is a labor-intensive and time-consuming process. Therefore, classifying PolSAR images with limited labels is a challenging task in remote sensing domain. In recent years, self-supervised learning approaches have proven effective in PolSAR image classification with sparse labels. However, we observe a lack of research on generative selfsupervised learning in the studied task. Motivated by this, we propose a dual-branch classification model based on generative self-supervised learning in this paper. The first branch is a superpixel-branch, which learns superpixel-level polarimetric representations using a generative self-supervised graph masked autoencoder. To acquire finer classification results, a convolutional neural networks-based pixel-branch is further incorporated to learn pixel-level features. Classification with fused dual-branch features is finally performed to obtain the predictions. Experimental results on the benchmark Flevoland dataset demonstrate that our approach yields promising classification results.

[LG-39] Stability Analysis of Equivariant Convolutional Representations Through The Lens of Equivariant Multi-layered CKNs

链接: https://arxiv.org/abs/2408.04277
作者: Soutrik Roy Chowdhury
关键词-EN: kernel Hilbert spaces, reproducing kernel Hilbert, convolutional kernel networks, theoretically analyse group, Hilbert spaces
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In this paper we construct and theoretically analyse group equivariant convolutional kernel networks (CKNs) which are useful in understanding the geometry of (equivariant) CNNs through the lens of reproducing kernel Hilbert spaces (RKHSs). We then proceed to study the stability analysis of such equiv-CKNs under the action of diffeomorphism and draw a connection with equiv-CNNs, where the goal is to analyse the geometry of inductive biases of equiv-CNNs through the lens of reproducing kernel Hilbert spaces (RKHSs). Traditional deep learning architectures, including CNNs, trained with sophisticated optimization algorithms is vulnerable to perturbations, including `adversarial examples’. Understanding the RKHS norm of such models through CKNs is useful in designing the appropriate architecture and can be useful in designing robust equivariant representation learning models.

[LG-40] Early Risk Assessment Model for ICA Timing Strategy in Unstable Angina Patients Using Multi-Modal Machine Learning

链接: https://arxiv.org/abs/2408.04276
作者: Candi Zheng,Kun Liu,Yang Wang,Shiyi Chen,Hongli Li
关键词-EN: Invasive coronary arteriography, diagnosing cardiovascular diseases, Invasive coronary, coronary arteriography, cardiovascular diseases
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Background: Invasive coronary arteriography (ICA) is recognized as the gold standard for diagnosing cardiovascular diseases, including unstable angina (UA). The challenge lies in determining the optimal timing for ICA in UA patients, balancing the need for revascularization in high-risk patients against the potential complications in low-risk ones. Unlike myocardial infarction, UA does not have specific indicators like ST-segment deviation or cardiac enzymes, making risk assessment complex. Objectives: Our study aims to enhance the early risk assessment for UA patients by utilizing machine learning algorithms. These algorithms can potentially identify patients who would benefit most from ICA by analyzing less specific yet related indicators that are challenging for human physicians to interpret. Methods: We collected data from 640 UA patients at Shanghai General Hospital, including medical history and electrocardiograms (ECG). Machine learning algorithms were trained using multi-modal demographic characteristics including clinical risk factors, symptoms, biomarker levels, and ECG features extracted by pre-trained neural networks. The goal was to stratify patients based on their revascularization risk. Additionally, we translated our models into applicable and explainable look-up tables through discretization for practical clinical use. Results: The study achieved an Area Under the Curve (AUC) of 0.719 \pm 0.065 in risk stratification, significantly surpassing the widely adopted GRACE score’s AUC of 0.579 \pm 0.044 . Conclusions: The results suggest that machine learning can provide superior risk stratification for UA patients. This improved stratification could help in balancing the risks, costs, and complications associated with ICA, indicating a potential shift in clinical assessment practices for unstable angina.

[LG-41] Generating Fine-Grained Causality in Climate Time Series Data for Forecasting and Anomaly Detection ICML2024

链接: https://arxiv.org/abs/2408.04254
作者: Dongqi Fu,Yada Zhu,Hanghang Tong,Kommy Weldemariam,Onkar Bhardwaj,Jingrui He
关键词-EN: TBN Granger Causality, Granger Causality, Neural Granger Causality, TBN Granger, time series data
类目: Machine Learning (cs.LG)
*备注: ICML 2024 AI for Science Workshop

点击查看摘要

Abstract:Understanding the causal interaction of time series variables can contribute to time series data analysis for many real-world applications, such as climate forecasting and extreme weather alerts. However, causal relationships are difficult to be fully observed in real-world complex settings, such as spatial-temporal data from deployed sensor networks. Therefore, to capture fine-grained causal relations among spatial-temporal variables for further a more accurate and reliable time series analysis, we first design a conceptual fine-grained causal model named TBN Granger Causality, which adds time-respecting Bayesian Networks to the previous time-lagged Neural Granger Causality to offset the instantaneous effects. Second, we propose an end-to-end deep generative model called TacSas, which discovers TBN Granger Causality in a generative manner to help forecast time series data and detect possible anomalies during the forecast. For evaluations, besides the causality discovery benchmark Lorenz-96, we also test TacSas on climate benchmark ERA5 for climate forecasting and the extreme weather benchmark of NOAA for extreme weather alerts.

[LG-42] Cooperative Multi-Agent Deep Reinforcement Learning in Content Ranking Optimization

链接: https://arxiv.org/abs/2408.04251
作者: Zhou Qin,Kai Yuan,Pratik Lahiri,Wenyang Liu
关键词-EN: customers’ shopping missions, fulfill customers’ shopping, typical e-commerce setting, Content Ranking Optimization, mechanisms are employed
类目: Machine Learning (cs.LG)
*备注: 14 pages

点击查看摘要

Abstract:In a typical e-commerce setting, Content Ranking Optimization (CRO) mechanisms are employed to surface content on the search page to fulfill customers’ shopping missions. CRO commonly utilizes models such as contextual deep bandits model to independently rank content at different positions, e.g., one optimizer dedicated to organic search results and another to sponsored results. However, this regional optimization approach does not necessarily translate to whole page optimization, e.g., maximizing revenue at the top of the page may inadvertently diminish the revenue of lower positions. In this paper, we propose a reinforcement learning based method for whole page ranking to jointly optimize across all positions by: 1) shifting from position level optimization to whole page level optimization to achieve an overall optimized ranking; 2) applying reinforcement learning to optimize for the cumulative rewards instead of the instant reward. We formulate page level CRO as a cooperative Multi-agent Markov Decision Process , and address it with the novel Multi-Agent Deep Deterministic Policy Gradient (MADDPG) model. MADDPG supports a flexible and scalable joint optimization framework by adopting a “centralized training and decentralized execution” approach. Extensive experiments demonstrate that MADDPG scales to a 2.5 billion action space in the public Mujoco environment, and outperforms the deep bandits modeling by 25.7% on the offline CRO data set from a leading e-commerce company. We foresee that this novel multi-agent optimization is applicable to similar joint optimization problems in the field of information retrieval.

[LG-43] Scalable Transformer for High Dimensional Multivariate Time Series Forecasting

链接: https://arxiv.org/abs/2408.04245
作者: Xin Zhou,Weiqing Wang,Wray Buntine,Shilin Qu,Abishek Sriramulu,Weicong Tan,Christoph Bergmeir
关键词-EN: demonstrated significant success, recently demonstrated significant, Multivariate Time Series, Deep models, Channel-dependent models
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Deep models for Multivariate Time Series (MTS) forecasting have recently demonstrated significant success. Channel-dependent models capture complex dependencies that channel-independent models cannot capture. However, the number of channels in real-world applications outpaces the capabilities of existing channel-dependent models, and contrary to common expectations, some models underperform the channel-independent models in handling high-dimensional data, which raises questions about the performance of channel-dependent models. To address this, our study first investigates the reasons behind the suboptimal performance of these channel-dependent models on high-dimensional MTS data. Our analysis reveals that two primary issues lie in the introduced noise from unrelated series that increases the difficulty of capturing the crucial inter-channel dependencies, and challenges in training strategies due to high-dimensional data. To address these issues, we propose STHD, the Scalable Transformer for High-Dimensional Multivariate Time Series Forecasting. STHD has three components: a) Relation Matrix Sparsity that limits the noise introduced and alleviates the memory issue; b) ReIndex applied as a training strategy to enable a more flexible batch size setting and increase the diversity of training data; and c) Transformer that handles 2-D inputs and captures channel dependencies. These components jointly enable STHD to manage the high-dimensional MTS while maintaining computational feasibility. Furthermore, experimental results show STHD’s considerable improvement on three high-dimensional datasets: Crime-Chicago, Wiki-People, and Traffic. The source code and dataset are publicly available this https URL.

[LG-44] he Ungrounded Alignment Problem

链接: https://arxiv.org/abs/2408.04242
作者: Marc Pickett,Aakash Kumar Nain,Joseph Modayil,Llion Jones
关键词-EN: Modern machine learning, demonstrated substantial abilities, Modern machine, ignore human-provided knowledge, machine learning systems
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE)
*备注: 7 pages, plus references and appendix

点击查看摘要

Abstract:Modern machine learning systems have demonstrated substantial abilities with methods that either embrace or ignore human-provided knowledge, but combining benefits of both styles remains a challenge. One particular challenge involves designing learning systems that exhibit built-in responses to specific abstract stimulus patterns, yet are still plastic enough to be agnostic about the modality and exact form of their inputs. In this paper, we investigate what we call The Ungrounded Alignment Problem, which asks How can we build in predefined knowledge in a system where we don’t know how a given stimulus will be grounded? This paper examines a simplified version of the general problem, where an unsupervised learner is presented with a sequence of images for the characters in a text corpus, and this learner is later evaluated on its ability to recognize specific (possibly rare) sequential patterns. Importantly, the learner is given no labels during learning or evaluation, but must map images from an unknown font or permutation to its correct class label. That is, at no point is our learner given labeled images, where an image vector is explicitly associated with a class label. Despite ample work in unsupervised and self-supervised loss functions, all current methods require a labeled fine-tuning phase to map the learned representations to correct classes. Finding this mapping in the absence of labels may seem a fool’s errand, but our main result resolves this seeming paradox. We show that leveraging only letter bigram frequencies is sufficient for an unsupervised learner both to reliably associate images to class labels and to reliably identify trigger words in the sequence of inputs. More generally, this method suggests an approach for encoding specific desired innate behaviour in modality-agnostic models.

[LG-45] Cluster-Wide Task Slowdown Detection in Cloud System KDD2024

链接: https://arxiv.org/abs/2408.04236
作者: Feiyi Chen,Yingying Zhang,Lunting Fan,Yuxuan Liang,Guansong Pang,Qingsong Wen,Shuiguang Deng
关键词-EN: substantial liquidated damages, bring substantial liquidated, Slow task detection, Slow task, liquidated damages
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: This paper has been accepted by KDD2024

点击查看摘要

Abstract:Slow task detection is a critical problem in cloud operation and maintenance since it is highly related to user experience and can bring substantial liquidated damages. Most anomaly detection methods detect it from a single-task aspect. However, considering millions of concurrent tasks in large-scale cloud computing clusters, it becomes impractical and inefficient. Moreover, single-task slowdowns are very common and do not necessarily indicate a malfunction of a cluster due to its violent fluctuation nature in a virtual environment. Thus, we shift our attention to cluster-wide task slowdowns by utilizing the duration time distribution of tasks across a cluster, so that the computation complexity is not relevant to the number of tasks. The task duration time distribution often exhibits compound periodicity and local exceptional fluctuations over time. Though transformer-based methods are one of the most powerful methods to capture these time series normal variation patterns, we empirically find and theoretically explain the flaw of the standard attention mechanism in reconstructing subperiods with low amplitude when dealing with compound periodicity. To tackle these challenges, we propose SORN (i.e., Skimming Off subperiods in descending amplitude order and Reconstructing Non-slowing fluctuation), which consists of a Skimming Attention mechanism to reconstruct the compound periodicity and a Neural Optimal Transport module to distinguish cluster-wide slowdowns from other exceptional fluctuations. Furthermore, since anomalies in the training set are inevitable in a practical scenario, we propose a picky loss function, which adaptively assigns higher weights to reliable time slots in the training set. Extensive experiments demonstrate that SORN outperforms state-of-the-art methods on multiple real-world industrial datasets. Comments: This paper has been accepted by KDD2024 Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2408.04236 [cs.LG] (or arXiv:2408.04236v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2408.04236 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-46] Enhanced Traffic Flow Prediction with Multi-Segment Fusion Tensor Graph Convolutional Networks

链接: https://arxiv.org/abs/2408.04232
作者: Wei Zhang,Peng Tang
关键词-EN: traffic Flow Prediction, Accurate traffic Flow, intelligent transportation systems, holds significant importance, Flow Prediction
类目: Machine Learning (cs.LG); Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Accurate traffic Flow Prediction can assist in traffic management, route planning, and congestion mitigation, which holds significant importance in enhancing the efficiency and reliability of intelligent transportation systems (ITS). However, existing traffic flow prediction models suffer from limitations in capturing the complex spatial-temporal dependencies within traffic networks. In order to address this issue, this study proposes a multi-segment fusion tensor graph convolutional network (MS-FTGCN) for traffic flow prediction with the following three-fold ideas: a) building a unified spatial-temporal graph convolutional framework based on Tensor M-product, which capture the spatial-temporal patterns simultaneously; b) incorporating hourly, daily, and weekly components to model multi temporal properties of traffic flows, respectively; c) fusing the outputs of the three components by attention mechanism to obtain the final traffic flow prediction results. The results of experiments conducted on two traffic flow datasets demonstrate that the proposed MS-FTGCN outperforms the state-of-the-art models.

[LG-47] Probabilistic Circuits for Cumulative Distribution Functions

链接: https://arxiv.org/abs/2408.04229
作者: Oliver Broadrick,William Cao,Benjie Wang,Martin Trapp,Guy Van den Broeck
关键词-EN: supports efficient probabilistic, efficient probabilistic inference, sufficient structural properties, multivariate probability distribution, probabilistic inference
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:A probabilistic circuit (PC) succinctly expresses a function that represents a multivariate probability distribution and, given sufficient structural properties of the circuit, supports efficient probabilistic inference. Typically a PC computes the probability mass (or density) function (PMF or PDF) of the distribution. We consider PCs instead computing the cumulative distribution function (CDF). We show that for distributions over binary random variables these representations (PMF and CDF) are essentially equivalent, in the sense that one can be transformed to the other in polynomial time. We then show how a similar equivalence holds for distributions over finite discrete variables using a modification of the standard encoding with binary variables that aligns with the CDF semantics. Finally we show that for continuous variables, smooth, decomposable PCs computing PDFs and CDFs can be efficiently transformed to each other by modifying only the leaves of the circuit.

[LG-48] Connective Viewpoints of Signal-to-Noise Diffusion Models

链接: https://arxiv.org/abs/2408.04221
作者: Khanh Doan,Long Tung Vuong,Tuan Nguyen,Anh Tuan Bui,Quyen Tran,Thanh-Toan Do,Dinh Phung,Trung Le
关键词-EN: complex data interpolation, Diffusion models, audio generation, Diffusion, diffusion models constitute
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
*备注:

点击查看摘要

Abstract:Diffusion models (DM) have become fundamental components of generative models, excelling across various domains such as image creation, audio generation, and complex data interpolation. Signal-to-Noise diffusion models constitute a diverse family covering most state-of-the-art diffusion models. While there have been several attempts to study Signal-to-Noise (S2N) diffusion models from various perspectives, there remains a need for a comprehensive study connecting different viewpoints and exploring new perspectives. In this study, we offer a comprehensive perspective on noise schedulers, examining their role through the lens of the signal-to-noise ratio (SNR) and its connections to information theory. Building upon this framework, we have developed a generalized backward equation to enhance the performance of the inference process.

[LG-49] Diffusion Guided Language Modeling ACL

链接: https://arxiv.org/abs/2408.04220
作者: Justin Lovelace,Varsha Kishore,Yiwei Chen,Kilian Q. Weinberger
关键词-EN: demonstrate remarkable proficiency, Current language models, Current language, models demonstrate remarkable, language models demonstrate
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注: ACL Findings 2024

点击查看摘要

Abstract:Current language models demonstrate remarkable proficiency in text generation. However, for many applications it is desirable to control attributes, such as sentiment, or toxicity, of the generated language – ideally tailored towards each specific use case and target audience. For auto-regressive language models, existing guidance methods are prone to decoding errors that cascade during generation and degrade performance. In contrast, text diffusion models can easily be guided with, for example, a simple linear sentiment classifier – however they do suffer from significantly higher perplexity than auto-regressive alternatives. In this paper we use a guided diffusion model to produce a latent proposal that steers an auto-regressive language model to generate text with desired properties. Our model inherits the unmatched fluency of the auto-regressive approach and the plug-and-play flexibility of diffusion. We show that it outperforms previous plug-and-play guidance methods across a wide range of benchmark data sets. Further, controlling a new attribute in our framework is reduced to training a single logistic regression classifier.

[LG-50] DC Algorithm for Estimation of Sparse Gaussian Graphical Models

链接: https://arxiv.org/abs/2408.04206
作者: Tomokaze Shiratori,Yuichi Takano
关键词-EN: numerous observed variables, Gaussian graphical models, ell, interpretable and quantifiable, Gaussian graphical
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Sparse estimation for Gaussian graphical models is a crucial technique for making the relationships among numerous observed variables more interpretable and quantifiable. Various methods have been proposed, including graphical lasso, which utilizes the \ell_1 norm as a regularization term, as well as methods employing non-convex regularization terms. However, most of these methods approximate the \ell_0 norm with convex functions. To estimate more accurate solutions, it is desirable to treat the \ell_0 norm directly as a regularization term. In this study, we formulate the sparse estimation problem for Gaussian graphical models using the \ell_0 norm and propose a method to solve this problem using the Difference of Convex functions Algorithm (DCA). Specifically, we convert the \ell_0 norm constraint into an equivalent largest- K norm constraint, reformulate the constrained problem into a penalized form, and solve it using the DC algorithm (DCA). Furthermore, we designed an algorithm that efficiently computes using graphical lasso. Experimental results with synthetic data show that our method yields results that are equivalent to or better than existing methods. Comparisons of model learning through cross-validation confirm that our method is particularly advantageous in selecting true edges.

[LG-51] Uncertainty-Aware Crime Prediction With Spatial Temporal Multivariate Graph Neural Networks

链接: https://arxiv.org/abs/2408.04193
作者: Zepu Wang,Xiaobo Ma,Huajie Yang,Weimin Lvu,Peng Sun,Sharath Chandra Guntuku
关键词-EN: stabilizing society today, society today, Zero-Inflated Negative Binomial, Negative Binomial Graph, critical component
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Crime forecasting is a critical component of urban analysis and essential for stabilizing society today. Unlike other time series forecasting problems, crime incidents are sparse, particularly in small regions and within specific time periods. Traditional spatial-temporal deep learning models often struggle with this sparsity, as they typically cannot effectively handle the non-Gaussian nature of crime data, which is characterized by numerous zeros and over-dispersed patterns. To address these challenges, we introduce a novel approach termed Spatial Temporal Multivariate Zero-Inflated Negative Binomial Graph Neural Networks (STMGNN-ZINB). This framework leverages diffusion and convolution networks to analyze spatial, temporal, and multivariate correlations, enabling the parameterization of probabilistic distributions of crime incidents. By incorporating a Zero-Inflated Negative Binomial model, STMGNN-ZINB effectively manages the sparse nature of crime data, enhancing prediction accuracy and the precision of confidence intervals. Our evaluation on real-world datasets confirms that STMGNN-ZINB outperforms existing models, providing a more reliable tool for predicting and understanding crime dynamics.

[LG-52] Listwise Reward Estimation for Offline Preference-based Reinforcement Learning ICML2024

链接: https://arxiv.org/abs/2408.04190
作者: Heewoong Choi,Sangwon Jung,Hongjoon Ahn,Taesup Moon
关键词-EN: designing precise reward, Reinforcement Learning, precise reward functions, reward functions remains, designing precise
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 21 pages, ICML 2024

点击查看摘要

Abstract:In Reinforcement Learning (RL), designing precise reward functions remains to be a challenge, particularly when aligning with human intent. Preference-based RL (PbRL) was introduced to address this problem by learning reward models from human feedback. However, existing PbRL methods have limitations as they often overlook the second-order preference that indicates the relative strength of preference. In this paper, we propose Listwise Reward Estimation (LiRE), a novel approach for offline PbRL that leverages second-order preference information by constructing a Ranked List of Trajectories (RLT), which can be efficiently built by using the same ternary feedback type as traditional methods. To validate the effectiveness of LiRE, we propose a new offline PbRL dataset that objectively reflects the effect of the estimated rewards. Our extensive experiments on the dataset demonstrate the superiority of LiRE, i.e., outperforming state-of-the-art baselines even with modest feedback budgets and enjoying robustness with respect to the number of feedbacks and feedback noise. Our code is available at this https URL

[LG-53] pyBregMan: A Python library for Bregman Manifolds

链接: https://arxiv.org/abs/2408.04175
作者: Frank Nielsen,Alexander Soen
关键词-EN: dually flat space, Bregman manifolds, Bregman, convex Bregman generators, dually flat
类目: Machine Learning (cs.LG); Computational Geometry (cs.CG); Computer Vision and Pattern Recognition (cs.CV); Information Theory (cs.IT)
*备注: 28 pages

点击查看摘要

Abstract:A Bregman manifold is a synonym for a dually flat space in information geometry which admits as a canonical divergence a Bregman divergence. Bregman manifolds are induced by smooth strictly convex functions like the cumulant or partition functions of regular exponential families, the negative entropy of mixture families, or the characteristic functions of regular cones just to list a few such convex Bregman generators. We describe the design of pyBregMan, a library which implements generic operations on Bregman manifolds and instantiate several common Bregman manifolds used in information sciences. At the core of the library is the notion of Legendre-Fenchel duality inducing a canonical pair of dual potential functions and dual Bregman divergences. The library also implements the Fisher-Rao manifolds of categorical/multinomial distributions and multivariate normal distributions. To demonstrate the use of the pyBregMan kernel manipulating those Bregman and Fisher-Rao manifolds, the library also provides several core algorithms for various applications in statistics, machine learning, information fusion, and so on.

[LG-54] wav2graph: A Framework for Supervised Learning Knowledge Graph from Speech

链接: https://arxiv.org/abs/2408.04174
作者: Khai Le-Duc,Quy-Anh Dang,Tan-Hanh Pham,Truong-Son Hy
关键词-EN: large language models, enhance the performance, providing structured, reasoning and context-awareness, performance of large
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG); Sound (cs.SD); Audio and Speech Processing (eess.AS)
*备注: Preprint, 32 pages

点击查看摘要

Abstract:Knowledge graphs (KGs) enhance the performance of large language models (LLMs) and search engines by providing structured, interconnected data that improves reasoning and context-awareness. However, KGs only focus on text data, thereby neglecting other modalities such as speech. In this work, we introduce wav2graph, the first framework for supervised learning knowledge graph from speech data. Our pipeline are straightforward: (1) constructing a KG based on transcribed spoken utterances and a named entity database, (2) converting KG into embedding vectors, and (3) training graph neural networks (GNNs) for node classification and link prediction tasks. Through extensive experiments conducted in inductive and transductive learning contexts using state-of-the-art GNN models, we provide baseline results and error analysis for node classification and link prediction tasks on human transcripts and automatic speech recognition (ASR) transcripts, including evaluations using both encoder-based and decoder-based node embeddings, as well as monolingual and multilingual acoustic pre-trained models. All related code, data, and models are published online.

[LG-55] he Data Addition Dilemma ALT

链接: https://arxiv.org/abs/2408.04154
作者: Judy Hanwen Shen,Inioluwa Deborah Raji,Irene Y. Chen
关键词-EN: healthcare tasks, standard datasets, fundamentally dissimilar, machine learning, learning for healthcare
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
*备注: Machine Learning For Health Care 2024 (MLHC)

点击查看摘要

Abstract:In many machine learning for healthcare tasks, standard datasets are constructed by amassing data across many, often fundamentally dissimilar, sources. But when does adding more data help, and when does it hinder progress on desired model outcomes in real-world settings? We identify this situation as the \textitData Addition Dilemma, demonstrating that adding training data in this multi-source scaling context can at times result in reduced overall accuracy, uncertain fairness outcomes, and reduced worst-subgroup performance. We find that this possibly arises from an empirically observed trade-off between model performance improvements due to data scaling and model deterioration from distribution shift. We thus establish baseline strategies for navigating this dilemma, introducing distribution shift heuristics to guide decision-making on which data sources to add in data scaling, in order to yield the expected model performance improvements. We conclude with a discussion of the required considerations for data collection and suggestions for studying data composition and scale in the age of increasingly larger models.

[LG-56] Heterogeneous Graph Sequence Neural Networks for Dynamic Traffic Assignment

链接: https://arxiv.org/abs/2408.04131
作者: Tong Liu,Hadi Meidani
关键词-EN: intelligent transportation systems, provide critical insights, prediction provide critical, Traffic, traffic flows
类目: Machine Learning (cs.LG)
*备注: 9 pages, 5 figures

点击查看摘要

Abstract:Traffic assignment and traffic flow prediction provide critical insights for urban planning, traffic management, and the development of intelligent transportation systems. An efficient model for calculating traffic flows over the entire transportation network could provide a more detailed and realistic understanding of traffic dynamics. However, existing traffic prediction approaches, such as those utilizing graph neural networks, are typically limited to locations where sensors are deployed and cannot predict traffic flows beyond sensor locations. To alleviate this limitation, inspired by fundamental relationship that exists between link flows and the origin-destination (OD) travel demands, we proposed the Heterogeneous Spatio-Temporal Graph Sequence Network (HSTGSN). HSTGSN exploits dependency between origin and destination nodes, even when it is long-range, and learns implicit vehicle route choices under different origin-destination demands. This model is based on a heterogeneous graph which consists of road links, OD links (virtual links connecting origins and destinations) and a spatio-temporal graph encoder-decoder that captures the spatio-temporal relationship between OD demands and flow distribution. We will show how the graph encoder-decoder is able to recover the incomplete information in the OD demand, by using node embedding from the graph decoder to predict the temporal changes in flow distribution. Using extensive experimental studies on real-world networks with complete/incomplete OD demands, we demonstrate that our method can not only capture the implicit spatio-temporal relationship between link traffic flows and OD demands but also achieve accurate prediction performance and generalization capability.

[LG-57] Out-of-Core Dimensionality Reduction for Large Data via Out-of-Sample Extensions

链接: https://arxiv.org/abs/2408.04129
作者: Luca Reichmann,David Hägele,Daniel Weiskopf
关键词-EN: data sets, high-dimensional data sets, data, large data, large data visualization
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Dimensionality reduction (DR) is a well-established approach for the visualization of high-dimensional data sets. While DR methods are often applied to typical DR benchmark data sets in the literature, they might suffer from high runtime complexity and memory requirements, making them unsuitable for large data visualization especially in environments outside of high-performance computing. To perform DR on large data sets, we propose the use of out-of-sample extensions. Such extensions allow inserting new data into existing projections, which we leverage to iteratively project data into a reference projection that consists only of a small manageable subset. This process makes it possible to perform DR out-of-core on large data, which would otherwise not be possible due to memory and runtime limitations. For metric multidimensional scaling (MDS), we contribute an implementation with out-of-sample projection capability since typical software libraries do not support it. We provide an evaluation of the projection quality of five common DR algorithms (MDS, PCA, t-SNE, UMAP, and autoencoders) using quality metrics from the literature and analyze the trade-off between the size of the reference set and projection quality. The runtime behavior of the algorithms is also quantified with respect to reference set size, out-of-sample batch size, and dimensionality of the data sets. Furthermore, we compare the out-of-sample approach to other recently introduced DR methods, such as PaCMAP and TriMAP, which claim to handle larger data sets than traditional approaches. To showcase the usefulness of DR on this large scale, we contribute a use case where we analyze ensembles of streamlines amounting to one billion projected instances.

[LG-58] Exploring RAG-based Vulnerability Augmentation with LLMs

链接: https://arxiv.org/abs/2408.04125
作者: Seyed Shayan Daneshvar,Yu Nong,Xu Yang,Shaowei Wang,Haipeng Cai
关键词-EN: Detecting vulnerabilities, maintaining the integrity, software systems, security of software, Detecting
类目: oftware Engineering (cs.SE); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注: 13 pages, 6 figures, 5 tables, 3 prompt templates, 1 algorithm

点击查看摘要

Abstract:Detecting vulnerabilities is a crucial task for maintaining the integrity, availability, and security of software systems. Utilizing DL-based models for vulnerability detection has become commonplace in recent years. However, such deep learning-based vulnerability detectors (DLVD) suffer from a shortage of sizable datasets to train effectively. Data augmentation can potentially alleviate the shortage of data, but augmenting vulnerable code is challenging and requires designing a generative solution that maintains vulnerability. Hence, the work on generating vulnerable code samples has been limited and previous works have only focused on generating samples that contain single statements or specific types of vulnerabilities. Lately, large language models (LLMs) are being used for solving various code generation and comprehension tasks and have shown inspiring results, especially when fused with retrieval augmented generation (RAG). In this study, we explore three different strategies to augment vulnerabilities both single and multi-statement vulnerabilities, with LLMs, namely Mutation, Injection, and Extension. We conducted an extensive evaluation of our proposed approach on three vulnerability datasets and three DLVD models, using two LLMs. Our results show that our injection-based clustering-enhanced RAG method beats the baseline setting (NoAug), Vulgen, and VGX (two SOTA methods), and Random Oversampling (ROS) by 30.80%, 27.48%, 27.93%, and 15.41% in f1-score with 5K generated vulnerable samples on average, and 53.84%, 54.10%, 69.90%, and 40.93% with 15K generated vulnerable samples. Our approach demonstrates its feasibility for large-scale data augmentation by generating 1K samples at as cheap as US 1.88.

[LG-59] Overcoming Brittleness in Pareto-Optimal Learning-Augmented Algorithms

链接: https://arxiv.org/abs/2408.04122
作者: Spyros Angelopoulos,Christoph Dürr,Alex Elenter,Yanni Lefki
关键词-EN: gained considerable prominence, Pareto-optimal algorithms, recent years, gained considerable, considerable prominence
类目: Machine Learning (cs.LG); Data Structures and Algorithms (cs.DS)
*备注:

点击查看摘要

Abstract:The study of online algorithms with machine-learned predictions has gained considerable prominence in recent years. One of the common objectives in the design and analysis of such algorithms is to attain (Pareto) optimal tradeoffs between the consistency of the algorithm, i.e., its performance assuming perfect predictions, and its robustness, i.e., the performance of the algorithm under adversarial predictions. In this work, we demonstrate that this optimization criterion can be extremely brittle, in that the performance of Pareto-optimal algorithms may degrade dramatically even in the presence of imperceptive prediction error. To remedy this drawback, we propose a new framework in which the smoothness in the performance of the algorithm is enforced by means of a user-specified profile. This allows us to regulate the performance of the algorithm as a function of the prediction error, while simultaneously maintaining the analytical notion of consistency/robustness tradeoffs, adapted to the profile setting. We apply this new approach to a well-studied online problem, namely the one-way trading problem. For this problem, we further address another limitation of the state-of-the-art Pareto-optimal algorithms, namely the fact that they are tailored to worst-case, and extremely pessimistic inputs. We propose a new Pareto-optimal algorithm that leverages any deviation from the worst-case input to its benefit, and introduce a new metric that allows us to compare any two Pareto-optimal algorithms via a dominance relation.

[LG-60] Combining Neural Architecture Search and Automatic Code Optimization: A Survey

链接: https://arxiv.org/abs/2408.04116
作者: Inas Bachiri,Hadjer Benmeziane,Smail Niar,Riyadh Baghdadi,Hamza Ouarnoughi,Abdelkrime Aries
关键词-EN: Deep Learning models, Deep Learning, experienced exponential growth, Learning models, experienced exponential
类目: Machine Learning (cs.LG); Programming Languages (cs.PL)
*备注: version 0, 13 pages, 4 figures

点击查看摘要

Abstract:Deep Learning models have experienced exponential growth in complexity and resource demands in recent years. Accelerating these models for efficient execution on resource-constrained devices has become more crucial than ever. Two notable techniques employed to achieve this goal are Hardware-aware Neural Architecture Search (HW-NAS) and Automatic Code Optimization (ACO). HW-NAS automatically designs accurate yet hardware-friendly neural networks, while ACO involves searching for the best compiler optimizations to apply on neural networks for efficient mapping and inference on the target hardware. This survey explores recent works that combine these two techniques within a single framework. We present the fundamental principles of both domains and demonstrate their sub-optimality when performed independently. We then investigate their integration into a joint optimization process that we call Hardware Aware-Neural Architecture and Compiler Optimizations co-Search (NACOS).

[LG-61] Zero-shot Factual Consistency Evaluation Across Domains

链接: https://arxiv.org/abs/2408.04114
作者: Raunak Agarwal
关键词-EN: text generation systems, Natural Language Inference, Factual Consistency Evaluation, factual consistency, generation systems
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This work addresses the challenge of factual consistency in text generation systems. We unify the tasks of Natural Language Inference, Summarization Evaluation, Factuality Verification and Factual Consistency Evaluation to train models capable of evaluating the factual consistency of source-target pairs across diverse domains. We rigorously evaluate these against eight baselines on a comprehensive benchmark suite comprising 22 datasets that span various tasks, domains, and document lengths. Results demonstrate that our method achieves state-of-the-art performance on this heterogeneous benchmark while addressing efficiency concerns and attaining cross-domain generalization.

[LG-62] UpLIF: An Updatable Self-Tuning Learned Index Framework

链接: https://arxiv.org/abs/2408.04113
作者: Alireza Heidari,Amirhossein Ahmadi,Wei Zhang
关键词-EN: estimate keys’ positions, key search efficiency, index size reduction, significant challenge inherent, learned index modeling
类目: Databases (cs.DB); Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注: 20 pages, ACM IDEAS 2024

点击查看摘要

Abstract:The emergence of learned indexes has caused a paradigm shift in our perception of indexing by considering indexes as predictive models that estimate keys’ positions within a data set, resulting in notable improvements in key search efficiency and index size reduction; however, a significant challenge inherent in learned index modeling is its constrained support for update operations, necessitated by the requirement for a fixed distribution of records. Previous studies have proposed various approaches to address this issue with the drawback of high overhead due to multiple model retraining. In this paper, we present UpLIF, an adaptive self-tuning learned index that adjusts the model to accommodate incoming updates, predicts the distribution of updates for performance improvement, and optimizes its index structure using reinforcement learning. We also introduce the concept of balanced model adjustment, which determines the model’s inherent properties (i.e. bias and variance), enabling the integration of these factors into the existing index model without the need for retraining with new data. Our comprehensive experiments show that the system surpasses state-of-the-art indexing solutions (both traditional and ML-based), achieving an increase in throughput of up to 3.12 times with 1000 times less memory usage.

[LG-63] Zero-Delay QKV Compression for Mitigating KV Cache and Network Bottlenecks in LLM Inference

链接: https://arxiv.org/abs/2408.04107
作者: Zeyu Zhang,Haiying Shen
关键词-EN: key-value cache, pose a challenge, KVC, time, JCT
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注:

点击查看摘要

Abstract:In large-language models, memory constraints in the key-value cache (KVC) pose a challenge during inference, especially with long prompts. In this work, we observed that compressing KV values is more effective than compressing the model regarding accuracy and job completion time (JCT). However, quantizing KV values and dropping less-important tokens incur significant runtime computational time overhead, delaying JCT. These methods also cannot reduce computation time or high network communication time overhead in sequence-parallelism (SP) frameworks for long prompts. To tackle these issues, based on our insightful observations from experimental analysis, we propose ZeroC, a Zero-delay QKV Compression system that eliminates time overhead and even reduces computation and communication time of the model operations. ZeroC innovatively embeds compression and decompression operations within model operations and adaptively determines compression ratios at a hybrid layer-token level. Further, it enables a communication-efficient SP inference framework. Trace-driven experiments demonstrate that ZeroC achieves up to 80% lower average JCT, 35% lower average perplexity, and 2.8x higher throughput with the same latency compared to state-of-the-art compression methods. ZeroC also reduces the average JCT of current LLM serving systems by up to 91% with the constraint of 0.1 perplexity increase. We open-sourced the code.

[LG-64] Hardware-Assisted Virtualization of Neural Processing Units for Cloud Platforms MICRO’24

链接: https://arxiv.org/abs/2408.04104
作者: Yuqi Xue,Yiqi Liu,Lifeng Nai,Jian Huang
关键词-EN: Cloud platforms today, deploying hardware accelerators, neural processing units, NPU, powering machine learning
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Operating Systems (cs.OS)
*备注: Accepted to MICRO’24

点击查看摘要

Abstract:Cloud platforms today have been deploying hardware accelerators like neural processing units (NPUs) for powering machine learning (ML) inference services. To maximize the resource utilization while ensuring reasonable quality of service, a natural approach is to virtualize NPUs for efficient resource sharing for multi-tenant ML services. However, virtualizing NPUs for modern cloud platforms is not easy. This is not only due to the lack of system abstraction support for NPU hardware, but also due to the lack of architectural and ISA support for enabling fine-grained dynamic operator scheduling for virtualized NPUs. We present TCloud, a holistic NPU virtualization framework. We investigate virtualization techniques for NPUs across the entire software and hardware stack. TCloud consists of (1) a flexible NPU abstraction called vNPU, which enables fine-grained virtualization of the heterogeneous compute units in a physical NPU (pNPU); (2) a vNPU resource allocator that enables pay-as-you-go computing model and flexible vNPU-to-pNPU mappings for improved resource utilization and cost-effectiveness; (3) an ISA extension of modern NPU architecture for facilitating fine-grained tensor operator scheduling for multiple vNPUs. We implement TCloud based on a production-level NPU simulator. Our experiments show that TCloud improves the throughput of ML inference services by up to 1.4 \times and reduces the tail latency by up to 4.6 \times , while improving the NPU utilization by 1.2 \times on average, compared to state-of-the-art NPU sharing approaches. Comments: Accepted to MICRO’24 Subjects: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Operating Systems (cs.OS) Cite as: arXiv:2408.04104 [cs.AR] (or arXiv:2408.04104v1 [cs.AR] for this version) https://doi.org/10.48550/arXiv.2408.04104 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-65] ree Attention: Topology-aware Decoding for Long-Context Attention on GPU clusters

链接: https://arxiv.org/abs/2408.04093
作者: Vasudev Shyam,Jonathan Pilault,Emily Shepperd,Quentin Anthony,Beren Millidge
关键词-EN: modern transformer architectures, significant computational bottleneck, core mathematical operation, computational bottleneck due, core mathematical
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:Self-attention is the core mathematical operation of modern transformer architectures and is also a significant computational bottleneck due to its quadratic complexity in the sequence length. In this work, we derive the scalar energy function whose gradient computes the self-attention block, thus elucidating the theoretical underpinnings of self-attention, providing a Bayesian interpretation of the operation and linking it closely with energy-based models such as Hopfield Networks. Moreover, due to this formulation, we discover that we can use efficient and optimized automatic-differentiation techniques to derive a highly efficient Tree Attention algorithm to compute the gradient of the energy and hence self-attention. Our formulation reveals that the reduction across the sequence axis can be efficiently computed in parallel through a tree reduction. Our algorithm, for parallelizing attention computation across multiple GPUs, enables cross-device decoding to be performed asymptotically faster (up to 8x faster) than alternative approaches such as Ring Attention, while also requiring significantly less communication volume and incurring 2x less peak memory. Our code is publicly available here: \urlthis https URL

[LG-66] PowerPM: Foundation Model for Power Systems

链接: https://arxiv.org/abs/2408.04057
作者: Shihao Tu,Yupeng Zhang,Jing Zhang,Yang Yang
关键词-EN: including demand-side management, ETS data, ETS, electricity time series, consumer behavior analysis
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 23 pages, 5 figures, 8 tables

点击查看摘要

Abstract:The emergence of abundant electricity time series (ETS) data provides ample opportunities for various applications in the power systems, including demand-side management, grid stability, and consumer behavior analysis. Deep learning models have advanced ETS modeling by effectively capturing sequence dependence. Nevertheless, learning a generic representation of ETS data for various applications remains challenging due to the inherently complex hierarchical structure of ETS data. Moreover, ETS data exhibits intricate temporal dependencies and is suscepti ble to the influence of exogenous variables. Furthermore, different instances exhibit diverse electricity consumption behavior. In this paper, we propose a foundation model PowerPM to model ETS data, providing a large-scale, off-the-shelf model for power systems. PowerPM consists of a temporal encoder and a hierarchical encoder. The temporal encoder captures both temporal dependencies in ETS data, considering exogenous variables. The hierarchical encoder models the correlation between hierarchy. Furthermore, PowerPM leverages a novel self-supervised pretraining framework consisting of masked ETS modeling and dual-view contrastive learning, which enable PowerPM to capture temporal dependency within ETS windows and aware the discrepancy across ETS windows, providing two different perspectives to learn generic representation. Our experiments involve five real world scenario datasets, comprising private and public data. Through pre-training on massive ETS data, PowerPM achieves SOTA performance on diverse downstream tasks within the private dataset. Impressively, when transferred to the public datasets, PowerPM maintains its superiority, showcasing its remarkable generalization ability across various tasks and domains. Moreover, ablation studies, few-shot experiments provide additional evidence of the effectiveness of our model.

[LG-67] Deep Generative Models for Subgraph Prediction ECAI2024

链接: https://arxiv.org/abs/2408.04053
作者: Erfaneh Mahmoudzadeh,Parmis Naddaf,Kiarash Zahirnia,Oliver Schulte
关键词-EN: Graph Neural Networks, social network analysis, Neural Networks, complex relational data, social network
类目: Machine Learning (cs.LG)
*备注: accepted at ECAI 2024

点击查看摘要

Abstract:Graph Neural Networks (GNNs) are important across different domains, such as social network analysis and recommendation systems, due to their ability to model complex relational data. This paper introduces subgraph queries as a new task for deep graph learning. Unlike traditional graph prediction tasks that focus on individual components like link prediction or node classification, subgraph queries jointly predict the components of a target subgraph based on evidence that is represented by an observed subgraph. For instance, a subgraph query can predict a set of target links and/or node labels. To answer subgraph queries, we utilize a probabilistic deep Graph Generative Model. Specifically, we inductively train a Variational Graph Auto-Encoder (VGAE) model, augmented to represent a joint distribution over links, node features and labels. Bayesian optimization is used to tune a weighting for the relative importance of links, node features and labels in a specific domain. We describe a deterministic and a sampling-based inference method for estimating subgraph probabilities from the VGAE generative graph distribution, without retraining, in zero-shot fashion. For evaluation, we apply the inference methods on a range of subgraph queries on six benchmark datasets. We find that inference from a model achieves superior predictive performance, surpassing independent prediction baselines with improvements in AUC scores ranging from 0.06 to 0.2 points, depending on the dataset.

[LG-68] Learning Rate-Free Reinforcement Learning: A Case for Model Selection with Non-Stationary Objectives

链接: https://arxiv.org/abs/2408.04046
作者: Aida Afshar,Aldo Pacchiano
关键词-EN: learning rate, learning, reinforcement learning, model selection, Rate-Free Reinforcement Learning
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: RLC 2024 Workshop on Failure Modes of Sequential Decision-Making in Practice

点击查看摘要

Abstract:The performance of reinforcement learning (RL) algorithms is sensitive to the choice of hyperparameters, with the learning rate being particularly influential. RL algorithms fail to reach convergence or demand an extensive number of samples when the learning rate is not optimally set. In this work, we show that model selection can help to improve the failure modes of RL that are due to suboptimal choices of learning rate. We present a model selection framework for Learning Rate-Free Reinforcement Learning that employs model selection methods to select the optimal learning rate on the fly. This approach of adaptive learning rate tuning neither depends on the underlying RL algorithm nor the optimizer and solely uses the reward feedback to select the learning rate; hence, the framework can input any RL algorithm and produce a learning rate-free version of it. We conduct experiments for policy optimization methods and evaluate various model selection strategies within our framework. Our results indicate that data-driven model selection algorithms are better alternatives to standard bandit algorithms when the optimal choice of hyperparameter is time-dependent and non-stationary.

[LG-69] Multimodal Gender Fairness in Depression Prediction: Insights on Data from the USA China

链接: https://arxiv.org/abs/2408.04026
作者: Joseph Cameron,Jiaee Cheong,Micol Spitale,Hatice Gunes
关键词-EN: Social agents, agents and robots, wellbeing settings, robots typically rely, individual mental wellbeing
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Robotics (cs.RO)
*备注: 9 Pages, 7 Tables. To be published and indexed in the IEEE Xplore Digital Library under the ACII 2024 Workshop Proceedings

点击查看摘要

Abstract:Social agents and robots are increasingly being used in wellbeing settings. However, a key challenge is that these agents and robots typically rely on machine learning (ML) algorithms to detect and analyse an individual’s mental wellbeing. The problem of bias and fairness in ML algorithms is becoming an increasingly greater source of concern. In concurrence, existing literature has also indicated that mental health conditions can manifest differently across genders and cultures. We hypothesise that the representation of features (acoustic, textual, and visual) and their inter-modal relations would vary among subjects from different cultures and genders, thus impacting the performance and fairness of various ML models. We present the very first evaluation of multimodal gender fairness in depression manifestation by undertaking a study on two different datasets from the USA and China. We undertake thorough statistical and ML experimentation and repeat the experiments for several different algorithms to ensure that the results are not algorithm-dependent. Our findings indicate that though there are differences between both datasets, it is not conclusive whether this is due to the difference in depression manifestation as hypothesised or other external factors such as differences in data collection methodology. Our findings further motivate a call for a more consistent and culturally aware data collection process in order to address the problem of ML bias in depression detection and to promote the development of fairer agents and robots for wellbeing.

[LG-70] Learning from Noisy Labels for Long-tailed Data via Optimal Transport

链接: https://arxiv.org/abs/2408.03977
作者: Mengting Li,Chuang Zhu
关键词-EN: deep learning models, Noisy labels, deep learning, significantly impair, long-tailed
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Noisy labels, which are common in real-world datasets, can significantly impair the training of deep learning models. However, recent adversarial noise-combating methods overlook the long-tailed distribution of real data, which can significantly harm the effect of denoising strategies. Meanwhile, the mismanagement of noisy labels further compromises the model’s ability to handle long-tailed data. To tackle this issue, we propose a novel approach to manage data characterized by both long-tailed distributions and noisy labels. First, we introduce a loss-distance cross-selection module, which integrates class predictions and feature distributions to filter clean samples, effectively addressing uncertainties introduced by noisy labels and long-tailed distributions. Subsequently, we employ optimal transport strategies to generate pseudo-labels for the noise set in a semi-supervised training manner, enhancing pseudo-label quality while mitigating the effects of sample scarcity caused by the long-tailed distribution. We conduct experiments on both synthetic and real-world datasets, and the comprehensive experimental results demonstrate that our method surpasses current state-of-the-art methods. Our code will be available in the future.

[LG-71] Enhancing Output Diversity Improves Conjugate Gradient-based Adversarial Attacks ICPR

链接: https://arxiv.org/abs/2408.03972
作者: Keiichiro Yamamura,Issa Oe,Hiroki Ishikura,Katsuki Fujisawa
关键词-EN: Deep neural networks, Deep neural, generate adversarial, consecutive search points, Auto Conjugate Gradient
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: ICPRAI2024

点击查看摘要

Abstract:Deep neural networks are vulnerable to adversarial examples, and adversarial attacks that generate adversarial examples have been studied in this context. Existing studies imply that increasing the diversity of model outputs contributes to improving the attack performance. This study focuses on the Auto Conjugate Gradient (ACG) attack, which is inspired by the conjugate gradient method and has a high diversification performance. We hypothesized that increasing the distance between two consecutive search points would enhance the output diversity. To test our hypothesis, we propose Rescaling-ACG (ReACG), which automatically modifies the two components that significantly affect the distance between two consecutive search points, including the search direction and step size. ReACG showed higher attack performance than that of ACG, and is particularly effective for ImageNet models with several classification classes. Experimental results show that the distance between two consecutive search points enhances the output diversity and may help develop new potent attacks. The code is available at \urlthis https URL

[LG-72] com Foundation Models: Applications Challenges and Future Trends

链接: https://arxiv.org/abs/2408.03964
作者: Tahar Zanouda,Meysam Masoudi,Fitsum Gaim Gebre,Mischa Dohler
关键词-EN: increasingly complex, Telecom, Telecom networks, diversified deployment scenarios, telecom network ecosystem
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Telecom networks are becoming increasingly complex, with diversified deployment scenarios, multi-standards, and multi-vendor support. The intricate nature of the telecom network ecosystem presents challenges to effectively manage, operate, and optimize networks. To address these hurdles, Artificial Intelligence (AI) has been widely adopted to solve different tasks in telecom networks. However, these conventional AI models are often designed for specific tasks, rely on extensive and costly-to-collect labeled data that require specialized telecom expertise for development and maintenance. The AI models usually fail to generalize and support diverse deployment scenarios and applications. In contrast, Foundation Models (FMs) show effective generalization capabilities in various domains in language, vision, and decision-making tasks. FMs can be trained on multiple data modalities generated from the telecom ecosystem and leverage specialized domain knowledge. Moreover, FMs can be fine-tuned to solve numerous specialized tasks with minimal task-specific labeled data and, in some instances, are able to leverage context to solve previously unseen problems. At the dawn of 6G, this paper investigates the potential opportunities of using FMs to shape the future of telecom technologies and standards. In particular, the paper outlines a conceptual process for developing Telecom FMs (TFMs) and discusses emerging opportunities for orchestrating specialized TFMs for network configuration, operation, and maintenance. Finally, the paper discusses the limitations and challenges of developing and deploying TFMs.

[LG-73] Optimizing Emotion Recognition with Wearable Sensor Data: Unveiling Patterns in Body Movements and Heart Rate through Random Forest Hyperparameter Tuning

链接: https://arxiv.org/abs/2408.03958
作者: Zikri Kholifah Nur,Rifki Wijaya,Gia Septiana Wulandari
关键词-EN: discern individual emotions, individual emotions based, smartwatch sensor data, heart rate monitoring, Random Forest model
类目: Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
*备注: 12 pages. Accepted by Jurnal Media Informatika Budidarma (Open Access)

点击查看摘要

Abstract:This research delves into the utilization of smartwatch sensor data and heart rate monitoring to discern individual emotions based on body movement and heart rate. Emotions play a pivotal role in human life, influencing mental well-being, quality of life, and even physical and physiological responses. The data were sourced from prior research by Juan C. Quiroz, PhD. The study enlisted 50 participants who donned smartwatches and heart rate monitors while completing a 250-meter walk. Emotions were induced through both audio-visual and audio stimuli, with participants’ emotional states evaluated using the PANAS questionnaire. The study scrutinized three scenarios: viewing a movie before walking, listening to music before walking, and listening to music while walking. Personal baselines were established using DummyClassifier with the ‘most_frequent’ strategy from the sklearn library, and various models, including Linear Regression and Random Forest, were employed to gauge the impacts of these activities. Notably, a novel approach was undertaken by incorporating hyperparameter tuning to the Random Forest model using RandomizedSearchCV. The outcomes showcased substantial enhancements with hyperparameter tuning in the Random Forest model, yielding mean accuracies of 86.63% for happy vs. sad and 76.33% for happy vs. neutral vs. sad.

[LG-74] GNN-Based Joint Channel and Power Allocation in Heterogeneous Wireless Networks

链接: https://arxiv.org/abs/2408.03957
作者: Lili Chen,Jingge Zhu,Jamie Evans
关键词-EN: maximal data rates, ensuring minimal interference, efficient energy utilisation, graph neural network, minimal interference
类目: Networking and Internet Architecture (cs.NI); Information Theory (cs.IT); Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注:

点击查看摘要

Abstract:The optimal allocation of channels and power resources plays a crucial role in ensuring minimal interference, maximal data rates, and efficient energy utilisation. As a successful approach for tackling resource management problems in wireless networks, Graph Neural Networks (GNNs) have attracted a lot of attention. This article proposes a GNN-based algorithm to address the joint resource allocation problem in heterogeneous wireless networks. Concretely, we model the heterogeneous wireless network as a heterogeneous graph and then propose a graph neural network structure intending to allocate the available channels and transmit power to maximise the network throughput. Our proposed joint channel and power allocation graph neural network (JCPGNN) comprises a shared message computation layer and two task-specific layers, with a dedicated focus on channel and power allocation tasks, respectively. Comprehensive experiments demonstrate that the proposed algorithm achieves satisfactory performance but with higher computational efficiency compared to traditional optimisation algorithms.

[LG-75] Left-Right Swapping and Upper-Lower Limb Pairing for Robust Multi-Wearable Workout Activity Detection

链接: https://arxiv.org/abs/2408.03947
作者: Jonas Van Der Donckt,Jeroen Van Der Donckt,Sofie Van Hoecke
关键词-EN: Signal Sleuths team, HASCA WEAR challenge, HASCA WEAR, Signal Sleuths, Sleuths team
类目: Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
*备注: Accepted at HASCA workshop - WEAR challenge (UbiComp 2024)

点击查看摘要

Abstract:This work presents the solution of the Signal Sleuths team for the 2024 HASCA WEAR challenge. The challenge focuses on detecting 18 workout activities (and the null class) using accelerometer data from 4 wearables - one worn on each limb. Data analysis revealed inconsistencies in wearable orientation within and across participants, leading to exploring novel multi-wearable data augmentation techniques. We investigate three models using a fixed feature set: (i) “raw”: using all data as is, (ii) “left-right swapping”: augmenting data by swapping left and right limb pairs, and (iii) “upper-lower limb paring”: stacking data by using upper-lower limb pair combinations (2 wearables). Our experiments utilize traditional machine learning with multi-window feature extraction and temporal smoothing. Using 3-fold cross-validation, the raw model achieves a macro F1-score of 90.01%, whereas left-right swapping and upper-lower limb paring improve the scores to 91.30% and 91.87% respectively.

[LG-76] axonomy Driven Fast Adversarial Training AAAI

链接: https://arxiv.org/abs/2408.03944
作者: Kun Tong,Chengze Jiang,Jie Gui,Yuan Cao
关键词-EN: effective defense method, effective defense, Adversarial, Adversarial training, neural networks
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: This paper is accepted by AAAI

点击查看摘要

Abstract:Adversarial training (AT) is an effective defense method against gradient-based attacks to enhance the robustness of neural networks. Among them, single-step AT has emerged as a hotspot topic due to its simplicity and efficiency, requiring only one gradient propagation in generating adversarial examples. Nonetheless, the problem of catastrophic overfitting (CO) that causes training collapse remains poorly understood, and there exists a gap between the robust accuracy achieved through single- and multi-step AT. In this paper, we present a surprising finding that the taxonomy of adversarial examples reveals the truth of CO. Based on this conclusion, we propose taxonomy driven fast adversarial training (TDAT) which jointly optimizes learning objective, loss function, and initialization method, thereby can be regarded as a new paradigm of single-step AT. Compared with other fast AT methods, TDAT can boost the robustness of neural networks, alleviate the influence of misclassified examples, and prevent CO during the training process while requiring almost no additional computational and memory resources. Our method achieves robust accuracy improvement of 1.59% , 1.62% , 0.71% , and 1.26% on CIFAR-10, CIFAR-100, Tiny ImageNet, and ImageNet-100 datasets, when against projected gradient descent PGD10 attack with perturbation budget 8/255. Furthermore, our proposed method also achieves state-of-the-art robust accuracy against other attacks. Code is available at this https URL.

[LG-77] Building Machines that Learn and Think with People

链接: https://arxiv.org/abs/2408.03943
作者: Katherine M. Collins,Ilia Sucholutsky,Umang Bhatt,Kartik Chandra,Lionel Wong,Mina Lee,Cedegao E. Zhang,Tan Zhi-Xuan,Mark Ho,Vikash Mansinghka,Adrian Weller,Joshua B. Tenenbaum,Thomas L. Griffiths
关键词-EN: thought, partners, machine intelligence, systems, thought partners
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:What do we want from machine intelligence? We envision machines that are not just tools for thought, but partners in thought: reasonable, insightful, knowledgeable, reliable, and trustworthy systems that think with us. Current artificial intelligence (AI) systems satisfy some of these criteria, some of the time. In this Perspective, we show how the science of collaborative cognition can be put to work to engineer systems that really can be called ``thought partners,‘’ systems built to meet our expectations and complement our limitations. We lay out several modes of collaborative thought in which humans and AI thought partners can engage and propose desiderata for human-compatible thought partnerships. Drawing on motifs from computational cognitive science, we motivate an alternative scaling path for the design of thought partners and ecosystems around their use through a Bayesian lens, whereby the partners we construct actively build and reason over models of the human and world.

[LG-78] Risk and cross validation in ridge regression with correlated samples

链接: https://arxiv.org/abs/2408.04607
作者: Alexander Atanasov,Jacob A. Zavatone-Veth,Cengiz Pehlevan
关键词-EN: existing theories assume, high-dimensional ridge regression, ridge regression, Recent years, substantial advances
类目: Machine Learning (stat.ML); Disordered Systems and Neural Networks (cond-mat.dis-nn); Machine Learning (cs.LG)
*备注: 44 pages, 18 figures

点击查看摘要

Abstract:Recent years have seen substantial advances in our understanding of high-dimensional ridge regression, but existing theories assume that training examples are independent. By leveraging recent techniques from random matrix theory and free probability, we provide sharp asymptotics for the in- and out-of-sample risks of ridge regression when the data points have arbitrary correlations. We demonstrate that in this setting, the generalized cross validation estimator (GCV) fails to correctly predict the out-of-sample risk. However, in the case where the noise residuals have the same correlations as the data points, one can modify the GCV to yield an efficiently-computable unbiased estimator that concentrates in the high-dimensional limit, which we dub CorrGCV. We further extend our asymptotic analysis to the case where the test point has nontrivial correlations with the training set, a setting often encountered in time series forecasting. Assuming knowledge of the correlation structure of the time series, this again yields an extension of the GCV estimator, and sharply characterizes the degree to which such test points yield an overly optimistic prediction of long-time risk. We validate the predictions of our theory across a variety of high dimensional data.

[LG-79] Inference with the Upper Confidence Bound Algorithm

链接: https://arxiv.org/abs/2408.04595
作者: Koulik Khamaru,Cun-Hui Zhang
关键词-EN: Upper Confidence Bound, Confidence Bound, Upper Confidence, multiarmed bandit problems, downstream inferential tasks
类目: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Systems and Control (eess.SY); Statistics Theory (math.ST)
*备注: 17 pages, 1 figure

点击查看摘要

Abstract:In this paper, we discuss the asymptotic behavior of the Upper Confidence Bound (UCB) algorithm in the context of multiarmed bandit problems and discuss its implication in downstream inferential tasks. While inferential tasks become challenging when data is collected in a sequential manner, we argue that this problem can be alleviated when the sequential algorithm at hand satisfies certain stability property. This notion of stability is motivated from the seminal work of Lai and Wei (1982). Our first main result shows that such a stability property is always satisfied for the UCB algorithm, and as a result the sample means for each arm are asymptotically normal. Next, we examine the stability properties of the UCB algorithm when the number of arms K is allowed to grow with the number of arm pulls T . We show that in such a case the arms are stable when \frac\log K\log T \rightarrow 0 , and the number of near-optimal arms are large.

[LG-80] Quantum Machine Learning: Performance and Security Implications in Real-World Applications

链接: https://arxiv.org/abs/2408.04543
作者: Zhengping Jay Luo,Tyler Stewart,Mourya Narasareddygari,Rui Duan,Shangqing Zhao
关键词-EN: garnered significant attention, Quantum computing, garnered significant, significant attention, attention in recent
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Quantum computing has garnered significant attention in recent years from both academia and industry due to its potential to achieve a “quantum advantage” over classical computers. The advent of quantum computing introduces new challenges for security and privacy. This poster explores the performance and security implications of quantum computing through a case study of machine learning in a real-world application. We compare the performance of quantum machine learning (QML) algorithms to their classical counterparts using the Alzheimer’s disease dataset. Our results indicate that QML algorithms show promising potential while they still have not surpassed classical algorithms in terms of learning capability and convergence difficulty, and running quantum algorithms through simulations on classical computers requires significantly large memory space and CPU time. Our study also indicates that QMLs have inherited vulnerabilities from classical machine learning algorithms while also introduce new attack vectors.

[LG-81] Hybrid Reinforcement Learning Breaks Sample Size Barriers in Linear MDPs

链接: https://arxiv.org/abs/2408.04526
作者: Kevin Tan,Wei Fan,Yuting Wei
关键词-EN: Hybrid Reinforcement Learning, Reinforcement Learning, significant recent interest, garnered significant recent, Hybrid Reinforcement
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Hybrid Reinforcement Learning (RL), where an agent learns from both an offline dataset and online explorations in an unknown environment, has garnered significant recent interest. A crucial question posed by Xie et al. (2022) is whether hybrid RL can improve upon the existing lower bounds established in purely offline and purely online RL without relying on the single-policy concentrability assumption. While Li et al. (2023) provided an affirmative answer to this question in the tabular PAC RL case, the question remains unsettled for both the regret-minimizing RL case and the non-tabular case. In this work, building upon recent advancements in offline RL and reward-agnostic exploration, we develop computationally efficient algorithms for both PAC and regret-minimizing RL with linear function approximation, without single-policy concentrability. We demonstrate that these algorithms achieve sharper error or regret bounds that are no worse than, and can improve on, the optimal sample complexity in offline RL (the first algorithm, for PAC RL) and online RL (the second algorithm, for regret-minimizing RL) in linear Markov decision processes (MDPs), regardless of the quality of the behavior policy. To our knowledge, this work establishes the tightest theoretical guarantees currently available for hybrid RL in linear MDPs. Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG) Cite as: arXiv:2408.04526 [stat.ML] (or arXiv:2408.04526v1 [stat.ML] for this version) https://doi.org/10.48550/arXiv.2408.04526 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-82] Finite sample learning of moving targets

链接: https://arxiv.org/abs/2408.04406
作者: Nikolaus Vertovec,Kostas Margellos,Maria Prandini
关键词-EN: seek to learn, target, moving target, Abstract, results extend randomized
类目: Optimization and Control (math.OC); Machine Learning (cs.LG)
*备注: 13 pages, 7 figures

点击查看摘要

Abstract:We consider a moving target that we seek to learn from samples. Our results extend randomized techniques developed in control and optimization for a constant target to the case where the target is changing. We derive a novel bound on the number of samples that are required to construct a probably approximately correct (PAC) estimate of the target. Furthermore, when the moving target is a convex polytope, we provide a constructive method of generating the PAC estimate using a mixed integer linear program (MILP). The proposed method is demonstrated on an application to autonomous emergency braking.

[LG-83] Robustness investigation of quality measures for the assessment of machine learning models

链接: https://arxiv.org/abs/2408.04391
作者: Thomas Most,Lars Gräning,Sebastian Wolff
关键词-EN: machine learning models, machine learning, quality measures, paper the accuracy, accuracy and robustness
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: under review, submitted to the journal Engineering Modelling, Analysis Simulation (EMAS)

点击查看摘要

Abstract:In this paper the accuracy and robustness of quality measures for the assessment of machine learning models are investigated. The prediction quality of a machine learning model is evaluated model-independent based on a cross-validation approach, where the approximation error is estimated for unknown data. The presented measures quantify the amount of explained variation in the model prediction. The reliability of these measures is assessed by means of several numerical examples, where an additional data set for the verification of the estimated prediction error is available. Furthermore, the confidence bounds of the presented quality measures are estimated and local quality measures are derived from the prediction residuals obtained by the cross-validation approach.

[LG-84] Deep Transfer Learning for Kidney Cancer Diagnosis

链接: https://arxiv.org/abs/2408.04318
作者: Yassine Habchi,Hamza Kheddar,Yassine Himeur,Abdelkrim Boukabou,Shadi Atalla,Wathiq Mansoor,Hussain Al-Ahmad
关键词-EN: including lifestyle choices, global societies stem, incurable diseases prevalent, social factors, including lifestyle
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 32 pages, 8 figures and 8 tables

点击查看摘要

Abstract:Many incurable diseases prevalent across global societies stem from various influences, including lifestyle choices, economic conditions, social factors, and genetics. Research predominantly focuses on these diseases due to their widespread nature, aiming to decrease mortality, enhance treatment options, and improve healthcare standards. Among these, kidney disease stands out as a particularly severe condition affecting men and women worldwide. Nonetheless, there is a pressing need for continued research into innovative, early diagnostic methods to develop more effective treatments for such diseases. Recently, automatic diagnosis of Kidney Cancer has become an important challenge especially when using deep learning (DL) due to the importance of training medical datasets, which in most cases are difficult and expensive to obtain. Furthermore, in most cases, algorithms require data from the same domain and a powerful computer with efficient storage capacity. To overcome this issue, a new type of learning known as transfer learning (TL) has been proposed that can produce impressive results based on other different pre-trained data. This paper presents, to the best of the authors’ knowledge, the first comprehensive survey of DL-based TL frameworks for kidney cancer diagnosis. This is a strong contribution to help researchers understand the current challenges and perspectives of this topic. Hence, the main limitations and advantages of each framework are identified and detailed critical analyses are provided. Looking ahead, the article identifies promising directions for future research. Moving on, the discussion is concluded by reflecting on the pivotal role of TL in the development of precision medicine and its effects on clinical practice and research in oncology.

[LG-85] Better Locally Private Sparse Estimation Given Multiple Samples Per User

链接: https://arxiv.org/abs/2408.04313
作者: Yuheng Ma,Ke Jia,Hanfang Yang
关键词-EN: Previous studies yielded, studies yielded discouraging, Previous studies, yielded discouraging results, sparsity assumption
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Methodology (stat.ME)
*备注:

点击查看摘要

Abstract:Previous studies yielded discouraging results for item-level locally differentially private linear regression with s^* -sparsity assumption, where the minimax rate for nm samples is \mathcalO(s^*d / nm\varepsilon^2) . This can be challenging for high-dimensional data, where the dimension d is extremely large. In this work, we investigate user-level locally differentially private sparse linear regression. We show that with n users each contributing m samples, the linear dependency of dimension d can be eliminated, yielding an error upper bound of \mathcalO(s^*2 / nm\varepsilon^2) . We propose a framework that first selects candidate variables and then conducts estimation in the narrowed low-dimensional space, which is extendable to general sparse estimation problems with tight error bounds. Experiments on both synthetic and real datasets demonstrate the superiority of the proposed methods. Both the theoretical and empirical results suggest that, with the same number of samples, locally private sparse estimation is better conducted when multiple samples per user are available.

[LG-86] Efficient and Accurate Pneumonia Detection Using a Novel Multi-Scale Transformer Approach

链接: https://arxiv.org/abs/2408.04290
作者: Alireza Saber,Pouria Parhami,Alimihammad Siahkarzadeh,Amirreza Fateh
关键词-EN: severe respiratory disease, significant diagnostic challenges, poses significant diagnostic, chest X-rays, respiratory disease
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Pneumonia, a severe respiratory disease, poses significant diagnostic challenges, especially in underdeveloped regions. Traditional diagnostic methods, such as chest X-rays, suffer from variability in interpretation among radiologists, necessitating reliable automated tools. In this study, we propose a novel approach combining deep learning and transformer-based attention mechanisms to enhance pneumonia detection from chest X-rays. Our method begins with lung segmentation using a TransUNet model that integrates our specialized transformer module, which has fewer parameters compared to common transformers while maintaining performance. This model is trained on the “Chest Xray Masks and Labels” dataset and then applied to the Kermany and Cohen datasets to isolate lung regions, enhancing subsequent classification tasks. For classification, we employ pre-trained ResNet models (ResNet-50 and ResNet-101) to extract multi-scale feature maps, processed through our modified transformer module. By employing our specialized transformer, we attain superior results with significantly fewer parameters compared to common transformer models. Our approach achieves high accuracy rates of 92.79% on the Kermany dataset and 95.11% on the Cohen dataset, ensuring robust and efficient performance suitable for resource-constrained environments. "this https URL

[LG-87] Prompt-Assisted Semantic Interference Cancellation on Moderate Interference Channels

链接: https://arxiv.org/abs/2408.04283
作者: Zian Meng,Qiang Li,Ashish Pandharipande,Xiaohu Ge
关键词-EN: management strategies degrades, interference management, conventional interference management, interference management strategies, interference
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注: 5 pages, 5 figures

点击查看摘要

Abstract:The performance of conventional interference management strategies degrades when interference power is comparable to signal power. We consider a new perspective on interference management using semantic communication. Specifically, a multi-user semantic communication system is considered on moderate interference channels (ICs), for which a novel framework of deep learning-based prompt-assisted semantic interference cancellation (DeepPASIC) is proposed. Each transmitted signal is partitioned into common and private parts. The common parts of different users are transmitted simultaneously in a shared medium, resulting in superposition. The private part, on the other hand, serves as a prompt to assist in canceling the interference suffered by the common part at the semantic level. Simulation results demonstrate that the proposed DeepPASIC outperforms conventional interference management strategies under moderate interference conditions.

[LG-88] An Upper Confidence Bound Approach to Estimating the Maximum Mean

链接: https://arxiv.org/abs/2408.04179
作者: Zhang Kun,Liu Guangwu,Shi Wen
关键词-EN: Estimating the maximum, finds a variety, maximum mean finds, maximum, Estimating
类目: atistics Theory (math.ST); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Estimating the maximum mean finds a variety of applications in practice. In this paper, we study estimation of the maximum mean using an upper confidence bound (UCB) approach where the sampling budget is adaptively allocated to one of the systems. We study in depth the existing grand average (GA) estimator, and propose a new largest-size average (LSA) estimator. Specifically, we establish statistical guarantees, including strong consistency, asymptotic mean squared errors, and central limit theorems (CLTs) for both estimators, which are new to the literature. We show that LSA is preferable over GA, as the bias of the former decays at a rate much faster than that of the latter when sample size increases. By using the CLTs, we further construct asymptotically valid confidence intervals for the maximum mean, and propose a single hypothesis test for a multiple comparison problem with application to clinical trials. Statistical efficiency of the resulting point and interval estimates and the proposed single hypothesis test is demonstrated via numerical examples.

[LG-89] Do Sharpness-based Optimizers Improve Generalization in Medical Image Analysis?

链接: https://arxiv.org/abs/2408.04065
作者: Mohamed Hassan,Aleksander Vakanski,Min Xian
关键词-EN: Effective clinical deployment, healthcare demands high, ensure accurate diagnosis, Effective clinical, deep learning models
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Effective clinical deployment of deep learning models in healthcare demands high generalization performance to ensure accurate diagnosis and treatment planning. In recent years, significant research has focused on improving the generalization of deep learning models by regularizing the sharpness of the loss landscape. Among the optimization approaches that explicitly minimize sharpness, Sharpness-Aware Minimization (SAM) has shown potential in enhancing generalization performance on general domain image datasets. This success has led to the development of several advanced sharpness-based algorithms aimed at addressing the limitations of SAM, such as Adaptive SAM, surrogate-Gap SAM, Weighted SAM, and Curvature Regularized SAM. These sharpness-based optimizers have shown improvements in model generalization compared to conventional stochastic gradient descent optimizers and their variants on general domain image datasets, but they have not been thoroughly evaluated on medical images. This work provides a review of recent sharpness-based methods for improving the generalization of deep learning networks and evaluates the methods performance on medical breast ultrasound images. Our findings indicate that the initial SAM method successfully enhances the generalization of various deep learning models. While Adaptive SAM improves generalization of convolutional neural networks, it fails to do so for vision transformers. Other sharpness-based optimizers, however, do not demonstrate consistent results. The results reveal that, contrary to findings in the non-medical domain, SAM is the only recommended sharpness-based optimizer that consistently improves generalization in medical image analysis, and further research is necessary to refine the variants of SAM to enhance generalization performance in this field

[LG-90] Machine Learning-Based Reward-Driven Tuning of Scanning Probe Microscopy: Towards Fully Automated Microscopy

链接: https://arxiv.org/abs/2408.04055
作者: Yu Liu,Roger Proksch,Jason Bemis,Utkarsh Pratiush,Astita Dubey,Mahshid Ahmadi,Reece Emery,Philip D. Rack,Yu-Chen Liu,Jan-Chi Yang,Sergei V. Kalinin
关键词-EN: intermittent contact mode, scanning probe microscopy, tapping mode, intermittent contact, tapping mode imaging
类目: Mesoscale and Nanoscale Physics (cond-mat.mes-hall); Materials Science (cond-mat.mtrl-sci); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 20 pages, 6 figures

点击查看摘要

Abstract:Since the dawn of scanning probe microscopy (SPM), tapping or intermittent contact mode has been one of the most widely used imaging modes. Manual optimization of tapping mode not only takes a lot of instrument and operator time, but also often leads to frequent probe and sample damage, poor image quality and reproducibility issues for new types of samples or inexperienced users. Despite wide use, optimization of tapping mode imaging is an extremely hard problem, ill-suited to either classical control methods or machine learning. Here we introduce a reward-driven workflow to automate the optimization of SPM in the tapping mode. The reward function is defined based on multiple channels with physical and empirical knowledge of good scans encoded, representing a sample-agnostic measure of image quality and imitating the decision-making logic employed by human operators. This automated workflow gives optimal scanning parameters for different probes and samples and gives high-quality SPM images consistently in the attractive mode. This study broadens the application and accessibility of SPM and opens the door for fully automated SPM.

[LG-91] Scaling Law of Sim2Real Transfer Learning in Expanding Computational Materials Databases for Real-World Predictions

链接: https://arxiv.org/abs/2408.04042
作者: Shunya Minami,Yoshihiro Hayashi,Stephen Wu,Kenji Fukumizu,Hiroki Sugisawa,Masashi Ishii,Isao Kuwajima,Kazuya Shiratori,Ryo Yoshida
关键词-EN: molecular dynamics simulations, extensive physical property, limited experimental materials, high-throughput computational experiments, dynamics simulations
类目: Materials Science (cond-mat.mtrl-sci); Machine Learning (cs.LG)
*备注: 22 pages, 6 figures

点击查看摘要

Abstract:To address the challenge of limited experimental materials data, extensive physical property databases are being developed based on high-throughput computational experiments, such as molecular dynamics simulations. Previous studies have shown that fine-tuning a predictor pretrained on a computational database to a real system can result in models with outstanding generalization capabilities compared to learning from scratch. This study demonstrates the scaling law of simulation-to-real (Sim2Real) transfer learning for several machine learning tasks in materials science. Case studies of three prediction tasks for polymers and inorganic materials reveal that the prediction error on real systems decreases according to a power-law as the size of the computational data increases. Observing the scaling behavior offers various insights for database development, such as determining the sample size necessary to achieve a desired performance, identifying equivalent sample sizes for physical and computational experiments, and guiding the design of data production protocols for downstream real-world tasks.

信息检索

[IR-0] Pairing Clustered Inverted Indexes with kNN Graphs for Fast Approximate Retrieval over Learned Sparse Representations

链接: https://arxiv.org/abs/2408.04443
作者: Sebastian Bruch,Franco Maria Nardini,Cosimo Rulli,Rossano Venturini
关键词-EN: Learned sparse representations, sparse representations form, Learned sparse, sparse representations, representations form
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Learned sparse representations form an effective and interpretable class of embeddings for text retrieval. While exact top-k retrieval over such embeddings faces efficiency challenges, a recent algorithm called Seismic has enabled remarkably fast, highly-accurate approximate retrieval. Seismic statically prunes inverted lists, organizes each list into geometrically-cohesive blocks, and augments each block with a summary vector. At query time, each inverted list associated with a query term is traversed one block at a time in an arbitrary order, with the inner product between the query and summaries determining if a block must be evaluated. When a block is deemed promising, its documents are fully evaluated with a forward index. Seismic is one to two orders of magnitude faster than state-of-the-art inverted index-based solutions and significantly outperforms the winning graph-based submissions to the BigANN 2023 Challenge. In this work, we speed up Seismic further by introducing two innovations to its query processing subroutine. First, we traverse blocks in order of importance, rather than arbitrarily. Second, we take the list of documents retrieved by Seismic and expand it to include the neighbors of each document using an offline k-regular nearest neighbor graph; the expanded list is then ranked to produce the final top-k set. Experiments on two public datasets show that our extension, named SeismicWave, can reach almost-exact accuracy levels and is up to 2.2x faster than Seismic.

[IR-1] Enhanced Semantic Graph Based Approach With Sentiment Analysis For User Interest Retrieval From Social Sites

链接: https://arxiv.org/abs/2408.04395
作者: Usama Ahmed Jamal
关键词-EN: ideas and thoughts, user, users, Blogs, networking sites serve
类目: oftware Engineering (cs.SE); Information Retrieval (cs.IR); Social and Information Networks (cs.SI)
*备注: This research was conducted as part of Master Thesis in Computer Science by the first author at HITEC University Taxila

点击查看摘要

Abstract:Blogs and social networking sites serve as a platform to the users for expressing their interests, ideas and thoughts. Targeted marketing uses the recommendation systems for suggesting their services and products to the users or clients. So the method used by target marketing is extraction of keywords and main topics from the user generated texts. Most of conventional methods involve identifying the personal interests just on the basis of surveys and rating systems. But the proposed research differs in manner that it aim at using the user generated text as a source medium for identifying and analyzing the personal interest as a knowledge base area of users. Semantic graph based approach is proposed research work that identifies the references of clients and users by analyzing their own texts such as tweets. The keywords need to be extracted from the text generated by the user on the social networking sites. This can be made possible by using several algorithms that extracts the keywords automatically from the available content provided by the user. Based on frequency and degree it ranks the extracted keywords. Furthermore, semantic graph based model assists in providing useful suggestions just by extracting the interests of users by analyzing their contents from social media. In this approach graph comprises of nodes and edges where nodes represents the keywords extracted by the algorithm and edges shows the semantic connection between the nodes. The method does not require internet related user activities like surveys or ratings to gather user interest related information.

[IR-2] MM-Forecast: A Multimodal Approach to Temporal Event Forecasting with Large Language Models

链接: https://arxiv.org/abs/2408.04388
作者: Haoxuan Li,Zhengmao Yang,Yunshan Ma,Yi Bin,Yang Yang,Tat-Seng Chua
关键词-EN: large language models, language models, temporal event forecasting, temporal event, large language
类目: Multimedia (cs.MM); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:We study an emerging and intriguing problem of multimodal temporal event forecasting with large language models. Compared to using text or graph modalities, the investigation of utilizing images for temporal event forecasting has not been fully explored, especially in the era of large language models (LLMs). To bridge this gap, we are particularly interested in two key questions of: 1) why images will help in temporal event forecasting, and 2) how to integrate images into the LLM-based forecasting framework. To answer these research questions, we propose to identify two essential functions that images play in the scenario of temporal event forecasting, i.e., highlighting and complementary. Then, we develop a novel framework, named MM-Forecast. It employs an Image Function Identification module to recognize these functions as verbal descriptions using multimodal large language models (MLLMs), and subsequently incorporates these function descriptions into LLM-based forecasting models. To evaluate our approach, we construct a new multimodal dataset, MidEast-TE-mm, by extending an existing event dataset MidEast-TE-mini with images. Empirical studies demonstrate that our MM-Forecast can correctly identify the image functions, and further more, incorporating these verbal function descriptions significantly improves the forecasting performance. The dataset, code, and prompts are available at this https URL.

[IR-3] Judgment2vec: Apply Graph Analytics to Searching and Recommendation of Similar Judgments

链接: https://arxiv.org/abs/2408.04382
作者: Hsuan-Lei Shao
关键词-EN: previous courts efficiently, legal professionals rely, court practice, courts efficiently, previous courts
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
*备注: 5 pages, 7 figures, 2 tables

点击查看摘要

Abstract:In court practice, legal professionals rely on their training to provide opinions that resolve cases, one of the most crucial aspects being the ability to identify similar judgments from previous courts efficiently. However, finding a similar case is challenging and often depends on experience, legal domain knowledge, and extensive labor hours, making veteran lawyers or judges indispensable. This research aims to automate the analysis of judgment text similarity. We utilized a judgment dataset labeled as the “golden standard” by experts, which includes human-verified features that can be converted into an “expert similarity score.” We then constructed a knowledge graph based on “case-article” relationships, ranking each case using natural language processing to derive a “Node2vec similarity score.” By evaluating these two similarity scores, we identified their discrepancies and relationships. The results can significantly reduce the labor hours required for legal searches and recommendations, with potential applications extending to various fields of information retrieval.

[IR-4] Understanding and Modeling Job Marketplace with Pretrained Language Models CIKM’24

链接: https://arxiv.org/abs/2408.04381
作者: Yaochen Zhu,Liang Wu,Binchi Zhang,Song Wang,Qi Guo,Liangjie Hong,Luke Simon,Jundong Li
关键词-EN: Job marketplace, job textual features, heterogeneous graph composed, Job, composed of interactions
类目: Information Retrieval (cs.IR)
*备注: accepted by CIKM’24 applied research track

点击查看摘要

Abstract:Job marketplace is a heterogeneous graph composed of interactions among members (job-seekers), companies, and jobs. Understanding and modeling job marketplace can benefit both job seekers and employers, ultimately contributing to the greater good of the society. However, existing graph neural network (GNN)-based methods have shallow understandings of the associated textual features and heterogeneous relations. To address the above challenges, we propose PLM4Job, a job marketplace foundation model that tightly couples pretrained language models (PLM) with job market graph, aiming to fully utilize the pretrained knowledge and reasoning ability to model member/job textual features as well as various member-job relations simultaneously. In the pretraining phase, we propose a heterogeneous ego-graph-based prompting strategy to model and aggregate member/job textual features based on the topological structure around the target member/job node, where entity type embeddings and graph positional embeddings are introduced accordingly to model different entities and their heterogeneous relations. Meanwhile, a proximity-aware attention alignment strategy is designed to dynamically adjust the attention of the PLM on ego-graph node tokens in the prompt, such that the attention can be better aligned with job marketplace semantics. Extensive experiments at LinkedIn demonstrate the effectiveness of PLM4Job.

[IR-5] Mitigating Exposure Bias in Online Learning to Rank Recommendation: A Novel Reward Model for Cascading Bandits

链接: https://arxiv.org/abs/2408.04332
作者: Masoud Mansoury,Bamshad Mobasher,Herke van Hoof
关键词-EN: Exposure bias, recommendation, Linear Cascading Bandits, bias, Exposure
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Exposure bias is a well-known issue in recommender systems where items and suppliers are not equally represented in the recommendation results. This bias becomes particularly problematic over time as a few items are repeatedly over-represented in recommendation lists, leading to a feedback loop that further amplifies this bias. Although extensive research has addressed this issue in model-based or neighborhood-based recommendation algorithms, less attention has been paid to online recommendation models, such as those based on top-K contextual bandits, where recommendation models are dynamically updated with ongoing user feedback. In this paper, we study exposure bias in a class of well-known contextual bandit algorithms known as Linear Cascading Bandits. We analyze these algorithms in their ability to handle exposure bias and provide a fair representation of items in the recommendation results. Our analysis reveals that these algorithms fail to mitigate exposure bias in the long run during the course of ongoing user interactions. We propose an Exposure-Aware reward model that updates the model parameters based on two factors: 1) implicit user feedback and 2) the position of the item in the recommendation list. The proposed model mitigates exposure bias by controlling the utility assigned to the items based on their exposure in the recommendation list. Our experiments with two real-world datasets show that our proposed reward model improves the exposure fairness of the linear cascading bandits over time while maintaining the recommendation accuracy. It also outperforms the current baselines. Finally, we prove a high probability upper regret bound for our proposed model, providing theoretical guarantees for its performance.

[IR-6] Scalable Transformer for High Dimensional Multivariate Time Series Forecasting

链接: https://arxiv.org/abs/2408.04245
作者: Xin Zhou,Weiqing Wang,Wray Buntine,Shilin Qu,Abishek Sriramulu,Weicong Tan,Christoph Bergmeir
关键词-EN: demonstrated significant success, recently demonstrated significant, Multivariate Time Series, Deep models, Channel-dependent models
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Deep models for Multivariate Time Series (MTS) forecasting have recently demonstrated significant success. Channel-dependent models capture complex dependencies that channel-independent models cannot capture. However, the number of channels in real-world applications outpaces the capabilities of existing channel-dependent models, and contrary to common expectations, some models underperform the channel-independent models in handling high-dimensional data, which raises questions about the performance of channel-dependent models. To address this, our study first investigates the reasons behind the suboptimal performance of these channel-dependent models on high-dimensional MTS data. Our analysis reveals that two primary issues lie in the introduced noise from unrelated series that increases the difficulty of capturing the crucial inter-channel dependencies, and challenges in training strategies due to high-dimensional data. To address these issues, we propose STHD, the Scalable Transformer for High-Dimensional Multivariate Time Series Forecasting. STHD has three components: a) Relation Matrix Sparsity that limits the noise introduced and alleviates the memory issue; b) ReIndex applied as a training strategy to enable a more flexible batch size setting and increase the diversity of training data; and c) Transformer that handles 2-D inputs and captures channel dependencies. These components jointly enable STHD to manage the high-dimensional MTS while maintaining computational feasibility. Furthermore, experimental results show STHD’s considerable improvement on three high-dimensional datasets: Crime-Chicago, Wiki-People, and Traffic. The source code and dataset are publicly available this https URL.

[IR-7] Enhanced Traffic Flow Prediction with Multi-Segment Fusion Tensor Graph Convolutional Networks

链接: https://arxiv.org/abs/2408.04232
作者: Wei Zhang,Peng Tang
关键词-EN: traffic Flow Prediction, Accurate traffic Flow, intelligent transportation systems, holds significant importance, Flow Prediction
类目: Machine Learning (cs.LG); Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Accurate traffic Flow Prediction can assist in traffic management, route planning, and congestion mitigation, which holds significant importance in enhancing the efficiency and reliability of intelligent transportation systems (ITS). However, existing traffic flow prediction models suffer from limitations in capturing the complex spatial-temporal dependencies within traffic networks. In order to address this issue, this study proposes a multi-segment fusion tensor graph convolutional network (MS-FTGCN) for traffic flow prediction with the following three-fold ideas: a) building a unified spatial-temporal graph convolutional framework based on Tensor M-product, which capture the spatial-temporal patterns simultaneously; b) incorporating hourly, daily, and weekly components to model multi temporal properties of traffic flows, respectively; c) fusing the outputs of the three components by attention mechanism to obtain the final traffic flow prediction results. The results of experiments conducted on two traffic flow datasets demonstrate that the proposed MS-FTGCN outperforms the state-of-the-art models.

[IR-8] MMREC: LLM Based Multi-Modal Recommender System

链接: https://arxiv.org/abs/2408.04211
作者: Jiahao Tian,Jinman Zhao,Zhenkai Wang,Zhicheng Ding
关键词-EN: growing rapidly due, content generated daily, recommender systems, generated daily, effective recommender systems
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:The importance of recommender systems is growing rapidly due to the exponential increase in the volume of content generated daily. This surge in content presents unique challenges for designing effective recommender systems. Key among these challenges is the need to effectively leverage the vast amounts of natural language data and images that represent user preferences. This paper presents a novel approach to enhancing recommender systems by leveraging Large Language Models (LLMs) and deep learning techniques. The proposed framework aims to improve the accuracy and relevance of recommendations by incorporating multi-modal information processing and by the use of unified latent space representation. The study explores the potential of LLMs to better understand and utilize natural language data in recommendation contexts, addressing the limitations of previous methods. The framework efficiently extracts and integrates text and image information through LLMs, unifying diverse modalities in a latent space to simplify the learning process for the ranking model. Experimental results demonstrate the enhanced discriminative power of the model when utilizing multi-modal information. This research contributes to the evolving field of recommender systems by showcasing the potential of LLMs and multi-modal data integration to create more personalized and contextually relevant recommendations.

[IR-9] Pairwise Judgment Formulation for Semantic Embedding Model in Web Search

链接: https://arxiv.org/abs/2408.04197
作者: Mengze Hong,Chen Jason Zhang
关键词-EN: network-based Siamese architecture, neural network-based Siamese, Semantic Embedding Model, natural language processing, Siamese architecture
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Databases (cs.DB)
*备注:

点击查看摘要

Abstract:Semantic Embedding Model (SEM), a neural network-based Siamese architecture, is gaining momentum in information retrieval and natural language processing. In order to train SEM in a supervised fashion for Web search, the search engine query log is typically utilized to automatically formulate pairwise judgments as training data. Despite the growing application of semantic embeddings in the search engine industry, little work has been done on formulating effective pairwise judgments for training SEM. In this paper, we make the first in-depth investigation of a wide range of strategies for generating pairwise judgments for SEM. An interesting (perhaps surprising) discovery reveals that the conventional pairwise judgment formulation strategy wildly used in the field of pairwise Learning-to-Rank (LTR) is not necessarily effective for training SEM. Through a large-scale empirical study based on query logs and click-through activities from a major commercial search engine, we demonstrate the effective strategies for SEM and highlight the advantages of a hybrid heuristic (i.e., Clicked Non-Clicked) in comparison to the atomic heuristics (e.g., Clicked Skipped) in LTR. We conclude with best practices for training SEM and offer promising insights for future research.

[IR-10] wav2graph: A Framework for Supervised Learning Knowledge Graph from Speech

链接: https://arxiv.org/abs/2408.04174
作者: Khai Le-Duc,Quy-Anh Dang,Tan-Hanh Pham,Truong-Son Hy
关键词-EN: large language models, enhance the performance, providing structured, reasoning and context-awareness, performance of large
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG); Sound (cs.SD); Audio and Speech Processing (eess.AS)
*备注: Preprint, 32 pages

点击查看摘要

Abstract:Knowledge graphs (KGs) enhance the performance of large language models (LLMs) and search engines by providing structured, interconnected data that improves reasoning and context-awareness. However, KGs only focus on text data, thereby neglecting other modalities such as speech. In this work, we introduce wav2graph, the first framework for supervised learning knowledge graph from speech data. Our pipeline are straightforward: (1) constructing a KG based on transcribed spoken utterances and a named entity database, (2) converting KG into embedding vectors, and (3) training graph neural networks (GNNs) for node classification and link prediction tasks. Through extensive experiments conducted in inductive and transductive learning contexts using state-of-the-art GNN models, we provide baseline results and error analysis for node classification and link prediction tasks on human transcripts and automatic speech recognition (ASR) transcripts, including evaluations using both encoder-based and decoder-based node embeddings, as well as monolingual and multilingual acoustic pre-trained models. All related code, data, and models are published online.

附件下载

点击下载今日全部论文列表